Refactor codebase following Clean Code principles and add 229 tests
- Extract helpers to reduce function sizes (listing_tasks, app.py, query.py, listing_fetcher) - Replace nonlocal mutations with _PipelineState dataclass in listing_tasks - Fix bugs: isinstance→equality check in repository, verify_exp for OIDC tokens - Consolidate duplicate filter methods in listing_repository - Move hardcoded config to env vars with backward-compatible defaults - Simplify CLI decorator to auto-build QueryParameters - Add deprecation docstring to data_access.py - Test count: 158 → 387 (all passing)
This commit is contained in:
parent
7e05b3c971
commit
150342bb9e
48 changed files with 5029 additions and 990 deletions
95
.claude/CLAUDE.md
Normal file
95
.claude/CLAUDE.md
Normal file
|
|
@ -0,0 +1,95 @@
|
|||
# Realestate Crawler
|
||||
|
||||
## Project Overview
|
||||
|
||||
A real estate listing aggregation platform that scrapes Rightmove UK listings, extracts square meter data from floorplan images via OCR, calculates transit routes, and serves an interactive map-based web UI. The repo contains three sub-projects:
|
||||
|
||||
- **`crawler/`** — Main application (Python backend + React frontend). Has its own `CLAUDE.md` with detailed architecture docs.
|
||||
- **`immoweb/`** — Separate scraper (Node.js, legacy/reference).
|
||||
- **`vqa/`** — Visual QA / testing tooling.
|
||||
|
||||
## Command Execution
|
||||
|
||||
**All commands run inside Docker containers.** The dev environment uses Docker Compose — start it first, then exec into containers for any operations.
|
||||
|
||||
- **Infrastructure commands** (docker compose, kubectl) — Run locally on the Mac
|
||||
- **All project commands** (pytest, poetry, alembic, python, mypy, ruff, etc.) — Run inside the `app` container via `docker compose exec app <command>`
|
||||
|
||||
See `.claude/skills/` for detailed skills on dev environment, building, and deploying.
|
||||
|
||||
## Quick Reference
|
||||
|
||||
| Action | Command | Where |
|
||||
|--------|---------|-------|
|
||||
| Start all services | `./start.sh` | `crawler/` |
|
||||
| Rebuild & start | `./start.sh --build` | `crawler/` |
|
||||
| Stop services | `./start.sh --down` | `crawler/` |
|
||||
| Run tests | `pytest tests/ -v --cov=. --cov-report=term-missing` | `crawler/` |
|
||||
| Type check | `mypy .` | `crawler/` |
|
||||
| Format code | `yapf --style .style.yapf --recursive .` | `crawler/` |
|
||||
| Lint (CI runs Ruff) | `ruff check .` | `crawler/` |
|
||||
| DB migration | `alembic upgrade head` | `crawler/` |
|
||||
| New migration | `alembic revision -m "description"` | `crawler/` |
|
||||
| Frontend dev | `cd frontend && ./start.sh` | `crawler/` |
|
||||
|
||||
## Tech Stack
|
||||
|
||||
- **Backend:** Python 3.13, FastAPI, SQLModel (SQLAlchemy 2), Celery + Redis, pytesseract/OpenCV
|
||||
- **Frontend:** React 19, TypeScript, Vite, Tailwind CSS, Radix UI, Mapbox GL
|
||||
- **Database:** MySQL 9 (prod) / SQLite (local dev), Alembic migrations
|
||||
- **Infrastructure:** Docker Compose (dev), Kubernetes (prod), Drone CI
|
||||
|
||||
## Code Conventions
|
||||
|
||||
- **Type checking:** Strict mypy (`disallow_untyped_defs=true`). All functions need type annotations.
|
||||
- **Formatting:** YAPF (`.style.yapf` config) + Ruff linter (CI on PRs).
|
||||
- **Async:** Scraper layer uses `async/await` throughout with `aiohttp`.
|
||||
- **Models:** SQLModel for DB entities, Pydantic for request/response validation.
|
||||
- **Service layer:** `services/` contains unified handlers used by both CLI and API — add new business logic here, not in API routes or CLI commands directly.
|
||||
- **Repository pattern:** `repositories/` for database queries. Don't put raw SQL in services or API.
|
||||
- **Tests:** pytest with `pytest-asyncio` (auto mode). Unit tests in `tests/unit/`, integration in `tests/integration/`.
|
||||
|
||||
## Architecture Layers (in `crawler/`)
|
||||
|
||||
```
|
||||
API (api/app.py) ←→ Services (services/) ←→ Repositories (repositories/)
|
||||
↑ ↑ ↑
|
||||
Auth (OIDC) Core Logic (rec/) SQLModel (models/)
|
||||
↑
|
||||
Celery Tasks (tasks/)
|
||||
```
|
||||
|
||||
- `rec/` — Core scraping, OCR, routing logic. Contains circuit breaker and throttle detection.
|
||||
- `services/` — Orchestration. Query splitting, listing fetching, caching.
|
||||
- `api/` — FastAPI routes, auth middleware, metrics.
|
||||
- `models/` — `RentListing`, `BuyListing` SQLModel entities.
|
||||
- `tasks/` — Celery background tasks with Redis broker.
|
||||
- `config/` — Env-var-based configuration (scraper settings, schedules).
|
||||
|
||||
## Key Design Decisions
|
||||
|
||||
- **Query splitting** works around Rightmove's ~1500-result API cap by adaptively splitting queries by district, bedrooms, and price bands (binary search). See `services/query_splitter.py`.
|
||||
- **Circuit breaker** (`rec/circuit_breaker.py`) and **throttle detection** (`rec/throttle_detector.py`) protect against rate limiting.
|
||||
- **Streaming API** uses NDJSON for progressive loading of large result sets.
|
||||
- **Redis** serves dual duty as Celery broker and GeoJSON cache.
|
||||
|
||||
## Environment Variables
|
||||
|
||||
See `crawler/.env.sample` for the full list. Key ones:
|
||||
|
||||
- `DB_CONNECTION_STRING` — Database URL
|
||||
- `CELERY_BROKER_URL` / `CELERY_RESULT_BACKEND` — Redis URLs
|
||||
- `ROUTING_API_KEY` — Google Maps API key
|
||||
- `RIGHTMOVE_*` — Scraper tuning (concurrency, delays, thresholds, proxy)
|
||||
- `SCRAPE_SCHEDULES` — JSON array of periodic scrape configs
|
||||
|
||||
## Git Workflow
|
||||
|
||||
- CI: Drone CI builds Docker images on push to `master`, deploys to K8s.
|
||||
- Linting: GitHub Actions runs Ruff on PR diffs.
|
||||
- Keep commits focused — one logical change per commit.
|
||||
- Group related files (e.g., code + its tests) in the same commit.
|
||||
|
||||
## Directories to Ignore
|
||||
|
||||
- `node_modules/`, `__pycache__/`, `.idea/`, `crawler/data/`, `venv/`, `_cache/`
|
||||
Loading…
Add table
Add a link
Reference in a new issue