- Extract helpers to reduce function sizes (listing_tasks, app.py, query.py, listing_fetcher) - Replace nonlocal mutations with _PipelineState dataclass in listing_tasks - Fix bugs: isinstance→equality check in repository, verify_exp for OIDC tokens - Consolidate duplicate filter methods in listing_repository - Move hardcoded config to env vars with backward-compatible defaults - Simplify CLI decorator to auto-build QueryParameters - Add deprecation docstring to data_access.py - Test count: 158 → 387 (all passing)
4.7 KiB
4.7 KiB
Realestate Crawler
Project Overview
A real estate listing aggregation platform that scrapes Rightmove UK listings, extracts square meter data from floorplan images via OCR, calculates transit routes, and serves an interactive map-based web UI. The repo contains three sub-projects:
crawler/— Main application (Python backend + React frontend). Has its ownCLAUDE.mdwith detailed architecture docs.immoweb/— Separate scraper (Node.js, legacy/reference).vqa/— Visual QA / testing tooling.
Command Execution
All commands run inside Docker containers. The dev environment uses Docker Compose — start it first, then exec into containers for any operations.
- Infrastructure commands (docker compose, kubectl) — Run locally on the Mac
- All project commands (pytest, poetry, alembic, python, mypy, ruff, etc.) — Run inside the
appcontainer viadocker compose exec app <command>
See .claude/skills/ for detailed skills on dev environment, building, and deploying.
Quick Reference
| Action | Command | Where |
|---|---|---|
| Start all services | ./start.sh |
crawler/ |
| Rebuild & start | ./start.sh --build |
crawler/ |
| Stop services | ./start.sh --down |
crawler/ |
| Run tests | pytest tests/ -v --cov=. --cov-report=term-missing |
crawler/ |
| Type check | mypy . |
crawler/ |
| Format code | yapf --style .style.yapf --recursive . |
crawler/ |
| Lint (CI runs Ruff) | ruff check . |
crawler/ |
| DB migration | alembic upgrade head |
crawler/ |
| New migration | alembic revision -m "description" |
crawler/ |
| Frontend dev | cd frontend && ./start.sh |
crawler/ |
Tech Stack
- Backend: Python 3.13, FastAPI, SQLModel (SQLAlchemy 2), Celery + Redis, pytesseract/OpenCV
- Frontend: React 19, TypeScript, Vite, Tailwind CSS, Radix UI, Mapbox GL
- Database: MySQL 9 (prod) / SQLite (local dev), Alembic migrations
- Infrastructure: Docker Compose (dev), Kubernetes (prod), Drone CI
Code Conventions
- Type checking: Strict mypy (
disallow_untyped_defs=true). All functions need type annotations. - Formatting: YAPF (
.style.yapfconfig) + Ruff linter (CI on PRs). - Async: Scraper layer uses
async/awaitthroughout withaiohttp. - Models: SQLModel for DB entities, Pydantic for request/response validation.
- Service layer:
services/contains unified handlers used by both CLI and API — add new business logic here, not in API routes or CLI commands directly. - Repository pattern:
repositories/for database queries. Don't put raw SQL in services or API. - Tests: pytest with
pytest-asyncio(auto mode). Unit tests intests/unit/, integration intests/integration/.
Architecture Layers (in crawler/)
API (api/app.py) ←→ Services (services/) ←→ Repositories (repositories/)
↑ ↑ ↑
Auth (OIDC) Core Logic (rec/) SQLModel (models/)
↑
Celery Tasks (tasks/)
rec/— Core scraping, OCR, routing logic. Contains circuit breaker and throttle detection.services/— Orchestration. Query splitting, listing fetching, caching.api/— FastAPI routes, auth middleware, metrics.models/—RentListing,BuyListingSQLModel entities.tasks/— Celery background tasks with Redis broker.config/— Env-var-based configuration (scraper settings, schedules).
Key Design Decisions
- Query splitting works around Rightmove's ~1500-result API cap by adaptively splitting queries by district, bedrooms, and price bands (binary search). See
services/query_splitter.py. - Circuit breaker (
rec/circuit_breaker.py) and throttle detection (rec/throttle_detector.py) protect against rate limiting. - Streaming API uses NDJSON for progressive loading of large result sets.
- Redis serves dual duty as Celery broker and GeoJSON cache.
Environment Variables
See crawler/.env.sample for the full list. Key ones:
DB_CONNECTION_STRING— Database URLCELERY_BROKER_URL/CELERY_RESULT_BACKEND— Redis URLsROUTING_API_KEY— Google Maps API keyRIGHTMOVE_*— Scraper tuning (concurrency, delays, thresholds, proxy)SCRAPE_SCHEDULES— JSON array of periodic scrape configs
Git Workflow
- CI: Drone CI builds Docker images on push to
master, deploys to K8s. - Linting: GitHub Actions runs Ruff on PR diffs.
- Keep commits focused — one logical change per commit.
- Group related files (e.g., code + its tests) in the same commit.
Directories to Ignore
node_modules/,__pycache__/,.idea/,crawler/data/,venv/,_cache/