# Realestate Crawler ## Project Overview A real estate listing aggregation platform that scrapes Rightmove UK listings, extracts square meter data from floorplan images via OCR, calculates transit routes, and serves an interactive map-based web UI. ## Command Execution **All commands run inside Docker containers.** The dev environment uses Docker Compose — start it first, then exec into containers for any operations. - **Infrastructure commands** (docker compose, kubectl) — Run locally on the Mac - **All project commands** (pytest, poetry, alembic, python, mypy, ruff, etc.) — Run inside the `app` container via `docker compose exec app ` See `.claude/skills/` for detailed skills on dev environment, building, and deploying. ## Quick Reference | Action | Command | |--------|---------| | Start all services | `./start.sh` | | Rebuild & start | `./start.sh --build` | | Stop services | `./start.sh --down` | | Run tests | `pytest tests/ -v --cov=. --cov-report=term-missing` | | Type check | `mypy .` | | Format code | `yapf --style .style.yapf --recursive .` | | Lint (CI runs Ruff) | `ruff check .` | | DB migration | `alembic upgrade head` | | New migration | `alembic revision -m "description"` | | Frontend dev | `cd frontend && ./start.sh` | ## Tech Stack - **Backend:** Python 3.13, FastAPI, SQLModel (SQLAlchemy 2), Celery + Redis, pytesseract/OpenCV - **Frontend:** React 19, TypeScript, Vite, Tailwind CSS, Radix UI, Mapbox GL - **Database:** MySQL 9 (prod) / SQLite (local dev), Alembic migrations - **Infrastructure:** Docker Compose (dev), Kubernetes (prod), Woodpecker CI ## Code Conventions - **Type checking:** Strict mypy (`disallow_untyped_defs=true`). All functions need type annotations. - **Formatting:** YAPF (`.style.yapf` config) + Ruff linter (CI on PRs). - **Async:** Scraper layer uses `async/await` throughout with `aiohttp`. - **Models:** SQLModel for DB entities, Pydantic for request/response validation. - **Service layer:** `services/` contains unified handlers used by both CLI and API — add new business logic here, not in API routes or CLI commands directly. - **Repository pattern:** `repositories/` for database queries. Don't put raw SQL in services or API. - **Tests:** pytest with `pytest-asyncio` (auto mode). Unit tests in `tests/unit/`, integration in `tests/integration/`. ## Architecture Layers ``` API (api/app.py) ←→ Services (services/) ←→ Repositories (repositories/) ↑ ↑ ↑ Auth (OIDC) Core Logic (rec/) SQLModel (models/) ↑ Celery Tasks (tasks/) ``` - `rec/` — Core scraping, OCR, routing logic. Contains circuit breaker and throttle detection. - `services/` — Orchestration. Query splitting, listing fetching, caching. - `api/` — FastAPI routes, auth middleware, metrics. - `models/` — `RentListing`, `BuyListing` SQLModel entities. - `tasks/` — Celery background tasks with Redis broker. - `config/` — Env-var-based configuration (scraper settings, schedules). ## Key Design Decisions - **Query splitting** works around Rightmove's ~1500-result API cap by adaptively splitting queries by district, bedrooms, and price bands (binary search). See `services/query_splitter.py`. - **Circuit breaker** (`rec/circuit_breaker.py`) and **throttle detection** (`rec/throttle_detector.py`) protect against rate limiting. - **Streaming API** uses NDJSON for progressive loading of large result sets. - **Redis** serves dual duty as Celery broker and GeoJSON cache. ## Environment Variables See `.env.sample` for the full list. Key ones: - `DB_CONNECTION_STRING` — Database URL - `CELERY_BROKER_URL` / `CELERY_RESULT_BACKEND` — Redis URLs - `ROUTING_API_KEY` — Google Maps API key - `RIGHTMOVE_*` — Scraper tuning (concurrency, delays, thresholds, proxy) - `SCRAPE_SCHEDULES` — JSON array of periodic scrape configs ## Git Workflow - CI: Woodpecker CI (`.woodpecker/`) builds Docker images on push to `master`, deploys to K8s. - Linting: GitHub Actions runs Ruff on PR diffs. - Keep commits focused — one logical change per commit. - Group related files (e.g., code + its tests) in the same commit. ## Directories to Ignore - `node_modules/`, `__pycache__/`, `.idea/`, `data/`, `venv/`, `_cache/`