Refactor codebase following Clean Code principles and add 229 tests

- Extract helpers to reduce function sizes (listing_tasks, app.py, query.py, listing_fetcher)
  - Replace nonlocal mutations with _PipelineState dataclass in listing_tasks
  - Fix bugs: isinstance→equality check in repository, verify_exp for OIDC tokens
  - Consolidate duplicate filter methods in listing_repository
  - Move hardcoded config to env vars with backward-compatible defaults
  - Simplify CLI decorator to auto-build QueryParameters
  - Add deprecation docstring to data_access.py
  - Test count: 158 → 387 (all passing)

2026-02-07 20:19:57 +00:00

4.7 KiB

Raw Blame History

Realestate Crawler

Project Overview

A real estate listing aggregation platform that scrapes Rightmove UK listings, extracts square meter data from floorplan images via OCR, calculates transit routes, and serves an interactive map-based web UI. The repo contains three sub-projects:

crawler/ — Main application (Python backend + React frontend). Has its own CLAUDE.md with detailed architecture docs.
immoweb/ — Separate scraper (Node.js, legacy/reference).
vqa/ — Visual QA / testing tooling.

Command Execution

All commands run inside Docker containers. The dev environment uses Docker Compose — start it first, then exec into containers for any operations.

Infrastructure commands (docker compose, kubectl) — Run locally on the Mac
All project commands (pytest, poetry, alembic, python, mypy, ruff, etc.) — Run inside the app container via docker compose exec app <command>

See .claude/skills/ for detailed skills on dev environment, building, and deploying.

Quick Reference

Action	Command	Where
Start all services	`./start.sh`	`crawler/`
Rebuild & start	`./start.sh --build`	`crawler/`
Stop services	`./start.sh --down`	`crawler/`
Run tests	`pytest tests/ -v --cov=. --cov-report=term-missing`	`crawler/`
Type check	`mypy .`	`crawler/`
Format code	`yapf --style .style.yapf --recursive .`	`crawler/`
Lint (CI runs Ruff)	`ruff check .`	`crawler/`
DB migration	`alembic upgrade head`	`crawler/`
New migration	`alembic revision -m "description"`	`crawler/`
Frontend dev	`cd frontend && ./start.sh`	`crawler/`

Tech Stack

Backend: Python 3.13, FastAPI, SQLModel (SQLAlchemy 2), Celery + Redis, pytesseract/OpenCV
Frontend: React 19, TypeScript, Vite, Tailwind CSS, Radix UI, Mapbox GL
Database: MySQL 9 (prod) / SQLite (local dev), Alembic migrations
Infrastructure: Docker Compose (dev), Kubernetes (prod), Drone CI

Code Conventions

Type checking: Strict mypy (disallow_untyped_defs=true). All functions need type annotations.
Formatting: YAPF (.style.yapf config) + Ruff linter (CI on PRs).
Async: Scraper layer uses async/await throughout with aiohttp.
Models: SQLModel for DB entities, Pydantic for request/response validation.
Service layer: services/ contains unified handlers used by both CLI and API — add new business logic here, not in API routes or CLI commands directly.
Repository pattern: repositories/ for database queries. Don't put raw SQL in services or API.
Tests: pytest with pytest-asyncio (auto mode). Unit tests in tests/unit/, integration in tests/integration/.

Architecture Layers (in `crawler/`)

API (api/app.py)  ←→  Services (services/)  ←→  Repositories (repositories/)
     ↑                      ↑                           ↑
  Auth (OIDC)         Core Logic (rec/)            SQLModel (models/)
                           ↑
                    Celery Tasks (tasks/)

rec/ — Core scraping, OCR, routing logic. Contains circuit breaker and throttle detection.
services/ — Orchestration. Query splitting, listing fetching, caching.
api/ — FastAPI routes, auth middleware, metrics.
models/ — RentListing, BuyListing SQLModel entities.
tasks/ — Celery background tasks with Redis broker.
config/ — Env-var-based configuration (scraper settings, schedules).

Key Design Decisions

Query splitting works around Rightmove's ~1500-result API cap by adaptively splitting queries by district, bedrooms, and price bands (binary search). See services/query_splitter.py.
Circuit breaker (rec/circuit_breaker.py) and throttle detection (rec/throttle_detector.py) protect against rate limiting.
Streaming API uses NDJSON for progressive loading of large result sets.
Redis serves dual duty as Celery broker and GeoJSON cache.

Environment Variables

See crawler/.env.sample for the full list. Key ones:

DB_CONNECTION_STRING — Database URL
CELERY_BROKER_URL / CELERY_RESULT_BACKEND — Redis URLs
ROUTING_API_KEY — Google Maps API key
RIGHTMOVE_* — Scraper tuning (concurrency, delays, thresholds, proxy)
SCRAPE_SCHEDULES — JSON array of periodic scrape configs

Git Workflow

CI: Drone CI builds Docker images on push to master, deploys to K8s.
Linting: GitHub Actions runs Ruff on PR diffs.
Keep commits focused — one logical change per commit.
Group related files (e.g., code + its tests) in the same commit.

Directories to Ignore

node_modules/, __pycache__/, .idea/, crawler/data/, venv/, _cache/

4.7 KiB Raw Blame History