wrongmove/.claude/CLAUDE.md
Viktor Barzin 150342bb9e
Refactor codebase following Clean Code principles and add 229 tests
- Extract helpers to reduce function sizes (listing_tasks, app.py, query.py, listing_fetcher)
  - Replace nonlocal mutations with _PipelineState dataclass in listing_tasks
  - Fix bugs: isinstance→equality check in repository, verify_exp for OIDC tokens
  - Consolidate duplicate filter methods in listing_repository
  - Move hardcoded config to env vars with backward-compatible defaults
  - Simplify CLI decorator to auto-build QueryParameters
  - Add deprecation docstring to data_access.py
  - Test count: 158 → 387 (all passing)
2026-02-07 20:19:57 +00:00

4.7 KiB

Realestate Crawler

Project Overview

A real estate listing aggregation platform that scrapes Rightmove UK listings, extracts square meter data from floorplan images via OCR, calculates transit routes, and serves an interactive map-based web UI. The repo contains three sub-projects:

  • crawler/ — Main application (Python backend + React frontend). Has its own CLAUDE.md with detailed architecture docs.
  • immoweb/ — Separate scraper (Node.js, legacy/reference).
  • vqa/ — Visual QA / testing tooling.

Command Execution

All commands run inside Docker containers. The dev environment uses Docker Compose — start it first, then exec into containers for any operations.

  • Infrastructure commands (docker compose, kubectl) — Run locally on the Mac
  • All project commands (pytest, poetry, alembic, python, mypy, ruff, etc.) — Run inside the app container via docker compose exec app <command>

See .claude/skills/ for detailed skills on dev environment, building, and deploying.

Quick Reference

Action Command Where
Start all services ./start.sh crawler/
Rebuild & start ./start.sh --build crawler/
Stop services ./start.sh --down crawler/
Run tests pytest tests/ -v --cov=. --cov-report=term-missing crawler/
Type check mypy . crawler/
Format code yapf --style .style.yapf --recursive . crawler/
Lint (CI runs Ruff) ruff check . crawler/
DB migration alembic upgrade head crawler/
New migration alembic revision -m "description" crawler/
Frontend dev cd frontend && ./start.sh crawler/

Tech Stack

  • Backend: Python 3.13, FastAPI, SQLModel (SQLAlchemy 2), Celery + Redis, pytesseract/OpenCV
  • Frontend: React 19, TypeScript, Vite, Tailwind CSS, Radix UI, Mapbox GL
  • Database: MySQL 9 (prod) / SQLite (local dev), Alembic migrations
  • Infrastructure: Docker Compose (dev), Kubernetes (prod), Drone CI

Code Conventions

  • Type checking: Strict mypy (disallow_untyped_defs=true). All functions need type annotations.
  • Formatting: YAPF (.style.yapf config) + Ruff linter (CI on PRs).
  • Async: Scraper layer uses async/await throughout with aiohttp.
  • Models: SQLModel for DB entities, Pydantic for request/response validation.
  • Service layer: services/ contains unified handlers used by both CLI and API — add new business logic here, not in API routes or CLI commands directly.
  • Repository pattern: repositories/ for database queries. Don't put raw SQL in services or API.
  • Tests: pytest with pytest-asyncio (auto mode). Unit tests in tests/unit/, integration in tests/integration/.

Architecture Layers (in crawler/)

API (api/app.py)  ←→  Services (services/)  ←→  Repositories (repositories/)
     ↑                      ↑                           ↑
  Auth (OIDC)         Core Logic (rec/)            SQLModel (models/)
                           ↑
                    Celery Tasks (tasks/)
  • rec/ — Core scraping, OCR, routing logic. Contains circuit breaker and throttle detection.
  • services/ — Orchestration. Query splitting, listing fetching, caching.
  • api/ — FastAPI routes, auth middleware, metrics.
  • models/RentListing, BuyListing SQLModel entities.
  • tasks/ — Celery background tasks with Redis broker.
  • config/ — Env-var-based configuration (scraper settings, schedules).

Key Design Decisions

  • Query splitting works around Rightmove's ~1500-result API cap by adaptively splitting queries by district, bedrooms, and price bands (binary search). See services/query_splitter.py.
  • Circuit breaker (rec/circuit_breaker.py) and throttle detection (rec/throttle_detector.py) protect against rate limiting.
  • Streaming API uses NDJSON for progressive loading of large result sets.
  • Redis serves dual duty as Celery broker and GeoJSON cache.

Environment Variables

See crawler/.env.sample for the full list. Key ones:

  • DB_CONNECTION_STRING — Database URL
  • CELERY_BROKER_URL / CELERY_RESULT_BACKEND — Redis URLs
  • ROUTING_API_KEY — Google Maps API key
  • RIGHTMOVE_* — Scraper tuning (concurrency, delays, thresholds, proxy)
  • SCRAPE_SCHEDULES — JSON array of periodic scrape configs

Git Workflow

  • CI: Drone CI builds Docker images on push to master, deploys to K8s.
  • Linting: GitHub Actions runs Ruff on PR diffs.
  • Keep commits focused — one logical change per commit.
  • Group related files (e.g., code + its tests) in the same commit.

Directories to Ignore

  • node_modules/, __pycache__/, .idea/, crawler/data/, venv/, _cache/