wrongmove/.claude/CLAUDE.md
Viktor Barzin eafbc1ac52
Flatten repo structure: move crawler/ to root, remove vqa/ and immoweb/
The crawler subdirectory was the only active project. Moving it to the
repo root simplifies paths and removes the unnecessary nesting. The
vqa/ and immoweb/ directories were legacy/unused and have been removed.

Updated .drone.yml, .gitignore, .claude/ docs, and skills to reflect
the new flat structure.
2026-02-07 23:01:20 +00:00

4.2 KiB

Realestate Crawler

Project Overview

A real estate listing aggregation platform that scrapes Rightmove UK listings, extracts square meter data from floorplan images via OCR, calculates transit routes, and serves an interactive map-based web UI.

Command Execution

All commands run inside Docker containers. The dev environment uses Docker Compose — start it first, then exec into containers for any operations.

  • Infrastructure commands (docker compose, kubectl) — Run locally on the Mac
  • All project commands (pytest, poetry, alembic, python, mypy, ruff, etc.) — Run inside the app container via docker compose exec app <command>

See .claude/skills/ for detailed skills on dev environment, building, and deploying.

Quick Reference

Action Command
Start all services ./start.sh
Rebuild & start ./start.sh --build
Stop services ./start.sh --down
Run tests pytest tests/ -v --cov=. --cov-report=term-missing
Type check mypy .
Format code yapf --style .style.yapf --recursive .
Lint (CI runs Ruff) ruff check .
DB migration alembic upgrade head
New migration alembic revision -m "description"
Frontend dev cd frontend && ./start.sh

Tech Stack

  • Backend: Python 3.13, FastAPI, SQLModel (SQLAlchemy 2), Celery + Redis, pytesseract/OpenCV
  • Frontend: React 19, TypeScript, Vite, Tailwind CSS, Radix UI, Mapbox GL
  • Database: MySQL 9 (prod) / SQLite (local dev), Alembic migrations
  • Infrastructure: Docker Compose (dev), Kubernetes (prod), Drone CI

Code Conventions

  • Type checking: Strict mypy (disallow_untyped_defs=true). All functions need type annotations.
  • Formatting: YAPF (.style.yapf config) + Ruff linter (CI on PRs).
  • Async: Scraper layer uses async/await throughout with aiohttp.
  • Models: SQLModel for DB entities, Pydantic for request/response validation.
  • Service layer: services/ contains unified handlers used by both CLI and API — add new business logic here, not in API routes or CLI commands directly.
  • Repository pattern: repositories/ for database queries. Don't put raw SQL in services or API.
  • Tests: pytest with pytest-asyncio (auto mode). Unit tests in tests/unit/, integration in tests/integration/.

Architecture Layers

API (api/app.py)  ←→  Services (services/)  ←→  Repositories (repositories/)
     ↑                      ↑                           ↑
  Auth (OIDC)         Core Logic (rec/)            SQLModel (models/)
                           ↑
                    Celery Tasks (tasks/)
  • rec/ — Core scraping, OCR, routing logic. Contains circuit breaker and throttle detection.
  • services/ — Orchestration. Query splitting, listing fetching, caching.
  • api/ — FastAPI routes, auth middleware, metrics.
  • models/RentListing, BuyListing SQLModel entities.
  • tasks/ — Celery background tasks with Redis broker.
  • config/ — Env-var-based configuration (scraper settings, schedules).

Key Design Decisions

  • Query splitting works around Rightmove's ~1500-result API cap by adaptively splitting queries by district, bedrooms, and price bands (binary search). See services/query_splitter.py.
  • Circuit breaker (rec/circuit_breaker.py) and throttle detection (rec/throttle_detector.py) protect against rate limiting.
  • Streaming API uses NDJSON for progressive loading of large result sets.
  • Redis serves dual duty as Celery broker and GeoJSON cache.

Environment Variables

See .env.sample for the full list. Key ones:

  • DB_CONNECTION_STRING — Database URL
  • CELERY_BROKER_URL / CELERY_RESULT_BACKEND — Redis URLs
  • ROUTING_API_KEY — Google Maps API key
  • RIGHTMOVE_* — Scraper tuning (concurrency, delays, thresholds, proxy)
  • SCRAPE_SCHEDULES — JSON array of periodic scrape configs

Git Workflow

  • CI: Drone CI builds Docker images on push to master, deploys to K8s.
  • Linting: GitHub Actions runs Ruff on PR diffs.
  • Keep commits focused — one logical change per commit.
  • Group related files (e.g., code + its tests) in the same commit.

Directories to Ignore

  • node_modules/, __pycache__/, .idea/, data/, venv/, _cache/