8.4 KiB
CLAUDE.md
This file provides guidance to Claude Code (claude.ai/code) when working with code in this repository.
Project Overview
A real estate listing crawler and aggregator that scrapes property listings from Rightmove UK, extracts square meter data from floorplan images using OCR, calculates transit routes, and provides a web UI for browsing listings.
Development Environment
IMPORTANT: This project runs on a remote host, not locally. Always use the remote executor to run commands:
- All shell commands (Python, pytest, poetry, alembic, etc.) must be executed via the remote executor
- Starting the project: Use the remote executor to run
./start.sh - Running tests: Use the remote executor to run
pytest - Any CLI operations: Use the remote executor to run
python main.py ...
Never run commands directly on the local machine - always route them through the remote executor.
Commands
Setup and Run (Docker - Recommended)
# Start all services (Redis, MySQL, API, Celery) with Docker
./start.sh
# Rebuild images and start
./start.sh --build
# Stop all containers
./start.sh --down
# View logs
./start.sh --logs
Setup and Run (Local with Poetry)
# Install dependencies
poetry install && cp .env.sample .env
# Start backend locally (requires Redis running)
./start.sh --local
# Start frontend (from frontend/ directory)
cd frontend && ./start.sh
CLI Operations
The main CLI (main.py) uses Click with a --data-dir option (default: data/rs/):
# Dump listings from Rightmove API
python main.py dump-listings --type rent --min-price 2000 --max-price 4000 --min-bedrooms 2
# Download floorplan images
python main.py dump-images
# Extract square meters from floorplans using OCR
python main.py detect-floorplan
# Calculate transit routes (consumes Google Maps API calls)
python main.py routing --destination-address 'Address' -m transit -l 10
# Export to GeoJSON for visualization
python main.py export-immoweb -O output.js --type rent [filter options]
Testing
# Run tests with coverage
pytest tests/ -v --cov=. --cov-report=term-missing
# Run type checker
mypy .
Database Migrations
alembic upgrade head # Apply migrations
alembic revision -m "description" # Create new migration
Code Formatting
yapf --style .style.yapf --recursive .
Architecture
Core Data Flow
- Scraping (
rec/query.py): Fetches listing IDs and details from Rightmove's Android API - Processing (
listing_processor.py): Pipeline with steps for fetching details, downloading images, and OCR detection - Storage: SQLModel/SQLAlchemy with MySQL or SQLite, plus JSON files in
data/rs/<listing_id>/ - API (
api/app.py): FastAPI endpoints authenticated via JWT from external Authentik service - Background Tasks (
tasks/listing_tasks.py): Celery tasks for async listing processing with Redis broker
Key Models
models/listing.py: SQLModel entities (RentListing,BuyListing) withQueryParametersfor filteringdata_access.py: DEPRECATED - LegacyListingdataclass for filesystem-based data access. Usemodels.listing.RentListingormodels.listing.BuyListinginstead.
Services Layer (Unified CLI and API)
IMPORTANT: The services/ directory contains unified handler functions that both the CLI and HTTP API use. This ensures consistency and code reuse.
High-level services (use these in CLI and API):
-
listing_service.py: Listing operationsget_listings()- Retrieve listings from databaserefresh_listings()- Fetch new listings from Rightmove (sync or async)download_images()- Download floorplan imagesdetect_floorplans()- Run OCR on floorplanscalculate_routes()- Calculate transit routes
-
export_service.py: Export operationsexport_to_csv()- Export listings to CSV fileexport_to_geojson()- Export listings to GeoJSON (file or in-memory)
-
district_service.py: District managementget_all_districts()- Get district name → region ID mappingget_district_names()- Get list of district namesvalidate_districts()- Validate district names
-
task_service.py: Background task managementget_task_status()- Get Celery task statusget_user_tasks()- Get all tasks for a useradd_task_for_user()- Associate task with user
Low-level services (internal implementation):
listing_fetcher.py: Fetches listing data from Rightmove APIimage_fetcher.py: Downloads floorplan imagesfloorplan_detector.py: OCR-based square meter detectionroute_calculator.py: Calculates transit routes using Google Maps APIquery_splitter.py: Intelligent query splitting to maximize data extraction
Query Splitting System
Rightmove's API caps search results at ~1,500 listings per query. The query splitting system works around this limitation to fetch all matching listings.
How it works:
- Initial Split: Queries are split by district and bedroom count
- Probe: Each subquery is probed (minimal API request) to get
totalAvailableResults - Adaptive Split: If results exceed threshold (1,200), the price range is binary-split
- Recursive Refinement: Splitting continues until all subqueries are under threshold
- Full Fetch: Each subquery fetches up to 60 pages (1,500 results max)
Original: 2BR, £1000-£5000 → 3,000 results (over cap!)
↓ split by price
£1000-£3000: 1,800 (still over!) | £3000-£5000: 1,200 ✓
↓ split again
£1000-£2000: 900 ✓ | £2000-£3000: 900 ✓
Final: 3 subqueries → 900 + 900 + 1,200 = 3,000 total results ✓
Key components:
config/scraper_config.py: Configuration with env var loadingservices/query_splitter.py:QuerySplitterclass withSubQuerydataclassrec/query.py:probe_query()for result count probing,create_session()for connection pooling
Processing Pipeline
ListingProcessor runs sequential steps defined in listing_processor.py:
FetchListingDetailsStep- Get property details from APIFetchImagesStep- Download floorplan imagesDetectFloorplanStep- OCR to extract square meters from floorplans
Floorplan OCR
rec/floorplan.py uses pytesseract with image preprocessing (adaptive thresholding) to extract square meter values from floorplan images.
Repository Pattern
repositories/listing_repository.py handles database operations with SQLModel sessions.
Environment Variables
DB_CONNECTION_STRING: Database URL (SQLite default:sqlite:///data/wrongmove.db)CELERY_BROKER_URL/CELERY_RESULT_BACKEND: Redis URLsROUTING_API_KEY: Google Maps API key for transit routing
Scraper Configuration
These control the query splitting behavior (see .env.sample for defaults):
| Variable | Default | Description |
|---|---|---|
RIGHTMOVE_MAX_CONCURRENT |
5 | Max concurrent HTTP requests |
RIGHTMOVE_REQUEST_DELAY_MS |
100 | Delay between requests (ms) |
RIGHTMOVE_SPLIT_THRESHOLD |
1200 | Split query when results exceed this |
RIGHTMOVE_MIN_PRICE_BAND |
100 | Minimum price band width (won't split below) |
RIGHTMOVE_MAX_PAGES |
60 | Max pages per subquery (60 × 25 = 1500) |
RIGHTMOVE_PROXY_URL |
- | SOCKS proxy URL (e.g., socks5://localhost:9050 for Tor) |
Project Structure
main.py: CLI entry pointapi/: FastAPI application with auth middlewareconfig/: Configuration modules (scraper settings, scheduled tasks)models/: SQLModel database entitiesrepositories/: Database access layerrec/: Core business logic (query, floorplan OCR, routing, districts)services/: Service layer modules (listing_fetcher, image_fetcher, floorplan_detector, route_calculator, query_splitter)tasks/: Celery background tasksfrontend/: React/Vite frontend with Caddy proxyalembic/: Database migrationstests/: Test suite (unit and integration tests)
Type Checking
The project uses strict mypy configuration with disallow_untyped_defs=true. Run mypy . to check types.
Exploration Preferences
- Always ignore
node_modulesdirectory when exploring the codebase
Git Workflow
IMPORTANT: After completing work items, always create separate commits for each logical change:
- Keep each commit focused on one feature/fix
- Do not include unrelated files
- Use descriptive commit messages
- Group related files together (e.g., tests with the code they test)