# CLAUDE.md This file provides guidance to Claude Code (claude.ai/code) when working with code in this repository. ## Project Overview A real estate listing crawler and aggregator that scrapes property listings from Rightmove UK, extracts square meter data from floorplan images using OCR, calculates transit routes, and provides a web UI for browsing listings. ## Development Environment **IMPORTANT**: This project runs on a remote host, not locally. Always use the remote executor to run commands: - **All shell commands** (Python, pytest, poetry, alembic, etc.) must be executed via the remote executor - **Starting the project**: Use the remote executor to run `./start.sh` - **Running tests**: Use the remote executor to run `pytest` - **Any CLI operations**: Use the remote executor to run `python main.py ...` Never run commands directly on the local machine - always route them through the remote executor. ## Commands ### Setup and Run (Docker - Recommended) ```bash # Start all services (Redis, MySQL, API, Celery) with Docker ./start.sh # Rebuild images and start ./start.sh --build # Stop all containers ./start.sh --down # View logs ./start.sh --logs ``` ### Setup and Run (Local with Poetry) ```bash # Install dependencies poetry install && cp .env.sample .env # Start backend locally (requires Redis running) ./start.sh --local # Start frontend (from frontend/ directory) cd frontend && ./start.sh ``` ### CLI Operations The main CLI (`main.py`) uses Click with a `--data-dir` option (default: `data/rs/`): ```bash # Dump listings from Rightmove API python main.py dump-listings --type rent --min-price 2000 --max-price 4000 --min-bedrooms 2 # Download floorplan images python main.py dump-images # Extract square meters from floorplans using OCR python main.py detect-floorplan # Calculate transit routes (consumes Google Maps API calls) python main.py routing --destination-address 'Address' -m transit -l 10 # Export to GeoJSON for visualization python main.py export-immoweb -O output.js --type rent [filter options] ``` ### Testing ```bash # Run tests with coverage pytest tests/ -v --cov=. --cov-report=term-missing # Run type checker mypy . ``` ### Database Migrations ```bash alembic upgrade head # Apply migrations alembic revision -m "description" # Create new migration ``` ### Code Formatting ```bash yapf --style .style.yapf --recursive . ``` ## Architecture ### Core Data Flow 1. **Scraping** (`rec/query.py`): Fetches listing IDs and details from Rightmove's Android API 2. **Processing** (`listing_processor.py`): Pipeline with steps for fetching details, downloading images, and OCR detection 3. **Storage**: SQLModel/SQLAlchemy with MySQL or SQLite, plus JSON files in `data/rs//` 4. **API** (`api/app.py`): FastAPI endpoints authenticated via JWT from external Authentik service 5. **Background Tasks** (`tasks/listing_tasks.py`): Celery tasks for async listing processing with Redis broker ### Key Models - `models/listing.py`: SQLModel entities (`RentListing`, `BuyListing`) with `QueryParameters` for filtering - `data_access.py`: **DEPRECATED** - Legacy `Listing` dataclass for filesystem-based data access. Use `models.listing.RentListing` or `models.listing.BuyListing` instead. ### Services Layer (Unified CLI and API) **IMPORTANT**: The `services/` directory contains unified handler functions that both the CLI and HTTP API use. This ensures consistency and code reuse. #### High-level services (use these in CLI and API): - **`listing_service.py`**: Listing operations - `get_listings()` - Retrieve listings from database - `refresh_listings()` - Fetch new listings from Rightmove (sync or async) - `download_images()` - Download floorplan images - `detect_floorplans()` - Run OCR on floorplans - `calculate_routes()` - Calculate transit routes - **`export_service.py`**: Export operations - `export_to_csv()` - Export listings to CSV file - `export_to_geojson()` - Export listings to GeoJSON (file or in-memory) - **`district_service.py`**: District management - `get_all_districts()` - Get district name → region ID mapping - `get_district_names()` - Get list of district names - `validate_districts()` - Validate district names - **`task_service.py`**: Background task management - `get_task_status()` - Get Celery task status - `get_user_tasks()` - Get all tasks for a user - `add_task_for_user()` - Associate task with user #### Low-level services (internal implementation): - `listing_fetcher.py`: Fetches listing data from Rightmove API - `image_fetcher.py`: Downloads floorplan images - `floorplan_detector.py`: OCR-based square meter detection - `route_calculator.py`: Calculates transit routes using Google Maps API - `query_splitter.py`: Intelligent query splitting to maximize data extraction ### Query Splitting System Rightmove's API caps search results at ~1,500 listings per query. The query splitting system works around this limitation to fetch **all matching listings**. #### How it works: 1. **Initial Split**: Queries are split by district and bedroom count 2. **Probe**: Each subquery is probed (minimal API request) to get `totalAvailableResults` 3. **Adaptive Split**: If results exceed threshold (1,200), the price range is binary-split 4. **Recursive Refinement**: Splitting continues until all subqueries are under threshold 5. **Full Fetch**: Each subquery fetches up to 60 pages (1,500 results max) ``` Original: 2BR, £1000-£5000 → 3,000 results (over cap!) ↓ split by price £1000-£3000: 1,800 (still over!) | £3000-£5000: 1,200 ✓ ↓ split again £1000-£2000: 900 ✓ | £2000-£3000: 900 ✓ Final: 3 subqueries → 900 + 900 + 1,200 = 3,000 total results ✓ ``` #### Key components: - `config/scraper_config.py`: Configuration with env var loading - `services/query_splitter.py`: `QuerySplitter` class with `SubQuery` dataclass - `rec/query.py`: `probe_query()` for result count probing, `create_session()` for connection pooling ### Processing Pipeline `ListingProcessor` runs sequential steps defined in `listing_processor.py`: 1. `FetchListingDetailsStep` - Get property details from API 2. `FetchImagesStep` - Download floorplan images 3. `DetectFloorplanStep` - OCR to extract square meters from floorplans ### Floorplan OCR `rec/floorplan.py` uses pytesseract with image preprocessing (adaptive thresholding) to extract square meter values from floorplan images. ### Repository Pattern `repositories/listing_repository.py` handles database operations with SQLModel sessions. ## Environment Variables - `DB_CONNECTION_STRING`: Database URL (SQLite default: `sqlite:///data/wrongmove.db`) - `CELERY_BROKER_URL` / `CELERY_RESULT_BACKEND`: Redis URLs - `ROUTING_API_KEY`: Google Maps API key for transit routing ### Scraper Configuration These control the query splitting behavior (see `.env.sample` for defaults): | Variable | Default | Description | |----------|---------|-------------| | `RIGHTMOVE_MAX_CONCURRENT` | 5 | Max concurrent HTTP requests | | `RIGHTMOVE_REQUEST_DELAY_MS` | 100 | Delay between requests (ms) | | `RIGHTMOVE_SPLIT_THRESHOLD` | 1200 | Split query when results exceed this | | `RIGHTMOVE_MIN_PRICE_BAND` | 100 | Minimum price band width (won't split below) | | `RIGHTMOVE_MAX_PAGES` | 60 | Max pages per subquery (60 × 25 = 1500) | | `RIGHTMOVE_PROXY_URL` | - | SOCKS proxy URL (e.g., `socks5://localhost:9050` for Tor) | ## Project Structure - `main.py`: CLI entry point - `api/`: FastAPI application with auth middleware - `config/`: Configuration modules (scraper settings, scheduled tasks) - `models/`: SQLModel database entities - `repositories/`: Database access layer - `rec/`: Core business logic (query, floorplan OCR, routing, districts) - `services/`: Service layer modules (listing_fetcher, image_fetcher, floorplan_detector, route_calculator, query_splitter) - `tasks/`: Celery background tasks - `frontend/`: React/Vite frontend with Caddy proxy - `alembic/`: Database migrations - `tests/`: Test suite (unit and integration tests) ## Type Checking The project uses strict mypy configuration with `disallow_untyped_defs=true`. Run `mypy .` to check types. ## Exploration Preferences - Always ignore `node_modules` directory when exploring the codebase ## Git Workflow **IMPORTANT**: After completing work items, always create separate commits for each logical change: - Keep each commit focused on one feature/fix - Do not include unrelated files - Use descriptive commit messages - Group related files together (e.g., tests with the code they test)