wrongmove/crawler/CLAUDE.md

8.4 KiB
Raw Permalink Blame History

CLAUDE.md

This file provides guidance to Claude Code (claude.ai/code) when working with code in this repository.

Project Overview

A real estate listing crawler and aggregator that scrapes property listings from Rightmove UK, extracts square meter data from floorplan images using OCR, calculates transit routes, and provides a web UI for browsing listings.

Development Environment

IMPORTANT: This project runs on a remote host, not locally. Always use the remote executor to run commands:

  • All shell commands (Python, pytest, poetry, alembic, etc.) must be executed via the remote executor
  • Starting the project: Use the remote executor to run ./start.sh
  • Running tests: Use the remote executor to run pytest
  • Any CLI operations: Use the remote executor to run python main.py ...

Never run commands directly on the local machine - always route them through the remote executor.

Commands

# Start all services (Redis, MySQL, API, Celery) with Docker
./start.sh

# Rebuild images and start
./start.sh --build

# Stop all containers
./start.sh --down

# View logs
./start.sh --logs

Setup and Run (Local with Poetry)

# Install dependencies
poetry install && cp .env.sample .env

# Start backend locally (requires Redis running)
./start.sh --local

# Start frontend (from frontend/ directory)
cd frontend && ./start.sh

CLI Operations

The main CLI (main.py) uses Click with a --data-dir option (default: data/rs/):

# Dump listings from Rightmove API
python main.py dump-listings --type rent --min-price 2000 --max-price 4000 --min-bedrooms 2

# Download floorplan images
python main.py dump-images

# Extract square meters from floorplans using OCR
python main.py detect-floorplan

# Calculate transit routes (consumes Google Maps API calls)
python main.py routing --destination-address 'Address' -m transit -l 10

# Export to GeoJSON for visualization
python main.py export-immoweb -O output.js --type rent [filter options]

Testing

# Run tests with coverage
pytest tests/ -v --cov=. --cov-report=term-missing

# Run type checker
mypy .

Database Migrations

alembic upgrade head    # Apply migrations
alembic revision -m "description"  # Create new migration

Code Formatting

yapf --style .style.yapf --recursive .

Architecture

Core Data Flow

  1. Scraping (rec/query.py): Fetches listing IDs and details from Rightmove's Android API
  2. Processing (listing_processor.py): Pipeline with steps for fetching details, downloading images, and OCR detection
  3. Storage: SQLModel/SQLAlchemy with MySQL or SQLite, plus JSON files in data/rs/<listing_id>/
  4. API (api/app.py): FastAPI endpoints authenticated via JWT from external Authentik service
  5. Background Tasks (tasks/listing_tasks.py): Celery tasks for async listing processing with Redis broker

Key Models

  • models/listing.py: SQLModel entities (RentListing, BuyListing) with QueryParameters for filtering
  • data_access.py: DEPRECATED - Legacy Listing dataclass for filesystem-based data access. Use models.listing.RentListing or models.listing.BuyListing instead.

Services Layer (Unified CLI and API)

IMPORTANT: The services/ directory contains unified handler functions that both the CLI and HTTP API use. This ensures consistency and code reuse.

High-level services (use these in CLI and API):

  • listing_service.py: Listing operations

    • get_listings() - Retrieve listings from database
    • refresh_listings() - Fetch new listings from Rightmove (sync or async)
    • download_images() - Download floorplan images
    • detect_floorplans() - Run OCR on floorplans
    • calculate_routes() - Calculate transit routes
  • export_service.py: Export operations

    • export_to_csv() - Export listings to CSV file
    • export_to_geojson() - Export listings to GeoJSON (file or in-memory)
  • district_service.py: District management

    • get_all_districts() - Get district name → region ID mapping
    • get_district_names() - Get list of district names
    • validate_districts() - Validate district names
  • task_service.py: Background task management

    • get_task_status() - Get Celery task status
    • get_user_tasks() - Get all tasks for a user
    • add_task_for_user() - Associate task with user

Low-level services (internal implementation):

  • listing_fetcher.py: Fetches listing data from Rightmove API
  • image_fetcher.py: Downloads floorplan images
  • floorplan_detector.py: OCR-based square meter detection
  • route_calculator.py: Calculates transit routes using Google Maps API
  • query_splitter.py: Intelligent query splitting to maximize data extraction

Query Splitting System

Rightmove's API caps search results at ~1,500 listings per query. The query splitting system works around this limitation to fetch all matching listings.

How it works:

  1. Initial Split: Queries are split by district and bedroom count
  2. Probe: Each subquery is probed (minimal API request) to get totalAvailableResults
  3. Adaptive Split: If results exceed threshold (1,200), the price range is binary-split
  4. Recursive Refinement: Splitting continues until all subqueries are under threshold
  5. Full Fetch: Each subquery fetches up to 60 pages (1,500 results max)
Original: 2BR, £1000-£5000 → 3,000 results (over cap!)
              ↓ split by price
£1000-£3000: 1,800 (still over!)  |  £3000-£5000: 1,200 ✓
        ↓ split again
£1000-£2000: 900 ✓  |  £2000-£3000: 900 ✓

Final: 3 subqueries → 900 + 900 + 1,200 = 3,000 total results ✓

Key components:

  • config/scraper_config.py: Configuration with env var loading
  • services/query_splitter.py: QuerySplitter class with SubQuery dataclass
  • rec/query.py: probe_query() for result count probing, create_session() for connection pooling

Processing Pipeline

ListingProcessor runs sequential steps defined in listing_processor.py:

  1. FetchListingDetailsStep - Get property details from API
  2. FetchImagesStep - Download floorplan images
  3. DetectFloorplanStep - OCR to extract square meters from floorplans

Floorplan OCR

rec/floorplan.py uses pytesseract with image preprocessing (adaptive thresholding) to extract square meter values from floorplan images.

Repository Pattern

repositories/listing_repository.py handles database operations with SQLModel sessions.

Environment Variables

  • DB_CONNECTION_STRING: Database URL (SQLite default: sqlite:///data/wrongmove.db)
  • CELERY_BROKER_URL / CELERY_RESULT_BACKEND: Redis URLs
  • ROUTING_API_KEY: Google Maps API key for transit routing

Scraper Configuration

These control the query splitting behavior (see .env.sample for defaults):

Variable Default Description
RIGHTMOVE_MAX_CONCURRENT 5 Max concurrent HTTP requests
RIGHTMOVE_REQUEST_DELAY_MS 100 Delay between requests (ms)
RIGHTMOVE_SPLIT_THRESHOLD 1200 Split query when results exceed this
RIGHTMOVE_MIN_PRICE_BAND 100 Minimum price band width (won't split below)
RIGHTMOVE_MAX_PAGES 60 Max pages per subquery (60 × 25 = 1500)
RIGHTMOVE_PROXY_URL - SOCKS proxy URL (e.g., socks5://localhost:9050 for Tor)

Project Structure

  • main.py: CLI entry point
  • api/: FastAPI application with auth middleware
  • config/: Configuration modules (scraper settings, scheduled tasks)
  • models/: SQLModel database entities
  • repositories/: Database access layer
  • rec/: Core business logic (query, floorplan OCR, routing, districts)
  • services/: Service layer modules (listing_fetcher, image_fetcher, floorplan_detector, route_calculator, query_splitter)
  • tasks/: Celery background tasks
  • frontend/: React/Vite frontend with Caddy proxy
  • alembic/: Database migrations
  • tests/: Test suite (unit and integration tests)

Type Checking

The project uses strict mypy configuration with disallow_untyped_defs=true. Run mypy . to check types.

Exploration Preferences

  • Always ignore node_modules directory when exploring the codebase

Git Workflow

IMPORTANT: After completing work items, always create separate commits for each logical change:

  • Keep each commit focused on one feature/fix
  • Do not include unrelated files
  • Use descriptive commit messages
  • Group related files together (e.g., tests with the code they test)