wrongmove/crawler/CLAUDE.md

# CLAUDE.md

This file provides guidance to Claude Code (claude.ai/code) when working with code in this repository.

## Project Overview

A real estate listing crawler and aggregator that scrapes property listings from Rightmove UK, extracts square meter data from floorplan images using OCR, calculates transit routes, and provides a web UI for browsing listings.

## Development Environment

**IMPORTANT**: This project runs on a remote host, not locally. Always use the remote executor to run commands:

- **All shell commands** (Python, pytest, poetry, alembic, etc.) must be executed via the remote executor
- **Starting the project**: Use the remote executor to run `./start.sh`
- **Running tests**: Use the remote executor to run `pytest`
- **Any CLI operations**: Use the remote executor to run `python main.py ...`

Never run commands directly on the local machine - always route them through the remote executor.

## Commands

### Setup and Run (Docker - Recommended)

```bash
# Start all services (Redis, MySQL, API, Celery) with Docker
./start.sh

# Rebuild images and start
./start.sh --build

# Stop all containers
./start.sh --down

# View logs
./start.sh --logs
```

### Setup and Run (Local with Poetry)

```bash
# Install dependencies
poetry install && cp .env.sample .env

# Start backend locally (requires Redis running)
./start.sh --local

# Start frontend (from frontend/ directory)
cd frontend && ./start.sh
```

### CLI Operations

The main CLI (`main.py`) uses Click with a `--data-dir` option (default: `data/rs/`):

```bash
# Dump listings from Rightmove API
python main.py dump-listings --type rent --min-price 2000 --max-price 4000 --min-bedrooms 2

# Download floorplan images
python main.py dump-images

# Extract square meters from floorplans using OCR
python main.py detect-floorplan

# Calculate transit routes (consumes Google Maps API calls)
python main.py routing --destination-address 'Address' -m transit -l 10

# Export to GeoJSON for visualization
python main.py export-immoweb -O output.js --type rent [filter options]
```

### Testing

```bash
# Run tests with coverage
pytest tests/ -v --cov=. --cov-report=term-missing

# Run type checker
mypy .
```

### Database Migrations

```bash
alembic upgrade head    # Apply migrations
alembic revision -m "description"  # Create new migration
```

### Code Formatting

```bash
yapf --style .style.yapf --recursive .
```

## Architecture

### Core Data Flow

1. **Scraping** (`rec/query.py`): Fetches listing IDs and details from Rightmove's Android API
2. **Processing** (`listing_processor.py`): Pipeline with steps for fetching details, downloading images, and OCR detection
3. **Storage**: SQLModel/SQLAlchemy with MySQL or SQLite, plus JSON files in `data/rs/<listing_id>/`
4. **API** (`api/app.py`): FastAPI endpoints authenticated via JWT from external Authentik service
5. **Background Tasks** (`tasks/listing_tasks.py`): Celery tasks for async listing processing with Redis broker

### Key Models

- `models/listing.py`: SQLModel entities (`RentListing`, `BuyListing`) with `QueryParameters` for filtering
- `data_access.py`: **DEPRECATED** - Legacy `Listing` dataclass for filesystem-based data access. Use `models.listing.RentListing` or `models.listing.BuyListing` instead.

### Services Layer (Unified CLI and API)

**IMPORTANT**: The `services/` directory contains unified handler functions that both the CLI and HTTP API use. This ensures consistency and code reuse.

#### High-level services (use these in CLI and API):
- **`listing_service.py`**: Listing operations
  - `get_listings()` - Retrieve listings from database
  - `refresh_listings()` - Fetch new listings from Rightmove (sync or async)
  - `download_images()` - Download floorplan images
  - `detect_floorplans()` - Run OCR on floorplans
  - `calculate_routes()` - Calculate transit routes

- **`export_service.py`**: Export operations
  - `export_to_csv()` - Export listings to CSV file
  - `export_to_geojson()` - Export listings to GeoJSON (file or in-memory)

- **`district_service.py`**: District management
  - `get_all_districts()` - Get district name → region ID mapping
  - `get_district_names()` - Get list of district names
  - `validate_districts()` - Validate district names

- **`task_service.py`**: Background task management
  - `get_task_status()` - Get Celery task status
  - `get_user_tasks()` - Get all tasks for a user
  - `add_task_for_user()` - Associate task with user

#### Low-level services (internal implementation):
- `listing_fetcher.py`: Fetches listing data from Rightmove API
- `image_fetcher.py`: Downloads floorplan images
- `floorplan_detector.py`: OCR-based square meter detection
- `route_calculator.py`: Calculates transit routes using Google Maps API
- `query_splitter.py`: Intelligent query splitting to maximize data extraction

### Query Splitting System

Rightmove's API caps search results at ~1,500 listings per query. The query splitting system works around this limitation to fetch **all matching listings**.

#### How it works:

1. **Initial Split**: Queries are split by district and bedroom count
2. **Probe**: Each subquery is probed (minimal API request) to get `totalAvailableResults`
3. **Adaptive Split**: If results exceed threshold (1,200), the price range is binary-split
4. **Recursive Refinement**: Splitting continues until all subqueries are under threshold
5. **Full Fetch**: Each subquery fetches up to 60 pages (1,500 results max)

```
Original: 2BR, £1000-£5000 → 3,000 results (over cap!)
              ↓ split by price
£1000-£3000: 1,800 (still over!)  |  £3000-£5000: 1,200 ✓
        ↓ split again
£1000-£2000: 900 ✓  |  £2000-£3000: 900 ✓

Final: 3 subqueries → 900 + 900 + 1,200 = 3,000 total results ✓
```

#### Key components:
- `config/scraper_config.py`: Configuration with env var loading
- `services/query_splitter.py`: `QuerySplitter` class with `SubQuery` dataclass
- `rec/query.py`: `probe_query()` for result count probing, `create_session()` for connection pooling

### Processing Pipeline

`ListingProcessor` runs sequential steps defined in `listing_processor.py`:
1. `FetchListingDetailsStep` - Get property details from API
2. `FetchImagesStep` - Download floorplan images
3. `DetectFloorplanStep` - OCR to extract square meters from floorplans

### Floorplan OCR

`rec/floorplan.py` uses pytesseract with image preprocessing (adaptive thresholding) to extract square meter values from floorplan images.

### Repository Pattern

`repositories/listing_repository.py` handles database operations with SQLModel sessions.

## Environment Variables

- `DB_CONNECTION_STRING`: Database URL (SQLite default: `sqlite:///data/wrongmove.db`)
- `CELERY_BROKER_URL` / `CELERY_RESULT_BACKEND`: Redis URLs
- `ROUTING_API_KEY`: Google Maps API key for transit routing

### Scraper Configuration

These control the query splitting behavior (see `.env.sample` for defaults):

| Variable | Default | Description |
|----------|---------|-------------|
| `RIGHTMOVE_MAX_CONCURRENT` | 5 | Max concurrent HTTP requests |
| `RIGHTMOVE_REQUEST_DELAY_MS` | 100 | Delay between requests (ms) |
| `RIGHTMOVE_SPLIT_THRESHOLD` | 1200 | Split query when results exceed this |
| `RIGHTMOVE_MIN_PRICE_BAND` | 100 | Minimum price band width (won't split below) |
| `RIGHTMOVE_MAX_PAGES` | 60 | Max pages per subquery (60 × 25 = 1500) |
| `RIGHTMOVE_PROXY_URL` | - | SOCKS proxy URL (e.g., `socks5://localhost:9050` for Tor) |

## Project Structure

- `main.py`: CLI entry point
- `api/`: FastAPI application with auth middleware
- `config/`: Configuration modules (scraper settings, scheduled tasks)
- `models/`: SQLModel database entities
- `repositories/`: Database access layer
- `rec/`: Core business logic (query, floorplan OCR, routing, districts)
- `services/`: Service layer modules (listing_fetcher, image_fetcher, floorplan_detector, route_calculator, query_splitter)
- `tasks/`: Celery background tasks
- `frontend/`: React/Vite frontend with Caddy proxy
- `alembic/`: Database migrations
- `tests/`: Test suite (unit and integration tests)

## Type Checking

The project uses strict mypy configuration with `disallow_untyped_defs=true`. Run `mypy .` to check types.

## Exploration Preferences

- Always ignore `node_modules` directory when exploring the codebase

## Git Workflow

**IMPORTANT**: After completing work items, always create separate commits for each logical change:
- Keep each commit focused on one feature/fix
- Do not include unrelated files
- Use descriptive commit messages
- Group related files together (e.g., tests with the code they test)