wrongmove/crawler/CLAUDE.md

234 lines
8.4 KiB
Markdown
Raw Permalink Normal View History

# CLAUDE.md
This file provides guidance to Claude Code (claude.ai/code) when working with code in this repository.
## Project Overview
A real estate listing crawler and aggregator that scrapes property listings from Rightmove UK, extracts square meter data from floorplan images using OCR, calculates transit routes, and provides a web UI for browsing listings.
## Development Environment
**IMPORTANT**: This project runs on a remote host, not locally. Always use the remote executor to run commands:
- **All shell commands** (Python, pytest, poetry, alembic, etc.) must be executed via the remote executor
- **Starting the project**: Use the remote executor to run `./start.sh`
- **Running tests**: Use the remote executor to run `pytest`
- **Any CLI operations**: Use the remote executor to run `python main.py ...`
Never run commands directly on the local machine - always route them through the remote executor.
## Commands
### Setup and Run (Docker - Recommended)
```bash
# Start all services (Redis, MySQL, API, Celery) with Docker
./start.sh
# Rebuild images and start
./start.sh --build
# Stop all containers
./start.sh --down
# View logs
./start.sh --logs
```
### Setup and Run (Local with Poetry)
```bash
# Install dependencies
poetry install && cp .env.sample .env
# Start backend locally (requires Redis running)
./start.sh --local
# Start frontend (from frontend/ directory)
cd frontend && ./start.sh
```
### CLI Operations
The main CLI (`main.py`) uses Click with a `--data-dir` option (default: `data/rs/`):
```bash
# Dump listings from Rightmove API
python main.py dump-listings --type rent --min-price 2000 --max-price 4000 --min-bedrooms 2
# Download floorplan images
python main.py dump-images
# Extract square meters from floorplans using OCR
python main.py detect-floorplan
# Calculate transit routes (consumes Google Maps API calls)
python main.py routing --destination-address 'Address' -m transit -l 10
# Export to GeoJSON for visualization
python main.py export-immoweb -O output.js --type rent [filter options]
```
### Testing
```bash
# Run tests with coverage
pytest tests/ -v --cov=. --cov-report=term-missing
# Run type checker
mypy .
```
### Database Migrations
```bash
alembic upgrade head # Apply migrations
alembic revision -m "description" # Create new migration
```
### Code Formatting
```bash
yapf --style .style.yapf --recursive .
```
## Architecture
### Core Data Flow
1. **Scraping** (`rec/query.py`): Fetches listing IDs and details from Rightmove's Android API
2. **Processing** (`listing_processor.py`): Pipeline with steps for fetching details, downloading images, and OCR detection
3. **Storage**: SQLModel/SQLAlchemy with MySQL or SQLite, plus JSON files in `data/rs/<listing_id>/`
4. **API** (`api/app.py`): FastAPI endpoints authenticated via JWT from external Authentik service
5. **Background Tasks** (`tasks/listing_tasks.py`): Celery tasks for async listing processing with Redis broker
### Key Models
- `models/listing.py`: SQLModel entities (`RentListing`, `BuyListing`) with `QueryParameters` for filtering
- `data_access.py`: **DEPRECATED** - Legacy `Listing` dataclass for filesystem-based data access. Use `models.listing.RentListing` or `models.listing.BuyListing` instead.
### Services Layer (Unified CLI and API)
**IMPORTANT**: The `services/` directory contains unified handler functions that both the CLI and HTTP API use. This ensures consistency and code reuse.
#### High-level services (use these in CLI and API):
- **`listing_service.py`**: Listing operations
- `get_listings()` - Retrieve listings from database
- `refresh_listings()` - Fetch new listings from Rightmove (sync or async)
- `download_images()` - Download floorplan images
- `detect_floorplans()` - Run OCR on floorplans
- `calculate_routes()` - Calculate transit routes
- **`export_service.py`**: Export operations
- `export_to_csv()` - Export listings to CSV file
- `export_to_geojson()` - Export listings to GeoJSON (file or in-memory)
- **`district_service.py`**: District management
- `get_all_districts()` - Get district name → region ID mapping
- `get_district_names()` - Get list of district names
- `validate_districts()` - Validate district names
- **`task_service.py`**: Background task management
- `get_task_status()` - Get Celery task status
- `get_user_tasks()` - Get all tasks for a user
- `add_task_for_user()` - Associate task with user
#### Low-level services (internal implementation):
- `listing_fetcher.py`: Fetches listing data from Rightmove API
- `image_fetcher.py`: Downloads floorplan images
- `floorplan_detector.py`: OCR-based square meter detection
- `route_calculator.py`: Calculates transit routes using Google Maps API
- `query_splitter.py`: Intelligent query splitting to maximize data extraction
### Query Splitting System
Rightmove's API caps search results at ~1,500 listings per query. The query splitting system works around this limitation to fetch **all matching listings**.
#### How it works:
1. **Initial Split**: Queries are split by district and bedroom count
2. **Probe**: Each subquery is probed (minimal API request) to get `totalAvailableResults`
3. **Adaptive Split**: If results exceed threshold (1,200), the price range is binary-split
4. **Recursive Refinement**: Splitting continues until all subqueries are under threshold
5. **Full Fetch**: Each subquery fetches up to 60 pages (1,500 results max)
```
Original: 2BR, £1000-£5000 → 3,000 results (over cap!)
↓ split by price
£1000-£3000: 1,800 (still over!) | £3000-£5000: 1,200 ✓
↓ split again
£1000-£2000: 900 ✓ | £2000-£3000: 900 ✓
Final: 3 subqueries → 900 + 900 + 1,200 = 3,000 total results ✓
```
#### Key components:
- `config/scraper_config.py`: Configuration with env var loading
- `services/query_splitter.py`: `QuerySplitter` class with `SubQuery` dataclass
- `rec/query.py`: `probe_query()` for result count probing, `create_session()` for connection pooling
### Processing Pipeline
`ListingProcessor` runs sequential steps defined in `listing_processor.py`:
1. `FetchListingDetailsStep` - Get property details from API
2. `FetchImagesStep` - Download floorplan images
3. `DetectFloorplanStep` - OCR to extract square meters from floorplans
### Floorplan OCR
`rec/floorplan.py` uses pytesseract with image preprocessing (adaptive thresholding) to extract square meter values from floorplan images.
### Repository Pattern
`repositories/listing_repository.py` handles database operations with SQLModel sessions.
## Environment Variables
- `DB_CONNECTION_STRING`: Database URL (SQLite default: `sqlite:///data/wrongmove.db`)
- `CELERY_BROKER_URL` / `CELERY_RESULT_BACKEND`: Redis URLs
- `ROUTING_API_KEY`: Google Maps API key for transit routing
### Scraper Configuration
These control the query splitting behavior (see `.env.sample` for defaults):
| Variable | Default | Description |
|----------|---------|-------------|
| `RIGHTMOVE_MAX_CONCURRENT` | 5 | Max concurrent HTTP requests |
| `RIGHTMOVE_REQUEST_DELAY_MS` | 100 | Delay between requests (ms) |
| `RIGHTMOVE_SPLIT_THRESHOLD` | 1200 | Split query when results exceed this |
| `RIGHTMOVE_MIN_PRICE_BAND` | 100 | Minimum price band width (won't split below) |
| `RIGHTMOVE_MAX_PAGES` | 60 | Max pages per subquery (60 × 25 = 1500) |
| `RIGHTMOVE_PROXY_URL` | - | SOCKS proxy URL (e.g., `socks5://localhost:9050` for Tor) |
## Project Structure
- `main.py`: CLI entry point
- `api/`: FastAPI application with auth middleware
- `config/`: Configuration modules (scraper settings, scheduled tasks)
- `models/`: SQLModel database entities
- `repositories/`: Database access layer
- `rec/`: Core business logic (query, floorplan OCR, routing, districts)
- `services/`: Service layer modules (listing_fetcher, image_fetcher, floorplan_detector, route_calculator, query_splitter)
- `tasks/`: Celery background tasks
- `frontend/`: React/Vite frontend with Caddy proxy
- `alembic/`: Database migrations
- `tests/`: Test suite (unit and integration tests)
## Type Checking
The project uses strict mypy configuration with `disallow_untyped_defs=true`. Run `mypy .` to check types.
## Exploration Preferences
- Always ignore `node_modules` directory when exploring the codebase
## Git Workflow
**IMPORTANT**: After completing work items, always create separate commits for each logical change:
- Keep each commit focused on one feature/fix
- Do not include unrelated files
- Use descriptive commit messages
- Group related files together (e.g., tests with the code they test)