wrongmove/crawler/CLAUDE.md

233 lines
8.4 KiB
Markdown
Raw Permalink Blame History

This file contains ambiguous Unicode characters

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

# CLAUDE.md
This file provides guidance to Claude Code (claude.ai/code) when working with code in this repository.
## Project Overview
A real estate listing crawler and aggregator that scrapes property listings from Rightmove UK, extracts square meter data from floorplan images using OCR, calculates transit routes, and provides a web UI for browsing listings.
## Development Environment
**IMPORTANT**: This project runs on a remote host, not locally. Always use the remote executor to run commands:
- **All shell commands** (Python, pytest, poetry, alembic, etc.) must be executed via the remote executor
- **Starting the project**: Use the remote executor to run `./start.sh`
- **Running tests**: Use the remote executor to run `pytest`
- **Any CLI operations**: Use the remote executor to run `python main.py ...`
Never run commands directly on the local machine - always route them through the remote executor.
## Commands
### Setup and Run (Docker - Recommended)
```bash
# Start all services (Redis, MySQL, API, Celery) with Docker
./start.sh
# Rebuild images and start
./start.sh --build
# Stop all containers
./start.sh --down
# View logs
./start.sh --logs
```
### Setup and Run (Local with Poetry)
```bash
# Install dependencies
poetry install && cp .env.sample .env
# Start backend locally (requires Redis running)
./start.sh --local
# Start frontend (from frontend/ directory)
cd frontend && ./start.sh
```
### CLI Operations
The main CLI (`main.py`) uses Click with a `--data-dir` option (default: `data/rs/`):
```bash
# Dump listings from Rightmove API
python main.py dump-listings --type rent --min-price 2000 --max-price 4000 --min-bedrooms 2
# Download floorplan images
python main.py dump-images
# Extract square meters from floorplans using OCR
python main.py detect-floorplan
# Calculate transit routes (consumes Google Maps API calls)
python main.py routing --destination-address 'Address' -m transit -l 10
# Export to GeoJSON for visualization
python main.py export-immoweb -O output.js --type rent [filter options]
```
### Testing
```bash
# Run tests with coverage
pytest tests/ -v --cov=. --cov-report=term-missing
# Run type checker
mypy .
```
### Database Migrations
```bash
alembic upgrade head # Apply migrations
alembic revision -m "description" # Create new migration
```
### Code Formatting
```bash
yapf --style .style.yapf --recursive .
```
## Architecture
### Core Data Flow
1. **Scraping** (`rec/query.py`): Fetches listing IDs and details from Rightmove's Android API
2. **Processing** (`listing_processor.py`): Pipeline with steps for fetching details, downloading images, and OCR detection
3. **Storage**: SQLModel/SQLAlchemy with MySQL or SQLite, plus JSON files in `data/rs/<listing_id>/`
4. **API** (`api/app.py`): FastAPI endpoints authenticated via JWT from external Authentik service
5. **Background Tasks** (`tasks/listing_tasks.py`): Celery tasks for async listing processing with Redis broker
### Key Models
- `models/listing.py`: SQLModel entities (`RentListing`, `BuyListing`) with `QueryParameters` for filtering
- `data_access.py`: **DEPRECATED** - Legacy `Listing` dataclass for filesystem-based data access. Use `models.listing.RentListing` or `models.listing.BuyListing` instead.
### Services Layer (Unified CLI and API)
**IMPORTANT**: The `services/` directory contains unified handler functions that both the CLI and HTTP API use. This ensures consistency and code reuse.
#### High-level services (use these in CLI and API):
- **`listing_service.py`**: Listing operations
- `get_listings()` - Retrieve listings from database
- `refresh_listings()` - Fetch new listings from Rightmove (sync or async)
- `download_images()` - Download floorplan images
- `detect_floorplans()` - Run OCR on floorplans
- `calculate_routes()` - Calculate transit routes
- **`export_service.py`**: Export operations
- `export_to_csv()` - Export listings to CSV file
- `export_to_geojson()` - Export listings to GeoJSON (file or in-memory)
- **`district_service.py`**: District management
- `get_all_districts()` - Get district name → region ID mapping
- `get_district_names()` - Get list of district names
- `validate_districts()` - Validate district names
- **`task_service.py`**: Background task management
- `get_task_status()` - Get Celery task status
- `get_user_tasks()` - Get all tasks for a user
- `add_task_for_user()` - Associate task with user
#### Low-level services (internal implementation):
- `listing_fetcher.py`: Fetches listing data from Rightmove API
- `image_fetcher.py`: Downloads floorplan images
- `floorplan_detector.py`: OCR-based square meter detection
- `route_calculator.py`: Calculates transit routes using Google Maps API
- `query_splitter.py`: Intelligent query splitting to maximize data extraction
### Query Splitting System
Rightmove's API caps search results at ~1,500 listings per query. The query splitting system works around this limitation to fetch **all matching listings**.
#### How it works:
1. **Initial Split**: Queries are split by district and bedroom count
2. **Probe**: Each subquery is probed (minimal API request) to get `totalAvailableResults`
3. **Adaptive Split**: If results exceed threshold (1,200), the price range is binary-split
4. **Recursive Refinement**: Splitting continues until all subqueries are under threshold
5. **Full Fetch**: Each subquery fetches up to 60 pages (1,500 results max)
```
Original: 2BR, £1000-£5000 → 3,000 results (over cap!)
↓ split by price
£1000-£3000: 1,800 (still over!) | £3000-£5000: 1,200 ✓
↓ split again
£1000-£2000: 900 ✓ | £2000-£3000: 900 ✓
Final: 3 subqueries → 900 + 900 + 1,200 = 3,000 total results ✓
```
#### Key components:
- `config/scraper_config.py`: Configuration with env var loading
- `services/query_splitter.py`: `QuerySplitter` class with `SubQuery` dataclass
- `rec/query.py`: `probe_query()` for result count probing, `create_session()` for connection pooling
### Processing Pipeline
`ListingProcessor` runs sequential steps defined in `listing_processor.py`:
1. `FetchListingDetailsStep` - Get property details from API
2. `FetchImagesStep` - Download floorplan images
3. `DetectFloorplanStep` - OCR to extract square meters from floorplans
### Floorplan OCR
`rec/floorplan.py` uses pytesseract with image preprocessing (adaptive thresholding) to extract square meter values from floorplan images.
### Repository Pattern
`repositories/listing_repository.py` handles database operations with SQLModel sessions.
## Environment Variables
- `DB_CONNECTION_STRING`: Database URL (SQLite default: `sqlite:///data/wrongmove.db`)
- `CELERY_BROKER_URL` / `CELERY_RESULT_BACKEND`: Redis URLs
- `ROUTING_API_KEY`: Google Maps API key for transit routing
### Scraper Configuration
These control the query splitting behavior (see `.env.sample` for defaults):
| Variable | Default | Description |
|----------|---------|-------------|
| `RIGHTMOVE_MAX_CONCURRENT` | 5 | Max concurrent HTTP requests |
| `RIGHTMOVE_REQUEST_DELAY_MS` | 100 | Delay between requests (ms) |
| `RIGHTMOVE_SPLIT_THRESHOLD` | 1200 | Split query when results exceed this |
| `RIGHTMOVE_MIN_PRICE_BAND` | 100 | Minimum price band width (won't split below) |
| `RIGHTMOVE_MAX_PAGES` | 60 | Max pages per subquery (60 × 25 = 1500) |
| `RIGHTMOVE_PROXY_URL` | - | SOCKS proxy URL (e.g., `socks5://localhost:9050` for Tor) |
## Project Structure
- `main.py`: CLI entry point
- `api/`: FastAPI application with auth middleware
- `config/`: Configuration modules (scraper settings, scheduled tasks)
- `models/`: SQLModel database entities
- `repositories/`: Database access layer
- `rec/`: Core business logic (query, floorplan OCR, routing, districts)
- `services/`: Service layer modules (listing_fetcher, image_fetcher, floorplan_detector, route_calculator, query_splitter)
- `tasks/`: Celery background tasks
- `frontend/`: React/Vite frontend with Caddy proxy
- `alembic/`: Database migrations
- `tests/`: Test suite (unit and integration tests)
## Type Checking
The project uses strict mypy configuration with `disallow_untyped_defs=true`. Run `mypy .` to check types.
## Exploration Preferences
- Always ignore `node_modules` directory when exploring the codebase
## Git Workflow
**IMPORTANT**: After completing work items, always create separate commits for each logical change:
- Keep each commit focused on one feature/fix
- Do not include unrelated files
- Use descriptive commit messages
- Group related files together (e.g., tests with the code they test)