wrongmove/crawler/CLAUDE.md

240 lines
8.8 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

# CLAUDE.md
This file provides guidance to Claude Code (claude.ai/code) when working with code in this repository.
## Project Overview
A real estate listing crawler and aggregator that scrapes property listings from Rightmove UK, extracts square meter data from floorplan images using OCR, calculates transit routes, and provides a web UI for browsing listings.
## Development Environment
**All project commands run inside Docker containers.** Start the dev environment with Docker Compose, then exec into containers:
- **Start dev environment**: `docker compose up -d` (locally)
- **Building/pushing images**: `docker build` / `docker push` (locally)
- **Deploying to K8s**: `kubectl` (locally, context: `kubernetes-admin@kubernetes`)
- **Running tests**: `docker compose exec app pytest tests/ -v`
- **CLI operations**: `docker compose exec app python main.py ...`
- **Migrations**: `docker compose exec app alembic upgrade head`
- **Type check**: `docker compose exec app mypy .`
- **Linting**: `docker compose exec app ruff check .`
Always ensure containers are running (`docker compose up -d`) before executing commands.
See `.claude/skills/` for detailed skills on dev environment, building, and deploying.
## Commands
### Setup and Run (Docker - Recommended)
```bash
# Start all services (Redis, MySQL, API, Celery) with Docker
./start.sh
# Rebuild images and start
./start.sh --build
# Stop all containers
./start.sh --down
# View logs
./start.sh --logs
```
### Setup and Run (Local with Poetry)
```bash
# Install dependencies
poetry install && cp .env.sample .env
# Start backend locally (requires Redis running)
./start.sh --local
# Start frontend (from frontend/ directory)
cd frontend && ./start.sh
```
### CLI Operations
The main CLI (`main.py`) uses Click with a `--data-dir` option (default: `data/rs/`):
```bash
# Dump listings from Rightmove API
python main.py dump-listings --type rent --min-price 2000 --max-price 4000 --min-bedrooms 2
# Download floorplan images
python main.py dump-images
# Extract square meters from floorplans using OCR
python main.py detect-floorplan
# Calculate transit routes (consumes Google Maps API calls)
python main.py routing --destination-address 'Address' -m transit -l 10
# Export to GeoJSON for visualization
python main.py export-immoweb -O output.js --type rent [filter options]
```
### Testing
```bash
# Run tests with coverage
pytest tests/ -v --cov=. --cov-report=term-missing
# Run type checker
mypy .
```
### Database Migrations
```bash
alembic upgrade head # Apply migrations
alembic revision -m "description" # Create new migration
```
### Code Formatting
```bash
yapf --style .style.yapf --recursive .
```
## Architecture
### Core Data Flow
1. **Scraping** (`rec/query.py`): Fetches listing IDs and details from Rightmove's Android API
2. **Processing** (`listing_processor.py`): Pipeline with steps for fetching details, downloading images, and OCR detection
3. **Storage**: SQLModel/SQLAlchemy with MySQL or SQLite, plus JSON files in `data/rs/<listing_id>/`
4. **API** (`api/app.py`): FastAPI endpoints authenticated via JWT from external Authentik service
5. **Background Tasks** (`tasks/listing_tasks.py`): Celery tasks for async listing processing with Redis broker
### Key Models
- `models/listing.py`: SQLModel entities (`RentListing`, `BuyListing`) with `QueryParameters` for filtering
- `data_access.py`: **DEPRECATED** - Legacy `Listing` dataclass for filesystem-based data access. Use `models.listing.RentListing` or `models.listing.BuyListing` instead.
### Services Layer (Unified CLI and API)
**IMPORTANT**: The `services/` directory contains unified handler functions that both the CLI and HTTP API use. This ensures consistency and code reuse.
#### High-level services (use these in CLI and API):
- **`listing_service.py`**: Listing operations
- `get_listings()` - Retrieve listings from database
- `refresh_listings()` - Fetch new listings from Rightmove (sync or async)
- `download_images()` - Download floorplan images
- `detect_floorplans()` - Run OCR on floorplans
- `calculate_routes()` - Calculate transit routes
- **`export_service.py`**: Export operations
- `export_to_csv()` - Export listings to CSV file
- `export_to_geojson()` - Export listings to GeoJSON (file or in-memory)
- **`district_service.py`**: District management
- `get_all_districts()` - Get district name → region ID mapping
- `get_district_names()` - Get list of district names
- `validate_districts()` - Validate district names
- **`task_service.py`**: Background task management
- `get_task_status()` - Get Celery task status
- `get_user_tasks()` - Get all tasks for a user
- `add_task_for_user()` - Associate task with user
#### Low-level services (internal implementation):
- `listing_fetcher.py`: Fetches listing data from Rightmove API
- `image_fetcher.py`: Downloads floorplan images
- `floorplan_detector.py`: OCR-based square meter detection
- `route_calculator.py`: Calculates transit routes using Google Maps API
- `query_splitter.py`: Intelligent query splitting to maximize data extraction
### Query Splitting System
Rightmove's API caps search results at ~1,500 listings per query. The query splitting system works around this limitation to fetch **all matching listings**.
#### How it works:
1. **Initial Split**: Queries are split by district and bedroom count
2. **Probe**: Each subquery is probed (minimal API request) to get `totalAvailableResults`
3. **Adaptive Split**: If results exceed threshold (1,200), the price range is binary-split
4. **Recursive Refinement**: Splitting continues until all subqueries are under threshold
5. **Full Fetch**: Each subquery fetches up to 60 pages (1,500 results max)
```
Original: 2BR, £1000-£5000 → 3,000 results (over cap!)
↓ split by price
£1000-£3000: 1,800 (still over!) | £3000-£5000: 1,200 ✓
↓ split again
£1000-£2000: 900 ✓ | £2000-£3000: 900 ✓
Final: 3 subqueries → 900 + 900 + 1,200 = 3,000 total results ✓
```
#### Key components:
- `config/scraper_config.py`: Configuration with env var loading
- `services/query_splitter.py`: `QuerySplitter` class with `SubQuery` dataclass
- `rec/query.py`: `probe_query()` for result count probing, `create_session()` for connection pooling
### Processing Pipeline
`ListingProcessor` runs sequential steps defined in `listing_processor.py`:
1. `FetchListingDetailsStep` - Get property details from API
2. `FetchImagesStep` - Download floorplan images
3. `DetectFloorplanStep` - OCR to extract square meters from floorplans
### Floorplan OCR
`rec/floorplan.py` uses pytesseract with image preprocessing (adaptive thresholding) to extract square meter values from floorplan images.
### Repository Pattern
`repositories/listing_repository.py` handles database operations with SQLModel sessions.
## Environment Variables
- `DB_CONNECTION_STRING`: Database URL (SQLite default: `sqlite:///data/wrongmove.db`)
- `CELERY_BROKER_URL` / `CELERY_RESULT_BACKEND`: Redis URLs
- `ROUTING_API_KEY`: Google Maps API key for transit routing
### Scraper Configuration
These control the query splitting behavior (see `.env.sample` for defaults):
| Variable | Default | Description |
|----------|---------|-------------|
| `RIGHTMOVE_MAX_CONCURRENT` | 5 | Max concurrent HTTP requests |
| `RIGHTMOVE_REQUEST_DELAY_MS` | 100 | Delay between requests (ms) |
| `RIGHTMOVE_SPLIT_THRESHOLD` | 1200 | Split query when results exceed this |
| `RIGHTMOVE_MIN_PRICE_BAND` | 100 | Minimum price band width (won't split below) |
| `RIGHTMOVE_MAX_PAGES` | 60 | Max pages per subquery (60 × 25 = 1500) |
| `RIGHTMOVE_PROXY_URL` | - | SOCKS proxy URL (e.g., `socks5://localhost:9050` for Tor) |
## Project Structure
- `main.py`: CLI entry point
- `api/`: FastAPI application with auth middleware
- `config/`: Configuration modules (scraper settings, scheduled tasks)
- `models/`: SQLModel database entities
- `repositories/`: Database access layer
- `rec/`: Core business logic (query, floorplan OCR, routing, districts)
- `services/`: Service layer modules (listing_fetcher, image_fetcher, floorplan_detector, route_calculator, query_splitter)
- `tasks/`: Celery background tasks
- `frontend/`: React/Vite frontend with Caddy proxy
- `alembic/`: Database migrations
- `tests/`: Test suite (unit and integration tests)
## Type Checking
The project uses strict mypy configuration with `disallow_untyped_defs=true`. Run `mypy .` to check types.
## Exploration Preferences
- Always ignore `node_modules` directory when exploring the codebase
## Git Workflow
**IMPORTANT**: After completing work items, always create separate commits for each logical change:
- Keep each commit focused on one feature/fix
- Do not include unrelated files
- Use descriptive commit messages
- Group related files together (e.g., tests with the code they test)
- **After each meaningful change, ask the user if they want to commit and push the changes**