233 lines
8.4 KiB
Markdown
233 lines
8.4 KiB
Markdown
# CLAUDE.md
|
||
|
||
This file provides guidance to Claude Code (claude.ai/code) when working with code in this repository.
|
||
|
||
## Project Overview
|
||
|
||
A real estate listing crawler and aggregator that scrapes property listings from Rightmove UK, extracts square meter data from floorplan images using OCR, calculates transit routes, and provides a web UI for browsing listings.
|
||
|
||
## Development Environment
|
||
|
||
**IMPORTANT**: This project runs on a remote host, not locally. Always use the remote executor to run commands:
|
||
|
||
- **All shell commands** (Python, pytest, poetry, alembic, etc.) must be executed via the remote executor
|
||
- **Starting the project**: Use the remote executor to run `./start.sh`
|
||
- **Running tests**: Use the remote executor to run `pytest`
|
||
- **Any CLI operations**: Use the remote executor to run `python main.py ...`
|
||
|
||
Never run commands directly on the local machine - always route them through the remote executor.
|
||
|
||
## Commands
|
||
|
||
### Setup and Run (Docker - Recommended)
|
||
|
||
```bash
|
||
# Start all services (Redis, MySQL, API, Celery) with Docker
|
||
./start.sh
|
||
|
||
# Rebuild images and start
|
||
./start.sh --build
|
||
|
||
# Stop all containers
|
||
./start.sh --down
|
||
|
||
# View logs
|
||
./start.sh --logs
|
||
```
|
||
|
||
### Setup and Run (Local with Poetry)
|
||
|
||
```bash
|
||
# Install dependencies
|
||
poetry install && cp .env.sample .env
|
||
|
||
# Start backend locally (requires Redis running)
|
||
./start.sh --local
|
||
|
||
# Start frontend (from frontend/ directory)
|
||
cd frontend && ./start.sh
|
||
```
|
||
|
||
### CLI Operations
|
||
|
||
The main CLI (`main.py`) uses Click with a `--data-dir` option (default: `data/rs/`):
|
||
|
||
```bash
|
||
# Dump listings from Rightmove API
|
||
python main.py dump-listings --type rent --min-price 2000 --max-price 4000 --min-bedrooms 2
|
||
|
||
# Download floorplan images
|
||
python main.py dump-images
|
||
|
||
# Extract square meters from floorplans using OCR
|
||
python main.py detect-floorplan
|
||
|
||
# Calculate transit routes (consumes Google Maps API calls)
|
||
python main.py routing --destination-address 'Address' -m transit -l 10
|
||
|
||
# Export to GeoJSON for visualization
|
||
python main.py export-immoweb -O output.js --type rent [filter options]
|
||
```
|
||
|
||
### Testing
|
||
|
||
```bash
|
||
# Run tests with coverage
|
||
pytest tests/ -v --cov=. --cov-report=term-missing
|
||
|
||
# Run type checker
|
||
mypy .
|
||
```
|
||
|
||
### Database Migrations
|
||
|
||
```bash
|
||
alembic upgrade head # Apply migrations
|
||
alembic revision -m "description" # Create new migration
|
||
```
|
||
|
||
### Code Formatting
|
||
|
||
```bash
|
||
yapf --style .style.yapf --recursive .
|
||
```
|
||
|
||
## Architecture
|
||
|
||
### Core Data Flow
|
||
|
||
1. **Scraping** (`rec/query.py`): Fetches listing IDs and details from Rightmove's Android API
|
||
2. **Processing** (`listing_processor.py`): Pipeline with steps for fetching details, downloading images, and OCR detection
|
||
3. **Storage**: SQLModel/SQLAlchemy with MySQL or SQLite, plus JSON files in `data/rs/<listing_id>/`
|
||
4. **API** (`api/app.py`): FastAPI endpoints authenticated via JWT from external Authentik service
|
||
5. **Background Tasks** (`tasks/listing_tasks.py`): Celery tasks for async listing processing with Redis broker
|
||
|
||
### Key Models
|
||
|
||
- `models/listing.py`: SQLModel entities (`RentListing`, `BuyListing`) with `QueryParameters` for filtering
|
||
- `data_access.py`: **DEPRECATED** - Legacy `Listing` dataclass for filesystem-based data access. Use `models.listing.RentListing` or `models.listing.BuyListing` instead.
|
||
|
||
### Services Layer (Unified CLI and API)
|
||
|
||
**IMPORTANT**: The `services/` directory contains unified handler functions that both the CLI and HTTP API use. This ensures consistency and code reuse.
|
||
|
||
#### High-level services (use these in CLI and API):
|
||
- **`listing_service.py`**: Listing operations
|
||
- `get_listings()` - Retrieve listings from database
|
||
- `refresh_listings()` - Fetch new listings from Rightmove (sync or async)
|
||
- `download_images()` - Download floorplan images
|
||
- `detect_floorplans()` - Run OCR on floorplans
|
||
- `calculate_routes()` - Calculate transit routes
|
||
|
||
- **`export_service.py`**: Export operations
|
||
- `export_to_csv()` - Export listings to CSV file
|
||
- `export_to_geojson()` - Export listings to GeoJSON (file or in-memory)
|
||
|
||
- **`district_service.py`**: District management
|
||
- `get_all_districts()` - Get district name → region ID mapping
|
||
- `get_district_names()` - Get list of district names
|
||
- `validate_districts()` - Validate district names
|
||
|
||
- **`task_service.py`**: Background task management
|
||
- `get_task_status()` - Get Celery task status
|
||
- `get_user_tasks()` - Get all tasks for a user
|
||
- `add_task_for_user()` - Associate task with user
|
||
|
||
#### Low-level services (internal implementation):
|
||
- `listing_fetcher.py`: Fetches listing data from Rightmove API
|
||
- `image_fetcher.py`: Downloads floorplan images
|
||
- `floorplan_detector.py`: OCR-based square meter detection
|
||
- `route_calculator.py`: Calculates transit routes using Google Maps API
|
||
- `query_splitter.py`: Intelligent query splitting to maximize data extraction
|
||
|
||
### Query Splitting System
|
||
|
||
Rightmove's API caps search results at ~1,500 listings per query. The query splitting system works around this limitation to fetch **all matching listings**.
|
||
|
||
#### How it works:
|
||
|
||
1. **Initial Split**: Queries are split by district and bedroom count
|
||
2. **Probe**: Each subquery is probed (minimal API request) to get `totalAvailableResults`
|
||
3. **Adaptive Split**: If results exceed threshold (1,200), the price range is binary-split
|
||
4. **Recursive Refinement**: Splitting continues until all subqueries are under threshold
|
||
5. **Full Fetch**: Each subquery fetches up to 60 pages (1,500 results max)
|
||
|
||
```
|
||
Original: 2BR, £1000-£5000 → 3,000 results (over cap!)
|
||
↓ split by price
|
||
£1000-£3000: 1,800 (still over!) | £3000-£5000: 1,200 ✓
|
||
↓ split again
|
||
£1000-£2000: 900 ✓ | £2000-£3000: 900 ✓
|
||
|
||
Final: 3 subqueries → 900 + 900 + 1,200 = 3,000 total results ✓
|
||
```
|
||
|
||
#### Key components:
|
||
- `config/scraper_config.py`: Configuration with env var loading
|
||
- `services/query_splitter.py`: `QuerySplitter` class with `SubQuery` dataclass
|
||
- `rec/query.py`: `probe_query()` for result count probing, `create_session()` for connection pooling
|
||
|
||
### Processing Pipeline
|
||
|
||
`ListingProcessor` runs sequential steps defined in `listing_processor.py`:
|
||
1. `FetchListingDetailsStep` - Get property details from API
|
||
2. `FetchImagesStep` - Download floorplan images
|
||
3. `DetectFloorplanStep` - OCR to extract square meters from floorplans
|
||
|
||
### Floorplan OCR
|
||
|
||
`rec/floorplan.py` uses pytesseract with image preprocessing (adaptive thresholding) to extract square meter values from floorplan images.
|
||
|
||
### Repository Pattern
|
||
|
||
`repositories/listing_repository.py` handles database operations with SQLModel sessions.
|
||
|
||
## Environment Variables
|
||
|
||
- `DB_CONNECTION_STRING`: Database URL (SQLite default: `sqlite:///data/wrongmove.db`)
|
||
- `CELERY_BROKER_URL` / `CELERY_RESULT_BACKEND`: Redis URLs
|
||
- `ROUTING_API_KEY`: Google Maps API key for transit routing
|
||
|
||
### Scraper Configuration
|
||
|
||
These control the query splitting behavior (see `.env.sample` for defaults):
|
||
|
||
| Variable | Default | Description |
|
||
|----------|---------|-------------|
|
||
| `RIGHTMOVE_MAX_CONCURRENT` | 5 | Max concurrent HTTP requests |
|
||
| `RIGHTMOVE_REQUEST_DELAY_MS` | 100 | Delay between requests (ms) |
|
||
| `RIGHTMOVE_SPLIT_THRESHOLD` | 1200 | Split query when results exceed this |
|
||
| `RIGHTMOVE_MIN_PRICE_BAND` | 100 | Minimum price band width (won't split below) |
|
||
| `RIGHTMOVE_MAX_PAGES` | 60 | Max pages per subquery (60 × 25 = 1500) |
|
||
| `RIGHTMOVE_PROXY_URL` | - | SOCKS proxy URL (e.g., `socks5://localhost:9050` for Tor) |
|
||
|
||
## Project Structure
|
||
|
||
- `main.py`: CLI entry point
|
||
- `api/`: FastAPI application with auth middleware
|
||
- `config/`: Configuration modules (scraper settings, scheduled tasks)
|
||
- `models/`: SQLModel database entities
|
||
- `repositories/`: Database access layer
|
||
- `rec/`: Core business logic (query, floorplan OCR, routing, districts)
|
||
- `services/`: Service layer modules (listing_fetcher, image_fetcher, floorplan_detector, route_calculator, query_splitter)
|
||
- `tasks/`: Celery background tasks
|
||
- `frontend/`: React/Vite frontend with Caddy proxy
|
||
- `alembic/`: Database migrations
|
||
- `tests/`: Test suite (unit and integration tests)
|
||
|
||
## Type Checking
|
||
|
||
The project uses strict mypy configuration with `disallow_untyped_defs=true`. Run `mypy .` to check types.
|
||
|
||
## Exploration Preferences
|
||
|
||
- Always ignore `node_modules` directory when exploring the codebase
|
||
|
||
## Git Workflow
|
||
|
||
**IMPORTANT**: After completing work items, always create separate commits for each logical change:
|
||
- Keep each commit focused on one feature/fix
|
||
- Do not include unrelated files
|
||
- Use descriptive commit messages
|
||
- Group related files together (e.g., tests with the code they test)
|
||
|