234 lines
8.4 KiB
Markdown
234 lines
8.4 KiB
Markdown
|
|
# CLAUDE.md
|
|||
|
|
|
|||
|
|
This file provides guidance to Claude Code (claude.ai/code) when working with code in this repository.
|
|||
|
|
|
|||
|
|
## Project Overview
|
|||
|
|
|
|||
|
|
A real estate listing crawler and aggregator that scrapes property listings from Rightmove UK, extracts square meter data from floorplan images using OCR, calculates transit routes, and provides a web UI for browsing listings.
|
|||
|
|
|
|||
|
|
## Development Environment
|
|||
|
|
|
|||
|
|
**IMPORTANT**: This project runs on a remote host, not locally. Always use the remote executor to run commands:
|
|||
|
|
|
|||
|
|
- **All shell commands** (Python, pytest, poetry, alembic, etc.) must be executed via the remote executor
|
|||
|
|
- **Starting the project**: Use the remote executor to run `./start.sh`
|
|||
|
|
- **Running tests**: Use the remote executor to run `pytest`
|
|||
|
|
- **Any CLI operations**: Use the remote executor to run `python main.py ...`
|
|||
|
|
|
|||
|
|
Never run commands directly on the local machine - always route them through the remote executor.
|
|||
|
|
|
|||
|
|
## Commands
|
|||
|
|
|
|||
|
|
### Setup and Run (Docker - Recommended)
|
|||
|
|
|
|||
|
|
```bash
|
|||
|
|
# Start all services (Redis, MySQL, API, Celery) with Docker
|
|||
|
|
./start.sh
|
|||
|
|
|
|||
|
|
# Rebuild images and start
|
|||
|
|
./start.sh --build
|
|||
|
|
|
|||
|
|
# Stop all containers
|
|||
|
|
./start.sh --down
|
|||
|
|
|
|||
|
|
# View logs
|
|||
|
|
./start.sh --logs
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
### Setup and Run (Local with Poetry)
|
|||
|
|
|
|||
|
|
```bash
|
|||
|
|
# Install dependencies
|
|||
|
|
poetry install && cp .env.sample .env
|
|||
|
|
|
|||
|
|
# Start backend locally (requires Redis running)
|
|||
|
|
./start.sh --local
|
|||
|
|
|
|||
|
|
# Start frontend (from frontend/ directory)
|
|||
|
|
cd frontend && ./start.sh
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
### CLI Operations
|
|||
|
|
|
|||
|
|
The main CLI (`main.py`) uses Click with a `--data-dir` option (default: `data/rs/`):
|
|||
|
|
|
|||
|
|
```bash
|
|||
|
|
# Dump listings from Rightmove API
|
|||
|
|
python main.py dump-listings --type rent --min-price 2000 --max-price 4000 --min-bedrooms 2
|
|||
|
|
|
|||
|
|
# Download floorplan images
|
|||
|
|
python main.py dump-images
|
|||
|
|
|
|||
|
|
# Extract square meters from floorplans using OCR
|
|||
|
|
python main.py detect-floorplan
|
|||
|
|
|
|||
|
|
# Calculate transit routes (consumes Google Maps API calls)
|
|||
|
|
python main.py routing --destination-address 'Address' -m transit -l 10
|
|||
|
|
|
|||
|
|
# Export to GeoJSON for visualization
|
|||
|
|
python main.py export-immoweb -O output.js --type rent [filter options]
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
### Testing
|
|||
|
|
|
|||
|
|
```bash
|
|||
|
|
# Run tests with coverage
|
|||
|
|
pytest tests/ -v --cov=. --cov-report=term-missing
|
|||
|
|
|
|||
|
|
# Run type checker
|
|||
|
|
mypy .
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
### Database Migrations
|
|||
|
|
|
|||
|
|
```bash
|
|||
|
|
alembic upgrade head # Apply migrations
|
|||
|
|
alembic revision -m "description" # Create new migration
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
### Code Formatting
|
|||
|
|
|
|||
|
|
```bash
|
|||
|
|
yapf --style .style.yapf --recursive .
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
## Architecture
|
|||
|
|
|
|||
|
|
### Core Data Flow
|
|||
|
|
|
|||
|
|
1. **Scraping** (`rec/query.py`): Fetches listing IDs and details from Rightmove's Android API
|
|||
|
|
2. **Processing** (`listing_processor.py`): Pipeline with steps for fetching details, downloading images, and OCR detection
|
|||
|
|
3. **Storage**: SQLModel/SQLAlchemy with MySQL or SQLite, plus JSON files in `data/rs/<listing_id>/`
|
|||
|
|
4. **API** (`api/app.py`): FastAPI endpoints authenticated via JWT from external Authentik service
|
|||
|
|
5. **Background Tasks** (`tasks/listing_tasks.py`): Celery tasks for async listing processing with Redis broker
|
|||
|
|
|
|||
|
|
### Key Models
|
|||
|
|
|
|||
|
|
- `models/listing.py`: SQLModel entities (`RentListing`, `BuyListing`) with `QueryParameters` for filtering
|
|||
|
|
- `data_access.py`: **DEPRECATED** - Legacy `Listing` dataclass for filesystem-based data access. Use `models.listing.RentListing` or `models.listing.BuyListing` instead.
|
|||
|
|
|
|||
|
|
### Services Layer (Unified CLI and API)
|
|||
|
|
|
|||
|
|
**IMPORTANT**: The `services/` directory contains unified handler functions that both the CLI and HTTP API use. This ensures consistency and code reuse.
|
|||
|
|
|
|||
|
|
#### High-level services (use these in CLI and API):
|
|||
|
|
- **`listing_service.py`**: Listing operations
|
|||
|
|
- `get_listings()` - Retrieve listings from database
|
|||
|
|
- `refresh_listings()` - Fetch new listings from Rightmove (sync or async)
|
|||
|
|
- `download_images()` - Download floorplan images
|
|||
|
|
- `detect_floorplans()` - Run OCR on floorplans
|
|||
|
|
- `calculate_routes()` - Calculate transit routes
|
|||
|
|
|
|||
|
|
- **`export_service.py`**: Export operations
|
|||
|
|
- `export_to_csv()` - Export listings to CSV file
|
|||
|
|
- `export_to_geojson()` - Export listings to GeoJSON (file or in-memory)
|
|||
|
|
|
|||
|
|
- **`district_service.py`**: District management
|
|||
|
|
- `get_all_districts()` - Get district name → region ID mapping
|
|||
|
|
- `get_district_names()` - Get list of district names
|
|||
|
|
- `validate_districts()` - Validate district names
|
|||
|
|
|
|||
|
|
- **`task_service.py`**: Background task management
|
|||
|
|
- `get_task_status()` - Get Celery task status
|
|||
|
|
- `get_user_tasks()` - Get all tasks for a user
|
|||
|
|
- `add_task_for_user()` - Associate task with user
|
|||
|
|
|
|||
|
|
#### Low-level services (internal implementation):
|
|||
|
|
- `listing_fetcher.py`: Fetches listing data from Rightmove API
|
|||
|
|
- `image_fetcher.py`: Downloads floorplan images
|
|||
|
|
- `floorplan_detector.py`: OCR-based square meter detection
|
|||
|
|
- `route_calculator.py`: Calculates transit routes using Google Maps API
|
|||
|
|
- `query_splitter.py`: Intelligent query splitting to maximize data extraction
|
|||
|
|
|
|||
|
|
### Query Splitting System
|
|||
|
|
|
|||
|
|
Rightmove's API caps search results at ~1,500 listings per query. The query splitting system works around this limitation to fetch **all matching listings**.
|
|||
|
|
|
|||
|
|
#### How it works:
|
|||
|
|
|
|||
|
|
1. **Initial Split**: Queries are split by district and bedroom count
|
|||
|
|
2. **Probe**: Each subquery is probed (minimal API request) to get `totalAvailableResults`
|
|||
|
|
3. **Adaptive Split**: If results exceed threshold (1,200), the price range is binary-split
|
|||
|
|
4. **Recursive Refinement**: Splitting continues until all subqueries are under threshold
|
|||
|
|
5. **Full Fetch**: Each subquery fetches up to 60 pages (1,500 results max)
|
|||
|
|
|
|||
|
|
```
|
|||
|
|
Original: 2BR, £1000-£5000 → 3,000 results (over cap!)
|
|||
|
|
↓ split by price
|
|||
|
|
£1000-£3000: 1,800 (still over!) | £3000-£5000: 1,200 ✓
|
|||
|
|
↓ split again
|
|||
|
|
£1000-£2000: 900 ✓ | £2000-£3000: 900 ✓
|
|||
|
|
|
|||
|
|
Final: 3 subqueries → 900 + 900 + 1,200 = 3,000 total results ✓
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
#### Key components:
|
|||
|
|
- `config/scraper_config.py`: Configuration with env var loading
|
|||
|
|
- `services/query_splitter.py`: `QuerySplitter` class with `SubQuery` dataclass
|
|||
|
|
- `rec/query.py`: `probe_query()` for result count probing, `create_session()` for connection pooling
|
|||
|
|
|
|||
|
|
### Processing Pipeline
|
|||
|
|
|
|||
|
|
`ListingProcessor` runs sequential steps defined in `listing_processor.py`:
|
|||
|
|
1. `FetchListingDetailsStep` - Get property details from API
|
|||
|
|
2. `FetchImagesStep` - Download floorplan images
|
|||
|
|
3. `DetectFloorplanStep` - OCR to extract square meters from floorplans
|
|||
|
|
|
|||
|
|
### Floorplan OCR
|
|||
|
|
|
|||
|
|
`rec/floorplan.py` uses pytesseract with image preprocessing (adaptive thresholding) to extract square meter values from floorplan images.
|
|||
|
|
|
|||
|
|
### Repository Pattern
|
|||
|
|
|
|||
|
|
`repositories/listing_repository.py` handles database operations with SQLModel sessions.
|
|||
|
|
|
|||
|
|
## Environment Variables
|
|||
|
|
|
|||
|
|
- `DB_CONNECTION_STRING`: Database URL (SQLite default: `sqlite:///data/wrongmove.db`)
|
|||
|
|
- `CELERY_BROKER_URL` / `CELERY_RESULT_BACKEND`: Redis URLs
|
|||
|
|
- `ROUTING_API_KEY`: Google Maps API key for transit routing
|
|||
|
|
|
|||
|
|
### Scraper Configuration
|
|||
|
|
|
|||
|
|
These control the query splitting behavior (see `.env.sample` for defaults):
|
|||
|
|
|
|||
|
|
| Variable | Default | Description |
|
|||
|
|
|----------|---------|-------------|
|
|||
|
|
| `RIGHTMOVE_MAX_CONCURRENT` | 5 | Max concurrent HTTP requests |
|
|||
|
|
| `RIGHTMOVE_REQUEST_DELAY_MS` | 100 | Delay between requests (ms) |
|
|||
|
|
| `RIGHTMOVE_SPLIT_THRESHOLD` | 1200 | Split query when results exceed this |
|
|||
|
|
| `RIGHTMOVE_MIN_PRICE_BAND` | 100 | Minimum price band width (won't split below) |
|
|||
|
|
| `RIGHTMOVE_MAX_PAGES` | 60 | Max pages per subquery (60 × 25 = 1500) |
|
|||
|
|
| `RIGHTMOVE_PROXY_URL` | - | SOCKS proxy URL (e.g., `socks5://localhost:9050` for Tor) |
|
|||
|
|
|
|||
|
|
## Project Structure
|
|||
|
|
|
|||
|
|
- `main.py`: CLI entry point
|
|||
|
|
- `api/`: FastAPI application with auth middleware
|
|||
|
|
- `config/`: Configuration modules (scraper settings, scheduled tasks)
|
|||
|
|
- `models/`: SQLModel database entities
|
|||
|
|
- `repositories/`: Database access layer
|
|||
|
|
- `rec/`: Core business logic (query, floorplan OCR, routing, districts)
|
|||
|
|
- `services/`: Service layer modules (listing_fetcher, image_fetcher, floorplan_detector, route_calculator, query_splitter)
|
|||
|
|
- `tasks/`: Celery background tasks
|
|||
|
|
- `frontend/`: React/Vite frontend with Caddy proxy
|
|||
|
|
- `alembic/`: Database migrations
|
|||
|
|
- `tests/`: Test suite (unit and integration tests)
|
|||
|
|
|
|||
|
|
## Type Checking
|
|||
|
|
|
|||
|
|
The project uses strict mypy configuration with `disallow_untyped_defs=true`. Run `mypy .` to check types.
|
|||
|
|
|
|||
|
|
## Exploration Preferences
|
|||
|
|
|
|||
|
|
- Always ignore `node_modules` directory when exploring the codebase
|
|||
|
|
|
|||
|
|
## Git Workflow
|
|||
|
|
|
|||
|
|
**IMPORTANT**: After completing work items, always create separate commits for each logical change:
|
|||
|
|
- Keep each commit focused on one feature/fix
|
|||
|
|
- Do not include unrelated files
|
|||
|
|
- Use descriptive commit messages
|
|||
|
|
- Group related files together (e.g., tests with the code they test)
|
|||
|
|
|