wrongmove/crawler/docs/BACKEND.md

5.4 KiB

Real Estate Crawler - Backend Documentation

A property listing aggregator that scrapes Rightmove UK, extracts square meters via OCR, and calculates transit routes.

Quick Start

# Docker (recommended) - starts Redis, MySQL, API, and Celery
./start.sh

# Or run locally with Poetry
poetry install
./start.sh --local

API available at http://localhost:5001

Dependencies

Dependency Purpose
Python 3.11+ Runtime
Redis Celery message broker
MySQL/SQLite Database
Tesseract OCR Floorplan text extraction
Docker Containerized deployment

Python Packages (key)

  • fastapi + uvicorn - HTTP API
  • celery - Background tasks
  • sqlmodel - ORM
  • pytesseract + opencv - OCR
  • aiohttp - Async HTTP client

API Endpoints

Health Check

curl http://localhost:5001/api/status
# {"status": "OK"}

Get Listings

curl -H "Authorization: Bearer $TOKEN" \
  "http://localhost:5001/api/listing?limit=10"

Get Listings as GeoJSON

curl -H "Authorization: Bearer $TOKEN" \
  "http://localhost:5001/api/listing_geojson?listing_type=RENT&min_bedrooms=2&max_price=3000"

Refresh Listings (async)

curl -X POST -H "Authorization: Bearer $TOKEN" \
  "http://localhost:5001/api/refresh_listings?listing_type=RENT&min_bedrooms=2&max_bedrooms=3&min_price=2000&max_price=4000"
# {"task_id": "abc123", "message": "Task abc123 started"}

Check Task Status

curl -H "Authorization: Bearer $TOKEN" \
  "http://localhost:5001/api/task_status?task_id=abc123"
# {"task_id": "abc123", "status": "SUCCESS", "result": "..."}

Get Districts

curl -H "Authorization: Bearer $TOKEN" \
  "http://localhost:5001/api/get_districts"
# {"Westminster": "REGION^93965", "Camden": "REGION^93934", ...}

CLI Commands

# Fetch listings from Rightmove
python main.py dump-listings -t rent --min-bedrooms 2 --max-price 4000

# Download floorplan images
python main.py dump-images

# Run OCR on floorplans
python main.py detect-floorplan

# Calculate transit routes
python main.py routing -d "10 Downing Street, London" -m TRANSIT -l 10

# Export to GeoJSON
python main.py export-immoweb -O output.geojson -t rent --min-bedrooms 2

# Export to CSV
python main.py export-csv -O output.csv -t rent

# List available districts
python main.py list-districts

Query Parameters

Parameter Type Description
listing_type RENT/BUY Property type
min_bedrooms int Minimum bedrooms
max_bedrooms int Maximum bedrooms
min_price int Minimum price
max_price int Maximum price
min_sqm int Minimum square meters
district string District name (repeatable)
furnish_types string FURNISHED/UNFURNISHED/PART_FURNISHED
last_seen_days int Only listings seen in last N days

Architecture

┌─────────────┐     ┌─────────────┐     ┌─────────────┐
│   CLI       │     │   HTTP API  │     │   Celery    │
│  (main.py)  │     │ (api/app.py)│     │   Worker    │
└──────┬──────┘     └──────┬──────┘     └──────┬──────┘
       │                   │                   │
       └───────────────────┼───────────────────┘
                           │
                  ┌────────▼────────┐
                  │    Services     │
                  │ (services/*.py) │
                  └────────┬────────┘
                           │
              ┌────────────┼────────────┐
              │            │            │
       ┌──────▼──────┐ ┌───▼───┐ ┌──────▼──────┐
       │ Repository  │ │ Redis │ │  Rightmove  │
       │  (MySQL)    │ │       │ │     API     │
       └─────────────┘ └───────┘ └─────────────┘

Environment Variables

# Database
DB_CONNECTION_STRING=mysql://user:pass@localhost:3306/wrongmove

# Redis (Celery)
CELERY_BROKER_URL=redis://localhost:6379/0
CELERY_RESULT_BACKEND=redis://localhost:6379/0

# Google Maps (optional, for routing)
ROUTING_API_KEY=your_api_key

Authentication

API endpoints (except /api/status) require JWT authentication via Authentik OIDC.

# Get token from Authentik, then:
curl -H "Authorization: Bearer $TOKEN" http://localhost:5001/api/listing

Project Structure

├── main.py              # CLI entry point
├── api/app.py           # FastAPI application
├── services/            # Business logic (shared by CLI + API)
│   ├── listing_service.py
│   ├── export_service.py
│   ├── district_service.py
│   └── task_service.py
├── repositories/        # Database access
├── models/              # SQLModel entities
├── rec/                 # Core logic (query, OCR, routing)
├── tasks/               # Celery background tasks
└── tests/               # Test suite

Running Tests

pytest tests/ -v --cov=.
mypy .