184 lines
5.4 KiB
Markdown
184 lines
5.4 KiB
Markdown
|
|
# Real Estate Crawler - Backend Documentation
|
||
|
|
|
||
|
|
A property listing aggregator that scrapes Rightmove UK, extracts square meters via OCR, and calculates transit routes.
|
||
|
|
|
||
|
|
## Quick Start
|
||
|
|
|
||
|
|
```bash
|
||
|
|
# Docker (recommended) - starts Redis, MySQL, API, and Celery
|
||
|
|
./start.sh
|
||
|
|
|
||
|
|
# Or run locally with Poetry
|
||
|
|
poetry install
|
||
|
|
./start.sh --local
|
||
|
|
```
|
||
|
|
|
||
|
|
API available at `http://localhost:5001`
|
||
|
|
|
||
|
|
## Dependencies
|
||
|
|
|
||
|
|
| Dependency | Purpose |
|
||
|
|
|------------|---------|
|
||
|
|
| Python 3.11+ | Runtime |
|
||
|
|
| Redis | Celery message broker |
|
||
|
|
| MySQL/SQLite | Database |
|
||
|
|
| Tesseract OCR | Floorplan text extraction |
|
||
|
|
| Docker | Containerized deployment |
|
||
|
|
|
||
|
|
### Python Packages (key)
|
||
|
|
- `fastapi` + `uvicorn` - HTTP API
|
||
|
|
- `celery` - Background tasks
|
||
|
|
- `sqlmodel` - ORM
|
||
|
|
- `pytesseract` + `opencv` - OCR
|
||
|
|
- `aiohttp` - Async HTTP client
|
||
|
|
|
||
|
|
## API Endpoints
|
||
|
|
|
||
|
|
### Health Check
|
||
|
|
```bash
|
||
|
|
curl http://localhost:5001/api/status
|
||
|
|
# {"status": "OK"}
|
||
|
|
```
|
||
|
|
|
||
|
|
### Get Listings
|
||
|
|
```bash
|
||
|
|
curl -H "Authorization: Bearer $TOKEN" \
|
||
|
|
"http://localhost:5001/api/listing?limit=10"
|
||
|
|
```
|
||
|
|
|
||
|
|
### Get Listings as GeoJSON
|
||
|
|
```bash
|
||
|
|
curl -H "Authorization: Bearer $TOKEN" \
|
||
|
|
"http://localhost:5001/api/listing_geojson?listing_type=RENT&min_bedrooms=2&max_price=3000"
|
||
|
|
```
|
||
|
|
|
||
|
|
### Refresh Listings (async)
|
||
|
|
```bash
|
||
|
|
curl -X POST -H "Authorization: Bearer $TOKEN" \
|
||
|
|
"http://localhost:5001/api/refresh_listings?listing_type=RENT&min_bedrooms=2&max_bedrooms=3&min_price=2000&max_price=4000"
|
||
|
|
# {"task_id": "abc123", "message": "Task abc123 started"}
|
||
|
|
```
|
||
|
|
|
||
|
|
### Check Task Status
|
||
|
|
```bash
|
||
|
|
curl -H "Authorization: Bearer $TOKEN" \
|
||
|
|
"http://localhost:5001/api/task_status?task_id=abc123"
|
||
|
|
# {"task_id": "abc123", "status": "SUCCESS", "result": "..."}
|
||
|
|
```
|
||
|
|
|
||
|
|
### Get Districts
|
||
|
|
```bash
|
||
|
|
curl -H "Authorization: Bearer $TOKEN" \
|
||
|
|
"http://localhost:5001/api/get_districts"
|
||
|
|
# {"Westminster": "REGION^93965", "Camden": "REGION^93934", ...}
|
||
|
|
```
|
||
|
|
|
||
|
|
## CLI Commands
|
||
|
|
|
||
|
|
```bash
|
||
|
|
# Fetch listings from Rightmove
|
||
|
|
python main.py dump-listings -t rent --min-bedrooms 2 --max-price 4000
|
||
|
|
|
||
|
|
# Download floorplan images
|
||
|
|
python main.py dump-images
|
||
|
|
|
||
|
|
# Run OCR on floorplans
|
||
|
|
python main.py detect-floorplan
|
||
|
|
|
||
|
|
# Calculate transit routes
|
||
|
|
python main.py routing -d "10 Downing Street, London" -m TRANSIT -l 10
|
||
|
|
|
||
|
|
# Export to GeoJSON
|
||
|
|
python main.py export-immoweb -O output.geojson -t rent --min-bedrooms 2
|
||
|
|
|
||
|
|
# Export to CSV
|
||
|
|
python main.py export-csv -O output.csv -t rent
|
||
|
|
|
||
|
|
# List available districts
|
||
|
|
python main.py list-districts
|
||
|
|
```
|
||
|
|
|
||
|
|
## Query Parameters
|
||
|
|
|
||
|
|
| Parameter | Type | Description |
|
||
|
|
|-----------|------|-------------|
|
||
|
|
| `listing_type` | RENT/BUY | Property type |
|
||
|
|
| `min_bedrooms` | int | Minimum bedrooms |
|
||
|
|
| `max_bedrooms` | int | Maximum bedrooms |
|
||
|
|
| `min_price` | int | Minimum price |
|
||
|
|
| `max_price` | int | Maximum price |
|
||
|
|
| `min_sqm` | int | Minimum square meters |
|
||
|
|
| `district` | string | District name (repeatable) |
|
||
|
|
| `furnish_types` | string | FURNISHED/UNFURNISHED/PART_FURNISHED |
|
||
|
|
| `last_seen_days` | int | Only listings seen in last N days |
|
||
|
|
|
||
|
|
## Architecture
|
||
|
|
|
||
|
|
```
|
||
|
|
┌─────────────┐ ┌─────────────┐ ┌─────────────┐
|
||
|
|
│ CLI │ │ HTTP API │ │ Celery │
|
||
|
|
│ (main.py) │ │ (api/app.py)│ │ Worker │
|
||
|
|
└──────┬──────┘ └──────┬──────┘ └──────┬──────┘
|
||
|
|
│ │ │
|
||
|
|
└───────────────────┼───────────────────┘
|
||
|
|
│
|
||
|
|
┌────────▼────────┐
|
||
|
|
│ Services │
|
||
|
|
│ (services/*.py) │
|
||
|
|
└────────┬────────┘
|
||
|
|
│
|
||
|
|
┌────────────┼────────────┐
|
||
|
|
│ │ │
|
||
|
|
┌──────▼──────┐ ┌───▼───┐ ┌──────▼──────┐
|
||
|
|
│ Repository │ │ Redis │ │ Rightmove │
|
||
|
|
│ (MySQL) │ │ │ │ API │
|
||
|
|
└─────────────┘ └───────┘ └─────────────┘
|
||
|
|
```
|
||
|
|
|
||
|
|
## Environment Variables
|
||
|
|
|
||
|
|
```bash
|
||
|
|
# Database
|
||
|
|
DB_CONNECTION_STRING=mysql://user:pass@localhost:3306/wrongmove
|
||
|
|
|
||
|
|
# Redis (Celery)
|
||
|
|
CELERY_BROKER_URL=redis://localhost:6379/0
|
||
|
|
CELERY_RESULT_BACKEND=redis://localhost:6379/0
|
||
|
|
|
||
|
|
# Google Maps (optional, for routing)
|
||
|
|
ROUTING_API_KEY=your_api_key
|
||
|
|
```
|
||
|
|
|
||
|
|
## Authentication
|
||
|
|
|
||
|
|
API endpoints (except `/api/status`) require JWT authentication via Authentik OIDC.
|
||
|
|
|
||
|
|
```bash
|
||
|
|
# Get token from Authentik, then:
|
||
|
|
curl -H "Authorization: Bearer $TOKEN" http://localhost:5001/api/listing
|
||
|
|
```
|
||
|
|
|
||
|
|
## Project Structure
|
||
|
|
|
||
|
|
```
|
||
|
|
├── main.py # CLI entry point
|
||
|
|
├── api/app.py # FastAPI application
|
||
|
|
├── services/ # Business logic (shared by CLI + API)
|
||
|
|
│ ├── listing_service.py
|
||
|
|
│ ├── export_service.py
|
||
|
|
│ ├── district_service.py
|
||
|
|
│ └── task_service.py
|
||
|
|
├── repositories/ # Database access
|
||
|
|
├── models/ # SQLModel entities
|
||
|
|
├── rec/ # Core logic (query, OCR, routing)
|
||
|
|
├── tasks/ # Celery background tasks
|
||
|
|
└── tests/ # Test suite
|
||
|
|
```
|
||
|
|
|
||
|
|
## Running Tests
|
||
|
|
|
||
|
|
```bash
|
||
|
|
pytest tests/ -v --cov=.
|
||
|
|
mypy .
|
||
|
|
```
|