Add services layer, tests, streaming UI, and cleanup legacy code
This commit is contained in:
parent
5514fa6381
commit
d205d15c74
62 changed files with 3729 additions and 1024 deletions
183
crawler/docs/BACKEND.md
Normal file
183
crawler/docs/BACKEND.md
Normal file
|
|
@ -0,0 +1,183 @@
|
|||
# Real Estate Crawler - Backend Documentation
|
||||
|
||||
A property listing aggregator that scrapes Rightmove UK, extracts square meters via OCR, and calculates transit routes.
|
||||
|
||||
## Quick Start
|
||||
|
||||
```bash
|
||||
# Docker (recommended) - starts Redis, MySQL, API, and Celery
|
||||
./start.sh
|
||||
|
||||
# Or run locally with Poetry
|
||||
poetry install
|
||||
./start.sh --local
|
||||
```
|
||||
|
||||
API available at `http://localhost:5001`
|
||||
|
||||
## Dependencies
|
||||
|
||||
| Dependency | Purpose |
|
||||
|------------|---------|
|
||||
| Python 3.11+ | Runtime |
|
||||
| Redis | Celery message broker |
|
||||
| MySQL/SQLite | Database |
|
||||
| Tesseract OCR | Floorplan text extraction |
|
||||
| Docker | Containerized deployment |
|
||||
|
||||
### Python Packages (key)
|
||||
- `fastapi` + `uvicorn` - HTTP API
|
||||
- `celery` - Background tasks
|
||||
- `sqlmodel` - ORM
|
||||
- `pytesseract` + `opencv` - OCR
|
||||
- `aiohttp` - Async HTTP client
|
||||
|
||||
## API Endpoints
|
||||
|
||||
### Health Check
|
||||
```bash
|
||||
curl http://localhost:5001/api/status
|
||||
# {"status": "OK"}
|
||||
```
|
||||
|
||||
### Get Listings
|
||||
```bash
|
||||
curl -H "Authorization: Bearer $TOKEN" \
|
||||
"http://localhost:5001/api/listing?limit=10"
|
||||
```
|
||||
|
||||
### Get Listings as GeoJSON
|
||||
```bash
|
||||
curl -H "Authorization: Bearer $TOKEN" \
|
||||
"http://localhost:5001/api/listing_geojson?listing_type=RENT&min_bedrooms=2&max_price=3000"
|
||||
```
|
||||
|
||||
### Refresh Listings (async)
|
||||
```bash
|
||||
curl -X POST -H "Authorization: Bearer $TOKEN" \
|
||||
"http://localhost:5001/api/refresh_listings?listing_type=RENT&min_bedrooms=2&max_bedrooms=3&min_price=2000&max_price=4000"
|
||||
# {"task_id": "abc123", "message": "Task abc123 started"}
|
||||
```
|
||||
|
||||
### Check Task Status
|
||||
```bash
|
||||
curl -H "Authorization: Bearer $TOKEN" \
|
||||
"http://localhost:5001/api/task_status?task_id=abc123"
|
||||
# {"task_id": "abc123", "status": "SUCCESS", "result": "..."}
|
||||
```
|
||||
|
||||
### Get Districts
|
||||
```bash
|
||||
curl -H "Authorization: Bearer $TOKEN" \
|
||||
"http://localhost:5001/api/get_districts"
|
||||
# {"Westminster": "REGION^93965", "Camden": "REGION^93934", ...}
|
||||
```
|
||||
|
||||
## CLI Commands
|
||||
|
||||
```bash
|
||||
# Fetch listings from Rightmove
|
||||
python main.py dump-listings -t rent --min-bedrooms 2 --max-price 4000
|
||||
|
||||
# Download floorplan images
|
||||
python main.py dump-images
|
||||
|
||||
# Run OCR on floorplans
|
||||
python main.py detect-floorplan
|
||||
|
||||
# Calculate transit routes
|
||||
python main.py routing -d "10 Downing Street, London" -m TRANSIT -l 10
|
||||
|
||||
# Export to GeoJSON
|
||||
python main.py export-immoweb -O output.geojson -t rent --min-bedrooms 2
|
||||
|
||||
# Export to CSV
|
||||
python main.py export-csv -O output.csv -t rent
|
||||
|
||||
# List available districts
|
||||
python main.py list-districts
|
||||
```
|
||||
|
||||
## Query Parameters
|
||||
|
||||
| Parameter | Type | Description |
|
||||
|-----------|------|-------------|
|
||||
| `listing_type` | RENT/BUY | Property type |
|
||||
| `min_bedrooms` | int | Minimum bedrooms |
|
||||
| `max_bedrooms` | int | Maximum bedrooms |
|
||||
| `min_price` | int | Minimum price |
|
||||
| `max_price` | int | Maximum price |
|
||||
| `min_sqm` | int | Minimum square meters |
|
||||
| `district` | string | District name (repeatable) |
|
||||
| `furnish_types` | string | FURNISHED/UNFURNISHED/PART_FURNISHED |
|
||||
| `last_seen_days` | int | Only listings seen in last N days |
|
||||
|
||||
## Architecture
|
||||
|
||||
```
|
||||
┌─────────────┐ ┌─────────────┐ ┌─────────────┐
|
||||
│ CLI │ │ HTTP API │ │ Celery │
|
||||
│ (main.py) │ │ (api/app.py)│ │ Worker │
|
||||
└──────┬──────┘ └──────┬──────┘ └──────┬──────┘
|
||||
│ │ │
|
||||
└───────────────────┼───────────────────┘
|
||||
│
|
||||
┌────────▼────────┐
|
||||
│ Services │
|
||||
│ (services/*.py) │
|
||||
└────────┬────────┘
|
||||
│
|
||||
┌────────────┼────────────┐
|
||||
│ │ │
|
||||
┌──────▼──────┐ ┌───▼───┐ ┌──────▼──────┐
|
||||
│ Repository │ │ Redis │ │ Rightmove │
|
||||
│ (MySQL) │ │ │ │ API │
|
||||
└─────────────┘ └───────┘ └─────────────┘
|
||||
```
|
||||
|
||||
## Environment Variables
|
||||
|
||||
```bash
|
||||
# Database
|
||||
DB_CONNECTION_STRING=mysql://user:pass@localhost:3306/wrongmove
|
||||
|
||||
# Redis (Celery)
|
||||
CELERY_BROKER_URL=redis://localhost:6379/0
|
||||
CELERY_RESULT_BACKEND=redis://localhost:6379/0
|
||||
|
||||
# Google Maps (optional, for routing)
|
||||
ROUTING_API_KEY=your_api_key
|
||||
```
|
||||
|
||||
## Authentication
|
||||
|
||||
API endpoints (except `/api/status`) require JWT authentication via Authentik OIDC.
|
||||
|
||||
```bash
|
||||
# Get token from Authentik, then:
|
||||
curl -H "Authorization: Bearer $TOKEN" http://localhost:5001/api/listing
|
||||
```
|
||||
|
||||
## Project Structure
|
||||
|
||||
```
|
||||
├── main.py # CLI entry point
|
||||
├── api/app.py # FastAPI application
|
||||
├── services/ # Business logic (shared by CLI + API)
|
||||
│ ├── listing_service.py
|
||||
│ ├── export_service.py
|
||||
│ ├── district_service.py
|
||||
│ └── task_service.py
|
||||
├── repositories/ # Database access
|
||||
├── models/ # SQLModel entities
|
||||
├── rec/ # Core logic (query, OCR, routing)
|
||||
├── tasks/ # Celery background tasks
|
||||
└── tests/ # Test suite
|
||||
```
|
||||
|
||||
## Running Tests
|
||||
|
||||
```bash
|
||||
pytest tests/ -v --cov=.
|
||||
mypy .
|
||||
```
|
||||
Loading…
Add table
Add a link
Reference in a new issue