Refactor codebase following Clean Code principles and add 229 tests

- Extract helpers to reduce function sizes (listing_tasks, app.py, query.py, listing_fetcher)
  - Replace nonlocal mutations with _PipelineState dataclass in listing_tasks
  - Fix bugs: isinstance→equality check in repository, verify_exp for OIDC tokens
  - Consolidate duplicate filter methods in listing_repository
  - Move hardcoded config to env vars with backward-compatible defaults
  - Simplify CLI decorator to auto-build QueryParameters
  - Add deprecation docstring to data_access.py
  - Test count: 158 → 387 (all passing)
This commit is contained in:
Viktor Barzin 2026-02-07 20:19:57 +00:00
parent 7e05b3c971
commit 150342bb9e
No known key found for this signature in database
GPG key ID: 0EB088298288D958
48 changed files with 5029 additions and 990 deletions

95
.claude/CLAUDE.md Normal file
View file

@ -0,0 +1,95 @@
# Realestate Crawler
## Project Overview
A real estate listing aggregation platform that scrapes Rightmove UK listings, extracts square meter data from floorplan images via OCR, calculates transit routes, and serves an interactive map-based web UI. The repo contains three sub-projects:
- **`crawler/`** — Main application (Python backend + React frontend). Has its own `CLAUDE.md` with detailed architecture docs.
- **`immoweb/`** — Separate scraper (Node.js, legacy/reference).
- **`vqa/`** — Visual QA / testing tooling.
## Command Execution
**All commands run inside Docker containers.** The dev environment uses Docker Compose — start it first, then exec into containers for any operations.
- **Infrastructure commands** (docker compose, kubectl) — Run locally on the Mac
- **All project commands** (pytest, poetry, alembic, python, mypy, ruff, etc.) — Run inside the `app` container via `docker compose exec app <command>`
See `.claude/skills/` for detailed skills on dev environment, building, and deploying.
## Quick Reference
| Action | Command | Where |
|--------|---------|-------|
| Start all services | `./start.sh` | `crawler/` |
| Rebuild & start | `./start.sh --build` | `crawler/` |
| Stop services | `./start.sh --down` | `crawler/` |
| Run tests | `pytest tests/ -v --cov=. --cov-report=term-missing` | `crawler/` |
| Type check | `mypy .` | `crawler/` |
| Format code | `yapf --style .style.yapf --recursive .` | `crawler/` |
| Lint (CI runs Ruff) | `ruff check .` | `crawler/` |
| DB migration | `alembic upgrade head` | `crawler/` |
| New migration | `alembic revision -m "description"` | `crawler/` |
| Frontend dev | `cd frontend && ./start.sh` | `crawler/` |
## Tech Stack
- **Backend:** Python 3.13, FastAPI, SQLModel (SQLAlchemy 2), Celery + Redis, pytesseract/OpenCV
- **Frontend:** React 19, TypeScript, Vite, Tailwind CSS, Radix UI, Mapbox GL
- **Database:** MySQL 9 (prod) / SQLite (local dev), Alembic migrations
- **Infrastructure:** Docker Compose (dev), Kubernetes (prod), Drone CI
## Code Conventions
- **Type checking:** Strict mypy (`disallow_untyped_defs=true`). All functions need type annotations.
- **Formatting:** YAPF (`.style.yapf` config) + Ruff linter (CI on PRs).
- **Async:** Scraper layer uses `async/await` throughout with `aiohttp`.
- **Models:** SQLModel for DB entities, Pydantic for request/response validation.
- **Service layer:** `services/` contains unified handlers used by both CLI and API — add new business logic here, not in API routes or CLI commands directly.
- **Repository pattern:** `repositories/` for database queries. Don't put raw SQL in services or API.
- **Tests:** pytest with `pytest-asyncio` (auto mode). Unit tests in `tests/unit/`, integration in `tests/integration/`.
## Architecture Layers (in `crawler/`)
```
API (api/app.py) ←→ Services (services/) ←→ Repositories (repositories/)
↑ ↑ ↑
Auth (OIDC) Core Logic (rec/) SQLModel (models/)
Celery Tasks (tasks/)
```
- `rec/` — Core scraping, OCR, routing logic. Contains circuit breaker and throttle detection.
- `services/` — Orchestration. Query splitting, listing fetching, caching.
- `api/` — FastAPI routes, auth middleware, metrics.
- `models/``RentListing`, `BuyListing` SQLModel entities.
- `tasks/` — Celery background tasks with Redis broker.
- `config/` — Env-var-based configuration (scraper settings, schedules).
## Key Design Decisions
- **Query splitting** works around Rightmove's ~1500-result API cap by adaptively splitting queries by district, bedrooms, and price bands (binary search). See `services/query_splitter.py`.
- **Circuit breaker** (`rec/circuit_breaker.py`) and **throttle detection** (`rec/throttle_detector.py`) protect against rate limiting.
- **Streaming API** uses NDJSON for progressive loading of large result sets.
- **Redis** serves dual duty as Celery broker and GeoJSON cache.
## Environment Variables
See `crawler/.env.sample` for the full list. Key ones:
- `DB_CONNECTION_STRING` — Database URL
- `CELERY_BROKER_URL` / `CELERY_RESULT_BACKEND` — Redis URLs
- `ROUTING_API_KEY` — Google Maps API key
- `RIGHTMOVE_*` — Scraper tuning (concurrency, delays, thresholds, proxy)
- `SCRAPE_SCHEDULES` — JSON array of periodic scrape configs
## Git Workflow
- CI: Drone CI builds Docker images on push to `master`, deploys to K8s.
- Linting: GitHub Actions runs Ruff on PR diffs.
- Keep commits focused — one logical change per commit.
- Group related files (e.g., code + its tests) in the same commit.
## Directories to Ignore
- `node_modules/`, `__pycache__/`, `.idea/`, `crawler/data/`, `venv/`, `_cache/`

View file

@ -0,0 +1,113 @@
---
name: build-and-push
description: |
Build Docker images for the API and frontend, and push them to Docker Hub.
Use when: (1) user wants to build new Docker images locally, (2) push images
to the registry before deploying, (3) tag images for a release. Covers both
the Python/FastAPI backend and the React/Nginx frontend.
author: Claude Code
version: 1.0.0
date: 2026-02-06
---
# Build and Push Docker Images
All commands run locally. Images are pushed to Docker Hub under the `viktorbarzin` namespace.
## Image Registries
| Component | Docker Hub repo | Dockerfile location |
|-----------|------------------------------------|------------------------------|
| API | `viktorbarzin/realestatecrawler` | `crawler/Dockerfile` |
| Frontend | `viktorbarzin/immoweb` | `crawler/frontend/Dockerfile`|
## Building Images
### Build API image
```bash
docker build -t viktorbarzin/realestatecrawler:latest crawler/
```
### Build Frontend image
```bash
docker build -t viktorbarzin/immoweb:latest crawler/frontend/
```
### Build both
```bash
docker build -t viktorbarzin/realestatecrawler:latest crawler/ && \
docker build -t viktorbarzin/immoweb:latest crawler/frontend/
```
### Build with a specific tag (recommended for production)
```bash
# Use git commit SHA
GIT_SHA=$(git rev-parse --short HEAD)
docker build -t viktorbarzin/realestatecrawler:${GIT_SHA} -t viktorbarzin/realestatecrawler:latest crawler/
docker build -t viktorbarzin/immoweb:${GIT_SHA} -t viktorbarzin/immoweb:latest crawler/frontend/
```
## Pushing Images
### Login to Docker Hub (if not already)
```bash
docker login -u viktorbarzin
```
### Push API image
```bash
docker push viktorbarzin/realestatecrawler:latest
```
### Push Frontend image
```bash
docker push viktorbarzin/immoweb:latest
```
### Push with specific tag
```bash
GIT_SHA=$(git rev-parse --short HEAD)
docker push viktorbarzin/realestatecrawler:${GIT_SHA}
docker push viktorbarzin/realestatecrawler:latest
docker push viktorbarzin/immoweb:${GIT_SHA}
docker push viktorbarzin/immoweb:latest
```
## Build and Push Everything (Full Release)
```bash
GIT_SHA=$(git rev-parse --short HEAD)
# Build
docker build -t viktorbarzin/realestatecrawler:${GIT_SHA} -t viktorbarzin/realestatecrawler:latest crawler/
docker build -t viktorbarzin/immoweb:${GIT_SHA} -t viktorbarzin/immoweb:latest crawler/frontend/
# Push
docker push viktorbarzin/realestatecrawler:${GIT_SHA}
docker push viktorbarzin/realestatecrawler:latest
docker push viktorbarzin/immoweb:${GIT_SHA}
docker push viktorbarzin/immoweb:latest
```
## CI/CD Note
Drone CI automatically builds and pushes images on push to `master` (see `.drone.yml`).
The manual process above is for when you need to build/push outside of CI, such as:
- Hotfix deployments
- Testing image builds locally before pushing
- Deploying from a non-master branch
## Notes
- The API Dockerfile installs system deps (OpenCV, Tesseract, MariaDB client) and Python deps via Poetry
- The Frontend Dockerfile is a multi-stage build: Node builder -> Nginx runtime
- Always tag with both `:latest` and a specific tag (git SHA or version) for traceability
- Use `docker buildx` for cross-platform builds if deploying to ARM nodes

View file

@ -0,0 +1,158 @@
---
name: deploy-to-kubernetes
description: |
Deploy the realestate-crawler to the Kubernetes cluster. Use when: (1) user wants
to deploy after building new images, (2) rollout restart to pick up new images,
(3) check deployment status, pod health, or logs in production, (4) scale
deployments up or down, (5) debug production issues.
author: Claude Code
version: 1.0.0
date: 2026-02-06
---
# Deploy to Kubernetes
All kubectl commands run locally against the K8s cluster at `10.0.20.100:6443`.
Namespace: `realestate-crawler`.
## Deployments
| Deployment | Image | Component |
|--------------------------|------------------------------------|-----------|
| realestate-crawler-api | viktorbarzin/realestatecrawler | API |
| realestate-crawler-ui | viktorbarzin/immoweb | Frontend |
## Deploying New Images
### Rolling restart (picks up :latest after push)
```bash
# Restart API deployment
kubectl rollout restart deployment/realestate-crawler-api -n realestate-crawler
# Restart Frontend deployment
kubectl rollout restart deployment/realestate-crawler-ui -n realestate-crawler
# Restart both
kubectl rollout restart deployment/realestate-crawler-api deployment/realestate-crawler-ui -n realestate-crawler
```
### Full deploy workflow (build, push, restart)
```bash
GIT_SHA=$(git rev-parse --short HEAD)
# Build and push API
docker build -t viktorbarzin/realestatecrawler:${GIT_SHA} -t viktorbarzin/realestatecrawler:latest crawler/
docker push viktorbarzin/realestatecrawler:${GIT_SHA}
docker push viktorbarzin/realestatecrawler:latest
# Build and push Frontend
docker build -t viktorbarzin/immoweb:${GIT_SHA} -t viktorbarzin/immoweb:latest crawler/frontend/
docker push viktorbarzin/immoweb:${GIT_SHA}
docker push viktorbarzin/immoweb:latest
# Restart deployments to pick up new images
kubectl rollout restart deployment/realestate-crawler-api -n realestate-crawler
kubectl rollout restart deployment/realestate-crawler-ui -n realestate-crawler
```
### Deploy a specific image tag
```bash
# Set API to a specific image version
kubectl set image deployment/realestate-crawler-api \
realestate-crawler-api=viktorbarzin/realestatecrawler:abc1234 \
-n realestate-crawler
# Set Frontend to a specific version
kubectl set image deployment/realestate-crawler-ui \
realestate-crawler-ui=viktorbarzin/immoweb:abc1234 \
-n realestate-crawler
```
## Checking Deployment Status
```bash
# List all resources in namespace
kubectl get all -n realestate-crawler
# Check deployment status
kubectl get deployments -n realestate-crawler
# Check rollout status (waits for completion)
kubectl rollout status deployment/realestate-crawler-api -n realestate-crawler
kubectl rollout status deployment/realestate-crawler-ui -n realestate-crawler
# Check pods
kubectl get pods -n realestate-crawler
# Describe a specific pod (events, conditions, image)
kubectl describe pod <pod-name> -n realestate-crawler
# Check which image a pod is running
kubectl get pods -n realestate-crawler -o jsonpath='{range .items[*]}{.metadata.name}{"\t"}{.spec.containers[*].image}{"\n"}{end}'
```
## Viewing Logs
```bash
# API logs
kubectl logs deployment/realestate-crawler-api -n realestate-crawler --tail=100 -f
# Frontend logs
kubectl logs deployment/realestate-crawler-ui -n realestate-crawler --tail=100 -f
# Logs from a specific pod
kubectl logs <pod-name> -n realestate-crawler --tail=100 -f
# Previous container logs (if pod crashed/restarted)
kubectl logs <pod-name> -n realestate-crawler --previous
```
## Scaling
```bash
# Scale API
kubectl scale deployment/realestate-crawler-api --replicas=2 -n realestate-crawler
# Scale down (e.g., for maintenance)
kubectl scale deployment/realestate-crawler-api --replicas=0 -n realestate-crawler
```
## Rollback
```bash
# View rollout history
kubectl rollout history deployment/realestate-crawler-api -n realestate-crawler
# Rollback to previous version
kubectl rollout undo deployment/realestate-crawler-api -n realestate-crawler
# Rollback to specific revision
kubectl rollout undo deployment/realestate-crawler-api --to-revision=3 -n realestate-crawler
```
## Debugging Production Issues
```bash
# Exec into a running API pod
kubectl exec -it deployment/realestate-crawler-api -n realestate-crawler -- bash
# Run a one-off command
kubectl exec deployment/realestate-crawler-api -n realestate-crawler -- python -c "print('hello')"
# Check pod events (useful for crash loops, image pull errors)
kubectl get events -n realestate-crawler --sort-by='.lastTimestamp' | tail -20
# Port-forward to a pod for local debugging
kubectl port-forward deployment/realestate-crawler-api 5001:5001 -n realestate-crawler
```
## Notes
- Drone CI handles automated deployments on push to `master` (see `.drone.yml`)
- Use manual deployment for hotfixes, testing, or deploying from non-master branches
- The K8s cluster is at `10.0.20.100:6443` (context: `kubernetes-admin@kubernetes`)
- If pods aren't picking up new `:latest` images, check the `kubernetes-latest-tag-image-pull` skill
- Always verify the rollout completed with `kubectl rollout status` after deploying

View file

@ -0,0 +1,144 @@
---
name: dev-environment
description: |
Start, stop, rebuild, and manage the local Docker Compose development environment
for the realestate-crawler project. Use when: (1) user wants to start/stop the dev
environment, (2) needs to rebuild after code changes, (3) wants to check service
status or view logs, (4) needs to run database migrations.
author: Claude Code
version: 1.0.0
date: 2026-02-06
---
# Dev Environment Management
Docker Compose orchestrates the dev environment locally from `crawler/`. All project
commands (pytest, alembic, mypy, python, etc.) must run inside the `app` container
via `docker compose exec app <command>`. Only docker/kubectl commands run on the host.
## Starting the Dev Environment
```bash
# Start all services (Redis, MySQL, API, Celery worker, Celery beat)
cd crawler && docker compose up
# Start in detached mode (background)
cd crawler && docker compose up -d
# Rebuild images and start (after Dockerfile or dependency changes)
cd crawler && docker compose up --build
# Or use the start.sh helper
cd crawler && ./start.sh # foreground
cd crawler && ./start.sh --build # rebuild first
```
## Stopping the Dev Environment
```bash
cd crawler && docker compose down
# Also remove volumes (fresh database, fresh Redis)
cd crawler && docker compose down -v
# Or use the helper
cd crawler && ./start.sh --down
```
## Checking Status
```bash
# List running containers
cd crawler && docker compose ps
# Check health status
cd crawler && docker compose ps --format "table {{.Name}}\t{{.Status}}"
```
## Viewing Logs
```bash
# Follow all service logs
cd crawler && docker compose logs -f
# Follow specific service logs
cd crawler && docker compose logs -f app
cd crawler && docker compose logs -f celery
cd crawler && docker compose logs -f celery-beat
cd crawler && docker compose logs -f mysql
cd crawler && docker compose logs -f redis
# Or use the helper
cd crawler && ./start.sh --logs
```
## Restarting Individual Services
```bash
# Restart just the API (e.g., after config change)
cd crawler && docker compose restart app
# Restart Celery worker
cd crawler && docker compose restart celery
# Rebuild and restart a single service
cd crawler && docker compose up --build app
```
## Running Database Migrations
```bash
# Apply pending migrations
cd crawler && docker compose exec app alembic upgrade head
# Create a new migration
cd crawler && docker compose exec app alembic revision -m "description"
```
## Running Tests Inside Container
```bash
cd crawler && docker compose exec app pytest tests/ -v --cov=. --cov-report=term-missing
```
## Running Any Command Inside Container
All project commands must be run inside the `app` container:
```bash
# General pattern
cd crawler && docker compose exec app <command>
# Examples
cd crawler && docker compose exec app python main.py dump-listings --type rent
cd crawler && docker compose exec app mypy .
cd crawler && docker compose exec app ruff check .
cd crawler && docker compose exec app poetry install
cd crawler && docker compose exec app bash # interactive shell
```
## Services and Ports
| Service | Container | Port | Description |
|-------------|-----------------|-------|--------------------------------|
| redis | rec-redis | 6379 | Celery broker + GeoJSON cache |
| mysql | rec-mysql | 3306 | Primary database |
| app | rec-app | 5001 | FastAPI server (hot-reload) |
| celery | rec-celery | - | Background task worker |
| celery-beat | rec-celery-beat | - | Periodic task scheduler |
## Environment Variables
Key env vars are set in `docker-compose.yml`. To override locally, create a `.env` file
in `crawler/` (see `.env.sample`). Key overrides:
- `ROUTING_API_KEY` - Google Maps API key (passed from host env)
- `SCRAPE_SCHEDULES` - JSON array of periodic scrape configs (passed from host env)
## Notes
- The API server has hot-reload enabled for `api/`, `services/`, `repositories/`, and `models/` directories
- Source code is bind-mounted into containers, so local edits are reflected immediately
- Python virtualenv is stored in a named Docker volume (`app_venv`) shared across app, celery, and celery-beat
- MySQL data persists in the `mysql_data` volume; Redis data in `redis_data`
- Use `docker compose down -v` to reset all data (volumes)