Was: Woodpecker built+pushed to DockerHub, then `kubectl set image` patched
the four Deployments to a pinned numeric tag. With Deployments pinned to
:51 (immutable tag), Keel polled forever and never saw a digest bump — and
no DockerHub pull-secret meant Keel hit 401 on the private repo at every
poll. The 4-Deployment setup also had a latent ImagePullBackOff risk: if a
node was replaced, fresh pulls would fail.
Now: GHA builds+pushes (.github/workflows/build-{api,frontend}.yml) on push
to master. Cluster Deployments reference :latest with an imagePullSecret
sourced from Vault via ESO (codified in infra/stacks/real-estate-crawler/
main.tf, separate commit). Keel polls :latest, sees the new digest after
each GHA build, and rolls all four Deployments.
- .github/workflows/build-api.yml: pytest (unit + integration/regression/
e2e/test_listing_geojson) + buildx push viktorbarzin/realestatecrawler
to {<8-char-sha>, latest}.
- .github/workflows/build-frontend.yml: vitest (all 4 ex-shards in one
run) + Vite build with VITE_MAPBOX_TOKEN from GHA secret + buildx push
viktorbarzin/immoweb to {<8-char-sha>, latest}.
- .woodpecker/{api,frontend}.yml renamed to
.woodpecker/build-fallback-{api,frontend}.yml with `event: deployment`
so they no longer fire on push — kept as manual-only fallback if GHA
is down (CLAUDE.md convention from the 10 already-migrated projects).
- .claude/CLAUDE.md: Git Workflow section updated to reflect GHA as
primary + the dockerhub-pull-secret wiring.
GHA repo secrets DOCKERHUB_TOKEN and MAPBOX_TOKEN populated from Vault
fields viktor.dockerhub_registry_password and ci/global.wrongmove-mapbox-token
respectively (DOCKERHUB_USERNAME=viktorbarzin was already set).
🤖 Generated with [Claude Code](https://claude.com/claude-code)
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
4.9 KiB
4.9 KiB
Realestate Crawler
Project Overview
A real estate listing aggregation platform that scrapes Rightmove UK listings, extracts square meter data from floorplan images via OCR, calculates transit routes, and serves an interactive map-based web UI.
Command Execution
All commands run inside Docker containers. The dev environment uses Docker Compose — start it first, then exec into containers for any operations.
- Infrastructure commands (docker compose, kubectl) — Run locally on the Mac
- All project commands (pytest, poetry, alembic, python, mypy, ruff, etc.) — Run inside the
appcontainer viadocker compose exec app <command>
See .claude/skills/ for detailed skills on dev environment, building, and deploying.
Quick Reference
| Action | Command |
|---|---|
| Start all services | ./start.sh |
| Rebuild & start | ./start.sh --build |
| Stop services | ./start.sh --down |
| Run tests | pytest tests/ -v --cov=. --cov-report=term-missing |
| Type check | mypy . |
| Format code | yapf --style .style.yapf --recursive . |
| Lint (CI runs Ruff) | ruff check . |
| DB migration | alembic upgrade head |
| New migration | alembic revision -m "description" |
| Frontend dev | cd frontend && ./start.sh |
Tech Stack
- Backend: Python 3.13, FastAPI, SQLModel (SQLAlchemy 2), Celery + Redis, pytesseract/OpenCV
- Frontend: React 19, TypeScript, Vite, Tailwind CSS, Radix UI, Mapbox GL
- Database: MySQL 9 (prod) / SQLite (local dev), Alembic migrations
- Infrastructure: Docker Compose (dev), Kubernetes (prod), Woodpecker CI
Code Conventions
- Type checking: Strict mypy (
disallow_untyped_defs=true). All functions need type annotations. - Formatting: YAPF (
.style.yapfconfig) + Ruff linter (CI on PRs). - Async: Scraper layer uses
async/awaitthroughout withaiohttp. - Models: SQLModel for DB entities, Pydantic for request/response validation.
- Service layer:
services/contains unified handlers used by both CLI and API — add new business logic here, not in API routes or CLI commands directly. - Repository pattern:
repositories/for database queries. Don't put raw SQL in services or API. - Tests: pytest with
pytest-asyncio(auto mode). Unit tests intests/unit/, integration intests/integration/.
Architecture Layers
API (api/app.py) ←→ Services (services/) ←→ Repositories (repositories/)
↑ ↑ ↑
Auth (OIDC) Core Logic (rec/) SQLModel (models/)
↑
Celery Tasks (tasks/)
rec/— Core scraping, OCR, routing logic. Contains circuit breaker and throttle detection.services/— Orchestration. Query splitting, listing fetching, caching.api/— FastAPI routes, auth middleware, metrics.models/—RentListing,BuyListingSQLModel entities.tasks/— Celery background tasks with Redis broker.config/— Env-var-based configuration (scraper settings, schedules).
Key Design Decisions
- Query splitting works around Rightmove's ~1500-result API cap by adaptively splitting queries by district, bedrooms, and price bands (binary search). See
services/query_splitter.py. - Circuit breaker (
rec/circuit_breaker.py) and throttle detection (rec/throttle_detector.py) protect against rate limiting. - Streaming API uses NDJSON for progressive loading of large result sets.
- Redis serves dual duty as Celery broker and GeoJSON cache.
Environment Variables
See .env.sample for the full list. Key ones:
DB_CONNECTION_STRING— Database URLCELERY_BROKER_URL/CELERY_RESULT_BACKEND— Redis URLsROUTING_API_KEY— Google Maps API keyRIGHTMOVE_*— Scraper tuning (concurrency, delays, thresholds, proxy)SCRAPE_SCHEDULES— JSON array of periodic scrape configs
Git Workflow
- CI: GitHub Actions (
.github/workflows/build-api.yml,build-frontend.yml) builds + pushes Docker images to DockerHub onmasterpush. Keel in the cluster watches:latestonviktorbarzin/realestatecrawlerandviktorbarzin/immoweband rolls the fourrealestate-crawler-*Deployments on digest change. No Woodpecker deploy POST — Keel is the rollout mechanism. - Pull-secret on the namespace:
dockerhub-pull-secret, synced from Vaultsecret/viktor.dockerhub_registry_passwordvia ExternalSecret (codified ininfra/stacks/real-estate-crawler/main.tf). Required because the DockerHub repos are private. - Fallback:
.woodpecker/build-fallback-{api,frontend}.yml(event:deployment, manual-only) preserves the in-cluster build path if GHA is down. - Linting: GitHub Actions runs Ruff on PR diffs (
.github/workflows/ruff.yml). - Keep commits focused — one logical change per commit.
- Group related files (e.g., code + its tests) in the same commit.
Directories to Ignore
node_modules/,__pycache__/,.idea/,data/,venv/,_cache/