wrongmove/.claude/CLAUDE.md
Viktor Barzin c2acbf5d2e CI: migrate Docker build/push from Woodpecker to GitHub Actions
Was: Woodpecker built+pushed to DockerHub, then `kubectl set image` patched
the four Deployments to a pinned numeric tag. With Deployments pinned to
:51 (immutable tag), Keel polled forever and never saw a digest bump — and
no DockerHub pull-secret meant Keel hit 401 on the private repo at every
poll. The 4-Deployment setup also had a latent ImagePullBackOff risk: if a
node was replaced, fresh pulls would fail.

Now: GHA builds+pushes (.github/workflows/build-{api,frontend}.yml) on push
to master. Cluster Deployments reference :latest with an imagePullSecret
sourced from Vault via ESO (codified in infra/stacks/real-estate-crawler/
main.tf, separate commit). Keel polls :latest, sees the new digest after
each GHA build, and rolls all four Deployments.

- .github/workflows/build-api.yml: pytest (unit + integration/regression/
  e2e/test_listing_geojson) + buildx push viktorbarzin/realestatecrawler
  to {<8-char-sha>, latest}.
- .github/workflows/build-frontend.yml: vitest (all 4 ex-shards in one
  run) + Vite build with VITE_MAPBOX_TOKEN from GHA secret + buildx push
  viktorbarzin/immoweb to {<8-char-sha>, latest}.
- .woodpecker/{api,frontend}.yml renamed to
  .woodpecker/build-fallback-{api,frontend}.yml with `event: deployment`
  so they no longer fire on push — kept as manual-only fallback if GHA
  is down (CLAUDE.md convention from the 10 already-migrated projects).
- .claude/CLAUDE.md: Git Workflow section updated to reflect GHA as
  primary + the dockerhub-pull-secret wiring.

GHA repo secrets DOCKERHUB_TOKEN and MAPBOX_TOKEN populated from Vault
fields viktor.dockerhub_registry_password and ci/global.wrongmove-mapbox-token
respectively (DOCKERHUB_USERNAME=viktorbarzin was already set).

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-18 19:11:31 +00:00

93 lines
4.9 KiB
Markdown

# Realestate Crawler
## Project Overview
A real estate listing aggregation platform that scrapes Rightmove UK listings, extracts square meter data from floorplan images via OCR, calculates transit routes, and serves an interactive map-based web UI.
## Command Execution
**All commands run inside Docker containers.** The dev environment uses Docker Compose — start it first, then exec into containers for any operations.
- **Infrastructure commands** (docker compose, kubectl) — Run locally on the Mac
- **All project commands** (pytest, poetry, alembic, python, mypy, ruff, etc.) — Run inside the `app` container via `docker compose exec app <command>`
See `.claude/skills/` for detailed skills on dev environment, building, and deploying.
## Quick Reference
| Action | Command |
|--------|---------|
| Start all services | `./start.sh` |
| Rebuild & start | `./start.sh --build` |
| Stop services | `./start.sh --down` |
| Run tests | `pytest tests/ -v --cov=. --cov-report=term-missing` |
| Type check | `mypy .` |
| Format code | `yapf --style .style.yapf --recursive .` |
| Lint (CI runs Ruff) | `ruff check .` |
| DB migration | `alembic upgrade head` |
| New migration | `alembic revision -m "description"` |
| Frontend dev | `cd frontend && ./start.sh` |
## Tech Stack
- **Backend:** Python 3.13, FastAPI, SQLModel (SQLAlchemy 2), Celery + Redis, pytesseract/OpenCV
- **Frontend:** React 19, TypeScript, Vite, Tailwind CSS, Radix UI, Mapbox GL
- **Database:** MySQL 9 (prod) / SQLite (local dev), Alembic migrations
- **Infrastructure:** Docker Compose (dev), Kubernetes (prod), Woodpecker CI
## Code Conventions
- **Type checking:** Strict mypy (`disallow_untyped_defs=true`). All functions need type annotations.
- **Formatting:** YAPF (`.style.yapf` config) + Ruff linter (CI on PRs).
- **Async:** Scraper layer uses `async/await` throughout with `aiohttp`.
- **Models:** SQLModel for DB entities, Pydantic for request/response validation.
- **Service layer:** `services/` contains unified handlers used by both CLI and API — add new business logic here, not in API routes or CLI commands directly.
- **Repository pattern:** `repositories/` for database queries. Don't put raw SQL in services or API.
- **Tests:** pytest with `pytest-asyncio` (auto mode). Unit tests in `tests/unit/`, integration in `tests/integration/`.
## Architecture Layers
```
API (api/app.py) ←→ Services (services/) ←→ Repositories (repositories/)
↑ ↑ ↑
Auth (OIDC) Core Logic (rec/) SQLModel (models/)
Celery Tasks (tasks/)
```
- `rec/` — Core scraping, OCR, routing logic. Contains circuit breaker and throttle detection.
- `services/` — Orchestration. Query splitting, listing fetching, caching.
- `api/` — FastAPI routes, auth middleware, metrics.
- `models/``RentListing`, `BuyListing` SQLModel entities.
- `tasks/` — Celery background tasks with Redis broker.
- `config/` — Env-var-based configuration (scraper settings, schedules).
## Key Design Decisions
- **Query splitting** works around Rightmove's ~1500-result API cap by adaptively splitting queries by district, bedrooms, and price bands (binary search). See `services/query_splitter.py`.
- **Circuit breaker** (`rec/circuit_breaker.py`) and **throttle detection** (`rec/throttle_detector.py`) protect against rate limiting.
- **Streaming API** uses NDJSON for progressive loading of large result sets.
- **Redis** serves dual duty as Celery broker and GeoJSON cache.
## Environment Variables
See `.env.sample` for the full list. Key ones:
- `DB_CONNECTION_STRING` — Database URL
- `CELERY_BROKER_URL` / `CELERY_RESULT_BACKEND` — Redis URLs
- `ROUTING_API_KEY` — Google Maps API key
- `RIGHTMOVE_*` — Scraper tuning (concurrency, delays, thresholds, proxy)
- `SCRAPE_SCHEDULES` — JSON array of periodic scrape configs
## Git Workflow
- **CI: GitHub Actions** (`.github/workflows/build-api.yml`, `build-frontend.yml`) builds + pushes Docker images to DockerHub on `master` push. **Keel** in the cluster watches `:latest` on `viktorbarzin/realestatecrawler` and `viktorbarzin/immoweb` and rolls the four `realestate-crawler-*` Deployments on digest change. No Woodpecker deploy POST — Keel is the rollout mechanism.
- Pull-secret on the namespace: `dockerhub-pull-secret`, synced from Vault `secret/viktor.dockerhub_registry_password` via ExternalSecret (codified in `infra/stacks/real-estate-crawler/main.tf`). Required because the DockerHub repos are private.
- Fallback: `.woodpecker/build-fallback-{api,frontend}.yml` (event: `deployment`, manual-only) preserves the in-cluster build path if GHA is down.
- Linting: GitHub Actions runs Ruff on PR diffs (`.github/workflows/ruff.yml`).
- Keep commits focused — one logical change per commit.
- Group related files (e.g., code + its tests) in the same commit.
## Directories to Ignore
- `node_modules/`, `__pycache__/`, `.idea/`, `data/`, `venv/`, `_cache/`