diff --git a/docs/plans/2026-05-28-reddit-examples-design.md b/docs/plans/2026-05-28-reddit-examples-design.md new file mode 100644 index 0000000..8ac7b77 --- /dev/null +++ b/docs/plans/2026-05-28-reddit-examples-design.md @@ -0,0 +1,250 @@ +# FIRE Reddit Examples Ingest — Design + +**Status**: approved 2026-05-28 +**Owner**: Viktor +**Module**: `fire_planner/examples/` + +## Goal + +Populate the FIRE planner DB with real-world examples of people pursuing +or living FIRE across different countries, so the planner can answer +questions like: + +- *"What's the median portfolio for people FIRE'd in the Philippines?"* +- *"For a £400k portfolio, where are people living?"* +- *"What annual spend do people report in Bali vs Lisbon vs Chiang Mai?"* + +The data is **informational only** — it feeds a new `/api/examples` +endpoint and a small overlay block in the simulator response. It does +**not** drive scenario inputs or COL ratios. + +## Non-goals + +- Comments scraping (posts only — comments are ~10× noisier) +- Driving COL ratios or scenario priors from scraped data (overlay only) +- Continuous streaming (weekly CronJob is enough) +- Multi-language (English subs only) +- Re-running tax / withdrawal logic on the scraped examples + +## Decisions (locked from brainstorming session) + +| Axis | Decision | +|---|---| +| Fields per example | Country, city, portfolio_gbp, annual_exp_gbp, age, family_size, fi_status (accumulating / coast / barista / lean / FIRE / fat), is_retired, post_url, source_sub, post_date | +| Extraction | Hybrid: cheap regex pre-filter → LLM JSON-schema extract | +| LLM backend | Primary: local llama-cpp (qwen3-8b, same instance as recruiter-responder). Tier 2 fallback: claude-agent-service when qwen3 confidence < 0.5 or JSON unparseable | +| Subreddits | r/financialindependence, r/leanfire, r/fatFIRE, r/coastFIRE, r/baristaFIRE, r/ExpatFIRE, r/EuropeFIRE, r/FIRE_Ind, r/AusFinance, r/CanadianFIRE, r/UKPersonalFinance, r/financialindependence_UK (12 subs) | +| Post selection | top-of-all-time (1000 cap) + top-of-year (~200) per sub. Weekly CronJob delta uses top-of-week | +| Reddit access | PRAW with existing app creds in Vault `secret/viktor:trading_bot_reddit_client_id` / `_secret`. user-agent: `fire-planner/0.1` | +| Parallelism | Python asyncio (`asyncio.gather` across subs); not literal subagent dispatch | +| Module layout | New `fire_planner/examples/` mirroring `col/` shape | +| Simulator integration | Informational overlay only — simulator response gains `examples_overlay {median, p25, p75, count, sample_links[]}` keyed by target country | + +## Architecture + +``` + ┌─────────────────────────────────────────────────────────────┐ + │ K8s Job (bulk one-shot) + K8s CronJob (weekly delta) │ + │ ↓ │ + │ fire_planner.examples.cli │ + │ ├─→ praw_source.py (async PRAW, 12 subs in parallel) │ + │ │ gather() top-of-all-time + top-of-year │ + │ ├─→ filters.py (cheap regex pre-filter) │ + │ ├─→ llm_extract.py (qwen3-8b primary, schema-locked) │ + │ │ └─ Tier 2 fallback: claude-agent-service │ + │ └─→ service.py (upsert into fire_planner. │ + │ fire_example, dedupe by reddit_id)│ + │ ↓ │ + │ Postgres (pg-cluster-rw, schema=fire_planner) │ + │ ↓ │ + │ fire_planner.api.examples (FastAPI router) │ + │ ├─→ GET /examples?country=PH&fi_status=FIRE │ + │ └─→ GET /examples/summary?country=PH │ + │ { median, p25, p75, count, sample_links[] } │ + │ ↓ │ + │ Simulator response gains `examples_overlay` per scenario │ + └─────────────────────────────────────────────────────────────┘ +``` + +## Module layout + +``` +fire_planner/examples/ + __init__.py + models.py # SQLAlchemy ORM + Pydantic schemas + praw_source.py # async PRAW wrapper → RawPost + filters.py # MONEY_RE + LOCATION_RE keyword pre-filter + llm_extract.py # qwen3-8b call → ExtractedExample + confidence + # with claude-agent-service Tier 2 fallback + service.py # upsert, dedupe, summary queries + cli.py # `python -m fire_planner.examples ingest …` +fire_planner/api/examples.py # FastAPI router +``` + +## Data model + +Migration `alembic/versions/0006_fire_examples.py`: + +```sql +CREATE TABLE fire_planner.fire_example ( + id SERIAL PRIMARY KEY, + reddit_id VARCHAR(16) NOT NULL UNIQUE, -- e.g. "abc123" + source_sub VARCHAR(64) NOT NULL, + post_url TEXT NOT NULL, + post_date DATE NOT NULL, + post_title TEXT NOT NULL, + -- extracted fields + country VARCHAR(64), -- ISO country name or "unknown" + city VARCHAR(128), + portfolio_gbp NUMERIC(14,2), + annual_exp_gbp NUMERIC(12,2), + age SMALLINT, + family_size SMALLINT, + fi_status VARCHAR(24), -- accumulating|coastFIRE| + -- baristaFIRE|leanFIRE| + -- FIRE|fatFIRE|unknown + is_retired BOOLEAN, + raw_currency CHAR(3), -- pre-normalisation currency + -- extraction metadata + raw_excerpt TEXT, -- ~500-char window + -- that produced the data + llm_model VARCHAR(64) NOT NULL, -- "qwen3-8b" | "claude-…" + llm_confidence NUMERIC(3,2), -- 0.00–1.00 + extracted_at TIMESTAMPTZ NOT NULL DEFAULT now(), + -- audit + ingested_at TIMESTAMPTZ NOT NULL DEFAULT now() +); +CREATE INDEX ix_fire_example_country ON fire_planner.fire_example(country); +CREATE INDEX ix_fire_example_fi_status ON fire_planner.fire_example(fi_status); +CREATE INDEX ix_fire_example_post_date ON fire_planner.fire_example(post_date); +``` + +Idempotent re-ingest via `reddit_id` UNIQUE. ON CONFLICT DO NOTHING by +default; `--reextract` CLI flag re-runs the LLM and overwrites. +Currency normalisation at extraction time via existing `fire_planner/fx.py`. + +## Data flow + +1. **CLI/Job spawns** — reads target sub list (default 12) and `top` + modes from config / flags +2. **Fan-out** — `asyncio.gather()` one coroutine per subreddit; each + fetches the requested PRAW listings, dedupes by id within the sub, + returns ~1100 raw posts per sub for the bulk run +3. **`filters.py`** — keep posts whose title+body match BOTH + `MONEY_RE` (`$|£|€|GBP|USD|EUR|million|net worth|portfolio`) AND + `LOCATION_RE` (country/city keyword list). Expected survival ~10–30 % +4. **`llm_extract.py`** — POST `{title, body, source_sub}` to + llama-cpp endpoint with strict JSON-schema prompt. Returns + `ExtractedExample(country, city, portfolio_native, annual_exp_native, + raw_currency, age, family_size, fi_status, is_retired, confidence)` +5. **Confidence gate** — `confidence < 0.5` OR JSON parse failure → + retry once via claude-agent-service with the same prompt. If still + fails, log + skip (counted, never inserted) +6. **Currency normalisation** — `fx.py` → `portfolio_gbp`, + `annual_exp_gbp`. Spot rate at `post_date` if available, else today +7. **Upsert** by `reddit_id` (ON CONFLICT DO NOTHING; `--reextract` + forces UPDATE) +8. **Prometheus counters**: + - `fire_examples_scraped_total{sub}` + - `fire_examples_extracted_total{sub,confidence_bucket}` + - `fire_examples_llm_fallback_total` + - `fire_examples_extract_failed_total{reason}` + +## Error handling + +| Failure | Behaviour | +|---|---| +| PRAW rate limit | Built-in PRAW exponential back-off; emit `fire_examples_rate_limited_total` | +| llama-cpp down | Fall through to claude-agent-service Tier 2 | +| claude-agent-service down | Log + skip; record `fire_examples_extract_failed_total{reason="llm_unavailable"}` for later replay | +| LLM returns malformed JSON | One retry with stricter "ONLY JSON, no prose" prompt, then Tier 2 | +| One subreddit fails entirely | Other 11 still complete (`gather(..., return_exceptions=True)`). Job exits 0 if ≥half succeed; otherwise exit 2 | +| reddit_id collision | ON CONFLICT DO NOTHING (idempotent re-run) | +| FX rate lookup fails | Insert row with NULL `portfolio_gbp` / `annual_exp_gbp`; record `raw_currency` always | + +## API surface + +``` +GET /api/examples?country=PH&fi_status=FIRE&limit=50 + → list of FireExample objects (post_url, portfolio_gbp, ...) + +GET /api/examples/summary?country=PH + → { + country: "PH", + count: 47, + portfolio_gbp: { median: 420000, p25: 180000, p75: 740000 }, + annual_exp_gbp: { median: 14400, p25: 9000, p75: 22000 }, + sample_links: ["https://reddit.com/...", ...] // top 5 + } +``` + +Simulator response gains an `examples_overlay` block keyed by the +scenario's target country, calling the same summary query under the +hood. No new auth — same FastAPI router and auth dependency as the +rest of the API. + +## Testing + +- **Unit** + - `filters.py` regex coverage (money + location keywords; positives + negatives) + - `llm_extract.py` with `respx` mocking llama-cpp + claude-agent-service endpoints; JSON parsing + confidence gate + Tier 2 escalation + - Currency normalisation paths via `fx.py` (incl. FX-fetch failure) +- **Integration** (against test PG, like existing `test_ingest_wealthfolio_pg.py`) + - `service.py` upsert / dedupe / `--reextract` paths + - Summary query: median / p25 / p75 over realistic mixed dataset +- **Fixture-driven regression suite** + - ~20 hand-picked real Reddit posts → JSON fixtures in + `tests/fixtures/reddit/` with expected `ExtractedExample` per fixture + - Lets us regression-test prompt changes against ground truth +- **E2E with mocked PRAW + LLM** + - `respx` mocks for both; full pipeline → assert DB rows +- **No live Reddit hits in CI** — opt-in via `LIVE_REDDIT=1` for local runs only + +## Deployment + +- **Image**: extend existing `fire-planner` Docker image (already has alembic + CLI) +- **Bulk one-shot**: K8s `Job` running + `python -m fire_planner.examples ingest --all --top=all,year` +- **Recurring delta**: K8s `CronJob` (weekly, e.g. Sundays 04:00 UTC) + running with `--top=week` +- **Vault refs via ESO**: + - `secret/viktor:trading_bot_reddit_client_id` → env `REDDIT_CLIENT_ID` + - `secret/viktor:trading_bot_reddit_client_secret` → env `REDDIT_CLIENT_SECRET` +- **Plain env vars** (Terraform configmap): + - `LLAMA_CPP_BASE_URL` (same value as recruiter-responder) + - `CLAUDE_AGENT_SERVICE_URL` (Tier 2 fallback) + - `REDDIT_USER_AGENT="fire-planner/0.1"` +- **Terraform**: extend `infra/stacks/fire-planner/` with the Job + CronJob resources + +## Out of scope (explicit YAGNI) + +- Comments scraping +- Driving COL ratios or scenario priors from scraped data +- Continuous streaming +- Multi-language +- Re-running tax / withdrawal sims against scraped examples +- Live continuous PRAW IDLE-like stream +- UI/frontend (a single React page can come later as a separate spec) + +## Open considerations (revisit if signal is bad) + +- **Confidence threshold 0.5** is a guess. May tune after seeing real + qwen3-8b output on a 200-post sample. +- **"≥half subs succeed = exit 0"** — if real failure modes correlate + (e.g. PRAW outage), this won't help. Alert on + `fire_examples_extract_failed_total` ratio instead. +- **Top-of-all-time per sub plateaus on whichever ~1000 posts are pinned + by historical karma.** Top-of-year + weekly delta provide freshness. + +## Migration / rollout + +1. Land alembic 0006_fire_examples migration on `pg-cluster-rw` + (cluster DB; no downtime — additive only) +2. Land `fire_planner/examples/` module + tests; CI green +3. Land Terraform changes (Job + CronJob) — Job runs once on apply, + bulk-populates the table +4. Add `/api/examples` router; bump fire-planner image +5. Add `examples_overlay` to simulator response (last; previous steps + are independent) + +Each step is independently revertable.