fire-planner: design — Reddit FIRE examples ingest

Brainstorm + design doc for a new fire_planner/examples/ module that scrapes 12 FIRE subreddits via PRAW, extracts structured fields with a local qwen3-8b LLM (claude-agent-service Tier 2 fallback), and exposes the data as an informational overlay in the simulator response. Locked decisions: PRAW + asyncio, hybrid regex+LLM extraction, informational overlay only (no scenario-priors coupling), new fire_planner/examples/ module mirroring col/ shape.
2026-05-28 21:32:32 +00:00 · 2026-05-28 21:32:32 +00:00 · 0907a31e0c
commit 0907a31e0c
parent 4a0ef1faf6
1 changed files with 250 additions and 0 deletions
--- a/docs/plans/2026-05-28-reddit-examples-design.md
+++ b/docs/plans/2026-05-28-reddit-examples-design.md
@ -0,0 +1,250 @@
+# FIRE Reddit Examples Ingest — Design
+
+**Status**: approved 2026-05-28
+**Owner**: Viktor
+**Module**: `fire_planner/examples/`
+
+## Goal
+
+Populate the FIRE planner DB with real-world examples of people pursuing
+or living FIRE across different countries, so the planner can answer
+questions like:
+
+- *"What's the median portfolio for people FIRE'd in the Philippines?"*
+- *"For a £400k portfolio, where are people living?"*
+- *"What annual spend do people report in Bali vs Lisbon vs Chiang Mai?"*
+
+The data is **informational only** — it feeds a new `/api/examples`
+endpoint and a small overlay block in the simulator response. It does
+**not** drive scenario inputs or COL ratios.
+
+## Non-goals
+
+- Comments scraping (posts only — comments are ~10× noisier)
+- Driving COL ratios or scenario priors from scraped data (overlay only)
+- Continuous streaming (weekly CronJob is enough)
+- Multi-language (English subs only)
+- Re-running tax / withdrawal logic on the scraped examples
+
+## Decisions (locked from brainstorming session)
+
+| Axis | Decision |
+|---|---|
+| Fields per example | Country, city, portfolio_gbp, annual_exp_gbp, age, family_size, fi_status (accumulating / coast / barista / lean / FIRE / fat), is_retired, post_url, source_sub, post_date |
+| Extraction | Hybrid: cheap regex pre-filter → LLM JSON-schema extract |
+| LLM backend | Primary: local llama-cpp (qwen3-8b, same instance as recruiter-responder). Tier 2 fallback: claude-agent-service when qwen3 confidence < 0.5 or JSON unparseable |
+| Subreddits | r/financialindependence, r/leanfire, r/fatFIRE, r/coastFIRE, r/baristaFIRE, r/ExpatFIRE, r/EuropeFIRE, r/FIRE_Ind, r/AusFinance, r/CanadianFIRE, r/UKPersonalFinance, r/financialindependence_UK (12 subs) |
+| Post selection | top-of-all-time (1000 cap) + top-of-year (~200) per sub. Weekly CronJob delta uses top-of-week |
+| Reddit access | PRAW with existing app creds in Vault `secret/viktor:trading_bot_reddit_client_id` / `_secret`. user-agent: `fire-planner/0.1` |
+| Parallelism | Python asyncio (`asyncio.gather` across subs); not literal subagent dispatch |
+| Module layout | New `fire_planner/examples/` mirroring `col/` shape |
+| Simulator integration | Informational overlay only — simulator response gains `examples_overlay {median, p25, p75, count, sample_links[]}` keyed by target country |
+
+## Architecture
+
+```
+   ┌─────────────────────────────────────────────────────────────┐
+   │ K8s Job (bulk one-shot) + K8s CronJob (weekly delta)        │
+   │       ↓                                                       │
+   │  fire_planner.examples.cli                                    │
+   │       ├─→ praw_source.py  (async PRAW, 12 subs in parallel)  │
+   │       │      gather() top-of-all-time + top-of-year           │
+   │       ├─→ filters.py       (cheap regex pre-filter)           │
+   │       ├─→ llm_extract.py   (qwen3-8b primary, schema-locked)  │
+   │       │       └─ Tier 2 fallback: claude-agent-service        │
+   │       └─→ service.py       (upsert into fire_planner.         │
+   │                             fire_example, dedupe by reddit_id)│
+   │       ↓                                                       │
+   │  Postgres (pg-cluster-rw, schema=fire_planner)               │
+   │       ↓                                                       │
+   │  fire_planner.api.examples (FastAPI router)                  │
+   │       ├─→ GET /examples?country=PH&fi_status=FIRE             │
+   │       └─→ GET /examples/summary?country=PH                    │
+   │             { median, p25, p75, count, sample_links[] }       │
+   │       ↓                                                       │
+   │  Simulator response gains `examples_overlay` per scenario     │
+   └─────────────────────────────────────────────────────────────┘
+```
+
+## Module layout
+
+```
+fire_planner/examples/
+  __init__.py
+  models.py          # SQLAlchemy ORM + Pydantic schemas
+  praw_source.py     # async PRAW wrapper → RawPost
+  filters.py         # MONEY_RE + LOCATION_RE keyword pre-filter
+  llm_extract.py     # qwen3-8b call → ExtractedExample + confidence
+                     #   with claude-agent-service Tier 2 fallback
+  service.py         # upsert, dedupe, summary queries
+  cli.py             # `python -m fire_planner.examples ingest …`
+fire_planner/api/examples.py   # FastAPI router
+```
+
+## Data model
+
+Migration `alembic/versions/0006_fire_examples.py`:
+
+```sql
+CREATE TABLE fire_planner.fire_example (
+  id              SERIAL PRIMARY KEY,
+  reddit_id       VARCHAR(16) NOT NULL UNIQUE,     -- e.g. "abc123"
+  source_sub      VARCHAR(64) NOT NULL,
+  post_url        TEXT NOT NULL,
+  post_date       DATE NOT NULL,
+  post_title      TEXT NOT NULL,
+  -- extracted fields
+  country         VARCHAR(64),                     -- ISO country name or "unknown"
+  city            VARCHAR(128),
+  portfolio_gbp   NUMERIC(14,2),
+  annual_exp_gbp  NUMERIC(12,2),
+  age             SMALLINT,
+  family_size     SMALLINT,
+  fi_status       VARCHAR(24),                     -- accumulating|coastFIRE|
+                                                   --   baristaFIRE|leanFIRE|
+                                                   --   FIRE|fatFIRE|unknown
+  is_retired      BOOLEAN,
+  raw_currency    CHAR(3),                         -- pre-normalisation currency
+  -- extraction metadata
+  raw_excerpt     TEXT,                            -- ~500-char window
+                                                   --   that produced the data
+  llm_model       VARCHAR(64) NOT NULL,            -- "qwen3-8b" | "claude-…"
+  llm_confidence  NUMERIC(3,2),                    -- 0.00–1.00
+  extracted_at    TIMESTAMPTZ NOT NULL DEFAULT now(),
+  -- audit
+  ingested_at     TIMESTAMPTZ NOT NULL DEFAULT now()
+);
+CREATE INDEX ix_fire_example_country   ON fire_planner.fire_example(country);
+CREATE INDEX ix_fire_example_fi_status ON fire_planner.fire_example(fi_status);
+CREATE INDEX ix_fire_example_post_date ON fire_planner.fire_example(post_date);
+```
+
+Idempotent re-ingest via `reddit_id` UNIQUE. ON CONFLICT DO NOTHING by
+default; `--reextract` CLI flag re-runs the LLM and overwrites.
+Currency normalisation at extraction time via existing `fire_planner/fx.py`.
+
+## Data flow
+
+1. **CLI/Job spawns** — reads target sub list (default 12) and `top`
+   modes from config / flags
+2. **Fan-out** — `asyncio.gather()` one coroutine per subreddit; each
+   fetches the requested PRAW listings, dedupes by id within the sub,
+   returns ~1100 raw posts per sub for the bulk run
+3. **`filters.py`** — keep posts whose title+body match BOTH
+   `MONEY_RE` (`$|£|€|GBP|USD|EUR|million|net worth|portfolio`) AND
+   `LOCATION_RE` (country/city keyword list). Expected survival ~10–30 %
+4. **`llm_extract.py`** — POST `{title, body, source_sub}` to
+   llama-cpp endpoint with strict JSON-schema prompt. Returns
+   `ExtractedExample(country, city, portfolio_native, annual_exp_native,
+   raw_currency, age, family_size, fi_status, is_retired, confidence)`
+5. **Confidence gate** — `confidence < 0.5` OR JSON parse failure →
+   retry once via claude-agent-service with the same prompt. If still
+   fails, log + skip (counted, never inserted)
+6. **Currency normalisation** — `fx.py` → `portfolio_gbp`,
+   `annual_exp_gbp`. Spot rate at `post_date` if available, else today
+7. **Upsert** by `reddit_id` (ON CONFLICT DO NOTHING; `--reextract`
+   forces UPDATE)
+8. **Prometheus counters**:
+   - `fire_examples_scraped_total{sub}`
+   - `fire_examples_extracted_total{sub,confidence_bucket}`
+   - `fire_examples_llm_fallback_total`
+   - `fire_examples_extract_failed_total{reason}`
+
+## Error handling
+
+| Failure | Behaviour |
+|---|---|
+| PRAW rate limit | Built-in PRAW exponential back-off; emit `fire_examples_rate_limited_total` |
+| llama-cpp down | Fall through to claude-agent-service Tier 2 |
+| claude-agent-service down | Log + skip; record `fire_examples_extract_failed_total{reason="llm_unavailable"}` for later replay |
+| LLM returns malformed JSON | One retry with stricter "ONLY JSON, no prose" prompt, then Tier 2 |
+| One subreddit fails entirely | Other 11 still complete (`gather(..., return_exceptions=True)`). Job exits 0 if ≥half succeed; otherwise exit 2 |
+| reddit_id collision | ON CONFLICT DO NOTHING (idempotent re-run) |
+| FX rate lookup fails | Insert row with NULL `portfolio_gbp` / `annual_exp_gbp`; record `raw_currency` always |
+
+## API surface
+
+```
+GET /api/examples?country=PH&fi_status=FIRE&limit=50
+  → list of FireExample objects (post_url, portfolio_gbp, ...)
+
+GET /api/examples/summary?country=PH
+  → {
+      country: "PH",
+      count: 47,
+      portfolio_gbp: { median: 420000, p25: 180000, p75: 740000 },
+      annual_exp_gbp: { median: 14400, p25: 9000, p75: 22000 },
+      sample_links: ["https://reddit.com/...", ...]   // top 5
+    }
+```
+
+Simulator response gains an `examples_overlay` block keyed by the
+scenario's target country, calling the same summary query under the
+hood. No new auth — same FastAPI router and auth dependency as the
+rest of the API.
+
+## Testing
+
+- **Unit**
+  - `filters.py` regex coverage (money + location keywords; positives + negatives)
+  - `llm_extract.py` with `respx` mocking llama-cpp + claude-agent-service endpoints; JSON parsing + confidence gate + Tier 2 escalation
+  - Currency normalisation paths via `fx.py` (incl. FX-fetch failure)
+- **Integration** (against test PG, like existing `test_ingest_wealthfolio_pg.py`)
+  - `service.py` upsert / dedupe / `--reextract` paths
+  - Summary query: median / p25 / p75 over realistic mixed dataset
+- **Fixture-driven regression suite**
+  - ~20 hand-picked real Reddit posts → JSON fixtures in
+    `tests/fixtures/reddit/` with expected `ExtractedExample` per fixture
+  - Lets us regression-test prompt changes against ground truth
+- **E2E with mocked PRAW + LLM**
+  - `respx` mocks for both; full pipeline → assert DB rows
+- **No live Reddit hits in CI** — opt-in via `LIVE_REDDIT=1` for local runs only
+
+## Deployment
+
+- **Image**: extend existing `fire-planner` Docker image (already has alembic + CLI)
+- **Bulk one-shot**: K8s `Job` running
+  `python -m fire_planner.examples ingest --all --top=all,year`
+- **Recurring delta**: K8s `CronJob` (weekly, e.g. Sundays 04:00 UTC)
+  running with `--top=week`
+- **Vault refs via ESO**:
+  - `secret/viktor:trading_bot_reddit_client_id` → env `REDDIT_CLIENT_ID`
+  - `secret/viktor:trading_bot_reddit_client_secret` → env `REDDIT_CLIENT_SECRET`
+- **Plain env vars** (Terraform configmap):
+  - `LLAMA_CPP_BASE_URL` (same value as recruiter-responder)
+  - `CLAUDE_AGENT_SERVICE_URL` (Tier 2 fallback)
+  - `REDDIT_USER_AGENT="fire-planner/0.1"`
+- **Terraform**: extend `infra/stacks/fire-planner/` with the Job + CronJob resources
+
+## Out of scope (explicit YAGNI)
+
+- Comments scraping
+- Driving COL ratios or scenario priors from scraped data
+- Continuous streaming
+- Multi-language
+- Re-running tax / withdrawal sims against scraped examples
+- Live continuous PRAW IDLE-like stream
+- UI/frontend (a single React page can come later as a separate spec)
+
+## Open considerations (revisit if signal is bad)
+
+- **Confidence threshold 0.5** is a guess. May tune after seeing real
+  qwen3-8b output on a 200-post sample.
+- **"≥half subs succeed = exit 0"** — if real failure modes correlate
+  (e.g. PRAW outage), this won't help. Alert on
+  `fire_examples_extract_failed_total` ratio instead.
+- **Top-of-all-time per sub plateaus on whichever ~1000 posts are pinned
+  by historical karma.** Top-of-year + weekly delta provide freshness.
+
+## Migration / rollout
+
+1. Land alembic 0006_fire_examples migration on `pg-cluster-rw`
+   (cluster DB; no downtime — additive only)
+2. Land `fire_planner/examples/` module + tests; CI green
+3. Land Terraform changes (Job + CronJob) — Job runs once on apply,
+   bulk-populates the table
+4. Add `/api/examples` router; bump fire-planner image
+5. Add `examples_overlay` to simulator response (last; previous steps
+   are independent)
+
+Each step is independently revertable.