ci/woodpecker/push/woodpecker Pipeline was successful

Details

fire-planner: design — Reddit FIRE examples ingest

Brainstorm + design doc for a new fire_planner/examples/ module that
scrapes 12 FIRE subreddits via PRAW, extracts structured fields with a
local qwen3-8b LLM (claude-agent-service Tier 2 fallback), and exposes
the data as an informational overlay in the simulator response.

Locked decisions: PRAW + asyncio, hybrid regex+LLM extraction,
informational overlay only (no scenario-priors coupling), new
fire_planner/examples/ module mirroring col/ shape.

2026-05-28 21:32:32 +00:00

12 KiB

Raw Blame History

FIRE Reddit Examples Ingest — Design

Status: approved 2026-05-28 Owner: Viktor Module: fire_planner/examples/

Goal

Populate the FIRE planner DB with real-world examples of people pursuing or living FIRE across different countries, so the planner can answer questions like:

"What's the median portfolio for people FIRE'd in the Philippines?"
"For a £400k portfolio, where are people living?"
"What annual spend do people report in Bali vs Lisbon vs Chiang Mai?"

The data is informational only — it feeds a new /api/examples endpoint and a small overlay block in the simulator response. It does not drive scenario inputs or COL ratios.

Non-goals

Comments scraping (posts only — comments are ~10× noisier)
Driving COL ratios or scenario priors from scraped data (overlay only)
Continuous streaming (weekly CronJob is enough)
Multi-language (English subs only)
Re-running tax / withdrawal logic on the scraped examples

Decisions (locked from brainstorming session)

Axis	Decision
Fields per example	Country, city, portfolio_gbp, annual_exp_gbp, age, family_size, fi_status (accumulating / coast / barista / lean / FIRE / fat), is_retired, post_url, source_sub, post_date
Extraction	Hybrid: cheap regex pre-filter → LLM JSON-schema extract
LLM backend	Primary: local llama-cpp (qwen3-8b, same instance as recruiter-responder). Tier 2 fallback: claude-agent-service when qwen3 confidence < 0.5 or JSON unparseable
Subreddits	r/financialindependence, r/leanfire, r/fatFIRE, r/coastFIRE, r/baristaFIRE, r/ExpatFIRE, r/EuropeFIRE, r/FIRE_Ind, r/AusFinance, r/CanadianFIRE, r/UKPersonalFinance, r/financialindependence_UK (12 subs)
Post selection	top-of-all-time (1000 cap) + top-of-year (~200) per sub. Weekly CronJob delta uses top-of-week
Reddit access	PRAW with existing app creds in Vault `secret/viktor:trading_bot_reddit_client_id` / `_secret`. user-agent: `fire-planner/0.1`
Parallelism	Python asyncio (`asyncio.gather` across subs); not literal subagent dispatch
Module layout	New `fire_planner/examples/` mirroring `col/` shape
Simulator integration	Informational overlay only — simulator response gains `examples_overlay {median, p25, p75, count, sample_links[]}` keyed by target country

Architecture

   ┌─────────────────────────────────────────────────────────────┐
   │ K8s Job (bulk one-shot) + K8s CronJob (weekly delta)        │
   │       ↓                                                       │
   │  fire_planner.examples.cli                                    │
   │       ├─→ praw_source.py  (async PRAW, 12 subs in parallel)  │
   │       │      gather() top-of-all-time + top-of-year           │
   │       ├─→ filters.py       (cheap regex pre-filter)           │
   │       ├─→ llm_extract.py   (qwen3-8b primary, schema-locked)  │
   │       │       └─ Tier 2 fallback: claude-agent-service        │
   │       └─→ service.py       (upsert into fire_planner.         │
   │                             fire_example, dedupe by reddit_id)│
   │       ↓                                                       │
   │  Postgres (pg-cluster-rw, schema=fire_planner)               │
   │       ↓                                                       │
   │  fire_planner.api.examples (FastAPI router)                  │
   │       ├─→ GET /examples?country=PH&fi_status=FIRE             │
   │       └─→ GET /examples/summary?country=PH                    │
   │             { median, p25, p75, count, sample_links[] }       │
   │       ↓                                                       │
   │  Simulator response gains `examples_overlay` per scenario     │
   └─────────────────────────────────────────────────────────────┘

Module layout

fire_planner/examples/
  __init__.py
  models.py          # SQLAlchemy ORM + Pydantic schemas
  praw_source.py     # async PRAW wrapper → RawPost
  filters.py         # MONEY_RE + LOCATION_RE keyword pre-filter
  llm_extract.py     # qwen3-8b call → ExtractedExample + confidence
                     #   with claude-agent-service Tier 2 fallback
  service.py         # upsert, dedupe, summary queries
  cli.py             # `python -m fire_planner.examples ingest …`
fire_planner/api/examples.py   # FastAPI router

Data model

Migration alembic/versions/0006_fire_examples.py:

CREATE TABLE fire_planner.fire_example (
  id              SERIAL PRIMARY KEY,
  reddit_id       VARCHAR(16) NOT NULL UNIQUE,     -- e.g. "abc123"
  source_sub      VARCHAR(64) NOT NULL,
  post_url        TEXT NOT NULL,
  post_date       DATE NOT NULL,
  post_title      TEXT NOT NULL,
  -- extracted fields
  country         VARCHAR(64),                     -- ISO country name or "unknown"
  city            VARCHAR(128),
  portfolio_gbp   NUMERIC(14,2),
  annual_exp_gbp  NUMERIC(12,2),
  age             SMALLINT,
  family_size     SMALLINT,
  fi_status       VARCHAR(24),                     -- accumulating|coastFIRE|
                                                   --   baristaFIRE|leanFIRE|
                                                   --   FIRE|fatFIRE|unknown
  is_retired      BOOLEAN,
  raw_currency    CHAR(3),                         -- pre-normalisation currency
  -- extraction metadata
  raw_excerpt     TEXT,                            -- ~500-char window
                                                   --   that produced the data
  llm_model       VARCHAR(64) NOT NULL,            -- "qwen3-8b" | "claude-…"
  llm_confidence  NUMERIC(3,2),                    -- 0.00–1.00
  extracted_at    TIMESTAMPTZ NOT NULL DEFAULT now(),
  -- audit
  ingested_at     TIMESTAMPTZ NOT NULL DEFAULT now()
);
CREATE INDEX ix_fire_example_country   ON fire_planner.fire_example(country);
CREATE INDEX ix_fire_example_fi_status ON fire_planner.fire_example(fi_status);
CREATE INDEX ix_fire_example_post_date ON fire_planner.fire_example(post_date);

Idempotent re-ingest via reddit_id UNIQUE. ON CONFLICT DO NOTHING by default; --reextract CLI flag re-runs the LLM and overwrites. Currency normalisation at extraction time via existing fire_planner/fx.py.

Data flow

CLI/Job spawns — reads target sub list (default 12) and top modes from config / flags
Fan-out — asyncio.gather() one coroutine per subreddit; each fetches the requested PRAW listings, dedupes by id within the sub, returns ~1100 raw posts per sub for the bulk run
filters.py — keep posts whose title+body match BOTH MONEY_RE ($|£|€|GBP|USD|EUR|million|net worth|portfolio) AND LOCATION_RE (country/city keyword list). Expected survival ~10–30 %
llm_extract.py — POST {title, body, source_sub} to llama-cpp endpoint with strict JSON-schema prompt. Returns ExtractedExample(country, city, portfolio_native, annual_exp_native, raw_currency, age, family_size, fi_status, is_retired, confidence)
Confidence gate — confidence < 0.5 OR JSON parse failure → retry once via claude-agent-service with the same prompt. If still fails, log + skip (counted, never inserted)
Currency normalisation — fx.py → portfolio_gbp, annual_exp_gbp. Spot rate at post_date if available, else today
Upsert by reddit_id (ON CONFLICT DO NOTHING; --reextract forces UPDATE)
Prometheus counters:
- fire_examples_scraped_total{sub}
- fire_examples_extracted_total{sub,confidence_bucket}
- fire_examples_llm_fallback_total
- fire_examples_extract_failed_total{reason}

Error handling

Failure	Behaviour
PRAW rate limit	Built-in PRAW exponential back-off; emit `fire_examples_rate_limited_total`
llama-cpp down	Fall through to claude-agent-service Tier 2
claude-agent-service down	Log + skip; record `fire_examples_extract_failed_total{reason="llm_unavailable"}` for later replay
LLM returns malformed JSON	One retry with stricter "ONLY JSON, no prose" prompt, then Tier 2
One subreddit fails entirely	Other 11 still complete (`gather(..., return_exceptions=True)`). Job exits 0 if ≥half succeed; otherwise exit 2
reddit_id collision	ON CONFLICT DO NOTHING (idempotent re-run)
FX rate lookup fails	Insert row with NULL `portfolio_gbp` / `annual_exp_gbp`; record `raw_currency` always

API surface

GET /api/examples?country=PH&fi_status=FIRE&limit=50
  → list of FireExample objects (post_url, portfolio_gbp, ...)

GET /api/examples/summary?country=PH
  → {
      country: "PH",
      count: 47,
      portfolio_gbp: { median: 420000, p25: 180000, p75: 740000 },
      annual_exp_gbp: { median: 14400, p25: 9000, p75: 22000 },
      sample_links: ["https://reddit.com/...", ...]   // top 5
    }

Simulator response gains an examples_overlay block keyed by the scenario's target country, calling the same summary query under the hood. No new auth — same FastAPI router and auth dependency as the rest of the API.

Testing

Unit
- filters.py regex coverage (money + location keywords; positives + negatives)
- llm_extract.py with respx mocking llama-cpp + claude-agent-service endpoints; JSON parsing + confidence gate + Tier 2 escalation
- Currency normalisation paths via fx.py (incl. FX-fetch failure)
Integration (against test PG, like existing test_ingest_wealthfolio_pg.py)
- service.py upsert / dedupe / --reextract paths
- Summary query: median / p25 / p75 over realistic mixed dataset
Fixture-driven regression suite
- ~20 hand-picked real Reddit posts → JSON fixtures in tests/fixtures/reddit/ with expected ExtractedExample per fixture
- Lets us regression-test prompt changes against ground truth
E2E with mocked PRAW + LLM
- respx mocks for both; full pipeline → assert DB rows
No live Reddit hits in CI — opt-in via LIVE_REDDIT=1 for local runs only

Deployment

Image: extend existing fire-planner Docker image (already has alembic + CLI)
Bulk one-shot: K8s Job running python -m fire_planner.examples ingest --all --top=all,year
Recurring delta: K8s CronJob (weekly, e.g. Sundays 04:00 UTC) running with --top=week
Vault refs via ESO:
- secret/viktor:trading_bot_reddit_client_id → env REDDIT_CLIENT_ID
- secret/viktor:trading_bot_reddit_client_secret → env REDDIT_CLIENT_SECRET
Plain env vars (Terraform configmap):
- LLAMA_CPP_BASE_URL (same value as recruiter-responder)
- CLAUDE_AGENT_SERVICE_URL (Tier 2 fallback)
- REDDIT_USER_AGENT="fire-planner/0.1"
Terraform: extend infra/stacks/fire-planner/ with the Job + CronJob resources

Out of scope (explicit YAGNI)

Comments scraping
Driving COL ratios or scenario priors from scraped data
Continuous streaming
Multi-language
Re-running tax / withdrawal sims against scraped examples
Live continuous PRAW IDLE-like stream
UI/frontend (a single React page can come later as a separate spec)

Open considerations (revisit if signal is bad)

Confidence threshold 0.5 is a guess. May tune after seeing real qwen3-8b output on a 200-post sample.
"≥half subs succeed = exit 0" — if real failure modes correlate (e.g. PRAW outage), this won't help. Alert on fire_examples_extract_failed_total ratio instead.
Top-of-all-time per sub plateaus on whichever ~1000 posts are pinned by historical karma. Top-of-year + weekly delta provide freshness.

Migration / rollout

Land alembic 0006_fire_examples migration on pg-cluster-rw (cluster DB; no downtime — additive only)
Land fire_planner/examples/ module + tests; CI green
Land Terraform changes (Job + CronJob) — Job runs once on apply, bulk-populates the table
Add /api/examples router; bump fire-planner image
Add examples_overlay to simulator response (last; previous steps are independent)

Each step is independently revertable.

12 KiB Raw Blame History Unescape Escape