fire-planner/docs/plans/2026-05-28-reddit-examples-design.md
Viktor Barzin 0907a31e0c
All checks were successful
ci/woodpecker/push/woodpecker Pipeline was successful
fire-planner: design — Reddit FIRE examples ingest
Brainstorm + design doc for a new fire_planner/examples/ module that
scrapes 12 FIRE subreddits via PRAW, extracts structured fields with a
local qwen3-8b LLM (claude-agent-service Tier 2 fallback), and exposes
the data as an informational overlay in the simulator response.

Locked decisions: PRAW + asyncio, hybrid regex+LLM extraction,
informational overlay only (no scenario-priors coupling), new
fire_planner/examples/ module mirroring col/ shape.
2026-05-28 21:32:32 +00:00

12 KiB
Raw Blame History

FIRE Reddit Examples Ingest — Design

Status: approved 2026-05-28 Owner: Viktor Module: fire_planner/examples/

Goal

Populate the FIRE planner DB with real-world examples of people pursuing or living FIRE across different countries, so the planner can answer questions like:

  • "What's the median portfolio for people FIRE'd in the Philippines?"
  • "For a £400k portfolio, where are people living?"
  • "What annual spend do people report in Bali vs Lisbon vs Chiang Mai?"

The data is informational only — it feeds a new /api/examples endpoint and a small overlay block in the simulator response. It does not drive scenario inputs or COL ratios.

Non-goals

  • Comments scraping (posts only — comments are ~10× noisier)
  • Driving COL ratios or scenario priors from scraped data (overlay only)
  • Continuous streaming (weekly CronJob is enough)
  • Multi-language (English subs only)
  • Re-running tax / withdrawal logic on the scraped examples

Decisions (locked from brainstorming session)

Axis Decision
Fields per example Country, city, portfolio_gbp, annual_exp_gbp, age, family_size, fi_status (accumulating / coast / barista / lean / FIRE / fat), is_retired, post_url, source_sub, post_date
Extraction Hybrid: cheap regex pre-filter → LLM JSON-schema extract
LLM backend Primary: local llama-cpp (qwen3-8b, same instance as recruiter-responder). Tier 2 fallback: claude-agent-service when qwen3 confidence < 0.5 or JSON unparseable
Subreddits r/financialindependence, r/leanfire, r/fatFIRE, r/coastFIRE, r/baristaFIRE, r/ExpatFIRE, r/EuropeFIRE, r/FIRE_Ind, r/AusFinance, r/CanadianFIRE, r/UKPersonalFinance, r/financialindependence_UK (12 subs)
Post selection top-of-all-time (1000 cap) + top-of-year (~200) per sub. Weekly CronJob delta uses top-of-week
Reddit access PRAW with existing app creds in Vault secret/viktor:trading_bot_reddit_client_id / _secret. user-agent: fire-planner/0.1
Parallelism Python asyncio (asyncio.gather across subs); not literal subagent dispatch
Module layout New fire_planner/examples/ mirroring col/ shape
Simulator integration Informational overlay only — simulator response gains examples_overlay {median, p25, p75, count, sample_links[]} keyed by target country

Architecture

   ┌─────────────────────────────────────────────────────────────┐
   │ K8s Job (bulk one-shot) + K8s CronJob (weekly delta)        │
   │       ↓                                                       │
   │  fire_planner.examples.cli                                    │
   │       ├─→ praw_source.py  (async PRAW, 12 subs in parallel)  │
   │       │      gather() top-of-all-time + top-of-year           │
   │       ├─→ filters.py       (cheap regex pre-filter)           │
   │       ├─→ llm_extract.py   (qwen3-8b primary, schema-locked)  │
   │       │       └─ Tier 2 fallback: claude-agent-service        │
   │       └─→ service.py       (upsert into fire_planner.         │
   │                             fire_example, dedupe by reddit_id)│
   │       ↓                                                       │
   │  Postgres (pg-cluster-rw, schema=fire_planner)               │
   │       ↓                                                       │
   │  fire_planner.api.examples (FastAPI router)                  │
   │       ├─→ GET /examples?country=PH&fi_status=FIRE             │
   │       └─→ GET /examples/summary?country=PH                    │
   │             { median, p25, p75, count, sample_links[] }       │
   │       ↓                                                       │
   │  Simulator response gains `examples_overlay` per scenario     │
   └─────────────────────────────────────────────────────────────┘

Module layout

fire_planner/examples/
  __init__.py
  models.py          # SQLAlchemy ORM + Pydantic schemas
  praw_source.py     # async PRAW wrapper → RawPost
  filters.py         # MONEY_RE + LOCATION_RE keyword pre-filter
  llm_extract.py     # qwen3-8b call → ExtractedExample + confidence
                     #   with claude-agent-service Tier 2 fallback
  service.py         # upsert, dedupe, summary queries
  cli.py             # `python -m fire_planner.examples ingest …`
fire_planner/api/examples.py   # FastAPI router

Data model

Migration alembic/versions/0006_fire_examples.py:

CREATE TABLE fire_planner.fire_example (
  id              SERIAL PRIMARY KEY,
  reddit_id       VARCHAR(16) NOT NULL UNIQUE,     -- e.g. "abc123"
  source_sub      VARCHAR(64) NOT NULL,
  post_url        TEXT NOT NULL,
  post_date       DATE NOT NULL,
  post_title      TEXT NOT NULL,
  -- extracted fields
  country         VARCHAR(64),                     -- ISO country name or "unknown"
  city            VARCHAR(128),
  portfolio_gbp   NUMERIC(14,2),
  annual_exp_gbp  NUMERIC(12,2),
  age             SMALLINT,
  family_size     SMALLINT,
  fi_status       VARCHAR(24),                     -- accumulating|coastFIRE|
                                                   --   baristaFIRE|leanFIRE|
                                                   --   FIRE|fatFIRE|unknown
  is_retired      BOOLEAN,
  raw_currency    CHAR(3),                         -- pre-normalisation currency
  -- extraction metadata
  raw_excerpt     TEXT,                            -- ~500-char window
                                                   --   that produced the data
  llm_model       VARCHAR(64) NOT NULL,            -- "qwen3-8b" | "claude-…"
  llm_confidence  NUMERIC(3,2),                    -- 0.001.00
  extracted_at    TIMESTAMPTZ NOT NULL DEFAULT now(),
  -- audit
  ingested_at     TIMESTAMPTZ NOT NULL DEFAULT now()
);
CREATE INDEX ix_fire_example_country   ON fire_planner.fire_example(country);
CREATE INDEX ix_fire_example_fi_status ON fire_planner.fire_example(fi_status);
CREATE INDEX ix_fire_example_post_date ON fire_planner.fire_example(post_date);

Idempotent re-ingest via reddit_id UNIQUE. ON CONFLICT DO NOTHING by default; --reextract CLI flag re-runs the LLM and overwrites. Currency normalisation at extraction time via existing fire_planner/fx.py.

Data flow

  1. CLI/Job spawns — reads target sub list (default 12) and top modes from config / flags
  2. Fan-outasyncio.gather() one coroutine per subreddit; each fetches the requested PRAW listings, dedupes by id within the sub, returns ~1100 raw posts per sub for the bulk run
  3. filters.py — keep posts whose title+body match BOTH MONEY_RE ($|£|€|GBP|USD|EUR|million|net worth|portfolio) AND LOCATION_RE (country/city keyword list). Expected survival ~1030 %
  4. llm_extract.py — POST {title, body, source_sub} to llama-cpp endpoint with strict JSON-schema prompt. Returns ExtractedExample(country, city, portfolio_native, annual_exp_native, raw_currency, age, family_size, fi_status, is_retired, confidence)
  5. Confidence gateconfidence < 0.5 OR JSON parse failure → retry once via claude-agent-service with the same prompt. If still fails, log + skip (counted, never inserted)
  6. Currency normalisationfx.pyportfolio_gbp, annual_exp_gbp. Spot rate at post_date if available, else today
  7. Upsert by reddit_id (ON CONFLICT DO NOTHING; --reextract forces UPDATE)
  8. Prometheus counters:
    • fire_examples_scraped_total{sub}
    • fire_examples_extracted_total{sub,confidence_bucket}
    • fire_examples_llm_fallback_total
    • fire_examples_extract_failed_total{reason}

Error handling

Failure Behaviour
PRAW rate limit Built-in PRAW exponential back-off; emit fire_examples_rate_limited_total
llama-cpp down Fall through to claude-agent-service Tier 2
claude-agent-service down Log + skip; record fire_examples_extract_failed_total{reason="llm_unavailable"} for later replay
LLM returns malformed JSON One retry with stricter "ONLY JSON, no prose" prompt, then Tier 2
One subreddit fails entirely Other 11 still complete (gather(..., return_exceptions=True)). Job exits 0 if ≥half succeed; otherwise exit 2
reddit_id collision ON CONFLICT DO NOTHING (idempotent re-run)
FX rate lookup fails Insert row with NULL portfolio_gbp / annual_exp_gbp; record raw_currency always

API surface

GET /api/examples?country=PH&fi_status=FIRE&limit=50
  → list of FireExample objects (post_url, portfolio_gbp, ...)

GET /api/examples/summary?country=PH
  → {
      country: "PH",
      count: 47,
      portfolio_gbp: { median: 420000, p25: 180000, p75: 740000 },
      annual_exp_gbp: { median: 14400, p25: 9000, p75: 22000 },
      sample_links: ["https://reddit.com/...", ...]   // top 5
    }

Simulator response gains an examples_overlay block keyed by the scenario's target country, calling the same summary query under the hood. No new auth — same FastAPI router and auth dependency as the rest of the API.

Testing

  • Unit
    • filters.py regex coverage (money + location keywords; positives + negatives)
    • llm_extract.py with respx mocking llama-cpp + claude-agent-service endpoints; JSON parsing + confidence gate + Tier 2 escalation
    • Currency normalisation paths via fx.py (incl. FX-fetch failure)
  • Integration (against test PG, like existing test_ingest_wealthfolio_pg.py)
    • service.py upsert / dedupe / --reextract paths
    • Summary query: median / p25 / p75 over realistic mixed dataset
  • Fixture-driven regression suite
    • ~20 hand-picked real Reddit posts → JSON fixtures in tests/fixtures/reddit/ with expected ExtractedExample per fixture
    • Lets us regression-test prompt changes against ground truth
  • E2E with mocked PRAW + LLM
    • respx mocks for both; full pipeline → assert DB rows
  • No live Reddit hits in CI — opt-in via LIVE_REDDIT=1 for local runs only

Deployment

  • Image: extend existing fire-planner Docker image (already has alembic + CLI)
  • Bulk one-shot: K8s Job running python -m fire_planner.examples ingest --all --top=all,year
  • Recurring delta: K8s CronJob (weekly, e.g. Sundays 04:00 UTC) running with --top=week
  • Vault refs via ESO:
    • secret/viktor:trading_bot_reddit_client_id → env REDDIT_CLIENT_ID
    • secret/viktor:trading_bot_reddit_client_secret → env REDDIT_CLIENT_SECRET
  • Plain env vars (Terraform configmap):
    • LLAMA_CPP_BASE_URL (same value as recruiter-responder)
    • CLAUDE_AGENT_SERVICE_URL (Tier 2 fallback)
    • REDDIT_USER_AGENT="fire-planner/0.1"
  • Terraform: extend infra/stacks/fire-planner/ with the Job + CronJob resources

Out of scope (explicit YAGNI)

  • Comments scraping
  • Driving COL ratios or scenario priors from scraped data
  • Continuous streaming
  • Multi-language
  • Re-running tax / withdrawal sims against scraped examples
  • Live continuous PRAW IDLE-like stream
  • UI/frontend (a single React page can come later as a separate spec)

Open considerations (revisit if signal is bad)

  • Confidence threshold 0.5 is a guess. May tune after seeing real qwen3-8b output on a 200-post sample.
  • "≥half subs succeed = exit 0" — if real failure modes correlate (e.g. PRAW outage), this won't help. Alert on fire_examples_extract_failed_total ratio instead.
  • Top-of-all-time per sub plateaus on whichever ~1000 posts are pinned by historical karma. Top-of-year + weekly delta provide freshness.

Migration / rollout

  1. Land alembic 0006_fire_examples migration on pg-cluster-rw (cluster DB; no downtime — additive only)
  2. Land fire_planner/examples/ module + tests; CI green
  3. Land Terraform changes (Job + CronJob) — Job runs once on apply, bulk-populates the table
  4. Add /api/examples router; bump fire-planner image
  5. Add examples_overlay to simulator response (last; previous steps are independent)

Each step is independently revertable.