Brainstorm + design doc for a new fire_planner/examples/ module that scrapes 12 FIRE subreddits via PRAW, extracts structured fields with a local qwen3-8b LLM (claude-agent-service Tier 2 fallback), and exposes the data as an informational overlay in the simulator response. Locked decisions: PRAW + asyncio, hybrid regex+LLM extraction, informational overlay only (no scenario-priors coupling), new fire_planner/examples/ module mirroring col/ shape.
12 KiB
FIRE Reddit Examples Ingest — Design
Status: approved 2026-05-28
Owner: Viktor
Module: fire_planner/examples/
Goal
Populate the FIRE planner DB with real-world examples of people pursuing or living FIRE across different countries, so the planner can answer questions like:
- "What's the median portfolio for people FIRE'd in the Philippines?"
- "For a £400k portfolio, where are people living?"
- "What annual spend do people report in Bali vs Lisbon vs Chiang Mai?"
The data is informational only — it feeds a new /api/examples
endpoint and a small overlay block in the simulator response. It does
not drive scenario inputs or COL ratios.
Non-goals
- Comments scraping (posts only — comments are ~10× noisier)
- Driving COL ratios or scenario priors from scraped data (overlay only)
- Continuous streaming (weekly CronJob is enough)
- Multi-language (English subs only)
- Re-running tax / withdrawal logic on the scraped examples
Decisions (locked from brainstorming session)
| Axis | Decision |
|---|---|
| Fields per example | Country, city, portfolio_gbp, annual_exp_gbp, age, family_size, fi_status (accumulating / coast / barista / lean / FIRE / fat), is_retired, post_url, source_sub, post_date |
| Extraction | Hybrid: cheap regex pre-filter → LLM JSON-schema extract |
| LLM backend | Primary: local llama-cpp (qwen3-8b, same instance as recruiter-responder). Tier 2 fallback: claude-agent-service when qwen3 confidence < 0.5 or JSON unparseable |
| Subreddits | r/financialindependence, r/leanfire, r/fatFIRE, r/coastFIRE, r/baristaFIRE, r/ExpatFIRE, r/EuropeFIRE, r/FIRE_Ind, r/AusFinance, r/CanadianFIRE, r/UKPersonalFinance, r/financialindependence_UK (12 subs) |
| Post selection | top-of-all-time (1000 cap) + top-of-year (~200) per sub. Weekly CronJob delta uses top-of-week |
| Reddit access | PRAW with existing app creds in Vault secret/viktor:trading_bot_reddit_client_id / _secret. user-agent: fire-planner/0.1 |
| Parallelism | Python asyncio (asyncio.gather across subs); not literal subagent dispatch |
| Module layout | New fire_planner/examples/ mirroring col/ shape |
| Simulator integration | Informational overlay only — simulator response gains examples_overlay {median, p25, p75, count, sample_links[]} keyed by target country |
Architecture
┌─────────────────────────────────────────────────────────────┐
│ K8s Job (bulk one-shot) + K8s CronJob (weekly delta) │
│ ↓ │
│ fire_planner.examples.cli │
│ ├─→ praw_source.py (async PRAW, 12 subs in parallel) │
│ │ gather() top-of-all-time + top-of-year │
│ ├─→ filters.py (cheap regex pre-filter) │
│ ├─→ llm_extract.py (qwen3-8b primary, schema-locked) │
│ │ └─ Tier 2 fallback: claude-agent-service │
│ └─→ service.py (upsert into fire_planner. │
│ fire_example, dedupe by reddit_id)│
│ ↓ │
│ Postgres (pg-cluster-rw, schema=fire_planner) │
│ ↓ │
│ fire_planner.api.examples (FastAPI router) │
│ ├─→ GET /examples?country=PH&fi_status=FIRE │
│ └─→ GET /examples/summary?country=PH │
│ { median, p25, p75, count, sample_links[] } │
│ ↓ │
│ Simulator response gains `examples_overlay` per scenario │
└─────────────────────────────────────────────────────────────┘
Module layout
fire_planner/examples/
__init__.py
models.py # SQLAlchemy ORM + Pydantic schemas
praw_source.py # async PRAW wrapper → RawPost
filters.py # MONEY_RE + LOCATION_RE keyword pre-filter
llm_extract.py # qwen3-8b call → ExtractedExample + confidence
# with claude-agent-service Tier 2 fallback
service.py # upsert, dedupe, summary queries
cli.py # `python -m fire_planner.examples ingest …`
fire_planner/api/examples.py # FastAPI router
Data model
Migration alembic/versions/0006_fire_examples.py:
CREATE TABLE fire_planner.fire_example (
id SERIAL PRIMARY KEY,
reddit_id VARCHAR(16) NOT NULL UNIQUE, -- e.g. "abc123"
source_sub VARCHAR(64) NOT NULL,
post_url TEXT NOT NULL,
post_date DATE NOT NULL,
post_title TEXT NOT NULL,
-- extracted fields
country VARCHAR(64), -- ISO country name or "unknown"
city VARCHAR(128),
portfolio_gbp NUMERIC(14,2),
annual_exp_gbp NUMERIC(12,2),
age SMALLINT,
family_size SMALLINT,
fi_status VARCHAR(24), -- accumulating|coastFIRE|
-- baristaFIRE|leanFIRE|
-- FIRE|fatFIRE|unknown
is_retired BOOLEAN,
raw_currency CHAR(3), -- pre-normalisation currency
-- extraction metadata
raw_excerpt TEXT, -- ~500-char window
-- that produced the data
llm_model VARCHAR(64) NOT NULL, -- "qwen3-8b" | "claude-…"
llm_confidence NUMERIC(3,2), -- 0.00–1.00
extracted_at TIMESTAMPTZ NOT NULL DEFAULT now(),
-- audit
ingested_at TIMESTAMPTZ NOT NULL DEFAULT now()
);
CREATE INDEX ix_fire_example_country ON fire_planner.fire_example(country);
CREATE INDEX ix_fire_example_fi_status ON fire_planner.fire_example(fi_status);
CREATE INDEX ix_fire_example_post_date ON fire_planner.fire_example(post_date);
Idempotent re-ingest via reddit_id UNIQUE. ON CONFLICT DO NOTHING by
default; --reextract CLI flag re-runs the LLM and overwrites.
Currency normalisation at extraction time via existing fire_planner/fx.py.
Data flow
- CLI/Job spawns — reads target sub list (default 12) and
topmodes from config / flags - Fan-out —
asyncio.gather()one coroutine per subreddit; each fetches the requested PRAW listings, dedupes by id within the sub, returns ~1100 raw posts per sub for the bulk run filters.py— keep posts whose title+body match BOTHMONEY_RE($|£|€|GBP|USD|EUR|million|net worth|portfolio) ANDLOCATION_RE(country/city keyword list). Expected survival ~10–30 %llm_extract.py— POST{title, body, source_sub}to llama-cpp endpoint with strict JSON-schema prompt. ReturnsExtractedExample(country, city, portfolio_native, annual_exp_native, raw_currency, age, family_size, fi_status, is_retired, confidence)- Confidence gate —
confidence < 0.5OR JSON parse failure → retry once via claude-agent-service with the same prompt. If still fails, log + skip (counted, never inserted) - Currency normalisation —
fx.py→portfolio_gbp,annual_exp_gbp. Spot rate atpost_dateif available, else today - Upsert by
reddit_id(ON CONFLICT DO NOTHING;--reextractforces UPDATE) - Prometheus counters:
fire_examples_scraped_total{sub}fire_examples_extracted_total{sub,confidence_bucket}fire_examples_llm_fallback_totalfire_examples_extract_failed_total{reason}
Error handling
| Failure | Behaviour |
|---|---|
| PRAW rate limit | Built-in PRAW exponential back-off; emit fire_examples_rate_limited_total |
| llama-cpp down | Fall through to claude-agent-service Tier 2 |
| claude-agent-service down | Log + skip; record fire_examples_extract_failed_total{reason="llm_unavailable"} for later replay |
| LLM returns malformed JSON | One retry with stricter "ONLY JSON, no prose" prompt, then Tier 2 |
| One subreddit fails entirely | Other 11 still complete (gather(..., return_exceptions=True)). Job exits 0 if ≥half succeed; otherwise exit 2 |
| reddit_id collision | ON CONFLICT DO NOTHING (idempotent re-run) |
| FX rate lookup fails | Insert row with NULL portfolio_gbp / annual_exp_gbp; record raw_currency always |
API surface
GET /api/examples?country=PH&fi_status=FIRE&limit=50
→ list of FireExample objects (post_url, portfolio_gbp, ...)
GET /api/examples/summary?country=PH
→ {
country: "PH",
count: 47,
portfolio_gbp: { median: 420000, p25: 180000, p75: 740000 },
annual_exp_gbp: { median: 14400, p25: 9000, p75: 22000 },
sample_links: ["https://reddit.com/...", ...] // top 5
}
Simulator response gains an examples_overlay block keyed by the
scenario's target country, calling the same summary query under the
hood. No new auth — same FastAPI router and auth dependency as the
rest of the API.
Testing
- Unit
filters.pyregex coverage (money + location keywords; positives + negatives)llm_extract.pywithrespxmocking llama-cpp + claude-agent-service endpoints; JSON parsing + confidence gate + Tier 2 escalation- Currency normalisation paths via
fx.py(incl. FX-fetch failure)
- Integration (against test PG, like existing
test_ingest_wealthfolio_pg.py)service.pyupsert / dedupe /--reextractpaths- Summary query: median / p25 / p75 over realistic mixed dataset
- Fixture-driven regression suite
- ~20 hand-picked real Reddit posts → JSON fixtures in
tests/fixtures/reddit/with expectedExtractedExampleper fixture - Lets us regression-test prompt changes against ground truth
- ~20 hand-picked real Reddit posts → JSON fixtures in
- E2E with mocked PRAW + LLM
respxmocks for both; full pipeline → assert DB rows
- No live Reddit hits in CI — opt-in via
LIVE_REDDIT=1for local runs only
Deployment
- Image: extend existing
fire-plannerDocker image (already has alembic + CLI) - Bulk one-shot: K8s
Jobrunningpython -m fire_planner.examples ingest --all --top=all,year - Recurring delta: K8s
CronJob(weekly, e.g. Sundays 04:00 UTC) running with--top=week - Vault refs via ESO:
secret/viktor:trading_bot_reddit_client_id→ envREDDIT_CLIENT_IDsecret/viktor:trading_bot_reddit_client_secret→ envREDDIT_CLIENT_SECRET
- Plain env vars (Terraform configmap):
LLAMA_CPP_BASE_URL(same value as recruiter-responder)CLAUDE_AGENT_SERVICE_URL(Tier 2 fallback)REDDIT_USER_AGENT="fire-planner/0.1"
- Terraform: extend
infra/stacks/fire-planner/with the Job + CronJob resources
Out of scope (explicit YAGNI)
- Comments scraping
- Driving COL ratios or scenario priors from scraped data
- Continuous streaming
- Multi-language
- Re-running tax / withdrawal sims against scraped examples
- Live continuous PRAW IDLE-like stream
- UI/frontend (a single React page can come later as a separate spec)
Open considerations (revisit if signal is bad)
- Confidence threshold 0.5 is a guess. May tune after seeing real qwen3-8b output on a 200-post sample.
- "≥half subs succeed = exit 0" — if real failure modes correlate
(e.g. PRAW outage), this won't help. Alert on
fire_examples_extract_failed_totalratio instead. - Top-of-all-time per sub plateaus on whichever ~1000 posts are pinned by historical karma. Top-of-year + weekly delta provide freshness.
Migration / rollout
- Land alembic 0006_fire_examples migration on
pg-cluster-rw(cluster DB; no downtime — additive only) - Land
fire_planner/examples/module + tests; CI green - Land Terraform changes (Job + CronJob) — Job runs once on apply, bulk-populates the table
- Add
/api/examplesrouter; bump fire-planner image - Add
examples_overlayto simulator response (last; previous steps are independent)
Each step is independently revertable.