fire-planner/docs/plans/2026-05-28-reddit-examples-design.md
Viktor Barzin 0907a31e0c
All checks were successful
ci/woodpecker/push/woodpecker Pipeline was successful
fire-planner: design — Reddit FIRE examples ingest
Brainstorm + design doc for a new fire_planner/examples/ module that
scrapes 12 FIRE subreddits via PRAW, extracts structured fields with a
local qwen3-8b LLM (claude-agent-service Tier 2 fallback), and exposes
the data as an informational overlay in the simulator response.

Locked decisions: PRAW + asyncio, hybrid regex+LLM extraction,
informational overlay only (no scenario-priors coupling), new
fire_planner/examples/ module mirroring col/ shape.
2026-05-28 21:32:32 +00:00

250 lines
12 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

# FIRE Reddit Examples Ingest — Design
**Status**: approved 2026-05-28
**Owner**: Viktor
**Module**: `fire_planner/examples/`
## Goal
Populate the FIRE planner DB with real-world examples of people pursuing
or living FIRE across different countries, so the planner can answer
questions like:
- *"What's the median portfolio for people FIRE'd in the Philippines?"*
- *"For a £400k portfolio, where are people living?"*
- *"What annual spend do people report in Bali vs Lisbon vs Chiang Mai?"*
The data is **informational only** — it feeds a new `/api/examples`
endpoint and a small overlay block in the simulator response. It does
**not** drive scenario inputs or COL ratios.
## Non-goals
- Comments scraping (posts only — comments are ~10× noisier)
- Driving COL ratios or scenario priors from scraped data (overlay only)
- Continuous streaming (weekly CronJob is enough)
- Multi-language (English subs only)
- Re-running tax / withdrawal logic on the scraped examples
## Decisions (locked from brainstorming session)
| Axis | Decision |
|---|---|
| Fields per example | Country, city, portfolio_gbp, annual_exp_gbp, age, family_size, fi_status (accumulating / coast / barista / lean / FIRE / fat), is_retired, post_url, source_sub, post_date |
| Extraction | Hybrid: cheap regex pre-filter → LLM JSON-schema extract |
| LLM backend | Primary: local llama-cpp (qwen3-8b, same instance as recruiter-responder). Tier 2 fallback: claude-agent-service when qwen3 confidence < 0.5 or JSON unparseable |
| Subreddits | r/financialindependence, r/leanfire, r/fatFIRE, r/coastFIRE, r/baristaFIRE, r/ExpatFIRE, r/EuropeFIRE, r/FIRE_Ind, r/AusFinance, r/CanadianFIRE, r/UKPersonalFinance, r/financialindependence_UK (12 subs) |
| Post selection | top-of-all-time (1000 cap) + top-of-year (~200) per sub. Weekly CronJob delta uses top-of-week |
| Reddit access | PRAW with existing app creds in Vault `secret/viktor:trading_bot_reddit_client_id` / `_secret`. user-agent: `fire-planner/0.1` |
| Parallelism | Python asyncio (`asyncio.gather` across subs); not literal subagent dispatch |
| Module layout | New `fire_planner/examples/` mirroring `col/` shape |
| Simulator integration | Informational overlay only simulator response gains `examples_overlay {median, p25, p75, count, sample_links[]}` keyed by target country |
## Architecture
```
┌─────────────────────────────────────────────────────────────┐
│ K8s Job (bulk one-shot) + K8s CronJob (weekly delta) │
│ ↓ │
│ fire_planner.examples.cli │
│ ├─→ praw_source.py (async PRAW, 12 subs in parallel) │
│ │ gather() top-of-all-time + top-of-year │
│ ├─→ filters.py (cheap regex pre-filter) │
│ ├─→ llm_extract.py (qwen3-8b primary, schema-locked) │
│ │ └─ Tier 2 fallback: claude-agent-service │
│ └─→ service.py (upsert into fire_planner. │
│ fire_example, dedupe by reddit_id)│
│ ↓ │
│ Postgres (pg-cluster-rw, schema=fire_planner) │
│ ↓ │
│ fire_planner.api.examples (FastAPI router) │
│ ├─→ GET /examples?country=PH&fi_status=FIRE │
│ └─→ GET /examples/summary?country=PH │
│ { median, p25, p75, count, sample_links[] } │
│ ↓ │
│ Simulator response gains `examples_overlay` per scenario │
└─────────────────────────────────────────────────────────────┘
```
## Module layout
```
fire_planner/examples/
__init__.py
models.py # SQLAlchemy ORM + Pydantic schemas
praw_source.py # async PRAW wrapper → RawPost
filters.py # MONEY_RE + LOCATION_RE keyword pre-filter
llm_extract.py # qwen3-8b call → ExtractedExample + confidence
# with claude-agent-service Tier 2 fallback
service.py # upsert, dedupe, summary queries
cli.py # `python -m fire_planner.examples ingest …`
fire_planner/api/examples.py # FastAPI router
```
## Data model
Migration `alembic/versions/0006_fire_examples.py`:
```sql
CREATE TABLE fire_planner.fire_example (
id SERIAL PRIMARY KEY,
reddit_id VARCHAR(16) NOT NULL UNIQUE, -- e.g. "abc123"
source_sub VARCHAR(64) NOT NULL,
post_url TEXT NOT NULL,
post_date DATE NOT NULL,
post_title TEXT NOT NULL,
-- extracted fields
country VARCHAR(64), -- ISO country name or "unknown"
city VARCHAR(128),
portfolio_gbp NUMERIC(14,2),
annual_exp_gbp NUMERIC(12,2),
age SMALLINT,
family_size SMALLINT,
fi_status VARCHAR(24), -- accumulating|coastFIRE|
-- baristaFIRE|leanFIRE|
-- FIRE|fatFIRE|unknown
is_retired BOOLEAN,
raw_currency CHAR(3), -- pre-normalisation currency
-- extraction metadata
raw_excerpt TEXT, -- ~500-char window
-- that produced the data
llm_model VARCHAR(64) NOT NULL, -- "qwen3-8b" | "claude-…"
llm_confidence NUMERIC(3,2), -- 0.001.00
extracted_at TIMESTAMPTZ NOT NULL DEFAULT now(),
-- audit
ingested_at TIMESTAMPTZ NOT NULL DEFAULT now()
);
CREATE INDEX ix_fire_example_country ON fire_planner.fire_example(country);
CREATE INDEX ix_fire_example_fi_status ON fire_planner.fire_example(fi_status);
CREATE INDEX ix_fire_example_post_date ON fire_planner.fire_example(post_date);
```
Idempotent re-ingest via `reddit_id` UNIQUE. ON CONFLICT DO NOTHING by
default; `--reextract` CLI flag re-runs the LLM and overwrites.
Currency normalisation at extraction time via existing `fire_planner/fx.py`.
## Data flow
1. **CLI/Job spawns** reads target sub list (default 12) and `top`
modes from config / flags
2. **Fan-out** `asyncio.gather()` one coroutine per subreddit; each
fetches the requested PRAW listings, dedupes by id within the sub,
returns ~1100 raw posts per sub for the bulk run
3. **`filters.py`** keep posts whose title+body match BOTH
`MONEY_RE` (`$|£|€|GBP|USD|EUR|million|net worth|portfolio`) AND
`LOCATION_RE` (country/city keyword list). Expected survival ~1030 %
4. **`llm_extract.py`** POST `{title, body, source_sub}` to
llama-cpp endpoint with strict JSON-schema prompt. Returns
`ExtractedExample(country, city, portfolio_native, annual_exp_native,
raw_currency, age, family_size, fi_status, is_retired, confidence)`
5. **Confidence gate** `confidence < 0.5` OR JSON parse failure
retry once via claude-agent-service with the same prompt. If still
fails, log + skip (counted, never inserted)
6. **Currency normalisation** `fx.py` `portfolio_gbp`,
`annual_exp_gbp`. Spot rate at `post_date` if available, else today
7. **Upsert** by `reddit_id` (ON CONFLICT DO NOTHING; `--reextract`
forces UPDATE)
8. **Prometheus counters**:
- `fire_examples_scraped_total{sub}`
- `fire_examples_extracted_total{sub,confidence_bucket}`
- `fire_examples_llm_fallback_total`
- `fire_examples_extract_failed_total{reason}`
## Error handling
| Failure | Behaviour |
|---|---|
| PRAW rate limit | Built-in PRAW exponential back-off; emit `fire_examples_rate_limited_total` |
| llama-cpp down | Fall through to claude-agent-service Tier 2 |
| claude-agent-service down | Log + skip; record `fire_examples_extract_failed_total{reason="llm_unavailable"}` for later replay |
| LLM returns malformed JSON | One retry with stricter "ONLY JSON, no prose" prompt, then Tier 2 |
| One subreddit fails entirely | Other 11 still complete (`gather(..., return_exceptions=True)`). Job exits 0 if half succeed; otherwise exit 2 |
| reddit_id collision | ON CONFLICT DO NOTHING (idempotent re-run) |
| FX rate lookup fails | Insert row with NULL `portfolio_gbp` / `annual_exp_gbp`; record `raw_currency` always |
## API surface
```
GET /api/examples?country=PH&fi_status=FIRE&limit=50
→ list of FireExample objects (post_url, portfolio_gbp, ...)
GET /api/examples/summary?country=PH
→ {
country: "PH",
count: 47,
portfolio_gbp: { median: 420000, p25: 180000, p75: 740000 },
annual_exp_gbp: { median: 14400, p25: 9000, p75: 22000 },
sample_links: ["https://reddit.com/...", ...] // top 5
}
```
Simulator response gains an `examples_overlay` block keyed by the
scenario's target country, calling the same summary query under the
hood. No new auth same FastAPI router and auth dependency as the
rest of the API.
## Testing
- **Unit**
- `filters.py` regex coverage (money + location keywords; positives + negatives)
- `llm_extract.py` with `respx` mocking llama-cpp + claude-agent-service endpoints; JSON parsing + confidence gate + Tier 2 escalation
- Currency normalisation paths via `fx.py` (incl. FX-fetch failure)
- **Integration** (against test PG, like existing `test_ingest_wealthfolio_pg.py`)
- `service.py` upsert / dedupe / `--reextract` paths
- Summary query: median / p25 / p75 over realistic mixed dataset
- **Fixture-driven regression suite**
- ~20 hand-picked real Reddit posts JSON fixtures in
`tests/fixtures/reddit/` with expected `ExtractedExample` per fixture
- Lets us regression-test prompt changes against ground truth
- **E2E with mocked PRAW + LLM**
- `respx` mocks for both; full pipeline assert DB rows
- **No live Reddit hits in CI** opt-in via `LIVE_REDDIT=1` for local runs only
## Deployment
- **Image**: extend existing `fire-planner` Docker image (already has alembic + CLI)
- **Bulk one-shot**: K8s `Job` running
`python -m fire_planner.examples ingest --all --top=all,year`
- **Recurring delta**: K8s `CronJob` (weekly, e.g. Sundays 04:00 UTC)
running with `--top=week`
- **Vault refs via ESO**:
- `secret/viktor:trading_bot_reddit_client_id` env `REDDIT_CLIENT_ID`
- `secret/viktor:trading_bot_reddit_client_secret` env `REDDIT_CLIENT_SECRET`
- **Plain env vars** (Terraform configmap):
- `LLAMA_CPP_BASE_URL` (same value as recruiter-responder)
- `CLAUDE_AGENT_SERVICE_URL` (Tier 2 fallback)
- `REDDIT_USER_AGENT="fire-planner/0.1"`
- **Terraform**: extend `infra/stacks/fire-planner/` with the Job + CronJob resources
## Out of scope (explicit YAGNI)
- Comments scraping
- Driving COL ratios or scenario priors from scraped data
- Continuous streaming
- Multi-language
- Re-running tax / withdrawal sims against scraped examples
- Live continuous PRAW IDLE-like stream
- UI/frontend (a single React page can come later as a separate spec)
## Open considerations (revisit if signal is bad)
- **Confidence threshold 0.5** is a guess. May tune after seeing real
qwen3-8b output on a 200-post sample.
- **"≥half subs succeed = exit 0"** if real failure modes correlate
(e.g. PRAW outage), this won't help. Alert on
`fire_examples_extract_failed_total` ratio instead.
- **Top-of-all-time per sub plateaus on whichever ~1000 posts are pinned
by historical karma.** Top-of-year + weekly delta provide freshness.
## Migration / rollout
1. Land alembic 0006_fire_examples migration on `pg-cluster-rw`
(cluster DB; no downtime additive only)
2. Land `fire_planner/examples/` module + tests; CI green
3. Land Terraform changes (Job + CronJob) Job runs once on apply,
bulk-populates the table
4. Add `/api/examples` router; bump fire-planner image
5. Add `examples_overlay` to simulator response (last; previous steps
are independent)
Each step is independently revertable.