251 lines
12 KiB
Markdown
251 lines
12 KiB
Markdown
|
|
# FIRE Reddit Examples Ingest — Design
|
|||
|
|
|
|||
|
|
**Status**: approved 2026-05-28
|
|||
|
|
**Owner**: Viktor
|
|||
|
|
**Module**: `fire_planner/examples/`
|
|||
|
|
|
|||
|
|
## Goal
|
|||
|
|
|
|||
|
|
Populate the FIRE planner DB with real-world examples of people pursuing
|
|||
|
|
or living FIRE across different countries, so the planner can answer
|
|||
|
|
questions like:
|
|||
|
|
|
|||
|
|
- *"What's the median portfolio for people FIRE'd in the Philippines?"*
|
|||
|
|
- *"For a £400k portfolio, where are people living?"*
|
|||
|
|
- *"What annual spend do people report in Bali vs Lisbon vs Chiang Mai?"*
|
|||
|
|
|
|||
|
|
The data is **informational only** — it feeds a new `/api/examples`
|
|||
|
|
endpoint and a small overlay block in the simulator response. It does
|
|||
|
|
**not** drive scenario inputs or COL ratios.
|
|||
|
|
|
|||
|
|
## Non-goals
|
|||
|
|
|
|||
|
|
- Comments scraping (posts only — comments are ~10× noisier)
|
|||
|
|
- Driving COL ratios or scenario priors from scraped data (overlay only)
|
|||
|
|
- Continuous streaming (weekly CronJob is enough)
|
|||
|
|
- Multi-language (English subs only)
|
|||
|
|
- Re-running tax / withdrawal logic on the scraped examples
|
|||
|
|
|
|||
|
|
## Decisions (locked from brainstorming session)
|
|||
|
|
|
|||
|
|
| Axis | Decision |
|
|||
|
|
|---|---|
|
|||
|
|
| Fields per example | Country, city, portfolio_gbp, annual_exp_gbp, age, family_size, fi_status (accumulating / coast / barista / lean / FIRE / fat), is_retired, post_url, source_sub, post_date |
|
|||
|
|
| Extraction | Hybrid: cheap regex pre-filter → LLM JSON-schema extract |
|
|||
|
|
| LLM backend | Primary: local llama-cpp (qwen3-8b, same instance as recruiter-responder). Tier 2 fallback: claude-agent-service when qwen3 confidence < 0.5 or JSON unparseable |
|
|||
|
|
| Subreddits | r/financialindependence, r/leanfire, r/fatFIRE, r/coastFIRE, r/baristaFIRE, r/ExpatFIRE, r/EuropeFIRE, r/FIRE_Ind, r/AusFinance, r/CanadianFIRE, r/UKPersonalFinance, r/financialindependence_UK (12 subs) |
|
|||
|
|
| Post selection | top-of-all-time (1000 cap) + top-of-year (~200) per sub. Weekly CronJob delta uses top-of-week |
|
|||
|
|
| Reddit access | PRAW with existing app creds in Vault `secret/viktor:trading_bot_reddit_client_id` / `_secret`. user-agent: `fire-planner/0.1` |
|
|||
|
|
| Parallelism | Python asyncio (`asyncio.gather` across subs); not literal subagent dispatch |
|
|||
|
|
| Module layout | New `fire_planner/examples/` mirroring `col/` shape |
|
|||
|
|
| Simulator integration | Informational overlay only — simulator response gains `examples_overlay {median, p25, p75, count, sample_links[]}` keyed by target country |
|
|||
|
|
|
|||
|
|
## Architecture
|
|||
|
|
|
|||
|
|
```
|
|||
|
|
┌─────────────────────────────────────────────────────────────┐
|
|||
|
|
│ K8s Job (bulk one-shot) + K8s CronJob (weekly delta) │
|
|||
|
|
│ ↓ │
|
|||
|
|
│ fire_planner.examples.cli │
|
|||
|
|
│ ├─→ praw_source.py (async PRAW, 12 subs in parallel) │
|
|||
|
|
│ │ gather() top-of-all-time + top-of-year │
|
|||
|
|
│ ├─→ filters.py (cheap regex pre-filter) │
|
|||
|
|
│ ├─→ llm_extract.py (qwen3-8b primary, schema-locked) │
|
|||
|
|
│ │ └─ Tier 2 fallback: claude-agent-service │
|
|||
|
|
│ └─→ service.py (upsert into fire_planner. │
|
|||
|
|
│ fire_example, dedupe by reddit_id)│
|
|||
|
|
│ ↓ │
|
|||
|
|
│ Postgres (pg-cluster-rw, schema=fire_planner) │
|
|||
|
|
│ ↓ │
|
|||
|
|
│ fire_planner.api.examples (FastAPI router) │
|
|||
|
|
│ ├─→ GET /examples?country=PH&fi_status=FIRE │
|
|||
|
|
│ └─→ GET /examples/summary?country=PH │
|
|||
|
|
│ { median, p25, p75, count, sample_links[] } │
|
|||
|
|
│ ↓ │
|
|||
|
|
│ Simulator response gains `examples_overlay` per scenario │
|
|||
|
|
└─────────────────────────────────────────────────────────────┘
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
## Module layout
|
|||
|
|
|
|||
|
|
```
|
|||
|
|
fire_planner/examples/
|
|||
|
|
__init__.py
|
|||
|
|
models.py # SQLAlchemy ORM + Pydantic schemas
|
|||
|
|
praw_source.py # async PRAW wrapper → RawPost
|
|||
|
|
filters.py # MONEY_RE + LOCATION_RE keyword pre-filter
|
|||
|
|
llm_extract.py # qwen3-8b call → ExtractedExample + confidence
|
|||
|
|
# with claude-agent-service Tier 2 fallback
|
|||
|
|
service.py # upsert, dedupe, summary queries
|
|||
|
|
cli.py # `python -m fire_planner.examples ingest …`
|
|||
|
|
fire_planner/api/examples.py # FastAPI router
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
## Data model
|
|||
|
|
|
|||
|
|
Migration `alembic/versions/0006_fire_examples.py`:
|
|||
|
|
|
|||
|
|
```sql
|
|||
|
|
CREATE TABLE fire_planner.fire_example (
|
|||
|
|
id SERIAL PRIMARY KEY,
|
|||
|
|
reddit_id VARCHAR(16) NOT NULL UNIQUE, -- e.g. "abc123"
|
|||
|
|
source_sub VARCHAR(64) NOT NULL,
|
|||
|
|
post_url TEXT NOT NULL,
|
|||
|
|
post_date DATE NOT NULL,
|
|||
|
|
post_title TEXT NOT NULL,
|
|||
|
|
-- extracted fields
|
|||
|
|
country VARCHAR(64), -- ISO country name or "unknown"
|
|||
|
|
city VARCHAR(128),
|
|||
|
|
portfolio_gbp NUMERIC(14,2),
|
|||
|
|
annual_exp_gbp NUMERIC(12,2),
|
|||
|
|
age SMALLINT,
|
|||
|
|
family_size SMALLINT,
|
|||
|
|
fi_status VARCHAR(24), -- accumulating|coastFIRE|
|
|||
|
|
-- baristaFIRE|leanFIRE|
|
|||
|
|
-- FIRE|fatFIRE|unknown
|
|||
|
|
is_retired BOOLEAN,
|
|||
|
|
raw_currency CHAR(3), -- pre-normalisation currency
|
|||
|
|
-- extraction metadata
|
|||
|
|
raw_excerpt TEXT, -- ~500-char window
|
|||
|
|
-- that produced the data
|
|||
|
|
llm_model VARCHAR(64) NOT NULL, -- "qwen3-8b" | "claude-…"
|
|||
|
|
llm_confidence NUMERIC(3,2), -- 0.00–1.00
|
|||
|
|
extracted_at TIMESTAMPTZ NOT NULL DEFAULT now(),
|
|||
|
|
-- audit
|
|||
|
|
ingested_at TIMESTAMPTZ NOT NULL DEFAULT now()
|
|||
|
|
);
|
|||
|
|
CREATE INDEX ix_fire_example_country ON fire_planner.fire_example(country);
|
|||
|
|
CREATE INDEX ix_fire_example_fi_status ON fire_planner.fire_example(fi_status);
|
|||
|
|
CREATE INDEX ix_fire_example_post_date ON fire_planner.fire_example(post_date);
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
Idempotent re-ingest via `reddit_id` UNIQUE. ON CONFLICT DO NOTHING by
|
|||
|
|
default; `--reextract` CLI flag re-runs the LLM and overwrites.
|
|||
|
|
Currency normalisation at extraction time via existing `fire_planner/fx.py`.
|
|||
|
|
|
|||
|
|
## Data flow
|
|||
|
|
|
|||
|
|
1. **CLI/Job spawns** — reads target sub list (default 12) and `top`
|
|||
|
|
modes from config / flags
|
|||
|
|
2. **Fan-out** — `asyncio.gather()` one coroutine per subreddit; each
|
|||
|
|
fetches the requested PRAW listings, dedupes by id within the sub,
|
|||
|
|
returns ~1100 raw posts per sub for the bulk run
|
|||
|
|
3. **`filters.py`** — keep posts whose title+body match BOTH
|
|||
|
|
`MONEY_RE` (`$|£|€|GBP|USD|EUR|million|net worth|portfolio`) AND
|
|||
|
|
`LOCATION_RE` (country/city keyword list). Expected survival ~10–30 %
|
|||
|
|
4. **`llm_extract.py`** — POST `{title, body, source_sub}` to
|
|||
|
|
llama-cpp endpoint with strict JSON-schema prompt. Returns
|
|||
|
|
`ExtractedExample(country, city, portfolio_native, annual_exp_native,
|
|||
|
|
raw_currency, age, family_size, fi_status, is_retired, confidence)`
|
|||
|
|
5. **Confidence gate** — `confidence < 0.5` OR JSON parse failure →
|
|||
|
|
retry once via claude-agent-service with the same prompt. If still
|
|||
|
|
fails, log + skip (counted, never inserted)
|
|||
|
|
6. **Currency normalisation** — `fx.py` → `portfolio_gbp`,
|
|||
|
|
`annual_exp_gbp`. Spot rate at `post_date` if available, else today
|
|||
|
|
7. **Upsert** by `reddit_id` (ON CONFLICT DO NOTHING; `--reextract`
|
|||
|
|
forces UPDATE)
|
|||
|
|
8. **Prometheus counters**:
|
|||
|
|
- `fire_examples_scraped_total{sub}`
|
|||
|
|
- `fire_examples_extracted_total{sub,confidence_bucket}`
|
|||
|
|
- `fire_examples_llm_fallback_total`
|
|||
|
|
- `fire_examples_extract_failed_total{reason}`
|
|||
|
|
|
|||
|
|
## Error handling
|
|||
|
|
|
|||
|
|
| Failure | Behaviour |
|
|||
|
|
|---|---|
|
|||
|
|
| PRAW rate limit | Built-in PRAW exponential back-off; emit `fire_examples_rate_limited_total` |
|
|||
|
|
| llama-cpp down | Fall through to claude-agent-service Tier 2 |
|
|||
|
|
| claude-agent-service down | Log + skip; record `fire_examples_extract_failed_total{reason="llm_unavailable"}` for later replay |
|
|||
|
|
| LLM returns malformed JSON | One retry with stricter "ONLY JSON, no prose" prompt, then Tier 2 |
|
|||
|
|
| One subreddit fails entirely | Other 11 still complete (`gather(..., return_exceptions=True)`). Job exits 0 if ≥half succeed; otherwise exit 2 |
|
|||
|
|
| reddit_id collision | ON CONFLICT DO NOTHING (idempotent re-run) |
|
|||
|
|
| FX rate lookup fails | Insert row with NULL `portfolio_gbp` / `annual_exp_gbp`; record `raw_currency` always |
|
|||
|
|
|
|||
|
|
## API surface
|
|||
|
|
|
|||
|
|
```
|
|||
|
|
GET /api/examples?country=PH&fi_status=FIRE&limit=50
|
|||
|
|
→ list of FireExample objects (post_url, portfolio_gbp, ...)
|
|||
|
|
|
|||
|
|
GET /api/examples/summary?country=PH
|
|||
|
|
→ {
|
|||
|
|
country: "PH",
|
|||
|
|
count: 47,
|
|||
|
|
portfolio_gbp: { median: 420000, p25: 180000, p75: 740000 },
|
|||
|
|
annual_exp_gbp: { median: 14400, p25: 9000, p75: 22000 },
|
|||
|
|
sample_links: ["https://reddit.com/...", ...] // top 5
|
|||
|
|
}
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
Simulator response gains an `examples_overlay` block keyed by the
|
|||
|
|
scenario's target country, calling the same summary query under the
|
|||
|
|
hood. No new auth — same FastAPI router and auth dependency as the
|
|||
|
|
rest of the API.
|
|||
|
|
|
|||
|
|
## Testing
|
|||
|
|
|
|||
|
|
- **Unit**
|
|||
|
|
- `filters.py` regex coverage (money + location keywords; positives + negatives)
|
|||
|
|
- `llm_extract.py` with `respx` mocking llama-cpp + claude-agent-service endpoints; JSON parsing + confidence gate + Tier 2 escalation
|
|||
|
|
- Currency normalisation paths via `fx.py` (incl. FX-fetch failure)
|
|||
|
|
- **Integration** (against test PG, like existing `test_ingest_wealthfolio_pg.py`)
|
|||
|
|
- `service.py` upsert / dedupe / `--reextract` paths
|
|||
|
|
- Summary query: median / p25 / p75 over realistic mixed dataset
|
|||
|
|
- **Fixture-driven regression suite**
|
|||
|
|
- ~20 hand-picked real Reddit posts → JSON fixtures in
|
|||
|
|
`tests/fixtures/reddit/` with expected `ExtractedExample` per fixture
|
|||
|
|
- Lets us regression-test prompt changes against ground truth
|
|||
|
|
- **E2E with mocked PRAW + LLM**
|
|||
|
|
- `respx` mocks for both; full pipeline → assert DB rows
|
|||
|
|
- **No live Reddit hits in CI** — opt-in via `LIVE_REDDIT=1` for local runs only
|
|||
|
|
|
|||
|
|
## Deployment
|
|||
|
|
|
|||
|
|
- **Image**: extend existing `fire-planner` Docker image (already has alembic + CLI)
|
|||
|
|
- **Bulk one-shot**: K8s `Job` running
|
|||
|
|
`python -m fire_planner.examples ingest --all --top=all,year`
|
|||
|
|
- **Recurring delta**: K8s `CronJob` (weekly, e.g. Sundays 04:00 UTC)
|
|||
|
|
running with `--top=week`
|
|||
|
|
- **Vault refs via ESO**:
|
|||
|
|
- `secret/viktor:trading_bot_reddit_client_id` → env `REDDIT_CLIENT_ID`
|
|||
|
|
- `secret/viktor:trading_bot_reddit_client_secret` → env `REDDIT_CLIENT_SECRET`
|
|||
|
|
- **Plain env vars** (Terraform configmap):
|
|||
|
|
- `LLAMA_CPP_BASE_URL` (same value as recruiter-responder)
|
|||
|
|
- `CLAUDE_AGENT_SERVICE_URL` (Tier 2 fallback)
|
|||
|
|
- `REDDIT_USER_AGENT="fire-planner/0.1"`
|
|||
|
|
- **Terraform**: extend `infra/stacks/fire-planner/` with the Job + CronJob resources
|
|||
|
|
|
|||
|
|
## Out of scope (explicit YAGNI)
|
|||
|
|
|
|||
|
|
- Comments scraping
|
|||
|
|
- Driving COL ratios or scenario priors from scraped data
|
|||
|
|
- Continuous streaming
|
|||
|
|
- Multi-language
|
|||
|
|
- Re-running tax / withdrawal sims against scraped examples
|
|||
|
|
- Live continuous PRAW IDLE-like stream
|
|||
|
|
- UI/frontend (a single React page can come later as a separate spec)
|
|||
|
|
|
|||
|
|
## Open considerations (revisit if signal is bad)
|
|||
|
|
|
|||
|
|
- **Confidence threshold 0.5** is a guess. May tune after seeing real
|
|||
|
|
qwen3-8b output on a 200-post sample.
|
|||
|
|
- **"≥half subs succeed = exit 0"** — if real failure modes correlate
|
|||
|
|
(e.g. PRAW outage), this won't help. Alert on
|
|||
|
|
`fire_examples_extract_failed_total` ratio instead.
|
|||
|
|
- **Top-of-all-time per sub plateaus on whichever ~1000 posts are pinned
|
|||
|
|
by historical karma.** Top-of-year + weekly delta provide freshness.
|
|||
|
|
|
|||
|
|
## Migration / rollout
|
|||
|
|
|
|||
|
|
1. Land alembic 0006_fire_examples migration on `pg-cluster-rw`
|
|||
|
|
(cluster DB; no downtime — additive only)
|
|||
|
|
2. Land `fire_planner/examples/` module + tests; CI green
|
|||
|
|
3. Land Terraform changes (Job + CronJob) — Job runs once on apply,
|
|||
|
|
bulk-populates the table
|
|||
|
|
4. Add `/api/examples` router; bump fire-planner image
|
|||
|
|
5. Add `examples_overlay` to simulator response (last; previous steps
|
|||
|
|
are independent)
|
|||
|
|
|
|||
|
|
Each step is independently revertable.
|