Dashboard now shows two 'Me' bars: realized gross (~£409k, from SUM(payslip taxable_pay) = P60 basis) and package/grant-value (~£267k, levels.fyi-comparable). Document that gross MUST come from taxable_pay, NOT salary+bonus+rsu_vest (rsu_vest is net/partial, understates RSU ~50%). Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
317 lines
16 KiB
Markdown
317 lines
16 KiB
Markdown
# Runbook: job-hunter — passive job + comp scraper
|
|
|
|
Last updated: 2026-06-02
|
|
|
|
`job-hunter` is a passive job-market + compensation scraper in the `job-hunter`
|
|
namespace. It pulls open roles from ATS boards (Greenhouse / Lever / Ashby),
|
|
HN "Who is hiring", and levels.fyi comp medians into a CNPG Postgres DB, and
|
|
serves agent-friendly CLI queries (used by the `job-hunter` Claude skill). As
|
|
of 2026-06-02 it also accumulates **dated snapshots** so comp and hiring-volume
|
|
trends can be tracked over time.
|
|
|
|
## Where things live
|
|
|
|
| Thing | Location |
|
|
|---|---|
|
|
| Source code | Forgejo `https://forgejo.viktorbarzin.me/viktor/job-hunter` (NOT in the monorepo) |
|
|
| Image | `forgejo.viktorbarzin.me/viktor/job-hunter:latest` (CI builds on push; Keel rolls the Deployment) |
|
|
| Terraform stack | `infra/stacks/job-hunter/` (`main.tf` = Deployment/Service/ESO; `cronjob.tf` = weekly refresh) |
|
|
| Database | `pg-cluster-rw.dbaas.svc.cluster.local:5432/job_hunter`, role `job_hunter` (Vault `static-creds/pg-job-hunter`, 7d rotation) |
|
|
| App secrets | Vault `secret/job-hunter` → `webhook_bearer_token`, `cdio_api_key`, `smtp_username/password`, `digest_to/from_address` |
|
|
| Grafana | `https://grafana.viktorbarzin.me` → datasource **Job Hunter** (PG, read-only) |
|
|
| Claude skill | `~/.claude/skills/job-hunter/SKILL.md` |
|
|
| Weekly scrape | CronJob `job-hunter-refresh`, **Sundays 04:00 UTC** |
|
|
|
|
## Architecture
|
|
|
|
- **Sources** (`job_hunter/sources/`): `ats` (Greenhouse/Lever/Ashby JSON APIs, ~35 companies in `config/companies.yaml`), `hn` (Algolia), `levels_fyi` (comp medians), `linkedin_guest` (opt-in), `changedetection` (`/webhook/cdio` for non-ATS careers pages in `config/cdio_watches.yaml`).
|
|
- **Tables**: `companies`, `roles`, `comp_points`, `levels`, `fx_rates` (upsert-in-place, "current state"); `comp_snapshots`, `roles_snapshots` (append-only, one row per source-row per `snapshot_date` — the dated series). Snapshots are written as a side-effect of every upsert during a refresh.
|
|
- **The ATS fetch is resilient**: a board returning a permanent 4xx (404/410/403) is skipped with a warning; 5xx/network errors retry once then skip. One dead board cannot abort the whole run (regression fixed 2026-06-02 — Elastic's 404 had been taking down every refresh). Boards are fetched concurrently (bounded semaphore, default 8 in-flight).
|
|
|
|
---
|
|
|
|
## OPS
|
|
|
|
### Is it healthy?
|
|
|
|
```bash
|
|
# CronJob exists + last schedule/success
|
|
kubectl -n job-hunter get cronjob job-hunter-refresh
|
|
# Most recent run's pods + logs
|
|
kubectl -n job-hunter get jobs -l app=job-hunter --sort-by=.metadata.creationTimestamp
|
|
kubectl -n job-hunter logs -l job-name=$(kubectl -n job-hunter get jobs -o jsonpath='{.items[-1:].metadata.name}')
|
|
# Deployment (serves the CLI / webhook) is up
|
|
kubectl -n job-hunter get deploy job-hunter
|
|
# Data freshness — newest snapshot date should advance weekly
|
|
kubectl -n job-hunter exec deploy/job-hunter -- python -m job_hunter report --days 7 | jq '.source_mix'
|
|
```
|
|
|
|
Row-count sanity (via the read-only Grafana datasource or a direct exec):
|
|
|
|
```bash
|
|
kubectl -n job-hunter exec deploy/job-hunter -- python -c "import job_hunter" # smoke
|
|
```
|
|
|
|
### Manual refresh (off-schedule)
|
|
|
|
```bash
|
|
kubectl -n job-hunter exec deploy/job-hunter -- \
|
|
python -m job_hunter refresh --source ats --source hn --source levels_fyi
|
|
```
|
|
|
|
Or trigger the CronJob immediately:
|
|
|
|
```bash
|
|
kubectl -n job-hunter create job --from=cronjob/job-hunter-refresh jh-manual-$(date +%s)
|
|
```
|
|
|
|
### Seed / re-snapshot the dated series
|
|
|
|
Snapshots are written automatically on every refresh. To seed a baseline from
|
|
the current tables (idempotent — one row per source-row per day):
|
|
|
|
```bash
|
|
kubectl -n job-hunter exec deploy/job-hunter -- python -m job_hunter snapshot
|
|
# back-date a snapshot if needed:
|
|
kubectl -n job-hunter exec deploy/job-hunter -- python -m job_hunter snapshot --date 2026-06-01
|
|
```
|
|
|
|
### Add an ATS company
|
|
|
|
ATS companies are scraped from `config/companies.yaml` in the **Forgejo repo**
|
|
(not the monorepo). To add one:
|
|
|
|
1. Live-probe the slug returns HTTP 200 with London roles before adding it:
|
|
```bash
|
|
curl -s "https://boards-api.greenhouse.io/v1/boards/<slug>/jobs?content=true" -o /dev/null -w '%{http_code}\n'
|
|
# Lever: https://api.lever.co/v0/postings/<slug>?mode=json
|
|
# Ashby: https://api.ashbyhq.com/posting-api/job-board/<slug>?includeCompensation=true
|
|
```
|
|
2. Add a `{slug, display_name, ats_type, ats_id, careers_url}` block to `config/companies.yaml`, commit, push.
|
|
3. CI builds the image; Keel rolls the Deployment. The next refresh picks it up. (No Terraform change — config ships in the image.)
|
|
|
|
A board that later starts 404ing is skipped automatically; remove its entry
|
|
when the 404 is permanent (keeps logs clean).
|
|
|
|
### Add a changedetection.io watch (non-ATS firms)
|
|
|
|
Firms without a public ATS JSON API (Citadel, Two Sigma, G-Research, HRT, xAI,
|
|
Wise, Revolut, …) are diff-monitored via CDIO. Add to `config/cdio_watches.yaml`
|
|
in the Forgejo repo, then reconcile:
|
|
|
|
```bash
|
|
kubectl -n job-hunter exec deploy/job-hunter -- python -m job_hunter cdio-seed --dry-run # preview
|
|
kubectl -n job-hunter exec deploy/job-hunter -- python -m job_hunter cdio-seed # create
|
|
kubectl -n job-hunter exec deploy/job-hunter -- python -m job_hunter cdio-reconcile # list
|
|
```
|
|
|
|
Changes hit `/webhook/cdio`; comp/role extraction from the diff is manual or
|
|
LLM-side (CDIO only captures the changed text).
|
|
|
|
### Deploying (build triggers the rollout)
|
|
|
|
Deploys are **automatic on push to master** — we build the image, so CI also
|
|
drives the rollout (`.woodpecker.yml`: `build-and-push` tags `latest` +
|
|
`${CI_COMMIT_SHA:0:8}`, then a `deploy` step runs
|
|
`kubectl set image deployment/job-hunter ...:${SHA}` + `rollout status`). The
|
|
woodpecker-agent SA is cluster-admin, so no kubeconfig/RBAC is wired into the
|
|
step. Keel stays enrolled in parallel as a redundant net (finds the SHA already
|
|
running → no-op). So to ship code:
|
|
|
|
```bash
|
|
# in the job-hunter source repo (forgejo viktor/job-hunter)
|
|
git push origin master # → lint+test → build (latest + :<sha>) → set image → rollout
|
|
```
|
|
|
|
The **Deployment** rolls to the just-built `:<sha>`. The **CronJob** runs
|
|
`:latest` with `imagePullPolicy: Always`, so its next scheduled pod pulls the
|
|
newest image (no rollout needed for a CronJob). `image_tag = "latest"` in
|
|
`terragrunt.hcl` is just the TF baseline; the running Deployment digest is
|
|
whatever CI last set (`kubectl -n job-hunter get deploy job-hunter -o jsonpath='{..image}'`).
|
|
|
|
**Versioning** is still semver — bump `pyproject.toml` and cut a `git tag
|
|
vX.Y.Z` to mark a release; that's the human version record, independent of the
|
|
`:<sha>` deploy tag (map a running SHA back to a version with `git describe`).
|
|
|
|
**Rollback**: `kubectl -n job-hunter rollout undo deployment/job-hunter` (last
|
|
ReplicaSet), or push a revert commit (CI redeploys the reverted SHA).
|
|
|
|
### Applying the Terraform stack
|
|
|
|
```bash
|
|
cd infra/stacks/job-hunter
|
|
scripts/tg plan # vault login -method=oidc first
|
|
scripts/tg apply
|
|
```
|
|
|
|
The DB password rotates every 7 days (Vault static role `pg-job-hunter`);
|
|
Reloader restarts the Deployment when the ESO-synced secret changes. The
|
|
Grafana datasource password is mirrored via a second ExternalSecret in the
|
|
`monitoring` namespace.
|
|
|
|
### Common failures
|
|
|
|
| Symptom | Cause | Fix |
|
|
|---|---|---|
|
|
| Refresh job `Error`, log shows `ats: skipping company=X — HTTP 404` | A board slug was renamed/removed | Expected — the run continues. Remove the dead slug from `companies.yaml` if permanent. |
|
|
| Refresh aborts with a traceback before any company | Pre-2026-06-02 image (no skip-on-404) | Confirm Keel rolled the new image: `kubectl -n job-hunter get deploy job-hunter -o jsonpath='{..image}'`. |
|
|
| `snapshot` / refresh fails: `relation "job_hunter.comp_snapshots" does not exist` | Migration 0004 not applied | The CronJob + Deployment run `migrate` on start. Run `kubectl -n job-hunter exec deploy/job-hunter -- python -m job_hunter migrate`. |
|
|
| `/webhook/cdio` returns 401 | `webhook_bearer_token` mismatch between Vault and the CDIO notification URL | Re-run `cdio-seed` after rotating the token; it rebuilds the `jsons://...?+Authorization=` URL. |
|
|
| Non-GBP comp looks wrong / NULL | `fx_rates` gap for the role's `posted_at` date | `kubectl -n job-hunter exec deploy/job-hunter -- python -m job_hunter backfill-fx --days 30` |
|
|
| Job OOMKilled | levels.fyi HTML parse spike across many companies | Bump the CronJob container memory limit in `cronjob.tf` (currently 1Gi). |
|
|
|
|
---
|
|
|
|
## ANALYST
|
|
|
|
### Weekly above-target Slack alert
|
|
|
|
The `job-hunter-alert` CronJob (Sundays 05:00 UTC, an hour after the refresh)
|
|
posts to Slack the companies whose London p50 total comp **≥ £500k**, flagging
|
|
any that **newly crossed** since last week's snapshot. Threshold is the
|
|
`--threshold` arg in `cronjob.tf` (default 500000 — well above the ~£267k move
|
|
floor, so only clearly-exceptional comp pings). Slack webhook comes from Vault
|
|
`secret/job-hunter` → `slack_webhook_url` (seeded from the shared workspace
|
|
webhook → currently posts to the same channel as Keel; repoint to a dedicated
|
|
channel by `vault kv patch secret/job-hunter slack_webhook_url=<url>`).
|
|
|
|
```bash
|
|
# Preview the message without posting
|
|
kubectl -n job-hunter exec deploy/job-hunter -- python -m job_hunter alert --stdout
|
|
# Different bar / location
|
|
kubectl -n job-hunter exec deploy/job-hunter -- \
|
|
python -m job_hunter alert --threshold 350000 --location london --stdout
|
|
# Fire it now (posts to Slack)
|
|
kubectl -n job-hunter create job --from=cronjob/job-hunter-alert jh-alert-manual
|
|
```
|
|
|
|
`newly_crossed` needs ≥2 snapshot dates — it's empty until the second weekly
|
|
run accumulates. To change the standing threshold, edit `--threshold` in
|
|
`infra/stacks/job-hunter/cronjob.tf` and apply.
|
|
|
|
### The periodic "market leaders in comp" report
|
|
|
|
This is the headline command — current leaders by p50 total comp, week-over-week
|
|
movers, new entrants, open-role counts, and sample-size caveats:
|
|
|
|
```bash
|
|
# London senior leaders, human-readable
|
|
kubectl -n job-hunter exec deploy/job-hunter -- \
|
|
python -m job_hunter analyze --level senior --top-n 10
|
|
# All levels, JSON for downstream tools
|
|
kubectl -n job-hunter exec deploy/job-hunter -- \
|
|
python -m job_hunter analyze --format json
|
|
```
|
|
|
|
`--trend-weeks N` sets the movers comparison window (default 12). Movers report
|
|
`available: false` until at least two snapshot dates spanning the window exist —
|
|
the series starts accumulating from the first refresh after 2026-06-02, so
|
|
12-week movers become meaningful around late August 2026.
|
|
|
|
### Query recipes
|
|
|
|
```bash
|
|
# Salary band for a slice
|
|
kubectl -n job-hunter exec deploy/job-hunter -- python -m job_hunter bands --title 'staff'
|
|
kubectl -n job-hunter exec deploy/job-hunter -- python -m job_hunter comp-band --level senior
|
|
# Per-(company, level) comp table
|
|
kubectl -n job-hunter exec deploy/job-hunter -- python -m job_hunter comp-table --location london
|
|
# Open roles, highest-confidence comp first
|
|
kubectl -n job-hunter exec deploy/job-hunter -- python -m job_hunter query --title sre --with-salary --limit 20
|
|
# Compare two firms
|
|
kubectl -n job-hunter exec deploy/job-hunter -- python -m job_hunter comp-band --company janestreet
|
|
kubectl -n job-hunter exec deploy/job-hunter -- python -m job_hunter comp-band --company optiver
|
|
```
|
|
|
|
### Trend queries (Grafana or psql against the snapshot tables)
|
|
|
|
The dated series lives in `comp_snapshots` / `roles_snapshots`. Examples (run in
|
|
Grafana's "Job Hunter" datasource, or `psql` as the `job_hunter` role):
|
|
|
|
```sql
|
|
-- Comp trend: median total comp per company over time (London)
|
|
SELECT s.snapshot_date, c.display_name,
|
|
percentile_cont(0.5) WITHIN GROUP (ORDER BY COALESCE(s.total_gbp, s.base_gbp)) AS p50_gbp
|
|
FROM job_hunter.comp_snapshots s
|
|
JOIN job_hunter.companies c ON c.id = s.company_id
|
|
WHERE s.location_bucket = 'london'
|
|
GROUP BY s.snapshot_date, c.display_name
|
|
ORDER BY s.snapshot_date, p50_gbp DESC;
|
|
|
|
-- Hiring-volume trend: open London roles per company per snapshot
|
|
SELECT s.snapshot_date, c.display_name, COUNT(*) AS open_roles
|
|
FROM job_hunter.roles_snapshots s
|
|
JOIN job_hunter.companies c ON c.id = s.company_id
|
|
WHERE s.primary_location = 'london'
|
|
GROUP BY s.snapshot_date, c.display_name
|
|
ORDER BY s.snapshot_date, open_roles DESC;
|
|
|
|
-- Two-snapshot diff: p50 change for one company between two dates
|
|
SELECT c.display_name, s.snapshot_date,
|
|
percentile_cont(0.5) WITHIN GROUP (ORDER BY COALESCE(s.total_gbp, s.base_gbp)) AS p50
|
|
FROM job_hunter.comp_snapshots s
|
|
JOIN job_hunter.companies c ON c.id = s.company_id
|
|
WHERE c.slug = 'janestreet' AND s.snapshot_date IN ('2026-06-02', '2026-08-30')
|
|
GROUP BY c.display_name, s.snapshot_date;
|
|
```
|
|
|
|
### "Your comp vs the market" dashboard panel + your baselines
|
|
|
|
The Job Hunter Grafana dashboard (`grafana.viktorbarzin.me` → Job Hunter) has a
|
|
bar chart **"Your comp vs the market — London p50 total comp"** ranking every
|
|
company's London median TC with your comp shown in line. Your figures are
|
|
deliberately **not hardcoded in the committed dashboard JSON** — they live in
|
|
the DB as labeled comp_points with `source='self'` (the panel tags any
|
|
`source='self'` row as "You" and renders one bar each). There are **two**, by
|
|
design:
|
|
|
|
- `self-realized` — **"Me - realized gross" ≈ £409k**: your actual P60 gross
|
|
for the current tax year. **Source = `SUM(payslip_ingest.payslip.taxable_pay)`**
|
|
for the tax year (this equals the P60 "pay for tax"; do NOT use
|
|
`salary+bonus+rsu_vest`, where `rsu_vest` is net/partial and understates RSU
|
|
income by ~half). Inflated by concurrent stacked RSU vests + META price.
|
|
- `self-current` — **"Me - package (grant TC)" ≈ £267k**: base + bonus +
|
|
current-year RSU refresher *grant face* (£117,927). This is the basis
|
|
**levels.fyi uses for the company bars**, so it's the apples-to-apples figure
|
|
for comparing a job *offer*.
|
|
|
|
Both sit below the £500k alert bar (never ping Slack). Re-seed when comp changes
|
|
(realized: re-pull `taxable_pay`; grant-value: from the YE letter). The
|
|
grant-value seed (run the realized one the same way with `company_slug='self-realized'`,
|
|
`company_display_name='Me - realized gross'`, `total_value=<taxable_pay sum>`):
|
|
|
|
```bash
|
|
kubectl -n job-hunter exec deploy/job-hunter -- python -c "
|
|
import asyncio; from decimal import Decimal; from datetime import date
|
|
from job_hunter.db import create_engine_from_env, make_session_factory
|
|
from job_hunter.sources.comp.base import CompPoint
|
|
from job_hunter.storage_comp import upsert_comp_point
|
|
async def m():
|
|
e=create_engine_from_env(); sf=make_session_factory(e)
|
|
async with sf() as s:
|
|
# total_value is what the comparison/bar uses — it MUST be full TC
|
|
# (base + bonus + RSU). Store the components too for transparency.
|
|
await upsert_comp_point(s, CompPoint(source='self', external_id='self-current',
|
|
company_slug='self-current', company_display_name='Me (Meta IC5)',
|
|
level_slug='senior', location_bucket='london',
|
|
base_value=Decimal('123682'), bonus_value=Decimal('25734'),
|
|
rsu_grant_value=Decimal('117927'), rsu_vesting_years=1,
|
|
total_value=Decimal('267343'), currency='GBP', effective_date=date.today()))
|
|
await s.commit()
|
|
await e.dispose()
|
|
asyncio.run(m())"
|
|
```
|
|
|
|
### Interpreting the numbers — caveats
|
|
|
|
- **Sample size**: `analyze` flags companies with `n < 3` as `low_confidence`. A single self-reported datapoint is anecdote, not a band — chase the p50 only where n is healthy.
|
|
- **levels.fyi bias**: comp_points are self-reported medians; they skew toward people who report (often higher earners) and lag the market by a quarter or two.
|
|
- **HFT/quant**: base comp is the disclosed figure; bonus (often the larger half) is variable and usually absent from postings. Treat HFT base as a floor, not total.
|
|
- **Currency**: all figures are GBP-normalised via ECB rates looked up by `posted_at` (7-day fallback). A FX gap shows as NULL comp, not a wrong number.
|
|
- **Movers need history**: a delta is only as good as the two snapshot dates behind it; early deltas (< full `trend_weeks` of data) compare against the earliest available snapshot and are noted as such.
|
|
|
|
## Related
|
|
|
|
- Skill: `~/.claude/skills/job-hunter/SKILL.md` (agent invocation patterns)
|
|
- Beads epic: `code-snp`
|
|
- Storage / backup context: this DB is on the shared CNPG cluster (`dbaas`), backed up by the per-db `postgresql-backup-per-db` CronJob.
|