infra/docs/runbooks/job-hunter.md
Viktor Barzin cda858d560
All checks were successful
ci/woodpecker/push/default Pipeline was successful
ci/woodpecker/push/build-cli Pipeline was successful
job-hunter: weekly refresh CronJob + ops/analyst runbook
Add kubernetes_cron_job_v1.job_hunter_refresh — Sundays 04:00 UTC, runs
`refresh --source ats --source hn --source levels_fyi`, which upserts roles/
comp AND appends the dated comp_snapshots/roles_snapshots series consumed by
`job-hunter analyze`. Mirrors the Deployment's alembic-migrate init container
so a refresh never runs against an un-migrated DB; concurrency Forbid,
backoff 1, 30m activeDeadline, KYVERNO_LIFECYCLE_V1 dns_config ignore.

Add docs/runbooks/job-hunter.md: ops (health checks, manual refresh, add an
ATS company / CDIO watch, secret bag + rotation, failure table, TF apply) and
analyst (the analyze report, query recipes, SQL trend queries against the
snapshot tables, interpretation caveats) sections.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-02 19:37:57 +00:00

11 KiB

Runbook: job-hunter — passive job + comp scraper

Last updated: 2026-06-02

job-hunter is a passive job-market + compensation scraper in the job-hunter namespace. It pulls open roles from ATS boards (Greenhouse / Lever / Ashby), HN "Who is hiring", and levels.fyi comp medians into a CNPG Postgres DB, and serves agent-friendly CLI queries (used by the job-hunter Claude skill). As of 2026-06-02 it also accumulates dated snapshots so comp and hiring-volume trends can be tracked over time.

Where things live

Thing Location
Source code Forgejo https://forgejo.viktorbarzin.me/viktor/job-hunter (NOT in the monorepo)
Image forgejo.viktorbarzin.me/viktor/job-hunter:latest (CI builds on push; Keel rolls the Deployment)
Terraform stack infra/stacks/job-hunter/ (main.tf = Deployment/Service/ESO; cronjob.tf = weekly refresh)
Database pg-cluster-rw.dbaas.svc.cluster.local:5432/job_hunter, role job_hunter (Vault static-creds/pg-job-hunter, 7d rotation)
App secrets Vault secret/job-hunterwebhook_bearer_token, cdio_api_key, smtp_username/password, digest_to/from_address
Grafana https://grafana.viktorbarzin.me → datasource Job Hunter (PG, read-only)
Claude skill ~/.claude/skills/job-hunter/SKILL.md
Weekly scrape CronJob job-hunter-refresh, Sundays 04:00 UTC

Architecture

  • Sources (job_hunter/sources/): ats (Greenhouse/Lever/Ashby JSON APIs, ~35 companies in config/companies.yaml), hn (Algolia), levels_fyi (comp medians), linkedin_guest (opt-in), changedetection (/webhook/cdio for non-ATS careers pages in config/cdio_watches.yaml).
  • Tables: companies, roles, comp_points, levels, fx_rates (upsert-in-place, "current state"); comp_snapshots, roles_snapshots (append-only, one row per source-row per snapshot_date — the dated series). Snapshots are written as a side-effect of every upsert during a refresh.
  • The ATS fetch is resilient: a board returning a permanent 4xx (404/410/403) is skipped with a warning; 5xx/network errors retry once then skip. One dead board cannot abort the whole run (regression fixed 2026-06-02 — Elastic's 404 had been taking down every refresh). Boards are fetched concurrently (bounded semaphore, default 8 in-flight).

OPS

Is it healthy?

# CronJob exists + last schedule/success
kubectl -n job-hunter get cronjob job-hunter-refresh
# Most recent run's pods + logs
kubectl -n job-hunter get jobs -l app=job-hunter --sort-by=.metadata.creationTimestamp
kubectl -n job-hunter logs -l job-name=$(kubectl -n job-hunter get jobs -o jsonpath='{.items[-1:].metadata.name}')
# Deployment (serves the CLI / webhook) is up
kubectl -n job-hunter get deploy job-hunter
# Data freshness — newest snapshot date should advance weekly
kubectl -n job-hunter exec deploy/job-hunter -- python -m job_hunter report --days 7 | jq '.source_mix'

Row-count sanity (via the read-only Grafana datasource or a direct exec):

kubectl -n job-hunter exec deploy/job-hunter -- python -c "import job_hunter"  # smoke

Manual refresh (off-schedule)

kubectl -n job-hunter exec deploy/job-hunter -- \
  python -m job_hunter refresh --source ats --source hn --source levels_fyi

Or trigger the CronJob immediately:

kubectl -n job-hunter create job --from=cronjob/job-hunter-refresh jh-manual-$(date +%s)

Seed / re-snapshot the dated series

Snapshots are written automatically on every refresh. To seed a baseline from the current tables (idempotent — one row per source-row per day):

kubectl -n job-hunter exec deploy/job-hunter -- python -m job_hunter snapshot
# back-date a snapshot if needed:
kubectl -n job-hunter exec deploy/job-hunter -- python -m job_hunter snapshot --date 2026-06-01

Add an ATS company

ATS companies are scraped from config/companies.yaml in the Forgejo repo (not the monorepo). To add one:

  1. Live-probe the slug returns HTTP 200 with London roles before adding it:
    curl -s "https://boards-api.greenhouse.io/v1/boards/<slug>/jobs?content=true" -o /dev/null -w '%{http_code}\n'
    # Lever:  https://api.lever.co/v0/postings/<slug>?mode=json
    # Ashby:  https://api.ashbyhq.com/posting-api/job-board/<slug>?includeCompensation=true
    
  2. Add a {slug, display_name, ats_type, ats_id, careers_url} block to config/companies.yaml, commit, push.
  3. CI builds the image; Keel rolls the Deployment. The next refresh picks it up. (No Terraform change — config ships in the image.)

A board that later starts 404ing is skipped automatically; remove its entry when the 404 is permanent (keeps logs clean).

Add a changedetection.io watch (non-ATS firms)

Firms without a public ATS JSON API (Citadel, Two Sigma, G-Research, HRT, xAI, Wise, Revolut, …) are diff-monitored via CDIO. Add to config/cdio_watches.yaml in the Forgejo repo, then reconcile:

kubectl -n job-hunter exec deploy/job-hunter -- python -m job_hunter cdio-seed --dry-run  # preview
kubectl -n job-hunter exec deploy/job-hunter -- python -m job_hunter cdio-seed            # create
kubectl -n job-hunter exec deploy/job-hunter -- python -m job_hunter cdio-reconcile       # list

Changes hit /webhook/cdio; comp/role extraction from the diff is manual or LLM-side (CDIO only captures the changed text).

Applying the Terraform stack

cd infra/stacks/job-hunter
scripts/tg plan      # vault login -method=oidc first
scripts/tg apply

The DB password rotates every 7 days (Vault static role pg-job-hunter); Reloader restarts the Deployment when the ESO-synced secret changes. The Grafana datasource password is mirrored via a second ExternalSecret in the monitoring namespace.

Common failures

Symptom Cause Fix
Refresh job Error, log shows ats: skipping company=X — HTTP 404 A board slug was renamed/removed Expected — the run continues. Remove the dead slug from companies.yaml if permanent.
Refresh aborts with a traceback before any company Pre-2026-06-02 image (no skip-on-404) Confirm Keel rolled the new image: kubectl -n job-hunter get deploy job-hunter -o jsonpath='{..image}'.
snapshot / refresh fails: relation "job_hunter.comp_snapshots" does not exist Migration 0004 not applied The CronJob + Deployment run migrate on start. Run kubectl -n job-hunter exec deploy/job-hunter -- python -m job_hunter migrate.
/webhook/cdio returns 401 webhook_bearer_token mismatch between Vault and the CDIO notification URL Re-run cdio-seed after rotating the token; it rebuilds the jsons://...?+Authorization= URL.
Non-GBP comp looks wrong / NULL fx_rates gap for the role's posted_at date kubectl -n job-hunter exec deploy/job-hunter -- python -m job_hunter backfill-fx --days 30
Job OOMKilled levels.fyi HTML parse spike across many companies Bump the CronJob container memory limit in cronjob.tf (currently 1Gi).

ANALYST

The periodic "market leaders in comp" report

This is the headline command — current leaders by p50 total comp, week-over-week movers, new entrants, open-role counts, and sample-size caveats:

# London senior leaders, human-readable
kubectl -n job-hunter exec deploy/job-hunter -- \
  python -m job_hunter analyze --level senior --top-n 10
# All levels, JSON for downstream tools
kubectl -n job-hunter exec deploy/job-hunter -- \
  python -m job_hunter analyze --format json

--trend-weeks N sets the movers comparison window (default 12). Movers report available: false until at least two snapshot dates spanning the window exist — the series starts accumulating from the first refresh after 2026-06-02, so 12-week movers become meaningful around late August 2026.

Query recipes

# Salary band for a slice
kubectl -n job-hunter exec deploy/job-hunter -- python -m job_hunter bands --title 'staff'
kubectl -n job-hunter exec deploy/job-hunter -- python -m job_hunter comp-band --level senior
# Per-(company, level) comp table
kubectl -n job-hunter exec deploy/job-hunter -- python -m job_hunter comp-table --location london
# Open roles, highest-confidence comp first
kubectl -n job-hunter exec deploy/job-hunter -- python -m job_hunter query --title sre --with-salary --limit 20
# Compare two firms
kubectl -n job-hunter exec deploy/job-hunter -- python -m job_hunter comp-band --company janestreet
kubectl -n job-hunter exec deploy/job-hunter -- python -m job_hunter comp-band --company optiver

Trend queries (Grafana or psql against the snapshot tables)

The dated series lives in comp_snapshots / roles_snapshots. Examples (run in Grafana's "Job Hunter" datasource, or psql as the job_hunter role):

-- Comp trend: median total comp per company over time (London)
SELECT s.snapshot_date, c.display_name,
       percentile_cont(0.5) WITHIN GROUP (ORDER BY COALESCE(s.total_gbp, s.base_gbp)) AS p50_gbp
FROM job_hunter.comp_snapshots s
JOIN job_hunter.companies c ON c.id = s.company_id
WHERE s.location_bucket = 'london'
GROUP BY s.snapshot_date, c.display_name
ORDER BY s.snapshot_date, p50_gbp DESC;

-- Hiring-volume trend: open London roles per company per snapshot
SELECT s.snapshot_date, c.display_name, COUNT(*) AS open_roles
FROM job_hunter.roles_snapshots s
JOIN job_hunter.companies c ON c.id = s.company_id
WHERE s.primary_location = 'london'
GROUP BY s.snapshot_date, c.display_name
ORDER BY s.snapshot_date, open_roles DESC;

-- Two-snapshot diff: p50 change for one company between two dates
SELECT c.display_name, s.snapshot_date,
       percentile_cont(0.5) WITHIN GROUP (ORDER BY COALESCE(s.total_gbp, s.base_gbp)) AS p50
FROM job_hunter.comp_snapshots s
JOIN job_hunter.companies c ON c.id = s.company_id
WHERE c.slug = 'janestreet' AND s.snapshot_date IN ('2026-06-02', '2026-08-30')
GROUP BY c.display_name, s.snapshot_date;

Interpreting the numbers — caveats

  • Sample size: analyze flags companies with n < 3 as low_confidence. A single self-reported datapoint is anecdote, not a band — chase the p50 only where n is healthy.
  • levels.fyi bias: comp_points are self-reported medians; they skew toward people who report (often higher earners) and lag the market by a quarter or two.
  • HFT/quant: base comp is the disclosed figure; bonus (often the larger half) is variable and usually absent from postings. Treat HFT base as a floor, not total.
  • Currency: all figures are GBP-normalised via ECB rates looked up by posted_at (7-day fallback). A FX gap shows as NULL comp, not a wrong number.
  • Movers need history: a delta is only as good as the two snapshot dates behind it; early deltas (< full trend_weeks of data) compare against the earliest available snapshot and are noted as such.
  • Skill: ~/.claude/skills/job-hunter/SKILL.md (agent invocation patterns)
  • Beads epic: code-snp
  • Storage / backup context: this DB is on the shared CNPG cluster (dbaas), backed up by the per-db postgresql-backup-per-db CronJob.