infra/docs/runbooks/job-hunter.md
Viktor Barzin aa0d6511b2
All checks were successful
ci/woodpecker/push/default Pipeline was successful
ci/woodpecker/push/build-cli Pipeline was successful
job-hunter runbook: document two self baselines + taxable_pay gotcha
Dashboard now shows two 'Me' bars: realized gross (~£409k, from
SUM(payslip taxable_pay) = P60 basis) and package/grant-value (~£267k,
levels.fyi-comparable). Document that gross MUST come from taxable_pay, NOT
salary+bonus+rsu_vest (rsu_vest is net/partial, understates RSU ~50%).

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-02 23:13:35 +00:00

16 KiB

Runbook: job-hunter — passive job + comp scraper

Last updated: 2026-06-02

job-hunter is a passive job-market + compensation scraper in the job-hunter namespace. It pulls open roles from ATS boards (Greenhouse / Lever / Ashby), HN "Who is hiring", and levels.fyi comp medians into a CNPG Postgres DB, and serves agent-friendly CLI queries (used by the job-hunter Claude skill). As of 2026-06-02 it also accumulates dated snapshots so comp and hiring-volume trends can be tracked over time.

Where things live

Thing Location
Source code Forgejo https://forgejo.viktorbarzin.me/viktor/job-hunter (NOT in the monorepo)
Image forgejo.viktorbarzin.me/viktor/job-hunter:latest (CI builds on push; Keel rolls the Deployment)
Terraform stack infra/stacks/job-hunter/ (main.tf = Deployment/Service/ESO; cronjob.tf = weekly refresh)
Database pg-cluster-rw.dbaas.svc.cluster.local:5432/job_hunter, role job_hunter (Vault static-creds/pg-job-hunter, 7d rotation)
App secrets Vault secret/job-hunterwebhook_bearer_token, cdio_api_key, smtp_username/password, digest_to/from_address
Grafana https://grafana.viktorbarzin.me → datasource Job Hunter (PG, read-only)
Claude skill ~/.claude/skills/job-hunter/SKILL.md
Weekly scrape CronJob job-hunter-refresh, Sundays 04:00 UTC

Architecture

  • Sources (job_hunter/sources/): ats (Greenhouse/Lever/Ashby JSON APIs, ~35 companies in config/companies.yaml), hn (Algolia), levels_fyi (comp medians), linkedin_guest (opt-in), changedetection (/webhook/cdio for non-ATS careers pages in config/cdio_watches.yaml).
  • Tables: companies, roles, comp_points, levels, fx_rates (upsert-in-place, "current state"); comp_snapshots, roles_snapshots (append-only, one row per source-row per snapshot_date — the dated series). Snapshots are written as a side-effect of every upsert during a refresh.
  • The ATS fetch is resilient: a board returning a permanent 4xx (404/410/403) is skipped with a warning; 5xx/network errors retry once then skip. One dead board cannot abort the whole run (regression fixed 2026-06-02 — Elastic's 404 had been taking down every refresh). Boards are fetched concurrently (bounded semaphore, default 8 in-flight).

OPS

Is it healthy?

# CronJob exists + last schedule/success
kubectl -n job-hunter get cronjob job-hunter-refresh
# Most recent run's pods + logs
kubectl -n job-hunter get jobs -l app=job-hunter --sort-by=.metadata.creationTimestamp
kubectl -n job-hunter logs -l job-name=$(kubectl -n job-hunter get jobs -o jsonpath='{.items[-1:].metadata.name}')
# Deployment (serves the CLI / webhook) is up
kubectl -n job-hunter get deploy job-hunter
# Data freshness — newest snapshot date should advance weekly
kubectl -n job-hunter exec deploy/job-hunter -- python -m job_hunter report --days 7 | jq '.source_mix'

Row-count sanity (via the read-only Grafana datasource or a direct exec):

kubectl -n job-hunter exec deploy/job-hunter -- python -c "import job_hunter"  # smoke

Manual refresh (off-schedule)

kubectl -n job-hunter exec deploy/job-hunter -- \
  python -m job_hunter refresh --source ats --source hn --source levels_fyi

Or trigger the CronJob immediately:

kubectl -n job-hunter create job --from=cronjob/job-hunter-refresh jh-manual-$(date +%s)

Seed / re-snapshot the dated series

Snapshots are written automatically on every refresh. To seed a baseline from the current tables (idempotent — one row per source-row per day):

kubectl -n job-hunter exec deploy/job-hunter -- python -m job_hunter snapshot
# back-date a snapshot if needed:
kubectl -n job-hunter exec deploy/job-hunter -- python -m job_hunter snapshot --date 2026-06-01

Add an ATS company

ATS companies are scraped from config/companies.yaml in the Forgejo repo (not the monorepo). To add one:

  1. Live-probe the slug returns HTTP 200 with London roles before adding it:
    curl -s "https://boards-api.greenhouse.io/v1/boards/<slug>/jobs?content=true" -o /dev/null -w '%{http_code}\n'
    # Lever:  https://api.lever.co/v0/postings/<slug>?mode=json
    # Ashby:  https://api.ashbyhq.com/posting-api/job-board/<slug>?includeCompensation=true
    
  2. Add a {slug, display_name, ats_type, ats_id, careers_url} block to config/companies.yaml, commit, push.
  3. CI builds the image; Keel rolls the Deployment. The next refresh picks it up. (No Terraform change — config ships in the image.)

A board that later starts 404ing is skipped automatically; remove its entry when the 404 is permanent (keeps logs clean).

Add a changedetection.io watch (non-ATS firms)

Firms without a public ATS JSON API (Citadel, Two Sigma, G-Research, HRT, xAI, Wise, Revolut, …) are diff-monitored via CDIO. Add to config/cdio_watches.yaml in the Forgejo repo, then reconcile:

kubectl -n job-hunter exec deploy/job-hunter -- python -m job_hunter cdio-seed --dry-run  # preview
kubectl -n job-hunter exec deploy/job-hunter -- python -m job_hunter cdio-seed            # create
kubectl -n job-hunter exec deploy/job-hunter -- python -m job_hunter cdio-reconcile       # list

Changes hit /webhook/cdio; comp/role extraction from the diff is manual or LLM-side (CDIO only captures the changed text).

Deploying (build triggers the rollout)

Deploys are automatic on push to master — we build the image, so CI also drives the rollout (.woodpecker.yml: build-and-push tags latest + ${CI_COMMIT_SHA:0:8}, then a deploy step runs kubectl set image deployment/job-hunter ...:${SHA} + rollout status). The woodpecker-agent SA is cluster-admin, so no kubeconfig/RBAC is wired into the step. Keel stays enrolled in parallel as a redundant net (finds the SHA already running → no-op). So to ship code:

# in the job-hunter source repo (forgejo viktor/job-hunter)
git push origin master      # → lint+test → build (latest + :<sha>) → set image → rollout

The Deployment rolls to the just-built :<sha>. The CronJob runs :latest with imagePullPolicy: Always, so its next scheduled pod pulls the newest image (no rollout needed for a CronJob). image_tag = "latest" in terragrunt.hcl is just the TF baseline; the running Deployment digest is whatever CI last set (kubectl -n job-hunter get deploy job-hunter -o jsonpath='{..image}').

Versioning is still semver — bump pyproject.toml and cut a git tag vX.Y.Z to mark a release; that's the human version record, independent of the :<sha> deploy tag (map a running SHA back to a version with git describe).

Rollback: kubectl -n job-hunter rollout undo deployment/job-hunter (last ReplicaSet), or push a revert commit (CI redeploys the reverted SHA).

Applying the Terraform stack

cd infra/stacks/job-hunter
scripts/tg plan      # vault login -method=oidc first
scripts/tg apply

The DB password rotates every 7 days (Vault static role pg-job-hunter); Reloader restarts the Deployment when the ESO-synced secret changes. The Grafana datasource password is mirrored via a second ExternalSecret in the monitoring namespace.

Common failures

Symptom Cause Fix
Refresh job Error, log shows ats: skipping company=X — HTTP 404 A board slug was renamed/removed Expected — the run continues. Remove the dead slug from companies.yaml if permanent.
Refresh aborts with a traceback before any company Pre-2026-06-02 image (no skip-on-404) Confirm Keel rolled the new image: kubectl -n job-hunter get deploy job-hunter -o jsonpath='{..image}'.
snapshot / refresh fails: relation "job_hunter.comp_snapshots" does not exist Migration 0004 not applied The CronJob + Deployment run migrate on start. Run kubectl -n job-hunter exec deploy/job-hunter -- python -m job_hunter migrate.
/webhook/cdio returns 401 webhook_bearer_token mismatch between Vault and the CDIO notification URL Re-run cdio-seed after rotating the token; it rebuilds the jsons://...?+Authorization= URL.
Non-GBP comp looks wrong / NULL fx_rates gap for the role's posted_at date kubectl -n job-hunter exec deploy/job-hunter -- python -m job_hunter backfill-fx --days 30
Job OOMKilled levels.fyi HTML parse spike across many companies Bump the CronJob container memory limit in cronjob.tf (currently 1Gi).

ANALYST

Weekly above-target Slack alert

The job-hunter-alert CronJob (Sundays 05:00 UTC, an hour after the refresh) posts to Slack the companies whose London p50 total comp ≥ £500k, flagging any that newly crossed since last week's snapshot. Threshold is the --threshold arg in cronjob.tf (default 500000 — well above the ~£267k move floor, so only clearly-exceptional comp pings). Slack webhook comes from Vault secret/job-hunterslack_webhook_url (seeded from the shared workspace webhook → currently posts to the same channel as Keel; repoint to a dedicated channel by vault kv patch secret/job-hunter slack_webhook_url=<url>).

# Preview the message without posting
kubectl -n job-hunter exec deploy/job-hunter -- python -m job_hunter alert --stdout
# Different bar / location
kubectl -n job-hunter exec deploy/job-hunter -- \
  python -m job_hunter alert --threshold 350000 --location london --stdout
# Fire it now (posts to Slack)
kubectl -n job-hunter create job --from=cronjob/job-hunter-alert jh-alert-manual

newly_crossed needs ≥2 snapshot dates — it's empty until the second weekly run accumulates. To change the standing threshold, edit --threshold in infra/stacks/job-hunter/cronjob.tf and apply.

The periodic "market leaders in comp" report

This is the headline command — current leaders by p50 total comp, week-over-week movers, new entrants, open-role counts, and sample-size caveats:

# London senior leaders, human-readable
kubectl -n job-hunter exec deploy/job-hunter -- \
  python -m job_hunter analyze --level senior --top-n 10
# All levels, JSON for downstream tools
kubectl -n job-hunter exec deploy/job-hunter -- \
  python -m job_hunter analyze --format json

--trend-weeks N sets the movers comparison window (default 12). Movers report available: false until at least two snapshot dates spanning the window exist — the series starts accumulating from the first refresh after 2026-06-02, so 12-week movers become meaningful around late August 2026.

Query recipes

# Salary band for a slice
kubectl -n job-hunter exec deploy/job-hunter -- python -m job_hunter bands --title 'staff'
kubectl -n job-hunter exec deploy/job-hunter -- python -m job_hunter comp-band --level senior
# Per-(company, level) comp table
kubectl -n job-hunter exec deploy/job-hunter -- python -m job_hunter comp-table --location london
# Open roles, highest-confidence comp first
kubectl -n job-hunter exec deploy/job-hunter -- python -m job_hunter query --title sre --with-salary --limit 20
# Compare two firms
kubectl -n job-hunter exec deploy/job-hunter -- python -m job_hunter comp-band --company janestreet
kubectl -n job-hunter exec deploy/job-hunter -- python -m job_hunter comp-band --company optiver

Trend queries (Grafana or psql against the snapshot tables)

The dated series lives in comp_snapshots / roles_snapshots. Examples (run in Grafana's "Job Hunter" datasource, or psql as the job_hunter role):

-- Comp trend: median total comp per company over time (London)
SELECT s.snapshot_date, c.display_name,
       percentile_cont(0.5) WITHIN GROUP (ORDER BY COALESCE(s.total_gbp, s.base_gbp)) AS p50_gbp
FROM job_hunter.comp_snapshots s
JOIN job_hunter.companies c ON c.id = s.company_id
WHERE s.location_bucket = 'london'
GROUP BY s.snapshot_date, c.display_name
ORDER BY s.snapshot_date, p50_gbp DESC;

-- Hiring-volume trend: open London roles per company per snapshot
SELECT s.snapshot_date, c.display_name, COUNT(*) AS open_roles
FROM job_hunter.roles_snapshots s
JOIN job_hunter.companies c ON c.id = s.company_id
WHERE s.primary_location = 'london'
GROUP BY s.snapshot_date, c.display_name
ORDER BY s.snapshot_date, open_roles DESC;

-- Two-snapshot diff: p50 change for one company between two dates
SELECT c.display_name, s.snapshot_date,
       percentile_cont(0.5) WITHIN GROUP (ORDER BY COALESCE(s.total_gbp, s.base_gbp)) AS p50
FROM job_hunter.comp_snapshots s
JOIN job_hunter.companies c ON c.id = s.company_id
WHERE c.slug = 'janestreet' AND s.snapshot_date IN ('2026-06-02', '2026-08-30')
GROUP BY c.display_name, s.snapshot_date;

"Your comp vs the market" dashboard panel + your baselines

The Job Hunter Grafana dashboard (grafana.viktorbarzin.me → Job Hunter) has a bar chart "Your comp vs the market — London p50 total comp" ranking every company's London median TC with your comp shown in line. Your figures are deliberately not hardcoded in the committed dashboard JSON — they live in the DB as labeled comp_points with source='self' (the panel tags any source='self' row as "You" and renders one bar each). There are two, by design:

  • self-realized"Me - realized gross" ≈ £409k: your actual P60 gross for the current tax year. Source = SUM(payslip_ingest.payslip.taxable_pay) for the tax year (this equals the P60 "pay for tax"; do NOT use salary+bonus+rsu_vest, where rsu_vest is net/partial and understates RSU income by ~half). Inflated by concurrent stacked RSU vests + META price.
  • self-current"Me - package (grant TC)" ≈ £267k: base + bonus + current-year RSU refresher grant face (£117,927). This is the basis levels.fyi uses for the company bars, so it's the apples-to-apples figure for comparing a job offer.

Both sit below the £500k alert bar (never ping Slack). Re-seed when comp changes (realized: re-pull taxable_pay; grant-value: from the YE letter). The grant-value seed (run the realized one the same way with company_slug='self-realized', company_display_name='Me - realized gross', total_value=<taxable_pay sum>):

kubectl -n job-hunter exec deploy/job-hunter -- python -c "
import asyncio; from decimal import Decimal; from datetime import date
from job_hunter.db import create_engine_from_env, make_session_factory
from job_hunter.sources.comp.base import CompPoint
from job_hunter.storage_comp import upsert_comp_point
async def m():
    e=create_engine_from_env(); sf=make_session_factory(e)
    async with sf() as s:
        # total_value is what the comparison/bar uses — it MUST be full TC
        # (base + bonus + RSU). Store the components too for transparency.
        await upsert_comp_point(s, CompPoint(source='self', external_id='self-current',
            company_slug='self-current', company_display_name='Me (Meta IC5)',
            level_slug='senior', location_bucket='london',
            base_value=Decimal('123682'), bonus_value=Decimal('25734'),
            rsu_grant_value=Decimal('117927'), rsu_vesting_years=1,
            total_value=Decimal('267343'), currency='GBP', effective_date=date.today()))
        await s.commit()
    await e.dispose()
asyncio.run(m())"

Interpreting the numbers — caveats

  • Sample size: analyze flags companies with n < 3 as low_confidence. A single self-reported datapoint is anecdote, not a band — chase the p50 only where n is healthy.
  • levels.fyi bias: comp_points are self-reported medians; they skew toward people who report (often higher earners) and lag the market by a quarter or two.
  • HFT/quant: base comp is the disclosed figure; bonus (often the larger half) is variable and usually absent from postings. Treat HFT base as a floor, not total.
  • Currency: all figures are GBP-normalised via ECB rates looked up by posted_at (7-day fallback). A FX gap shows as NULL comp, not a wrong number.
  • Movers need history: a delta is only as good as the two snapshot dates behind it; early deltas (< full trend_weeks of data) compare against the earliest available snapshot and are noted as such.
  • Skill: ~/.claude/skills/job-hunter/SKILL.md (agent invocation patterns)
  • Beads epic: code-snp
  • Storage / backup context: this DB is on the shared CNPG cluster (dbaas), backed up by the per-db postgresql-backup-per-db CronJob.