Commit graph

2995 commits

Author SHA1 Message Date
Viktor Barzin
7dfe89a6e0 [redis] stabilise against node-crash flap cascade — RC1-RC5 fixes
Five compounding factors produced the 2026-04-22 flap cascade: soft
anti-affinity let 2/3 pods co-locate on k8s-node3 (which bounced
NotReady→Ready at 11:42Z and took quorum), aggressive sentinel/probe
timing amplified LUKS-encrypted LVM I/O stalls into spurious
+switch-master loops, HAProxy's 1s polling raced sentinel failovers
and routed writes to demoted masters, publish_not_ready_addresses=true
fed not-yet-ready pods into HAProxy DNS, and realestate-crawler-celery
CrashLoopBackOff closed the feedback loop.

Changes:
- Anti-affinity: preferred → required (one redis pod per node, hard)
- Sentinel down-after-ms 5000→15000, failover-timeout 30000→60000
- Redis + sentinel liveness: timeout 3→10, failure_threshold 3→5
- HAProxy: check inter 1s→2s / fall 2→3, timeout check 3s→5s
- Headless svc: publish_not_ready_addresses true→false

Post-rollout verification clean: 0 flaps, 0 +switch-master events,
0 celery ReadOnlyError in the 60s window after settle. Docs updated.
2026-04-22 15:59:00 +00:00
Viktor Barzin
fdced7577b [monitoring] HomeAssistantCriticalSensorUnavailable alert 2026-04-22 14:52:23 +00:00
Viktor Barzin
dc05c440bc [hermes-agent] disable deployment — PVC permission mismatch
Main container crashes with "mkdir: cannot create directory '/opt/data':
Permission denied". Init container writes fine but main container runs
with different fsGroup/runAsUser. Scaling to 0 until the PVC permission
model is reworked.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-04-22 14:31:50 +00:00
Viktor Barzin
a4eafafe49 [monitoring] Add GPUNodeUnschedulable alert — fires when GPU node is cordoned
After k8s-node1 was silently cordoned and broke Frigate camera streams,
existing alerts (NvidiaExporterDown, PodUnschedulable) didn't catch the
root cause proactively. This alert fires within 5m of the GPU node being
cordoned, before any pod restart attempts to schedule and fails.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-04-22 14:05:12 +00:00
Viktor Barzin
e2146e6916 gpu: schedule off NFD label, not k8s-node1 hostname
Remove every hardcoded reference to k8s-node1 that pinned GPU
scheduling to a specific host:

- GPU workload nodeSelectors: gpu=true -> nvidia.com/gpu.present=true
  (frigate, immich, whisper, piper, ytdlp, ebook2audiobook, audiblez,
  audiblez-web, nvidia-exporter, gpu-pod-exporter). The NFD label is
  auto-applied by gpu-feature-discovery on any node carrying an
  NVIDIA PCI device, so the selector follows the card.

- null_resource.gpu_node_config: rewrite to enumerate NFD-labeled
  nodes (feature.node.kubernetes.io/pci-10de.present=true) and taint
  each with nvidia.com/gpu=true:PreferNoSchedule. Drop the manual
  'kubectl label gpu=true' since NFD handles labeling.

- MySQL anti-affinity: kubernetes.io/hostname NotIn [k8s-node1] ->
  nvidia.com/gpu.present NotIn [true]. Same intent (keep MySQL off
  the GPU node) but portable when the card relocates.

Net effect: moving the GPU card between nodes no longer requires any
Terraform edit. Verified no-op for current scheduling — both old and
new labels resolve to node1 today.

Docs updated to match: AGENTS.md, compute.md, overview.md,
proxmox-inventory.md, k8s-portal agent-guidance string.
2026-04-22 13:43:07 +00:00
Viktor Barzin
134d6b9a82 vault runbook + raft/HA stuck-leader alerts
Post-2026-04-22 Step 5 deliverables:
- docs/runbooks/vault-raft-leader-deadlock.md — safe pod-restart
  sequence that avoids zombie containerd-shim + kernel NFS
  corruption, qm reset no-op gotcha, boot-order gotcha.
- prometheus_chart_values.tpl — VaultRaftLeaderStuck +
  VaultHAStatusUnavailable. Silent until vault telemetry
  scraping lands (tracked as beads code-vkpn).

Epic for moving vault off NFS tracked as beads code-gy7h.
2026-04-22 12:44:46 +00:00
Viktor Barzin
4cb2c157da post-mortem 2026-04-22: full timeline — second regression + node4 reboot
The initial recovery at 11:03 was premature; vault-1's audit writes over
NFS started hanging ~15 min later and the cluster regressed to 503.
Full recovery required rebooting node4 (to free vault-0's stuck NFS
mount and shed PVE NFS thread contention) and a second reboot of node3
(to clear another round of kernel NFS client degradation). Final
recovery at 11:43:28 UTC with vault-2 as active leader on the quorum
vault-0 + vault-2.

vault-1 remains stuck in ContainerCreating on node2 — a third node2
reboot is required for full 3/3 quorum, but 2/3 is operationally
sufficient, so that's deferred.
2026-04-22 11:44:56 +00:00
Viktor Barzin
2f1f9107f8 vault: add fsGroupChangePolicy=OnRootMismatch + 2026-04-22 post-mortem
The 2026-04-22 Vault outage caught kubelet in a 2-minute chown loop that
never exited because the default fsGroupChangePolicy (Always) walks every
file on the NFS-backed data PVC. With retrans=3,timeo=30 NFS options and
a 1GB audit log, the recursive chown outlasted the deadline and restarted
forever — blocking raft quorum recovery. OnRootMismatch makes chown a
no-op when the volume root is already correct, which it always is after
initial setup.

The breakglass fix was applied live via kubectl patch at 10:54 UTC; this
commit persists it in Terraform so the next apply doesn't revert.

The post-mortem also documents the upstream raft stuck-leader pattern,
NFS kernel client corruption after force-kill, and the path to migrate
Vault off NFS to proxmox-lvm-encrypted.
2026-04-22 11:12:19 +00:00
Viktor Barzin
6a4a477336 [infra] Update RPi Sofia DNS: 192.168.1.16 → 192.168.1.10
RPi now uses USB Ethernet (eth1) as primary uplink at .10 instead of
the old wlan1 address at .16. Camera namespace and DNAT updated to
use eth1 with systemd persistence.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-04-22 10:55:34 +00:00
Viktor Barzin
d39770b30d monitoring: tighten LVMSnapshotStale to 30h for daily-cadence detection
Threshold was 48h + 30m for: a job that runs daily. We don't need
to wait 2.5 days to detect a broken timer — bring it down to 30h
+ 30m (just over a day of cadence + minor drift/retry grace). Also
add a description pointing to the restore runbook so the alert
text surfaces the fix path directly.

Threshold change: 172800s → 108000s. Docs in backup-dr.md synced.

Re-triggers default.yml apply now that ci/Dockerfile is rebuilt
with vault CLI — this is the first commit touching a stack that
will actually succeed since the e80b2f02 regression.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-04-22 08:54:37 +00:00
Viktor Barzin
3eb8b9a4ea ci: add vault CLI to infra-ci image + surface real errors in scripts/tg
The Woodpecker CI pipeline has been silently failing to apply Tier 1
stacks since the state-migration commit e80b2f02 because the Alpine
CI image never had the vault CLI. `scripts/tg` swallowed stderr with
`2>/dev/null` and surfaced a misleading "Cannot read PG credentials
from Vault" message — the real error was `sh: vault: not found`.

Verified with an in-cluster probe: woodpecker/default SA + role=ci
already gets the terraform-state policy and has read capability on
database/static-creds/pg-terraform-state. Auth was never the problem;
the vault binary just wasn't there.

- ci/Dockerfile: pin vault v1.18.1 (matches server) and install
- scripts/tg: pre-flight check + surface real vault output on failure
- Next build-ci-image.yml run rebuilds :latest with vault included;
  subsequent default.yml runs unblock monitoring apply (code-aoxk)

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-04-22 08:46:50 +00:00
Viktor Barzin
4a343c33f0 monitoring: bring EmailRoundtripStale threshold docs in sync with for:20m
Doc claimed >40m; actual fire time is 80m (60m last-success threshold +
20m 'for'). Stale since pre-existing config; now re-stale after raising
'for' from 10m to 20m in 9b4970da. Files out of sync only on this one
alert row.
2026-04-21 22:39:46 +00:00
Viktor Barzin
9b4970da61 monitoring: alert hygiene — disambiguate, rename, tune, fix inhibits
- HighPowerUsage: add subsystem:gpu (line 724) + subsystem:r730 (line 775)
  labels so the two same-named alerts are distinguishable in routing.
- HeadscaleDown (deployment-replicas flavor, line 1414) → rename to
  HeadscaleReplicasMismatch. Line 2039 keeps HeadscaleDown as the real
  up-metric critical check. NodeDown inhibit rule updated to suppress
  the renamed alert too.
- EmailRoundtripStale (line 1816): for 10m → 20m. Survives one missed
  20-min probe cycle before firing, cuts flapping (12 short-burst fires
  over last 24h).

ATSOverload tuning skipped: 24h fire-count is 0, it's continuously
firing not flapping — already-known sustained 83% ATS load, tuning
would not change behavior.

8 backup *NeverSucceeded rules audited: all 7 using
kube_cronjob_status_last_successful_time target real K8s CronJobs with
active metrics (not Pushgateway-sourced). PrometheusBackupNeverRun
already uses absent() correctly. No fixes needed.
2026-04-21 22:29:15 +00:00
Viktor Barzin
ac695dea38 [registry] bulk-clean 34 orphan manifests + beads-server image bump
Registry integrity probe surfaced 38 broken manifest references
(34 unique repo:tag pairs, same OCI-index orphan pattern as the 04-19
infra-ci incident). Deleted all via registry HTTP API + ran GC;
reclaimed ~3GB blob storage.

beads-server CronJobs were stuck ImagePullBackOff on
claude-agent-service:0c24c9b6 for >6h — bumped variable default to
2fd7670d (canonical tag in claude-agent-service stack, already healthy
in registry) so new ticks can fire.

Rebuilt in-use broken tags: freedify:{latest,c803de02} and
beadboard:{17a38e43,latest} on registry VM; priority-pass via
Woodpecker pipeline #8. wealthfolio-sync:latest deferred (monthly
CronJob, next run 2026-05-01).

Probe now reports 0/39 failures. RegistryManifestIntegrityFailure
alert cleared.

Closes: code-8hk
Closes: code-jh3c

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-19 23:16:34 +00:00
Viktor Barzin
9041f52b05 monitoring: TechnitiumZoneCountMismatch — compare replicas only, exclude primary
Primary has only the Primary-type zones it owns (10). Replicas have those
+ built-in zones (localhost, in-addr.arpa reverse, etc.), so their count
(14) can never match primary. Alert expr compared max-min across all
instances, making it chronically firing.

Fix: instance!="primary" filter. The real signal this alert wants is
"did one replica drift from the others" — replica-to-replica comparison
captures that; primary was never comparable.
2026-04-19 22:15:55 +00:00
Viktor Barzin
4bedabb9e8 healthcheck: fix three false-positive WARNs (HA token, cert-manager, LVM snap grep)
- HA Sofia token: auto-bootstrap from Vault secret/viktor/haos_api_token when
  HOME_ASSISTANT_SOFIA_{URL,TOKEN} env vars are unset. Default URL =
  https://ha-sofia.viktorbarzin.me.
- cert-manager: add cert_manager_installed() probe (kubectl get crd
  certificates.cert-manager.io). When not installed — which is our current
  state — report PASS "N/A" instead of noisy WARN "CRDs unavailable".
- LVM snapshot freshness: grep pattern was `-- -snap` but actual LV names use
  underscore (`foo_snap_YYYY...`), so the grep matched nothing and the check
  always WARN'd. Fixed to `grep _snap`.

After fix: PASS 36→40, WARN 9→6, FAIL 1→1 (new ha_entities FAIL is a real
HA issue, not a script bug — 400/1401 sensors stale on ha-sofia).
2026-04-19 22:13:32 +00:00
Viktor Barzin
e092f159b3 monitoring: drop MAM Mouse-class + qBittorrent-unsatisfied alerts
Both alerts fired as expected noise while the MAM account is in new-member
Mouse class — tracker refuses announces and the 72h seed-gate can't be met
until ratio recovers. Keeping the rest of the MAM rules (cookie expiry,
ratio, farming/janitor stalls, qbt disconnect) which still signal real
pipeline failures.

Firing count drops from 7 → 3 in healthcheck.
2026-04-19 21:24:46 +00:00
Viktor Barzin
68a10905e0 [monitoring] uk-payslip Panel 13: stacked bars + sum-in-legend
"Monthly cash flow — tax impact (RSU excluded)" was already stacking
group A in normal mode but rendered as 70%-opacity filled lines — the
overlap made the total-per-month figure visually inaccessible.

Switch drawStyle to bars (100% fill, 0-width lineWidth, no per-point
markers) so each month reads as a single stacked bar whose top edge is
the total cash-side deduction. Add "sum" to legend.calcs so the
tax-year totals per series show in the legend table alongside last and
max.

Panel 11 (Tax & pension — monthly, RSU-inclusive) retains the line/
area style so the two panels remain visually distinct.
2026-04-19 20:31:53 +00:00
Viktor Barzin
2224a6b2cc [job-hunter] Bump image to 92afc38d — Frankfurter FX + comp_table COALESCE 2026-04-19 19:09:54 +00:00
Viktor Barzin
e813170960 [job-hunter] Bump image to 99ab188f — levels.fyi per-level + comp_points
99ab188f adds the structured-comp pipeline: levels.fyi __NEXT_DATA__
scraper, Robert Walters + Hays PDF parser, comp_points/levels tables
(alembic 0003), CLI comp/comp-table/comp-band/backfill-levels, and
Grafana panels 6-9. Alembic 0003 runs via the existing init container.

After apply, exec:
  kubectl -n job-hunter exec deploy/job-hunter -c job-hunter -- \
    python -m job_hunter backfill-levels
  kubectl -n job-hunter exec deploy/job-hunter -c job-hunter -- \
    python -m job_hunter refresh --source levels_fyi
  kubectl -n job-hunter exec deploy/job-hunter -c job-hunter -- \
    python -m job_hunter refresh --source uk_surveys
2026-04-19 18:56:20 +00:00
Viktor Barzin
3f6dfb10aa [monitoring] job-hunter: panels 6-9 for comp_points tables + trends
Append the structured-comp dashboard surface to the job-hunter
dashboard:

Panel 6 — Per-company salary by level (p50 base, GBP table).
Panel 7 — Total-comp heatmap per (company, level), p50 GBP.
Panel 8 — Comp-point volume by source (daily time-series).
Panel 9 — Base-salary trend (p50) over time for the top 5 companies.

Adds templating: $location (multi, default london), $level (single,
default senior), $company (multi, default all) — populated from
comp_points + levels metadata so the selection reflects what was
actually ingested.

Closes: code-5ph
2026-04-19 18:50:48 +00:00
Viktor Barzin
a8280e77b6 [broker-sync] unsuspend IMAP + Panel 15 RSU vest reconciliation (Phase D)
Activates the Schwab/InvestEngine IMAP ingest CronJob that's been
scaffolded-but-suspended since Phase 2 of broker-sync, now that the
Schwab parser can detect vest-confirmation emails. Runs nightly 02:30 UK.

Current behaviour once deployed:
  - Trade confirmations (Schwab sell-to-cover, InvestEngine orders) →
    Activity rows posted to Wealthfolio. Unchanged.
  - Release Confirmations (Schwab RSU vests) → parser returns gross-vest
    BUY + sell-to-cover SELL Activities (to Wealthfolio) and a VestEvent
    object (NOT YET persisted — Postgres sink + DB grant pending; see
    follow-up under code-860). Vest detection uses a subject/body
    heuristic that will need tightening against a real email fixture.

Panel 15 of the UK payslip dashboard added: per-vest-month join of
payslip.rsu_vest vs rsu_vest_events (gross_value_gbp, tax_withheld_gbp)
with delta columns. Tax-delta-percent coloured green/orange/red at
0/2%/5% thresholds. Table is empty until broker-sync starts persisting
VestEvents — harmless until then.

Before applying:
  - Verify IMAP creds in Vault (secret/broker-sync: imap_host,
    imap_user, imap_password, imap_directory) are still valid.
  - Empty vest-event table is expected; delta columns show NULL until
    the postgres sink lands.

Part of: code-860
2026-04-19 18:29:01 +00:00
Viktor Barzin
1c0e1bcdde [payslip-ingest] ActualBudget payroll sync CronJob + Panel 14 (Phase C)
Wires the daily ActualBudget deposit sync from the payslip-ingest app into
K8s as a CronJob, and adds dashboard Panel 14 to overlay bank deposits
against payslip net_pay.

CronJob: actualbudget-payroll-sync in payslip-ingest namespace, runs
02:00 UTC. Calls `python -m payslip_ingest sync-meta-deposits`, which
hits budget-http-api-viktor in the actualbudget namespace and upserts
matching Meta payroll deposits into payslip_ingest.external_meta_deposits.

ExternalSecret extended with three new Vault keys:
  - ACTUALBUDGET_API_KEY (same as actualbudget-http-api-viktor's env API_KEY)
  - ACTUALBUDGET_ENCRYPTION_PASSWORD (Viktor's budget password)
  - ACTUALBUDGET_BUDGET_SYNC_ID (Viktor's sync_id)

These must be seeded at secret/payslip-ingest in Vault before the
CronJob will run — it'll CrashLoop on missing env vars otherwise. First
run can be triggered on demand via `kubectl -n payslip-ingest create
job --from=cronjob/actualbudget-payroll-sync initial-sync`.

Panel 14 plots monthly SUM(external_meta_deposits.amount) vs
SUM(payslip.net_pay), plus a delta bar series — |delta| > £50 flags
likely parser drift on net_pay.

Part of: code-860
2026-04-19 18:21:20 +00:00
Viktor Barzin
ef53053ae6 [job-hunter] Bump image to 48f8615d — London filter + AI CLI
New image adds Alembic 0002 (primary_location column), London-default
query/bands/report commands, and FX-priming on refresh so USD/EUR
salaries convert correctly. Applied live; 5826 rows backfilled.

Refs: code-snp

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-19 18:13:26 +00:00
Viktor Barzin
fca3dd4976 [monitoring] uk-payslip: Panel 2 uses COALESCE cash_income_tax; Panel 4 flags NULL
Phase A of RSU tax spike fix. Two changes:

1. Panel 2 "Monthly cash flow (RSU stripped)" plotted raw income_tax despite
   the title. Switch to COALESCE(cash_income_tax, income_tax) so the chart
   is honest once the Phase B back-fill populates cash_income_tax on
   variant-A slips. For slips where cash_income_tax is already populated
   (variant B, 2024+) the spike is removed immediately.

2. Panel 4 "Data integrity" now surfaces rows where cash_income_tax is NULL
   on vest months (rsu_vest > 0). New status value NULL_CASH_TAX (orange)
   highlights the back-fill remaining population — expected to drop to 0
   after Phase B lands.

Part of: code-860
2026-04-19 18:04:05 +00:00
Viktor Barzin
7e34b67f24 [docs] Architecture docs: registry integrity probe, pin, new CI pipelines
Bring the architecture set in line with what's actually deployed after
today's registry reliability work (commits 7cb44d7242961a5f):

- docs/architecture/ci-cd.md: expand Infra Pipelines table with
  build-ci-image (+ verify-integrity step), registry-config-sync,
  pve-nfs-exports-sync, postmortem-todos, drift-detection,
  issue-automation, provision-user. Note registry:2.8.3 pin +
  integrity probe in the image-registry flow section.
- docs/architecture/monitoring.md: add Registry Integrity Probe to
  components table; add 3-alert section (Manifest Integrity Failure /
  Probe Stale / Catalog Inaccessible).
- .claude/CLAUDE.md: one-line on the pin, auto-sync pipeline, and the
  revision-link-not-blob rule so the next agent knows the right check.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-19 17:51:26 +00:00
Viktor Barzin
fec0bbb7dd [job-hunter] Pin to first built image tag 9c42eac9
Locally-built image pushed to registry.viktorbarzin.me/job-hunter:9c42eac9
after Woodpecker v3.13 Forgejo webhook parsing bug left CI unable to build
the initial image (server/forge/forgejo/helper.go:57 nil pointer panic on
parse — see repaired webhooks still not triggering pipelines).

Unblocks code-97n (TF apply) without waiting for CI recovery.

Refs: code-snp, code-0c6

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-19 17:48:16 +00:00
Viktor Barzin
42961a5f58 [registry] fix-broken-blobs.sh — check revision-link, not blob data
The original index-child scan checked if the child's blob data file
existed under /blobs/sha256/<child>/data. That's wrong in a subtle
way: registry:2 serves a per-repo manifest via the link file at
<repo>/_manifests/revisions/sha256/<child-digest>/link, NOT by blob
presence. When cleanup-tags.sh rmtrees a tag, the per-repo revision
links for its index's children also disappear — but the blob data
survives (GC owns that, and runs weekly). Result: blob present,
link absent, API 404 on HEAD — the exact 2026-04-19 failure mode.

Live proof: the registry-integrity-probe CronJob just found 38 real
orphan children (including 98f718c8 from the original incident) while
the previous fix-broken-blobs.sh scan reported 0. After the fix, both
tools agree. The probe had been authoritative all along; the scan was
a false-negative because it was asking the wrong question.

Post-mortem updated to reflect the true mechanism (link-file absence,
not blob deletion).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-19 17:43:35 +00:00
Viktor Barzin
f4d3fdb2e3 [monitoring] uk-payslip: drop RSU-vest annotations
Vertical orange markers at every vest month added more visual noise
than signal. Panel 13 (cash-only) already conveys the "no spike on
vest months" story without needing markers across panels 1/2/3/7/11/12.
2026-04-19 17:32:49 +00:00
Viktor Barzin
34ee282d88 [ci] Auto-sync modules/docker-registry/* to registry VM + runbook docs
Replaces the manual scp+bounce sequence that landed registry:2.8.3 on
10.0.20.10 today (see commit 7cb44d72 + nginx-DNS-trap in runbook).
Addresses the "no repeat manual fixes" preference — future changes to
docker-compose.yml / fix-broken-blobs.sh / nginx_registry.conf /
config-private.yml / cleanup-tags.sh now deploy through CI.

Pipeline (.woodpecker/registry-config-sync.yml) mirrors
pve-nfs-exports-sync.yml: ssh-keyscan pin, scp the whole managed set,
bounce compose only when compose-visible files changed, always restart
nginx after a compose bounce (critical — nginx caches upstream DNS), end
with a dry-run fix-broken-blobs.sh to catch regressions.

Credentials:
 - Woodpecker repo-secret `registry_ssh_key` (events: push, manual)
 - Mirror at Vault `secret/woodpecker/registry_ssh_key`
   (private_key / public_key / known_hosts_entry)
 - Public key on /root/.ssh/authorized_keys on 10.0.20.10
 - Key label: woodpecker-registry-config-sync

Runbook updated with "Auto-sync pipeline" section pointing at the new
flow + manual override command.

Closes: code-3vl

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-19 17:32:12 +00:00
Viktor Barzin
a641dc744f [monitoring] uk-payslip: RSU vest annotations + cash-only tax panel
Panel 11 stacks RSU-attributed income tax on top of cash PAYE, which
is mathematically correct but emotionally misleading since RSU tax is
withheld at source via sell-to-cover and never hits the bank. Adopts
the two-view convention: Panel 11 keeps the full PAYE picture; new
Panel 13 shows cash-only deductions. Dashboard-level "RSU vests"
annotation paints orange markers on every vest month across all
timeseries panels, with tooltips like "RSU vest: £31232 gross /
£15257 tax withheld".

Shifts Panels 4/5/6/8/9/10 down by 9 rows to make room for Panel 13
at y=29.
2026-04-19 17:24:35 +00:00
Viktor Barzin
6e96b436b1 [docs] Capture nginx stale-DNS trap in registry-vm runbook
Discovered during the 2026-04-19 registry:2.8.3 pin deploy: nginx caches
its upstream DNS at startup and does NOT re-resolve after registry-*
containers are recreated. Symptom was /v2/_catalog returning
{"repositories": []} and /v2/ returning 200 without auth — nginx was
forwarding to a stale IP that a different backend container now owns.

Fix is always 'docker restart registry-nginx' after any registry-*
bounce. Captured in registry-vm.md so future manual operators and the
coming auto-sync pipeline (beads code-3vl) both encode the step.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-19 17:24:09 +00:00
Viktor Barzin
c9d6343a9b [job-hunter] Switch ExternalSecret to explicit UPPERCASE data mappings
Replaces dataFrom.extract with per-key `data` entries so the Secret
keys in K8s (and therefore env vars in the pod) are always UPPERCASE:
WEBHOOK_BEARER_TOKEN, CDIO_API_KEY, SMTP_USERNAME, SMTP_PASSWORD,
DIGEST_TO_ADDRESS, DIGEST_FROM_ADDRESS. Vault KV keys at
secret/job-hunter stay lowercase (webhook_bearer_token etc.).

Refs: code-snp

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-19 17:23:28 +00:00
Viktor Barzin
9f9d7d10ff [registry] Scope OCI-index scan to private registry only
Live run on the registry VM surfaced 632 "orphaned" index children across
156 indexes in the pull-through caches (ghcr, immich, affine, linkwarden,
openclaw). These aren't bugs — pull-through caches only fetch what's been
requested, so missing arm64 / arm / attestation children are normal partial
state. Scanning them generates noise that would mask the real signal from
the private registry (where we push full manifests ourselves and a missing
child IS always a bug — the 2026-04-13 + 2026-04-19 failure mode).

Change: index-child scan is now gated on registry_name == "private". Layer-
link scan still runs across all registries (missing blob under a live link
is always a bug, regardless of pull-through semantics).

Verified: live run now reports 0 orphans in private registry — consistent
with the hot-fix rebuild of infra-ci:latest earlier today. Layer scan
still inspects 425 links across all registries and finds 0 orphans.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-19 17:23:04 +00:00
Viktor Barzin
e7ce545da2 [job-hunter] Add infra stack + Grafana dashboard + n8n digest workflow
New service stack at stacks/job-hunter/ mirroring the payslip-ingest
pattern: per-service CNPG database + role (via dbaas null_resource),
Vault static role pg-job-hunter (7d rotation), ExternalSecrets for app
secrets and DB creds, Deployment with alembic-migrate init container,
ClusterIP Service, Grafana datasource ConfigMap.

Grafana dashboard job-hunter.json in Finance folder: new roles per
day, source breakdown, top companies, GBP salary distribution, recent
roles table (sorted by parse confidence then salary).

n8n weekly-digest workflow calls POST /digest/generate with bearer
auth every Monday 07:00 London; digest_runs table provides
idempotency.

Refs: code-snp

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-19 17:09:29 +00:00
Viktor Barzin
7cb44d7264 [registry] Stop recurring orphan OCI-index incidents — detection + prevention + recovery
Second identical registry incident on 2026-04-19 (first 2026-04-13): the
infra-ci:latest image index resolved to child manifests whose blobs had been
garbage-collected out from under the index. Pipelines P366→P376 all exited
126 "image can't be pulled". Hot fix (a05d63e / 6371e75 / c113be4) restored
green CI but left the underlying bug unaddressed.

Root cause: cleanup-tags.sh rmtrees tag dirs on the registry VM daily at
02:00, registry:2's GC (Sunday 03:25) walks OCI index children imperfectly
(distribution/distribution#3324 class). Nothing verified pushes end-to-end;
nothing probed the registry for fetchability; nothing caught orphan indexes.

Phase 1 — Detection:
 - .woodpecker/build-ci-image.yml: after build-and-push, a verify-integrity
   step walks the just-pushed manifest (index + children + config + every
   layer blob) via HEAD and fails the pipeline on any non-200. Catches
   broken pushes at the source.
 - stacks/monitoring: new registry-integrity-probe CronJob (every 15m) and
   three alerts — RegistryManifestIntegrityFailure,
   RegistryIntegrityProbeStale, RegistryCatalogInaccessible — closing the
   "registry serves 404 for a tag that exists" gap that masked the incident
   for 2+ hours.
 - docs/post-mortems/2026-04-19-registry-orphan-index.md: root cause,
   timeline, monitoring gaps, permanent fix.

Phase 2 — Prevention:
 - modules/docker-registry/docker-compose.yml: pin registry:2 → registry:2.8.3
   across all six registry services. Removes the floating-tag footgun.
 - modules/docker-registry/fix-broken-blobs.sh: new scan walks every
   _manifests/revisions/sha256/<digest> that is an image index and logs a
   loud WARNING when a referenced child blob is missing. Does NOT auto-
   delete — deleting a published image is a conscious decision. Layer-link
   scan preserved.

Phase 3 — Recovery:
 - build-ci-image.yml: accept `manual` event so Woodpecker API/UI rebuilds
   don't need a cosmetic Dockerfile edit (matches convention from
   pve-nfs-exports-sync.yml).
 - docs/runbooks/registry-rebuild-image.md: exact command sequence for
   diagnosing + rebuilding after an orphan-index incident, plus a fallback
   for building directly on the registry VM if Woodpecker itself is down.
 - docs/runbooks/registry-vm.md + .claude/reference/service-catalog.md:
   cross-references to the new runbook.

Out of scope (verified healthy or intentionally deferred):
 - Pull-through DockerHub/GHCR mirrors (74.5% hit rate, no 404s).
 - Registry HA/replication (single-VM SPOF is a known architectural
   choice; Synology offsite covers RPO < 1 day).
 - Diun exclude for registry:2 — not applicable; Diun only watches
   k8s (DIUN_PROVIDERS_KUBERNETES=true), not the VM's docker-compose.

Verified locally:
 - fix-broken-blobs.sh --dry-run on a synthetic registry directory correctly
   flags both orphan layer links and orphan OCI-index children.
 - terraform fmt + validate on stacks/monitoring: success (only unrelated
   deprecation warnings).
 - python3 yaml.safe_load on .woodpecker/build-ci-image.yml and
   modules/docker-registry/docker-compose.yml: both parse clean.

Closes: code-4b8

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-19 17:08:28 +00:00
Viktor Barzin
df2c53db8d [infra] TrueNAS decommission — remove active references from Terraform + configs
TrueNAS VM 9000 at 10.0.10.15 was operationally decommissioned 2026-04-13.
The subagent-driven doc sweep in 5a0b24f5 covered the prose. This commit
removes the remaining in-code references:

- reverse-proxy: drop truenas Traefik ingress + Cloudflare record
  (truenas.viktorbarzin.me was 502-ing since the VM stopped), drop
  truenas_homepage_token variable.
- config.tfvars: drop deprecated `truenas IN A 10.0.10.15`, `iscsi CNAME
  truenas`, and the commented-out `iscsi`/`zabbix` A records.
- dashy/conf.yml: remove Truenas dashboard entry (&ref_28).
- monitoring/loki.yaml: change storageClass from the decommissioned
  `iscsi-truenas` to `proxmox-lvm` so a future re-enable has a valid SC
  (Loki is currently disabled).
- actualbudget/main.tf + freedify/main.tf: update new-deployment
  docstrings to cite Proxmox host NFS instead of TrueNAS.
- nfs-csi: add an explanatory comment to the `nfs-truenas` StorageClass
  noting the name is historical — 48 bound PVs reference it, SC names
  are immutable on PVs, rename not worth the churn.

Also cleaned out-of-band:
- Technitium DNS: deleted `truenas.viktorbarzin.lan` A and
  `iscsi.viktorbarzin.lan` CNAME records.
- Vault: `secret/viktor` → removed `truenas_api_key` and
  `truenas_ssh_private_key`; `secret/platform.homepage_credentials.reverse_proxy.truenas_token` removed.
- Terraform-applied: `scripts/tg apply -target=module.reverse-proxy.module.truenas`
  destroyed the 3 K8s/Cloudflare resources cleanly.

Deferred:
- VM 9000 is still stopped on PVE. Deletion (destructive) awaits explicit
  user go-ahead.
- `nfs-truenas` StorageClass name retained (see nfs-csi comment above).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-19 16:57:05 +00:00
Viktor Barzin
5a0b24f54e [docs] TrueNAS decommission cleanup — remove references from active docs
TrueNAS VM 9000 was operationally decommissioned 2026-04-13; NFS has been
served by Proxmox host (192.168.1.127) since. This commit scrubs remaining
references from active docs. VM 9000 itself remains on PVE in stopped state
pending user decision on deletion.

In-session cleanup already landed: reverse-proxy ingress + Cloudflare record
removed; Technitium DNS records deleted; Vault truenas_{api_key,ssh_private_key}
purged; homepage_credentials.reverse_proxy.truenas_token removed;
truenas_homepage_token variable + module deleted; Loki + Dashy cleaned;
config.tfvars deprecated DNS lines removed; historical-name comment added to
the nfs-truenas StorageClass (48 bound PVs, immutable name — kept).

Historical records (docs/plans/, docs/post-mortems/, .planning/) intentionally
untouched — they describe state at a point in time.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-19 16:55:43 +00:00
Viktor Barzin
5f832e37d0 [monitoring] UK Payslip — add tax & pension breakdown panels
New Panel 11 (monthly) + Panel 12 (YTD cumulative), side-by-side at
y=19. Six series each: cash income tax, RSU-attributed income tax, NI,
student loan, employee pension, employer pension. Employer pension
included to show full retirement contribution picture (paid on top of
salary, not deducted from take-home). Downstream panels shifted down
by 10.
2026-04-19 16:53:32 +00:00
Viktor Barzin
ab402b3421 [monitoring] UK Payslip Panel 7 — trim to 5 semantic layers
Drop ytd_student_loan (~£200-300/mo noise) and ytd_rsu_offset (always
£0 on post-2024 Meta variant-B payslips) from the YTD uses stack. Now
mirrors Panel 1's 4-way source breakdown clarity: take-home, cash PAYE,
RSU PAYE, NI, pension. Student loan + RSU offset still surface on
Panel 8 Sankey.

Title: "YTD uses — where gross went" (mirrors Panel 1 label pattern).
2026-04-19 16:37:12 +00:00
Viktor Barzin
e55c549c9a [redis] Phase 7 step 2: remove Bitnami helm_release + orphan PVCs
Bringing the 2026-04-19 rework to its end-state. Cutover soaked for ~1h
with 0 alerts firing and 127 ops/sec on the v2 master — skipped the
nominal 24h rollback window per user direction.

 - Removed `helm_release.redis` (Bitnami chart v25.3.2) from TF. Helm
   destroy cleaned up the StatefulSet redis-node (already scaled to 0),
   ConfigMaps, ServiceAccount, RBAC, and the deprecated `redis` + `redis-headless`
   ClusterIP services that the chart owned.
 - Removed `null_resource.patch_redis_service` — the kubectl-patch hack
   that worked around the Bitnami chart's broken service selector. No
   Helm chart, no patch needed.
 - Removed the dead `depends_on = [helm_release.redis]` from the HAProxy
   deployment.
 - `kubectl delete pvc -n redis redis-data-redis-node-{0,1}` for the two
   orphan PVCs the StatefulSet template left behind (K8s doesn't cascade-delete).
 - Simplified the top-of-file comment and the redis-v2 architecture
   comment — they talked about the parallel-cluster migration state that
   no longer exists. Folded in the sentinel hostname gotcha, the redis
   8.x image requirement, and the BGSAVE+AOF-rewrite memory reasoning
   so the rationale survives in the code rather than only in beads.
 - `RedisDown` alert no longer matches `redis-node|redis-v2` — just
   `redis-v2` since that's the only StatefulSet now. Kept the `or on()
   vector(0)` so the alert fires when kube_state_metrics has no sample
   (e.g. after accidental delete).
 - `docs/architecture/databases.md` trimmed: no more "pending TF removal"
   or "cold rollback for 24h" language.

Verification after apply:
 - kubectl get all -n redis: redis-v2-{0,1,2} (3/3 Running) + redis-haproxy-*
   (3 pods, PDB minAvailable=2). Services: redis-master + redis-v2-headless only.
 - PVCs: data-redis-v2-{0,1,2} only (redis-data-redis-node-* deleted).
 - Sentinel: all 3 agree mymaster = redis-v2-0 hostname.
 - HAProxy: PING PONG, DBSIZE 92, 127 ops/sec on master.
 - Prometheus: 0 firing redis alerts.

Closes: code-v2b
Closes: code-2mw

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-19 16:32:14 +00:00
Viktor Barzin
c113be4d5e [ci] Retrigger default workflow — new infra-ci image now in registry
P380/build-ci-image pushed a fresh infra-ci image with valid manifest
(sha256:d21c47c9 for amd64). Default workflow raced build-ci-image on
that pipeline and pulled the stale broken manifest. This empty commit
runs default only (build-ci-image path filter doesn't match).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-19 16:31:44 +00:00
Viktor Barzin
6371e75ef9 [ci] Rebuild infra-ci image — registry index referenced missing blobs
The infra-ci :latest (and :5319f03e) tags in the private registry resolved
to an OCI image index (sha256:7235cba7...) whose referenced amd64 manifest
(98f718c8) and attestation (27d5ab83) blobs returned 404 — either never
uploaded or garbage-collected. Every pipeline since P366 exited 126 on
image pull.

This comment-only Dockerfile change triggers build-ci-image.yml's path
filter, which rebuilds + pushes a fresh image.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-19 16:29:20 +00:00
Viktor Barzin
b6cd83f85a [redis] Phase 3-7: cutover to redis-v2, Nextcloud HAProxy-only
Phase 3 — replication chain (old → v2):
 - Discovered the v2 cluster was running redis:7.4-alpine, but the
   Bitnami old master ships redis 8.6.2 which writes RDB format 13 —
   the 7.4 replicas rejected the stream with "Can't handle RDB format
   version 13". Bumped v2 image to redis:8-alpine (also 8.6.2) to
   restore PSYNC compatibility.
 - Discovered that sentinel on BOTH v2 and old Bitnami clusters
   auto-discovered the cross-cluster replication chain when v2-0
   REPLICAOF'd the old master, triggering a failover that reparented
   old-master to a v2 replica and took HAProxy's backend offline.
   Mitigation: `SENTINEL REMOVE mymaster` on all 5 sentinels (both
   clusters) during the REPLICAOF surgery, then re-MONITOR after
   cutover. This must be done on the OLD sentinels too, not just v2 —
   they're the ones that kept fighting our REPLICAOF.
 - Set up the chain: v2-0 REPLICAOF old-master; v2-{1,2} REPLICAOF v2-0.
   All 76 keys (db0:76, db1:22, db4:16) synced including `immich_bull:*`
   BullMQ queues and `_kombu.*` Celery queues — the user-stated
   must-survive data class.

Phase 4 — HAProxy cutover:
 - Updated `kubernetes_config_map.haproxy` to point at
   `redis-v2-{0,1,2}.redis-v2-headless` for both redis_master and
   redis_sentinel backends (removed redis-node-{0,1}).
 - Promoted v2-0 (`REPLICAOF NO ONE`) at the same time as the
   ConfigMap apply so HAProxy's 1s health-check interval found a
   role:master within a few seconds. Cutover disruption on HAProxy
   rollout was brief; old clients naturally moved to new HAProxy pods
   within the rolling update window.
 - Re-enabled sentinel monitoring on v2 with `SENTINEL MONITOR
   mymaster <hostname> 6379 2` after verifying `resolve-hostnames yes`
   + `announce-hostnames yes` were active — this ensures sentinel
   stores the hostname (not resolved IP) in its rewritten config, so
   pod-IP churn on restart doesn't break failover.

Phase 5 — chaos:
 - Round 1: killed master v2-0 mid-probe. First run exposed the
   sentinel IP-storage issue (stored 10.10.107.222, went stale on
   restart) — ~12s probe disruption. Fixed hostname persistence and
   re-MONITORed.
 - Round 2: killed new master v2-2 with hostnames correctly stored.
   Sentinel elected v2-0, HAProxy re-routed, 1/40 probe failures over
   60s — target <3s of actual user-visible disruption.

Phase 6 — Nextcloud simplification:
 - `zzz-redis.config.php` no longer queries sentinel in-process —
   just points at `redis-master.redis.svc.cluster.local`. Removed 20
   lines of PHP. HAProxy handles master tracking transparently now
   that it's scaled to 3 + PDB minAvailable=2.

Phase 7 step 1:
 - `kubectl scale statefulset/redis-node --replicas=0` (transient —
   TF removal in a 24h follow-up). Old PVCs `redis-data-redis-node-{0,1}`
   preserved as cold rollback.

Docs:
 - Rewrote `databases.md` Redis section to reflect post-cutover reality
   and the sentinel hostname gotcha (so future sessions don't relearn it).
 - `.claude/reference/service-catalog.md` entry updated.

The parallel-bootstrap race documented in the previous commit is still
worth watching — the init container now defaults to pod-0 as master
when no peer reports role:master-with-slaves, so fresh boots land in
a deterministic topology.

Closes: code-7n4
Closes: code-9y6
Closes: code-cnf
Closes: code-tc4

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-19 16:13:43 +00:00
Viktor Barzin
f6685a23a9 [dns] Kea: multi-IP DHCP option 6 (10.0.10, 10.0.20) + TSIG-signed DDNS (WS E)
Workstream E of the DNS hardening push. Two independent pfSense-side
changes to eliminate single-point DNS failures and the unauthenticated
RFC 2136 update vector.

Part 1 — Multi-IP DHCP option 6
- Before: clients on 10.0.10/24 got only 10.0.10.1; clients on 10.0.20/24
  got only 10.0.20.1. Internal resolver outage == cluster-wide DNS dark.
- After:
  - 10.0.10/24 -> [10.0.10.1, 94.140.14.14]
  - 10.0.20/24 -> [10.0.20.1, 94.140.14.14]
- 192.168.1/24 deliberately untouched (served by TP-Link AP, not pfSense
  Kea — pfSense WAN DHCP is disabled); already ships [192.168.1.2,
  94.140.14.14] so the end state is consistent across all three subnets.
- Applied via PHP: set $cfg['dhcpd']['lan']['dnsserver'] and
  $cfg['dhcpd']['opt1']['dnsserver'] as arrays. pfSense's
  services_kea4_configure() implodes the array into "data: a, b" on the
  "domain-name-servers" option-data entry (services.inc L1214).
- Verified:
  - DevVM (10.0.10.10) resolv.conf shows "nameserver 10.0.10.1" +
    "nameserver 94.140.14.14" after networkd renew.
  - k8s-node1 (10.0.20.101) same after networkctl reload + systemd-resolved
    restart.
  - Fallback drill on k8s-node1: `ip route add blackhole 10.0.20.1/32`;
    dig @10.0.20.1 google.com -> "no servers could be reached"; dig
    @94.140.14.14 google.com -> 216.58.204.110; system resolver
    (getent hosts) succeeds via the fallback IP. Blackhole route removed.

Part 2 — TSIG-signed Kea DHCP-DDNS
- Before: /usr/local/etc/kea/kea-dhcp-ddns.conf had `tsig-keys: []` and
  Technitium's viktorbarzin.lan zone had update=Deny. Unauthenticated
  update vector was latent (DDNS wiring in Kea DHCP4 is actually off
  today — "DDNS: disabled" in dhcpd.log) but would activate as soon as
  anyone turned on ddnsupdate on LAN/OPT1.
- Generated HMAC-SHA256 secret, base64-encoded 32 random bytes.
- Stored in Vault: secret/viktor/kea_ddns_tsig_secret (version 27).
- Created TSIG key "kea-ddns" on primary/secondary/tertiary Technitium
  instances via /api/settings/set (tsigKeys[]).
- Updated kea-dhcp-ddns.conf on pfSense with
  tsig-keys[]={name: "kea-ddns", algorithm: "HMAC-SHA256", secret: …}
  and key-name: kea-ddns on each forward-ddns / reverse-ddns domain.
  Pre-change backup at /usr/local/etc/kea/kea-dhcp-ddns.conf.2026-04-19-pre-tsig.
- Configured viktorbarzin.lan + 10.0.10.in-addr.arpa +
  20.0.10.in-addr.arpa + 1.168.192.in-addr.arpa on Technitium primary:
  - update = UseSpecifiedNetworkACL
  - updateNetworkACL = [10.0.20.1, 10.0.10.1, 192.168.1.2]
  - updateSecurityPolicies = [{tsigKeyName: kea-ddns,
                               domain: "*.<zone>", allowedTypes: [ANY]}]
  Technitium requires BOTH a source-IP match AND a valid TSIG signature.
- Verified TSIG end-to-end:
  - Signed A-record update from pfSense -> "successfully processed",
    dig returns 10.99.99.99 (log: "TSIG KeyName: kea-ddns; TSIG Algo:
    hmac-sha256; TSIG Error: NoError; RCODE: NoError").
  - Signed PTR update same zone pattern -> dig -x returns tsig-test
    FQDN.
  - Unsigned update from pfSense IP (in ACL) -> "update failed:
    REFUSED" (log: "refused a zone UPDATE request [...] due to Dynamic
    Updates Security Policy").
  - Test records cleaned up via signed nsupdate.

Safety
- pfSense config backup: /cf/conf/config.xml.2026-04-19-pre-kea-multi-ip
  (145898 bytes, pre-change snapshot — keep 30d).
- DDNS config backup: /usr/local/etc/kea/kea-dhcp-ddns.conf.2026-04-19-pre-tsig.
- TSIG secret lives only in Vault + in config.xml/kea-dhcp-ddns.conf on
  pfSense; not committed to git.

Docs
- architecture/dns.md: zone dynamic-updates section records the TSIG
  policy; Incident History gets a WS E entry.
- architecture/networking.md: DHCP Coverage table now shows the DNS
  option 6 values per subnet; pfSense block notes the TSIG-signed DDNS
  and config backup path.
- runbooks/pfsense-unbound.md: new "Kea DHCP-DDNS TSIG" section covers
  key rotation, emergency bypass, and enforcement-verification.

Closes: code-o6j

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-19 16:12:23 +00:00
Viktor Barzin
a05d63eefb [ci] Fix infra pipeline image-pull — drop :5050 from infra-ci image URL
P366-P374 default workflow failed with exit 126 "image can't be pulled" — containerd
hosts.toml has a mirror entry for `registry.viktorbarzin.me` but NOT for
`registry.viktorbarzin.me:5050`, so pulls fell through to direct HTTPS on :5050
(which isn't exposed externally). Convention per infra/.claude/CLAUDE.md is the
no-port form; :5050 was an anomaly introduced by the 2026-04-15 CI perf overhaul.

build-cli/build-ci-image push paths still use :5050 and work fine — they go through
the buildx plugin (pod DNS, not node containerd). Only `image:` fields on a step
hit the broken path. Normalizing push URLs left for a follow-up.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-19 16:00:58 +00:00
Viktor Barzin
b7ea122355 payslip-ingest: pin image_tag=4f70681d — includes migrations 0004+0005
Aligns the stack with the repo HEAD carrying migration 0004
(cash_income_tax + ytd_rsu_* columns), migration 0005 (p60_reference
table), the bonus-dedup logic, and the Woodpecker path-filter fix.

Applied + verified:
- pod rolled out with the new image, Alembic ran 0003→0004→0005
- cash_income_tax backfilled on 71/71 existing rows
- dashboard Panel 7 YTD split query returns real numbers
- no existing (tax_year, bonus) duplicates found — guard ships for future

Closes: code-7z0
2026-04-19 15:54:24 +00:00
Viktor Barzin
33d934c32f [dns] pfSense: Unbound replaces dnsmasq (WS D)
Replace pfSense dnsmasq (DNS Forwarder) with Unbound (DNS Resolver) so
LAN-side .viktorbarzin.lan resolution survives a full Kubernetes outage.

Out-of-band pfSense changes (not in Terraform; pfSense config.xml is
VM-managed). Backup at /cf/conf/config.xml.2026-04-19-pre-unbound on-box
+ /mnt/backup/pfsense/ nightly.

- <unbound> enabled; listens on lan, opt1, wan, lo0
- <forwarding> on + <forward_tls_upstream> → DoT to Cloudflare
  (1.1.1.1 / 1.0.0.1 port 853, SNI cloudflare-dns.com)
- <dnssec>, <prefetch>, <prefetchkey>, <dnsrecordcache> (serve-expired)
- msgcachesize=256MB, cache_max_ttl=7d, cache_min_ttl=60s
- custom_options: auth-zone viktorbarzin.lan master=10.0.20.201
  fallback-enabled=yes for-upstream=yes + serve-expired-ttl=259200
- <dnsmasq><enable> removed; dnsmasq stopped
- NAT rdr WAN UDP 53 → 10.0.20.201 removed (Unbound listens on WAN now)
- Technitium zone viktorbarzin.lan: zoneTransferNetworkACL set to
  10.0.20.1, 10.0.10.1, 192.168.1.2 (pfSense source IPs)

Verified:
- unbound-control list_auth_zones: viktorbarzin.lan serial 49367
- dig @127.0.0.1 idrac.viktorbarzin.lan returns 192.168.1.4 with aa flag
  (served from auth-zone, not forwarded)
- dig @127.0.0.1 example.com +dnssec returns ad flag (DoT + validated)
- /var/unbound/viktorbarzin.lan.zone has ~114 records
- K8s outage drill passed: scale technitium=0 → dig still returns via
  WAN/LAN/OPT1 interfaces → scale restored
- LAN/management/K8s VLAN clients all resolve via pfSense 192.168.1.2 /
  10.0.10.1 / 10.0.20.1 respectively

Trade-off: Technitium Split Horizon hairpin for 192.168.1.x →
*.viktorbarzin.me (non-proxied) no longer runs via pfSense (Unbound
answers locally). Fix if it bites: switch service to proxied or add
Unbound Host Override. Documented in docs/runbooks/pfsense-unbound.md.

Closes: code-k0d
2026-04-19 15:52:41 +00:00
Viktor Barzin
bc866d53fa [servarr/mam-farming] Tune grabber for MAM's real catalogue
## Context

After the Mouse-class unblock on 2026-04-19, end-to-end testing of the
grabber revealed three issues with the plan's original filter values:

1. **`SEEDER_CEILING=50` rejects ~99% of MAM's catalogue.** MAM is a
   well-seeded private tracker — 100-700 seeders per torrent is normal.
   A ceiling of 50 makes the filter too tight: across 140 FL torrents
   sampled in one loop, only 0-1 matched. The intent ("avoid oversupplied
   swarms") is still valid; the threshold was wrong for MAM's shape.

2. **`RATIO_FLOOR=1.2` was sized for Mouse-class defence and is now
   over-tight.** Its job is preventing the death spiral where Mouse-class
   accounts can't announce, so any grab deepens the ratio hole. Once
   class > Mouse, MAM serves peer lists normally and demand-first
   filtering (`leechers>=1`) keeps new grabs upload-positive on average.
   With ratio sitting at 0.7 post-recovery (we over-downloaded while
   unblocking), 1.2 was preventing the very grabs that would earn us
   back to healthy ratio.

3. **`parse_size` crashed on `"1,002.9 MiB"`.** MAM's pretty-printed
   sizes use thousands separators; `float("1,002.9")` raises
   `ValueError`. Every grabber run that hit a ≥1000-MiB candidate on
   the page crashed with a traceback instead of skipping the size.

## This change

- `SEEDER_CEILING`: 50 → 200 — live catalogue evidence showed 50 was
  rejecting viable demand-first candidates like `Zen and the Art of
  Motorcycle Maintenance` (S=156, L=1, score=125).
- `RATIO_FLOOR`: 1.2 → 0.5 — still a tripwire for catastrophic dips,
  but no longer a steady-state block. Class == Mouse remains an
  absolute skip (separate branch).
- `parse_size`: `s.replace(",", "").split()` before int-parse.

## Verified post-change

Manual grabber loop (5 runs at random offsets) after applying:

    run=1  parse_size crash on "1,002.9" (this crash motivated fix #3)
    run=2  GRABBED 3 torrents:
             Dean and Me: A Love Story      (240.7 MiB, S:18, L:1)  score=194
             Digital Nature Photography      (83.7 MiB, S:42, L:1)  score=182
             Zen and the Art of Motorcycle   (830.3 MiB, S:156, L:1) score=125
    run=3-5 grabbed=0 at offsets that landed on pages with no matches
            (expected — MAM returns 20/page, many offsets yield nothing)

MAM profile: class=User, ratio=0.7 (recovering from the Mouse unblock),
BP=24,053. 28 mam-farming torrents in forcedUP state, actively uploading
~8 MiB to MAM this session across 2 of the Maxximized comic issues.

## What is NOT in this change

- No alert threshold changes — `MAMRatioBelowOne` (24h) and `MAMMouseClass`
  (1h) already handle the "going back to Mouse" case; lowering the floor
  on the grabber doesn't change alerting.
- No janitor changes — the janitor rules are H&R-based and independent
  of ratio/class state.

## Test plan

### Automated

    $ cd infra/stacks/servarr && ../../scripts/tg apply --non-interactive
    Apply complete! Resources: 0 added, 2 changed, 0 destroyed.

    $ python3 -c 'import ast; ast.parse(open(
        "infra/stacks/servarr/mam-farming/files/freeleech-grabber.py").read())'

### Manual Verification

1. Trigger the grabber and confirm it doesn't skip-for-ratio at ratio 0.7:

       $ kubectl -n servarr create job --from=cronjob/mam-freeleech-grabber g1
       $ kubectl -n servarr logs job/g1 | head -5
       Profile: ratio=0.7 class=User | Farming: 33, 2.0 GiB, tracked IDs: 4
       Search offset=<random>, found=1323, page_results=20
       Added (score=...) ...

2. Repeat 3-5× at different random offsets. Over the course of a 30-min
   cron cadence, expect 2-5 grabs across the day given MAM's catalogue
   churn and our filter intersection.

## Reproduce locally

    cd infra/stacks/servarr
    ../../scripts/tg plan  # expect: 0 to add, 2 to change (configmap + cronjob)
    ../../scripts/tg apply --non-interactive
    kubectl -n servarr create job --from=cronjob/mam-freeleech-grabber g1
    kubectl -n servarr logs job/g1

Follow-up: `bd close code-qfs` already completed in the parent commit;
this is a post-shipping tune, no beads action needed.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-19 15:46:46 +00:00
Viktor Barzin
0f6321ce86 [dns] NodeLocal DNSCache — deploy DaemonSet to all nodes (WS C)
Adds per-node DNS cache that transparently intercepts pod queries on
10.96.0.10 (kube-dns ClusterIP) AND 169.254.20.10 (link-local) via
hostNetwork + NET_ADMIN iptables NOTRACK rules. Pods keep using their
existing /etc/resolv.conf (nameserver 10.96.0.10) unchanged — no kubelet
rollout needed for transparent mode.

Layout mirrors existing stacks (technitium, descheduler, kured):
  stacks/nodelocal-dns/
    main.tf                                 # module wiring + IP params
    modules/nodelocal-dns/main.tf           # SA, Services, ConfigMap, DS

Key decisions:
  - Image: registry.k8s.io/dns/k8s-dns-node-cache:1.23.1
  - Co-listens on 169.254.20.10 + 10.96.0.10 (transparent interception)
  - Upstream path: kube-dns-upstream (new headless svc) → CoreDNS pods
    (separate ClusterIP avoids cache looping back through itself)
  - viktorbarzin.lan zone forwards directly to Technitium ClusterIP
    (10.96.0.53), bypassing CoreDNS for internal names
  - priorityClassName: system-node-critical
  - tolerations: operator=Exists (runs on master + all tainted nodes)
  - No CPU limit (cluster-wide policy); mem requests=32Mi, limit=128Mi
  - Kyverno dns_config drift suppressed on the DaemonSet
  - Kubelet clusterDNS NOT changed — transparent mode is sufficient;
    rolling 5 nodes just to switch to 169.254.20.10 has no additional
    benefit and expanding blast radius for no reason.

Verified:
  - DaemonSet 5/5 Ready across k8s-master + 4 workers
  - dig @169.254.20.10 idrac.viktorbarzin.lan -> 192.168.1.4
  - dig @169.254.20.10 github.com -> 140.82.121.3
  - Deleted all 3 CoreDNS pods; cached queries still resolved via
    NodeLocal DNSCache (resilience confirmed)

Docs: architecture/dns.md — adds NodeLocal DNSCache to Components table,
graph diagram, stacks table; rewrites pod DNS resolution paths to show
the cache layer; adds troubleshooting entry.

Closes: code-2k6
2026-04-19 15:46:41 +00:00