infra

Author	SHA1	Message	Date
Viktor Barzin	3f6dfb10aa	[monitoring] job-hunter: panels 6-9 for comp_points tables + trends Append the structured-comp dashboard surface to the job-hunter dashboard: Panel 6 — Per-company salary by level (p50 base, GBP table). Panel 7 — Total-comp heatmap per (company, level), p50 GBP. Panel 8 — Comp-point volume by source (daily time-series). Panel 9 — Base-salary trend (p50) over time for the top 5 companies. Adds templating: $location (multi, default london), $level (single, default senior), $company (multi, default all) — populated from comp_points + levels metadata so the selection reflects what was actually ingested. Closes: code-5ph	2026-04-19 18:50:48 +00:00
Viktor Barzin	a8280e77b6	[broker-sync] unsuspend IMAP + Panel 15 RSU vest reconciliation (Phase D) Activates the Schwab/InvestEngine IMAP ingest CronJob that's been scaffolded-but-suspended since Phase 2 of broker-sync, now that the Schwab parser can detect vest-confirmation emails. Runs nightly 02:30 UK. Current behaviour once deployed: - Trade confirmations (Schwab sell-to-cover, InvestEngine orders) → Activity rows posted to Wealthfolio. Unchanged. - Release Confirmations (Schwab RSU vests) → parser returns gross-vest BUY + sell-to-cover SELL Activities (to Wealthfolio) and a VestEvent object (NOT YET persisted — Postgres sink + DB grant pending; see follow-up under code-860). Vest detection uses a subject/body heuristic that will need tightening against a real email fixture. Panel 15 of the UK payslip dashboard added: per-vest-month join of payslip.rsu_vest vs rsu_vest_events (gross_value_gbp, tax_withheld_gbp) with delta columns. Tax-delta-percent coloured green/orange/red at 0/2%/5% thresholds. Table is empty until broker-sync starts persisting VestEvents — harmless until then. Before applying: - Verify IMAP creds in Vault (secret/broker-sync: imap_host, imap_user, imap_password, imap_directory) are still valid. - Empty vest-event table is expected; delta columns show NULL until the postgres sink lands. Part of: code-860	2026-04-19 18:29:01 +00:00
Viktor Barzin	1c0e1bcdde	[payslip-ingest] ActualBudget payroll sync CronJob + Panel 14 (Phase C) Wires the daily ActualBudget deposit sync from the payslip-ingest app into K8s as a CronJob, and adds dashboard Panel 14 to overlay bank deposits against payslip net_pay. CronJob: actualbudget-payroll-sync in payslip-ingest namespace, runs 02:00 UTC. Calls `python -m payslip_ingest sync-meta-deposits`, which hits budget-http-api-viktor in the actualbudget namespace and upserts matching Meta payroll deposits into payslip_ingest.external_meta_deposits. ExternalSecret extended with three new Vault keys: - ACTUALBUDGET_API_KEY (same as actualbudget-http-api-viktor's env API_KEY) - ACTUALBUDGET_ENCRYPTION_PASSWORD (Viktor's budget password) - ACTUALBUDGET_BUDGET_SYNC_ID (Viktor's sync_id) These must be seeded at secret/payslip-ingest in Vault before the CronJob will run — it'll CrashLoop on missing env vars otherwise. First run can be triggered on demand via `kubectl -n payslip-ingest create job --from=cronjob/actualbudget-payroll-sync initial-sync`. Panel 14 plots monthly SUM(external_meta_deposits.amount) vs SUM(payslip.net_pay), plus a delta bar series — \|delta\| > £50 flags likely parser drift on net_pay. Part of: code-860	2026-04-19 18:21:20 +00:00
Viktor Barzin	ef53053ae6	[job-hunter] Bump image to 48f8615d — London filter + AI CLI New image adds Alembic 0002 (primary_location column), London-default query/bands/report commands, and FX-priming on refresh so USD/EUR salaries convert correctly. Applied live; 5826 rows backfilled. Refs: code-snp Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-19 18:13:26 +00:00
Viktor Barzin	fca3dd4976	[monitoring] uk-payslip: Panel 2 uses COALESCE cash_income_tax; Panel 4 flags NULL Phase A of RSU tax spike fix. Two changes: 1. Panel 2 "Monthly cash flow (RSU stripped)" plotted raw income_tax despite the title. Switch to COALESCE(cash_income_tax, income_tax) so the chart is honest once the Phase B back-fill populates cash_income_tax on variant-A slips. For slips where cash_income_tax is already populated (variant B, 2024+) the spike is removed immediately. 2. Panel 4 "Data integrity" now surfaces rows where cash_income_tax is NULL on vest months (rsu_vest > 0). New status value NULL_CASH_TAX (orange) highlights the back-fill remaining population — expected to drop to 0 after Phase B lands. Part of: code-860	2026-04-19 18:04:05 +00:00
Viktor Barzin	7e34b67f24	[docs] Architecture docs: registry integrity probe, pin, new CI pipelines Bring the architecture set in line with what's actually deployed after today's registry reliability work (commits `7cb44d72` → `42961a5f`): - docs/architecture/ci-cd.md: expand Infra Pipelines table with build-ci-image (+ verify-integrity step), registry-config-sync, pve-nfs-exports-sync, postmortem-todos, drift-detection, issue-automation, provision-user. Note registry:2.8.3 pin + integrity probe in the image-registry flow section. - docs/architecture/monitoring.md: add Registry Integrity Probe to components table; add 3-alert section (Manifest Integrity Failure / Probe Stale / Catalog Inaccessible). - .claude/CLAUDE.md: one-line on the pin, auto-sync pipeline, and the revision-link-not-blob rule so the next agent knows the right check. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-19 17:51:26 +00:00
Viktor Barzin	fec0bbb7dd	[job-hunter] Pin to first built image tag 9c42eac9 Locally-built image pushed to registry.viktorbarzin.me/job-hunter:9c42eac9 after Woodpecker v3.13 Forgejo webhook parsing bug left CI unable to build the initial image (server/forge/forgejo/helper.go:57 nil pointer panic on parse — see repaired webhooks still not triggering pipelines). Unblocks code-97n (TF apply) without waiting for CI recovery. Refs: code-snp, code-0c6 Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-19 17:48:16 +00:00
Viktor Barzin	42961a5f58	[registry] fix-broken-blobs.sh — check revision-link, not blob data The original index-child scan checked if the child's blob data file existed under /blobs/sha256/<child>/data. That's wrong in a subtle way: registry:2 serves a per-repo manifest via the link file at <repo>/_manifests/revisions/sha256/<child-digest>/link, NOT by blob presence. When cleanup-tags.sh rmtrees a tag, the per-repo revision links for its index's children also disappear — but the blob data survives (GC owns that, and runs weekly). Result: blob present, link absent, API 404 on HEAD — the exact 2026-04-19 failure mode. Live proof: the registry-integrity-probe CronJob just found 38 real orphan children (including 98f718c8 from the original incident) while the previous fix-broken-blobs.sh scan reported 0. After the fix, both tools agree. The probe had been authoritative all along; the scan was a false-negative because it was asking the wrong question. Post-mortem updated to reflect the true mechanism (link-file absence, not blob deletion). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-19 17:43:35 +00:00
Viktor Barzin	f4d3fdb2e3	[monitoring] uk-payslip: drop RSU-vest annotations Vertical orange markers at every vest month added more visual noise than signal. Panel 13 (cash-only) already conveys the "no spike on vest months" story without needing markers across panels 1/2/3/7/11/12.	2026-04-19 17:32:49 +00:00
Viktor Barzin	34ee282d88	[ci] Auto-sync modules/docker-registry/* to registry VM + runbook docs Replaces the manual scp+bounce sequence that landed registry:2.8.3 on 10.0.20.10 today (see commit `7cb44d72` + nginx-DNS-trap in runbook). Addresses the "no repeat manual fixes" preference — future changes to docker-compose.yml / fix-broken-blobs.sh / nginx_registry.conf / config-private.yml / cleanup-tags.sh now deploy through CI. Pipeline (.woodpecker/registry-config-sync.yml) mirrors pve-nfs-exports-sync.yml: ssh-keyscan pin, scp the whole managed set, bounce compose only when compose-visible files changed, always restart nginx after a compose bounce (critical — nginx caches upstream DNS), end with a dry-run fix-broken-blobs.sh to catch regressions. Credentials: - Woodpecker repo-secret `registry_ssh_key` (events: push, manual) - Mirror at Vault `secret/woodpecker/registry_ssh_key` (private_key / public_key / known_hosts_entry) - Public key on /root/.ssh/authorized_keys on 10.0.20.10 - Key label: woodpecker-registry-config-sync Runbook updated with "Auto-sync pipeline" section pointing at the new flow + manual override command. Closes: code-3vl Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-19 17:32:12 +00:00
Viktor Barzin	a641dc744f	[monitoring] uk-payslip: RSU vest annotations + cash-only tax panel Panel 11 stacks RSU-attributed income tax on top of cash PAYE, which is mathematically correct but emotionally misleading since RSU tax is withheld at source via sell-to-cover and never hits the bank. Adopts the two-view convention: Panel 11 keeps the full PAYE picture; new Panel 13 shows cash-only deductions. Dashboard-level "RSU vests" annotation paints orange markers on every vest month across all timeseries panels, with tooltips like "RSU vest: £31232 gross / £15257 tax withheld". Shifts Panels 4/5/6/8/9/10 down by 9 rows to make room for Panel 13 at y=29.	2026-04-19 17:24:35 +00:00
Viktor Barzin	6e96b436b1	[docs] Capture nginx stale-DNS trap in registry-vm runbook Discovered during the 2026-04-19 registry:2.8.3 pin deploy: nginx caches its upstream DNS at startup and does NOT re-resolve after registry-* containers are recreated. Symptom was /v2/_catalog returning {"repositories": []} and /v2/ returning 200 without auth — nginx was forwarding to a stale IP that a different backend container now owns. Fix is always 'docker restart registry-nginx' after any registry-* bounce. Captured in registry-vm.md so future manual operators and the coming auto-sync pipeline (beads code-3vl) both encode the step. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-19 17:24:09 +00:00
Viktor Barzin	c9d6343a9b	[job-hunter] Switch ExternalSecret to explicit UPPERCASE data mappings Replaces dataFrom.extract with per-key `data` entries so the Secret keys in K8s (and therefore env vars in the pod) are always UPPERCASE: WEBHOOK_BEARER_TOKEN, CDIO_API_KEY, SMTP_USERNAME, SMTP_PASSWORD, DIGEST_TO_ADDRESS, DIGEST_FROM_ADDRESS. Vault KV keys at secret/job-hunter stay lowercase (webhook_bearer_token etc.). Refs: code-snp Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-19 17:23:28 +00:00
Viktor Barzin	9f9d7d10ff	[registry] Scope OCI-index scan to private registry only Live run on the registry VM surfaced 632 "orphaned" index children across 156 indexes in the pull-through caches (ghcr, immich, affine, linkwarden, openclaw). These aren't bugs — pull-through caches only fetch what's been requested, so missing arm64 / arm / attestation children are normal partial state. Scanning them generates noise that would mask the real signal from the private registry (where we push full manifests ourselves and a missing child IS always a bug — the 2026-04-13 + 2026-04-19 failure mode). Change: index-child scan is now gated on registry_name == "private". Layer- link scan still runs across all registries (missing blob under a live link is always a bug, regardless of pull-through semantics). Verified: live run now reports 0 orphans in private registry — consistent with the hot-fix rebuild of infra-ci:latest earlier today. Layer scan still inspects 425 links across all registries and finds 0 orphans. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-19 17:23:04 +00:00
Viktor Barzin	e7ce545da2	[job-hunter] Add infra stack + Grafana dashboard + n8n digest workflow New service stack at stacks/job-hunter/ mirroring the payslip-ingest pattern: per-service CNPG database + role (via dbaas null_resource), Vault static role pg-job-hunter (7d rotation), ExternalSecrets for app secrets and DB creds, Deployment with alembic-migrate init container, ClusterIP Service, Grafana datasource ConfigMap. Grafana dashboard job-hunter.json in Finance folder: new roles per day, source breakdown, top companies, GBP salary distribution, recent roles table (sorted by parse confidence then salary). n8n weekly-digest workflow calls POST /digest/generate with bearer auth every Monday 07:00 London; digest_runs table provides idempotency. Refs: code-snp Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-19 17:09:29 +00:00
Viktor Barzin	7cb44d7264	[registry] Stop recurring orphan OCI-index incidents — detection + prevention + recovery Second identical registry incident on 2026-04-19 (first 2026-04-13): the infra-ci:latest image index resolved to child manifests whose blobs had been garbage-collected out from under the index. Pipelines P366→P376 all exited 126 "image can't be pulled". Hot fix (`a05d63e` / `6371e75` / `c113be4`) restored green CI but left the underlying bug unaddressed. Root cause: cleanup-tags.sh rmtrees tag dirs on the registry VM daily at 02:00, registry:2's GC (Sunday 03:25) walks OCI index children imperfectly (distribution/distribution#3324 class). Nothing verified pushes end-to-end; nothing probed the registry for fetchability; nothing caught orphan indexes. Phase 1 — Detection: - .woodpecker/build-ci-image.yml: after build-and-push, a verify-integrity step walks the just-pushed manifest (index + children + config + every layer blob) via HEAD and fails the pipeline on any non-200. Catches broken pushes at the source. - stacks/monitoring: new registry-integrity-probe CronJob (every 15m) and three alerts — RegistryManifestIntegrityFailure, RegistryIntegrityProbeStale, RegistryCatalogInaccessible — closing the "registry serves 404 for a tag that exists" gap that masked the incident for 2+ hours. - docs/post-mortems/2026-04-19-registry-orphan-index.md: root cause, timeline, monitoring gaps, permanent fix. Phase 2 — Prevention: - modules/docker-registry/docker-compose.yml: pin registry:2 → registry:2.8.3 across all six registry services. Removes the floating-tag footgun. - modules/docker-registry/fix-broken-blobs.sh: new scan walks every _manifests/revisions/sha256/<digest> that is an image index and logs a loud WARNING when a referenced child blob is missing. Does NOT auto- delete — deleting a published image is a conscious decision. Layer-link scan preserved. Phase 3 — Recovery: - build-ci-image.yml: accept `manual` event so Woodpecker API/UI rebuilds don't need a cosmetic Dockerfile edit (matches convention from pve-nfs-exports-sync.yml). - docs/runbooks/registry-rebuild-image.md: exact command sequence for diagnosing + rebuilding after an orphan-index incident, plus a fallback for building directly on the registry VM if Woodpecker itself is down. - docs/runbooks/registry-vm.md + .claude/reference/service-catalog.md: cross-references to the new runbook. Out of scope (verified healthy or intentionally deferred): - Pull-through DockerHub/GHCR mirrors (74.5% hit rate, no 404s). - Registry HA/replication (single-VM SPOF is a known architectural choice; Synology offsite covers RPO < 1 day). - Diun exclude for registry:2 — not applicable; Diun only watches k8s (DIUN_PROVIDERS_KUBERNETES=true), not the VM's docker-compose. Verified locally: - fix-broken-blobs.sh --dry-run on a synthetic registry directory correctly flags both orphan layer links and orphan OCI-index children. - terraform fmt + validate on stacks/monitoring: success (only unrelated deprecation warnings). - python3 yaml.safe_load on .woodpecker/build-ci-image.yml and modules/docker-registry/docker-compose.yml: both parse clean. Closes: code-4b8 Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-19 17:08:28 +00:00
Viktor Barzin	df2c53db8d	[infra] TrueNAS decommission — remove active references from Terraform + configs TrueNAS VM 9000 at 10.0.10.15 was operationally decommissioned 2026-04-13. The subagent-driven doc sweep in `5a0b24f5` covered the prose. This commit removes the remaining in-code references: - reverse-proxy: drop truenas Traefik ingress + Cloudflare record (truenas.viktorbarzin.me was 502-ing since the VM stopped), drop truenas_homepage_token variable. - config.tfvars: drop deprecated `truenas IN A 10.0.10.15`, `iscsi CNAME truenas`, and the commented-out `iscsi`/`zabbix` A records. - dashy/conf.yml: remove Truenas dashboard entry (&ref_28). - monitoring/loki.yaml: change storageClass from the decommissioned `iscsi-truenas` to `proxmox-lvm` so a future re-enable has a valid SC (Loki is currently disabled). - actualbudget/main.tf + freedify/main.tf: update new-deployment docstrings to cite Proxmox host NFS instead of TrueNAS. - nfs-csi: add an explanatory comment to the `nfs-truenas` StorageClass noting the name is historical — 48 bound PVs reference it, SC names are immutable on PVs, rename not worth the churn. Also cleaned out-of-band: - Technitium DNS: deleted `truenas.viktorbarzin.lan` A and `iscsi.viktorbarzin.lan` CNAME records. - Vault: `secret/viktor` → removed `truenas_api_key` and `truenas_ssh_private_key`; `secret/platform.homepage_credentials.reverse_proxy.truenas_token` removed. - Terraform-applied: `scripts/tg apply -target=module.reverse-proxy.module.truenas` destroyed the 3 K8s/Cloudflare resources cleanly. Deferred: - VM 9000 is still stopped on PVE. Deletion (destructive) awaits explicit user go-ahead. - `nfs-truenas` StorageClass name retained (see nfs-csi comment above). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-19 16:57:05 +00:00
Viktor Barzin	5a0b24f54e	[docs] TrueNAS decommission cleanup — remove references from active docs TrueNAS VM 9000 was operationally decommissioned 2026-04-13; NFS has been served by Proxmox host (192.168.1.127) since. This commit scrubs remaining references from active docs. VM 9000 itself remains on PVE in stopped state pending user decision on deletion. In-session cleanup already landed: reverse-proxy ingress + Cloudflare record removed; Technitium DNS records deleted; Vault truenas_{api_key,ssh_private_key} purged; homepage_credentials.reverse_proxy.truenas_token removed; truenas_homepage_token variable + module deleted; Loki + Dashy cleaned; config.tfvars deprecated DNS lines removed; historical-name comment added to the nfs-truenas StorageClass (48 bound PVs, immutable name — kept). Historical records (docs/plans/, docs/post-mortems/, .planning/) intentionally untouched — they describe state at a point in time. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-19 16:55:43 +00:00
Viktor Barzin	5f832e37d0	[monitoring] UK Payslip — add tax & pension breakdown panels New Panel 11 (monthly) + Panel 12 (YTD cumulative), side-by-side at y=19. Six series each: cash income tax, RSU-attributed income tax, NI, student loan, employee pension, employer pension. Employer pension included to show full retirement contribution picture (paid on top of salary, not deducted from take-home). Downstream panels shifted down by 10.	2026-04-19 16:53:32 +00:00
Viktor Barzin	ab402b3421	[monitoring] UK Payslip Panel 7 — trim to 5 semantic layers Drop ytd_student_loan (~£200-300/mo noise) and ytd_rsu_offset (always £0 on post-2024 Meta variant-B payslips) from the YTD uses stack. Now mirrors Panel 1's 4-way source breakdown clarity: take-home, cash PAYE, RSU PAYE, NI, pension. Student loan + RSU offset still surface on Panel 8 Sankey. Title: "YTD uses — where gross went" (mirrors Panel 1 label pattern).	2026-04-19 16:37:12 +00:00
Viktor Barzin	e55c549c9a	[redis] Phase 7 step 2: remove Bitnami helm_release + orphan PVCs Bringing the 2026-04-19 rework to its end-state. Cutover soaked for ~1h with 0 alerts firing and 127 ops/sec on the v2 master — skipped the nominal 24h rollback window per user direction. - Removed `helm_release.redis` (Bitnami chart v25.3.2) from TF. Helm destroy cleaned up the StatefulSet redis-node (already scaled to 0), ConfigMaps, ServiceAccount, RBAC, and the deprecated `redis` + `redis-headless` ClusterIP services that the chart owned. - Removed `null_resource.patch_redis_service` — the kubectl-patch hack that worked around the Bitnami chart's broken service selector. No Helm chart, no patch needed. - Removed the dead `depends_on = [helm_release.redis]` from the HAProxy deployment. - `kubectl delete pvc -n redis redis-data-redis-node-{0,1}` for the two orphan PVCs the StatefulSet template left behind (K8s doesn't cascade-delete). - Simplified the top-of-file comment and the redis-v2 architecture comment — they talked about the parallel-cluster migration state that no longer exists. Folded in the sentinel hostname gotcha, the redis 8.x image requirement, and the BGSAVE+AOF-rewrite memory reasoning so the rationale survives in the code rather than only in beads. - `RedisDown` alert no longer matches `redis-node\|redis-v2` — just `redis-v2` since that's the only StatefulSet now. Kept the `or on() vector(0)` so the alert fires when kube_state_metrics has no sample (e.g. after accidental delete). - `docs/architecture/databases.md` trimmed: no more "pending TF removal" or "cold rollback for 24h" language. Verification after apply: - kubectl get all -n redis: redis-v2-{0,1,2} (3/3 Running) + redis-haproxy-* (3 pods, PDB minAvailable=2). Services: redis-master + redis-v2-headless only. - PVCs: data-redis-v2-{0,1,2} only (redis-data-redis-node-* deleted). - Sentinel: all 3 agree mymaster = redis-v2-0 hostname. - HAProxy: PING PONG, DBSIZE 92, 127 ops/sec on master. - Prometheus: 0 firing redis alerts. Closes: code-v2b Closes: code-2mw Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-19 16:32:14 +00:00
Viktor Barzin	c113be4d5e	[ci] Retrigger default workflow — new infra-ci image now in registry P380/build-ci-image pushed a fresh infra-ci image with valid manifest (sha256:d21c47c9 for amd64). Default workflow raced build-ci-image on that pipeline and pulled the stale broken manifest. This empty commit runs default only (build-ci-image path filter doesn't match). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-19 16:31:44 +00:00
Viktor Barzin	6371e75ef9	[ci] Rebuild infra-ci image — registry index referenced missing blobs The infra-ci :latest (and :`5319f03e`) tags in the private registry resolved to an OCI image index (sha256:7235cba7...) whose referenced amd64 manifest (98f718c8) and attestation (27d5ab83) blobs returned 404 — either never uploaded or garbage-collected. Every pipeline since P366 exited 126 on image pull. This comment-only Dockerfile change triggers build-ci-image.yml's path filter, which rebuilds + pushes a fresh image. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-19 16:29:20 +00:00
Viktor Barzin	b6cd83f85a	[redis] Phase 3-7: cutover to redis-v2, Nextcloud HAProxy-only Phase 3 — replication chain (old → v2): - Discovered the v2 cluster was running redis:7.4-alpine, but the Bitnami old master ships redis 8.6.2 which writes RDB format 13 — the 7.4 replicas rejected the stream with "Can't handle RDB format version 13". Bumped v2 image to redis:8-alpine (also 8.6.2) to restore PSYNC compatibility. - Discovered that sentinel on BOTH v2 and old Bitnami clusters auto-discovered the cross-cluster replication chain when v2-0 REPLICAOF'd the old master, triggering a failover that reparented old-master to a v2 replica and took HAProxy's backend offline. Mitigation: `SENTINEL REMOVE mymaster` on all 5 sentinels (both clusters) during the REPLICAOF surgery, then re-MONITOR after cutover. This must be done on the OLD sentinels too, not just v2 — they're the ones that kept fighting our REPLICAOF. - Set up the chain: v2-0 REPLICAOF old-master; v2-{1,2} REPLICAOF v2-0. All 76 keys (db0:76, db1:22, db4:16) synced including `immich_bull:` BullMQ queues and `_kombu.` Celery queues — the user-stated must-survive data class. Phase 4 — HAProxy cutover: - Updated `kubernetes_config_map.haproxy` to point at `redis-v2-{0,1,2}.redis-v2-headless` for both redis_master and redis_sentinel backends (removed redis-node-{0,1}). - Promoted v2-0 (`REPLICAOF NO ONE`) at the same time as the ConfigMap apply so HAProxy's 1s health-check interval found a role:master within a few seconds. Cutover disruption on HAProxy rollout was brief; old clients naturally moved to new HAProxy pods within the rolling update window. - Re-enabled sentinel monitoring on v2 with `SENTINEL MONITOR mymaster <hostname> 6379 2` after verifying `resolve-hostnames yes` + `announce-hostnames yes` were active — this ensures sentinel stores the hostname (not resolved IP) in its rewritten config, so pod-IP churn on restart doesn't break failover. Phase 5 — chaos: - Round 1: killed master v2-0 mid-probe. First run exposed the sentinel IP-storage issue (stored 10.10.107.222, went stale on restart) — ~12s probe disruption. Fixed hostname persistence and re-MONITORed. - Round 2: killed new master v2-2 with hostnames correctly stored. Sentinel elected v2-0, HAProxy re-routed, 1/40 probe failures over 60s — target <3s of actual user-visible disruption. Phase 6 — Nextcloud simplification: - `zzz-redis.config.php` no longer queries sentinel in-process — just points at `redis-master.redis.svc.cluster.local`. Removed 20 lines of PHP. HAProxy handles master tracking transparently now that it's scaled to 3 + PDB minAvailable=2. Phase 7 step 1: - `kubectl scale statefulset/redis-node --replicas=0` (transient — TF removal in a 24h follow-up). Old PVCs `redis-data-redis-node-{0,1}` preserved as cold rollback. Docs: - Rewrote `databases.md` Redis section to reflect post-cutover reality and the sentinel hostname gotcha (so future sessions don't relearn it). - `.claude/reference/service-catalog.md` entry updated. The parallel-bootstrap race documented in the previous commit is still worth watching — the init container now defaults to pod-0 as master when no peer reports role:master-with-slaves, so fresh boots land in a deterministic topology. Closes: code-7n4 Closes: code-9y6 Closes: code-cnf Closes: code-tc4 Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-19 16:13:43 +00:00
Viktor Barzin	f6685a23a9	[dns] Kea: multi-IP DHCP option 6 (10.0.10, 10.0.20) + TSIG-signed DDNS (WS E) Workstream E of the DNS hardening push. Two independent pfSense-side changes to eliminate single-point DNS failures and the unauthenticated RFC 2136 update vector. Part 1 — Multi-IP DHCP option 6 - Before: clients on 10.0.10/24 got only 10.0.10.1; clients on 10.0.20/24 got only 10.0.20.1. Internal resolver outage == cluster-wide DNS dark. - After: - 10.0.10/24 -> [10.0.10.1, 94.140.14.14] - 10.0.20/24 -> [10.0.20.1, 94.140.14.14] - 192.168.1/24 deliberately untouched (served by TP-Link AP, not pfSense Kea — pfSense WAN DHCP is disabled); already ships [192.168.1.2, 94.140.14.14] so the end state is consistent across all three subnets. - Applied via PHP: set $cfg['dhcpd']['lan']['dnsserver'] and $cfg['dhcpd']['opt1']['dnsserver'] as arrays. pfSense's services_kea4_configure() implodes the array into "data: a, b" on the "domain-name-servers" option-data entry (services.inc L1214). - Verified: - DevVM (10.0.10.10) resolv.conf shows "nameserver 10.0.10.1" + "nameserver 94.140.14.14" after networkd renew. - k8s-node1 (10.0.20.101) same after networkctl reload + systemd-resolved restart. - Fallback drill on k8s-node1: `ip route add blackhole 10.0.20.1/32`; dig @10.0.20.1 google.com -> "no servers could be reached"; dig @94.140.14.14 google.com -> 216.58.204.110; system resolver (getent hosts) succeeds via the fallback IP. Blackhole route removed. Part 2 — TSIG-signed Kea DHCP-DDNS - Before: /usr/local/etc/kea/kea-dhcp-ddns.conf had `tsig-keys: []` and Technitium's viktorbarzin.lan zone had update=Deny. Unauthenticated update vector was latent (DDNS wiring in Kea DHCP4 is actually off today — "DDNS: disabled" in dhcpd.log) but would activate as soon as anyone turned on ddnsupdate on LAN/OPT1. - Generated HMAC-SHA256 secret, base64-encoded 32 random bytes. - Stored in Vault: secret/viktor/kea_ddns_tsig_secret (version 27). - Created TSIG key "kea-ddns" on primary/secondary/tertiary Technitium instances via /api/settings/set (tsigKeys[]). - Updated kea-dhcp-ddns.conf on pfSense with tsig-keys[]={name: "kea-ddns", algorithm: "HMAC-SHA256", secret: …} and key-name: kea-ddns on each forward-ddns / reverse-ddns domain. Pre-change backup at /usr/local/etc/kea/kea-dhcp-ddns.conf.2026-04-19-pre-tsig. - Configured viktorbarzin.lan + 10.0.10.in-addr.arpa + 20.0.10.in-addr.arpa + 1.168.192.in-addr.arpa on Technitium primary: - update = UseSpecifiedNetworkACL - updateNetworkACL = [10.0.20.1, 10.0.10.1, 192.168.1.2] - updateSecurityPolicies = [{tsigKeyName: kea-ddns, domain: "*.<zone>", allowedTypes: [ANY]}] Technitium requires BOTH a source-IP match AND a valid TSIG signature. - Verified TSIG end-to-end: - Signed A-record update from pfSense -> "successfully processed", dig returns 10.99.99.99 (log: "TSIG KeyName: kea-ddns; TSIG Algo: hmac-sha256; TSIG Error: NoError; RCODE: NoError"). - Signed PTR update same zone pattern -> dig -x returns tsig-test FQDN. - Unsigned update from pfSense IP (in ACL) -> "update failed: REFUSED" (log: "refused a zone UPDATE request [...] due to Dynamic Updates Security Policy"). - Test records cleaned up via signed nsupdate. Safety - pfSense config backup: /cf/conf/config.xml.2026-04-19-pre-kea-multi-ip (145898 bytes, pre-change snapshot — keep 30d). - DDNS config backup: /usr/local/etc/kea/kea-dhcp-ddns.conf.2026-04-19-pre-tsig. - TSIG secret lives only in Vault + in config.xml/kea-dhcp-ddns.conf on pfSense; not committed to git. Docs - architecture/dns.md: zone dynamic-updates section records the TSIG policy; Incident History gets a WS E entry. - architecture/networking.md: DHCP Coverage table now shows the DNS option 6 values per subnet; pfSense block notes the TSIG-signed DDNS and config backup path. - runbooks/pfsense-unbound.md: new "Kea DHCP-DDNS TSIG" section covers key rotation, emergency bypass, and enforcement-verification. Closes: code-o6j Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-19 16:12:23 +00:00
Viktor Barzin	a05d63eefb	[ci] Fix infra pipeline image-pull — drop :5050 from infra-ci image URL P366-P374 default workflow failed with exit 126 "image can't be pulled" — containerd hosts.toml has a mirror entry for `registry.viktorbarzin.me` but NOT for `registry.viktorbarzin.me:5050`, so pulls fell through to direct HTTPS on :5050 (which isn't exposed externally). Convention per infra/.claude/CLAUDE.md is the no-port form; :5050 was an anomaly introduced by the 2026-04-15 CI perf overhaul. build-cli/build-ci-image push paths still use :5050 and work fine — they go through the buildx plugin (pod DNS, not node containerd). Only `image:` fields on a step hit the broken path. Normalizing push URLs left for a follow-up. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-19 16:00:58 +00:00
Viktor Barzin	b7ea122355	payslip-ingest: pin image_tag=4f70681d — includes migrations 0004+0005 Aligns the stack with the repo HEAD carrying migration 0004 (cash_income_tax + ytd_rsu_* columns), migration 0005 (p60_reference table), the bonus-dedup logic, and the Woodpecker path-filter fix. Applied + verified: - pod rolled out with the new image, Alembic ran 0003→0004→0005 - cash_income_tax backfilled on 71/71 existing rows - dashboard Panel 7 YTD split query returns real numbers - no existing (tax_year, bonus) duplicates found — guard ships for future Closes: code-7z0	2026-04-19 15:54:24 +00:00
Viktor Barzin	33d934c32f	[dns] pfSense: Unbound replaces dnsmasq (WS D) Replace pfSense dnsmasq (DNS Forwarder) with Unbound (DNS Resolver) so LAN-side .viktorbarzin.lan resolution survives a full Kubernetes outage. Out-of-band pfSense changes (not in Terraform; pfSense config.xml is VM-managed). Backup at /cf/conf/config.xml.2026-04-19-pre-unbound on-box + /mnt/backup/pfsense/ nightly. - <unbound> enabled; listens on lan, opt1, wan, lo0 - <forwarding> on + <forward_tls_upstream> → DoT to Cloudflare (1.1.1.1 / 1.0.0.1 port 853, SNI cloudflare-dns.com) - <dnssec>, <prefetch>, <prefetchkey>, <dnsrecordcache> (serve-expired) - msgcachesize=256MB, cache_max_ttl=7d, cache_min_ttl=60s - custom_options: auth-zone viktorbarzin.lan master=10.0.20.201 fallback-enabled=yes for-upstream=yes + serve-expired-ttl=259200 - <dnsmasq><enable> removed; dnsmasq stopped - NAT rdr WAN UDP 53 → 10.0.20.201 removed (Unbound listens on WAN now) - Technitium zone viktorbarzin.lan: zoneTransferNetworkACL set to 10.0.20.1, 10.0.10.1, 192.168.1.2 (pfSense source IPs) Verified: - unbound-control list_auth_zones: viktorbarzin.lan serial 49367 - dig @127.0.0.1 idrac.viktorbarzin.lan returns 192.168.1.4 with aa flag (served from auth-zone, not forwarded) - dig @127.0.0.1 example.com +dnssec returns ad flag (DoT + validated) - /var/unbound/viktorbarzin.lan.zone has ~114 records - K8s outage drill passed: scale technitium=0 → dig still returns via WAN/LAN/OPT1 interfaces → scale restored - LAN/management/K8s VLAN clients all resolve via pfSense 192.168.1.2 / 10.0.10.1 / 10.0.20.1 respectively Trade-off: Technitium Split Horizon hairpin for 192.168.1.x → *.viktorbarzin.me (non-proxied) no longer runs via pfSense (Unbound answers locally). Fix if it bites: switch service to proxied or add Unbound Host Override. Documented in docs/runbooks/pfsense-unbound.md. Closes: code-k0d	2026-04-19 15:52:41 +00:00
Viktor Barzin	bc866d53fa	[servarr/mam-farming] Tune grabber for MAM's real catalogue ## Context After the Mouse-class unblock on 2026-04-19, end-to-end testing of the grabber revealed three issues with the plan's original filter values: 1. `SEEDER_CEILING=50` rejects ~99% of MAM's catalogue. MAM is a well-seeded private tracker — 100-700 seeders per torrent is normal. A ceiling of 50 makes the filter too tight: across 140 FL torrents sampled in one loop, only 0-1 matched. The intent ("avoid oversupplied swarms") is still valid; the threshold was wrong for MAM's shape. 2. `RATIO_FLOOR=1.2` was sized for Mouse-class defence and is now over-tight. Its job is preventing the death spiral where Mouse-class accounts can't announce, so any grab deepens the ratio hole. Once class > Mouse, MAM serves peer lists normally and demand-first filtering (`leechers>=1`) keeps new grabs upload-positive on average. With ratio sitting at 0.7 post-recovery (we over-downloaded while unblocking), 1.2 was preventing the very grabs that would earn us back to healthy ratio. 3. `parse_size` crashed on `"1,002.9 MiB"`. MAM's pretty-printed sizes use thousands separators; `float("1,002.9")` raises `ValueError`. Every grabber run that hit a ≥1000-MiB candidate on the page crashed with a traceback instead of skipping the size. ## This change - `SEEDER_CEILING`: 50 → 200 — live catalogue evidence showed 50 was rejecting viable demand-first candidates like `Zen and the Art of Motorcycle Maintenance` (S=156, L=1, score=125). - `RATIO_FLOOR`: 1.2 → 0.5 — still a tripwire for catastrophic dips, but no longer a steady-state block. Class == Mouse remains an absolute skip (separate branch). - `parse_size`: `s.replace(",", "").split()` before int-parse. ## Verified post-change Manual grabber loop (5 runs at random offsets) after applying: run=1 parse_size crash on "1,002.9" (this crash motivated fix #3) run=2 GRABBED 3 torrents: Dean and Me: A Love Story (240.7 MiB, S:18, L:1) score=194 Digital Nature Photography (83.7 MiB, S:42, L:1) score=182 Zen and the Art of Motorcycle (830.3 MiB, S:156, L:1) score=125 run=3-5 grabbed=0 at offsets that landed on pages with no matches (expected — MAM returns 20/page, many offsets yield nothing) MAM profile: class=User, ratio=0.7 (recovering from the Mouse unblock), BP=24,053. 28 mam-farming torrents in forcedUP state, actively uploading ~8 MiB to MAM this session across 2 of the Maxximized comic issues. ## What is NOT in this change - No alert threshold changes — `MAMRatioBelowOne` (24h) and `MAMMouseClass` (1h) already handle the "going back to Mouse" case; lowering the floor on the grabber doesn't change alerting. - No janitor changes — the janitor rules are H&R-based and independent of ratio/class state. ## Test plan ### Automated $ cd infra/stacks/servarr && ../../scripts/tg apply --non-interactive Apply complete! Resources: 0 added, 2 changed, 0 destroyed. $ python3 -c 'import ast; ast.parse(open( "infra/stacks/servarr/mam-farming/files/freeleech-grabber.py").read())' ### Manual Verification 1. Trigger the grabber and confirm it doesn't skip-for-ratio at ratio 0.7: $ kubectl -n servarr create job --from=cronjob/mam-freeleech-grabber g1 $ kubectl -n servarr logs job/g1 \| head -5 Profile: ratio=0.7 class=User \| Farming: 33, 2.0 GiB, tracked IDs: 4 Search offset=<random>, found=1323, page_results=20 Added (score=...) ... 2. Repeat 3-5× at different random offsets. Over the course of a 30-min cron cadence, expect 2-5 grabs across the day given MAM's catalogue churn and our filter intersection. ## Reproduce locally cd infra/stacks/servarr ../../scripts/tg plan # expect: 0 to add, 2 to change (configmap + cronjob) ../../scripts/tg apply --non-interactive kubectl -n servarr create job --from=cronjob/mam-freeleech-grabber g1 kubectl -n servarr logs job/g1 Follow-up: `bd close code-qfs` already completed in the parent commit; this is a post-shipping tune, no beads action needed. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-19 15:46:46 +00:00
Viktor Barzin	0f6321ce86	[dns] NodeLocal DNSCache — deploy DaemonSet to all nodes (WS C) Adds per-node DNS cache that transparently intercepts pod queries on 10.96.0.10 (kube-dns ClusterIP) AND 169.254.20.10 (link-local) via hostNetwork + NET_ADMIN iptables NOTRACK rules. Pods keep using their existing /etc/resolv.conf (nameserver 10.96.0.10) unchanged — no kubelet rollout needed for transparent mode. Layout mirrors existing stacks (technitium, descheduler, kured): stacks/nodelocal-dns/ main.tf # module wiring + IP params modules/nodelocal-dns/main.tf # SA, Services, ConfigMap, DS Key decisions: - Image: registry.k8s.io/dns/k8s-dns-node-cache:1.23.1 - Co-listens on 169.254.20.10 + 10.96.0.10 (transparent interception) - Upstream path: kube-dns-upstream (new headless svc) → CoreDNS pods (separate ClusterIP avoids cache looping back through itself) - viktorbarzin.lan zone forwards directly to Technitium ClusterIP (10.96.0.53), bypassing CoreDNS for internal names - priorityClassName: system-node-critical - tolerations: operator=Exists (runs on master + all tainted nodes) - No CPU limit (cluster-wide policy); mem requests=32Mi, limit=128Mi - Kyverno dns_config drift suppressed on the DaemonSet - Kubelet clusterDNS NOT changed — transparent mode is sufficient; rolling 5 nodes just to switch to 169.254.20.10 has no additional benefit and expanding blast radius for no reason. Verified: - DaemonSet 5/5 Ready across k8s-master + 4 workers - dig @169.254.20.10 idrac.viktorbarzin.lan -> 192.168.1.4 - dig @169.254.20.10 github.com -> 140.82.121.3 - Deleted all 3 CoreDNS pods; cached queries still resolved via NodeLocal DNSCache (resilience confirmed) Docs: architecture/dns.md — adds NodeLocal DNSCache to Components table, graph diagram, stacks table; rewrites pod DNS resolution paths to show the cache layer; adds troubleshooting entry. Closes: code-2k6	2026-04-19 15:46:41 +00:00
Viktor Barzin	eb6ceac5f5	[dns] static-client DNS — Proxmox host, registry VM dual-resolver setup (WS F) Fixes single-upstream DNS brittleness on non-DHCP hosts. Each host now has a primary internal resolver + external fallback (AdGuard) so DNS keeps working if the primary resolver IP is unreachable. New config: - Proxmox host (192.168.1.127): plain /etc/resolv.conf with nameserver 192.168.1.2 (pfSense LAN) + 94.140.14.14 (AdGuard). Previously: single nameserver 192.168.1.1 — could not resolve internal .lan names at all. Documented in docs/runbooks/proxmox-host.md. - Registry VM (10.0.20.10): systemd-resolved drop-in at /etc/systemd/resolved.conf.d/10-internal-dns.conf (DNS=10.0.20.1, FallbackDNS=94.140.14.14, Domains=viktorbarzin.lan) plus matching per-link nameservers in /etc/netplan/50-cloud-init.yaml. Previously: 1.1.1.1 + 8.8.8.8 only — image pulls referencing .lan hostnames would fail to resolve. Documented in docs/runbooks/registry-vm.md. - TrueNAS (10.0.10.15): host unreachable during this session ("No route to host" on 10.0.10.0/24). Deferred best-effort per WS F instructions; noted on the beads task. Both hosts have pre-change backups at /root/dns-backups/ for one-command rollback. Fallback behaviour was validated by routing each primary to a blackhole and confirming dig answered from the fallback. Both runbooks include the verified resolvectl / resolv.conf state, the fallback-test procedure, and the rollback steps. Closes: code-dw8	2026-04-19 15:43:49 +00:00
Viktor Barzin	3b54983a9f	[ci] build-cli: add logins entry for registry.viktorbarzin.me:5050 ## Context The infra CLI image (`viktorbarzin/infra` + `registry.viktorbarzin.me:5050/infra`) is built by `.woodpecker/build-cli.yml` via plugin-docker-buildx and pushed to two repos. The private-registry htpasswd auth that went in on 2026-03-22 (memory 437) was never wired into this pipeline, so the second push has been failing with `401 Unauthorized` on every blob HEAD for ~4 weeks. That in turn kept every infra pipeline's overall status at `failure`, which fooled the service-upgrade agent into spurious rollbacks before the per-workflow check in bd code-3o3. Now that the agent ignores overall status, this is purely cosmetic — but worth fixing so the pipeline list goes green and the private- registry mirror of the infra CLI image stays fresh. ## This change Extend the plugin's `logins:` array with an entry for `registry.viktorbarzin.me:5050`, pulling credentials from two Woodpecker global secrets `registry_user` / `registry_password`. Secrets plumbing (no CI config changes needed long-term — already `vault-woodpecker-sync` compatible): - Vault `secret/ci/global` now carries `registry_user` + `registry_password`, copied from `secret/viktor` via `vault kv patch`. - `vault-woodpecker-sync` CronJob picks them up on next run and POSTs them to Woodpecker via the API. Also triggered manually as `manual-sync-1776613321` → "Synced 8 global secrets from Vault to Woodpecker". - `curl -H "Authorization: Bearer <wp-api-token>" .../api/secrets` now lists both `registry_user` and `registry_password`. ## What is NOT in this change - A follow-on cleanup of the `docker_username`/`docker_password` globals (which are actually DockerHub creds mis-named). They still work — renaming would cascade across several older pipelines. - Restoring inline BuildKit cache — commit `0c123903` disabled `cache_from/cache_to` due to registry cache corruption; leaving that alone here. ## Test Plan ### Automated Will be validated by the CI run of this very commit: - `build-cli` workflow should log `#14 [auth] viktor/registry.viktorbarzin.me:5050` successful - blob HEAD returns 200/404 instead of 401 - step `build-image` exits 0 - overall pipeline status: success (FINALLY) ### Manual Verification ``` $ curl -sS -H "Authorization: Bearer $(vault kv get -field=woodpecker_api_token secret/ci/global)" \ https://ci.viktorbarzin.me/api/secrets \| jq '.[] \| .name' \| grep registry "registry_password" "registry_user" $ curl -sSI -u viktor:$PASS https://registry.viktorbarzin.me:5050/v2/infra/manifests/<8-char-sha> HTTP/2 200 ``` Closes: code-12b Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-19 15:42:52 +00:00
Viktor Barzin	364df9f2ea	[dns] readiness gate — replace auth-required zone-count probe with DNS parity check Zone-count parity required hitting /api/zones/list which requires auth. The null_resource has no access to the Technitium admin password (it's declared `sensitive = true` on the module variable), so we were probing with an empty token and getting 200 OK with an error JSON — silently returning 0 zones for every instance. Replaced the HTTP probe with a second DNS check: dig idrac.viktorbarzin.lan on each pod, require the same A record from all three. This catches both "zone not loaded on an instance" and "zone drift between primary and replicas" without needing any HTTP client or credentials. The AXFR chain guarantees all three should converge on the same value. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-19 15:24:56 +00:00
Viktor Barzin	f09be1524d	monitoring: split income_tax cash/RSU + add P60 & HMRC reconciliation panels Panel 7 (YTD uses): replace the single `ytd_income_tax` stack segment with two — `ytd_cash_income_tax` (full red, same color as before) and `ytd_rsu_income_tax` (desaturated orange) — computed from the new `cash_income_tax` column on payslip. RSU-vest months now visually separate the cash tax from the PAYE attributable to the grossed-up RSU, matching user mental model of "what I actually paid in cash tax". Panel 8 (Sankey): split the single `Gross → Income Tax` edge into two edges (`Gross → Income Tax (cash)` and `Gross → Income Tax (RSU)`) sourcing the same two figures. Panel 3 (effective rate): left untouched — it's the "all-in" rate and keeps using raw `income_tax`. Panel 9 (P60 reconciliation — new): per-tax-year table comparing HMRC P60 annual figures against SUM(payslip) via LATERAL JOIN on payslip_ingest.p60_reference. Threshold-coloured delta columns (\|Δ\|<1 green, 1-50 yellow, >50 red) surface missing months or parser drift. Panel 10 (HMRC Tax Year Reconciliation — new): placeholder for the hmrc-sync service (code scaffolded, awaiting HMRC prod approval to activate). Queries `hmrc_sync.tax_year_snapshot`; renders empty until that schema lands. Delta > £10 → red. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-19 15:23:36 +00:00
Viktor Barzin	91aa39ef96	[dns] readiness gate — reject all-zero zone counts as probe failure The zone-count parity check was trivially passing when the ephemeral curl pod failed to reach the Technitium web API: all three counts came back as 0, UNIQ=1, gate claimed "PASSED". This happened during today's DNS hardening apply when CoreDNS was in CrashLoopBackOff and the curl pod couldn't resolve service names. Added a MIN > 0 sanity check. Technitium always has built-in zones (localhost, standard reverse PTRs), so a zero count means the probe didn't reach the API, not that the instance truly has zero zones. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-19 15:23:07 +00:00
Viktor Barzin	150f196095	[redis] Phase 1+2: parallel redis-v2 StatefulSet + Prometheus alerts Builds the target 3-node raw StatefulSet alongside the legacy Bitnami Helm release so data can migrate via REPLICAOF during a future short maintenance window (Phase 3-7). No traffic touches the new cluster yet — HAProxy still points at redis-node-{0,1}. Architecture: - 3 redis pods, each co-locating redis + sentinel + oliver006/redis_exporter - podManagementPolicy=Parallel + init container that writes fresh sentinel.conf on every boot by probing peer sentinels and redis for consensus master (priority: sentinel vote > role:master with slaves > pod-0 fallback). Kills the stale-state bug that broke sentinel on Apr 19 PM. - redis.conf `include /shared/replica.conf` — init container writes `replicaof <master> 6379` for non-master pods so they come up already in the correct role. No bootstrap race. - master+replica memory 768Mi (was 512Mi) for concurrent BGSAVE+AOF fork COW headroom. auto-aof-rewrite-percentage=200 tunes down rewrite churn. - RDB (save 900 1 / 300 100 / 60 10000) + AOF appendfsync=everysec. - PodDisruptionBudget minAvailable=2. Also: - HAProxy scaled 2→3 replicas + PodDisruptionBudget minAvailable=2, since Phase 6 drops Nextcloud's sentinel-query fallback and HAProxy becomes the sole client-facing path for all 17 consumers. - New Prometheus alerts: RedisMemoryPressure, RedisEvictions, RedisReplicationLagHigh, RedisForkLatencyHigh, RedisAOFRewriteLong, RedisReplicasMissing. Updated RedisDown to cover both statefulsets during the migration. - databases.md updated to describe the interim parallel-cluster state. Verified live: redis-v2-0 master, redis-v2-{1,2} replicas, master_link_status up, all 3 sentinels agree on get-master-addr-by-name. All new alerts loaded into Prometheus and inactive. Beads: code-v2b (still in progress — Phase 3-7 await maintenance window). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-19 15:23:05 +00:00
Viktor Barzin	6ee283c2f0	[docs] Document external-monitor opt-out mechanism in monitoring.md The doc said monitors were created for everything in cloudflare_proxied_names, but since the k8s-api discovery rewrite the ConfigMap is a fallback only. Describe the opt-OUT semantics and how external_monitor=false on a factory call translates to the sync script's skip annotation.	2026-04-19 15:19:06 +00:00
Viktor Barzin	af6574a006	[dns] Fix CoreDNS serve_stale syntax — 24h TTL, no refresh-mode arg CoreDNS refused to load the new Corefile with `serve_stale 3600s 86400s`: plugin/cache: invalid value for serve_stale refresh mode: 86400s serve_stale takes one DURATION and an optional refresh_mode keyword ("immediate" or "verify"), not two durations. Simplified to `serve_stale 86400s` (serve cached entries for up to 24h when upstream is unreachable). The new CoreDNS pods were CrashLoopBackOff; the two old pods kept serving traffic so there was no outage, but the partial apply left the cluster wedged with the bad ConfigMap. Also collapses the inline viktorbarzin.lan cache block. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-19 15:18:43 +00:00
Viktor Barzin	752f94ab8f	[monitoring] Opt-out external monitor for family/mladost3/task-webhook/torrserver; drop r730 The `external-monitor-sync` script is opt-IN by default for any *.viktorbarzin.me ingress, so a missing annotation means "monitored." Both ingress factories previously OMITTED the annotation when `external_monitor = false`, which silently left monitors in place. Fix: when the caller sets `external_monitor = false` explicitly, emit `uptime.viktorbarzin.me/external-monitor = "false"` so the sync script deletes the monitor. Keep the previous behavior (no annotation) for callers that leave external_monitor null — otherwise 19 publicly-reachable services with `dns_type="none"` would lose monitoring. Set external_monitor=false on family (grampsweb) and mladost3 (reverse-proxy) to match the other two already-flagged services. Delete the r730 ingress module entirely — the Dell server has been decommissioned.	2026-04-19 15:18:27 +00:00
Viktor Barzin	a0d770d9a7	[cluster-health] Expand to 42 checks, remove pod CronJob path - scripts/cluster_healthcheck.sh: add 12 new checks (cert-manager readiness/expiry/requests, backup freshness per-DB/offsite/LVM, monitoring prom+AM/vault-sealed/CSS, external reachability cloudflared +authentik/ExternalAccessDivergence/traefik-5xx). Bump TOTAL_CHECKS to 42, add --no-fix flag. - Remove the duplicate pod-version .claude/cluster-health.sh (1728 lines) and the openclaw cluster_healthcheck CronJob (local CLI is now the single authoritative runner). Keep the healthcheck SA + Role + RoleBinding — still reused by task_processor CronJob. - Remove SLACK_WEBHOOK_URL env from openclaw deployment and delete the unused setup-monitoring.sh. - Rewrite .claude/skills/cluster-health/SKILL.md: mandates running the script first, refreshes the 42-check table, drops stale CronJob/Slack/post-mortem sections, documents the monorepo-canonical + hardlink layout. File is hardlinked to /home/wizard/code/.claude/skills/cluster-health/SKILL.md for dual discovery. - AGENTS.md + k8s-portal agent page: 25-check → 42-check. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-19 15:13:03 +00:00
Viktor Barzin	5ea079181f	[dns] Technitium — raise memory limit to 2Gi (was 1Gi, originally 512Mi) Primary was at 401Mi / 512Mi (78%) before the first bump; the plan's 1Gi leaves enough headroom for normal operation but thin margin if blocklists or cache grow. User escalated: OOM cascades are the exact failure mode that causes user-visible DNS outages, so give a full 2x safety margin across all three instances. Replicas currently use 124-155Mi steady-state so they have enormous headroom at 2Gi — accepted for symmetry and future growth (OISD blocklists, in-memory cache). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-19 15:08:04 +00:00
Viktor Barzin	a86a97deb7	[reverse-proxy] Fix gw.viktorbarzin.me — point at 192.168.1.1 via EndpointSlice The TP-Link gateway was wired via ExternalName `gw.viktorbarzin.lan`, but Technitium has no record for that name (the router isn't a DHCP client and Kea DDNS never registers it), so the ingress backend returned NXDOMAIN and the `[External] gw` Uptime Kuma monitor was permanently failing. Factory now accepts `backend_ip` as an alternative to `external_name`: it creates a selector-less ClusterIP Service + manual EndpointSlice pointing at the given IP, bypassing cluster DNS entirely. Used for gw (192.168.1.1); the old ExternalName path is retained for every other service. Also add a direct `port` monitor for the router in uptime-kuma's internal_monitors list so we can tell a Cloudflare/tunnel outage apart from the router itself being down. Extended the internal-monitor-sync script to handle non-DB monitor types (hostname + port fields).	2026-04-19 15:07:24 +00:00
Viktor Barzin	4b39fbb717	[dns] readiness gate — use dig-in-pod + retries, ephemeral curl pod for zone parity Technitium pods don't ship wget/curl, only dig/nslookup. Switched the per-pod health check from wget against /api to dig +short against 127.0.0.1. This probes the actual DNS serving path, which is what we care about anyway. Zone-count parity can't be done inside the Technitium pod (no HTTP client), so it spawns a short-lived curlimages/curl pod via kubectl run --rm that curls the three internal web services and exits. Added retry loop on the dig check (6 × 10s) to tolerate zone-load delay after a pod restart — viktorbarzin.lan is ~864KB and can take tens of seconds to load into memory on a cold start. Relaxed the A-record regex to match any IPv4 rather than 10.x — records may legitimately live outside that range. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-19 14:57:29 +00:00
Viktor Barzin	9a21c0f065	[dns] DNS reliability & hardening — Technitium + CoreDNS + alerts + readiness gate Workstreams A, B, G, H, I of the DNS reliability plan (code-q2e). Follow-ups for C, D, E, F filed as code-2k6, code-k0d, code-o6j, code-dw8. Technitium (WS A) - Primary deployment: add Kyverno lifecycle ignore_changes on dns_config (secondary/tertiary already had it) — eliminates per-apply ndots drift. - All 3 instances: raise memory request+limit from 512Mi to 1Gi (primary was restarting near the ceiling; CPU limits stay off per cluster policy). - zone-sync CronJob: parse API responses, push status/failures/last-run and per-instance zone_count gauges to Pushgateway, fail the job on any create error (was silently passing). CoreDNS (WS B) - Corefile: add policy sequential + health_check 5s + max_fails 2 on root forward, health_check on viktorbarzin.lan forward, serve_stale 3600s/86400s on both cache blocks — pfSense flap no longer takes the cluster down; upstream outage keeps cached names resolving for 24h. - Scale deploy/coredns to 3 replicas with required pod anti-affinity on hostname via null_resource (hashicorp/kubernetes v3 dropped the _patch resources); readiness gate asserts state post-apply. - PDB coredns with minAvailable=2. Observability (WS G) - Fix DNSQuerySpike — rewrite to compare against avg_over_time(dns_anomaly_total_queries[1h] offset 15m); previous dns_anomaly_avg_queries was computed from a per-pod /tmp file so always equalled the current value (alert could never fire). - New: DNSQueryRateDropped, TechnitiumZoneSyncFailed, TechnitiumZoneSyncStale, TechnitiumZoneCountMismatch, CoreDNSForwardFailureRate. Post-apply readiness gate (WS H) - null_resource.technitium_readiness_gate runs at end of apply: kubectl rollout status on all 3 deployments (180s), per-pod /api/stats/get probe, zone-count parity across the 3 instances. Fails the apply on any check fail. Override: -var skip_readiness=true. Docs (WS I) - docs/architecture/dns.md: CoreDNS Corefile hardening, new alerts table, zone-sync metrics reference, why DNSQuerySpike was broken. - docs/runbooks/technitium-apply.md (new): what the gate checks, failure modes, emergency override. Out of scope for this commit (see beads follow-ups): - WS C: NodeLocal DNSCache (code-2k6) - WS D: pfSense Unbound replaces dnsmasq (code-k0d) - WS E: Kea multi-IP DHCP + TSIG (code-o6j) - WS F: static-client DNS fixes (code-dw8) Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-19 14:53:41 +00:00
Viktor Barzin	a5e097088a	[ci] Persist VAULT_TOKEN across Woodpecker step commands ## Context Follow-up to commit `2eca011c` (bd code-e1x). That commit attached the `terraform-state` policy to the `ci` Vault role and propagated apply- loop failures so the pipeline actually fails when a stack fails. On the very first push to exercise it (pipeline 361), the platform apply step died with: [vault] Starting apply... state-sync: ERROR — no Vault token and no age key at ~/.config/sops/age/keys.txt [vault] FAILED (exit 1) Root cause: in Woodpecker's `commands:` list, each `- \|` item runs in a fresh shell. The dedicated "Vault auth" command was doing `export VAULT_TOKEN=...`, but that export was lost by the time the apply command ran. Tier-0 stacks depended on Vault Transit (via `scripts/state-sync`), and Tier-1 stacks depend on `vault read database/static-creds/pg-terraform-state` via `scripts/tg` — both silently fell through to their "no Vault" error path. This bug was latent before `2eca011c` because the old apply loop swallowed per-stack exit codes. Now that we surface them, the pipeline fails honestly — but fails on every run. Fixing the missing token propagation is the last mile. ## This change - Pin `VAULT_ADDR` at the step's `environment:` level so every command inherits it without an explicit export. - In the Vault auth command, assert the auth succeeded (non-empty, non-"null" token) then write the token to `~/.vault-token` with `umask 077`. `vault`, `scripts/tg`, and `scripts/state-sync` all fall through to `~/.vault-token` when `VAULT_TOKEN` env is unset. ## What is NOT in this change - A broader refactor to fold the multi-step chain into a single `- \|` script — preserving the existing granular structure keeps individual step logs grep-friendly and failures localised. - Restoring the VAULT_TOKEN export too — redundant once ~/.vault-token is written, and would need duplicating into each command anyway. ## Test Plan ### Automated N/A (pure YAML change). Will be verified by the very next CI run — the push creating this commit. ### Manual Verification Watch `ci.viktorbarzin.me/repos/1/pipelines` for the pipeline whose commit matches this one. Expected: - `default` workflow exercises the auth + apply steps. - Platform apply for `vault` stack runs state-sync decrypt → detects no drift (I applied locally already) → OK. - Tier-1 stacks (if any in the diff): `vault read database/static- creds/pg-terraform-state` returns creds → apply runs. - No "state-sync: ERROR" or "Cannot read PG credentials" errors. - `default` workflow state: success. - Overall pipeline status: still failure because `build-cli` is independently broken (bd code-12b); that's cosmetic. Refs: bd code-e1x Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-19 14:30:39 +00:00
Viktor Barzin	2eca011cc3	[ci,vault] Fix Tier-1 apply silently failing in Woodpecker ## Context For weeks, every push to infra has resulted in `build-cli` workflow failure AND `default` workflow succeed — but the `default` workflow's "success" was a lie. Inside the apply-loop we were swallowing per-stack failures with `set +e ... echo FAILED` and the step exited 0 regardless. Discovered during bd code-3o3 e2e test (qbittorrent 5.0.4 → 5.1.4): agent commit landed, CI reported `default=success`, but cluster was unchanged. Log inside the step showed: [servarr] Starting apply... ERROR: Cannot read PG credentials from Vault. Run: vault login -method=oidc [servarr] FAILED (exit 1) Two root causes, two fixes here. ### 1. Vault `ci` role lacks Tier-1 PG backend creds The Tier-1 PG state backend (2026-04-16 migration, memory 407) uses the `pg-terraform-state` static DB role. `scripts/tg` reads it via `vault read database/static-creds/pg-terraform-state`. That path is permitted by the separate `terraform-state` Vault policy, which is bound only to a role in namespace `claude-agent`. The CI runner is in namespace `woodpecker` using role `ci`, whose policy grants only KV + K8s-creds + transit. Net: every Tier-1 stack apply from CI has been dying at the PG-creds fetch since the migration. Fix: attach `vault_policy.terraform_state` to `vault_kubernetes_auth_backend_role.ci`'s `token_policies`. No new policy needed — reuses the minimal one from 2026-04-16. ### 2. Apply-loop swallows stack failures `.woodpecker/default.yml`'s platform + app apply loops use `set +e; OUTPUT=$(... tg apply ...); EXIT=$?; set -e; [ $EXIT -ne 0 ] && echo FAILED` and then continue the while-loop. The step never re-raises, so it exits 0 regardless of how many stacks failed. Fix: accumulate failed stack names (excluding lock-skipped ones) into `FAILED_PLATFORM_STACKS` / `FAILED_APP_STACKS`, serialise the platform list to `.platform_failed` so it survives the step boundary, and at the end of the app-stack step exit 1 if either list is non-empty. Lock-skipped stacks remain non-fatal. Together, (1) unblocks real apply and (2) ensures the Woodpecker pipeline + the service-upgrade agent can both trust `default` workflow state again. ## What is NOT in this change - Re-running the qbittorrent upgrade to converge the cluster — the TF file is already at 5.1.4 in git; once CI picks up this commit it'll apply on its own, or Viktor can run `tg apply` locally now that the ci role has access too. - Retiring the `set +e ... continue` pattern entirely — keeping the per-stack continuation so a single bad stack doesn't hide the others' plans from the log. Just making the final status honest. ## Test Plan ### Automated `terraform plan` / apply clean (Tier-0 via scripts/tg): ``` Plan: 0 to add, 2 to change, 0 to destroy. # vault_kubernetes_auth_backend_role.ci will be updated in-place ~ token_policies = [ + "terraform-state", # (1 unchanged element hidden) ] # vault_jwt_auth_backend.oidc will be updated in-place ~ tune = [...] # cosmetic provider-schema drift, pre-existing Apply complete! Resources: 0 added, 2 changed, 0 destroyed. ``` State re-encrypted via `scripts/state-sync encrypt vault`; enc file committed. ### Manual Verification ``` # Before (on previous commit — expect failure): $ kubectl -n woodpecker exec woodpecker-server-0 -- sh -c ' SA=$(cat /var/run/secrets/kubernetes.io/serviceaccount/token); TOK=$(curl -s -X POST http://vault-active.vault.svc:8200/v1/auth/kubernetes/login \ -d "{\"role\":\"ci\",\"jwt\":\"$SA\"}" \| jq -r .auth.client_token); curl -s -H "X-Vault-Token: $TOK" \ http://vault-active.vault.svc:8200/v1/database/static-creds/pg-terraform-state' → {"errors":["1 error occurred:\n\t* permission denied\n\n"]} # After (this commit): → {"data":{"username":"terraform_state","password":"..."},...} ``` Pipeline-level: the next infra push will exercise `.woodpecker/default.yml`; expected first push is this very commit. Watch `ci.viktorbarzin.me` — the `default` workflow should either succeed for real (and land actual changes) or exit 1 with "=== FAILED STACKS ===" so the cause is visible. Refs: bd code-e1x Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-19 14:25:52 +00:00
Viktor Barzin	2431c6d5fe	[reverse-proxy] ha-sofia per-service retry + ServersTransport Adds a ha-sofia-retry Middleware (attempts=3, initialInterval=100ms) and ha-sofia-transport ServersTransport (dialTimeout=500ms) wired into ha-sofia + music-assistant ingresses. Absorbs the 67-156ms connect/DNS stalls that were surfacing as 18 x 502s/day without disturbing the global 2-attempt retry or Immich's 60s dialTimeout. depends_on the new manifests to avoid the dangling-reference pattern from the 2026-04-17 Traefik P0. Closes: code-rd1	2026-04-19 14:07:07 +00:00
Viktor Barzin	947f1bd75d	[monitoring] UK Payslip v3.2 — stacked YTD panels, YTD-cumulative rate, Sankey Three changes: 1. Split panel 1 (YTD overlay of 6 non-additive lines) into two accounting- clean stacked-area panels side-by-side: - "YTD sources": salary + bonus + rsu_vest + residual (= gross) - "YTD uses": net + income_tax + NI + pension_employee + student_loan + rsu_offset (= gross, per validate_totals identity) Green for take-home, red/orange for taxes, purple for pension, teal for RSU offset — visually encodes "what you earned vs what was taken". 2. Panel 3 effective rate switched from per-slip attribution to YTD cumulative (SUM OVER w / SUM OVER w). Kills the vest-month >100% spike: the old SQL subtracted `rsu_vest × ytd_avg_rate` from income_tax, but Meta's variant-C grossup means actual RSU tax is on `rsu_grossup × top marginal`, not rsu_vest × average. Cumulative approach blends both proportionally, no attribution hack needed. Also adds a third series: all-deductions rate (income_tax + NI + student_loan / gross). 3. New panel 8 — Sankey (netsage-sankey-panel) showing sources → Gross → uses over the selected time range. Plugin added to grafana Helm values.	2026-04-19 13:42:27 +00:00
Service Upgrade Agent	55ade1f9b3	[servarr] Fix qbittorrent container_port 8787 -> 8080 (matches WEBUI_PORT) Co-Authored-By: Service Upgrade Agent <noreply@viktorbarzin.me>	2026-04-19 13:37:44 +00:00
Viktor Barzin	3b4a059243	[uptime-kuma] Fix broken Redis monitor + move to TF-managed list The Redis monitor (id=53) was created manually with a connection string pointing at redis-master.redis-headless.redis.svc.cluster.local, which doesn't resolve — headless only exposes pod DNS (redis-node-N.redis-headless), not a synthetic "redis-master" name. Status had been DOWN with ENOTFOUND for weeks. Declare it in local.internal_monitors using redis-master.redis.svc.cluster.local (the HAProxy-fronted ClusterIP that already routes to the Sentinel-elected master). Verified RESP PING through HAProxy returns PONG. Tighten intervals to 60s / 30s retry / 3 retries — Redis is core (Paperless, Immich, Authentik, Dawarich all depend on it), a 5-minute detection window was way too loose given the blast radius. Also teach the sync CronJob to handle no-password monitors (auth disabled on the Bitnami chart), via an optional database_password_vault_key. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-19 13:28:36 +00:00

1 2 3 4 5 ...

2975 commits