infra

Author	SHA1	Message	Date
Viktor Barzin	bf4c7618d8	wealth: SQLite→PG ETL sidecar + new Grafana dashboard Mirrors Wealthfolio's daily_account_valuation / accounts / activities from SQLite into a new PG database (wealthfolio_sync) every hour, so Grafana can chart net worth, contributions, and growth over time. Components: - dbaas: null_resource creates wealthfolio_sync DB + role on the CNPG cluster (dynamic primary lookup so it survives failover). - vault: pg-wealthfolio-sync static role rotates the password every 7d. - wealthfolio: ExternalSecret pulls the rotated password into the WF namespace; new pg-sync sidecar (alpine + sqlite + postgresql-client + busybox crond) does sqlite3 .backup → TSV dump → truncate-and-reload psql, hourly at :07. Plus a grafana-wealth-datasource ConfigMap in the monitoring namespace (uid: wealth-pg). - monitoring: new Wealth dashboard (wealth.json, 10 panels) — current net worth / contribution / growth / ROI% stats, then time-series for net worth, contribution-vs-market, growth area, per-account stacked area, cash-vs-invested, and a 100-row activity log. Initial sync: 6 accounts, 10,798 daily valuations, 518 activities. Verified PG totals match SQLite latest snapshot exactly.	2026-04-25 17:07:33 +00:00
Viktor Barzin	4f5f1ff8c2	monitoring(uk-payslip): add yearly receipt stacked barchart panel New panel 16 (barchart, h=11, y=179): one stacked bar per tax year showing total comp split into net pay (bank deposit), cash income tax, RSU tax (band-aware marginal: PAYE+NI), cash NI, student loan, pension salary- sacrifice, and RSU offset (Variant A only). X-axis = tax_year (categorical), y-axis = currencyGBP. Bar height ≈ gross_pay + pension_sacrifice (small over-attribution in Variant A years where the band-aware model exceeds recorded payslip PAYE).	2026-04-25 16:26:57 +00:00
Viktor Barzin	b3c29eda12	monitoring(uk-payslip): model UK income-tax bands + PA-taper for RSU marginal Replaces the flat 47% (45 PAYE + 2 NI) RSU marginal across panels 3, 7, 8, 11, and 12 with an exact piecewise band-aware computation. Each row computes ani_prior/ani_pre/ani_post over the tax-year YTD (chronological model — the RSU is taxed at the band its YTD ANI position occupies at the vest date, mirroring PAYE withholding behaviour). Bands (2024/25+, applied to all years): IT: 0% / 20% / 40% / 60% (PA-taper) / 45% at 12,570 / 50,270 / 100k / 125,140 NI: 0% / 8% / 2% at 12,570 / 50,270 PA-taper modelled as 60% effective IT marginal in £100k–£125,140 (40% on the £1 + 40% on the £0.50 of lost PA = 60%). Spot-checked per tax-year totals via psql; numbers diverge from the flat 47% baseline most for years where vests cross PA-taper or basic-rate bands (2020/21 ~35%, 2024/25 ~41%, 2025/26 ~43%).	2026-04-25 16:14:49 +00:00
Viktor Barzin	0d5f53f337	monitoring(uk-payslip): replace misleading take-home rates in Panel 3 Drop the two misleading series in "Effective rate & take-home % (YTD cumulative)" — both used SUM(gross_pay) as denominator while only counting cash deductions/net in the numerator, which understated take-home by 25-30 pp because RSU shares are absent from the cash deposit but present in gross. Replaced with three semantically clean angles: - ytd_paye_rate_pct: SUM(income_tax) / SUM(taxable_pay) — HMRC audit rate (~41-42% in additional-rate band), kept as before. - ytd_cash_take_home_pct: SUM(net_pay) / SUM(gross_pay - rsu_vest) — what fraction of cash earnings hits the bank (~62-65%). - ytd_total_keep_pct: (SUM(net_pay) + 0.53 × SUM(rsu_vest)) / SUM(gross_pay) — true "what I actually keep" including post-tax RSU shares (47% marginal applied to vest value), ~55-60%. Added field overrides for clear color-coding (red/green/blue). Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-04-25 15:45:47 +00:00
Viktor Barzin	8f0d13282c	monitoring(uk-payslip): drop cash PAYE/NI from "Tax & pension — monthly" Same reasoning as panel 2: cash-side income_tax and NI are inherently bumpy in vest months due to UK cumulative PAYE catching up on YTD, and the flat-47% strip can't fix it. Panel now shows only the explicit RSU vest tax (orange, 47% × rsu_vest), student loan, and pensions. The smooth view of total cash deductions stays available on panel 12 (YTD cumulative). Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-04-25 15:43:32 +00:00
Viktor Barzin	2230cb6cf4	monitoring(uk-payslip): drop tax/NI from "Monthly cash flow (RSU stripped)" panel Vest months still bumped 4-5x in this panel after the flat-47% strip because UK cumulative PAYE genuinely catches up YTD tax in vest months, on top of the marginal RSU portion — no arithmetic split can make that line flat without distorting the data. The cash-flow question this panel answers (what hits the bank, RSU aside) is already covered cleanly by cash_gross + net_pay; the tax detail lives on Panel 11 where the RSU split is now linear. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-04-25 15:30:46 +00:00
Viktor Barzin	cb3ffa6d8d	monitoring(uk-payslip): smooth quarterly RSU tax bumps via flat 47% marginal Replace the implicit pro-rata RSU/cash split with an explicit flat 47% marginal (45% PAYE + 2% NI) for the RSU vest tax stack. The orange slice now scales linearly with rsu_vest instead of wobbling around the month's effective PAYE rate; cash PAYE/NI slices have those amounts subtracted out so the stack still totals to actual deductions. Affects panel 7 (monthly), panel 12 (YTD cumulative), panel 7 (YTD uses), and the Sankey panel. Verified on 35 months of live data: sum invariant holds exactly (cash + rsu_marginal + cash_ni == income_tax + national_insurance), no negatives in cash slices. Out of scope (left raw): effective-rate %, data-integrity, payslip table, P60/HMRC reconciliation — those are audit views that use unmodified income_tax / cash_income_tax columns. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-04-25 15:13:29 +00:00
Viktor Barzin	d231615ebb	[monitoring] Fix fuse voltage alerts — divide raw deciVolt reading by 10 The tuya-bridge exporter reports `fuse_main_voltage` and `fuse_garage_voltage` as raw uint16 from the Tuya protocol, which encodes voltage in deciVolts (e.g. 2352 = 235.2V). The 200/260V thresholds were comparing against the raw integer, so both FuseMainVoltageAbnormal and FuseGarageVoltageAbnormal fired continuously during normal mains conditions. Dividing in the expression also makes `{{ $value }}V` render the correct human-readable value in the alert summary. Root fix would be in tuya-bridge `_decode_value()` where `name.startswith("voltage")` returns `int.from_bytes(...)` without the /10 scaling that `decode_voltage_threshold` applies. Leaving that alone to avoid breaking the automatic_transfer_switch scrape which uses a different code path (`parse_voltage_string`).	2026-04-24 11:12:56 +00:00
Viktor Barzin	a5e4db9af8	[monitoring] Tuya Cloud root-cause alert + cascade suppression New alert TuyaCloudDown fires when any _tuya_cloud_up gauge == 0 (i.e., the Tuya Cloud API rejects scrape calls — the symptom during last night's iot.tuya.com trial expiry, code=28841002). 5m for-duration beats the 15m window of the seven downstream MetricsMissing alerts, so the new Alertmanager inhibit rule suppresses the per-device noise and only TuyaCloudDown pages. Also flips helm_release.prometheus.force_update from true to false: force_update was tripping on the pushgateway PVC added in rev 188 (commit e51c104) — Helm's --force path tried to reset spec.volumeName on a bound PVC. Disabled here; re-enable temporarily when a StatefulSet volumeClaimTemplate change actually needs --force. Bundled with pre-existing working-tree additions for Fuse/Thermostat threshold alerts and expanded PowerOutage inhibit regex (landed in the same Helm revision 190). Verified: rule loaded, value=7 (all 7 tuya-bridge devices report cloud_up=0 right now), TuyaCloudDown moved pending→firing after 5m, 3 *MetricsMissing alerts currently suppressed in Alertmanager with inhibitedBy=1 (thermostat alerts still pending their 15m window, will be suppressed on transition).	2026-04-23 09:59:48 +00:00
Viktor Barzin	344fce3692	[monitoring][poison-fountain] pushgateway persistence + cronjob uid-0 Two independent root-cause fixes surfaced by the 2026-04-22 cluster health check: 1. Pushgateway lost all in-memory metrics when node3 kubelet hiccuped at 11:42 UTC, hiding backup_last_success_timestamp{job="offsite- backup-sync"} until the next 06:01 UTC push — a ~18h false-negative window. Enable persistence on a 2Gi proxmox-lvm-encrypted PVC with --persistence.interval=1m. Chart note: values key is `prometheus-pushgateway:` (subchart alias), not `pushgateway:`. 2. poison-fountain-fetcher CronJob runs curlimages/curl as UID 100 but the NFS mount /srv/nfs/poison-fountain is root:root 755 and the main Deployment runs as root, so mkdir /data/cache fails every 6h. Set run_as_user=0 on the CronJob container (no_root_squash is set on the export). Closes the backup_offsite_sync FAIL on the next 06:01 UTC offsite sync; closes the recurring poison-fountain evicted-pod noise on the next 00:00 UTC cron tick. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-22 18:32:29 +00:00
Viktor Barzin	fdced7577b	[monitoring] HomeAssistantCriticalSensorUnavailable alert	2026-04-22 14:52:23 +00:00
Viktor Barzin	a4eafafe49	[monitoring] Add GPUNodeUnschedulable alert — fires when GPU node is cordoned After k8s-node1 was silently cordoned and broke Frigate camera streams, existing alerts (NvidiaExporterDown, PodUnschedulable) didn't catch the root cause proactively. This alert fires within 5m of the GPU node being cordoned, before any pod restart attempts to schedule and fails. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-04-22 14:05:12 +00:00
Viktor Barzin	134d6b9a82	vault runbook + raft/HA stuck-leader alerts Post-2026-04-22 Step 5 deliverables: - docs/runbooks/vault-raft-leader-deadlock.md — safe pod-restart sequence that avoids zombie containerd-shim + kernel NFS corruption, qm reset no-op gotcha, boot-order gotcha. - prometheus_chart_values.tpl — VaultRaftLeaderStuck + VaultHAStatusUnavailable. Silent until vault telemetry scraping lands (tracked as beads code-vkpn). Epic for moving vault off NFS tracked as beads code-gy7h.	2026-04-22 12:44:46 +00:00
Viktor Barzin	d39770b30d	monitoring: tighten LVMSnapshotStale to 30h for daily-cadence detection Threshold was 48h + 30m for: a job that runs daily. We don't need to wait 2.5 days to detect a broken timer — bring it down to 30h + 30m (just over a day of cadence + minor drift/retry grace). Also add a description pointing to the restore runbook so the alert text surfaces the fix path directly. Threshold change: 172800s → 108000s. Docs in backup-dr.md synced. Re-triggers default.yml apply now that ci/Dockerfile is rebuilt with vault CLI — this is the first commit touching a stack that will actually succeed since the `e80b2f02` regression. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-04-22 08:54:37 +00:00
Viktor Barzin	9b4970da61	monitoring: alert hygiene — disambiguate, rename, tune, fix inhibits - HighPowerUsage: add subsystem:gpu (line 724) + subsystem:r730 (line 775) labels so the two same-named alerts are distinguishable in routing. - HeadscaleDown (deployment-replicas flavor, line 1414) → rename to HeadscaleReplicasMismatch. Line 2039 keeps HeadscaleDown as the real up-metric critical check. NodeDown inhibit rule updated to suppress the renamed alert too. - EmailRoundtripStale (line 1816): for 10m → 20m. Survives one missed 20-min probe cycle before firing, cuts flapping (12 short-burst fires over last 24h). ATSOverload tuning skipped: 24h fire-count is 0, it's continuously firing not flapping — already-known sustained 83% ATS load, tuning would not change behavior. 8 backup *NeverSucceeded rules audited: all 7 using kube_cronjob_status_last_successful_time target real K8s CronJobs with active metrics (not Pushgateway-sourced). PrometheusBackupNeverRun already uses absent() correctly. No fixes needed.	2026-04-21 22:29:15 +00:00
Viktor Barzin	9041f52b05	monitoring: TechnitiumZoneCountMismatch — compare replicas only, exclude primary Primary has only the Primary-type zones it owns (10). Replicas have those + built-in zones (localhost, in-addr.arpa reverse, etc.), so their count (14) can never match primary. Alert expr compared max-min across all instances, making it chronically firing. Fix: instance!="primary" filter. The real signal this alert wants is "did one replica drift from the others" — replica-to-replica comparison captures that; primary was never comparable.	2026-04-19 22:15:55 +00:00
Viktor Barzin	e092f159b3	monitoring: drop MAM Mouse-class + qBittorrent-unsatisfied alerts Both alerts fired as expected noise while the MAM account is in new-member Mouse class — tracker refuses announces and the 72h seed-gate can't be met until ratio recovers. Keeping the rest of the MAM rules (cookie expiry, ratio, farming/janitor stalls, qbt disconnect) which still signal real pipeline failures. Firing count drops from 7 → 3 in healthcheck.	2026-04-19 21:24:46 +00:00
Viktor Barzin	68a10905e0	[monitoring] uk-payslip Panel 13: stacked bars + sum-in-legend "Monthly cash flow — tax impact (RSU excluded)" was already stacking group A in normal mode but rendered as 70%-opacity filled lines — the overlap made the total-per-month figure visually inaccessible. Switch drawStyle to bars (100% fill, 0-width lineWidth, no per-point markers) so each month reads as a single stacked bar whose top edge is the total cash-side deduction. Add "sum" to legend.calcs so the tax-year totals per series show in the legend table alongside last and max. Panel 11 (Tax & pension — monthly, RSU-inclusive) retains the line/ area style so the two panels remain visually distinct.	2026-04-19 20:31:53 +00:00
Viktor Barzin	3f6dfb10aa	[monitoring] job-hunter: panels 6-9 for comp_points tables + trends Append the structured-comp dashboard surface to the job-hunter dashboard: Panel 6 — Per-company salary by level (p50 base, GBP table). Panel 7 — Total-comp heatmap per (company, level), p50 GBP. Panel 8 — Comp-point volume by source (daily time-series). Panel 9 — Base-salary trend (p50) over time for the top 5 companies. Adds templating: $location (multi, default london), $level (single, default senior), $company (multi, default all) — populated from comp_points + levels metadata so the selection reflects what was actually ingested. Closes: code-5ph	2026-04-19 18:50:48 +00:00
Viktor Barzin	a8280e77b6	[broker-sync] unsuspend IMAP + Panel 15 RSU vest reconciliation (Phase D) Activates the Schwab/InvestEngine IMAP ingest CronJob that's been scaffolded-but-suspended since Phase 2 of broker-sync, now that the Schwab parser can detect vest-confirmation emails. Runs nightly 02:30 UK. Current behaviour once deployed: - Trade confirmations (Schwab sell-to-cover, InvestEngine orders) → Activity rows posted to Wealthfolio. Unchanged. - Release Confirmations (Schwab RSU vests) → parser returns gross-vest BUY + sell-to-cover SELL Activities (to Wealthfolio) and a VestEvent object (NOT YET persisted — Postgres sink + DB grant pending; see follow-up under code-860). Vest detection uses a subject/body heuristic that will need tightening against a real email fixture. Panel 15 of the UK payslip dashboard added: per-vest-month join of payslip.rsu_vest vs rsu_vest_events (gross_value_gbp, tax_withheld_gbp) with delta columns. Tax-delta-percent coloured green/orange/red at 0/2%/5% thresholds. Table is empty until broker-sync starts persisting VestEvents — harmless until then. Before applying: - Verify IMAP creds in Vault (secret/broker-sync: imap_host, imap_user, imap_password, imap_directory) are still valid. - Empty vest-event table is expected; delta columns show NULL until the postgres sink lands. Part of: code-860	2026-04-19 18:29:01 +00:00
Viktor Barzin	1c0e1bcdde	[payslip-ingest] ActualBudget payroll sync CronJob + Panel 14 (Phase C) Wires the daily ActualBudget deposit sync from the payslip-ingest app into K8s as a CronJob, and adds dashboard Panel 14 to overlay bank deposits against payslip net_pay. CronJob: actualbudget-payroll-sync in payslip-ingest namespace, runs 02:00 UTC. Calls `python -m payslip_ingest sync-meta-deposits`, which hits budget-http-api-viktor in the actualbudget namespace and upserts matching Meta payroll deposits into payslip_ingest.external_meta_deposits. ExternalSecret extended with three new Vault keys: - ACTUALBUDGET_API_KEY (same as actualbudget-http-api-viktor's env API_KEY) - ACTUALBUDGET_ENCRYPTION_PASSWORD (Viktor's budget password) - ACTUALBUDGET_BUDGET_SYNC_ID (Viktor's sync_id) These must be seeded at secret/payslip-ingest in Vault before the CronJob will run — it'll CrashLoop on missing env vars otherwise. First run can be triggered on demand via `kubectl -n payslip-ingest create job --from=cronjob/actualbudget-payroll-sync initial-sync`. Panel 14 plots monthly SUM(external_meta_deposits.amount) vs SUM(payslip.net_pay), plus a delta bar series — \|delta\| > £50 flags likely parser drift on net_pay. Part of: code-860	2026-04-19 18:21:20 +00:00
Viktor Barzin	fca3dd4976	[monitoring] uk-payslip: Panel 2 uses COALESCE cash_income_tax; Panel 4 flags NULL Phase A of RSU tax spike fix. Two changes: 1. Panel 2 "Monthly cash flow (RSU stripped)" plotted raw income_tax despite the title. Switch to COALESCE(cash_income_tax, income_tax) so the chart is honest once the Phase B back-fill populates cash_income_tax on variant-A slips. For slips where cash_income_tax is already populated (variant B, 2024+) the spike is removed immediately. 2. Panel 4 "Data integrity" now surfaces rows where cash_income_tax is NULL on vest months (rsu_vest > 0). New status value NULL_CASH_TAX (orange) highlights the back-fill remaining population — expected to drop to 0 after Phase B lands. Part of: code-860	2026-04-19 18:04:05 +00:00
Viktor Barzin	f4d3fdb2e3	[monitoring] uk-payslip: drop RSU-vest annotations Vertical orange markers at every vest month added more visual noise than signal. Panel 13 (cash-only) already conveys the "no spike on vest months" story without needing markers across panels 1/2/3/7/11/12.	2026-04-19 17:32:49 +00:00
Viktor Barzin	a641dc744f	[monitoring] uk-payslip: RSU vest annotations + cash-only tax panel Panel 11 stacks RSU-attributed income tax on top of cash PAYE, which is mathematically correct but emotionally misleading since RSU tax is withheld at source via sell-to-cover and never hits the bank. Adopts the two-view convention: Panel 11 keeps the full PAYE picture; new Panel 13 shows cash-only deductions. Dashboard-level "RSU vests" annotation paints orange markers on every vest month across all timeseries panels, with tooltips like "RSU vest: £31232 gross / £15257 tax withheld". Shifts Panels 4/5/6/8/9/10 down by 9 rows to make room for Panel 13 at y=29.	2026-04-19 17:24:35 +00:00
Viktor Barzin	e7ce545da2	[job-hunter] Add infra stack + Grafana dashboard + n8n digest workflow New service stack at stacks/job-hunter/ mirroring the payslip-ingest pattern: per-service CNPG database + role (via dbaas null_resource), Vault static role pg-job-hunter (7d rotation), ExternalSecrets for app secrets and DB creds, Deployment with alembic-migrate init container, ClusterIP Service, Grafana datasource ConfigMap. Grafana dashboard job-hunter.json in Finance folder: new roles per day, source breakdown, top companies, GBP salary distribution, recent roles table (sorted by parse confidence then salary). n8n weekly-digest workflow calls POST /digest/generate with bearer auth every Monday 07:00 London; digest_runs table provides idempotency. Refs: code-snp Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-19 17:09:29 +00:00
Viktor Barzin	7cb44d7264	[registry] Stop recurring orphan OCI-index incidents — detection + prevention + recovery Second identical registry incident on 2026-04-19 (first 2026-04-13): the infra-ci:latest image index resolved to child manifests whose blobs had been garbage-collected out from under the index. Pipelines P366→P376 all exited 126 "image can't be pulled". Hot fix (`a05d63e` / `6371e75` / `c113be4`) restored green CI but left the underlying bug unaddressed. Root cause: cleanup-tags.sh rmtrees tag dirs on the registry VM daily at 02:00, registry:2's GC (Sunday 03:25) walks OCI index children imperfectly (distribution/distribution#3324 class). Nothing verified pushes end-to-end; nothing probed the registry for fetchability; nothing caught orphan indexes. Phase 1 — Detection: - .woodpecker/build-ci-image.yml: after build-and-push, a verify-integrity step walks the just-pushed manifest (index + children + config + every layer blob) via HEAD and fails the pipeline on any non-200. Catches broken pushes at the source. - stacks/monitoring: new registry-integrity-probe CronJob (every 15m) and three alerts — RegistryManifestIntegrityFailure, RegistryIntegrityProbeStale, RegistryCatalogInaccessible — closing the "registry serves 404 for a tag that exists" gap that masked the incident for 2+ hours. - docs/post-mortems/2026-04-19-registry-orphan-index.md: root cause, timeline, monitoring gaps, permanent fix. Phase 2 — Prevention: - modules/docker-registry/docker-compose.yml: pin registry:2 → registry:2.8.3 across all six registry services. Removes the floating-tag footgun. - modules/docker-registry/fix-broken-blobs.sh: new scan walks every _manifests/revisions/sha256/<digest> that is an image index and logs a loud WARNING when a referenced child blob is missing. Does NOT auto- delete — deleting a published image is a conscious decision. Layer-link scan preserved. Phase 3 — Recovery: - build-ci-image.yml: accept `manual` event so Woodpecker API/UI rebuilds don't need a cosmetic Dockerfile edit (matches convention from pve-nfs-exports-sync.yml). - docs/runbooks/registry-rebuild-image.md: exact command sequence for diagnosing + rebuilding after an orphan-index incident, plus a fallback for building directly on the registry VM if Woodpecker itself is down. - docs/runbooks/registry-vm.md + .claude/reference/service-catalog.md: cross-references to the new runbook. Out of scope (verified healthy or intentionally deferred): - Pull-through DockerHub/GHCR mirrors (74.5% hit rate, no 404s). - Registry HA/replication (single-VM SPOF is a known architectural choice; Synology offsite covers RPO < 1 day). - Diun exclude for registry:2 — not applicable; Diun only watches k8s (DIUN_PROVIDERS_KUBERNETES=true), not the VM's docker-compose. Verified locally: - fix-broken-blobs.sh --dry-run on a synthetic registry directory correctly flags both orphan layer links and orphan OCI-index children. - terraform fmt + validate on stacks/monitoring: success (only unrelated deprecation warnings). - python3 yaml.safe_load on .woodpecker/build-ci-image.yml and modules/docker-registry/docker-compose.yml: both parse clean. Closes: code-4b8 Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-19 17:08:28 +00:00
Viktor Barzin	df2c53db8d	[infra] TrueNAS decommission — remove active references from Terraform + configs TrueNAS VM 9000 at 10.0.10.15 was operationally decommissioned 2026-04-13. The subagent-driven doc sweep in `5a0b24f5` covered the prose. This commit removes the remaining in-code references: - reverse-proxy: drop truenas Traefik ingress + Cloudflare record (truenas.viktorbarzin.me was 502-ing since the VM stopped), drop truenas_homepage_token variable. - config.tfvars: drop deprecated `truenas IN A 10.0.10.15`, `iscsi CNAME truenas`, and the commented-out `iscsi`/`zabbix` A records. - dashy/conf.yml: remove Truenas dashboard entry (&ref_28). - monitoring/loki.yaml: change storageClass from the decommissioned `iscsi-truenas` to `proxmox-lvm` so a future re-enable has a valid SC (Loki is currently disabled). - actualbudget/main.tf + freedify/main.tf: update new-deployment docstrings to cite Proxmox host NFS instead of TrueNAS. - nfs-csi: add an explanatory comment to the `nfs-truenas` StorageClass noting the name is historical — 48 bound PVs reference it, SC names are immutable on PVs, rename not worth the churn. Also cleaned out-of-band: - Technitium DNS: deleted `truenas.viktorbarzin.lan` A and `iscsi.viktorbarzin.lan` CNAME records. - Vault: `secret/viktor` → removed `truenas_api_key` and `truenas_ssh_private_key`; `secret/platform.homepage_credentials.reverse_proxy.truenas_token` removed. - Terraform-applied: `scripts/tg apply -target=module.reverse-proxy.module.truenas` destroyed the 3 K8s/Cloudflare resources cleanly. Deferred: - VM 9000 is still stopped on PVE. Deletion (destructive) awaits explicit user go-ahead. - `nfs-truenas` StorageClass name retained (see nfs-csi comment above). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-19 16:57:05 +00:00
Viktor Barzin	5f832e37d0	[monitoring] UK Payslip — add tax & pension breakdown panels New Panel 11 (monthly) + Panel 12 (YTD cumulative), side-by-side at y=19. Six series each: cash income tax, RSU-attributed income tax, NI, student loan, employee pension, employer pension. Employer pension included to show full retirement contribution picture (paid on top of salary, not deducted from take-home). Downstream panels shifted down by 10.	2026-04-19 16:53:32 +00:00
Viktor Barzin	ab402b3421	[monitoring] UK Payslip Panel 7 — trim to 5 semantic layers Drop ytd_student_loan (~£200-300/mo noise) and ytd_rsu_offset (always £0 on post-2024 Meta variant-B payslips) from the YTD uses stack. Now mirrors Panel 1's 4-way source breakdown clarity: take-home, cash PAYE, RSU PAYE, NI, pension. Student loan + RSU offset still surface on Panel 8 Sankey. Title: "YTD uses — where gross went" (mirrors Panel 1 label pattern).	2026-04-19 16:37:12 +00:00
Viktor Barzin	e55c549c9a	[redis] Phase 7 step 2: remove Bitnami helm_release + orphan PVCs Bringing the 2026-04-19 rework to its end-state. Cutover soaked for ~1h with 0 alerts firing and 127 ops/sec on the v2 master — skipped the nominal 24h rollback window per user direction. - Removed `helm_release.redis` (Bitnami chart v25.3.2) from TF. Helm destroy cleaned up the StatefulSet redis-node (already scaled to 0), ConfigMaps, ServiceAccount, RBAC, and the deprecated `redis` + `redis-headless` ClusterIP services that the chart owned. - Removed `null_resource.patch_redis_service` — the kubectl-patch hack that worked around the Bitnami chart's broken service selector. No Helm chart, no patch needed. - Removed the dead `depends_on = [helm_release.redis]` from the HAProxy deployment. - `kubectl delete pvc -n redis redis-data-redis-node-{0,1}` for the two orphan PVCs the StatefulSet template left behind (K8s doesn't cascade-delete). - Simplified the top-of-file comment and the redis-v2 architecture comment — they talked about the parallel-cluster migration state that no longer exists. Folded in the sentinel hostname gotcha, the redis 8.x image requirement, and the BGSAVE+AOF-rewrite memory reasoning so the rationale survives in the code rather than only in beads. - `RedisDown` alert no longer matches `redis-node\|redis-v2` — just `redis-v2` since that's the only StatefulSet now. Kept the `or on() vector(0)` so the alert fires when kube_state_metrics has no sample (e.g. after accidental delete). - `docs/architecture/databases.md` trimmed: no more "pending TF removal" or "cold rollback for 24h" language. Verification after apply: - kubectl get all -n redis: redis-v2-{0,1,2} (3/3 Running) + redis-haproxy-* (3 pods, PDB minAvailable=2). Services: redis-master + redis-v2-headless only. - PVCs: data-redis-v2-{0,1,2} only (redis-data-redis-node-* deleted). - Sentinel: all 3 agree mymaster = redis-v2-0 hostname. - HAProxy: PING PONG, DBSIZE 92, 127 ops/sec on master. - Prometheus: 0 firing redis alerts. Closes: code-v2b Closes: code-2mw Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-19 16:32:14 +00:00
Viktor Barzin	f09be1524d	monitoring: split income_tax cash/RSU + add P60 & HMRC reconciliation panels Panel 7 (YTD uses): replace the single `ytd_income_tax` stack segment with two — `ytd_cash_income_tax` (full red, same color as before) and `ytd_rsu_income_tax` (desaturated orange) — computed from the new `cash_income_tax` column on payslip. RSU-vest months now visually separate the cash tax from the PAYE attributable to the grossed-up RSU, matching user mental model of "what I actually paid in cash tax". Panel 8 (Sankey): split the single `Gross → Income Tax` edge into two edges (`Gross → Income Tax (cash)` and `Gross → Income Tax (RSU)`) sourcing the same two figures. Panel 3 (effective rate): left untouched — it's the "all-in" rate and keeps using raw `income_tax`. Panel 9 (P60 reconciliation — new): per-tax-year table comparing HMRC P60 annual figures against SUM(payslip) via LATERAL JOIN on payslip_ingest.p60_reference. Threshold-coloured delta columns (\|Δ\|<1 green, 1-50 yellow, >50 red) surface missing months or parser drift. Panel 10 (HMRC Tax Year Reconciliation — new): placeholder for the hmrc-sync service (code scaffolded, awaiting HMRC prod approval to activate). Queries `hmrc_sync.tax_year_snapshot`; renders empty until that schema lands. Delta > £10 → red. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-19 15:23:36 +00:00
Viktor Barzin	150f196095	[redis] Phase 1+2: parallel redis-v2 StatefulSet + Prometheus alerts Builds the target 3-node raw StatefulSet alongside the legacy Bitnami Helm release so data can migrate via REPLICAOF during a future short maintenance window (Phase 3-7). No traffic touches the new cluster yet — HAProxy still points at redis-node-{0,1}. Architecture: - 3 redis pods, each co-locating redis + sentinel + oliver006/redis_exporter - podManagementPolicy=Parallel + init container that writes fresh sentinel.conf on every boot by probing peer sentinels and redis for consensus master (priority: sentinel vote > role:master with slaves > pod-0 fallback). Kills the stale-state bug that broke sentinel on Apr 19 PM. - redis.conf `include /shared/replica.conf` — init container writes `replicaof <master> 6379` for non-master pods so they come up already in the correct role. No bootstrap race. - master+replica memory 768Mi (was 512Mi) for concurrent BGSAVE+AOF fork COW headroom. auto-aof-rewrite-percentage=200 tunes down rewrite churn. - RDB (save 900 1 / 300 100 / 60 10000) + AOF appendfsync=everysec. - PodDisruptionBudget minAvailable=2. Also: - HAProxy scaled 2→3 replicas + PodDisruptionBudget minAvailable=2, since Phase 6 drops Nextcloud's sentinel-query fallback and HAProxy becomes the sole client-facing path for all 17 consumers. - New Prometheus alerts: RedisMemoryPressure, RedisEvictions, RedisReplicationLagHigh, RedisForkLatencyHigh, RedisAOFRewriteLong, RedisReplicasMissing. Updated RedisDown to cover both statefulsets during the migration. - databases.md updated to describe the interim parallel-cluster state. Verified live: redis-v2-0 master, redis-v2-{1,2} replicas, master_link_status up, all 3 sentinels agree on get-master-addr-by-name. All new alerts loaded into Prometheus and inactive. Beads: code-v2b (still in progress — Phase 3-7 await maintenance window). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-19 15:23:05 +00:00
Viktor Barzin	9a21c0f065	[dns] DNS reliability & hardening — Technitium + CoreDNS + alerts + readiness gate Workstreams A, B, G, H, I of the DNS reliability plan (code-q2e). Follow-ups for C, D, E, F filed as code-2k6, code-k0d, code-o6j, code-dw8. Technitium (WS A) - Primary deployment: add Kyverno lifecycle ignore_changes on dns_config (secondary/tertiary already had it) — eliminates per-apply ndots drift. - All 3 instances: raise memory request+limit from 512Mi to 1Gi (primary was restarting near the ceiling; CPU limits stay off per cluster policy). - zone-sync CronJob: parse API responses, push status/failures/last-run and per-instance zone_count gauges to Pushgateway, fail the job on any create error (was silently passing). CoreDNS (WS B) - Corefile: add policy sequential + health_check 5s + max_fails 2 on root forward, health_check on viktorbarzin.lan forward, serve_stale 3600s/86400s on both cache blocks — pfSense flap no longer takes the cluster down; upstream outage keeps cached names resolving for 24h. - Scale deploy/coredns to 3 replicas with required pod anti-affinity on hostname via null_resource (hashicorp/kubernetes v3 dropped the _patch resources); readiness gate asserts state post-apply. - PDB coredns with minAvailable=2. Observability (WS G) - Fix DNSQuerySpike — rewrite to compare against avg_over_time(dns_anomaly_total_queries[1h] offset 15m); previous dns_anomaly_avg_queries was computed from a per-pod /tmp file so always equalled the current value (alert could never fire). - New: DNSQueryRateDropped, TechnitiumZoneSyncFailed, TechnitiumZoneSyncStale, TechnitiumZoneCountMismatch, CoreDNSForwardFailureRate. Post-apply readiness gate (WS H) - null_resource.technitium_readiness_gate runs at end of apply: kubectl rollout status on all 3 deployments (180s), per-pod /api/stats/get probe, zone-count parity across the 3 instances. Fails the apply on any check fail. Override: -var skip_readiness=true. Docs (WS I) - docs/architecture/dns.md: CoreDNS Corefile hardening, new alerts table, zone-sync metrics reference, why DNSQuerySpike was broken. - docs/runbooks/technitium-apply.md (new): what the gate checks, failure modes, emergency override. Out of scope for this commit (see beads follow-ups): - WS C: NodeLocal DNSCache (code-2k6) - WS D: pfSense Unbound replaces dnsmasq (code-k0d) - WS E: Kea multi-IP DHCP + TSIG (code-o6j) - WS F: static-client DNS fixes (code-dw8) Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-19 14:53:41 +00:00
Viktor Barzin	947f1bd75d	[monitoring] UK Payslip v3.2 — stacked YTD panels, YTD-cumulative rate, Sankey Three changes: 1. Split panel 1 (YTD overlay of 6 non-additive lines) into two accounting- clean stacked-area panels side-by-side: - "YTD sources": salary + bonus + rsu_vest + residual (= gross) - "YTD uses": net + income_tax + NI + pension_employee + student_loan + rsu_offset (= gross, per validate_totals identity) Green for take-home, red/orange for taxes, purple for pension, teal for RSU offset — visually encodes "what you earned vs what was taken". 2. Panel 3 effective rate switched from per-slip attribution to YTD cumulative (SUM OVER w / SUM OVER w). Kills the vest-month >100% spike: the old SQL subtracted `rsu_vest × ytd_avg_rate` from income_tax, but Meta's variant-C grossup means actual RSU tax is on `rsu_grossup × top marginal`, not rsu_vest × average. Cumulative approach blends both proportionally, no attribution hack needed. Also adds a third series: all-deductions rate (income_tax + NI + student_loan / gross). 3. New panel 8 — Sankey (netsage-sankey-panel) showing sources → Gross → uses over the selected time range. Plugin added to grafana Helm values.	2026-04-19 13:42:27 +00:00
Viktor Barzin	13cc5d956e	[monitoring] UK Payslip dashboard v3.1 — add YTD reconciliation panel Adds panel 6 that reconciles each payslip's reported YTD summary block (ytd_gross, ytd_taxable_pay, ytd_tax_paid) against the cumulative sum of extracted per-payslip values within the same tax year. Any Δ > £0.02 flags a parser regression, missing slip, or duplicate ingest — the algebraic companion to the existing missing-months panel. Variant A payslips (pre-mid-2022) carry no YTD block and are filtered out via WHERE ytd_gross IS NOT NULL.	2026-04-19 13:12:57 +00:00
Viktor Barzin	ac95973b38	[monitoring] UK Payslip dashboard v3 — consolidate to 5 panels + data-integrity check Collapse from 11 panels to 5. New hero "Tax-year YTD — gross / net / taxes / RSU / salary" merges the old YTD cumulative + total-comp + earnings-breakdown panels into a single line chart (tax-band thresholds still on ytd_cash_gross). New "Data integrity" table surfaces missing months and zero-salary anomalies at a glance — catches the 2024-02 gap (Paperless doc never uploaded) and any future parser regressions. Monthly cash flow, effective-rate, and full payslip table kept as-is. Total dashboard height: 39 rows (was ~67). No parser / schema changes. [ci skip]	2026-04-19 12:47:44 +00:00
Viktor Barzin	789cb61310	[servarr] Rewrite MAM ratio farming — break Mouse death spiral, adopt in TF ## Context A MAM (MyAnonamouse) freeleech farming workflow was deployed on 2026-04-14 via kubectl apply (outside Terraform). Five days later the account was still stuck in Mouse class: 715 MiB downloaded, 0 uploaded, ratio 0. Tracker responses on 7 of 9 active torrents returned `status=4 \| msg="User currently mouse rank, you need to get your ratio up!"` — MAM was actively refusing to serve peer lists because the account was in Mouse class, and refusing to serve peer lists made the ratio impossible to recover. Meanwhile the grabber kept digging: 501 torrents sat in qBittorrent, 0 completed, 0 bytes uploaded. Root causes (ranked): 1. Death spiral — Mouse class blocks announces, nothing uploads. 2. BP-spender 30 000 BP threshold blocked the only exit even though the account already had 24 500 BP. 3. Grabber selection (`score = 1.0 / (seeders+1)`) preferred low-demand torrents filtered to <100 MiB — ratio-hostile by design. 4. Grabber/cleanup deadlock: cleanup only fired on seed_time > 3d, so torrents that never started never qualified. Combined with the 500- torrent cap this stalled the grabber indefinitely. 5. qBittorrent queueing amplified (4) — 495/501 stuck in queuedDL. 6. Ratio-monitor labelled queued torrents `unknown` (empty tracker field), hiding the problem on the MAM Grafana panel. 7. qBittorrent memory limit (256 Mi LimitRange default) too low. 8. All of the above was Terraform drift with no reviewability. ## This change Introduces `stacks/servarr/mam-farming/` — a new TF module that adopts the three kubectl-applied resources and replaces their scripts with demand-first, H&R-aware logic. Also bumps qBittorrent resources, fixes ratio-monitor labelling, and adds five Prometheus alerts plus a Grafana panel row. ### Architecture MAM API ───┬─── jsonLoad.php (profile: ratio, class, BP) ├─── loadSearchJSONbasic.php (freeleech search) ├─── bonusBuy.php (50 GiB min tier for API) └─── download.php (torrent file) │ Pushgateway <──┬────────────┤ │ mam_ratio ┌────────────────────┐ │ mam_class_code │ freeleech-grabber │ /30 │ mam_bp_balance ◄───│ (ratio-guarded) │ │ mam_farming_ └──────────┬─────────┘ │ mam_janitor_* │ adds to │ ▼ │ Grafana panels qBittorrent (mam-farming) │ + 5 alerts ▲ │ │ deletes by rule │ ┌──────────┴─────────┐ │ ◄───│ farming-janitor │ /15 │ │ (H&R-aware) │ │ └──────────┬─────────┘ │ │ buys credit │ ┌──────────┴─────────┐ └───────────────────────│ bp-spender │ 0 /6 │ (tier-aware) │ └────────────────────┘ ### Key decisions - Ratio guard on grabber — refuse to grab if ratio < 1.2 OR class == Mouse. Prevents the death spiral from deepening. Emits `mam_grabber_skipped_reason{reason=...}` and exits clean. - Demand-first selection — new score formula `leechers3 - seeders0.5 + 200 if freeleech_wedge else 0`; size band 50 MiB – 1 GiB; leecher floor 1; seeder ceiling 50. Picks titles that will actually upload. - Janitor decoupled from grabber — runs every 15 min regardless of the ratio-guard state. Without this, stuck torrents accumulate fastest exactly when the grabber is skipping (Mouse class). H&R-aware: never deletes `progress==1.0 AND seeding_time < 72h`. Six delete reasons observable via `mam_janitor_deleted_per_run{reason=...}`. - BP-spender tier-aware — MAM imposes a hard 50 GiB minimum on API buyers ("Automated spenders are limited to buying at least 50 GB... due to log spam"). Valid API tiers: 50/100/200/500 GiB at 500 BP/GiB. The spender picks the smallest tier that satisfies the ratio deficit AND fits the budget, preserving a 500 BP reserve. If even the 50 GiB tier is too expensive, it skips and retries on the next 6-hour cron. - Authoritative metrics use MAM profile fields — `downloaded_bytes` / `uploaded_bytes` (integers) rather than the pretty-printed `downloaded` / `uploaded` strings like "715.55 MiB" that MAM also returns. - Ratio-monitor category-first labelling — `tracker` is empty for queued torrents that never announced. Now maps `category==mam-farming` to label `mam` first, only falls back to tracker-URL parsing when category is absent. Stops hundreds of MAM torrents collecting under `unknown`. - qBittorrent resources bumped to `requests=512Mi / limits=1Gi` so hundreds of active torrents don't OOM. ### Emergency recovery performed this session 1. Adopted 5 in-cluster resources via root-module `import {}` blocks (Terraform 1.5+ rejects imports inside child modules). 2. Ran the janitor in DRY_RUN=1 to verify rules against live state — 466 `never_started` candidates, 0 false positives in any other reason bucket. Flipped to enforce mode. 3. Janitor deleted 466 stuck torrents (matches plan's ~495 target; 35 preserved as active/in-progress). 4. Truncated `/data/grabbed_ids.txt` so newly-popular titles become eligible again. The ratio is still 0 because the API cannot buy below 50 GiB and the account sits at 24 551 BP (needs 25 000). Manual 1 GiB purchase via the MAM web UI — 500 BP — would immediately lift the account to ratio ≈ 1.4 and unblock announces. Future automation cannot do this for us due to MAMs anti-spam rule. ### What is NOT in this change - qBittorrent prefs reconciliation (max_active_downloads=20, max_active_uploads=150, max_active_torrents=150). The plan wanted this; deferred to a follow-up because the janitor + ratio recovery handles the 500-torrent backlog first. A small reconciler CronJob posting to /api/v2/app/setPreferences is the intended follow-up. - VIP purchase (~100 k BP) — deferred until BP accumulates. - Cross-seed / autobrr — separate initiative. ## Alerts added - P1 MAMMouseClass — `mam_class_code == 0` for 1h - P1 MAMCookieExpired — `mam_farming_cookie_expired > 0` - P2 MAMRatioBelowOne — `mam_ratio < 1.0` for 24h (replaces old QBittorrentMAMRatioLow, now driven by authoritative profile metric) - P2 MAMFarmingStuck — no grabs in 4h while ratio is healthy - P2 MAMJanitorStuckBacklog — `skipped_active > 400` for 6h ## Test plan ### Automated $ cd infra/stacks/servarr && ../../scripts/tg plan 2>&1 \| grep Plan Plan: 5 to import, 2 to add, 6 to change, 0 to destroy. $ ../../scripts/tg apply --non-interactive Apply complete! Resources: 5 imported, 2 added, 6 changed, 0 destroyed. # Re-plan after import block removal (idempotent) $ ../../scripts/tg plan 2>&1 \| grep Plan Plan: 0 to add, 1 to change, 0 to destroy. # The 1 change is a pre-existing MetalLB annotation drift on the # qbittorrent-torrenting Service — unrelated to this change. $ cd ../monitoring && ../../scripts/tg apply --non-interactive Apply complete! Resources: 0 added, 2 changed, 0 destroyed. # Python + JSON syntax $ python3 -c 'import ast; [ast.parse(open(p).read()) for p in [ "infra/stacks/servarr/mam-farming/files/freeleech-grabber.py", "infra/stacks/servarr/mam-farming/files/bp-spender.py", "infra/stacks/servarr/mam-farming/files/mam-farming-janitor.py"]]' $ python3 -c 'import json; json.load(open( "infra/stacks/monitoring/modules/monitoring/dashboards/qbittorrent.json"))' ### Manual Verification 1. Grabber ratio-guard path: $ kubectl -n servarr create job --from=cronjob/mam-freeleech-grabber g1 $ kubectl -n servarr logs job/g1 Skip grab: ratio=0.0 class=Mouse (floor=1.2) reason=mouse_class 2. BP-spender tier path: $ kubectl -n servarr create job --from=cronjob/mam-bp-spender s1 $ kubectl -n servarr logs job/s1 Profile: ratio=0.0 class=Mouse DL=0.70 GiB UL=0.00 GiB BP=24551 \| deficit=1.40 GiB needed=3 affordable=48 buy=0 Done: BP=24551, spent=0 GiB (needed=3, affordable=48) Correctly skips because affordable (48) < smallest API tier (50). 3. Janitor in enforce mode: $ kubectl -n servarr create job --from=cronjob/mam-farming-janitor j1 $ kubectl -n servarr logs job/j1 \| tail -3 Done: deleted=466 preserved_hnr=0 skipped_active=35 dry_run=False per reason: {'never_started': 466, ...} Second run immediately after: `deleted=0 skipped_active=35` — steady state with only active/seeding torrents left. 4. Alerts loaded: $ kubectl -n monitoring get cm prometheus-server \ -o jsonpath='{.data.alerting_rules\.yml}' \ \| grep -E "alert: MAM\|alert: QBittorrent" - alert: MAMMouseClass - alert: MAMCookieExpired - alert: MAMRatioBelowOne - alert: MAMFarmingStuck - alert: MAMJanitorStuckBacklog - alert: QBittorrentDisconnected - alert: QBittorrentMAMUnsatisfied 5. Dashboard: browse to Grafana "qBittorrent - Seeding & Ratio" → new "MAM Profile (from jsonLoad.php)" row at the bottom shows class, BP balance, profile ratio, transfer, BP-vs-reserve timeseries, janitor deletion stacked chart, janitor state stat, grabber state stat. ## Reproduce locally 1. `cd infra/stacks/servarr && ../../scripts/tg plan` — expect 0 add / 1 change (unrelated MetalLB annotation drift). 2. `kubectl -n servarr get cronjobs` — expect three: mam-freeleech-grabber, mam-bp-spender, mam-farming-janitor. 3. Trigger each via `kubectl create job --from=cronjob/<name> <job>` and read logs; outputs match the manual-verification snippets above. Closes: code-qfs Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-19 11:45:38 +00:00
Viktor Barzin	a5df175a67	[mailserver] Retire Dovecot exporter + scrape + alerts [ci skip] ## Context code-vnc confirmed `viktorbarzin/dovecot_exporter` cannot produce real metrics against docker-mailserver 15.0.0's Dovecot 2.3.19 — the exporter speaks the pre-2.3 `old_stats` FIFO protocol, which Dovecot 2.3 deprecated in favour of `service stats` + `doveadm-server` with a different wire format. The scrape only ever returned `dovecot_up{scope="user"} 0`. code-1ik listed two paths: (a) switch to a Dovecot 2.3+ exporter, or (b) retire the exporter + scrape + alerts. Picking (b) — carrying a no-op exporter + scrape + alert group taxes cluster resources, clutters Prometheus /targets, and tees up an alert that can never fire correctly. If a future session needs real Dovecot stats, reach for a known-good exporter (e.g., jtackaberry/dovecot_exporter) and rebuild this scaffolding. ## This change ### mailserver stack - Removes the `dovecot-exporter` container from `kubernetes_deployment.mailserver` (was ~28 lines). Pod now runs a single `docker-mailserver` container. - Removes `kubernetes_service.mailserver_metrics` (ClusterIP Service added in code-izl). The `mailserver` LoadBalancer (ports 25, 465, 587, 993) is unaffected. - Drops the dovecot.cf comment documenting the failed code-vnc attempt — the documentation survives here + in bd code-vnc / code-1ik. ### monitoring stack - Removes `job_name: 'mailserver-dovecot'` from `extraScrapeConfigs`. - Removes the `Mailserver Dovecot` PrometheusRule group (`DovecotConnectionsNearLimit`, `DovecotExporterDown`). - Inline comments in both files point future work at code-1ik's decision record. Prometheus configmap-reload picked up the change; scrape target set now has zero entries for `mailserver-dovecot`. Pod rolled cleanly to 1/1 Running. ## What is NOT in this change - No replacement exporter — deliberate. The alert that was removed was a false-signal alert; its removal returns cluster alerting to a correct, lower-noise state. - mailserver MetalLB Service + SMTP/IMAP ports — unchanged. - `auth_failure_delay`, `mail_max_userip_connections` — stay; those are unrelated to stats export. ## Test Plan ### Automated ``` $ kubectl get pod -n mailserver -l app=mailserver NAME READY STATUS RESTARTS AGE mailserver-78589bfd95-swz6h 1/1 Running 0 49s $ kubectl get svc -n mailserver NAME TYPE PORT(S) mailserver LoadBalancer 25/TCP,465/TCP,587/TCP,993/TCP roundcubemail ClusterIP 80/TCP # mailserver-metrics gone $ kubectl exec -n monitoring <prom-pod> -c prometheus-server -- \ wget -qO- 'http://localhost:9090/api/v1/targets?scrapePool=mailserver-dovecot' {"status":"success","data":{"activeTargets":[]}} ``` ### Manual Verification 1. E2E probe `email-roundtrip-monitor` keeps succeeding (20-min cadence) 2. `EmailRoundtripFailing` stays green — proves IMAP is healthy even without the exporter signal 3. Prometheus `/alerts` page no longer shows DovecotConnectionsNearLimit or DovecotExporterDown Closes: code-1ik Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-19 11:01:07 +00:00
Viktor Barzin	973f549810	[payslip-ingest] Update extractor agent + dashboard for v2 regex parser ## Context Companion change to payslip-ingest v2 (regex parser + accurate RSU tax attribution). The Grafana dashboard now has 4 more panels powered by the new earnings-decomposition and YTD-snapshot columns, and the Claude fallback agent's prompt is aligned with the new schema so non-Meta payslips still land with the full field set. ## This change ### `.claude/agents/payslip-extractor.md` Rewrites the RSU handling section to match Meta UK's actual template (rsu_vest = "RSU Tax Offset" + "RSU Excs Refund", no matching rsu_offset deduction — PAYE uses grossed-up Taxable Pay instead). Adds a new "Earnings decomposition (v2)" section telling the fallback agent how to populate salary/bonus/pension_sacrifice/taxable_pay/ytd_* and when to use pension_employee vs pension_sacrifice without double-counting. ### `stacks/monitoring/modules/monitoring/dashboards/uk-payslip.json` - Panel 4 (Effective rate) — SQL switched from the naive `(income_tax + NIC) / cash_gross` to the YTD-effective-rate method: `cash_tax = income_tax - rsu_vest × (ytd_tax_paid / ytd_taxable_pay)`. Title updated to "YTD-corrected" so the change is discoverable. - Panel 5 (Table) — adds salary, bonus, pension_sacrifice, taxable_pay columns so row-level debugging against the parser output is trivial. - +Panel 8 (Earnings breakdown) — monthly stacked bars of salary / bonus / rsu_vest / -pension_sacrifice. Bonus-sacrifice months show up as a massive negative pension_sacrifice spike paired with a near-zero bonus bar. - +Panel 9 (Accurate cash tax rate) — timeseries of cash_tax_rate_ytd vs naive_tax_rate. Divergence is the RSU contribution the payslip hides in the single `Tax paid` line. - +Panel 10 (All-in compensation) — stacked bars of cash_gross + rsu_vest per payslip. - +Panel 11 (YTD cumulative cash gross vs total comp) — two lines partitioned by tax_year; the gap between them is the RSU contribution YTD. Total panels go from 7 → 11. ## Test Plan ### Automated Dashboard JSON validity: ``` $ python3 -m json.tool uk-payslip.json > /dev/null && echo ok ok ``` ### Manual Verification After applying `stacks/monitoring/`: 1. `https://grafana.viktorbarzin.me/d/uk-payslip` loads with 11 panels 2. Bonus-sacrifice months (e.g. March 2024 if present in data) show the negative pension_sacrifice bar in panel 8 3. Panel 9 "Accurate cash effective tax rate" shows the cash_tax_rate_ytd line sitting ~10-15pp below naive_tax_rate in RSU-vest months ## Reproduce locally 1. `cd infra/stacks/monitoring && terragrunt plan` 2. Expected: ConfigMap diff on the payslip dashboard with the new panel JSON 3. `terragrunt apply` — Grafana reloads the dashboard automatically (configmap-reload sidecar) Relates to: payslip-ingest commit 9741816 Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-19 10:54:33 +00:00
Viktor Barzin	c941199f8d	[mailserver] Split Dovecot metrics port onto ClusterIP service [ci skip] ## Context Port 9166 (`dovecot-metrics`) was exposed on the public MetalLB LoadBalancer 10.0.20.202 alongside SMTP/IMAP. While only LAN-routable, shipping an internal metric on the same listening IP as external mail conflated two concerns and over-exposed the port. Prometheus was scraping via the same LB Service. Addresses code-izl (follow-up to code-61v which added the scrape job). ## This change ### mailserver stack - Drops `dovecot-metrics` port from `kubernetes_service.mailserver` (LoadBalancer stays: 25, 465, 587, 993). - Adds new `kubernetes_service.mailserver_metrics` — ClusterIP-only, selecting the same `app=mailserver` pod, exposing 9166. ### monitoring stack - Updates `extraScrapeConfigs` in the Prometheus chart values to target the new `mailserver-metrics.mailserver.svc.cluster.local:9166` instead of `mailserver.mailserver.svc.cluster.local:9166`. - helm_release.prometheus updated in-place; configmap-reload sidecar picked up the new target within 10s. ``` mailserver LB mailserver-metrics ClusterIP ┌──────────────────┐ ┌──────────────────┐ │ 25 smtp │ │ 9166 dovecot- │ │ 465 smtp-secure │ │ metrics │ ← Prometheus only │ 587 smtp-auth │ └──────────────────┘ │ 993 imap-secure │ └──────────────────┘ ↑ 10.0.20.202 ``` ## What is NOT in this change - Per-Service RBAC/NetworkPolicy tightening (separate task) - Moving the metrics port to a dedicated sidecar-only Service Monitor (ServiceMonitor CRDs not installed; extraScrapeConfigs is correct for the prometheus-community chart in use) ## Test Plan ### Automated ``` $ kubectl get svc -n mailserver mailserver LoadBalancer 10.0.20.202 25/TCP,465/TCP,587/TCP,993/TCP mailserver-metrics ClusterIP 10.100.102.174 9166/TCP $ kubectl get endpoints -n mailserver mailserver-metrics mailserver-metrics 10.10.169.163:9166 $ # Prometheus target (after 10s configmap-reload) $ kubectl exec -n monitoring <prom-pod> -c prometheus-server -- \ wget -qO- 'http://localhost:9090/api/v1/targets?scrapePool=mailserver-dovecot' scrapeUrl: http://mailserver-metrics.mailserver.svc.cluster.local:9166/metrics health: up ``` ### Manual Verification 1. From a host outside the cluster: `nc -vz 10.0.20.202 9166` → connection refused 2. Prometheus UI `/targets` → `mailserver-dovecot` UP, labels show new DNS name 3. PromQL: `up{job="mailserver-dovecot"}` returns `1` Closes: code-izl Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-19 10:37:30 +00:00
Viktor Barzin	c36b41eabc	[monitoring] Scrape mailserver Dovecot exporter + near-limit alerts Port 9166 (`dovecot-metrics`) is exposed on the mailserver Service but nothing was scraping it. Added a static `mailserver-dovecot` scrape job to `extraScrapeConfigs` (we run `prometheus-community/prometheus`, not `kube-prometheus-stack`, so no ServiceMonitor CRDs are available). Two alerts in a new `Mailserver Dovecot` rule group: - `DovecotConnectionsNearLimit` fires at ≥42/50 IMAP connections for 5m (85% of `mail_max_userip_connections = 50`). - `DovecotExporterDown` fires if the scrape target is unreachable for 10m (catches pod restarts + network issues). Originally drafted as `kubernetes_manifest` ServiceMonitor + PrometheusRule on `mailserver-beta1` branch; that commit is abandoned because the CRDs aren't installed. This path is functionally equivalent and plans cleanly. Closes: code-61v	2026-04-19 00:24:12 +00:00
Viktor Barzin	ac604d4d1f	[monitoring] uk-payslip: cash-basis queries + RSU vest panel - Panels 1/2/4: compute on (gross_pay - rsu_vest) so numbers reflect actual UK cash pay, not the RSU-inflated figure the payslip shows. - Detailed table: add cash_gross / rsu_vest / rsu_offset columns. - New RSU panel at the bottom: bar chart of rsu_vest over time (only shows months with stock vests). Taxed at Schwab — included here for reporting/reconciliation, not for P&L.	2026-04-18 23:39:46 +00:00
Viktor Barzin	73ed2d9001	[monitoring] Add detailed-payslips table + full-deductions panels Two new panels below the 4 existing ones: - Detailed table: every payslip sorted by pay_date DESC with all fields (gross, all deductions, net, tax_year, validated flag, paperless_doc_id). Footer reducer sums the numeric columns. - Full deductions stacked bars: income_tax + NI + pension_employee + pension_employer + student_loan per payslip. The earlier panel only showed 4 deductions; this one shows the complete picture.	2026-04-18 23:32:21 +00:00
Viktor Barzin	4cd8d96b01	[monitoring] Widen uk-payslip default time range to 10y Oldest payslip in Paperless is July 2019. Previous default (now-2y) hid everything from 2019-2023, making it look like the backfill was broken.	2026-04-18 23:26:49 +00:00
Viktor Barzin	1698cd1ce1	[mailserver] Add daily backup CronJob for mailserver PVC ## Context The mailserver stack holds everything valuable and hard to recreate: 243M of maildirs, dovecot/rspamd state, and the DKIM private key that signs outbound mail. Today the only defense is the LVM thin-pool snapshots on the PVE host (7-day retention, storage-class scope only) — there is no app-level backup. Infra/.claude/CLAUDE.md mandates that every proxmox-lvm(-encrypted) app ship a NFS-backed backup CronJob, and the mailserver stack was the only one still out of compliance. Loss of mailserver-data-encrypted without backups = total loss of all stored mail plus a DKIM key rotation (which requires a DNS update and breaks signature verification on every message in transit for the TTL window). Unacceptable for a service people actually use. Trade-offs considered: - mysqldump-style single-file dump vs rsync snapshot — maildirs are millions of small files, not a DB export. rsync --link-dest gives incremental weekly snapshots for ~10% of the cost of a full copy. - RWO PVC read-only mount — the underlying PVC is ReadWriteOnce, so the backup Job has to co-locate with the mailserver pod. vaultwarden solves this with pod_affinity; mirrored here. - Image choice — alpine + apk add rsync matches vaultwarden's pattern and keeps the container image small. ## This change Adds `kubernetes_cron_job_v1.mailserver-backup` + NFS PV/PVC to the mailserver module. Runs daily at 03:00 (avoids the 00:30 mysql-backup and 00:45 per-db windows, and the /20 email-roundtrip cadence). The job rsyncs /var/mail, /var/mail-state, /var/log/mail into /srv/nfs/mailserver-backup/<YYYY-WW>/ with --link-dest against the previous week for space-efficient incrementals. 8-week retention. Data layout (flowed through from the deployment's subPath mounts so the rsync tree matches the mailserver's own on-disk layout): PVC mailserver-data-encrypted (RWO, 2Gi) ├─ data/ (subPath) → pod's /var/mail → backup/<week>/data/ ├─ state/ (subPath) → pod's /var/mail-state → backup/<week>/state/ └─ log/ (subPath) → pod's /var/log/mail → backup/<week>/log/ Safety: - PVC mounted read-only (volume.persistent_volume_claim.read_only AND all three volume_mounts set read_only=true) so a backup-script bug cannot corrupt maildirs. - pod_affinity on app=mailserver + topology_key=hostname forces the Job pod onto the same node holding the RWO PVC attachment. - set -euxo pipefail + per-directory existence guard so a missing subPath short-circuits cleanly instead of silently no-op'ing. Metrics pushed to Pushgateway match the mysql-backup/vaultwarden-backup convention (job="mailserver-backup"): backup_duration_seconds, backup_read_bytes, backup_written_bytes, backup_output_bytes, backup_last_success_timestamp. Alert rules added in monitoring stack, mirroring Mysql/Vaultwarden: - MailserverBackupStale — 36h threshold, critical, 30m for: - MailserverBackupNeverSucceeded — critical, 1h for: ## Reproduce locally 1. cd infra/stacks/mailserver && ../../scripts/tg plan Expected: 3 to add (cronjob + NFS PV + PVC), unrelated drift on deployment/service is pre-existing. 2. ../../scripts/tg apply --non-interactive \ -target=module.mailserver.module.nfs_mailserver_backup_host \ -target=module.mailserver.kubernetes_cron_job_v1.mailserver-backup 3. cd ../monitoring && ../../scripts/tg apply --non-interactive 4. kubectl create job --from=cronjob/mailserver-backup \ mailserver-backup-test -n mailserver 5. kubectl wait --for=condition=complete --timeout=300s \ job/mailserver-backup-test -n mailserver 6. Expected: test pod co-locates with mailserver on same node (k8s-node2 today), rsync writes ~950M to /srv/nfs/mailserver-backup/<YYYY-WW>/, Pushgateway exposes backup_output_bytes{job="mailserver-backup"}. ## Test Plan ### Automated $ kubectl get cronjob -n mailserver mailserver-backup NAME SCHEDULE TIMEZONE SUSPEND ACTIVE LAST SCHEDULE AGE mailserver-backup 0 3 * * <none> False 0 <none> 3s $ kubectl create job --from=cronjob/mailserver-backup \ mailserver-backup-test -n mailserver job.batch/mailserver-backup-test created $ kubectl wait --for=condition=complete --timeout=300s \ job/mailserver-backup-test -n mailserver job.batch/mailserver-backup-test condition met $ kubectl logs -n mailserver job/mailserver-backup-test \| tail -5 === Backup IO Stats === duration: 80s read: 1120 MiB written: 1186 MiB output: 947.0M $ kubectl run nfs-verify --rm --image=alpine --restart=Never \ --overrides='{...nfs mount /srv/nfs...}' \ -n mailserver --attach -- ls -la /nfs/mailserver-backup/ 947.0M /nfs/mailserver-backup/2026-15 $ curl http://prometheus-prometheus-pushgateway.monitoring:9091/metrics \ \| grep mailserver-backup backup_duration_seconds{instance="",job="mailserver-backup"} 80 backup_last_success_timestamp{instance="",job="mailserver-backup"} 1.776554641e+09 backup_output_bytes{instance="",job="mailserver-backup"} 9.92315701e+08 backup_read_bytes{instance="",job="mailserver-backup"} 1.175027712e+09 backup_written_bytes{instance="",job="mailserver-backup"} 1.244254208e+09 $ curl -s http://prometheus-server/api/v1/rules \ \| jq '.data.groups[].rules[] \| select(.name \| test("Mailserver"))' MailserverBackupStale: (time() - kube_cronjob_status_last_successful_time{cronjob="mailserver-backup",namespace="mailserver"}) > 129600 MailserverBackupNeverSucceeded: kube_cronjob_status_last_successful_time{cronjob="mailserver-backup",namespace="mailserver"} == 0 ### Manual Verification 1. Wait for the scheduled 03:00 run tonight; verify `kubectl get job -n mailserver` shows a new completed job. 2. Check that `backup_last_success_timestamp` advances past today. 3. Confirm `MailserverBackupNeverSucceeded` did not fire. 4. Next week (week 16), confirm `--link-dest` builds hardlinks vs 2026-15 (size delta should drop from ~950M to ~the actual churn). ## Deviations from mysql-backup pattern - Image: alpine + rsync (mirrors vaultwarden — mysql's `mysql:8.0` base is not applicable for a filesystem rsync). - pod_affinity: required for RWO PVC co-location (mysql uses its own MySQL service for network access; mailserver must mount the PVC). - Metric push via wget (mirrors vaultwarden; alpine has wget, not curl). - Week-folder layout with --link-dest rotation: rsync pattern, closer to the PVE daily-backup script than mysql's single-file gzip dumps. [ci skip] Closes: code-z26 Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-18 23:26:08 +00:00
Viktor Barzin	06e3425a39	[monitoring] Set rawQuery+editorMode on uk-payslip panel targets Grafana 11's Postgres plugin shows 'you do not have default database' on any panel whose target is missing rawQuery:true / editorMode:"code". The query builder can't reason about a custom schema.table path and blanks the panel.	2026-04-18 23:12:45 +00:00
Viktor Barzin	ed820e9b58	[monitoring] Fix uk-payslip datasource type to grafana-postgresql-datasource The installed Postgres plugin is 'grafana-postgresql-datasource' (the newer one). Dashboard panels referenced legacy 'postgres' type, which caused Grafana to fall back to 'default database' and error out when rendering. Ran sed over the JSON; all 8 panel+target type refs now match the installed plugin name. UID (payslips-pg) was already correct.	2026-04-18 23:10:13 +00:00
Viktor Barzin	471e946133	[monitoring] Put uk-payslip dashboard in Finance folder Grafana can't auto-create the reserved 'General' folder ('A folder with that name already exists'), which aborts the sidecar provisioner's walk and drops every dashboard in that folder. Move uk-payslip to Finance so it loads.	2026-04-18 23:03:22 +00:00
Viktor Barzin	b28c76e371	[infra] Wire drift detection to Pushgateway + alert on stale/unaddressed drift ## Context Wave 7 of the state-drift consolidation plan. The drift-detection pipeline (`.woodpecker/drift-detection.yml`) already ran terragrunt plan on every stack daily and Slack-posted a summary, but its output was ephemeral — nothing persisted in Prometheus, so there was no historical view of which stacks drift, when, or for how long. Following the convergence work in waves 1–6 (168 KYVERNO_LIFECYCLE_V1 markers, 4 stacks adopted, Phase 4 mysql cleanup), the baseline is clean enough that new drift should stand out. That only works if we have observability. ## This change ### `.woodpecker/drift-detection.yml` Enhances the existing cron pipeline to push a batched set of metrics to the in-cluster Pushgateway (`prometheus-prometheus-pushgateway.monitoring:9091`) after each run: \| Metric \| Kind \| Purpose \| \|---\|---\|---\| \| `drift_stack_state{stack}` \| gauge, 0/1/2 \| 0=clean, 1=drift, 2=error \| \| `drift_stack_first_seen{stack}` \| gauge (unix seconds) \| Preserved across runs for drift-age tracking \| \| `drift_stack_age_hours{stack}` \| gauge (hours) \| Computed from `first_seen` \| \| `drift_stack_count` \| gauge (count) \| Total drifted stacks this run \| \| `drift_error_count` \| gauge (count) \| Total plan-errored stacks \| \| `drift_clean_count` \| gauge (count) \| Total clean stacks \| \| `drift_detection_last_run_timestamp` \| gauge (unix seconds) \| Pipeline heartbeat \| First-seen preservation: on each drift hit, the pipeline queries Pushgateway for the existing `drift_stack_first_seen{stack=<stack>}` value. If present and non-zero, reuse it; otherwise stamp with `NOW`. That means age-hours grows monotonically until the stack goes clean (at which point state=0 resets first_seen by omission). Atomic batched push: all metrics for a run are POST'd in a single HTTP request. Pushgateway doesn't support atomic multi-metric updates natively, but batching at the pipeline layer prevents half-updated state if the curl is interrupted mid-run (the second call would just fail the entire run and alert on `DriftDetectionStale`). ### `stacks/monitoring/.../prometheus_chart_values.tpl` New `Infrastructure Drift` alert group with three rules: - DriftDetectionStale (warning, 30m): fires if `drift_detection_last_run_timestamp` is older than 26h. Gives a 2h grace window on top of the 24h cron so transient Pushgateway or cluster unavailability doesn't false-alarm. Guards against the pipeline silently failing or the cron not firing. - DriftUnaddressed (warning, 1h): fires if any stack has `drift_stack_age_hours > 72` — three days of unacknowledged drift. Three days is long enough to absorb weekends + typical review cycles but short enough to force follow-up before drift compounds. - DriftStacksMany (warning, 30m): fires if `drift_stack_count > 10` in a single run. Sudden wide drift usually signals systemic causes (new admission webhook, provider version bump, cluster-wide CRD upgrade) rather than individual configuration errors, and the alert body nudges toward that diagnosis. Applied to `stacks/monitoring` this session — 1 helm_release changed, no other drift surfaced. ## What is NOT in this change - The Wave 7 GitHub issue auto-filer — the full plan included filing a `drift-detected` issue per drifted stack. Deferred because it requires wiring the `file-issue` skill's convention + a gh token exposed to Woodpecker, both of which need separate setup. The Slack alert covers the same need at lower fidelity in the meantime. - The Wave 7 PG drift_history table — would provide the richest historical view but adds a new DB schema dependency for a CI pipeline. Pushgateway + Prometheus handle the 72h window we care about; PG history is nice-to-have for quarterly reviews. - Auto-apply marker (`# DRIFT_AUTO_APPLY_OK`) — premature until the baseline has been stable for a few cycles. Follow-ups tracked: file dedicated beads items for GH-issue filer + PG drift_history. ## Verification ``` $ cd stacks/monitoring && ../../scripts/tg apply --non-interactive Apply complete! Resources: 0 added, 1 changed, 0 destroyed. # After next cron run (cron expr: "drift-detection" in Woodpecker UI): $ curl -s http://prometheus-prometheus-pushgateway.monitoring:9091/metrics \ \| grep -c '^drift_' # expect a positive number ``` ## Reproduce locally 1. `git pull` 2. Check Prometheus rules: `curl -sk https://prometheus.viktorbarzin.lan/api/v1/rules \| jq '.data.groups[] \| select(.name == "Infrastructure Drift")'` 3. Manually trigger the Woodpecker cron and watch Pushgateway populate. Refs: Wave 7 umbrella (code-hl1) Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-18 22:42:51 +00:00
Viktor Barzin	327ce215b9	[infra] Sweep dns_config ignore_changes across all pod-owning resources [ci skip] ## Context Wave 3A (commit `c9d221d5`) added the `# KYVERNO_LIFECYCLE_V1` marker to the 27 pre-existing `ignore_changes = [...dns_config]` sites so they could be grepped and audited. It did NOT address pod-owning resources that were simply missing the suppression entirely. Post-Wave-3A sampling (2026-04-18) found that navidrome, f1-stream, frigate, servarr, monitoring, crowdsec, and many other stacks showed perpetual `dns_config` drift every plan because their `kubernetes_deployment` / `kubernetes_stateful_set` / `kubernetes_cron_job_v1` resources had no `lifecycle {}` block at all. Root cause (same as Wave 3A): Kyverno's admission webhook stamps `dns_config { option { name = "ndots"; value = "2" } }` on every pod's `spec.template.spec.dns_config` to prevent NxDomain search-domain flooding (see `k8s-ndots-search-domain-nxdomain-flood` skill). Without `ignore_changes` on every Terraform-managed pod-owner, Terraform repeatedly tries to strip the injected field. ## This change Extends the Wave 3A convention by sweeping EVERY `kubernetes_deployment`, `kubernetes_stateful_set`, `kubernetes_daemon_set`, `kubernetes_cron_job_v1`, `kubernetes_job_v1` (+ their `_v1` variants) in the repo and ensuring each carries the right `ignore_changes` path: - kubernetes_deployment / stateful_set / daemon_set / job_v1: `spec[0].template[0].spec[0].dns_config` - kubernetes_cron_job_v1: `spec[0].job_template[0].spec[0].template[0].spec[0].dns_config` (extra `job_template[0]` nesting — the CronJob's PodTemplateSpec is one level deeper) Each injection / extension is tagged `# KYVERNO_LIFECYCLE_V1: Kyverno admission webhook mutates dns_config with ndots=2` inline so the suppression is discoverable via `rg 'KYVERNO_LIFECYCLE_V1' stacks/`. Two insertion paths are handled by a Python pass (`/tmp/add_dns_config_ignore.py`): 1. No existing `lifecycle {}`: inject a brand-new block just before the resource's closing `}`. 108 new blocks on 93 files. 2. Existing `lifecycle {}` (usually for `DRIFT_WORKAROUND: CI owns image tag` from Wave 4, commit a62b43d1): extend its `ignore_changes` list with the dns_config path. Handles both inline (`= [x]`) and multiline (`= [\n x,\n]`) forms; ensures the last pre-existing list item carries a trailing comma so the extended list is valid HCL. 34 extensions. The script skips anything already mentioning `dns_config` inside an `ignore_changes`, so re-running is a no-op. ## Scale - 142 total lifecycle injections/extensions - 93 `.tf` files touched - 108 brand-new `lifecycle {}` blocks + 34 extensions of existing ones - Every Tier 0 and Tier 1 stack with a pod-owning resource is covered - Together with Wave 3A's 27 pre-existing markers → 169 greppable `KYVERNO_LIFECYCLE_V1` dns_config sites across the repo ## What is NOT in this change - `stacks/trading-bot/main.tf` — entirely commented-out block (`/* … /`). Python script touched the file, reverted manually. - `_template/main.tf.example` skeleton — kept minimal on purpose; any future stack created from it should either inherit the Wave 3A one-line form or add its own on first `kubernetes_deployment`. - `terraform fmt` fixes to pre-existing alignment issues in meshcentral, nvidia/modules/nvidia, vault — unrelated to this commit. Left for a separate fmt-only pass. - Non-pod resources (`kubernetes_service`, `kubernetes_secret`, `kubernetes_manifest`, etc.) — they don't own pods so they don't get Kyverno dns_config mutation. ## Verification Random sample post-commit: ``` $ cd stacks/navidrome && ../../scripts/tg plan → No changes. $ cd stacks/f1-stream && ../../scripts/tg plan → No changes. $ cd stacks/frigate && ../../scripts/tg plan → No changes. $ rg -c 'KYVERNO_LIFECYCLE_V1' stacks/ --include='.tf' --include='*.tf.example' \ \| awk -F: '{s+=$2} END {print s}' 169 ``` ## Reproduce locally 1. `git pull` 2. `rg 'KYVERNO_LIFECYCLE_V1' stacks/ \| wc -l` → 169+ 3. `cd stacks/navidrome && ../../scripts/tg plan` → expect 0 drift on the deployment's dns_config field. Refs: code-seq (Wave 3B dns_config class closed; kubernetes_manifest annotation class handled separately in `8d94688d` for tls_secret) Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-18 21:19:48 +00:00

1 2 3

119 commits