infra

Author	SHA1	Message	Date
root	423aac0908	Woodpecker CI Update TLS Certificates Commit	2026-04-26 00:03:26 +00:00
Viktor Barzin	21ac619fac	monitoring(uk-payslip): promote yearly receipt + YTD gross YoY to row 4 Move both barchart/timeseries panels into row 4 (y=29, side-by-side w=12 each, h=10) so the per-tax-year overviews appear right after the income-tax-and-pension YTD row. Shift panels 13, 4, 5, 6, 8, 9 down by 10 to accommodate. Final ordering: rows 1–3 = monthly + YTD timeseries (panels 1/7/2/3/11/12), row 4 = yearly receipt + YTD gross YoY (16/17), then the wider deduction/integrity/table panels below.	2026-04-25 23:58:15 +00:00
Viktor Barzin	53f555dc61	monitoring(uk-payslip): drop 3 panels referencing undeployed data Removed: - Panel 10 "HMRC Tax Year Reconciliation — Individual Tax API" → references hmrc_sync.tax_year_snapshot schema. The hmrc-sync service / DB has not been deployed, so the panel always errored with "relation does not exist". - Panel 14 "Meta payroll: bank deposit vs payslip net pay" → references payslip_ingest.external_meta_deposits, which is created by alembic migration 0007. The deployed payslip-ingest image is at 0005, so the table doesn't exist. - Panel 15 "RSU vest reconciliation — payslip vs Schwab" → references payslip_ingest.rsu_vest_events, created by migration 0008. Same image-staleness story. Verified all 14 remaining panels return without error via Grafana /api/ds/query. SQL for the removed panels is preserved in git history; re-add when the data sources are actually deployed.	2026-04-25 23:56:03 +00:00
Viktor Barzin	b2a25775aa	monitoring(uk-payslip): simplify yearly receipt to earned-and-kept view Replace the 7-stack "where total comp went" decomposition with a 3-stack "what I actually earned" view: salary (gross), bonus (gross), and RSU vest after band-aware tax (PAYE+NI withheld via sell-to-cover). Skips income tax / NI / student loan / pension / RSU offset. Bar height = real income kept across all components. RSU is net of tax because it's withheld at source and never hits the bank account; salary and bonus are gross because they're paid in full and taxes are deducted elsewhere. This is the income-side view where tax is implicit, not the deduction waterfall. Per-year RSU after tax: 2020/21 £18k · 2021/22 £39k · 2022/23 £50k · 2023/24 £26k · 2024/25 £71k · 2025/26 £73k.	2026-04-25 23:42:20 +00:00
Viktor Barzin	a17304f735	monitoring(uk-payslip): fix empty YTD gross YoY chart Two bugs: 1. Synthetic dates projected onto 1970/71 fell outside the dashboard's default time range (now-10y → now), so Grafana filtered out every point. Switched to a sliding 12-month window (CURRENT_DATE - INTERVAL '12 months') as the projection base, plus a per-panel timeFrom: "13M" override so the panel always shows the last 13 months regardless of the dashboard's time picker. 2. ORDER BY tax_year, pay_date violated Grafana's long→wide conversion requirement (data must be ascending by time). Wrapped in a CTE and re-ordered by the synthetic time column. Pivoted result is now a single wide frame with 7 series (2019/20…2025/26).	2026-04-25 23:36:16 +00:00
Viktor Barzin	ac18c49a7b	monitoring(wealth): fix x-axis label formatting on yearly bars The default fieldConfig unit (percent on Yearly investment return %, currencyGBP on Annual change decomposition) was being applied to the "year" string column too — so x-axis labels rendered as "2024%" and "£2,024" respectively. Add field overrides on the "year" column to force unit=string. The earlier "tax_year" panels weren't affected because "2024/25" doesn't parse as a number; "2024" did.	2026-04-25 23:31:03 +00:00
Viktor Barzin	77bed10a51	monitoring: investment-only returns + YoY YTD gross line chart Wealth dashboard: - "Yearly growth %" → "Yearly investment return %": switched to modified-Dietz formula `market_gain / (nw_start + 0.5 × contributions)` so contributions don't inflate the return. New money in is excluded — this is portfolio performance, not net-worth change. - "Trailing 12-month growth %" → "Trailing 12-month investment return %": same formula, applied to the trailing 12mo window. Pre-fix vs post-fix: 2020: 155.0% → 5.12% (large contributions on small base) 2021: 344.7% → 26.45% 2022: 26.9% → -25.65% (the actual 2022 bear market) 2023: 123.2% → 41.60% 2024: 87.4% → 25.70% 2025: 46.8% → 8.43% 2026: 16.7% → 3.28% (YTD) UK Payslip dashboard: - Replaced the per-tax-year stacked bar with a year-over-year line chart: one line per tax year, X = month-of-tax-year (April→March, projected onto a 1970/71 fiscal calendar so years overlay), Y = cumulative YTD gross. Five+ lines visible at a glance for trend comparison.	2026-04-25 23:25:42 +00:00
Viktor Barzin	55d1da41f6	monitoring: more growth detail in Wealth + gross composition in UK Payslip Wealth (4 new panels at the bottom): - Trailing 12-month growth % (stat) — % change in net worth over last 12mo. - Yearly growth % (bar per calendar year) — first→last valuation each year. - Annual change decomposition (stacked bar) — splits each year's NW change into "net contributions" (new money in) and "market gain" (everything else: appreciation, dividends, FX). Answers "did I grow because I saved or because the market did the work?". - Per-account ROI % (horizontal bar) — (value − contribution) / contribution × 100, latest snapshot. Excludes accounts with zero/negative net contribution (Schwab — distorts ratio after RSU sells). UK Payslip (1 new panel below the yearly receipt): - Gross composition by tax year (stacked bar) — salary / bonus / RSU vest / other components per tax year. Bar height = gross pay. Trends in salary growth, bonus levels, and RSU vest sizing at a glance. All queries spot-checked via Grafana /api/ds/query.	2026-04-25 23:21:42 +00:00
Viktor Barzin	d48e222054	monitoring: lock Finance (Personal) folder to admin + fix cash classification Folder ACL: - Move uk-payslip + wealth dashboards to a new "Finance (Personal)" folder; job-hunter + fire-planner stay in "Finance" (open). - New null_resource calls Grafana's folder permissions API after the dashboard sidecar materialises the folder, setting an admin-only ACL ({Admin: 4}). Default Viewer/Editor inheritance is overridden, so anonymous-Viewer (auth.anonymous=true) is denied. Server-admin always retains access. - Verified: anonymous → 403 on uk-payslip + wealth, 200 on control dashboards (node-exporter); admin → 200 on all. Wealth cash fix: - Wealthfolio dumps WORKPLACE_PENSION wrappers entirely into cash_balance because it doesn't track underlying fund holdings. Reclassify pension cash as invested in the "Cash vs invested" panel so the cash series reflects actual uninvested broker cash (~£16k T212 ISA + Schwab) instead of phantom £154k. Pre-fix: cash=£153,789 / invested=£870,282 / total=£1,024,071 Post-fix: cash=£16,064 / invested=£1,008,008 / total=£1,024,071	2026-04-25 23:11:26 +00:00
Viktor Barzin	51bf38815c	vault: record Phase 3 vault Released-PV cleanup Deleted the 6 NFS PVs orphaned by the Phase 2 rolling and removed their /srv/nfs/<dir> subtrees on the PVE host (~1.5 GB; vault-2 audit log was 1.4 GB on its own). Cluster-wide Released-PV sweep on the proxmox-lvm/encrypted side stays out of scope.	2026-04-25 23:08:45 +00:00
Viktor Barzin	498400173c	wealthfolio-sync: skip the synthetic TOTAL row in ETL Wealthfolio's daily_account_valuation includes a row with account_id='TOTAL' that pre-aggregates the per-account values for that day. Mirroring it into PG verbatim caused every SUM(total_value) in the Wealth dashboard to double-count (showing ~£2M against actual ~£1M). Drop the synthetic row at the dump step so the PG mirror only holds real-account rows. Initial sync after fix: 8,649 DAV rows (was 10,798), net worth resolves to £1,024,071 — matches the per-account latest snapshot.	2026-04-25 22:59:24 +00:00
Viktor Barzin	f0ce7b0363	fire-planner: add stack, Vault DB role, dashboard, DB New stacks/fire-planner/ mirrors payslip-ingest layout: - ExternalSecret pulling RECOMPUTE_BEARER_TOKEN from Vault secret/fire-planner - DB ExternalSecret templating DB_CONNECTION_STRING via static role pg-fire-planner - FastAPI Deployment (serve), CronJob (recompute-all monthly on 2nd at 09:00 UTC, scheduled after wealthfolio-sync's 1st at 08:00), ClusterIP Service - Grafana datasource ConfigMap "FirePlanner" — `database` inside jsonData (`cc56ba29` fix; otherwise Grafana 11.2+ hits "you do not have default database") Plus: - vault/main.tf: pg-fire-planner static role (7d rotation), allowed_roles - dbaas/modules/dbaas/main.tf: null_resource creates fire_planner DB+role - monitoring/dashboards/fire-planner.json: 9-panel Finance-folder dashboard (NW timeseries, MC fan chart, success heatmap, lifetime tax bars, years-to-ruin table, optimal leave-UK stat, ending wealth stat, UK success-by-strategy bars, sequence-risk correlation table) - monitoring/modules/monitoring/grafana.tf: register "fire-planner.json" in Finance folder Apply order: 1. vault stack — creates the static role 2. dbaas stack — creates the database & role 3. external-secrets stack picks up vault-database refs (no change needed) 4. fire-planner stack — first apply with -target=kubernetes_manifest.db_external_secret before full apply, per the plan-time-data-source pattern 5. monitoring stack — picks up the new dashboard ConfigMap [ci skip] Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-04-25 17:27:19 +00:00
Viktor Barzin	484b4c7190	vault: complete Phase 2 NFS-hostile migration; remove nfs-proxmox SC All 3 vault voters now on proxmox-lvm-encrypted (vault-0 16:18, vault-1 + vault-2 today). The NFS fsync incompatibility identified in the 2026-04-22 raft-leader-deadlock post-mortem is no longer reachable — raft consensus log + audit log live on LUKS2 block storage with real fsync semantics. Cluster-wide consumers of the inline kubernetes_storage_class.nfs_proxmox dropped to zero after the rolling, so the resource is removed from infra/stacks/vault/main.tf. Released NFS PVs (6) remain in the cluster and will be reclaimed in Phase 3 cleanup. Lesson learned (recorded in plan): pvc-protection finalizer races the StatefulSet controller — pod recreates on the OLD PVCs unless the finalizer is patched out before pod delete. Force-finalize technique applied to vault-1 + vault-2 successfully. Closes: code-gy7h	2026-04-25 17:10:00 +00:00
Viktor Barzin	df2fa0a31d	state(vault): update encrypted state	2026-04-25 17:09:35 +00:00
Viktor Barzin	bf4c7618d8	wealth: SQLite→PG ETL sidecar + new Grafana dashboard Mirrors Wealthfolio's daily_account_valuation / accounts / activities from SQLite into a new PG database (wealthfolio_sync) every hour, so Grafana can chart net worth, contributions, and growth over time. Components: - dbaas: null_resource creates wealthfolio_sync DB + role on the CNPG cluster (dynamic primary lookup so it survives failover). - vault: pg-wealthfolio-sync static role rotates the password every 7d. - wealthfolio: ExternalSecret pulls the rotated password into the WF namespace; new pg-sync sidecar (alpine + sqlite + postgresql-client + busybox crond) does sqlite3 .backup → TSV dump → truncate-and-reload psql, hourly at :07. Plus a grafana-wealth-datasource ConfigMap in the monitoring namespace (uid: wealth-pg). - monitoring: new Wealth dashboard (wealth.json, 10 panels) — current net worth / contribution / growth / ROI% stats, then time-series for net worth, contribution-vs-market, growth area, per-account stacked area, cash-vs-invested, and a 100-row activity log. Initial sync: 6 accounts, 10,798 daily valuations, 518 activities. Verified PG totals match SQLite latest snapshot exactly.	2026-04-25 17:07:33 +00:00
Viktor Barzin	7dd580972a	state(vault): update encrypted state	2026-04-25 16:57:42 +00:00
Viktor Barzin	ac8d2f548b	paperless-ngx: migrate to proxmox-lvm-encrypted Document scans (receipts, contracts, IDs) are unambiguously sensitive PII. Storage decision rule defaults sensitive data to `proxmox-lvm-encrypted`, but paperless-ngx had been left on plain `proxmox-lvm` by an abandoned migration attempt that left a dormant, non-Terraform-managed encrypted PVC sitting unbound for 11 days. Cleaned up the orphan, added the encrypted PVC properly via Terraform, rsynced data with deployment scaled to 0, swapped claim_name. Plain `proxmox-lvm` PVC retained for a 7-day soak before removal. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-04-25 16:48:53 +00:00
Viktor Barzin	4f5f1ff8c2	monitoring(uk-payslip): add yearly receipt stacked barchart panel New panel 16 (barchart, h=11, y=179): one stacked bar per tax year showing total comp split into net pay (bank deposit), cash income tax, RSU tax (band-aware marginal: PAYE+NI), cash NI, student loan, pension salary- sacrifice, and RSU offset (Variant A only). X-axis = tax_year (categorical), y-axis = currencyGBP. Bar height ≈ gross_pay + pension_sacrifice (small over-attribution in Variant A years where the band-aware model exceeds recorded payslip PAYE).	2026-04-25 16:26:57 +00:00
Viktor Barzin	288efa89b3	vault: migrate vault-0 storage to proxmox-lvm-encrypted Phase 2 of the NFS-hostile migration: data + audit storageClass on the vault helm release switches from nfs-proxmox to proxmox-lvm-encrypted, then per-pod rolling swap (24h soak between). vault-0 swap done. vault-1 + vault-2 still on NFS — the rolling part is what makes this safe (raft quorum maintained by 2 healthy pods while one is replaced). Also restores chart-default pod securityContext fields. The previous `statefulSet.securityContext.pod = {fsGroupChangePolicy = "..."}` block REPLACED (not merged) the chart's defaults — fsGroup, runAsGroup, runAsUser, runAsNonRoot were all silently dropped. NFS exports were permissive enough to mask the missing fsGroup; ext4 LV volume root is root:root and the vault user (UID 100) couldn't open vault.db, CrashLoopBackOff. Fix: provide all five fields explicitly, survives future chart bumps. vault-1 and vault-2 retained their correct securityContext from when their pod specs were written to etcd, before the partial customization landed — the bug only surfaces when a pod is recreated. Pre-flight raft snapshot saved at /tmp/vault-pre-migration-*.snap (recovery anchor). Refs: code-gy7h Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-04-25 16:19:49 +00:00
Viktor Barzin	08b13858dd	state(vault): update encrypted state	2026-04-25 16:16:35 +00:00
Viktor Barzin	b3c29eda12	monitoring(uk-payslip): model UK income-tax bands + PA-taper for RSU marginal Replaces the flat 47% (45 PAYE + 2 NI) RSU marginal across panels 3, 7, 8, 11, and 12 with an exact piecewise band-aware computation. Each row computes ani_prior/ani_pre/ani_post over the tax-year YTD (chronological model — the RSU is taxed at the band its YTD ANI position occupies at the vest date, mirroring PAYE withholding behaviour). Bands (2024/25+, applied to all years): IT: 0% / 20% / 40% / 60% (PA-taper) / 45% at 12,570 / 50,270 / 100k / 125,140 NI: 0% / 8% / 2% at 12,570 / 50,270 PA-taper modelled as 60% effective IT marginal in £100k–£125,140 (40% on the £1 + 40% on the £0.50 of lost PA = 60%). Spot-checked per tax-year totals via psql; numbers diverge from the flat 47% baseline most for years where vests cross PA-taper or basic-rate bands (2020/21 ~35%, 2024/25 ~41%, 2025/26 ~43%).	2026-04-25 16:14:49 +00:00
Viktor Barzin	3f85cee1ef	state(vault): update encrypted state	2026-04-25 16:08:38 +00:00
Viktor Barzin	43e4f3f68e	immich: migrate PostgreSQL off NFS to proxmox-lvm-encrypted Live PG data moves to a 10Gi LUKS-encrypted RWO PVC. WAL fsync per commit on NFS contributed to the 2026-04-22 NFS writeback storm (2h43m recovery, 3 of 4 nodes hard-reset). Backups remain on NFS (append-only, NFS-tolerant). The init container that writes postgresql.override.conf is now gated on PG_VERSION presence — on a fresh PVC the file would otherwise make initdb refuse the non-empty PGDATA. First boot skips the override and initdb's cleanly; second boot (after a forced restart) writes the override so vchord/vectors/pg_prewarm load before the dump restore. Idempotent on initialised PVCs. Migration executed: pg_dumpall (1.9GB) → restore on encrypted PVC → REINDEX clip_index/face_index → 111,843 assets verified, external HTTP 200, all 10 extensions present (vector minor 0.8.0→0.8.1 only). LV created on PVE host, picked up by lvm-pvc-snapshot. See docs/plans/2026-04-25-nfs-hostile-migration-{design,plan}.md. Phase 2 (Vault Raft) follows under code-gy7h. Closes: code-ahr7 Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-04-25 15:47:30 +00:00
Viktor Barzin	0d5f53f337	monitoring(uk-payslip): replace misleading take-home rates in Panel 3 Drop the two misleading series in "Effective rate & take-home % (YTD cumulative)" — both used SUM(gross_pay) as denominator while only counting cash deductions/net in the numerator, which understated take-home by 25-30 pp because RSU shares are absent from the cash deposit but present in gross. Replaced with three semantically clean angles: - ytd_paye_rate_pct: SUM(income_tax) / SUM(taxable_pay) — HMRC audit rate (~41-42% in additional-rate band), kept as before. - ytd_cash_take_home_pct: SUM(net_pay) / SUM(gross_pay - rsu_vest) — what fraction of cash earnings hits the bank (~62-65%). - ytd_total_keep_pct: (SUM(net_pay) + 0.53 × SUM(rsu_vest)) / SUM(gross_pay) — true "what I actually keep" including post-tax RSU shares (47% marginal applied to vest value), ~55-60%. Added field overrides for clear color-coding (red/green/blue). Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-04-25 15:45:47 +00:00
Viktor Barzin	8f0d13282c	monitoring(uk-payslip): drop cash PAYE/NI from "Tax & pension — monthly" Same reasoning as panel 2: cash-side income_tax and NI are inherently bumpy in vest months due to UK cumulative PAYE catching up on YTD, and the flat-47% strip can't fix it. Panel now shows only the explicit RSU vest tax (orange, 47% × rsu_vest), student loan, and pensions. The smooth view of total cash deductions stays available on panel 12 (YTD cumulative). Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-04-25 15:43:32 +00:00
Viktor Barzin	2230cb6cf4	monitoring(uk-payslip): drop tax/NI from "Monthly cash flow (RSU stripped)" panel Vest months still bumped 4-5x in this panel after the flat-47% strip because UK cumulative PAYE genuinely catches up YTD tax in vest months, on top of the marginal RSU portion — no arithmetic split can make that line flat without distorting the data. The cash-flow question this panel answers (what hits the bank, RSU aside) is already covered cleanly by cash_gross + net_pay; the tax detail lives on Panel 11 where the RSU split is now linear. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-04-25 15:30:46 +00:00
Viktor Barzin	cb3ffa6d8d	monitoring(uk-payslip): smooth quarterly RSU tax bumps via flat 47% marginal Replace the implicit pro-rata RSU/cash split with an explicit flat 47% marginal (45% PAYE + 2% NI) for the RSU vest tax stack. The orange slice now scales linearly with rsu_vest instead of wobbling around the month's effective PAYE rate; cash PAYE/NI slices have those amounts subtracted out so the stack still totals to actual deductions. Affects panel 7 (monthly), panel 12 (YTD cumulative), panel 7 (YTD uses), and the Sankey panel. Verified on 35 months of live data: sum invariant holds exactly (cash + rsu_marginal + cash_ni == income_tax + national_insurance), no negatives in cash slices. Out of scope (left raw): effective-rate %, data-integrity, payslip table, P60/HMRC reconciliation — those are audit views that use unmodified income_tax / cash_income_tax columns. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-04-25 15:13:29 +00:00
Viktor Barzin	4315ed5c2a	[backup] Fix lvm-pvc-snapshot Pushgateway push (stdout pollution in cmd_prune_count) cmd_prune_count's `log " Pruned: ..."` wrote to stdout, which the caller captures via `pruned=$(cmd_prune_count)`. From 2026-04-16 onward (7d retention kicked in), pruned snapshots polluted the captured value with multi-line log text, breaking the Prometheus exposition format on the metric push (`lvm_snapshot_pruned_total ${pruned}` → 400 from Pushgateway). Snapshots themselves were always fine; only the metric push silently failed for ~9 nights, eventually triggering LVMSnapshotNeverRun (alert has 48h `for:`). Fix: redirect the inner log call to stderr so cmd_prune_count's stdout contains only the count. Also adopts `infra/scripts/lvm-pvc-snapshot.sh` as the source-of-truth (was edited only on the PVE host) and updates backup-dr.md to point at the .sh and document the scp deploy. Deploy: scp infra/scripts/lvm-pvc-snapshot.sh root@192.168.1.127:/usr/local/bin/lvm-pvc-snapshot Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-25 14:30:58 +00:00
Viktor Barzin	d231615ebb	[monitoring] Fix fuse voltage alerts — divide raw deciVolt reading by 10 The tuya-bridge exporter reports `fuse_main_voltage` and `fuse_garage_voltage` as raw uint16 from the Tuya protocol, which encodes voltage in deciVolts (e.g. 2352 = 235.2V). The 200/260V thresholds were comparing against the raw integer, so both FuseMainVoltageAbnormal and FuseGarageVoltageAbnormal fired continuously during normal mains conditions. Dividing in the expression also makes `{{ $value }}V` render the correct human-readable value in the alert summary. Root fix would be in tuya-bridge `_decode_value()` where `name.startswith("voltage")` returns `int.from_bytes(...)` without the /10 scaling that `decode_voltage_threshold` applies. Leaving that alone to avoid breaking the automatic_transfer_switch scrape which uses a different code path (`parse_voltage_string`).	2026-04-24 11:12:56 +00:00
Viktor Barzin	a5e4db9af8	[monitoring] Tuya Cloud root-cause alert + cascade suppression New alert TuyaCloudDown fires when any _tuya_cloud_up gauge == 0 (i.e., the Tuya Cloud API rejects scrape calls — the symptom during last night's iot.tuya.com trial expiry, code=28841002). 5m for-duration beats the 15m window of the seven downstream MetricsMissing alerts, so the new Alertmanager inhibit rule suppresses the per-device noise and only TuyaCloudDown pages. Also flips helm_release.prometheus.force_update from true to false: force_update was tripping on the pushgateway PVC added in rev 188 (commit e51c104) — Helm's --force path tried to reset spec.volumeName on a bound PVC. Disabled here; re-enable temporarily when a StatefulSet volumeClaimTemplate change actually needs --force. Bundled with pre-existing working-tree additions for Fuse/Thermostat threshold alerts and expanded PowerOutage inhibit regex (landed in the same Helm revision 190). Verified: rule loaded, value=7 (all 7 tuya-bridge devices report cloud_up=0 right now), TuyaCloudDown moved pending→firing after 5m, 3 *MetricsMissing alerts currently suppressed in Alertmanager with inhibitedBy=1 (thermostat alerts still pending their 15m window, will be suppressed on transition).	2026-04-23 09:59:48 +00:00
Viktor Barzin	5ebd3a81c3	tuya-bridge: liveness probe hits /health so k8s restarts silently-hung bridge The bridge was down 10h 40m on 2026-04-22 without being restarted — the liveness probe hit `/` (trivial Flask handler) which passed while the actual Tuya-cloud call path was stuck. /health now reports Tuya cloud reachability via a background probe in the app; point both probes at it. Liveness: 60s grace + 6x30s = 3min of 503s before restart; readiness: 2x15s = 30s before removal from service. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-23 07:47:41 +00:00
Viktor Barzin	8e55c4357a	[poison-fountain] opt ingress out of Uptime Kuma external monitor Deployment is scaled to replicas=0 to silence ExternalAccessDivergence, but the ingress at poison.viktorbarzin.me was still auto-annotated `external-monitor=true` by ingress_factory (dns_type=non-proxied path), so external-monitor-sync kept creating `[External] poison` which probed a backend with no endpoints and flagged DOWN. Setting `external_monitor = false` emits the explicit opt-out annotation; next sync run deleted the orphaned monitor. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-22 21:24:22 +00:00
Viktor Barzin	344fce3692	[monitoring][poison-fountain] pushgateway persistence + cronjob uid-0 Two independent root-cause fixes surfaced by the 2026-04-22 cluster health check: 1. Pushgateway lost all in-memory metrics when node3 kubelet hiccuped at 11:42 UTC, hiding backup_last_success_timestamp{job="offsite- backup-sync"} until the next 06:01 UTC push — a ~18h false-negative window. Enable persistence on a 2Gi proxmox-lvm-encrypted PVC with --persistence.interval=1m. Chart note: values key is `prometheus-pushgateway:` (subchart alias), not `pushgateway:`. 2. poison-fountain-fetcher CronJob runs curlimages/curl as UID 100 but the NFS mount /srv/nfs/poison-fountain is root:root 755 and the main Deployment runs as root, so mkdir /data/cache fails every 6h. Set run_as_user=0 on the CronJob container (no_root_squash is set on the export). Closes the backup_offsite_sync FAIL on the next 06:01 UTC offsite sync; closes the recurring poison-fountain evicted-pod noise on the next 00:00 UTC cron tick. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-22 18:32:29 +00:00
Viktor Barzin	f1f723be83	[technitium] zone-sync now reconciles primaryNameServerAddresses When a zone is created against a stale primary IP (e.g. the old primary pod IP 10.10.36.189 before the technitium-primary ClusterIP service existed), AXFR refresh keeps failing forever while every other zone on the same replica refreshes fine from 10.110.37.186. The resync-only branch didn't touch zone options, so the bad IP was pinned indefinitely. This surfaced as rpi-sofia.viktorbarzin.lan returning 192.168.1.16 (pre-move) on secondaries while primary had the correct .10 from 2026-04-22 morning — Uptime Kuma Sofia RPI monitor DOWN, cluster cluster_healthcheck FAIL. The sync loop now re-applies primaryNameServerAddresses on every run for existing zones. Idempotent — Technitium accepts identical values — and self-heals any drift within 30 min. Env renamed PRIMARY_IP → PRIMARY_HOST for consistency with the reconcile semantics. Hostname form (technitium-primary.technitium.svc.cluster.local) was tried but Technitium's own resolver doesn't forward svc.cluster.local, so the field must stay a literal IP. Terraform tracks the ClusterIP on every apply and the reconcile loop propagates it to replicas.	2026-04-22 17:47:18 +00:00
Viktor Barzin	7dfe89a6e0	[redis] stabilise against node-crash flap cascade — RC1-RC5 fixes Five compounding factors produced the 2026-04-22 flap cascade: soft anti-affinity let 2/3 pods co-locate on k8s-node3 (which bounced NotReady→Ready at 11:42Z and took quorum), aggressive sentinel/probe timing amplified LUKS-encrypted LVM I/O stalls into spurious +switch-master loops, HAProxy's 1s polling raced sentinel failovers and routed writes to demoted masters, publish_not_ready_addresses=true fed not-yet-ready pods into HAProxy DNS, and realestate-crawler-celery CrashLoopBackOff closed the feedback loop. Changes: - Anti-affinity: preferred → required (one redis pod per node, hard) - Sentinel down-after-ms 5000→15000, failover-timeout 30000→60000 - Redis + sentinel liveness: timeout 3→10, failure_threshold 3→5 - HAProxy: check inter 1s→2s / fall 2→3, timeout check 3s→5s - Headless svc: publish_not_ready_addresses true→false Post-rollout verification clean: 0 flaps, 0 +switch-master events, 0 celery ReadOnlyError in the 60s window after settle. Docs updated.	2026-04-22 15:59:00 +00:00
Viktor Barzin	fdced7577b	[monitoring] HomeAssistantCriticalSensorUnavailable alert	2026-04-22 14:52:23 +00:00
Viktor Barzin	dc05c440bc	[hermes-agent] disable deployment — PVC permission mismatch Main container crashes with "mkdir: cannot create directory '/opt/data': Permission denied". Init container writes fine but main container runs with different fsGroup/runAsUser. Scaling to 0 until the PVC permission model is reworked. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-04-22 14:31:50 +00:00
Viktor Barzin	a4eafafe49	[monitoring] Add GPUNodeUnschedulable alert — fires when GPU node is cordoned After k8s-node1 was silently cordoned and broke Frigate camera streams, existing alerts (NvidiaExporterDown, PodUnschedulable) didn't catch the root cause proactively. This alert fires within 5m of the GPU node being cordoned, before any pod restart attempts to schedule and fails. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-04-22 14:05:12 +00:00
Viktor Barzin	e2146e6916	gpu: schedule off NFD label, not k8s-node1 hostname Remove every hardcoded reference to k8s-node1 that pinned GPU scheduling to a specific host: - GPU workload nodeSelectors: gpu=true -> nvidia.com/gpu.present=true (frigate, immich, whisper, piper, ytdlp, ebook2audiobook, audiblez, audiblez-web, nvidia-exporter, gpu-pod-exporter). The NFD label is auto-applied by gpu-feature-discovery on any node carrying an NVIDIA PCI device, so the selector follows the card. - null_resource.gpu_node_config: rewrite to enumerate NFD-labeled nodes (feature.node.kubernetes.io/pci-10de.present=true) and taint each with nvidia.com/gpu=true:PreferNoSchedule. Drop the manual 'kubectl label gpu=true' since NFD handles labeling. - MySQL anti-affinity: kubernetes.io/hostname NotIn [k8s-node1] -> nvidia.com/gpu.present NotIn [true]. Same intent (keep MySQL off the GPU node) but portable when the card relocates. Net effect: moving the GPU card between nodes no longer requires any Terraform edit. Verified no-op for current scheduling — both old and new labels resolve to node1 today. Docs updated to match: AGENTS.md, compute.md, overview.md, proxmox-inventory.md, k8s-portal agent-guidance string.	2026-04-22 13:43:07 +00:00
Viktor Barzin	134d6b9a82	vault runbook + raft/HA stuck-leader alerts Post-2026-04-22 Step 5 deliverables: - docs/runbooks/vault-raft-leader-deadlock.md — safe pod-restart sequence that avoids zombie containerd-shim + kernel NFS corruption, qm reset no-op gotcha, boot-order gotcha. - prometheus_chart_values.tpl — VaultRaftLeaderStuck + VaultHAStatusUnavailable. Silent until vault telemetry scraping lands (tracked as beads code-vkpn). Epic for moving vault off NFS tracked as beads code-gy7h.	2026-04-22 12:44:46 +00:00
Viktor Barzin	4cb2c157da	post-mortem 2026-04-22: full timeline — second regression + node4 reboot The initial recovery at 11:03 was premature; vault-1's audit writes over NFS started hanging ~15 min later and the cluster regressed to 503. Full recovery required rebooting node4 (to free vault-0's stuck NFS mount and shed PVE NFS thread contention) and a second reboot of node3 (to clear another round of kernel NFS client degradation). Final recovery at 11:43:28 UTC with vault-2 as active leader on the quorum vault-0 + vault-2. vault-1 remains stuck in ContainerCreating on node2 — a third node2 reboot is required for full 3/3 quorum, but 2/3 is operationally sufficient, so that's deferred.	2026-04-22 11:44:56 +00:00
Viktor Barzin	2f1f9107f8	vault: add fsGroupChangePolicy=OnRootMismatch + 2026-04-22 post-mortem The 2026-04-22 Vault outage caught kubelet in a 2-minute chown loop that never exited because the default fsGroupChangePolicy (Always) walks every file on the NFS-backed data PVC. With retrans=3,timeo=30 NFS options and a 1GB audit log, the recursive chown outlasted the deadline and restarted forever — blocking raft quorum recovery. OnRootMismatch makes chown a no-op when the volume root is already correct, which it always is after initial setup. The breakglass fix was applied live via kubectl patch at 10:54 UTC; this commit persists it in Terraform so the next apply doesn't revert. The post-mortem also documents the upstream raft stuck-leader pattern, NFS kernel client corruption after force-kill, and the path to migrate Vault off NFS to proxmox-lvm-encrypted.	2026-04-22 11:12:19 +00:00
Viktor Barzin	6a4a477336	[infra] Update RPi Sofia DNS: 192.168.1.16 → 192.168.1.10 RPi now uses USB Ethernet (eth1) as primary uplink at .10 instead of the old wlan1 address at .16. Camera namespace and DNAT updated to use eth1 with systemd persistence. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-04-22 10:55:34 +00:00
Viktor Barzin	d39770b30d	monitoring: tighten LVMSnapshotStale to 30h for daily-cadence detection Threshold was 48h + 30m for: a job that runs daily. We don't need to wait 2.5 days to detect a broken timer — bring it down to 30h + 30m (just over a day of cadence + minor drift/retry grace). Also add a description pointing to the restore runbook so the alert text surfaces the fix path directly. Threshold change: 172800s → 108000s. Docs in backup-dr.md synced. Re-triggers default.yml apply now that ci/Dockerfile is rebuilt with vault CLI — this is the first commit touching a stack that will actually succeed since the `e80b2f02` regression. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-04-22 08:54:37 +00:00
Viktor Barzin	3eb8b9a4ea	ci: add vault CLI to infra-ci image + surface real errors in scripts/tg The Woodpecker CI pipeline has been silently failing to apply Tier 1 stacks since the state-migration commit `e80b2f02` because the Alpine CI image never had the vault CLI. `scripts/tg` swallowed stderr with `2>/dev/null` and surfaced a misleading "Cannot read PG credentials from Vault" message — the real error was `sh: vault: not found`. Verified with an in-cluster probe: woodpecker/default SA + role=ci already gets the terraform-state policy and has read capability on database/static-creds/pg-terraform-state. Auth was never the problem; the vault binary just wasn't there. - ci/Dockerfile: pin vault v1.18.1 (matches server) and install - scripts/tg: pre-flight check + surface real vault output on failure - Next build-ci-image.yml run rebuilds :latest with vault included; subsequent default.yml runs unblock monitoring apply (code-aoxk) Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-04-22 08:46:50 +00:00
Viktor Barzin	4a343c33f0	monitoring: bring EmailRoundtripStale threshold docs in sync with for:20m Doc claimed >40m; actual fire time is 80m (60m last-success threshold + 20m 'for'). Stale since pre-existing config; now re-stale after raising 'for' from 10m to 20m in `9b4970da`. Files out of sync only on this one alert row.	2026-04-21 22:39:46 +00:00
Viktor Barzin	9b4970da61	monitoring: alert hygiene — disambiguate, rename, tune, fix inhibits - HighPowerUsage: add subsystem:gpu (line 724) + subsystem:r730 (line 775) labels so the two same-named alerts are distinguishable in routing. - HeadscaleDown (deployment-replicas flavor, line 1414) → rename to HeadscaleReplicasMismatch. Line 2039 keeps HeadscaleDown as the real up-metric critical check. NodeDown inhibit rule updated to suppress the renamed alert too. - EmailRoundtripStale (line 1816): for 10m → 20m. Survives one missed 20-min probe cycle before firing, cuts flapping (12 short-burst fires over last 24h). ATSOverload tuning skipped: 24h fire-count is 0, it's continuously firing not flapping — already-known sustained 83% ATS load, tuning would not change behavior. 8 backup *NeverSucceeded rules audited: all 7 using kube_cronjob_status_last_successful_time target real K8s CronJobs with active metrics (not Pushgateway-sourced). PrometheusBackupNeverRun already uses absent() correctly. No fixes needed.	2026-04-21 22:29:15 +00:00
Viktor Barzin	ac695dea38	[registry] bulk-clean 34 orphan manifests + beads-server image bump Registry integrity probe surfaced 38 broken manifest references (34 unique repo:tag pairs, same OCI-index orphan pattern as the 04-19 infra-ci incident). Deleted all via registry HTTP API + ran GC; reclaimed ~3GB blob storage. beads-server CronJobs were stuck ImagePullBackOff on claude-agent-service:0c24c9b6 for >6h — bumped variable default to 2fd7670d (canonical tag in claude-agent-service stack, already healthy in registry) so new ticks can fire. Rebuilt in-use broken tags: freedify:{latest,c803de02} and beadboard:{17a38e43,latest} on registry VM; priority-pass via Woodpecker pipeline #8. wealthfolio-sync:latest deferred (monthly CronJob, next run 2026-05-01). Probe now reports 0/39 failures. RegistryManifestIntegrityFailure alert cleared. Closes: code-8hk Closes: code-jh3c Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-19 23:16:34 +00:00
Viktor Barzin	9041f52b05	monitoring: TechnitiumZoneCountMismatch — compare replicas only, exclude primary Primary has only the Primary-type zones it owns (10). Replicas have those + built-in zones (localhost, in-addr.arpa reverse, etc.), so their count (14) can never match primary. Alert expr compared max-min across all instances, making it chronically firing. Fix: instance!="primary" filter. The real signal this alert wants is "did one replica drift from the others" — replica-to-replica comparison captures that; primary was never comparable.	2026-04-19 22:15:55 +00:00
Viktor Barzin	4bedabb9e8	healthcheck: fix three false-positive WARNs (HA token, cert-manager, LVM snap grep) - HA Sofia token: auto-bootstrap from Vault secret/viktor/haos_api_token when HOME_ASSISTANT_SOFIA_{URL,TOKEN} env vars are unset. Default URL = https://ha-sofia.viktorbarzin.me. - cert-manager: add cert_manager_installed() probe (kubectl get crd certificates.cert-manager.io). When not installed — which is our current state — report PASS "N/A" instead of noisy WARN "CRDs unavailable". - LVM snapshot freshness: grep pattern was `-- -snap` but actual LV names use underscore (`foo_snap_YYYY...`), so the grep matched nothing and the check always WARN'd. Fixed to `grep _snap`. After fix: PASS 36→40, WARN 9→6, FAIL 1→1 (new ha_entities FAIL is a real HA issue, not a script bug — 400/1401 sensors stale on ha-sofia).	2026-04-19 22:13:32 +00:00

1 2 3 4 5 ...

3029 commits