Two independent root-cause fixes surfaced by the 2026-04-22 cluster
health check:
1. Pushgateway lost all in-memory metrics when node3 kubelet hiccuped
at 11:42 UTC, hiding backup_last_success_timestamp{job="offsite-
backup-sync"} until the next 06:01 UTC push — a ~18h false-negative
window. Enable persistence on a 2Gi proxmox-lvm-encrypted PVC with
--persistence.interval=1m. Chart note: values key is
`prometheus-pushgateway:` (subchart alias), not `pushgateway:`.
2. poison-fountain-fetcher CronJob runs curlimages/curl as UID 100
but the NFS mount /srv/nfs/poison-fountain is root:root 755 and
the main Deployment runs as root, so mkdir /data/cache fails
every 6h. Set run_as_user=0 on the CronJob container (no_root_squash
is set on the export).
Closes the backup_offsite_sync FAIL on the next 06:01 UTC offsite
sync; closes the recurring poison-fountain evicted-pod noise on the
next 00:00 UTC cron tick.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Threshold was 48h + 30m for: a job that runs daily. We don't need
to wait 2.5 days to detect a broken timer — bring it down to 30h
+ 30m (just over a day of cadence + minor drift/retry grace). Also
add a description pointing to the restore runbook so the alert
text surfaces the fix path directly.
Threshold change: 172800s → 108000s. Docs in backup-dr.md synced.
Re-triggers default.yml apply now that ci/Dockerfile is rebuilt
with vault CLI — this is the first commit touching a stack that
will actually succeed since the e80b2f02 regression.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
TrueNAS VM 9000 was operationally decommissioned 2026-04-13; NFS has been
served by Proxmox host (192.168.1.127) since. This commit scrubs remaining
references from active docs. VM 9000 itself remains on PVE in stopped state
pending user decision on deletion.
In-session cleanup already landed: reverse-proxy ingress + Cloudflare record
removed; Technitium DNS records deleted; Vault truenas_{api_key,ssh_private_key}
purged; homepage_credentials.reverse_proxy.truenas_token removed;
truenas_homepage_token variable + module deleted; Loki + Dashy cleaned;
config.tfvars deprecated DNS lines removed; historical-name comment added to
the nfs-truenas StorageClass (48 bound PVs, immutable name — kept).
Historical records (docs/plans/, docs/post-mortems/, .planning/) intentionally
untouched — they describe state at a point in time.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
- Healthcheck: add entity availability, integration health, automation
status, and system resources checks for Home Assistant Sofia
- Docs: add backup-dr architecture documentation
Expanded cloud sync excludes to reduce sync time and Synology disk usage.
All excluded data is either regenerable or low-value.
TrueNAS Task 1 and incremental script already updated live.