[monitoring][poison-fountain] pushgateway persistence + cronjob uid-0

Two independent root-cause fixes surfaced by the 2026-04-22 cluster health check: 1. Pushgateway lost all in-memory metrics when node3 kubelet hiccuped at 11:42 UTC, hiding backup_last_success_timestamp{job="offsite- backup-sync"} until the next 06:01 UTC push — a ~18h false-negative window. Enable persistence on a 2Gi proxmox-lvm-encrypted PVC with --persistence.interval=1m. Chart note: values key is `prometheus-pushgateway:` (subchart alias), not `pushgateway:`. 2. poison-fountain-fetcher CronJob runs curlimages/curl as UID 100 but the NFS mount /srv/nfs/poison-fountain is root:root 755 and the main Deployment runs as root, so mkdir /data/cache fails every 6h. Set run_as_user=0 on the CronJob container (no_root_squash is set on the export). Closes the backup_offsite_sync FAIL on the next 06:01 UTC offsite sync; closes the recurring poison-fountain evicted-pod noise on the next 00:00 UTC cron tick. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-22 18:32:29 +00:00 · 2026-04-22 18:32:29 +00:00 · 344fce3692
commit 344fce3692
parent f1f723be83
4 changed files with 42 additions and 0 deletions
--- a/docs/architecture/backup-dr.md
+++ b/docs/architecture/backup-dr.md
@ -692,6 +692,16 @@ module "nfs_backup" {
 - ~~CloudSync monitor~~: Removed (TrueNAS decommissioned)
 - Vaultwarden integrity: Pushes `vaultwarden_sqlite_integrity_ok` hourly

+**Pushgateway persistence**: The Pushgateway is configured with
+`--persistence.file=/data/pushgateway.bin --persistence.interval=1m`
+on a 2Gi `proxmox-lvm-encrypted` PVC (helm values:
+`prometheus-pushgateway.persistentVolume`). Without this, every pod
+restart drops in-memory metrics. Once-per-day pushers (offsite-sync,
+weekly backup) are otherwise invisible for up to 24h if the
+Pushgateway restarts between pushes — which is exactly what triggered
+the 2026-04-22 backup_offsite_sync FAIL (node3 kubelet hiccup at
+11:42 UTC terminated the Pushgateway 8h after the 03:12 UTC push).
+
 **Alert routing**:
 - All backup alerts → Slack `#infra-alerts`
 - Vaultwarden integrity fail → Slack `#infra-critical` (immediate action required)