infra

Viktor Barzin 344fce3692 [monitoring][poison-fountain] pushgateway persistence + cronjob uid-0 Two independent root-cause fixes surfaced by the 2026-04-22 cluster health check: 1. Pushgateway lost all in-memory metrics when node3 kubelet hiccuped at 11:42 UTC, hiding backup_last_success_timestamp{job="offsite- backup-sync"} until the next 06:01 UTC push — a ~18h false-negative window. Enable persistence on a 2Gi proxmox-lvm-encrypted PVC with --persistence.interval=1m. Chart note: values key is `prometheus-pushgateway:` (subchart alias), not `pushgateway:`. 2. poison-fountain-fetcher CronJob runs curlimages/curl as UID 100 but the NFS mount /srv/nfs/poison-fountain is root:root 755 and the main Deployment runs as root, so mkdir /data/cache fails every 6h. Set run_as_user=0 on the CronJob container (no_root_squash is set on the export). Closes the backup_offsite_sync FAIL on the next 06:01 UTC offsite sync; closes the recurring poison-fountain evicted-pod noise on the next 00:00 UTC cron tick. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>		2026-04-22 18:32:29 +00:00
..
agent-task-tracking.md	Add agent task tracking documentation	2026-04-15 17:11:26 +00:00
authentication.md	fix: cluster healthcheck fixes + Authentik upgrade to 2026.2.2	2026-04-15 06:41:56 +00:00
automated-upgrades.md	[docs] automated-upgrades: document long-lived OAuth + expiry monitoring	2026-04-18 13:00:07 +00:00
backup-dr.md	[monitoring][poison-fountain] pushgateway persistence + cronjob uid-0	2026-04-22 18:32:29 +00:00
ci-cd.md	[docs] Architecture docs: registry integrity probe, pin, new CI pipelines	2026-04-19 17:51:26 +00:00
compute.md	gpu: schedule off NFD label, not k8s-node1 hostname	2026-04-22 13:43:07 +00:00
databases.md	[redis] stabilise against node-crash flap cascade — RC1-RC5 fixes	2026-04-22 15:59:00 +00:00
dns.md	[dns] Kea: multi-IP DHCP option 6 (10.0.10, 10.0.20) + TSIG-signed DDNS (WS E)	2026-04-19 16:12:23 +00:00
homepage.md	add homepage auto-discovery documentation [ci skip]	2026-03-25 13:06:43 +02:00
incident-response.md	[claude-agent-service] Migrate all pipelines from DevVM SSH to K8s HTTP	2026-04-18 10:12:02 +00:00
mailserver.md	monitoring: bring EmailRoundtripStale threshold docs in sync with for:20m	2026-04-21 22:39:46 +00:00
monitoring.md	monitoring: bring EmailRoundtripStale threshold docs in sync with for:20m	2026-04-21 22:39:46 +00:00
multi-tenancy.md	add architecture documentation for all infrastructure subsystems [ci skip]	2026-03-24 00:55:25 +02:00
networking.md	[docs] TrueNAS decommission cleanup — remove references from active docs	2026-04-19 16:55:43 +00:00
overview.md	gpu: schedule off NFD label, not k8s-node1 hostname	2026-04-22 13:43:07 +00:00
secrets.md	docs: comprehensive audit and update of all architecture docs and runbooks [ci skip]	2026-04-06 13:21:05 +03:00
security.md	[docs] Update anti-AI and rybbit docs after rewrite-body removal	2026-04-17 21:43:13 +00:00
storage.md	[docs] TrueNAS decommission cleanup — remove references from active docs	2026-04-19 16:55:43 +00:00
vpn.md	[docs] TrueNAS decommission cleanup — remove references from active docs	2026-04-19 16:55:43 +00:00