infra/stacks
Viktor Barzin 4798583db7 backup pipeline: S1 fixes from 2026-05-24 audit
Three immediate fixes surfaced by the backup-pipeline audit:

1. **S1 silent-loss race fix** (daily-backup.sh:142): remove the
   `> "${MANIFEST}"` truncation at the start of daily-backup. Truncation
   already lives in offsite-sync-backup at line 159, gated on a successful
   sync. With both scripts truncating, an offsite-sync failure followed by
   the next morning's daily-backup would silently wipe yesterday's
   unconsumed manifest entries — those files would only reach Synology
   via the monthly full sync (1st-7th of month). Now only offsite-sync
   truncates, and only on success.

2. **Missing alert OffsiteBackupSyncFailing**: documented in backup-dr.md
   but was never added to prometheus_chart_values.tpl. Step 1 or Step 2
   failure pushes offsite_sync_last_status=1 but nothing read it. Added.

3. **wear: drop `-z` from local-only rsyncs** (daily-backup.sh:218 PVC
   snapshot rsync + line 347 /etc/pve sync). Both are local-to-sda
   transfers — compression wastes CPU and yields nothing (gigabit local
   path, intermediate disk doesn't benefit).

Bonus cleanups (zero functional impact):
- "Weekly backup starting/complete" → "daily-backup starting/complete"
  (the timer is daily, not weekly — legacy from earlier monthly-rotation
  schedule).
- "--- Step 2: PVC file copy ---" → "Step 1:" (was numbered from 2 with no
  Step 1 above).
- **wear: pfSense full filesystem tar now Sunday-only** instead of daily.
  config.xml stays daily (it's the primary restore artifact and tiny).
  Full tar is forensic recovery only — re-tarring ~100MB+ daily writes
  ~3G/month to sda + Synology for unchanged content. Weekly is plenty.

docs/architecture/backup-dr.md: rewritten Overview + 3-2-1 breakdown to
reflect today's two-leg architecture; added a "2026-05-24 session"
changelog summary at the top; added a "Synology snapshot management"
subsection with the sudo + `synosharesnapshot` recipe (DSM API is gated
by 2FA so this is the only programmatic path); updated Key Files table
with nfs-mirror + the Synology SSH access notes.

Open follow-ups from the audit (S2 — file as beads if pursued):
- Factor two-leg invariant into /etc/backup-skip-list.conf sourced by
  both nfs-mirror.sh and offsite-sync-backup.sh.
- Manifest write-collision flock between nfs-mirror Mon 04:11 and
  daily-backup Mon 05:00.
- Unbounded manifest cap (force full sync if > 500k lines).
- Synology free-space scraper + alert.
- LVM thin pool meta-pool fill alert.
- nfs-change-tracker.service heartbeat to Pushgateway.
- Synology config drift TF surface (snap retention, share defs).
2026-05-24 16:18:44 +00:00
..
_template ingress_factory: replace protected bool with auth enum + audit pass across 100 stacks 2026-05-10 18:53:49 +00:00
actualbudget recruiter-responder: bump image_tag to 189ef901 2026-05-16 12:41:05 +00:00
affine recruiter-responder: bump image_tag to 189ef901 2026-05-16 12:41:05 +00:00
authentik authentik: worker replicas 3 -> 2 2026-05-21 09:14:35 +00:00
beads-server beads-server: codify Keel annotations on Dolt deployment (drift cleanup) 2026-05-17 22:22:40 +00:00
blog nfs-mirror: append transferred files to offsite-sync manifest 2026-05-24 15:32:22 +00:00
broker-sync broker-sync(imap): fix command name + add fsGroup for sync.db writes 2026-05-22 14:41:54 +00:00
calico security(wave1): W1.6 expand observation from recruiter-responder pilot → tier 3+4 (82 namespaces) 2026-05-19 22:14:16 +00:00
changedetection enrolled-patch stacks: ignore image drift from Keel auto-update 2026-05-16 13:24:16 +00:00
chrome-service recruiter-responder: bump image_tag to 189ef901 2026-05-16 12:41:05 +00:00
city-guesser enrolled-patch stacks: ignore image drift from Keel auto-update 2026-05-16 13:24:16 +00:00
claude-agent-service claude-agent-service: cut memory request 2Gi → 1Gi (limit 4Gi → 2Gi) 2026-05-23 10:03:42 +00:00
claude-memory recruiter-responder: bump image_tag to 189ef901 2026-05-16 12:41:05 +00:00
cloudflared mailserver: decommission SendGrid 2026-05-22 20:08:38 +00:00
cnpg cnpg: bump webhook-cert renewal threshold 7d -> 30d 2026-05-22 15:00:41 +00:00
coturn enrolled-patch stacks: ignore image drift from Keel auto-update 2026-05-16 13:24:16 +00:00
crowdsec crowdsec: pin image to v1.7.8 + remove ENROLL_KEY, CAPI restored 2026-05-24 11:11:29 +00:00
cyberchef final wave: enroll immich + status-page, retrigger 17 pending Bucket A 2026-05-16 23:19:20 +00:00
dashy enrolled-patch stacks: ignore image drift from Keel auto-update 2026-05-16 13:24:16 +00:00
dawarich enrolled-patch stacks: ignore image drift from Keel auto-update 2026-05-16 13:24:16 +00:00
dbaas dbaas: opt MySQL out of Keel + add do-not-bump warning 2026-05-19 13:21:03 +00:00
descheduler keel: enroll 15 critical-path namespaces for digest-only auto-update 2026-05-17 12:13:22 +00:00
diun enrolled-patch stacks: ignore image drift from Keel auto-update 2026-05-16 13:24:16 +00:00
ebook2audiobook enrolled-patch stacks: ignore image drift from Keel auto-update 2026-05-16 13:24:16 +00:00
ebooks enrolled-patch stacks: ignore image drift from Keel auto-update 2026-05-16 13:24:16 +00:00
echo enrolled-patch stacks: ignore image drift from Keel auto-update 2026-05-16 13:24:16 +00:00
excalidraw enrolled-patch stacks: ignore image drift from Keel auto-update 2026-05-16 13:24:16 +00:00
external-secrets recruiter-responder: bump image_tag to 189ef901 2026-05-16 12:41:05 +00:00
f1-stream final wave: enroll immich + status-page, retrigger 17 pending Bucket A 2026-05-16 23:19:20 +00:00
fire-planner fire-planner: COL refresh CronJob + Grafana Cost-of-Living dashboard 2026-05-22 14:15:38 +00:00
foolery recruiter-responder: bump image_tag to 189ef901 2026-05-16 12:41:05 +00:00
forgejo nfs-mirror: append transferred files to offsite-sync manifest 2026-05-24 15:32:22 +00:00
freedify recruiter-responder: bump image_tag to 189ef901 2026-05-16 12:41:05 +00:00
freshrss infra: add kubectl + authentik providers across 6 stacks 2026-05-21 08:07:22 +00:00
frigate ci: retrigger v2 — apply pending Keel-enrolled stacks (#697 was cancelled by #698) 2026-05-16 13:47:13 +00:00
grampsweb ci: retrigger v2 — apply pending Keel-enrolled stacks (#697 was cancelled by #698) 2026-05-16 13:47:13 +00:00
hackmd ci: retrigger v2 — apply pending Keel-enrolled stacks (#697 was cancelled by #698) 2026-05-16 13:47:13 +00:00
headscale keel: enroll 15 critical-path namespaces for digest-only auto-update 2026-05-17 12:13:22 +00:00
health ci: retrigger v2 — apply pending Keel-enrolled stacks (#697 was cancelled by #698) 2026-05-16 13:47:13 +00:00
hermes-agent ci: retrigger v2 — apply pending Keel-enrolled stacks (#697 was cancelled by #698) 2026-05-16 13:47:13 +00:00
homepage final wave: enroll immich + status-page, retrigger 17 pending Bucket A 2026-05-16 23:19:20 +00:00
immich Woodpecker CI deploy [CI SKIP] 2026-05-24 14:23:44 +00:00
infra [forgejo] Phases 3+4+5: cutover, decommission, docs sweep 2026-05-07 18:30:02 +00:00
infra-maintenance [infra] Sweep dns_config ignore_changes across all pod-owning resources [ci skip] 2026-04-18 21:19:48 +00:00
insta2spotify ci: retrigger v2 — apply pending Keel-enrolled stacks (#697 was cancelled by #698) 2026-05-16 13:47:13 +00:00
instagram-poster Bucket A retrigger + Bucket D enrollment (5 module-nested stacks) 2026-05-16 23:10:38 +00:00
isponsorblocktv ci: retrigger v2 — apply pending Keel-enrolled stacks (#697 was cancelled by #698) 2026-05-16 13:47:13 +00:00
job-hunter ci: retrigger v2 — apply pending Keel-enrolled stacks (#697 was cancelled by #698) 2026-05-16 13:47:13 +00:00
jsoncrack final wave: enroll immich + status-page, retrigger 17 pending Bucket A 2026-05-16 23:19:20 +00:00
k8s-dashboard final wave: enroll immich + status-page, retrigger 17 pending Bucket A 2026-05-16 23:19:20 +00:00
k8s-portal Bucket A retrigger + Bucket D enrollment (5 module-nested stacks) 2026-05-16 23:10:38 +00:00
k8s-version-upgrade k8s-version-upgrade: ignore IngressTTFBCritical in halt-on-alert check 2026-05-24 01:10:44 +00:00
keel upgrade-state: skill + script + Keel scrape for periodic three-pipeline audit 2026-05-18 10:50:43 +00:00
kms final wave: enroll immich + status-page, retrigger 17 pending Bucket A 2026-05-16 23:19:20 +00:00
kured ci: retrigger v2 — apply pending Keel-enrolled stacks (#697 was cancelled by #698) 2026-05-16 13:47:13 +00:00
kyverno kyverno: allowlist woodpeckerci/* for CI step pods 2026-05-23 08:52:48 +00:00
linkwarden infra: add kubectl + authentik providers across 6 stacks 2026-05-21 08:07:22 +00:00
llama-cpp llama-cpp: ignore_changes for keel/k8s-managed annotations 2026-05-24 09:01:17 +00:00
local-path final wave: enroll immich + status-page, retrigger 17 pending Bucket A 2026-05-16 23:19:20 +00:00
mailserver mailserver: decommission SendGrid 2026-05-22 20:08:38 +00:00
matrix ci: retrigger v2 — apply pending Keel-enrolled stacks (#697 was cancelled by #698) 2026-05-16 13:47:13 +00:00
meshcentral ci: retrigger v2 — apply pending Keel-enrolled stacks (#697 was cancelled by #698) 2026-05-16 13:47:13 +00:00
metallb keel: enroll 11 more namespaces (operators + critical infra) 2026-05-17 20:59:14 +00:00
metrics-server keel: enroll 15 critical-path namespaces for digest-only auto-update 2026-05-17 12:13:22 +00:00
monitoring backup pipeline: S1 fixes from 2026-05-24 audit 2026-05-24 16:18:44 +00:00
n8n nfs-mirror: append transferred files to offsite-sync manifest 2026-05-24 15:32:22 +00:00
navidrome infra: add kubectl + authentik providers across 6 stacks 2026-05-21 08:07:22 +00:00
netbox ci: retrigger v2 — apply pending Keel-enrolled stacks (#697 was cancelled by #698) 2026-05-16 13:47:13 +00:00
networking-toolbox ci: retrigger v2 — apply pending Keel-enrolled stacks (#697 was cancelled by #698) 2026-05-16 13:47:13 +00:00
nextcloud nextcloud(external_storage): add per-mount enableSharing option 2026-05-24 11:39:16 +00:00
nfs-csi keel: enroll 11 more namespaces (operators + critical infra) 2026-05-17 20:59:14 +00:00
nodelocal-dns [dns] NodeLocal DNSCache — deploy DaemonSet to all nodes (WS C) 2026-04-19 15:46:41 +00:00
novelapp Woodpecker CI deploy [CI SKIP] 2026-05-16 23:17:44 +00:00
ntfy ci: retrigger v2 — apply pending Keel-enrolled stacks (#697 was cancelled by #698) 2026-05-16 13:47:13 +00:00
nvidia nvidia/driver-daemonset: bump memory request 256Mi → 822Mi 2026-05-24 01:11:06 +00:00
onlyoffice ci: retrigger v2 — apply pending Keel-enrolled stacks (#697 was cancelled by #698) 2026-05-16 13:47:13 +00:00
openclaw nfs-mirror: append transferred files to offsite-sync manifest 2026-05-24 15:32:22 +00:00
osm_routing final wave: enroll immich + status-page, retrigger 17 pending Bucket A 2026-05-16 23:19:20 +00:00
owntracks ci: retrigger v2 — apply pending Keel-enrolled stacks (#697 was cancelled by #698) 2026-05-16 13:47:13 +00:00
paperless-mcp paperless-mcp: deploy MCP for AI document search 2026-05-17 11:14:35 +00:00
paperless-ngx ci: retrigger v2 — apply pending Keel-enrolled stacks (#697 was cancelled by #698) 2026-05-16 13:47:13 +00:00
payslip-ingest ci: retrigger v2 — apply pending Keel-enrolled stacks (#697 was cancelled by #698) 2026-05-16 13:47:13 +00:00
phpipam ci: retrigger v2 — apply pending Keel-enrolled stacks (#697 was cancelled by #698) 2026-05-16 13:47:13 +00:00
platform [infra] Add Cloudflare provider to all stack lock files and generated providers 2026-04-16 16:31:36 +00:00
plotting-book Woodpecker CI deploy [CI SKIP] 2026-05-16 23:17:44 +00:00
poison-fountain ci: retrigger v2 — apply pending Keel-enrolled stacks (#697 was cancelled by #698) 2026-05-16 13:47:13 +00:00
postiz postiz: bump memory request 512Mi → 2Gi, limit 4Gi → 3Gi (right-size for next deploy) 2026-05-24 01:11:25 +00:00
priority-pass ci: retrigger v2 — apply pending Keel-enrolled stacks (#697 was cancelled by #698) 2026-05-16 13:47:13 +00:00
privatebin ci: retrigger v2 — apply pending Keel-enrolled stacks (#697 was cancelled by #698) 2026-05-16 13:47:13 +00:00
proxmox-csi proxmox-csi/node: bump memory request 64Mi → 1Gi (LUKS unlock reservation) 2026-05-24 01:10:44 +00:00
pvc-autoresizer [infra] Suppress Goldilocks vpa-update-mode label drift on all namespaces [ci skip] 2026-04-18 21:15:27 +00:00
rbac [infra] Migrate Terraform state from local SOPS to PostgreSQL backend 2026-04-16 19:33:12 +00:00
real-estate-crawler realestate-crawler: dockerhub pull-secret + lift image-pin on ui/api 2026-05-18 19:11:43 +00:00
recruiter-responder nfs-mirror: append transferred files to offsite-sync manifest 2026-05-24 15:32:22 +00:00
redis keel: enroll 15 critical-path namespaces for digest-only auto-update 2026-05-17 12:13:22 +00:00
reloader keel: enroll 15 critical-path namespaces for digest-only auto-update 2026-05-17 12:13:22 +00:00
resume ci: retrigger v2 — apply pending Keel-enrolled stacks (#697 was cancelled by #698) 2026-05-16 13:47:13 +00:00
reverse-proxy keel: enroll 15 critical-path namespaces for digest-only auto-update 2026-05-17 12:13:22 +00:00
rybbit ci: retrigger v2 — apply pending Keel-enrolled stacks (#697 was cancelled by #698) 2026-05-16 13:47:13 +00:00
sealed-secrets keel: enroll 11 more namespaces (operators + critical infra) 2026-05-17 20:59:14 +00:00
send ci: retrigger v2 — apply pending Keel-enrolled stacks (#697 was cancelled by #698) 2026-05-16 13:47:13 +00:00
servarr keel: enroll 11 more namespaces (operators + critical infra) 2026-05-17 20:59:14 +00:00
shadowsocks ci: retrigger v2 — apply pending Keel-enrolled stacks (#697 was cancelled by #698) 2026-05-16 13:47:13 +00:00
speedtest ci: retrigger v2 — apply pending Keel-enrolled stacks (#697 was cancelled by #698) 2026-05-16 13:47:13 +00:00
status-page [infra] Establish KYVERNO_LIFECYCLE_V1 drift-suppression convention [ci skip] 2026-04-18 14:15:51 +00:00
stirling-pdf ci: retrigger v2 — apply pending Keel-enrolled stacks (#697 was cancelled by #698) 2026-05-16 13:47:13 +00:00
tandoor infra: add kubectl + authentik providers across 6 stacks 2026-05-21 08:07:22 +00:00
technitium technitium: cut memory — primary 2Gi → 1Gi, secondary+tertiary 2Gi → 512Mi 2026-05-23 10:03:51 +00:00
terminal terminal: probe + alerts after Traefik replica routing-table skew 2026-05-17 10:04:26 +00:00
tor-proxy ci: retrigger v3 — apply remaining 22 Keel-enrolled stacks 2026-05-16 14:06:39 +00:00
trading-bot trading-bot: add kevin_signal_bridge container (kill-switch OFF for Phase 1) 2026-05-24 01:22:53 +00:00
traefik traefik: bump auth-proxy nginx header buffers to handle Authentik cookie pile 2026-05-23 08:34:33 +00:00
travel_blog final wave: enroll immich + status-page, retrigger 17 pending Bucket A 2026-05-16 23:19:20 +00:00
tuya-bridge ci: retrigger v3 — apply remaining 22 Keel-enrolled stacks 2026-05-16 14:06:39 +00:00
uptime-kuma Bucket A retrigger + Bucket D enrollment (5 module-nested stacks) 2026-05-16 23:10:38 +00:00
url nfs-mirror: append transferred files to offsite-sync manifest 2026-05-24 15:32:22 +00:00
vault trading-bot: revive K8s stack + add meet-kevin-watcher 2026-05-22 11:23:30 +00:00
vaultwarden Bucket A retrigger + Bucket D enrollment (5 module-nested stacks) 2026-05-16 23:10:38 +00:00
vpa keel: enroll 11 more namespaces (operators + critical infra) 2026-05-17 20:59:14 +00:00
wealthfolio Woodpecker CI deploy [CI SKIP] 2026-05-16 13:45:45 +00:00
webhook_handler final wave: enroll immich + status-page, retrigger 17 pending Bucket A 2026-05-16 23:19:20 +00:00
whisper ci: retrigger v3 — apply remaining 22 Keel-enrolled stacks 2026-05-16 14:06:39 +00:00
wireguard keel: enroll 15 critical-path namespaces for digest-only auto-update 2026-05-17 12:13:22 +00:00
woodpecker ci: retrigger v3 — apply remaining 22 Keel-enrolled stacks 2026-05-16 14:06:39 +00:00
xray xray: drop dead vless ingress + pin Service target_port 2026-05-24 01:13:54 +00:00
ytdlp ci: retrigger v3 — apply remaining 22 Keel-enrolled stacks 2026-05-16 14:06:39 +00:00