cmd_prune_count's `log " Pruned: ..."` wrote to stdout, which the
caller captures via `pruned=$(cmd_prune_count)`. From 2026-04-16 onward
(7d retention kicked in), pruned snapshots polluted the captured value
with multi-line log text, breaking the Prometheus exposition format
on the metric push (`lvm_snapshot_pruned_total ${pruned}` → 400 from
Pushgateway). Snapshots themselves were always fine; only the metric
push silently failed for ~9 nights, eventually triggering
LVMSnapshotNeverRun (alert has 48h `for:`).
Fix: redirect the inner log call to stderr so cmd_prune_count's stdout
contains only the count. Also adopts `infra/scripts/lvm-pvc-snapshot.sh`
as the source-of-truth (was edited only on the PVE host) and updates
backup-dr.md to point at the .sh and document the scp deploy.
Deploy: scp infra/scripts/lvm-pvc-snapshot.sh root@192.168.1.127:/usr/local/bin/lvm-pvc-snapshot
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Two independent root-cause fixes surfaced by the 2026-04-22 cluster
health check:
1. Pushgateway lost all in-memory metrics when node3 kubelet hiccuped
at 11:42 UTC, hiding backup_last_success_timestamp{job="offsite-
backup-sync"} until the next 06:01 UTC push — a ~18h false-negative
window. Enable persistence on a 2Gi proxmox-lvm-encrypted PVC with
--persistence.interval=1m. Chart note: values key is
`prometheus-pushgateway:` (subchart alias), not `pushgateway:`.
2. poison-fountain-fetcher CronJob runs curlimages/curl as UID 100
but the NFS mount /srv/nfs/poison-fountain is root:root 755 and
the main Deployment runs as root, so mkdir /data/cache fails
every 6h. Set run_as_user=0 on the CronJob container (no_root_squash
is set on the export).
Closes the backup_offsite_sync FAIL on the next 06:01 UTC offsite
sync; closes the recurring poison-fountain evicted-pod noise on the
next 00:00 UTC cron tick.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Threshold was 48h + 30m for: a job that runs daily. We don't need
to wait 2.5 days to detect a broken timer — bring it down to 30h
+ 30m (just over a day of cadence + minor drift/retry grace). Also
add a description pointing to the restore runbook so the alert
text surfaces the fix path directly.
Threshold change: 172800s → 108000s. Docs in backup-dr.md synced.
Re-triggers default.yml apply now that ci/Dockerfile is rebuilt
with vault CLI — this is the first commit touching a stack that
will actually succeed since the e80b2f02 regression.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
TrueNAS VM 9000 was operationally decommissioned 2026-04-13; NFS has been
served by Proxmox host (192.168.1.127) since. This commit scrubs remaining
references from active docs. VM 9000 itself remains on PVE in stopped state
pending user decision on deletion.
In-session cleanup already landed: reverse-proxy ingress + Cloudflare record
removed; Technitium DNS records deleted; Vault truenas_{api_key,ssh_private_key}
purged; homepage_credentials.reverse_proxy.truenas_token removed;
truenas_homepage_token variable + module deleted; Loki + Dashy cleaned;
config.tfvars deprecated DNS lines removed; historical-name comment added to
the nfs-truenas StorageClass (48 bound PVs, immutable name — kept).
Historical records (docs/plans/, docs/post-mortems/, .planning/) intentionally
untouched — they describe state at a point in time.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
- Healthcheck: add entity availability, integration health, automation
status, and system resources checks for Home Assistant Sofia
- Docs: add backup-dr architecture documentation
Expanded cloud sync excludes to reduce sync time and Synology disk usage.
All excluded data is either regenerable or low-value.
TrueNAS Task 1 and incremental script already updated live.