Commit graph

18 commits

Author SHA1 Message Date
Viktor Barzin
4798583db7 backup pipeline: S1 fixes from 2026-05-24 audit
Three immediate fixes surfaced by the backup-pipeline audit:

1. **S1 silent-loss race fix** (daily-backup.sh:142): remove the
   `> "${MANIFEST}"` truncation at the start of daily-backup. Truncation
   already lives in offsite-sync-backup at line 159, gated on a successful
   sync. With both scripts truncating, an offsite-sync failure followed by
   the next morning's daily-backup would silently wipe yesterday's
   unconsumed manifest entries — those files would only reach Synology
   via the monthly full sync (1st-7th of month). Now only offsite-sync
   truncates, and only on success.

2. **Missing alert OffsiteBackupSyncFailing**: documented in backup-dr.md
   but was never added to prometheus_chart_values.tpl. Step 1 or Step 2
   failure pushes offsite_sync_last_status=1 but nothing read it. Added.

3. **wear: drop `-z` from local-only rsyncs** (daily-backup.sh:218 PVC
   snapshot rsync + line 347 /etc/pve sync). Both are local-to-sda
   transfers — compression wastes CPU and yields nothing (gigabit local
   path, intermediate disk doesn't benefit).

Bonus cleanups (zero functional impact):
- "Weekly backup starting/complete" → "daily-backup starting/complete"
  (the timer is daily, not weekly — legacy from earlier monthly-rotation
  schedule).
- "--- Step 2: PVC file copy ---" → "Step 1:" (was numbered from 2 with no
  Step 1 above).
- **wear: pfSense full filesystem tar now Sunday-only** instead of daily.
  config.xml stays daily (it's the primary restore artifact and tiny).
  Full tar is forensic recovery only — re-tarring ~100MB+ daily writes
  ~3G/month to sda + Synology for unchanged content. Weekly is plenty.

docs/architecture/backup-dr.md: rewritten Overview + 3-2-1 breakdown to
reflect today's two-leg architecture; added a "2026-05-24 session"
changelog summary at the top; added a "Synology snapshot management"
subsection with the sudo + `synosharesnapshot` recipe (DSM API is gated
by 2FA so this is the only programmatic path); updated Key Files table
with nfs-mirror + the Synology SSH access notes.

Open follow-ups from the audit (S2 — file as beads if pursued):
- Factor two-leg invariant into /etc/backup-skip-list.conf sourced by
  both nfs-mirror.sh and offsite-sync-backup.sh.
- Manifest write-collision flock between nfs-mirror Mon 04:11 and
  daily-backup Mon 05:00.
- Unbounded manifest cap (force full sync if > 500k lines).
- Synology free-space scraper + alert.
- LVM thin pool meta-pool fill alert.
- nfs-change-tracker.service heartbeat to Pushgateway.
- Synology config drift TF surface (snap retention, share defs).
2026-05-24 16:18:44 +00:00
Viktor Barzin
4d756be4f5 backup: consolidate to one local-mirror script + invert offsite filter
Some checks failed
ci/woodpecker/push/build-cli Pipeline failed
ci/woodpecker/push/default Pipeline failed
Before this commit, the in-flight design split anca-elements (its own
mirror script + timer) from the rest of /srv/nfs (still going to
Synology via inotify-tracked offsite-sync). It also meant Synology
received some bytes via both paths (sda → Synology AND direct NFS →
Synology), which doubled consumption.

This commit collapses both into a clean 3-2-1:

  Copy 1 (sdc):       live /srv/nfs/* + cluster block PVCs
  Copy 2 (sda):       /mnt/backup/{pvc-data,sqlite-backup,pfsense,
                                   pve-config,<critical-nfs>/}
                      ← daily-backup + nfs-mirror (one script each)
  Copy 3 (Synology):  /Backup/Viki/{pve-backup,nfs,nfs-ssd}
                      ← offsite-sync-backup Step 1 (sda → Synology)
                        + Step 2 (sda-BYPASS paths only → Synology direct)

scripts/nfs-mirror.{sh,service,timer}:
  New consolidated weekly mirror. Replaces anca-elements-mirror (to be
  removed in a follow-up after the current in-flight rsync completes,
  parity-verified, and Synology source-of-truth is deleted). Single
  rsync /srv/nfs/ → /mnt/backup/ with an explicit EXCLUDES list that
  drops paths not worth a local 2nd copy: immich (1.2T — too big),
  frigate (14d ring), prometheus/loki (rebuildable), ollama/llamacpp/
  audiblez/ebook2audiobook (re-fetchable), *-backup (already backups),
  temp/alertmanager (transient). Nice=10, IOSchedulingClass=idle.

scripts/offsite-sync-backup.sh:
  Step 2 (NFS → Synology) filter inverted: instead of `--exclude=
  anca-elements/`, it now `--include`s only the sda-BYPASS paths
  (immich, frigate, prometheus, *-backup, …). The bypass-include
  regex MUST stay in lockstep with nfs-mirror's EXCLUDES — they are
  complementary and any drift creates either gaps or duplication on
  Synology. Comment in the script flags this.

monitoring alerts: renamed AncaElementsMirror{Stale,Failing} to
NfsMirror{Stale,Failing} matching the new metric job name
`nfs-mirror`. Thresholds unchanged.

docs/architecture/backup-dr.md: rewritten Step 1/Step 2 sections and
added the bypass-list rationale + cross-reference between scripts.

NOT YET DEPLOYED — gated on the in-flight anca-elements-mirror rsync
finishing + parity verification + Synology /volume1/Backup/Anca/
Elements deletion. The old scripts (anca-elements-{mirror,sync.sh})
remain on the PVE host until then, and will be removed in a cleanup
commit.
2026-05-24 12:49:20 +00:00
Viktor Barzin
6db64fe060 anca-elements: weekly local mirror sdc → sda (replaces Synology as 2nd copy)
Some checks failed
ci/woodpecker/push/build-cli Pipeline failed
ci/woodpecker/push/default Pipeline was successful
Synology is being removed as a host for the Anca/Elements archive
(770G). /srv/nfs/anca-elements on PVE becomes the source of truth;
sda /mnt/backup/anca-elements becomes the single-disk-failure mirror.
No offsite for this archive — by design.

- scripts/anca-elements-mirror.sh: rsync -rlt --delete -H, idempotent,
  pushes anca_elements_mirror_last_{run_timestamp,status,bytes} to
  Pushgateway, lockfile in /run, SIGTERM-safe (status=2 on abort).
- .service: oneshot, Nice=10, IOSchedulingClass=idle, 5h timeout.
- .timer: weekly Mon 04:00, Persistent=true, 15-min randomised delay.

Deployed to PVE host; timer enabled; initial 770G sync running in
background. Synology original to be deleted after first run completes
and parity is verified.

docs/architecture/backup-dr.md: documents Layer 3a + updated path
exclusion rationale (PVE is now upstream, not downstream).
2026-05-24 11:51:52 +00:00
Viktor Barzin
f55eaae682 docs/backup-dr: document /srv/nfs/anca-elements offsite-sync exclusion
Some checks failed
ci/woodpecker/push/build-cli Pipeline failed
ci/woodpecker/push/default Pipeline was successful
2026-05-24 11:03:50 +00:00
Viktor Barzin
cfe969fe43 backup: fix daily-backup silent failures, postiz pg_dump CronJob, doc reconcile
daily-backup ran out of its 1h budget and SIGTERMed for 10 days straight (Apr
30 → May 9). Each failed run left its snapshot mount stacked on /tmp/pvc-mount,
which blocked the next run from completing — root cause of the WeeklyBackupStale
alert going silent (the metric never reached its end-of-script push).

Fixes:
- TimeoutStartSec 1h → 4h (current workload of 118 PVCs needs ~1.5h, was hitting
  the wall during week 18 runs)
- Recursive umount + LUKS cleanup on EXIT trap, plus the same at script start as
  belt-and-braces for any inherited stuck state from a prior crashed run
- TERM/INT trap pushes status=2 metric so WeeklyBackupFailing fires instead of
  the alert going blind on systemd kills
- pfsense metric pushed in BOTH success and failure paths (was only on success;
  any ssh-to-pfsense outage made PfsenseBackupStale silent until the alert
  threshold expired)

Postiz backup CronJob: bundled bitnami PG/Redis live on local-path (K8s node
OS disk) — outside Layer 1+2 of the 3-2-1 pipeline. Added postiz-postgres-backup
that pg_dumps postiz + temporal + temporal_visibility daily 03:00 to
/srv/nfs/postiz-backup, getting Layer 3 offsite coverage. Verified end-to-end:
3 dumps written, Pushgateway metric received. Note: bitnamilegacy/postgresql
image is stripped (no curl/wget/python) — switched to docker.io/library/postgres
matching the dbaas/postgresql-backup pattern with apt-installed curl.

Doc reconcile (backup-dr.md): metric names had drifted (e.g. the docs claimed
backup_weekly_last_success_timestamp but the script pushes
daily_backup_last_run_timestamp). Updated to match what's actually emitted, and
added a "default-covered" footnote to the Service Protection Matrix so the
~40 services with PVCs not enumerated in the table are no longer ambiguous.

Manual PVE-host actions (out-of-band, not in TF):
- unmounted 6 stacked snapshots from /tmp/pvc-mount
- pruned 5 stale snapshots on vm-9999-pvc-67c90b6b... (origin LV that the
  loop got SIGTERMed against repeatedly, so prune kept failing)
- created /srv/nfs/postiz-backup directory
- triggered a one-shot daily-backup run with the new TimeoutStartSec to
  validate the fix end-to-end

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-09 17:41:04 +00:00
Viktor Barzin
4315ed5c2a [backup] Fix lvm-pvc-snapshot Pushgateway push (stdout pollution in cmd_prune_count)
cmd_prune_count's `log "  Pruned: ..."` wrote to stdout, which the
caller captures via `pruned=$(cmd_prune_count)`. From 2026-04-16 onward
(7d retention kicked in), pruned snapshots polluted the captured value
with multi-line log text, breaking the Prometheus exposition format
on the metric push (`lvm_snapshot_pruned_total ${pruned}` → 400 from
Pushgateway). Snapshots themselves were always fine; only the metric
push silently failed for ~9 nights, eventually triggering
LVMSnapshotNeverRun (alert has 48h `for:`).

Fix: redirect the inner log call to stderr so cmd_prune_count's stdout
contains only the count. Also adopts `infra/scripts/lvm-pvc-snapshot.sh`
as the source-of-truth (was edited only on the PVE host) and updates
backup-dr.md to point at the .sh and document the scp deploy.

Deploy: scp infra/scripts/lvm-pvc-snapshot.sh root@192.168.1.127:/usr/local/bin/lvm-pvc-snapshot

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-25 14:30:58 +00:00
Viktor Barzin
344fce3692 [monitoring][poison-fountain] pushgateway persistence + cronjob uid-0
Two independent root-cause fixes surfaced by the 2026-04-22 cluster
health check:

1. Pushgateway lost all in-memory metrics when node3 kubelet hiccuped
   at 11:42 UTC, hiding backup_last_success_timestamp{job="offsite-
   backup-sync"} until the next 06:01 UTC push — a ~18h false-negative
   window. Enable persistence on a 2Gi proxmox-lvm-encrypted PVC with
   --persistence.interval=1m. Chart note: values key is
   `prometheus-pushgateway:` (subchart alias), not `pushgateway:`.

2. poison-fountain-fetcher CronJob runs curlimages/curl as UID 100
   but the NFS mount /srv/nfs/poison-fountain is root:root 755 and
   the main Deployment runs as root, so mkdir /data/cache fails
   every 6h. Set run_as_user=0 on the CronJob container (no_root_squash
   is set on the export).

Closes the backup_offsite_sync FAIL on the next 06:01 UTC offsite
sync; closes the recurring poison-fountain evicted-pod noise on the
next 00:00 UTC cron tick.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-22 18:32:29 +00:00
Viktor Barzin
d39770b30d monitoring: tighten LVMSnapshotStale to 30h for daily-cadence detection
Threshold was 48h + 30m for: a job that runs daily. We don't need
to wait 2.5 days to detect a broken timer — bring it down to 30h
+ 30m (just over a day of cadence + minor drift/retry grace). Also
add a description pointing to the restore runbook so the alert
text surfaces the fix path directly.

Threshold change: 172800s → 108000s. Docs in backup-dr.md synced.

Re-triggers default.yml apply now that ci/Dockerfile is rebuilt
with vault CLI — this is the first commit touching a stack that
will actually succeed since the e80b2f02 regression.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-04-22 08:54:37 +00:00
Viktor Barzin
5a0b24f54e [docs] TrueNAS decommission cleanup — remove references from active docs
TrueNAS VM 9000 was operationally decommissioned 2026-04-13; NFS has been
served by Proxmox host (192.168.1.127) since. This commit scrubs remaining
references from active docs. VM 9000 itself remains on PVE in stopped state
pending user decision on deletion.

In-session cleanup already landed: reverse-proxy ingress + Cloudflare record
removed; Technitium DNS records deleted; Vault truenas_{api_key,ssh_private_key}
purged; homepage_credentials.reverse_proxy.truenas_token removed;
truenas_homepage_token variable + module deleted; Loki + Dashy cleaned;
config.tfvars deprecated DNS lines removed; historical-name comment added to
the nfs-truenas StorageClass (48 bound PVs, immutable name — kept).

Historical records (docs/plans/, docs/post-mortems/, .planning/) intentionally
untouched — they describe state at a point in time.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-19 16:55:43 +00:00
Viktor Barzin
f8facf44dd [infra] Fix rewrite-body plugin + cleanup TrueNAS + version bumps
## Context

The rewrite-body Traefik plugin (packruler/rewrite-body v1.2.0) silently
broke on Traefik v3.6.12 — every service using rybbit analytics or anti-AI
injection returned HTTP 200 with "Error 404: Not Found" body. Root cause:
middleware specs referenced plugin name `rewrite-body` but Traefik registered
it as `traefik-plugin-rewritebody`.

Migrated to maintained fork `the-ccsn/traefik-plugin-rewritebody` v0.1.3
which uses the correct plugin name. Also added `lastModified = true` and
`methods = ["GET"]` to anti-AI middleware to avoid rewriting non-HTML
responses.

## This change

- Replace packruler/rewrite-body v1.2.0 with the-ccsn/traefik-plugin-rewritebody v0.1.3
- Fix plugin name in all 3 middleware locations (ingress_factory, reverse-proxy factory, traefik anti-AI)
- Remove deprecated TrueNAS cloud sync monitor (VM decommissioned 2026-04-13)
- Remove CloudSyncStale/CloudSyncFailing/CloudSyncNeverRun alerts
- Fix PrometheusBackupNeverRun alert (for: 48h → 32d to match monthly sidecar schedule)
- Bump versions: rybbit v1.0.21→v1.1.0, wealthfolio v1.1.0→v3.2,
  networking-toolbox 1.1.1→1.6.0, cyberchef v10.24.0→v9.55.0
- MySQL standalone storage_limit 30Gi → 50Gi
- beads-server: fix Dolt workbench type casing, remove Authentik on GraphQL endpoint

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-17 05:51:52 +00:00
Viktor Barzin
d31bbc9a18 docs: update monitoring and backup docs for external monitors and per-db backups
- CLAUDE.md: document external monitoring (ExternalAccessDivergence alert,
  external-monitor-sync CronJob) and per-database backup/restore paths
- backup-dr.md: add per-db backup CronJobs to inventory table and daily
  timeline, update restore runbook references
- monitoring.md: add External Monitor Sync component and external monitoring
  architecture section

[ci skip]

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-15 06:37:07 +00:00
Viktor Barzin
82f674a0b4 rename weekly-backup → daily-backup across scripts, timers, services, and docs [ci skip]
Reflects the schedule change from weekly to daily. All references updated:
- scripts/weekly-backup.{sh,timer,service} → daily-backup.*
- Pushgateway job name: weekly-backup → daily-backup
- Prometheus metric names: weekly_backup_* → daily_backup_*
- All docs, runbooks, AGENTS.md, CLAUDE.md, proxmox-inventory
- offsite-sync dependency: After=daily-backup.service

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-13 18:37:04 +00:00
Viktor Barzin
b45cee5c4a docs: update backup architecture for inotify change tracking + consolidated Synology layout [ci skip]
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-13 18:16:36 +00:00
Viktor Barzin
38d51ab0af deprecate TrueNAS: migrate Immich NFS to Proxmox, remove all 10.0.10.15 references [ci skip]
- Migrate Immich (8 NFS PVs, 1.1TB) from TrueNAS to Proxmox host NFS
- Update config.tfvars nfs_server to 192.168.1.127 (Proxmox)
- Update nfs-csi StorageClass share to /srv/nfs
- Update scripts (weekly-backup, cluster-healthcheck) to Proxmox IP
- Delete obsolete TrueNAS scripts (nfs_exports.sh, truenas-status.sh)
- Rewrite nfs-health.sh for Proxmox NFS monitoring
- Update Freedify nfs_music_server default to Proxmox
- Mark CloudSync monitor CronJob as deprecated
- Update Prometheus alert summaries
- Update all architecture docs, AGENTS.md, and reference docs
- Zero PVs remain on TrueNAS — VM ready for decommission

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-13 14:42:07 +00:00
Viktor Barzin
b345b086ef update backup/DR docs and runbooks for 3-2-1 architecture
- Full rewrite of backup-dr.md: 3-2-1 strategy with sda backup disk,
  PVC file-level copy from LVM snapshots, pfsense backup, two offsite
  paths. 4 Mermaid diagrams (data flow, timeline, disk layout, restore tree).
- Update storage.md: 65 proxmox-lvm PVCs, sda backup tier
- Update restore-full-cluster.md: add Phase 3.5 for PVC restore from sda
- Update restore-{mysql,postgresql,vault,vaultwarden}.md: add sda fallback paths
- New runbook: restore-pvc-from-backup.md (file-level restore from sda)
- Update CLAUDE.md Storage & Backup section for 3-2-1 architecture
2026-04-06 15:06:01 +03:00
Viktor Barzin
72d832fee7 add HA Sofia checks (26-29) to cluster healthcheck and backup-dr docs
- Healthcheck: add entity availability, integration health, automation
  status, and system resources checks for Home Assistant Sofia
- Docs: add backup-dr architecture documentation
2026-04-06 11:57:36 +03:00
Viktor Barzin
10f22350c5 exclude frigate, audiblez, ollama, real-estate-crawler from Synology backup [ci skip]
Expanded cloud sync excludes to reduce sync time and Synology disk usage.
All excluded data is either regenerable or low-value.
TrueNAS Task 1 and incremental script already updated live.
2026-03-29 13:44:32 +03:00
Viktor Barzin
5a42643176 add architecture documentation for all infrastructure subsystems [ci skip]
14 docs covering networking, VPN, storage, authentication, security,
monitoring, secrets, CI/CD, backup/DR, compute, databases, and
multi-tenancy. Each doc includes Mermaid diagrams, component tables,
configuration references, decision rationale, and troubleshooting.
2026-03-24 00:55:25 +02:00