Commit graph

9 commits

Author SHA1 Message Date
Viktor Barzin
c948dc0dbe backup pipeline: flock manifest + cap + drop LAN -z
Three more audit fixes from the 2026-05-24 backup-pipeline review:

#5 (S1 race) — manifest flock
  daily-backup and nfs-mirror both append to /mnt/backup/.changed-files.
  If they overlap (nfs-mirror Mon 04:11 running long, daily-backup
  starting Mon 05:00), concurrent appends from `find | tee` and
  `find | sed >>` could interleave mid-line — partial paths would slip
  past rsync's --files-from. Both scripts now share a manifest_append()
  helper using `flock -x` on /mnt/backup/.changed-files.lock. The 4
  daily-backup call sites + the 1 nfs-mirror call site all pipe through
  it instead of redirecting directly.

#7 (S2 unbounded manifest)
  daily-backup gains check_manifest_size() invoked after the PVE-config
  append (the last manifest writer of the run). Above MANIFEST_MAX_LINES
  (500k) it touches /mnt/backup/.force-full-sync — offsite-sync's Step 1
  now treats that flag the same as day-of-month ≤ 7 (full sync with
  --delete) and clears it on success. Catches the "Synology unreachable
  for many days" edge case where the manifest would grow unbounded.

#9 (wear — drop -z on LAN hops)
  offsite-sync rsync calls to Synology over the same 192.168.1.0/24
  gigabit LAN had `-rltz`. Compression burns CPU on the PVE host (already
  IO-busy) and gives nothing on a saturated GigE link. Dropped to `-rlt`
  on all 5 offsite rsync invocations (Step 1 full + Step 1 incremental +
  Step 2 full nfs + Step 2 full nfs-ssd + Step 2 incremental).

Other adjustments:
- nfs-mirror's find-after-rsync now also excludes the new state files
  (.changed-files.lock, .force-full-sync) when populating the manifest.
- offsite-sync Step 1 full-sync excludes the same .force-full-sync flag
  so it doesn't ship to Synology.

Deployed to PVE host (/usr/local/bin/{daily-backup,nfs-mirror,
offsite-sync-backup}). Currently in-flight nfs-mirror run is unaffected
(bash loaded the old script into memory at start). Next runs use the
new behaviour.

Refs: 2026-05-24 audit Section 2 items #1 (manifest race), #4 (unbounded
manifest), #6 (LAN -z wear).
2026-05-24 16:27:42 +00:00
Viktor Barzin
4d756be4f5 backup: consolidate to one local-mirror script + invert offsite filter
Some checks failed
ci/woodpecker/push/build-cli Pipeline failed
ci/woodpecker/push/default Pipeline failed
Before this commit, the in-flight design split anca-elements (its own
mirror script + timer) from the rest of /srv/nfs (still going to
Synology via inotify-tracked offsite-sync). It also meant Synology
received some bytes via both paths (sda → Synology AND direct NFS →
Synology), which doubled consumption.

This commit collapses both into a clean 3-2-1:

  Copy 1 (sdc):       live /srv/nfs/* + cluster block PVCs
  Copy 2 (sda):       /mnt/backup/{pvc-data,sqlite-backup,pfsense,
                                   pve-config,<critical-nfs>/}
                      ← daily-backup + nfs-mirror (one script each)
  Copy 3 (Synology):  /Backup/Viki/{pve-backup,nfs,nfs-ssd}
                      ← offsite-sync-backup Step 1 (sda → Synology)
                        + Step 2 (sda-BYPASS paths only → Synology direct)

scripts/nfs-mirror.{sh,service,timer}:
  New consolidated weekly mirror. Replaces anca-elements-mirror (to be
  removed in a follow-up after the current in-flight rsync completes,
  parity-verified, and Synology source-of-truth is deleted). Single
  rsync /srv/nfs/ → /mnt/backup/ with an explicit EXCLUDES list that
  drops paths not worth a local 2nd copy: immich (1.2T — too big),
  frigate (14d ring), prometheus/loki (rebuildable), ollama/llamacpp/
  audiblez/ebook2audiobook (re-fetchable), *-backup (already backups),
  temp/alertmanager (transient). Nice=10, IOSchedulingClass=idle.

scripts/offsite-sync-backup.sh:
  Step 2 (NFS → Synology) filter inverted: instead of `--exclude=
  anca-elements/`, it now `--include`s only the sda-BYPASS paths
  (immich, frigate, prometheus, *-backup, …). The bypass-include
  regex MUST stay in lockstep with nfs-mirror's EXCLUDES — they are
  complementary and any drift creates either gaps or duplication on
  Synology. Comment in the script flags this.

monitoring alerts: renamed AncaElementsMirror{Stale,Failing} to
NfsMirror{Stale,Failing} matching the new metric job name
`nfs-mirror`. Thresholds unchanged.

docs/architecture/backup-dr.md: rewritten Step 1/Step 2 sections and
added the bypass-list rationale + cross-reference between scripts.

NOT YET DEPLOYED — gated on the in-flight anca-elements-mirror rsync
finishing + parity verification + Synology /volume1/Backup/Anca/
Elements deletion. The old scripts (anca-elements-{mirror,sync.sh})
remain on the PVE host until then, and will be removed in a cleanup
commit.
2026-05-24 12:49:20 +00:00
Viktor Barzin
05f047f290 offsite-sync-backup + nfs-change-tracker: exclude /srv/nfs/anca-elements
Some checks failed
ci/woodpecker/push/build-cli Pipeline failed
ci/woodpecker/push/default Pipeline was successful
The 771G under /srv/nfs/anca-elements is a downstream replica synced
FROM Synology (/volume1/Backup/Anca/Elements) by anca-elements-sync.sh.
The offsite-sync pipeline was copying it back to Synology under
/volume1/Backup/Viki/nfs/anca-elements, creating a self-duplicate
(~122G already partially copied during the last monthly full sync).

- nfs-change-tracker.service: drop anca-elements/ from inotify watch
  (incremental syncs no longer queue these paths)
- offsite-sync-backup.sh: --exclude='anca-elements/' on the monthly
  full rsync; grep -v on the incremental files-from list

Deployed to 192.168.1.127:/usr/local/bin/offsite-sync-backup +
/etc/systemd/system/nfs-change-tracker.service; service reloaded.
2026-05-24 11:03:09 +00:00
Viktor Barzin
82f674a0b4 rename weekly-backup → daily-backup across scripts, timers, services, and docs [ci skip]
Reflects the schedule change from weekly to daily. All references updated:
- scripts/weekly-backup.{sh,timer,service} → daily-backup.*
- Pushgateway job name: weekly-backup → daily-backup
- Prometheus metric names: weekly_backup_* → daily_backup_*
- All docs, runbooks, AGENTS.md, CLAUDE.md, proxmox-inventory
- offsite-sync dependency: After=daily-backup.service

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-13 18:37:04 +00:00
Viktor Barzin
ca5039f8aa switch backup + offsite sync from weekly to daily — RPO 7d → 1d [ci skip]
- weekly-backup.timer: Sun 05:00 → daily 05:00
- offsite-sync-backup.timer: Sun 08:00 → daily 06:00
- Monthly full rsync --delete unchanged (1st-7th of month)
- Total daily I/O cost: ~20GB sdc reads, ~3.5GB sda writes, seconds of network
- Updated script headers and service descriptions

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-13 18:24:38 +00:00
Viktor Barzin
28ad11d12c consolidate offsite backup: inotify change tracking, deduplicate Synology paths [ci skip]
Architecture overhaul:
- Synology truenas/ renamed to nfs/, immich paths flattened to match source
- Created nfs-ssd/ on Synology for SSD data (thumbs, ML cache)
- Deleted pve-backup/nfs-mirror (53GB duplication eliminated)
- New inotifywait daemon (nfs-change-tracker.service) watches /srv/nfs + /srv/nfs-ssd
- offsite-sync Step 2: reads inotify change log, rsync --files-from only changed files
- weekly-backup: removed NFS mirror step entirely (NFS goes direct to Synology)
- Cleaned 9 orphaned LVs (101GB + 38 snapshots reclaimed from thin pool)

Performance: incremental sync completes in seconds (vs 30+ min with full rsync)

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-13 18:06:20 +00:00
Viktor Barzin
aa4c125f9c improve 3-2-1 backup: auto-discover dirs, Immich offsite sync, SQLite backup [ci skip]
- weekly-backup.sh: replace hardcoded BACKUP_DIRS with glob auto-discovery
  (catches nextcloud-backup, council-complaints-backup, future dirs)
- weekly-backup.sh: add auto SQLite backup from PVC snapshots
  (magic number check, ?mode=ro URI, fallback to raw copy)
- offsite-sync-backup.sh: add NFS media direct-to-Synology sync
  (Immich, calibre, audiobookshelf — reuses existing TrueNAS Cloud Sync paths)
- Cleaned up 9 orphaned LVs + 38 snapshots on PVE host (101GB reclaimed)

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-13 15:47:56 +00:00
Viktor Barzin
d2af5339af fix offsite sync: use --chmod for Synology permission compatibility
Synology Administrator user can't create dirs with root-owned permissions
from PVC snapshots. Switch from -az to -rltz --chmod to set writable
permissions on destination. Also updated Cloud Sync Task 1 excludes
to prevent duplication of backup dirs on Synology.
2026-04-06 16:01:42 +03:00
Viktor Barzin
d009f9a0f2 add 3-2-1 backup pipeline: weekly PVC file copy, NFS mirror, pfsense, offsite sync
- weekly-backup.sh: mounts LVM thin snapshots ro, rsyncs files to /mnt/backup/pvc-data
  with --link-dest versioning (4 weeks). Also mirrors NFS backup dirs from TrueNAS,
  backs up pfsense (config.xml + full tar), PVE host config, and prunes >7d snapshots.
- offsite-sync-backup.sh: rsync --files-from manifest to Synology (no full dir walk).
  Monthly full --delete sync on 1st Sunday. After=weekly-backup.service dependency.
- lvm-pvc-snapshot.timer: changed to daily 03:00 (was 2x daily)
- Prometheus alerts: WeeklyBackupStale, WeeklyBackupFailing, PfsenseBackupStale,
  OffsiteBackupSyncStale, BackupDiskFull. LVMSnapshotStale threshold 24h→48h.
2026-04-06 14:53:28 +03:00