diff --git a/.claude/CLAUDE.md b/.claude/CLAUDE.md index 98ebf460..529c699c 100755 --- a/.claude/CLAUDE.md +++ b/.claude/CLAUDE.md @@ -265,9 +265,10 @@ resource "kubernetes_persistent_volume_claim" "data_encrypted" { **Copy 2**: sda backup disk (`/mnt/backup`, 1.1TB ext4, VG `backup`) **Copy 3**: Synology NAS offsite (two-tier: sda + NFS) -**PVE host scripts** (source: `infra/scripts/`): -- `/usr/local/bin/daily-backup` — Daily 05:00. Mounts LVM thin snapshots ro → rsyncs FILES to `/mnt/backup/pvc-data////` with `--link-dest` versioning (4 weeks). Auto SQLite backup (magic number check, `?mode=ro`). Auto-discovered BACKUP_DIRS (glob, not hardcoded). Also backs up pfSense (config.xml + tar), PVE config. Prunes snapshots >7d. -- `/usr/local/bin/offsite-sync-backup` — Daily 06:00 (After=daily-backup). Step 1: sda → Synology `pve-backup/` (PVC snapshots, pfSense, PVE config). Step 2: NFS → Synology `nfs/` + `nfs-ssd/` via inotify change-tracked `rsync --files-from`. Monthly full `rsync --delete` on 1st Sunday. +**PVE host scripts** (source: `infra/scripts/`; deployed manually via `scp` to `/usr/local/bin/` — strip the `.sh`): +- `/usr/local/bin/nfs-mirror` — Daily 02:00. `rsync --delete /srv/nfs// → /mnt/backup//` (sda leg 1), appends transferred paths to `/mnt/backup/.changed-files` for offsite Step 1. **EXCLUDES**: immich (too big — direct leg), frigate/temp (no backup), anca-elements (in Immich), and **(2026-06-01) ollama, prometheus-backup, audiblez, ebook2audiobook** — regenerable, live-only on sdc, kept off the space-constrained offsite. Does NOT mirror `/srv/nfs-ssd`. +- `/usr/local/bin/daily-backup` — Daily 05:00. Mounts LVM thin snapshots ro → rsyncs FILES to `/mnt/backup/pvc-data////` with `--link-dest` versioning (4 weeks). Auto SQLite backup (magic number check, `?mode=ro`). Also backs up pfSense (config.xml + tar), PVE config. Prunes snapshots >7d. **Skip-list (2026-06-01)**: `nextcloud/nextcloud-data-proxmox` (orphaned pre-encryption PV). +- `/usr/local/bin/offsite-sync-backup` — Daily 06:00 (After=daily-backup). Step 1: sda → Synology `pve-backup/` (incremental via manifest; monthly full `rsync --delete` days 1–7). Step 2: NFS direct → Synology — **immich-only on BOTH `nfs/` and `nfs-ssd/` (2026-06-01)**; ollama/llamacpp on the SSD no longer ship offsite. - `/usr/local/bin/lvm-pvc-snapshot` — Daily 03:00. Thin snapshots of all PVCs except dbaas+monitoring. 7-day retention. Instant restore: `lvm-pvc-snapshot restore `. - `nfs-change-tracker.service` — Continuous inotifywait on `/srv/nfs` + `/srv/nfs-ssd`. Logs changed file paths to `/mnt/backup/.nfs-changes.log`. Consumed by offsite-sync-backup for incremental rsync (completes in seconds instead of 30+ minutes). diff --git a/docs/architecture/backup-dr.md b/docs/architecture/backup-dr.md index ab9c903c..60c1c77d 100644 --- a/docs/architecture/backup-dr.md +++ b/docs/architecture/backup-dr.md @@ -1,6 +1,31 @@ # Backup & Disaster Recovery Architecture -Last updated: 2026-05-26 +Last updated: 2026-06-01 + +> **2026-06-01 — regenerable services carved back out** (offsite Synology hit +> 97%; the `Backup` share had grown +670 G in a week, traced to the 2026-05-26 +> change below that started mirroring large regenerable data offsite): +> - **`nfs-mirror` re-excludes** `ollama` (20 G), `prometheus-backup` (64 G), +> `audiblez` (24 G), `ebook2audiobook` (11 G). Live copy stays on sdc; no +> sda/Synology copy. `--delete` reaps them from sda on the next run. +> `*-backup` DB dumps (sqlite-backup etc.) are KEPT — real DB safety copies. +> - **`offsite-sync` Step 2 nfs-ssd → immich-only**: `ollama` (59 G) + +> `llamacpp` (26 G) on the SSD no longer ship to Synology (re-pullable +> models). Was a blanket `/srv/nfs-ssd/` sync; now immich-only like nfs/. +> - **`daily-backup` skips `nextcloud/nextcloud-data-proxmox`** — orphaned +> pre-encryption PV (Released, Retain) that was still backed up weekly. +> - **Nextcloud backup shrunk**: the dedicated nextcloud-backup CronJob +> (`stacks/nextcloud`) kept 7 full copies incl. a 10 GB+ `nextcloud.log` +> (87 G total). Now: `log_rotate_size=10 MB` caps the log at source, backup +> excludes `nextcloud.log*` + preview cache, retention 7 → 1 (pvc-data holds +> the version history). Footprint < 5 G. +> - **Nextcloud image pinned to `32.0.9`** in chart_values — the 2026-05-26 +> Keel bump (32.0.3 → 32.0.9, data migrated to 32.0.9.2) was never pinned in +> TF, so this session's apply rolled a 32.0.3 pod and CrashLooped on the +> downgrade. Pinning eliminates the drift. +> - **One-off Synology delete** of the existing copies above + emptied the +> `Backup`/`Emo shared` recycle bins (~31 G). ~340 G total; reclaims as the +> 3-day `Backup`-share snapshots roll off (or via manual snapshot expiry). > **2026-05-26 — bypass list pruned to a single path** (follow-up to the > 2026-05-24 changes below): @@ -48,9 +73,9 @@ The **bypass list** (leg 2) is just `/srv/nfs/immich/` — too big for sda (1.5 - **Copy 1** (live): all PVC data + VM disks on Proxmox sdc thin pool (10.7TB RAID1 HDD); all NFS data at `/srv/nfs[-ssd]/` - **Copy 2** (local backup): sda `/mnt/backup` (1.1TB RAID1 SAS) — **46% used** post-2026-05-26 (was 87% before anca-elements cleanup; bypass-list pruning added ~260 G of *-backup + ollama + audiblez + ebook2audiobook) - **Copy 3** (offsite): Synology NAS at 192.168.1.13 - - `Synology/Backup/Viki/pve-backup/` — sda contents (PVC backups + nfs-mirror output: ~90 service dirs, now also includes ollama/audiblez/ebook2audiobook/*-backup) + - `Synology/Backup/Viki/pve-backup/` — sda contents (PVC backups + nfs-mirror output: ~90 service dirs incl. `*-backup` DB dumps. **ollama/audiblez/ebook2audiobook/prometheus-backup excluded 2026-06-01** — regenerable, live-only) - `Synology/Backup/Viki/nfs/` — immich only (post-2026-05-26) - - `Synology/Backup/Viki/nfs-ssd/` — full SSD NFS (immich-ML, ollama, llamacpp); SSD has no sda-mirror leg, so all three go direct + - `Synology/Backup/Viki/nfs-ssd/` — **immich-ML only (2026-06-01)**; ollama/llamacpp dropped (re-pullable models, live-only on the SSD) ## Architecture Diagram diff --git a/scripts/daily-backup.sh b/scripts/daily-backup.sh index 7b896780..ab61b6ca 100644 --- a/scripts/daily-backup.sh +++ b/scripts/daily-backup.sh @@ -215,6 +215,17 @@ else continue fi + # Skip-list: PVCs we deliberately don't keep offsite copies of. + # nextcloud-data-proxmox — orphaned pre-encryption PV (Released, + # Retain). Nextcloud moved to nextcloud-data-encrypted on 2026-04-13; + # this old unencrypted PV lingers (Retain) and was still being backed + # up weekly, filling the offsite Synology. Stop copying it (2026-06-01). + case "${ns_pvc}" in + nextcloud/nextcloud-data-proxmox) + log " skip ${ns_pvc} (orphaned pre-encryption PVC)" + continue ;; + esac + # Detect LUKS-encrypted volumes and set up mount device LUKS_NAME="" MOUNT_DEV="/dev/pve/${snap}" diff --git a/scripts/nfs-mirror.sh b/scripts/nfs-mirror.sh index e644d495..2e322ede 100644 --- a/scripts/nfs-mirror.sh +++ b/scripts/nfs-mirror.sh @@ -13,21 +13,30 @@ # destination layout (anca-elements lives at /mnt/backup/anca-elements/), # but now covers every other critical NFS subtree in one pass. # -# SKIP-LIST rationale (2026-05-26 simplification — see commit notes): +# SKIP-LIST rationale (2026-05-26 simplification; REGENERABLE-SERVICE +# CARVE-OUT added 2026-06-01 — see below): # immich — 1.5T, doesn't fit on sda; offsite-sync ships it direct to Synology # frigate — camera ring buffer; intentionally NOT backed up anywhere # temp — scratch; intentionally NOT backed up # -# Everything else (ollama, audiblez, ebook2audiobook, *-backup, …) now -# flows sdc → sda (this script) → Synology pve-backup/ via offsite-sync -# Step 1. Previously they went sdc → Synology DIRECT via Step 2; the -# bypass list got pruned to just `immich` so we have a single canonical -# mirror at sda. Prometheus/loki/alertmanager were live-orphan entries -# that no longer exist on /srv/nfs (cleaned 2026-05-26) — dropped from -# the exclude list as a no-op. +# 2026-06-01 carve-out: the offsite Synology (5.3T) hit 97% and the +# `Backup` share had grown +670G in a week — traced to the 2026-05-26 +# change that started mirroring large *regenerable* services to sda and +# thence to Synology pve-backup/. These are now re-excluded because they +# cost offsite capacity for data we can rebuild on demand: +# ollama (20G) — LLM model blobs, re-pullable +# prometheus-backup (64G) — metrics TSDB snapshots; was offsite-excluded +# pre-2026-05-26 by original intent +# audiblez (24G) — generated audiobooks, re-derivable from ebooks +# ebook2audiobook (11G) — same, generation output +# Their live copy stays on sdc (/srv/nfs); only the sda + Synology copies +# are dropped. `*-backup` DB dumps (sqlite-backup et al.) are intentionally +# KEPT — they are real database safety copies, not regenerable. # -# Note: /srv/nfs-ssd is intentionally NOT mirrored — its three dirs -# (immich, ollama, llamacpp) all go direct to Synology nfs-ssd/. +# Note: /srv/nfs-ssd is intentionally NOT mirrored — its dirs (immich, +# ollama, llamacpp) go direct to Synology nfs-ssd/ via offsite-sync +# Step 2, which (also 2026-06-01) was narrowed to immich-only so ollama +# + llamacpp on the SSD stop reaching Synology too. set -euo pipefail @@ -68,6 +77,14 @@ EXCLUDES=( --exclude='/frigate/' # ring buffer — no backup anywhere --exclude='/temp/' # scratch — no backup anywhere + # ---- regenerable services: live-only on sdc, no offsite (2026-06-01) ---- + # See header carve-out. --delete reaps any existing copies from sda on + # the next run; a one-off direct delete already cleared them from Synology. + --exclude='/ollama/' # LLM models — re-pullable + --exclude='/prometheus-backup/' # metrics TSDB snapshots + --exclude='/audiblez/' # generated audiobooks + --exclude='/ebook2audiobook/' # generated audiobooks + # ---- Synology / Windows / macOS cruft ---- --exclude='/@eaDir/' --exclude='*@synoeastream' @@ -119,7 +136,7 @@ mountpoint -q /mnt/backup || { log "FATAL: /mnt/backup not mounted"; push_metric [ -d "$SRC" ] || { log "FATAL: source $SRC missing"; push_metrics 1 0; exit 1; } log "=== mirror starting: $SRC → $DST ===" -log "skip: immich (Synology direct), frigate (no backup), temp (no backup), anca-elements" +log "skip: immich (Synology direct), frigate/temp (no backup), anca-elements, ollama/prometheus-backup/audiblez/ebook2audiobook (regenerable, live-only)" # Marker file used to identify files written by this rsync run, so we can append # their paths to the offsite-sync manifest. Touch BEFORE rsync; `find -newer` AFTER. diff --git a/scripts/offsite-sync-backup.sh b/scripts/offsite-sync-backup.sh index 790215e1..1adeacd7 100644 --- a/scripts/offsite-sync-backup.sh +++ b/scripts/offsite-sync-backup.sh @@ -95,16 +95,17 @@ fi # reaching Synology via Step 1 (sda → pve-backup/). frigate and temp are # excluded from both legs — intentionally NOT backed up. # -# nfs-ssd is handled separately below: its three dirs (immich, ollama, -# llamacpp) all go direct to Synology since /srv/nfs-ssd is not mirrored -# to sda. ollama+llamacpp are small enough (~85G total) that the direct -# leg is fine and we don't need to extend nfs-mirror to cover the SSD. +# nfs-ssd: as of 2026-06-01 this leg is ALSO immich-only. ollama (59G) and +# llamacpp (26G) on the SSD were filling the offsite Synology (5.3T hit 97%) +# for re-pullable model blobs, so they're dropped — live copy stays on the +# SSD, no offsite. The monthly --delete pass below reaps them from Synology +# nfs-ssd/; a one-off direct delete cleared the bulk on 2026-06-01. # -# Keep this aligned with /usr/local/bin/nfs-mirror's EXCLUDES — the -# excludes there are { immich (this leg), frigate (no backup), temp -# (no backup), anca-elements (deleted), pvc-data and friends (owned by -# daily-backup) }. Only the bypass-leg subset matters here: { immich }. -log "--- Step 2: NFS → Synology (immich-only direct leg + nfs-ssd) ---" +# Keep this aligned with /usr/local/bin/nfs-mirror's EXCLUDES. Both legs now +# carry immich only; everything else is either curated through sda (Step 1) +# or intentionally live-only (frigate, temp, ollama, llamacpp, audiblez, +# ebook2audiobook, prometheus-backup). +log "--- Step 2: NFS → Synology (immich-only on both nfs/ and nfs-ssd/) ---" # Regex matching paths NOT on sda (must reach Synology directly). NFS_SDA_BYPASS_RE='^/srv/nfs/immich/' @@ -123,9 +124,9 @@ if [ "${DAY_OF_MONTH}" -le 7 ]; then log "Monthly full NFS sync (immich-only — reaps legacy bypass dirs)..." rsync -rlt --delete "${NFS_FULL_INCLUDES[@]}" /srv/nfs/ "${NFS_DEST}/" 2>&1 \ && log " OK: nfs/ full sync (immich-only)" || { warn "nfs/ full sync failed"; STATUS=1; } - # nfs-ssd: full sync of all three dirs (immich, ollama, llamacpp). - rsync -rlt --delete /srv/nfs-ssd/ "${NFS_SSD_DEST}/" 2>&1 \ - && log " OK: nfs-ssd/ full sync" || { warn "nfs-ssd/ full sync failed"; STATUS=1; } + # nfs-ssd: immich-only (2026-06-01) — --delete reaps legacy ollama/llamacpp. + rsync -rlt --delete "${NFS_FULL_INCLUDES[@]}" /srv/nfs-ssd/ "${NFS_SSD_DEST}/" 2>&1 \ + && log " OK: nfs-ssd/ full sync (immich-only)" || { warn "nfs-ssd/ full sync failed"; STATUS=1; } > "${NFS_CHANGE_LOG}" elif [ -s "${NFS_CHANGE_LOG}" ]; then # Incremental: only sync changed files matching the bypass leg (immich). @@ -147,8 +148,8 @@ elif [ -s "${NFS_CHANGE_LOG}" ]; then || { warn "nfs/ incremental failed"; STATUS=1; } fi - # SSD NFS — every nfs-ssd path (immich/ollama/llamacpp) ships direct. - grep '^/srv/nfs-ssd/' /tmp/nfs-changes-deduped | \ + # SSD NFS — immich-only (2026-06-01); ollama/llamacpp are live-only, no offsite. + grep '^/srv/nfs-ssd/immich/' /tmp/nfs-changes-deduped | \ while IFS= read -r f; do [ -f "$f" ] && echo "${f#/srv/nfs-ssd/}"; done \ > /tmp/sync-nfs-ssd.list 2>/dev/null || true SSD_COUNT=$(wc -l < /tmp/sync-nfs-ssd.list 2>/dev/null || echo 0) diff --git a/stacks/nextcloud/chart_values.yaml b/stacks/nextcloud/chart_values.yaml index 3b44d270..760a6aec 100644 --- a/stacks/nextcloud/chart_values.yaml +++ b/stacks/nextcloud/chart_values.yaml @@ -1,3 +1,14 @@ +# Pin the image to 32.0.9 (apache). On 2026-05-26 Keel bumped the live +# Deployment 32.0.3 → 32.0.9-apache and the DATA migrated to 32.0.9.2; Keel +# was then disabled but chart_values was never pinned, so it kept defaulting +# to the chart's appVersion (32.0.3). A 2026-06-01 `terragrunt apply` +# reconciled that drift, rolled a 32.0.3 pod, and Nextcloud refused to +# downgrade (data 32.0.9.2 > image 32.0.3.2) → CrashLoopBackOff. Pinning here +# keeps TF the source of truth and matches the on-disk data version. +image: + flavor: apache + tag: "32.0.9" + nextcloud: host: nextcloud.viktorbarzin.me trustedDomains: @@ -51,6 +62,10 @@ nextcloud: 2, + // Cap + rotate nextcloud.log. Without this it grew unbounded to + // 10GB+ and bloated every backup (2026-06-01 space incident). + // At 10MB the log rotates to nextcloud.log.1 (1 kept) → ~20MB max. + 'log_rotate_size' => 10485760, 'mail_smtpdebug' => false, ); zzz-mysql.config.php: | diff --git a/stacks/nextcloud/main.tf b/stacks/nextcloud/main.tf index 1fbba67a..04fc3825 100644 --- a/stacks/nextcloud/main.tf +++ b/stacks/nextcloud/main.tf @@ -382,14 +382,31 @@ resource "kubernetes_config_map" "backup-script" { # Create backup directory mkdir -p "$BACKUP_PATH" - # Backup everything (config, data, custom_apps, themes, etc.) + # Backup config/data/custom_apps. Exclusions (2026-06-01 space fix): + # - nextcloud.log* — rotated at source via log_rotate_size; previously + # grew to 10GB+ and bloated every dated copy (backups hit 20G each). + # - preview cache — regenerable thumbnails, no need to back up. + # Backs up config/, data/, custom_apps/ (the irreplaceable bits). Skips: + # - html/ — the Nextcloud app code, reproducible from the pinned image + # (real config is at config/config.php; html/config/config.php is empty). + # - nextcloud.log* — capped at source via log_rotate_size; was 10GB+. + # - preview cache — regenerable thumbnails. echo "Backing up Nextcloud installation..." - rsync -a "$DATA_DIR/" "$BACKUP_PATH/" + rsync -a \ + --exclude='/html/' \ + --exclude='nextcloud.log' \ + --exclude='nextcloud.log.*' \ + --exclude='data/appdata_*/preview/' \ + "$DATA_DIR/" "$BACKUP_PATH/" - # Keep only last 7 backups + # Keep only the latest backup. The version history lives in daily-backup's + # pvc-data (4 weekly snapshot-consistent copies of this same encrypted PVC), + # so this browsable app-level copy only needs the most recent. Keeping the + # whole installation (incl. logs) x7 here was the bulk of the 87G that + # filled the offsite Synology. echo "Cleaning old backups..." cd "$BACKUP_DIR" - ls -dt */ | tail -n +8 | xargs -r rm -rf + ls -dt */ | tail -n +2 | xargs -r rm -rf echo "Backup completed at $(date)" echo "Backup stored at: $BACKUP_PATH"