From bc33cd5ac40722f627c676b8dca3c0174eec2459 Mon Sep 17 00:00:00 2001 From: Viktor Barzin Date: Fri, 5 Jun 2026 18:18:31 +0000 Subject: [PATCH] monitoring: NodeFilesystemFull 90%->95% + Synology storage runbook The Synology offsite backup target (/mnt/synology-backup, surfaced via the PVE host NFS mount) sits at ~94% by design and was firing NodeFilesystemFull continuously. Per user request, raise the threshold to 95% (<5% free). NOTE: NodeFilesystemFull is a global node-filesystem rule, so this also loosens the warning on k8s node/system disks; BackupDiskFull (sda /mnt/backup) stays at 85%. Also adds docs/runbooks/synology-storage.md: how to assess Synology usage WITHOUT du (Storage Analyzer weekly CSVs, df/btrfs/qgroup), btrfs async/snapshot-pinned reclaim, the 2026-06-05 capacity assessment (94% full; Backup share 4.42TiB), and ~500GiB of homelab cleanup candidates (redundant gphotos Takeout, old laptop VM images, archives). Co-Authored-By: Claude Opus 4.8 --- docs/runbooks/synology-storage.md | 127 ++++++++++++++++++ .../monitoring/prometheus_chart_values.tpl | 4 +- 2 files changed, 129 insertions(+), 2 deletions(-) create mode 100644 docs/runbooks/synology-storage.md diff --git a/docs/runbooks/synology-storage.md b/docs/runbooks/synology-storage.md new file mode 100644 index 00000000..9030569b --- /dev/null +++ b/docs/runbooks/synology-storage.md @@ -0,0 +1,127 @@ +# Runbook: Synology NAS storage — navigate, assess, clean + +**Target:** Synology DS218 (`NAS_Barzini`), `192.168.1.13`, `/volume1` +(5.3 TiB btrfs). This is the **offsite backup target** (Copy 3 of the +3-2-1 strategy) **and a shared family volume** — homelab data is only +under `Backup/Viki/`; `Anca/`, `Emo/`, `Common/`, `music`, `video`, +`photo` etc. are family data. + +Related: [storage architecture](../architecture/storage.md) · +[backup & DR](../architecture/backup-dr.md) + +## Access + +- SSH: `ssh Administrator@192.168.1.13` (capital `A`; key-auth works + from devvm and the PVE host). `Administrator` can `sudo`. +- sudo password: Vault `secret/viktor` → `synology_admin_password` + (`VAULT_ADDR=https://vault.viktorbarzin.me`). DSM Web API has 2FA, so + **SSH+sudo is the only unattended path** (`read -r PW; printf '%s\n' + "$PW" | sudo -S -p '' ` to keep the secret out of `argv`). + +## ⚠️ NEVER run `du` / `find` / `ncdu` on this NAS + +Recursive walks over the multi-TB `Backup` share take 10+ min (often +never finish) and burn disk/IO on the NAS. Use Synology's own +pre-indexed data instead: + +| Need | Instant, non-walking source | +|---|---| +| Volume fill | `df -h /volume1` | +| btrfs real usage | `btrfs filesystem df /volume1` | +| Per-subvolume | `sudo btrfs qgroup show -prce --raw /volume1` | +| **Per-share / per-owner / per-type / largest / oldest / dupes** | **Storage Analyzer weekly report** (below) | + +### Storage Analyzer weekly report + +Storage Analyzer is installed and writes a report every **Monday +~00:00** to: + +``` +/volume1/Backup/Viki/synoreport/weekly storage report// +``` + +Data is up to ~7 days stale. The useful files are zipped CSVs in +`csv/` — **content is UTF-16, and there is no `unzip` on the box**, so +read them with Python: + +```python +import zipfile, os +R="...//csv" +def readcsv(n): + z=zipfile.ZipFile(os.path.join(R,n)); raw=z.read(z.namelist()[0]) + for enc in ("utf-16","utf-8-sig","utf-8"): + try: return raw.decode(enc) + except Exception: pass +``` + +Key CSVs: `volume_usage`, `share_list` (per-share, incl/excl recycle), +`quota_usage.share` (**per-owner within a share**), `file_group` +(per-file-type), `large_file`, `least_modify` (oldest), `duplicate_file`. +The `*.db` files (`folder.db` etc.) are a **custom Synology format — +NOT sqlite**; `report.html` does not embed clean folder totals. + +## btrfs space-reclaim is ASYNCHRONOUS — and snapshot-pinned + +- Deleting files/snapshots returns instantly but `df` lags minutes + while the btrfs cleaner reclaims extents (~30 GB/min on the DS218). +- Data deleted from the live share **stays on disk until the share + snapshots that still reference it also rotate out.** There are 4 + daily `Backup` share snapshots (`GMT-*-21.00.02`), so **expect up to + ~4 days of lag** before a delete fully frees space. +- Snapshot CLI (sudo, full path): `/usr/syno/sbin/synosharesnapshot + {list|delete} Backup ...`. Retention: + `/usr/syno/etc/sharesnap/sharesnap.conf`. + +## Capacity alert + +The Synology mount surfaces to Prometheus as the PVE host NFS mount +`/mnt/synology-backup` (`job="proxmox-host"`, `fstype=nfs4`), caught by +the **global `NodeFilesystemFull`** rule in +`stacks/monitoring/modules/monitoring/prometheus_chart_values.tpl`. + +- **2026-06-05:** threshold changed **90% → 95%** (`* 100 < 5`) at + user request — a backup target legitimately runs hot, so 90% was + noisy. NOTE: this rule is **global**, so the looser 95% now applies to + all node/system disks too. `BackupDiskFull` (the sda `/mnt/backup` + disk, separate alert) stays at 85%. + +## Current assessment — 2026-06-05 + +`/volume1` at **94% (5.0 TiB used / 5.3 TiB, 324 GiB free)**, down from +98% on 2026-05-24. The **`Backup` share is 4.42 TiB (86%)**: +Administrator/homelab **3.92 TiB**, Emo/family **504 GiB**. By type: +Other 1.76 TiB, Videos 1.33 TiB, Pictures 631 GiB, Zipped 495 GiB, +DiskImage 77 GiB. The ~1.9 TiB of media is mostly the **Immich offsite +backup** (`Viki/nfs/immich` + `nfs-ssd/immich`), which **grows daily — +the structural capacity driver now that one-off cleanups are spent.** + +### Already reclaimed (verified gone) + +`Anca/Elements` (770 GiB — dir now empty), `prometheus-backup` (63 GiB), +`ollama`/`llamacpp`/`audiblez`/`ebook2audiobook` — removed in the +2026-06-01 cleanup; nfs-mirror now excludes the regenerable services. + +### Cleanup candidates — homelab (`Backup/Viki/`, Administrator-owned) + +| Target | Size | Notes | +|---|---|---| +| `Photos/gphotos-1/` | **208 GiB** zips (+ extracted) | 2023 Google Takeout, **already imported to Immich** (`immich-go.exe` beside them; dupes confirmed). Redundant. | +| `laptop/` | ~167 GiB | old VM images (Kali/windows vdis, metasploitable, soton-rpi.img) | +| `All-in-one/` | ~95 GiB | 2015–2018 archives | +| `#recycle/` (Backup) | ~16 GiB | recycle bin (HA backup rotation) | +| loose `*.asc`/`*.mov` in `Viki/` root | ~8 GiB | old encrypted archives, phone videos | +| `sgs7/` | ~3.5 GiB | 2021 Galaxy S7 backup | + +**~500 GiB** reclaimable without touching live backups or family data. + +### Cleanup candidates — family (flag to Emo, do not delete) + +- `Emo/D/` Windows 7 vmdks — **3 identical 39.5 GiB copies** (one live + + two under `_SYNCAPP/Versioning/`) → 79 GiB dedup. +- Emo-shared recycle bin: 12.6 GiB. + +### Do NOT touch + +`Viki/pve-backup/` (live structured backup), `Viki/nfs/immich` + +`nfs-ssd/immich` (irreplaceable), `HomeAssistant/` + `ha_backup_vermont/` +(~7 GiB, healthy 3-copy retention). diff --git a/stacks/monitoring/modules/monitoring/prometheus_chart_values.tpl b/stacks/monitoring/modules/monitoring/prometheus_chart_values.tpl index 898bf6b4..b3c53062 100755 --- a/stacks/monitoring/modules/monitoring/prometheus_chart_values.tpl +++ b/stacks/monitoring/modules/monitoring/prometheus_chart_values.tpl @@ -1257,12 +1257,12 @@ serverFiles: - name: Storage rules: - alert: NodeFilesystemFull - expr: (node_filesystem_avail_bytes{fstype!~"tmpfs|fuse.*"} / node_filesystem_size_bytes) * 100 < 10 + expr: (node_filesystem_avail_bytes{fstype!~"tmpfs|fuse.*"} / node_filesystem_size_bytes) * 100 < 5 for: 15m labels: severity: warning annotations: - summary: "Disk {{ $labels.mountpoint }} on {{ $labels.instance }}: {{ $value | printf \"%.1f\" }}% free (threshold: 10%)" + summary: "Disk {{ $labels.mountpoint }} on {{ $labels.instance }}: {{ $value | printf \"%.1f\" }}% free (threshold: 5%)" # PVAutoExpanding removed — was info-only at >80% used, but # pvc-autoresizer's threshold is 10% free (= 90% used), so the # alert always fired ~10 percentage points before any action