monitoring: NodeFilesystemFull 90%->95% + Synology storage runbook

The Synology offsite backup target (/mnt/synology-backup, surfaced via
the PVE host NFS mount) sits at ~94% by design and was firing
NodeFilesystemFull continuously. Per user request, raise the threshold
to 95% (<5% free). NOTE: NodeFilesystemFull is a global node-filesystem
rule, so this also loosens the warning on k8s node/system disks;
BackupDiskFull (sda /mnt/backup) stays at 85%.

Also adds docs/runbooks/synology-storage.md: how to assess Synology
usage WITHOUT du (Storage Analyzer weekly CSVs, df/btrfs/qgroup),
btrfs async/snapshot-pinned reclaim, the 2026-06-05 capacity assessment
(94% full; Backup share 4.42TiB), and ~500GiB of homelab cleanup
candidates (redundant gphotos Takeout, old laptop VM images, archives).

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
This commit is contained in:
Viktor Barzin 2026-06-05 18:18:31 +00:00
parent f526af694d
commit bc33cd5ac4
2 changed files with 129 additions and 2 deletions

View file

@ -0,0 +1,127 @@
# Runbook: Synology NAS storage — navigate, assess, clean
**Target:** Synology DS218 (`NAS_Barzini`), `192.168.1.13`, `/volume1`
(5.3 TiB btrfs). This is the **offsite backup target** (Copy 3 of the
3-2-1 strategy) **and a shared family volume** — homelab data is only
under `Backup/Viki/`; `Anca/`, `Emo/`, `Common/`, `music`, `video`,
`photo` etc. are family data.
Related: [storage architecture](../architecture/storage.md) ·
[backup & DR](../architecture/backup-dr.md)
## Access
- SSH: `ssh Administrator@192.168.1.13` (capital `A`; key-auth works
from devvm and the PVE host). `Administrator` can `sudo`.
- sudo password: Vault `secret/viktor``synology_admin_password`
(`VAULT_ADDR=https://vault.viktorbarzin.me`). DSM Web API has 2FA, so
**SSH+sudo is the only unattended path** (`read -r PW; printf '%s\n'
"$PW" | sudo -S -p '' <cmd>` to keep the secret out of `argv`).
## ⚠️ NEVER run `du` / `find` / `ncdu` on this NAS
Recursive walks over the multi-TB `Backup` share take 10+ min (often
never finish) and burn disk/IO on the NAS. Use Synology's own
pre-indexed data instead:
| Need | Instant, non-walking source |
|---|---|
| Volume fill | `df -h /volume1` |
| btrfs real usage | `btrfs filesystem df /volume1` |
| Per-subvolume | `sudo btrfs qgroup show -prce --raw /volume1` |
| **Per-share / per-owner / per-type / largest / oldest / dupes** | **Storage Analyzer weekly report** (below) |
### Storage Analyzer weekly report
Storage Analyzer is installed and writes a report every **Monday
~00:00** to:
```
/volume1/Backup/Viki/synoreport/weekly storage report/<YYYY-MM-DD_..>/
```
Data is up to ~7 days stale. The useful files are zipped CSVs in
`csv/`**content is UTF-16, and there is no `unzip` on the box**, so
read them with Python:
```python
import zipfile, os
R=".../<date>/csv"
def readcsv(n):
z=zipfile.ZipFile(os.path.join(R,n)); raw=z.read(z.namelist()[0])
for enc in ("utf-16","utf-8-sig","utf-8"):
try: return raw.decode(enc)
except Exception: pass
```
Key CSVs: `volume_usage`, `share_list` (per-share, incl/excl recycle),
`quota_usage.share` (**per-owner within a share**), `file_group`
(per-file-type), `large_file`, `least_modify` (oldest), `duplicate_file`.
The `*.db` files (`folder.db` etc.) are a **custom Synology format —
NOT sqlite**; `report.html` does not embed clean folder totals.
## btrfs space-reclaim is ASYNCHRONOUS — and snapshot-pinned
- Deleting files/snapshots returns instantly but `df` lags minutes
while the btrfs cleaner reclaims extents (~30 GB/min on the DS218).
- Data deleted from the live share **stays on disk until the share
snapshots that still reference it also rotate out.** There are 4
daily `Backup` share snapshots (`GMT-*-21.00.02`), so **expect up to
~4 days of lag** before a delete fully frees space.
- Snapshot CLI (sudo, full path): `/usr/syno/sbin/synosharesnapshot
{list|delete} Backup <snap>...`. Retention:
`/usr/syno/etc/sharesnap/sharesnap.conf`.
## Capacity alert
The Synology mount surfaces to Prometheus as the PVE host NFS mount
`/mnt/synology-backup` (`job="proxmox-host"`, `fstype=nfs4`), caught by
the **global `NodeFilesystemFull`** rule in
`stacks/monitoring/modules/monitoring/prometheus_chart_values.tpl`.
- **2026-06-05:** threshold changed **90% → 95%** (`* 100 < 5`) at
user request — a backup target legitimately runs hot, so 90% was
noisy. NOTE: this rule is **global**, so the looser 95% now applies to
all node/system disks too. `BackupDiskFull` (the sda `/mnt/backup`
disk, separate alert) stays at 85%.
## Current assessment — 2026-06-05
`/volume1` at **94% (5.0 TiB used / 5.3 TiB, 324 GiB free)**, down from
98% on 2026-05-24. The **`Backup` share is 4.42 TiB (86%)**:
Administrator/homelab **3.92 TiB**, Emo/family **504 GiB**. By type:
Other 1.76 TiB, Videos 1.33 TiB, Pictures 631 GiB, Zipped 495 GiB,
DiskImage 77 GiB. The ~1.9 TiB of media is mostly the **Immich offsite
backup** (`Viki/nfs/immich` + `nfs-ssd/immich`), which **grows daily —
the structural capacity driver now that one-off cleanups are spent.**
### Already reclaimed (verified gone)
`Anca/Elements` (770 GiB — dir now empty), `prometheus-backup` (63 GiB),
`ollama`/`llamacpp`/`audiblez`/`ebook2audiobook` — removed in the
2026-06-01 cleanup; nfs-mirror now excludes the regenerable services.
### Cleanup candidates — homelab (`Backup/Viki/`, Administrator-owned)
| Target | Size | Notes |
|---|---|---|
| `Photos/gphotos-1/` | **208 GiB** zips (+ extracted) | 2023 Google Takeout, **already imported to Immich** (`immich-go.exe` beside them; dupes confirmed). Redundant. |
| `laptop/` | ~167 GiB | old VM images (Kali/windows vdis, metasploitable, soton-rpi.img) |
| `All-in-one/` | ~95 GiB | 20152018 archives |
| `#recycle/` (Backup) | ~16 GiB | recycle bin (HA backup rotation) |
| loose `*.asc`/`*.mov` in `Viki/` root | ~8 GiB | old encrypted archives, phone videos |
| `sgs7/` | ~3.5 GiB | 2021 Galaxy S7 backup |
**~500 GiB** reclaimable without touching live backups or family data.
### Cleanup candidates — family (flag to Emo, do not delete)
- `Emo/D/` Windows 7 vmdks — **3 identical 39.5 GiB copies** (one live +
two under `_SYNCAPP/Versioning/`) → 79 GiB dedup.
- Emo-shared recycle bin: 12.6 GiB.
### Do NOT touch
`Viki/pve-backup/` (live structured backup), `Viki/nfs/immich` +
`nfs-ssd/immich` (irreplaceable), `HomeAssistant/` + `ha_backup_vermont/`
(~7 GiB, healthy 3-copy retention).

View file

@ -1257,12 +1257,12 @@ serverFiles:
- name: Storage
rules:
- alert: NodeFilesystemFull
expr: (node_filesystem_avail_bytes{fstype!~"tmpfs|fuse.*"} / node_filesystem_size_bytes) * 100 < 10
expr: (node_filesystem_avail_bytes{fstype!~"tmpfs|fuse.*"} / node_filesystem_size_bytes) * 100 < 5
for: 15m
labels:
severity: warning
annotations:
summary: "Disk {{ $labels.mountpoint }} on {{ $labels.instance }}: {{ $value | printf \"%.1f\" }}% free (threshold: 10%)"
summary: "Disk {{ $labels.mountpoint }} on {{ $labels.instance }}: {{ $value | printf \"%.1f\" }}% free (threshold: 5%)"
# PVAutoExpanding removed — was info-only at >80% used, but
# pvc-autoresizer's threshold is 10% free (= 90% used), so the
# alert always fired ~10 percentage points before any action