Viktor Barzin bc33cd5ac4 monitoring: NodeFilesystemFull 90%->95% + Synology storage runbook

The Synology offsite backup target (/mnt/synology-backup, surfaced via
the PVE host NFS mount) sits at ~94% by design and was firing
NodeFilesystemFull continuously. Per user request, raise the threshold
to 95% (<5% free). NOTE: NodeFilesystemFull is a global node-filesystem
rule, so this also loosens the warning on k8s node/system disks;
BackupDiskFull (sda /mnt/backup) stays at 85%.

Also adds docs/runbooks/synology-storage.md: how to assess Synology
usage WITHOUT du (Storage Analyzer weekly CSVs, df/btrfs/qgroup),
btrfs async/snapshot-pinned reclaim, the 2026-06-05 capacity assessment
(94% full; Backup share 4.42TiB), and ~500GiB of homelab cleanup
candidates (redundant gphotos Takeout, old laptop VM images, archives).

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

2026-06-05 18:18:31 +00:00

5.4 KiB

Raw Blame History

Runbook: Synology NAS storage — navigate, assess, clean

Target: Synology DS218 (NAS_Barzini), 192.168.1.13, /volume1 (5.3 TiB btrfs). This is the offsite backup target (Copy 3 of the 3-2-1 strategy) and a shared family volume — homelab data is only under Backup/Viki/; Anca/, Emo/, Common/, music, video, photo etc. are family data.

Access

SSH: ssh Administrator@192.168.1.13 (capital A; key-auth works from devvm and the PVE host). Administrator can sudo.
sudo password: Vault secret/viktor → synology_admin_password (VAULT_ADDR=https://vault.viktorbarzin.me). DSM Web API has 2FA, so SSH+sudo is the only unattended path (read -r PW; printf '%s\n' "$PW" | sudo -S -p '' <cmd> to keep the secret out of argv).

⚠️ NEVER run `du` / `find` / `ncdu` on this NAS

Recursive walks over the multi-TB Backup share take 10+ min (often never finish) and burn disk/IO on the NAS. Use Synology's own pre-indexed data instead:

Need	Instant, non-walking source
Volume fill	`df -h /volume1`
btrfs real usage	`btrfs filesystem df /volume1`
Per-subvolume	`sudo btrfs qgroup show -prce --raw /volume1`
Per-share / per-owner / per-type / largest / oldest / dupes	Storage Analyzer weekly report (below)

Storage Analyzer weekly report

Storage Analyzer is installed and writes a report every Monday ~00:00 to:

/volume1/Backup/Viki/synoreport/weekly storage report/<YYYY-MM-DD_..>/

Data is up to ~7 days stale. The useful files are zipped CSVs in csv/ — content is UTF-16, and there is no unzip on the box, so read them with Python:

import zipfile, os
R=".../<date>/csv"
def readcsv(n):
    z=zipfile.ZipFile(os.path.join(R,n)); raw=z.read(z.namelist()[0])
    for enc in ("utf-16","utf-8-sig","utf-8"):
        try: return raw.decode(enc)
        except Exception: pass

Key CSVs: volume_usage, share_list (per-share, incl/excl recycle), quota_usage.share (per-owner within a share), file_group (per-file-type), large_file, least_modify (oldest), duplicate_file. The *.db files (folder.db etc.) are a custom Synology format — NOT sqlite; report.html does not embed clean folder totals.

btrfs space-reclaim is ASYNCHRONOUS — and snapshot-pinned

Deleting files/snapshots returns instantly but df lags minutes while the btrfs cleaner reclaims extents (~30 GB/min on the DS218).
Data deleted from the live share stays on disk until the share snapshots that still reference it also rotate out. There are 4 daily Backup share snapshots (GMT-*-21.00.02), so expect up to ~4 days of lag before a delete fully frees space.
Snapshot CLI (sudo, full path): /usr/syno/sbin/synosharesnapshot {list|delete} Backup <snap>.... Retention: /usr/syno/etc/sharesnap/sharesnap.conf.

Capacity alert

The Synology mount surfaces to Prometheus as the PVE host NFS mount /mnt/synology-backup (job="proxmox-host", fstype=nfs4), caught by the global NodeFilesystemFull rule in stacks/monitoring/modules/monitoring/prometheus_chart_values.tpl.

2026-06-05: threshold changed 90% → 95% (* 100 < 5) at user request — a backup target legitimately runs hot, so 90% was noisy. NOTE: this rule is global, so the looser 95% now applies to all node/system disks too. BackupDiskFull (the sda /mnt/backup disk, separate alert) stays at 85%.

Current assessment — 2026-06-05

/volume1 at 94% (5.0 TiB used / 5.3 TiB, 324 GiB free), down from 98% on 2026-05-24. The Backup share is 4.42 TiB (86%): Administrator/homelab 3.92 TiB, Emo/family 504 GiB. By type: Other 1.76 TiB, Videos 1.33 TiB, Pictures 631 GiB, Zipped 495 GiB, DiskImage 77 GiB. The ~1.9 TiB of media is mostly the Immich offsite backup (Viki/nfs/immich + nfs-ssd/immich), which grows daily — the structural capacity driver now that one-off cleanups are spent.

Already reclaimed (verified gone)

Anca/Elements (770 GiB — dir now empty), prometheus-backup (63 GiB), ollama/llamacpp/audiblez/ebook2audiobook — removed in the 2026-06-01 cleanup; nfs-mirror now excludes the regenerable services.

Cleanup candidates — homelab (`Backup/Viki/`, Administrator-owned)

Target	Size	Notes
`Photos/gphotos-1/`	208 GiB zips (+ extracted)	2023 Google Takeout, already imported to Immich (`immich-go.exe` beside them; dupes confirmed). Redundant.
`laptop/`	~167 GiB	old VM images (Kali/windows vdis, metasploitable, soton-rpi.img)
`All-in-one/`	~95 GiB	2015–2018 archives
`#recycle/` (Backup)	~16 GiB	recycle bin (HA backup rotation)
loose `.asc`/`.mov` in `Viki/` root	~8 GiB	old encrypted archives, phone videos
`sgs7/`	~3.5 GiB	2021 Galaxy S7 backup

~500 GiB reclaimable without touching live backups or family data.

Cleanup candidates — family (flag to Emo, do not delete)

Emo/D/ Windows 7 vmdks — 3 identical 39.5 GiB copies (one live + two under _SYNCAPP/Versioning/) → 79 GiB dedup.
Emo-shared recycle bin: 12.6 GiB.

Do NOT touch

Viki/pve-backup/ (live structured backup), Viki/nfs/immich + nfs-ssd/immich (irreplaceable), HomeAssistant/ + ha_backup_vermont/ (~7 GiB, healthy 3-copy retention).

5.4 KiB Raw Blame History Unescape Escape