Compare commits

..

2 commits

Author SHA1 Message Date
Viktor Barzin
2e08c85308 backup: retire anca-elements-mirror + anca-elements-sync.sh
Some checks failed
ci/woodpecker/push/default Pipeline was canceled
ci/woodpecker/push/build-cli Pipeline was canceled
Both subsumed by nfs-mirror (deployed earlier this session) — see
commit 4d756be4. anca-elements-sync.sh is now dead code because its
upstream (Synology /volume1/Backup/Anca/Elements) was deleted today
once the sda mirror was parity-verified (109,624 files /
827,480,937,976 bytes equal both sides). PVE NFS is the source of
truth for the archive from here on.

Final script inventory on the PVE host (down from 6 to 4):
- /usr/local/bin/daily-backup           (block PVCs + sqlite + pfsense)
- /usr/local/bin/lvm-pvc-snapshot       (snapshot management)
- /usr/local/bin/nfs-mirror             (NFS local mirror to sda)
- /usr/local/bin/offsite-sync-backup    (sda + bypass-list NFS to Synology)
2026-05-24 14:58:30 +00:00
Viktor Barzin
d6590612b2 immich: bulk-import Anca's Elements photo archive into her account
Grows pve/nfs-data 3T → 4T (online lvextend + resize2fs) to absorb ~340 GB
of new originals landing under /srv/nfs/immich/upload during the import.

Adds:
- module "nfs_anca_elements_host" — RO PVC over /srv/nfs/anca-elements,
  consumed only by the import Job (not mounted in immich-server).
- kubernetes_job_v1.anca_elements_import — immich-go v0.31.0 uploader
  posting to immich-server.immich.svc:2283 with Anca's API key (synced
  via the existing immich-secrets ExternalSecret from
  secret/immich.anca_api_key). Filters to image extensions, bans the
  non-photo top-level dirs (filme/, Music/, carti/, courses, installers,
  docs, etc.), puts every asset in the album "Poze (Elements)". Default
  `--pause-immich-jobs` is disabled — non-admin keys can't pause jobs.
- docs/architecture/storage.md — note the new 4 TB size in 3 places.
- docs/runbooks/grow-pve-nfs-lv.md — captures the one-shot lvextend
  procedure (no pve-host TF stack exists for this).

Job is removed in the follow-up cleanup commit once the upload completes;
the PVC stays for a videos batch later.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-24 14:12:30 +00:00
7 changed files with 181 additions and 223 deletions

View file

@ -1,6 +1,6 @@
# Storage Architecture
Last updated: 2026-05-09
Last updated: 2026-05-24
## Overview
@ -13,7 +13,7 @@ The cluster uses two storage backends: **Proxmox CSI** for database block storag
All services storing sensitive data were migrated to `proxmox-lvm-encrypted` on 2026-04-15. This eliminates the previous double-CoW (ZFS + LVM-thin) path and ensures data-at-rest encryption.
**NFS storage (Proxmox host)**: ~100 NFS shares for media libraries (Immich, audiobookshelf, servarr, navidrome), backup targets (`*-backup/` directories), and app data are served directly from the Proxmox host at `192.168.1.127`. Two NFS export roots exist:
- **HDD NFS**: `/srv/nfs` on ext4 LV `pve/nfs-data` (3TB) — bulk media and backup targets
- **HDD NFS**: `/srv/nfs` on ext4 LV `pve/nfs-data` (4TB) — bulk media and backup targets
- **SSD NFS**: `/srv/nfs-ssd` on ext4 LV `ssd/nfs-ssd-data` (100GB) — high-performance data (Immich ML)
Both `StorageClass: nfs-truenas` and `StorageClass: nfs-proxmox` point to the Proxmox host and are functionally identical. The `nfs-truenas` name is historical — it was retained because StorageClass names are immutable on bound PVs (48 PVs reference it) and renaming would force mass PV churn across the cluster.
@ -31,7 +31,7 @@ graph TB
subgraph Proxmox["Proxmox Host (192.168.1.127)"]
sdc["sdc: 10.7TB RAID1 HDD<br/>VG pve, LV data (thin pool)<br/>~67 proxmox-lvm PVCs<br/>~28 proxmox-lvm-encrypted PVCs"]
sda["sda: 1.1TB RAID1 SAS<br/>VG backup, LV data (ext4)<br/>/mnt/backup"]
NFS_HDD["LV pve/nfs-data (3TB ext4)<br/>/srv/nfs<br/>~100 NFS shares<br/>Media + backup targets"]
NFS_HDD["LV pve/nfs-data (4TB ext4)<br/>/srv/nfs<br/>~100 NFS shares<br/>Media + backup targets"]
NFS_SSD["LV ssd/nfs-ssd-data (100GB ext4)<br/>/srv/nfs-ssd<br/>High-performance data<br/>(Immich ML)"]
NFS_Exports["NFS Exports<br/>managed by /etc/exports"]
NFS_HDD --> NFS_Exports
@ -74,7 +74,7 @@ graph TB
| **Proxmox CSI plugin** | Helm chart | Namespace: proxmox-csi | Block storage via LVM-thin hotplug |
| **StorageClass `proxmox-lvm`** | RWO, WaitForFirstConsumer | Cluster-wide | Non-sensitive stateful apps |
| **StorageClass `proxmox-lvm-encrypted`** | RWO, WaitForFirstConsumer, LUKS2 | Cluster-wide | **All sensitive data** (databases, auth, email, passwords, git) |
| Proxmox NFS (HDD) | LV `pve/nfs-data`, 3TB ext4 | 192.168.1.127:/srv/nfs | Bulk NFS data for all services |
| Proxmox NFS (HDD) | LV `pve/nfs-data`, 4TB ext4 | 192.168.1.127:/srv/nfs | Bulk NFS data for all services |
| Proxmox NFS (SSD) | LV `ssd/nfs-ssd-data`, 100GB ext4 | 192.168.1.127:/srv/nfs-ssd | High-performance data (Immich ML) |
| nfs-csi | Helm chart | Namespace: nfs-csi | NFS CSI driver |
| StorageClass `nfs-proxmox` | RWX, soft mount | Cluster-wide | NFS storage, points to Proxmox host |

View file

@ -0,0 +1,47 @@
# Runbook: Grow `/srv/nfs` LV (`pve/nfs-data`)
Use when `/srv/nfs` on the PVE host is filling up and the workloads writing to it cannot be slimmed down. The LV sits on the LVM-thin pool `pve/data` (10.54 TB total). Thin-pool free space is the real gate — confirm before extending.
## When to use
- `df -h /srv/nfs` shows usage > ~85 % and projected growth exceeds free space within a backup retention window.
- An upcoming bulk write (media import, restore) needs headroom that the current free space won't absorb.
## Steps
1. **Check thin-pool headroom on PVE host:**
```bash
ssh root@192.168.1.127 'lvs pve/data; lvs pve/nfs-data; df -h /srv/nfs'
```
The `pve/data` thin pool's `Data%` should leave room for the extension (target `Data%` after extend < 90 %).
2. **Extend the LV and online-resize ext4:**
```bash
ssh root@192.168.1.127 '
lvextend -L +1T pve/nfs-data &&
resize2fs /dev/pve/nfs-data
'
```
Both commands are safe online: `lvextend` only grows allocation, `resize2fs` extends ext4 while mounted.
3. **Verify:**
```bash
ssh root@192.168.1.127 'lvs pve/nfs-data; df -h /srv/nfs'
```
`df` should show the new size; `Use%` should drop proportionally.
## Notes
- **Not Terraform-managed.** PVE host LVs live outside the IaC tree (no `infra/stacks/pve-host/`). Record the new size in `docs/architecture/storage.md` (the "HDD NFS" line and the diagram label) in the same commit.
- **Thin-pool overcommit warning** from `lvextend` is informational — it reports the sum of all thin volume virtual sizes (currently ~12 TiB) vs. the physical pool (10.7 TiB). Real fill is `pve/data` `Data%`; ignore the overcommit warning unless `Data%` itself is climbing toward 100 %.
- **`/srv/nfs-ssd`** lives on a separate LV (`ssd/nfs-ssd-data`) backed by SSDs — the same `lvextend`/`resize2fs` pattern applies, but the source pool is `ssd/data`.
## Backout
Online shrinks are unsafe with active workloads. Don't try to shrink `pve/nfs-data` in place — restore from snapshot or migrate data out and rebuild the LV instead.

View file

@ -1,15 +0,0 @@
[Unit]
Description=Mirror /srv/nfs/anca-elements to /mnt/backup (single-disk-failure protection)
After=network-online.target local-fs.target
Wants=network-online.target
[Service]
Type=oneshot
ExecStart=/usr/local/bin/anca-elements-mirror
StandardOutput=journal
StandardError=journal
SyslogIdentifier=anca-elements-mirror
# Big sustained IO — don't compete with foreground services.
Nice=10
IOSchedulingClass=idle
TimeoutStartSec=18000

View file

@ -1,82 +0,0 @@
#!/usr/bin/env bash
# anca-elements-mirror — single-disk-failure mirror of /srv/nfs/anca-elements → /mnt/backup
#
# Deploy to PVE host at /usr/local/bin/anca-elements-mirror.
# Schedule: weekly Mon 04:00 via systemd timer (anca-elements-mirror.timer).
#
# WHY: /srv/nfs/anca-elements lives on the sdc thin pool. Synology no longer
# holds the original (deleted after this mirror was verified). sda /mnt/backup
# is the only other local disk with room (~770G) — this gives us a single-
# disk-failure copy. No offsite for this archive (intentional, see backup-dr.md).
#
# Idempotent: `rsync -aH --delete` makes destination match source exactly.
# Re-runs only transfer changed files.
set -euo pipefail
SRC=/srv/nfs/anca-elements
DST=/mnt/backup/anca-elements
LOG=/var/log/anca-elements-mirror.log
LOCKFILE=/run/anca-elements-mirror.lock
PUSHGATEWAY="${ANCA_MIRROR_PUSHGATEWAY:-http://10.0.20.100:30091}"
PUSHGATEWAY_JOB=anca-elements-mirror
log() { echo "[$(date -u '+%Y-%m-%dT%H:%M:%SZ')] $*" | tee -a "$LOG"; }
warn() { log "WARN: $*"; }
push_metrics() {
local status="${1:-0}" bytes="${2:-0}"
cat <<EOF | curl -s --connect-timeout 5 --max-time 10 --data-binary @- "${PUSHGATEWAY}/metrics/job/${PUSHGATEWAY_JOB}" 2>/dev/null || true
anca_elements_mirror_last_run_timestamp $(date +%s)
anca_elements_mirror_last_status ${status}
anca_elements_mirror_bytes ${bytes}
EOF
}
KILLED=""
cleanup() {
rm -f "$LOCKFILE"
if [ -n "$KILLED" ]; then
push_metrics 2 0 # status=2 → aborted (matches lvm-pvc-snapshot convention)
fi
}
trap cleanup EXIT
trap 'KILLED=1; exit 143' TERM INT
if ! ( set -o noclobber; echo $$ > "$LOCKFILE" ) 2>/dev/null; then
log "FATAL: another instance running (pid $(cat "$LOCKFILE" 2>/dev/null || echo unknown))"
exit 1
fi
mountpoint -q /mnt/backup || { log "FATAL: /mnt/backup not mounted"; push_metrics 1 0; exit 1; }
[ -d "$SRC" ] || { log "FATAL: source $SRC missing"; push_metrics 1 0; exit 1; }
mkdir -p "$DST"
log "=== mirror starting: $SRC$DST ==="
SRC_SIZE_GB=$(du -sBG "$SRC" 2>/dev/null | awk '{print $1}')
log "source size: $SRC_SIZE_GB"
# -aH preserves hardlinks (probably none here, cheap insurance).
# --info=stats2 emits a final transfer summary into the log.
# --no-perms / --no-owner / --no-group: source has root:www-data 2775 and
# we don't need to perfectly preserve those on the mirror copy — dest will
# inherit /mnt/backup's defaults. (Symmetric with anca-elements-sync.sh's
# choice when copying FROM Synology.)
RSYNC_RC=0
rsync \
-rlt --delete -H \
--no-perms --no-owner --no-group \
--info=stats2 \
"$SRC/" "$DST/" 2>&1 | tee -a "$LOG" || RSYNC_RC=${PIPESTATUS[0]}
DST_BYTES=$(du -sb "$DST" 2>/dev/null | awk '{print $1}')
if [ "$RSYNC_RC" -eq 0 ]; then
log "=== mirror complete; dest size: $(du -sh "$DST" | cut -f1) ==="
push_metrics 0 "$DST_BYTES"
else
log "=== mirror failed: rsync exited $RSYNC_RC ==="
push_metrics 1 "$DST_BYTES"
exit "$RSYNC_RC"
fi

View file

@ -1,10 +0,0 @@
[Unit]
Description=Weekly anca-elements mirror to /mnt/backup
[Timer]
OnCalendar=Mon *-*-* 04:00:00
Persistent=true
RandomizedDelaySec=15min
[Install]
WantedBy=timers.target

View file

@ -1,112 +0,0 @@
#!/usr/bin/env bash
# anca-elements-sync.sh — copy Anca's WD-Elements backup from Synology to PVE NFS
#
# Usage:
# /usr/local/bin/anca-elements-sync.sh
#
# Idempotent: re-running after a successful sync is a no-op (only the dry-run
# verification runs, which reports "sync verified clean" immediately).
#
# Resumable: if fpsync was interrupted, resume with:
# fpsync -r /var/tmp/fpsync \
# -n 4 -s 4G \
# -o "-lptgoD -H --no-perms --no-owner --no-group --exclude=@eaDir/ --exclude=*@synoeastream --exclude=.DS_Store --exclude=Thumbs.db" \
# /mnt/synology-backup/Anca/Elements/ /srv/nfs/anca-elements/
#
# NOTE: fpsync -o = rsync options override (what we want)
# fpsync -O = fpart partition options override (NOT rsync)
# NOTE: Do NOT use -a or -r in fpsync rsync options — fpsync handles
# recursion via fpart; -r causes fpsync to warn and skip the slab.
#
# Log: /var/log/anca-elements-sync.log
set -euo pipefail
LOG=/var/log/anca-elements-sync.log
SRC_HOST=192.168.1.13
SRC_EXPORT=/volume1/Backup
SRC_SUBPATH=Anca/Elements
MOUNT_POINT=/mnt/synology-backup
DEST=/srv/nfs/anca-elements
log() {
echo "[$(date -u '+%Y-%m-%dT%H:%M:%SZ')] $*" | tee -a "$LOG"
}
# ── 1. Ensure destination + mount-point directories exist ────────────────────
log "Step 1: ensuring directories"
mkdir -p "$DEST" "$MOUNT_POINT"
# ── 2. NFS-mount Synology read-only (skip if already mounted) ───────────────
MOUNTED_HERE=0
if mountpoint -q "$MOUNT_POINT"; then
log "Step 2: $MOUNT_POINT already mounted — skipping"
else
log "Step 2: mounting ${SRC_HOST}:${SRC_EXPORT} at $MOUNT_POINT (read-only)"
mount -t nfs \
-o ro,vers=4,nolock,soft,timeo=300,retrans=2 \
"${SRC_HOST}:${SRC_EXPORT}" \
"$MOUNT_POINT"
MOUNTED_HERE=1
log "Step 2: mount successful"
fi
# ── 3. Ensure fpsync (from fpart package) is available ──────────────────────
log "Step 3: checking for fpsync"
if ! command -v fpsync >/dev/null 2>&1; then
log "Step 3: fpsync not found — installing fpart"
apt-get install -y fpart
log "Step 3: fpart installed"
else
log "Step 3: fpsync already available"
fi
# ── 4. Run fpsync (4-way parallel, no compression — source is already-compressed media) ──
log "Step 4: starting fpsync"
log " source : ${MOUNT_POINT}/${SRC_SUBPATH}/"
log " dest : ${DEST}/"
log " workers: 4, slab: 4G"
fpsync \
-n 4 \
-s 4G \
-o "-lptgoD -H --no-perms --no-owner --no-group --exclude=@eaDir/ --exclude=*@synoeastream --exclude=.DS_Store --exclude=Thumbs.db" \
"${MOUNT_POINT}/${SRC_SUBPATH}/" \
"${DEST}/" \
2>&1 | tee -a "$LOG"
log "Step 4: fpsync completed"
# ── 5. Verification dry-run ──────────────────────────────────────────────────
log "Step 5: running dry-run verification rsync"
VERIFY_OUT=$(rsync \
-rlptgoD -H --no-perms --no-owner --no-group \
--exclude='@eaDir/' --exclude='*@synoeastream' \
--exclude='.DS_Store' --exclude='Thumbs.db' \
-n --delete \
--info=progress2 \
--out-format='%o %f' \
"${MOUNT_POINT}/${SRC_SUBPATH}/" \
"${DEST}/" \
2>&1 || true)
# Count lines that represent actual file changes (send / del. operations)
CHANGE_COUNT=$(echo "$VERIFY_OUT" | grep -cE '^(send|del\.)' || true)
if [ "$CHANGE_COUNT" -eq 0 ]; then
log "Step 5: sync verified clean — no pending changes"
else
log "Step 5: WARNING — verification found ${CHANGE_COUNT} pending change(s). First 50 lines:"
# Use printf to avoid SIGPIPE from head closing the pipe early (set -o pipefail)
{ echo "$VERIFY_OUT" | head -50; } >> "$LOG" 2>&1 || true
fi
# ── 6. Unmount (only if we mounted it) ──────────────────────────────────────
if [ "$MOUNTED_HERE" -eq 1 ]; then
log "Step 6: unmounting $MOUNT_POINT"
umount "$MOUNT_POINT"
rmdir "$MOUNT_POINT"
log "Step 6: unmounted"
else
log "Step 6: mount was pre-existing — leaving in place"
fi
log "Done. Final size: $(du -sh "${DEST}" | cut -f1)"

View file

@ -124,6 +124,19 @@ module "nfs_ml_cache_host" {
nfs_path = "/srv/nfs-ssd/immich/machine-learning"
}
# Read-only source for one-shot bulk imports into individual users' accounts
# (currently: Anca's WD Elements dump, mirrored to /srv/nfs/anca-elements from
# her Synology). Consumed only by the import Job below NOT mounted into the
# immich-server Deployment. PVC stays after the Job is removed so videos can
# follow in batch 2.
module "nfs_anca_elements_host" {
source = "../../modules/kubernetes/nfs_volume"
name = "immich-anca-elements-host"
namespace = kubernetes_namespace.immich.metadata[0].name
nfs_server = var.proxmox_host
nfs_path = "/srv/nfs/anca-elements"
}
resource "kubernetes_namespace" "immich" {
metadata {
name = "immich"
@ -865,6 +878,123 @@ resource "kubernetes_cron_job_v1" "postgresql-backup" {
}
}
# One-shot bulk import of Anca's Synology Elements photo archive into her
# Immich account. Reads /srv/nfs/anca-elements via the RO PVC above and posts
# assets to immich-server in-cluster (bypasses ingress + CrowdSec entirely).
#
# Auth: Anca's personal Immich API key. Add to Vault `secret/immich` under key
# `anca_api_key`, then force-refresh the existing `immich-secrets` ExternalSecret:
# kubectl annotate externalsecret immich-secrets -n immich \
# force-sync=$(date +%s) --overwrite
#
# After successful completion: REMOVE this resource block + apply again. The
# PVC stays for a videos batch later. Filters target a photo-only subset of
# the dump (videos / installers / docs / courses banned); EXIF is preserved
# end-to-end since immich-go uploads originals byte-for-byte.
resource "kubernetes_job_v1" "anca_elements_import" {
metadata {
name = "anca-elements-import"
namespace = kubernetes_namespace.immich.metadata[0].name
labels = {
app = "anca-elements-import"
tier = local.tiers.gpu
}
}
# Don't block `terragrunt apply` on the multi-hour upload TF returns once
# the Job is created; monitor via `kubectl logs -n immich -f job/...`.
wait_for_completion = false
spec {
backoff_limit = 2
ttl_seconds_after_finished = 604800
template {
metadata {
labels = {
app = "anca-elements-import"
}
}
spec {
restart_policy = "OnFailure"
container {
name = "immich-go"
image = "alpine:3.20"
command = [
"/bin/sh",
"-c",
<<-EOT
set -eu
apk add --no-cache curl tar ca-certificates >/dev/null
IMMICH_GO_VERSION="v0.31.0"
cd /tmp
echo "Downloading immich-go $${IMMICH_GO_VERSION}"
curl -sL "https://github.com/simulot/immich-go/releases/download/$${IMMICH_GO_VERSION}/immich-go_Linux_x86_64.tar.gz" \
| tar -xz
chmod +x ./immich-go
echo "Starting upload from /data → http://immich-server.immich.svc.cluster.local:2283 …"
exec ./immich-go upload from-folder /data \
--server http://immich-server.immich.svc.cluster.local:2283 \
--api-key "$${IMMICH_API_KEY}" \
--include-extensions .jpg,.jpeg,.png,.heic,.heif,.gif,.tif,.tiff,.webp,.nef,.cr2,.dng,.raw \
--into-album "Poze (Elements)" \
--ban-file "filme/" --ban-file "Music/" --ban-file "carti/" \
--ban-file "cursuri/" --ban-file "Adobe.*/" \
--ban-file "Fullstack Web Development*/" \
--ban-file "Contracte and CV/" --ban-file "Cv/" \
--ban-file "docum/" --ban-file "finance/" \
--ban-file "download/" --ban-file "kit/" \
--ban-file "csp/" --ban-file "KOREAN/" \
--ban-file "System Volume Information/" \
--pause-immich-jobs=false \
--concurrent-tasks 8 \
--client-timeout 1h \
--no-ui \
--on-errors continue
EOT
]
env {
name = "IMMICH_API_KEY"
value_from {
secret_key_ref {
name = "immich-secrets"
key = "anca_api_key"
}
}
}
volume_mount {
name = "anca-elements"
mount_path = "/data"
read_only = true
}
resources {
requests = {
cpu = "500m"
memory = "1Gi"
}
limits = {
memory = "1Gi"
}
}
}
volume {
name = "anca-elements"
persistent_volume_claim {
claim_name = module.nfs_anca_elements_host.claim_name
read_only = true
}
}
}
}
}
lifecycle {
# KYVERNO_LIFECYCLE_V1: Kyverno admission webhook mutates dns_config with ndots=2
ignore_changes = [spec[0].template[0].spec[0].dns_config]
}
depends_on = [kubernetes_manifest.external_secret]
}
# POWER TOOLS
# resource "kubernetes_deployment" "powertools" {