forgejo: pin to v11.0.14 + disable Keel (image-rewrite incident 2026-05-24)

On 2026-05-24T15:35:37Z Keel's force-policy rewrote the image tag from `11.0.14 → 1.18` (codeberg.org/forgejo/forgejo). v1.18 is a Gitea-era Forgejo (Forgejo forked from Gitea at 1.18 and used pre-Forgejo versioning early on); the DB had already been migrated to schema 305 by 11.0.14, and 1.18 only knows up to migration 231 → pod refused to start ("Your database (migration version: 305) is for a newer Gitea, you can not use the newer database for this old Gitea release (231)"). Exact replay of the 2026-05-16 force-policy tag-rewriting bug (memory id=1933). Changes: - Pin image to explicit `:11.0.14` (latest 11.x, published 2026-05-12) - Add `keel.sh/policy: "never"` deploy annotation — overrides the Kyverno-stamped `force` policy via the chart's `+()` anchor semantics (memory id=1972). Keel will no longer touch this workload. - Drop KEEL_IGNORE_IMAGE from `lifecycle.ignore_changes` (TF owns the image now). Restore it if you flip Keel back to `force`. - Add the KEEL_LIFECYCLE_V1 trio (`kubernetes.io/change-cause`, `deployment.kubernetes.io/revision`, `keel.sh/update-time` on the pod template) so future TF applies don't fight K8s rollout metadata. Verified: new pod on v11.0.14 came up Running 1/1. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
f1-stream: revive aceztrims + pitsport, more ppv variants
2026-05-24 22:06:59 +00:00 · 2026-05-24 22:05:37 +00:00 · 2026-05-24 18:34:41 +00:00 · 2026-05-24 16:27:42 +00:00 · 2026-05-24 16:18:44 +00:00
9 changed files with 347 additions and 167 deletions
--- a/docs/architecture/backup-dr.md
+++ b/docs/architecture/backup-dr.md
@ -1,18 +1,34 @@
 # Backup & Disaster Recovery Architecture

-Last updated: 2026-04-13
+Last updated: 2026-05-24
+
+> **2026-05-24 session — what changed today** (deeper structural review pending — see the open backup-pipeline simplification audit):
+> - **anca-elements archive direction inverted** — Synology `/Backup/Anca/Elements` (770G) deleted; PVE `/srv/nfs/anca-elements` is now source of truth. `anca-elements-sync.sh` retired.
+> - **`anca-elements-mirror.{sh,service,timer}` retired**, subsumed into the new **`nfs-mirror`** weekly job covering all critical NFS subtrees (anca-elements + ~80 services) → sda.
+> - **`offsite-sync-backup` Step 2 filter inverted**: NFS-direct-to-Synology now only carries the sda-bypass paths (immich + frigate + prometheus + `*-backup` + …). Two-leg invariant: `nfs-mirror.sh EXCLUDES` ≡ `offsite-sync-backup Step 2 INCLUDES`. Cross-referenced in both scripts.
+> - **Synology `/Backup/Viki/nfs/<svc>/` orphan cleanup** — 84 dirs renamed in-place (btrfs metadata-only) to `/Backup/Viki/pve-backup/<svc>/` so daily-incremental Step 1 sees them as pre-existing and only ships deltas. No re-transfer.
+> - **Synology snapshot retention 7d → 3d**, all 8 backlog snapshots deleted via `sudo synosharesnapshot delete Backup ...`. Reclaimed ~800G btrfs (98% → 83% used). DSM API was blocked by 2FA; `sudo` over the existing `Administrator` SSH key worked with the Vault-stored password.
+> - **Manifest mechanism extended**: `nfs-mirror` now appends its transferred file list to `/mnt/backup/.changed-files` so daily Step 1 incremental picks it up (was previously only fed by `daily-backup`).

 ## Overview

-The homelab uses a defense-in-depth 3-2-1 backup strategy: **3 copies** (live PVCs on sdc, weekly backups on sda, offsite on Synology), **2 media types** (SSD thin LVM, HDD), **1 offsite copy** (Synology NAS). This architecture provides <1s RPO for recent changes (via 7-day LVM snapshots), <7d RPO for file-level recovery, and <30min RTO for most services.
+The homelab runs a 3-2-1 strategy with a **two-leg** path to Synology so every NFS byte takes exactly one route to offsite (no duplication, no gaps):
+
+```
+sdc /srv/nfs/<svc>/   ──nfs-mirror weekly──→  sda /mnt/backup/<svc>/   ──offsite-sync Step 1──→  Synology /Backup/Viki/pve-backup/<svc>/      [leg 1]
+sdc /srv/nfs/<bypass>/  ──inotify (nfs-change-tracker)──→  offsite-sync Step 2  ──→  Synology /Backup/Viki/nfs/<bypass>/                       [leg 2]
+sdc PVCs (LVM thin)   ──daily-backup~snapshot~rsync──→  sda /mnt/backup/{pvc-data,sqlite-backup,pfsense,pve-config}/  ──Step 1──→  Synology /Backup/Viki/pve-backup/
+```
+
+The **bypass list** (paths that take leg 2 — too big for sda, transient, or already-a-backup): `immich`, `frigate`, `prometheus`, `loki`, `temp`, `alertmanager`, `ollama`, `audiblez`, `ebook2audiobook`, `*-backup`. Anything NOT in this list rides leg 1 via `nfs-mirror`.

 **3-2-1 Breakdown**:
- **Copy 1** (live): All PVC data + VM disks on Proxmox sdc thin pool (10.7TB RAID1 HDD)
- **Copy 2** (local backup): Weekly file-level backup to sda `/mnt/backup` (1.1TB RAID1 SAS)
- **Copy 3** (offsite): Synology NAS at 192.168.1.13:
-  - `Synology/Backup/Viki/pve-backup/` — PVC snapshots, pfSense, PVE config (rsync from sda weekly)
-  - `Synology/Backup/Viki/nfs/` — NFS HDD data (inotify change-tracked rsync from `/srv/nfs`)
-  - `Synology/Backup/Viki/nfs-ssd/` — NFS SSD data (inotify change-tracked rsync from `/srv/nfs-ssd`)
+- **Copy 1** (live): all PVC data + VM disks on Proxmox sdc thin pool (10.7TB RAID1 HDD); all NFS data at `/srv/nfs[-ssd]/`
+- **Copy 2** (local backup): sda `/mnt/backup` (1.1TB RAID1 SAS) — at **~90% used** post-2026-05-24 (was ~10% in April)
+- **Copy 3** (offsite): Synology NAS at 192.168.1.13 — at **~83% used / 934G free** post-2026-05-24 (was 98% / 121G before today's cleanup)
+  - `Synology/Backup/Viki/pve-backup/` — sda contents (PVC backups + nfs-mirror output: ~90 service dirs)
+  - `Synology/Backup/Viki/nfs/` — bypass-list NFS (immich, frigate, etc.)
+  - `Synology/Backup/Viki/nfs-ssd/` — bypass-list SSD NFS (immich-ML, ollama, llamacpp)

 ## Architecture Diagram

@ -366,6 +382,38 @@ Pushes `nfs_mirror_last_run_timestamp` + `nfs_mirror_last_status` + `nfs_mirror_

 > TrueNAS Cloud Sync was decommissioned along with TrueNAS (2026-04-13). The current offsite path is inotify-change-tracked rsync from the Proxmox host NFS (`/srv/nfs`, `/srv/nfs-ssd`) to Synology.

+### Synology snapshot management
+
+Synology DSM keeps daily btrfs snapshots of every shared folder (the `Backup` share most importantly). Retention is configured per-share in DSM's Snapshot Replication app, and persists in `synosharesnapshot shareconf`.
+
+**Current settings** (`Backup` share, 2026-05-24): daily at 02:00, **`snap_auto_remove_keep_days=3`** (tightened from 7 to reduce the window where deleted data continues to consume space).
+
+Snapshots are CoW — deleting a file from the live filesystem does NOT free its blocks while any retained snapshot references them. Reclaim only happens after ALL referencing snapshots roll off.
+
+**DSM Web API is gated by 2FA (FIDO/OTP)** — programmatic snapshot management has to go via SSH + sudo instead:
+
+```bash
+# Password is in Vault: secret/viktor → synology_admin_password
+PASS=$(VAULT_ADDR=https://vault.viktorbarzin.me vault kv get -field=synology_admin_password secret/viktor)
+
+# List snapshots on the Backup share
+ssh Administrator@192.168.1.13 "echo '$PASS' | sudo -S /usr/syno/sbin/synosharesnapshot list Backup"
+
+# Bulk delete ALL snapshots (reclaims everything once btrfs cleaner runs)
+ssh Administrator@192.168.1.13 "
+  SNAPS=\$(echo '$PASS' | sudo -S /usr/syno/sbin/synosharesnapshot list Backup 2>/dev/null \
+    | grep -oE 'GMT-[0-9]+\.[0-9]+\.[0-9]+-[0-9]+\.[0-9]+\.[0-9]+' | sort -u)
+  echo '$PASS' | sudo -S /usr/syno/sbin/synosharesnapshot delete Backup \$SNAPS
+"
+
+# Tighten retention
+ssh Administrator@192.168.1.13 "echo '$PASS' | sudo -S /usr/syno/sbin/synosharesnapshot shareconf set Backup snap_auto_remove_keep_days=3"
+```
+
+The btrfs cleaner thread reclaims async — `df` may lag the snapshot-delete by minutes (typical reclaim rate observed 2026-05-24: ~300 MB/s sustained, with bursts of 800 GB in 2 minutes).
+
+> Memory: id=2673-2676 (Synology snapshot retention gotcha — deletion vs reclaim timing).
+
 ## Configuration

 ### Key Files
@ -387,6 +435,8 @@ Pushes `nfs_mirror_last_run_timestamp` + `nfs_mirror_last_status` + `nfs_mirror_
 | `stacks/vault/` | Terraform: Vault backup CronJob |
 | `stacks/vaultwarden/` | Terraform: Vaultwarden backup + integrity CronJobs |
 | `stacks/monitoring/` | Terraform: Prometheus alerts |
+| `synology:Administrator@192.168.1.13` | Synology SSH; sudo password = Vault `secret/viktor` `synology_admin_password`; DSM API itself gated by 2FA |
+| `/usr/syno/sbin/synosharesnapshot` | Synology: btrfs snapshot CLI — must run as root via sudo |

 ### Vault Paths

--- a/scripts/daily-backup.sh
+++ b/scripts/daily-backup.sh
@ -20,6 +20,34 @@ log()  { echo "[$(date '+%Y-%m-%d %H:%M:%S')] $*"; }
 warn() { log "WARN: $*" >&2; }
 die()  { log "FATAL: $*" >&2; push_metrics 1 0; exit 1; }

+# --- Manifest append helper ---
+# Both daily-backup and nfs-mirror append to /mnt/backup/.changed-files.
+# If their runs overlap (e.g. nfs-mirror Mon 04:11 still running when
+# daily-backup starts Mon 05:00) the appends can interleave mid-line.
+# `flock -x` on a sibling lock file makes appends atomic across processes.
+MANIFEST_LOCK="${MANIFEST}.lock"
+manifest_append() {
+    (
+        flock -x 200
+        cat >> "${MANIFEST}"
+    ) 200>"${MANIFEST_LOCK}"
+}
+
+# Cap manifest size to prevent unbounded growth (e.g. Synology unreachable
+# for many days, every daily-backup keeps appending). At >500k lines,
+# `--files-from=` rsync becomes pathological — fall back to a full Step 1
+# sync by signalling offsite-sync to ignore the manifest this round.
+MANIFEST_MAX_LINES=500000
+check_manifest_size() {
+    [ -f "${MANIFEST}" ] || return 0
+    local lines
+    lines=$(wc -l < "${MANIFEST}" 2>/dev/null || echo 0)
+    if [ "${lines:-0}" -gt "${MANIFEST_MAX_LINES}" ]; then
+        warn "manifest at ${lines} lines (>${MANIFEST_MAX_LINES}) — flagging next offsite-sync as full"
+        touch "${BACKUP_ROOT}/.force-full-sync"
+    fi
+}
+
 # --- Locking ---
 # Track whether we got SIGTERM/SIGINT so cleanup can push a non-success metric.
 # Without this, a systemd timeout-kill leaves WeeklyBackupFailing alerts blind:
@ -123,7 +151,7 @@ check_nfs_exports() {
 }

 # --- Main ---
-log "=== Weekly backup starting ==="
+log "=== daily-backup starting ==="

 if ! mountpoint -q "${BACKUP_ROOT}"; then
    die "${BACKUP_ROOT} is not mounted"
@ -138,16 +166,25 @@ check_nfs_exports || {
 STATUS=0
 TOTAL_BYTES=0

-# Clear manifest for this run
-> "${MANIFEST}"
+# DO NOT truncate the manifest here.
+#
+# Truncation lives in offsite-sync-backup (only on successful sync). If
+# offsite-sync failed yesterday — Synology unreachable, transient error —
+# the manifest holds yesterday's unconsumed file list. Truncating at the
+# start of today's daily-backup would silently lose those entries; they'd
+# only reach Synology on the next monthly full sync.
+#
+# Appending duplicates across multiple runs is harmless — rsync transfers
+# each file once. If the manifest grows pathologically (Synology down for
+# weeks), the OffsiteBackupSync{Stale,Failing} alerts catch it.

-# NFS data is synced directly to Synology via inotifywait + offsite-sync-backup.sh
-# No NFS mirror step on sda — saves 53GB and eliminates duplication.
+# NFS data is synced to Synology via two paths: nfs-mirror → sda → Step 1
+# for the curated subset, and inotify + Step 2 for the sda-bypass list.

 # ============================================================
 # STEP 1: PVC file-level copy from LVM thin snapshots
 # ============================================================
-log "--- Step 2: PVC file copy from snapshots ---"
+log "--- Step 1: PVC file copy from snapshots ---"
 WEEK=$(date +%Y-%W)
 PREV=$(ls -1d "${BACKUP_ROOT}/pvc-data"/????-?? 2>/dev/null | tail -1 || true)

@ -215,7 +252,7 @@ else
            # (immich-postgres ~10 GiB, ~3 min on local ext4) and well
            # below the unit-level budget so we still have headroom to
            # finish the rest.
-            timeout 1800 rsync -az --delete \
+            timeout 1800 rsync -a --delete \
                ${PREV:+--link-dest="${PREV}/${ns_pvc}/"} \
                "${PVC_MOUNT}/" "${dst}/" 2>&1 || rsync_rc=$?
            if [ "$rsync_rc" -eq 0 ]; then
@ -274,10 +311,10 @@ else
    log "  PVC copy: ${PVC_COUNT} OK, ${PVC_FAIL} failed"
    [ "${PVC_FAIL}" -gt 0 ] && STATUS=1

-    # Add PVC files to manifest
+    # Add PVC files to manifest (locked append)
    if [ -d "${BACKUP_ROOT}/pvc-data/${WEEK}" ]; then
        find "${BACKUP_ROOT}/pvc-data/${WEEK}" -type f 2>/dev/null | \
-            sed "s|^${BACKUP_ROOT}/||" >> "${MANIFEST}"
+            sed "s|^${BACKUP_ROOT}/||" | manifest_append
    fi

    # Prune old weekly versions (keep 4)
@ -301,23 +338,31 @@ if timeout 10 ssh -o BatchMode=yes -o ConnectTimeout=5 root@10.0.20.1 true 2>/de
    # config.xml — primary restore artifact
    if scp -o ConnectTimeout=10 root@10.0.20.1:/cf/conf/config.xml "${PFSENSE_DEST}/config-${DATE}.xml" 2>/dev/null; then
        log "  OK: config.xml"
-        echo "pfsense/config-${DATE}.xml" >> "${MANIFEST}"
+        echo "pfsense/config-${DATE}.xml" | manifest_append
    else
        warn "Failed to copy pfsense config.xml"
        STATUS=1
        PFSENSE_STATUS=1
    fi

-    # Full filesystem tar
-    if ssh -o ConnectTimeout=10 root@10.0.20.1 \
-        "tar czf - --exclude=/dev --exclude=/proc --exclude=/tmp --exclude=/var/run /" \
-        > "${PFSENSE_DEST}/pfsense-full-${DATE}.tar.gz" 2>/dev/null; then
-        log "  OK: full tar ($(du -sh "${PFSENSE_DEST}/pfsense-full-${DATE}.tar.gz" | cut -f1))"
-        echo "pfsense/pfsense-full-${DATE}.tar.gz" >> "${MANIFEST}"
+    # Full filesystem tar — Sundays only (weekly).
+    # config.xml is the primary restore artifact and runs daily above; the
+    # full filesystem tar is for forensic / package-state recovery only and
+    # rarely-needed. Re-tarring 100M+ daily writes ~3G/month to sda + Synology
+    # for unchanged content. Keep one fresh tarball per week instead.
+    if [ "$(date +%u)" = "7" ]; then
+        if ssh -o ConnectTimeout=10 root@10.0.20.1 \
+            "tar czf - --exclude=/dev --exclude=/proc --exclude=/tmp --exclude=/var/run /" \
+            > "${PFSENSE_DEST}/pfsense-full-${DATE}.tar.gz" 2>/dev/null; then
+            log "  OK: weekly full tar ($(du -sh "${PFSENSE_DEST}/pfsense-full-${DATE}.tar.gz" | cut -f1))"
+            echo "pfsense/pfsense-full-${DATE}.tar.gz" | manifest_append
+        else
+            warn "Failed to tar pfsense filesystem"
+            STATUS=1
+            PFSENSE_STATUS=1
+        fi
    else
-        warn "Failed to tar pfsense filesystem"
-        STATUS=1
-        PFSENSE_STATUS=1
+        log "  skip weekly full tar (only runs Sundays)"
    fi

    # Retention: keep 4 weekly copies
@ -344,13 +389,15 @@ fi
 # ============================================================
 log "--- Step 4: PVE host config ---"
 mkdir -p "${BACKUP_ROOT}/pve-config/scripts"
-timeout 300 rsync -az --delete /etc/pve/ "${BACKUP_ROOT}/pve-config/etc-pve/" 2>&1 || { warn "Failed to sync /etc/pve"; STATUS=1; }
+timeout 300 rsync -a --delete /etc/pve/ "${BACKUP_ROOT}/pve-config/etc-pve/" 2>&1 || { warn "Failed to sync /etc/pve"; STATUS=1; }
 for script in /usr/local/bin/lvm-pvc-snapshot /usr/local/bin/daily-backup /usr/local/bin/offsite-sync-backup; do
    [ -f "${script}" ] && cp "${script}" "${BACKUP_ROOT}/pve-config/scripts/" 2>/dev/null || true
 done
-find "${BACKUP_ROOT}/pve-config" -type f 2>/dev/null | sed "s|^${BACKUP_ROOT}/||" >> "${MANIFEST}"
+find "${BACKUP_ROOT}/pve-config" -type f 2>/dev/null | sed "s|^${BACKUP_ROOT}/||" | manifest_append
 log "  OK: PVE config"

+check_manifest_size
+
 # ============================================================
 # STEP 5: Prune LVM snapshots older than 7 days
 # ============================================================
@ -361,6 +408,6 @@ log "--- Step 5: Snapshot pruning (7-day retention) ---"
 # Done
 # ============================================================
 MANIFEST_LINES=$(wc -l < "${MANIFEST}" 2>/dev/null || echo 0)
-log "=== Weekly backup complete (status=${STATUS}, ${TOTAL_BYTES} bytes, ${MANIFEST_LINES} files in manifest) ==="
+log "=== daily-backup complete (status=${STATUS}, ${TOTAL_BYTES} bytes, ${MANIFEST_LINES} files in manifest) ==="
 push_metrics "${STATUS}" "${TOTAL_BYTES}"
 exit "${STATUS}"
--- a/scripts/nfs-mirror.sh
+++ b/scripts/nfs-mirror.sh
@ -57,6 +57,14 @@ EXCLUDES=(
    --exclude='/.lv-pvc-mapping.json'
    --exclude='/.nfs-changes.log'

+    # ---- anca-elements: photos are being ingested into Immich (2026-05-24),
+    # so /srv/nfs/immich/library/ becomes the canonical copy and the separate
+    # anca-elements tree is redundant. Excluded from nfs-mirror going forward.
+    # The historical 771G at /mnt/backup/anca-elements/ stays put until manual
+    # cleanup once Immich ingest completes; offsite-sync Step 1 also excludes
+    # it from the Synology pve-backup/ upload so we don't ship the redundant copy.
+    --exclude='/anca-elements/'
+
    # ---- NFS paths: too big / transient / re-fetchable ----
    --exclude='/immich/'
    --exclude='/frigate/'
@ -81,6 +89,17 @@ EXCLUDES=(
 log()  { echo "[$(date -u '+%Y-%m-%dT%H:%M:%SZ')] $*" | tee -a "$LOG"; }
 warn() { log "WARN: $*"; }

+# Locked manifest append (shared with daily-backup) — see daily-backup.sh
+# for the rationale. flock prevents interleaved appends when nfs-mirror
+# (Mon 04:11) overruns into daily-backup (Mon 05:00).
+MANIFEST_LOCK="${MANIFEST}.lock"
+manifest_append() {
+    (
+        flock -x 200
+        cat >> "${MANIFEST}"
+    ) 200>"${MANIFEST_LOCK}"
+}
+
 push_metrics() {
    local status="${1:-0}" bytes="${2:-0}"
    cat <<EOF | curl -s --connect-timeout 5 --max-time 10 --data-binary @- "${PUSHGATEWAY}/metrics/job/${PUSHGATEWAY_JOB}" 2>/dev/null || true
@ -132,10 +151,12 @@ if [ "$RSYNC_RC" -eq 0 ]; then
    # manifest so daily Step 1 incremental picks them up tomorrow morning.
    NEW_COUNT=$(find /mnt/backup -newer "$STAMP" -type f \
        ! -path '/mnt/backup/.changed-files' \
+        ! -path '/mnt/backup/.changed-files.lock' \
        ! -path '/mnt/backup/.lv-pvc-mapping.json' \
        ! -path '/mnt/backup/.nfs-changes.log' \
        ! -path '/mnt/backup/.last-offsite-sync' \
-        -printf '%P\n' 2>/dev/null | tee -a "$MANIFEST" | wc -l)
+        ! -path '/mnt/backup/.force-full-sync' \
+        -printf '%P\n' 2>/dev/null | tee >(manifest_append) | wc -l)
    log "=== mirror complete; ${NEW_COUNT} files added to offsite manifest ==="
    log "/mnt/backup used: $(df -h --output=used /mnt/backup | tail -1 | tr -d ' ')"
    push_metrics 0 "$DST_BYTES"
--- a/scripts/offsite-sync-backup.sh
+++ b/scripts/offsite-sync-backup.sh
@ -54,18 +54,32 @@ DAY_OF_MONTH=$(date +%d)
 # ============================================================
 log "--- Step 1: sda → Synology pve-backup/ ---"

-if [ "${DAY_OF_MONTH}" -le 7 ]; then
-    log "Monthly full sync (1st Sunday)..."
-    rsync -rltz --delete --chmod=Du=rwx,Dgo=rx,Fu=rw,Fog=r \
+# Trigger: monthly cleanup window OR daily-backup signalled the manifest grew
+# past its cap (Synology was unreachable too long for incremental to keep up).
+FORCE_FULL_FLAG="${BACKUP_ROOT}/.force-full-sync"
+FORCE_FULL=""
+[ -f "${FORCE_FULL_FLAG}" ] && FORCE_FULL=1
+if [ "${DAY_OF_MONTH}" -le 7 ] || [ -n "${FORCE_FULL}" ]; then
+    [ -n "${FORCE_FULL}" ] && log "Forced full sync (manifest size cap tripped)..." || log "Monthly full sync (1st Sunday)..."
+    # No -z on LAN: gigabit hop to 192.168.1.13 doesn't benefit from compression
+    # and burns CPU on the PVE host that's already busy with cluster IO.
+    rsync -rlt --delete --chmod=Du=rwx,Dgo=rx,Fu=rw,Fog=r \
        --exclude='.changed-files' \
+        --exclude='.changed-files.lock' \
        --exclude='.last-offsite-sync' \
        --exclude='.lv-pvc-mapping.json' \
        --exclude='.nfs-changes.log' \
+        --exclude='.force-full-sync' \
+        --exclude='/anca-elements/' \
        "${BACKUP_ROOT}/" "${PVE_BACKUP_DEST}/" 2>&1 || STATUS=1
+    rm -f "${FORCE_FULL_FLAG}"
 elif [ -s "${MANIFEST}" ]; then
    MANIFEST_LINES=$(wc -l < "${MANIFEST}")
    log "Incremental sync (${MANIFEST_LINES} files from manifest)..."
-    rsync -rltz --chmod=Du=rwx,Dgo=rx,Fu=rw,Fog=r --files-from="${MANIFEST}" \
+    # /anca-elements is being ingested into Immich (Immich becomes canonical) —
+    # skip the redundant copy in /mnt/backup/anca-elements/ until manual cleanup.
+    rsync -rlt --chmod=Du=rwx,Dgo=rx,Fu=rw,Fog=r --files-from="${MANIFEST}" \
+        --exclude='anca-elements/' \
        "${BACKUP_ROOT}/" "${PVE_BACKUP_DEST}/" 2>&1 || STATUS=1
 else
    log "No changed files in manifest, nothing to sync"
@ -110,11 +124,11 @@ NFS_FULL_INCLUDES=(
 if [ "${DAY_OF_MONTH}" -le 7 ]; then
    # Monthly: full sync with --delete for cleanup, restricted to bypass-list.
    log "Monthly full NFS sync (sda-bypass paths only)..."
-    rsync -rltz --delete "${NFS_FULL_INCLUDES[@]}" /srv/nfs/ "${NFS_DEST}/" 2>&1 \
+    rsync -rlt --delete "${NFS_FULL_INCLUDES[@]}" /srv/nfs/ "${NFS_DEST}/" 2>&1 \
        && log "  OK: nfs/ full sync (bypass-list)" || { warn "nfs/ full sync failed"; STATUS=1; }
    # nfs-ssd: every dir under it (immich/ollama/llamacpp) is in the bypass list,
    # so a plain --delete still applies cleanly.
-    rsync -rltz --delete /srv/nfs-ssd/ "${NFS_SSD_DEST}/" 2>&1 \
+    rsync -rlt --delete /srv/nfs-ssd/ "${NFS_SSD_DEST}/" 2>&1 \
        && log "  OK: nfs-ssd/ full sync" || { warn "nfs-ssd/ full sync failed"; STATUS=1; }
    > "${NFS_CHANGE_LOG}"
 elif [ -s "${NFS_CHANGE_LOG}" ]; then
@ -127,7 +141,7 @@ elif [ -s "${NFS_CHANGE_LOG}" ]; then
        > /tmp/sync-nfs.list 2>/dev/null
    NFS_COUNT=$(wc -l < /tmp/sync-nfs.list 2>/dev/null || echo 0)
    if [ "${NFS_COUNT:-0}" -gt 0 ]; then
-        rsync -rltz --files-from=/tmp/sync-nfs.list /srv/nfs/ "${NFS_DEST}/" 2>&1 \
+        rsync -rlt --files-from=/tmp/sync-nfs.list /srv/nfs/ "${NFS_DEST}/" 2>&1 \
            && log "  OK: nfs/ (${NFS_COUNT} bypass files)" \
            || { warn "nfs/ incremental failed"; STATUS=1; }
    fi
@ -138,7 +152,7 @@ elif [ -s "${NFS_CHANGE_LOG}" ]; then
        > /tmp/sync-nfs-ssd.list 2>/dev/null || true
    SSD_COUNT=$(wc -l < /tmp/sync-nfs-ssd.list 2>/dev/null || echo 0)
    if [ "${SSD_COUNT:-0}" -gt 0 ]; then
-        rsync -rltz --files-from=/tmp/sync-nfs-ssd.list /srv/nfs-ssd/ "${NFS_SSD_DEST}/" 2>&1 \
+        rsync -rlt --files-from=/tmp/sync-nfs-ssd.list /srv/nfs-ssd/ "${NFS_SSD_DEST}/" 2>&1 \
            && log "  OK: nfs-ssd/ (${SSD_COUNT} files)" \
            || { warn "nfs-ssd/ incremental failed"; STATUS=1; }
    fi
--- a/stacks/f1-stream/files/backend/extractors/aceztrims.py
+++ b/stacks/f1-stream/files/backend/extractors/aceztrims.py
@ -1,13 +1,24 @@
-"""Aceztrims extractor - scrapes F1 streaming links from Aceztrims pages.
+"""Aceztrims extractor — scrapes embed URLs from acestrlms.pages.dev/f11/.

-Parses HTML for iframe button onclick handlers and extracts streams from:
- /iframe1?s=<m3u8_url> → direct m3u8
- https://pooembed.eu/embed/... → embed URL
+The page (Cloudflare Pages, no anti-bot) hosts an iframe + a strip of
+onclick channel-switcher buttons. Each button rewrites the iframe via
+`document.getElementById('iframe').src = '<embed_url>'`. The initial
+channel is hard-coded as `<iframe id='iframe' src='...'>`.
+
+We strip HTML comments first because the page keeps ~20 legacy channel
+buttons inside `<!-- ... -->` blocks for easy re-enablement; the previous
+loose regex picked them up as false positives.
+
+All channels are iframe embeds (no direct m3u8) — `stream_type='embed'`.
+
+Site naming note: the extractor key stays `aceztrims` (the previous
+domain) so registry/cache identifiers don't churn. The current domain
+is `acestrlms.pages.dev` and the F1 path is `/f11/` (two ones — `/f1/`
+is the cross-sport schedule page and has no stream buttons).
 """

 import logging
 import re
-from urllib.parse import parse_qs, urlparse

 import httpx

@ -17,9 +28,8 @@ from backend.extractors.models import ExtractedStream
 logger = logging.getLogger(__name__)

 BASE_URL = "https://acestrlms.pages.dev"
-# Pages to scrape for streams
 F1_PAGES = [
-    ("/f1/", "Formula 1"),
+    ("/f11/", "Formula 1"),
 ]

 USER_AGENT = (
@ -28,13 +38,21 @@ USER_AGENT = (
    "Chrome/120.0.0.0 Safari/537.36"
 )

+# `document.getElementById('iframe').src = '<URL>'` — current channel-switcher format.
+_ONCLICK_IFRAME_SRC = re.compile(
+    r"""document\.getElementById\(['"]iframe['"]\)\.src\s*=\s*['"]([^'"]+)['"]""",
+    re.IGNORECASE,
+)
+# `<iframe id='iframe' src='<URL>'>` — the default/initial channel.
+_DEFAULT_IFRAME = re.compile(
+    r"""<iframe[^>]*id\s*=\s*['"]iframe['"][^>]*src\s*=\s*['"]([^'"]+)['"]""",
+    re.IGNORECASE,
+)
+_HTML_COMMENT = re.compile(r"<!--.*?-->", re.DOTALL)
+

 class AceztrimsExtractor(BaseExtractor):
-    """Extracts streams from Aceztrims pages by parsing HTML for iframe URLs.
-
-    Looks for onclick handlers on buttons/links that open iframes, and
-    extracts the stream URLs from them.
-    """
+    """Pulls iframe embed URLs out of the acestrlms.pages.dev F1 page."""

    @property
    def site_key(self) -> str:
@ -45,7 +63,6 @@ class AceztrimsExtractor(BaseExtractor):
        return "Aceztrims"

    async def extract(self) -> list[ExtractedStream]:
-        """Scrape all configured F1 pages for stream URLs."""
        streams: list[ExtractedStream] = []

        async with httpx.AsyncClient(
@ -55,12 +72,9 @@ class AceztrimsExtractor(BaseExtractor):
        ) as client:
            for path, category in F1_PAGES:
                try:
-                    page_streams = await self._scrape_page(client, path, category)
-                    streams.extend(page_streams)
+                    streams.extend(await self._scrape_page(client, path, category))
                except Exception:
-                    logger.exception(
-                        "[aceztrims] Failed to scrape page %s", path
-                    )
+                    logger.exception("[aceztrims] Failed to scrape %s", path)

        logger.info("[aceztrims] Extracted %d stream(s)", len(streams))
        return streams
@ -68,85 +82,39 @@ class AceztrimsExtractor(BaseExtractor):
    async def _scrape_page(
        self, client: httpx.AsyncClient, path: str, category: str
    ) -> list[ExtractedStream]:
-        """Scrape a single page for stream URLs."""
        url = f"{BASE_URL}{path}"
        resp = await client.get(url)
        if resp.status_code != 200:
            logger.warning(
-                "[aceztrims] Page %s returned HTTP %d", path, resp.status_code
+                "[aceztrims] %s returned HTTP %d", path, resp.status_code
            )
            return []

-        html = resp.text
+        # The page keeps a block of legacy channel buttons inside
+        # `<!-- ... -->` for quick re-enablement. Strip comments first so
+        # the regex only sees live buttons.
+        html = _HTML_COMMENT.sub("", resp.text)
+
+        seen: set[str] = set()
        streams: list[ExtractedStream] = []
-        seen_urls: set[str] = set()

-        # Pattern 1: /iframe1?s=<m3u8_url> — direct m3u8
-        iframe1_pattern = re.compile(
-            r"""['"]((?:https?://[^'"]*)?/iframe1\?s=([^'"&]+))['""]""",
-            re.IGNORECASE,
-        )
-        for match in iframe1_pattern.finditer(html):
-            m3u8_url = match.group(2)
-            if m3u8_url in seen_urls:
-                continue
-            seen_urls.add(m3u8_url)
-
-            streams.append(
-                ExtractedStream(
-                    url=m3u8_url,
-                    site_key=self.site_key,
-                    site_name=self.site_name,
-                    quality="",
-                    title=f"{category} Stream",
-                    stream_type="m3u8",
+        for pattern in (_DEFAULT_IFRAME, _ONCLICK_IFRAME_SRC):
+            for match in pattern.finditer(html):
+                embed_url = match.group(1).strip()
+                if not embed_url or embed_url in seen:
+                    continue
+                seen.add(embed_url)
+                streams.append(
+                    ExtractedStream(
+                        url=embed_url,
+                        site_key=self.site_key,
+                        site_name=self.site_name,
+                        quality="",
+                        title=f"{category} Stream",
+                        stream_type="embed",
+                        embed_url=embed_url,
+                    )
                )
-            )
-
-        # Pattern 2: embed URLs (pooembed.eu or similar)
-        embed_pattern = re.compile(
-            r"""['"]((https?://(?:pooembed\.eu|[^'"]*embed)[^'"]*))['"]""",
-            re.IGNORECASE,
-        )
-        for match in embed_pattern.finditer(html):
-            embed_url = match.group(1)
-            if embed_url in seen_urls:
-                continue
-            seen_urls.add(embed_url)
-
-            streams.append(
-                ExtractedStream(
-                    url=embed_url,
-                    site_key=self.site_key,
-                    site_name=self.site_name,
-                    quality="",
-                    title=f"{category} Stream (Embed)",
-                    stream_type="embed",
-                    embed_url=embed_url,
-                )
-            )
-
-        # Pattern 3: Generic onclick handlers with URLs
-        onclick_pattern = re.compile(
-            r"""onclick\s*=\s*['"].*?['"]?(https?://[^'")\s]+\.m3u8[^'")\s]*)['"]?""",
-            re.IGNORECASE,
-        )
-        for match in onclick_pattern.finditer(html):
-            m3u8_url = match.group(1)
-            if m3u8_url in seen_urls:
-                continue
-            seen_urls.add(m3u8_url)
-
-            streams.append(
-                ExtractedStream(
-                    url=m3u8_url,
-                    site_key=self.site_key,
-                    site_name=self.site_name,
-                    quality="",
-                    title=f"{category} Stream",
-                    stream_type="m3u8",
-                )
-            )

        logger.info(
            "[aceztrims] Found %d stream(s) on %s", len(streams), path
--- a/stacks/f1-stream/files/backend/extractors/pitsport.py
+++ b/stacks/f1-stream/files/backend/extractors/pitsport.py
@ -34,7 +34,7 @@ USER_AGENT = (
 # to also surface MotoGP and adjacent motorsports — keeps the f1-stream
 # UI useful between race weekends and during the off-season.
 MOTORSPORT_CATEGORIES = {
-    "formula 1", "formula 2", "formula 3",
+    "f1", "formula 1", "formula 2", "formula 3",
    "motogp", "moto gp", "moto2", "moto3", "motoe",
    "world rally championship", "wrc",
    "world endurance championship", "wec",
@ -85,27 +85,61 @@ _is_f1_category = _is_motorsport_category
 _is_f1_event = _is_motorsport_event


-def _parse_live_events(html: str) -> list[_PitsportEvent]:
-    """Parse live events from the main page RSC payload.
+def _decode_rsc_payload(html: str) -> str:
+    """Concatenate and unescape all `self.__next_f.push([1, "..."])` chunks.

-    The main page contains event cards with props:
-        category, title, time, imageUrl
-    wrapped in <a href="/watch/{UUID}"> links.
+    Next.js RSC ships its tree as escape-encoded strings inside repeated
+    `self.__next_f.push` calls. Regex over the raw HTML misses everything
+    interesting; we have to decode unicode escapes first.
    """
+    chunks = re.findall(r'self\.__next_f\.push\(\[1,"(.*?)"\]\)', html, re.DOTALL)
+    if not chunks:
+        return ""
+    payload = ""
+    for chunk in chunks:
+        try:
+            payload += chunk.encode().decode("unicode_escape")
+        except Exception:
+            payload += chunk
+    return payload
+
+
+def _parse_live_events(html: str) -> list[_PitsportEvent]:
+    """Parse live events from the main page (or `/live-now`) RSC payload.
+
+    The pages embed event cards inside the Next.js RSC payload; the raw
+    HTML keeps it escape-encoded so we decode first, then match.
+    Two shapes are common:
+      1) Older card props: "category":"...","title":"..." next to
+         "href":"/watch/UUID".
+      2) Newer `event` prop: an `event` object with `uri:"/watch/UUID"`
+         carrying `category` and `title`.
+    """
+    payload = _decode_rsc_payload(html) or html
+
    events: list[_PitsportEvent] = []

-    # Match event cards in the RSC payload - they appear as JSON-like structures
-    # Pattern: href="/watch/UUID" ... category":"...", "title":"..."
-    # In the RSC payload, the data is in the format:
-    #   ["$","$L2","/watch/UUID",{"href":"/watch/UUID","children":["$","$L10",null,
-    #     {"category":"...","title":"...","time":...,"imageUrl":"..."}]}]
-    pattern = re.compile(
+    href_pattern = re.compile(
        r'"href":"(/watch/([0-9a-f-]{36}))"[^}]*?"category":"([^"]+)","title":"([^"]+)"',
    )
-    for match in pattern.finditer(html):
+    for match in href_pattern.finditer(payload):
        _, uuid, category, title = match.groups()
        events.append(_PitsportEvent(category=category, title=title, watch_uuid=uuid))

+    event_pattern = re.compile(
+        r'"event":\{[^{}]*?"title":"([^"]+)"[^{}]*?"uri":"/watch/([0-9a-f-]{36})"[^{}]*?"category":"([^"]+)"',
+    )
+    for match in event_pattern.finditer(payload):
+        title, uuid, category = match.groups()
+        events.append(_PitsportEvent(category=category, title=title, watch_uuid=uuid))
+
+    event_pattern_alt = re.compile(
+        r'"event":\{[^{}]*?"category":"([^"]+)"[^{}]*?"title":"([^"]+)"[^{}]*?"uri":"/watch/([0-9a-f-]{36})"',
+    )
+    for match in event_pattern_alt.finditer(payload):
+        category, title, uuid = match.groups()
+        events.append(_PitsportEvent(category=category, title=title, watch_uuid=uuid))
+
    return events


@ -301,13 +335,12 @@ def _is_m3u8_method(method: str) -> bool:


 def _extract_m3u8_url(link: str) -> str:
-    """Convert a serveplay.site player URL to an m3u8 playlist URL.
+    """Pass through the link from pushembdz's `api/stream/<slug>` response.

-    Input:  https://dash.serveplay.site/{channel}/index.html
-    Output: https://dash.serveplay.site/{channel}/index.html
-
-    The index.html IS the m3u8 playlist (served with proper content-type
-    when fetched with the correct Referer header).
+    The host has rotated over time (serveplay.site → oe1.ossfeed.store →
+    …); the response is always a master playlist URL we hand to the
+    player as-is. Content-Type may be `text/css` or `application/json` —
+    treat as HLS based on body sniffing (`#EXTM3U`), not MIME.
    """
    return link

@ -388,6 +421,24 @@ class PitsportExtractor(BaseExtractor):
        except Exception:
            logger.exception("[pitsport] Failed to fetch main page")

+        # Fetch /live-now — canonical "currently live" list, added 2026.
+        try:
+            resp = await client.get(f"{PITSPORT_BASE}/live-now")
+            if resp.status_code == 200:
+                live_now_events = _parse_live_events(resp.text)
+                logger.info(
+                    "[pitsport] Live-now page: %d event(s)", len(live_now_events)
+                )
+                for ev in live_now_events:
+                    if _is_f1_event(ev.category, ev.title):
+                        all_events.append(ev)
+            else:
+                logger.warning(
+                    "[pitsport] Live-now page returned HTTP %d", resp.status_code
+                )
+        except Exception:
+            logger.exception("[pitsport] Failed to fetch live-now page")
+
        # Fetch schedule page for upcoming events
        try:
            resp = await client.get(f"{PITSPORT_BASE}/schedule")
--- a/stacks/f1-stream/files/backend/extractors/ppv.py
+++ b/stacks/f1-stream/files/backend/extractors/ppv.py
@ -153,21 +153,37 @@ class PPVExtractor(BaseExtractor):
                    if viewers and int(viewers) > 0:
                        title += f" ({viewers} viewers)"

-                    # Check for substreams (multiple quality/language options)
+                    # Always emit the parent stream — substreams are
+                    # additional language/source variants, not replacements.
+                    streams.append(
+                        ExtractedStream(
+                            url=embed_url,
+                            site_key=self.site_key,
+                            site_name=self.site_name,
+                            quality=quality,
+                            title=title,
+                            stream_type="embed",
+                            embed_url=embed_url,
+                        )
+                    )
+
                    substreams = stream_obj.get("substreams")
-                    if isinstance(substreams, list) and substreams:
+                    if isinstance(substreams, list):
                        for i, sub in enumerate(substreams):
                            sub_embed = sub.get("iframe", "") or sub.get("embed_url", "")
                            if not sub_embed:
-                                # Fall back to the parent embed URL
                                sub_embed = embed_url
-                            sub_name = sub.get("name", "") or sub.get("label", "")
+                            sub_name = (
+                                sub.get("source_tag", "")
+                                or sub.get("name", "")
+                                or sub.get("label", "")
+                            )
                            sub_quality = sub.get("tag", "") or sub.get("quality", "") or quality
                            sub_title = f"{name}"
                            if sub_name:
                                sub_title += f" - {sub_name}"
-                            elif i > 0:
-                                sub_title += f" #{i + 1}"
+                            else:
+                                sub_title += f" #{i + 2}"

                            streams.append(
                                ExtractedStream(
@ -180,19 +196,6 @@ class PPVExtractor(BaseExtractor):
                                    embed_url=sub_embed,
                                )
                            )
-                    else:
-                        # Single stream, no substreams
-                        streams.append(
-                            ExtractedStream(
-                                url=embed_url,
-                                site_key=self.site_key,
-                                site_name=self.site_name,
-                                quality=quality,
-                                title=title,
-                                stream_type="embed",
-                                embed_url=embed_url,
-                            )
-                        )

        except Exception:
            logger.exception("[ppv] Failed to extract streams")
--- a/stacks/forgejo/main.tf
+++ b/stacks/forgejo/main.tf
@ -61,6 +61,12 @@ resource "kubernetes_deployment" "forgejo" {
      app  = "forgejo"
      tier = local.tiers.edge
    }
+    annotations = {
+      # Keel disabled here — its `force` policy rewrote the image tag
+      # from 11.0.14 → 1.18 on 2026-05-24 (same bug as memory id=1933).
+      # TF owns the tag now; bump it manually here when upgrading.
+      "keel.sh/policy" = "never"
+    }
  }
  spec {
    replicas = 1
@ -89,7 +95,14 @@ resource "kubernetes_deployment" "forgejo" {
        }
        container {
          name  = "forgejo"
-          image = "codeberg.org/forgejo/forgejo:11"
+          # Pinned to 11.0.14 (latest 11.x as of 2026-05-12) — was on
+          # floating `:11`. On 2026-05-24T15:35:37Z Keel force-policy
+          # rewrote the tag from `11.0.14 → 1.18` (Gitea-era Forgejo
+          # v1.18), exact replay of the 2026-05-16 force-policy
+          # tag-rewriting incident (memory id=1933). The pod crashlooped
+          # because the DB had already been migrated to schema 305 by
+          # 11.0.14 and v1.18 only knows up to migration 231.
+          image = "codeberg.org/forgejo/forgejo:11.0.14"
          env {
            name  = "USER_UID"
            value = 1000
@ -182,10 +195,16 @@ resource "kubernetes_deployment" "forgejo" {
  lifecycle {
    ignore_changes = [
      spec[0].template[0].spec[0].dns_config, # KYVERNO_LIFECYCLE_V1
-      spec[0].template[0].spec[0].container[0].image, # KEEL_IGNORE_IMAGE — Keel manages tag updates
-      metadata[0].annotations["keel.sh/policy"],
+      # KEEL_IGNORE_IMAGE removed 2026-05-24 — Keel is disabled for this
+      # workload now (keel.sh/policy=never annotation above), so TF owns
+      # the image tag. Restore this ignore_changes line if you flip
+      # keel.sh/policy back to `force` later.
+      metadata[0].annotations["keel.sh/match-tag"],
      metadata[0].annotations["keel.sh/trigger"],
      metadata[0].annotations["keel.sh/pollSchedule"], # KYVERNO_LIFECYCLE_V2
+      metadata[0].annotations["kubernetes.io/change-cause"],
+      metadata[0].annotations["deployment.kubernetes.io/revision"],
+      spec[0].template[0].metadata[0].annotations["keel.sh/update-time"],
    ]
  }
 }
--- a/stacks/monitoring/modules/monitoring/prometheus_chart_values.tpl
+++ b/stacks/monitoring/modules/monitoring/prometheus_chart_values.tpl
@ -1562,6 +1562,13 @@ serverFiles:
              severity: warning
            annotations:
              summary: "Offsite backup sync is {{ $value | humanizeDuration }} old (threshold: 9d)"
+          - alert: OffsiteBackupSyncFailing
+            expr: offsite_sync_last_status{job="offsite-backup-sync"} != 0
+            for: 0m
+            labels:
+              severity: warning
+            annotations:
+              summary: "Offsite backup sync last run reported errors (status={{ $value }})"
          - alert: NfsMirrorStale
            expr: (time() - nfs_mirror_last_run_timestamp{job="nfs-mirror"}) > 1382400
            for: 30m
Author	SHA1	Message	Date
Viktor Barzin	5cdac421c2	forgejo: pin to v11.0.14 + disable Keel (image-rewrite incident 2026-05-24) Some checks failed ci/woodpecker/push/build-cli Pipeline failed Details ci/woodpecker/push/default Pipeline was successful Details On 2026-05-24T15:35:37Z Keel's force-policy rewrote the image tag from `11.0.14 → 1.18` (codeberg.org/forgejo/forgejo). v1.18 is a Gitea-era Forgejo (Forgejo forked from Gitea at 1.18 and used pre-Forgejo versioning early on); the DB had already been migrated to schema 305 by 11.0.14, and 1.18 only knows up to migration 231 → pod refused to start ("Your database (migration version: 305) is for a newer Gitea, you can not use the newer database for this old Gitea release (231)"). Exact replay of the 2026-05-16 force-policy tag-rewriting bug (memory id=1933). Changes: - Pin image to explicit `:11.0.14` (latest 11.x, published 2026-05-12) - Add `keel.sh/policy: "never"` deploy annotation — overrides the Kyverno-stamped `force` policy via the chart's `+()` anchor semantics (memory id=1972). Keel will no longer touch this workload. - Drop KEEL_IGNORE_IMAGE from `lifecycle.ignore_changes` (TF owns the image now). Restore it if you flip Keel back to `force`. - Add the KEEL_LIFECYCLE_V1 trio (`kubernetes.io/change-cause`, `deployment.kubernetes.io/revision`, `keel.sh/update-time` on the pod template) so future TF applies don't fight K8s rollout metadata. Verified: new pod on v11.0.14 came up Running 1/1. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-05-24 22:06:59 +00:00
Viktor Barzin	5a0e4b3dac	f1-stream: revive aceztrims + pitsport, more ppv variants - aceztrims: scrape /f11/ (the actual stream page), not /f1/ (the cross-sport schedule). Drop the dead /iframe1?s= + onclick m3u8 regexes (site moved to `getElementById('iframe').src = '...'` ~20 channels ago). Strip HTML comments first so the ~20 legacy buttons kept inside <!-- ... --> stop showing up as false positives. Also pick up the default inline <iframe id='iframe' src='...'>. Local run: 11 channels (was 0). - pitsport: decode the RSC payload before regex-matching in _parse_live_events (raw HTML had it escape-encoded, so the homepage card path was silently 0). Add the new /live-now route (canonical what's-live-right-now list). Add "f1" to MOTORSPORT_CATEGORIES — the site labels Formula 1 events as just "F1". Refresh the stale serveplay.site docstring (host rotates; pushembdz's api/stream link is authoritative). Local run: 7 m3u8 streams covering Canadian GP (EN1/EN2/MULTI/ITA/ESP) + NASCAR Coke 600 (was 0). - ppv: always emit the parent embed alongside substreams (was dropping it whenever substreams existed). Prefer source_tag in substream titles so users see "Sky Sport 1 NZ" / "Apple TV (F1TV)" instead of generic #1/#2 suffixes. Diagnosed against the live cluster (curated + 7 other extractors returning 0 cached streams, only 2 dead hmembeds curated 24/7 channels visible to users). Each fix verified with the extractor run against live sites this turn.	2026-05-24 22:05:37 +00:00
Viktor Barzin	d5f73ce109	backup: exclude /anca-elements/ from nfs-mirror + offsite Step 1 Anca's photos are being ingested into Immich (started 2026-05-24 afternoon), so /srv/nfs/immich/library/ becomes the canonical copy for those photos. The separate /srv/nfs/anca-elements/ archive tree + its sda mirror at /mnt/backup/anca-elements/ are now redundant. Going forward: - nfs-mirror EXCLUDES /anca-elements/ so future weekly runs don't re-touch the 771G subtree (also no longer required since Immich has the data via its NFS library). - offsite-sync Step 1 also excludes /anca-elements/ — the historical 771G under /mnt/backup/anca-elements/ stays on sda for now but is NOT shipped to Synology pve-backup/ (Immich's library reaches Synology via Step 2 bypass leg anyway). The 771G on /mnt/backup/anca-elements/ will be cleaned up manually once Immich ingest completes and we verify all photos are in the Immich library. Same for /srv/nfs/anca-elements/ on sdc thin pool — freeing both would reclaim ~1.5 TB across sdc + sda. In-flight context: today's nfs-mirror first run was killed mid-flight at ~70% (was at /srv/nfs/postgresql/). The killed run wrote ~200G of service NFS subtrees to /mnt/backup/<svc>/, then sda hit 95% used, prompting this change. Next nfs-mirror run will not touch anca-elements and will fit comfortably (~250G total for the keep-list minus anca-elements).	2026-05-24 18:34:41 +00:00
Viktor Barzin	c948dc0dbe	backup pipeline: flock manifest + cap + drop LAN -z Three more audit fixes from the 2026-05-24 backup-pipeline review: #5 (S1 race) — manifest flock daily-backup and nfs-mirror both append to /mnt/backup/.changed-files. If they overlap (nfs-mirror Mon 04:11 running long, daily-backup starting Mon 05:00), concurrent appends from `find \| tee` and `find \| sed >>` could interleave mid-line — partial paths would slip past rsync's --files-from. Both scripts now share a manifest_append() helper using `flock -x` on /mnt/backup/.changed-files.lock. The 4 daily-backup call sites + the 1 nfs-mirror call site all pipe through it instead of redirecting directly. #7 (S2 unbounded manifest) daily-backup gains check_manifest_size() invoked after the PVE-config append (the last manifest writer of the run). Above MANIFEST_MAX_LINES (500k) it touches /mnt/backup/.force-full-sync — offsite-sync's Step 1 now treats that flag the same as day-of-month ≤ 7 (full sync with --delete) and clears it on success. Catches the "Synology unreachable for many days" edge case where the manifest would grow unbounded. #9 (wear — drop -z on LAN hops) offsite-sync rsync calls to Synology over the same 192.168.1.0/24 gigabit LAN had `-rltz`. Compression burns CPU on the PVE host (already IO-busy) and gives nothing on a saturated GigE link. Dropped to `-rlt` on all 5 offsite rsync invocations (Step 1 full + Step 1 incremental + Step 2 full nfs + Step 2 full nfs-ssd + Step 2 incremental). Other adjustments: - nfs-mirror's find-after-rsync now also excludes the new state files (.changed-files.lock, .force-full-sync) when populating the manifest. - offsite-sync Step 1 full-sync excludes the same .force-full-sync flag so it doesn't ship to Synology. Deployed to PVE host (/usr/local/bin/{daily-backup,nfs-mirror, offsite-sync-backup}). Currently in-flight nfs-mirror run is unaffected (bash loaded the old script into memory at start). Next runs use the new behaviour. Refs: 2026-05-24 audit Section 2 items #1 (manifest race), #4 (unbounded manifest), #6 (LAN -z wear).	2026-05-24 16:27:42 +00:00
Viktor Barzin	4798583db7	backup pipeline: S1 fixes from 2026-05-24 audit Three immediate fixes surfaced by the backup-pipeline audit: 1. S1 silent-loss race fix (daily-backup.sh:142): remove the `> "${MANIFEST}"` truncation at the start of daily-backup. Truncation already lives in offsite-sync-backup at line 159, gated on a successful sync. With both scripts truncating, an offsite-sync failure followed by the next morning's daily-backup would silently wipe yesterday's unconsumed manifest entries — those files would only reach Synology via the monthly full sync (1st-7th of month). Now only offsite-sync truncates, and only on success. 2. Missing alert OffsiteBackupSyncFailing: documented in backup-dr.md but was never added to prometheus_chart_values.tpl. Step 1 or Step 2 failure pushes offsite_sync_last_status=1 but nothing read it. Added. 3. wear: drop `-z` from local-only rsyncs (daily-backup.sh:218 PVC snapshot rsync + line 347 /etc/pve sync). Both are local-to-sda transfers — compression wastes CPU and yields nothing (gigabit local path, intermediate disk doesn't benefit). Bonus cleanups (zero functional impact): - "Weekly backup starting/complete" → "daily-backup starting/complete" (the timer is daily, not weekly — legacy from earlier monthly-rotation schedule). - "--- Step 2: PVC file copy ---" → "Step 1:" (was numbered from 2 with no Step 1 above). - wear: pfSense full filesystem tar now Sunday-only instead of daily. config.xml stays daily (it's the primary restore artifact and tiny). Full tar is forensic recovery only — re-tarring ~100MB+ daily writes ~3G/month to sda + Synology for unchanged content. Weekly is plenty. docs/architecture/backup-dr.md: rewritten Overview + 3-2-1 breakdown to reflect today's two-leg architecture; added a "2026-05-24 session" changelog summary at the top; added a "Synology snapshot management" subsection with the sudo + `synosharesnapshot` recipe (DSM API is gated by 2FA so this is the only programmatic path); updated Key Files table with nfs-mirror + the Synology SSH access notes. Open follow-ups from the audit (S2 — file as beads if pursued): - Factor two-leg invariant into /etc/backup-skip-list.conf sourced by both nfs-mirror.sh and offsite-sync-backup.sh. - Manifest write-collision flock between nfs-mirror Mon 04:11 and daily-backup Mon 05:00. - Unbounded manifest cap (force full sync if > 500k lines). - Synology free-space scraper + alert. - LVM thin pool meta-pool fill alert. - nfs-change-tracker.service heartbeat to Pushgateway. - Synology config drift TF surface (snap retention, share defs).	2026-05-24 16:18:44 +00:00