5 changed files with 23 additions and 65 deletions
--- a/docs/architecture/backup-dr.md
+++ b/docs/architecture/backup-dr.md
@ -16,11 +16,6 @@ Last updated: 2026-05-26
 >   exist on `/srv/nfs`. Dropped from the exclude/include lists as no-ops.
 > - `/mnt/backup/anca-elements` (423 G) deleted — canonical copy lives in
 >   Immich since the 2026-05-24 ingest.
-> - **`nfs-mirror.timer`: weekly Mon 04:00 → daily 02:00.** Steady-state
->   delta is 10-20 min of mostly-metadata rsync, so the IO cost is
->   negligible. RPO for non-CronJob app data (nextcloud shared files,
->   audiobookshelf library, mailserver Maildir, real-estate-crawler scraped
->   data, etc.) drops from 7 days to ~24h.
 > - Aftermath: sda 87% → 46% used; Synology `/Viki/nfs/` shrinks to
 >   immich-only on next monthly `--delete` pass (or manual cleanup —
 >   see runbook).
@ -37,7 +32,7 @@ Last updated: 2026-05-26
 The homelab runs a 3-2-1 strategy with a **two-leg** path to Synology so every NFS byte takes exactly one route to offsite (no duplication, no gaps):

 ```
-sdc /srv/nfs/<svc>/   ──nfs-mirror daily 02:00──→  sda /mnt/backup/<svc>/   ──offsite-sync Step 1──→  Synology /Backup/Viki/pve-backup/<svc>/  [leg 1]
+sdc /srv/nfs/<svc>/   ──nfs-mirror weekly──→  sda /mnt/backup/<svc>/   ──offsite-sync Step 1──→  Synology /Backup/Viki/pve-backup/<svc>/      [leg 1]
 sdc /srv/nfs/immich/  ──inotify (nfs-change-tracker)──→  offsite-sync Step 2  ──→  Synology /Backup/Viki/nfs/immich/                          [leg 2]
 sdc PVCs (LVM thin)   ──daily-backup~snapshot~rsync──→  sda /mnt/backup/{pvc-data,sqlite-backup,pfsense,pve-config}/  ──Step 1──→  Synology /Backup/Viki/pve-backup/
 ```
@ -378,7 +373,7 @@ Two-step offsite sync:

 **`/srv/nfs/anca-elements/` history**: had its own dedicated Synology exclusion line earlier in 2026-05-24 because the original Synology source (`/volume1/Backup/Anca/Elements`) was being preserved while we moved canonical to PVE. After the original was deleted (same day), anca-elements joined the broader "NOT bypassing sda" category and is covered by Step 1 via `nfs-mirror`.

-**Layer 3a: NFS local mirror on sda (3-2-1 second copy)**: `/usr/local/bin/nfs-mirror` rsyncs `/srv/nfs/` → `/mnt/backup/<service>/` daily at 02:00 (switched from weekly Mon 04:00 on 2026-05-26 — steady-state delta is 10-20 min of mostly-metadata rsync, cuts non-CronJob app-data RPO from 7d to ~24h). Single rsync invocation, single destination. As of 2026-05-26 the skip-list (in `nfs-mirror.sh` `EXCLUDES`) is intentionally minimal:
+**Layer 3a: NFS local mirror on sda (3-2-1 second copy)**: `/usr/local/bin/nfs-mirror` rsyncs `/srv/nfs/` → `/mnt/backup/<service>/` weekly (Mon 04:00). Single rsync invocation, single destination. As of 2026-05-26 the skip-list (in `nfs-mirror.sh` `EXCLUDES`) is intentionally minimal:

 - **immich** (1.5 T) — too big for sda; ships sdc → Synology direct (leg 2)
 - **frigate** (camera ring buffer) — intentionally NOT backed up
@ -449,8 +444,8 @@ The btrfs cleaner thread reclaims async — `df` may lag the snapshot-delete by
 | `/etc/systemd/system/lvm-pvc-snapshot.timer` | Daily 03:00 (LVM snapshots) |
 | `/etc/systemd/system/daily-backup.timer` | Daily 05:00 (file backup) |
 | `/etc/systemd/system/offsite-sync-backup.timer` | Daily 06:00 (offsite sync) |
-| `/usr/local/bin/nfs-mirror` | PVE host: daily 02:00 mirror of /srv/nfs/* → sda /mnt/backup/<svc>/ (Layer 3a) |
-| `/etc/systemd/system/nfs-mirror.timer` | Daily 02:00 (NFS local mirror to sda) |
+| `/usr/local/bin/nfs-mirror` | PVE host: weekly selective mirror of /srv/nfs/* → sda /mnt/backup/<svc>/ (Layer 3a) |
+| `/etc/systemd/system/nfs-mirror.timer` | Weekly Mon 04:00 (NFS local mirror to sda) |
 | `stacks/dbaas/` | Terraform: PostgreSQL/MySQL backup CronJobs |
 | `stacks/vault/` | Terraform: Vault backup CronJob |
 | `stacks/vaultwarden/` | Terraform: Vaultwarden backup + integrity CronJobs |
--- a/scripts/nfs-mirror.timer
+++ b/scripts/nfs-mirror.timer
@ -1,14 +1,8 @@
 [Unit]
-Description=Daily local NFS mirror to /mnt/backup
+Description=Weekly local NFS mirror to /mnt/backup

 [Timer]
-# Daily 02:00 — runs 3h before daily-backup (05:00) so the .changed-files
-# manifest is populated and offsite-sync (06:00) ships both legs' deltas.
-# Switched from weekly Mon 04:00 → daily 2026-05-26: steady-state delta is
-# 10-20 min of mostly-metadata rsync, so the IO cost is negligible and it
-# cuts non-CronJob app-data RPO from 7d to ~24h (matters for nextcloud
-# shared files, audiobookshelf library, mailserver Maildir, etc.).
-OnCalendar=*-*-* 02:00:00
+OnCalendar=Mon *-*-* 04:00:00
 Persistent=true
 RandomizedDelaySec=15min

--- a/stacks/keel/main.tf
+++ b/stacks/keel/main.tf
@ -46,17 +46,16 @@ resource "helm_release" "keel" {
  atomic = true

  values = [yamlencode({
-    # 2026-05-26 17:30: re-enabled after switching the Kyverno-injected
-    # default from `force + match-tag=true` (proven unreliable — see
-    # stacks/kyverno/modules/kyverno/keel-annotations.tf) to `patch` which
-    # is semver-parser-bounded. Under `patch`:
-    #   - Semver-tagged workloads get patch bumps only (1.2.3 → 1.2.4).
-    #   - Float / SHA / non-semver tags are IGNORED — no tag rewriting.
-    # The 2026-05-26 emergency-stop scope (replicaCount=0) is reverted now
-    # that the default is safe. Workloads pinned out-of-band (uptime-kuma
-    # via keel.sh/policy=never LABEL) stay opted-out via the Kyverno
-    # exclude rule, not via Keel's own annotation.
-    replicaCount = 1
+    # EMERGENCY STOP — scaled to 0 on 2026-05-26 16:42 UTC. Keel was actively
+    # rewriting tag strings (not just digests) despite the
+    # `keel.sh/match-tag=true` annotation injected by Kyverno that's supposed
+    # to constrain it to digest-only watches. Known casualties this round:
+    # uptime-kuma (2 → 1, 4h CrashLoopBackOff), n8n (1.80.5 → 0.1.2, silent
+    # degradation), beads-server/dolt-workbench (0.3.73 → 0.1.0), and ~10
+    # other deployments with downgrade-flavored change-cause annotations.
+    # Re-enable only after root-causing why match-tag isn't being enforced,
+    # OR after migrating each app to a content-addressed (SHA) tag pin.
+    replicaCount = 0
    # Prometheus pod-annotation scrape — picks up Keel-specific metrics
    # (pending_approvals, poll_trigger_tracked_images, registries_scanned_total{image,registry})
    # on container port 9300 /metrics. The cluster's `kubernetes-pods`
--- a/stacks/kyverno/modules/kyverno/keel-annotations.tf
+++ b/stacks/kyverno/modules/kyverno/keel-annotations.tf
@ -177,42 +177,13 @@ resource "kubectl_manifest" "policy_inject_keel_annotations" {
                #                                 to bypass this mutation)
                # Per-namespace opt-out:
                #   Remove the `keel.sh/enrolled=true` namespace label.
-                # 2026-05-26: switched default from `force + match-tag=true`
-                # to `patch` after the 2026-05-26 incident proved match-tag
-                # does NOT reliably constrain Keel — tag strings got rewritten
-                # (uptime-kuma :2→:1, n8n :1.80.5→:0.1.2, dolt-workbench
-                # :0.3.73→:0.1.0, wealthfolio :3.2.1→:2.0→:3.2 truncated).
-                #
-                # `patch` is semver-parser-bounded:
-                #   - Only patch bumps within current major.minor
-                #     (e.g. 1.2.3 → 1.2.4; never 1.3.x or 2.x).
-                #   - Non-semver tags (`:latest`, `:v4`, `:2`, SHA, `:nightly`)
-                #     are IGNORED entirely — Keel does nothing for them.
-                #   - No more string-comparison surprises.
-                #
-                # `match-tag` annotation dropped — it was only meaningful as
-                # the (failed) safety net under `force`. Irrelevant under
-                # semver-bounded policies.
-                #
-                # `+(...)` anchor = "add only if missing". With the anchor,
-                # this policy ONLY sets defaults on new workloads — existing
-                # per-workload overrides (set via TF or kubectl annotate)
-                # are preserved across policy updates. This was DROPPED for
-                # one apply on 2026-05-26 to migrate the 151 stale `force`
-                # annotations to `patch`, then re-added in the same session
-                # after observing that the label-based exclude rule below
-                # doesn't reliably filter mutateExistingOnPolicyUpdate scans
-                # (22 workloads with LABEL keel.sh/policy=never still got
-                # their ANNOTATION rewritten and had to be repatched). Keep
-                # the anchor unless you genuinely want a cluster-wide flip.
-                #
-                # To override per workload, set the ANNOTATION directly:
-                #   - keel.sh/policy=never  (Keel won't touch)
-                #   - keel.sh/policy=minor  (wider semver bumps, still bounded)
-                #   - keel.sh/policy=major  (any semver bump)
-                # The corresponding LABEL keel.sh/policy=never is for the
-                # exclude rule below (defense-in-depth against future mutations).
-                "+(keel.sh/policy)"       = "patch"
+                # `+(...)` anchor — only add if not present. This preserves
+                # per-workload overrides set out-of-band (e.g. `never` for
+                # phased rollout). Without the anchor, every policy update
+                # would overwrite existing annotations, breaking the phased
+                # rollout state.
+                "+(keel.sh/policy)"       = "force"
+                "+(keel.sh/match-tag)"    = "true"
                "+(keel.sh/trigger)"      = "poll"
                "+(keel.sh/pollSchedule)" = "@every 1h"
              }
--- a/stacks/kyverno/modules/kyverno/security-policies.tf
+++ b/stacks/kyverno/modules/kyverno/security-policies.tf
@ -26,7 +26,6 @@ locals {
    "kured",          # kured DaemonSet is privileged (manages node reboots)
    "default",        # etcd backup + defrag CronJobs use hostNetwork
    "changedetection", # uses SYS_ADMIN for chromium sandbox
-    "woodpecker",     # CI pipeline pods (wp-*) run privileged docker builds
  ]
 }