fix: restore tree dropped by 6d224861; land stem95su gdrive-sync (10m) [ci skip]

6d224861 came from a --no-checkout worktree whose empty index made the commit drop every file except two. This restores 05b50d2b's full tree and correctly adds stacks/stem95su/gdrive-sync.tf + the service-catalog stem95su entry. Forward-only (parent=6d224861, no force-push); [ci skip] since the live infra was never applied from the broken commit. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-09 08:45:33 +00:00 · 2026-06-09 08:45:33 +00:00 · fd0f4a0365
commit fd0f4a0365
parent 6d224861c4
1166 changed files with 358546 additions and 0 deletions
--- a/docs/post-mortems/2026-03-16-kured-containerd-cascade-outage.html
+++ b/docs/post-mortems/2026-03-16-kured-containerd-cascade-outage.html
--- a/docs/post-mortems/2026-03-16-nfs-csi-cascade-failure.md
+++ b/docs/post-mortems/2026-03-16-nfs-csi-cascade-failure.md
@ -0,0 +1,130 @@
+# Post-Mortem: NFS CSI Cascade Failure
+
+| Field | Value |
+|-------|-------|
+| **Date** | 2026-03-16 |
+| **Duration** | ~47h (ongoing) |
+| **Severity** | SEV1 |
+| **Affected Services** | 40+ pods across 20+ namespaces |
+| **Status** | Draft |
+
+## Summary
+
+The NFS CSI driver entered a crash-loop due to a liveness-probe port conflict (~47h ago), causing all NFS-backed PV mounts to fail. This cascaded into 15+ pods stuck in ContainerCreating, MySQL InnoDB data file lock failures, Vault Raft storage timeouts, and MetalLB speaker port conflicts. Critical path services (Traefik, PostgreSQL) remained partially operational.
+
+## Impact
+
+- **User-facing**: Services dependent on NFS storage (calibre, forgejo, pgadmin, uptime-kuma, etc.) completely unavailable. Grafana down (MySQL dependency). Vault unavailable.
+- **Services affected**: 40+ pods across 20+ namespaces — storage layer, databases, monitoring, CI/CD
+- **Duration**: ~47h and ongoing at time of investigation
+- **Data loss**: None confirmed; MySQL ibdata1 lock issue may indicate risk if not resolved cleanly
+
+## Timeline (UTC)
+
+| Time | Event | Source |
+|------|-------|--------|
+| ~47h ago | NFS CSI driver starts crash-looping — liveness-probe port 29653 conflict | cluster-health-checker (pod age + restart count) |
+| ~30h ago | iSCSI CSI controller deployed (stable) | pod age |
+| ~27h ago | Vault, MySQL, Headscale, Woodpecker, CrowdSec agents start failing | pod ages, events |
+| ~26h ago | mysql-cluster-0 enters ContainerStatusUnknown | sre investigation |
+| ~20-22h ago | Cascade of service restarts across cluster | pod restart timestamps |
+| ~1h ago | Latest wave of pod rescheduling/restarts | events |
+
+## Root Cause
+
+**NFS CSI driver liveness-probe port conflict**: The liveness-probe containers on all worker nodes fail with `listen tcp 127.0.0.1:29653: bind: address already in use`. The port conflict suggests a previous liveness-probe process did not cleanly terminate, or pods were restarted while old processes lingered in the network namespace.
+
+**Impact chain**: NFS CSI not registered on nodes → all NFS PV mounts fail → 15+ pods stuck in ContainerCreating with "driver name nfs.csi.k8s.io not found in the list of registered CSI drivers"
+
+## Contributing Factors
+
+- **MySQL InnoDB data corruption**: Cannot open `ibdata1` (OS error 11 — EAGAIN). Likely caused by NFS storage instability or stale lock from mysql-cluster-0 in ContainerStatusUnknown
+- **Vault Raft lock timeout**: vault-0 fails with "failed to open bolt file: timeout" — BoltDB locked by previous instance. vault-2 cannot mount NFS volume at all
+- **MetalLB speaker port conflicts**: 3/4 speakers fail with memberlist port 7946 already in use — same pattern as NFS CSI, suggesting containerd instability ~47h ago
+- **Node3 memory pressure**: 80% utilization, hosting both mysql-cluster-1 and mysql-cluster-2
+
+## Detection
+
+- **How detected**: Manual investigation (this post-mortem)
+- **Time to detect**: ~47h from start of NFS CSI crash-loop
+- **Gap analysis**: No alerting on CSI driver health. Existing pod alerts likely firing but root cause (CSI driver) not surfaced. Need CSI-specific health alerts.
+
+## Resolution
+
+Not yet resolved at time of investigation. Recommended steps:
+
+1. Delete NFS CSI node pods one at a time (DaemonSet will recreate with clean port allocation)
+2. Delete MetalLB crash-looping speaker pods (same approach)
+3. Force-delete mysql-cluster-0 (ContainerStatusUnknown) to release ibdata1 lock
+4. Once NFS healthy, vault-2 should start; vault-0 may need BoltDB lock file cleared
+
+## Action Items
+
+### Preventive (stop recurrence)
+
+| Priority | Action | Type | Details |
+|----------|--------|------|---------|
+| P1 | Investigate containerd health on worker nodes | Investigation | Port conflicts across NFS CSI + MetalLB suggest containerd restart/instability ~47h ago |
+| P1 | Fix NFS CSI liveness probe port allocation | Config | Use ephemeral ports or add port uniqueness checks to avoid stale port conflicts |
+| P2 | Add node anti-affinity for MySQL replicas | Terraform | mysql-cluster-1 and -2 both on node3 (80% memory) — spread across nodes |
+
+### Detective (catch faster)
+
+| Priority | Action | Type | Details |
+|----------|--------|------|---------|
+| P1 | Add CSI driver health alerting | Alert | PrometheusRule for CSI driver pod crash-loops — `kube_pod_container_status_waiting_reason{reason="CrashLoopBackOff", namespace="democratic-csi"}` |
+| P1 | Add NFS mount failure alerting | Alert | Alert on pods stuck in ContainerCreating > 10min with volume mount errors |
+| P2 | Add Uptime Kuma monitor for Vault | Monitor | Vault health endpoint check |
+
+### Mitigative (reduce blast radius)
+
+| Priority | Action | Type | Details |
+|----------|--------|------|---------|
+| P2 | Add PDB for NFS CSI node DaemonSet | Config | Ensure at least N-1 nodes always have healthy CSI driver |
+| P2 | Document NFS CSI recovery runbook | Runbook | Steps to recover from port conflict scenario |
+| P3 | Evaluate moving MySQL off NFS to iSCSI | Architecture | iSCSI remained stable throughout; MySQL on NFS is fragile |
+
+## Lessons Learned
+
+- **Went well**: Critical path services (Traefik 3/3, Authentik 2/3, PostgreSQL) survived due to not depending on NFS CSI
+- **Went poorly**: 47h detection gap — no alerting on CSI driver health despite it being a single point of failure for all NFS workloads
+- **Got lucky**: iSCSI CSI remained stable, keeping PostgreSQL (CNPG) operational. If both CSI drivers had failed, the entire cluster would have been fully down
+
+## Raw Investigation Data
+
+<details>
+<summary>Cluster State Summary</summary>
+
+- **Nodes**: All 5 Ready. k8s-node2 metrics `<unknown>`. k8s-node3 at 80% memory.
+- **Tier 1 (Critical)**: Traefik OK (3/3), Authentik degraded (2/3), PostgreSQL degraded (1/2), Vault DOWN, Redis starting, MetalLB degraded (1/4)
+- **Tier 2 (Storage)**: NFS CSI FAILING (port conflict on all workers), iSCSI CSI OK
+- **Tier 3 (Apps)**: 15+ ContainerCreating (NFS), 5+ Pending (GPU/memory), 10+ CrashLoopBackOff (DB deps)
+- **Tier 4 (Databases)**: PostgreSQL recovering, MySQL DOWN (ibdata1 lock)
+
+</details>
+
+<details>
+<summary>NFS CSI Error Details</summary>
+
+Controller pods (2): CrashLoopBackOff — `listen tcp 127.0.0.1:29653: bind: address already in use`
+Node DaemonSet: 4/5 crash-looping (same error). Only k8s-master healthy.
+Age: ~47h with 96+ restarts.
+
+</details>
+
+<details>
+<summary>MySQL Error Details</summary>
+
+mysql-cluster-0: ContainerStatusUnknown
+mysql-cluster-1, mysql-cluster-2: CrashLoopBackOff — `Cannot open datafile './ibdata1'` (OS error 11: Resource temporarily unavailable)
+mysql-cluster-router: crash-looping (no healthy backend)
+
+</details>
+
+<details>
+<summary>Vault Error Details</summary>
+
+vault-0: CrashLoopBackOff — "failed to open bolt file: timeout" (Raft BoltDB lock)
+vault-2: ContainerCreating — NFS mount failure
+
+</details>
--- a/docs/post-mortems/2026-04-14-nfs-fsid0-dns-vault-outage.md
+++ b/docs/post-mortems/2026-04-14-nfs-fsid0-dns-vault-outage.md
@ -0,0 +1,223 @@
+# Post-Mortem: NFS fsid=0 Cascade — DNS + Vault + Multi-Service Outage
+
+| Field | Value |
+|-------|-------|
+| **Date** | 2026-04-14 |
+| **Duration** | ~5h initial (05:37–10:40 EEST), then ~2h secondary (NFS restart broke NFSv3 + DNS zone sync gap) |
+| **Severity** | SEV1 |
+| **Affected Services** | 48+ pods across 20+ namespaces. DNS (all instances), Vault, MySQL, Grafana, Uptime Kuma, ebooks, phpipam, immich, servarr, and more |
+| **Status** | Complete |
+
+## Summary
+
+An `fsid=0` flag in the PVE host's `/etc/exports` caused all NFSv4 subdirectory mount paths from k8s to fail. Combined with a `lockd` configuration failure that broke NFSv3 fallback, this made ALL new NFS mounts impossible and ALL existing NFS mounts stale after the NFS server had been restarted on Apr 11. The failure cascaded into DNS (primary unreachable), Vault (lost Raft quorum), Alertmanager (monitoring blind), and 20+ other services.
+
+## Impact
+
+- **User-facing**: DNS intermittent failures for 192.168.1.x network users (primary down, secondary/tertiary covered ~66% of queries). Vault-dependent services unable to rotate secrets.
+- **Blast radius**: 57,405 NFS error messages across 4 k8s nodes + PVE host. 53 NFS-backed PVs at risk.
+- **Duration**: ~5h of active NFS failure. Some services (ebooks) were in CreateContainerError for 2d22h before detection.
+- **Data loss**: None. NFS data remained intact on disk; only the NFS service was broken.
+- **Monitoring gap**: Alertmanager itself was on NFS storage, so it couldn't alert about the NFS failure.
+
+## Timeline (EEST)
+
+| Time | Event |
+|------|-------|
+| **Apr 11 00:44:52** | NFS server restarts on PVE. Logs `exportfs: can't open /etc/exports for reading` and `nfsdctl: lockd configuration failure`. Server starts with broken export configuration. |
+| **Apr 11 (later)** | `/etc/exports` recreated with `fsid=0` on `/srv/nfs` and `fsid=1` on `/srv/nfs-ssd`. Exports reloaded. Existing k8s NFS mounts continue working (cached file handles). |
+| **Apr 13 23:52** | TrueNAS (10.0.10.15) goes completely unreachable. PVE host's hard NFS mounts to TrueNAS start blocking with D-state kernel process. |
+| **Apr 14 05:04** | `daily-backup.service` starts on PVE. 25/88 LUKS PVC snapshots fail to mount. |
+| **Apr 14 05:37** | **CASCADE BEGINS**: k8s-node1, node3, node4 simultaneously start reporting `nfs: server 192.168.1.127 not responding, timed out`. Existing cached file handles expire; new mount operations hit the fsid=0 + lockd issues. |
+| **Apr 14 ~07:30** | DNS outage reported by users on 192.168.1.x network. Investigation begins. |
+| **Apr 14 10:40** | **FIX**: `fsid=0` removed from `/etc/exports`, NFS server restarted. New mounts work. |
+| **Apr 14 10:41** | Vault pods restarted — all 3 come up 2/2 Running. Raft quorum restored. |
+| **Apr 14 10:42** | Technitium primary pod restarted with fresh NFS mount. DNS fully restored. |
+| **Apr 14 10:44–11:04** | Technitium primary migrated from NFS to `proxmox-lvm-encrypted` PVC. Data restored (104.9M including all 11 zones). Terraform state reconciled. |
+
+## Root Cause Chain
+
+```
+[1] fsid=0 in /etc/exports (introduced Apr 11)
+ └─> NFSv4 treats /srv/nfs as pseudo-root
+     └─> Subdirectory paths (/srv/nfs/technitium) resolve incorrectly → ENOENT
+         ├─> [2] lockd configuration failure (since Apr 11)
+         │    └─> NFSv3 fallback fails (statd not running, no 'nolock' in CSI mount options)
+         │         └─> ALL new NFS mounts fail with "No such file or directory"
+         └─> [3] Existing mounts go stale (cached file handles expire ~Apr 14 05:37)
+              ├─> Technitium primary: I/O errors on /etc/dns → degraded DNS
+              ├─> Vault 0+1: CreateContainerError → lost Raft quorum → Vault down
+              ├─> Alertmanager: I/O errors → monitoring blind spot
+              ├─> Grafana: CrashLoopBackOff (MySQL password rotation failed during Vault outage)
+              ├─> ebooks: CreateContainerError (2d22h undetected)
+              └─> 20+ CronJobs and services: Error/FailedMount
+```
+
+### Why fsid=0 was there
+
+During the TrueNAS → Proxmox NFS migration on Apr 11, the NFS server was restarted. At startup, `/etc/exports` was missing/unreadable. When the exports file was recreated, `fsid=0` was included — likely copied from an NFSv4 example. This flag is only appropriate for dedicated NFSv4-only exports, not for NFS CSI dynamic provisioning which mounts subdirectories.
+
+### Why the failure was delayed 3 days
+
+Existing NFS mounts from before the Apr 11 restart continued working because:
+- NFSv3 mounts cache file handles and don't re-negotiate protocol on every I/O
+- The `soft,timeo=30` mount options return errors after timeout, but cached operations succeed
+- Only when cached handles expired or new mounts were needed did the failure manifest
+
+The trigger was likely the `daily-backup.service` at 05:04 on Apr 14, which accesses NFS exports and may have caused the NFS server to recycle state, invalidating cached handles cluster-wide.
+
+## Contributing Factors
+
+1. **TrueNAS unreachable since Apr 13 23:52**: PVE host has `hard` NFS mounts to TrueNAS that will retry forever. D-state kernel process stuck since Apr 13. This may have contributed to NFS thread contention on the PVE host.
+
+2. **Alertmanager on NFS storage**: The very system meant to alert about storage failures was stored on the failing storage. Circular dependency.
+
+3. **`/etc/exports` not managed by Terraform or git**: Changes to this critical configuration file are untracked, making it impossible to audit when `fsid=0` was introduced.
+
+4. **No NFS-specific health alerts**: While CLAUDE.md mentions "NFS responsiveness" alerts, no alert fired during this 5+ hour outage. The Prometheus rule may not cover mount failures from the k8s node perspective.
+
+5. **CSI mount options lack `nfsvers=3` and `nolock`**: The NFS CSI driver uses default mount options that rely on version auto-negotiation. When NFSv4 fails (fsid=0) and NFSv3 fails (lockd), there's no fallback path.
+
+## Detection Gaps
+
+| Gap | Impact | Fix |
+|-----|--------|-----|
+| No alert on NFS mount failures from k8s nodes | 5h to detection | Add PrometheusRule: `node_nfs_requests_total` error rate |
+| Alertmanager on NFS storage | Alerting blind during NFS outage | Move Alertmanager to `proxmox-lvm-encrypted` |
+| `/etc/exports` not in git/Terraform | Can't audit config changes | Manage via Ansible or TF `remote-exec` |
+| No TrueNAS reachability alert | 11h unnoticed before cascade | Add ping/ICMP monitor in Uptime Kuma |
+| No CSI mount failure alert | Pods stuck for days unnoticed | Alert on `kube_pod_container_status_waiting_reason{reason="ContainerCreating"}` > 10m |
+| ebooks pods broken for 2d22h | Zero notification | Above alert covers this |
+| Grafana down 37h | Dashboard monitoring blind | Uptime Kuma HTTP check already exists; verify it alerts |
+
+## Prevention Plan
+
+### P0 — Prevent this exact failure from recurring
+
+| Action | Owner | Status |
+|--------|-------|--------|
+| Remove `fsid=0` from `/etc/exports` on PVE host | Done | Completed Apr 14 |
+| Fix `lockd configuration failure` on PVE NFS server | Done | Disabled NFSv3 entirely (`vers3=n`). lockd is an nfsdctl bug on kernel 6.14 — not fixable without Proxmox patch. |
+| Force-unmount hung TrueNAS NFS mounts on PVE | Done | `umount -l /mnt/truenas-src /mnt/truenas-ssd`. Not in fstab — won't recur. |
+| Manage `/etc/exports` in git (add to `infra/scripts/` and deploy via PVE provisioning) | Done | `scripts/pve-nfs-exports` added with fsid=0 safety comments. Deploy: scp + exportfs -ra |
+| Migrate all NFS PVs to NFSv4 | Done | Patched 52 PVs to `nfsvers=4`. Updated TF module + StorageClass. Applied all 20 stacks. |
+| Add DNS zone sync CronJob | Done | `technitium-zone-sync` runs every 30min, replicates all primary zones to secondary/tertiary via AXFR |
+
+### P1 — Eliminate the NFS single point of failure for critical services
+
+| Action | Owner | Status |
+|--------|-------|--------|
+| Migrate Technitium primary to `proxmox-lvm-encrypted` | Done | Completed Apr 14 |
+| Migrate Alertmanager PV from NFS to `proxmox-lvm-encrypted` | Done | `storage-prometheus-alertmanager-0` now on `proxmox-lvm-encrypted`. Helm `persistence.storageClass` set. |
+| Migrate Vault PVCs from `nfs-proxmox` to `proxmox-lvm-encrypted` | TODO | Vault is too critical for NFS dependency |
+| Review all 53 NFS PVs — identify which are critical-path and migrate | TODO | Reduce NFS blast radius |
+
+### P2 — Detect NFS failures before users notice
+
+| Action | Owner | Status |
+|--------|-------|--------|
+| Add PrometheusRule: NFS mount errors from node kernel logs | Done | `NFSHighRPCRetransmissions` alert: rate > 5/s for 5m → warning |
+| Add PrometheusRule: Pods in ContainerCreating > 10 minutes | Done | `NFSMountFailures` + `NFSCSINodeDown` alerts added to Prometheus |
+| Add Uptime Kuma monitor: TrueNAS ping (10.0.10.15) | Done | Already existed (id=66, PING, 60s). Verified active. |
+| Add Uptime Kuma monitor: PVE NFS port 2049 TCP check | Done | Already existed (id=96 + id=328, PORT 2049, active). Verified. |
+| Verify Grafana Uptime Kuma alert actually fires | Done | Monitor id=32 has notificationIDList=[1] (Slack). Root cause of 37h gap: Uptime Kuma itself on NFS. |
+
+### P3 — Improve NFS resilience
+
+| Action | Owner | Status |
+|--------|-------|--------|
+| Remove hung TrueNAS `hard` mounts from PVE fstab (TrueNAS is sunset) | TODO | Eliminates D-state kernel process risk |
+| Add NFS export health check to daily-backup script | Done | `check_nfs_exports()` added to `scripts/daily-backup.sh`: checks fsid=0, nfs-server active, exportfs count |
+| Document NFS CSI mount option requirements in CLAUDE.md | Done | Added NFS CSI section: nfsvers=4 mandatory, fsid=0 forbidden, critical services → proxmox-lvm-encrypted |
+
+## Phase 2: NFS Restart Broke NFSv3 + DNS Zone Sync Gap
+
+### What happened after the 10:40 fix
+
+The NFS server restart at 10:40 (which fixed the fsid=0 issue) introduced two new problems:
+
+#### Problem 1: NFSv3 completely broken after restart
+
+After the restart, **ALL NFSv3 mount(2) system calls returned EIO** from k8s worker nodes, even though:
+- NFS port 2049 was reachable from all nodes
+- `showmount -e` listed correct exports
+- NFSv4 mounts worked perfectly from all nodes
+- NFSv3 mounts worked from k8s-master (which had no prior NFS mounts — clean kernel state)
+
+**Root cause**: `nfsdctl: lockd configuration failure` on PVE kernel 6.14.11-4-pve — a bug where nfsdctl's `autostart` command tries to call a non-existent `lockd` subcommand. This warning was present since Apr 11 but NFSv3 only broke after the restart. Worker nodes retained corrupted NFS client kernel state from the stale mount period that could not be cleared without a reboot.
+
+**Resolution**: Patched all 52 NFS PVs to add `nfsvers=4` mount option via `kubectl patch pv`. Updated Terraform `nfs_volume` module and `nfs-csi` StorageClass. Disabled NFSv3 on PVE (`vers3=n` in `/etc/nfs.conf`). Applied to all 20+ Terraform stacks.
+
+#### Problem 2: DNS zone sync gap — .lan resolution failures
+
+**Finding**: Technitium secondary and tertiary instances had **only 5 default zones** (localhost, arpa). Custom zones (`viktorbarzin.lan`, `viktorbarzin.me`, reverse lookup zones, etc.) only existed on the primary. This was a **pre-existing gap** — the zone setup was a one-time Kubernetes Job that ran at initial deployment and never synced new zones created afterward.
+
+**Impact**: The MetalLB VIP (10.0.20.201) load-balances across all 3 instances. 2/3 of queries hit secondary/tertiary → NXDOMAIN for `.lan` → cached for 300s by CoreDNS → ExternalName services (e.g., `ha-sofia.viktorbarzin.lan`) returned 502 Bad Gateway.
+
+**Why it surfaced now**: The NFS outage restarted the primary Technitium pod, flushing client DNS caches. This increased the visible failure rate for `.lan` queries.
+
+**Resolution**: 
+1. Created `viktorbarzin.lan` and `viktorbarzin.me` as Secondary zones on both secondary and tertiary via Technitium API
+2. **Converted one-time setup Job to a CronJob** (`technitium-zone-sync`) running every 30 minutes that:
+   - Gets all zones from primary
+   - Enables zone transfer (AXFR) on primary
+   - Creates missing zones as Secondary type on replicas
+   - Resyncs existing zones
+3. 20 zones were synced to tertiary that were previously missing
+
+### Phase 2 Timeline
+
+| Time | Event |
+|------|-------|
+| **10:40** | NFS server restarted (fsid=0 fix). NFSv3 breaks. 48+ stale mounts across all workers. |
+| **10:41–10:50** | Vault, DNS primary come back. But NFSv3 mounts stay stale. |
+| **11:00–11:40** | Investigation: NFSv3 mount(2) returns EIO on workers, NFSv4 works. Patched 52 PVs to nfsvers=4. |
+| **11:40–12:00** | Restarted all NFS-dependent pods. Fixed MySQL (Vault rotation mismatch), Redis HAProxy, Woodpecker DB. |
+| **12:00–12:15** | Users report .lan resolution failures. Discovered secondary/tertiary missing all custom zones. |
+| **12:15** | Created secondary zones on secondary/tertiary via API. .lan resolution restored. |
+| **12:15–12:20** | Converted one-time setup Job to zone-sync CronJob. Applied via Terraform. |
+
+### Additional collateral damage fixed during Phase 2
+
+| Issue | Root Cause | Fix |
+|-------|-----------|-----|
+| Grafana, realestate-crawler, shlink MySQL access denied | Vault rotated passwords during NFS outage, credentials mismatched | Force-rotated Vault DB roles, manually synced Grafana password |
+| Uptime Kuma `mysql_native_password` plugin not loaded | MySQL 8.4 disabled plugin by default | Enabled via `mysql-native-password=ON` in mycnf, changed user auth plugin |
+| Redis HAProxy all backends DOWN | Health check timeout during cluster turbulence | Restarted HAProxy pods |
+| Woodpecker missing PostgreSQL database | DB init Job ran at deploy, DB dropped during cluster recreation | Manually created database |
+| Nextcloud PVC deleted | `nextcloud-data-proxmox` was Terminating and got garbage collected | Rebound Released PV to new PVC |
+| MySQL InnoDB cluster OFFLINE | rollout restart invalidated operator state, kopf finalizer blocked deletion | Removed finalizer, recreated cluster via `dba.createCluster()` |
+
+## Lessons Learned
+
+1. **NFSv4 `fsid=0` is dangerous for CSI subdirectory mounts**: It changes path resolution semantics in non-obvious ways. Never use it on exports that serve dynamic subdirectory mounts.
+
+2. **Critical monitoring infrastructure must not depend on the thing it monitors**: Alertmanager on NFS cannot alert about NFS failures. This is the same anti-pattern as "DNS depends on DNS" or "monitoring depends on the monitored database".
+
+3. **Stale NFS mounts have delayed-action failure modes**: The 3-day gap between the config change (Apr 11) and the outage (Apr 14) made root cause analysis much harder. Cached file handles mask configuration errors.
+
+4. **`/etc/exports` is a single-point-of-configuration**: Unmanaged, unversioned, no review process. A single flag caused a cluster-wide outage.
+
+5. **This is the SECOND DNS outage related to NFS migration** (first was Apr 6 — unbound PVC). Storage migrations for DNS infrastructure need extra scrutiny and pre-migration testing.
+
+6. **DNS HA requires zone replication, not just pod replication**: Having 3 Technitium pods with a PDB is useless if only the primary has the zone data. A one-time setup Job is insufficient — zones created after initial deployment are never synced. This is now fixed with a recurring CronJob.
+
+7. **NFSv3 client kernel state survives mount cleanup**: Force-unmounting all NFS mounts from a node does NOT clear the kernel's per-server NFS client state. The only reliable fix was switching to NFSv4 (different protocol path). NFSv3 is now disabled on the PVE server.
+
+8. **`kubectl rollout restart statefulset` is dangerous for operator-managed StatefulSets**: The MySQL InnoDB operator lost track of its cluster state after the rollout restart changed the pod template. Recovery required manually removing kopf finalizers, recreating the InnoDB cluster, and re-bootstrapping the router.
+
+## Follow-up Implementation
+
+| Date | Action | Priority | Type | Commit | Implemented By |
+|------|--------|----------|------|--------|----------------|
+| 2026-04-14 | Add `NFSHighRPCRetransmissions` PrometheusRule (`node_nfs_rpc_retransmissions_total` rate > 5/s for 5m) | P2 | Alert | [`35646841`](https://github.com/ViktorBarzin/infra/commit/35646841) | postmortem-todo-resolver |
+| 2026-04-14 | Migrate Alertmanager PV from NFS to `proxmox-lvm-encrypted` (eliminates circular alerting dependency) | P2 | Alert | [`35646841`](https://github.com/ViktorBarzin/infra/commit/35646841) | postmortem-todo-resolver |
+| 2026-04-14 | Add `/etc/exports` to git as `scripts/pve-nfs-exports` with fsid=0 safety documentation | P2 | Config | [`3da4812c`](https://github.com/ViktorBarzin/infra/commit/3da4812c) | postmortem-todo-resolver |
+| 2026-04-14 | Add `check_nfs_exports()` to `scripts/daily-backup.sh` (detects fsid=0, nfs-server down, no active exports) | P3 | Config | [`3da4812c`](https://github.com/ViktorBarzin/infra/commit/3da4812c) | postmortem-todo-resolver |
+| 2026-04-14 | Document NFS CSI mount requirements in `.claude/CLAUDE.md` (nfsvers=4 mandatory, fsid=0 forbidden) | P3 | Config | [`3da4812c`](https://github.com/ViktorBarzin/infra/commit/3da4812c) | postmortem-todo-resolver |
+| 2026-04-14 | TrueNAS ICMP ping monitor (id=66) — already existed, verified active | P2 | Monitor | — | Verified existing |
+| 2026-04-14 | PVE NFS port 2049 TCP monitor (id=96, 328) — already existed, verified active | P2 | Monitor | — | Verified existing |
+| 2026-04-14 | Grafana Uptime Kuma monitor — alert wired to Slack (notificationIDList=[1]). Root cause of 37h gap: Uptime Kuma itself on NFS storage | P2 | Alert | — | Verified existing |
+| — | Migrate Vault PVCs from `nfs-proxmox` to `proxmox-lvm-encrypted` | P2 | Migration | — | Needs human review |
+| — | Review all 53 NFS PVs — identify critical-path and migrate | P2 | Migration | — | Needs human review |
+| — | Remove hung TrueNAS `hard` mounts from PVE fstab (TrueNAS is sunset) | P2 | Migration | — | Needs human review |
--- a/docs/post-mortems/2026-04-14-postmortem-pipeline-test.md
+++ b/docs/post-mortems/2026-04-14-postmortem-pipeline-test.md
@ -0,0 +1,36 @@
+# Post-Mortem: Pipeline E2E Test
+
+| Field | Value |
+|-------|-------|
+| **Date** | 2026-04-14 |
+| **Duration** | N/A |
+| **Severity** | SEV3 |
+| **Affected Services** | None (test) |
+| **Status** | Draft |
+
+## Summary
+
+Test post-mortem to validate the automated TODO implementation pipeline end-to-end.
+
+## Prevention Plan
+
+| Priority | Action | Type | Details | Status |
+|----------|--------|------|---------|--------|
+| P2 | Add Uptime Kuma monitor for Proxmox web UI port 8006 | Monitor | TCP check on 192.168.1.127:8006 to detect PVE management plane down | TODO |
+| P2 | Migrate Alertmanager to encrypted storage | Architecture | Move from NFS to proxmox-lvm-encrypted to avoid circular alerting dependency | TODO |
+
+## Lessons Learned
+
+1. Automated post-mortem pipelines reduce mean time to remediation.
+
+## Follow-up Implementation
+
+_This section is auto-populated by the postmortem-todo-resolver agent._
+
+| Date | Action | Priority | Type | Commit | Implemented By |
+|------|--------|----------|------|--------|----------------|
+
+# E2E test 17:12
+# E2E validation 17:27:45
+# Final E2E test Tue Apr 14 05:43:38 PM UTC 2026
+# 1776188690
--- a/docs/post-mortems/2026-04-18-authentik-outpost-shm-full.md
+++ b/docs/post-mortems/2026-04-18-authentik-outpost-shm-full.md
@ -0,0 +1,150 @@
+# Post-Mortem: Authentik Embedded Outpost `/dev/shm` Fills — Cluster-Wide Auth Blocked
+
+| Field | Value |
+|-------|-------|
+| **Date** | 2026-04-18 |
+| **Duration** | ~44h for first-affected user (Emil, Apr 16 17:00 → Apr 18 12:40 UTC); ~30min for cluster-wide impact (Apr 18 12:10 → 12:40 UTC) |
+| **Severity** | SEV2 — authentication blocked for all users on all Authentik-protected services |
+| **Affected Services** | ~30+ Authentik-protected subdomains (every service using the `authentik-forward-auth` Traefik middleware) |
+| **Status** | Root cause fixed; permanent mitigation applied; alerting still TODO |
+
+## Summary
+
+The `ak-outpost-authentik-embedded-outpost` pod's `/dev/shm` (default 64 MB tmpfs) filled to 100% with ~44,000 `session_*` files. Once full, every forward-auth request failed to write its session state with `ENOSPC` and the outpost returned HTTP 400 instead of the usual 302 → login redirect. All users on all protected services were unable to log in.
+
+Detection was delayed because the initial user report (Emil) looked like a per-user bug — investigation spent two days chasing hypotheses about non-ASCII headers, user privileges, cookie corruption, and a newly-deployed Cloudflare Worker before the real cause was found in the outpost logs.
+
+## Impact
+
+- **User-facing**: HTTP 400 on initial GET of any Authentik-protected site (`terminal`, `grafana`, `immich`, `proxmox`, `london`, etc.). Existing sessions whose cookies were still cached worked until their cookie rotation attempt, then broke.
+- **Blast radius**: Every service using the `authentik-forward-auth` middleware via the "Domain wide catch all" Proxy provider. Public and internal.
+- **Duration**: First user (Emil) broken since 2026-04-16 ~17:00 UTC after his last valid session. Cluster-wide block when Viktor's cached session stopped being sufficient — roughly 2026-04-18 12:10 UTC. Fixed 12:40 UTC.
+- **Data loss**: None. Session state in tmpfs is ephemeral by design.
+- **Monitoring gap**: No Prometheus alert on outpost `/dev/shm` usage. No alert on outpost 400 response rate. Uptime Kuma external monitors hitting protected services returned 400s for 40+ hours without paging.
+
+## Timeline (UTC)
+
+| Time | Event |
+|------|-------|
+| **Apr 15 ~09:21** | `ak-outpost-authentik-embedded-outpost-587598dc4b-fvzzz` pod started (normal rolling restart, unrelated to this incident). `/dev/shm` fresh. |
+| **Apr 16 16:23:32** | Emil's last successful `authorize_application` event from his iPhone Brave (`85.255.235.23`). After this point, his subsequent requests create session files — his new sessions work briefly, then `/dev/shm` fills and every new session write fails. |
+| **Apr 16 ~17:00 (approx)** | `/dev/shm` at ~44,000 files = 100% full. New forward-auth requests start returning 400 across the board. Viktor's browser still has a valid cached cookie so his requests succeed without writing new session files. |
+| **Apr 17 10:30 (approx)** | Emil reports "terminal.viktorbarzin.me returns 400" to Viktor. |
+| **Apr 18 09:00–12:30** | Deep investigation begins. Multiple hypotheses tested and rejected: non-ASCII bytes in Emil's `name` field, policy denial, cookie corruption, Rybbit Cloudflare Worker (deployed 2026-04-17 — suspicious timing, turned out unrelated), plaintext redirect scheme. |
+| **Apr 18 12:20:39** | First direct evidence found: 2 Chrome 400s in Traefik logs from Emil's IP `176.12.22.76` (BG) on `terminal.viktorbarzin.me`, request missing `authentik_proxy_*` cookie. Redirect loop observed on iPhone IPv6 `2620:10d:c092:500::7:8c0d`. |
+| **Apr 18 12:34** | Viktor reports he can no longer log in either. |
+| **Apr 18 12:38** | `curl` against direct Traefik (`--resolve` bypassing Cloudflare) returns the same 400 with Authentik's CSP header — Cloudflare Worker exonerated. |
+| **Apr 18 12:39** | Outpost log grep finds the smoking gun: `failed to save session: write /dev/shm/session_XXX: no space left on device`. |
+| **Apr 18 12:40:13** | `kubectl delete pod ak-outpost-authentik-embedded-outpost-587598dc4b-fvzzz` — tmpfs cleared on pod restart. Replacement pod `-8qscr` Running within 8s. Cluster unblocked. |
+| **Apr 18 12:41** | Verified: direct-Traefik and via-CF curls both return `HTTP 302` to Authentik auth flow. Viktor authenticates successfully on `proxmox.viktorbarzin.me`. |
+| **Apr 18 12:53** | Permanent fix applied via Authentik API: `PATCH /api/v3/outposts/instances/{uuid}/` setting `config.kubernetes_json_patches` to mount `emptyDir {medium: Memory, sizeLimit: 512Mi}` at `/dev/shm`. |
+| **Apr 18 12:54** | Authentik controller reconciled the Deployment within 5s. `kubectl rollout restart` triggered new pod `-k5hv8`. `/dev/shm` now `tmpfs 256M` (4× the previous capacity; K8s clamps the tmpfs size to pod memory policy, but usage is capped at `sizeLimit=512Mi`). Forward-auth verified working. |
+
+## Root Cause Chain
+
+```
+[1] goauthentik/proxy outpost uses gorilla/sessions FileStore
+ └─> each forward-auth request that has no valid session cookie writes
+     /dev/shm/session_<random> (~1500 bytes/file)
+         │
+         ├─> [2] Catch-all Proxy provider's access_token_validity = hours=168 (7 days)
+         │    └─> each file's MaxAge = 7 days
+         │        └─> Upstream 5-min GC (PR #15798, shipped in ≥ 2025.10) can only
+         │            delete files whose MaxAge has EXPIRED, not whose age exceeds any
+         │            shorter threshold
+         │
+         ├─> [3] Measured creation rate: ~18 files/min (Uptime-Kuma monitors +
+         │    real user traffic)
+         │    └─> 18/min × 60 × 24 × 7 = 181,440 steady-state files expected
+         │
+         └─> [4] Pod's /dev/shm default: 64 MB tmpfs (Kubernetes default)
+              └─> 64 MB / 1500 bytes ≈ 44,000 files maximum
+                  └─> Full in approx 44,000 / (18 × 60) min ≈ 41 hours
+                      └─> Actual observed time: pod started Apr 15 ~09:21,
+                          first ENOSPC ~Apr 16 ~17:00 ≈ 32 hours
+                          (some excess from Uptime-Kuma bursts)
+
+[ENOSPC] -> every new forward-auth request fails -> outpost returns HTTP 400
+         -> Traefik forwards the 400 to the browser
+         -> user sees "400 Bad Request" on every protected site
+```
+
+## Why Diagnosis Took So Long
+
+The initial report was framed as "Emil can't access terminal" — a per-user symptom. All four pre-registered hypotheses in the triage plan (non-ASCII bytes in header value, oversized cookie, corrupt user attribute, provider policy rejecting groups) were per-user explanations, all of which turned out to be falsified.
+
+Contributing distractions:
+1. **Misattribution in initial research** — an `authorize_application` event for Viktor (`vbarzin@gmail.com`) at 2026-04-18 08:09 was initially attributed to Emil. This led to the incorrect conclusion that Emil was authenticating successfully today.
+2. **Rybbit analytics Cloudflare Worker deployed 2026-04-17** (see memory #792, commit around 2026-04-17 21:26 UTC) ran on `*.viktorbarzin.me/*`. Suspicious timing — Viktor's first instinct was "this must be the Worker." The Worker WAS adding long cookies to browser state, but not the cause of the 400. Exonerated by direct-Traefik curl returning the same 400.
+3. **Viktor's cached session masked the outage** — only unauthenticated requests wrote new session files. Viktor's valid cookie kept working until the outpost needed to rotate state, at which point he also hit 400.
+4. **The tell is in the outpost logs, not anywhere else.** `grep 'no space left on device'` on the outpost logs would have found it in seconds, but the investigation scope started with user records, then cookies, then the Worker — outpost logs weren't grepped until hour 3+.
+
+## Contributing Factors
+
+1. **No alert on outpost `/dev/shm` usage.** A simple `kubelet_volume_stats_used_bytes / kubelet_volume_stats_capacity_bytes > 0.8` or equivalent cAdvisor metric would have paged hours before users noticed.
+2. **No alert on outpost HTTP 400 rate.** `increase(authentik_outpost_http_requests_total{status="400"}[15m])` went from ~0 to thousands — invisible to our monitoring.
+3. **No alert on "Uptime-Kuma external monitors all turning red simultaneously."** Every external monitor for a protected service started failing, but each is individually monitored — correlated failures across dozens of services didn't trigger a higher-level alert.
+4. **Default Kubernetes `/dev/shm` is 64 MB.** This is fine for most workloads, but the goauthentik proxy outpost writes one session file per unauthenticated request with a 7-day retention. The default sizing is an accident waiting to happen on any busy deployment.
+5. **Upstream issue [#20093](https://github.com/goauthentik/authentik/issues/20093)** ("External Proxy Outpost cannot use persistent session backend") is still OPEN as of 2026-04-18. Known architectural limitation.
+6. **Catch-all Proxy provider is UI-managed, not Terraform-managed.** Its `access_token_validity` and the outpost's `kubernetes_json_patches` are configured in Authentik's PostgreSQL database, not in code. This means the fix applied today is invisible to `git log` and vulnerable to drift if someone changes it in the UI.
+
+## Detection Gaps
+
+| Gap | Impact | Fix |
+|-----|--------|-----|
+| No alert on outpost `/dev/shm` usage | Outage progressed from "Emil only" to "everyone" over 40+ hours silently | Add Prometheus alert: `kubelet_volume_stats_used_bytes{namespace="authentik",persistentvolumeclaim=~"dshm.*"} / kubelet_volume_stats_capacity_bytes > 0.8` (or per-container cAdvisor metric if emptyDir not a PVC) |
+| No alert on outpost 400 rate spike | ~thousands of 400s over 40h didn't page | Alert on `increase(traefik_service_requests_total{code="400",service=~".*viktorbarzin-me.*"}[15m]) > N` OR on outpost-specific 400 metric |
+| Uptime Kuma external monitors not cross-correlated | Dozens of red monitors didn't trigger a cluster-wide alert | Add meta-alert: "more than N [External] Uptime Kuma monitors down within 10 min" — strong signal of shared-infra failure |
+| Outpost logs not searched during initial triage | Investigation went down 4 wrong paths before finding the real error | Runbook addition: for any Authentik forward-auth issue, FIRST command is `kubectl -n authentik logs -l goauthentik.io/outpost-name=authentik-embedded-outpost --since=1h \| grep -iE 'error\|no space'` |
+
+## Prevention Plan
+
+### P0 — Prevent this exact failure
+
+| Priority | Action | Type | Details | Status |
+|----------|--------|------|---------|--------|
+| P0 | Size `/dev/shm` up via `kubernetes_json_patches` on the embedded outpost config | Config | `PATCH /api/v3/outposts/instances/0eecac07-97c7-443c-8925-05f2f4fe3e47/` with `config.kubernetes_json_patches.deployment` adding an `emptyDir {medium: Memory, sizeLimit: 512Mi}` volume at `/dev/shm`. Authentik reconciles the Deployment within 5 minutes. **Applied 2026-04-18 12:53 UTC.** | **DONE** |
+
+### P1 — Detect this next time
+
+| Priority | Action | Type | Details | Status |
+|----------|--------|------|---------|--------|
+| P1 | Prometheus alerts on outpost `/dev/shm` fill (two thresholds) | Alert | Group `Authentik Outpost` added in `stacks/monitoring/modules/monitoring/prometheus_chart_values.tpl`. `AuthentikOutpostMemoryHigh` (warning, working set > 1.5 GiB for 15m) + `AuthentikOutpostMemoryCritical` (critical, > 1.8 GiB for 5m) + `AuthentikOutpostRestarts` (warning, > 2 restarts in 30m). Applied 2026-04-18 13:16 UTC; loaded in Prometheus, state=inactive. | **DONE** |
+| P1 | Uptime-Kuma meta-monitor: "N+ external monitors down simultaneously" | Alert | Either a Prometheus rule over `uptime_kuma_monitor_status == 0` counts, or a dedicated external probe. Very strong signal of shared-infra failure. | TODO |
+| P1 | Bump tmpfs `sizeLimit` from 512Mi → 2Gi + set explicit container memory limit 2560Mi | Config | Patched outpost `kubernetes_json_patches` via Authentik API. 2026-04-18 13:06 UTC (sizeLimit), 13:22 UTC (container limit). **Gotcha**: `sizeLimit` alone is insufficient — writes to tmpfs count against container cgroup memory, and Kyverno's `tier-defaults` LimitRange sets a default `limits.memory: 256Mi` which OOM-kills the container before tmpfs fills. Fix is to also set `containers[0].resources.limits.memory` ≥ `sizeLimit + working_set_headroom`. Verified 1.5 GB file write succeeds on the configured pod; df reports 2.0 GB tmpfs. Gives ~8× growth headroom at current probe rate. | **DONE** |
+
+### P2 — Codify the fix so it survives drift
+
+| Priority | Action | Type | Details | Status |
+|----------|--------|------|---------|--------|
+| P2 | Codify the catch-all Proxy provider + embedded outpost config in Terraform | Architecture | Adopt `goauthentik/authentik` Terraform provider in `infra/stacks/authentik/`. Import the existing UUID `0eecac07-97c7-443c-8925-05f2f4fe3e47` and the catch-all provider pk=5. Move `kubernetes_json_patches` into TF so the fix is reviewable in git. **Done 2026-05-10**: `authentik_outpost.embedded` resource + `authentik_provider_proxy.catchall.access_token_validity` codified, plan-to-zero on the whole stack. The `Outpost.managed` field is server-set (not in provider schema) and preserved across applies because TF only writes known fields. Same-day work also flipped the outpost's session backend from filesystem (`/dev/shm`) to PostgreSQL — see `.claude/reference/authentik-state.md`. | **DONE** |
+| P2 | Runbook: Authentik forward-auth troubleshooting | Docs | Add a runbook at `docs/runbooks/authentik-forward-auth-400.md` with the "grep outpost logs first" first step, plus pointer commands for `/dev/shm` usage, session file count, and recent authorize events. | TODO |
+
+### P3 — Upstream + architectural
+
+| Priority | Action | Type | Details | Status |
+|----------|--------|------|---------|--------|
+| P3 | Comment/support on authentik issue [#20093](https://github.com/goauthentik/authentik/issues/20093) | Upstream | Request either a persistent-backed session store (Redis/DB) OR a configurable GC interval shorter than the default 5 min. | TODO |
+| P3 | Consider shortening `access_token_validity` from 168h (7 days) to 24h | Config | Original idea: shrink steady-state session file count (~7× reduction) at the cost of daily re-auth. **Resolved differently 2026-05-10**: switched the outpost to the PostgreSQL session backend (`Outpost.managed = goauthentik.io/outposts/embedded` + `AUTHENTIK_POSTGRESQL__*` envFrom), which makes session count irrelevant for tmpfs sizing and lets us BUMP `access_token_validity` to `weeks=4` for better UX without cost. | **DONE (alt)** |
+| P3 | Evaluate moving forward-auth away from the embedded outpost | Architecture | Original framing: external, multi-replica outpost with Redis-backed sessions. **Resolved 2026-05-10** by enabling the postgres-backed session store on the embedded outpost itself (PR goauthentik/authentik#16628). Sessions now persist across pod restarts; the original "in-memory state" concern is moot. Multi-replica still requires a goauthentik upstream fix (PgBouncer-friendly session migration), but the loss-of-state class of failures is gone. | **DONE (alt)** |
+
+## Lessons Learned
+
+1. **When a per-user bug affects a shared infrastructure layer, suspect the shared layer, not the user.** The framing "Emil gets 400" led the first two hours of investigation down four user-specific rabbit holes. A sanity check ("does ANY user's non-cached request to a protected site return 400?") would have cut to the chase in minutes.
+
+2. **Check the outpost logs first, not last.** For any Authentik forward-auth oddity, the first `kubectl logs` should be on the outpost pod, grepping for `error` and `ENOSPC`. The outpost is the component that actually makes the 400/302 decision.
+
+3. **Cache + low-request users mask outages longer than you'd think.** Viktor had a valid cookie and his browser kept using it without writing new session files; he couldn't reproduce the bug Emil saw. The outage felt per-user until his cookie rotation needed to write state. **Any outage that "only affects some users" needs an active check from a fresh, cookie-less context** — `curl` with no cookie jar is the fastest way.
+
+4. **Default tmpfs sizing + per-request file writes = ticking clock.** 64 MB of `/dev/shm` is a Kubernetes default, not a considered choice. Any workload that writes per-request files into tmpfs without aggressive GC will eventually fill, and the time-to-fill scales inversely with request rate. Worth auditing other services that might have the same pattern.
+
+5. **UI-managed Authentik config is invisible to git review.** Our catch-all Proxy provider, embedded outpost config, property mappings, and policy bindings are all in Authentik's PostgreSQL database. The fix applied today (`kubernetes_json_patches`) is durable but not discoverable from `git log`. Drift risk. Codify in Terraform.
+
+6. **Recently-deployed things are prime suspects but not always guilty.** The Rybbit Cloudflare Worker was deployed 2026-04-17 with a wildcard route. Viktor's intuition was "that's the recent change, must be the cause." It was a plausible theory and worth checking — but `curl --resolve` to bypass Cloudflare proved it innocent within 30 seconds. Always have a way to bypass the suspect layer cheaply.
+
+## References
+
+- Memory #836-841: incident details stored in claude-memory MCP (2026-04-18 12:42 UTC).
+- Upstream issue: [goauthentik/authentik#20093](https://github.com/goauthentik/authentik/issues/20093) (open).
+- Related upstream fix: [PR #15798](https://github.com/goauthentik/authentik/pull/15798) — 5-min session GC shipped in ≥ 2025.10 (our version 2026.2.2 has it, but insufficient alone).
+- Beads task: `code-zru` (P1 bug).
--- a/docs/post-mortems/2026-04-19-registry-orphan-index.md
+++ b/docs/post-mortems/2026-04-19-registry-orphan-index.md
@ -0,0 +1,246 @@
+# Post-Mortem: Private Registry Orphan OCI-Index — Repeat Incident
+
+| Field | Value |
+|-------|-------|
+| **Date** | 2026-04-19 (first occurrence 2026-04-13) |
+| **Duration** | ~40 min of blocked CI each time; only detected via pipeline failures |
+| **Severity** | SEV2 — all infra CI pipelines using `infra-ci:latest` failed (P366 → P376 all exit 126 "image can't be pulled") |
+| **Affected Services** | Every Woodpecker pipeline that starts with `image: registry.viktorbarzin.me:5050/infra-ci:latest` — `default.yml`, `build-cli.yml`, `renew-tls.yml`, `drift-detection.yml`, `provision-user.yml`, `k8s-portal.yml`, `postmortem-todos.yml`, `issue-automation.yml`, `pve-nfs-exports-sync.yml` |
+| **Status** | Hot fix green (three commits: `a05d63ee`, `6371e75e`, `c113be4d` — URL fix + rebuild). This doc captures the permanent fix landed in the same branch. |
+
+## Summary
+
+On 2026-04-19 ~09:00 UTC, every infra CI pipeline started failing at the
+`clone` step with "image can't be pulled". The image in question — the CI
+toolchain image `registry.viktorbarzin.me:5050/infra-ci:latest` — resolved
+to an OCI image index whose `linux/amd64` platform manifest
+(`sha256:98f718c8…`) and its in-toto attestation
+(`sha256:27d5ab83…`) returned **HTTP 404** from the private registry.
+The index record itself still existed — it's the children that had been
+garbage-collected out from under it.
+
+This is the **second identical incident**: the same failure mode occurred
+on 2026-04-13 against a different image. Both times the immediate fix was
+to rebuild the image from scratch; both times the root cause was left
+unaddressed.
+
+## Impact
+
+- **User-facing**: all CI pipelines failed. No automated Terraform applies,
+  no TLS renewal, no drift detection. Manual workflows (Woodpecker UI
+  reruns) all failed with the same error.
+- **Blast radius**: every pipeline that pulls `infra-ci`. Does NOT affect
+  k8s workloads (those pull via containerd, which goes through the
+  pull-through proxy on :5000/:5010 — a completely different code path).
+- **Duration on 2026-04-19**: from first P366 failure to the hot-fix
+  commit `c113be4d` — roughly 40 min. Pipelines that had already been
+  triggered queued up until the rebuild restored `:latest`.
+- **Data loss**: none. The registry has the index object; the child
+  manifests are re-producible by rebuilding the source image.
+- **Monitoring gap**: nothing alerted. The only signal was the individual
+  pipeline failures from Woodpecker. No Prometheus alert fires on "the
+  registry served a 404 for a tag that exists".
+
+## Timeline (UTC, 2026-04-19)
+
+| Time | Event |
+|------|-------|
+| ~09:00 | P366 (`default.yml` on master) fails with exit 126. |
+| 09:00–11:00 | P367, P368, … P376 all fail with the same error. Nobody pages — there's no alert configured. |
+| 11:15 | User notices and investigates: `skopeo inspect` reveals the missing platform manifest. |
+| 11:20 | Hot fix phase begins: `a05d63ee` fixes a push-URL misalignment, `6371e75e` and `c113be4d` trigger a full rebuild. |
+| 11:40 | Rebuild completes; `infra-ci:latest` resolves to a fresh, complete index. Pipelines green from P377 onward. |
+| 11:45 | User requests a proper root-cause fix: "this is the second time — what's actually broken?" |
+| 12:00 | Investigation begins (this document's work). |
+
+## Root Cause Chain
+
+```
+[1] cleanup-tags.sh runs daily at 02:00 on the registry VM
+ └─> For each repository, keeps the last 10 tags by mtime, rmtrees the rest.
+     This walks `_manifests/tags/<tag>` directly, bypassing the registry API.
+         │
+         ├─> [2] Subtle on-disk asymmetry: a registry:2 tag rmtree removes
+         │    BOTH the `_manifests/tags/<tag>/` dir AND — on 2.8.x — the
+         │    per-repo revision-link files under
+         │    `<repo>/_manifests/revisions/sha256/<child-digest>/link` for
+         │    every child referenced by that tag's index. The raw blob data
+         │    under `/var/lib/registry/docker/registry/v2/blobs/sha256/<.>/data`
+         │    is NOT touched — GC owns that, and GC only runs Sunday.
+         │
+         ├─> [3] If ANOTHER tag's index still references one of those same
+         │    children (common — successive rebuilds share layers), the child
+         │    blob survives. But the revision-link is gone, so the registry
+         │    API can no longer map `<repo>/manifests/sha256:<child>` back
+         │    to the blob. HEAD → 404, even though the bytes are on disk.
+         │    distribution/distribution#3324 is the upstream class of this bug.
+         │
+         └─> [4] Result: the surviving index (e.g. `infra-ci:5319f03e`) is
+              intact on disk, its children's blob data files are intact on
+              disk, but HEAD `/v2/infra-ci/manifests/sha256:98f718c8…`
+              returns 404. The registry has the bytes, but cannot find them
+              through the API because the per-repo link bridge is gone.
+
+[pull] containerd resolves `infra-ci:latest`
+         │
+         ├─> GET /v2/infra-ci/manifests/latest → 200 OK, returns the index
+         │
+         └─> GET /v2/infra-ci/manifests/sha256:98f718c8… → 404 Not Found
+              └─> containerd fails the pull with "manifest unknown"
+                    └─> woodpecker exit 126
+```
+
+> **Detection-gotcha** uncovered 2026-04-19 while implementing
+> `fix-broken-blobs.sh`: a scan that checks `/blobs/sha256/<child>/data` for
+> presence is NOT equivalent to "can the registry serve this child?" The
+> authoritative check is whether
+> `<repo>/_manifests/revisions/sha256/<child>/link` exists. The script
+> was rewritten to check the per-repo link file after the HTTP probe
+> caught 38 real orphans the filesystem scan had reported clean.
+
+## Why Existing Remediation Missed It
+
+1. **`fix-broken-blobs.sh` only scans layer links.** The existing cron
+   walks `_layers/sha256/` and removes link files whose blob `data` is
+   missing. It does NOT inspect `_manifests/revisions/sha256/` to see
+   whether an image-index's referenced children still exist. That's
+   exactly the class of orphan this incident represents.
+2. **`registry:2` image tag was floating.** `docker-compose.yml` pinned
+   only to `registry:2`. Whatever Docker Inc. last rebuilt as
+   "v2-current" was running, with no version pin. Any regression in
+   the upstream walker would silently swap in.
+3. **No integrity monitoring.** Prometheus alerted on cache hit rate
+   and registry-down, but nothing probes "are the manifests the registry
+   advertises actually fetchable?"
+4. **CI pipeline didn't verify its own push.** `buildx --push` returns
+   success as soon as it uploads. If a child blob upload 0-byted or
+   the client disconnected mid-push (distinct from the GC mode but the
+   same on-disk symptom), nothing would notice until the next pull.
+
+## Permanent Fix — Three Phases
+
+### Phase 1 — Detection (ship today)
+
+1. **Post-push integrity check** in `.woodpecker/build-ci-image.yml`.
+   After `build-and-push`, a new step walks the just-pushed manifest
+   (and every child of an image index) and HEADs every referenced blob.
+   Any non-200 fails the pipeline immediately, catching broken pushes at
+   the source rather than leaking them to consumers.
+2. **Prometheus alert `RegistryManifestIntegrityFailure`.** A new
+   CronJob (`registry-integrity-probe`, every 15m, in the `monitoring`
+   namespace) walks the private registry's catalog, HEADs every tag's
+   manifest, follows each image index's children, and pushes
+   `registry_manifest_integrity_failures` to Pushgateway. Accompanying
+   alerts: `RegistryIntegrityProbeStale`, `RegistryCatalogInaccessible`.
+3. **Post-mortem** — this document. Linked from
+   `.claude/reference/service-catalog.md` via the new runbook.
+
+### Phase 2 — Prevention
+
+4. **Pin `registry:2` → `registry:2.8.3`** in
+   `modules/docker-registry/docker-compose.yml` (all six registry
+   services). Removes the floating-tag footgun.
+5. **Extend `fix-broken-blobs.sh`** to scan every
+   `_manifests/revisions/sha256/<digest>` that is an image index and
+   flag children whose blob `data` file is missing. The script prints a
+   loud WARNING per orphan; it does not auto-delete the index, because
+   deleting a published image is a conscious decision, not an automated
+   repair.
+
+### Phase 3 — Recovery tooling
+
+6. **Manual event trigger** on `build-ci-image.yml`. Rebuilds no longer
+   need a cosmetic Dockerfile edit — POST to the Woodpecker API or
+   click "Run manually" in the UI.
+7. **Runbook** `docs/runbooks/registry-rebuild-image.md` — exact
+   command sequence for the next time this happens, plus fallback paths.
+
+## Out of Scope
+
+- **Pull-through caches.** The DockerHub / GHCR mirrors on
+  `:5000` / `:5010` are healthy (74.5% cache hit rate, no 404s). The
+  orphan problem is private-registry-only. No changes to nginx or
+  containerd `hosts.toml`.
+- **Registry HA / replication.** Single-VM SPOF is a known
+  architectural choice. Harbor or a replicated registry would solve
+  more than this incident requires, at multi-day cost. Synology offsite
+  snapshots already give RPO < 1 day.
+- **Disabling `cleanup-tags.sh`.** Keeping storage bounded is still
+  necessary; the fix is detection + rebuild, not "stop cleaning up".
+
+## Lessons
+
+- **Repeat incidents deserve root-cause work, not a third hot-fix.** The
+  2026-04-13 incident was closed when CI turned green. Without a probe
+  and without a scan for orphan indexes, the next incident was
+  inevitable — and it happened six days later against a different image.
+- **"No alert fired, so it wasn't detected" is a monitoring gap, not an
+  outage feature.** The registry was serving 404s for 2+ hours before
+  anyone noticed, because our only signal was "pipeline failures" and
+  our eyes were elsewhere. The new probe closes that gap.
+- **CI pipelines should verify their own output.** The `buildx --push`
+  "success" exit code is not a guarantee of pulled-back integrity — as
+  this incident proves. A 30-second post-push HEAD walk is cheap
+  insurance.
+
+## Related
+
+- **Prior incident (same failure mode, different image)**: memory `709`
+  / `710` — 2026-04-13.
+- **Runbook**: `docs/runbooks/registry-rebuild-image.md` (new).
+- **Hot-fix commits**: `a05d63ee`, `6371e75e`, `c113be4d`.
+- **Upstream bug class**: `distribution/distribution#3324`.
+
+## 2026-04-19 — Bulk cleanup sweep (beads code-8hk + code-jh3c)
+
+Same failure class, broader scope. The `registry-integrity-probe`
+surfaced 38 broken manifest references persisting after the 04-19
+infra-ci fix. `beads-dispatcher` + `beads-reaper` CronJobs were stuck
+`ImagePullBackOff` on `claude-agent-service:0c24c9b6` for >6h. All 34
+affected `repo:tag` pairs were OCI indexes whose `linux/amd64` child
+manifests were absent from blob storage (same orphan pattern).
+
+**Action taken**:
+1. Bumped `beads-server/main.tf` var default `claude_agent_service_image_tag`
+   from `0c24c9b6` → `2fd7670d` (the canonical tag in
+   `claude-agent-service/main.tf`), reused — same image already healthy
+   on the registry. `scripts/tg apply` on `beads-server`. Deleted the
+   stuck Jobs so new CronJob ticks could fire.
+2. Enumerated 34 broken `(repo, tag, parent_digest)` triples via HTTP
+   probe using `registry-probe-credentials` K8s Secret. Deleted each
+   via `DELETE /v2/<repo>/manifests/<digest>` (33× 202, 1× 404 —
+   claude-agent-service:latest pointed at an already-deleted digest).
+3. Ran `docker exec registry-private /bin/registry garbage-collect
+   /etc/docker/registry/config.yml` — reclaimed ~3GB of orphan blob
+   storage.
+4. Rebuilt the 3 in-use broken tags (all 3 OCI-index parents pointed
+   at missing children, so no cached copies would survive pod
+   reschedule):
+   - `freedify:latest` / `freedify:c803de02` — built on registry VM
+     directly (no CI pipeline exists for this image; python FastAPI).
+   - `beadboard:17a38e43` / `beadboard:latest` — GHA
+     `workflow_dispatch` failed at registry login (missing
+     `REGISTRY_USERNAME`/`REGISTRY_PASSWORD` GH secrets). Built on
+     registry VM directly as the fallback. GitHub secret gap is a
+     follow-up — beads `code-8hk` notes it.
+   - `priority-pass-backend:ae1420a0` / `priority-pass-frontend:ae1420a0`
+     — Woodpecker pipeline #8 on repo 81. Pipeline `kubectl set image`'d
+     the Deployment to `ae1420a0` (drift vs TF `v5`/`v8` defaults, but
+     that drift is pre-existing, not introduced by this cleanup).
+   - `wealthfolio-sync:latest` — **not rebuilt**. Monthly CronJob (next
+     run 2026-05-01), no source tree or CI pipeline available in the
+     monorepo; deferred for separate follow-up.
+
+**Post-cleanup state**:
+- Probe: 39 tags, 0 failures. `registry_manifest_integrity_failures{} = 0`.
+- Alert `RegistryManifestIntegrityFailure` cleared (was firing for
+  5h 32m).
+- No `ImagePullBackOff` pods anywhere in the cluster.
+- 28 of 34 deleted manifests were **dangling tags not referenced by any
+  workload** — old `382d6b1*`, `v2`-`v7`, `yt-fallback`, etc. Safe
+  deletes, no rebuilds needed.
+
+**Permanent fix still in flight**: Phase 2/3 of this post-mortem
+(post-push verification in CI, atomic `cleanup-tags.sh`) — not
+addressed by this cleanup. The probe continues to be the
+authoritative detector.
--- a/docs/post-mortems/2026-04-22-vault-raft-leader-deadlock.md
+++ b/docs/post-mortems/2026-04-22-vault-raft-leader-deadlock.md
@ -0,0 +1,155 @@
+# Post-Mortem: Vault Raft Leader Deadlock + NFS Kernel Client Corruption Cascade
+
+> **Resolution status (2026-04-25):** Resolved structurally by code-gy7h
+> migration. All 3 vault voters now on `proxmox-lvm-encrypted` block
+> storage; the NFS fsync incompatibility that triggered the original
+> raft hang is no longer reachable. See
+> `docs/plans/2026-04-25-nfs-hostile-migration-plan.md` Phase 2.
+
+| Field | Value |
+|-------|-------|
+| **Date** | 2026-04-22 |
+| **Duration** | External endpoint 503 from ~09:00 UTC to ~11:43 UTC (~2h 43m). vault-2 became active leader 11:43:28 UTC. |
+| **Severity** | SEV1 (Vault — single source of secrets for 40+ services) |
+| **Affected Services** | All ESO-backed services (password rotation paused). CronJobs that read plan-time secrets (14 stacks). Woodpecker CI (blocked pipeline `d39770b3`). Everything with `ExternalSecret` refresh interval ≤ 2h. |
+| **Status** | Vault HA operational with vault-0 + vault-2 quorum. vault-1 still stuck ContainerCreating on node2 (third node2 reboot pending; workload can accept 2/3 quorum). Terraform fix committed as `2f1f9107`; apply pending. |
+
+## Summary
+
+A Vault raft leader (`vault-2`) entered a stuck goroutine state where its cluster port (8201) accepted TCP but never completed msgpack RPC. Standbys could not detect leader death because the TCP layer looked healthy, so no re-election fired. The only recovery was to kill the leader. During recovery, abrupt `kubectl delete --force` of the stuck Vault pods left kernel-side NFS client state on k8s-node1/node3/node4 in a corrupted state — **all new NFS mounts from those nodes timed out at 110s**, while existing mounts kept working. This created a cascade: the stuck leader blocked quorum, killing the leader broke NFS on the destination node for the recreated pod, force-killing the stuck pods left zombie `containerd-shim` processes kubelet couldn't clean up, and the resulting volume-manager loops pegged kubelet into 2-minute timeouts. Recovery required a VM hard-reset for node2 and node3 (kubelet was zombie on both). vault-0 remains down pending node4 reboot.
+
+## Impact
+
+- **User-facing**: `vault.viktorbarzin.me` returned HTTP 503 for ~2h. Any service that needed a Vault token during that window was degraded; Woodpecker CI pipeline blocked.
+- **Blast radius**: 3/3 Vault pods affected (raft deadlock blocked re-election even with standbys up). Three k8s nodes degraded simultaneously with kernel NFS client stuck state (node1, node3, node4). Two nodes required VM hard-reset to recover kubelet (node2, node3).
+- **Duration**: Degraded ~2h; resolution required sequential hard reboots.
+- **Data loss**: None. Raft data integrity preserved on NFS. vault-1 came up with index 2475732, caught up to 2476009+ once leader was elected.
+- **Observability gap**: No alert fired for the stuck raft leader. Standbys report `HA Mode: standby, Active Node Address: <leader IP>` as if healthy even when leader is hung.
+
+## Timeline (UTC)
+
+| Time | Event |
+|------|-------|
+| **~09:00** | `vault-2` (original raft leader) enters hung state — port 8201 open but msgpack RPCs hang. Its own logs go silent. Standbys continue heartbeat/appendEntries with `msgpack decode error [pos 0]: i/o timeout`. Neither standby triggers re-election because raft transport does not distinguish "TCP open + silent" from "TCP open + healthy". |
+| **~09:15** | External endpoint starts serving 503. Woodpecker CI pipeline `d39770b3` blocks waiting for Vault. |
+| **09:59** | Operator force-deletes `vault-2` pod — replacement comes up on node3 and enters candidate loop (term=32), cannot get quorum because DNS for `vault-0` is NXDOMAIN (ContainerCreating) and vault-1 does not respond (its raft goroutine also hung). |
+| **10:07** | Operator force-deletes `vault-1` — new `vault-1` gets scheduled to node2. Its raft would be fine, but kubelet on node2 hangs in the pod cleanup path for the old pod's NFS mount. Concurrently, a new `vault-0` pod is attempted on node4, but **NFS mount from node4 times out at 110s** — the host kernel NFS client is in a degraded state that blocks all new mounts (including to completely different NFS paths like `/srv/nfs/ytdlp`). |
+| **10:09** | Diagnostic test: from node1 and node4 CSI pods, `mount -t nfs -o nfsvers=4 192.168.1.127:/srv/nfs/ytdlp /tmp/test` times out. From node2 and node3 the same mount succeeds. NFS server is healthy (`showmount -e` works; `rpcinfo` shows all programs registered). The common factor on the broken nodes: they had a force-terminated Vault pod earlier in the session, leaving stuck `mount.nfs` processes in D-state. |
+| **10:18** | Manual unmount of stale NFS mount from the force-deleted old vault-0 pod on node4. New mount attempts from CSI still time out — clearing the old mount did not recover kernel NFS client state. |
+| **10:22** | Workaround discovered: mounting with `nfsvers=4.0` or `nfsvers=4.1` (instead of default `nfsvers=4` which negotiates to 4.2) succeeds on broken nodes. Confirms the stuck state is version-specific (NFSv4.2 session state), not a general NFS issue. Decision: rather than change CSI mount options cluster-wide (risk of remounting existing 48+ PVs), fix the nodes directly. |
+| **10:31** | Investigated node2 kubelet state: old `vault-1` container shows `vault` process in **Z (zombie)** state with its `sh` wrapper stuck in `do_wait` in kernel (`zap_pid_ns_processes`). Containerd-shim PID killed manually — `sh` and zombie reparented to init but remained stuck (uninterruptible kernel wait tied to NFS). |
+| **10:34** | Attempted `systemctl restart kubelet` on node2 — kubelet itself went into Z (zombie) with 2 tasks still attached. Classic NFS-related kernel deadlock. |
+| **10:42** | **Decision: hard-reset node2 VM** (`qm reset 202`). Disruption: 22 pods evicted. |
+| **10:43** | node2 back up (Ready). CSI registered. New `vault-1` scheduled to node2. NFS mount succeeded (fresh kernel state). Kubelet began chowning volume — **extremely slow, ~3 files per minute over NFS**. |
+| **10:48** | `vault-1` (2/2 Running) unsealed. **Raft leader elected: `vault-2` wins term 32, election tally=2** (vault-1 voted yes once it came up, vault-0 unreachable). However vault-2's vault-layer (HA active/standby) never transitioned to active — raft leader with `active_time: 0001-01-01T00:00:00Z` and `/sys/ha-status` returning 500. |
+| **10:50** | Restarted `vault-2` pod to force clean leader transition. New `vault-2` stuck in chown loop on node3 (same pattern as node2 earlier). |
+| **10:54** | Patched the Vault `StatefulSet` with `fsGroupChangePolicy: OnRootMismatch` so subsequent recreations skip the recursive chown. |
+| **10:57** | Force-deleted `vault-2` and `06fa940b` pod directory on node3. New pod spawned but kubelet again stuck on phantom state from the old pod. |
+| **11:01** | **Hard-reset node3 VM** (`qm reset 203`). |
+| **11:03** | First 200 response: vault-1 elected leader, vault-2 standby. Premature celebration — vault-1's audit log on node2 NFS starts timing out; `/sys/ha-status` returns 500 even though raft thinks vault-1 is active. |
+| **~11:18** | Service regresses. `vault-1` audit writes hanging (`event not processed by enough 'sink' nodes, context deadline exceeded`). Readiness probe fails; pod goes 1/2; `vault-active` endpoint stays pointed at vault-1's IP but backend unresponsive → 503. |
+| **11:22** | Force-restart `vault-1` to trigger re-election with new pod. Delete + containerd-shim cleanup leaves yet another zombie on node2. Same pattern: force-delete → zombie. |
+| **11:29** | **Hard-reset node4 VM** (`qm reset 204`). Rationale: vault-0 was still blocked there; 74 pods on node4 contribute to NFS server load (load avg 16 on PVE). After reboot, vault-0 mounts its PVCs on fresh kernel state and comes up 2/2 Running 11:31. |
+| **11:31** | Increased PVE NFS threads from 16 to 64 (`echo 64 > /proc/fs/nfsd/threads`). Did not help immediate mount failures — the stuck state is per-client kernel, not server capacity. |
+| **11:38** | Discover DNS resolution issue: vault-2's Go resolver returns NXDOMAIN for short names `vault-0.vault-internal` even though glibc resolver works. CoreDNS restart issued earlier didn't fix. Restart vault-2 pod to force fresh resolver state. |
+| **11:42** | **Second hard-reset of node3 VM** (`qm reset 203`). Kubelet+CSI re-register; vault-2 scheduled, NFS mounts finally succeed on fresh kernel state. |
+| **11:43:28** | **vault-2 becomes active leader.** External endpoint returns 200 and stays there. vault-0 follower, catches up to index 2477632+. vault-1 still stuck on node2; left for later recovery. |
+
+## Root Cause Chain
+
+```
+[1] Vault-2 raft goroutine hang (root cause — upstream Vault bug or infra-induced)
+ └─> Cluster port 8201 accepts TCP but never responds to msgpack RPCs
+     └─> Standbys' appendEntries calls return `msgpack decode error [pos 0]: i/o timeout`
+         └─> Raft protocol: no re-election because leader is heartbeating at the TCP level
+            └─> External endpoint returns 503 because HA layer has no active leader
+
+[2] Recovery complication — abrupt pod termination
+ └─> `kubectl delete --force --grace-period=0` on vault-0/1/2
+     └─> containerd-shim fails to kill container cleanly (NFS I/O in D-state)
+         └─> vault process ends as zombie; sh wrapper stuck in do_wait
+             └─> Kubelet retries forever, cannot tear down old pod volumes
+                └─> NFS-CSI unmount requests succeed at the NFS layer but kubelet's
+                    volume state-machine never marks the volume as unmounted
+                    (stale 0000-mode mount directory blocks teardown completion)
+
+[3] Kernel NFS client corruption on node1/node4
+ └─> Force-terminated Vault pod left stuck `mount.nfs` processes in D-state
+     └─> Kernel NFS4.2 client session state corrupted (held open mount slot)
+         └─> All subsequent mount syscalls for nfsvers=4 block 110s+ waiting for
+             session slot that will never be freed
+            └─> Manual workaround: nfsvers=4.1 bypasses the corrupted session state
+
+[4] Kubelet starvation
+ └─> Combination of (2) and (3) means kubelet is stuck in a 2-minute volume-setup
+     context deadline loop — each iteration times out, new iteration restarts,
+     infinite loop
+     └─> Hard VM reset is the only exit
+         └─> After reset, kubelet starts clean, CSI re-registers, mounts succeed
+
+[5] Slow recursive chown amplifies impact
+ └─> Default fsGroupChangePolicy: Always (Vault Helm chart 0.29.1 default)
+     └─> Kubelet walks every file on NFS setting gid=1000
+        └─> Over a 1GB audit log and a 47MB raft.db on NFS with timeo=30,retrans=3,
+            each chown syscall takes seconds; kubelet 2-minute deadline runs out
+            before the walk finishes
+           └─> Loop never exits even when ownership is already correct
+```
+
+## Why This Failed
+
+1. **Raft transport does not detect stuck leaders.** If TCP is open and the process is alive enough to hold the port, standbys assume the leader is healthy. A stuck goroutine that never responds to RPCs appears to raft as "leader with high RTT" and does not trigger re-election. This is an upstream Vault bug (or at least a missing liveness check).
+
+2. **Abrupt pod termination + NFS = kernel-level zombie.** When a Vault pod holding an NFS mount is force-killed before it cleanly closes file handles, the kernel's NFS4.2 client session state enters a corrupted state. This blocks all new mounts from that node — not just to the same NFS path, but to ANY NFS path on the same server. The fix is a kernel reboot; there is no userspace recovery.
+
+3. **Vault data on NFS violates the documented rule.** `infra/.claude/CLAUDE.md` explicitly states: *"Critical services MUST NOT use NFS storage — circular dependency risk."* Vault currently uses `nfs-proxmox` for both `dataStorage` and `auditStorage`. If Vault had been on `proxmox-lvm-encrypted`, none of the NFS corruption cascade would have happened.
+
+4. **fsGroupChangePolicy: Always is the Helm default.** Every pod restart walks every file over NFS. On a 1GB audit log with degraded NFS RTT, this takes longer than kubelet's internal 2-minute deadline, causing infinite restart loops. `OnRootMismatch` makes chown a no-op when the root is already correct (which it always is after first setup).
+
+5. **No alert for this failure mode.** Prometheus alerts exist for `VaultSealed`, `VaultDown` (`up` metric), and backup staleness, but none for "raft leader has been running without advancing commit index" or "standby reports leader but leader's `/sys/ha-status` returns 500".
+
+## Remediation (Applied)
+
+- [x] Hard-reset node2 and node3 VMs to clear kernel NFS state and kubelet zombies.
+- [x] Manually patched live `StatefulSet vault/vault` with `fsGroupChangePolicy: OnRootMismatch` to stop the chown loop.
+- [x] Lazy-unmounted stale NFS mounts from force-deleted pod directories on node2 and node3.
+- [x] Removed stale kubelet pod directories (`/var/lib/kubelet/pods/<UID>`) that had 0000-mode mount subdirectories blocking teardown.
+- [x] Updated `stacks/vault/main.tf` with the `fsGroupChangePolicy` setting so the next `scripts/tg apply vault` makes it durable.
+
+## Remediation (Pending)
+
+- [ ] **Hard-reset node4** to recover vault-0 (same NFS kernel corruption pattern).
+- [ ] **Run `scripts/tg apply` on the vault stack** to persist the fsGroupChangePolicy change.
+- [ ] **Add Prometheus alert `VaultRaftLeaderStuck`** — fire when `vault_raft_last_index_gauge` (or derivation from `vault_runtime_total_gc_runs`) stops advancing for >2 minutes while `vault_core_active` is 1.
+- [ ] **Add Prometheus alert `VaultHAStatusUnavailable`** — fire when `vault_core_active{}` reports 0 across all pods but `up{job="vault"}` reports 1 (HA layer broken but pods alive).
+- [ ] **Migrate Vault to `proxmox-lvm-encrypted` block storage** — eliminates the entire NFS failure class. This follows the rule already documented in `infra/.claude/CLAUDE.md`. Tracked as beads task (open after Dolt is back up; currently down on node4).
+- [ ] **Consider raising kubelet volume-manager deadline** for large-volume chown scenarios, or document the `fsGroupChangePolicy: OnRootMismatch` requirement for all NFS-backed StatefulSets.
+- [ ] **Runbook**: `docs/runbooks/vault-raft-leader-deadlock.md` — how to detect stuck leader, safe force-restart procedure that avoids zombie pods, NFS kernel state recovery.
+
+## Contributing Factors
+
+1. **NFS mount options use bare `nfsvers=4`**. This negotiates to the highest version the server supports (NFSv4.2). When 4.2 session state corrupts, mounts fail; 4.1 works. Pinning to `nfsvers=4.1` in the `nfs-proxmox` StorageClass would make the failure mode recoverable without node reboot, but would also require recreating 48+ existing PVs (volumeAttributes are immutable). Deferred.
+
+2. **`kubectl delete --force` is the default for stuck pods**. Operators reach for force-delete when a pod won't terminate, but this leaves containerd in an inconsistent state when the underlying storage is hung. Better approach: identify the stuck process (typically `mount.nfs` or a kernel NFS callback) and fix the root cause before force-deleting.
+
+3. **Beads / Dolt server was on node4**, so beads task tracking went offline during this incident and couldn't be used to log progress cross-session.
+
+4. **node1 was cordoned mid-incident** to prevent rescheduling to a node with confirmed NFS issues, but this reduced the scheduling surface for anti-affinity-sensitive StatefulSets.
+
+## Learnings
+
+1. **NFS for stateful critical services is structurally unsafe.** When NFS breaks, the recovery involves killing pods → which can break NFS further → until a reboot. The rule exists for a reason; Vault should never have been on NFS.
+
+2. **Raft liveness needs application-layer probing, not TCP.** Every time we've seen a "stuck leader" issue in the homelab, TCP was fine and the app was unresponsive. A lightweight RPC probe with a short timeout and Prometheus alert would catch this in minutes instead of hours.
+
+3. **kubelet volume-manager is fragile against stuck NFS.** Once kubelet enters a chown loop with a context deadline shorter than the chown duration, it cannot make progress — even when the filesystem is otherwise healthy. `OnRootMismatch` is effectively mandatory for any pod with `fsGroup` and a volume >100MB.
+
+4. **VM hard-reset is cheap but disruptive.** The two reboots took ~60 seconds each but evicted 22+44 = 66 pods. Doing this twice in one session is a lot of churn. A post-mortem-driven improvement: pre-prepare "hot-standby" capacity so we can cordon+drain instead of hard-reset when kubelet zombies appear.
+
+5. **Documentation of this rule is worth more than the rule itself.** The CLAUDE.md already says "critical services must not use NFS". The vault stack violates it. The rule without enforcement (validation, linting, CI) is ignored during the rush to ship.
+
+## References
+
+- Related: `docs/post-mortems/2026-04-14-nfs-fsid0-dns-vault-outage.md` — previous Vault+NFS incident (different root cause, similar blast pattern).
+- Vault helm chart 0.29.1 default `fsGroupChangePolicy` is unset (behaves as `Always`).
+- Upstream Vault HA layer: raft leader → vault-active transition is in `vault/external_tests/raft`. Stuck goroutine pattern not documented as a known issue.
--- a/docs/post-mortems/2026-05-09-io-pressure-stale-nfs.md
+++ b/docs/post-mortems/2026-05-09-io-pressure-stale-nfs.md
@ -0,0 +1,56 @@
+# Post-Mortem: IO Pressure Stalls from Stale NFS Client to Decommissioned TrueNAS
+
+| Field | Value |
+|-------|-------|
+| **Date** | 2026-05-09 (issue first observable in journal at 2026-05-08 00:00:04) |
+| **Duration** | Intermittent IO PSI stalls and kubectl TLS handshake timeouts during the session; PVE host loadavg ~15 sustained. No user-visible outage. |
+| **Severity** | SEV3 (degraded host I/O, no service down) |
+| **Affected Components** | PVE host (192.168.1.127), `node_exporter` (PID 1479, D-state), kernel NFS kthread `[10.0.10.15-manager]`, k8s-node3 (downstream IO PSI). |
+| **Status** | Resolved structurally. Stale connection source removed; recurring trigger eliminated. Wedged kthread persists in kernel queue — clears on next PVE reboot. |
+
+## Summary
+
+The PVE host's NFS client was retaining a wedged connection to `10.0.10.15` — the IP of the TrueNAS VM that was operationally decommissioned 2026-04-13 (storage migrated to `192.168.1.127:/srv/nfs`). The connection was created by `/usr/local/bin/weekly-backup`, a legacy script left over from before the NFS migration that had never been removed. Its kernel kthread `[10.0.10.15-manager]` parked itself in `rpc_wait_bit_killable` and stayed there. Any process that touched `/proc/mountstats` — including `node_exporter` — got dragged into D-state alongside it, which in turn fed back into IO pressure metrics. cluster-health surfaced this as `k8s-node3 full avg10=23%` and PVE loadavg sustained at ~15.
+
+## Impact
+
+- **User-facing**: None directly. Intermittent kubectl TLS handshake timeouts during the session, attributable to the elevated PVE loadavg.
+- **Blast radius**: Single PVE host. node_exporter (PID 1479) wedged in D-state with the kthread. k8s-node3 downstream IO PSI peaked at `full avg10=23%`.
+- **Data loss**: None.
+- **Observability gap**: No alert fired for "stale NFS connection to decommissioned host". The IO PSI watchdog caught the symptom, not the cause.
+
+## Root Cause
+
+`/usr/local/bin/weekly-backup` was an artifact of the pre-2026-04-13 backup pipeline (when TrueNAS at `10.0.10.15` was the NFS server). After the TrueNAS decommission and migration to host NFS at `192.168.1.127`, the script was never deleted. It executed at least once recently (manually, or via a cron entry that has since been pruned), opening an NFS RPC session to `10.0.10.15`. With no peer answering, the kernel's RPC retry timer parked the manager kthread in `rpc_wait_bit_killable`. The kthread holds a lock that any reader of `/proc/mountstats` must take — `node_exporter` reads that file every scrape interval, so its scrape goroutine wedged in D-state too.
+
+## Resolution
+
+1. `lvextend -L +1T /dev/pve/nfs-data` + `resize2fs` — `/srv/nfs` 2 TiB → 3 TiB (90% → 60% used). Unrelated to the IO issue but bundled because `/srv/nfs` was at 90% and the user picked "grow LV" over "diet Immich". Thinpool (sdc) had ~4.6 TiB free.
+2. `rm /usr/local/bin/weekly-backup` — eliminates the trigger. Backup pipeline is now `daily-backup.service` + `offsite-sync-backup.service` + per-app CronJobs (mysql/postgres/vault/etc.); `weekly-backup` was fully redundant.
+3. `systemctl restart node_exporter` — replaces the wedged process. New PID 183319 healthy, `:9100/metrics` responsive.
+4. `mysql-standalone` memory bump 2 Gi → 4 Gi limit, 1.5 Gi → 3 Gi request (commit forthcoming). Coincident May 8 18:05 OOM, not caused by this incident — `innodb_buffer_pool_size=1Gi` plus connection buffers and InnoDB internals didn't fit in 2 Gi.
+
+## Open / Out-of-Scope
+
+- **Wedged kthread `[10.0.10.15-manager]` (PID 3796184)** persists in the kernel queue. The kernel will eventually reap it once the RPC retry timer gives up, or it clears at next PVE reboot. With the script gone, no new ops queue against it. **Plan**: if PVE host PSI does not fully clear within 24 h, fold a PVE reboot into the next maintenance window. Not done in this change.
+- **Transient OOMs unrelated to this incident**:
+  - `mysql-standalone-0` May 8 18:05 (anon-rss 2 GB at 2 Gi limit) — addressed by the limit bump above.
+  - postgres helpers May 9 12:37 — anon-rss <8 MB, pods no longer exist, no recurrence. No action.
+  - python pod May 9 13:36 (anon-rss 518 MB on k8s-node2) — pod no longer exists, no recurrence. No action.
+- **Pre-existing TF drift**: `null_resource.pg_job_hunter_db` in `stacks/dbaas/modules/dbaas/main.tf` execs against `pg-cluster-1`, but the current CNPG primary is `pg-cluster-2`. Unrelated to this incident; surfaced during the targeted MySQL apply. Fix is a separate ticket — should resolve the primary dynamically (e.g., via the `cnpg.io/instanceRole=primary` selector) instead of hardcoding pod ordinal.
+
+## Action Items
+
+- [x] Delete `/usr/local/bin/weekly-backup` on PVE host.
+- [x] Restart `node_exporter.service` on PVE host.
+- [x] Grow `pve/nfs-data` LV to 3 TiB; online `resize2fs`.
+- [x] Bump `mysql-standalone` memory request/limit to 3 Gi / 4 Gi.
+- [x] Update `docs/architecture/storage.md` to record the new LV size.
+- [ ] Reboot PVE host at next maintenance window if `[10.0.10.15-manager]` kthread does not clear within 24 h.
+- [ ] (Separate ticket) Fix `null_resource.pg_*_db` resources to target the actual CNPG primary instead of hardcoding `pg-cluster-1`.
+
+## Related
+
+- TrueNAS decommission: memory `id=674` (2026-04-13).
+- Prior LV grow on `pve/nfs-data` (2 TiB out-of-band): memory `id=691` (2026-04-12).
+- Architecture: `docs/architecture/storage.md`, `docs/architecture/backup-dr.md`.
--- a/docs/post-mortems/2026-05-16-kured-stalled-and-anubis-ha.md
+++ b/docs/post-mortems/2026-05-16-kured-stalled-and-anubis-ha.md
@ -0,0 +1,164 @@
+# Post-Mortem: kured Reboots Silently Stalled for 6 Days + Anubis HA Lift
+
+| Field | Value |
+|-------|-------|
+| **Date** | 2026-05-16 |
+| **Duration** | 6 days of unbooted pending-reboot packages (2026-05-10 → 2026-05-16) |
+| **Severity** | SEV3 — no user-facing impact; latent risk (kernel/libc CVEs queued, not landing) |
+| **Affected Services** | None directly; OS-reboot pipeline halted on all 5 K8s nodes |
+| **Status** | Root cause fixed (kured Helm value), defensive defaults added (Anubis HA, kured drain-timeout, CNPG 3 instances) |
+
+## Summary
+
+After unattended-upgrades was re-enabled on the K8s nodes on 2026-05-10,
+kured was supposed to drive rolling node reboots within the Mon–Fri
+02:00–06:00 London window. Instead, kured logged "Reboot not required"
+every hour for six straight days while the `kured-sentinel-gate`
+DaemonSet on every host happily reported "ALL CHECKS PASSED — creating
+/var/run/gated-reboot-required". The gate WAS open. kured was looking
+in the wrong place.
+
+The kured Helm chart derives the sentinel hostPath from
+`dirname(configuration.rebootSentinel)`. The stack set
+`rebootSentinel = "/sentinel/gated-reboot-required"` — which pointed
+the chart at hostPath `/sentinel/` (an empty auto-created directory).
+The sentinel-gate writes to `/var/run/gated-reboot-required` on the
+host. Two different host directories. kured silently skipped reboots
+for six days.
+
+Found on 2026-05-16 while auditing why "automatic upgrades aren't
+happening" alongside the K8s version-upgrade Job-chain (PM
+2026-05-11). Fixed in one commit; took the opportunity to also
+eliminate three latent drain-time hazards (Anubis single-replica PDB
+deadlock, kured unbounded drain timeout, CNPG-only-2-instances).
+
+## Impact
+
+- **User-facing**: None. Existing kernels, libc, and userspace kept running. CVEs queued in `/var/run/reboot-required.pkgs` on every node but were never exploited.
+- **Backlog**: All 5 nodes accumulated `linux-image-*` + `libc6` queued for reboot. Largest gap was master at ~6 days. Workers also 5–6 days.
+- **Detection gap**: kured exposes no Prometheus signal for "I checked but said no". The hourly "Reboot not required" line in stdout is the only trace, and nobody was tailing it. The architecture had two layers (sentinel-gate gate + kured sentinel check) but no verification that the two layers were looking at the same path.
+- **Side discovery**: 8 Anubis instances would have stalled drain anyway via single-replica + `PDB minAvailable=1` (the same trap that stalled the manual K8s upgrade on 2026-05-11). Even if the kured path bug were fixed in isolation, Monday's first reboot would have hit the Anubis trap and idled forever (kured default `--drain-timeout=0` = unlimited).
+
+## Timeline (UTC)
+
+| Time | Event |
+|------|-------|
+| **Mar 16 21:26** | kured-sentinel-gate DaemonSet introduced after the 26h overlayfs cascade outage. Original sentinel cool-down 30m. |
+| **May 10 ~16:57** | Last successful kured pod restart picked up new Helm values. `rebootSentinel = "/sentinel/gated-reboot-required"`. Same commit re-enabled unattended-upgrades in cloud_init and stretched the sentinel cool-down 30m → 24h. |
+| **May 10 ~17:00 → May 15 06:16** | unattended-upgrades on every node successfully installs kernel + libc patches, writes `/var/run/reboot-required`. |
+| **May 10–15** | sentinel-gate Check 1–4 all pass every 5 min on every host. Touches `/var/run/gated-reboot-required`. Logs "ALL CHECKS PASSED". |
+| **May 10–15** | kured polls `/sentinel/gated-reboot-required` (empty dir, file does not exist). Returns "Reboot not required" every hour. No reboots happen. |
+| **May 11 20:40–21:00** | Separate K8s-version-upgrade incident (master upgraded to v1.34.7, workers stalled mid-rollout because the upgrade agent drained its own host). Manual recovery 5/11–5/12. **kured stall noticed but not investigated**: cluster healthy, K8sVersionSkew firing was tracked as the urgent issue. |
+| **May 11 22:47 → May 12 00:01** | Manual worker drains hit the Anubis single-replica PDB trap (drain loops). Resolved by direct-deleting Anubis pods to bypass eviction API. This was the first signal that single-replica `minAvailable=1` patterns deadlock drains. |
+| **May 16 10:56 UTC** | While auditing "what runs the upgrades" for the user, the kured + sentinel-gate log/path mismatch became visible. |
+| **May 16 11:13 UTC** | `stacks/kured/main.tf`: `rebootSentinel = "/sentinel/..."` → `"/var/run/gated-reboot-required"`. Re-init, plan, apply. |
+| **May 16 11:14 UTC** | kured DaemonSet rolls out the new spec. Volume hostPath becomes `/var/run`. kured pod can now see `/sentinel/reboot-required` (32B, from uu) AND `/sentinel/gated-reboot-required` (0B, from gate). Confirmed via `kubectl exec` listing. |
+| **May 16 11:44 UTC** | Anubis HA module change deployed: `shared_store_url` variable → `store: { backend: valkey }` block appended to policy YAML, default replicas 2, PDB `maxUnavailable=1`, topology `DoNotSchedule`. Cyberchef applied as canary. Confirmed: Redis DB 5 starts receiving challenge state. |
+| **May 16 11:48–11:53 UTC** | Remaining 7 Anubis stacks applied (DBs 6–12). 8/8 deployments at 2/2 Ready, replicas spread on different nodes. Smoke-tested 6 of 8 public URLs return 200. |
+| **May 16 12:05 UTC** | kured `drainTimeout: "30m"` added + applied. pg-cluster bumped from 2 → 3 instances. |
+| **May 16 12:11 UTC** | pg-cluster phase = "Cluster in healthy state", 3/3 ready. |
+
+## Root Cause
+
+The Helm chart `kured-5.11.0` computes:
+```
+{{- $sentinel_dir := dir .Values.configuration.rebootSentinel -}}
+# template renders both volume mount and hostPath using $sentinel_dir
+```
+
+So `rebootSentinel` is doubly-purposed: it's both the **CLI arg path inside
+the pod** AND the **hostPath on the node**. Setting it to `/sentinel/...`
+caused:
+- pod arg: `--reboot-sentinel=/sentinel/gated-reboot-required` (looks at `/sentinel/` inside the pod)
+- hostPath: `/sentinel/` (auto-created empty directory by `type: Directory`)
+- mountPath inside pod: `/sentinel/` (mapped from hostPath above)
+
+Meanwhile the gate DaemonSet was configured with hostPath `/var/run` →
+mountPath `/host/var-run`, and wrote `gated-reboot-required` to its local
+`/host/var-run/` which became the host's `/var/run/gated-reboot-required`.
+
+The two daemons never touched the same directory.
+
+**Why this was hard to spot**:
+
+1. Both layers logged success: sentinel-gate said "ALL CHECKS PASSED", kured said "Reboot not required". Neither claimed an error.
+2. No Prometheus alert exists for "kured polled, gate is open, kured still didn't act". The Upgrade Gates alert group catches firing-alert-during-rollout, not silently-skipped-rollout.
+3. The Helm chart's auto-derivation of hostPath from a config value is undocumented surprising behavior. The mental model is "rebootSentinel is just the in-pod path"; the hostPath co-mutation is invisible.
+
+## Remediation
+
+### Primary fix
+- `stacks/kured/main.tf`: `rebootSentinel = "/var/run/gated-reboot-required"`. Both the chart-derived hostPath and the kured CLI arg now align with where the gate writes.
+
+### Defensive companion changes (same session)
+
+| Change | Purpose | Stack |
+|---|---|---|
+| `drainTimeout = "30m"` on kured | Fail closed instead of looping forever if a future PDB or finalizer stalls drain. Node stays Schedulable (no silent capacity loss). | `stacks/kured/main.tf` |
+| Anubis: shared-state Valkey/Redis backend | Eliminate the single-replica drain deadlock + provide real HA. PDB changed `minAvailable=1` → `maxUnavailable=1`. Replicas 1 → 2 with `topologySpreadConstraint: DoNotSchedule`. | `modules/kubernetes/anubis_instance/main.tf` + 8 callers |
+| pg-cluster: 2 → 3 instances | Failover during primary's node drain no longer depends on the lone replica being caught up. CNPG always has a fully-current candidate. | `stacks/dbaas/modules/dbaas/main.tf` |
+| Orphan `mysql-standalone` PDB deleted | Helm-stamped leftover (selector required 4 labels, pod has 3 → matched 0 pods). Was dead code; deletion is safe. | `kubectl` (not TF-managed) |
+
+### Verified post-fix
+
+- `kubectl -n kured exec deploy/kured -- ls /sentinel/` lists both `reboot-required` and `gated-reboot-required` on every node.
+- 8 Anubis Deployments at 2/2 Ready; pods spread across different nodes (verified via `kubectl get pods -o wide`).
+- Redis DBs 5, 7, 8, 10 receiving challenge state from real public traffic post-apply (Palo Alto Networks scanner hit blog).
+- pg-cluster 3/3 healthy, phase = "Cluster in healthy state".
+- kured args show `--drain-timeout=30m`.
+
+## Lessons
+
+1. **Auto-derivation in Helm charts is invisible drift surface.** The chart's
+   habit of deriving hostPath from a CLI-arg-shaped value is the kind of
+   "convenient default" that hides during normal review. Mitigation:
+   pin `hostFilePath` explicitly in `configuration` so the host path is
+   declared, not derived. (Did not do this in the fix because the
+   single-config approach is now correct; flagging as future improvement.)
+
+2. **"Silently skipped" needs a Prometheus signal.** The Upgrade Gates
+   alerts cover "rollout in progress + something went wrong". They don't
+   cover "we haven't rolled in 7 days when we should have". Suggested:
+   add `KuredRebootBacklog` — fires when `kured_reboot_required ==
+   1` (kured exposes this) for more than 24h continuously. The kured
+   chart already serves `/metrics`; just needs a rule. (Deferred.)
+
+3. **Single-replica `PDB: minAvailable=1` is a deadlock pattern.** It
+   reads as "protect this pod" but actually means "block all voluntary
+   disruption forever". Manifested in 9 places (8 Anubis + mysql-standalone
+   with broken selector). The Anubis fix is now in place via shared-store
+   replicas=2; the `mysql-standalone` selector was already broken so it
+   matched 0 pods (and was deleted as cruft). Worth auditing the cluster
+   periodically for any new pattern of the same shape.
+
+4. **k8s-node1 containerd source drift** (Ubuntu archive's `containerd`
+   vs Docker's `containerd.io`) is benign but should be documented.
+   Audited during this session: not a blocker for kured because both
+   variants are in the Package-Blacklist and both are apt-held. The
+   version skew with master (1.6.22 vs 1.7.24/1.7.27) is what the
+   K8s version-upgrade Stage 3 "containerd bump" exists to fix.
+
+5. **CNPG drain handling at 2 replicas is fragile.** Switchover works
+   but the lone replica must be caught up; in practice this means
+   on a busy cluster, a primary-node drain could stall for tens of
+   seconds while CNPG promotes. 3 instances eliminates this. Worth
+   considering for every long-running multi-instance stateful workload.
+
+## Detection / Prevention Followups
+
+- [ ] `KuredRebootBacklog` Prometheus alert. Spec: `kured_reboot_required == 1 and (time() - timestamp(kured_reboot_required)) > 86400`.
+- [ ] Add a `hostFilePath` value to the kured Helm release for explicit declaration (current setup is correct but undocumented).
+- [ ] Audit periodically for new single-replica + `minAvailable=1` PDB patterns (could be a Kyverno warn policy).
+- [ ] Phase 4: clean up the InnoDB Cluster CR + remaining `mysql-cluster-pdb` once the bitnami legacy is fully decommissioned.
+
+## File pointers
+
+| What | Where | Commit |
+|---|---|---|
+| kured sentinel path fix | `infra/stacks/kured/main.tf` | c17d87e1 |
+| Anubis HA (module + 8 callers) | `infra/modules/kubernetes/anubis_instance/` + 8 `stacks/<app>/main.tf` | 6e920f96 |
+| kured drainTimeout + CNPG 3-replica | `infra/stacks/kured/main.tf` + `infra/stacks/dbaas/modules/dbaas/main.tf` | a726e963 |
+| K8s version-upgrade Job-chain (related context) | `infra/stacks/k8s-version-upgrade/` | 01bc16d5 (5/11) |
+| Architecture doc | `infra/docs/architecture/automated-upgrades.md` | (updated 5/11) |
+| Runbook | `infra/docs/runbooks/k8s-version-upgrade.md` | (updated 5/11) |
+| Deprecated agent prompt (self-preemption history) | `infra/.claude/agents/k8s-version-upgrade.deprecated.md` | 01bc16d5 |
--- a/docs/post-mortems/2026-05-17-gpu-driver-ubuntu2604-mismatch.md
+++ b/docs/post-mortems/2026-05-17-gpu-driver-ubuntu2604-mismatch.md
@ -0,0 +1,212 @@
+# Post-Mortem: GPU Driver Crashloop after Ubuntu 26.04 Upgrade on k8s-node1
+
+**Date:** 2026-05-17
+**Author:** Viktor Barzin / Claude (incident response)
+**Severity:** SEV-3 (GPU workloads unavailable: frigate, immich-ml, llama-swap, ytdlp/yt-highlights all Pending; no impact to non-GPU services)
+**Beads:** `code-8vr0` (P1)
+**Status:** Blocked on upstream — NVIDIA has not published Ubuntu 26.04 driver images yet
+
+## Summary
+
+`nvidia-driver-daemonset-sg22g` on k8s-node1 went into CrashLoopBackOff
+with 76+ restarts. Root cause: k8s-node1 was upgraded to **Ubuntu 26.04
+LTS (Resolute Raccoon)** at some point, putting the running kernel at
+`7.0.0-15-generic`. The NVIDIA driver daemonset's installer container
+runs `apt-get install linux-headers-<kernel>` against Ubuntu 24.04's
+noble repositories (the container's base OS), which don't carry
+`linux-headers-7.0.0-15-generic`, so the build aborts with:
+
+    Could not resolve Linux kernel version
+
+Attempted fix (chart upgrade v25.10.1 → v26.3.1 with driver 580.105.08
+and `kernelModuleType: open`) succeeded at the chart level but produced
+a worse outcome: the v26.3.1 operator auto-detects the host OS via NFD
+and constructs the image tag `<version>-ubuntu26.04`, which 404s on
+pull. `skopeo list-tags docker://nvcr.io/nvidia/driver` confirms zero
+ubuntu26.04 tags exist (vs 779 ubuntu22.04 and 206 ubuntu24.04 tags).
+
+Rolled the chart back to v25.10.1 (pinned in TF) to restore the closest-
+to-working state pending an upstream fix or kernel rollback.
+
+## Impact
+
+- GPU resource `nvidia.com/gpu` = 0 on k8s-node1 (only GPU node)
+- All GPU-bound workloads Pending or 0/N Ready:
+  - `frigate/frigate`
+  - `immich/immich-machine-learning`
+  - `llama-cpp/llama-swap`
+  - `nvidia/nvidia-exporter`
+  - `ytdlp/yt-highlights`
+- Downstream alerts firing: `NvidiaExporterDown`, 5× Uptime Kuma monitors
+  (Frigate, Immich ML, nvidia-exporter, …), `GPUNodeUnschedulable` not
+  firing (node is schedulable, just no GPU advertised)
+- No data loss; no user-facing service degradation outside the GPU stack
+
+## Timeline (Europe/Sofia, UTC+3)
+
+- pre-incident — `apt-get dist-upgrade` (or `do-release-upgrade`) bumped
+  k8s-node1 from Ubuntu 24.04 → 26.04. Apt history.log doesn't capture
+  the upgrade (rotated by `do-release-upgrade`).
+- ~2026-05-11 — node rebooted into kernel `7.0.0-15-generic`. NFD
+  reports `system-os_release.VERSION_ID = 26.04`,
+  `kernel-version.full = 7.0.0-15-generic`.
+- 2026-05-17 04:00 (approx) — driver daemonset enters CrashLoopBackOff
+  on every kubelet restart cycle. Error: "Could not resolve Linux kernel
+  version".
+- 2026-05-17 13:35 — chart upgrade attempt v25.10.1 → v26.3.1, driver
+  570.195.03 → 580.105.08, `kernelModuleType: open`. Helm applies
+  cleanly but driver pod ImagePullBackOff on
+  `driver:580.105.08-ubuntu26.04`.
+- 2026-05-17 ~13:45 — skopeo confirms zero ubuntu26.04 tags on
+  nvcr.io/nvidia/driver. Decision: roll chart back, pin in TF, document
+  the gotcha, file the kernel rollback as the next step.
+
+## Root Causes
+
+1. **Host OS upgraded to Ubuntu 26.04** ahead of NVIDIA's driver image
+   support window. NVIDIA typically lags new Ubuntu LTS releases by
+   weeks-to-months on the driver-container front.
+2. **gpu-operator chart was not pinned** prior to today. The TF
+   `helm_release` had `version` commented out, so any apply could
+   re-resolve to the latest chart and follow its OS-auto-detection
+   logic. With v25.10.1, the operator fell back to ubuntu24.04 image
+   suffix (which pulls successfully but fails to compile against kernel
+   7.0). With v26.3.1, the operator picks the correct (per-NFD)
+   ubuntu26.04 suffix — which doesn't exist.
+3. **No alert for "GPU device count = 0 on a GPU node"** — the cluster
+   had 14+ hours of silent GPU outage before noticing. `NvidiaExporterDown`
+   fires only when the metrics exporter itself stops scraping, not when
+   the operator's driver pod is unhealthy.
+
+## What We Changed in This Session
+
+- `stacks/nvidia/modules/nvidia/main.tf` — pinned
+  `helm_release.nvidia-gpu-operator.version = "v25.10.1"` so future
+  applies don't surprise us with v26.3.1's stricter OS detection.
+- `stacks/nvidia/modules/nvidia/values.yaml` — comment block explaining
+  the situation; driver version stays at `570.195.03` as the last-known
+  config that produced a pullable image.
+- `docs/post-mortems/2026-05-17-gpu-driver-ubuntu2604-mismatch.md` —
+  this file.
+
+## What We Did NOT Do (Pending User Decision)
+
+- **Roll back the host kernel** on k8s-node1 from `7.0.0-15-generic`
+  to `6.8.0-117-generic`. The 6.8 kernel is still installed at
+  `/lib/modules/6.8.0-117-generic` and the matching headers at
+  `/usr/src/linux-headers-6.8.0-117-generic`, so GRUB can boot it and
+  the driver image's apt sources (Ubuntu 24.04 noble) carry
+  `linux-headers-6.8.0-117-generic`. This would require draining the
+  node, editing GRUB defaults, `apt-mark hold` to prevent future drift,
+  and rebooting — needs explicit user OK.
+- **Add a probe + alert** for `nvidia.com/gpu` resource count on the
+  GPU node. Should fire within 10 minutes of the operator failing to
+  publish the resource, regardless of which sub-pod failed.
+
+## Recovery Procedure (next time)
+
+### If the driver-installer fails with "Could not resolve Linux kernel version"
+
+1. Identify the running kernel: `uname -r` on the affected node.
+2. Check whether NVIDIA ships an image for that kernel/distro combo:
+
+       docker run --rm quay.io/skopeo/stable list-tags \
+           docker://nvcr.io/nvidia/driver \
+         | python3 -c "import json,sys; d=json.load(sys.stdin); \
+             print([t for t in d['Tags'] if '<distro>' in t][:5])"
+
+3. If yes, point the chart at the right version + ensure NFD reports
+   the matching OS.
+4. If no (and a kernel rollback is acceptable):
+   - `kubectl cordon <node>` then `kubectl drain <node> --ignore-daemonsets --delete-emptydir-data`
+   - `nsenter -t 1 -m -p -u sed -i 's/^GRUB_DEFAULT=.*/GRUB_DEFAULT="Advanced options for Ubuntu>Ubuntu, with Linux 6.8.0-117-generic"/' /etc/default/grub`
+   - `nsenter -t 1 -m -p -u update-grub`
+   - `nsenter -t 1 -m -p -u apt-mark hold linux-image-6.8.0-117-generic linux-headers-6.8.0-117-generic linux-generic linux-image-generic linux-headers-generic`
+   - Reboot: `nsenter -t 1 -m -p -u systemctl reboot`
+   - After boot: `kubectl uncordon <node>` and wait for the GPU
+     daemonset to come Ready
+
+## Action Items
+
+- [x] Pin gpu-operator chart to v25.10.1 in TF
+- [x] Document situation in this post-mortem
+- [x] Roll back k8s-node1 host kernel to 6.8.0-117-generic (done by user;
+      kernel rollback succeeded and NFD now reports
+      `kernel-version.full=6.8.0-117-generic`, `os_release.VERSION_ID=24.04`)
+- [x] Extend driver daemonset startup probe `failureThreshold` from 120 to 300
+      (50 min) in TF `values.yaml` — 2026-05-25. On this hardware the
+      full install sequence (apt headers + gcc compilation + file copy) takes
+      ~21min which exactly exhausted the old 120×10s window.
+- [ ] Add Prometheus alert `GPUNodeNoGPUResource` — fires when a node
+      labeled `nvidia.com/gpu.present=true` has `nvidia.com/gpu` capacity
+      of 0 for >10m
+- [ ] Periodically re-check NVIDIA's NGC catalog for ubuntu26.04 driver
+      tags — file as a quarterly checkup once we see the first 26.04
+      tag, unpin the chart and revert this post-mortem's mitigation
+- [ ] Audit ALL host packages with `apt-mark hold` semantics. The
+      memory of the March 2026 outage says we disabled
+      `unattended-upgrades` — `do-release-upgrade` is a separate path
+      that should be gated too
+
+## Follow-up Incident: Driver install hang (2026-05-25)
+
+**Date**: 2026-05-25  
+**Status**: Resolved  
+
+After the kernel rollback to 6.8.0-117-generic succeeded, the driver pod
+(`nvidia-driver-daemonset-529vg`) was still reported as "stuck at
+Installing Linux kernel headers..." with no progress for 15–20 min.
+
+**Actual root causes (two compounding issues)**:
+
+1. **Deadlock between k8s-driver-manager and operator-validator**: The
+   `k8s-driver-manager` init container waits for `nvidia-operator-validator`
+   to shut down before it can begin the install sequence. The validator's
+   `driver-validation` init container was in an infinite retry loop polling
+   `/run/nvidia/validations/.driver-ctr-ready` (which the driver creates when
+   ready). Since the driver never finished, the validator never exited. The
+   validator pod had `deletionTimestamp` set but kubelet on node1 couldn't GC
+   it — the container received SIGTERM but remained in "Terminating" state
+   indefinitely, blocking the new driver from starting.
+   **Fix**: Force-deleted the stuck validator pod
+   (`kubectl delete pod -n nvidia nvidia-operator-validator-sff98 --force --grace-period=0`).
+   This broke the deadlock immediately.
+
+2. **Startup probe timeout**: The full driver install sequence on this hardware
+   (6 vCPUs, 16Gi RAM) takes ~21 minutes:
+   - `apt-get install linux-headers-6.8.0-117-generic`: ~2 min
+   - `gcc/make -j16` kernel module build (nvidia, nvidia-uvm, nvidia-modeset,
+     nvidia-peermem): ~12 min  
+   - nvidia-installer file copy + archive integrity check: ~7 min
+   The default startup probe allows exactly `60 + (120 × 10) = 1260s = 21min`.
+   This caused a SIGKILL (exit 137) at 21 minutes even when the install was
+   progressing normally.
+   **Fix**: Patched `driver.startupProbe.failureThreshold` from 120 → 300
+   in `stacks/nvidia/modules/nvidia/values.yaml` (gives 51 min headroom).
+
+**Key observation**: "Installing Linux kernel headers..." is NOT a hang — the
+apt install just takes 2+ min and produces no log output during execution. The
+log line appears before apt runs, so it looks frozen. Check `ps auxf` inside
+the container to confirm apt/dpkg are actively running.
+
+## Lessons
+
+- **Operator-style charts that auto-detect host OS can silently break
+  when the host fleet leapfrogs upstream image support.** Pin the chart
+  version + driver version, and treat upstream support gaps as a hard
+  blocker rather than a guaranteed-to-resolve race condition.
+- **Drain-and-revert host kernel is the right escape hatch when
+  upstream image lags.** Make sure the previous kernel and its headers
+  stay installed (don't aggressively purge old kernels in apt
+  autoremove).
+- **NFD labels are authoritative for the operator's image-tag
+  construction.** If you need to lie about OS version (e.g., to force a
+  24.04 image on a 26.04 host), edit the NFD label — but only as a last
+  resort; the chart upgrade made clear the operator will eventually
+  reconcile this.
+- **A k8s-driver-manager deadlock on a stuck Terminating validator pod is
+  indistinguishable from an apt hang** — `ps auxf` inside the container is
+  the key diagnostic. Force-deleting a stuck Terminating pod with no
+  finalizers is safe and immediately resolves the deadlock.
+- **Driver startup probe must be sized for the full install wall-clock time**,
+  not just apt or just compilation. On slow hardware, 21 min is tight.
--- a/docs/post-mortems/2026-05-17-nfs-csi-keel-upgrade-master-port-conflict.md
+++ b/docs/post-mortems/2026-05-17-nfs-csi-keel-upgrade-master-port-conflict.md
@ -0,0 +1,133 @@
+# Post-Mortem: nfs-csi Keel-Triggered Upgrade Broke Master Node CSI
+
+**Date:** 2026-05-17
+**Author:** Viktor Barzin / Claude (incident response)
+**Severity:** SEV-3 (1 of 5 CSI node DaemonSet pods stuck CrashLoopBackOff; controller pair flapping)
+**Duration:** ~2 hours from first detection to all-green
+
+## Summary
+
+The Keel auto-update operator polled the `csi-driver-nfs` Helm chart and rolled
+`v4.13.1 → v4.13.2`. The new chart's controller Deployment scheduled both
+replicas onto `k8s-master` (no built-in control-plane exclusion). Both replicas
+used `hostNetwork: true` and tried to bind the same host ports
+(`19809` for `node-driver-registrar`, `29653` for `liveness-probe`), so one
+controller pod CrashLoopBackOff'd with `bind: address already in use`. The
+upgrade also left behind multiple orphan controller pods in containerd that
+kubelet could no longer reconcile — they held the host ports even after the
+helm rollback removed them from K8s state.
+
+The `csi-nfs-node` DaemonSet pod on master then could not start either: its
+own `node-driver-registrar` and `liveness-probe` containers tried to bind
+the same host ports and lost to the zombies.
+
+## Impact
+
+- 1× `csi-nfs-node` pod on `k8s-master` stuck CrashLoopBackOff (16+ restarts)
+- CSI plugin unregistered on master → no NFS volumes could be mounted on
+  master-hosted pods (calico-typha cert mount failed, etcd backup CronJob
+  failed)
+- Controller flap (2 replicas fighting) → intermittent
+  `csi-resizer`/`csi-snapshotter` failure for the whole cluster
+- Cascade: kured-sentinel, node-local-dns, prometheus-node-exporter,
+  csi-node-driver (Calico) all bounced on master while kubelet thrashed
+
+No data loss; no production-facing outages observed (CSI mounts on the four
+worker nodes kept working).
+
+## Timeline (Europe/Sofia, UTC+3)
+
+- ~07:46 — Keel polls forgejo + DockerHub manifests, sees a new digest under
+  the `csi-driver-nfs` `4.13.x` channel, triggers Helm upgrade
+- 07:46:16 — `helm upgrade csi-driver-nfs` runs; new controller Deployment
+  scheduled (no `affinity` block → both replicas land on `k8s-master`)
+- ~07:50 — Controller replicas fight for ports `19809`, `29653`; one stays in
+  CrashLoopBackOff
+- ~08:00 — User notices "CSI issue ... due to the upgrade"; investigation
+  begins
+- 08:15 — `helm rollback csi-driver-nfs` to revision 8 (v4.13.1) — controllers
+  on master deleted via K8s, but containerd retains them as live sandboxes
+- 08:30 — Live `podAntiAffinity` + `nodeAffinity: control-plane DoesNotExist`
+  added to the controller Deployment via patch (controllers now correctly
+  schedule on node1+node3)
+- 08:40 — `csi-nfs-node` master pod still CrashLoopBackOff; ports 19809/29653
+  held by orphan PIDs (livenessprobe PID 1816, csi-node-driver PID 1944,
+  plus 5× csi-provisioner from zombie controller pods)
+- 09:00 — Privileged pkill via `hostPID: true` pod failed
+  (`permission denied` from runc — containerd refused to signal init in the
+  zombie containers)
+- 09:03 — `nsenter -t 1 -m -p -u systemctl restart kubelet` on master cleared
+  the orphan containers via cgroup GC; ports freed
+- 09:04 — `csi-nfs-node` master pod reaches 3/3 Ready; cluster green
+- 09:09 — Terraform `apply`: pin `helm_release.version = "4.13.1"`, add
+  `controller.affinity` to values
+
+## Root Causes
+
+1. **`csi-driver-nfs` Helm chart in TF was unpinned.** The `helm_release` had
+   no `version = ...` field, so it floated to whatever the chart repo
+   advertised. Keel polled this and rolled forward.
+2. **Chart v4.13.2 dropped the implicit control-plane exclusion** that v4.13.1
+   shipped with. Without it, the K8s scheduler chose master for both
+   controller replicas.
+3. **Two controller replicas + hostNetwork = port conflict on the same node.**
+   The chart did not add `podAntiAffinity` between the replicas. Live state
+   has it now; TF now does too.
+4. **Helm rollback does not always clean containerd sandboxes.** When the
+   prior revision's pods are abandoned mid-flight (image-pull-pending, etc.),
+   containerd can keep multiple sandbox instances for the same pod-UID.
+   Kubelet GC is the only thing that reliably reaps these — restarting it
+   forces a reconciliation pass that drops orphans.
+
+## What We Fixed
+
+- **`stacks/nfs-csi/modules/nfs-csi/main.tf`** (this commit):
+  - `version = "4.13.1"` pin on the `helm_release` (defense in depth — namespace
+    is already excluded from Kyverno-Keel injection, but the chart could still
+    drift on a `terraform apply` without a pin)
+  - `controller.affinity` block with `podAntiAffinity` (different hosts for
+    replicas) and `nodeAffinity` (exclude `node-role.kubernetes.io/control-plane`)
+  - Inline comments explaining both decisions
+- **Kyverno keel-annotations**: `nfs-csi` was already in the namespace exclude
+  list (decision from authentik incident 2026-05-17). Verified still there
+  in `stacks/kyverno/modules/kyverno/keel-annotations.tf:91`.
+
+## Recovery Procedure (next time)
+
+If `csi-nfs-node` on a node CrashLoopBackOff with `bind: address already in use`:
+
+1. **Find which host ports are bound** — `lsof -i :19809`, `lsof -i :29653`
+   (from a privileged hostPID pod on the affected node).
+2. **Try `crictl rmp -f <pod-id>`** on zombie pods (those K8s no longer
+   tracks). Will fail with `unable to signal init: permission denied` if
+   the containers are sufficiently stuck.
+3. **Restart kubelet on the affected node** via `nsenter -t 1 -m -p -u
+   systemctl restart kubelet` (privileged hostPID pod). Kubelet's GC
+   reconciles containerd state and reaps the orphans.
+4. **Force-delete the DaemonSet pod** to clear the back-off
+   (`kubectl delete pod -n nfs-csi csi-nfs-node-XXXX --force --grace-period=0`).
+   DaemonSet recreates it; with the ports free, containers start cleanly.
+
+## Action Items
+
+- [x] Pin `csi-driver-nfs` chart version in TF
+- [x] Add `controller.affinity` to TF (podAntiAffinity + control-plane exclude)
+- [x] Document recovery procedure (this post-mortem)
+- [ ] Audit other unpinned `helm_release` blocks — every chart used in
+      Kyverno-excluded namespaces should still be pinned to prevent
+      `terraform apply` drift. (Filed as follow-up — not blocking.)
+- [ ] Consider adding a `kured` or daily script that detects orphan
+      containerd sandboxes whose pod-UID is unknown to the apiserver and
+      reaps them automatically. (Filed as follow-up — not blocking.)
+
+## Lessons
+
+- **Keel exclusion ≠ chart pin.** The namespace was already excluded from
+  Keel injection, but the helm_release was unpinned — so a `terraform apply`
+  alone could re-trigger the same break. Both layers needed locking down.
+- **`crictl rmp -f` is not always sufficient.** When containerd refuses to
+  signal init, kubelet restart is the next escalation step before SSH/reboot.
+- **The Keel rollout phase 2-6 design ASSUMED stateful operators were
+  excluded.** CSI was correctly excluded — but the chart version itself was
+  still a moving target via plain `terraform apply`. The exclude-list catches
+  Keel; the version pin catches everything else.
--- a/docs/post-mortems/2026-05-25-immich-anca-elements-io-storm.md
+++ b/docs/post-mortems/2026-05-25-immich-anca-elements-io-storm.md
@ -0,0 +1,198 @@
+# Post-Mortem: Immich anca-elements Bulk Ingest IO Storm → k8s-node1 Reboot
+
+| Field | Value |
+|-------|-------|
+| **Date** | 2026-05-25 |
+| **Duration** | ~9h from job start (2026-05-24 23:55Z) to node1 reboot (2026-05-25 09:33Z); service restoration ongoing (~1h elapsed at time of writing). |
+| **Severity** | SEV2 — k8s-node1 went down, ~33 deployments lost their only replica, DNS partially degraded (Technitium primary on node1), Loki down, GPU stack down, backup pipeline timed out for the day. No data loss. |
+| **Affected Services** | 33 deployments, 3 StatefulSets, 9 DaemonSets across ~30 namespaces. Most concentrated on k8s-node1 (the only GPU node and home of several pinned services). |
+| **Issue** | TBD (no GitHub issue filed yet) |
+| **Status** | Draft — recovery in progress |
+| **Recurrence count** | **3rd IO-pressure-induced incident in 17 days** at time of writing; **recurred 2026-05-26** (Alloy log-read storm, mem id=2726) and **2026-06-01** (Immich Duplicate Detection ML/thumbnail backfill — see [Update 2026-06-01](#update-2026-06-01--recurrence-immich-duplicate-detection)) |
+
+## Summary
+
+The `anca-elements-import` Kubernetes Job — a one-shot bulk import of ~34k photos (770 GB) from `/srv/nfs/anca-elements` into Immich — ran with `immich-go --concurrent-tasks 20` and no CPU/IO limits. The 20 parallel NFS readers combined with the Immich ML pipeline saturated sdc (the 10.7 TB HDD thin pool holding all VM disks) for hours. Sustained disk contention starved the k8s-node1 VM's IO until the VM rebooted at 09:33 UTC. ~60 pods on node1 went zombie; the proxmox CSI driver lost its registration; the NVIDIA driver DaemonSet entered CrashLoopBackOff; the daily-backup pipeline was killed by its 4h systemd timeout while waiting on post-reboot ext4 orphan-inode cleanup.
+
+This is the third time in 17 days that an IO event has taken meaningful slice of the cluster offline. We cannot keep treating each one as a one-off.
+
+## Impact
+
+- **User-facing**: DNS resolution degraded (1 of 3 Technitium replicas down on node1). 20 self-hosted apps (changedetection, freshrss, frigate, navidrome, wealthfolio, etc.) returned 502 or hung. GPU-dependent services (Frigate ML, Immich ML, nvidia-exporter) had no GPU available.
+- **Blast radius**: ~60 zombie pods on k8s-node1; 33 deployment replicas missing cluster-wide; 1 StatefulSet (Loki) unavailable. Multi-Attach errors on ~8 proxmox-lvm PVCs prevented reschedule onto healthy nodes for ~30 min.
+- **Duration**: Initial IO degradation ~01:30Z (job ran ~85 min then ended). VM stayed alive but degraded for ~8 hours after the job ended (likely due to filesystem journal recovery / page cache pressure tail). Hard reboot at 09:33Z. Service restoration began at 10:00Z.
+- **Data loss**: None. All PVCs intact; no failed writes detected.
+- **Monitoring gap**: We had **no alert** for "VM is about to crash from sustained IO pressure." `NodeHighIOWait` fired but didn't escalate, and PVE-host-level IO PSI metrics aren't scraped into Prometheus.
+
+## Pattern — this is the third time
+
+| Incident | Date | Root IO source | Outcome |
+|----------|------|----------------|---------|
+| 1 | 2026-05-09 | Stale NFS kthread to decommissioned TrueNAS (`/usr/local/bin/weekly-backup` artifact) wedged in `rpc_wait_bit_killable` | PVE loadavg ~15 sustained, IO PSI stall on node3, no user-visible outage |
+| 2 | 2026-05-16/17 | kured stuck + GPU driver Ubuntu 26.04 mismatch + NFS-CSI Keel upgrade race | Multi-issue cluster degradation; required manual recovery |
+| 3 | 2026-05-25 (**this incident**) | Immich `anca-elements-import` Job with 20 parallel uncapped readers | k8s-node1 VM reboot, ~33 deployments down, backup pipeline broken |
+| 4 | 2026-05-26 | Grafana Alloy DaemonSet read 12.18 TB of logs in ~24h (silently lost its `controller.resources` limit) | sdc 97% util, all VMs + NFS starved (mem id=2726) |
+| 5 | 2026-06-01 | Immich library-wide **Duplicate Detection** → ML/thumbnail backfill read originals at server-side `thumbnailGeneration` concurrency **8** | sdc ~100% util, 64 `nfsd` threads D-state, **etcd starved → kube-apiserver down ~30 min** |
+
+**Common pattern**: a single uncontrolled IO-heavy workload (or a stale connection) saturates the shared sdc thin pool, which hosts **all VM disks** for the entire cluster. There is currently no IO budget enforcement between workloads, no PVE-level IO QoS between VMs, and no alerting that fires *before* a node crashes.
+
+We have a single point of contention (sdc). Every storm finds it.
+
+## Timeline (UTC)
+
+| Time | Event |
+|------|-------|
+| **2026-05-24 23:55** | `anca-elements-import` Job starts. immich-go v0.31.0, `--concurrent-tasks 20`, no resource limits. Anca's 770 GB photo archive begins streaming from NFS to immich-server. |
+| **2026-05-25 01:21** | Job marks `Complete` (85 min runtime). 34k photos uploaded. Immich-server ML pipeline (face detection / thumbnail generation) keeps the IO load going for hours after. |
+| **05:02** | `daily-backup.service` (systemd timer) starts. It runs LVM thin-snapshot → LUKS-decrypt → mount → rsync per PVC. The competing IO from the still-saturated thin pool stretches every per-PVC step. |
+| **08:24** | `daily-backup` `matrix-data-proxmox` rsync hits its per-PVC 30-min timeout — first warning. |
+| **08:31–09:02** | 30 LUKS-encrypted PVC mounts log `Failed to mount snapshot` because ext4 orphan-inode cleanup exceeds the 30s `timeout 30 mount` guard (one volume took 109s). |
+| **09:02:28** | systemd kills `daily-backup` with `TimeoutStartSec=14400` (4h). Script was nowhere near complete — alphabetically still on letter 'm'. Snapshots from today's run are left, but that's the *designed* 7-day retention pattern. |
+| **09:07** | `nfs-mirror.service` also times out. |
+| **09:33:24** | k8s-node1 VM **rebooted** (cause: best-guess IO starvation triggered Proxmox watchdog or qemu IO timeout; not directly observable). All pods on node1 enter `Unknown`. |
+| **09:33:36** | node1 kubelet posts Ready. Pods on node1 begin churn: calico-node, csi-node-driver, kured, alloy, loki-canary, nvidia stack all restart. proxmox-csi-plugin-node fails to re-register the `csi.proxmox.sinextra.dev` driver in CSINode. |
+| **09:40** | User session impacted: `~/.zsh_history` left with 154-byte NUL padding from interrupted write. |
+| **09:41** | Incident detected by user; `/cluster-health` invoked. Healthcheck reports 33 PASS / 3 WARN / 64 FAIL. |
+| **09:50** | Force-deleted 47 Failed pods + 22 stuck Terminating + 6 zombie DS pods on node1. |
+| **09:55** | 4 recovery sub-agents dispatched in parallel: csi-recovery, gpu-recovery, dns-monitoring-recovery, backup-recovery. |
+| **~10:15** | proxmox CSI re-registered on node1 (csi-recovery). Multi-Attach errors clearing. Loki StatefulSet recovers to 1/1. Calico fully back to 5/5. |
+| **~10:30** | daily-backup re-started manually (currently still running, ETA ~2h). |
+| **(ongoing)** | nvidia stack recovery ongoing; 20 deployments still recovering. |
+
+## Root Cause
+
+### Direct (the trigger)
+`anca-elements-import` Job in `stacks/immich/main.tf` runs `immich-go upload` with `--concurrent-tasks 20`, no CPU limit, and no IO throttling. Twenty parallel NFS readers against `/srv/nfs/anca-elements` (mounted from PVE host) plus immich-server's ML pipeline (CUDA-accelerated face detection + thumbnail generation) saturated the read queue on sdc. The job itself only ran for 85 min, but the after-effects (ML processing, filesystem cache eviction, dirty-page writeback) persisted for hours.
+
+### Recovery-side cascade (why the cluster stayed broken after node1 booted)
+Once node1 rebooted, the kubelet posted Ready within 12s — but `csi.proxmox.sinextra.dev` failed to re-register, blocking ~30 pods. The actual cascade (discovered by the csi-recovery agent during today's investigation):
+
+1. **Calico CNI on node1 entered a crash loop.** The `calico-node` pod's BIRD BGP daemon takes a few seconds to create `/var/run/bird/bird.ctl` on startup. The container's liveness probe was killing the process before the socket appeared, restarting it before it could stabilize.
+2. **Without functional Calico, no new pod on node1 could reach the kube-apiserver service IP** (`10.96.0.1:443`).
+3. **The proxmox-csi-plugin-node pod therefore crash-looped** with "no route to host" trying to talk to the apiserver, and **never created its CSI socket** for kubelet to discover.
+4. **node-driver-registrar (sidecar) therefore never registered the driver with kubelet**, so CSINode for k8s-node1 lacked `csi.proxmox.sinextra.dev`.
+5. **Every PVC mount on node1 failed** with the `driver not found` error we observed; meanwhile, **VolumeAttachments for those PVCs still pointed at node1** from before the reboot, so reschedule onto healthy nodes hit `Multi-Attach error`.
+
+Fix order matters: **Calico first, then CSI, then stale VolumeAttachments**. Doing them out of order leaves the cascade broken. This is now a P3 runbook (below).
+
+### Second cascade (3.5h into recovery) — Proxmox CSI 30-LUN-per-VM hard cap
+
+During the recovery, a second cascade was discovered that compounded the outage:
+
+1. **k8s-driver-manager init container cordons + drains node1** as part of GPU driver re-install. This evicted GPU-tagged pods (and incidentally triggered descheduler rebalancing) onto other nodes.
+2. **Simultaneously the dns-monitoring-recovery agent killed an orphaned containerd holding a boltdb lock on k8s-node4**, evicting all node4 pods.
+3. **The combined eviction wave scheduled ~60 PVC-using pods through the scheduler in a short window.** Many landed on node1 (largest node, least cordoned-ish), pushing its SCSI LUN slot count from ~17 (pre-Immich-import baseline) to **29 of 30 in use**.
+4. **proxmox-csi-plugin's hard-coded `MaxVolumesPerNode = 30`** (per the upstream `csi-driver-proxmox` source: it scans scsi0…scsi30 and errors `no free lun found` when none are free) blocked further attaches.
+5. **vault-0, mysql-standalone-0, claude-memory, grafana, nextcloud** etc. all could not start because their PVCs couldn't attach to node1 (their scheduled target). Multi-Attach errors compounded when their previous-node attachments hadn't been released cleanly.
+6. **Daily-backup was running concurrently** — adding ~120 MB/s read load on sdc, slowing every CSI attach/detach operation by 3-5×, prolonging the queue.
+
+**Resolution (manual, 2026-05-25):** `systemctl stop daily-backup`, `kubectl cordon k8s-node1`, force-delete stuck Pending pods. They rescheduled to nodes with LUN headroom (node2/3/4 had ~12-15 free slots each).
+
+### Structural (why it took down a node)
+1. **Single shared IO domain**: sdc is one LVM thin pool serving all 9 VMs. No Proxmox-level `bwlimit` or `iothrottle` between VMs. Any VM can starve the others.
+2. **No IO budget at workload level**: the K8s job had `resources: {}`. There is no cluster-wide cgroup-IO budget enforced.
+3. **NFS reads bypass per-VM accounting**: anca-elements is read via the PVE host's NFS export. The reads happen *on the PVE host*, charged to the host's IO scheduler, not to the k8s-node1 VM. So even if we capped node1's VM IO, the storm would still happen.
+4. **node1 is also the only GPU node** — Immich-ML pods are pinned there. The reader (immich-server) and consumer (immich-ml) are both fighting for the same node's resources during ingestion.
+5. **ext4 orphan-inode cleanup is unaware of `noload`**: `daily-backup.sh` uses `mount -o ro,noload` to skip journal replay, but `noload` doesn't skip orphan-inode cleanup. When a node reboots with dirty filesystem state on the source PVC, snapshot mounts can take 100+s — exceeding the script's 30s timeout. Confirmed by `dmesg`: 20+ volumes logged `INFO: recovery required on readonly filesystem` during today's backup window.
+6. **Calico BIRD liveness probe is racy on cold start.** The probe doesn't tolerate the 3-5s BIRD initialization window, so any cold-start of Calico tends to crash-loop briefly. Usually it self-recovers on the 2nd or 3rd restart — today it didn't, because the apiserver was unreachable from the brand-new pod (chicken-and-egg).
+
+## Contributing Factors
+
+1. **The job's IO profile was never measured before running**. `immich-go --concurrent-tasks 20` is the upstream default; nobody validated it against our hardware.
+2. **No staging window**. anca-elements-import is the second of two intentional one-shot ingestion runs (1st was Viktor's library months ago). The first run also caused load — but didn't crash a node, so it was treated as "loud but fine."
+3. **Daily-backup overlap**. The 05:00 backup timer fired while the IO tail of the Immich job was still in flight. The two competing workloads triggered the LUKS mount timeouts.
+4. **No PVE-level IO QoS** between VMs (Proxmox supports `iops_rd/wr` throttle groups on disk specs; we've never set them).
+5. **No alert for "node1 about to crash"**. `NodeHighIOWait` fires at a fixed threshold but doesn't trigger any automated mitigation or paging.
+
+## Detection Gaps
+
+| Gap | Impact | Fix |
+|-----|--------|-----|
+| No PVE host IO PSI scraped into Prometheus | We can see node1 IO PSI but not the PVE-host-level pressure that's the actual leading indicator | Add node_exporter PSI scrape on PVE (already running) to Prometheus targets, expose `pressure_io_*` |
+| No alert on sustained sdc utilization > 80% | The IO storm built up for hours without any signal escalating | Add `PVEThinPoolIOSaturated` rule: `irate(node_disk_io_time_seconds_total{device="sdc",instance="pve"}[5m]) > 0.85 for 30m` |
+| No alert on Proxmox-host loadavg > 20 | Sustained loadavg 13–15 was visible only through cluster healthcheck #44 | Add `PVEHostLoadHigh` rule (1m loadavg > 25 for 10m) |
+| No alert on K8s Job IO throughput | An uncapped K8s job can do unlimited IO without alerting | Add `JobHighIOThroughput`: alert if container_fs_reads_bytes_total rate over 5m > 100 MB/s for >10m |
+| Backup timeout fires silently | systemd kill of daily-backup at 4h didn't alert anyone — we'd have noticed after 48h via the `backup_per_db=FAIL mysql=33h pg=33h` healthcheck | Add Alertmanager rule on daily-backup unit failure (probe systemd unit state via node-exporter textfile collector) |
+| LUKS mount step inflation post-reboot is silent | 30 mount failures logged as WARN, no aggregate alert | Add count-of-WARN alert from the daily-backup log |
+
+## Prevention Plan
+
+### P0 — Prevent this exact failure
+
+| Priority | Action | Type | Details | Status |
+|----------|--------|------|---------|--------|
+| P0 | Cap concurrency on future `*-elements-import` jobs | Config | In `stacks/immich/main.tf` (`kubernetes_job_v1.anca_elements_import` and future siblings): set `--concurrent-tasks 4` (down from 20). Also set `resources.limits.cpu = "2000m"` and `activeDeadlineSeconds = 21600` (6h cap). Add a `nodeSelector` to keep the job *off* node1 (move read-side onto a non-GPU node so the GPU node only does ML). | TODO |
+| P0 | Bump LUKS mount timeout in `daily-backup` | Config | `infra/scripts/daily-backup.sh` line 243: change `timeout 30 mount …` → `timeout 180 mount …` (covers observed 109s worst case). Add a comment explaining the ext4 orphan-cleanup exception. | TODO |
+| P0 | Schedule big-data ingests outside backup window | Config | Forbid Job/CronJob scheduling between 04:30–08:30 EEST (when daily-backup runs). Either via a Kyverno policy on `*-import` named jobs, or a documented convention enforced at PR review. | TODO |
+| P0 | **Raise Proxmox CSI LUN limit on each k8s-node VM** | Architecture | The default `virtio-scsi-pci` controller exposes 30 LUN slots; proxmox-csi hard-caps at this. **Resolution path**: add a 2nd `virtio-scsi-pci` controller (`scsihw1`) to each k8s-node VM via Proxmox, OR migrate VMs to `virtio-scsi-single` which allows 256+ LUNs per disk. Either requires a brief per-node reboot. Without this, every future cluster-churn event can re-hit "no free lun found" on whichever node ends up overloaded. **Permanent fix — must land before next ingest run.** | TODO |
+| P0 | Document the `MaxVolumesPerNode=30` limit in storage architecture | Runbook | Add to `docs/architecture/storage.md` — currently the 30-LUN cap is invisible to operators until they hit it. Include `kubectl get pods -A --field-selector spec.nodeName=NODE` and the 30-cap as a sizing check before any cluster-wide rebalance / drain operation. | TODO |
+| P0 | Add startupProbe to mysql-standalone | Config | `stacks/dbaas/modules/dbaas/main.tf` (the `kubernetes_stateful_set_v1` for `mysql-standalone`): add `startupProbe` with `failureThreshold=120, periodSeconds=15, timeoutSeconds=10` (≈30 min budget for InnoDB recovery). Also bump liveness `initialDelaySeconds=120, failureThreshold=10`. Today MySQL spun in a CrashLoopBackOff for ~30 min — each restart's InnoDB recovery aborted when the existing 30s liveness probe fired, never finishing. Resolved manually via `kubectl patch sts mysql-standalone` — must Terraform-codify. | Done (kubectl, needs TF) |
+| P0 | Add startupProbe to goauthentik-server | Config | Similar issue: Authentik Django migrations + clip/face index rebuilds take 5-10 min after PG restart, but the startup probe budget is too short → restart loop. Add `startupProbe: failureThreshold=180, periodSeconds=10` (30 min) on `goauthentik-server`. Source: `stacks/authentik/modules/authentik/main.tf` (or equivalent Helm values). | TODO |
+| P0 | Disable daily-backup.timer when manually stopping daily-backup.service | Runbook | During this incident, `systemctl stop daily-backup.service` alone wasn't enough — the timer kept it queued for re-fire. The recovery sequence is: `systemctl stop daily-backup.timer; systemctl stop daily-backup.service`. Document in `docs/runbooks/cluster-recovery.md` (to-create) as the canonical sequence. | TODO |
+
+### P1 — Reduce blast radius
+
+| Priority | Action | Type | Details | Status |
+|----------|--------|------|---------|--------|
+| P1 | Proxmox per-VM IO throttle for the Immich workload host | Architecture | Set `iops_rd=2000,iops_wr=1000,mbps_rd=200,mbps_wr=100` on the k8s-node1 VM disk via Proxmox API. Pick numbers based on baseline `iostat` measurement. Same for non-prod VMs (devvm, registry). | TODO |
+| P1 | Move NFS reads off the PVE host hot path | Architecture | Currently the PVE host *itself* reads `/srv/nfs/anca-elements` when an NFS client mounts that path — but the *reads* happen on PVE because it's the NFS server. Consider mounting anca-elements via a dedicated NFS export with a `wsize/rsize` cap, OR put bulk-ingest source data on a separate physical disk (sdb SSD has headroom). | Investigation |
+| P1 | Add cgroup IOLimit to Kyverno mutating webhook for namespaces | Config | Auto-attach a `cgroupv2 io.max` annotation to pods in known-high-IO namespaces (immich, frigate, ollama). Requires kernel ≥5.13 + cgroupv2 (we have both). | TODO |
+| P1 | Separate Immich `library` + `upload` NFS exports onto different LVs | Architecture | Currently `/srv/nfs/immich/{library,upload}` share the `pve/nfs-data` LV. Splitting upload onto its own thinly-provisioned LV would let us throttle upload-side independently. Cost: ~30 min PV churn. | Architecture |
+
+### P2 — Detect faster
+
+| Priority | Action | Type | Details | Status |
+|----------|--------|------|---------|--------|
+| P2 | Alert on sustained PVE sdc utilization | Alert | New PrometheusRule `PVEThinPoolIOSaturated`: `irate(node_disk_io_time_seconds_total{device="sdc",instance=~"pve.*"}[5m]) > 0.85` for 30m, severity=warning. | TODO |
+| P2 | Alert on PVE loadavg high | Alert | New PrometheusRule `PVEHostLoadHigh`: `node_load1{instance=~"pve.*"} > 25` for 10m. severity=warning. | TODO |
+| P2 | Alert on Kubernetes Job high IO rate | Alert | `JobHighIOThroughput`: `sum by (namespace, pod) (irate(container_fs_reads_bytes_total{container!=""}[5m])) > 100*1024*1024` for 10m → warning. | TODO |
+| P2 | Alert on daily-backup systemd unit failure | Alert | Add node-exporter textfile collector entry that runs `systemctl is-failed daily-backup.service` every 1m and writes 0/1 to `/var/lib/node_exporter/textfile/backup_unit_state.prom`. PrometheusRule fires on value=1 for 5m. | TODO |
+| P2 | Alert on Multi-Attach VolumeAttachment hung > 5m | Alert | These are the smoking gun whenever a node reboots. New rule on `kube_volumeattachment_status_attached == 0 and time() - kube_volumeattachment_metadata_created > 300`. | TODO |
+
+### P3 — Improve resilience
+
+| Priority | Action | Type | Details | Status |
+|----------|--------|------|---------|--------|
+| P3 | Move k8s-node1 OS disk off sdc | Architecture | If node1 OS disk lived on the SSD (sdb VG `ssd`, 475 GB free), an sdc IO storm wouldn't starve the VM's own root filesystem and we'd avoid the reboot trigger. Cost: VM migration, ~1h downtime for node1. | Architecture |
+| P3 | Spread GPU + pinned services off node1 | Architecture | Today node1 carries the GPU + Loki + Technitium primary + claude-agent + many app deployments. When it goes down, the blast radius is huge. Re-evaluate pin constraints — only Immich-ML and Frigate genuinely need node1. | Investigation |
+| P3 | Document recovery runbook for "node1 hard reboot" | Runbook | New `docs/runbooks/node1-reboot-recovery.md` capturing the **strict order** discovered today: (1) force-cleanup Failed/Unknown/stuck-Terminating zombies, (2) force-delete the calico-node pod on the rebooted node so BIRD restarts cleanly, (3) wait for calico-node Ready, (4) force-delete the proxmox-csi-plugin-node pod, (5) verify `csi.proxmox.sinextra.dev` appears in `kubectl get csinode <node> -o yaml`, (6) delete stale VolumeAttachments referencing the rebooted node where consumers have already rescheduled, (7) verify nvidia driver recovery (separate cascade). | TODO |
+| P3 | Make Calico BIRD liveness probe cold-start tolerant | Config | Bump `initialDelaySeconds` on `calico-node`'s liveness probe by 15s, or switch to an `exec` probe that checks BIRD socket existence rather than HTTP. Prevents the cold-start crash loop after node reboots. | Investigation |
+| P3 | Pre-flight script for bulk ingest jobs | Runbook + Config | Wrapper around `immich-go` that (a) checks PVE loadavg < 10, (b) checks sdc IO util < 50%, (c) checks daily-backup not running, before allowing the job to start. Refuses otherwise. | TODO |
+
+## Lessons Learned
+
+1. **One shared physical disk is one shared failure domain**. sdc serves all VMs; any uncapped workload can take down the cluster. We've now hit this three times in 17 days. Continuing to treat each as a one-off is no longer credible — we need IO budget enforcement (P1), not just better alerting (P2).
+2. **NFS reads bypass per-VM accounting**. We assumed throttling the workload's VM would protect us. It doesn't — the reads physically happen on the PVE host's IO scheduler.
+3. **The "complete" state of a Job doesn't mean its IO is gone**. anca-elements-import finished in 85 min, but the IO tail (ML pipeline + filesystem cache eviction) ran for hours. Future ingest jobs need to either run during off-hours OR be sized so that even their tail is benign.
+4. **Backup pipeline depends on a clean cluster state**. When node1 was unhealthy, daily-backup couldn't complete LUKS mounts in time. Backups should be more resilient to upstream IO degradation OR we should treat backup failure as a SEV signal in real time.
+5. **The 30s `timeout` in `daily-backup.sh` was set without considering post-reboot recovery time**. Defaults like this need to be reviewed in light of actual observed worst case.
+6. **Recovery requires a known runbook**. Today's recovery worked because we knew which order to do things: force-delete zombies → re-register CSI → clear VAs → wait for daemonsets → restart deployments. Codifying that as a runbook means the next incident is 5x faster.
+
+## Update 2026-06-01 — recurrence (Immich Duplicate Detection)
+
+**5th IO-pressure incident.** A user-triggered library-wide **Duplicate Detection** run on the 163,989-asset Immich library cascaded into ML/thumbnail backfill for the ~5,150 assets missing CLIP embeddings (largely a fresh `anca-elements` import that had completed ~90 min earlier). The Immich **server-side** job `thumbnailGeneration` was set to concurrency **8** (plus `metadataExtraction=4`, `library=4`), so the backfill read originals off sdc 8-wide → ~92 MB/s, queue depth ~99, sdc ~100% util. 64 `nfsd` threads went D-state on `folio_wait_bit_common`; **etcd on k8s-master was starved → kube-apiserver down ~30 min** (different blast radius from 2026-05-25's node1 reboot — same root cause: the shared sdc spindle).
+
+**New finding:** the 2026-05-25 P0 capped only the *import-side* concurrency (`immich-go --concurrent-tasks`). The Immich **server-side** job concurrency (`job.*.concurrency` in DB system-config) was never capped and had been tuned for speed (8/4/4). So **any** library-wide operation (dedup, smart-search backfill, thumbnail regen) re-triggers the storm independent of the import job.
+
+**Mitigation applied (2026-06-01):** capped the HDD-original-reading server jobs to `thumbnailGeneration=2, metadataExtraction=2, library=2` in `system_metadata` `system-config` JSONB + `immich-server` recreate. Verified: dedup resumed with sdc at 2–3% util, queue depth ~0.05, apiserver healthy. Documented in `infra/.claude/CLAUDE.md` Immich row.
+
+**Still the real fix (from this PM, still TODO):** the P0 import-side cap, and especially the **IO-isolation** items — move k8s-master **etcd** + node OS disks off sdc onto SSD (generalize P3), and/or give the Immich library its own spindle (P1). Concurrency caps are a band-aid; sdc remains a single shared failure domain that every storm finds. Tracked in beads (see Follow-up Implementation).
+
+## Related
+
+- 2026-05-09 IO post-mortem: `docs/post-mortems/2026-05-09-io-pressure-stale-nfs.md`
+- 2026-05-16 kured/anubis post-mortem: `docs/post-mortems/2026-05-16-kured-stalled-and-anubis-ha.md`
+- 2026-05-17 GPU driver post-mortem: `docs/post-mortems/2026-05-17-gpu-driver-ubuntu2604-mismatch.md`
+- Storage architecture: `docs/architecture/storage.md`
+- Backup pipeline: `docs/architecture/backup-dr.md`
+- Storage hardware mapping: memory `id=464` (sdc thin pool, sda backup, sdb SSD)
+- 3-2-1 backup strategy: memory `id=609`
+- Immich storage layout: memory `id=674`
+- Memory entries for this incident: 2682-2686 (Immich storm), 2687-2692 (LUKS mount timing)
+
+## Follow-up Implementation
+
+_This section is auto-populated by the postmortem-todo-resolver agent._
+
+| Date | Action | Priority | Type | Commit | Implemented By |
+|------|--------|----------|------|--------|----------------|
--- a/docs/post-mortems/2026-05-30-redis-split-brain.md
+++ b/docs/post-mortems/2026-05-30-redis-split-brain.md
@ -0,0 +1,93 @@
+# Post-mortem: Redis split-brain wedged BullMQ/Celery queues (2026-05-30)
+
+**Severity:** SEV2 (degraded — no data loss in Redis; queue processing stalled
+cluster-wide). **Status:** Resolved.
+
+## Summary
+
+The 3-node Sentinel HA Redis (`redis-v2`) split-brained: two pods both held
+`role:master`. HAProxy — which routes to any backend reporting `role:master` —
+round-robined client connections across **both** masters. Immich enqueued
+BullMQ jobs on one master while its workers blocked-popped on the other, so
+every queue stalled. User-visible symptom: **newly uploaded Immich photos
+returned HTTP 404 for their thumbnails** (the generation job never ran). Celery
+apps (real-estate-crawler, trading-bot, paperless) and other queue users were
+affected the same way.
+
+## Impact
+
+- Immich: thumbnail/preview/face/ML jobs not processing. `facialRecognition`
+  backlog reached ~30k waiting; new uploads showed broken images in the web UI.
+- All ~15 shared-Redis consumers had inconsistent reads/writes (connections
+  split across two diverging masters).
+- No Redis data lost — the larger dataset (`redis-v2-0`, ~30k keys) was
+  preserved through the fix.
+
+## Timeline (UTC+1 local)
+
+- **~2026-05-26/27**: `redis-v2` pods recreated (node2 unclean reboot era).
+  `redis-v2-0` came up partitioned; its Sentinel saw 0 peers and it declared
+  itself master via the init script's deterministic "pod-0 = bootstrap master"
+  fallback. Sentinels on `-1`/`-2` independently elected `redis-v2-2`.
+  Split-brain formed and persisted (~3-4 days) as the network healed but the
+  topology never reconciled.
+- **2026-05-30 ~16:58**: investigating "Immich images with no thumbnails."
+  Found thumbnail jobs failing on missing/zeroed originals (separate pre-existing
+  data-loss issue) AND a stuck job queue.
+- **2026-05-30 ~17:00**: user manually restarted immich-server; namespace
+  `tier-quota` (24Gi) briefly blocked the replacement pod → ~1 min Immich
+  outage. Recovered. (Red herring — not the root cause.)
+- **2026-05-30 ~17:1x**: identified two `role:master` redis pods
+  (`redis-v2-0` dbsize 30320, isolated, 0 connected slaves; `redis-v2-2` dbsize
+  442, quorum master). HAProxy fan-out across both = wedged queues. Ruled out
+  IPv6 (cluster is single-stack IPv4) and eviction (`evicted_keys=0`).
+- **2026-05-30 ~17:30**: reverted `redis-v2` to a single standalone instance.
+  Queues drained immediately; newest Immich assets served HTTP 200.
+
+## Root cause
+
+`redis-v2`'s init container (`generate-sentinel-conf`) falls through to
+"Priority 3: pod-0 is always the bootstrap master" when it cannot reach peer
+Sentinels/Redis. During a network partition, `redis-v2-0` hit that fallback and
+became a second master. HAProxy's health check (`tcp-check expect rstring
+role:master`) matches **any** master, so with two masters it placed both in
+rotation and round-robined writes/reads across diverging datasets. BullMQ's
+enqueue (LPUSH) and worker consume (BRPOPLPUSH) landed on different instances →
+jobs never consumed.
+
+This is the **third** Sentinel-class incident (after 2026-04-19 PM quorum drift
+and 2026-04-22 flap cascade). The 3-sentinel design was built to *prevent*
+split-brain, but the bootstrap fallback re-introduced it.
+
+## Resolution
+
+Reverted `redis-v2` to a **single standalone instance** (`replicas=1`, Sentinel
+ HAProxy removed), collapsing onto `redis-v2-0`'s dataset (preserved Immich's
+queued jobs). Eviction policy changed `allkeys-lru` → **`volatile-lru`** so the
+shared cache+queue workload is served correctly by one instance (evict only
+TTL'd cache keys; never TTL-less queue keys). `redis-master` service name/DNS
+unchanged → no consumer edits. Decision rationale: a homelab cache/broker does
+not need HA; a few-seconds restart blip beats chasing Sentinel correctness.
+Mirrors the 2026-04-16 MySQL InnoDB-Cluster → standalone reversion.
+
+## Follow-ups
+
+- [ ] Re-upload the ~99 Immich images + 12 timeline videos whose **originals**
+      are missing/zero-filled on disk (pre-existing data loss, unrelated to the
+      split-brain — re-running jobs can't regenerate them). Owner: Viktor.
+- [ ] `requirepass` auth on Redis + creds rollout to all consumers (carried over
+      from the 2026-04-19 rework; still open).
+- [ ] Consider whether any queue user (Immich/Celery) warrants its own dedicated
+      Redis if the shared instance's memory ever becomes contended (currently
+      ~30MB / 640MB — not a concern).
+
+## Lessons
+
+- HA that re-introduces its own failure class is worse than no HA. For a
+  single-node-tolerant homelab, prefer a standalone instance + a small accepted
+  downtime window.
+- `allkeys-lru` on a shared cache+queue Redis silently drops queue jobs under
+  pressure; `volatile-lru` is the correct single-instance policy (Immich even
+  logs `IMPORTANT! Eviction policy ... should be "noeviction"`).
+- A "bootstrap master" fallback that fires under partition is a split-brain
+  generator — avoid deterministic self-promotion when peers are unreachable.
--- a/docs/post-mortems/2026-05-31-kured-sentinel-gate-oom.md
+++ b/docs/post-mortems/2026-05-31-kured-sentinel-gate-oom.md
@ -0,0 +1,89 @@
+# Post-Mortem: kured-sentinel-gate OOM while k8s-master stuck pending-reboot
+
+| Field | Value |
+|-------|-------|
+| **Date** | 2026-05-31 |
+| **Duration** | OOMs began 2026-05-30 ~03:33, escalating until fixed 2026-05-31 14:40 UTC |
+| **Severity** | SEV4 — no user-facing impact; noisy + latent risk (a wedged gate pod could eventually mis-gate reboots) |
+| **Affected** | `kured-sentinel-gate` pod on k8s-master only |
+| **Status** | Fixed (gate hardened). Two contributing alerts still open, tracked separately. |
+
+## Summary
+
+Noticed by the operator during a routine cluster health check ("an app OOMing
+periodically"). The `kured-sentinel-gate` pod on k8s-master was the *only*
+container in the cluster with OOM events: `container_oom_events_total` showed
+0/day through May 29, **15 on May 30, 134 on May 31** (by 08:21). The kernel
+OOM-killer was killing child `kubectl` processes inside the pod's cgroup; PID 1
+(bash) survived, so the pod never restarted — restartCount stayed at 1 despite
+149 oom_events in 7d. Symptom: the gate's check cycle stretched from 5 min to
+~25 min.
+
+## Root cause (chain)
+
+```
+hermes-agent deploy = 0/0 (parked 2026-04-22, PVC-perms bug) → PVC WaitForFirstConsumer
+  never binds → PVCStuckPending fires; its dead external monitor → ExternalAccessDivergence
+/mnt/synology-backup (192.168.1.13 offsite NAS) at 96% → NodeFilesystemFull fires
+        │  none of these 3 are in kured's alert ignore-list
+        ▼
+kured halts ALL reboots (correct fail-safe)
+        ▼
+k8s-master got /var/run/reboot-required on 2026-05-30 03:33 (kernel update) but can't reboot
+        ▼
+master's gate pod is now the ONLY one running the kubectl-heavy hot path every cycle
+(the other 6 hit the early "no reboot required → continue" at ~3 MiB)
+        ▼
+the immortal `while true` bash loop slowly leaks (repeated kubectl forks + the
+Check-4 `< <(kubectl ...)` process substitution), crosses the 64Mi cgroup limit
+~5 days in, and the OOM-killer culls child kubectls — accelerating as it wedges
+```
+
+The 64Mi limit was the proximate misconfiguration: each `kubectl` fork is a
+~30-50Mi Go binary, and the hot path runs up to 3 per cycle.
+
+## Why hard to spot
+
+- The pod showed `Running` / `1 restart` the whole time — the OOMs hit child
+  processes, not PID 1. Only `container_oom_events_total` (cAdvisor) revealed it;
+  `kube_pod_container_status_*` restart metrics did not.
+- Logs looked clean ("ALL CHECKS PASSED") — the gate kept producing correct
+  decisions, just slowly.
+- Same blind spot as the 2026-05-16 PM: there is still no Prometheus signal for
+  "a node has been pending-reboot too long" (the deferred `KuredRebootBacklog`
+  alert). That alert would have surfaced the stuck-master state on May 30.
+
+## Fix (`stacks/kured/main.tf`, applied + committed 2026-05-31)
+
+1. **Immediate**: deleted the leaking pod (DaemonSet recreated it at ~3 MiB).
+2. **Durable**: memory limit `64Mi → 256Mi` (headroom for kubectl forks) **plus**
+   a self-restart guard — the loop counts iterations and `exit 0`s every
+   `MAX_ITER=72` cycles (~6h at 300s), so kubelet restarts the pod fresh and the
+   slow leak can never accumulate, regardless of how long a node stays
+   pending-reboot. Verified: all 7 pods at 256Mi, `iter N/72` loop live, OOMs
+   stopped.
+
+## Contributing items (open — being addressed separately)
+
+- **hermes-agent** parked at `replicas=0` since 2026-04-22 (PVC `/opt/data` perms
+  mismatch). Its orphaned `WaitForFirstConsumer` PVC drives PVCStuckPending +
+  ExternalAccessDivergence. Resolve = fix perms + scale up, OR remove the PVC and
+  external monitor while parked, OR scope PVCStuckPending to ignore 0-replica
+  consumers.
+- **Synology offsite backup at 96%** (5.0T/5.3T, 265G free; `#recycle` holds 17G).
+  Resolve = prune retention / empty recycle / expand volume. NodeFilesystemFull
+  cannot be blanket-ignored in kured (a full *node* disk SHOULD block reboots) —
+  if scoped, scope to the offsite mount only.
+
+Until at least the first two clear, kured will keep (correctly) refusing to
+reboot master — but the gate pod is now leak-proof either way.
+
+## Lessons
+
+1. **`container_oom_events_total` is the canonical "is anything OOMing" signal** —
+   not restart counts. A cgroup can OOM-kill children while PID 1 lives.
+2. **Immortal in-pod loops that fork heavy binaries need either a generous limit
+   or a periodic self-restart.** A periodic task is really a CronJob; the
+   self-exit guard is the minimal fix within the DaemonSet model.
+3. **The `KuredRebootBacklog` alert (deferred from 2026-05-16) is now twice-implicated.**
+   Worth promoting from the backlog: `kured_reboot_required == 1 for > 24h`.
--- a/docs/post-mortems/2026-06-01-cloudflared-stale-traefik-origin.md
+++ b/docs/post-mortems/2026-06-01-cloudflared-stale-traefik-origin.md
@ -0,0 +1,78 @@
+# Post-Mortem: Out-of-Band Tunnel Repoint Reverted by Terraform → Full External 502
+
+| Field | Value |
+|-------|-------|
+| **Date** | 2026-06-01 |
+| **Duration** | Drift present 2026-05-30 → 2026-06-01. Actual external outage began when a `terragrunt apply` reverted the tunnel origin on 2026-06-01 (cloudflared errors visible from ≥20:58Z); root-caused and fixed at 21:15Z; pods converged 21:16Z. |
+| **Severity** | SEV1 — *every* Cloudflare-proxied hostname (`viktorbarzin.me` + all `*.viktorbarzin.me`) returned HTTP 502 to external clients. Internal/LAN access (split-horizon → Traefik direct) was unaffected, which is why it stayed hidden. |
+| **Affected Services** | All external ingress: viktorbarzin.me, nextcloud, vault, authentik, vaultwarden, immich, linkwarden, nas, technitium, terminal, speedtest, and every other proxied app. |
+| **Issue** | None filed (diagnosed and fixed in-session). |
+| **Status** | Resolved. |
+| **Recurrence count** | 1st of this exact kind. Same `.200→.203` migration family as the 2026-06-01 forgejo-registry containerd-redirect fix (`a382683c`). |
+
+## Summary
+
+On 2026-05-30 Traefik was moved off the shared MetalLB IP `10.0.20.200` onto a dedicated `10.0.20.203`. The migration plan correctly identified that the Cloudflare tunnel had to be repointed away from `10.0.20.200:443` **first**, and it was — but the repoint was done **out-of-band via the Cloudflare Global API Key**, not in Terraform. The Terraform source (`stacks/cloudflared/modules/cloudflared/cloudflare.tf`) was left pointing at `https://10.0.20.200:443`, creating silent drift between live (correct: service DNS) and code (stale: `.200`).
+
+External ingress kept working for ~2 days on the manual config. Then on 2026-06-01 a `terragrunt apply` of the cloudflared stack reconciled live back to the stale Terraform value `https://10.0.20.200:443`. Nothing serves HTTPS on `.200:443` after the Traefik move, so cloudflared could not reach its origin (`connect: no route to host` / `i/o timeout`) and Cloudflare returned 502 across the entire public surface.
+
+Fix: codify the correct origin in Terraform — both ingress rules now point at `https://traefik.traefik.svc.cluster.local:443` (in-cluster Traefik Service DNS). This both restores ingress and makes it permanent (TF and live agree; future applies can't revert it; the origin is decoupled from the Traefik LB IP entirely).
+
+## Impact
+
+- **User-facing**: 100% of externally-reachable services returned 502 via Cloudflare. LAN/internal access (which resolves `*.viktorbarzin.me` → `10.0.20.203` via Technitium split-horizon, bypassing Cloudflare) kept working — this masked the outage.
+- **Blast radius**: every proxied hostname. Origin (Traefik) was healthy the whole time — purely a tunnel-origin routing fault.
+- **Data loss**: none.
+- **Collateral**: Vault's own public hostname (`vault.viktorbarzin.me`) was also 502, creating a bootstrap problem — `terragrunt apply` needs Vault for the PG state-backend creds, but Vault was only reachable from the dev box via the broken tunnel. Worked around with a temporary `/etc/hosts` entry pointing `vault.viktorbarzin.me` → `10.0.20.203` (internal Traefik), removed after the apply.
+
+## Root Cause
+
+**A manual (out-of-band) fix was never codified in Terraform, and a later Terraform apply reverted it.** This is a direct violation of the repo's "Terraform Only — all infra changes go through Terraform" rule. The 2026-05-30 plan applied the tunnel repoint via the Cloudflare API for speed/safety during the migration but did not land the equivalent change in `cloudflare.tf`. Terraform's authority over the resource guaranteed the manual change would eventually be reverted; it was, on the next apply.
+
+Contributing factors:
+- **No drift alarm tied this to user impact.** The TF/live divergence on `cloudflare_zero_trust_tunnel_cloudflared_config` existed for ~2 days; drift-detection (if it ran) didn't escalate it as outage-risk.
+- **Detection gap (masking).** Split-horizon means LAN users never see external-only breakage. The `[External]` Uptime-Kuma monitors + `ExternalAccessDivergence` alert are the only signal for this failure mode, and they did not prompt action.
+- **Docs vs code.** CLAUDE.md described cloudflared as targeting the service DNS — true of live (post-manual-fix) but not of the TF source. The doc masked the drift.
+
+## Timeline (UTC)
+
+| Time | Event |
+|------|-------|
+| **2026-05-30 ~08:09** | Traefik Service moved to `10.0.20.203` (`ETP=Local`). Plan step 1 repoints the tunnel `https://10.0.20.200:443` → `https://traefik.traefik.svc.cluster.local:443` **via the CF Global API Key (out-of-band)**. Ingress works. `cloudflare.tf` still says `.200` → **drift**. |
+| **2026-05-30 → 06-01** | External ingress healthy on the manual config. Drift sits unnoticed. |
+| **2026-06-01 (during the day)** | A `terragrunt apply` of the cloudflared stack reconciles the tunnel origin back to the stale TF value `https://10.0.20.200:443`. External ingress breaks → 502. |
+| **2026-06-01 ~20:51** | Keel auto-patches the cloudflared image; pods roll (coincidental, not causal). |
+| **~20:58** | cloudflared logs show every proxied hostname failing against `https://10.0.20.200:443` (`no route to host` / `i/o timeout`). |
+| **21:08** | User reports "no ingress coming in." Investigation starts. |
+| **21:09** | Isolated: origin healthy (direct to `.203` → 200/302), public path → 502; logs pin origin to dead `.200:443`. |
+| **21:13** | Vault unreachable via public name (circular dep); temp `/etc/hosts` → `.203`. `tg init -reconfigure` (rotated PG backend creds). |
+| **21:15:25** | Targeted apply: both ingress origins → service DNS. `Apply complete! 1 changed`. |
+| **21:16** | 10/10 curls to `viktorbarzin.me` → 200; 0 `.200` errors across all pods; `vault.viktorbarzin.me` via real Cloudflare → 200. Temp hosts entry removed. Resolved + committed (`f807050e`). |
+
+## Resolution
+
+Changed both `ingress_rule` blocks in `cloudflare.tf` from `https://10.0.20.200:443` to `https://traefik.traefik.svc.cluster.local:443` (`no_tls_verify` retained), making the Terraform source match the intended (and previously-manual) live config. Applied surgically with `-target` on the tunnel config resource only, to avoid touching two pre-existing, unrelated drift items the full plan surfaced (below). Committed `[ci skip]` since live already matched after the targeted apply.
+
+## Pre-existing drift (NOT part of this incident, left untouched)
+
+The full `cloudflared` stack plan showed two extra in-place changes, deliberately **not** applied:
+1. `kubernetes_deployment.cloudflared` — TF would strip Keel's runtime annotations (`keel.sh/policy|pollSchedule|trigger|update-time`). The deployment ignores `dns_config` but not `metadata.annotations`. Self-healing (Keel re-adds within its 1h poll); clean fix is to add `metadata[0].annotations` (+ template equivalent) to `ignore_changes`.
+2. `cloudflare_record.mail_domainkey_rspamd` — cosmetic re-chunking of the DKIM TXT record (identical key). Benign.
+
+## Action Items
+
+- [x] Codify tunnel origin (service DNS) in `cloudflare.tf` (this fix) — drift eliminated.
+- [x] Fix stale `10.0.20.200:443` Traefik reference in `docs/runbooks/kms-public-exposure.md` (→ `.203`).
+- [x] Post-mortem written.
+- [ ] **Audit for other out-of-band changes from the 2026-05-30 migration** that were applied via CF API / kubectl / pfSense but not landed in code — they will all revert on the next apply.
+- [ ] **Make `ExternalAccessDivergence` trustworthy and seen** — it is the only signal for external-only outages and did not prompt action here.
+- [ ] **Drift detection should flag tunnel-origin divergence as outage-risk**, not just generic drift.
+- [ ] (Optional) Pin the exact reverting-apply time via Woodpecker pipeline history for the cloudflared stack on 2026-06-01.
+- [ ] (Optional) Fix the cloudflared Keel-annotation drift so the stack plans clean.
+
+## Lessons
+
+- **Codify out-of-band fixes immediately.** A manual change to a Terraform-managed resource is a time bomb — Terraform *will* revert it on the next apply. The "Terraform Only" rule exists for exactly this; the 05-30 manual tunnel repoint should have been mirrored into `cloudflare.tf` the same day.
+- **Reference shared infra (Traefik) by stable Service DNS, not LB IP**, from anything that can use cluster DNS. The service-DNS origin also happens to survive LB-IP moves.
+- **External-only outages are invisible from the LAN** (split-horizon). The `[External]` divergence signal is load-bearing — it must be trustworthy and actually seen.
+- **Keep docs honest about source-of-truth.** "Live is correct" is not the same as "code is correct"; a doc that conflates them hides drift.
--- a/docs/post-mortems/2026-06-01-keel-match-tag-image-swap.md
+++ b/docs/post-mortems/2026-06-01-keel-match-tag-image-swap.md
@ -0,0 +1,117 @@
+# Post-Mortem: Keel `match-tag` cross-assigned the blog's container images (site down ~6 days)
+
+| Field | Value |
+|-------|-------|
+| **Date** | 2026-06-01 |
+| **Duration** | 2026-05-26 19:47 UTC → 2026-06-01 ~16:00 UTC (~6 days) |
+| **Severity** | SEV3 — `viktorbarzin.me` (public blog) fully down; user-facing, but a personal blog with no SLA |
+| **Affected** | `website/blog` Deployment (acute outage). Latent: 194 enrolled workloads carried the same stale annotation; 16 were multi-image swap-risk |
+| **Status** | Fixed — images un-swapped, `keel.sh/match-tag` stripped fleet-wide, `inject-keel-annotations` policy hardened to strip it on admission |
+
+## Summary
+
+Reported by the operator as "blog is crashlooping." The `website/blog` pod was
+`1/2 CrashLoopBackOff`. The two container images had been **swapped**: the
+container named `nginx-exporter` was running the nginx blog image
+(`viktorbarzin/blog:cfd39d6f`) and receiving the exporter's
+`-nginx.scrape-uri` arg — nginx's entrypoint rejected `-n` (`illegal option`)
+and crashed; while the container named `blog` was running the exporter image
+(`nginx/nginx-prometheus-exporter:1.5.1`), listening on `:9113` instead of
+serving the site on `:80`. **Nothing served `:80`, so the blog was fully down**
+(Anubis → `blog:80` → connection refused), not merely a crashing sidecar.
+
+The swap happened **2026-05-26 19:47 UTC** (rollout revisions 28–35, all stamped
+`keel automated update, version latest -> 1.5.1`) and went unnoticed for ~6 days.
+
+## Root cause (chain)
+
+1. The `inject-keel-annotations` Kyverno policy stamps Keel control annotations
+   on every workload in `keel.sh/enrolled=true` namespaces. Before 2026-05-26
+   the default was `keel.sh/policy: force` + `keel.sh/match-tag: "true"`.
+2. The `blog` Deployment runs **two containers with two different images that
+   both float on tag `latest`**: `viktorbarzin/blog:latest` and
+   `nginx/nginx-prometheus-exporter` (→ `:latest`).
+3. On 2026-05-26 `nginx/nginx-prometheus-exporter` published semver `1.5.1`.
+   Under `force + match-tag`, Keel rewrote the deployment and **cross-assigned
+   the two images** — the exact class of failure the same-day incident
+   documented (uptime-kuma `:2→:1`, n8n `:1.80.5→:0.1.2`, etc.). The blog was a
+   casualty of that incident but was **not on the cleanup list**.
+4. Same day, the policy default was switched `force → patch` and `match-tag` was
+   dropped from the patch — but **Kyverno's add-only `patchStrategicMerge`
+   cannot remove an annotation that's no longer listed**. So ~194 pre-migration
+   workloads (the blog included) kept a stale `keel.sh/match-tag=true`.
+5. Because the blog's images are in Terraform `ignore_changes` (Keel/Woodpecker
+   own them) and the keel annotations are policy-managed (not in the stack), a
+   `terraform apply` would not have corrected either field — the broken state
+   was invisible to the normal apply/drift loop.
+
+## Why hard to spot
+
+- **No crash on most swaps.** A swap only hard-crashes when a container's args
+  are rejected by the wrong image. The blog crashed because nginx got
+  `-nginx.scrape-uri`. The sibling `travel_blog` has `match-tag` too but its
+  exporter sidecar is commented out (single container — nothing to cross-wire),
+  so it was fine. `changedetection` shows crossed images but both boot without
+  conflicting args, so it ran 2/2 for days — silently mis-wired, no alert.
+- **No external monitor caught it.** The Anubis challenge page returns 200
+  without reaching the backend, so a naive front-door check looks healthy.
+- The acute symptom (`CrashLoopBackOff`) was only visible via `kubectl`, and the
+  blog has no SLA, so nothing paged.
+
+## Fix (applied + committed 2026-06-01)
+
+1. **Un-swapped the blog images** via `kubectl set image` (the same path
+   Woodpecker uses for this TF-ignored image): `blog=viktorbarzin/blog:cfd39d6f`,
+   `nginx-exporter=nginx/nginx-prometheus-exporter:1.5.1`. Pod is 2/2; site
+   serves 200 internally and externally (`/net-diag.sh` via the Anubis-bypass
+   carve-out returned the real 40 KB script).
+2. **Removed the orphaned annotation** from the blog (`kubectl annotate …
+   keel.sh/match-tag-`).
+3. **Hardened the policy** (`stacks/kyverno/modules/kyverno/keel-annotations.tf`):
+   added `keel.sh/match-tag = null` to the `patchStrategicMerge`, so the
+   annotation is stripped on admission for every enrolled workload and can never
+   be re-added.
+4. **Swept the fleet.** `mutateExistingOnPolicyUpdate` did *not* regenerate
+   UpdateRequests for a removal-only change (Kyverno re-mutates existing
+   resources for add/set, not deletions), so the 194 pre-existing workloads were
+   swept once with `kubectl annotate <kind>/<name> -n <ns> keel.sh/match-tag-`.
+   Annotation-only ⇒ no pod restarts (verified: vault/CSI/monitoring pod ages
+   unchanged). Remaining `match-tag=true`: 0.
+
+## Lessons
+
+- **Add-only mutation can't undo itself.** Dropping a key from a Kyverno
+  `patchStrategicMerge` does not remove it from already-mutated resources — you
+  must set it to `null` *and* sweep existing ones. The 2026-05-26 migration did
+  neither, leaving 194 landmines.
+- **Multi-image pods + a shared floating tag + `force`/`match-tag` = swap risk.**
+  Keep third-party sidecars on explicit pinned tags, not `latest`, so they never
+  share a tag with the app image.
+- **State that Terraform `ignore_changes` is invisible to drift detection.**
+  Image fields and policy-managed annotations won't show up in `plan`; they need
+  their own verification (a synthetic backend probe, not just the front door).
+
+## Audit result (completed 2026-06-01)
+
+All 16 multi-image swap-risk workloads were checked. **Only two were actually
+swapped:**
+
+- `website/blog` — acute crash (fixed, un-swapped).
+- `changedetection/changedetection` — *silent* swap: it ran 2/2 for days
+  because pod containers share a network namespace (each process still bound its
+  own port), but the app was running **without its `/datastore` PVC, without
+  `PLAYWRIGHT_DRIVER_URL`/`BASE_URL`, and at a 128Mi cap** — config was ephemeral
+  and one restart from total loss. Un-swapped; `/datastore` (watch config back to
+  Feb 2026) re-mounted; app confirmed serving `200` with watches loaded.
+
+The other 14 are NOT swapped: `insta2spotify` and `priority-pass` (the other
+custom app+helper pairs) verified correctly mapped; the rest are upstream Helm
+charts (grafana, prometheus, loki, alloy, vault, the CSI controllers/nodes,
+mysql) with fixed image→container mappings, all healthy. `match-tag` is now
+stripped from all of them, so none can swap again.
+
+## Recommendation (not yet actioned)
+
+- An **external monitor that hits the bare blog backend** (bypassing Anubis)
+  would have caught this: the Anubis challenge page returns `200` without
+  reaching the backend, so the front-door monitor stayed green for 6 days.
--- a/docs/post-mortems/2026-06-02-immich-ml-ttl-gpu-oom-recruiter.md
+++ b/docs/post-mortems/2026-06-02-immich-ml-ttl-gpu-oom-recruiter.md
@ -0,0 +1,85 @@
+# Post-Mortem: immich-ml VRAM hog (MODEL_TTL=0) starved llama-swap → recruiter-responder silently down
+
+| Field | Value |
+|-------|-------|
+| **Date** | 2026-06-02 |
+| **Duration** | Triage failing 17:41 → ~20:08 EEST (~2.5 h confirmed in retained logs; first 502 at 17:41) |
+| **Severity** | SEV3 — one pipeline (recruiter-responder) fully down; no data loss (emails preserved unseen); no other user-facing impact |
+| **Affected** | `recruiter-responder` (triage). Root cause in `immich-machine-learning` + shared T4 GPU. |
+| **Status** | Fixed — `immich` `MACHINE_LEARNING_MODEL_TTL` 0 → 600; immich-ml VRAM dropped 10.7 GB → ~1.9 GB; qwen3-8b loads again; backlog reprocessed. |
+
+## Summary
+
+Reported by the operator: "receiving recruiter emails but seeing no responses."
+The recruiter-responder IMAP IDLE reader was healthy and fetching mail, but every
+email failed at the triage step with `502 Bad Gateway` from llama-swap. llama-swap
+could not load its `qwen3-8b` model because the shared Tesla T4 (16 GB) had only
+~2.2 GB free — `immich-machine-learning` was holding **10.7 GB** and never released
+it. Because triage *raised* (not swallowed), each email was left **unseen** and
+retried, so no mail was lost — but no draft/event/Telegram notification was ever
+produced.
+
+## Root cause (chain)
+
+```
+immich-ml runs with MACHINE_LEARNING_MODEL_TTL=0  →  ModelCache(revalidate=False),
+  per-model TTL eviction + idle-shutdown both DISABLED → nothing ever unloads
+        ▼
+heavy immich library job ~17:17 (metadata + smartSearch + OCR + face) runs OCR
+  (PP-OCRv5, dynamic input shapes) → onnxruntime BFC CUDA arena inflates to ~10.7 GB
+        ▼
+TTL=0 → the arena floor is permanent (onnxruntime doesn't cudaFree between runs;
+  only a process restart reclaims it)
+        ▼
+T4 free VRAM ~2.2 GB  (T4 is time-sliced across immich-ml / immich-server /
+  frigate / llama-swap with NO memory isolation)
+        ▼
+llama-swap gets a qwen3-8b request → llama-server: cudaMalloc 4455 MiB OOM →
+  "exiting due to model loading error" → llama-swap returns 502
+        ▼
+recruiter-responder triage.py raise_for_status() → orchestrator raises →
+  imap_idle leaves the message UNSEEN (BODY.PEEK) → no draft/event → no Telegram
+```
+
+## Why it was hard to spot
+
+- **Everything showed `Running`/healthy**: the recruiter-responder, llama-swap, and
+  immich-ml pods were all `1/1 Running` with 0 restarts. The failure was a runtime
+  502, not a crash.
+- **`nvidia-smi` inside a container shows "No running processes found"** (PID-namespace
+  isolation) — per-process VRAM attribution needed the host-PID `gpu-pod-exporter`
+  (`nvidia-smi --query-compute-apps`), which pinned the 10.7 GB on `immich_ml.main`.
+- **Silent**: triage errors only landed in recruiter-responder logs; no alert fired
+  on llama-swap 5xx or on low GPU free-VRAM. ~440 triage attempts failed before the
+  operator noticed organically.
+
+## Resolution
+
+- `stacks/immich/main.tf`: `MACHINE_LEARNING_MODEL_TTL` `0` → `600` (targeted apply of
+  `kubernetes_deployment.immich-machine-learning`). The Recreate rollout cleared the
+  stuck arena immediately; going forward, idle ad-hoc models (OCR, face) unload after
+  600 s and return VRAM, while preloaded CLIP (smart search) stays warm.
+- Verified: T4 used 12571 → 3785 MiB (11.1 GB free); immich-ml 10726 → 1940 MiB;
+  `qwen3-8b` chat completion returns HTTP 200; recruiter-responder reprocessed its
+  unseen backlog with triage `200 OK`.
+
+## Why MODEL_TTL=0 was set (and the correction)
+
+`MODEL_TTL=0` was almost certainly chosen to keep the smart-search model permanently
+warm for snappy search. The unintended consequence: it *also* pins every ad-hoc model
+(OCR/face) and lets onnxruntime's arena grow unbounded on a GPU it doesn't own alone.
+immich has **no per-model TTL** (a single global knob; the idle path kills the whole
+worker via `os.kill(getpid(), SIGINT)` and respawns), so the practical compromise is a
+moderate global TTL + CLIP preload: CLIP reloads in ~10 s on the rare idle miss, while
+OCR/face free their VRAM.
+
+## Follow-ups (not yet done — operator declined hardening this session)
+
+- **Alerting** on (a) GPU free-VRAM below a threshold and (b) llama-swap 5xx /
+  recruiter-responder triage failure rate, so a future starvation doesn't sit silent.
+  (Operator believes existing alerts cover it — unverified here.)
+- **Optional** recruiter-responder resilience: fall back to a smaller model
+  (`qwen3vl-4b`) or the Tier-1 GPT relay when llama-swap 502s.
+- **Separate pre-existing issue** surfaced in immich-server logs: repeated
+  `AssetExtractMetadata` `ENOENT` on `upload/upload/...` paths (missing originals) —
+  unrelated to this incident; worth a look.
--- a/docs/post-mortems/index.html
+++ b/docs/post-mortems/index.html
@ -0,0 +1,138 @@
+<!DOCTYPE html>
+<html lang="en">
+<head>
+<meta charset="UTF-8">
+<meta name="viewport" content="width=device-width, initial-scale=1.0">
+<title>Post-Mortems — viktorbarzin.me</title>
+<link rel="preconnect" href="https://fonts.googleapis.com">
+<link rel="preconnect" href="https://fonts.gstatic.com" crossorigin>
+<link href="https://fonts.googleapis.com/css2?family=Space+Grotesk:wght@400;500;600;700&family=IBM+Plex+Sans:wght@300;400;500&display=swap" rel="stylesheet">
+<style>
+:root {
+  --bg: #f5f3f0;
+  --surface: #ffffff;
+  --text: #1a1215;
+  --text-secondary: #6b5e64;
+  --border: #ddd5d0;
+  --accent: #b91c1c;
+}
+@media (prefers-color-scheme: dark) {
+  :root {
+    --bg: #0f0b0d;
+    --surface: #1e1719;
+    --text: #ede8ea;
+    --text-secondary: #a89da2;
+    --border: #332b2e;
+    --accent: #ef4444;
+  }
+}
+* { margin: 0; padding: 0; box-sizing: border-box; }
+body {
+  font-family: 'IBM Plex Sans', sans-serif;
+  background: var(--bg);
+  color: var(--text);
+  padding: 60px 24px;
+  max-width: 800px;
+  margin: 0 auto;
+}
+h1 {
+  font-family: 'Space Grotesk', sans-serif;
+  font-size: 2rem;
+  font-weight: 700;
+  margin-bottom: 8px;
+  letter-spacing: -0.02em;
+}
+.subtitle {
+  color: var(--text-secondary);
+  margin-bottom: 40px;
+  font-size: 0.95rem;
+}
+.incident-list { list-style: none; }
+.incident-item {
+  background: var(--surface);
+  border: 1px solid var(--border);
+  border-radius: 10px;
+  padding: 20px 24px;
+  margin-bottom: 12px;
+  transition: border-color 0.2s;
+}
+.incident-item:hover { border-color: var(--accent); }
+.incident-item a {
+  text-decoration: none;
+  color: var(--text);
+  display: block;
+}
+.incident-date {
+  font-family: 'Space Grotesk', sans-serif;
+  font-size: 0.8rem;
+  color: var(--text-secondary);
+  font-weight: 500;
+  letter-spacing: 0.04em;
+}
+.incident-title {
+  font-family: 'Space Grotesk', sans-serif;
+  font-size: 1.15rem;
+  font-weight: 600;
+  margin: 4px 0;
+}
+.incident-desc {
+  font-size: 0.85rem;
+  color: var(--text-secondary);
+}
+.sev-tag {
+  display: inline-block;
+  font-family: 'Space Grotesk', sans-serif;
+  font-size: 0.7rem;
+  font-weight: 600;
+  padding: 2px 8px;
+  border-radius: 4px;
+  background: rgba(185, 28, 28, 0.1);
+  color: var(--accent);
+  border: 1px solid var(--accent);
+  text-transform: uppercase;
+  letter-spacing: 0.04em;
+  margin-left: 8px;
+  vertical-align: middle;
+}
+footer {
+  margin-top: 40px;
+  padding-top: 20px;
+  border-top: 1px solid var(--border);
+  font-size: 0.7rem;
+  color: var(--text-secondary);
+  text-align: center;
+}
+</style>
+</head>
+<body>
+<h1>Post-Mortems</h1>
+<p class="subtitle">Incident reviews for the viktorbarzin.me Kubernetes cluster</p>
+<ul class="incident-list">
+  <li class="incident-item">
+    <a href="2026-04-14-nfs-fsid0-dns-vault-outage.md">
+      <span class="incident-date">2026-04-14</span>
+      <span class="sev-tag">SEV 1</span>
+      <div class="incident-title">NFS fsid=0 Cascade &mdash; DNS + Vault + Multi-Service Outage</div>
+      <div class="incident-desc">5h outage: fsid=0 in PVE /etc/exports broke NFSv4 subdirectory mounts &rarr; Technitium primary I/O errors &rarr; Vault lost quorum &rarr; Alertmanager blind &rarr; 25+ pods affected across 15+ namespaces.</div>
+    </a>
+  </li>
+  <li class="incident-item">
+    <a href="2026-03-16-nfs-csi-cascade-failure.md">
+      <span class="incident-date">2026-03-16</span>
+      <span class="sev-tag">SEV 1</span>
+      <div class="incident-title">NFS CSI Cascade Failure</div>
+      <div class="incident-desc">47h outage: NFS CSI driver liveness-probe port conflict &rarr; all NFS mounts fail &rarr; 40+ pods stuck across 20+ namespaces.</div>
+    </a>
+  </li>
+  <li class="incident-item">
+    <a href="2026-03-16-kured-containerd-cascade-outage.html">
+      <span class="incident-date">2026-03-16</span>
+      <span class="sev-tag">SEV 1</span>
+      <div class="incident-title">Kured + Containerd Cascade Outage</div>
+      <div class="incident-desc">26h cluster outage: unattended-upgrades kernel update &rarr; kured reboot &rarr; containerd overlayfs snapshotter corruption &rarr; calico down &rarr; cascading failure across all 5 nodes.</div>
+    </a>
+  </li>
+</ul>
+<footer>viktorbarzin.me infrastructure</footer>
+</body>
+</html>