fix: restore tree dropped by 6d224861; land stem95su gdrive-sync (10m) [ci skip]

6d224861 came from a --no-checkout worktree whose empty index made the
commit drop every file except two. This restores 05b50d2b's full tree and
correctly adds stacks/stem95su/gdrive-sync.tf + the service-catalog stem95su
entry. Forward-only (parent=6d224861, no force-push); [ci skip] since the
live infra was never applied from the broken commit.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
This commit is contained in:
Viktor Barzin 2026-06-09 08:45:33 +00:00
parent 6d224861c4
commit fd0f4a0365
1166 changed files with 358546 additions and 0 deletions

File diff suppressed because it is too large Load diff

View file

@ -0,0 +1,130 @@
# Post-Mortem: NFS CSI Cascade Failure
| Field | Value |
|-------|-------|
| **Date** | 2026-03-16 |
| **Duration** | ~47h (ongoing) |
| **Severity** | SEV1 |
| **Affected Services** | 40+ pods across 20+ namespaces |
| **Status** | Draft |
## Summary
The NFS CSI driver entered a crash-loop due to a liveness-probe port conflict (~47h ago), causing all NFS-backed PV mounts to fail. This cascaded into 15+ pods stuck in ContainerCreating, MySQL InnoDB data file lock failures, Vault Raft storage timeouts, and MetalLB speaker port conflicts. Critical path services (Traefik, PostgreSQL) remained partially operational.
## Impact
- **User-facing**: Services dependent on NFS storage (calibre, forgejo, pgadmin, uptime-kuma, etc.) completely unavailable. Grafana down (MySQL dependency). Vault unavailable.
- **Services affected**: 40+ pods across 20+ namespaces — storage layer, databases, monitoring, CI/CD
- **Duration**: ~47h and ongoing at time of investigation
- **Data loss**: None confirmed; MySQL ibdata1 lock issue may indicate risk if not resolved cleanly
## Timeline (UTC)
| Time | Event | Source |
|------|-------|--------|
| ~47h ago | NFS CSI driver starts crash-looping — liveness-probe port 29653 conflict | cluster-health-checker (pod age + restart count) |
| ~30h ago | iSCSI CSI controller deployed (stable) | pod age |
| ~27h ago | Vault, MySQL, Headscale, Woodpecker, CrowdSec agents start failing | pod ages, events |
| ~26h ago | mysql-cluster-0 enters ContainerStatusUnknown | sre investigation |
| ~20-22h ago | Cascade of service restarts across cluster | pod restart timestamps |
| ~1h ago | Latest wave of pod rescheduling/restarts | events |
## Root Cause
**NFS CSI driver liveness-probe port conflict**: The liveness-probe containers on all worker nodes fail with `listen tcp 127.0.0.1:29653: bind: address already in use`. The port conflict suggests a previous liveness-probe process did not cleanly terminate, or pods were restarted while old processes lingered in the network namespace.
**Impact chain**: NFS CSI not registered on nodes → all NFS PV mounts fail → 15+ pods stuck in ContainerCreating with "driver name nfs.csi.k8s.io not found in the list of registered CSI drivers"
## Contributing Factors
- **MySQL InnoDB data corruption**: Cannot open `ibdata1` (OS error 11 — EAGAIN). Likely caused by NFS storage instability or stale lock from mysql-cluster-0 in ContainerStatusUnknown
- **Vault Raft lock timeout**: vault-0 fails with "failed to open bolt file: timeout" — BoltDB locked by previous instance. vault-2 cannot mount NFS volume at all
- **MetalLB speaker port conflicts**: 3/4 speakers fail with memberlist port 7946 already in use — same pattern as NFS CSI, suggesting containerd instability ~47h ago
- **Node3 memory pressure**: 80% utilization, hosting both mysql-cluster-1 and mysql-cluster-2
## Detection
- **How detected**: Manual investigation (this post-mortem)
- **Time to detect**: ~47h from start of NFS CSI crash-loop
- **Gap analysis**: No alerting on CSI driver health. Existing pod alerts likely firing but root cause (CSI driver) not surfaced. Need CSI-specific health alerts.
## Resolution
Not yet resolved at time of investigation. Recommended steps:
1. Delete NFS CSI node pods one at a time (DaemonSet will recreate with clean port allocation)
2. Delete MetalLB crash-looping speaker pods (same approach)
3. Force-delete mysql-cluster-0 (ContainerStatusUnknown) to release ibdata1 lock
4. Once NFS healthy, vault-2 should start; vault-0 may need BoltDB lock file cleared
## Action Items
### Preventive (stop recurrence)
| Priority | Action | Type | Details |
|----------|--------|------|---------|
| P1 | Investigate containerd health on worker nodes | Investigation | Port conflicts across NFS CSI + MetalLB suggest containerd restart/instability ~47h ago |
| P1 | Fix NFS CSI liveness probe port allocation | Config | Use ephemeral ports or add port uniqueness checks to avoid stale port conflicts |
| P2 | Add node anti-affinity for MySQL replicas | Terraform | mysql-cluster-1 and -2 both on node3 (80% memory) — spread across nodes |
### Detective (catch faster)
| Priority | Action | Type | Details |
|----------|--------|------|---------|
| P1 | Add CSI driver health alerting | Alert | PrometheusRule for CSI driver pod crash-loops — `kube_pod_container_status_waiting_reason{reason="CrashLoopBackOff", namespace="democratic-csi"}` |
| P1 | Add NFS mount failure alerting | Alert | Alert on pods stuck in ContainerCreating > 10min with volume mount errors |
| P2 | Add Uptime Kuma monitor for Vault | Monitor | Vault health endpoint check |
### Mitigative (reduce blast radius)
| Priority | Action | Type | Details |
|----------|--------|------|---------|
| P2 | Add PDB for NFS CSI node DaemonSet | Config | Ensure at least N-1 nodes always have healthy CSI driver |
| P2 | Document NFS CSI recovery runbook | Runbook | Steps to recover from port conflict scenario |
| P3 | Evaluate moving MySQL off NFS to iSCSI | Architecture | iSCSI remained stable throughout; MySQL on NFS is fragile |
## Lessons Learned
- **Went well**: Critical path services (Traefik 3/3, Authentik 2/3, PostgreSQL) survived due to not depending on NFS CSI
- **Went poorly**: 47h detection gap — no alerting on CSI driver health despite it being a single point of failure for all NFS workloads
- **Got lucky**: iSCSI CSI remained stable, keeping PostgreSQL (CNPG) operational. If both CSI drivers had failed, the entire cluster would have been fully down
## Raw Investigation Data
<details>
<summary>Cluster State Summary</summary>
- **Nodes**: All 5 Ready. k8s-node2 metrics `<unknown>`. k8s-node3 at 80% memory.
- **Tier 1 (Critical)**: Traefik OK (3/3), Authentik degraded (2/3), PostgreSQL degraded (1/2), Vault DOWN, Redis starting, MetalLB degraded (1/4)
- **Tier 2 (Storage)**: NFS CSI FAILING (port conflict on all workers), iSCSI CSI OK
- **Tier 3 (Apps)**: 15+ ContainerCreating (NFS), 5+ Pending (GPU/memory), 10+ CrashLoopBackOff (DB deps)
- **Tier 4 (Databases)**: PostgreSQL recovering, MySQL DOWN (ibdata1 lock)
</details>
<details>
<summary>NFS CSI Error Details</summary>
Controller pods (2): CrashLoopBackOff — `listen tcp 127.0.0.1:29653: bind: address already in use`
Node DaemonSet: 4/5 crash-looping (same error). Only k8s-master healthy.
Age: ~47h with 96+ restarts.
</details>
<details>
<summary>MySQL Error Details</summary>
mysql-cluster-0: ContainerStatusUnknown
mysql-cluster-1, mysql-cluster-2: CrashLoopBackOff — `Cannot open datafile './ibdata1'` (OS error 11: Resource temporarily unavailable)
mysql-cluster-router: crash-looping (no healthy backend)
</details>
<details>
<summary>Vault Error Details</summary>
vault-0: CrashLoopBackOff — "failed to open bolt file: timeout" (Raft BoltDB lock)
vault-2: ContainerCreating — NFS mount failure
</details>

View file

@ -0,0 +1,223 @@
# Post-Mortem: NFS fsid=0 Cascade — DNS + Vault + Multi-Service Outage
| Field | Value |
|-------|-------|
| **Date** | 2026-04-14 |
| **Duration** | ~5h initial (05:3710:40 EEST), then ~2h secondary (NFS restart broke NFSv3 + DNS zone sync gap) |
| **Severity** | SEV1 |
| **Affected Services** | 48+ pods across 20+ namespaces. DNS (all instances), Vault, MySQL, Grafana, Uptime Kuma, ebooks, phpipam, immich, servarr, and more |
| **Status** | Complete |
## Summary
An `fsid=0` flag in the PVE host's `/etc/exports` caused all NFSv4 subdirectory mount paths from k8s to fail. Combined with a `lockd` configuration failure that broke NFSv3 fallback, this made ALL new NFS mounts impossible and ALL existing NFS mounts stale after the NFS server had been restarted on Apr 11. The failure cascaded into DNS (primary unreachable), Vault (lost Raft quorum), Alertmanager (monitoring blind), and 20+ other services.
## Impact
- **User-facing**: DNS intermittent failures for 192.168.1.x network users (primary down, secondary/tertiary covered ~66% of queries). Vault-dependent services unable to rotate secrets.
- **Blast radius**: 57,405 NFS error messages across 4 k8s nodes + PVE host. 53 NFS-backed PVs at risk.
- **Duration**: ~5h of active NFS failure. Some services (ebooks) were in CreateContainerError for 2d22h before detection.
- **Data loss**: None. NFS data remained intact on disk; only the NFS service was broken.
- **Monitoring gap**: Alertmanager itself was on NFS storage, so it couldn't alert about the NFS failure.
## Timeline (EEST)
| Time | Event |
|------|-------|
| **Apr 11 00:44:52** | NFS server restarts on PVE. Logs `exportfs: can't open /etc/exports for reading` and `nfsdctl: lockd configuration failure`. Server starts with broken export configuration. |
| **Apr 11 (later)** | `/etc/exports` recreated with `fsid=0` on `/srv/nfs` and `fsid=1` on `/srv/nfs-ssd`. Exports reloaded. Existing k8s NFS mounts continue working (cached file handles). |
| **Apr 13 23:52** | TrueNAS (10.0.10.15) goes completely unreachable. PVE host's hard NFS mounts to TrueNAS start blocking with D-state kernel process. |
| **Apr 14 05:04** | `daily-backup.service` starts on PVE. 25/88 LUKS PVC snapshots fail to mount. |
| **Apr 14 05:37** | **CASCADE BEGINS**: k8s-node1, node3, node4 simultaneously start reporting `nfs: server 192.168.1.127 not responding, timed out`. Existing cached file handles expire; new mount operations hit the fsid=0 + lockd issues. |
| **Apr 14 ~07:30** | DNS outage reported by users on 192.168.1.x network. Investigation begins. |
| **Apr 14 10:40** | **FIX**: `fsid=0` removed from `/etc/exports`, NFS server restarted. New mounts work. |
| **Apr 14 10:41** | Vault pods restarted — all 3 come up 2/2 Running. Raft quorum restored. |
| **Apr 14 10:42** | Technitium primary pod restarted with fresh NFS mount. DNS fully restored. |
| **Apr 14 10:4411:04** | Technitium primary migrated from NFS to `proxmox-lvm-encrypted` PVC. Data restored (104.9M including all 11 zones). Terraform state reconciled. |
## Root Cause Chain
```
[1] fsid=0 in /etc/exports (introduced Apr 11)
└─> NFSv4 treats /srv/nfs as pseudo-root
└─> Subdirectory paths (/srv/nfs/technitium) resolve incorrectly → ENOENT
├─> [2] lockd configuration failure (since Apr 11)
│ └─> NFSv3 fallback fails (statd not running, no 'nolock' in CSI mount options)
│ └─> ALL new NFS mounts fail with "No such file or directory"
└─> [3] Existing mounts go stale (cached file handles expire ~Apr 14 05:37)
├─> Technitium primary: I/O errors on /etc/dns → degraded DNS
├─> Vault 0+1: CreateContainerError → lost Raft quorum → Vault down
├─> Alertmanager: I/O errors → monitoring blind spot
├─> Grafana: CrashLoopBackOff (MySQL password rotation failed during Vault outage)
├─> ebooks: CreateContainerError (2d22h undetected)
└─> 20+ CronJobs and services: Error/FailedMount
```
### Why fsid=0 was there
During the TrueNAS → Proxmox NFS migration on Apr 11, the NFS server was restarted. At startup, `/etc/exports` was missing/unreadable. When the exports file was recreated, `fsid=0` was included — likely copied from an NFSv4 example. This flag is only appropriate for dedicated NFSv4-only exports, not for NFS CSI dynamic provisioning which mounts subdirectories.
### Why the failure was delayed 3 days
Existing NFS mounts from before the Apr 11 restart continued working because:
- NFSv3 mounts cache file handles and don't re-negotiate protocol on every I/O
- The `soft,timeo=30` mount options return errors after timeout, but cached operations succeed
- Only when cached handles expired or new mounts were needed did the failure manifest
The trigger was likely the `daily-backup.service` at 05:04 on Apr 14, which accesses NFS exports and may have caused the NFS server to recycle state, invalidating cached handles cluster-wide.
## Contributing Factors
1. **TrueNAS unreachable since Apr 13 23:52**: PVE host has `hard` NFS mounts to TrueNAS that will retry forever. D-state kernel process stuck since Apr 13. This may have contributed to NFS thread contention on the PVE host.
2. **Alertmanager on NFS storage**: The very system meant to alert about storage failures was stored on the failing storage. Circular dependency.
3. **`/etc/exports` not managed by Terraform or git**: Changes to this critical configuration file are untracked, making it impossible to audit when `fsid=0` was introduced.
4. **No NFS-specific health alerts**: While CLAUDE.md mentions "NFS responsiveness" alerts, no alert fired during this 5+ hour outage. The Prometheus rule may not cover mount failures from the k8s node perspective.
5. **CSI mount options lack `nfsvers=3` and `nolock`**: The NFS CSI driver uses default mount options that rely on version auto-negotiation. When NFSv4 fails (fsid=0) and NFSv3 fails (lockd), there's no fallback path.
## Detection Gaps
| Gap | Impact | Fix |
|-----|--------|-----|
| No alert on NFS mount failures from k8s nodes | 5h to detection | Add PrometheusRule: `node_nfs_requests_total` error rate |
| Alertmanager on NFS storage | Alerting blind during NFS outage | Move Alertmanager to `proxmox-lvm-encrypted` |
| `/etc/exports` not in git/Terraform | Can't audit config changes | Manage via Ansible or TF `remote-exec` |
| No TrueNAS reachability alert | 11h unnoticed before cascade | Add ping/ICMP monitor in Uptime Kuma |
| No CSI mount failure alert | Pods stuck for days unnoticed | Alert on `kube_pod_container_status_waiting_reason{reason="ContainerCreating"}` > 10m |
| ebooks pods broken for 2d22h | Zero notification | Above alert covers this |
| Grafana down 37h | Dashboard monitoring blind | Uptime Kuma HTTP check already exists; verify it alerts |
## Prevention Plan
### P0 — Prevent this exact failure from recurring
| Action | Owner | Status |
|--------|-------|--------|
| Remove `fsid=0` from `/etc/exports` on PVE host | Done | Completed Apr 14 |
| Fix `lockd configuration failure` on PVE NFS server | Done | Disabled NFSv3 entirely (`vers3=n`). lockd is an nfsdctl bug on kernel 6.14 — not fixable without Proxmox patch. |
| Force-unmount hung TrueNAS NFS mounts on PVE | Done | `umount -l /mnt/truenas-src /mnt/truenas-ssd`. Not in fstab — won't recur. |
| Manage `/etc/exports` in git (add to `infra/scripts/` and deploy via PVE provisioning) | Done | `scripts/pve-nfs-exports` added with fsid=0 safety comments. Deploy: scp + exportfs -ra |
| Migrate all NFS PVs to NFSv4 | Done | Patched 52 PVs to `nfsvers=4`. Updated TF module + StorageClass. Applied all 20 stacks. |
| Add DNS zone sync CronJob | Done | `technitium-zone-sync` runs every 30min, replicates all primary zones to secondary/tertiary via AXFR |
### P1 — Eliminate the NFS single point of failure for critical services
| Action | Owner | Status |
|--------|-------|--------|
| Migrate Technitium primary to `proxmox-lvm-encrypted` | Done | Completed Apr 14 |
| Migrate Alertmanager PV from NFS to `proxmox-lvm-encrypted` | Done | `storage-prometheus-alertmanager-0` now on `proxmox-lvm-encrypted`. Helm `persistence.storageClass` set. |
| Migrate Vault PVCs from `nfs-proxmox` to `proxmox-lvm-encrypted` | TODO | Vault is too critical for NFS dependency |
| Review all 53 NFS PVs — identify which are critical-path and migrate | TODO | Reduce NFS blast radius |
### P2 — Detect NFS failures before users notice
| Action | Owner | Status |
|--------|-------|--------|
| Add PrometheusRule: NFS mount errors from node kernel logs | Done | `NFSHighRPCRetransmissions` alert: rate > 5/s for 5m → warning |
| Add PrometheusRule: Pods in ContainerCreating > 10 minutes | Done | `NFSMountFailures` + `NFSCSINodeDown` alerts added to Prometheus |
| Add Uptime Kuma monitor: TrueNAS ping (10.0.10.15) | Done | Already existed (id=66, PING, 60s). Verified active. |
| Add Uptime Kuma monitor: PVE NFS port 2049 TCP check | Done | Already existed (id=96 + id=328, PORT 2049, active). Verified. |
| Verify Grafana Uptime Kuma alert actually fires | Done | Monitor id=32 has notificationIDList=[1] (Slack). Root cause of 37h gap: Uptime Kuma itself on NFS. |
### P3 — Improve NFS resilience
| Action | Owner | Status |
|--------|-------|--------|
| Remove hung TrueNAS `hard` mounts from PVE fstab (TrueNAS is sunset) | TODO | Eliminates D-state kernel process risk |
| Add NFS export health check to daily-backup script | Done | `check_nfs_exports()` added to `scripts/daily-backup.sh`: checks fsid=0, nfs-server active, exportfs count |
| Document NFS CSI mount option requirements in CLAUDE.md | Done | Added NFS CSI section: nfsvers=4 mandatory, fsid=0 forbidden, critical services → proxmox-lvm-encrypted |
## Phase 2: NFS Restart Broke NFSv3 + DNS Zone Sync Gap
### What happened after the 10:40 fix
The NFS server restart at 10:40 (which fixed the fsid=0 issue) introduced two new problems:
#### Problem 1: NFSv3 completely broken after restart
After the restart, **ALL NFSv3 mount(2) system calls returned EIO** from k8s worker nodes, even though:
- NFS port 2049 was reachable from all nodes
- `showmount -e` listed correct exports
- NFSv4 mounts worked perfectly from all nodes
- NFSv3 mounts worked from k8s-master (which had no prior NFS mounts — clean kernel state)
**Root cause**: `nfsdctl: lockd configuration failure` on PVE kernel 6.14.11-4-pve — a bug where nfsdctl's `autostart` command tries to call a non-existent `lockd` subcommand. This warning was present since Apr 11 but NFSv3 only broke after the restart. Worker nodes retained corrupted NFS client kernel state from the stale mount period that could not be cleared without a reboot.
**Resolution**: Patched all 52 NFS PVs to add `nfsvers=4` mount option via `kubectl patch pv`. Updated Terraform `nfs_volume` module and `nfs-csi` StorageClass. Disabled NFSv3 on PVE (`vers3=n` in `/etc/nfs.conf`). Applied to all 20+ Terraform stacks.
#### Problem 2: DNS zone sync gap — .lan resolution failures
**Finding**: Technitium secondary and tertiary instances had **only 5 default zones** (localhost, arpa). Custom zones (`viktorbarzin.lan`, `viktorbarzin.me`, reverse lookup zones, etc.) only existed on the primary. This was a **pre-existing gap** — the zone setup was a one-time Kubernetes Job that ran at initial deployment and never synced new zones created afterward.
**Impact**: The MetalLB VIP (10.0.20.201) load-balances across all 3 instances. 2/3 of queries hit secondary/tertiary → NXDOMAIN for `.lan` → cached for 300s by CoreDNS → ExternalName services (e.g., `ha-sofia.viktorbarzin.lan`) returned 502 Bad Gateway.
**Why it surfaced now**: The NFS outage restarted the primary Technitium pod, flushing client DNS caches. This increased the visible failure rate for `.lan` queries.
**Resolution**:
1. Created `viktorbarzin.lan` and `viktorbarzin.me` as Secondary zones on both secondary and tertiary via Technitium API
2. **Converted one-time setup Job to a CronJob** (`technitium-zone-sync`) running every 30 minutes that:
- Gets all zones from primary
- Enables zone transfer (AXFR) on primary
- Creates missing zones as Secondary type on replicas
- Resyncs existing zones
3. 20 zones were synced to tertiary that were previously missing
### Phase 2 Timeline
| Time | Event |
|------|-------|
| **10:40** | NFS server restarted (fsid=0 fix). NFSv3 breaks. 48+ stale mounts across all workers. |
| **10:4110:50** | Vault, DNS primary come back. But NFSv3 mounts stay stale. |
| **11:0011:40** | Investigation: NFSv3 mount(2) returns EIO on workers, NFSv4 works. Patched 52 PVs to nfsvers=4. |
| **11:4012:00** | Restarted all NFS-dependent pods. Fixed MySQL (Vault rotation mismatch), Redis HAProxy, Woodpecker DB. |
| **12:0012:15** | Users report .lan resolution failures. Discovered secondary/tertiary missing all custom zones. |
| **12:15** | Created secondary zones on secondary/tertiary via API. .lan resolution restored. |
| **12:1512:20** | Converted one-time setup Job to zone-sync CronJob. Applied via Terraform. |
### Additional collateral damage fixed during Phase 2
| Issue | Root Cause | Fix |
|-------|-----------|-----|
| Grafana, realestate-crawler, shlink MySQL access denied | Vault rotated passwords during NFS outage, credentials mismatched | Force-rotated Vault DB roles, manually synced Grafana password |
| Uptime Kuma `mysql_native_password` plugin not loaded | MySQL 8.4 disabled plugin by default | Enabled via `mysql-native-password=ON` in mycnf, changed user auth plugin |
| Redis HAProxy all backends DOWN | Health check timeout during cluster turbulence | Restarted HAProxy pods |
| Woodpecker missing PostgreSQL database | DB init Job ran at deploy, DB dropped during cluster recreation | Manually created database |
| Nextcloud PVC deleted | `nextcloud-data-proxmox` was Terminating and got garbage collected | Rebound Released PV to new PVC |
| MySQL InnoDB cluster OFFLINE | rollout restart invalidated operator state, kopf finalizer blocked deletion | Removed finalizer, recreated cluster via `dba.createCluster()` |
## Lessons Learned
1. **NFSv4 `fsid=0` is dangerous for CSI subdirectory mounts**: It changes path resolution semantics in non-obvious ways. Never use it on exports that serve dynamic subdirectory mounts.
2. **Critical monitoring infrastructure must not depend on the thing it monitors**: Alertmanager on NFS cannot alert about NFS failures. This is the same anti-pattern as "DNS depends on DNS" or "monitoring depends on the monitored database".
3. **Stale NFS mounts have delayed-action failure modes**: The 3-day gap between the config change (Apr 11) and the outage (Apr 14) made root cause analysis much harder. Cached file handles mask configuration errors.
4. **`/etc/exports` is a single-point-of-configuration**: Unmanaged, unversioned, no review process. A single flag caused a cluster-wide outage.
5. **This is the SECOND DNS outage related to NFS migration** (first was Apr 6 — unbound PVC). Storage migrations for DNS infrastructure need extra scrutiny and pre-migration testing.
6. **DNS HA requires zone replication, not just pod replication**: Having 3 Technitium pods with a PDB is useless if only the primary has the zone data. A one-time setup Job is insufficient — zones created after initial deployment are never synced. This is now fixed with a recurring CronJob.
7. **NFSv3 client kernel state survives mount cleanup**: Force-unmounting all NFS mounts from a node does NOT clear the kernel's per-server NFS client state. The only reliable fix was switching to NFSv4 (different protocol path). NFSv3 is now disabled on the PVE server.
8. **`kubectl rollout restart statefulset` is dangerous for operator-managed StatefulSets**: The MySQL InnoDB operator lost track of its cluster state after the rollout restart changed the pod template. Recovery required manually removing kopf finalizers, recreating the InnoDB cluster, and re-bootstrapping the router.
## Follow-up Implementation
| Date | Action | Priority | Type | Commit | Implemented By |
|------|--------|----------|------|--------|----------------|
| 2026-04-14 | Add `NFSHighRPCRetransmissions` PrometheusRule (`node_nfs_rpc_retransmissions_total` rate > 5/s for 5m) | P2 | Alert | [`35646841`](https://github.com/ViktorBarzin/infra/commit/35646841) | postmortem-todo-resolver |
| 2026-04-14 | Migrate Alertmanager PV from NFS to `proxmox-lvm-encrypted` (eliminates circular alerting dependency) | P2 | Alert | [`35646841`](https://github.com/ViktorBarzin/infra/commit/35646841) | postmortem-todo-resolver |
| 2026-04-14 | Add `/etc/exports` to git as `scripts/pve-nfs-exports` with fsid=0 safety documentation | P2 | Config | [`3da4812c`](https://github.com/ViktorBarzin/infra/commit/3da4812c) | postmortem-todo-resolver |
| 2026-04-14 | Add `check_nfs_exports()` to `scripts/daily-backup.sh` (detects fsid=0, nfs-server down, no active exports) | P3 | Config | [`3da4812c`](https://github.com/ViktorBarzin/infra/commit/3da4812c) | postmortem-todo-resolver |
| 2026-04-14 | Document NFS CSI mount requirements in `.claude/CLAUDE.md` (nfsvers=4 mandatory, fsid=0 forbidden) | P3 | Config | [`3da4812c`](https://github.com/ViktorBarzin/infra/commit/3da4812c) | postmortem-todo-resolver |
| 2026-04-14 | TrueNAS ICMP ping monitor (id=66) — already existed, verified active | P2 | Monitor | — | Verified existing |
| 2026-04-14 | PVE NFS port 2049 TCP monitor (id=96, 328) — already existed, verified active | P2 | Monitor | — | Verified existing |
| 2026-04-14 | Grafana Uptime Kuma monitor — alert wired to Slack (notificationIDList=[1]). Root cause of 37h gap: Uptime Kuma itself on NFS storage | P2 | Alert | — | Verified existing |
| — | Migrate Vault PVCs from `nfs-proxmox` to `proxmox-lvm-encrypted` | P2 | Migration | — | Needs human review |
| — | Review all 53 NFS PVs — identify critical-path and migrate | P2 | Migration | — | Needs human review |
| — | Remove hung TrueNAS `hard` mounts from PVE fstab (TrueNAS is sunset) | P2 | Migration | — | Needs human review |

View file

@ -0,0 +1,36 @@
# Post-Mortem: Pipeline E2E Test
| Field | Value |
|-------|-------|
| **Date** | 2026-04-14 |
| **Duration** | N/A |
| **Severity** | SEV3 |
| **Affected Services** | None (test) |
| **Status** | Draft |
## Summary
Test post-mortem to validate the automated TODO implementation pipeline end-to-end.
## Prevention Plan
| Priority | Action | Type | Details | Status |
|----------|--------|------|---------|--------|
| P2 | Add Uptime Kuma monitor for Proxmox web UI port 8006 | Monitor | TCP check on 192.168.1.127:8006 to detect PVE management plane down | TODO |
| P2 | Migrate Alertmanager to encrypted storage | Architecture | Move from NFS to proxmox-lvm-encrypted to avoid circular alerting dependency | TODO |
## Lessons Learned
1. Automated post-mortem pipelines reduce mean time to remediation.
## Follow-up Implementation
_This section is auto-populated by the postmortem-todo-resolver agent._
| Date | Action | Priority | Type | Commit | Implemented By |
|------|--------|----------|------|--------|----------------|
# E2E test 17:12
# E2E validation 17:27:45
# Final E2E test Tue Apr 14 05:43:38 PM UTC 2026
# 1776188690

View file

@ -0,0 +1,150 @@
# Post-Mortem: Authentik Embedded Outpost `/dev/shm` Fills — Cluster-Wide Auth Blocked
| Field | Value |
|-------|-------|
| **Date** | 2026-04-18 |
| **Duration** | ~44h for first-affected user (Emil, Apr 16 17:00 → Apr 18 12:40 UTC); ~30min for cluster-wide impact (Apr 18 12:10 → 12:40 UTC) |
| **Severity** | SEV2 — authentication blocked for all users on all Authentik-protected services |
| **Affected Services** | ~30+ Authentik-protected subdomains (every service using the `authentik-forward-auth` Traefik middleware) |
| **Status** | Root cause fixed; permanent mitigation applied; alerting still TODO |
## Summary
The `ak-outpost-authentik-embedded-outpost` pod's `/dev/shm` (default 64 MB tmpfs) filled to 100% with ~44,000 `session_*` files. Once full, every forward-auth request failed to write its session state with `ENOSPC` and the outpost returned HTTP 400 instead of the usual 302 → login redirect. All users on all protected services were unable to log in.
Detection was delayed because the initial user report (Emil) looked like a per-user bug — investigation spent two days chasing hypotheses about non-ASCII headers, user privileges, cookie corruption, and a newly-deployed Cloudflare Worker before the real cause was found in the outpost logs.
## Impact
- **User-facing**: HTTP 400 on initial GET of any Authentik-protected site (`terminal`, `grafana`, `immich`, `proxmox`, `london`, etc.). Existing sessions whose cookies were still cached worked until their cookie rotation attempt, then broke.
- **Blast radius**: Every service using the `authentik-forward-auth` middleware via the "Domain wide catch all" Proxy provider. Public and internal.
- **Duration**: First user (Emil) broken since 2026-04-16 ~17:00 UTC after his last valid session. Cluster-wide block when Viktor's cached session stopped being sufficient — roughly 2026-04-18 12:10 UTC. Fixed 12:40 UTC.
- **Data loss**: None. Session state in tmpfs is ephemeral by design.
- **Monitoring gap**: No Prometheus alert on outpost `/dev/shm` usage. No alert on outpost 400 response rate. Uptime Kuma external monitors hitting protected services returned 400s for 40+ hours without paging.
## Timeline (UTC)
| Time | Event |
|------|-------|
| **Apr 15 ~09:21** | `ak-outpost-authentik-embedded-outpost-587598dc4b-fvzzz` pod started (normal rolling restart, unrelated to this incident). `/dev/shm` fresh. |
| **Apr 16 16:23:32** | Emil's last successful `authorize_application` event from his iPhone Brave (`85.255.235.23`). After this point, his subsequent requests create session files — his new sessions work briefly, then `/dev/shm` fills and every new session write fails. |
| **Apr 16 ~17:00 (approx)** | `/dev/shm` at ~44,000 files = 100% full. New forward-auth requests start returning 400 across the board. Viktor's browser still has a valid cached cookie so his requests succeed without writing new session files. |
| **Apr 17 10:30 (approx)** | Emil reports "terminal.viktorbarzin.me returns 400" to Viktor. |
| **Apr 18 09:0012:30** | Deep investigation begins. Multiple hypotheses tested and rejected: non-ASCII bytes in Emil's `name` field, policy denial, cookie corruption, Rybbit Cloudflare Worker (deployed 2026-04-17 — suspicious timing, turned out unrelated), plaintext redirect scheme. |
| **Apr 18 12:20:39** | First direct evidence found: 2 Chrome 400s in Traefik logs from Emil's IP `176.12.22.76` (BG) on `terminal.viktorbarzin.me`, request missing `authentik_proxy_*` cookie. Redirect loop observed on iPhone IPv6 `2620:10d:c092:500::7:8c0d`. |
| **Apr 18 12:34** | Viktor reports he can no longer log in either. |
| **Apr 18 12:38** | `curl` against direct Traefik (`--resolve` bypassing Cloudflare) returns the same 400 with Authentik's CSP header — Cloudflare Worker exonerated. |
| **Apr 18 12:39** | Outpost log grep finds the smoking gun: `failed to save session: write /dev/shm/session_XXX: no space left on device`. |
| **Apr 18 12:40:13** | `kubectl delete pod ak-outpost-authentik-embedded-outpost-587598dc4b-fvzzz` — tmpfs cleared on pod restart. Replacement pod `-8qscr` Running within 8s. Cluster unblocked. |
| **Apr 18 12:41** | Verified: direct-Traefik and via-CF curls both return `HTTP 302` to Authentik auth flow. Viktor authenticates successfully on `proxmox.viktorbarzin.me`. |
| **Apr 18 12:53** | Permanent fix applied via Authentik API: `PATCH /api/v3/outposts/instances/{uuid}/` setting `config.kubernetes_json_patches` to mount `emptyDir {medium: Memory, sizeLimit: 512Mi}` at `/dev/shm`. |
| **Apr 18 12:54** | Authentik controller reconciled the Deployment within 5s. `kubectl rollout restart` triggered new pod `-k5hv8`. `/dev/shm` now `tmpfs 256M` (4× the previous capacity; K8s clamps the tmpfs size to pod memory policy, but usage is capped at `sizeLimit=512Mi`). Forward-auth verified working. |
## Root Cause Chain
```
[1] goauthentik/proxy outpost uses gorilla/sessions FileStore
└─> each forward-auth request that has no valid session cookie writes
/dev/shm/session_<random> (~1500 bytes/file)
├─> [2] Catch-all Proxy provider's access_token_validity = hours=168 (7 days)
│ └─> each file's MaxAge = 7 days
│ └─> Upstream 5-min GC (PR #15798, shipped in ≥ 2025.10) can only
│ delete files whose MaxAge has EXPIRED, not whose age exceeds any
│ shorter threshold
├─> [3] Measured creation rate: ~18 files/min (Uptime-Kuma monitors +
│ real user traffic)
│ └─> 18/min × 60 × 24 × 7 = 181,440 steady-state files expected
└─> [4] Pod's /dev/shm default: 64 MB tmpfs (Kubernetes default)
└─> 64 MB / 1500 bytes ≈ 44,000 files maximum
└─> Full in approx 44,000 / (18 × 60) min ≈ 41 hours
└─> Actual observed time: pod started Apr 15 ~09:21,
first ENOSPC ~Apr 16 ~17:00 ≈ 32 hours
(some excess from Uptime-Kuma bursts)
[ENOSPC] -> every new forward-auth request fails -> outpost returns HTTP 400
-> Traefik forwards the 400 to the browser
-> user sees "400 Bad Request" on every protected site
```
## Why Diagnosis Took So Long
The initial report was framed as "Emil can't access terminal" — a per-user symptom. All four pre-registered hypotheses in the triage plan (non-ASCII bytes in header value, oversized cookie, corrupt user attribute, provider policy rejecting groups) were per-user explanations, all of which turned out to be falsified.
Contributing distractions:
1. **Misattribution in initial research** — an `authorize_application` event for Viktor (`vbarzin@gmail.com`) at 2026-04-18 08:09 was initially attributed to Emil. This led to the incorrect conclusion that Emil was authenticating successfully today.
2. **Rybbit analytics Cloudflare Worker deployed 2026-04-17** (see memory #792, commit around 2026-04-17 21:26 UTC) ran on `*.viktorbarzin.me/*`. Suspicious timing — Viktor's first instinct was "this must be the Worker." The Worker WAS adding long cookies to browser state, but not the cause of the 400. Exonerated by direct-Traefik curl returning the same 400.
3. **Viktor's cached session masked the outage** — only unauthenticated requests wrote new session files. Viktor's valid cookie kept working until the outpost needed to rotate state, at which point he also hit 400.
4. **The tell is in the outpost logs, not anywhere else.** `grep 'no space left on device'` on the outpost logs would have found it in seconds, but the investigation scope started with user records, then cookies, then the Worker — outpost logs weren't grepped until hour 3+.
## Contributing Factors
1. **No alert on outpost `/dev/shm` usage.** A simple `kubelet_volume_stats_used_bytes / kubelet_volume_stats_capacity_bytes > 0.8` or equivalent cAdvisor metric would have paged hours before users noticed.
2. **No alert on outpost HTTP 400 rate.** `increase(authentik_outpost_http_requests_total{status="400"}[15m])` went from ~0 to thousands — invisible to our monitoring.
3. **No alert on "Uptime-Kuma external monitors all turning red simultaneously."** Every external monitor for a protected service started failing, but each is individually monitored — correlated failures across dozens of services didn't trigger a higher-level alert.
4. **Default Kubernetes `/dev/shm` is 64 MB.** This is fine for most workloads, but the goauthentik proxy outpost writes one session file per unauthenticated request with a 7-day retention. The default sizing is an accident waiting to happen on any busy deployment.
5. **Upstream issue [#20093](https://github.com/goauthentik/authentik/issues/20093)** ("External Proxy Outpost cannot use persistent session backend") is still OPEN as of 2026-04-18. Known architectural limitation.
6. **Catch-all Proxy provider is UI-managed, not Terraform-managed.** Its `access_token_validity` and the outpost's `kubernetes_json_patches` are configured in Authentik's PostgreSQL database, not in code. This means the fix applied today is invisible to `git log` and vulnerable to drift if someone changes it in the UI.
## Detection Gaps
| Gap | Impact | Fix |
|-----|--------|-----|
| No alert on outpost `/dev/shm` usage | Outage progressed from "Emil only" to "everyone" over 40+ hours silently | Add Prometheus alert: `kubelet_volume_stats_used_bytes{namespace="authentik",persistentvolumeclaim=~"dshm.*"} / kubelet_volume_stats_capacity_bytes > 0.8` (or per-container cAdvisor metric if emptyDir not a PVC) |
| No alert on outpost 400 rate spike | ~thousands of 400s over 40h didn't page | Alert on `increase(traefik_service_requests_total{code="400",service=~".*viktorbarzin-me.*"}[15m]) > N` OR on outpost-specific 400 metric |
| Uptime Kuma external monitors not cross-correlated | Dozens of red monitors didn't trigger a cluster-wide alert | Add meta-alert: "more than N [External] Uptime Kuma monitors down within 10 min" — strong signal of shared-infra failure |
| Outpost logs not searched during initial triage | Investigation went down 4 wrong paths before finding the real error | Runbook addition: for any Authentik forward-auth issue, FIRST command is `kubectl -n authentik logs -l goauthentik.io/outpost-name=authentik-embedded-outpost --since=1h \| grep -iE 'error\|no space'` |
## Prevention Plan
### P0 — Prevent this exact failure
| Priority | Action | Type | Details | Status |
|----------|--------|------|---------|--------|
| P0 | Size `/dev/shm` up via `kubernetes_json_patches` on the embedded outpost config | Config | `PATCH /api/v3/outposts/instances/0eecac07-97c7-443c-8925-05f2f4fe3e47/` with `config.kubernetes_json_patches.deployment` adding an `emptyDir {medium: Memory, sizeLimit: 512Mi}` volume at `/dev/shm`. Authentik reconciles the Deployment within 5 minutes. **Applied 2026-04-18 12:53 UTC.** | **DONE** |
### P1 — Detect this next time
| Priority | Action | Type | Details | Status |
|----------|--------|------|---------|--------|
| P1 | Prometheus alerts on outpost `/dev/shm` fill (two thresholds) | Alert | Group `Authentik Outpost` added in `stacks/monitoring/modules/monitoring/prometheus_chart_values.tpl`. `AuthentikOutpostMemoryHigh` (warning, working set > 1.5 GiB for 15m) + `AuthentikOutpostMemoryCritical` (critical, > 1.8 GiB for 5m) + `AuthentikOutpostRestarts` (warning, > 2 restarts in 30m). Applied 2026-04-18 13:16 UTC; loaded in Prometheus, state=inactive. | **DONE** |
| P1 | Uptime-Kuma meta-monitor: "N+ external monitors down simultaneously" | Alert | Either a Prometheus rule over `uptime_kuma_monitor_status == 0` counts, or a dedicated external probe. Very strong signal of shared-infra failure. | TODO |
| P1 | Bump tmpfs `sizeLimit` from 512Mi → 2Gi + set explicit container memory limit 2560Mi | Config | Patched outpost `kubernetes_json_patches` via Authentik API. 2026-04-18 13:06 UTC (sizeLimit), 13:22 UTC (container limit). **Gotcha**: `sizeLimit` alone is insufficient — writes to tmpfs count against container cgroup memory, and Kyverno's `tier-defaults` LimitRange sets a default `limits.memory: 256Mi` which OOM-kills the container before tmpfs fills. Fix is to also set `containers[0].resources.limits.memory``sizeLimit + working_set_headroom`. Verified 1.5 GB file write succeeds on the configured pod; df reports 2.0 GB tmpfs. Gives ~8× growth headroom at current probe rate. | **DONE** |
### P2 — Codify the fix so it survives drift
| Priority | Action | Type | Details | Status |
|----------|--------|------|---------|--------|
| P2 | Codify the catch-all Proxy provider + embedded outpost config in Terraform | Architecture | Adopt `goauthentik/authentik` Terraform provider in `infra/stacks/authentik/`. Import the existing UUID `0eecac07-97c7-443c-8925-05f2f4fe3e47` and the catch-all provider pk=5. Move `kubernetes_json_patches` into TF so the fix is reviewable in git. **Done 2026-05-10**: `authentik_outpost.embedded` resource + `authentik_provider_proxy.catchall.access_token_validity` codified, plan-to-zero on the whole stack. The `Outpost.managed` field is server-set (not in provider schema) and preserved across applies because TF only writes known fields. Same-day work also flipped the outpost's session backend from filesystem (`/dev/shm`) to PostgreSQL — see `.claude/reference/authentik-state.md`. | **DONE** |
| P2 | Runbook: Authentik forward-auth troubleshooting | Docs | Add a runbook at `docs/runbooks/authentik-forward-auth-400.md` with the "grep outpost logs first" first step, plus pointer commands for `/dev/shm` usage, session file count, and recent authorize events. | TODO |
### P3 — Upstream + architectural
| Priority | Action | Type | Details | Status |
|----------|--------|------|---------|--------|
| P3 | Comment/support on authentik issue [#20093](https://github.com/goauthentik/authentik/issues/20093) | Upstream | Request either a persistent-backed session store (Redis/DB) OR a configurable GC interval shorter than the default 5 min. | TODO |
| P3 | Consider shortening `access_token_validity` from 168h (7 days) to 24h | Config | Original idea: shrink steady-state session file count (~7× reduction) at the cost of daily re-auth. **Resolved differently 2026-05-10**: switched the outpost to the PostgreSQL session backend (`Outpost.managed = goauthentik.io/outposts/embedded` + `AUTHENTIK_POSTGRESQL__*` envFrom), which makes session count irrelevant for tmpfs sizing and lets us BUMP `access_token_validity` to `weeks=4` for better UX without cost. | **DONE (alt)** |
| P3 | Evaluate moving forward-auth away from the embedded outpost | Architecture | Original framing: external, multi-replica outpost with Redis-backed sessions. **Resolved 2026-05-10** by enabling the postgres-backed session store on the embedded outpost itself (PR goauthentik/authentik#16628). Sessions now persist across pod restarts; the original "in-memory state" concern is moot. Multi-replica still requires a goauthentik upstream fix (PgBouncer-friendly session migration), but the loss-of-state class of failures is gone. | **DONE (alt)** |
## Lessons Learned
1. **When a per-user bug affects a shared infrastructure layer, suspect the shared layer, not the user.** The framing "Emil gets 400" led the first two hours of investigation down four user-specific rabbit holes. A sanity check ("does ANY user's non-cached request to a protected site return 400?") would have cut to the chase in minutes.
2. **Check the outpost logs first, not last.** For any Authentik forward-auth oddity, the first `kubectl logs` should be on the outpost pod, grepping for `error` and `ENOSPC`. The outpost is the component that actually makes the 400/302 decision.
3. **Cache + low-request users mask outages longer than you'd think.** Viktor had a valid cookie and his browser kept using it without writing new session files; he couldn't reproduce the bug Emil saw. The outage felt per-user until his cookie rotation needed to write state. **Any outage that "only affects some users" needs an active check from a fresh, cookie-less context**`curl` with no cookie jar is the fastest way.
4. **Default tmpfs sizing + per-request file writes = ticking clock.** 64 MB of `/dev/shm` is a Kubernetes default, not a considered choice. Any workload that writes per-request files into tmpfs without aggressive GC will eventually fill, and the time-to-fill scales inversely with request rate. Worth auditing other services that might have the same pattern.
5. **UI-managed Authentik config is invisible to git review.** Our catch-all Proxy provider, embedded outpost config, property mappings, and policy bindings are all in Authentik's PostgreSQL database. The fix applied today (`kubernetes_json_patches`) is durable but not discoverable from `git log`. Drift risk. Codify in Terraform.
6. **Recently-deployed things are prime suspects but not always guilty.** The Rybbit Cloudflare Worker was deployed 2026-04-17 with a wildcard route. Viktor's intuition was "that's the recent change, must be the cause." It was a plausible theory and worth checking — but `curl --resolve` to bypass Cloudflare proved it innocent within 30 seconds. Always have a way to bypass the suspect layer cheaply.
## References
- Memory #836-841: incident details stored in claude-memory MCP (2026-04-18 12:42 UTC).
- Upstream issue: [goauthentik/authentik#20093](https://github.com/goauthentik/authentik/issues/20093) (open).
- Related upstream fix: [PR #15798](https://github.com/goauthentik/authentik/pull/15798) — 5-min session GC shipped in ≥ 2025.10 (our version 2026.2.2 has it, but insufficient alone).
- Beads task: `code-zru` (P1 bug).

View file

@ -0,0 +1,246 @@
# Post-Mortem: Private Registry Orphan OCI-Index — Repeat Incident
| Field | Value |
|-------|-------|
| **Date** | 2026-04-19 (first occurrence 2026-04-13) |
| **Duration** | ~40 min of blocked CI each time; only detected via pipeline failures |
| **Severity** | SEV2 — all infra CI pipelines using `infra-ci:latest` failed (P366 → P376 all exit 126 "image can't be pulled") |
| **Affected Services** | Every Woodpecker pipeline that starts with `image: registry.viktorbarzin.me:5050/infra-ci:latest``default.yml`, `build-cli.yml`, `renew-tls.yml`, `drift-detection.yml`, `provision-user.yml`, `k8s-portal.yml`, `postmortem-todos.yml`, `issue-automation.yml`, `pve-nfs-exports-sync.yml` |
| **Status** | Hot fix green (three commits: `a05d63ee`, `6371e75e`, `c113be4d` — URL fix + rebuild). This doc captures the permanent fix landed in the same branch. |
## Summary
On 2026-04-19 ~09:00 UTC, every infra CI pipeline started failing at the
`clone` step with "image can't be pulled". The image in question — the CI
toolchain image `registry.viktorbarzin.me:5050/infra-ci:latest` — resolved
to an OCI image index whose `linux/amd64` platform manifest
(`sha256:98f718c8…`) and its in-toto attestation
(`sha256:27d5ab83…`) returned **HTTP 404** from the private registry.
The index record itself still existed — it's the children that had been
garbage-collected out from under it.
This is the **second identical incident**: the same failure mode occurred
on 2026-04-13 against a different image. Both times the immediate fix was
to rebuild the image from scratch; both times the root cause was left
unaddressed.
## Impact
- **User-facing**: all CI pipelines failed. No automated Terraform applies,
no TLS renewal, no drift detection. Manual workflows (Woodpecker UI
reruns) all failed with the same error.
- **Blast radius**: every pipeline that pulls `infra-ci`. Does NOT affect
k8s workloads (those pull via containerd, which goes through the
pull-through proxy on :5000/:5010 — a completely different code path).
- **Duration on 2026-04-19**: from first P366 failure to the hot-fix
commit `c113be4d` — roughly 40 min. Pipelines that had already been
triggered queued up until the rebuild restored `:latest`.
- **Data loss**: none. The registry has the index object; the child
manifests are re-producible by rebuilding the source image.
- **Monitoring gap**: nothing alerted. The only signal was the individual
pipeline failures from Woodpecker. No Prometheus alert fires on "the
registry served a 404 for a tag that exists".
## Timeline (UTC, 2026-04-19)
| Time | Event |
|------|-------|
| ~09:00 | P366 (`default.yml` on master) fails with exit 126. |
| 09:0011:00 | P367, P368, … P376 all fail with the same error. Nobody pages — there's no alert configured. |
| 11:15 | User notices and investigates: `skopeo inspect` reveals the missing platform manifest. |
| 11:20 | Hot fix phase begins: `a05d63ee` fixes a push-URL misalignment, `6371e75e` and `c113be4d` trigger a full rebuild. |
| 11:40 | Rebuild completes; `infra-ci:latest` resolves to a fresh, complete index. Pipelines green from P377 onward. |
| 11:45 | User requests a proper root-cause fix: "this is the second time — what's actually broken?" |
| 12:00 | Investigation begins (this document's work). |
## Root Cause Chain
```
[1] cleanup-tags.sh runs daily at 02:00 on the registry VM
└─> For each repository, keeps the last 10 tags by mtime, rmtrees the rest.
This walks `_manifests/tags/<tag>` directly, bypassing the registry API.
├─> [2] Subtle on-disk asymmetry: a registry:2 tag rmtree removes
│ BOTH the `_manifests/tags/<tag>/` dir AND — on 2.8.x — the
│ per-repo revision-link files under
`<repo>/_manifests/revisions/sha256/<child-digest>/link` for
│ every child referenced by that tag's index. The raw blob data
│ under `/var/lib/registry/docker/registry/v2/blobs/sha256/<.>/data`
│ is NOT touched — GC owns that, and GC only runs Sunday.
├─> [3] If ANOTHER tag's index still references one of those same
│ children (common — successive rebuilds share layers), the child
│ blob survives. But the revision-link is gone, so the registry
│ API can no longer map `<repo>/manifests/sha256:<child>` back
│ to the blob. HEAD → 404, even though the bytes are on disk.
│ distribution/distribution#3324 is the upstream class of this bug.
└─> [4] Result: the surviving index (e.g. `infra-ci:5319f03e`) is
intact on disk, its children's blob data files are intact on
disk, but HEAD `/v2/infra-ci/manifests/sha256:98f718c8…`
returns 404. The registry has the bytes, but cannot find them
through the API because the per-repo link bridge is gone.
[pull] containerd resolves `infra-ci:latest`
├─> GET /v2/infra-ci/manifests/latest → 200 OK, returns the index
└─> GET /v2/infra-ci/manifests/sha256:98f718c8… → 404 Not Found
└─> containerd fails the pull with "manifest unknown"
└─> woodpecker exit 126
```
> **Detection-gotcha** uncovered 2026-04-19 while implementing
> `fix-broken-blobs.sh`: a scan that checks `/blobs/sha256/<child>/data` for
> presence is NOT equivalent to "can the registry serve this child?" The
> authoritative check is whether
> `<repo>/_manifests/revisions/sha256/<child>/link` exists. The script
> was rewritten to check the per-repo link file after the HTTP probe
> caught 38 real orphans the filesystem scan had reported clean.
## Why Existing Remediation Missed It
1. **`fix-broken-blobs.sh` only scans layer links.** The existing cron
walks `_layers/sha256/` and removes link files whose blob `data` is
missing. It does NOT inspect `_manifests/revisions/sha256/` to see
whether an image-index's referenced children still exist. That's
exactly the class of orphan this incident represents.
2. **`registry:2` image tag was floating.** `docker-compose.yml` pinned
only to `registry:2`. Whatever Docker Inc. last rebuilt as
"v2-current" was running, with no version pin. Any regression in
the upstream walker would silently swap in.
3. **No integrity monitoring.** Prometheus alerted on cache hit rate
and registry-down, but nothing probes "are the manifests the registry
advertises actually fetchable?"
4. **CI pipeline didn't verify its own push.** `buildx --push` returns
success as soon as it uploads. If a child blob upload 0-byted or
the client disconnected mid-push (distinct from the GC mode but the
same on-disk symptom), nothing would notice until the next pull.
## Permanent Fix — Three Phases
### Phase 1 — Detection (ship today)
1. **Post-push integrity check** in `.woodpecker/build-ci-image.yml`.
After `build-and-push`, a new step walks the just-pushed manifest
(and every child of an image index) and HEADs every referenced blob.
Any non-200 fails the pipeline immediately, catching broken pushes at
the source rather than leaking them to consumers.
2. **Prometheus alert `RegistryManifestIntegrityFailure`.** A new
CronJob (`registry-integrity-probe`, every 15m, in the `monitoring`
namespace) walks the private registry's catalog, HEADs every tag's
manifest, follows each image index's children, and pushes
`registry_manifest_integrity_failures` to Pushgateway. Accompanying
alerts: `RegistryIntegrityProbeStale`, `RegistryCatalogInaccessible`.
3. **Post-mortem** — this document. Linked from
`.claude/reference/service-catalog.md` via the new runbook.
### Phase 2 — Prevention
4. **Pin `registry:2` → `registry:2.8.3`** in
`modules/docker-registry/docker-compose.yml` (all six registry
services). Removes the floating-tag footgun.
5. **Extend `fix-broken-blobs.sh`** to scan every
`_manifests/revisions/sha256/<digest>` that is an image index and
flag children whose blob `data` file is missing. The script prints a
loud WARNING per orphan; it does not auto-delete the index, because
deleting a published image is a conscious decision, not an automated
repair.
### Phase 3 — Recovery tooling
6. **Manual event trigger** on `build-ci-image.yml`. Rebuilds no longer
need a cosmetic Dockerfile edit — POST to the Woodpecker API or
click "Run manually" in the UI.
7. **Runbook** `docs/runbooks/registry-rebuild-image.md` — exact
command sequence for the next time this happens, plus fallback paths.
## Out of Scope
- **Pull-through caches.** The DockerHub / GHCR mirrors on
`:5000` / `:5010` are healthy (74.5% cache hit rate, no 404s). The
orphan problem is private-registry-only. No changes to nginx or
containerd `hosts.toml`.
- **Registry HA / replication.** Single-VM SPOF is a known
architectural choice. Harbor or a replicated registry would solve
more than this incident requires, at multi-day cost. Synology offsite
snapshots already give RPO < 1 day.
- **Disabling `cleanup-tags.sh`.** Keeping storage bounded is still
necessary; the fix is detection + rebuild, not "stop cleaning up".
## Lessons
- **Repeat incidents deserve root-cause work, not a third hot-fix.** The
2026-04-13 incident was closed when CI turned green. Without a probe
and without a scan for orphan indexes, the next incident was
inevitable — and it happened six days later against a different image.
- **"No alert fired, so it wasn't detected" is a monitoring gap, not an
outage feature.** The registry was serving 404s for 2+ hours before
anyone noticed, because our only signal was "pipeline failures" and
our eyes were elsewhere. The new probe closes that gap.
- **CI pipelines should verify their own output.** The `buildx --push`
"success" exit code is not a guarantee of pulled-back integrity — as
this incident proves. A 30-second post-push HEAD walk is cheap
insurance.
## Related
- **Prior incident (same failure mode, different image)**: memory `709`
/ `710` — 2026-04-13.
- **Runbook**: `docs/runbooks/registry-rebuild-image.md` (new).
- **Hot-fix commits**: `a05d63ee`, `6371e75e`, `c113be4d`.
- **Upstream bug class**: `distribution/distribution#3324`.
## 2026-04-19 — Bulk cleanup sweep (beads code-8hk + code-jh3c)
Same failure class, broader scope. The `registry-integrity-probe`
surfaced 38 broken manifest references persisting after the 04-19
infra-ci fix. `beads-dispatcher` + `beads-reaper` CronJobs were stuck
`ImagePullBackOff` on `claude-agent-service:0c24c9b6` for >6h. All 34
affected `repo:tag` pairs were OCI indexes whose `linux/amd64` child
manifests were absent from blob storage (same orphan pattern).
**Action taken**:
1. Bumped `beads-server/main.tf` var default `claude_agent_service_image_tag`
from `0c24c9b6``2fd7670d` (the canonical tag in
`claude-agent-service/main.tf`), reused — same image already healthy
on the registry. `scripts/tg apply` on `beads-server`. Deleted the
stuck Jobs so new CronJob ticks could fire.
2. Enumerated 34 broken `(repo, tag, parent_digest)` triples via HTTP
probe using `registry-probe-credentials` K8s Secret. Deleted each
via `DELETE /v2/<repo>/manifests/<digest>` (33× 202, 1× 404 —
claude-agent-service:latest pointed at an already-deleted digest).
3. Ran `docker exec registry-private /bin/registry garbage-collect
/etc/docker/registry/config.yml` — reclaimed ~3GB of orphan blob
storage.
4. Rebuilt the 3 in-use broken tags (all 3 OCI-index parents pointed
at missing children, so no cached copies would survive pod
reschedule):
- `freedify:latest` / `freedify:c803de02` — built on registry VM
directly (no CI pipeline exists for this image; python FastAPI).
- `beadboard:17a38e43` / `beadboard:latest` — GHA
`workflow_dispatch` failed at registry login (missing
`REGISTRY_USERNAME`/`REGISTRY_PASSWORD` GH secrets). Built on
registry VM directly as the fallback. GitHub secret gap is a
follow-up — beads `code-8hk` notes it.
- `priority-pass-backend:ae1420a0` / `priority-pass-frontend:ae1420a0`
— Woodpecker pipeline #8 on repo 81. Pipeline `kubectl set image`'d
the Deployment to `ae1420a0` (drift vs TF `v5`/`v8` defaults, but
that drift is pre-existing, not introduced by this cleanup).
- `wealthfolio-sync:latest`**not rebuilt**. Monthly CronJob (next
run 2026-05-01), no source tree or CI pipeline available in the
monorepo; deferred for separate follow-up.
**Post-cleanup state**:
- Probe: 39 tags, 0 failures. `registry_manifest_integrity_failures{} = 0`.
- Alert `RegistryManifestIntegrityFailure` cleared (was firing for
5h 32m).
- No `ImagePullBackOff` pods anywhere in the cluster.
- 28 of 34 deleted manifests were **dangling tags not referenced by any
workload** — old `382d6b1*`, `v2`-`v7`, `yt-fallback`, etc. Safe
deletes, no rebuilds needed.
**Permanent fix still in flight**: Phase 2/3 of this post-mortem
(post-push verification in CI, atomic `cleanup-tags.sh`) — not
addressed by this cleanup. The probe continues to be the
authoritative detector.

View file

@ -0,0 +1,155 @@
# Post-Mortem: Vault Raft Leader Deadlock + NFS Kernel Client Corruption Cascade
> **Resolution status (2026-04-25):** Resolved structurally by code-gy7h
> migration. All 3 vault voters now on `proxmox-lvm-encrypted` block
> storage; the NFS fsync incompatibility that triggered the original
> raft hang is no longer reachable. See
> `docs/plans/2026-04-25-nfs-hostile-migration-plan.md` Phase 2.
| Field | Value |
|-------|-------|
| **Date** | 2026-04-22 |
| **Duration** | External endpoint 503 from ~09:00 UTC to ~11:43 UTC (~2h 43m). vault-2 became active leader 11:43:28 UTC. |
| **Severity** | SEV1 (Vault — single source of secrets for 40+ services) |
| **Affected Services** | All ESO-backed services (password rotation paused). CronJobs that read plan-time secrets (14 stacks). Woodpecker CI (blocked pipeline `d39770b3`). Everything with `ExternalSecret` refresh interval ≤ 2h. |
| **Status** | Vault HA operational with vault-0 + vault-2 quorum. vault-1 still stuck ContainerCreating on node2 (third node2 reboot pending; workload can accept 2/3 quorum). Terraform fix committed as `2f1f9107`; apply pending. |
## Summary
A Vault raft leader (`vault-2`) entered a stuck goroutine state where its cluster port (8201) accepted TCP but never completed msgpack RPC. Standbys could not detect leader death because the TCP layer looked healthy, so no re-election fired. The only recovery was to kill the leader. During recovery, abrupt `kubectl delete --force` of the stuck Vault pods left kernel-side NFS client state on k8s-node1/node3/node4 in a corrupted state — **all new NFS mounts from those nodes timed out at 110s**, while existing mounts kept working. This created a cascade: the stuck leader blocked quorum, killing the leader broke NFS on the destination node for the recreated pod, force-killing the stuck pods left zombie `containerd-shim` processes kubelet couldn't clean up, and the resulting volume-manager loops pegged kubelet into 2-minute timeouts. Recovery required a VM hard-reset for node2 and node3 (kubelet was zombie on both). vault-0 remains down pending node4 reboot.
## Impact
- **User-facing**: `vault.viktorbarzin.me` returned HTTP 503 for ~2h. Any service that needed a Vault token during that window was degraded; Woodpecker CI pipeline blocked.
- **Blast radius**: 3/3 Vault pods affected (raft deadlock blocked re-election even with standbys up). Three k8s nodes degraded simultaneously with kernel NFS client stuck state (node1, node3, node4). Two nodes required VM hard-reset to recover kubelet (node2, node3).
- **Duration**: Degraded ~2h; resolution required sequential hard reboots.
- **Data loss**: None. Raft data integrity preserved on NFS. vault-1 came up with index 2475732, caught up to 2476009+ once leader was elected.
- **Observability gap**: No alert fired for the stuck raft leader. Standbys report `HA Mode: standby, Active Node Address: <leader IP>` as if healthy even when leader is hung.
## Timeline (UTC)
| Time | Event |
|------|-------|
| **~09:00** | `vault-2` (original raft leader) enters hung state — port 8201 open but msgpack RPCs hang. Its own logs go silent. Standbys continue heartbeat/appendEntries with `msgpack decode error [pos 0]: i/o timeout`. Neither standby triggers re-election because raft transport does not distinguish "TCP open + silent" from "TCP open + healthy". |
| **~09:15** | External endpoint starts serving 503. Woodpecker CI pipeline `d39770b3` blocks waiting for Vault. |
| **09:59** | Operator force-deletes `vault-2` pod — replacement comes up on node3 and enters candidate loop (term=32), cannot get quorum because DNS for `vault-0` is NXDOMAIN (ContainerCreating) and vault-1 does not respond (its raft goroutine also hung). |
| **10:07** | Operator force-deletes `vault-1` — new `vault-1` gets scheduled to node2. Its raft would be fine, but kubelet on node2 hangs in the pod cleanup path for the old pod's NFS mount. Concurrently, a new `vault-0` pod is attempted on node4, but **NFS mount from node4 times out at 110s** — the host kernel NFS client is in a degraded state that blocks all new mounts (including to completely different NFS paths like `/srv/nfs/ytdlp`). |
| **10:09** | Diagnostic test: from node1 and node4 CSI pods, `mount -t nfs -o nfsvers=4 192.168.1.127:/srv/nfs/ytdlp /tmp/test` times out. From node2 and node3 the same mount succeeds. NFS server is healthy (`showmount -e` works; `rpcinfo` shows all programs registered). The common factor on the broken nodes: they had a force-terminated Vault pod earlier in the session, leaving stuck `mount.nfs` processes in D-state. |
| **10:18** | Manual unmount of stale NFS mount from the force-deleted old vault-0 pod on node4. New mount attempts from CSI still time out — clearing the old mount did not recover kernel NFS client state. |
| **10:22** | Workaround discovered: mounting with `nfsvers=4.0` or `nfsvers=4.1` (instead of default `nfsvers=4` which negotiates to 4.2) succeeds on broken nodes. Confirms the stuck state is version-specific (NFSv4.2 session state), not a general NFS issue. Decision: rather than change CSI mount options cluster-wide (risk of remounting existing 48+ PVs), fix the nodes directly. |
| **10:31** | Investigated node2 kubelet state: old `vault-1` container shows `vault` process in **Z (zombie)** state with its `sh` wrapper stuck in `do_wait` in kernel (`zap_pid_ns_processes`). Containerd-shim PID killed manually — `sh` and zombie reparented to init but remained stuck (uninterruptible kernel wait tied to NFS). |
| **10:34** | Attempted `systemctl restart kubelet` on node2 — kubelet itself went into Z (zombie) with 2 tasks still attached. Classic NFS-related kernel deadlock. |
| **10:42** | **Decision: hard-reset node2 VM** (`qm reset 202`). Disruption: 22 pods evicted. |
| **10:43** | node2 back up (Ready). CSI registered. New `vault-1` scheduled to node2. NFS mount succeeded (fresh kernel state). Kubelet began chowning volume — **extremely slow, ~3 files per minute over NFS**. |
| **10:48** | `vault-1` (2/2 Running) unsealed. **Raft leader elected: `vault-2` wins term 32, election tally=2** (vault-1 voted yes once it came up, vault-0 unreachable). However vault-2's vault-layer (HA active/standby) never transitioned to active — raft leader with `active_time: 0001-01-01T00:00:00Z` and `/sys/ha-status` returning 500. |
| **10:50** | Restarted `vault-2` pod to force clean leader transition. New `vault-2` stuck in chown loop on node3 (same pattern as node2 earlier). |
| **10:54** | Patched the Vault `StatefulSet` with `fsGroupChangePolicy: OnRootMismatch` so subsequent recreations skip the recursive chown. |
| **10:57** | Force-deleted `vault-2` and `06fa940b` pod directory on node3. New pod spawned but kubelet again stuck on phantom state from the old pod. |
| **11:01** | **Hard-reset node3 VM** (`qm reset 203`). |
| **11:03** | First 200 response: vault-1 elected leader, vault-2 standby. Premature celebration — vault-1's audit log on node2 NFS starts timing out; `/sys/ha-status` returns 500 even though raft thinks vault-1 is active. |
| **~11:18** | Service regresses. `vault-1` audit writes hanging (`event not processed by enough 'sink' nodes, context deadline exceeded`). Readiness probe fails; pod goes 1/2; `vault-active` endpoint stays pointed at vault-1's IP but backend unresponsive → 503. |
| **11:22** | Force-restart `vault-1` to trigger re-election with new pod. Delete + containerd-shim cleanup leaves yet another zombie on node2. Same pattern: force-delete → zombie. |
| **11:29** | **Hard-reset node4 VM** (`qm reset 204`). Rationale: vault-0 was still blocked there; 74 pods on node4 contribute to NFS server load (load avg 16 on PVE). After reboot, vault-0 mounts its PVCs on fresh kernel state and comes up 2/2 Running 11:31. |
| **11:31** | Increased PVE NFS threads from 16 to 64 (`echo 64 > /proc/fs/nfsd/threads`). Did not help immediate mount failures — the stuck state is per-client kernel, not server capacity. |
| **11:38** | Discover DNS resolution issue: vault-2's Go resolver returns NXDOMAIN for short names `vault-0.vault-internal` even though glibc resolver works. CoreDNS restart issued earlier didn't fix. Restart vault-2 pod to force fresh resolver state. |
| **11:42** | **Second hard-reset of node3 VM** (`qm reset 203`). Kubelet+CSI re-register; vault-2 scheduled, NFS mounts finally succeed on fresh kernel state. |
| **11:43:28** | **vault-2 becomes active leader.** External endpoint returns 200 and stays there. vault-0 follower, catches up to index 2477632+. vault-1 still stuck on node2; left for later recovery. |
## Root Cause Chain
```
[1] Vault-2 raft goroutine hang (root cause — upstream Vault bug or infra-induced)
└─> Cluster port 8201 accepts TCP but never responds to msgpack RPCs
└─> Standbys' appendEntries calls return `msgpack decode error [pos 0]: i/o timeout`
└─> Raft protocol: no re-election because leader is heartbeating at the TCP level
└─> External endpoint returns 503 because HA layer has no active leader
[2] Recovery complication — abrupt pod termination
└─> `kubectl delete --force --grace-period=0` on vault-0/1/2
└─> containerd-shim fails to kill container cleanly (NFS I/O in D-state)
└─> vault process ends as zombie; sh wrapper stuck in do_wait
└─> Kubelet retries forever, cannot tear down old pod volumes
└─> NFS-CSI unmount requests succeed at the NFS layer but kubelet's
volume state-machine never marks the volume as unmounted
(stale 0000-mode mount directory blocks teardown completion)
[3] Kernel NFS client corruption on node1/node4
└─> Force-terminated Vault pod left stuck `mount.nfs` processes in D-state
└─> Kernel NFS4.2 client session state corrupted (held open mount slot)
└─> All subsequent mount syscalls for nfsvers=4 block 110s+ waiting for
session slot that will never be freed
└─> Manual workaround: nfsvers=4.1 bypasses the corrupted session state
[4] Kubelet starvation
└─> Combination of (2) and (3) means kubelet is stuck in a 2-minute volume-setup
context deadline loop — each iteration times out, new iteration restarts,
infinite loop
└─> Hard VM reset is the only exit
└─> After reset, kubelet starts clean, CSI re-registers, mounts succeed
[5] Slow recursive chown amplifies impact
└─> Default fsGroupChangePolicy: Always (Vault Helm chart 0.29.1 default)
└─> Kubelet walks every file on NFS setting gid=1000
└─> Over a 1GB audit log and a 47MB raft.db on NFS with timeo=30,retrans=3,
each chown syscall takes seconds; kubelet 2-minute deadline runs out
before the walk finishes
└─> Loop never exits even when ownership is already correct
```
## Why This Failed
1. **Raft transport does not detect stuck leaders.** If TCP is open and the process is alive enough to hold the port, standbys assume the leader is healthy. A stuck goroutine that never responds to RPCs appears to raft as "leader with high RTT" and does not trigger re-election. This is an upstream Vault bug (or at least a missing liveness check).
2. **Abrupt pod termination + NFS = kernel-level zombie.** When a Vault pod holding an NFS mount is force-killed before it cleanly closes file handles, the kernel's NFS4.2 client session state enters a corrupted state. This blocks all new mounts from that node — not just to the same NFS path, but to ANY NFS path on the same server. The fix is a kernel reboot; there is no userspace recovery.
3. **Vault data on NFS violates the documented rule.** `infra/.claude/CLAUDE.md` explicitly states: *"Critical services MUST NOT use NFS storage — circular dependency risk."* Vault currently uses `nfs-proxmox` for both `dataStorage` and `auditStorage`. If Vault had been on `proxmox-lvm-encrypted`, none of the NFS corruption cascade would have happened.
4. **fsGroupChangePolicy: Always is the Helm default.** Every pod restart walks every file over NFS. On a 1GB audit log with degraded NFS RTT, this takes longer than kubelet's internal 2-minute deadline, causing infinite restart loops. `OnRootMismatch` makes chown a no-op when the root is already correct (which it always is after first setup).
5. **No alert for this failure mode.** Prometheus alerts exist for `VaultSealed`, `VaultDown` (`up` metric), and backup staleness, but none for "raft leader has been running without advancing commit index" or "standby reports leader but leader's `/sys/ha-status` returns 500".
## Remediation (Applied)
- [x] Hard-reset node2 and node3 VMs to clear kernel NFS state and kubelet zombies.
- [x] Manually patched live `StatefulSet vault/vault` with `fsGroupChangePolicy: OnRootMismatch` to stop the chown loop.
- [x] Lazy-unmounted stale NFS mounts from force-deleted pod directories on node2 and node3.
- [x] Removed stale kubelet pod directories (`/var/lib/kubelet/pods/<UID>`) that had 0000-mode mount subdirectories blocking teardown.
- [x] Updated `stacks/vault/main.tf` with the `fsGroupChangePolicy` setting so the next `scripts/tg apply vault` makes it durable.
## Remediation (Pending)
- [ ] **Hard-reset node4** to recover vault-0 (same NFS kernel corruption pattern).
- [ ] **Run `scripts/tg apply` on the vault stack** to persist the fsGroupChangePolicy change.
- [ ] **Add Prometheus alert `VaultRaftLeaderStuck`** — fire when `vault_raft_last_index_gauge` (or derivation from `vault_runtime_total_gc_runs`) stops advancing for >2 minutes while `vault_core_active` is 1.
- [ ] **Add Prometheus alert `VaultHAStatusUnavailable`** — fire when `vault_core_active{}` reports 0 across all pods but `up{job="vault"}` reports 1 (HA layer broken but pods alive).
- [ ] **Migrate Vault to `proxmox-lvm-encrypted` block storage** — eliminates the entire NFS failure class. This follows the rule already documented in `infra/.claude/CLAUDE.md`. Tracked as beads task (open after Dolt is back up; currently down on node4).
- [ ] **Consider raising kubelet volume-manager deadline** for large-volume chown scenarios, or document the `fsGroupChangePolicy: OnRootMismatch` requirement for all NFS-backed StatefulSets.
- [ ] **Runbook**: `docs/runbooks/vault-raft-leader-deadlock.md` — how to detect stuck leader, safe force-restart procedure that avoids zombie pods, NFS kernel state recovery.
## Contributing Factors
1. **NFS mount options use bare `nfsvers=4`**. This negotiates to the highest version the server supports (NFSv4.2). When 4.2 session state corrupts, mounts fail; 4.1 works. Pinning to `nfsvers=4.1` in the `nfs-proxmox` StorageClass would make the failure mode recoverable without node reboot, but would also require recreating 48+ existing PVs (volumeAttributes are immutable). Deferred.
2. **`kubectl delete --force` is the default for stuck pods**. Operators reach for force-delete when a pod won't terminate, but this leaves containerd in an inconsistent state when the underlying storage is hung. Better approach: identify the stuck process (typically `mount.nfs` or a kernel NFS callback) and fix the root cause before force-deleting.
3. **Beads / Dolt server was on node4**, so beads task tracking went offline during this incident and couldn't be used to log progress cross-session.
4. **node1 was cordoned mid-incident** to prevent rescheduling to a node with confirmed NFS issues, but this reduced the scheduling surface for anti-affinity-sensitive StatefulSets.
## Learnings
1. **NFS for stateful critical services is structurally unsafe.** When NFS breaks, the recovery involves killing pods → which can break NFS further → until a reboot. The rule exists for a reason; Vault should never have been on NFS.
2. **Raft liveness needs application-layer probing, not TCP.** Every time we've seen a "stuck leader" issue in the homelab, TCP was fine and the app was unresponsive. A lightweight RPC probe with a short timeout and Prometheus alert would catch this in minutes instead of hours.
3. **kubelet volume-manager is fragile against stuck NFS.** Once kubelet enters a chown loop with a context deadline shorter than the chown duration, it cannot make progress — even when the filesystem is otherwise healthy. `OnRootMismatch` is effectively mandatory for any pod with `fsGroup` and a volume >100MB.
4. **VM hard-reset is cheap but disruptive.** The two reboots took ~60 seconds each but evicted 22+44 = 66 pods. Doing this twice in one session is a lot of churn. A post-mortem-driven improvement: pre-prepare "hot-standby" capacity so we can cordon+drain instead of hard-reset when kubelet zombies appear.
5. **Documentation of this rule is worth more than the rule itself.** The CLAUDE.md already says "critical services must not use NFS". The vault stack violates it. The rule without enforcement (validation, linting, CI) is ignored during the rush to ship.
## References
- Related: `docs/post-mortems/2026-04-14-nfs-fsid0-dns-vault-outage.md` — previous Vault+NFS incident (different root cause, similar blast pattern).
- Vault helm chart 0.29.1 default `fsGroupChangePolicy` is unset (behaves as `Always`).
- Upstream Vault HA layer: raft leader → vault-active transition is in `vault/external_tests/raft`. Stuck goroutine pattern not documented as a known issue.

View file

@ -0,0 +1,56 @@
# Post-Mortem: IO Pressure Stalls from Stale NFS Client to Decommissioned TrueNAS
| Field | Value |
|-------|-------|
| **Date** | 2026-05-09 (issue first observable in journal at 2026-05-08 00:00:04) |
| **Duration** | Intermittent IO PSI stalls and kubectl TLS handshake timeouts during the session; PVE host loadavg ~15 sustained. No user-visible outage. |
| **Severity** | SEV3 (degraded host I/O, no service down) |
| **Affected Components** | PVE host (192.168.1.127), `node_exporter` (PID 1479, D-state), kernel NFS kthread `[10.0.10.15-manager]`, k8s-node3 (downstream IO PSI). |
| **Status** | Resolved structurally. Stale connection source removed; recurring trigger eliminated. Wedged kthread persists in kernel queue — clears on next PVE reboot. |
## Summary
The PVE host's NFS client was retaining a wedged connection to `10.0.10.15` — the IP of the TrueNAS VM that was operationally decommissioned 2026-04-13 (storage migrated to `192.168.1.127:/srv/nfs`). The connection was created by `/usr/local/bin/weekly-backup`, a legacy script left over from before the NFS migration that had never been removed. Its kernel kthread `[10.0.10.15-manager]` parked itself in `rpc_wait_bit_killable` and stayed there. Any process that touched `/proc/mountstats` — including `node_exporter` — got dragged into D-state alongside it, which in turn fed back into IO pressure metrics. cluster-health surfaced this as `k8s-node3 full avg10=23%` and PVE loadavg sustained at ~15.
## Impact
- **User-facing**: None directly. Intermittent kubectl TLS handshake timeouts during the session, attributable to the elevated PVE loadavg.
- **Blast radius**: Single PVE host. node_exporter (PID 1479) wedged in D-state with the kthread. k8s-node3 downstream IO PSI peaked at `full avg10=23%`.
- **Data loss**: None.
- **Observability gap**: No alert fired for "stale NFS connection to decommissioned host". The IO PSI watchdog caught the symptom, not the cause.
## Root Cause
`/usr/local/bin/weekly-backup` was an artifact of the pre-2026-04-13 backup pipeline (when TrueNAS at `10.0.10.15` was the NFS server). After the TrueNAS decommission and migration to host NFS at `192.168.1.127`, the script was never deleted. It executed at least once recently (manually, or via a cron entry that has since been pruned), opening an NFS RPC session to `10.0.10.15`. With no peer answering, the kernel's RPC retry timer parked the manager kthread in `rpc_wait_bit_killable`. The kthread holds a lock that any reader of `/proc/mountstats` must take — `node_exporter` reads that file every scrape interval, so its scrape goroutine wedged in D-state too.
## Resolution
1. `lvextend -L +1T /dev/pve/nfs-data` + `resize2fs``/srv/nfs` 2 TiB → 3 TiB (90% → 60% used). Unrelated to the IO issue but bundled because `/srv/nfs` was at 90% and the user picked "grow LV" over "diet Immich". Thinpool (sdc) had ~4.6 TiB free.
2. `rm /usr/local/bin/weekly-backup` — eliminates the trigger. Backup pipeline is now `daily-backup.service` + `offsite-sync-backup.service` + per-app CronJobs (mysql/postgres/vault/etc.); `weekly-backup` was fully redundant.
3. `systemctl restart node_exporter` — replaces the wedged process. New PID 183319 healthy, `:9100/metrics` responsive.
4. `mysql-standalone` memory bump 2 Gi → 4 Gi limit, 1.5 Gi → 3 Gi request (commit forthcoming). Coincident May 8 18:05 OOM, not caused by this incident — `innodb_buffer_pool_size=1Gi` plus connection buffers and InnoDB internals didn't fit in 2 Gi.
## Open / Out-of-Scope
- **Wedged kthread `[10.0.10.15-manager]` (PID 3796184)** persists in the kernel queue. The kernel will eventually reap it once the RPC retry timer gives up, or it clears at next PVE reboot. With the script gone, no new ops queue against it. **Plan**: if PVE host PSI does not fully clear within 24 h, fold a PVE reboot into the next maintenance window. Not done in this change.
- **Transient OOMs unrelated to this incident**:
- `mysql-standalone-0` May 8 18:05 (anon-rss 2 GB at 2 Gi limit) — addressed by the limit bump above.
- postgres helpers May 9 12:37 — anon-rss <8 MB, pods no longer exist, no recurrence. No action.
- python pod May 9 13:36 (anon-rss 518 MB on k8s-node2) — pod no longer exists, no recurrence. No action.
- **Pre-existing TF drift**: `null_resource.pg_job_hunter_db` in `stacks/dbaas/modules/dbaas/main.tf` execs against `pg-cluster-1`, but the current CNPG primary is `pg-cluster-2`. Unrelated to this incident; surfaced during the targeted MySQL apply. Fix is a separate ticket — should resolve the primary dynamically (e.g., via the `cnpg.io/instanceRole=primary` selector) instead of hardcoding pod ordinal.
## Action Items
- [x] Delete `/usr/local/bin/weekly-backup` on PVE host.
- [x] Restart `node_exporter.service` on PVE host.
- [x] Grow `pve/nfs-data` LV to 3 TiB; online `resize2fs`.
- [x] Bump `mysql-standalone` memory request/limit to 3 Gi / 4 Gi.
- [x] Update `docs/architecture/storage.md` to record the new LV size.
- [ ] Reboot PVE host at next maintenance window if `[10.0.10.15-manager]` kthread does not clear within 24 h.
- [ ] (Separate ticket) Fix `null_resource.pg_*_db` resources to target the actual CNPG primary instead of hardcoding `pg-cluster-1`.
## Related
- TrueNAS decommission: memory `id=674` (2026-04-13).
- Prior LV grow on `pve/nfs-data` (2 TiB out-of-band): memory `id=691` (2026-04-12).
- Architecture: `docs/architecture/storage.md`, `docs/architecture/backup-dr.md`.

View file

@ -0,0 +1,164 @@
# Post-Mortem: kured Reboots Silently Stalled for 6 Days + Anubis HA Lift
| Field | Value |
|-------|-------|
| **Date** | 2026-05-16 |
| **Duration** | 6 days of unbooted pending-reboot packages (2026-05-10 → 2026-05-16) |
| **Severity** | SEV3 — no user-facing impact; latent risk (kernel/libc CVEs queued, not landing) |
| **Affected Services** | None directly; OS-reboot pipeline halted on all 5 K8s nodes |
| **Status** | Root cause fixed (kured Helm value), defensive defaults added (Anubis HA, kured drain-timeout, CNPG 3 instances) |
## Summary
After unattended-upgrades was re-enabled on the K8s nodes on 2026-05-10,
kured was supposed to drive rolling node reboots within the MonFri
02:0006:00 London window. Instead, kured logged "Reboot not required"
every hour for six straight days while the `kured-sentinel-gate`
DaemonSet on every host happily reported "ALL CHECKS PASSED — creating
/var/run/gated-reboot-required". The gate WAS open. kured was looking
in the wrong place.
The kured Helm chart derives the sentinel hostPath from
`dirname(configuration.rebootSentinel)`. The stack set
`rebootSentinel = "/sentinel/gated-reboot-required"` — which pointed
the chart at hostPath `/sentinel/` (an empty auto-created directory).
The sentinel-gate writes to `/var/run/gated-reboot-required` on the
host. Two different host directories. kured silently skipped reboots
for six days.
Found on 2026-05-16 while auditing why "automatic upgrades aren't
happening" alongside the K8s version-upgrade Job-chain (PM
2026-05-11). Fixed in one commit; took the opportunity to also
eliminate three latent drain-time hazards (Anubis single-replica PDB
deadlock, kured unbounded drain timeout, CNPG-only-2-instances).
## Impact
- **User-facing**: None. Existing kernels, libc, and userspace kept running. CVEs queued in `/var/run/reboot-required.pkgs` on every node but were never exploited.
- **Backlog**: All 5 nodes accumulated `linux-image-*` + `libc6` queued for reboot. Largest gap was master at ~6 days. Workers also 56 days.
- **Detection gap**: kured exposes no Prometheus signal for "I checked but said no". The hourly "Reboot not required" line in stdout is the only trace, and nobody was tailing it. The architecture had two layers (sentinel-gate gate + kured sentinel check) but no verification that the two layers were looking at the same path.
- **Side discovery**: 8 Anubis instances would have stalled drain anyway via single-replica + `PDB minAvailable=1` (the same trap that stalled the manual K8s upgrade on 2026-05-11). Even if the kured path bug were fixed in isolation, Monday's first reboot would have hit the Anubis trap and idled forever (kured default `--drain-timeout=0` = unlimited).
## Timeline (UTC)
| Time | Event |
|------|-------|
| **Mar 16 21:26** | kured-sentinel-gate DaemonSet introduced after the 26h overlayfs cascade outage. Original sentinel cool-down 30m. |
| **May 10 ~16:57** | Last successful kured pod restart picked up new Helm values. `rebootSentinel = "/sentinel/gated-reboot-required"`. Same commit re-enabled unattended-upgrades in cloud_init and stretched the sentinel cool-down 30m → 24h. |
| **May 10 ~17:00 → May 15 06:16** | unattended-upgrades on every node successfully installs kernel + libc patches, writes `/var/run/reboot-required`. |
| **May 1015** | sentinel-gate Check 14 all pass every 5 min on every host. Touches `/var/run/gated-reboot-required`. Logs "ALL CHECKS PASSED". |
| **May 1015** | kured polls `/sentinel/gated-reboot-required` (empty dir, file does not exist). Returns "Reboot not required" every hour. No reboots happen. |
| **May 11 20:4021:00** | Separate K8s-version-upgrade incident (master upgraded to v1.34.7, workers stalled mid-rollout because the upgrade agent drained its own host). Manual recovery 5/115/12. **kured stall noticed but not investigated**: cluster healthy, K8sVersionSkew firing was tracked as the urgent issue. |
| **May 11 22:47 → May 12 00:01** | Manual worker drains hit the Anubis single-replica PDB trap (drain loops). Resolved by direct-deleting Anubis pods to bypass eviction API. This was the first signal that single-replica `minAvailable=1` patterns deadlock drains. |
| **May 16 10:56 UTC** | While auditing "what runs the upgrades" for the user, the kured + sentinel-gate log/path mismatch became visible. |
| **May 16 11:13 UTC** | `stacks/kured/main.tf`: `rebootSentinel = "/sentinel/..."``"/var/run/gated-reboot-required"`. Re-init, plan, apply. |
| **May 16 11:14 UTC** | kured DaemonSet rolls out the new spec. Volume hostPath becomes `/var/run`. kured pod can now see `/sentinel/reboot-required` (32B, from uu) AND `/sentinel/gated-reboot-required` (0B, from gate). Confirmed via `kubectl exec` listing. |
| **May 16 11:44 UTC** | Anubis HA module change deployed: `shared_store_url` variable → `store: { backend: valkey }` block appended to policy YAML, default replicas 2, PDB `maxUnavailable=1`, topology `DoNotSchedule`. Cyberchef applied as canary. Confirmed: Redis DB 5 starts receiving challenge state. |
| **May 16 11:4811:53 UTC** | Remaining 7 Anubis stacks applied (DBs 612). 8/8 deployments at 2/2 Ready, replicas spread on different nodes. Smoke-tested 6 of 8 public URLs return 200. |
| **May 16 12:05 UTC** | kured `drainTimeout: "30m"` added + applied. pg-cluster bumped from 2 → 3 instances. |
| **May 16 12:11 UTC** | pg-cluster phase = "Cluster in healthy state", 3/3 ready. |
## Root Cause
The Helm chart `kured-5.11.0` computes:
```
{{- $sentinel_dir := dir .Values.configuration.rebootSentinel -}}
# template renders both volume mount and hostPath using $sentinel_dir
```
So `rebootSentinel` is doubly-purposed: it's both the **CLI arg path inside
the pod** AND the **hostPath on the node**. Setting it to `/sentinel/...`
caused:
- pod arg: `--reboot-sentinel=/sentinel/gated-reboot-required` (looks at `/sentinel/` inside the pod)
- hostPath: `/sentinel/` (auto-created empty directory by `type: Directory`)
- mountPath inside pod: `/sentinel/` (mapped from hostPath above)
Meanwhile the gate DaemonSet was configured with hostPath `/var/run`
mountPath `/host/var-run`, and wrote `gated-reboot-required` to its local
`/host/var-run/` which became the host's `/var/run/gated-reboot-required`.
The two daemons never touched the same directory.
**Why this was hard to spot**:
1. Both layers logged success: sentinel-gate said "ALL CHECKS PASSED", kured said "Reboot not required". Neither claimed an error.
2. No Prometheus alert exists for "kured polled, gate is open, kured still didn't act". The Upgrade Gates alert group catches firing-alert-during-rollout, not silently-skipped-rollout.
3. The Helm chart's auto-derivation of hostPath from a config value is undocumented surprising behavior. The mental model is "rebootSentinel is just the in-pod path"; the hostPath co-mutation is invisible.
## Remediation
### Primary fix
- `stacks/kured/main.tf`: `rebootSentinel = "/var/run/gated-reboot-required"`. Both the chart-derived hostPath and the kured CLI arg now align with where the gate writes.
### Defensive companion changes (same session)
| Change | Purpose | Stack |
|---|---|---|
| `drainTimeout = "30m"` on kured | Fail closed instead of looping forever if a future PDB or finalizer stalls drain. Node stays Schedulable (no silent capacity loss). | `stacks/kured/main.tf` |
| Anubis: shared-state Valkey/Redis backend | Eliminate the single-replica drain deadlock + provide real HA. PDB changed `minAvailable=1``maxUnavailable=1`. Replicas 1 → 2 with `topologySpreadConstraint: DoNotSchedule`. | `modules/kubernetes/anubis_instance/main.tf` + 8 callers |
| pg-cluster: 2 → 3 instances | Failover during primary's node drain no longer depends on the lone replica being caught up. CNPG always has a fully-current candidate. | `stacks/dbaas/modules/dbaas/main.tf` |
| Orphan `mysql-standalone` PDB deleted | Helm-stamped leftover (selector required 4 labels, pod has 3 → matched 0 pods). Was dead code; deletion is safe. | `kubectl` (not TF-managed) |
### Verified post-fix
- `kubectl -n kured exec deploy/kured -- ls /sentinel/` lists both `reboot-required` and `gated-reboot-required` on every node.
- 8 Anubis Deployments at 2/2 Ready; pods spread across different nodes (verified via `kubectl get pods -o wide`).
- Redis DBs 5, 7, 8, 10 receiving challenge state from real public traffic post-apply (Palo Alto Networks scanner hit blog).
- pg-cluster 3/3 healthy, phase = "Cluster in healthy state".
- kured args show `--drain-timeout=30m`.
## Lessons
1. **Auto-derivation in Helm charts is invisible drift surface.** The chart's
habit of deriving hostPath from a CLI-arg-shaped value is the kind of
"convenient default" that hides during normal review. Mitigation:
pin `hostFilePath` explicitly in `configuration` so the host path is
declared, not derived. (Did not do this in the fix because the
single-config approach is now correct; flagging as future improvement.)
2. **"Silently skipped" needs a Prometheus signal.** The Upgrade Gates
alerts cover "rollout in progress + something went wrong". They don't
cover "we haven't rolled in 7 days when we should have". Suggested:
add `KuredRebootBacklog` — fires when `kured_reboot_required ==
1` (kured exposes this) for more than 24h continuously. The kured
chart already serves `/metrics`; just needs a rule. (Deferred.)
3. **Single-replica `PDB: minAvailable=1` is a deadlock pattern.** It
reads as "protect this pod" but actually means "block all voluntary
disruption forever". Manifested in 9 places (8 Anubis + mysql-standalone
with broken selector). The Anubis fix is now in place via shared-store
replicas=2; the `mysql-standalone` selector was already broken so it
matched 0 pods (and was deleted as cruft). Worth auditing the cluster
periodically for any new pattern of the same shape.
4. **k8s-node1 containerd source drift** (Ubuntu archive's `containerd`
vs Docker's `containerd.io`) is benign but should be documented.
Audited during this session: not a blocker for kured because both
variants are in the Package-Blacklist and both are apt-held. The
version skew with master (1.6.22 vs 1.7.24/1.7.27) is what the
K8s version-upgrade Stage 3 "containerd bump" exists to fix.
5. **CNPG drain handling at 2 replicas is fragile.** Switchover works
but the lone replica must be caught up; in practice this means
on a busy cluster, a primary-node drain could stall for tens of
seconds while CNPG promotes. 3 instances eliminates this. Worth
considering for every long-running multi-instance stateful workload.
## Detection / Prevention Followups
- [ ] `KuredRebootBacklog` Prometheus alert. Spec: `kured_reboot_required == 1 and (time() - timestamp(kured_reboot_required)) > 86400`.
- [ ] Add a `hostFilePath` value to the kured Helm release for explicit declaration (current setup is correct but undocumented).
- [ ] Audit periodically for new single-replica + `minAvailable=1` PDB patterns (could be a Kyverno warn policy).
- [ ] Phase 4: clean up the InnoDB Cluster CR + remaining `mysql-cluster-pdb` once the bitnami legacy is fully decommissioned.
## File pointers
| What | Where | Commit |
|---|---|---|
| kured sentinel path fix | `infra/stacks/kured/main.tf` | c17d87e1 |
| Anubis HA (module + 8 callers) | `infra/modules/kubernetes/anubis_instance/` + 8 `stacks/<app>/main.tf` | 6e920f96 |
| kured drainTimeout + CNPG 3-replica | `infra/stacks/kured/main.tf` + `infra/stacks/dbaas/modules/dbaas/main.tf` | a726e963 |
| K8s version-upgrade Job-chain (related context) | `infra/stacks/k8s-version-upgrade/` | 01bc16d5 (5/11) |
| Architecture doc | `infra/docs/architecture/automated-upgrades.md` | (updated 5/11) |
| Runbook | `infra/docs/runbooks/k8s-version-upgrade.md` | (updated 5/11) |
| Deprecated agent prompt (self-preemption history) | `infra/.claude/agents/k8s-version-upgrade.deprecated.md` | 01bc16d5 |

View file

@ -0,0 +1,212 @@
# Post-Mortem: GPU Driver Crashloop after Ubuntu 26.04 Upgrade on k8s-node1
**Date:** 2026-05-17
**Author:** Viktor Barzin / Claude (incident response)
**Severity:** SEV-3 (GPU workloads unavailable: frigate, immich-ml, llama-swap, ytdlp/yt-highlights all Pending; no impact to non-GPU services)
**Beads:** `code-8vr0` (P1)
**Status:** Blocked on upstream — NVIDIA has not published Ubuntu 26.04 driver images yet
## Summary
`nvidia-driver-daemonset-sg22g` on k8s-node1 went into CrashLoopBackOff
with 76+ restarts. Root cause: k8s-node1 was upgraded to **Ubuntu 26.04
LTS (Resolute Raccoon)** at some point, putting the running kernel at
`7.0.0-15-generic`. The NVIDIA driver daemonset's installer container
runs `apt-get install linux-headers-<kernel>` against Ubuntu 24.04's
noble repositories (the container's base OS), which don't carry
`linux-headers-7.0.0-15-generic`, so the build aborts with:
Could not resolve Linux kernel version
Attempted fix (chart upgrade v25.10.1 → v26.3.1 with driver 580.105.08
and `kernelModuleType: open`) succeeded at the chart level but produced
a worse outcome: the v26.3.1 operator auto-detects the host OS via NFD
and constructs the image tag `<version>-ubuntu26.04`, which 404s on
pull. `skopeo list-tags docker://nvcr.io/nvidia/driver` confirms zero
ubuntu26.04 tags exist (vs 779 ubuntu22.04 and 206 ubuntu24.04 tags).
Rolled the chart back to v25.10.1 (pinned in TF) to restore the closest-
to-working state pending an upstream fix or kernel rollback.
## Impact
- GPU resource `nvidia.com/gpu` = 0 on k8s-node1 (only GPU node)
- All GPU-bound workloads Pending or 0/N Ready:
- `frigate/frigate`
- `immich/immich-machine-learning`
- `llama-cpp/llama-swap`
- `nvidia/nvidia-exporter`
- `ytdlp/yt-highlights`
- Downstream alerts firing: `NvidiaExporterDown`, 5× Uptime Kuma monitors
(Frigate, Immich ML, nvidia-exporter, …), `GPUNodeUnschedulable` not
firing (node is schedulable, just no GPU advertised)
- No data loss; no user-facing service degradation outside the GPU stack
## Timeline (Europe/Sofia, UTC+3)
- pre-incident — `apt-get dist-upgrade` (or `do-release-upgrade`) bumped
k8s-node1 from Ubuntu 24.04 → 26.04. Apt history.log doesn't capture
the upgrade (rotated by `do-release-upgrade`).
- ~2026-05-11 — node rebooted into kernel `7.0.0-15-generic`. NFD
reports `system-os_release.VERSION_ID = 26.04`,
`kernel-version.full = 7.0.0-15-generic`.
- 2026-05-17 04:00 (approx) — driver daemonset enters CrashLoopBackOff
on every kubelet restart cycle. Error: "Could not resolve Linux kernel
version".
- 2026-05-17 13:35 — chart upgrade attempt v25.10.1 → v26.3.1, driver
570.195.03 → 580.105.08, `kernelModuleType: open`. Helm applies
cleanly but driver pod ImagePullBackOff on
`driver:580.105.08-ubuntu26.04`.
- 2026-05-17 ~13:45 — skopeo confirms zero ubuntu26.04 tags on
nvcr.io/nvidia/driver. Decision: roll chart back, pin in TF, document
the gotcha, file the kernel rollback as the next step.
## Root Causes
1. **Host OS upgraded to Ubuntu 26.04** ahead of NVIDIA's driver image
support window. NVIDIA typically lags new Ubuntu LTS releases by
weeks-to-months on the driver-container front.
2. **gpu-operator chart was not pinned** prior to today. The TF
`helm_release` had `version` commented out, so any apply could
re-resolve to the latest chart and follow its OS-auto-detection
logic. With v25.10.1, the operator fell back to ubuntu24.04 image
suffix (which pulls successfully but fails to compile against kernel
7.0). With v26.3.1, the operator picks the correct (per-NFD)
ubuntu26.04 suffix — which doesn't exist.
3. **No alert for "GPU device count = 0 on a GPU node"** — the cluster
had 14+ hours of silent GPU outage before noticing. `NvidiaExporterDown`
fires only when the metrics exporter itself stops scraping, not when
the operator's driver pod is unhealthy.
## What We Changed in This Session
- `stacks/nvidia/modules/nvidia/main.tf` — pinned
`helm_release.nvidia-gpu-operator.version = "v25.10.1"` so future
applies don't surprise us with v26.3.1's stricter OS detection.
- `stacks/nvidia/modules/nvidia/values.yaml` — comment block explaining
the situation; driver version stays at `570.195.03` as the last-known
config that produced a pullable image.
- `docs/post-mortems/2026-05-17-gpu-driver-ubuntu2604-mismatch.md`
this file.
## What We Did NOT Do (Pending User Decision)
- **Roll back the host kernel** on k8s-node1 from `7.0.0-15-generic`
to `6.8.0-117-generic`. The 6.8 kernel is still installed at
`/lib/modules/6.8.0-117-generic` and the matching headers at
`/usr/src/linux-headers-6.8.0-117-generic`, so GRUB can boot it and
the driver image's apt sources (Ubuntu 24.04 noble) carry
`linux-headers-6.8.0-117-generic`. This would require draining the
node, editing GRUB defaults, `apt-mark hold` to prevent future drift,
and rebooting — needs explicit user OK.
- **Add a probe + alert** for `nvidia.com/gpu` resource count on the
GPU node. Should fire within 10 minutes of the operator failing to
publish the resource, regardless of which sub-pod failed.
## Recovery Procedure (next time)
### If the driver-installer fails with "Could not resolve Linux kernel version"
1. Identify the running kernel: `uname -r` on the affected node.
2. Check whether NVIDIA ships an image for that kernel/distro combo:
docker run --rm quay.io/skopeo/stable list-tags \
docker://nvcr.io/nvidia/driver \
| python3 -c "import json,sys; d=json.load(sys.stdin); \
print([t for t in d['Tags'] if '<distro>' in t][:5])"
3. If yes, point the chart at the right version + ensure NFD reports
the matching OS.
4. If no (and a kernel rollback is acceptable):
- `kubectl cordon <node>` then `kubectl drain <node> --ignore-daemonsets --delete-emptydir-data`
- `nsenter -t 1 -m -p -u sed -i 's/^GRUB_DEFAULT=.*/GRUB_DEFAULT="Advanced options for Ubuntu>Ubuntu, with Linux 6.8.0-117-generic"/' /etc/default/grub`
- `nsenter -t 1 -m -p -u update-grub`
- `nsenter -t 1 -m -p -u apt-mark hold linux-image-6.8.0-117-generic linux-headers-6.8.0-117-generic linux-generic linux-image-generic linux-headers-generic`
- Reboot: `nsenter -t 1 -m -p -u systemctl reboot`
- After boot: `kubectl uncordon <node>` and wait for the GPU
daemonset to come Ready
## Action Items
- [x] Pin gpu-operator chart to v25.10.1 in TF
- [x] Document situation in this post-mortem
- [x] Roll back k8s-node1 host kernel to 6.8.0-117-generic (done by user;
kernel rollback succeeded and NFD now reports
`kernel-version.full=6.8.0-117-generic`, `os_release.VERSION_ID=24.04`)
- [x] Extend driver daemonset startup probe `failureThreshold` from 120 to 300
(50 min) in TF `values.yaml` — 2026-05-25. On this hardware the
full install sequence (apt headers + gcc compilation + file copy) takes
~21min which exactly exhausted the old 120×10s window.
- [ ] Add Prometheus alert `GPUNodeNoGPUResource` — fires when a node
labeled `nvidia.com/gpu.present=true` has `nvidia.com/gpu` capacity
of 0 for >10m
- [ ] Periodically re-check NVIDIA's NGC catalog for ubuntu26.04 driver
tags — file as a quarterly checkup once we see the first 26.04
tag, unpin the chart and revert this post-mortem's mitigation
- [ ] Audit ALL host packages with `apt-mark hold` semantics. The
memory of the March 2026 outage says we disabled
`unattended-upgrades``do-release-upgrade` is a separate path
that should be gated too
## Follow-up Incident: Driver install hang (2026-05-25)
**Date**: 2026-05-25
**Status**: Resolved
After the kernel rollback to 6.8.0-117-generic succeeded, the driver pod
(`nvidia-driver-daemonset-529vg`) was still reported as "stuck at
Installing Linux kernel headers..." with no progress for 1520 min.
**Actual root causes (two compounding issues)**:
1. **Deadlock between k8s-driver-manager and operator-validator**: The
`k8s-driver-manager` init container waits for `nvidia-operator-validator`
to shut down before it can begin the install sequence. The validator's
`driver-validation` init container was in an infinite retry loop polling
`/run/nvidia/validations/.driver-ctr-ready` (which the driver creates when
ready). Since the driver never finished, the validator never exited. The
validator pod had `deletionTimestamp` set but kubelet on node1 couldn't GC
it — the container received SIGTERM but remained in "Terminating" state
indefinitely, blocking the new driver from starting.
**Fix**: Force-deleted the stuck validator pod
(`kubectl delete pod -n nvidia nvidia-operator-validator-sff98 --force --grace-period=0`).
This broke the deadlock immediately.
2. **Startup probe timeout**: The full driver install sequence on this hardware
(6 vCPUs, 16Gi RAM) takes ~21 minutes:
- `apt-get install linux-headers-6.8.0-117-generic`: ~2 min
- `gcc/make -j16` kernel module build (nvidia, nvidia-uvm, nvidia-modeset,
nvidia-peermem): ~12 min
- nvidia-installer file copy + archive integrity check: ~7 min
The default startup probe allows exactly `60 + (120 × 10) = 1260s = 21min`.
This caused a SIGKILL (exit 137) at 21 minutes even when the install was
progressing normally.
**Fix**: Patched `driver.startupProbe.failureThreshold` from 120 → 300
in `stacks/nvidia/modules/nvidia/values.yaml` (gives 51 min headroom).
**Key observation**: "Installing Linux kernel headers..." is NOT a hang — the
apt install just takes 2+ min and produces no log output during execution. The
log line appears before apt runs, so it looks frozen. Check `ps auxf` inside
the container to confirm apt/dpkg are actively running.
## Lessons
- **Operator-style charts that auto-detect host OS can silently break
when the host fleet leapfrogs upstream image support.** Pin the chart
version + driver version, and treat upstream support gaps as a hard
blocker rather than a guaranteed-to-resolve race condition.
- **Drain-and-revert host kernel is the right escape hatch when
upstream image lags.** Make sure the previous kernel and its headers
stay installed (don't aggressively purge old kernels in apt
autoremove).
- **NFD labels are authoritative for the operator's image-tag
construction.** If you need to lie about OS version (e.g., to force a
24.04 image on a 26.04 host), edit the NFD label — but only as a last
resort; the chart upgrade made clear the operator will eventually
reconcile this.
- **A k8s-driver-manager deadlock on a stuck Terminating validator pod is
indistinguishable from an apt hang** — `ps auxf` inside the container is
the key diagnostic. Force-deleting a stuck Terminating pod with no
finalizers is safe and immediately resolves the deadlock.
- **Driver startup probe must be sized for the full install wall-clock time**,
not just apt or just compilation. On slow hardware, 21 min is tight.

View file

@ -0,0 +1,133 @@
# Post-Mortem: nfs-csi Keel-Triggered Upgrade Broke Master Node CSI
**Date:** 2026-05-17
**Author:** Viktor Barzin / Claude (incident response)
**Severity:** SEV-3 (1 of 5 CSI node DaemonSet pods stuck CrashLoopBackOff; controller pair flapping)
**Duration:** ~2 hours from first detection to all-green
## Summary
The Keel auto-update operator polled the `csi-driver-nfs` Helm chart and rolled
`v4.13.1 → v4.13.2`. The new chart's controller Deployment scheduled both
replicas onto `k8s-master` (no built-in control-plane exclusion). Both replicas
used `hostNetwork: true` and tried to bind the same host ports
(`19809` for `node-driver-registrar`, `29653` for `liveness-probe`), so one
controller pod CrashLoopBackOff'd with `bind: address already in use`. The
upgrade also left behind multiple orphan controller pods in containerd that
kubelet could no longer reconcile — they held the host ports even after the
helm rollback removed them from K8s state.
The `csi-nfs-node` DaemonSet pod on master then could not start either: its
own `node-driver-registrar` and `liveness-probe` containers tried to bind
the same host ports and lost to the zombies.
## Impact
- 1× `csi-nfs-node` pod on `k8s-master` stuck CrashLoopBackOff (16+ restarts)
- CSI plugin unregistered on master → no NFS volumes could be mounted on
master-hosted pods (calico-typha cert mount failed, etcd backup CronJob
failed)
- Controller flap (2 replicas fighting) → intermittent
`csi-resizer`/`csi-snapshotter` failure for the whole cluster
- Cascade: kured-sentinel, node-local-dns, prometheus-node-exporter,
csi-node-driver (Calico) all bounced on master while kubelet thrashed
No data loss; no production-facing outages observed (CSI mounts on the four
worker nodes kept working).
## Timeline (Europe/Sofia, UTC+3)
- ~07:46 — Keel polls forgejo + DockerHub manifests, sees a new digest under
the `csi-driver-nfs` `4.13.x` channel, triggers Helm upgrade
- 07:46:16 — `helm upgrade csi-driver-nfs` runs; new controller Deployment
scheduled (no `affinity` block → both replicas land on `k8s-master`)
- ~07:50 — Controller replicas fight for ports `19809`, `29653`; one stays in
CrashLoopBackOff
- ~08:00 — User notices "CSI issue ... due to the upgrade"; investigation
begins
- 08:15 — `helm rollback csi-driver-nfs` to revision 8 (v4.13.1) — controllers
on master deleted via K8s, but containerd retains them as live sandboxes
- 08:30 — Live `podAntiAffinity` + `nodeAffinity: control-plane DoesNotExist`
added to the controller Deployment via patch (controllers now correctly
schedule on node1+node3)
- 08:40 — `csi-nfs-node` master pod still CrashLoopBackOff; ports 19809/29653
held by orphan PIDs (livenessprobe PID 1816, csi-node-driver PID 1944,
plus 5× csi-provisioner from zombie controller pods)
- 09:00 — Privileged pkill via `hostPID: true` pod failed
(`permission denied` from runc — containerd refused to signal init in the
zombie containers)
- 09:03 — `nsenter -t 1 -m -p -u systemctl restart kubelet` on master cleared
the orphan containers via cgroup GC; ports freed
- 09:04 — `csi-nfs-node` master pod reaches 3/3 Ready; cluster green
- 09:09 — Terraform `apply`: pin `helm_release.version = "4.13.1"`, add
`controller.affinity` to values
## Root Causes
1. **`csi-driver-nfs` Helm chart in TF was unpinned.** The `helm_release` had
no `version = ...` field, so it floated to whatever the chart repo
advertised. Keel polled this and rolled forward.
2. **Chart v4.13.2 dropped the implicit control-plane exclusion** that v4.13.1
shipped with. Without it, the K8s scheduler chose master for both
controller replicas.
3. **Two controller replicas + hostNetwork = port conflict on the same node.**
The chart did not add `podAntiAffinity` between the replicas. Live state
has it now; TF now does too.
4. **Helm rollback does not always clean containerd sandboxes.** When the
prior revision's pods are abandoned mid-flight (image-pull-pending, etc.),
containerd can keep multiple sandbox instances for the same pod-UID.
Kubelet GC is the only thing that reliably reaps these — restarting it
forces a reconciliation pass that drops orphans.
## What We Fixed
- **`stacks/nfs-csi/modules/nfs-csi/main.tf`** (this commit):
- `version = "4.13.1"` pin on the `helm_release` (defense in depth — namespace
is already excluded from Kyverno-Keel injection, but the chart could still
drift on a `terraform apply` without a pin)
- `controller.affinity` block with `podAntiAffinity` (different hosts for
replicas) and `nodeAffinity` (exclude `node-role.kubernetes.io/control-plane`)
- Inline comments explaining both decisions
- **Kyverno keel-annotations**: `nfs-csi` was already in the namespace exclude
list (decision from authentik incident 2026-05-17). Verified still there
in `stacks/kyverno/modules/kyverno/keel-annotations.tf:91`.
## Recovery Procedure (next time)
If `csi-nfs-node` on a node CrashLoopBackOff with `bind: address already in use`:
1. **Find which host ports are bound**`lsof -i :19809`, `lsof -i :29653`
(from a privileged hostPID pod on the affected node).
2. **Try `crictl rmp -f <pod-id>`** on zombie pods (those K8s no longer
tracks). Will fail with `unable to signal init: permission denied` if
the containers are sufficiently stuck.
3. **Restart kubelet on the affected node** via `nsenter -t 1 -m -p -u
systemctl restart kubelet` (privileged hostPID pod). Kubelet's GC
reconciles containerd state and reaps the orphans.
4. **Force-delete the DaemonSet pod** to clear the back-off
(`kubectl delete pod -n nfs-csi csi-nfs-node-XXXX --force --grace-period=0`).
DaemonSet recreates it; with the ports free, containers start cleanly.
## Action Items
- [x] Pin `csi-driver-nfs` chart version in TF
- [x] Add `controller.affinity` to TF (podAntiAffinity + control-plane exclude)
- [x] Document recovery procedure (this post-mortem)
- [ ] Audit other unpinned `helm_release` blocks — every chart used in
Kyverno-excluded namespaces should still be pinned to prevent
`terraform apply` drift. (Filed as follow-up — not blocking.)
- [ ] Consider adding a `kured` or daily script that detects orphan
containerd sandboxes whose pod-UID is unknown to the apiserver and
reaps them automatically. (Filed as follow-up — not blocking.)
## Lessons
- **Keel exclusion ≠ chart pin.** The namespace was already excluded from
Keel injection, but the helm_release was unpinned — so a `terraform apply`
alone could re-trigger the same break. Both layers needed locking down.
- **`crictl rmp -f` is not always sufficient.** When containerd refuses to
signal init, kubelet restart is the next escalation step before SSH/reboot.
- **The Keel rollout phase 2-6 design ASSUMED stateful operators were
excluded.** CSI was correctly excluded — but the chart version itself was
still a moving target via plain `terraform apply`. The exclude-list catches
Keel; the version pin catches everything else.

View file

@ -0,0 +1,198 @@
# Post-Mortem: Immich anca-elements Bulk Ingest IO Storm → k8s-node1 Reboot
| Field | Value |
|-------|-------|
| **Date** | 2026-05-25 |
| **Duration** | ~9h from job start (2026-05-24 23:55Z) to node1 reboot (2026-05-25 09:33Z); service restoration ongoing (~1h elapsed at time of writing). |
| **Severity** | SEV2 — k8s-node1 went down, ~33 deployments lost their only replica, DNS partially degraded (Technitium primary on node1), Loki down, GPU stack down, backup pipeline timed out for the day. No data loss. |
| **Affected Services** | 33 deployments, 3 StatefulSets, 9 DaemonSets across ~30 namespaces. Most concentrated on k8s-node1 (the only GPU node and home of several pinned services). |
| **Issue** | TBD (no GitHub issue filed yet) |
| **Status** | Draft — recovery in progress |
| **Recurrence count** | **3rd IO-pressure-induced incident in 17 days** at time of writing; **recurred 2026-05-26** (Alloy log-read storm, mem id=2726) and **2026-06-01** (Immich Duplicate Detection ML/thumbnail backfill — see [Update 2026-06-01](#update-2026-06-01--recurrence-immich-duplicate-detection)) |
## Summary
The `anca-elements-import` Kubernetes Job — a one-shot bulk import of ~34k photos (770 GB) from `/srv/nfs/anca-elements` into Immich — ran with `immich-go --concurrent-tasks 20` and no CPU/IO limits. The 20 parallel NFS readers combined with the Immich ML pipeline saturated sdc (the 10.7 TB HDD thin pool holding all VM disks) for hours. Sustained disk contention starved the k8s-node1 VM's IO until the VM rebooted at 09:33 UTC. ~60 pods on node1 went zombie; the proxmox CSI driver lost its registration; the NVIDIA driver DaemonSet entered CrashLoopBackOff; the daily-backup pipeline was killed by its 4h systemd timeout while waiting on post-reboot ext4 orphan-inode cleanup.
This is the third time in 17 days that an IO event has taken meaningful slice of the cluster offline. We cannot keep treating each one as a one-off.
## Impact
- **User-facing**: DNS resolution degraded (1 of 3 Technitium replicas down on node1). 20 self-hosted apps (changedetection, freshrss, frigate, navidrome, wealthfolio, etc.) returned 502 or hung. GPU-dependent services (Frigate ML, Immich ML, nvidia-exporter) had no GPU available.
- **Blast radius**: ~60 zombie pods on k8s-node1; 33 deployment replicas missing cluster-wide; 1 StatefulSet (Loki) unavailable. Multi-Attach errors on ~8 proxmox-lvm PVCs prevented reschedule onto healthy nodes for ~30 min.
- **Duration**: Initial IO degradation ~01:30Z (job ran ~85 min then ended). VM stayed alive but degraded for ~8 hours after the job ended (likely due to filesystem journal recovery / page cache pressure tail). Hard reboot at 09:33Z. Service restoration began at 10:00Z.
- **Data loss**: None. All PVCs intact; no failed writes detected.
- **Monitoring gap**: We had **no alert** for "VM is about to crash from sustained IO pressure." `NodeHighIOWait` fired but didn't escalate, and PVE-host-level IO PSI metrics aren't scraped into Prometheus.
## Pattern — this is the third time
| Incident | Date | Root IO source | Outcome |
|----------|------|----------------|---------|
| 1 | 2026-05-09 | Stale NFS kthread to decommissioned TrueNAS (`/usr/local/bin/weekly-backup` artifact) wedged in `rpc_wait_bit_killable` | PVE loadavg ~15 sustained, IO PSI stall on node3, no user-visible outage |
| 2 | 2026-05-16/17 | kured stuck + GPU driver Ubuntu 26.04 mismatch + NFS-CSI Keel upgrade race | Multi-issue cluster degradation; required manual recovery |
| 3 | 2026-05-25 (**this incident**) | Immich `anca-elements-import` Job with 20 parallel uncapped readers | k8s-node1 VM reboot, ~33 deployments down, backup pipeline broken |
| 4 | 2026-05-26 | Grafana Alloy DaemonSet read 12.18 TB of logs in ~24h (silently lost its `controller.resources` limit) | sdc 97% util, all VMs + NFS starved (mem id=2726) |
| 5 | 2026-06-01 | Immich library-wide **Duplicate Detection** → ML/thumbnail backfill read originals at server-side `thumbnailGeneration` concurrency **8** | sdc ~100% util, 64 `nfsd` threads D-state, **etcd starved → kube-apiserver down ~30 min** |
**Common pattern**: a single uncontrolled IO-heavy workload (or a stale connection) saturates the shared sdc thin pool, which hosts **all VM disks** for the entire cluster. There is currently no IO budget enforcement between workloads, no PVE-level IO QoS between VMs, and no alerting that fires *before* a node crashes.
We have a single point of contention (sdc). Every storm finds it.
## Timeline (UTC)
| Time | Event |
|------|-------|
| **2026-05-24 23:55** | `anca-elements-import` Job starts. immich-go v0.31.0, `--concurrent-tasks 20`, no resource limits. Anca's 770 GB photo archive begins streaming from NFS to immich-server. |
| **2026-05-25 01:21** | Job marks `Complete` (85 min runtime). 34k photos uploaded. Immich-server ML pipeline (face detection / thumbnail generation) keeps the IO load going for hours after. |
| **05:02** | `daily-backup.service` (systemd timer) starts. It runs LVM thin-snapshot → LUKS-decrypt → mount → rsync per PVC. The competing IO from the still-saturated thin pool stretches every per-PVC step. |
| **08:24** | `daily-backup` `matrix-data-proxmox` rsync hits its per-PVC 30-min timeout — first warning. |
| **08:3109:02** | 30 LUKS-encrypted PVC mounts log `Failed to mount snapshot` because ext4 orphan-inode cleanup exceeds the 30s `timeout 30 mount` guard (one volume took 109s). |
| **09:02:28** | systemd kills `daily-backup` with `TimeoutStartSec=14400` (4h). Script was nowhere near complete — alphabetically still on letter 'm'. Snapshots from today's run are left, but that's the *designed* 7-day retention pattern. |
| **09:07** | `nfs-mirror.service` also times out. |
| **09:33:24** | k8s-node1 VM **rebooted** (cause: best-guess IO starvation triggered Proxmox watchdog or qemu IO timeout; not directly observable). All pods on node1 enter `Unknown`. |
| **09:33:36** | node1 kubelet posts Ready. Pods on node1 begin churn: calico-node, csi-node-driver, kured, alloy, loki-canary, nvidia stack all restart. proxmox-csi-plugin-node fails to re-register the `csi.proxmox.sinextra.dev` driver in CSINode. |
| **09:40** | User session impacted: `~/.zsh_history` left with 154-byte NUL padding from interrupted write. |
| **09:41** | Incident detected by user; `/cluster-health` invoked. Healthcheck reports 33 PASS / 3 WARN / 64 FAIL. |
| **09:50** | Force-deleted 47 Failed pods + 22 stuck Terminating + 6 zombie DS pods on node1. |
| **09:55** | 4 recovery sub-agents dispatched in parallel: csi-recovery, gpu-recovery, dns-monitoring-recovery, backup-recovery. |
| **~10:15** | proxmox CSI re-registered on node1 (csi-recovery). Multi-Attach errors clearing. Loki StatefulSet recovers to 1/1. Calico fully back to 5/5. |
| **~10:30** | daily-backup re-started manually (currently still running, ETA ~2h). |
| **(ongoing)** | nvidia stack recovery ongoing; 20 deployments still recovering. |
## Root Cause
### Direct (the trigger)
`anca-elements-import` Job in `stacks/immich/main.tf` runs `immich-go upload` with `--concurrent-tasks 20`, no CPU limit, and no IO throttling. Twenty parallel NFS readers against `/srv/nfs/anca-elements` (mounted from PVE host) plus immich-server's ML pipeline (CUDA-accelerated face detection + thumbnail generation) saturated the read queue on sdc. The job itself only ran for 85 min, but the after-effects (ML processing, filesystem cache eviction, dirty-page writeback) persisted for hours.
### Recovery-side cascade (why the cluster stayed broken after node1 booted)
Once node1 rebooted, the kubelet posted Ready within 12s — but `csi.proxmox.sinextra.dev` failed to re-register, blocking ~30 pods. The actual cascade (discovered by the csi-recovery agent during today's investigation):
1. **Calico CNI on node1 entered a crash loop.** The `calico-node` pod's BIRD BGP daemon takes a few seconds to create `/var/run/bird/bird.ctl` on startup. The container's liveness probe was killing the process before the socket appeared, restarting it before it could stabilize.
2. **Without functional Calico, no new pod on node1 could reach the kube-apiserver service IP** (`10.96.0.1:443`).
3. **The proxmox-csi-plugin-node pod therefore crash-looped** with "no route to host" trying to talk to the apiserver, and **never created its CSI socket** for kubelet to discover.
4. **node-driver-registrar (sidecar) therefore never registered the driver with kubelet**, so CSINode for k8s-node1 lacked `csi.proxmox.sinextra.dev`.
5. **Every PVC mount on node1 failed** with the `driver not found` error we observed; meanwhile, **VolumeAttachments for those PVCs still pointed at node1** from before the reboot, so reschedule onto healthy nodes hit `Multi-Attach error`.
Fix order matters: **Calico first, then CSI, then stale VolumeAttachments**. Doing them out of order leaves the cascade broken. This is now a P3 runbook (below).
### Second cascade (3.5h into recovery) — Proxmox CSI 30-LUN-per-VM hard cap
During the recovery, a second cascade was discovered that compounded the outage:
1. **k8s-driver-manager init container cordons + drains node1** as part of GPU driver re-install. This evicted GPU-tagged pods (and incidentally triggered descheduler rebalancing) onto other nodes.
2. **Simultaneously the dns-monitoring-recovery agent killed an orphaned containerd holding a boltdb lock on k8s-node4**, evicting all node4 pods.
3. **The combined eviction wave scheduled ~60 PVC-using pods through the scheduler in a short window.** Many landed on node1 (largest node, least cordoned-ish), pushing its SCSI LUN slot count from ~17 (pre-Immich-import baseline) to **29 of 30 in use**.
4. **proxmox-csi-plugin's hard-coded `MaxVolumesPerNode = 30`** (per the upstream `csi-driver-proxmox` source: it scans scsi0…scsi30 and errors `no free lun found` when none are free) blocked further attaches.
5. **vault-0, mysql-standalone-0, claude-memory, grafana, nextcloud** etc. all could not start because their PVCs couldn't attach to node1 (their scheduled target). Multi-Attach errors compounded when their previous-node attachments hadn't been released cleanly.
6. **Daily-backup was running concurrently** — adding ~120 MB/s read load on sdc, slowing every CSI attach/detach operation by 3-5×, prolonging the queue.
**Resolution (manual, 2026-05-25):** `systemctl stop daily-backup`, `kubectl cordon k8s-node1`, force-delete stuck Pending pods. They rescheduled to nodes with LUN headroom (node2/3/4 had ~12-15 free slots each).
### Structural (why it took down a node)
1. **Single shared IO domain**: sdc is one LVM thin pool serving all 9 VMs. No Proxmox-level `bwlimit` or `iothrottle` between VMs. Any VM can starve the others.
2. **No IO budget at workload level**: the K8s job had `resources: {}`. There is no cluster-wide cgroup-IO budget enforced.
3. **NFS reads bypass per-VM accounting**: anca-elements is read via the PVE host's NFS export. The reads happen *on the PVE host*, charged to the host's IO scheduler, not to the k8s-node1 VM. So even if we capped node1's VM IO, the storm would still happen.
4. **node1 is also the only GPU node** — Immich-ML pods are pinned there. The reader (immich-server) and consumer (immich-ml) are both fighting for the same node's resources during ingestion.
5. **ext4 orphan-inode cleanup is unaware of `noload`**: `daily-backup.sh` uses `mount -o ro,noload` to skip journal replay, but `noload` doesn't skip orphan-inode cleanup. When a node reboots with dirty filesystem state on the source PVC, snapshot mounts can take 100+s — exceeding the script's 30s timeout. Confirmed by `dmesg`: 20+ volumes logged `INFO: recovery required on readonly filesystem` during today's backup window.
6. **Calico BIRD liveness probe is racy on cold start.** The probe doesn't tolerate the 3-5s BIRD initialization window, so any cold-start of Calico tends to crash-loop briefly. Usually it self-recovers on the 2nd or 3rd restart — today it didn't, because the apiserver was unreachable from the brand-new pod (chicken-and-egg).
## Contributing Factors
1. **The job's IO profile was never measured before running**. `immich-go --concurrent-tasks 20` is the upstream default; nobody validated it against our hardware.
2. **No staging window**. anca-elements-import is the second of two intentional one-shot ingestion runs (1st was Viktor's library months ago). The first run also caused load — but didn't crash a node, so it was treated as "loud but fine."
3. **Daily-backup overlap**. The 05:00 backup timer fired while the IO tail of the Immich job was still in flight. The two competing workloads triggered the LUKS mount timeouts.
4. **No PVE-level IO QoS** between VMs (Proxmox supports `iops_rd/wr` throttle groups on disk specs; we've never set them).
5. **No alert for "node1 about to crash"**. `NodeHighIOWait` fires at a fixed threshold but doesn't trigger any automated mitigation or paging.
## Detection Gaps
| Gap | Impact | Fix |
|-----|--------|-----|
| No PVE host IO PSI scraped into Prometheus | We can see node1 IO PSI but not the PVE-host-level pressure that's the actual leading indicator | Add node_exporter PSI scrape on PVE (already running) to Prometheus targets, expose `pressure_io_*` |
| No alert on sustained sdc utilization > 80% | The IO storm built up for hours without any signal escalating | Add `PVEThinPoolIOSaturated` rule: `irate(node_disk_io_time_seconds_total{device="sdc",instance="pve"}[5m]) > 0.85 for 30m` |
| No alert on Proxmox-host loadavg > 20 | Sustained loadavg 1315 was visible only through cluster healthcheck #44 | Add `PVEHostLoadHigh` rule (1m loadavg > 25 for 10m) |
| No alert on K8s Job IO throughput | An uncapped K8s job can do unlimited IO without alerting | Add `JobHighIOThroughput`: alert if container_fs_reads_bytes_total rate over 5m > 100 MB/s for >10m |
| Backup timeout fires silently | systemd kill of daily-backup at 4h didn't alert anyone — we'd have noticed after 48h via the `backup_per_db=FAIL mysql=33h pg=33h` healthcheck | Add Alertmanager rule on daily-backup unit failure (probe systemd unit state via node-exporter textfile collector) |
| LUKS mount step inflation post-reboot is silent | 30 mount failures logged as WARN, no aggregate alert | Add count-of-WARN alert from the daily-backup log |
## Prevention Plan
### P0 — Prevent this exact failure
| Priority | Action | Type | Details | Status |
|----------|--------|------|---------|--------|
| P0 | Cap concurrency on future `*-elements-import` jobs | Config | In `stacks/immich/main.tf` (`kubernetes_job_v1.anca_elements_import` and future siblings): set `--concurrent-tasks 4` (down from 20). Also set `resources.limits.cpu = "2000m"` and `activeDeadlineSeconds = 21600` (6h cap). Add a `nodeSelector` to keep the job *off* node1 (move read-side onto a non-GPU node so the GPU node only does ML). | TODO |
| P0 | Bump LUKS mount timeout in `daily-backup` | Config | `infra/scripts/daily-backup.sh` line 243: change `timeout 30 mount …``timeout 180 mount …` (covers observed 109s worst case). Add a comment explaining the ext4 orphan-cleanup exception. | TODO |
| P0 | Schedule big-data ingests outside backup window | Config | Forbid Job/CronJob scheduling between 04:3008:30 EEST (when daily-backup runs). Either via a Kyverno policy on `*-import` named jobs, or a documented convention enforced at PR review. | TODO |
| P0 | **Raise Proxmox CSI LUN limit on each k8s-node VM** | Architecture | The default `virtio-scsi-pci` controller exposes 30 LUN slots; proxmox-csi hard-caps at this. **Resolution path**: add a 2nd `virtio-scsi-pci` controller (`scsihw1`) to each k8s-node VM via Proxmox, OR migrate VMs to `virtio-scsi-single` which allows 256+ LUNs per disk. Either requires a brief per-node reboot. Without this, every future cluster-churn event can re-hit "no free lun found" on whichever node ends up overloaded. **Permanent fix — must land before next ingest run.** | TODO |
| P0 | Document the `MaxVolumesPerNode=30` limit in storage architecture | Runbook | Add to `docs/architecture/storage.md` — currently the 30-LUN cap is invisible to operators until they hit it. Include `kubectl get pods -A --field-selector spec.nodeName=NODE` and the 30-cap as a sizing check before any cluster-wide rebalance / drain operation. | TODO |
| P0 | Add startupProbe to mysql-standalone | Config | `stacks/dbaas/modules/dbaas/main.tf` (the `kubernetes_stateful_set_v1` for `mysql-standalone`): add `startupProbe` with `failureThreshold=120, periodSeconds=15, timeoutSeconds=10` (≈30 min budget for InnoDB recovery). Also bump liveness `initialDelaySeconds=120, failureThreshold=10`. Today MySQL spun in a CrashLoopBackOff for ~30 min — each restart's InnoDB recovery aborted when the existing 30s liveness probe fired, never finishing. Resolved manually via `kubectl patch sts mysql-standalone` — must Terraform-codify. | Done (kubectl, needs TF) |
| P0 | Add startupProbe to goauthentik-server | Config | Similar issue: Authentik Django migrations + clip/face index rebuilds take 5-10 min after PG restart, but the startup probe budget is too short → restart loop. Add `startupProbe: failureThreshold=180, periodSeconds=10` (30 min) on `goauthentik-server`. Source: `stacks/authentik/modules/authentik/main.tf` (or equivalent Helm values). | TODO |
| P0 | Disable daily-backup.timer when manually stopping daily-backup.service | Runbook | During this incident, `systemctl stop daily-backup.service` alone wasn't enough — the timer kept it queued for re-fire. The recovery sequence is: `systemctl stop daily-backup.timer; systemctl stop daily-backup.service`. Document in `docs/runbooks/cluster-recovery.md` (to-create) as the canonical sequence. | TODO |
### P1 — Reduce blast radius
| Priority | Action | Type | Details | Status |
|----------|--------|------|---------|--------|
| P1 | Proxmox per-VM IO throttle for the Immich workload host | Architecture | Set `iops_rd=2000,iops_wr=1000,mbps_rd=200,mbps_wr=100` on the k8s-node1 VM disk via Proxmox API. Pick numbers based on baseline `iostat` measurement. Same for non-prod VMs (devvm, registry). | TODO |
| P1 | Move NFS reads off the PVE host hot path | Architecture | Currently the PVE host *itself* reads `/srv/nfs/anca-elements` when an NFS client mounts that path — but the *reads* happen on PVE because it's the NFS server. Consider mounting anca-elements via a dedicated NFS export with a `wsize/rsize` cap, OR put bulk-ingest source data on a separate physical disk (sdb SSD has headroom). | Investigation |
| P1 | Add cgroup IOLimit to Kyverno mutating webhook for namespaces | Config | Auto-attach a `cgroupv2 io.max` annotation to pods in known-high-IO namespaces (immich, frigate, ollama). Requires kernel ≥5.13 + cgroupv2 (we have both). | TODO |
| P1 | Separate Immich `library` + `upload` NFS exports onto different LVs | Architecture | Currently `/srv/nfs/immich/{library,upload}` share the `pve/nfs-data` LV. Splitting upload onto its own thinly-provisioned LV would let us throttle upload-side independently. Cost: ~30 min PV churn. | Architecture |
### P2 — Detect faster
| Priority | Action | Type | Details | Status |
|----------|--------|------|---------|--------|
| P2 | Alert on sustained PVE sdc utilization | Alert | New PrometheusRule `PVEThinPoolIOSaturated`: `irate(node_disk_io_time_seconds_total{device="sdc",instance=~"pve.*"}[5m]) > 0.85` for 30m, severity=warning. | TODO |
| P2 | Alert on PVE loadavg high | Alert | New PrometheusRule `PVEHostLoadHigh`: `node_load1{instance=~"pve.*"} > 25` for 10m. severity=warning. | TODO |
| P2 | Alert on Kubernetes Job high IO rate | Alert | `JobHighIOThroughput`: `sum by (namespace, pod) (irate(container_fs_reads_bytes_total{container!=""}[5m])) > 100*1024*1024` for 10m → warning. | TODO |
| P2 | Alert on daily-backup systemd unit failure | Alert | Add node-exporter textfile collector entry that runs `systemctl is-failed daily-backup.service` every 1m and writes 0/1 to `/var/lib/node_exporter/textfile/backup_unit_state.prom`. PrometheusRule fires on value=1 for 5m. | TODO |
| P2 | Alert on Multi-Attach VolumeAttachment hung > 5m | Alert | These are the smoking gun whenever a node reboots. New rule on `kube_volumeattachment_status_attached == 0 and time() - kube_volumeattachment_metadata_created > 300`. | TODO |
### P3 — Improve resilience
| Priority | Action | Type | Details | Status |
|----------|--------|------|---------|--------|
| P3 | Move k8s-node1 OS disk off sdc | Architecture | If node1 OS disk lived on the SSD (sdb VG `ssd`, 475 GB free), an sdc IO storm wouldn't starve the VM's own root filesystem and we'd avoid the reboot trigger. Cost: VM migration, ~1h downtime for node1. | Architecture |
| P3 | Spread GPU + pinned services off node1 | Architecture | Today node1 carries the GPU + Loki + Technitium primary + claude-agent + many app deployments. When it goes down, the blast radius is huge. Re-evaluate pin constraints — only Immich-ML and Frigate genuinely need node1. | Investigation |
| P3 | Document recovery runbook for "node1 hard reboot" | Runbook | New `docs/runbooks/node1-reboot-recovery.md` capturing the **strict order** discovered today: (1) force-cleanup Failed/Unknown/stuck-Terminating zombies, (2) force-delete the calico-node pod on the rebooted node so BIRD restarts cleanly, (3) wait for calico-node Ready, (4) force-delete the proxmox-csi-plugin-node pod, (5) verify `csi.proxmox.sinextra.dev` appears in `kubectl get csinode <node> -o yaml`, (6) delete stale VolumeAttachments referencing the rebooted node where consumers have already rescheduled, (7) verify nvidia driver recovery (separate cascade). | TODO |
| P3 | Make Calico BIRD liveness probe cold-start tolerant | Config | Bump `initialDelaySeconds` on `calico-node`'s liveness probe by 15s, or switch to an `exec` probe that checks BIRD socket existence rather than HTTP. Prevents the cold-start crash loop after node reboots. | Investigation |
| P3 | Pre-flight script for bulk ingest jobs | Runbook + Config | Wrapper around `immich-go` that (a) checks PVE loadavg < 10, (b) checks sdc IO util < 50%, (c) checks daily-backup not running, before allowing the job to start. Refuses otherwise. | TODO |
## Lessons Learned
1. **One shared physical disk is one shared failure domain**. sdc serves all VMs; any uncapped workload can take down the cluster. We've now hit this three times in 17 days. Continuing to treat each as a one-off is no longer credible — we need IO budget enforcement (P1), not just better alerting (P2).
2. **NFS reads bypass per-VM accounting**. We assumed throttling the workload's VM would protect us. It doesn't — the reads physically happen on the PVE host's IO scheduler.
3. **The "complete" state of a Job doesn't mean its IO is gone**. anca-elements-import finished in 85 min, but the IO tail (ML pipeline + filesystem cache eviction) ran for hours. Future ingest jobs need to either run during off-hours OR be sized so that even their tail is benign.
4. **Backup pipeline depends on a clean cluster state**. When node1 was unhealthy, daily-backup couldn't complete LUKS mounts in time. Backups should be more resilient to upstream IO degradation OR we should treat backup failure as a SEV signal in real time.
5. **The 30s `timeout` in `daily-backup.sh` was set without considering post-reboot recovery time**. Defaults like this need to be reviewed in light of actual observed worst case.
6. **Recovery requires a known runbook**. Today's recovery worked because we knew which order to do things: force-delete zombies → re-register CSI → clear VAs → wait for daemonsets → restart deployments. Codifying that as a runbook means the next incident is 5x faster.
## Update 2026-06-01 — recurrence (Immich Duplicate Detection)
**5th IO-pressure incident.** A user-triggered library-wide **Duplicate Detection** run on the 163,989-asset Immich library cascaded into ML/thumbnail backfill for the ~5,150 assets missing CLIP embeddings (largely a fresh `anca-elements` import that had completed ~90 min earlier). The Immich **server-side** job `thumbnailGeneration` was set to concurrency **8** (plus `metadataExtraction=4`, `library=4`), so the backfill read originals off sdc 8-wide → ~92 MB/s, queue depth ~99, sdc ~100% util. 64 `nfsd` threads went D-state on `folio_wait_bit_common`; **etcd on k8s-master was starved → kube-apiserver down ~30 min** (different blast radius from 2026-05-25's node1 reboot — same root cause: the shared sdc spindle).
**New finding:** the 2026-05-25 P0 capped only the *import-side* concurrency (`immich-go --concurrent-tasks`). The Immich **server-side** job concurrency (`job.*.concurrency` in DB system-config) was never capped and had been tuned for speed (8/4/4). So **any** library-wide operation (dedup, smart-search backfill, thumbnail regen) re-triggers the storm independent of the import job.
**Mitigation applied (2026-06-01):** capped the HDD-original-reading server jobs to `thumbnailGeneration=2, metadataExtraction=2, library=2` in `system_metadata` `system-config` JSONB + `immich-server` recreate. Verified: dedup resumed with sdc at 23% util, queue depth ~0.05, apiserver healthy. Documented in `infra/.claude/CLAUDE.md` Immich row.
**Still the real fix (from this PM, still TODO):** the P0 import-side cap, and especially the **IO-isolation** items — move k8s-master **etcd** + node OS disks off sdc onto SSD (generalize P3), and/or give the Immich library its own spindle (P1). Concurrency caps are a band-aid; sdc remains a single shared failure domain that every storm finds. Tracked in beads (see Follow-up Implementation).
## Related
- 2026-05-09 IO post-mortem: `docs/post-mortems/2026-05-09-io-pressure-stale-nfs.md`
- 2026-05-16 kured/anubis post-mortem: `docs/post-mortems/2026-05-16-kured-stalled-and-anubis-ha.md`
- 2026-05-17 GPU driver post-mortem: `docs/post-mortems/2026-05-17-gpu-driver-ubuntu2604-mismatch.md`
- Storage architecture: `docs/architecture/storage.md`
- Backup pipeline: `docs/architecture/backup-dr.md`
- Storage hardware mapping: memory `id=464` (sdc thin pool, sda backup, sdb SSD)
- 3-2-1 backup strategy: memory `id=609`
- Immich storage layout: memory `id=674`
- Memory entries for this incident: 2682-2686 (Immich storm), 2687-2692 (LUKS mount timing)
## Follow-up Implementation
_This section is auto-populated by the postmortem-todo-resolver agent._
| Date | Action | Priority | Type | Commit | Implemented By |
|------|--------|----------|------|--------|----------------|

View file

@ -0,0 +1,93 @@
# Post-mortem: Redis split-brain wedged BullMQ/Celery queues (2026-05-30)
**Severity:** SEV2 (degraded — no data loss in Redis; queue processing stalled
cluster-wide). **Status:** Resolved.
## Summary
The 3-node Sentinel HA Redis (`redis-v2`) split-brained: two pods both held
`role:master`. HAProxy — which routes to any backend reporting `role:master`
round-robined client connections across **both** masters. Immich enqueued
BullMQ jobs on one master while its workers blocked-popped on the other, so
every queue stalled. User-visible symptom: **newly uploaded Immich photos
returned HTTP 404 for their thumbnails** (the generation job never ran). Celery
apps (real-estate-crawler, trading-bot, paperless) and other queue users were
affected the same way.
## Impact
- Immich: thumbnail/preview/face/ML jobs not processing. `facialRecognition`
backlog reached ~30k waiting; new uploads showed broken images in the web UI.
- All ~15 shared-Redis consumers had inconsistent reads/writes (connections
split across two diverging masters).
- No Redis data lost — the larger dataset (`redis-v2-0`, ~30k keys) was
preserved through the fix.
## Timeline (UTC+1 local)
- **~2026-05-26/27**: `redis-v2` pods recreated (node2 unclean reboot era).
`redis-v2-0` came up partitioned; its Sentinel saw 0 peers and it declared
itself master via the init script's deterministic "pod-0 = bootstrap master"
fallback. Sentinels on `-1`/`-2` independently elected `redis-v2-2`.
Split-brain formed and persisted (~3-4 days) as the network healed but the
topology never reconciled.
- **2026-05-30 ~16:58**: investigating "Immich images with no thumbnails."
Found thumbnail jobs failing on missing/zeroed originals (separate pre-existing
data-loss issue) AND a stuck job queue.
- **2026-05-30 ~17:00**: user manually restarted immich-server; namespace
`tier-quota` (24Gi) briefly blocked the replacement pod → ~1 min Immich
outage. Recovered. (Red herring — not the root cause.)
- **2026-05-30 ~17:1x**: identified two `role:master` redis pods
(`redis-v2-0` dbsize 30320, isolated, 0 connected slaves; `redis-v2-2` dbsize
442, quorum master). HAProxy fan-out across both = wedged queues. Ruled out
IPv6 (cluster is single-stack IPv4) and eviction (`evicted_keys=0`).
- **2026-05-30 ~17:30**: reverted `redis-v2` to a single standalone instance.
Queues drained immediately; newest Immich assets served HTTP 200.
## Root cause
`redis-v2`'s init container (`generate-sentinel-conf`) falls through to
"Priority 3: pod-0 is always the bootstrap master" when it cannot reach peer
Sentinels/Redis. During a network partition, `redis-v2-0` hit that fallback and
became a second master. HAProxy's health check (`tcp-check expect rstring
role:master`) matches **any** master, so with two masters it placed both in
rotation and round-robined writes/reads across diverging datasets. BullMQ's
enqueue (LPUSH) and worker consume (BRPOPLPUSH) landed on different instances →
jobs never consumed.
This is the **third** Sentinel-class incident (after 2026-04-19 PM quorum drift
and 2026-04-22 flap cascade). The 3-sentinel design was built to *prevent*
split-brain, but the bootstrap fallback re-introduced it.
## Resolution
Reverted `redis-v2` to a **single standalone instance** (`replicas=1`, Sentinel
+ HAProxy removed), collapsing onto `redis-v2-0`'s dataset (preserved Immich's
queued jobs). Eviction policy changed `allkeys-lru`**`volatile-lru`** so the
shared cache+queue workload is served correctly by one instance (evict only
TTL'd cache keys; never TTL-less queue keys). `redis-master` service name/DNS
unchanged → no consumer edits. Decision rationale: a homelab cache/broker does
not need HA; a few-seconds restart blip beats chasing Sentinel correctness.
Mirrors the 2026-04-16 MySQL InnoDB-Cluster → standalone reversion.
## Follow-ups
- [ ] Re-upload the ~99 Immich images + 12 timeline videos whose **originals**
are missing/zero-filled on disk (pre-existing data loss, unrelated to the
split-brain — re-running jobs can't regenerate them). Owner: Viktor.
- [ ] `requirepass` auth on Redis + creds rollout to all consumers (carried over
from the 2026-04-19 rework; still open).
- [ ] Consider whether any queue user (Immich/Celery) warrants its own dedicated
Redis if the shared instance's memory ever becomes contended (currently
~30MB / 640MB — not a concern).
## Lessons
- HA that re-introduces its own failure class is worse than no HA. For a
single-node-tolerant homelab, prefer a standalone instance + a small accepted
downtime window.
- `allkeys-lru` on a shared cache+queue Redis silently drops queue jobs under
pressure; `volatile-lru` is the correct single-instance policy (Immich even
logs `IMPORTANT! Eviction policy ... should be "noeviction"`).
- A "bootstrap master" fallback that fires under partition is a split-brain
generator — avoid deterministic self-promotion when peers are unreachable.

View file

@ -0,0 +1,89 @@
# Post-Mortem: kured-sentinel-gate OOM while k8s-master stuck pending-reboot
| Field | Value |
|-------|-------|
| **Date** | 2026-05-31 |
| **Duration** | OOMs began 2026-05-30 ~03:33, escalating until fixed 2026-05-31 14:40 UTC |
| **Severity** | SEV4 — no user-facing impact; noisy + latent risk (a wedged gate pod could eventually mis-gate reboots) |
| **Affected** | `kured-sentinel-gate` pod on k8s-master only |
| **Status** | Fixed (gate hardened). Two contributing alerts still open, tracked separately. |
## Summary
Noticed by the operator during a routine cluster health check ("an app OOMing
periodically"). The `kured-sentinel-gate` pod on k8s-master was the *only*
container in the cluster with OOM events: `container_oom_events_total` showed
0/day through May 29, **15 on May 30, 134 on May 31** (by 08:21). The kernel
OOM-killer was killing child `kubectl` processes inside the pod's cgroup; PID 1
(bash) survived, so the pod never restarted — restartCount stayed at 1 despite
149 oom_events in 7d. Symptom: the gate's check cycle stretched from 5 min to
~25 min.
## Root cause (chain)
```
hermes-agent deploy = 0/0 (parked 2026-04-22, PVC-perms bug) → PVC WaitForFirstConsumer
never binds → PVCStuckPending fires; its dead external monitor → ExternalAccessDivergence
/mnt/synology-backup (192.168.1.13 offsite NAS) at 96% → NodeFilesystemFull fires
│ none of these 3 are in kured's alert ignore-list
kured halts ALL reboots (correct fail-safe)
k8s-master got /var/run/reboot-required on 2026-05-30 03:33 (kernel update) but can't reboot
master's gate pod is now the ONLY one running the kubectl-heavy hot path every cycle
(the other 6 hit the early "no reboot required → continue" at ~3 MiB)
the immortal `while true` bash loop slowly leaks (repeated kubectl forks + the
Check-4 `< <(kubectl ...)` process substitution), crosses the 64Mi cgroup limit
~5 days in, and the OOM-killer culls child kubectls — accelerating as it wedges
```
The 64Mi limit was the proximate misconfiguration: each `kubectl` fork is a
~30-50Mi Go binary, and the hot path runs up to 3 per cycle.
## Why hard to spot
- The pod showed `Running` / `1 restart` the whole time — the OOMs hit child
processes, not PID 1. Only `container_oom_events_total` (cAdvisor) revealed it;
`kube_pod_container_status_*` restart metrics did not.
- Logs looked clean ("ALL CHECKS PASSED") — the gate kept producing correct
decisions, just slowly.
- Same blind spot as the 2026-05-16 PM: there is still no Prometheus signal for
"a node has been pending-reboot too long" (the deferred `KuredRebootBacklog`
alert). That alert would have surfaced the stuck-master state on May 30.
## Fix (`stacks/kured/main.tf`, applied + committed 2026-05-31)
1. **Immediate**: deleted the leaking pod (DaemonSet recreated it at ~3 MiB).
2. **Durable**: memory limit `64Mi → 256Mi` (headroom for kubectl forks) **plus**
a self-restart guard — the loop counts iterations and `exit 0`s every
`MAX_ITER=72` cycles (~6h at 300s), so kubelet restarts the pod fresh and the
slow leak can never accumulate, regardless of how long a node stays
pending-reboot. Verified: all 7 pods at 256Mi, `iter N/72` loop live, OOMs
stopped.
## Contributing items (open — being addressed separately)
- **hermes-agent** parked at `replicas=0` since 2026-04-22 (PVC `/opt/data` perms
mismatch). Its orphaned `WaitForFirstConsumer` PVC drives PVCStuckPending +
ExternalAccessDivergence. Resolve = fix perms + scale up, OR remove the PVC and
external monitor while parked, OR scope PVCStuckPending to ignore 0-replica
consumers.
- **Synology offsite backup at 96%** (5.0T/5.3T, 265G free; `#recycle` holds 17G).
Resolve = prune retention / empty recycle / expand volume. NodeFilesystemFull
cannot be blanket-ignored in kured (a full *node* disk SHOULD block reboots) —
if scoped, scope to the offsite mount only.
Until at least the first two clear, kured will keep (correctly) refusing to
reboot master — but the gate pod is now leak-proof either way.
## Lessons
1. **`container_oom_events_total` is the canonical "is anything OOMing" signal** —
not restart counts. A cgroup can OOM-kill children while PID 1 lives.
2. **Immortal in-pod loops that fork heavy binaries need either a generous limit
or a periodic self-restart.** A periodic task is really a CronJob; the
self-exit guard is the minimal fix within the DaemonSet model.
3. **The `KuredRebootBacklog` alert (deferred from 2026-05-16) is now twice-implicated.**
Worth promoting from the backlog: `kured_reboot_required == 1 for > 24h`.

View file

@ -0,0 +1,78 @@
# Post-Mortem: Out-of-Band Tunnel Repoint Reverted by Terraform → Full External 502
| Field | Value |
|-------|-------|
| **Date** | 2026-06-01 |
| **Duration** | Drift present 2026-05-30 → 2026-06-01. Actual external outage began when a `terragrunt apply` reverted the tunnel origin on 2026-06-01 (cloudflared errors visible from ≥20:58Z); root-caused and fixed at 21:15Z; pods converged 21:16Z. |
| **Severity** | SEV1 — *every* Cloudflare-proxied hostname (`viktorbarzin.me` + all `*.viktorbarzin.me`) returned HTTP 502 to external clients. Internal/LAN access (split-horizon → Traefik direct) was unaffected, which is why it stayed hidden. |
| **Affected Services** | All external ingress: viktorbarzin.me, nextcloud, vault, authentik, vaultwarden, immich, linkwarden, nas, technitium, terminal, speedtest, and every other proxied app. |
| **Issue** | None filed (diagnosed and fixed in-session). |
| **Status** | Resolved. |
| **Recurrence count** | 1st of this exact kind. Same `.200→.203` migration family as the 2026-06-01 forgejo-registry containerd-redirect fix (`a382683c`). |
## Summary
On 2026-05-30 Traefik was moved off the shared MetalLB IP `10.0.20.200` onto a dedicated `10.0.20.203`. The migration plan correctly identified that the Cloudflare tunnel had to be repointed away from `10.0.20.200:443` **first**, and it was — but the repoint was done **out-of-band via the Cloudflare Global API Key**, not in Terraform. The Terraform source (`stacks/cloudflared/modules/cloudflared/cloudflare.tf`) was left pointing at `https://10.0.20.200:443`, creating silent drift between live (correct: service DNS) and code (stale: `.200`).
External ingress kept working for ~2 days on the manual config. Then on 2026-06-01 a `terragrunt apply` of the cloudflared stack reconciled live back to the stale Terraform value `https://10.0.20.200:443`. Nothing serves HTTPS on `.200:443` after the Traefik move, so cloudflared could not reach its origin (`connect: no route to host` / `i/o timeout`) and Cloudflare returned 502 across the entire public surface.
Fix: codify the correct origin in Terraform — both ingress rules now point at `https://traefik.traefik.svc.cluster.local:443` (in-cluster Traefik Service DNS). This both restores ingress and makes it permanent (TF and live agree; future applies can't revert it; the origin is decoupled from the Traefik LB IP entirely).
## Impact
- **User-facing**: 100% of externally-reachable services returned 502 via Cloudflare. LAN/internal access (which resolves `*.viktorbarzin.me``10.0.20.203` via Technitium split-horizon, bypassing Cloudflare) kept working — this masked the outage.
- **Blast radius**: every proxied hostname. Origin (Traefik) was healthy the whole time — purely a tunnel-origin routing fault.
- **Data loss**: none.
- **Collateral**: Vault's own public hostname (`vault.viktorbarzin.me`) was also 502, creating a bootstrap problem — `terragrunt apply` needs Vault for the PG state-backend creds, but Vault was only reachable from the dev box via the broken tunnel. Worked around with a temporary `/etc/hosts` entry pointing `vault.viktorbarzin.me``10.0.20.203` (internal Traefik), removed after the apply.
## Root Cause
**A manual (out-of-band) fix was never codified in Terraform, and a later Terraform apply reverted it.** This is a direct violation of the repo's "Terraform Only — all infra changes go through Terraform" rule. The 2026-05-30 plan applied the tunnel repoint via the Cloudflare API for speed/safety during the migration but did not land the equivalent change in `cloudflare.tf`. Terraform's authority over the resource guaranteed the manual change would eventually be reverted; it was, on the next apply.
Contributing factors:
- **No drift alarm tied this to user impact.** The TF/live divergence on `cloudflare_zero_trust_tunnel_cloudflared_config` existed for ~2 days; drift-detection (if it ran) didn't escalate it as outage-risk.
- **Detection gap (masking).** Split-horizon means LAN users never see external-only breakage. The `[External]` Uptime-Kuma monitors + `ExternalAccessDivergence` alert are the only signal for this failure mode, and they did not prompt action.
- **Docs vs code.** CLAUDE.md described cloudflared as targeting the service DNS — true of live (post-manual-fix) but not of the TF source. The doc masked the drift.
## Timeline (UTC)
| Time | Event |
|------|-------|
| **2026-05-30 ~08:09** | Traefik Service moved to `10.0.20.203` (`ETP=Local`). Plan step 1 repoints the tunnel `https://10.0.20.200:443``https://traefik.traefik.svc.cluster.local:443` **via the CF Global API Key (out-of-band)**. Ingress works. `cloudflare.tf` still says `.200`**drift**. |
| **2026-05-30 → 06-01** | External ingress healthy on the manual config. Drift sits unnoticed. |
| **2026-06-01 (during the day)** | A `terragrunt apply` of the cloudflared stack reconciles the tunnel origin back to the stale TF value `https://10.0.20.200:443`. External ingress breaks → 502. |
| **2026-06-01 ~20:51** | Keel auto-patches the cloudflared image; pods roll (coincidental, not causal). |
| **~20:58** | cloudflared logs show every proxied hostname failing against `https://10.0.20.200:443` (`no route to host` / `i/o timeout`). |
| **21:08** | User reports "no ingress coming in." Investigation starts. |
| **21:09** | Isolated: origin healthy (direct to `.203` → 200/302), public path → 502; logs pin origin to dead `.200:443`. |
| **21:13** | Vault unreachable via public name (circular dep); temp `/etc/hosts``.203`. `tg init -reconfigure` (rotated PG backend creds). |
| **21:15:25** | Targeted apply: both ingress origins → service DNS. `Apply complete! 1 changed`. |
| **21:16** | 10/10 curls to `viktorbarzin.me` → 200; 0 `.200` errors across all pods; `vault.viktorbarzin.me` via real Cloudflare → 200. Temp hosts entry removed. Resolved + committed (`f807050e`). |
## Resolution
Changed both `ingress_rule` blocks in `cloudflare.tf` from `https://10.0.20.200:443` to `https://traefik.traefik.svc.cluster.local:443` (`no_tls_verify` retained), making the Terraform source match the intended (and previously-manual) live config. Applied surgically with `-target` on the tunnel config resource only, to avoid touching two pre-existing, unrelated drift items the full plan surfaced (below). Committed `[ci skip]` since live already matched after the targeted apply.
## Pre-existing drift (NOT part of this incident, left untouched)
The full `cloudflared` stack plan showed two extra in-place changes, deliberately **not** applied:
1. `kubernetes_deployment.cloudflared` — TF would strip Keel's runtime annotations (`keel.sh/policy|pollSchedule|trigger|update-time`). The deployment ignores `dns_config` but not `metadata.annotations`. Self-healing (Keel re-adds within its 1h poll); clean fix is to add `metadata[0].annotations` (+ template equivalent) to `ignore_changes`.
2. `cloudflare_record.mail_domainkey_rspamd` — cosmetic re-chunking of the DKIM TXT record (identical key). Benign.
## Action Items
- [x] Codify tunnel origin (service DNS) in `cloudflare.tf` (this fix) — drift eliminated.
- [x] Fix stale `10.0.20.200:443` Traefik reference in `docs/runbooks/kms-public-exposure.md` (→ `.203`).
- [x] Post-mortem written.
- [ ] **Audit for other out-of-band changes from the 2026-05-30 migration** that were applied via CF API / kubectl / pfSense but not landed in code — they will all revert on the next apply.
- [ ] **Make `ExternalAccessDivergence` trustworthy and seen** — it is the only signal for external-only outages and did not prompt action here.
- [ ] **Drift detection should flag tunnel-origin divergence as outage-risk**, not just generic drift.
- [ ] (Optional) Pin the exact reverting-apply time via Woodpecker pipeline history for the cloudflared stack on 2026-06-01.
- [ ] (Optional) Fix the cloudflared Keel-annotation drift so the stack plans clean.
## Lessons
- **Codify out-of-band fixes immediately.** A manual change to a Terraform-managed resource is a time bomb — Terraform *will* revert it on the next apply. The "Terraform Only" rule exists for exactly this; the 05-30 manual tunnel repoint should have been mirrored into `cloudflare.tf` the same day.
- **Reference shared infra (Traefik) by stable Service DNS, not LB IP**, from anything that can use cluster DNS. The service-DNS origin also happens to survive LB-IP moves.
- **External-only outages are invisible from the LAN** (split-horizon). The `[External]` divergence signal is load-bearing — it must be trustworthy and actually seen.
- **Keep docs honest about source-of-truth.** "Live is correct" is not the same as "code is correct"; a doc that conflates them hides drift.

View file

@ -0,0 +1,117 @@
# Post-Mortem: Keel `match-tag` cross-assigned the blog's container images (site down ~6 days)
| Field | Value |
|-------|-------|
| **Date** | 2026-06-01 |
| **Duration** | 2026-05-26 19:47 UTC → 2026-06-01 ~16:00 UTC (~6 days) |
| **Severity** | SEV3 — `viktorbarzin.me` (public blog) fully down; user-facing, but a personal blog with no SLA |
| **Affected** | `website/blog` Deployment (acute outage). Latent: 194 enrolled workloads carried the same stale annotation; 16 were multi-image swap-risk |
| **Status** | Fixed — images un-swapped, `keel.sh/match-tag` stripped fleet-wide, `inject-keel-annotations` policy hardened to strip it on admission |
## Summary
Reported by the operator as "blog is crashlooping." The `website/blog` pod was
`1/2 CrashLoopBackOff`. The two container images had been **swapped**: the
container named `nginx-exporter` was running the nginx blog image
(`viktorbarzin/blog:cfd39d6f`) and receiving the exporter's
`-nginx.scrape-uri` arg — nginx's entrypoint rejected `-n` (`illegal option`)
and crashed; while the container named `blog` was running the exporter image
(`nginx/nginx-prometheus-exporter:1.5.1`), listening on `:9113` instead of
serving the site on `:80`. **Nothing served `:80`, so the blog was fully down**
(Anubis → `blog:80` → connection refused), not merely a crashing sidecar.
The swap happened **2026-05-26 19:47 UTC** (rollout revisions 2835, all stamped
`keel automated update, version latest -> 1.5.1`) and went unnoticed for ~6 days.
## Root cause (chain)
1. The `inject-keel-annotations` Kyverno policy stamps Keel control annotations
on every workload in `keel.sh/enrolled=true` namespaces. Before 2026-05-26
the default was `keel.sh/policy: force` + `keel.sh/match-tag: "true"`.
2. The `blog` Deployment runs **two containers with two different images that
both float on tag `latest`**: `viktorbarzin/blog:latest` and
`nginx/nginx-prometheus-exporter` (→ `:latest`).
3. On 2026-05-26 `nginx/nginx-prometheus-exporter` published semver `1.5.1`.
Under `force + match-tag`, Keel rewrote the deployment and **cross-assigned
the two images** — the exact class of failure the same-day incident
documented (uptime-kuma `:2→:1`, n8n `:1.80.5→:0.1.2`, etc.). The blog was a
casualty of that incident but was **not on the cleanup list**.
4. Same day, the policy default was switched `force → patch` and `match-tag` was
dropped from the patch — but **Kyverno's add-only `patchStrategicMerge`
cannot remove an annotation that's no longer listed**. So ~194 pre-migration
workloads (the blog included) kept a stale `keel.sh/match-tag=true`.
5. Because the blog's images are in Terraform `ignore_changes` (Keel/Woodpecker
own them) and the keel annotations are policy-managed (not in the stack), a
`terraform apply` would not have corrected either field — the broken state
was invisible to the normal apply/drift loop.
## Why hard to spot
- **No crash on most swaps.** A swap only hard-crashes when a container's args
are rejected by the wrong image. The blog crashed because nginx got
`-nginx.scrape-uri`. The sibling `travel_blog` has `match-tag` too but its
exporter sidecar is commented out (single container — nothing to cross-wire),
so it was fine. `changedetection` shows crossed images but both boot without
conflicting args, so it ran 2/2 for days — silently mis-wired, no alert.
- **No external monitor caught it.** The Anubis challenge page returns 200
without reaching the backend, so a naive front-door check looks healthy.
- The acute symptom (`CrashLoopBackOff`) was only visible via `kubectl`, and the
blog has no SLA, so nothing paged.
## Fix (applied + committed 2026-06-01)
1. **Un-swapped the blog images** via `kubectl set image` (the same path
Woodpecker uses for this TF-ignored image): `blog=viktorbarzin/blog:cfd39d6f`,
`nginx-exporter=nginx/nginx-prometheus-exporter:1.5.1`. Pod is 2/2; site
serves 200 internally and externally (`/net-diag.sh` via the Anubis-bypass
carve-out returned the real 40 KB script).
2. **Removed the orphaned annotation** from the blog (`kubectl annotate …
keel.sh/match-tag-`).
3. **Hardened the policy** (`stacks/kyverno/modules/kyverno/keel-annotations.tf`):
added `keel.sh/match-tag = null` to the `patchStrategicMerge`, so the
annotation is stripped on admission for every enrolled workload and can never
be re-added.
4. **Swept the fleet.** `mutateExistingOnPolicyUpdate` did *not* regenerate
UpdateRequests for a removal-only change (Kyverno re-mutates existing
resources for add/set, not deletions), so the 194 pre-existing workloads were
swept once with `kubectl annotate <kind>/<name> -n <ns> keel.sh/match-tag-`.
Annotation-only ⇒ no pod restarts (verified: vault/CSI/monitoring pod ages
unchanged). Remaining `match-tag=true`: 0.
## Lessons
- **Add-only mutation can't undo itself.** Dropping a key from a Kyverno
`patchStrategicMerge` does not remove it from already-mutated resources — you
must set it to `null` *and* sweep existing ones. The 2026-05-26 migration did
neither, leaving 194 landmines.
- **Multi-image pods + a shared floating tag + `force`/`match-tag` = swap risk.**
Keep third-party sidecars on explicit pinned tags, not `latest`, so they never
share a tag with the app image.
- **State that Terraform `ignore_changes` is invisible to drift detection.**
Image fields and policy-managed annotations won't show up in `plan`; they need
their own verification (a synthetic backend probe, not just the front door).
## Audit result (completed 2026-06-01)
All 16 multi-image swap-risk workloads were checked. **Only two were actually
swapped:**
- `website/blog` — acute crash (fixed, un-swapped).
- `changedetection/changedetection`*silent* swap: it ran 2/2 for days
because pod containers share a network namespace (each process still bound its
own port), but the app was running **without its `/datastore` PVC, without
`PLAYWRIGHT_DRIVER_URL`/`BASE_URL`, and at a 128Mi cap** — config was ephemeral
and one restart from total loss. Un-swapped; `/datastore` (watch config back to
Feb 2026) re-mounted; app confirmed serving `200` with watches loaded.
The other 14 are NOT swapped: `insta2spotify` and `priority-pass` (the other
custom app+helper pairs) verified correctly mapped; the rest are upstream Helm
charts (grafana, prometheus, loki, alloy, vault, the CSI controllers/nodes,
mysql) with fixed image→container mappings, all healthy. `match-tag` is now
stripped from all of them, so none can swap again.
## Recommendation (not yet actioned)
- An **external monitor that hits the bare blog backend** (bypassing Anubis)
would have caught this: the Anubis challenge page returns `200` without
reaching the backend, so the front-door monitor stayed green for 6 days.

View file

@ -0,0 +1,85 @@
# Post-Mortem: immich-ml VRAM hog (MODEL_TTL=0) starved llama-swap → recruiter-responder silently down
| Field | Value |
|-------|-------|
| **Date** | 2026-06-02 |
| **Duration** | Triage failing 17:41 → ~20:08 EEST (~2.5 h confirmed in retained logs; first 502 at 17:41) |
| **Severity** | SEV3 — one pipeline (recruiter-responder) fully down; no data loss (emails preserved unseen); no other user-facing impact |
| **Affected** | `recruiter-responder` (triage). Root cause in `immich-machine-learning` + shared T4 GPU. |
| **Status** | Fixed — `immich` `MACHINE_LEARNING_MODEL_TTL` 0 → 600; immich-ml VRAM dropped 10.7 GB → ~1.9 GB; qwen3-8b loads again; backlog reprocessed. |
## Summary
Reported by the operator: "receiving recruiter emails but seeing no responses."
The recruiter-responder IMAP IDLE reader was healthy and fetching mail, but every
email failed at the triage step with `502 Bad Gateway` from llama-swap. llama-swap
could not load its `qwen3-8b` model because the shared Tesla T4 (16 GB) had only
~2.2 GB free — `immich-machine-learning` was holding **10.7 GB** and never released
it. Because triage *raised* (not swallowed), each email was left **unseen** and
retried, so no mail was lost — but no draft/event/Telegram notification was ever
produced.
## Root cause (chain)
```
immich-ml runs with MACHINE_LEARNING_MODEL_TTL=0 → ModelCache(revalidate=False),
per-model TTL eviction + idle-shutdown both DISABLED → nothing ever unloads
heavy immich library job ~17:17 (metadata + smartSearch + OCR + face) runs OCR
(PP-OCRv5, dynamic input shapes) → onnxruntime BFC CUDA arena inflates to ~10.7 GB
TTL=0 → the arena floor is permanent (onnxruntime doesn't cudaFree between runs;
only a process restart reclaims it)
T4 free VRAM ~2.2 GB (T4 is time-sliced across immich-ml / immich-server /
frigate / llama-swap with NO memory isolation)
llama-swap gets a qwen3-8b request → llama-server: cudaMalloc 4455 MiB OOM →
"exiting due to model loading error" → llama-swap returns 502
recruiter-responder triage.py raise_for_status() → orchestrator raises →
imap_idle leaves the message UNSEEN (BODY.PEEK) → no draft/event → no Telegram
```
## Why it was hard to spot
- **Everything showed `Running`/healthy**: the recruiter-responder, llama-swap, and
immich-ml pods were all `1/1 Running` with 0 restarts. The failure was a runtime
502, not a crash.
- **`nvidia-smi` inside a container shows "No running processes found"** (PID-namespace
isolation) — per-process VRAM attribution needed the host-PID `gpu-pod-exporter`
(`nvidia-smi --query-compute-apps`), which pinned the 10.7 GB on `immich_ml.main`.
- **Silent**: triage errors only landed in recruiter-responder logs; no alert fired
on llama-swap 5xx or on low GPU free-VRAM. ~440 triage attempts failed before the
operator noticed organically.
## Resolution
- `stacks/immich/main.tf`: `MACHINE_LEARNING_MODEL_TTL` `0``600` (targeted apply of
`kubernetes_deployment.immich-machine-learning`). The Recreate rollout cleared the
stuck arena immediately; going forward, idle ad-hoc models (OCR, face) unload after
600 s and return VRAM, while preloaded CLIP (smart search) stays warm.
- Verified: T4 used 12571 → 3785 MiB (11.1 GB free); immich-ml 10726 → 1940 MiB;
`qwen3-8b` chat completion returns HTTP 200; recruiter-responder reprocessed its
unseen backlog with triage `200 OK`.
## Why MODEL_TTL=0 was set (and the correction)
`MODEL_TTL=0` was almost certainly chosen to keep the smart-search model permanently
warm for snappy search. The unintended consequence: it *also* pins every ad-hoc model
(OCR/face) and lets onnxruntime's arena grow unbounded on a GPU it doesn't own alone.
immich has **no per-model TTL** (a single global knob; the idle path kills the whole
worker via `os.kill(getpid(), SIGINT)` and respawns), so the practical compromise is a
moderate global TTL + CLIP preload: CLIP reloads in ~10 s on the rare idle miss, while
OCR/face free their VRAM.
## Follow-ups (not yet done — operator declined hardening this session)
- **Alerting** on (a) GPU free-VRAM below a threshold and (b) llama-swap 5xx /
recruiter-responder triage failure rate, so a future starvation doesn't sit silent.
(Operator believes existing alerts cover it — unverified here.)
- **Optional** recruiter-responder resilience: fall back to a smaller model
(`qwen3vl-4b`) or the Tier-1 GPT relay when llama-swap 502s.
- **Separate pre-existing issue** surfaced in immich-server logs: repeated
`AssetExtractMetadata` `ENOENT` on `upload/upload/...` paths (missing originals) —
unrelated to this incident; worth a look.

View file

@ -0,0 +1,138 @@
<!DOCTYPE html>
<html lang="en">
<head>
<meta charset="UTF-8">
<meta name="viewport" content="width=device-width, initial-scale=1.0">
<title>Post-Mortems — viktorbarzin.me</title>
<link rel="preconnect" href="https://fonts.googleapis.com">
<link rel="preconnect" href="https://fonts.gstatic.com" crossorigin>
<link href="https://fonts.googleapis.com/css2?family=Space+Grotesk:wght@400;500;600;700&family=IBM+Plex+Sans:wght@300;400;500&display=swap" rel="stylesheet">
<style>
:root {
--bg: #f5f3f0;
--surface: #ffffff;
--text: #1a1215;
--text-secondary: #6b5e64;
--border: #ddd5d0;
--accent: #b91c1c;
}
@media (prefers-color-scheme: dark) {
:root {
--bg: #0f0b0d;
--surface: #1e1719;
--text: #ede8ea;
--text-secondary: #a89da2;
--border: #332b2e;
--accent: #ef4444;
}
}
* { margin: 0; padding: 0; box-sizing: border-box; }
body {
font-family: 'IBM Plex Sans', sans-serif;
background: var(--bg);
color: var(--text);
padding: 60px 24px;
max-width: 800px;
margin: 0 auto;
}
h1 {
font-family: 'Space Grotesk', sans-serif;
font-size: 2rem;
font-weight: 700;
margin-bottom: 8px;
letter-spacing: -0.02em;
}
.subtitle {
color: var(--text-secondary);
margin-bottom: 40px;
font-size: 0.95rem;
}
.incident-list { list-style: none; }
.incident-item {
background: var(--surface);
border: 1px solid var(--border);
border-radius: 10px;
padding: 20px 24px;
margin-bottom: 12px;
transition: border-color 0.2s;
}
.incident-item:hover { border-color: var(--accent); }
.incident-item a {
text-decoration: none;
color: var(--text);
display: block;
}
.incident-date {
font-family: 'Space Grotesk', sans-serif;
font-size: 0.8rem;
color: var(--text-secondary);
font-weight: 500;
letter-spacing: 0.04em;
}
.incident-title {
font-family: 'Space Grotesk', sans-serif;
font-size: 1.15rem;
font-weight: 600;
margin: 4px 0;
}
.incident-desc {
font-size: 0.85rem;
color: var(--text-secondary);
}
.sev-tag {
display: inline-block;
font-family: 'Space Grotesk', sans-serif;
font-size: 0.7rem;
font-weight: 600;
padding: 2px 8px;
border-radius: 4px;
background: rgba(185, 28, 28, 0.1);
color: var(--accent);
border: 1px solid var(--accent);
text-transform: uppercase;
letter-spacing: 0.04em;
margin-left: 8px;
vertical-align: middle;
}
footer {
margin-top: 40px;
padding-top: 20px;
border-top: 1px solid var(--border);
font-size: 0.7rem;
color: var(--text-secondary);
text-align: center;
}
</style>
</head>
<body>
<h1>Post-Mortems</h1>
<p class="subtitle">Incident reviews for the viktorbarzin.me Kubernetes cluster</p>
<ul class="incident-list">
<li class="incident-item">
<a href="2026-04-14-nfs-fsid0-dns-vault-outage.md">
<span class="incident-date">2026-04-14</span>
<span class="sev-tag">SEV 1</span>
<div class="incident-title">NFS fsid=0 Cascade &mdash; DNS + Vault + Multi-Service Outage</div>
<div class="incident-desc">5h outage: fsid=0 in PVE /etc/exports broke NFSv4 subdirectory mounts &rarr; Technitium primary I/O errors &rarr; Vault lost quorum &rarr; Alertmanager blind &rarr; 25+ pods affected across 15+ namespaces.</div>
</a>
</li>
<li class="incident-item">
<a href="2026-03-16-nfs-csi-cascade-failure.md">
<span class="incident-date">2026-03-16</span>
<span class="sev-tag">SEV 1</span>
<div class="incident-title">NFS CSI Cascade Failure</div>
<div class="incident-desc">47h outage: NFS CSI driver liveness-probe port conflict &rarr; all NFS mounts fail &rarr; 40+ pods stuck across 20+ namespaces.</div>
</a>
</li>
<li class="incident-item">
<a href="2026-03-16-kured-containerd-cascade-outage.html">
<span class="incident-date">2026-03-16</span>
<span class="sev-tag">SEV 1</span>
<div class="incident-title">Kured + Containerd Cascade Outage</div>
<div class="incident-desc">26h cluster outage: unattended-upgrades kernel update &rarr; kured reboot &rarr; containerd overlayfs snapshotter corruption &rarr; calico down &rarr; cascading failure across all 5 nodes.</div>
</a>
</li>
</ul>
<footer>viktorbarzin.me infrastructure</footer>
</body>
</html>