[docs] Add post-mortem for Authentik outpost /dev/shm incident (2026-04-18)

## Context On 2026-04-18 all Authentik-protected *.viktorbarzin.me sites returned HTTP 400 for all users. Reported first as a per-user issue affecting Emil since 2026-04-16 ~17:00 UTC, escalated to cluster-wide when Viktor's cached session stopped being enough. Duration: ~44h for the first-affected user, ~30 min from cluster-wide report to unblocked. ## Root cause The `ak-outpost-authentik-embedded-outpost` pod's /dev/shm (default 64 MB tmpfs) filled to 100% with ~44k `session_*` files from gorilla/sessions FileStore. Every forward-auth request with no valid cookie creates one session-state file; with `access_token_validity=7d` and measured ~18 files/min, steady-state accumulation (~180k files) vastly exceeds the default tmpfs. Once full, every new `store.Save()` returned ENOSPC and the outpost replied HTTP 400 instead of the usual 302 to login. ## What's captured - Full timeline, impact, affected services - Root-cause chain diagram (request rate → retention → ENOSPC → 400) - Why diagnosis took 2 days (misattribution of a Viktor event to Emil, red-herring suspicion of the new Rybbit Worker, cached sessions masking the outage) - Contributing factors + detection gaps - Prevention plan with P0 (done — 512Mi emptyDir via kubernetes_json_patches on the outpost config), P1 alerts, P2 Terraform codification, P3 upstream - Lessons learned (check outpost logs first; cookie-less `curl` disproves per-user symptoms fast; UI-managed Authentik config is invisible to git) ## Follow-ups not in this commit - Prometheus alert for outpost /dev/shm usage > 80% - Meta-alert for correlated Uptime Kuma external-monitor failures - Decision on tmpfs sizing vs restart cadence vs probe-frequency reduction (see discussion in beads code-zru) Closes: code-zru Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-18 13:12:27 +00:00 · 2026-04-18 13:12:27 +00:00 · b41528e564
commit b41528e564
parent 6e19dce99e
1 changed files with 150 additions and 0 deletions
--- a/docs/post-mortems/2026-04-18-authentik-outpost-shm-full.md
+++ b/docs/post-mortems/2026-04-18-authentik-outpost-shm-full.md
@ -0,0 +1,150 @@
 # Post-Mortem: Authentik Embedded Outpost `/dev/shm` Fills — Cluster-Wide Auth Blocked
 | Field | Value |
 |-------|-------|
 | **Date** | 2026-04-18 |
 | **Duration** | ~44h for first-affected user (Emil, Apr 16 17:00 → Apr 18 12:40 UTC); ~30min for cluster-wide impact (Apr 18 12:10 → 12:40 UTC) |
 | **Severity** | SEV2 — authentication blocked for all users on all Authentik-protected services |
 | **Affected Services** | ~30+ Authentik-protected subdomains (every service using the `authentik-forward-auth` Traefik middleware) |
 | **Status** | Root cause fixed; permanent mitigation applied; alerting still TODO |
 ## Summary
 The `ak-outpost-authentik-embedded-outpost` pod's `/dev/shm` (default 64 MB tmpfs) filled to 100% with ~44,000 `session_*` files. Once full, every forward-auth request failed to write its session state with `ENOSPC` and the outpost returned HTTP 400 instead of the usual 302 → login redirect. All users on all protected services were unable to log in.
 Detection was delayed because the initial user report (Emil) looked like a per-user bug — investigation spent two days chasing hypotheses about non-ASCII headers, user privileges, cookie corruption, and a newly-deployed Cloudflare Worker before the real cause was found in the outpost logs.
 ## Impact
 - **User-facing**: HTTP 400 on initial GET of any Authentik-protected site (`terminal`, `grafana`, `immich`, `proxmox`, `london`, etc.). Existing sessions whose cookies were still cached worked until their cookie rotation attempt, then broke.
 - **Blast radius**: Every service using the `authentik-forward-auth` middleware via the "Domain wide catch all" Proxy provider. Public and internal.
 - **Duration**: First user (Emil) broken since 2026-04-16 ~17:00 UTC after his last valid session. Cluster-wide block when Viktor's cached session stopped being sufficient — roughly 2026-04-18 12:10 UTC. Fixed 12:40 UTC.
 - **Data loss**: None. Session state in tmpfs is ephemeral by design.
 - **Monitoring gap**: No Prometheus alert on outpost `/dev/shm` usage. No alert on outpost 400 response rate. Uptime Kuma external monitors hitting protected services returned 400s for 40+ hours without paging.
 ## Timeline (UTC)
 | Time | Event |
 |------|-------|
 | **Apr 15 ~09:21** | `ak-outpost-authentik-embedded-outpost-587598dc4b-fvzzz` pod started (normal rolling restart, unrelated to this incident). `/dev/shm` fresh. |
 | **Apr 16 16:23:32** | Emil's last successful `authorize_application` event from his iPhone Brave (`85.255.235.23`). After this point, his subsequent requests create session files — his new sessions work briefly, then `/dev/shm` fills and every new session write fails. |
 | **Apr 16 ~17:00 (approx)** | `/dev/shm` at ~44,000 files = 100% full. New forward-auth requests start returning 400 across the board. Viktor's browser still has a valid cached cookie so his requests succeed without writing new session files. |
 | **Apr 17 10:30 (approx)** | Emil reports "terminal.viktorbarzin.me returns 400" to Viktor. |
 | **Apr 18 09:00–12:30** | Deep investigation begins. Multiple hypotheses tested and rejected: non-ASCII bytes in Emil's `name` field, policy denial, cookie corruption, Rybbit Cloudflare Worker (deployed 2026-04-17 — suspicious timing, turned out unrelated), plaintext redirect scheme. |
 | **Apr 18 12:20:39** | First direct evidence found: 2 Chrome 400s in Traefik logs from Emil's IP `176.12.22.76` (BG) on `terminal.viktorbarzin.me`, request missing `authentik_proxy_*` cookie. Redirect loop observed on iPhone IPv6 `2620:10d:c092:500::7:8c0d`. |
 | **Apr 18 12:34** | Viktor reports he can no longer log in either. |
 | **Apr 18 12:38** | `curl` against direct Traefik (`--resolve` bypassing Cloudflare) returns the same 400 with Authentik's CSP header — Cloudflare Worker exonerated. |
 | **Apr 18 12:39** | Outpost log grep finds the smoking gun: `failed to save session: write /dev/shm/session_XXX: no space left on device`. |
 | **Apr 18 12:40:13** | `kubectl delete pod ak-outpost-authentik-embedded-outpost-587598dc4b-fvzzz` — tmpfs cleared on pod restart. Replacement pod `-8qscr` Running within 8s. Cluster unblocked. |
 | **Apr 18 12:41** | Verified: direct-Traefik and via-CF curls both return `HTTP 302` to Authentik auth flow. Viktor authenticates successfully on `proxmox.viktorbarzin.me`. |
 | **Apr 18 12:53** | Permanent fix applied via Authentik API: `PATCH /api/v3/outposts/instances/{uuid}/` setting `config.kubernetes_json_patches` to mount `emptyDir {medium: Memory, sizeLimit: 512Mi}` at `/dev/shm`. |
 | **Apr 18 12:54** | Authentik controller reconciled the Deployment within 5s. `kubectl rollout restart` triggered new pod `-k5hv8`. `/dev/shm` now `tmpfs 256M` (4× the previous capacity; K8s clamps the tmpfs size to pod memory policy, but usage is capped at `sizeLimit=512Mi`). Forward-auth verified working. |
 ## Root Cause Chain
 ```
 [1] goauthentik/proxy outpost uses gorilla/sessions FileStore
 └─> each forward-auth request that has no valid session cookie writes
     /dev/shm/session_<random> (~1500 bytes/file)
         │
         ├─> [2] Catch-all Proxy provider's access_token_validity = hours=168 (7 days)
         │    └─> each file's MaxAge = 7 days
         │        └─> Upstream 5-min GC (PR #15798, shipped in ≥ 2025.10) can only
         │            delete files whose MaxAge has EXPIRED, not whose age exceeds any
         │            shorter threshold
         │
         ├─> [3] Measured creation rate: ~18 files/min (Uptime-Kuma monitors +
         │    real user traffic)
         │    └─> 18/min × 60 × 24 × 7 = 181,440 steady-state files expected
         │
         └─> [4] Pod's /dev/shm default: 64 MB tmpfs (Kubernetes default)
              └─> 64 MB / 1500 bytes ≈ 44,000 files maximum
                  └─> Full in approx 44,000 / (18 × 60) min ≈ 41 hours
                      └─> Actual observed time: pod started Apr 15 ~09:21,
                          first ENOSPC ~Apr 16 ~17:00 ≈ 32 hours
                          (some excess from Uptime-Kuma bursts)
 [ENOSPC] -> every new forward-auth request fails -> outpost returns HTTP 400
         -> Traefik forwards the 400 to the browser
         -> user sees "400 Bad Request" on every protected site
 ```
 ## Why Diagnosis Took So Long
 The initial report was framed as "Emil can't access terminal" — a per-user symptom. All four pre-registered hypotheses in the triage plan (non-ASCII bytes in header value, oversized cookie, corrupt user attribute, provider policy rejecting groups) were per-user explanations, all of which turned out to be falsified.
 Contributing distractions:
 1. **Misattribution in initial research** — an `authorize_application` event for Viktor (`vbarzin@gmail.com`) at 2026-04-18 08:09 was initially attributed to Emil. This led to the incorrect conclusion that Emil was authenticating successfully today.
 2. **Rybbit analytics Cloudflare Worker deployed 2026-04-17** (see memory #792, commit around 2026-04-17 21:26 UTC) ran on `*.viktorbarzin.me/*`. Suspicious timing — Viktor's first instinct was "this must be the Worker." The Worker WAS adding long cookies to browser state, but not the cause of the 400. Exonerated by direct-Traefik curl returning the same 400.
 3. **Viktor's cached session masked the outage** — only unauthenticated requests wrote new session files. Viktor's valid cookie kept working until the outpost needed to rotate state, at which point he also hit 400.
 4. **The tell is in the outpost logs, not anywhere else.** `grep 'no space left on device'` on the outpost logs would have found it in seconds, but the investigation scope started with user records, then cookies, then the Worker — outpost logs weren't grepped until hour 3+.
 ## Contributing Factors
 1. **No alert on outpost `/dev/shm` usage.** A simple `kubelet_volume_stats_used_bytes / kubelet_volume_stats_capacity_bytes > 0.8` or equivalent cAdvisor metric would have paged hours before users noticed.
 2. **No alert on outpost HTTP 400 rate.** `increase(authentik_outpost_http_requests_total{status="400"}[15m])` went from ~0 to thousands — invisible to our monitoring.
 3. **No alert on "Uptime-Kuma external monitors all turning red simultaneously."** Every external monitor for a protected service started failing, but each is individually monitored — correlated failures across dozens of services didn't trigger a higher-level alert.
 4. **Default Kubernetes `/dev/shm` is 64 MB.** This is fine for most workloads, but the goauthentik proxy outpost writes one session file per unauthenticated request with a 7-day retention. The default sizing is an accident waiting to happen on any busy deployment.
 5. **Upstream issue [#20093](https://github.com/goauthentik/authentik/issues/20093)** ("External Proxy Outpost cannot use persistent session backend") is still OPEN as of 2026-04-18. Known architectural limitation.
 6. **Catch-all Proxy provider is UI-managed, not Terraform-managed.** Its `access_token_validity` and the outpost's `kubernetes_json_patches` are configured in Authentik's PostgreSQL database, not in code. This means the fix applied today is invisible to `git log` and vulnerable to drift if someone changes it in the UI.
 ## Detection Gaps
 | Gap | Impact | Fix |
 |-----|--------|-----|
 | No alert on outpost `/dev/shm` usage | Outage progressed from "Emil only" to "everyone" over 40+ hours silently | Add Prometheus alert: `kubelet_volume_stats_used_bytes{namespace="authentik",persistentvolumeclaim=~"dshm.*"} / kubelet_volume_stats_capacity_bytes > 0.8` (or per-container cAdvisor metric if emptyDir not a PVC) |
 | No alert on outpost 400 rate spike | ~thousands of 400s over 40h didn't page | Alert on `increase(traefik_service_requests_total{code="400",service=~".*viktorbarzin-me.*"}[15m]) > N` OR on outpost-specific 400 metric |
 | Uptime Kuma external monitors not cross-correlated | Dozens of red monitors didn't trigger a cluster-wide alert | Add meta-alert: "more than N [External] Uptime Kuma monitors down within 10 min" — strong signal of shared-infra failure |
 | Outpost logs not searched during initial triage | Investigation went down 4 wrong paths before finding the real error | Runbook addition: for any Authentik forward-auth issue, FIRST command is `kubectl -n authentik logs -l goauthentik.io/outpost-name=authentik-embedded-outpost --since=1h \| grep -iE 'error\|no space'` |
 ## Prevention Plan
 ### P0 — Prevent this exact failure
 | Priority | Action | Type | Details | Status |
 |----------|--------|------|---------|--------|
 | P0 | Size `/dev/shm` up via `kubernetes_json_patches` on the embedded outpost config | Config | `PATCH /api/v3/outposts/instances/0eecac07-97c7-443c-8925-05f2f4fe3e47/` with `config.kubernetes_json_patches.deployment` adding an `emptyDir {medium: Memory, sizeLimit: 512Mi}` volume at `/dev/shm`. Authentik reconciles the Deployment within 5 minutes. **Applied 2026-04-18 12:53 UTC.** | **DONE** |
 ### P1 — Detect this next time
 | Priority | Action | Type | Details | Status |
 |----------|--------|------|---------|--------|
 | P1 | Prometheus alert on outpost `/dev/shm` usage > 80% | Alert | Metric: `container_fs_usage_bytes{container!="",namespace="authentik",pod=~"ak-outpost-.*"} / container_fs_limit_bytes > 0.8`. Firing threshold 15 min, severity warning. | TODO |
 | P1 | Prometheus alert on sustained 400 rate on forward-auth middleware | Alert | `increase(traefik_service_requests_total{code="400",service=~".*-viktorbarzin-me@.*"}[15m]) > 100` — catches mass-failure patterns at the Traefik level before the outpost is silently broken. | TODO |
 | P1 | Uptime-Kuma meta-monitor: "N+ external monitors down simultaneously" | Alert | Either a Prometheus rule over `uptime_kuma_monitor_status == 0` counts, or a dedicated external probe. Very strong signal of shared-infra failure. | TODO |
 ### P2 — Codify the fix so it survives drift
 | Priority | Action | Type | Details | Status |
 |----------|--------|------|---------|--------|
 | P2 | Codify the catch-all Proxy provider + embedded outpost config in Terraform | Architecture | Adopt `goauthentik/authentik` Terraform provider in `infra/stacks/authentik/`. Import the existing UUID `0eecac07-97c7-443c-8925-05f2f4fe3e47` and the catch-all provider pk=5. Move `kubernetes_json_patches` into TF so the fix is reviewable in git. | TODO |
 | P2 | Runbook: Authentik forward-auth troubleshooting | Docs | Add a runbook at `docs/runbooks/authentik-forward-auth-400.md` with the "grep outpost logs first" first step, plus pointer commands for `/dev/shm` usage, session file count, and recent authorize events. | TODO |
 ### P3 — Upstream + architectural
 | Priority | Action | Type | Details | Status |
 |----------|--------|------|---------|--------|
 | P3 | Comment/support on authentik issue [#20093](https://github.com/goauthentik/authentik/issues/20093) | Upstream | Request either a persistent-backed session store (Redis/DB) OR a configurable GC interval shorter than the default 5 min. | TODO |
 | P3 | Consider shortening `access_token_validity` from 168h (7 days) to 24h | Config | Reduces steady-state session file count from ~181k to ~26k (7× reduction). Trade-off: users re-auth daily. Viktor's call on UX tolerance. | TODO |
 | P3 | Evaluate moving forward-auth away from the embedded outpost | Architecture | The embedded outpost is a single replica Go binary with in-memory session state. An external, multi-replica outpost with Redis-backed sessions is the production-grade deployment. Probably overkill for a home-lab, but worth noting. | TODO (paused) |
 ## Lessons Learned
 1. **When a per-user bug affects a shared infrastructure layer, suspect the shared layer, not the user.** The framing "Emil gets 400" led the first two hours of investigation down four user-specific rabbit holes. A sanity check ("does ANY user's non-cached request to a protected site return 400?") would have cut to the chase in minutes.
 2. **Check the outpost logs first, not last.** For any Authentik forward-auth oddity, the first `kubectl logs` should be on the outpost pod, grepping for `error` and `ENOSPC`. The outpost is the component that actually makes the 400/302 decision.
 3. **Cache + low-request users mask outages longer than you'd think.** Viktor had a valid cookie and his browser kept using it without writing new session files; he couldn't reproduce the bug Emil saw. The outage felt per-user until his cookie rotation needed to write state. **Any outage that "only affects some users" needs an active check from a fresh, cookie-less context** — `curl` with no cookie jar is the fastest way.
 4. **Default tmpfs sizing + per-request file writes = ticking clock.** 64 MB of `/dev/shm` is a Kubernetes default, not a considered choice. Any workload that writes per-request files into tmpfs without aggressive GC will eventually fill, and the time-to-fill scales inversely with request rate. Worth auditing other services that might have the same pattern.
 5. **UI-managed Authentik config is invisible to git review.** Our catch-all Proxy provider, embedded outpost config, property mappings, and policy bindings are all in Authentik's PostgreSQL database. The fix applied today (`kubernetes_json_patches`) is durable but not discoverable from `git log`. Drift risk. Codify in Terraform.
 6. **Recently-deployed things are prime suspects but not always guilty.** The Rybbit Cloudflare Worker was deployed 2026-04-17 with a wildcard route. Viktor's intuition was "that's the recent change, must be the cause." It was a plausible theory and worth checking — but `curl --resolve` to bypass Cloudflare proved it innocent within 30 seconds. Always have a way to bypass the suspect layer cheaply.
 ## References
 - Memory #836-841: incident details stored in claude-memory MCP (2026-04-18 12:42 UTC).
 - Upstream issue: [goauthentik/authentik#20093](https://github.com/goauthentik/authentik/issues/20093) (open).
 - Related upstream fix: [PR #15798](https://github.com/goauthentik/authentik/pull/15798) — 5-min session GC shipped in ≥ 2025.10 (our version 2026.2.2 has it, but insufficient alone).
 - Beads task: `code-zru` (P1 bug).