From 4e882989769913060f099a262f6201c92cc0740c Mon Sep 17 00:00:00 2001 From: Viktor Barzin Date: Thu, 11 Jun 2026 00:26:52 +0000 Subject: [PATCH] authentik: incident hardening after the signin-speedup rollout storm The first apply of the signin-speedup change triggered a ~50min authentik outage (and a shared CNPG primary failover): the helm chart pin (2026.2.2) silently DOWNGRADED the Keel-managed live image (2026.2.4) against an already-migrated DB, default liveness probes kill-looped pods queuing on authentik's migration advisory lock, and kills mid-migration left ghost idle-in-transaction sessions holding that lock. Full analysis in docs/post-mortems/2026-06-10-authentik-downgrade-boot-storm.md. Hardening (all root causes): - values.yaml: pin global.image.tag to the Keel-managed live tag (2026.2.4) so helm applies can never downgrade under Keel again - values.yaml: server livenessProbe 6x10s/5s (was chart-default 3x10s/3s) - values.yaml: REMOVE AUTHENTIK_POSTGRESQL__CONN_MAX_AGE (session-mode pgbouncer pins persistent conns 1:1 -> pool saturation, 58s/s waits) - pgbouncer.ini: idle_transaction_timeout=300 reaps ghost lock holders; pgbouncer.tf gets a config-checksum annotation so ini changes roll pods - authentik_provider.tf: drop the completed import stanza (adoption rule) - traefik: suppress pre-existing keel.sh annotation/tier-label drift on auth-proxy/bot-block/x402/error-pages deployments (KEEL_LIFECYCLE_V1 pattern) so applies stop stripping live Keel state Co-Authored-By: Claude Fable 5 --- .claude/reference/authentik-state.md | 3 + ...26-06-10-authentik-downgrade-boot-storm.md | 76 +++++++++++++++++++ stacks/authentik/authentik_provider.tf | 7 -- .../authentik/modules/authentik/pgbouncer.ini | 4 + .../authentik/modules/authentik/pgbouncer.tf | 5 ++ .../authentik/modules/authentik/values.yaml | 38 +++++++--- stacks/traefik/modules/traefik/error-pages.tf | 13 +++- stacks/traefik/modules/traefik/main.tf | 33 +++++++- 8 files changed, 156 insertions(+), 23 deletions(-) create mode 100644 docs/post-mortems/2026-06-10-authentik-downgrade-boot-storm.md diff --git a/.claude/reference/authentik-state.md b/.claude/reference/authentik-state.md index 919e286b..125d1a71 100644 --- a/.claude/reference/authentik-state.md +++ b/.claude/reference/authentik-state.md @@ -179,6 +179,9 @@ Notes: - The Authentik provider's resource schema does **not** expose the `Outpost.managed` field. We rely on TF's "write only fields it knows about" semantic: the server-set `goauthentik.io/outposts/embedded` value is preserved across applies because Terraform never writes `managed`. Don't change the resource provider schema expectations without verifying this assumption holds. - ALL tuned env vars are injected via `server.env` / `worker.env` (not the `authentik.*` values block) because we set `authentik.existingSecret.secretName: goauthentik`, which makes the chart skip rendering its own `AUTHENTIK_*` Secret. The `authentik.*` value block is therefore inert in this stack — anything new under `authentik.*` must use the `*.env` arrays instead. Live base values come from the orphaned, helm-keep-policy `goauthentik` Secret created by chart 2025.10.3 before `existingSecret` was introduced. **2026-06-10:** the previously-inert tuning (`AUTHENTIK_WEB__WORKERS=3`, `AUTHENTIK_WEB__THREADS=4`, `AUTHENTIK_CACHE__TIMEOUT_FLOWS=1800`, `AUTHENTIK_CACHE__TIMEOUT_POLICIES=900`, `AUTHENTIK_POSTGRESQL__CONN_MAX_AGE=60`, `AUTHENTIK_POSTGRESQL__CONN_HEALTH_CHECKS=true`, worker `AUTHENTIK_WORKER__THREADS=4`) was moved into the env arrays and is now actually live — before that, pods silently ran defaults (2 gunicorn workers, 300s caches, no persistent DB conns). - **Outpost (2026-06-10):** `log_level=info` (was `trace` — per-request overhead on the forward-auth hot path) and `kubernetes_replicas=2` (was 1 — single-pod hot path; safe since proxy sessions live in Postgres). Both in `authentik_outpost.embedded` config. +- **Image tag is PINNED in values (`global.image.tag`), 2026-06-10:** Keel moves the authentik image between chart releases, while helm derives the tag from the chart appVersion — an unpinned helm apply silently DOWNGRADES live pods (caused the 2026-06-10 boot storm + shared-PG failover; see `docs/post-mortems/2026-06-10-authentik-downgrade-boot-storm.md`). Before touching this chart, check the live image tag and refresh the pin. +- **Liveness budget (2026-06-10):** `server.livenessProbe` = 6×10s, 5s timeout (chart default 3×10s/3s kill-loops pods that queue on the DB migration advisory lock during rolling restarts). +- **PgBouncer (2026-06-10):** `idle_transaction_timeout=300` reaps ghost `idle in transaction` sessions (a killed pod mid-migration otherwise holds the migration advisory lock forever, serializing all boots); the deployment carries a config-checksum annotation so ini changes roll the pods. Do NOT set `AUTHENTIK_POSTGRESQL__CONN_MAX_AGE` — session-mode PgBouncer pins persistent conns 1:1 (pool saturation). - **Static assets (2026-06-10):** a second `ingress_factory` (`module.ingress-static`, path `/static` on the authentik host) attaches the `authentik-static-cache-headers` middleware → `Cache-Control: public, max-age=31536000, immutable`. Authentik itself serves no max-age; assets are version-fingerprinted so immutable is safe. Mainly helps split-horizon internal users (no Cloudflare edge cache on the direct path). ## Upgrade Validation Checklist diff --git a/docs/post-mortems/2026-06-10-authentik-downgrade-boot-storm.md b/docs/post-mortems/2026-06-10-authentik-downgrade-boot-storm.md new file mode 100644 index 00000000..3d5d8750 --- /dev/null +++ b/docs/post-mortems/2026-06-10-authentik-downgrade-boot-storm.md @@ -0,0 +1,76 @@ +# Post-mortem: Authentik downgrade boot storm + shared-PG failover (2026-06-10) + +**Impact:** Authentik (and therefore forward-auth for all ~67 `auth="required"` +ingresses and every OIDC app) degraded/unavailable for ~50 minutes +(~22:20–23:10 UTC). The auth-proxy basicAuth fallback served Emergency Access +prompts during outpost-check failures. The shared CNPG primary failed over +(pg-cluster-2 → pg-cluster-1, 22:40:58 UTC), briefly disturbing every PG-backed +tenant. + +**Trigger:** a routine values-only `tg apply` on `stacks/authentik` (first-time +signin speedup work — env tuning, outpost config, static-asset ingress). + +## Root causes (three stacked) + +1. **Helm/Keel version split → silent downgrade.** Keel (namespace + `keel.sh/enrolled` + diun annotations) had upgraded the live authentik + image to `2026.2.4`, while the Helm release pinned chart `2026.2.2` (whose + appVersion drives the image tag). The values-only apply therefore rolled + every server/worker pod BACK to `2026.2.2` against a `2026.2.4`-migrated + database. Cores never came up healthy (`failed to proxy to backend`, plus + Django cross-version serialized-cache warnings), and mid-storm Keel + re-upgraded the image, adding a third ReplicaSet to the churn. + +2. **Liveness budget too small for authentik's boot.** The chart-default + liveness probe (3×10s, 3s timeout) kills a pod ~30s after the go layer + passes the startup probe — but during a rolling restart the Python core + still waits on authentik's DB **migration advisory lock** (60–120s+ under + contention). kubelet kill-looped every booting pod, and each kill increased + lock contention for the rest (thundering herd). + +3. **Ghost lock holders.** Pods killed mid-migration-check left PgBouncer + server connections `idle in transaction` still **holding the migration + advisory lock** (observed twice: `SELECT * FROM authentik_version_history` + idle 2+ min). Every subsequent boot serialized behind a dead client. + PgBouncer had no `idle_transaction_timeout`, so the ghosts never expired. + +**Aggravator:** `AUTHENTIK_POSTGRESQL__CONN_MAX_AGE=60` (newly made live) made +every Django thread hold its connection persistently; with PgBouncer in +*session* mode each one pins a server connection 1:1, so the restart churn +saturated all 3×(20+5) pool slots (58s/s client wait observed; authentik held +75 of 108 connections on the new primary). The shared primary's +restart/failover at 22:40 fits this storm window. + +## Resolution + +- Scaled workers to 0 (transient) to free pool capacity; rollout converged + once, then re-degraded when workers returned. +- Emergency `kubectl patch` of the server liveness probe (3×10s/3s → + 6×10s/5s) — final state codified in Helm values in the same session. +- `pg_terminate_backend()` on the ghost `idle in transaction` lock holders + (twice). +- Scaled servers to 1 so a single `2026.2.4` pod booted uncontended, then back + to 3 — converged cleanly (51s boots, zero restarts). +- Final `tg apply` reconciled everything (image tag pinned, conn_max_age + removed, liveness in values, pgbouncer reaper config). + +## Prevention (all landed in this change) + +| Cause | Fix | +|---|---| +| Helm/Keel version split | `global.image.tag` pinned in `values.yaml` to the Keel-managed live tag, with a comment requiring the pin be refreshed whenever the chart is touched. Long-term: bump the chart pin when Keel moves the image (diun notifies). | +| Liveness kill loop | `server.livenessProbe` 6×10s / 5s timeout in values (startup probe still bounds total boot at 60×10s). | +| Ghost advisory-lock holders | `idle_transaction_timeout = 300` in `pgbouncer.ini` + config-checksum annotation so ini changes actually roll pgbouncer pods. | +| Pool saturation | `CONN_MAX_AGE` removed (per-request connections are ~1–2ms through local PgBouncer; not worth pinning server connections in session mode). values.yaml carries a do-not-set warning. | + +## Lessons + +- **Check the live image tag against the chart pin before ANY helm-managed + apply on a Keel-enrolled namespace.** `kubectl get deploy -o + jsonpath='{..image}'` vs the chart's appVersion — a mismatch means the apply + is a version change, not a config change. +- A "stuck rollout" of authentik is usually the migration advisory lock: + check `pg_locks` joined to `pg_stat_activity` for `idle in transaction` + holders before blaming probes or resources. +- The auth-proxy basicAuth fallback worked as designed throughout (Emergency + Access path); without it every protected app would have hard-failed. diff --git a/stacks/authentik/authentik_provider.tf b/stacks/authentik/authentik_provider.tf index e69657cc..573adcca 100644 --- a/stacks/authentik/authentik_provider.tf +++ b/stacks/authentik/authentik_provider.tf @@ -217,11 +217,6 @@ resource "authentik_stage_user_login" "default_login" { # screen and bypass the password field. # ----------------------------------------------------------------------------- -import { - to = authentik_stage_identification.default_identification - id = "32aca5ab-106e-43f4-a4cc-4513d80e57f3" -} - data "authentik_stage" "default_authentication_password" { name = "default-authentication-password" } @@ -243,8 +238,6 @@ resource "authentik_stage_identification" "default_identification" { passwordless_flow, pretend_user_exists, captcha_stage, - webauthn_stage, - enable_remember_me, ] } } diff --git a/stacks/authentik/modules/authentik/pgbouncer.ini b/stacks/authentik/modules/authentik/pgbouncer.ini index 8148ab53..74fbbf10 100644 --- a/stacks/authentik/modules/authentik/pgbouncer.ini +++ b/stacks/authentik/modules/authentik/pgbouncer.ini @@ -12,3 +12,7 @@ default_pool_size = 20 reserve_pool_size = 5 reserve_pool_timeout = 5 ignore_startup_parameters = extra_float_digits +; Reap server connections stuck "idle in transaction" (e.g. an authentik pod +; killed mid-migration leaves a ghost transaction holding the migration +; advisory lock, serializing every subsequent pod boot — 2026-06-10 incident). +idle_transaction_timeout = 300 diff --git a/stacks/authentik/modules/authentik/pgbouncer.tf b/stacks/authentik/modules/authentik/pgbouncer.tf index 3029852f..a501b950 100644 --- a/stacks/authentik/modules/authentik/pgbouncer.tf +++ b/stacks/authentik/modules/authentik/pgbouncer.tf @@ -48,6 +48,11 @@ resource "kubernetes_deployment" "pgbouncer" { labels = { app = "pgbouncer" } + annotations = { + # pgbouncer reads its ini only at startup (subPath mount never + # propagates updates anyway) — roll the pods on config change. + "checksum/pgbouncer-config" = sha1(kubernetes_config_map.pgbouncer_config.data["pgbouncer.ini"]) + } } spec { diff --git a/stacks/authentik/modules/authentik/values.yaml b/stacks/authentik/modules/authentik/values.yaml index 56f19d40..9a7c0c1a 100644 --- a/stacks/authentik/modules/authentik/values.yaml +++ b/stacks/authentik/modules/authentik/values.yaml @@ -47,13 +47,20 @@ server: value: "1800" - name: AUTHENTIK_CACHE__TIMEOUT_POLICIES value: "900" - # Persistent client-side DB connections (safe with PgBouncer session mode; - # must stay < pgbouncer server_idle_timeout=600s). Cuts per-request Django - # connection setup off the auth hot path. - - name: AUTHENTIK_POSTGRESQL__CONN_MAX_AGE - value: "60" - - name: AUTHENTIK_POSTGRESQL__CONN_HEALTH_CHECKS - value: "true" + # Do NOT set AUTHENTIK_POSTGRESQL__CONN_MAX_AGE here. With PgBouncer in + # session mode every persistent Django connection pins a server connection + # 1:1, so the 3x(20+5) pool saturated during the 2026-06-10 rolling + # restart (58s pool waits, readiness flapping, and the shared CNPG primary + # failed over mid-storm). The ~1-2ms/request connection-setup saving is + # not worth that risk on the shared PG substrate. + # Liveness budget sized for slow boots (2026-06-10 incident): during a + # rolling restart pods queue on authentik's DB migration lock; the go layer + # answers /-/health/live before the core is up, so with the default 3x10s + # budget kubelet kill-looped every booting pod and amplified the contention. + # Startup probe still bounds total boot time (60x10s). + livenessProbe: + failureThreshold: 6 + timeoutSeconds: 5 strategy: type: RollingUpdate rollingUpdate: @@ -84,6 +91,16 @@ server: minAvailable: 2 global: addPrometheusAnnotations: true + image: + # Pin to the Keel-managed live tag. Keel (diun-annotated, keel.sh/enrolled + # namespace) bumps the IMAGE between chart releases, while helm defaults + # the tag to the chart appVersion — so any helm upgrade silently + # DOWNGRADES the running pods to the chart pin (2026-06-10: a values-only + # apply rolled live 2026.2.4 back to 2026.2.2 against a 2026.2.4-migrated + # DB → boot storm, see docs/post-mortems/2026-06-10-authentik-downgrade- + # boot-storm.md). Keep this tag in sync with what Keel has deployed when + # touching this chart; clear it only when bumping the chart version itself. + tag: "2026.2.4" worker: # 2 replicas: workers handle background tasks (LDAP sync, email, @@ -98,15 +115,12 @@ worker: # Dramatiq worker threads per pod (default 2). - name: AUTHENTIK_WORKER__THREADS value: "4" - # Keep cache + DB-connection settings in lockstep with server.env. + # Keep cache settings in lockstep with server.env. (No CONN_MAX_AGE — + # see the server.env note: session-mode PgBouncer pins persistent conns.) - name: AUTHENTIK_CACHE__TIMEOUT_FLOWS value: "1800" - name: AUTHENTIK_CACHE__TIMEOUT_POLICIES value: "900" - - name: AUTHENTIK_POSTGRESQL__CONN_MAX_AGE - value: "60" - - name: AUTHENTIK_POSTGRESQL__CONN_HEALTH_CHECKS - value: "true" strategy: type: RollingUpdate rollingUpdate: diff --git a/stacks/traefik/modules/traefik/error-pages.tf b/stacks/traefik/modules/traefik/error-pages.tf index 54346a07..34c1e997 100644 --- a/stacks/traefik/modules/traefik/error-pages.tf +++ b/stacks/traefik/modules/traefik/error-pages.tf @@ -89,7 +89,18 @@ resource "kubernetes_deployment" "error_pages" { } lifecycle { - ignore_changes = [spec[0].template[0].spec[0].dns_config] # KYVERNO_LIFECYCLE_V1 + ignore_changes = [ + spec[0].template[0].spec[0].dns_config, # KYVERNO_LIFECYCLE_V1 + # KEEL_LIFECYCLE_V1: keel.sh annotations + tier label are stamped on the + # live object (keel enrollment / resource-governance) — don't strip them. + metadata[0].annotations["keel.sh/policy"], + metadata[0].annotations["keel.sh/trigger"], + metadata[0].annotations["keel.sh/pollSchedule"], + metadata[0].annotations["keel.sh/match-tag"], + metadata[0].labels["tier"], + spec[0].template[0].metadata[0].annotations["keel.sh/update-time"], + spec[0].template[0].spec[0].container[0].image, # KEEL_IGNORE_IMAGE — Keel manages tag updates + ] } } diff --git a/stacks/traefik/modules/traefik/main.tf b/stacks/traefik/modules/traefik/main.tf index 1dbc793e..7b3d3f2c 100644 --- a/stacks/traefik/modules/traefik/main.tf +++ b/stacks/traefik/modules/traefik/main.tf @@ -494,7 +494,16 @@ resource "kubernetes_deployment" "bot_block_proxy" { } lifecycle { # KYVERNO_LIFECYCLE_V1: Kyverno admission webhook mutates dns_config with ndots=2 - ignore_changes = [spec[0].template[0].spec[0].dns_config] + ignore_changes = [ + spec[0].template[0].spec[0].dns_config, + # KEEL_LIFECYCLE_V1: keel.sh annotations + tier label are stamped on the + # live object (keel enrollment / resource-governance) — don't strip them. + metadata[0].annotations["keel.sh/policy"], + metadata[0].annotations["keel.sh/trigger"], + metadata[0].annotations["keel.sh/pollSchedule"], + metadata[0].annotations["keel.sh/match-tag"], + metadata[0].labels["tier"], + ] } } @@ -653,7 +662,16 @@ resource "kubernetes_deployment" "x402_gateway" { lifecycle { # KYVERNO_LIFECYCLE_V1: Kyverno admission webhook mutates dns_config with ndots=2 - ignore_changes = [spec[0].template[0].spec[0].dns_config] + ignore_changes = [ + spec[0].template[0].spec[0].dns_config, + # KEEL_LIFECYCLE_V1: keel.sh annotations + tier label are stamped on the + # live object (keel enrollment / resource-governance) — don't strip them. + metadata[0].annotations["keel.sh/policy"], + metadata[0].annotations["keel.sh/trigger"], + metadata[0].annotations["keel.sh/pollSchedule"], + metadata[0].annotations["keel.sh/match-tag"], + metadata[0].labels["tier"], + ] } } @@ -922,7 +940,16 @@ resource "kubernetes_deployment" "auth_proxy" { } lifecycle { # KYVERNO_LIFECYCLE_V1: Kyverno admission webhook mutates dns_config with ndots=2 - ignore_changes = [spec[0].template[0].spec[0].dns_config] + ignore_changes = [ + spec[0].template[0].spec[0].dns_config, + # KEEL_LIFECYCLE_V1: keel.sh annotations + tier label are stamped on the + # live object (keel enrollment / resource-governance) — don't strip them. + metadata[0].annotations["keel.sh/policy"], + metadata[0].annotations["keel.sh/trigger"], + metadata[0].annotations["keel.sh/pollSchedule"], + metadata[0].annotations["keel.sh/match-tag"], + metadata[0].labels["tier"], + ] } }