The first apply of the signin-speedup change triggered a ~50min authentik outage (and a shared CNPG primary failover): the helm chart pin (2026.2.2) silently DOWNGRADED the Keel-managed live image (2026.2.4) against an already-migrated DB, default liveness probes kill-looped pods queuing on authentik's migration advisory lock, and kills mid-migration left ghost idle-in-transaction sessions holding that lock. Full analysis in docs/post-mortems/2026-06-10-authentik-downgrade-boot-storm.md. Hardening (all root causes): - values.yaml: pin global.image.tag to the Keel-managed live tag (2026.2.4) so helm applies can never downgrade under Keel again - values.yaml: server livenessProbe 6x10s/5s (was chart-default 3x10s/3s) - values.yaml: REMOVE AUTHENTIK_POSTGRESQL__CONN_MAX_AGE (session-mode pgbouncer pins persistent conns 1:1 -> pool saturation, 58s/s waits) - pgbouncer.ini: idle_transaction_timeout=300 reaps ghost lock holders; pgbouncer.tf gets a config-checksum annotation so ini changes roll pods - authentik_provider.tf: drop the completed import stanza (adoption rule) - traefik: suppress pre-existing keel.sh annotation/tier-label drift on auth-proxy/bot-block/x402/error-pages deployments (KEEL_LIFECYCLE_V1 pattern) so applies stop stripping live Keel state Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
4.4 KiB
Post-mortem: Authentik downgrade boot storm + shared-PG failover (2026-06-10)
Impact: Authentik (and therefore forward-auth for all ~67 auth="required"
ingresses and every OIDC app) degraded/unavailable for ~50 minutes
(~22:20–23:10 UTC). The auth-proxy basicAuth fallback served Emergency Access
prompts during outpost-check failures. The shared CNPG primary failed over
(pg-cluster-2 → pg-cluster-1, 22:40:58 UTC), briefly disturbing every PG-backed
tenant.
Trigger: a routine values-only tg apply on stacks/authentik (first-time
signin speedup work — env tuning, outpost config, static-asset ingress).
Root causes (three stacked)
-
Helm/Keel version split → silent downgrade. Keel (namespace
keel.sh/enrolled+ diun annotations) had upgraded the live authentik image to2026.2.4, while the Helm release pinned chart2026.2.2(whose appVersion drives the image tag). The values-only apply therefore rolled every server/worker pod BACK to2026.2.2against a2026.2.4-migrated database. Cores never came up healthy (failed to proxy to backend, plus Django cross-version serialized-cache warnings), and mid-storm Keel re-upgraded the image, adding a third ReplicaSet to the churn. -
Liveness budget too small for authentik's boot. The chart-default liveness probe (3×10s, 3s timeout) kills a pod ~30s after the go layer passes the startup probe — but during a rolling restart the Python core still waits on authentik's DB migration advisory lock (60–120s+ under contention). kubelet kill-looped every booting pod, and each kill increased lock contention for the rest (thundering herd).
-
Ghost lock holders. Pods killed mid-migration-check left PgBouncer server connections
idle in transactionstill holding the migration advisory lock (observed twice:SELECT * FROM authentik_version_historyidle 2+ min). Every subsequent boot serialized behind a dead client. PgBouncer had noidle_transaction_timeout, so the ghosts never expired.
Aggravator: AUTHENTIK_POSTGRESQL__CONN_MAX_AGE=60 (newly made live) made
every Django thread hold its connection persistently; with PgBouncer in
session mode each one pins a server connection 1:1, so the restart churn
saturated all 3×(20+5) pool slots (58s/s client wait observed; authentik held
75 of 108 connections on the new primary). The shared primary's
restart/failover at 22:40 fits this storm window.
Resolution
- Scaled workers to 0 (transient) to free pool capacity; rollout converged once, then re-degraded when workers returned.
- Emergency
kubectl patchof the server liveness probe (3×10s/3s → 6×10s/5s) — final state codified in Helm values in the same session. pg_terminate_backend()on the ghostidle in transactionlock holders (twice).- Scaled servers to 1 so a single
2026.2.4pod booted uncontended, then back to 3 — converged cleanly (51s boots, zero restarts). - Final
tg applyreconciled everything (image tag pinned, conn_max_age removed, liveness in values, pgbouncer reaper config).
Prevention (all landed in this change)
| Cause | Fix |
|---|---|
| Helm/Keel version split | global.image.tag pinned in values.yaml to the Keel-managed live tag, with a comment requiring the pin be refreshed whenever the chart is touched. Long-term: bump the chart pin when Keel moves the image (diun notifies). |
| Liveness kill loop | server.livenessProbe 6×10s / 5s timeout in values (startup probe still bounds total boot at 60×10s). |
| Ghost advisory-lock holders | idle_transaction_timeout = 300 in pgbouncer.ini + config-checksum annotation so ini changes actually roll pgbouncer pods. |
| Pool saturation | CONN_MAX_AGE removed (per-request connections are ~1–2ms through local PgBouncer; not worth pinning server connections in session mode). values.yaml carries a do-not-set warning. |
Lessons
- Check the live image tag against the chart pin before ANY helm-managed
apply on a Keel-enrolled namespace.
kubectl get deploy <x> -o jsonpath='{..image}'vs the chart's appVersion — a mismatch means the apply is a version change, not a config change. - A "stuck rollout" of authentik is usually the migration advisory lock:
check
pg_locksjoined topg_stat_activityforidle in transactionholders before blaming probes or resources. - The auth-proxy basicAuth fallback worked as designed throughout (Emergency Access path); without it every protected app would have hard-failed.