authentik: incident hardening after the signin-speedup rollout storm

The first apply of the signin-speedup change triggered a ~50min authentik
outage (and a shared CNPG primary failover): the helm chart pin (2026.2.2)
silently DOWNGRADED the Keel-managed live image (2026.2.4) against an
already-migrated DB, default liveness probes kill-looped pods queuing on
authentik's migration advisory lock, and kills mid-migration left ghost
idle-in-transaction sessions holding that lock. Full analysis in
docs/post-mortems/2026-06-10-authentik-downgrade-boot-storm.md.

Hardening (all root causes):
- values.yaml: pin global.image.tag to the Keel-managed live tag (2026.2.4)
  so helm applies can never downgrade under Keel again
- values.yaml: server livenessProbe 6x10s/5s (was chart-default 3x10s/3s)
- values.yaml: REMOVE AUTHENTIK_POSTGRESQL__CONN_MAX_AGE (session-mode
  pgbouncer pins persistent conns 1:1 -> pool saturation, 58s/s waits)
- pgbouncer.ini: idle_transaction_timeout=300 reaps ghost lock holders;
  pgbouncer.tf gets a config-checksum annotation so ini changes roll pods
- authentik_provider.tf: drop the completed import stanza (adoption rule)
- traefik: suppress pre-existing keel.sh annotation/tier-label drift on
  auth-proxy/bot-block/x402/error-pages deployments (KEEL_LIFECYCLE_V1
  pattern) so applies stop stripping live Keel state

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
This commit is contained in:
Viktor Barzin 2026-06-11 00:26:52 +00:00
parent 97ccdbecb8
commit 4e88298976
8 changed files with 156 additions and 23 deletions

View file

@ -179,6 +179,9 @@ Notes:
- The Authentik provider's resource schema does **not** expose the `Outpost.managed` field. We rely on TF's "write only fields it knows about" semantic: the server-set `goauthentik.io/outposts/embedded` value is preserved across applies because Terraform never writes `managed`. Don't change the resource provider schema expectations without verifying this assumption holds.
- ALL tuned env vars are injected via `server.env` / `worker.env` (not the `authentik.*` values block) because we set `authentik.existingSecret.secretName: goauthentik`, which makes the chart skip rendering its own `AUTHENTIK_*` Secret. The `authentik.*` value block is therefore inert in this stack — anything new under `authentik.*` must use the `*.env` arrays instead. Live base values come from the orphaned, helm-keep-policy `goauthentik` Secret created by chart 2025.10.3 before `existingSecret` was introduced. **2026-06-10:** the previously-inert tuning (`AUTHENTIK_WEB__WORKERS=3`, `AUTHENTIK_WEB__THREADS=4`, `AUTHENTIK_CACHE__TIMEOUT_FLOWS=1800`, `AUTHENTIK_CACHE__TIMEOUT_POLICIES=900`, `AUTHENTIK_POSTGRESQL__CONN_MAX_AGE=60`, `AUTHENTIK_POSTGRESQL__CONN_HEALTH_CHECKS=true`, worker `AUTHENTIK_WORKER__THREADS=4`) was moved into the env arrays and is now actually live — before that, pods silently ran defaults (2 gunicorn workers, 300s caches, no persistent DB conns).
- **Outpost (2026-06-10):** `log_level=info` (was `trace` — per-request overhead on the forward-auth hot path) and `kubernetes_replicas=2` (was 1 — single-pod hot path; safe since proxy sessions live in Postgres). Both in `authentik_outpost.embedded` config.
- **Image tag is PINNED in values (`global.image.tag`), 2026-06-10:** Keel moves the authentik image between chart releases, while helm derives the tag from the chart appVersion — an unpinned helm apply silently DOWNGRADES live pods (caused the 2026-06-10 boot storm + shared-PG failover; see `docs/post-mortems/2026-06-10-authentik-downgrade-boot-storm.md`). Before touching this chart, check the live image tag and refresh the pin.
- **Liveness budget (2026-06-10):** `server.livenessProbe` = 6×10s, 5s timeout (chart default 3×10s/3s kill-loops pods that queue on the DB migration advisory lock during rolling restarts).
- **PgBouncer (2026-06-10):** `idle_transaction_timeout=300` reaps ghost `idle in transaction` sessions (a killed pod mid-migration otherwise holds the migration advisory lock forever, serializing all boots); the deployment carries a config-checksum annotation so ini changes roll the pods. Do NOT set `AUTHENTIK_POSTGRESQL__CONN_MAX_AGE` — session-mode PgBouncer pins persistent conns 1:1 (pool saturation).
- **Static assets (2026-06-10):** a second `ingress_factory` (`module.ingress-static`, path `/static` on the authentik host) attaches the `authentik-static-cache-headers` middleware → `Cache-Control: public, max-age=31536000, immutable`. Authentik itself serves no max-age; assets are version-fingerprinted so immutable is safe. Mainly helps split-horizon internal users (no Cloudflare edge cache on the direct path).
## Upgrade Validation Checklist

View file

@ -0,0 +1,76 @@
# Post-mortem: Authentik downgrade boot storm + shared-PG failover (2026-06-10)
**Impact:** Authentik (and therefore forward-auth for all ~67 `auth="required"`
ingresses and every OIDC app) degraded/unavailable for ~50 minutes
(~22:2023:10 UTC). The auth-proxy basicAuth fallback served Emergency Access
prompts during outpost-check failures. The shared CNPG primary failed over
(pg-cluster-2 → pg-cluster-1, 22:40:58 UTC), briefly disturbing every PG-backed
tenant.
**Trigger:** a routine values-only `tg apply` on `stacks/authentik` (first-time
signin speedup work — env tuning, outpost config, static-asset ingress).
## Root causes (three stacked)
1. **Helm/Keel version split → silent downgrade.** Keel (namespace
`keel.sh/enrolled` + diun annotations) had upgraded the live authentik
image to `2026.2.4`, while the Helm release pinned chart `2026.2.2` (whose
appVersion drives the image tag). The values-only apply therefore rolled
every server/worker pod BACK to `2026.2.2` against a `2026.2.4`-migrated
database. Cores never came up healthy (`failed to proxy to backend`, plus
Django cross-version serialized-cache warnings), and mid-storm Keel
re-upgraded the image, adding a third ReplicaSet to the churn.
2. **Liveness budget too small for authentik's boot.** The chart-default
liveness probe (3×10s, 3s timeout) kills a pod ~30s after the go layer
passes the startup probe — but during a rolling restart the Python core
still waits on authentik's DB **migration advisory lock** (60120s+ under
contention). kubelet kill-looped every booting pod, and each kill increased
lock contention for the rest (thundering herd).
3. **Ghost lock holders.** Pods killed mid-migration-check left PgBouncer
server connections `idle in transaction` still **holding the migration
advisory lock** (observed twice: `SELECT * FROM authentik_version_history`
idle 2+ min). Every subsequent boot serialized behind a dead client.
PgBouncer had no `idle_transaction_timeout`, so the ghosts never expired.
**Aggravator:** `AUTHENTIK_POSTGRESQL__CONN_MAX_AGE=60` (newly made live) made
every Django thread hold its connection persistently; with PgBouncer in
*session* mode each one pins a server connection 1:1, so the restart churn
saturated all 3×(20+5) pool slots (58s/s client wait observed; authentik held
75 of 108 connections on the new primary). The shared primary's
restart/failover at 22:40 fits this storm window.
## Resolution
- Scaled workers to 0 (transient) to free pool capacity; rollout converged
once, then re-degraded when workers returned.
- Emergency `kubectl patch` of the server liveness probe (3×10s/3s →
6×10s/5s) — final state codified in Helm values in the same session.
- `pg_terminate_backend()` on the ghost `idle in transaction` lock holders
(twice).
- Scaled servers to 1 so a single `2026.2.4` pod booted uncontended, then back
to 3 — converged cleanly (51s boots, zero restarts).
- Final `tg apply` reconciled everything (image tag pinned, conn_max_age
removed, liveness in values, pgbouncer reaper config).
## Prevention (all landed in this change)
| Cause | Fix |
|---|---|
| Helm/Keel version split | `global.image.tag` pinned in `values.yaml` to the Keel-managed live tag, with a comment requiring the pin be refreshed whenever the chart is touched. Long-term: bump the chart pin when Keel moves the image (diun notifies). |
| Liveness kill loop | `server.livenessProbe` 6×10s / 5s timeout in values (startup probe still bounds total boot at 60×10s). |
| Ghost advisory-lock holders | `idle_transaction_timeout = 300` in `pgbouncer.ini` + config-checksum annotation so ini changes actually roll pgbouncer pods. |
| Pool saturation | `CONN_MAX_AGE` removed (per-request connections are ~12ms through local PgBouncer; not worth pinning server connections in session mode). values.yaml carries a do-not-set warning. |
## Lessons
- **Check the live image tag against the chart pin before ANY helm-managed
apply on a Keel-enrolled namespace.** `kubectl get deploy <x> -o
jsonpath='{..image}'` vs the chart's appVersion — a mismatch means the apply
is a version change, not a config change.
- A "stuck rollout" of authentik is usually the migration advisory lock:
check `pg_locks` joined to `pg_stat_activity` for `idle in transaction`
holders before blaming probes or resources.
- The auth-proxy basicAuth fallback worked as designed throughout (Emergency
Access path); without it every protected app would have hard-failed.

View file

@ -217,11 +217,6 @@ resource "authentik_stage_user_login" "default_login" {
# screen and bypass the password field.
# -----------------------------------------------------------------------------
import {
to = authentik_stage_identification.default_identification
id = "32aca5ab-106e-43f4-a4cc-4513d80e57f3"
}
data "authentik_stage" "default_authentication_password" {
name = "default-authentication-password"
}
@ -243,8 +238,6 @@ resource "authentik_stage_identification" "default_identification" {
passwordless_flow,
pretend_user_exists,
captcha_stage,
webauthn_stage,
enable_remember_me,
]
}
}

View file

@ -12,3 +12,7 @@ default_pool_size = 20
reserve_pool_size = 5
reserve_pool_timeout = 5
ignore_startup_parameters = extra_float_digits
; Reap server connections stuck "idle in transaction" (e.g. an authentik pod
; killed mid-migration leaves a ghost transaction holding the migration
; advisory lock, serializing every subsequent pod boot — 2026-06-10 incident).
idle_transaction_timeout = 300

View file

@ -48,6 +48,11 @@ resource "kubernetes_deployment" "pgbouncer" {
labels = {
app = "pgbouncer"
}
annotations = {
# pgbouncer reads its ini only at startup (subPath mount never
# propagates updates anyway) roll the pods on config change.
"checksum/pgbouncer-config" = sha1(kubernetes_config_map.pgbouncer_config.data["pgbouncer.ini"])
}
}
spec {

View file

@ -47,13 +47,20 @@ server:
value: "1800"
- name: AUTHENTIK_CACHE__TIMEOUT_POLICIES
value: "900"
# Persistent client-side DB connections (safe with PgBouncer session mode;
# must stay < pgbouncer server_idle_timeout=600s). Cuts per-request Django
# connection setup off the auth hot path.
- name: AUTHENTIK_POSTGRESQL__CONN_MAX_AGE
value: "60"
- name: AUTHENTIK_POSTGRESQL__CONN_HEALTH_CHECKS
value: "true"
# Do NOT set AUTHENTIK_POSTGRESQL__CONN_MAX_AGE here. With PgBouncer in
# session mode every persistent Django connection pins a server connection
# 1:1, so the 3x(20+5) pool saturated during the 2026-06-10 rolling
# restart (58s pool waits, readiness flapping, and the shared CNPG primary
# failed over mid-storm). The ~1-2ms/request connection-setup saving is
# not worth that risk on the shared PG substrate.
# Liveness budget sized for slow boots (2026-06-10 incident): during a
# rolling restart pods queue on authentik's DB migration lock; the go layer
# answers /-/health/live before the core is up, so with the default 3x10s
# budget kubelet kill-looped every booting pod and amplified the contention.
# Startup probe still bounds total boot time (60x10s).
livenessProbe:
failureThreshold: 6
timeoutSeconds: 5
strategy:
type: RollingUpdate
rollingUpdate:
@ -84,6 +91,16 @@ server:
minAvailable: 2
global:
addPrometheusAnnotations: true
image:
# Pin to the Keel-managed live tag. Keel (diun-annotated, keel.sh/enrolled
# namespace) bumps the IMAGE between chart releases, while helm defaults
# the tag to the chart appVersion — so any helm upgrade silently
# DOWNGRADES the running pods to the chart pin (2026-06-10: a values-only
# apply rolled live 2026.2.4 back to 2026.2.2 against a 2026.2.4-migrated
# DB → boot storm, see docs/post-mortems/2026-06-10-authentik-downgrade-
# boot-storm.md). Keep this tag in sync with what Keel has deployed when
# touching this chart; clear it only when bumping the chart version itself.
tag: "2026.2.4"
worker:
# 2 replicas: workers handle background tasks (LDAP sync, email,
@ -98,15 +115,12 @@ worker:
# Dramatiq worker threads per pod (default 2).
- name: AUTHENTIK_WORKER__THREADS
value: "4"
# Keep cache + DB-connection settings in lockstep with server.env.
# Keep cache settings in lockstep with server.env. (No CONN_MAX_AGE —
# see the server.env note: session-mode PgBouncer pins persistent conns.)
- name: AUTHENTIK_CACHE__TIMEOUT_FLOWS
value: "1800"
- name: AUTHENTIK_CACHE__TIMEOUT_POLICIES
value: "900"
- name: AUTHENTIK_POSTGRESQL__CONN_MAX_AGE
value: "60"
- name: AUTHENTIK_POSTGRESQL__CONN_HEALTH_CHECKS
value: "true"
strategy:
type: RollingUpdate
rollingUpdate:

View file

@ -89,7 +89,18 @@ resource "kubernetes_deployment" "error_pages" {
}
lifecycle {
ignore_changes = [spec[0].template[0].spec[0].dns_config] # KYVERNO_LIFECYCLE_V1
ignore_changes = [
spec[0].template[0].spec[0].dns_config, # KYVERNO_LIFECYCLE_V1
# KEEL_LIFECYCLE_V1: keel.sh annotations + tier label are stamped on the
# live object (keel enrollment / resource-governance) don't strip them.
metadata[0].annotations["keel.sh/policy"],
metadata[0].annotations["keel.sh/trigger"],
metadata[0].annotations["keel.sh/pollSchedule"],
metadata[0].annotations["keel.sh/match-tag"],
metadata[0].labels["tier"],
spec[0].template[0].metadata[0].annotations["keel.sh/update-time"],
spec[0].template[0].spec[0].container[0].image, # KEEL_IGNORE_IMAGE Keel manages tag updates
]
}
}

View file

@ -494,7 +494,16 @@ resource "kubernetes_deployment" "bot_block_proxy" {
}
lifecycle {
# KYVERNO_LIFECYCLE_V1: Kyverno admission webhook mutates dns_config with ndots=2
ignore_changes = [spec[0].template[0].spec[0].dns_config]
ignore_changes = [
spec[0].template[0].spec[0].dns_config,
# KEEL_LIFECYCLE_V1: keel.sh annotations + tier label are stamped on the
# live object (keel enrollment / resource-governance) don't strip them.
metadata[0].annotations["keel.sh/policy"],
metadata[0].annotations["keel.sh/trigger"],
metadata[0].annotations["keel.sh/pollSchedule"],
metadata[0].annotations["keel.sh/match-tag"],
metadata[0].labels["tier"],
]
}
}
@ -653,7 +662,16 @@ resource "kubernetes_deployment" "x402_gateway" {
lifecycle {
# KYVERNO_LIFECYCLE_V1: Kyverno admission webhook mutates dns_config with ndots=2
ignore_changes = [spec[0].template[0].spec[0].dns_config]
ignore_changes = [
spec[0].template[0].spec[0].dns_config,
# KEEL_LIFECYCLE_V1: keel.sh annotations + tier label are stamped on the
# live object (keel enrollment / resource-governance) don't strip them.
metadata[0].annotations["keel.sh/policy"],
metadata[0].annotations["keel.sh/trigger"],
metadata[0].annotations["keel.sh/pollSchedule"],
metadata[0].annotations["keel.sh/match-tag"],
metadata[0].labels["tier"],
]
}
}
@ -922,7 +940,16 @@ resource "kubernetes_deployment" "auth_proxy" {
}
lifecycle {
# KYVERNO_LIFECYCLE_V1: Kyverno admission webhook mutates dns_config with ndots=2
ignore_changes = [spec[0].template[0].spec[0].dns_config]
ignore_changes = [
spec[0].template[0].spec[0].dns_config,
# KEEL_LIFECYCLE_V1: keel.sh annotations + tier label are stamped on the
# live object (keel enrollment / resource-governance) don't strip them.
metadata[0].annotations["keel.sh/policy"],
metadata[0].annotations["keel.sh/trigger"],
metadata[0].annotations["keel.sh/pollSchedule"],
metadata[0].annotations["keel.sh/match-tag"],
metadata[0].labels["tier"],
]
}
}