authentik: incident hardening after the signin-speedup rollout storm
The first apply of the signin-speedup change triggered a ~50min authentik outage (and a shared CNPG primary failover): the helm chart pin (2026.2.2) silently DOWNGRADED the Keel-managed live image (2026.2.4) against an already-migrated DB, default liveness probes kill-looped pods queuing on authentik's migration advisory lock, and kills mid-migration left ghost idle-in-transaction sessions holding that lock. Full analysis in docs/post-mortems/2026-06-10-authentik-downgrade-boot-storm.md. Hardening (all root causes): - values.yaml: pin global.image.tag to the Keel-managed live tag (2026.2.4) so helm applies can never downgrade under Keel again - values.yaml: server livenessProbe 6x10s/5s (was chart-default 3x10s/3s) - values.yaml: REMOVE AUTHENTIK_POSTGRESQL__CONN_MAX_AGE (session-mode pgbouncer pins persistent conns 1:1 -> pool saturation, 58s/s waits) - pgbouncer.ini: idle_transaction_timeout=300 reaps ghost lock holders; pgbouncer.tf gets a config-checksum annotation so ini changes roll pods - authentik_provider.tf: drop the completed import stanza (adoption rule) - traefik: suppress pre-existing keel.sh annotation/tier-label drift on auth-proxy/bot-block/x402/error-pages deployments (KEEL_LIFECYCLE_V1 pattern) so applies stop stripping live Keel state Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
This commit is contained in:
parent
97ccdbecb8
commit
4e88298976
8 changed files with 156 additions and 23 deletions
|
|
@ -179,6 +179,9 @@ Notes:
|
|||
- The Authentik provider's resource schema does **not** expose the `Outpost.managed` field. We rely on TF's "write only fields it knows about" semantic: the server-set `goauthentik.io/outposts/embedded` value is preserved across applies because Terraform never writes `managed`. Don't change the resource provider schema expectations without verifying this assumption holds.
|
||||
- ALL tuned env vars are injected via `server.env` / `worker.env` (not the `authentik.*` values block) because we set `authentik.existingSecret.secretName: goauthentik`, which makes the chart skip rendering its own `AUTHENTIK_*` Secret. The `authentik.*` value block is therefore inert in this stack — anything new under `authentik.*` must use the `*.env` arrays instead. Live base values come from the orphaned, helm-keep-policy `goauthentik` Secret created by chart 2025.10.3 before `existingSecret` was introduced. **2026-06-10:** the previously-inert tuning (`AUTHENTIK_WEB__WORKERS=3`, `AUTHENTIK_WEB__THREADS=4`, `AUTHENTIK_CACHE__TIMEOUT_FLOWS=1800`, `AUTHENTIK_CACHE__TIMEOUT_POLICIES=900`, `AUTHENTIK_POSTGRESQL__CONN_MAX_AGE=60`, `AUTHENTIK_POSTGRESQL__CONN_HEALTH_CHECKS=true`, worker `AUTHENTIK_WORKER__THREADS=4`) was moved into the env arrays and is now actually live — before that, pods silently ran defaults (2 gunicorn workers, 300s caches, no persistent DB conns).
|
||||
- **Outpost (2026-06-10):** `log_level=info` (was `trace` — per-request overhead on the forward-auth hot path) and `kubernetes_replicas=2` (was 1 — single-pod hot path; safe since proxy sessions live in Postgres). Both in `authentik_outpost.embedded` config.
|
||||
- **Image tag is PINNED in values (`global.image.tag`), 2026-06-10:** Keel moves the authentik image between chart releases, while helm derives the tag from the chart appVersion — an unpinned helm apply silently DOWNGRADES live pods (caused the 2026-06-10 boot storm + shared-PG failover; see `docs/post-mortems/2026-06-10-authentik-downgrade-boot-storm.md`). Before touching this chart, check the live image tag and refresh the pin.
|
||||
- **Liveness budget (2026-06-10):** `server.livenessProbe` = 6×10s, 5s timeout (chart default 3×10s/3s kill-loops pods that queue on the DB migration advisory lock during rolling restarts).
|
||||
- **PgBouncer (2026-06-10):** `idle_transaction_timeout=300` reaps ghost `idle in transaction` sessions (a killed pod mid-migration otherwise holds the migration advisory lock forever, serializing all boots); the deployment carries a config-checksum annotation so ini changes roll the pods. Do NOT set `AUTHENTIK_POSTGRESQL__CONN_MAX_AGE` — session-mode PgBouncer pins persistent conns 1:1 (pool saturation).
|
||||
- **Static assets (2026-06-10):** a second `ingress_factory` (`module.ingress-static`, path `/static` on the authentik host) attaches the `authentik-static-cache-headers` middleware → `Cache-Control: public, max-age=31536000, immutable`. Authentik itself serves no max-age; assets are version-fingerprinted so immutable is safe. Mainly helps split-horizon internal users (no Cloudflare edge cache on the direct path).
|
||||
|
||||
## Upgrade Validation Checklist
|
||||
|
|
|
|||
|
|
@ -0,0 +1,76 @@
|
|||
# Post-mortem: Authentik downgrade boot storm + shared-PG failover (2026-06-10)
|
||||
|
||||
**Impact:** Authentik (and therefore forward-auth for all ~67 `auth="required"`
|
||||
ingresses and every OIDC app) degraded/unavailable for ~50 minutes
|
||||
(~22:20–23:10 UTC). The auth-proxy basicAuth fallback served Emergency Access
|
||||
prompts during outpost-check failures. The shared CNPG primary failed over
|
||||
(pg-cluster-2 → pg-cluster-1, 22:40:58 UTC), briefly disturbing every PG-backed
|
||||
tenant.
|
||||
|
||||
**Trigger:** a routine values-only `tg apply` on `stacks/authentik` (first-time
|
||||
signin speedup work — env tuning, outpost config, static-asset ingress).
|
||||
|
||||
## Root causes (three stacked)
|
||||
|
||||
1. **Helm/Keel version split → silent downgrade.** Keel (namespace
|
||||
`keel.sh/enrolled` + diun annotations) had upgraded the live authentik
|
||||
image to `2026.2.4`, while the Helm release pinned chart `2026.2.2` (whose
|
||||
appVersion drives the image tag). The values-only apply therefore rolled
|
||||
every server/worker pod BACK to `2026.2.2` against a `2026.2.4`-migrated
|
||||
database. Cores never came up healthy (`failed to proxy to backend`, plus
|
||||
Django cross-version serialized-cache warnings), and mid-storm Keel
|
||||
re-upgraded the image, adding a third ReplicaSet to the churn.
|
||||
|
||||
2. **Liveness budget too small for authentik's boot.** The chart-default
|
||||
liveness probe (3×10s, 3s timeout) kills a pod ~30s after the go layer
|
||||
passes the startup probe — but during a rolling restart the Python core
|
||||
still waits on authentik's DB **migration advisory lock** (60–120s+ under
|
||||
contention). kubelet kill-looped every booting pod, and each kill increased
|
||||
lock contention for the rest (thundering herd).
|
||||
|
||||
3. **Ghost lock holders.** Pods killed mid-migration-check left PgBouncer
|
||||
server connections `idle in transaction` still **holding the migration
|
||||
advisory lock** (observed twice: `SELECT * FROM authentik_version_history`
|
||||
idle 2+ min). Every subsequent boot serialized behind a dead client.
|
||||
PgBouncer had no `idle_transaction_timeout`, so the ghosts never expired.
|
||||
|
||||
**Aggravator:** `AUTHENTIK_POSTGRESQL__CONN_MAX_AGE=60` (newly made live) made
|
||||
every Django thread hold its connection persistently; with PgBouncer in
|
||||
*session* mode each one pins a server connection 1:1, so the restart churn
|
||||
saturated all 3×(20+5) pool slots (58s/s client wait observed; authentik held
|
||||
75 of 108 connections on the new primary). The shared primary's
|
||||
restart/failover at 22:40 fits this storm window.
|
||||
|
||||
## Resolution
|
||||
|
||||
- Scaled workers to 0 (transient) to free pool capacity; rollout converged
|
||||
once, then re-degraded when workers returned.
|
||||
- Emergency `kubectl patch` of the server liveness probe (3×10s/3s →
|
||||
6×10s/5s) — final state codified in Helm values in the same session.
|
||||
- `pg_terminate_backend()` on the ghost `idle in transaction` lock holders
|
||||
(twice).
|
||||
- Scaled servers to 1 so a single `2026.2.4` pod booted uncontended, then back
|
||||
to 3 — converged cleanly (51s boots, zero restarts).
|
||||
- Final `tg apply` reconciled everything (image tag pinned, conn_max_age
|
||||
removed, liveness in values, pgbouncer reaper config).
|
||||
|
||||
## Prevention (all landed in this change)
|
||||
|
||||
| Cause | Fix |
|
||||
|---|---|
|
||||
| Helm/Keel version split | `global.image.tag` pinned in `values.yaml` to the Keel-managed live tag, with a comment requiring the pin be refreshed whenever the chart is touched. Long-term: bump the chart pin when Keel moves the image (diun notifies). |
|
||||
| Liveness kill loop | `server.livenessProbe` 6×10s / 5s timeout in values (startup probe still bounds total boot at 60×10s). |
|
||||
| Ghost advisory-lock holders | `idle_transaction_timeout = 300` in `pgbouncer.ini` + config-checksum annotation so ini changes actually roll pgbouncer pods. |
|
||||
| Pool saturation | `CONN_MAX_AGE` removed (per-request connections are ~1–2ms through local PgBouncer; not worth pinning server connections in session mode). values.yaml carries a do-not-set warning. |
|
||||
|
||||
## Lessons
|
||||
|
||||
- **Check the live image tag against the chart pin before ANY helm-managed
|
||||
apply on a Keel-enrolled namespace.** `kubectl get deploy <x> -o
|
||||
jsonpath='{..image}'` vs the chart's appVersion — a mismatch means the apply
|
||||
is a version change, not a config change.
|
||||
- A "stuck rollout" of authentik is usually the migration advisory lock:
|
||||
check `pg_locks` joined to `pg_stat_activity` for `idle in transaction`
|
||||
holders before blaming probes or resources.
|
||||
- The auth-proxy basicAuth fallback worked as designed throughout (Emergency
|
||||
Access path); without it every protected app would have hard-failed.
|
||||
|
|
@ -217,11 +217,6 @@ resource "authentik_stage_user_login" "default_login" {
|
|||
# screen and bypass the password field.
|
||||
# -----------------------------------------------------------------------------
|
||||
|
||||
import {
|
||||
to = authentik_stage_identification.default_identification
|
||||
id = "32aca5ab-106e-43f4-a4cc-4513d80e57f3"
|
||||
}
|
||||
|
||||
data "authentik_stage" "default_authentication_password" {
|
||||
name = "default-authentication-password"
|
||||
}
|
||||
|
|
@ -243,8 +238,6 @@ resource "authentik_stage_identification" "default_identification" {
|
|||
passwordless_flow,
|
||||
pretend_user_exists,
|
||||
captcha_stage,
|
||||
webauthn_stage,
|
||||
enable_remember_me,
|
||||
]
|
||||
}
|
||||
}
|
||||
|
|
|
|||
|
|
@ -12,3 +12,7 @@ default_pool_size = 20
|
|||
reserve_pool_size = 5
|
||||
reserve_pool_timeout = 5
|
||||
ignore_startup_parameters = extra_float_digits
|
||||
; Reap server connections stuck "idle in transaction" (e.g. an authentik pod
|
||||
; killed mid-migration leaves a ghost transaction holding the migration
|
||||
; advisory lock, serializing every subsequent pod boot — 2026-06-10 incident).
|
||||
idle_transaction_timeout = 300
|
||||
|
|
|
|||
|
|
@ -48,6 +48,11 @@ resource "kubernetes_deployment" "pgbouncer" {
|
|||
labels = {
|
||||
app = "pgbouncer"
|
||||
}
|
||||
annotations = {
|
||||
# pgbouncer reads its ini only at startup (subPath mount never
|
||||
# propagates updates anyway) — roll the pods on config change.
|
||||
"checksum/pgbouncer-config" = sha1(kubernetes_config_map.pgbouncer_config.data["pgbouncer.ini"])
|
||||
}
|
||||
}
|
||||
|
||||
spec {
|
||||
|
|
|
|||
|
|
@ -47,13 +47,20 @@ server:
|
|||
value: "1800"
|
||||
- name: AUTHENTIK_CACHE__TIMEOUT_POLICIES
|
||||
value: "900"
|
||||
# Persistent client-side DB connections (safe with PgBouncer session mode;
|
||||
# must stay < pgbouncer server_idle_timeout=600s). Cuts per-request Django
|
||||
# connection setup off the auth hot path.
|
||||
- name: AUTHENTIK_POSTGRESQL__CONN_MAX_AGE
|
||||
value: "60"
|
||||
- name: AUTHENTIK_POSTGRESQL__CONN_HEALTH_CHECKS
|
||||
value: "true"
|
||||
# Do NOT set AUTHENTIK_POSTGRESQL__CONN_MAX_AGE here. With PgBouncer in
|
||||
# session mode every persistent Django connection pins a server connection
|
||||
# 1:1, so the 3x(20+5) pool saturated during the 2026-06-10 rolling
|
||||
# restart (58s pool waits, readiness flapping, and the shared CNPG primary
|
||||
# failed over mid-storm). The ~1-2ms/request connection-setup saving is
|
||||
# not worth that risk on the shared PG substrate.
|
||||
# Liveness budget sized for slow boots (2026-06-10 incident): during a
|
||||
# rolling restart pods queue on authentik's DB migration lock; the go layer
|
||||
# answers /-/health/live before the core is up, so with the default 3x10s
|
||||
# budget kubelet kill-looped every booting pod and amplified the contention.
|
||||
# Startup probe still bounds total boot time (60x10s).
|
||||
livenessProbe:
|
||||
failureThreshold: 6
|
||||
timeoutSeconds: 5
|
||||
strategy:
|
||||
type: RollingUpdate
|
||||
rollingUpdate:
|
||||
|
|
@ -84,6 +91,16 @@ server:
|
|||
minAvailable: 2
|
||||
global:
|
||||
addPrometheusAnnotations: true
|
||||
image:
|
||||
# Pin to the Keel-managed live tag. Keel (diun-annotated, keel.sh/enrolled
|
||||
# namespace) bumps the IMAGE between chart releases, while helm defaults
|
||||
# the tag to the chart appVersion — so any helm upgrade silently
|
||||
# DOWNGRADES the running pods to the chart pin (2026-06-10: a values-only
|
||||
# apply rolled live 2026.2.4 back to 2026.2.2 against a 2026.2.4-migrated
|
||||
# DB → boot storm, see docs/post-mortems/2026-06-10-authentik-downgrade-
|
||||
# boot-storm.md). Keep this tag in sync with what Keel has deployed when
|
||||
# touching this chart; clear it only when bumping the chart version itself.
|
||||
tag: "2026.2.4"
|
||||
|
||||
worker:
|
||||
# 2 replicas: workers handle background tasks (LDAP sync, email,
|
||||
|
|
@ -98,15 +115,12 @@ worker:
|
|||
# Dramatiq worker threads per pod (default 2).
|
||||
- name: AUTHENTIK_WORKER__THREADS
|
||||
value: "4"
|
||||
# Keep cache + DB-connection settings in lockstep with server.env.
|
||||
# Keep cache settings in lockstep with server.env. (No CONN_MAX_AGE —
|
||||
# see the server.env note: session-mode PgBouncer pins persistent conns.)
|
||||
- name: AUTHENTIK_CACHE__TIMEOUT_FLOWS
|
||||
value: "1800"
|
||||
- name: AUTHENTIK_CACHE__TIMEOUT_POLICIES
|
||||
value: "900"
|
||||
- name: AUTHENTIK_POSTGRESQL__CONN_MAX_AGE
|
||||
value: "60"
|
||||
- name: AUTHENTIK_POSTGRESQL__CONN_HEALTH_CHECKS
|
||||
value: "true"
|
||||
strategy:
|
||||
type: RollingUpdate
|
||||
rollingUpdate:
|
||||
|
|
|
|||
|
|
@ -89,7 +89,18 @@ resource "kubernetes_deployment" "error_pages" {
|
|||
}
|
||||
|
||||
lifecycle {
|
||||
ignore_changes = [spec[0].template[0].spec[0].dns_config] # KYVERNO_LIFECYCLE_V1
|
||||
ignore_changes = [
|
||||
spec[0].template[0].spec[0].dns_config, # KYVERNO_LIFECYCLE_V1
|
||||
# KEEL_LIFECYCLE_V1: keel.sh annotations + tier label are stamped on the
|
||||
# live object (keel enrollment / resource-governance) — don't strip them.
|
||||
metadata[0].annotations["keel.sh/policy"],
|
||||
metadata[0].annotations["keel.sh/trigger"],
|
||||
metadata[0].annotations["keel.sh/pollSchedule"],
|
||||
metadata[0].annotations["keel.sh/match-tag"],
|
||||
metadata[0].labels["tier"],
|
||||
spec[0].template[0].metadata[0].annotations["keel.sh/update-time"],
|
||||
spec[0].template[0].spec[0].container[0].image, # KEEL_IGNORE_IMAGE — Keel manages tag updates
|
||||
]
|
||||
}
|
||||
}
|
||||
|
||||
|
|
|
|||
|
|
@ -494,7 +494,16 @@ resource "kubernetes_deployment" "bot_block_proxy" {
|
|||
}
|
||||
lifecycle {
|
||||
# KYVERNO_LIFECYCLE_V1: Kyverno admission webhook mutates dns_config with ndots=2
|
||||
ignore_changes = [spec[0].template[0].spec[0].dns_config]
|
||||
ignore_changes = [
|
||||
spec[0].template[0].spec[0].dns_config,
|
||||
# KEEL_LIFECYCLE_V1: keel.sh annotations + tier label are stamped on the
|
||||
# live object (keel enrollment / resource-governance) — don't strip them.
|
||||
metadata[0].annotations["keel.sh/policy"],
|
||||
metadata[0].annotations["keel.sh/trigger"],
|
||||
metadata[0].annotations["keel.sh/pollSchedule"],
|
||||
metadata[0].annotations["keel.sh/match-tag"],
|
||||
metadata[0].labels["tier"],
|
||||
]
|
||||
}
|
||||
}
|
||||
|
||||
|
|
@ -653,7 +662,16 @@ resource "kubernetes_deployment" "x402_gateway" {
|
|||
|
||||
lifecycle {
|
||||
# KYVERNO_LIFECYCLE_V1: Kyverno admission webhook mutates dns_config with ndots=2
|
||||
ignore_changes = [spec[0].template[0].spec[0].dns_config]
|
||||
ignore_changes = [
|
||||
spec[0].template[0].spec[0].dns_config,
|
||||
# KEEL_LIFECYCLE_V1: keel.sh annotations + tier label are stamped on the
|
||||
# live object (keel enrollment / resource-governance) — don't strip them.
|
||||
metadata[0].annotations["keel.sh/policy"],
|
||||
metadata[0].annotations["keel.sh/trigger"],
|
||||
metadata[0].annotations["keel.sh/pollSchedule"],
|
||||
metadata[0].annotations["keel.sh/match-tag"],
|
||||
metadata[0].labels["tier"],
|
||||
]
|
||||
}
|
||||
}
|
||||
|
||||
|
|
@ -922,7 +940,16 @@ resource "kubernetes_deployment" "auth_proxy" {
|
|||
}
|
||||
lifecycle {
|
||||
# KYVERNO_LIFECYCLE_V1: Kyverno admission webhook mutates dns_config with ndots=2
|
||||
ignore_changes = [spec[0].template[0].spec[0].dns_config]
|
||||
ignore_changes = [
|
||||
spec[0].template[0].spec[0].dns_config,
|
||||
# KEEL_LIFECYCLE_V1: keel.sh annotations + tier label are stamped on the
|
||||
# live object (keel enrollment / resource-governance) — don't strip them.
|
||||
metadata[0].annotations["keel.sh/policy"],
|
||||
metadata[0].annotations["keel.sh/trigger"],
|
||||
metadata[0].annotations["keel.sh/pollSchedule"],
|
||||
metadata[0].annotations["keel.sh/match-tag"],
|
||||
metadata[0].labels["tier"],
|
||||
]
|
||||
}
|
||||
}
|
||||
|
||||
|
|
|
|||
Loading…
Add table
Add a link
Reference in a new issue