The first apply of the signin-speedup change triggered a ~50min authentik outage (and a shared CNPG primary failover): the helm chart pin (2026.2.2) silently DOWNGRADED the Keel-managed live image (2026.2.4) against an already-migrated DB, default liveness probes kill-looped pods queuing on authentik's migration advisory lock, and kills mid-migration left ghost idle-in-transaction sessions holding that lock. Full analysis in docs/post-mortems/2026-06-10-authentik-downgrade-boot-storm.md. Hardening (all root causes): - values.yaml: pin global.image.tag to the Keel-managed live tag (2026.2.4) so helm applies can never downgrade under Keel again - values.yaml: server livenessProbe 6x10s/5s (was chart-default 3x10s/3s) - values.yaml: REMOVE AUTHENTIK_POSTGRESQL__CONN_MAX_AGE (session-mode pgbouncer pins persistent conns 1:1 -> pool saturation, 58s/s waits) - pgbouncer.ini: idle_transaction_timeout=300 reaps ghost lock holders; pgbouncer.tf gets a config-checksum annotation so ini changes roll pods - authentik_provider.tf: drop the completed import stanza (adoption rule) - traefik: suppress pre-existing keel.sh annotation/tier-label drift on auth-proxy/bot-block/x402/error-pages deployments (KEEL_LIFECYCLE_V1 pattern) so applies stop stripping live Keel state Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
18 lines
639 B
INI
18 lines
639 B
INI
[databases]
|
|
authentik = host=postgresql.dbaas port=5432 dbname=authentik user=authentik password=${password}
|
|
|
|
[pgbouncer]
|
|
listen_addr = 0.0.0.0
|
|
listen_port = 6432
|
|
auth_type = md5
|
|
auth_file = /etc/pgbouncer/userlist.txt
|
|
pool_mode = session
|
|
max_client_conn = 200
|
|
default_pool_size = 20
|
|
reserve_pool_size = 5
|
|
reserve_pool_timeout = 5
|
|
ignore_startup_parameters = extra_float_digits
|
|
; Reap server connections stuck "idle in transaction" (e.g. an authentik pod
|
|
; killed mid-migration leaves a ghost transaction holding the migration
|
|
; advisory lock, serializing every subsequent pod boot — 2026-06-10 incident).
|
|
idle_transaction_timeout = 300
|