authentik: incident hardening after the signin-speedup rollout storm

The first apply of the signin-speedup change triggered a ~50min authentik
outage (and a shared CNPG primary failover): the helm chart pin (2026.2.2)
silently DOWNGRADED the Keel-managed live image (2026.2.4) against an
already-migrated DB, default liveness probes kill-looped pods queuing on
authentik's migration advisory lock, and kills mid-migration left ghost
idle-in-transaction sessions holding that lock. Full analysis in
docs/post-mortems/2026-06-10-authentik-downgrade-boot-storm.md.

Hardening (all root causes):
- values.yaml: pin global.image.tag to the Keel-managed live tag (2026.2.4)
  so helm applies can never downgrade under Keel again
- values.yaml: server livenessProbe 6x10s/5s (was chart-default 3x10s/3s)
- values.yaml: REMOVE AUTHENTIK_POSTGRESQL__CONN_MAX_AGE (session-mode
  pgbouncer pins persistent conns 1:1 -> pool saturation, 58s/s waits)
- pgbouncer.ini: idle_transaction_timeout=300 reaps ghost lock holders;
  pgbouncer.tf gets a config-checksum annotation so ini changes roll pods
- authentik_provider.tf: drop the completed import stanza (adoption rule)
- traefik: suppress pre-existing keel.sh annotation/tier-label drift on
  auth-proxy/bot-block/x402/error-pages deployments (KEEL_LIFECYCLE_V1
  pattern) so applies stop stripping live Keel state

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
This commit is contained in:
Viktor Barzin 2026-06-11 00:26:52 +00:00
parent 97ccdbecb8
commit 4e88298976
8 changed files with 156 additions and 23 deletions

View file

@ -89,7 +89,18 @@ resource "kubernetes_deployment" "error_pages" {
}
lifecycle {
ignore_changes = [spec[0].template[0].spec[0].dns_config] # KYVERNO_LIFECYCLE_V1
ignore_changes = [
spec[0].template[0].spec[0].dns_config, # KYVERNO_LIFECYCLE_V1
# KEEL_LIFECYCLE_V1: keel.sh annotations + tier label are stamped on the
# live object (keel enrollment / resource-governance) don't strip them.
metadata[0].annotations["keel.sh/policy"],
metadata[0].annotations["keel.sh/trigger"],
metadata[0].annotations["keel.sh/pollSchedule"],
metadata[0].annotations["keel.sh/match-tag"],
metadata[0].labels["tier"],
spec[0].template[0].metadata[0].annotations["keel.sh/update-time"],
spec[0].template[0].spec[0].container[0].image, # KEEL_IGNORE_IMAGE Keel manages tag updates
]
}
}

View file

@ -494,7 +494,16 @@ resource "kubernetes_deployment" "bot_block_proxy" {
}
lifecycle {
# KYVERNO_LIFECYCLE_V1: Kyverno admission webhook mutates dns_config with ndots=2
ignore_changes = [spec[0].template[0].spec[0].dns_config]
ignore_changes = [
spec[0].template[0].spec[0].dns_config,
# KEEL_LIFECYCLE_V1: keel.sh annotations + tier label are stamped on the
# live object (keel enrollment / resource-governance) don't strip them.
metadata[0].annotations["keel.sh/policy"],
metadata[0].annotations["keel.sh/trigger"],
metadata[0].annotations["keel.sh/pollSchedule"],
metadata[0].annotations["keel.sh/match-tag"],
metadata[0].labels["tier"],
]
}
}
@ -653,7 +662,16 @@ resource "kubernetes_deployment" "x402_gateway" {
lifecycle {
# KYVERNO_LIFECYCLE_V1: Kyverno admission webhook mutates dns_config with ndots=2
ignore_changes = [spec[0].template[0].spec[0].dns_config]
ignore_changes = [
spec[0].template[0].spec[0].dns_config,
# KEEL_LIFECYCLE_V1: keel.sh annotations + tier label are stamped on the
# live object (keel enrollment / resource-governance) don't strip them.
metadata[0].annotations["keel.sh/policy"],
metadata[0].annotations["keel.sh/trigger"],
metadata[0].annotations["keel.sh/pollSchedule"],
metadata[0].annotations["keel.sh/match-tag"],
metadata[0].labels["tier"],
]
}
}
@ -922,7 +940,16 @@ resource "kubernetes_deployment" "auth_proxy" {
}
lifecycle {
# KYVERNO_LIFECYCLE_V1: Kyverno admission webhook mutates dns_config with ndots=2
ignore_changes = [spec[0].template[0].spec[0].dns_config]
ignore_changes = [
spec[0].template[0].spec[0].dns_config,
# KEEL_LIFECYCLE_V1: keel.sh annotations + tier label are stamped on the
# live object (keel enrollment / resource-governance) don't strip them.
metadata[0].annotations["keel.sh/policy"],
metadata[0].annotations["keel.sh/trigger"],
metadata[0].annotations["keel.sh/pollSchedule"],
metadata[0].annotations["keel.sh/match-tag"],
metadata[0].labels["tier"],
]
}
}