authentik: incident hardening after the signin-speedup rollout storm
The first apply of the signin-speedup change triggered a ~50min authentik outage (and a shared CNPG primary failover): the helm chart pin (2026.2.2) silently DOWNGRADED the Keel-managed live image (2026.2.4) against an already-migrated DB, default liveness probes kill-looped pods queuing on authentik's migration advisory lock, and kills mid-migration left ghost idle-in-transaction sessions holding that lock. Full analysis in docs/post-mortems/2026-06-10-authentik-downgrade-boot-storm.md. Hardening (all root causes): - values.yaml: pin global.image.tag to the Keel-managed live tag (2026.2.4) so helm applies can never downgrade under Keel again - values.yaml: server livenessProbe 6x10s/5s (was chart-default 3x10s/3s) - values.yaml: REMOVE AUTHENTIK_POSTGRESQL__CONN_MAX_AGE (session-mode pgbouncer pins persistent conns 1:1 -> pool saturation, 58s/s waits) - pgbouncer.ini: idle_transaction_timeout=300 reaps ghost lock holders; pgbouncer.tf gets a config-checksum annotation so ini changes roll pods - authentik_provider.tf: drop the completed import stanza (adoption rule) - traefik: suppress pre-existing keel.sh annotation/tier-label drift on auth-proxy/bot-block/x402/error-pages deployments (KEEL_LIFECYCLE_V1 pattern) so applies stop stripping live Keel state Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
This commit is contained in:
parent
97ccdbecb8
commit
4e88298976
8 changed files with 156 additions and 23 deletions
|
|
@ -89,7 +89,18 @@ resource "kubernetes_deployment" "error_pages" {
|
|||
}
|
||||
|
||||
lifecycle {
|
||||
ignore_changes = [spec[0].template[0].spec[0].dns_config] # KYVERNO_LIFECYCLE_V1
|
||||
ignore_changes = [
|
||||
spec[0].template[0].spec[0].dns_config, # KYVERNO_LIFECYCLE_V1
|
||||
# KEEL_LIFECYCLE_V1: keel.sh annotations + tier label are stamped on the
|
||||
# live object (keel enrollment / resource-governance) — don't strip them.
|
||||
metadata[0].annotations["keel.sh/policy"],
|
||||
metadata[0].annotations["keel.sh/trigger"],
|
||||
metadata[0].annotations["keel.sh/pollSchedule"],
|
||||
metadata[0].annotations["keel.sh/match-tag"],
|
||||
metadata[0].labels["tier"],
|
||||
spec[0].template[0].metadata[0].annotations["keel.sh/update-time"],
|
||||
spec[0].template[0].spec[0].container[0].image, # KEEL_IGNORE_IMAGE — Keel manages tag updates
|
||||
]
|
||||
}
|
||||
}
|
||||
|
||||
|
|
|
|||
|
|
@ -494,7 +494,16 @@ resource "kubernetes_deployment" "bot_block_proxy" {
|
|||
}
|
||||
lifecycle {
|
||||
# KYVERNO_LIFECYCLE_V1: Kyverno admission webhook mutates dns_config with ndots=2
|
||||
ignore_changes = [spec[0].template[0].spec[0].dns_config]
|
||||
ignore_changes = [
|
||||
spec[0].template[0].spec[0].dns_config,
|
||||
# KEEL_LIFECYCLE_V1: keel.sh annotations + tier label are stamped on the
|
||||
# live object (keel enrollment / resource-governance) — don't strip them.
|
||||
metadata[0].annotations["keel.sh/policy"],
|
||||
metadata[0].annotations["keel.sh/trigger"],
|
||||
metadata[0].annotations["keel.sh/pollSchedule"],
|
||||
metadata[0].annotations["keel.sh/match-tag"],
|
||||
metadata[0].labels["tier"],
|
||||
]
|
||||
}
|
||||
}
|
||||
|
||||
|
|
@ -653,7 +662,16 @@ resource "kubernetes_deployment" "x402_gateway" {
|
|||
|
||||
lifecycle {
|
||||
# KYVERNO_LIFECYCLE_V1: Kyverno admission webhook mutates dns_config with ndots=2
|
||||
ignore_changes = [spec[0].template[0].spec[0].dns_config]
|
||||
ignore_changes = [
|
||||
spec[0].template[0].spec[0].dns_config,
|
||||
# KEEL_LIFECYCLE_V1: keel.sh annotations + tier label are stamped on the
|
||||
# live object (keel enrollment / resource-governance) — don't strip them.
|
||||
metadata[0].annotations["keel.sh/policy"],
|
||||
metadata[0].annotations["keel.sh/trigger"],
|
||||
metadata[0].annotations["keel.sh/pollSchedule"],
|
||||
metadata[0].annotations["keel.sh/match-tag"],
|
||||
metadata[0].labels["tier"],
|
||||
]
|
||||
}
|
||||
}
|
||||
|
||||
|
|
@ -922,7 +940,16 @@ resource "kubernetes_deployment" "auth_proxy" {
|
|||
}
|
||||
lifecycle {
|
||||
# KYVERNO_LIFECYCLE_V1: Kyverno admission webhook mutates dns_config with ndots=2
|
||||
ignore_changes = [spec[0].template[0].spec[0].dns_config]
|
||||
ignore_changes = [
|
||||
spec[0].template[0].spec[0].dns_config,
|
||||
# KEEL_LIFECYCLE_V1: keel.sh annotations + tier label are stamped on the
|
||||
# live object (keel enrollment / resource-governance) — don't strip them.
|
||||
metadata[0].annotations["keel.sh/policy"],
|
||||
metadata[0].annotations["keel.sh/trigger"],
|
||||
metadata[0].annotations["keel.sh/pollSchedule"],
|
||||
metadata[0].annotations["keel.sh/match-tag"],
|
||||
metadata[0].labels["tier"],
|
||||
]
|
||||
}
|
||||
}
|
||||
|
||||
|
|
|
|||
Loading…
Add table
Add a link
Reference in a new issue