authentik: fix episodic blank-screen + 30s-hang login (reliability R2)

The login screen would sometimes hang/blank for everyone for ~30s at a time.
Root-caused: the readiness probe (/-/health/ready/) queries the DB, and on a
transient PG/pgbouncer blip it 503s; with the chart-default ~30s tolerance all 3
goauthentik-server pods dropped out of the Service at once, so Traefik had no
healthy backend -> 502/503/504. Compounded by a silent drift: the repo set the
rollout strategy under `strategy:`, but the chart reads `deploymentStrategy:` —
so live ran the chart-default 25%/25% and dropped a pod out of rotation on every
roll. (Redis was removed upstream in authentik 2026.2, so sessions+cache are on
PostgreSQL and request-serving is coupled to PG — verified there is no
external-cache option to put back, so a SHORT transient is now survived but a
total CNPG outage still takes authentik down.)

Reliability package (R2, approved):
- readinessProbe.failureThreshold 3->8 (~80s) — absorbs a full CNPG failover
  reconnect without dropping the whole fleet from the Service.
- rename server+worker `strategy:` -> `deploymentStrategy:` (the real chart key)
  and set maxSurge:1/maxUnavailable:0 so a roll never dips below 3 ready.
- gunicorn AUTHENTIK_WEB__MAX_REQUESTS 1000->10000 / JITTER 50->1000 so the 9
  workers' recycles don't cluster on a DB blip.
- / and /static ingresses switch to the dedicated authentik-rate-limit (100/1000)
  from the previous commit (skip_default_rate_limit) — fixes the cold-load 429
  blank screen.

Liveness intentionally left DB-coupled-but-shallow (LiveView always returns 200,
so it can't kill a DB-blocked pod). CONN_MAX_AGE intentionally NOT set (pins the
pgbouncer pool, reverted 2026-06-10). Docs: .claude/CLAUDE.md + authentication.md
(also corrected a stale "60s persistent DB connections" note).

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
This commit is contained in:
Viktor Barzin 2026-06-28 09:17:05 +00:00
parent b84b0021c2
commit 385dfff0e7
4 changed files with 77 additions and 16 deletions

View file

@ -86,10 +86,29 @@ Signin latency is dominated by screen count and round trips, not server time
use the explicit-consent flow (it re-prompted every 4 weeks per app).
- **Live tuning via `server.env`/`worker.env`** (the `authentik.*` Helm values
are inert due to `existingSecret`): 3 gunicorn workers, 30m flow-plan cache,
15m policy cache, 60s persistent DB connections.
15m policy cache, gunicorn `max_requests=10000`/jitter=1000 (recycle
hardening — decorrelates the 9 workers' recycles from PG blips). **No
`CONN_MAX_AGE`** — persistent Django connections pin a PgBouncer server conn
1:1 and saturate the session-mode pool (reverted 2026-06-10).
- **Static assets cached immutable**: `/static` ingress carve-out adds
`Cache-Control: public, max-age=31536000, immutable` (assets are
version-fingerprinted; authentik itself sends no max-age).
- **Rate-limit carve-out** (2026-06-28): `/` and `/static` use a dedicated
`authentik-rate-limit` (100/1000) instead of the shared 10/50 default — the
login SPA cold-loads ~70 flow-executor chunks from `/static`; the default
burst 429'd the tail and a failed ES-module import left a blank login screen.
- **Readiness tolerance** (2026-06-28): server `readinessProbe.failureThreshold:8`
(~80s, was the chart-default ~30s). The probe (`/-/health/ready/`) queries the
DB; too-tight tolerance let a sub-60s PG/pgbouncer transient return 503 on all
3 server pods at once → Traefik had no healthy backend → 502/503/504 (episodic
blank login + 30s hangs). 80s absorbs a full CNPG failover reconnect. Sessions
+ cache are PostgreSQL-only since Redis was removed in 2026.2 (no external-cache
option), so request-serving is coupled to PG — this survives a short transient,
not a total CNPG outage.
- **Rolling-update strategy** (2026-06-28): the chart key is `deploymentStrategy`
(the repo's old `strategy:` key was silently inert → live ran the chart-default
25%/25% and dropped a server pod out of rotation on every roll). Now
`maxSurge:1/maxUnavailable:0` keeps all 3 ready throughout a roll.
- **Outpost**: 2 replicas, `log_level=info` (was 1 replica at `trace`).
- **auth-proxy nginx**: upstream `keepalive 32` + HTTP/1.1 — no per-request
TCP setup on the forward-auth subrequest path.