authentik: fix episodic blank-screen + 30s-hang login (reliability R2)
The login screen would sometimes hang/blank for everyone for ~30s at a time. Root-caused: the readiness probe (/-/health/ready/) queries the DB, and on a transient PG/pgbouncer blip it 503s; with the chart-default ~30s tolerance all 3 goauthentik-server pods dropped out of the Service at once, so Traefik had no healthy backend -> 502/503/504. Compounded by a silent drift: the repo set the rollout strategy under `strategy:`, but the chart reads `deploymentStrategy:` — so live ran the chart-default 25%/25% and dropped a pod out of rotation on every roll. (Redis was removed upstream in authentik 2026.2, so sessions+cache are on PostgreSQL and request-serving is coupled to PG — verified there is no external-cache option to put back, so a SHORT transient is now survived but a total CNPG outage still takes authentik down.) Reliability package (R2, approved): - readinessProbe.failureThreshold 3->8 (~80s) — absorbs a full CNPG failover reconnect without dropping the whole fleet from the Service. - rename server+worker `strategy:` -> `deploymentStrategy:` (the real chart key) and set maxSurge:1/maxUnavailable:0 so a roll never dips below 3 ready. - gunicorn AUTHENTIK_WEB__MAX_REQUESTS 1000->10000 / JITTER 50->1000 so the 9 workers' recycles don't cluster on a DB blip. - / and /static ingresses switch to the dedicated authentik-rate-limit (100/1000) from the previous commit (skip_default_rate_limit) — fixes the cold-load 429 blank screen. Liveness intentionally left DB-coupled-but-shallow (LiveView always returns 200, so it can't kill a DB-blocked pod). CONN_MAX_AGE intentionally NOT set (pins the pgbouncer pool, reverted 2026-06-10). Docs: .claude/CLAUDE.md + authentication.md (also corrected a stale "60s persistent DB connections" note). Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
This commit is contained in:
parent
b84b0021c2
commit
385dfff0e7
4 changed files with 77 additions and 16 deletions
|
|
@ -86,10 +86,29 @@ Signin latency is dominated by screen count and round trips, not server time
|
|||
use the explicit-consent flow (it re-prompted every 4 weeks per app).
|
||||
- **Live tuning via `server.env`/`worker.env`** (the `authentik.*` Helm values
|
||||
are inert due to `existingSecret`): 3 gunicorn workers, 30m flow-plan cache,
|
||||
15m policy cache, 60s persistent DB connections.
|
||||
15m policy cache, gunicorn `max_requests=10000`/jitter=1000 (recycle
|
||||
hardening — decorrelates the 9 workers' recycles from PG blips). **No
|
||||
`CONN_MAX_AGE`** — persistent Django connections pin a PgBouncer server conn
|
||||
1:1 and saturate the session-mode pool (reverted 2026-06-10).
|
||||
- **Static assets cached immutable**: `/static` ingress carve-out adds
|
||||
`Cache-Control: public, max-age=31536000, immutable` (assets are
|
||||
version-fingerprinted; authentik itself sends no max-age).
|
||||
- **Rate-limit carve-out** (2026-06-28): `/` and `/static` use a dedicated
|
||||
`authentik-rate-limit` (100/1000) instead of the shared 10/50 default — the
|
||||
login SPA cold-loads ~70 flow-executor chunks from `/static`; the default
|
||||
burst 429'd the tail and a failed ES-module import left a blank login screen.
|
||||
- **Readiness tolerance** (2026-06-28): server `readinessProbe.failureThreshold:8`
|
||||
(~80s, was the chart-default ~30s). The probe (`/-/health/ready/`) queries the
|
||||
DB; too-tight tolerance let a sub-60s PG/pgbouncer transient return 503 on all
|
||||
3 server pods at once → Traefik had no healthy backend → 502/503/504 (episodic
|
||||
blank login + 30s hangs). 80s absorbs a full CNPG failover reconnect. Sessions
|
||||
+ cache are PostgreSQL-only since Redis was removed in 2026.2 (no external-cache
|
||||
option), so request-serving is coupled to PG — this survives a short transient,
|
||||
not a total CNPG outage.
|
||||
- **Rolling-update strategy** (2026-06-28): the chart key is `deploymentStrategy`
|
||||
(the repo's old `strategy:` key was silently inert → live ran the chart-default
|
||||
25%/25% and dropped a server pod out of rotation on every roll). Now
|
||||
`maxSurge:1/maxUnavailable:0` keeps all 3 ready throughout a roll.
|
||||
- **Outpost**: 2 replicas, `log_level=info` (was 1 replica at `trace`).
|
||||
- **auth-proxy nginx**: upstream `keepalive 32` + HTTP/1.1 — no per-request
|
||||
TCP setup on the forward-auth subrequest path.
|
||||
|
|
|
|||
Loading…
Add table
Add a link
Reference in a new issue