infra/docs/architecture
Viktor Barzin 385dfff0e7 authentik: fix episodic blank-screen + 30s-hang login (reliability R2)
The login screen would sometimes hang/blank for everyone for ~30s at a time.
Root-caused: the readiness probe (/-/health/ready/) queries the DB, and on a
transient PG/pgbouncer blip it 503s; with the chart-default ~30s tolerance all 3
goauthentik-server pods dropped out of the Service at once, so Traefik had no
healthy backend -> 502/503/504. Compounded by a silent drift: the repo set the
rollout strategy under `strategy:`, but the chart reads `deploymentStrategy:` —
so live ran the chart-default 25%/25% and dropped a pod out of rotation on every
roll. (Redis was removed upstream in authentik 2026.2, so sessions+cache are on
PostgreSQL and request-serving is coupled to PG — verified there is no
external-cache option to put back, so a SHORT transient is now survived but a
total CNPG outage still takes authentik down.)

Reliability package (R2, approved):
- readinessProbe.failureThreshold 3->8 (~80s) — absorbs a full CNPG failover
  reconnect without dropping the whole fleet from the Service.
- rename server+worker `strategy:` -> `deploymentStrategy:` (the real chart key)
  and set maxSurge:1/maxUnavailable:0 so a roll never dips below 3 ready.
- gunicorn AUTHENTIK_WEB__MAX_REQUESTS 1000->10000 / JITTER 50->1000 so the 9
  workers' recycles don't cluster on a DB blip.
- / and /static ingresses switch to the dedicated authentik-rate-limit (100/1000)
  from the previous commit (skip_default_rate_limit) — fixes the cold-load 429
  blank screen.

Liveness intentionally left DB-coupled-but-shallow (LiveView always returns 200,
so it can't kill a DB-blocked pod). CONN_MAX_AGE intentionally NOT set (pins the
pgbouncer pool, reverted 2026-06-10). Docs: .claude/CLAUDE.md + authentication.md
(also corrected a stale "60s persistent DB connections" note).

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-28 09:17:05 +00:00
..
agent-task-tracking.md fix: restore tree dropped by 6d224861; land stem95su gdrive-sync (10m) [ci skip] 2026-06-09 08:45:33 +00:00
authentication.md authentik: fix episodic blank-screen + 30s-hang login (reliability R2) 2026-06-28 09:17:05 +00:00
automated-upgrades.md k8s-upgrade: nightly Slack report monitor + scope chain-failed alert to phases 2026-06-21 16:57:44 +00:00
backup-dr.md monitoring: VzdumpBackup{Stale,NeverRun,Failing} alerts for the new VM-image backup 2026-06-10 09:10:46 +00:00
chrome-service.md chrome-service: supervise x11vnc in noVNC sidecar so the VNC view self-heals 2026-06-27 08:03:29 +00:00
ci-cd.md docs(ci-cd): add plotting-book build→ghcr→deploy flow diagram 2026-06-27 15:49:58 +00:00
compute.md apply-mbps-caps: compare normalized option sets (true idempotency) + devvm I/O-stall post-mortem [ci skip] 2026-06-11 18:00:08 +00:00
databases.md fix: restore tree dropped by 6d224861; land stem95su gdrive-sync (10m) [ci skip] 2026-06-09 08:45:33 +00:00
dns.md pfsense: SNI-routed internal 443 — mail.viktorbarzin.me serves webmail everywhere 2026-06-10 18:41:07 +00:00
homepage.md fix: restore tree dropped by 6d224861; land stem95su gdrive-sync (10m) [ci skip] 2026-06-09 08:45:33 +00:00
incident-response.md fix: restore tree dropped by 6d224861; land stem95su gdrive-sync (10m) [ci skip] 2026-06-09 08:45:33 +00:00
llama-cpp.md fix: restore tree dropped by 6d224861; land stem95su gdrive-sync (10m) [ci skip] 2026-06-09 08:45:33 +00:00
mailserver.md fix: restore tree dropped by 6d224861; land stem95su gdrive-sync (10m) [ci skip] 2026-06-09 08:45:33 +00:00
monitoring.md monitoring: consolidate all Slack alerting to #alerts, abandon #security 2026-06-26 13:29:44 +00:00
multi-tenancy.md fix(workstation): carry OS/sudo authz policy into managed-settings source + multi-tenancy doc 2026-06-26 08:25:33 +00:00
networking.md authentik: dedicated rate-limit carve-out + per-router 5xx observability 2026-06-28 09:10:34 +00:00
overview.md fix: restore tree dropped by 6d224861; land stem95su gdrive-sync (10m) [ci skip] 2026-06-09 08:45:33 +00:00
secrets.md fix: restore tree dropped by 6d224861; land stem95su gdrive-sync (10m) [ci skip] 2026-06-09 08:45:33 +00:00
security.md docs(security): note crowdsec-cf-sync rate-limit resilience 2026-06-27 15:27:44 +00:00
storage.md docs: sync compute/storage/proxmox-inventory with live state (memory audit) [ci skip] 2026-06-11 17:50:43 +00:00
vpn.md fix: restore tree dropped by 6d224861; land stem95su gdrive-sync (10m) [ci skip] 2026-06-09 08:45:33 +00:00
wave1-egress-observation-2026-05-22.md fix: restore tree dropped by 6d224861; land stem95su gdrive-sync (10m) [ci skip] 2026-06-09 08:45:33 +00:00