infra

Viktor Barzin 385dfff0e7 authentik: fix episodic blank-screen + 30s-hang login (reliability R2) The login screen would sometimes hang/blank for everyone for ~30s at a time. Root-caused: the readiness probe (/-/health/ready/) queries the DB, and on a transient PG/pgbouncer blip it 503s; with the chart-default ~30s tolerance all 3 goauthentik-server pods dropped out of the Service at once, so Traefik had no healthy backend -> 502/503/504. Compounded by a silent drift: the repo set the rollout strategy under `strategy:`, but the chart reads `deploymentStrategy:` — so live ran the chart-default 25%/25% and dropped a pod out of rotation on every roll. (Redis was removed upstream in authentik 2026.2, so sessions+cache are on PostgreSQL and request-serving is coupled to PG — verified there is no external-cache option to put back, so a SHORT transient is now survived but a total CNPG outage still takes authentik down.) Reliability package (R2, approved): - readinessProbe.failureThreshold 3->8 (~80s) — absorbs a full CNPG failover reconnect without dropping the whole fleet from the Service. - rename server+worker `strategy:` -> `deploymentStrategy:` (the real chart key) and set maxSurge:1/maxUnavailable:0 so a roll never dips below 3 ready. - gunicorn AUTHENTIK_WEB__MAX_REQUESTS 1000->10000 / JITTER 50->1000 so the 9 workers' recycles don't cluster on a DB blip. - / and /static ingresses switch to the dedicated authentik-rate-limit (100/1000) from the previous commit (skip_default_rate_limit) — fixes the cold-load 429 blank screen. Liveness intentionally left DB-coupled-but-shallow (LiveView always returns 200, so it can't kill a DB-blocked pod). CONN_MAX_AGE intentionally NOT set (pins the pgbouncer pool, reverted 2026-06-10). Docs: .claude/CLAUDE.md + authentication.md (also corrected a stale "60s persistent DB connections" note). Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>		2026-06-28 09:17:05 +00:00
..
agent-task-tracking.md	fix: restore tree dropped by `6d224861`; land stem95su gdrive-sync (10m) [ci skip]	2026-06-09 08:45:33 +00:00
authentication.md	authentik: fix episodic blank-screen + 30s-hang login (reliability R2)	2026-06-28 09:17:05 +00:00
automated-upgrades.md	k8s-upgrade: nightly Slack report monitor + scope chain-failed alert to phases	2026-06-21 16:57:44 +00:00
backup-dr.md	monitoring: VzdumpBackup{Stale,NeverRun,Failing} alerts for the new VM-image backup	2026-06-10 09:10:46 +00:00
chrome-service.md	chrome-service: supervise x11vnc in noVNC sidecar so the VNC view self-heals	2026-06-27 08:03:29 +00:00
ci-cd.md	docs(ci-cd): add plotting-book build→ghcr→deploy flow diagram	2026-06-27 15:49:58 +00:00
compute.md	apply-mbps-caps: compare normalized option sets (true idempotency) + devvm I/O-stall post-mortem [ci skip]	2026-06-11 18:00:08 +00:00
databases.md	fix: restore tree dropped by `6d224861`; land stem95su gdrive-sync (10m) [ci skip]	2026-06-09 08:45:33 +00:00
dns.md	pfsense: SNI-routed internal 443 — mail.viktorbarzin.me serves webmail everywhere	2026-06-10 18:41:07 +00:00
homepage.md	fix: restore tree dropped by `6d224861`; land stem95su gdrive-sync (10m) [ci skip]	2026-06-09 08:45:33 +00:00
incident-response.md	fix: restore tree dropped by `6d224861`; land stem95su gdrive-sync (10m) [ci skip]	2026-06-09 08:45:33 +00:00
llama-cpp.md	fix: restore tree dropped by `6d224861`; land stem95su gdrive-sync (10m) [ci skip]	2026-06-09 08:45:33 +00:00
mailserver.md	fix: restore tree dropped by `6d224861`; land stem95su gdrive-sync (10m) [ci skip]	2026-06-09 08:45:33 +00:00
monitoring.md	monitoring: consolidate all Slack alerting to #alerts, abandon #security	2026-06-26 13:29:44 +00:00
multi-tenancy.md	fix(workstation): carry OS/sudo authz policy into managed-settings source + multi-tenancy doc	2026-06-26 08:25:33 +00:00
networking.md	authentik: dedicated rate-limit carve-out + per-router 5xx observability	2026-06-28 09:10:34 +00:00
overview.md	fix: restore tree dropped by `6d224861`; land stem95su gdrive-sync (10m) [ci skip]	2026-06-09 08:45:33 +00:00
secrets.md	fix: restore tree dropped by `6d224861`; land stem95su gdrive-sync (10m) [ci skip]	2026-06-09 08:45:33 +00:00
security.md	docs(security): note crowdsec-cf-sync rate-limit resilience	2026-06-27 15:27:44 +00:00
storage.md	docs: sync compute/storage/proxmox-inventory with live state (memory audit) [ci skip]	2026-06-11 17:50:43 +00:00
vpn.md	fix: restore tree dropped by `6d224861`; land stem95su gdrive-sync (10m) [ci skip]	2026-06-09 08:45:33 +00:00
wave1-egress-observation-2026-05-22.md	fix: restore tree dropped by `6d224861`; land stem95su gdrive-sync (10m) [ci skip]	2026-06-09 08:45:33 +00:00