authentik: fix episodic blank-screen + 30s-hang login (reliability R2)

The login screen would sometimes hang/blank for everyone for ~30s at a time.
Root-caused: the readiness probe (/-/health/ready/) queries the DB, and on a
transient PG/pgbouncer blip it 503s; with the chart-default ~30s tolerance all 3
goauthentik-server pods dropped out of the Service at once, so Traefik had no
healthy backend -> 502/503/504. Compounded by a silent drift: the repo set the
rollout strategy under `strategy:`, but the chart reads `deploymentStrategy:` —
so live ran the chart-default 25%/25% and dropped a pod out of rotation on every
roll. (Redis was removed upstream in authentik 2026.2, so sessions+cache are on
PostgreSQL and request-serving is coupled to PG — verified there is no
external-cache option to put back, so a SHORT transient is now survived but a
total CNPG outage still takes authentik down.)

Reliability package (R2, approved):
- readinessProbe.failureThreshold 3->8 (~80s) — absorbs a full CNPG failover
  reconnect without dropping the whole fleet from the Service.
- rename server+worker `strategy:` -> `deploymentStrategy:` (the real chart key)
  and set maxSurge:1/maxUnavailable:0 so a roll never dips below 3 ready.
- gunicorn AUTHENTIK_WEB__MAX_REQUESTS 1000->10000 / JITTER 50->1000 so the 9
  workers' recycles don't cluster on a DB blip.
- / and /static ingresses switch to the dedicated authentik-rate-limit (100/1000)
  from the previous commit (skip_default_rate_limit) — fixes the cold-load 429
  blank screen.

Liveness intentionally left DB-coupled-but-shallow (LiveView always returns 200,
so it can't kill a DB-blocked pod). CONN_MAX_AGE intentionally NOT set (pins the
pgbouncer pool, reverted 2026-06-10). Docs: .claude/CLAUDE.md + authentication.md
(also corrected a stale "60s persistent DB connections" note).

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
This commit is contained in:
Viktor Barzin 2026-06-28 09:17:05 +00:00
parent b84b0021c2
commit 385dfff0e7
4 changed files with 77 additions and 16 deletions

View file

@ -219,7 +219,7 @@ the workflow's built-in `GITHUB_TOKEN` (`packages: write`).
| Immich | ML on SSD (CUDA), disable ModSecurity (breaks streaming), frequent upgrades. **`immich-machine-learning` MUST run with `MACHINE_LEARNING_MODEL_TTL > 0`** (set to `600` in `stacks/immich/main.tf`, env on the `immich-machine-learning` deployment). At `0`, no model ever unloads and onnxruntime's CUDA arena (OCR's dynamic input shapes inflate it to ~10 GB) is held forever on the **time-sliced T4 it shares with llama-swap/frigate/immich-server** — which has no VRAM isolation, so immich-ml starved llama-swap (qwen3-8b) and silently broke recruiter-responder triage for ~5 h on 2026-06-02 (post-mortem `docs/post-mortems/2026-06-02-immich-ml-ttl-gpu-oom-recruiter.md`). TTL>0 lets idle models (OCR, face — AND CLIP) free VRAM. The TTL is a single GLOBAL knob (no per-model pin), so CLIP would also unload after 600s idle; the `clip-keepalive` CronJob (`*/5 * * * *`, same stack) pings the CLIP textual encoder so smart-search stays warm without pinning the ad-hoc models. **Smart search has a SECOND warmth layer in Postgres** (don't conflate it with the ML model): the ~665MB vchord `clip_index` must stay resident in PG `shared_buffers`, else an ANN probe that lands on an evicted list pays a ~1.8s cold storage read vs ~4ms warm. The `postStart` hook prewarms it ONCE at pod start and `pg_prewarm.autoprewarm` only re-warms at *startup*, so the index decays out of cache over days under job buffer-pressure (observed ~33% resident after 9d uptime → slow context search, easily misattributed to the ML model). The `clip-index-prewarm` CronJob (`*/5`, same stack) re-runs `pg_prewarm('clip_index')` to pin it hot; `immich-search-probe` (`*/5`) measures live latency + residency → Pushgateway gauges (`immich_smart_search_db_seconds`, `immich_clip_index_cached_pct`) → alerts `ImmichSmartSearchSlow`/`ImmichClipIndexColdCache`/`ImmichSearchProbeStale` + cluster-health check #46 (`check_immich_search`). immich PG role is a superuser so the CronJobs can run `pg_prewarm`/`pg_buffercache`. **Video transcoding is GPU-accelerated**: `immich-server` is pinned to GPU node1 (nodeSelector `nvidia.com/gpu.present` + NoSchedule toleration + `gpu-workload` priority) with a time-sliced `nvidia.com/gpu=1` slice — the stock immich-server image's ffmpeg already ships h264/hevc_nvenc + NVDEC. Activated via `ffmpeg.accel=nvenc` + `accelDecode=true` in the **DB** system-config (`system_metadata` table, key `system-config`, JSONB — NOT Terraform; app config is DB-managed here like oauth/smtp). Direct DB edits need a pod **recreate** to reload (config is cached at boot; only API-driven changes broadcast a reload). **Streaming bitrate is capped** to keep 4K playback smooth on the contended HDD and over remote uplinks: `ffmpeg.maxBitrate=20000k` + `preset=medium` + `transcode=bitrate` (set 2026-06-01 — was uncapped `maxBitrate=0` + `ultrafast` + `targetResolution=original`, which produced 77264 Mbps 4K transcodes that stuttered for every client, local and remote, since even a single stream needs ~1013.5 MB/s off the shared `sdc` spindle). 4K resolution is preserved (`targetResolution=original`); originals are NEVER modified — only the `encoded-video/` streaming copy. To re-apply transcode settings to EXISTING videos (config changes only affect new/missing ones): delete the offenders' `asset_file` rows `WHERE type='encoded_video'` (derived/regenerable — never touches originals) then run videoConversion `force=false` (admin Jobs API → "Missing"); it regenerates them to the deterministic `<assetId>.mp4` path at concurrency 1 (gentle on sdc). See `docs/runbooks/immich-transcode-bitrate.md`. If Immich is ever reinstalled fresh (not restored), re-set these keys (accel, accelDecode, **maxBitrate=20000k, preset=medium, transcode=bitrate**). Thumbnails/previews live on SSD NFS (sdb) — do NOT move to block storage (HDD sdc = slower + the contended IO domain). **Background-job concurrency is capped to protect sdc** (DB-managed system-config, `system_metadata` key `system-config`, JSONB `job.*.concurrency`; re-set on fresh install): `thumbnailGeneration=2`, `metadataExtraction=2`, `library=2` — these jobs read ORIGINALS off the HDD library. Left uncapped (were 8/4/4) a library-wide job (e.g. Duplicate Detection on 2026-06-01) fans the ML/thumbnail backfill out into a read storm that saturates sdc and starves etcd → apiserver down. `sidecar`/`smartSearch`/`faceDetection` stay at Immich defaults (small `.xmp` / SSD previews). Apply via Job Settings UI or the `system-config` API; **direct DB edits need an `immich-server` pod recreate to reload** (config cached at boot). See `docs/post-mortems/2026-05-25-immich-anca-elements-io-storm.md`. | | Immich | ML on SSD (CUDA), disable ModSecurity (breaks streaming), frequent upgrades. **`immich-machine-learning` MUST run with `MACHINE_LEARNING_MODEL_TTL > 0`** (set to `600` in `stacks/immich/main.tf`, env on the `immich-machine-learning` deployment). At `0`, no model ever unloads and onnxruntime's CUDA arena (OCR's dynamic input shapes inflate it to ~10 GB) is held forever on the **time-sliced T4 it shares with llama-swap/frigate/immich-server** — which has no VRAM isolation, so immich-ml starved llama-swap (qwen3-8b) and silently broke recruiter-responder triage for ~5 h on 2026-06-02 (post-mortem `docs/post-mortems/2026-06-02-immich-ml-ttl-gpu-oom-recruiter.md`). TTL>0 lets idle models (OCR, face — AND CLIP) free VRAM. The TTL is a single GLOBAL knob (no per-model pin), so CLIP would also unload after 600s idle; the `clip-keepalive` CronJob (`*/5 * * * *`, same stack) pings the CLIP textual encoder so smart-search stays warm without pinning the ad-hoc models. **Smart search has a SECOND warmth layer in Postgres** (don't conflate it with the ML model): the ~665MB vchord `clip_index` must stay resident in PG `shared_buffers`, else an ANN probe that lands on an evicted list pays a ~1.8s cold storage read vs ~4ms warm. The `postStart` hook prewarms it ONCE at pod start and `pg_prewarm.autoprewarm` only re-warms at *startup*, so the index decays out of cache over days under job buffer-pressure (observed ~33% resident after 9d uptime → slow context search, easily misattributed to the ML model). The `clip-index-prewarm` CronJob (`*/5`, same stack) re-runs `pg_prewarm('clip_index')` to pin it hot; `immich-search-probe` (`*/5`) measures live latency + residency → Pushgateway gauges (`immich_smart_search_db_seconds`, `immich_clip_index_cached_pct`) → alerts `ImmichSmartSearchSlow`/`ImmichClipIndexColdCache`/`ImmichSearchProbeStale` + cluster-health check #46 (`check_immich_search`). immich PG role is a superuser so the CronJobs can run `pg_prewarm`/`pg_buffercache`. **Video transcoding is GPU-accelerated**: `immich-server` is pinned to GPU node1 (nodeSelector `nvidia.com/gpu.present` + NoSchedule toleration + `gpu-workload` priority) with a time-sliced `nvidia.com/gpu=1` slice — the stock immich-server image's ffmpeg already ships h264/hevc_nvenc + NVDEC. Activated via `ffmpeg.accel=nvenc` + `accelDecode=true` in the **DB** system-config (`system_metadata` table, key `system-config`, JSONB — NOT Terraform; app config is DB-managed here like oauth/smtp). Direct DB edits need a pod **recreate** to reload (config is cached at boot; only API-driven changes broadcast a reload). **Streaming bitrate is capped** to keep 4K playback smooth on the contended HDD and over remote uplinks: `ffmpeg.maxBitrate=20000k` + `preset=medium` + `transcode=bitrate` (set 2026-06-01 — was uncapped `maxBitrate=0` + `ultrafast` + `targetResolution=original`, which produced 77264 Mbps 4K transcodes that stuttered for every client, local and remote, since even a single stream needs ~1013.5 MB/s off the shared `sdc` spindle). 4K resolution is preserved (`targetResolution=original`); originals are NEVER modified — only the `encoded-video/` streaming copy. To re-apply transcode settings to EXISTING videos (config changes only affect new/missing ones): delete the offenders' `asset_file` rows `WHERE type='encoded_video'` (derived/regenerable — never touches originals) then run videoConversion `force=false` (admin Jobs API → "Missing"); it regenerates them to the deterministic `<assetId>.mp4` path at concurrency 1 (gentle on sdc). See `docs/runbooks/immich-transcode-bitrate.md`. If Immich is ever reinstalled fresh (not restored), re-set these keys (accel, accelDecode, **maxBitrate=20000k, preset=medium, transcode=bitrate**). Thumbnails/previews live on SSD NFS (sdb) — do NOT move to block storage (HDD sdc = slower + the contended IO domain). **Background-job concurrency is capped to protect sdc** (DB-managed system-config, `system_metadata` key `system-config`, JSONB `job.*.concurrency`; re-set on fresh install): `thumbnailGeneration=2`, `metadataExtraction=2`, `library=2` — these jobs read ORIGINALS off the HDD library. Left uncapped (were 8/4/4) a library-wide job (e.g. Duplicate Detection on 2026-06-01) fans the ML/thumbnail backfill out into a read storm that saturates sdc and starves etcd → apiserver down. `sidecar`/`smartSearch`/`faceDetection` stay at Immich defaults (small `.xmp` / SSD previews). Apply via Job Settings UI or the `system-config` API; **direct DB edits need an `immich-server` pod recreate to reload** (config cached at boot). See `docs/post-mortems/2026-05-25-immich-anca-elements-io-storm.md`. |
| CrowdSec | Pin version, disable Metabase when not needed (CPU hog), LAPI scaled to 3, **DB on PostgreSQL** (migrated from MySQL), flush config: max_items=10000/max_age=7d/agents_autodelete=30d, DECISION_DURATION=168h in blocklist CronJob. **Enforcement is out-of-band, NOT a Traefik plugin** (the Yaegi `crowdsec-bouncer-traefik-plugin` was dead on Traefik 3.7.5 and removed): `cs-firewall-bouncer` DaemonSet drops in-kernel via nftables on direct hosts (bouncer key `firewall`, v0.0.34 binary fetched at runtime, hostNetwork+NET_ADMIN, `stacks/crowdsec/modules/crowdsec/firewall_bouncer.tf`); `crowdsec-cf-sync` CronJob blocks at the CF edge for proxied hosts (bouncer key `kvsync`, `stacks/rybbit/crowdsec_edge.tf`). Both fail open. See `docs/architecture/security.md` | | CrowdSec | Pin version, disable Metabase when not needed (CPU hog), LAPI scaled to 3, **DB on PostgreSQL** (migrated from MySQL), flush config: max_items=10000/max_age=7d/agents_autodelete=30d, DECISION_DURATION=168h in blocklist CronJob. **Enforcement is out-of-band, NOT a Traefik plugin** (the Yaegi `crowdsec-bouncer-traefik-plugin` was dead on Traefik 3.7.5 and removed): `cs-firewall-bouncer` DaemonSet drops in-kernel via nftables on direct hosts (bouncer key `firewall`, v0.0.34 binary fetched at runtime, hostNetwork+NET_ADMIN, `stacks/crowdsec/modules/crowdsec/firewall_bouncer.tf`); `crowdsec-cf-sync` CronJob blocks at the CF edge for proxied hosts (bouncer key `kvsync`, `stacks/rybbit/crowdsec_edge.tf`). Both fail open. See `docs/architecture/security.md` |
| Frigate | GPU stall detection in liveness probe (inference speed check), high CPU | | Frigate | GPU stall detection in liveness probe (inference speed check), high CPU |
| Authentik | 3 server replicas + 2-replica embedded outpost (PG-backed sessions), PgBouncer in front of PostgreSQL, strip auth headers before forwarding. **`authentik.*` Helm values are INERT** (existingSecret skips chart env rendering) — tune via `server.env`/`worker.env` in `modules/authentik/values.yaml`. Single-screen login (password embedded in identification stage); all first-party OIDC apps use implicit consent (2026-06-10). `/static` ingress carve-out serves assets with immutable Cache-Control. | | Authentik | 3 server replicas + 2-replica embedded outpost (PG-backed sessions), PgBouncer in front of PostgreSQL, strip auth headers before forwarding. **`authentik.*` Helm values are INERT** (existingSecret skips chart env rendering) — tune via `server.env`/`worker.env` in `modules/authentik/values.yaml`. Single-screen login (password embedded in identification stage); all first-party OIDC apps use implicit consent (2026-06-10). `/static` ingress carve-out serves assets with immutable Cache-Control; `/`+`/static` use a dedicated `authentik-rate-limit` (100/1000) so the cold-load chunk burst isn't 429'd into a blank screen. **Reliability (2026-06-28): the chart key is `deploymentStrategy`, NOT `strategy`** — the old `strategy:` key was inert, so live ran the chart default 25%/25% and dropped a server pod out of rotation on every roll; now `maxSurge:1/maxUnavailable:0`. Readiness `failureThreshold:8` (~80s, was 30s): the DB-coupled `/-/health/ready/` returns 503 on a PG/pgbouncer blip, and with too-tight tolerance all 3 server pods left the Service at once → Traefik 502/504 (the episodic blank-screen + 30s-hang). gunicorn `max_requests=10000`/jitter=1000 decorrelates worker recycles from DB blips. Redis is GONE since 2026.2 (sessions+cache+channels on PostgreSQL, no external-cache option) — a short PG transient is now survived, but a TOTAL CNPG outage still takes authentik down. |
| Kyverno | failurePolicy=Ignore to prevent blocking cluster, pin chart version | | Kyverno | failurePolicy=Ignore to prevent blocking cluster, pin chart version |
| MySQL Standalone | Raw `kubernetes_stateful_set_v1` pinned to `mysql:8.4.8` exactly (migrated from InnoDB Cluster 2026-04-16; **pinned to 8.4.8 on 2026-05-18** after Keel-driven `mysql:8.4` → 8.4.9 bump stalled the DD upgrade and required a full PVC-wipe + dump-restore — see `docs/runbooks/restore-mysql.md` and beads code-eme8/code-k40p). `skip-log-bin`, `innodb_flush_log_at_trx_commit=2`, `innodb_doublewrite=ON`. ConfigMap `mysql-standalone-cnf`. PVC `data-mysql-standalone-0` (5Gi initial → 30Gi via autoresizer, `proxmox-lvm-encrypted`). Service `mysql.dbaas` unchanged. Anti-affinity excludes k8s-node1. Bitnami charts deprecated (Broadcom Aug 2025) — use official images. | | MySQL Standalone | Raw `kubernetes_stateful_set_v1` pinned to `mysql:8.4.8` exactly (migrated from InnoDB Cluster 2026-04-16; **pinned to 8.4.8 on 2026-05-18** after Keel-driven `mysql:8.4` → 8.4.9 bump stalled the DD upgrade and required a full PVC-wipe + dump-restore — see `docs/runbooks/restore-mysql.md` and beads code-eme8/code-k40p). `skip-log-bin`, `innodb_flush_log_at_trx_commit=2`, `innodb_doublewrite=ON`. ConfigMap `mysql-standalone-cnf`. PVC `data-mysql-standalone-0` (5Gi initial → 30Gi via autoresizer, `proxmox-lvm-encrypted`). Service `mysql.dbaas` unchanged. Anti-affinity excludes k8s-node1. Bitnami charts deprecated (Broadcom Aug 2025) — use official images. |
| phpIPAM | IPAM — no active scanning. `pfsense-import` CronJob (hourly) pulls Kea leases + ARP via SSH. `dns-sync` CronJob (15min) bidirectional sync with Technitium. Kea DDNS on pfSense handles all 3 subnets. API app `claude` (ssl_token). | | phpIPAM | IPAM — no active scanning. `pfsense-import` CronJob (hourly) pulls Kea leases + ARP via SSH. `dns-sync` CronJob (15min) bidirectional sync with Technitium. Kea DDNS on pfSense handles all 3 subnets. API app `claude` (ssl_token). |

View file

@ -86,10 +86,29 @@ Signin latency is dominated by screen count and round trips, not server time
use the explicit-consent flow (it re-prompted every 4 weeks per app). use the explicit-consent flow (it re-prompted every 4 weeks per app).
- **Live tuning via `server.env`/`worker.env`** (the `authentik.*` Helm values - **Live tuning via `server.env`/`worker.env`** (the `authentik.*` Helm values
are inert due to `existingSecret`): 3 gunicorn workers, 30m flow-plan cache, are inert due to `existingSecret`): 3 gunicorn workers, 30m flow-plan cache,
15m policy cache, 60s persistent DB connections. 15m policy cache, gunicorn `max_requests=10000`/jitter=1000 (recycle
hardening — decorrelates the 9 workers' recycles from PG blips). **No
`CONN_MAX_AGE`** — persistent Django connections pin a PgBouncer server conn
1:1 and saturate the session-mode pool (reverted 2026-06-10).
- **Static assets cached immutable**: `/static` ingress carve-out adds - **Static assets cached immutable**: `/static` ingress carve-out adds
`Cache-Control: public, max-age=31536000, immutable` (assets are `Cache-Control: public, max-age=31536000, immutable` (assets are
version-fingerprinted; authentik itself sends no max-age). version-fingerprinted; authentik itself sends no max-age).
- **Rate-limit carve-out** (2026-06-28): `/` and `/static` use a dedicated
`authentik-rate-limit` (100/1000) instead of the shared 10/50 default — the
login SPA cold-loads ~70 flow-executor chunks from `/static`; the default
burst 429'd the tail and a failed ES-module import left a blank login screen.
- **Readiness tolerance** (2026-06-28): server `readinessProbe.failureThreshold:8`
(~80s, was the chart-default ~30s). The probe (`/-/health/ready/`) queries the
DB; too-tight tolerance let a sub-60s PG/pgbouncer transient return 503 on all
3 server pods at once → Traefik had no healthy backend → 502/503/504 (episodic
blank login + 30s hangs). 80s absorbs a full CNPG failover reconnect. Sessions
+ cache are PostgreSQL-only since Redis was removed in 2026.2 (no external-cache
option), so request-serving is coupled to PG — this survives a short transient,
not a total CNPG outage.
- **Rolling-update strategy** (2026-06-28): the chart key is `deploymentStrategy`
(the repo's old `strategy:` key was silently inert → live ran the chart-default
25%/25% and dropped a server pod out of rotation on every roll). Now
`maxSurge:1/maxUnavailable:0` keeps all 3 ready throughout a roll.
- **Outpost**: 2 replicas, `log_level=info` (was 1 replica at `trace`). - **Outpost**: 2 replicas, `log_level=info` (was 1 replica at `trace`).
- **auth-proxy nginx**: upstream `keepalive 32` + HTTP/1.1 — no per-request - **auth-proxy nginx**: upstream `keepalive 32` + HTTP/1.1 — no per-request
TCP setup on the forward-auth subrequest path. TCP setup on the forward-auth subrequest path.

View file

@ -82,6 +82,11 @@ module "ingress" {
service_name = "goauthentik-server" service_name = "goauthentik-server"
tls_secret_name = var.tls_secret_name tls_secret_name = var.tls_secret_name
anti_ai_scraping = false anti_ai_scraping = false
# Swap the shared 10/50 default limiter for a dedicated 100/1000 carve-out:
# the login SPA + flow-executor API burst on a cold load otherwise 429s into
# a blank screen (see traefik middleware "authentik-rate-limit").
skip_default_rate_limit = true
extra_middlewares = ["traefik-authentik-rate-limit@kubernetescrd"]
extra_annotations = { extra_annotations = {
"gethomepage.dev/enabled" = "true" "gethomepage.dev/enabled" = "true"
"gethomepage.dev/name" = "Authentik" "gethomepage.dev/name" = "Authentik"
@ -140,14 +145,21 @@ module "ingress-static" {
# Same-host path carve-out of the public authentik UI ingress above, only # Same-host path carve-out of the public authentik UI ingress above, only
# adding the cache-headers middleware for the static asset prefix. # adding the cache-headers middleware for the static asset prefix.
# auth = "none": versioned static assets of the (already public) Authentik login UI. # auth = "none": versioned static assets of the (already public) Authentik login UI.
auth = "none" auth = "none"
namespace = kubernetes_namespace.authentik.metadata[0].name namespace = kubernetes_namespace.authentik.metadata[0].name
name = "authentik-static" name = "authentik-static"
host = "authentik" host = "authentik"
service_name = "goauthentik-server" service_name = "goauthentik-server"
ingress_path = ["/static"] ingress_path = ["/static"]
tls_secret_name = var.tls_secret_name tls_secret_name = var.tls_secret_name
anti_ai_scraping = false anti_ai_scraping = false
homepage_enabled = false homepage_enabled = false
extra_middlewares = ["authentik-static-cache-headers@kubernetescrd"] # /static serves ALL the SPA JS/CSS chunks; the default 10/50 limiter 429s the
# cold-load fan-out blank screen. Dedicated 100/1000 carve-out (note the two
# namespaces: cache-headers is in ns authentik, rate-limit is in ns traefik).
skip_default_rate_limit = true
extra_middlewares = [
"authentik-static-cache-headers@kubernetescrd",
"traefik-authentik-rate-limit@kubernetescrd",
]
} }

View file

@ -39,6 +39,16 @@ server:
value: "3" value: "3"
- name: AUTHENTIK_WEB__THREADS - name: AUTHENTIK_WEB__THREADS
value: "4" value: "4"
# Gunicorn worker recycle hardening (defaults max_requests=1000/jitter=50).
# A worker recycle that coincides with a transient PG/pgbouncer blip stalls
# in-flight requests (sessions+cache are on PostgreSQL since Redis was removed
# in 2026.2), and with 9 workers recycling on a tight 50-jitter window the
# recycles cluster — feeding the episodic all-pods-NotReady 502/504 cascade.
# 10x rarer recycles + 20x wider jitter (1000) decorrelate them from DB blips.
- name: AUTHENTIK_WEB__MAX_REQUESTS
value: "10000"
- name: AUTHENTIK_WEB__MAX_REQUESTS_JITTER
value: "1000"
# Cache flow plans for 30m and policy evaluations for 15m (defaults 300s). # Cache flow plans for 30m and policy evaluations for 15m (defaults 300s).
# Authentik 2026.2 stores cache in Postgres, so a TTL hit is still a # Authentik 2026.2 stores cache in Postgres, so a TTL hit is still a
# SELECT — but a single indexed lookup beats re-planning the flow # SELECT — but a single indexed lookup beats re-planning the flow
@ -87,11 +97,28 @@ server:
livenessProbe: livenessProbe:
failureThreshold: 6 failureThreshold: 6
timeoutSeconds: 5 timeoutSeconds: 5
strategy: # Readiness widened from the chart default (3x10s/3s ~= 30s) to ~80s. The
# readiness probe (/-/health/ready/) queries the DB, so a sub-~60s PG/pgbouncer
# transient otherwise returns 503 and drops ALL 3 server pods from the Service
# at once -> Traefik has no healthy backend -> 502/504 (the episodic blank
# screen + 30s hang). 80s absorbs a full CNPG failover reconnect; liveness
# still reaps a truly hung pod. Partial override — the chart deep-merges the
# httpGet path /-/health/ready/ (same as the livenessProbe override above).
readinessProbe:
failureThreshold: 8
periodSeconds: 10
timeoutSeconds: 5
# RollingUpdate strategy. The chart key is `deploymentStrategy`, NOT `strategy`
# (authentik.server reads .Values.server.deploymentStrategy) — the old
# `strategy:` key was silently ignored, so live ran the chart default 25%/25%
# and every rolling event dropped a server pod out of rotation, amplifying the
# NotReady cascade. maxSurge:1 + maxUnavailable:0 keeps all 3 ready throughout
# a roll (PDB minAvailable:2 + ResourceQuota headroom allow the transient pod).
deploymentStrategy:
type: RollingUpdate type: RollingUpdate
rollingUpdate: rollingUpdate:
maxSurge: 0 maxSurge: 1
maxUnavailable: 1 maxUnavailable: 0
resources: resources:
requests: requests:
cpu: 100m cpu: 100m
@ -166,7 +193,10 @@ worker:
secretKeyRef: secretKeyRef:
name: authentik-email name: authentik-email
key: AUTHENTIK_EMAIL__PASSWORD key: AUTHENTIK_EMAIL__PASSWORD
strategy: # Chart key is `deploymentStrategy`, not `strategy` (see server above). Workers
# serve no user traffic, so maxSurge:0/maxUnavailable:1 is fine — this is just
# the dead-key cleanup so the declared intent actually takes effect.
deploymentStrategy:
type: RollingUpdate type: RollingUpdate
rollingUpdate: rollingUpdate:
maxSurge: 0 maxSurge: 0