authentik: speed up first-time signin (single-screen login, live env tuning, asset caching, outpost+nginx hot path)

Viktor asked to review Authentik and the web tier and make first-time
signin to apps faster. Review found the slowness is screens and round
trips, not server time. Changes:

- values.yaml: the authentik.* Helm values (gunicorn workers, cache
  timeouts, conn_max_age) were silently INERT because existingSecret
  skips chart env rendering — pods ran defaults (2 workers, 300s
  caches, no persistent DB conns). Moved all tuning into
  server.env/worker.env, which actually reaches the pods.
- authentik_provider.tf: adopt the identification stage and pin
  password_stage so username+password render on ONE screen (the
  separate order-20 password binding is deleted via API — authentik
  requires that when embedding). Outpost log_level trace->info and
  1->2 replicas (it is on the hot path of every forward-auth request;
  PG-backed sessions make 2 replicas safe).
- authentik module: /static ingress carve-out with immutable
  Cache-Control (assets are version-fingerprinted but served with no
  max-age — internal split-horizon users got zero caching).
- traefik auth-proxy nginx: upstream keepalive 32 + HTTP/1.1 (was
  opening a fresh TCP connection to the outpost per subrequest) +
  config-checksum annotation so config changes roll the pods.
- docs: authentication.md + authentik-state.md updated; fixed stale
  'postgresql.dbaas has no endpoints' claim in CLAUDE.md/CONTEXT.md
  (it is a live CNPG primary-selector compatibility service).

Done via API in the same change (UI-managed objects): 6 OIDC providers
(Vault, Forgejo, Immich, Headscale, linkwarden, Cloudflare Access)
switched from explicit to implicit consent — all first-party, the
4-weekly consent screen only slowed first-time signin.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
This commit is contained in:
Viktor Barzin 2026-06-10 21:58:10 +00:00
parent 93ba67c84a
commit 97ccdbecb8
8 changed files with 232 additions and 55 deletions

View file

@ -135,7 +135,7 @@ Repo IDs: infra=1, Website=2, finance=3, health=4, travel_blog=5, webhook-handle
## Database Host
**`postgresql_host`** in `config.tfvars` is `pg-cluster-rw.dbaas.svc.cluster.local` (the CNPG primary). The legacy `postgresql.dbaas` service has no endpoints — never use it. This variable is shared by ~12 stacks.
**`postgresql_host`** in `config.tfvars` is `pg-cluster-rw.dbaas.svc.cluster.local` (the CNPG primary). The legacy `postgresql.dbaas` service is a live compatibility alias (selector `cnpg.io/instanceRole=primary`, so it also reaches the primary — authentik's PgBouncer still points at it) — but use `pg-cluster-rw` for anything new. This variable is shared by ~12 stacks.
**CNPG tuning** (in `stacks/dbaas/modules/dbaas/main.tf`): `shared_buffers=512MB`, `work_mem=16MB`, `wal_compression=on`, `effective_cache_size=1536MB`, pod memory 2Gi.
@ -159,7 +159,7 @@ Repo IDs: infra=1, Website=2, finance=3, health=4, travel_blog=5, webhook-handle
| Immich | ML on SSD (CUDA), disable ModSecurity (breaks streaming), frequent upgrades. **`immich-machine-learning` MUST run with `MACHINE_LEARNING_MODEL_TTL > 0`** (set to `600` in `stacks/immich/main.tf`, env on the `immich-machine-learning` deployment). At `0`, no model ever unloads and onnxruntime's CUDA arena (OCR's dynamic input shapes inflate it to ~10 GB) is held forever on the **time-sliced T4 it shares with llama-swap/frigate/immich-server** — which has no VRAM isolation, so immich-ml starved llama-swap (qwen3-8b) and silently broke recruiter-responder triage for ~5 h on 2026-06-02 (post-mortem `docs/post-mortems/2026-06-02-immich-ml-ttl-gpu-oom-recruiter.md`). TTL>0 lets idle models (OCR, face — AND CLIP) free VRAM. The TTL is a single GLOBAL knob (no per-model pin), so CLIP would also unload after 600s idle; the `clip-keepalive` CronJob (`*/5 * * * *`, same stack) pings the CLIP textual encoder so smart-search stays warm without pinning the ad-hoc models. **Smart search has a SECOND warmth layer in Postgres** (don't conflate it with the ML model): the ~665MB vchord `clip_index` must stay resident in PG `shared_buffers`, else an ANN probe that lands on an evicted list pays a ~1.8s cold storage read vs ~4ms warm. The `postStart` hook prewarms it ONCE at pod start and `pg_prewarm.autoprewarm` only re-warms at *startup*, so the index decays out of cache over days under job buffer-pressure (observed ~33% resident after 9d uptime → slow context search, easily misattributed to the ML model). The `clip-index-prewarm` CronJob (`*/5`, same stack) re-runs `pg_prewarm('clip_index')` to pin it hot; `immich-search-probe` (`*/5`) measures live latency + residency → Pushgateway gauges (`immich_smart_search_db_seconds`, `immich_clip_index_cached_pct`) → alerts `ImmichSmartSearchSlow`/`ImmichClipIndexColdCache`/`ImmichSearchProbeStale` + cluster-health check #46 (`check_immich_search`). immich PG role is a superuser so the CronJobs can run `pg_prewarm`/`pg_buffercache`. **Video transcoding is GPU-accelerated**: `immich-server` is pinned to GPU node1 (nodeSelector `nvidia.com/gpu.present` + NoSchedule toleration + `gpu-workload` priority) with a time-sliced `nvidia.com/gpu=1` slice — the stock immich-server image's ffmpeg already ships h264/hevc_nvenc + NVDEC. Activated via `ffmpeg.accel=nvenc` + `accelDecode=true` in the **DB** system-config (`system_metadata` table, key `system-config`, JSONB — NOT Terraform; app config is DB-managed here like oauth/smtp). Direct DB edits need a pod **recreate** to reload (config is cached at boot; only API-driven changes broadcast a reload). **Streaming bitrate is capped** to keep 4K playback smooth on the contended HDD and over remote uplinks: `ffmpeg.maxBitrate=20000k` + `preset=medium` + `transcode=bitrate` (set 2026-06-01 — was uncapped `maxBitrate=0` + `ultrafast` + `targetResolution=original`, which produced 77264 Mbps 4K transcodes that stuttered for every client, local and remote, since even a single stream needs ~1013.5 MB/s off the shared `sdc` spindle). 4K resolution is preserved (`targetResolution=original`); originals are NEVER modified — only the `encoded-video/` streaming copy. To re-apply transcode settings to EXISTING videos (config changes only affect new/missing ones): delete the offenders' `asset_file` rows `WHERE type='encoded_video'` (derived/regenerable — never touches originals) then run videoConversion `force=false` (admin Jobs API → "Missing"); it regenerates them to the deterministic `<assetId>.mp4` path at concurrency 1 (gentle on sdc). See `docs/runbooks/immich-transcode-bitrate.md`. If Immich is ever reinstalled fresh (not restored), re-set these keys (accel, accelDecode, **maxBitrate=20000k, preset=medium, transcode=bitrate**). Thumbnails/previews live on SSD NFS (sdb) — do NOT move to block storage (HDD sdc = slower + the contended IO domain). **Background-job concurrency is capped to protect sdc** (DB-managed system-config, `system_metadata` key `system-config`, JSONB `job.*.concurrency`; re-set on fresh install): `thumbnailGeneration=2`, `metadataExtraction=2`, `library=2` — these jobs read ORIGINALS off the HDD library. Left uncapped (were 8/4/4) a library-wide job (e.g. Duplicate Detection on 2026-06-01) fans the ML/thumbnail backfill out into a read storm that saturates sdc and starves etcd → apiserver down. `sidecar`/`smartSearch`/`faceDetection` stay at Immich defaults (small `.xmp` / SSD previews). Apply via Job Settings UI or the `system-config` API; **direct DB edits need an `immich-server` pod recreate to reload** (config cached at boot). See `docs/post-mortems/2026-05-25-immich-anca-elements-io-storm.md`. |
| CrowdSec | Pin version, disable Metabase when not needed (CPU hog), LAPI scaled to 3, **DB on PostgreSQL** (migrated from MySQL), flush config: max_items=10000/max_age=7d/agents_autodelete=30d, DECISION_DURATION=168h in blocklist CronJob |
| Frigate | GPU stall detection in liveness probe (inference speed check), high CPU |
| Authentik | 3 replicas, PgBouncer in front of PostgreSQL, strip auth headers before forwarding |
| Authentik | 3 server replicas + 2-replica embedded outpost (PG-backed sessions), PgBouncer in front of PostgreSQL, strip auth headers before forwarding. **`authentik.*` Helm values are INERT** (existingSecret skips chart env rendering) — tune via `server.env`/`worker.env` in `modules/authentik/values.yaml`. Single-screen login (password embedded in identification stage); all first-party OIDC apps use implicit consent (2026-06-10). `/static` ingress carve-out serves assets with immutable Cache-Control. |
| Kyverno | failurePolicy=Ignore to prevent blocking cluster, pin chart version |
| MySQL Standalone | Raw `kubernetes_stateful_set_v1` pinned to `mysql:8.4.8` exactly (migrated from InnoDB Cluster 2026-04-16; **pinned to 8.4.8 on 2026-05-18** after Keel-driven `mysql:8.4` → 8.4.9 bump stalled the DD upgrade and required a full PVC-wipe + dump-restore — see `docs/runbooks/restore-mysql.md` and beads code-eme8/code-k40p). `skip-log-bin`, `innodb_flush_log_at_trx_commit=2`, `innodb_doublewrite=ON`. ConfigMap `mysql-standalone-cnf`. PVC `data-mysql-standalone-0` (5Gi initial → 30Gi via autoresizer, `proxmox-lvm-encrypted`). Service `mysql.dbaas` unchanged. Anti-affinity excludes k8s-node1. Bitnami charts deprecated (Broadcom Aug 2025) — use official images. |
| phpIPAM | IPAM — no active scanning. `pfsense-import` CronJob (hourly) pulls Kea leases + ARP via SSH. `dns-sync` CronJob (15min) bidirectional sync with Technitium. Kea DDNS on pfSense handles all 3 subnets. API app `claude` (ssl_token). |

View file

@ -5,17 +5,26 @@
## Applications (11)
| Application | Provider Type | Auth Flow |
|-------------|--------------|-----------|
| Cloudflare Access | OAuth2/OIDC | explicit consent |
| Cloudflare Access | OAuth2/OIDC | implicit consent |
| Domain wide catch all | Proxy (forward auth) | implicit consent |
| Forgejo | OAuth2/OIDC | explicit consent |
| Forgejo | OAuth2/OIDC | implicit consent |
| Grafana | OAuth2/OIDC | implicit consent |
| Headscale | OAuth2/OIDC | explicit consent |
| Immich | OAuth2/OIDC | explicit consent |
| Headscale | OAuth2/OIDC | implicit consent |
| Immich | OAuth2/OIDC | implicit consent |
| Kubernetes | OAuth2/OIDC (public) | implicit consent |
| Kubernetes Dashboard | OAuth2/OIDC (confidential) | implicit consent |
| linkwarden | OAuth2/OIDC | explicit consent |
| linkwarden | OAuth2/OIDC | implicit consent |
| Vault | OAuth2/OIDC | implicit consent |
| wrongmove | OAuth2/OIDC | implicit consent |
> **2026-06-10 — every provider now uses implicit consent.** Cloudflare
> Access (pk 9), Forgejo (20), Immich (1), Headscale (13), linkwarden (8)
> and Vault (53) were switched from
> `default-provider-authorization-explicit-consent` via the API (these
> providers are UI-managed, not in TF). All are first-party apps; the
> expiring consent screen (re-shown every 4 weeks per app) only slowed
> first-time signin.
> **Kubernetes Dashboard** (TF-managed in `stacks/k8s-dashboard/authentik.tf`):
> confidential client `k8s-dashboard`, built for seamless dashboard SSO via
> oauth2-proxy. **Currently IDLE** — the apiserver rejects all OIDC tokens (see
@ -60,8 +69,27 @@
- All sources use `invitation-enrollment` as enrollment flow (new users require invitation)
## Authorization Flows
- **Explicit consent** (`default-provider-authorization-explicit-consent`): Shows consent screen
- **Implicit consent** (`default-provider-authorization-implicit-consent`): Auto-redirects
- **Explicit consent** (`default-provider-authorization-explicit-consent`): Shows consent screen — no provider uses it since 2026-06-10
- **Implicit consent** (`default-provider-authorization-implicit-consent`): Auto-redirects — used by ALL providers
## Authentication Flow (single-screen login, 2026-06-10)
`default-authentication-flow` bindings: identification (order 10) →
mfa-validation (order 30) → user-login (order 100). The identification
stage (`default-authentication-identification`, pk
`32aca5ab-106e-43f4-a4cc-4513d80e57f3`) has `password_stage` set to
`default-authentication-password`, so username + password render on ONE
screen (one round trip instead of two). The previously separate
password-stage binding at order 20 (pk `0fc677db-a23f-4ee7-8648-da342e14573b`)
was DELETED via the API — authentik requires removing it when the
identification stage embeds the password field. `password_stage` is pinned in
Terraform (`authentik_stage_identification.default_identification` in
`stacks/authentik/authentik_provider.tf`); all other stage fields stay
UI-managed via `ignore_changes`. Social-login buttons remain on the same
screen and bypass the password field, so Google/GitHub/Facebook users are
unaffected. If a future authentik upgrade/blueprint re-adds the order-20
binding, users would briefly see a second password prompt — delete the
binding again.
## Invitation Enrollment Flow
Slug: `invitation-enrollment` | PK: `7d667321-2b02-4e16-8161-148078a8dac1`
@ -149,7 +177,9 @@ Notes:
- The standalone embedded-outpost deployment needs `AUTHENTIK_POSTGRESQL__{HOST,PORT,USER,PASSWORD,NAME}` env vars to reach the dbaas cluster — codified via `kubernetes_json_patches.deployment` envFrom the shared `goauthentik` Secret. The `app.kubernetes.io/component=server` pod label is also injected via JSON patch (matches the `component:server` half of the Service selector that the controller adds for embedded outposts).
- `ProxyProvider.remember_me_offset` stays UI-managed via `ignore_changes`.
- The Authentik provider's resource schema does **not** expose the `Outpost.managed` field. We rely on TF's "write only fields it knows about" semantic: the server-set `goauthentik.io/outposts/embedded` value is preserved across applies because Terraform never writes `managed`. Don't change the resource provider schema expectations without verifying this assumption holds.
- The `unauthenticated_age` env var is injected via `server.env` / `worker.env` (not `authentik.sessions.unauthenticated_age`) because we set `authentik.existingSecret.secretName: goauthentik`, which makes the chart skip rendering its own `AUTHENTIK_*` Secret. The `authentik.*` value block is therefore inert in this stack — anything new under `authentik.*` must use the `*.env` arrays instead. The same applies to the existing `authentik.cache.*`, `authentik.web.*`, `authentik.worker.*` blocks (currently inert; live values come from the orphaned, helm-keep-policy `goauthentik` Secret created by chart 2025.10.3 before `existingSecret` was introduced).
- ALL tuned env vars are injected via `server.env` / `worker.env` (not the `authentik.*` values block) because we set `authentik.existingSecret.secretName: goauthentik`, which makes the chart skip rendering its own `AUTHENTIK_*` Secret. The `authentik.*` value block is therefore inert in this stack — anything new under `authentik.*` must use the `*.env` arrays instead. Live base values come from the orphaned, helm-keep-policy `goauthentik` Secret created by chart 2025.10.3 before `existingSecret` was introduced. **2026-06-10:** the previously-inert tuning (`AUTHENTIK_WEB__WORKERS=3`, `AUTHENTIK_WEB__THREADS=4`, `AUTHENTIK_CACHE__TIMEOUT_FLOWS=1800`, `AUTHENTIK_CACHE__TIMEOUT_POLICIES=900`, `AUTHENTIK_POSTGRESQL__CONN_MAX_AGE=60`, `AUTHENTIK_POSTGRESQL__CONN_HEALTH_CHECKS=true`, worker `AUTHENTIK_WORKER__THREADS=4`) was moved into the env arrays and is now actually live — before that, pods silently ran defaults (2 gunicorn workers, 300s caches, no persistent DB conns).
- **Outpost (2026-06-10):** `log_level=info` (was `trace` — per-request overhead on the forward-auth hot path) and `kubernetes_replicas=2` (was 1 — single-pod hot path; safe since proxy sessions live in Postgres). Both in `authentik_outpost.embedded` config.
- **Static assets (2026-06-10):** a second `ingress_factory` (`module.ingress-static`, path `/static` on the authentik host) attaches the `authentik-static-cache-headers` middleware → `Cache-Control: public, max-age=31536000, immutable`. Authentik itself serves no max-age; assets are version-fingerprinted so immutable is safe. Mainly helps split-horizon internal users (no Cloudflare edge cache on the direct path).
## Upgrade Validation Checklist
@ -161,8 +191,9 @@ Run after **any** of these:
The fragile surfaces are the `kubernetes_json_patches` and the `Outpost.managed` field — both rely on assumptions that can silently break across upgrades. The checklist exercises the same path the alerts watch, so it doubles as a smoke test for the alerts.
```bash
# 1. Service routes to the outpost pod (NOT the server pods).
# Empty endpoints => auth-proxy fallback fires; expected: ONE pod IP, ports 9000/9300/9443.
# 1. Service routes to the outpost pods (NOT the server pods).
# Empty endpoints => auth-proxy fallback fires; expected: TWO pod IPs
# (kubernetes_replicas=2 since 2026-06-10), ports 9000/9300/9443.
kubectl -n authentik get endpoints ak-outpost-authentik-embedded-outpost
# 2. Service selector still excludes the server pods. Expected: includes

View file

@ -149,7 +149,7 @@ _Avoid_: bare "backup" without saying which copy you mean (a service is "backed
**CNPG** / **pg-cluster**:
**CNPG** is the CloudNativePG operator; **`pg-cluster`** is the Postgres cluster it manages — the shared Postgres substrate. Backs Tier-1 Terraform state (`pg-cluster-rw.dbaas.svc.cluster.local:5432/terraform_state`) and ~12 application databases, reached through **PgBouncer** (a **critical-path Service**) for connection pooling; app credentials rotate via the `vault-database` ClusterSecretStore.
_Avoid_: "the database" (many DBs share one cluster); the legacy `postgresql.dbaas` Service (no endpoints — dead); conflating the CNPG operator with the `pg-cluster` it manages.
_Avoid_: "the database" (many DBs share one cluster); the legacy `postgresql.dbaas` Service for NEW work (it is a live compatibility alias selecting the CNPG primary — authentik's PgBouncer still uses it — but `pg-cluster-rw` is the canonical name); conflating the CNPG operator with the `pg-cluster` it manages.
### Secrets

View file

@ -40,10 +40,10 @@ graph TB
| Component | Version | Location | Purpose |
|-----------|---------|----------|---------|
| Authentik Server | 2026.2.2 | `stacks/authentik/` | Core IdP application servers (2 replicas) |
| Authentik Server | 2026.2.2 | `stacks/authentik/` | Core IdP application servers (3 replicas) |
| Authentik Worker | 2026.2.2 | `stacks/authentik/` | Background task processors (2 replicas) |
| PgBouncer | Latest | `stacks/authentik/` | PostgreSQL connection pooler (3 replicas) |
| Embedded Outpost | - | Built into Authentik | Forward auth endpoint for Traefik |
| Embedded Outpost | - | Standalone deployment, managed by Authentik | Forward auth endpoint for Traefik (2 replicas, PG-backed sessions) |
| Traefik ForwardAuth | - | `modules/kubernetes/ingress_factory/` | Middleware attached when `auth = "required"` or `"public"` |
| Vault OIDC Method | - | `stacks/vault/` | Human SSO authentication to Vault |
| Vault K8s Auth | - | `stacks/vault/` | Service account JWT authentication |
@ -64,15 +64,36 @@ Services pick an auth tier via the `auth` enum on the `ingress_factory` module (
When `auth = "required"`, an unauthenticated request flows:
1. Request hits Traefik ingress
2. ForwardAuth middleware calls Authentik embedded outpost
3. Authentik checks for valid session cookie
2. ForwardAuth middleware calls the `auth-proxy` nginx (basicAuth fallback when Authentik is down), which proxies to the Authentik embedded outpost over a keepalive connection pool
3. Authentik checks for valid session cookie (domain-level `authentik_proxy_*` cookie on `.viktorbarzin.me`, 4-week validity — one cookie covers all forward-auth apps)
4. If missing/invalid, redirects to Authentik login page (authentik.viktorbarzin.me)
5. User authenticates via social provider (Google/GitHub/Facebook)
5. User authenticates on a **single screen**: username + password together (the identification stage embeds the password stage), or a social provider button (Google/GitHub/Facebook), then MFA validation
6. Authentik creates session, sets cookie, redirects back to original URL
7. Subsequent requests include session cookie, pass auth check, reach backend
Authentik adds authentication headers (user, email, groups) to forwarded requests. These headers are stripped before reaching the backend to prevent confusion.
### First-time signin performance (2026-06-10)
Signin latency is dominated by screen count and round trips, not server time
(DB avg 1.6ms). Standing decisions:
- **Single-screen login**: the identification stage carries `password_stage`,
so username+password is one round trip. The separate password-stage binding
was removed from `default-authentication-flow` (required by authentik when
embedding). Pinned in TF: `authentik_stage_identification.default_identification`.
- **Implicit consent everywhere**: all OIDC providers are first-party, so none
use the explicit-consent flow (it re-prompted every 4 weeks per app).
- **Live tuning via `server.env`/`worker.env`** (the `authentik.*` Helm values
are inert due to `existingSecret`): 3 gunicorn workers, 30m flow-plan cache,
15m policy cache, 60s persistent DB connections.
- **Static assets cached immutable**: `/static` ingress carve-out adds
`Cache-Control: public, max-age=31536000, immutable` (assets are
version-fingerprinted; authentik itself sends no max-age).
- **Outpost**: 2 replicas, `log_level=info` (was 1 replica at `trace`).
- **auth-proxy nginx**: upstream `keepalive 32` + HTTP/1.1 — no per-request
TCP setup on the forward-auth subrequest path.
**Anti-exposure guard**: every `auth = "app"` or `auth = "none"` line MUST have a preceding `# auth = "<tier>": <reason>` comment documenting what gates the backend (for `"app"`) or why the endpoint is intentionally public (for `"none"`). The convention is enforced by `scripts/check-ingress-auth-comments.py`, which `scripts/tg` runs on every `plan/apply/destroy/refresh` and blocks the terragrunt invocation if violated. Stack-scoped — each stack documents itself.
### Social Login & Invitation Flow

View file

@ -91,14 +91,21 @@ resource "authentik_outpost" "embedded" {
protocol_providers = [authentik_provider_proxy.catchall.id]
service_connection = "99e227a7-4562-4888-9660-4c27da678c50"
config = jsonencode({
log_level = "trace"
docker_labels = null
authentik_host = "https://authentik.viktorbarzin.me/"
docker_network = null
container_image = null
docker_map_ports = true
refresh_interval = "minutes=5"
kubernetes_replicas = 1
# info, not trace: the outpost sits on the hot path of every request to
# every auth="required" ingress trace logging is per-request overhead
# with no operational value (request access lines are emitted at info).
log_level = "info"
docker_labels = null
authentik_host = "https://authentik.viktorbarzin.me/"
docker_network = null
container_image = null
docker_map_ports = true
refresh_interval = "minutes=5"
# 2 replicas: removes the single-pod hot path for all forward-auth
# subrequests. Safe since sessions moved to the shared Postgres backend
# (authentik_providers_proxy_proxysession, 2026-05-10) no pod-local
# session state anymore.
kubernetes_replicas = 2
kubernetes_namespace = "authentik"
authentik_host_browser = ""
object_naming_template = "ak-outpost-%(name)s"
@ -198,3 +205,46 @@ resource "authentik_stage_user_login" "default_login" {
]
}
}
# -----------------------------------------------------------------------------
# Default Identification stage adopted 2026-06-10 to embed the password
# field on the identification screen (single-screen login: one round trip and
# one screen instead of two). Per authentik docs, when an Identification stage
# carries a password stage the Password stage must NOT be bound separately
# the redundant order-20 binding on default-authentication-flow (pk
# 0fc677db-a23f-4ee7-8648-da342e14573b) was deleted via the API in the same
# change. Social-login users are unaffected: source buttons stay on the same
# screen and bypass the password field.
# -----------------------------------------------------------------------------
import {
to = authentik_stage_identification.default_identification
id = "32aca5ab-106e-43f4-a4cc-4513d80e57f3"
}
data "authentik_stage" "default_authentication_password" {
name = "default-authentication-password"
}
resource "authentik_stage_identification" "default_identification" {
name = "default-authentication-identification"
password_stage = data.authentik_stage.default_authentication_password.id
lifecycle {
# Pin only password_stage; everything else stays UI-managed (same pattern
# as authentik_stage_user_login.default_login above).
ignore_changes = [
user_fields,
case_insensitive_matching,
show_matched_user,
show_source_labels,
sources,
enrollment_flow,
recovery_flow,
passwordless_flow,
pretend_user_exists,
captcha_stage,
webauthn_stage,
enable_remember_me,
]
}
}

View file

@ -29,7 +29,7 @@ resource "kubernetes_namespace" "authentik" {
labels = {
tier = var.tier
"resource-governance/custom-quota" = "true"
"keel.sh/enrolled" = "true"
"keel.sh/enrolled" = "true"
}
}
lifecycle {
@ -111,3 +111,44 @@ module "ingress-outpost" {
anti_ai_scraping = false
exclude_crowdsec = true
}
# Immutable caching for the flow-executor static assets. Authentik serves
# /static/dist/* with version-fingerprinted filenames (e.g. poly-2026.2.4.js)
# but no max-age, so browsers re-validate the login JS bundle on every signin
# and split-horizon internal users (direct to Traefik, no Cloudflare) get no
# edge cache at all. Long-lived immutable caching is safe: every authentik
# upgrade changes the asset URLs.
resource "kubernetes_manifest" "static_cache_headers" {
manifest = {
apiVersion = "traefik.io/v1alpha1"
kind = "Middleware"
metadata = {
name = "static-cache-headers"
namespace = kubernetes_namespace.authentik.metadata[0].name
}
spec = {
headers = {
customResponseHeaders = {
"Cache-Control" = "public, max-age=31536000, immutable"
}
}
}
}
}
module "ingress-static" {
source = "../../../../modules/kubernetes/ingress_factory"
# Same-host path carve-out of the public authentik UI ingress above, only
# adding the cache-headers middleware for the static asset prefix.
# auth = "none": versioned static assets of the (already public) Authentik login UI.
auth = "none"
namespace = kubernetes_namespace.authentik.metadata[0].name
name = "authentik-static"
host = "authentik"
service_name = "goauthentik-server"
ingress_path = ["/static"]
tls_secret_name = var.tls_secret_name
anti_ai_scraping = false
homepage_enabled = false
extra_middlewares = ["authentik-static-cache-headers@kubernetescrd"]
}

View file

@ -1,4 +1,10 @@
authentik:
# NOTE: because we set existingSecret below, the chart does NOT render the
# authentik.* values into an AUTHENTIK_* env Secret — the live env comes
# from the orphaned, helm-keep-policy `goauthentik` Secret created by chart
# 2025.10.3. Anything under authentik.* here is effectively INERT. All new
# or tuned config MUST go through server.env / worker.env instead (see
# .claude/reference/authentik-state.md).
log_level: warning
# log_level: trace
secret_key: ""
@ -14,38 +20,40 @@ authentik:
port: 6432
user: authentik
password: ""
# Persistent client-side connections (safe with PgBouncer session mode;
# must be < pgbouncer server_idle_timeout=600s). Cuts Django connection
# setup overhead off the ~70 sequential ORM ops per flow stage.
conn_max_age: 60
conn_health_checks: true
cache:
# Cache flow plans for 30m and policy evaluations for 15m. Authentik 2026.2
# moved cache storage from Redis to Postgres, so a TTL hit is still a
# SELECT — but a single indexed lookup beats re-evaluating PolicyBindings.
timeout_flows: 1800
timeout_policies: 900
web:
# Gunicorn: 3 workers × 4 threads per server pod (default 2×4).
# Pairs with the server memory bump to 2Gi (each worker preloads Django ~500Mi).
workers: 3
threads: 4
worker:
# Celery-equivalent worker threads per pod (default 2, renamed from
# AUTHENTIK_WORKER__CONCURRENCY in 2025.8).
threads: 4
server:
replicas: 3
# Anonymous Django sessions (no completed login: bots, healthcheckers,
# partial flows) expire in 2h. Default is days=1. Once login completes,
# UserLoginStage.session_duration takes over via request.session.set_expiry.
# Injected via server.env (not authentik.sessions.*) because we use
# authentik.existingSecret.secretName, which makes the chart skip
# rendering the AUTHENTIK_* secret — so the values block doesn't reach env.
env:
# Anonymous Django sessions (no completed login: bots, healthcheckers,
# partial flows) expire in 2h. Default is days=1. Once login completes,
# UserLoginStage.session_duration takes over via request.session.set_expiry.
# Injected via server.env (not authentik.sessions.*) because we use
# authentik.existingSecret.secretName, which makes the chart skip
# rendering the AUTHENTIK_* secret — so the values block doesn't reach env.
- name: AUTHENTIK_SESSIONS__UNAUTHENTICATED_AGE
value: "hours=2"
# Gunicorn: 3 workers × 4 threads per server pod (defaults 2×4).
# Pairs with the server memory limit of 2Gi (each worker preloads
# Django ~500Mi).
- name: AUTHENTIK_WEB__WORKERS
value: "3"
- name: AUTHENTIK_WEB__THREADS
value: "4"
# Cache flow plans for 30m and policy evaluations for 15m (defaults 300s).
# Authentik 2026.2 stores cache in Postgres, so a TTL hit is still a
# SELECT — but a single indexed lookup beats re-planning the flow
# (~70 sequential ORM ops per flow stage POST).
- name: AUTHENTIK_CACHE__TIMEOUT_FLOWS
value: "1800"
- name: AUTHENTIK_CACHE__TIMEOUT_POLICIES
value: "900"
# Persistent client-side DB connections (safe with PgBouncer session mode;
# must stay < pgbouncer server_idle_timeout=600s). Cuts per-request Django
# connection setup off the auth hot path.
- name: AUTHENTIK_POSTGRESQL__CONN_MAX_AGE
value: "60"
- name: AUTHENTIK_POSTGRESQL__CONN_HEALTH_CHECKS
value: "true"
strategy:
type: RollingUpdate
rollingUpdate:
@ -82,11 +90,23 @@ worker:
# certificate renewal) — no user-facing traffic, so 2-of-3 isn't
# needed for availability. Drop saves ~100m sustained CPU.
replicas: 2
# Same unauthenticated_age cap as server — both the server (Django session
# middleware) and worker (cleanup tasks) need to see the value.
env:
# Same unauthenticated_age cap as server — both the server (Django session
# middleware) and worker (cleanup tasks) need to see the value.
- name: AUTHENTIK_SESSIONS__UNAUTHENTICATED_AGE
value: "hours=2"
# Dramatiq worker threads per pod (default 2).
- name: AUTHENTIK_WORKER__THREADS
value: "4"
# Keep cache + DB-connection settings in lockstep with server.env.
- name: AUTHENTIK_CACHE__TIMEOUT_FLOWS
value: "1800"
- name: AUTHENTIK_CACHE__TIMEOUT_POLICIES
value: "900"
- name: AUTHENTIK_POSTGRESQL__CONN_MAX_AGE
value: "60"
- name: AUTHENTIK_POSTGRESQL__CONN_HEALTH_CHECKS
value: "true"
strategy:
type: RollingUpdate
rollingUpdate:

View file

@ -720,6 +720,11 @@ resource "kubernetes_config_map" "auth_proxy_config" {
"default.conf" = <<-EOT
upstream authentik {
server ak-outpost-authentik-embedded-outpost.authentik.svc.cluster.local:9000;
# Reuse connections to the outpost. Without this every forward-auth
# subrequest (= every request to every auth="required" ingress) opens
# a fresh TCP connection. Requires HTTP/1.1 + cleared Connection
# header on the proxy_pass locations below.
keepalive 32;
}
server {
listen 9000;
@ -734,6 +739,8 @@ resource "kubernetes_config_map" "auth_proxy_config" {
location /outpost.goauthentik.io/auth/traefik {
proxy_pass http://authentik;
proxy_http_version 1.1;
proxy_set_header Connection "";
proxy_connect_timeout 3s;
proxy_read_timeout 5s;
proxy_send_timeout 5s;
@ -764,6 +771,8 @@ resource "kubernetes_config_map" "auth_proxy_config" {
location /outpost.goauthentik.io/ {
proxy_pass http://authentik;
proxy_http_version 1.1;
proxy_set_header Connection "";
proxy_connect_timeout 3s;
proxy_read_timeout 10s;
proxy_set_header Host $host;
@ -820,6 +829,11 @@ resource "kubernetes_deployment" "auth_proxy" {
labels = {
app = "auth-proxy"
}
annotations = {
# nginx only reads its config at startup roll the pods whenever
# the ConfigMap content changes.
"checksum/auth-proxy-config" = sha1(kubernetes_config_map.auth_proxy_config.data["default.conf"])
}
}
spec {
topology_spread_constraint {