authentik: speed up first-time signin (single-screen login, live env tuning, asset caching, outpost+nginx hot path)
Viktor asked to review Authentik and the web tier and make first-time signin to apps faster. Review found the slowness is screens and round trips, not server time. Changes: - values.yaml: the authentik.* Helm values (gunicorn workers, cache timeouts, conn_max_age) were silently INERT because existingSecret skips chart env rendering — pods ran defaults (2 workers, 300s caches, no persistent DB conns). Moved all tuning into server.env/worker.env, which actually reaches the pods. - authentik_provider.tf: adopt the identification stage and pin password_stage so username+password render on ONE screen (the separate order-20 password binding is deleted via API — authentik requires that when embedding). Outpost log_level trace->info and 1->2 replicas (it is on the hot path of every forward-auth request; PG-backed sessions make 2 replicas safe). - authentik module: /static ingress carve-out with immutable Cache-Control (assets are version-fingerprinted but served with no max-age — internal split-horizon users got zero caching). - traefik auth-proxy nginx: upstream keepalive 32 + HTTP/1.1 (was opening a fresh TCP connection to the outpost per subrequest) + config-checksum annotation so config changes roll the pods. - docs: authentication.md + authentik-state.md updated; fixed stale 'postgresql.dbaas has no endpoints' claim in CLAUDE.md/CONTEXT.md (it is a live CNPG primary-selector compatibility service). Done via API in the same change (UI-managed objects): 6 OIDC providers (Vault, Forgejo, Immich, Headscale, linkwarden, Cloudflare Access) switched from explicit to implicit consent — all first-party, the 4-weekly consent screen only slowed first-time signin. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
This commit is contained in:
parent
93ba67c84a
commit
97ccdbecb8
8 changed files with 232 additions and 55 deletions
|
|
@ -135,7 +135,7 @@ Repo IDs: infra=1, Website=2, finance=3, health=4, travel_blog=5, webhook-handle
|
|||
|
||||
## Database Host
|
||||
|
||||
**`postgresql_host`** in `config.tfvars` is `pg-cluster-rw.dbaas.svc.cluster.local` (the CNPG primary). The legacy `postgresql.dbaas` service has no endpoints — never use it. This variable is shared by ~12 stacks.
|
||||
**`postgresql_host`** in `config.tfvars` is `pg-cluster-rw.dbaas.svc.cluster.local` (the CNPG primary). The legacy `postgresql.dbaas` service is a live compatibility alias (selector `cnpg.io/instanceRole=primary`, so it also reaches the primary — authentik's PgBouncer still points at it) — but use `pg-cluster-rw` for anything new. This variable is shared by ~12 stacks.
|
||||
|
||||
**CNPG tuning** (in `stacks/dbaas/modules/dbaas/main.tf`): `shared_buffers=512MB`, `work_mem=16MB`, `wal_compression=on`, `effective_cache_size=1536MB`, pod memory 2Gi.
|
||||
|
||||
|
|
@ -159,7 +159,7 @@ Repo IDs: infra=1, Website=2, finance=3, health=4, travel_blog=5, webhook-handle
|
|||
| Immich | ML on SSD (CUDA), disable ModSecurity (breaks streaming), frequent upgrades. **`immich-machine-learning` MUST run with `MACHINE_LEARNING_MODEL_TTL > 0`** (set to `600` in `stacks/immich/main.tf`, env on the `immich-machine-learning` deployment). At `0`, no model ever unloads and onnxruntime's CUDA arena (OCR's dynamic input shapes inflate it to ~10 GB) is held forever on the **time-sliced T4 it shares with llama-swap/frigate/immich-server** — which has no VRAM isolation, so immich-ml starved llama-swap (qwen3-8b) and silently broke recruiter-responder triage for ~5 h on 2026-06-02 (post-mortem `docs/post-mortems/2026-06-02-immich-ml-ttl-gpu-oom-recruiter.md`). TTL>0 lets idle models (OCR, face — AND CLIP) free VRAM. The TTL is a single GLOBAL knob (no per-model pin), so CLIP would also unload after 600s idle; the `clip-keepalive` CronJob (`*/5 * * * *`, same stack) pings the CLIP textual encoder so smart-search stays warm without pinning the ad-hoc models. **Smart search has a SECOND warmth layer in Postgres** (don't conflate it with the ML model): the ~665MB vchord `clip_index` must stay resident in PG `shared_buffers`, else an ANN probe that lands on an evicted list pays a ~1.8s cold storage read vs ~4ms warm. The `postStart` hook prewarms it ONCE at pod start and `pg_prewarm.autoprewarm` only re-warms at *startup*, so the index decays out of cache over days under job buffer-pressure (observed ~33% resident after 9d uptime → slow context search, easily misattributed to the ML model). The `clip-index-prewarm` CronJob (`*/5`, same stack) re-runs `pg_prewarm('clip_index')` to pin it hot; `immich-search-probe` (`*/5`) measures live latency + residency → Pushgateway gauges (`immich_smart_search_db_seconds`, `immich_clip_index_cached_pct`) → alerts `ImmichSmartSearchSlow`/`ImmichClipIndexColdCache`/`ImmichSearchProbeStale` + cluster-health check #46 (`check_immich_search`). immich PG role is a superuser so the CronJobs can run `pg_prewarm`/`pg_buffercache`. **Video transcoding is GPU-accelerated**: `immich-server` is pinned to GPU node1 (nodeSelector `nvidia.com/gpu.present` + NoSchedule toleration + `gpu-workload` priority) with a time-sliced `nvidia.com/gpu=1` slice — the stock immich-server image's ffmpeg already ships h264/hevc_nvenc + NVDEC. Activated via `ffmpeg.accel=nvenc` + `accelDecode=true` in the **DB** system-config (`system_metadata` table, key `system-config`, JSONB — NOT Terraform; app config is DB-managed here like oauth/smtp). Direct DB edits need a pod **recreate** to reload (config is cached at boot; only API-driven changes broadcast a reload). **Streaming bitrate is capped** to keep 4K playback smooth on the contended HDD and over remote uplinks: `ffmpeg.maxBitrate=20000k` + `preset=medium` + `transcode=bitrate` (set 2026-06-01 — was uncapped `maxBitrate=0` + `ultrafast` + `targetResolution=original`, which produced 77–264 Mbps 4K transcodes that stuttered for every client, local and remote, since even a single stream needs ~10–13.5 MB/s off the shared `sdc` spindle). 4K resolution is preserved (`targetResolution=original`); originals are NEVER modified — only the `encoded-video/` streaming copy. To re-apply transcode settings to EXISTING videos (config changes only affect new/missing ones): delete the offenders' `asset_file` rows `WHERE type='encoded_video'` (derived/regenerable — never touches originals) then run videoConversion `force=false` (admin Jobs API → "Missing"); it regenerates them to the deterministic `<assetId>.mp4` path at concurrency 1 (gentle on sdc). See `docs/runbooks/immich-transcode-bitrate.md`. If Immich is ever reinstalled fresh (not restored), re-set these keys (accel, accelDecode, **maxBitrate=20000k, preset=medium, transcode=bitrate**). Thumbnails/previews live on SSD NFS (sdb) — do NOT move to block storage (HDD sdc = slower + the contended IO domain). **Background-job concurrency is capped to protect sdc** (DB-managed system-config, `system_metadata` key `system-config`, JSONB `job.*.concurrency`; re-set on fresh install): `thumbnailGeneration=2`, `metadataExtraction=2`, `library=2` — these jobs read ORIGINALS off the HDD library. Left uncapped (were 8/4/4) a library-wide job (e.g. Duplicate Detection on 2026-06-01) fans the ML/thumbnail backfill out into a read storm that saturates sdc and starves etcd → apiserver down. `sidecar`/`smartSearch`/`faceDetection` stay at Immich defaults (small `.xmp` / SSD previews). Apply via Job Settings UI or the `system-config` API; **direct DB edits need an `immich-server` pod recreate to reload** (config cached at boot). See `docs/post-mortems/2026-05-25-immich-anca-elements-io-storm.md`. |
|
||||
| CrowdSec | Pin version, disable Metabase when not needed (CPU hog), LAPI scaled to 3, **DB on PostgreSQL** (migrated from MySQL), flush config: max_items=10000/max_age=7d/agents_autodelete=30d, DECISION_DURATION=168h in blocklist CronJob |
|
||||
| Frigate | GPU stall detection in liveness probe (inference speed check), high CPU |
|
||||
| Authentik | 3 replicas, PgBouncer in front of PostgreSQL, strip auth headers before forwarding |
|
||||
| Authentik | 3 server replicas + 2-replica embedded outpost (PG-backed sessions), PgBouncer in front of PostgreSQL, strip auth headers before forwarding. **`authentik.*` Helm values are INERT** (existingSecret skips chart env rendering) — tune via `server.env`/`worker.env` in `modules/authentik/values.yaml`. Single-screen login (password embedded in identification stage); all first-party OIDC apps use implicit consent (2026-06-10). `/static` ingress carve-out serves assets with immutable Cache-Control. |
|
||||
| Kyverno | failurePolicy=Ignore to prevent blocking cluster, pin chart version |
|
||||
| MySQL Standalone | Raw `kubernetes_stateful_set_v1` pinned to `mysql:8.4.8` exactly (migrated from InnoDB Cluster 2026-04-16; **pinned to 8.4.8 on 2026-05-18** after Keel-driven `mysql:8.4` → 8.4.9 bump stalled the DD upgrade and required a full PVC-wipe + dump-restore — see `docs/runbooks/restore-mysql.md` and beads code-eme8/code-k40p). `skip-log-bin`, `innodb_flush_log_at_trx_commit=2`, `innodb_doublewrite=ON`. ConfigMap `mysql-standalone-cnf`. PVC `data-mysql-standalone-0` (5Gi initial → 30Gi via autoresizer, `proxmox-lvm-encrypted`). Service `mysql.dbaas` unchanged. Anti-affinity excludes k8s-node1. Bitnami charts deprecated (Broadcom Aug 2025) — use official images. |
|
||||
| phpIPAM | IPAM — no active scanning. `pfsense-import` CronJob (hourly) pulls Kea leases + ARP via SSH. `dns-sync` CronJob (15min) bidirectional sync with Technitium. Kea DDNS on pfSense handles all 3 subnets. API app `claude` (ssl_token). |
|
||||
|
|
|
|||
|
|
@ -5,17 +5,26 @@
|
|||
## Applications (11)
|
||||
| Application | Provider Type | Auth Flow |
|
||||
|-------------|--------------|-----------|
|
||||
| Cloudflare Access | OAuth2/OIDC | explicit consent |
|
||||
| Cloudflare Access | OAuth2/OIDC | implicit consent |
|
||||
| Domain wide catch all | Proxy (forward auth) | implicit consent |
|
||||
| Forgejo | OAuth2/OIDC | explicit consent |
|
||||
| Forgejo | OAuth2/OIDC | implicit consent |
|
||||
| Grafana | OAuth2/OIDC | implicit consent |
|
||||
| Headscale | OAuth2/OIDC | explicit consent |
|
||||
| Immich | OAuth2/OIDC | explicit consent |
|
||||
| Headscale | OAuth2/OIDC | implicit consent |
|
||||
| Immich | OAuth2/OIDC | implicit consent |
|
||||
| Kubernetes | OAuth2/OIDC (public) | implicit consent |
|
||||
| Kubernetes Dashboard | OAuth2/OIDC (confidential) | implicit consent |
|
||||
| linkwarden | OAuth2/OIDC | explicit consent |
|
||||
| linkwarden | OAuth2/OIDC | implicit consent |
|
||||
| Vault | OAuth2/OIDC | implicit consent |
|
||||
| wrongmove | OAuth2/OIDC | implicit consent |
|
||||
|
||||
> **2026-06-10 — every provider now uses implicit consent.** Cloudflare
|
||||
> Access (pk 9), Forgejo (20), Immich (1), Headscale (13), linkwarden (8)
|
||||
> and Vault (53) were switched from
|
||||
> `default-provider-authorization-explicit-consent` via the API (these
|
||||
> providers are UI-managed, not in TF). All are first-party apps; the
|
||||
> expiring consent screen (re-shown every 4 weeks per app) only slowed
|
||||
> first-time signin.
|
||||
|
||||
> **Kubernetes Dashboard** (TF-managed in `stacks/k8s-dashboard/authentik.tf`):
|
||||
> confidential client `k8s-dashboard`, built for seamless dashboard SSO via
|
||||
> oauth2-proxy. **Currently IDLE** — the apiserver rejects all OIDC tokens (see
|
||||
|
|
@ -60,8 +69,27 @@
|
|||
- All sources use `invitation-enrollment` as enrollment flow (new users require invitation)
|
||||
|
||||
## Authorization Flows
|
||||
- **Explicit consent** (`default-provider-authorization-explicit-consent`): Shows consent screen
|
||||
- **Implicit consent** (`default-provider-authorization-implicit-consent`): Auto-redirects
|
||||
- **Explicit consent** (`default-provider-authorization-explicit-consent`): Shows consent screen — no provider uses it since 2026-06-10
|
||||
- **Implicit consent** (`default-provider-authorization-implicit-consent`): Auto-redirects — used by ALL providers
|
||||
|
||||
## Authentication Flow (single-screen login, 2026-06-10)
|
||||
|
||||
`default-authentication-flow` bindings: identification (order 10) →
|
||||
mfa-validation (order 30) → user-login (order 100). The identification
|
||||
stage (`default-authentication-identification`, pk
|
||||
`32aca5ab-106e-43f4-a4cc-4513d80e57f3`) has `password_stage` set to
|
||||
`default-authentication-password`, so username + password render on ONE
|
||||
screen (one round trip instead of two). The previously separate
|
||||
password-stage binding at order 20 (pk `0fc677db-a23f-4ee7-8648-da342e14573b`)
|
||||
was DELETED via the API — authentik requires removing it when the
|
||||
identification stage embeds the password field. `password_stage` is pinned in
|
||||
Terraform (`authentik_stage_identification.default_identification` in
|
||||
`stacks/authentik/authentik_provider.tf`); all other stage fields stay
|
||||
UI-managed via `ignore_changes`. Social-login buttons remain on the same
|
||||
screen and bypass the password field, so Google/GitHub/Facebook users are
|
||||
unaffected. If a future authentik upgrade/blueprint re-adds the order-20
|
||||
binding, users would briefly see a second password prompt — delete the
|
||||
binding again.
|
||||
|
||||
## Invitation Enrollment Flow
|
||||
Slug: `invitation-enrollment` | PK: `7d667321-2b02-4e16-8161-148078a8dac1`
|
||||
|
|
@ -149,7 +177,9 @@ Notes:
|
|||
- The standalone embedded-outpost deployment needs `AUTHENTIK_POSTGRESQL__{HOST,PORT,USER,PASSWORD,NAME}` env vars to reach the dbaas cluster — codified via `kubernetes_json_patches.deployment` envFrom the shared `goauthentik` Secret. The `app.kubernetes.io/component=server` pod label is also injected via JSON patch (matches the `component:server` half of the Service selector that the controller adds for embedded outposts).
|
||||
- `ProxyProvider.remember_me_offset` stays UI-managed via `ignore_changes`.
|
||||
- The Authentik provider's resource schema does **not** expose the `Outpost.managed` field. We rely on TF's "write only fields it knows about" semantic: the server-set `goauthentik.io/outposts/embedded` value is preserved across applies because Terraform never writes `managed`. Don't change the resource provider schema expectations without verifying this assumption holds.
|
||||
- The `unauthenticated_age` env var is injected via `server.env` / `worker.env` (not `authentik.sessions.unauthenticated_age`) because we set `authentik.existingSecret.secretName: goauthentik`, which makes the chart skip rendering its own `AUTHENTIK_*` Secret. The `authentik.*` value block is therefore inert in this stack — anything new under `authentik.*` must use the `*.env` arrays instead. The same applies to the existing `authentik.cache.*`, `authentik.web.*`, `authentik.worker.*` blocks (currently inert; live values come from the orphaned, helm-keep-policy `goauthentik` Secret created by chart 2025.10.3 before `existingSecret` was introduced).
|
||||
- ALL tuned env vars are injected via `server.env` / `worker.env` (not the `authentik.*` values block) because we set `authentik.existingSecret.secretName: goauthentik`, which makes the chart skip rendering its own `AUTHENTIK_*` Secret. The `authentik.*` value block is therefore inert in this stack — anything new under `authentik.*` must use the `*.env` arrays instead. Live base values come from the orphaned, helm-keep-policy `goauthentik` Secret created by chart 2025.10.3 before `existingSecret` was introduced. **2026-06-10:** the previously-inert tuning (`AUTHENTIK_WEB__WORKERS=3`, `AUTHENTIK_WEB__THREADS=4`, `AUTHENTIK_CACHE__TIMEOUT_FLOWS=1800`, `AUTHENTIK_CACHE__TIMEOUT_POLICIES=900`, `AUTHENTIK_POSTGRESQL__CONN_MAX_AGE=60`, `AUTHENTIK_POSTGRESQL__CONN_HEALTH_CHECKS=true`, worker `AUTHENTIK_WORKER__THREADS=4`) was moved into the env arrays and is now actually live — before that, pods silently ran defaults (2 gunicorn workers, 300s caches, no persistent DB conns).
|
||||
- **Outpost (2026-06-10):** `log_level=info` (was `trace` — per-request overhead on the forward-auth hot path) and `kubernetes_replicas=2` (was 1 — single-pod hot path; safe since proxy sessions live in Postgres). Both in `authentik_outpost.embedded` config.
|
||||
- **Static assets (2026-06-10):** a second `ingress_factory` (`module.ingress-static`, path `/static` on the authentik host) attaches the `authentik-static-cache-headers` middleware → `Cache-Control: public, max-age=31536000, immutable`. Authentik itself serves no max-age; assets are version-fingerprinted so immutable is safe. Mainly helps split-horizon internal users (no Cloudflare edge cache on the direct path).
|
||||
|
||||
## Upgrade Validation Checklist
|
||||
|
||||
|
|
@ -161,8 +191,9 @@ Run after **any** of these:
|
|||
The fragile surfaces are the `kubernetes_json_patches` and the `Outpost.managed` field — both rely on assumptions that can silently break across upgrades. The checklist exercises the same path the alerts watch, so it doubles as a smoke test for the alerts.
|
||||
|
||||
```bash
|
||||
# 1. Service routes to the outpost pod (NOT the server pods).
|
||||
# Empty endpoints => auth-proxy fallback fires; expected: ONE pod IP, ports 9000/9300/9443.
|
||||
# 1. Service routes to the outpost pods (NOT the server pods).
|
||||
# Empty endpoints => auth-proxy fallback fires; expected: TWO pod IPs
|
||||
# (kubernetes_replicas=2 since 2026-06-10), ports 9000/9300/9443.
|
||||
kubectl -n authentik get endpoints ak-outpost-authentik-embedded-outpost
|
||||
|
||||
# 2. Service selector still excludes the server pods. Expected: includes
|
||||
|
|
|
|||
|
|
@ -149,7 +149,7 @@ _Avoid_: bare "backup" without saying which copy you mean (a service is "backed
|
|||
|
||||
**CNPG** / **pg-cluster**:
|
||||
**CNPG** is the CloudNativePG operator; **`pg-cluster`** is the Postgres cluster it manages — the shared Postgres substrate. Backs Tier-1 Terraform state (`pg-cluster-rw.dbaas.svc.cluster.local:5432/terraform_state`) and ~12 application databases, reached through **PgBouncer** (a **critical-path Service**) for connection pooling; app credentials rotate via the `vault-database` ClusterSecretStore.
|
||||
_Avoid_: "the database" (many DBs share one cluster); the legacy `postgresql.dbaas` Service (no endpoints — dead); conflating the CNPG operator with the `pg-cluster` it manages.
|
||||
_Avoid_: "the database" (many DBs share one cluster); the legacy `postgresql.dbaas` Service for NEW work (it is a live compatibility alias selecting the CNPG primary — authentik's PgBouncer still uses it — but `pg-cluster-rw` is the canonical name); conflating the CNPG operator with the `pg-cluster` it manages.
|
||||
|
||||
### Secrets
|
||||
|
||||
|
|
|
|||
|
|
@ -40,10 +40,10 @@ graph TB
|
|||
|
||||
| Component | Version | Location | Purpose |
|
||||
|-----------|---------|----------|---------|
|
||||
| Authentik Server | 2026.2.2 | `stacks/authentik/` | Core IdP application servers (2 replicas) |
|
||||
| Authentik Server | 2026.2.2 | `stacks/authentik/` | Core IdP application servers (3 replicas) |
|
||||
| Authentik Worker | 2026.2.2 | `stacks/authentik/` | Background task processors (2 replicas) |
|
||||
| PgBouncer | Latest | `stacks/authentik/` | PostgreSQL connection pooler (3 replicas) |
|
||||
| Embedded Outpost | - | Built into Authentik | Forward auth endpoint for Traefik |
|
||||
| Embedded Outpost | - | Standalone deployment, managed by Authentik | Forward auth endpoint for Traefik (2 replicas, PG-backed sessions) |
|
||||
| Traefik ForwardAuth | - | `modules/kubernetes/ingress_factory/` | Middleware attached when `auth = "required"` or `"public"` |
|
||||
| Vault OIDC Method | - | `stacks/vault/` | Human SSO authentication to Vault |
|
||||
| Vault K8s Auth | - | `stacks/vault/` | Service account JWT authentication |
|
||||
|
|
@ -64,15 +64,36 @@ Services pick an auth tier via the `auth` enum on the `ingress_factory` module (
|
|||
When `auth = "required"`, an unauthenticated request flows:
|
||||
|
||||
1. Request hits Traefik ingress
|
||||
2. ForwardAuth middleware calls Authentik embedded outpost
|
||||
3. Authentik checks for valid session cookie
|
||||
2. ForwardAuth middleware calls the `auth-proxy` nginx (basicAuth fallback when Authentik is down), which proxies to the Authentik embedded outpost over a keepalive connection pool
|
||||
3. Authentik checks for valid session cookie (domain-level `authentik_proxy_*` cookie on `.viktorbarzin.me`, 4-week validity — one cookie covers all forward-auth apps)
|
||||
4. If missing/invalid, redirects to Authentik login page (authentik.viktorbarzin.me)
|
||||
5. User authenticates via social provider (Google/GitHub/Facebook)
|
||||
5. User authenticates on a **single screen**: username + password together (the identification stage embeds the password stage), or a social provider button (Google/GitHub/Facebook), then MFA validation
|
||||
6. Authentik creates session, sets cookie, redirects back to original URL
|
||||
7. Subsequent requests include session cookie, pass auth check, reach backend
|
||||
|
||||
Authentik adds authentication headers (user, email, groups) to forwarded requests. These headers are stripped before reaching the backend to prevent confusion.
|
||||
|
||||
### First-time signin performance (2026-06-10)
|
||||
|
||||
Signin latency is dominated by screen count and round trips, not server time
|
||||
(DB avg 1.6ms). Standing decisions:
|
||||
|
||||
- **Single-screen login**: the identification stage carries `password_stage`,
|
||||
so username+password is one round trip. The separate password-stage binding
|
||||
was removed from `default-authentication-flow` (required by authentik when
|
||||
embedding). Pinned in TF: `authentik_stage_identification.default_identification`.
|
||||
- **Implicit consent everywhere**: all OIDC providers are first-party, so none
|
||||
use the explicit-consent flow (it re-prompted every 4 weeks per app).
|
||||
- **Live tuning via `server.env`/`worker.env`** (the `authentik.*` Helm values
|
||||
are inert due to `existingSecret`): 3 gunicorn workers, 30m flow-plan cache,
|
||||
15m policy cache, 60s persistent DB connections.
|
||||
- **Static assets cached immutable**: `/static` ingress carve-out adds
|
||||
`Cache-Control: public, max-age=31536000, immutable` (assets are
|
||||
version-fingerprinted; authentik itself sends no max-age).
|
||||
- **Outpost**: 2 replicas, `log_level=info` (was 1 replica at `trace`).
|
||||
- **auth-proxy nginx**: upstream `keepalive 32` + HTTP/1.1 — no per-request
|
||||
TCP setup on the forward-auth subrequest path.
|
||||
|
||||
**Anti-exposure guard**: every `auth = "app"` or `auth = "none"` line MUST have a preceding `# auth = "<tier>": <reason>` comment documenting what gates the backend (for `"app"`) or why the endpoint is intentionally public (for `"none"`). The convention is enforced by `scripts/check-ingress-auth-comments.py`, which `scripts/tg` runs on every `plan/apply/destroy/refresh` and blocks the terragrunt invocation if violated. Stack-scoped — each stack documents itself.
|
||||
|
||||
### Social Login & Invitation Flow
|
||||
|
|
|
|||
|
|
@ -91,14 +91,21 @@ resource "authentik_outpost" "embedded" {
|
|||
protocol_providers = [authentik_provider_proxy.catchall.id]
|
||||
service_connection = "99e227a7-4562-4888-9660-4c27da678c50"
|
||||
config = jsonencode({
|
||||
log_level = "trace"
|
||||
docker_labels = null
|
||||
authentik_host = "https://authentik.viktorbarzin.me/"
|
||||
docker_network = null
|
||||
container_image = null
|
||||
docker_map_ports = true
|
||||
refresh_interval = "minutes=5"
|
||||
kubernetes_replicas = 1
|
||||
# info, not trace: the outpost sits on the hot path of every request to
|
||||
# every auth="required" ingress — trace logging is per-request overhead
|
||||
# with no operational value (request access lines are emitted at info).
|
||||
log_level = "info"
|
||||
docker_labels = null
|
||||
authentik_host = "https://authentik.viktorbarzin.me/"
|
||||
docker_network = null
|
||||
container_image = null
|
||||
docker_map_ports = true
|
||||
refresh_interval = "minutes=5"
|
||||
# 2 replicas: removes the single-pod hot path for all forward-auth
|
||||
# subrequests. Safe since sessions moved to the shared Postgres backend
|
||||
# (authentik_providers_proxy_proxysession, 2026-05-10) — no pod-local
|
||||
# session state anymore.
|
||||
kubernetes_replicas = 2
|
||||
kubernetes_namespace = "authentik"
|
||||
authentik_host_browser = ""
|
||||
object_naming_template = "ak-outpost-%(name)s"
|
||||
|
|
@ -198,3 +205,46 @@ resource "authentik_stage_user_login" "default_login" {
|
|||
]
|
||||
}
|
||||
}
|
||||
|
||||
# -----------------------------------------------------------------------------
|
||||
# Default Identification stage — adopted 2026-06-10 to embed the password
|
||||
# field on the identification screen (single-screen login: one round trip and
|
||||
# one screen instead of two). Per authentik docs, when an Identification stage
|
||||
# carries a password stage the Password stage must NOT be bound separately —
|
||||
# the redundant order-20 binding on default-authentication-flow (pk
|
||||
# 0fc677db-a23f-4ee7-8648-da342e14573b) was deleted via the API in the same
|
||||
# change. Social-login users are unaffected: source buttons stay on the same
|
||||
# screen and bypass the password field.
|
||||
# -----------------------------------------------------------------------------
|
||||
|
||||
import {
|
||||
to = authentik_stage_identification.default_identification
|
||||
id = "32aca5ab-106e-43f4-a4cc-4513d80e57f3"
|
||||
}
|
||||
|
||||
data "authentik_stage" "default_authentication_password" {
|
||||
name = "default-authentication-password"
|
||||
}
|
||||
|
||||
resource "authentik_stage_identification" "default_identification" {
|
||||
name = "default-authentication-identification"
|
||||
password_stage = data.authentik_stage.default_authentication_password.id
|
||||
lifecycle {
|
||||
# Pin only password_stage; everything else stays UI-managed (same pattern
|
||||
# as authentik_stage_user_login.default_login above).
|
||||
ignore_changes = [
|
||||
user_fields,
|
||||
case_insensitive_matching,
|
||||
show_matched_user,
|
||||
show_source_labels,
|
||||
sources,
|
||||
enrollment_flow,
|
||||
recovery_flow,
|
||||
passwordless_flow,
|
||||
pretend_user_exists,
|
||||
captcha_stage,
|
||||
webauthn_stage,
|
||||
enable_remember_me,
|
||||
]
|
||||
}
|
||||
}
|
||||
|
|
|
|||
|
|
@ -29,7 +29,7 @@ resource "kubernetes_namespace" "authentik" {
|
|||
labels = {
|
||||
tier = var.tier
|
||||
"resource-governance/custom-quota" = "true"
|
||||
"keel.sh/enrolled" = "true"
|
||||
"keel.sh/enrolled" = "true"
|
||||
}
|
||||
}
|
||||
lifecycle {
|
||||
|
|
@ -111,3 +111,44 @@ module "ingress-outpost" {
|
|||
anti_ai_scraping = false
|
||||
exclude_crowdsec = true
|
||||
}
|
||||
|
||||
# Immutable caching for the flow-executor static assets. Authentik serves
|
||||
# /static/dist/* with version-fingerprinted filenames (e.g. poly-2026.2.4.js)
|
||||
# but no max-age, so browsers re-validate the login JS bundle on every signin
|
||||
# — and split-horizon internal users (direct to Traefik, no Cloudflare) get no
|
||||
# edge cache at all. Long-lived immutable caching is safe: every authentik
|
||||
# upgrade changes the asset URLs.
|
||||
resource "kubernetes_manifest" "static_cache_headers" {
|
||||
manifest = {
|
||||
apiVersion = "traefik.io/v1alpha1"
|
||||
kind = "Middleware"
|
||||
metadata = {
|
||||
name = "static-cache-headers"
|
||||
namespace = kubernetes_namespace.authentik.metadata[0].name
|
||||
}
|
||||
spec = {
|
||||
headers = {
|
||||
customResponseHeaders = {
|
||||
"Cache-Control" = "public, max-age=31536000, immutable"
|
||||
}
|
||||
}
|
||||
}
|
||||
}
|
||||
}
|
||||
|
||||
module "ingress-static" {
|
||||
source = "../../../../modules/kubernetes/ingress_factory"
|
||||
# Same-host path carve-out of the public authentik UI ingress above, only
|
||||
# adding the cache-headers middleware for the static asset prefix.
|
||||
# auth = "none": versioned static assets of the (already public) Authentik login UI.
|
||||
auth = "none"
|
||||
namespace = kubernetes_namespace.authentik.metadata[0].name
|
||||
name = "authentik-static"
|
||||
host = "authentik"
|
||||
service_name = "goauthentik-server"
|
||||
ingress_path = ["/static"]
|
||||
tls_secret_name = var.tls_secret_name
|
||||
anti_ai_scraping = false
|
||||
homepage_enabled = false
|
||||
extra_middlewares = ["authentik-static-cache-headers@kubernetescrd"]
|
||||
}
|
||||
|
|
|
|||
|
|
@ -1,4 +1,10 @@
|
|||
authentik:
|
||||
# NOTE: because we set existingSecret below, the chart does NOT render the
|
||||
# authentik.* values into an AUTHENTIK_* env Secret — the live env comes
|
||||
# from the orphaned, helm-keep-policy `goauthentik` Secret created by chart
|
||||
# 2025.10.3. Anything under authentik.* here is effectively INERT. All new
|
||||
# or tuned config MUST go through server.env / worker.env instead (see
|
||||
# .claude/reference/authentik-state.md).
|
||||
log_level: warning
|
||||
# log_level: trace
|
||||
secret_key: ""
|
||||
|
|
@ -14,38 +20,40 @@ authentik:
|
|||
port: 6432
|
||||
user: authentik
|
||||
password: ""
|
||||
# Persistent client-side connections (safe with PgBouncer session mode;
|
||||
# must be < pgbouncer server_idle_timeout=600s). Cuts Django connection
|
||||
# setup overhead off the ~70 sequential ORM ops per flow stage.
|
||||
conn_max_age: 60
|
||||
conn_health_checks: true
|
||||
cache:
|
||||
# Cache flow plans for 30m and policy evaluations for 15m. Authentik 2026.2
|
||||
# moved cache storage from Redis to Postgres, so a TTL hit is still a
|
||||
# SELECT — but a single indexed lookup beats re-evaluating PolicyBindings.
|
||||
timeout_flows: 1800
|
||||
timeout_policies: 900
|
||||
web:
|
||||
# Gunicorn: 3 workers × 4 threads per server pod (default 2×4).
|
||||
# Pairs with the server memory bump to 2Gi (each worker preloads Django ~500Mi).
|
||||
workers: 3
|
||||
threads: 4
|
||||
worker:
|
||||
# Celery-equivalent worker threads per pod (default 2, renamed from
|
||||
# AUTHENTIK_WORKER__CONCURRENCY in 2025.8).
|
||||
threads: 4
|
||||
|
||||
server:
|
||||
replicas: 3
|
||||
# Anonymous Django sessions (no completed login: bots, healthcheckers,
|
||||
# partial flows) expire in 2h. Default is days=1. Once login completes,
|
||||
# UserLoginStage.session_duration takes over via request.session.set_expiry.
|
||||
# Injected via server.env (not authentik.sessions.*) because we use
|
||||
# authentik.existingSecret.secretName, which makes the chart skip
|
||||
# rendering the AUTHENTIK_* secret — so the values block doesn't reach env.
|
||||
env:
|
||||
# Anonymous Django sessions (no completed login: bots, healthcheckers,
|
||||
# partial flows) expire in 2h. Default is days=1. Once login completes,
|
||||
# UserLoginStage.session_duration takes over via request.session.set_expiry.
|
||||
# Injected via server.env (not authentik.sessions.*) because we use
|
||||
# authentik.existingSecret.secretName, which makes the chart skip
|
||||
# rendering the AUTHENTIK_* secret — so the values block doesn't reach env.
|
||||
- name: AUTHENTIK_SESSIONS__UNAUTHENTICATED_AGE
|
||||
value: "hours=2"
|
||||
# Gunicorn: 3 workers × 4 threads per server pod (defaults 2×4).
|
||||
# Pairs with the server memory limit of 2Gi (each worker preloads
|
||||
# Django ~500Mi).
|
||||
- name: AUTHENTIK_WEB__WORKERS
|
||||
value: "3"
|
||||
- name: AUTHENTIK_WEB__THREADS
|
||||
value: "4"
|
||||
# Cache flow plans for 30m and policy evaluations for 15m (defaults 300s).
|
||||
# Authentik 2026.2 stores cache in Postgres, so a TTL hit is still a
|
||||
# SELECT — but a single indexed lookup beats re-planning the flow
|
||||
# (~70 sequential ORM ops per flow stage POST).
|
||||
- name: AUTHENTIK_CACHE__TIMEOUT_FLOWS
|
||||
value: "1800"
|
||||
- name: AUTHENTIK_CACHE__TIMEOUT_POLICIES
|
||||
value: "900"
|
||||
# Persistent client-side DB connections (safe with PgBouncer session mode;
|
||||
# must stay < pgbouncer server_idle_timeout=600s). Cuts per-request Django
|
||||
# connection setup off the auth hot path.
|
||||
- name: AUTHENTIK_POSTGRESQL__CONN_MAX_AGE
|
||||
value: "60"
|
||||
- name: AUTHENTIK_POSTGRESQL__CONN_HEALTH_CHECKS
|
||||
value: "true"
|
||||
strategy:
|
||||
type: RollingUpdate
|
||||
rollingUpdate:
|
||||
|
|
@ -82,11 +90,23 @@ worker:
|
|||
# certificate renewal) — no user-facing traffic, so 2-of-3 isn't
|
||||
# needed for availability. Drop saves ~100m sustained CPU.
|
||||
replicas: 2
|
||||
# Same unauthenticated_age cap as server — both the server (Django session
|
||||
# middleware) and worker (cleanup tasks) need to see the value.
|
||||
env:
|
||||
# Same unauthenticated_age cap as server — both the server (Django session
|
||||
# middleware) and worker (cleanup tasks) need to see the value.
|
||||
- name: AUTHENTIK_SESSIONS__UNAUTHENTICATED_AGE
|
||||
value: "hours=2"
|
||||
# Dramatiq worker threads per pod (default 2).
|
||||
- name: AUTHENTIK_WORKER__THREADS
|
||||
value: "4"
|
||||
# Keep cache + DB-connection settings in lockstep with server.env.
|
||||
- name: AUTHENTIK_CACHE__TIMEOUT_FLOWS
|
||||
value: "1800"
|
||||
- name: AUTHENTIK_CACHE__TIMEOUT_POLICIES
|
||||
value: "900"
|
||||
- name: AUTHENTIK_POSTGRESQL__CONN_MAX_AGE
|
||||
value: "60"
|
||||
- name: AUTHENTIK_POSTGRESQL__CONN_HEALTH_CHECKS
|
||||
value: "true"
|
||||
strategy:
|
||||
type: RollingUpdate
|
||||
rollingUpdate:
|
||||
|
|
|
|||
|
|
@ -720,6 +720,11 @@ resource "kubernetes_config_map" "auth_proxy_config" {
|
|||
"default.conf" = <<-EOT
|
||||
upstream authentik {
|
||||
server ak-outpost-authentik-embedded-outpost.authentik.svc.cluster.local:9000;
|
||||
# Reuse connections to the outpost. Without this every forward-auth
|
||||
# subrequest (= every request to every auth="required" ingress) opens
|
||||
# a fresh TCP connection. Requires HTTP/1.1 + cleared Connection
|
||||
# header on the proxy_pass locations below.
|
||||
keepalive 32;
|
||||
}
|
||||
server {
|
||||
listen 9000;
|
||||
|
|
@ -734,6 +739,8 @@ resource "kubernetes_config_map" "auth_proxy_config" {
|
|||
|
||||
location /outpost.goauthentik.io/auth/traefik {
|
||||
proxy_pass http://authentik;
|
||||
proxy_http_version 1.1;
|
||||
proxy_set_header Connection "";
|
||||
proxy_connect_timeout 3s;
|
||||
proxy_read_timeout 5s;
|
||||
proxy_send_timeout 5s;
|
||||
|
|
@ -764,6 +771,8 @@ resource "kubernetes_config_map" "auth_proxy_config" {
|
|||
|
||||
location /outpost.goauthentik.io/ {
|
||||
proxy_pass http://authentik;
|
||||
proxy_http_version 1.1;
|
||||
proxy_set_header Connection "";
|
||||
proxy_connect_timeout 3s;
|
||||
proxy_read_timeout 10s;
|
||||
proxy_set_header Host $host;
|
||||
|
|
@ -820,6 +829,11 @@ resource "kubernetes_deployment" "auth_proxy" {
|
|||
labels = {
|
||||
app = "auth-proxy"
|
||||
}
|
||||
annotations = {
|
||||
# nginx only reads its config at startup — roll the pods whenever
|
||||
# the ConfigMap content changes.
|
||||
"checksum/auth-proxy-config" = sha1(kubernetes_config_map.auth_proxy_config.data["default.conf"])
|
||||
}
|
||||
}
|
||||
spec {
|
||||
topology_spread_constraint {
|
||||
|
|
|
|||
Loading…
Add table
Add a link
Reference in a new issue