infra/.claude/reference/authentik-state.md
Viktor Barzin 4e88298976 authentik: incident hardening after the signin-speedup rollout storm
The first apply of the signin-speedup change triggered a ~50min authentik
outage (and a shared CNPG primary failover): the helm chart pin (2026.2.2)
silently DOWNGRADED the Keel-managed live image (2026.2.4) against an
already-migrated DB, default liveness probes kill-looped pods queuing on
authentik's migration advisory lock, and kills mid-migration left ghost
idle-in-transaction sessions holding that lock. Full analysis in
docs/post-mortems/2026-06-10-authentik-downgrade-boot-storm.md.

Hardening (all root causes):
- values.yaml: pin global.image.tag to the Keel-managed live tag (2026.2.4)
  so helm applies can never downgrade under Keel again
- values.yaml: server livenessProbe 6x10s/5s (was chart-default 3x10s/3s)
- values.yaml: REMOVE AUTHENTIK_POSTGRESQL__CONN_MAX_AGE (session-mode
  pgbouncer pins persistent conns 1:1 -> pool saturation, 58s/s waits)
- pgbouncer.ini: idle_transaction_timeout=300 reaps ghost lock holders;
  pgbouncer.tf gets a config-checksum annotation so ini changes roll pods
- authentik_provider.tf: drop the completed import stanza (adoption rule)
- traefik: suppress pre-existing keel.sh annotation/tier-label drift on
  auth-proxy/bot-block/x402/error-pages deployments (KEEL_LIFECYCLE_V1
  pattern) so applies stop stripping live Keel state

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
2026-06-11 00:26:52 +00:00

18 KiB
Raw Permalink Blame History

Authentik Current State

Snapshot of applications, groups, users, and flows. Use authentik skill for management tasks.

Applications (11)

Application Provider Type Auth Flow
Cloudflare Access OAuth2/OIDC implicit consent
Domain wide catch all Proxy (forward auth) implicit consent
Forgejo OAuth2/OIDC implicit consent
Grafana OAuth2/OIDC implicit consent
Headscale OAuth2/OIDC implicit consent
Immich OAuth2/OIDC implicit consent
Kubernetes OAuth2/OIDC (public) implicit consent
Kubernetes Dashboard OAuth2/OIDC (confidential) implicit consent
linkwarden OAuth2/OIDC implicit consent
Vault OAuth2/OIDC implicit consent
wrongmove OAuth2/OIDC implicit consent

2026-06-10 — every provider now uses implicit consent. Cloudflare Access (pk 9), Forgejo (20), Immich (1), Headscale (13), linkwarden (8) and Vault (53) were switched from default-provider-authorization-explicit-consent via the API (these providers are UI-managed, not in TF). All are first-party apps; the expiring consent screen (re-shown every 4 weeks per app) only slowed first-time signin.

Kubernetes Dashboard (TF-managed in stacks/k8s-dashboard/authentik.tf): confidential client k8s-dashboard, built for seamless dashboard SSO via oauth2-proxy. Currently IDLE — the apiserver rejects all OIDC tokens (see docs/plans/2026-06-04-k8s-dashboard-sso-design.md §12), so the dashboard runs on forward-auth + token-paste instead and oauth2-proxy is unwired. Kept for a future SSO retry once apiserver OIDC is fixed.

admin-services-restriction policy (TF-managed in stacks/authentik/admin-services-restriction.tf, adopted 2026-06-04): gates the 15 admin-only hostnames to Home Server Admins, with a carve-out admitting the kubernetes-* RBAC groups to k8s.viktorbarzin.me (dashboard login page).

Groups (9)

Group Parent Superuser Purpose
Allow Login Users -- No Parent group for login-permitted users
authentik Admins -- Yes Full admin access
Headscale Users Allow Login Users No VPN access
Home Server Admins Allow Login Users No Server admin access
Wrongmove Users Allow Login Users No Real-estate app access
kubernetes-admins -- No K8s cluster-admin RBAC
kubernetes-power-users -- No K8s power-user RBAC
kubernetes-namespace-owners -- No K8s namespace-owner RBAC
Task Submitters -- No Task submission access

Users (8 real)

Username Name Type Groups
akadmin authentik Default Admin internal authentik Admins, Home Server Admins, Headscale Users
vbarzin@gmail.com Viktor Barzin internal authentik Admins, Home Server Admins, Wrongmove Users, Headscale Users
emil.barzin@gmail.com Emil Barzin internal Home Server Admins, Headscale Users
ancaelena98@gmail.com Anca Milea external Wrongmove Users, Headscale Users
vabbit81@gmail.com GHEORGHE Milea external Headscale Users, kubernetes-namespace-owners, sops-vabbit81
valentinakolevabarzina@gmail.com Valentina internal Headscale Users
anca.r.cristian10@gmail.com -- internal Wrongmove Users
kadir.tugan@gmail.com Kadir internal Wrongmove Users

Login Sources

  • Google (OAuth) -- user matching by identifier
  • GitHub (OAuth) -- user matching by email_link
  • Facebook (OAuth) -- user matching by email_link
  • All sources use invitation-enrollment as enrollment flow (new users require invitation)

Authorization Flows

  • Explicit consent (default-provider-authorization-explicit-consent): Shows consent screen — no provider uses it since 2026-06-10
  • Implicit consent (default-provider-authorization-implicit-consent): Auto-redirects — used by ALL providers

Authentication Flow (single-screen login, 2026-06-10)

default-authentication-flow bindings: identification (order 10) → mfa-validation (order 30) → user-login (order 100). The identification stage (default-authentication-identification, pk 32aca5ab-106e-43f4-a4cc-4513d80e57f3) has password_stage set to default-authentication-password, so username + password render on ONE screen (one round trip instead of two). The previously separate password-stage binding at order 20 (pk 0fc677db-a23f-4ee7-8648-da342e14573b) was DELETED via the API — authentik requires removing it when the identification stage embeds the password field. password_stage is pinned in Terraform (authentik_stage_identification.default_identification in stacks/authentik/authentik_provider.tf); all other stage fields stay UI-managed via ignore_changes. Social-login buttons remain on the same screen and bypass the password field, so Google/GitHub/Facebook users are unaffected. If a future authentik upgrade/blueprint re-adds the order-20 binding, users would briefly see a second password prompt — delete the binding again.

Invitation Enrollment Flow

Slug: invitation-enrollment | PK: 7d667321-2b02-4e16-8161-148078a8dac1

New users can only sign up via invitation link. Admins generate single-use invite links.

Stages (in order)

Order Stage Type Purpose
10 invitation-validation Invitation Validates ?itoken= parameter, blocks without valid token
20 enrollment-identification Identification Shows social login (Google/GitHub/Facebook) + passkey
30 enrollment-prompt Prompt Collects name and email (pre-filled from social login)
40 enrollment-user-write User Write Creates user in Allow Login Users group
50 enrollment-login User Login Auto-login after signup (policy: invitation-group-assignment adds user to target group from invitation fixed_data.group)

Invitation Management

Script: .claude/scripts/authentik-invite.sh

# Create invitation (single-use, no expiry)
./authentik-invite.sh create "Headscale Users"

# Create invitation with expiry
./authentik-invite.sh create "Wrongmove Users" --days 7

# Add user to group after enrollment
./authentik-invite.sh assign <username> "Headscale Users"

# List pending invitations
./authentik-invite.sh list

Invited users sign up via social login (Google/GitHub/Facebook) or passkey. No username/password enrollment. The target group (e.g. "Headscale Users") is auto-assigned on enrollment via the invitation-group-assignment expression policy. The assign command is available for manual post-enrollment group changes.

Cleanup Log (2026-03-13)

Deleted Flows

  • enrollment-inviation (typo) -- previous invitation attempt
  • headscale-authentication -- not used by any provider
  • headscale-authorization -- not used by any provider
  • default-enrollment-flow -- password-based, unused
  • oauth-enrollment -- replaced by invitation-enrollment

Deleted Stages

  • enrollment-invitation, enrollment-invitation-write (from old invitation flow)
  • invitation (unbound)
  • default-enrollment-prompt-first, default-enrollment-prompt-second (from default enrollment)
  • default-enrollment-user-write, default-enrollment-email-verification, default-enrollment-user-login

Deleted Groups

  • authentik Read-only -- 0 users, unused role

Deleted Policies

  • map github username to email -- unbound
  • Map Google Attributes -- unbound

Deleted Roles

  • authentik Read-only -- no group assignment

Policy Fix (2026-04-06)

Unbound brute-force-protection Policy

The brute-force-protection ReputationPolicy (PK: ac98cb11-31d3-46ab-8883-bf51e6b09a60, check_username=True, check_ip=True, threshold=-5) was bound to 3 authentication flows, causing "Flow does not apply to current user" for all unauthenticated users (no username to evaluate → failure_result=false → flow denied).

Removed bindings from:

  • default-authentication-flow (PK: 34618cf3) — username/password login
  • webauthn (PK: 0b60c2a5) — passkey login
  • default-source-authentication (PK: via policybindingmodel 1a779f24) — Google/GitHub/Facebook OAuth

Policy still exists with 0 bindings. If brute-force protection is needed, bind to the password stage (not the flow level).

Session Duration (2026-05-01)

Pinned via Terraform in stacks/authentik/:

Knob Value Surface Effect
UserLoginStage.session_duration on default-authentication-login weeks=4 authentik_stage_user_login.default_login in authentik_provider.tf Authenticated users stay logged in 4 weeks across browser restarts. No sliding refresh — resets on each login.
ProxyProvider.access_token_validity on Provider for Domain wide catch all weeks=4 authentik_provider_proxy.catchall.access_token_validity in authentik_provider.tf Cookie Max-Age on authentik_proxy_* and expires on rows in authentik_providers_proxy_proxysession. Bumped 2026-05-10 from hours=168. Bumping requires kubectl rollout restart deploy/ak-outpost-authentik-embedded-outpost — the gorilla session store binds the value once at outpost startup; the 5-min provider refresh logs "reusing existing session store" and skips rebuild.
AUTHENTIK_SESSIONS__UNAUTHENTICATED_AGE (server + worker) hours=2 server.env + worker.env in modules/authentik/values.yaml Anonymous Django sessions (bots, healthcheckers, partial flows) are reaped within 2h instead of the 1d default.

Notes:

  • There is no Brand.session_duration; UserLoginStage is the only correct lever for authenticated session lifetime.
  • Embedded outpost session storage: PostgreSQL table authentik_providers_proxy_proxysession in authentik 2025.10+ (PR #16628), but only when IsEmbedded() returns true (i.e. Outpost.managed == "goauthentik.io/outposts/embedded"). Our outpost record had managed=null until 2026-05-10, which silently kept it on the gorilla FilesystemStore at /dev/shm (TMPDIR) and re-exposed the 2026-04-18 mismatched-session-ID class on every pod restart. Fix landed 2026-05-10: see authentik_outpost.embedded in authentik_provider.tf and post-mortem 2026-04-18-authentik-outpost-shm-full.md.
  • The proxy outpost service has a known goauthentik 2026.2.2 bug (internal/outpost/controllers/k8s/service.py:52): for embedded outposts the controller sets the Service selector to app.kubernetes.io/name=authentik (the server pods), not authentik-outpost-proxy. We work around it via a kubernetes_json_patches.service patch on the outpost record (replaces /spec/selector with the outpost's own labels). Without this, endpoints are empty and Traefik forward-auth fails over to the Basic Auth realm Emergency Access.
  • The standalone embedded-outpost deployment needs AUTHENTIK_POSTGRESQL__{HOST,PORT,USER,PASSWORD,NAME} env vars to reach the dbaas cluster — codified via kubernetes_json_patches.deployment envFrom the shared goauthentik Secret. The app.kubernetes.io/component=server pod label is also injected via JSON patch (matches the component:server half of the Service selector that the controller adds for embedded outposts).
  • ProxyProvider.remember_me_offset stays UI-managed via ignore_changes.
  • The Authentik provider's resource schema does not expose the Outpost.managed field. We rely on TF's "write only fields it knows about" semantic: the server-set goauthentik.io/outposts/embedded value is preserved across applies because Terraform never writes managed. Don't change the resource provider schema expectations without verifying this assumption holds.
  • ALL tuned env vars are injected via server.env / worker.env (not the authentik.* values block) because we set authentik.existingSecret.secretName: goauthentik, which makes the chart skip rendering its own AUTHENTIK_* Secret. The authentik.* value block is therefore inert in this stack — anything new under authentik.* must use the *.env arrays instead. Live base values come from the orphaned, helm-keep-policy goauthentik Secret created by chart 2025.10.3 before existingSecret was introduced. 2026-06-10: the previously-inert tuning (AUTHENTIK_WEB__WORKERS=3, AUTHENTIK_WEB__THREADS=4, AUTHENTIK_CACHE__TIMEOUT_FLOWS=1800, AUTHENTIK_CACHE__TIMEOUT_POLICIES=900, AUTHENTIK_POSTGRESQL__CONN_MAX_AGE=60, AUTHENTIK_POSTGRESQL__CONN_HEALTH_CHECKS=true, worker AUTHENTIK_WORKER__THREADS=4) was moved into the env arrays and is now actually live — before that, pods silently ran defaults (2 gunicorn workers, 300s caches, no persistent DB conns).
  • Outpost (2026-06-10): log_level=info (was trace — per-request overhead on the forward-auth hot path) and kubernetes_replicas=2 (was 1 — single-pod hot path; safe since proxy sessions live in Postgres). Both in authentik_outpost.embedded config.
  • Image tag is PINNED in values (global.image.tag), 2026-06-10: Keel moves the authentik image between chart releases, while helm derives the tag from the chart appVersion — an unpinned helm apply silently DOWNGRADES live pods (caused the 2026-06-10 boot storm + shared-PG failover; see docs/post-mortems/2026-06-10-authentik-downgrade-boot-storm.md). Before touching this chart, check the live image tag and refresh the pin.
  • Liveness budget (2026-06-10): server.livenessProbe = 6×10s, 5s timeout (chart default 3×10s/3s kill-loops pods that queue on the DB migration advisory lock during rolling restarts).
  • PgBouncer (2026-06-10): idle_transaction_timeout=300 reaps ghost idle in transaction sessions (a killed pod mid-migration otherwise holds the migration advisory lock forever, serializing all boots); the deployment carries a config-checksum annotation so ini changes roll the pods. Do NOT set AUTHENTIK_POSTGRESQL__CONN_MAX_AGE — session-mode PgBouncer pins persistent conns 1:1 (pool saturation).
  • Static assets (2026-06-10): a second ingress_factory (module.ingress-static, path /static on the authentik host) attaches the authentik-static-cache-headers middleware → Cache-Control: public, max-age=31536000, immutable. Authentik itself serves no max-age; assets are version-fingerprinted so immutable is safe. Mainly helps split-horizon internal users (no Cloudflare edge cache on the direct path).

Upgrade Validation Checklist

Run after any of these:

  • Authentik chart version bump in stacks/authentik/modules/authentik/main.tf (the version = "..." line on helm_release.authentik).
  • goauthentik/authentik Terraform provider version bump.
  • Outpost pod recreation (kured reboot, eviction, manual rollout restart, scheduler move).

The fragile surfaces are the kubernetes_json_patches and the Outpost.managed field — both rely on assumptions that can silently break across upgrades. The checklist exercises the same path the alerts watch, so it doubles as a smoke test for the alerts.

# 1. Service routes to the outpost pods (NOT the server pods).
#    Empty endpoints => auth-proxy fallback fires; expected: TWO pod IPs
#    (kubernetes_replicas=2 since 2026-06-10), ports 9000/9300/9443.
kubectl -n authentik get endpoints ak-outpost-authentik-embedded-outpost

# 2. Service selector still excludes the server pods. Expected: includes
#    `app.kubernetes.io/name: authentik-outpost-proxy`. If it flips to
#    `name: authentik`, the goauthentik upstream bug came back or our
#    JSON patch was unset.
kubectl -n authentik get svc ak-outpost-authentik-embedded-outpost -o jsonpath='{.spec.selector}'

# 3. Outpost mode + session backend. Expected log lines on startup:
#      {"embedded":true,"event":"Outpost mode",...}
#      {"event":"using PostgreSQL session backend",...}
#    If embedded=false or `using filesystem session backend`, the postgres
#    fix is broken — likely `Outpost.managed` got cleared, or the upstream
#    schema started exposing `managed` and TF reset it.
kubectl -n authentik logs deploy/ak-outpost-authentik-embedded-outpost | grep -E '"Outpost mode"|"session backend"' | head -3

# 4. /dev/shm is essentially empty (postgres backend = no filesystem use).
#    A row count > a few dozen indicates filesystem fallback is firing.
kubectl -n authentik exec deploy/ak-outpost-authentik-embedded-outpost -- sh -c 'df -h /dev/shm; ls /dev/shm | wc -l'

# 5. Postgres session table is growing with traffic. Expected: rows with
#    `expires` ~28 days out (matches access_token_validity = weeks=4).
kubectl -n authentik exec deploy/goauthentik-server -- ak shell -c "
from django.db import connection; c = connection.cursor()
c.execute('SELECT COUNT(*), MAX(expires) FROM authentik_providers_proxy_proxysession')
print(c.fetchone())"

# 6. Edge auth flow: should be 302 → authentik. NOT 401 with WWW-Authenticate.
curl -sS -o /dev/null -D - 'https://terminal.viktorbarzin.me/' -H 'User-Agent: Mozilla/5.0' \
  | grep -iE '^HTTP|^location|x-auth-fallback|www-authenticate'

# 7. Terraform plan-to-zero on the whole authentik stack.
( cd stacks/authentik && /home/wizard/code/infra/scripts/tg plan ) | grep -E 'No changes|Plan:'

Steps 1, 3, 6 cover the failure modes the Prometheus alerts trigger on (AuthentikForwardAuthFallbackActive, AuthentikOutpostForwardAuth400Spike). Steps 4 and 5 cover the silent-regression case (filesystem fallback) where the alerts don't fire but the system loses its postgres-backed session persistence on the next pod restart.

If step 2 shows the controller restored app.kubernetes.io/name=authentik, watch goauthentik/authentik issue tracker for fixes around internal/outpost/controllers/k8s/service.py:52 — the upstream patch might let us drop our kubernetes_json_patches.service workaround.