Add `AuthentikForwardAuthFallbackActive` Prometheus alert: fires on sustained 401/s spike on the websecure entrypoint (>5/s for 5m), which is the symptom of the auth-proxy Emergency-Access fallback firing — in turn caused by zero ready endpoints on the outpost service. Why this rule and not `kube_endpoint_address_available == 0`: kube-state-metrics endpoint metrics exist as series names but never have current values in this Prometheus pipeline (something is dropping them silently). Detecting the failure at the edge via Traefik is more reliable than instrumenting the broken middle. Also fix the pre-existing `AuthentikOutpostForwardAuth400Spike` regex — the service label is `authentik-ak-outpost-...`, not `authentik-authentik-outpost-...`, so the alert never matched any series and never could have fired. Verified in Prometheus before/after the fix. Add an "Upgrade Validation Checklist" section to `.claude/reference/authentik-state.md` with the seven-step smoke test to run after Authentik chart bumps, provider bumps, or outpost pod recreation. Covers the brittle surfaces (Service selector, JSON patches, postgres backend wiring, access_token_validity TTL, edge auth flow, plan-to-zero).
13 KiB
Authentik Current State
Snapshot of applications, groups, users, and flows. Use
authentikskill for management tasks.
Applications (10)
| Application | Provider Type | Auth Flow |
|---|---|---|
| Cloudflare Access | OAuth2/OIDC | explicit consent |
| Domain wide catch all | Proxy (forward auth) | implicit consent |
| Forgejo | OAuth2/OIDC | explicit consent |
| Grafana | OAuth2/OIDC | implicit consent |
| Headscale | OAuth2/OIDC | explicit consent |
| Immich | OAuth2/OIDC | explicit consent |
| Kubernetes | OAuth2/OIDC (public) | implicit consent |
| linkwarden | OAuth2/OIDC | explicit consent |
| Matrix | OAuth2/OIDC | implicit consent |
| wrongmove | OAuth2/OIDC | implicit consent |
Groups (9)
| Group | Parent | Superuser | Purpose |
|---|---|---|---|
| Allow Login Users | -- | No | Parent group for login-permitted users |
| authentik Admins | -- | Yes | Full admin access |
| Headscale Users | Allow Login Users | No | VPN access |
| Home Server Admins | Allow Login Users | No | Server admin access |
| Wrongmove Users | Allow Login Users | No | Real-estate app access |
| kubernetes-admins | -- | No | K8s cluster-admin RBAC |
| kubernetes-power-users | -- | No | K8s power-user RBAC |
| kubernetes-namespace-owners | -- | No | K8s namespace-owner RBAC |
| Task Submitters | -- | No | Task submission access |
Users (8 real)
| Username | Name | Type | Groups |
|---|---|---|---|
| akadmin | authentik Default Admin | internal | authentik Admins, Home Server Admins, Headscale Users |
| vbarzin@gmail.com | Viktor Barzin | internal | authentik Admins, Home Server Admins, Wrongmove Users, Headscale Users |
| emil.barzin@gmail.com | Emil Barzin | internal | Home Server Admins, Headscale Users |
| ancaelena98@gmail.com | Anca Milea | external | Wrongmove Users, Headscale Users |
| vabbit81@gmail.com | GHEORGHE Milea | external | Headscale Users |
| valentinakolevabarzina@gmail.com | Valentina | internal | Headscale Users |
| anca.r.cristian10@gmail.com | -- | internal | Wrongmove Users |
| kadir.tugan@gmail.com | Kadir | internal | Wrongmove Users |
Login Sources
- Google (OAuth) -- user matching by identifier
- GitHub (OAuth) -- user matching by email_link
- Facebook (OAuth) -- user matching by email_link
- All sources use
invitation-enrollmentas enrollment flow (new users require invitation)
Authorization Flows
- Explicit consent (
default-provider-authorization-explicit-consent): Shows consent screen - Implicit consent (
default-provider-authorization-implicit-consent): Auto-redirects
Invitation Enrollment Flow
Slug: invitation-enrollment | PK: 7d667321-2b02-4e16-8161-148078a8dac1
New users can only sign up via invitation link. Admins generate single-use invite links.
Stages (in order)
| Order | Stage | Type | Purpose |
|---|---|---|---|
| 10 | invitation-validation | Invitation | Validates ?itoken= parameter, blocks without valid token |
| 20 | enrollment-identification | Identification | Shows social login (Google/GitHub/Facebook) + passkey |
| 30 | enrollment-prompt | Prompt | Collects name and email (pre-filled from social login) |
| 40 | enrollment-user-write | User Write | Creates user in Allow Login Users group |
| 50 | enrollment-login | User Login | Auto-login after signup (policy: invitation-group-assignment adds user to target group from invitation fixed_data.group) |
Invitation Management
Script: .claude/scripts/authentik-invite.sh
# Create invitation (single-use, no expiry)
./authentik-invite.sh create "Headscale Users"
# Create invitation with expiry
./authentik-invite.sh create "Wrongmove Users" --days 7
# Add user to group after enrollment
./authentik-invite.sh assign <username> "Headscale Users"
# List pending invitations
./authentik-invite.sh list
Invited users sign up via social login (Google/GitHub/Facebook) or passkey. No username/password enrollment.
The target group (e.g. "Headscale Users") is auto-assigned on enrollment via the invitation-group-assignment expression policy. The assign command is available for manual post-enrollment group changes.
Cleanup Log (2026-03-13)
Deleted Flows
enrollment-inviation(typo) -- previous invitation attemptheadscale-authentication-- not used by any providerheadscale-authorization-- not used by any providerdefault-enrollment-flow-- password-based, unusedoauth-enrollment-- replaced by invitation-enrollment
Deleted Stages
enrollment-invitation,enrollment-invitation-write(from old invitation flow)invitation(unbound)default-enrollment-prompt-first,default-enrollment-prompt-second(from default enrollment)default-enrollment-user-write,default-enrollment-email-verification,default-enrollment-user-login
Deleted Groups
authentik Read-only-- 0 users, unused role
Deleted Policies
map github username to email-- unboundMap Google Attributes-- unbound
Deleted Roles
authentik Read-only-- no group assignment
Policy Fix (2026-04-06)
Unbound brute-force-protection Policy
The brute-force-protection ReputationPolicy (PK: ac98cb11-31d3-46ab-8883-bf51e6b09a60, check_username=True, check_ip=True, threshold=-5) was bound to 3 authentication flows, causing "Flow does not apply to current user" for all unauthenticated users (no username to evaluate → failure_result=false → flow denied).
Removed bindings from:
default-authentication-flow(PK:34618cf3) — username/password loginwebauthn(PK:0b60c2a5) — passkey logindefault-source-authentication(PK: via policybindingmodel1a779f24) — Google/GitHub/Facebook OAuth
Policy still exists with 0 bindings. If brute-force protection is needed, bind to the password stage (not the flow level).
Session Duration (2026-05-01)
Pinned via Terraform in stacks/authentik/:
| Knob | Value | Surface | Effect |
|---|---|---|---|
UserLoginStage.session_duration on default-authentication-login |
weeks=4 |
authentik_stage_user_login.default_login in authentik_provider.tf |
Authenticated users stay logged in 4 weeks across browser restarts. No sliding refresh — resets on each login. |
ProxyProvider.access_token_validity on Provider for Domain wide catch all |
weeks=4 |
authentik_provider_proxy.catchall.access_token_validity in authentik_provider.tf |
Cookie Max-Age on authentik_proxy_* and expires on rows in authentik_providers_proxy_proxysession. Bumped 2026-05-10 from hours=168. Bumping requires kubectl rollout restart deploy/ak-outpost-authentik-embedded-outpost — the gorilla session store binds the value once at outpost startup; the 5-min provider refresh logs "reusing existing session store" and skips rebuild. |
AUTHENTIK_SESSIONS__UNAUTHENTICATED_AGE (server + worker) |
hours=2 |
server.env + worker.env in modules/authentik/values.yaml |
Anonymous Django sessions (bots, healthcheckers, partial flows) are reaped within 2h instead of the 1d default. |
Notes:
- There is no
Brand.session_duration;UserLoginStageis the only correct lever for authenticated session lifetime. - Embedded outpost session storage: PostgreSQL table
authentik_providers_proxy_proxysessionin authentik 2025.10+ (PR #16628), but only whenIsEmbedded()returns true (i.e.Outpost.managed == "goauthentik.io/outposts/embedded"). Our outpost record hadmanaged=nulluntil 2026-05-10, which silently kept it on the gorillaFilesystemStoreat/dev/shm(TMPDIR) and re-exposed the 2026-04-18 mismatched-session-ID class on every pod restart. Fix landed 2026-05-10: seeauthentik_outpost.embeddedinauthentik_provider.tfand post-mortem2026-04-18-authentik-outpost-shm-full.md. - The proxy outpost service has a known goauthentik 2026.2.2 bug (
internal/outpost/controllers/k8s/service.py:52): for embedded outposts the controller sets the Service selector toapp.kubernetes.io/name=authentik(the server pods), notauthentik-outpost-proxy. We work around it via akubernetes_json_patches.servicepatch on the outpost record (replaces/spec/selectorwith the outpost's own labels). Without this, endpoints are empty and Traefik forward-auth fails over to the Basic Auth realmEmergency Access. - The standalone embedded-outpost deployment needs
AUTHENTIK_POSTGRESQL__{HOST,PORT,USER,PASSWORD,NAME}env vars to reach the dbaas cluster — codified viakubernetes_json_patches.deploymentenvFrom the sharedgoauthentikSecret. Theapp.kubernetes.io/component=serverpod label is also injected via JSON patch (matches thecomponent:serverhalf of the Service selector that the controller adds for embedded outposts). ProxyProvider.remember_me_offsetstays UI-managed viaignore_changes.- The Authentik provider's resource schema does not expose the
Outpost.managedfield. We rely on TF's "write only fields it knows about" semantic: the server-setgoauthentik.io/outposts/embeddedvalue is preserved across applies because Terraform never writesmanaged. Don't change the resource provider schema expectations without verifying this assumption holds. - The
unauthenticated_ageenv var is injected viaserver.env/worker.env(notauthentik.sessions.unauthenticated_age) because we setauthentik.existingSecret.secretName: goauthentik, which makes the chart skip rendering its ownAUTHENTIK_*Secret. Theauthentik.*value block is therefore inert in this stack — anything new underauthentik.*must use the*.envarrays instead. The same applies to the existingauthentik.cache.*,authentik.web.*,authentik.worker.*blocks (currently inert; live values come from the orphaned, helm-keep-policygoauthentikSecret created by chart 2025.10.3 beforeexistingSecretwas introduced).
Upgrade Validation Checklist
Run after any of these:
- Authentik chart version bump in
stacks/authentik/modules/authentik/main.tf(theversion = "..."line onhelm_release.authentik). goauthentik/authentikTerraform provider version bump.- Outpost pod recreation (kured reboot, eviction, manual
rollout restart, scheduler move).
The fragile surfaces are the kubernetes_json_patches and the Outpost.managed field — both rely on assumptions that can silently break across upgrades. The checklist exercises the same path the alerts watch, so it doubles as a smoke test for the alerts.
# 1. Service routes to the outpost pod (NOT the server pods).
# Empty endpoints => auth-proxy fallback fires; expected: ONE pod IP, ports 9000/9300/9443.
kubectl -n authentik get endpoints ak-outpost-authentik-embedded-outpost
# 2. Service selector still excludes the server pods. Expected: includes
# `app.kubernetes.io/name: authentik-outpost-proxy`. If it flips to
# `name: authentik`, the goauthentik upstream bug came back or our
# JSON patch was unset.
kubectl -n authentik get svc ak-outpost-authentik-embedded-outpost -o jsonpath='{.spec.selector}'
# 3. Outpost mode + session backend. Expected log lines on startup:
# {"embedded":true,"event":"Outpost mode",...}
# {"event":"using PostgreSQL session backend",...}
# If embedded=false or `using filesystem session backend`, the postgres
# fix is broken — likely `Outpost.managed` got cleared, or the upstream
# schema started exposing `managed` and TF reset it.
kubectl -n authentik logs deploy/ak-outpost-authentik-embedded-outpost | grep -E '"Outpost mode"|"session backend"' | head -3
# 4. /dev/shm is essentially empty (postgres backend = no filesystem use).
# A row count > a few dozen indicates filesystem fallback is firing.
kubectl -n authentik exec deploy/ak-outpost-authentik-embedded-outpost -- sh -c 'df -h /dev/shm; ls /dev/shm | wc -l'
# 5. Postgres session table is growing with traffic. Expected: rows with
# `expires` ~28 days out (matches access_token_validity = weeks=4).
kubectl -n authentik exec deploy/goauthentik-server -- ak shell -c "
from django.db import connection; c = connection.cursor()
c.execute('SELECT COUNT(*), MAX(expires) FROM authentik_providers_proxy_proxysession')
print(c.fetchone())"
# 6. Edge auth flow: should be 302 → authentik. NOT 401 with WWW-Authenticate.
curl -sS -o /dev/null -D - 'https://terminal.viktorbarzin.me/' -H 'User-Agent: Mozilla/5.0' \
| grep -iE '^HTTP|^location|x-auth-fallback|www-authenticate'
# 7. Terraform plan-to-zero on the whole authentik stack.
( cd stacks/authentik && /home/wizard/code/infra/scripts/tg plan ) | grep -E 'No changes|Plan:'
Steps 1, 3, 6 cover the failure modes the Prometheus alerts trigger on (AuthentikForwardAuthFallbackActive, AuthentikOutpostForwardAuth400Spike). Steps 4 and 5 cover the silent-regression case (filesystem fallback) where the alerts don't fire but the system loses its postgres-backed session persistence on the next pod restart.
If step 2 shows the controller restored app.kubernetes.io/name=authentik, watch goauthentik/authentik issue tracker for fixes around internal/outpost/controllers/k8s/service.py:52 — the upstream patch might let us drop our kubernetes_json_patches.service workaround.