Viktor's passkeys all vanished and he was suddenly being asked to log in
multiple times a day instead of ~monthly. Root cause: on 2026-06-18 an ad-hoc
tripit passkey E2E test (run from the devvm as akadmin via python-httpx) cleaned
up "the demo user's" passkeys with GET /core/users/?search={demo} then DELETE
each device of users[0] — but the fuzzy search returned the REAL account, so it
wiped all 6 real passkeys. Losing passkeys forced fallback to Google login, and
the social-login stage (default-source-authentication-login) had the provider
default session_duration=seconds=0, which falls back to UNAUTHENTICATED_AGE=2h —
hence the constant re-logins. (Password + passkey logins were already weeks=4.)
Changes:
- authentik: adopt default-source-authentication-login into Terraform (import)
and pin session_duration=weeks=4, so Google/GitHub/Facebook logins last as long
as password/passkey. Immediate relief without re-enrolling.
- authentik: document the provider-schema gotcha — authentik_stage_identification
exposes no webauthn_stage / enable_remember_me attribute, so they must NOT be in
ignore_changes (commit 4e882989 removed them for this reason; re-adding breaks
every apply). The passkey break was purely the missing device records, not drift.
- edge (rybbit): shield auth so a CrowdSec hit can never wall a user out of login —
carve authentik.viktorbarzin.me + public-auth out of the zone WAF block rule,
make the LAPI->edge sync ban-only (stop downgrading captcha to a hard block),
and set exclude_crowdsec on the Authentik UI ingress (auth keeps rate-limiting).
- docs: record the session-duration change, the edge enforcement + auth carve-out
(previously undocumented), and the pre-existing broken crowdsec-cf-sync CronJob
(CF cursor pagination 400 + ~31k IPs vs list capacity -> edge list inert).
Passkey re-enrollment is a manual user action (devices are gone from the DB);
nothing auto-re-deletes them.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
20 KiB
Authentik Current State
Snapshot of applications, groups, users, and flows. Use
authentikskill for management tasks.
Applications (11)
| Application | Provider Type | Auth Flow |
|---|---|---|
| Cloudflare Access | OAuth2/OIDC | implicit consent |
| Domain wide catch all | Proxy (forward auth) | implicit consent |
| Forgejo | OAuth2/OIDC | implicit consent |
| Grafana | OAuth2/OIDC | implicit consent |
| Headscale | OAuth2/OIDC | implicit consent |
| Immich | OAuth2/OIDC | implicit consent |
| Kubernetes | OAuth2/OIDC (public) | implicit consent |
| Kubernetes Dashboard | OAuth2/OIDC (confidential) | implicit consent |
| linkwarden | OAuth2/OIDC | implicit consent |
| Vault | OAuth2/OIDC | implicit consent |
| wrongmove | OAuth2/OIDC | implicit consent |
2026-06-10 — every provider now uses implicit consent. Cloudflare Access (pk 9), Forgejo (20), Immich (1), Headscale (13), linkwarden (8) and Vault (53) were switched from
default-provider-authorization-explicit-consentvia the API (these providers are UI-managed, not in TF). All are first-party apps; the expiring consent screen (re-shown every 4 weeks per app) only slowed first-time signin.
Kubernetes Dashboard (TF-managed in
stacks/k8s-dashboard/authentik.tf): confidential clientk8s-dashboard, built for seamless dashboard SSO via oauth2-proxy. Currently IDLE — the apiserver rejects all OIDC tokens (seedocs/plans/2026-06-04-k8s-dashboard-sso-design.md§12), so the dashboard runs on forward-auth + token-paste instead and oauth2-proxy is unwired. Kept for a future SSO retry once apiserver OIDC is fixed.admin-services-restriction policy (TF-managed in
stacks/authentik/admin-services-restriction.tf, adopted 2026-06-04): gates the 15 admin-only hostnames toHome Server Admins, with a carve-out admitting thekubernetes-*RBAC groups tok8s.viktorbarzin.me(dashboard login page).
Groups (9)
| Group | Parent | Superuser | Purpose |
|---|---|---|---|
| Allow Login Users | -- | No | Parent group for login-permitted users |
| authentik Admins | -- | Yes | Full admin access |
| Headscale Users | Allow Login Users | No | VPN access |
| Home Server Admins | Allow Login Users | No | Server admin access |
| Wrongmove Users | Allow Login Users | No | Real-estate app access |
| kubernetes-admins | -- | No | K8s cluster-admin RBAC |
| kubernetes-power-users | -- | No | K8s power-user RBAC |
| kubernetes-namespace-owners | -- | No | K8s namespace-owner RBAC |
| Task Submitters | -- | No | Task submission access |
Users (8 real)
| Username | Name | Type | Groups |
|---|---|---|---|
| akadmin | authentik Default Admin | internal | authentik Admins, Home Server Admins, Headscale Users |
| vbarzin@gmail.com | Viktor Barzin | internal | authentik Admins, Home Server Admins, Wrongmove Users, Headscale Users |
| emil.barzin@gmail.com | Emil Barzin | internal | Home Server Admins, Headscale Users |
| ancaelena98@gmail.com | Anca Milea | external | Wrongmove Users, Headscale Users |
| vabbit81@gmail.com | GHEORGHE Milea | external | Headscale Users, kubernetes-namespace-owners, sops-vabbit81 |
| valentinakolevabarzina@gmail.com | Valentina | internal | Headscale Users |
| anca.r.cristian10@gmail.com | -- | internal | Wrongmove Users |
| kadir.tugan@gmail.com | Kadir | internal | Wrongmove Users |
Login Sources
- Google (OAuth) -- user matching by identifier
- GitHub (OAuth) -- user matching by email_link
- Facebook (OAuth) -- user matching by email_link
- All sources use
invitation-enrollmentas enrollment flow (new users require invitation)
Authorization Flows
- Explicit consent (
default-provider-authorization-explicit-consent): Shows consent screen — no provider uses it since 2026-06-10 - Implicit consent (
default-provider-authorization-implicit-consent): Auto-redirects — used by ALL providers
Authentication Flow (single-screen login, 2026-06-10)
default-authentication-flow bindings: identification (order 10) →
mfa-validation (order 30) → user-login (order 100). The identification
stage (default-authentication-identification, pk
32aca5ab-106e-43f4-a4cc-4513d80e57f3) has password_stage set to
default-authentication-password, so username + password render on ONE
screen (one round trip instead of two). The previously separate
password-stage binding at order 20 (pk 0fc677db-a23f-4ee7-8648-da342e14573b)
was DELETED via the API — authentik requires removing it when the
identification stage embeds the password field. password_stage is pinned in
Terraform (authentik_stage_identification.default_identification in
stacks/authentik/authentik_provider.tf); all other stage fields stay
UI-managed via ignore_changes. Social-login buttons remain on the same
screen and bypass the password field, so Google/GitHub/Facebook users are
unaffected. If a future authentik upgrade/blueprint re-adds the order-20
binding, users would briefly see a second password prompt — delete the
binding again.
Invitation Enrollment Flow
Slug: invitation-enrollment | PK: 7d667321-2b02-4e16-8161-148078a8dac1
New users can only sign up via invitation link. Admins generate single-use invite links.
Stages (in order)
| Order | Stage | Type | Purpose |
|---|---|---|---|
| 10 | invitation-validation | Invitation | Validates ?itoken= parameter, blocks without valid token |
| 20 | enrollment-identification | Identification | Shows social login (Google/GitHub/Facebook) + passkey |
| 30 | enrollment-prompt | Prompt | Collects name and email (pre-filled from social login) |
| 40 | enrollment-user-write | User Write | Creates user in Allow Login Users group |
| 50 | enrollment-login | User Login | Auto-login after signup (policy: invitation-group-assignment adds user to target group from invitation fixed_data.group) |
Invitation Management
Script: .claude/scripts/authentik-invite.sh
# Create invitation (single-use, no expiry)
./authentik-invite.sh create "Headscale Users"
# Create invitation with expiry
./authentik-invite.sh create "Wrongmove Users" --days 7
# Add user to group after enrollment
./authentik-invite.sh assign <username> "Headscale Users"
# List pending invitations
./authentik-invite.sh list
Invited users sign up via social login (Google/GitHub/Facebook) or passkey. No username/password enrollment.
The target group (e.g. "Headscale Users") is auto-assigned on enrollment via the invitation-group-assignment expression policy. The assign command is available for manual post-enrollment group changes.
Cleanup Log (2026-03-13)
Deleted Flows
enrollment-inviation(typo) -- previous invitation attemptheadscale-authentication-- not used by any providerheadscale-authorization-- not used by any providerdefault-enrollment-flow-- password-based, unusedoauth-enrollment-- replaced by invitation-enrollment
Deleted Stages
enrollment-invitation,enrollment-invitation-write(from old invitation flow)invitation(unbound)default-enrollment-prompt-first,default-enrollment-prompt-second(from default enrollment)default-enrollment-user-write,default-enrollment-email-verification,default-enrollment-user-login
Deleted Groups
authentik Read-only-- 0 users, unused role
Deleted Policies
map github username to email-- unboundMap Google Attributes-- unbound
Deleted Roles
authentik Read-only-- no group assignment
Policy Fix (2026-04-06)
Unbound brute-force-protection Policy
The brute-force-protection ReputationPolicy (PK: ac98cb11-31d3-46ab-8883-bf51e6b09a60, check_username=True, check_ip=True, threshold=-5) was bound to 3 authentication flows, causing "Flow does not apply to current user" for all unauthenticated users (no username to evaluate → failure_result=false → flow denied).
Removed bindings from:
default-authentication-flow(PK:34618cf3) — username/password loginwebauthn(PK:0b60c2a5) — passkey logindefault-source-authentication(PK: via policybindingmodel1a779f24) — Google/GitHub/Facebook OAuth
Policy still exists with 0 bindings. If brute-force protection is needed, bind to the password stage (not the flow level).
Session Duration (2026-05-01)
Pinned via Terraform in stacks/authentik/:
| Knob | Value | Surface | Effect |
|---|---|---|---|
UserLoginStage.session_duration on default-authentication-login |
weeks=4 |
authentik_stage_user_login.default_login in authentik_provider.tf |
Authenticated users stay logged in 4 weeks across browser restarts. No sliding refresh — resets on each login. Used by password login (default-authentication-flow) AND passkey login (webauthn flow — both terminate on this stage). |
UserLoginStage.session_duration on default-source-authentication-login |
weeks=4 |
authentik_stage_user_login.default_source_login in authentik_provider.tf (imported 2026-06-20, id 4c6977d2-…) |
Social logins (Google/GitHub/Facebook, via default-source-authentication-flow). Was the provider default seconds=0, which fell back to UNAUTHENTICATED_AGE=hours=2 — so social logins expired every 2h while password/passkey lasted 4 weeks. Pinned weeks=4 on 2026-06-20 to make all login paths consistent. (Surfaced when the 2026-06-18 passkey wipe forced fallback to Google login → "re-login multiple times daily".) |
ProxyProvider.access_token_validity on Provider for Domain wide catch all |
weeks=4 |
authentik_provider_proxy.catchall.access_token_validity in authentik_provider.tf |
Cookie Max-Age on authentik_proxy_* and expires on rows in authentik_providers_proxy_proxysession. Bumped 2026-05-10 from hours=168. Bumping requires kubectl rollout restart deploy/ak-outpost-authentik-embedded-outpost — the gorilla session store binds the value once at outpost startup; the 5-min provider refresh logs "reusing existing session store" and skips rebuild. |
AUTHENTIK_SESSIONS__UNAUTHENTICATED_AGE (server + worker) |
hours=2 |
server.env + worker.env in modules/authentik/values.yaml |
Anonymous Django sessions (bots, healthcheckers, partial flows) are reaped within 2h instead of the 1d default. |
Notes:
- There is no
Brand.session_duration;UserLoginStageis the only correct lever for authenticated session lifetime. - Embedded outpost session storage: PostgreSQL table
authentik_providers_proxy_proxysessionin authentik 2025.10+ (PR #16628), but only whenIsEmbedded()returns true (i.e.Outpost.managed == "goauthentik.io/outposts/embedded"). Our outpost record hadmanaged=nulluntil 2026-05-10, which silently kept it on the gorillaFilesystemStoreat/dev/shm(TMPDIR) and re-exposed the 2026-04-18 mismatched-session-ID class on every pod restart. Fix landed 2026-05-10: seeauthentik_outpost.embeddedinauthentik_provider.tfand post-mortem2026-04-18-authentik-outpost-shm-full.md. - The proxy outpost service has a known goauthentik 2026.2.2 bug (
internal/outpost/controllers/k8s/service.py:52): for embedded outposts the controller sets the Service selector toapp.kubernetes.io/name=authentik(the server pods), notauthentik-outpost-proxy. We work around it via akubernetes_json_patches.servicepatch on the outpost record (replaces/spec/selectorwith the outpost's own labels). Without this, endpoints are empty and Traefik forward-auth fails over to the Basic Auth realmEmergency Access. - The standalone embedded-outpost deployment needs
AUTHENTIK_POSTGRESQL__{HOST,PORT,USER,PASSWORD,NAME}env vars to reach the dbaas cluster — codified viakubernetes_json_patches.deploymentenvFrom the sharedgoauthentikSecret. Theapp.kubernetes.io/component=serverpod label is also injected via JSON patch (matches thecomponent:serverhalf of the Service selector that the controller adds for embedded outposts). ProxyProvider.remember_me_offsetstays UI-managed viaignore_changes.- The Authentik provider's resource schema does not expose the
Outpost.managedfield. We rely on TF's "write only fields it knows about" semantic: the server-setgoauthentik.io/outposts/embeddedvalue is preserved across applies because Terraform never writesmanaged. Don't change the resource provider schema expectations without verifying this assumption holds.
WebAuthn / Passkeys (2026-06-20)
- Passkey devices live in the DB, NOT Terraform (
WebAuthnDevicemodel). They are user-owned; no TF resource or blueprint manages them. Re-enroll via the user settings UI (Authentik → Settings → MFA Devices → register a security key / passkey). - 2026-06-18 wipe (root cause of the "WebAuthn broke" incident): all 6 of Viktor's passkeys were deleted (
WebAuthnDevice.objects.count()→ 0) at 19:27 by an ad-hoc tripit passkey E2E test run from the devvm (python-httpx/0.28.1, asakadmin). The test cleanup didGET /core/users/?search={demo}(a fuzzy search) thenDELETE /api/v3/authenticators/admin/webauthn/{pk}/for each device ofusers[0]— butusers[0]resolved to the real account, not the intended demo user. Lesson: any future passkey-test cleanup MUST exact-match the demo user (username == demo), neverusers[0]of a fuzzy?search=. It was a one-off ad-hoc script (no committed/scheduled copy), so nothing auto-re-deletes — re-enrollment is safe. - Passkey login path itself is intact: the identification stage's
passwordless_flow→webauthnflow (UI-managed, inignore_changes); the break was purely the missing device records. - Provider-schema gotcha: the pinned authentik TF provider's
authentik_stage_identificationresource exposes nowebauthn_stageorenable_remember_meattribute (they exist on the app model, not in the provider schema). Do NOT add them toignore_changes—tg planerrorsUnsupported attribute. They are purely UI/app-managed. (Commit4e882989removed them for exactly this reason; re-adding breaks every apply.) - ALL tuned env vars are injected via
server.env/worker.env(not theauthentik.*values block) because we setauthentik.existingSecret.secretName: goauthentik, which makes the chart skip rendering its ownAUTHENTIK_*Secret. Theauthentik.*value block is therefore inert in this stack — anything new underauthentik.*must use the*.envarrays instead. Live base values come from the orphaned, helm-keep-policygoauthentikSecret created by chart 2025.10.3 beforeexistingSecretwas introduced. 2026-06-10: the previously-inert tuning (AUTHENTIK_WEB__WORKERS=3,AUTHENTIK_WEB__THREADS=4,AUTHENTIK_CACHE__TIMEOUT_FLOWS=1800,AUTHENTIK_CACHE__TIMEOUT_POLICIES=900,AUTHENTIK_POSTGRESQL__CONN_MAX_AGE=60,AUTHENTIK_POSTGRESQL__CONN_HEALTH_CHECKS=true, workerAUTHENTIK_WORKER__THREADS=4) was moved into the env arrays and is now actually live — before that, pods silently ran defaults (2 gunicorn workers, 300s caches, no persistent DB conns). - Outpost (2026-06-10):
log_level=info(wastrace— per-request overhead on the forward-auth hot path) andkubernetes_replicas=2(was 1 — single-pod hot path; safe since proxy sessions live in Postgres). Both inauthentik_outpost.embeddedconfig. - Image tag is PINNED in values (
global.image.tag), 2026-06-10: Keel moves the authentik image between chart releases, while helm derives the tag from the chart appVersion — an unpinned helm apply silently DOWNGRADES live pods (caused the 2026-06-10 boot storm + shared-PG failover; seedocs/post-mortems/2026-06-10-authentik-downgrade-boot-storm.md). Before touching this chart, check the live image tag and refresh the pin. - Liveness budget (2026-06-10):
server.livenessProbe= 6×10s, 5s timeout (chart default 3×10s/3s kill-loops pods that queue on the DB migration advisory lock during rolling restarts). - PgBouncer (2026-06-10):
idle_transaction_timeout=300reaps ghostidle in transactionsessions (a killed pod mid-migration otherwise holds the migration advisory lock forever, serializing all boots); the deployment carries a config-checksum annotation so ini changes roll the pods. Do NOT setAUTHENTIK_POSTGRESQL__CONN_MAX_AGE— session-mode PgBouncer pins persistent conns 1:1 (pool saturation). - Static assets (2026-06-10): a second
ingress_factory(module.ingress-static, path/staticon the authentik host) attaches theauthentik-static-cache-headersmiddleware →Cache-Control: public, max-age=31536000, immutable. Authentik itself serves no max-age; assets are version-fingerprinted so immutable is safe. Mainly helps split-horizon internal users (no Cloudflare edge cache on the direct path).
Upgrade Validation Checklist
Run after any of these:
- Authentik chart version bump in
stacks/authentik/modules/authentik/main.tf(theversion = "..."line onhelm_release.authentik). goauthentik/authentikTerraform provider version bump.- Outpost pod recreation (kured reboot, eviction, manual
rollout restart, scheduler move).
The fragile surfaces are the kubernetes_json_patches and the Outpost.managed field — both rely on assumptions that can silently break across upgrades. The checklist exercises the same path the alerts watch, so it doubles as a smoke test for the alerts.
# 1. Service routes to the outpost pods (NOT the server pods).
# Empty endpoints => auth-proxy fallback fires; expected: TWO pod IPs
# (kubernetes_replicas=2 since 2026-06-10), ports 9000/9300/9443.
kubectl -n authentik get endpoints ak-outpost-authentik-embedded-outpost
# 2. Service selector still excludes the server pods. Expected: includes
# `app.kubernetes.io/name: authentik-outpost-proxy`. If it flips to
# `name: authentik`, the goauthentik upstream bug came back or our
# JSON patch was unset.
kubectl -n authentik get svc ak-outpost-authentik-embedded-outpost -o jsonpath='{.spec.selector}'
# 3. Outpost mode + session backend. Expected log lines on startup:
# {"embedded":true,"event":"Outpost mode",...}
# {"event":"using PostgreSQL session backend",...}
# If embedded=false or `using filesystem session backend`, the postgres
# fix is broken — likely `Outpost.managed` got cleared, or the upstream
# schema started exposing `managed` and TF reset it.
kubectl -n authentik logs deploy/ak-outpost-authentik-embedded-outpost | grep -E '"Outpost mode"|"session backend"' | head -3
# 4. /dev/shm is essentially empty (postgres backend = no filesystem use).
# A row count > a few dozen indicates filesystem fallback is firing.
kubectl -n authentik exec deploy/ak-outpost-authentik-embedded-outpost -- sh -c 'df -h /dev/shm; ls /dev/shm | wc -l'
# 5. Postgres session table is growing with traffic. Expected: rows with
# `expires` ~28 days out (matches access_token_validity = weeks=4).
kubectl -n authentik exec deploy/goauthentik-server -- ak shell -c "
from django.db import connection; c = connection.cursor()
c.execute('SELECT COUNT(*), MAX(expires) FROM authentik_providers_proxy_proxysession')
print(c.fetchone())"
# 6. Edge auth flow: should be 302 → authentik. NOT 401 with WWW-Authenticate.
curl -sS -o /dev/null -D - 'https://terminal.viktorbarzin.me/' -H 'User-Agent: Mozilla/5.0' \
| grep -iE '^HTTP|^location|x-auth-fallback|www-authenticate'
# 7. Terraform plan-to-zero on the whole authentik stack.
( cd stacks/authentik && /home/wizard/code/infra/scripts/tg plan ) | grep -E 'No changes|Plan:'
Steps 1, 3, 6 cover the failure modes the Prometheus alerts trigger on (AuthentikForwardAuthFallbackActive, AuthentikOutpostForwardAuth400Spike). Steps 4 and 5 cover the silent-regression case (filesystem fallback) where the alerts don't fire but the system loses its postgres-backed session persistence on the next pod restart.
If step 2 shows the controller restored app.kubernetes.io/name=authentik, watch goauthentik/authentik issue tracker for fixes around internal/outpost/controllers/k8s/service.py:52 — the upstream patch might let us drop our kubernetes_json_patches.service workaround.