The first apply of the signin-speedup change triggered a ~50min authentik
outage (and a shared CNPG primary failover): the helm chart pin (2026.2.2)
silently DOWNGRADED the Keel-managed live image (2026.2.4) against an
already-migrated DB, default liveness probes kill-looped pods queuing on
authentik's migration advisory lock, and kills mid-migration left ghost
idle-in-transaction sessions holding that lock. Full analysis in
docs/post-mortems/2026-06-10-authentik-downgrade-boot-storm.md.
Hardening (all root causes):
- values.yaml: pin global.image.tag to the Keel-managed live tag (2026.2.4)
so helm applies can never downgrade under Keel again
- values.yaml: server livenessProbe 6x10s/5s (was chart-default 3x10s/3s)
- values.yaml: REMOVE AUTHENTIK_POSTGRESQL__CONN_MAX_AGE (session-mode
pgbouncer pins persistent conns 1:1 -> pool saturation, 58s/s waits)
- pgbouncer.ini: idle_transaction_timeout=300 reaps ghost lock holders;
pgbouncer.tf gets a config-checksum annotation so ini changes roll pods
- authentik_provider.tf: drop the completed import stanza (adoption rule)
- traefik: suppress pre-existing keel.sh annotation/tier-label drift on
auth-proxy/bot-block/x402/error-pages deployments (KEEL_LIFECYCLE_V1
pattern) so applies stop stripping live Keel state
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
Viktor asked to review Authentik and the web tier and make first-time
signin to apps faster. Review found the slowness is screens and round
trips, not server time. Changes:
- values.yaml: the authentik.* Helm values (gunicorn workers, cache
timeouts, conn_max_age) were silently INERT because existingSecret
skips chart env rendering — pods ran defaults (2 workers, 300s
caches, no persistent DB conns). Moved all tuning into
server.env/worker.env, which actually reaches the pods.
- authentik_provider.tf: adopt the identification stage and pin
password_stage so username+password render on ONE screen (the
separate order-20 password binding is deleted via API — authentik
requires that when embedding). Outpost log_level trace->info and
1->2 replicas (it is on the hot path of every forward-auth request;
PG-backed sessions make 2 replicas safe).
- authentik module: /static ingress carve-out with immutable
Cache-Control (assets are version-fingerprinted but served with no
max-age — internal split-horizon users got zero caching).
- traefik auth-proxy nginx: upstream keepalive 32 + HTTP/1.1 (was
opening a fresh TCP connection to the outpost per subrequest) +
config-checksum annotation so config changes roll the pods.
- docs: authentication.md + authentik-state.md updated; fixed stale
'postgresql.dbaas has no endpoints' claim in CLAUDE.md/CONTEXT.md
(it is a live CNPG primary-selector compatibility service).
Done via API in the same change (UI-managed objects): 6 OIDC providers
(Vault, Forgejo, Immich, Headscale, linkwarden, Cloudflare Access)
switched from explicit to implicit consent — all first-party, the
4-weekly consent screen only slowed first-time signin.
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
6d224861 came from a --no-checkout worktree whose empty index made the
commit drop every file except two. This restores 05b50d2b's full tree and
correctly adds stacks/stem95su/gdrive-sync.tf + the service-catalog stem95su
entry. Forward-only (parent=6d224861, no force-push); [ci skip] since the
live infra was never applied from the broken commit.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
CronJob stem95su-gdrive-sync (*/10) mounts the content PVC RW and
rclone-syncs the read-only Drive folder "claude" (stem claude/files) onto
it (rclone/rclone:1.74.3, scope=drive.readonly, empty-source guard +
--max-delete 25). ESO ExternalSecret stem95su-rclone <- Vault
secret/stem95su. Requires the GCP OAuth app published to Production or the
refresh token expires ~weekly.
Lands the gdrive-sync stack on master (it had landed on a feature branch
by accident on the shared devvm checkout).
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
The migration left a UI-managed (not TF) Authentik OIDC app orphaned — tuwunel
uses native password auth, so nothing consumed it. Deleted application `matrix`
+ OAuth2 provider pk=6 via the Authentik API (user-confirmed). Drop the stale
Matrix rows from the SSO reference tables and update the plan's residual list.
Doc-only [ci skip].
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Replace the cramped Synapse deployment with tuwunel v1.7.1: embedded RocksDB
drops the CNPG dependency (both init-containers, the db ESO, the Reloader
annotation all gone), env-var config, fsGroup-owned encrypted PVC, federation
on, tuwunel-served well-known delegation to :443. server_name unchanged
(matrix.viktorbarzin.me); fresh start (no Synapse->RocksDB migration path).
Registered @viktor admin then disabled registration (403).
Cleanup: removed the orphaned pg-matrix Vault static role and dropped the
matrix Postgres DB/role; updated service-catalog, upgrade-config, CLAUDE.md
PG-rotation list, and the Matrix OIDC->orphaned auth notes. Design+plan in
docs/plans/2026-06-08-matrix-synapse-to-tuwunel-*.
Already applied via scripts/tg (matrix tier-1 + targeted vault tier-0), so
[ci skip] to avoid CI reconciling an unrelated pre-existing vault OIDC
tune-TTL drift.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Correct the docs I'd written for the (reverted) oauth2-proxy SSO. Reality:
apiserver OIDC rejects all Authentik tokens (design §12), so the dashboard
uses forward-auth (admits kubernetes-* groups) + per-namespace SA token-paste.
Updates authentication.md, multi-tenancy.md, service-catalog, authentik-state,
and add-user skill (onboarding now documents the dashboard token). oauth2-proxy
+ k8s-dashboard OIDC app noted as idle. [ci skip]
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Update authentication.md (structured multi-issuer AuthenticationConfiguration
+ dashboard SSO flow), multi-tenancy.md (web dashboard access), authentik-state
(new k8s-dashboard app + gheorghe groups), service-catalog (dashboard auth),
and the k8s-version-upgrade runbook (kubeadm wipes --authentication-config →
re-apply rbac post-upgrade). Design/plan addenda record the issuer-constraint
pivot from the original dual-aud approach. [ci skip]
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Add `AuthentikForwardAuthFallbackActive` Prometheus alert: fires on
sustained 401/s spike on the websecure entrypoint (>5/s for 5m), which
is the symptom of the auth-proxy Emergency-Access fallback firing —
in turn caused by zero ready endpoints on the outpost service.
Why this rule and not `kube_endpoint_address_available == 0`:
kube-state-metrics endpoint metrics exist as series names but never
have current values in this Prometheus pipeline (something is dropping
them silently). Detecting the failure at the edge via Traefik is more
reliable than instrumenting the broken middle.
Also fix the pre-existing `AuthentikOutpostForwardAuth400Spike` regex
— the service label is `authentik-ak-outpost-...`, not
`authentik-authentik-outpost-...`, so the alert never matched any
series and never could have fired. Verified in Prometheus before/after
the fix.
Add an "Upgrade Validation Checklist" section to
`.claude/reference/authentik-state.md` with the seven-step smoke test
to run after Authentik chart bumps, provider bumps, or outpost pod
recreation. Covers the brittle surfaces (Service selector, JSON
patches, postgres backend wiring, access_token_validity TTL, edge
auth flow, plan-to-zero).
Update `.claude/reference/authentik-state.md`:
- Add `ProxyProvider.access_token_validity = "weeks=4"` to the Session
Duration table with the gotcha that the gorilla session store binds
the value once at outpost startup (rollout restart needed).
- Replace the "session storage moved to Postgres in 2025.10" note that
falsely implied the migration was automatic — explain that the
`Outpost.managed` field gates the postgres path and our outpost
silently stayed on `FilesystemStore` until 2026-05-10.
- Document the goauthentik 2026.2.2 service-selector bug
(service.py:52) and the JSON-patch workaround.
- Document that the standalone embedded-outpost deployment needs
`AUTHENTIK_POSTGRESQL__*` env vars injected via JSON patch, plus the
`app.kubernetes.io/component=server` pod label.
- Note the "Terraform doesn't expose `Outpost.managed`" assumption
that holds the `managed=embedded` value in place across applies.
Close out post-mortem `2026-04-18-authentik-outpost-shm-full.md`:
- P2 codify-in-Terraform: DONE.
- P3 access_token_validity reduce: DONE-alt (we did the opposite —
bumped to 4 weeks — because postgres backend mooted the storage
concern).
- P3 move-off-embedded-outpost: DONE-alt (postgres backend addresses
the loss-of-state class on the embedded outpost itself).
- Adopt UserLoginStage (default-authentication-login) into Terraform
and pin session_duration=weeks=4 so users stay logged in across
browser restarts. There is no Brand.session_duration in 2026.2.x;
UserLoginStage is the only correct lever.
- Cap anonymous Django sessions at 2h via
AUTHENTIK_SESSIONS__UNAUTHENTICATED_AGE on server + worker pods
(default is days=1). Bots, healthcheckers, and partial flows now
get reaped within 2h instead of accumulating for a day.
Implementation note: the env var is injected via server.env /
worker.env rather than authentik.sessions.unauthenticated_age,
because authentik.existingSecret.secretName is set, which makes the
chart skip rendering its own AUTHENTIK_* Secret. authentik.* values
are therefore inert in this stack -- this is documented in
.claude/reference/authentik-state.md so future edits use the right
surface.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
- vpn.md: Rewrite WireGuard section to match actual config (single tun_wg0
interface, 10.3.2.0/24 subnet, hub-and-spoke topology, correct device
names and subnets for London/Valchedrym)
- authentik-state.md: Document brute-force-protection policy unbinding fix
that was blocking all unauthenticated users from login flows
[ci skip]
Added invitation-group-assignment expression policy bound to the
enrollment-login stage. Reads group name from invitation fixed_data
and auto-adds the user to the target group on enrollment.
No more manual assign step needed after signup.
Cleanup:
- Deleted 5 unused flows (enrollment-inviation, headscale-auth/authz, default-enrollment, oauth-enrollment)
- Deleted 8 orphaned stages bound only to deleted flows
- Deleted authentik Read-only group and role (0 users)
- Deleted 2 unbound policies (map github username, Map Google Attributes)
Invitation enrollment:
- Created invitation-enrollment flow with 5 stages (invitation validation,
identification with social login, prompt, user write, auto-login)
- Set all OAuth sources (Google/GitHub/Facebook) enrollment_flow to invitation-enrollment
- New users can only sign up via single-use invitation links
- Added authentik-invite.sh script for invitation management
- Updated reference docs and authentik skill