Commit graph

15 commits

Author SHA1 Message Date
Viktor Barzin
4e88298976 authentik: incident hardening after the signin-speedup rollout storm
The first apply of the signin-speedup change triggered a ~50min authentik
outage (and a shared CNPG primary failover): the helm chart pin (2026.2.2)
silently DOWNGRADED the Keel-managed live image (2026.2.4) against an
already-migrated DB, default liveness probes kill-looped pods queuing on
authentik's migration advisory lock, and kills mid-migration left ghost
idle-in-transaction sessions holding that lock. Full analysis in
docs/post-mortems/2026-06-10-authentik-downgrade-boot-storm.md.

Hardening (all root causes):
- values.yaml: pin global.image.tag to the Keel-managed live tag (2026.2.4)
  so helm applies can never downgrade under Keel again
- values.yaml: server livenessProbe 6x10s/5s (was chart-default 3x10s/3s)
- values.yaml: REMOVE AUTHENTIK_POSTGRESQL__CONN_MAX_AGE (session-mode
  pgbouncer pins persistent conns 1:1 -> pool saturation, 58s/s waits)
- pgbouncer.ini: idle_transaction_timeout=300 reaps ghost lock holders;
  pgbouncer.tf gets a config-checksum annotation so ini changes roll pods
- authentik_provider.tf: drop the completed import stanza (adoption rule)
- traefik: suppress pre-existing keel.sh annotation/tier-label drift on
  auth-proxy/bot-block/x402/error-pages deployments (KEEL_LIFECYCLE_V1
  pattern) so applies stop stripping live Keel state

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
2026-06-11 00:26:52 +00:00
Viktor Barzin
97ccdbecb8 authentik: speed up first-time signin (single-screen login, live env tuning, asset caching, outpost+nginx hot path)
Viktor asked to review Authentik and the web tier and make first-time
signin to apps faster. Review found the slowness is screens and round
trips, not server time. Changes:

- values.yaml: the authentik.* Helm values (gunicorn workers, cache
  timeouts, conn_max_age) were silently INERT because existingSecret
  skips chart env rendering — pods ran defaults (2 workers, 300s
  caches, no persistent DB conns). Moved all tuning into
  server.env/worker.env, which actually reaches the pods.
- authentik_provider.tf: adopt the identification stage and pin
  password_stage so username+password render on ONE screen (the
  separate order-20 password binding is deleted via API — authentik
  requires that when embedding). Outpost log_level trace->info and
  1->2 replicas (it is on the hot path of every forward-auth request;
  PG-backed sessions make 2 replicas safe).
- authentik module: /static ingress carve-out with immutable
  Cache-Control (assets are version-fingerprinted but served with no
  max-age — internal split-horizon users got zero caching).
- traefik auth-proxy nginx: upstream keepalive 32 + HTTP/1.1 (was
  opening a fresh TCP connection to the outpost per subrequest) +
  config-checksum annotation so config changes roll the pods.
- docs: authentication.md + authentik-state.md updated; fixed stale
  'postgresql.dbaas has no endpoints' claim in CLAUDE.md/CONTEXT.md
  (it is a live CNPG primary-selector compatibility service).

Done via API in the same change (UI-managed objects): 6 OIDC providers
(Vault, Forgejo, Immich, Headscale, linkwarden, Cloudflare Access)
switched from explicit to implicit consent — all first-party, the
4-weekly consent screen only slowed first-time signin.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
2026-06-10 21:58:10 +00:00
Viktor Barzin
fd0f4a0365 fix: restore tree dropped by 6d224861; land stem95su gdrive-sync (10m) [ci skip]
6d224861 came from a --no-checkout worktree whose empty index made the
commit drop every file except two. This restores 05b50d2b's full tree and
correctly adds stacks/stem95su/gdrive-sync.tf + the service-catalog stem95su
entry. Forward-only (parent=6d224861, no force-push); [ci skip] since the
live infra was never applied from the broken commit.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-09 08:45:33 +00:00
Viktor Barzin
6d224861c4 stem95su: scheduled Drive->site sync CronJob (every 10m)
CronJob stem95su-gdrive-sync (*/10) mounts the content PVC RW and
rclone-syncs the read-only Drive folder "claude" (stem claude/files) onto
it (rclone/rclone:1.74.3, scope=drive.readonly, empty-source guard +
--max-delete 25). ESO ExternalSecret stem95su-rclone <- Vault
secret/stem95su. Requires the GCP OAuth app published to Production or the
refresh token expires ~weekly.

Lands the gdrive-sync stack on master (it had landed on a feature branch
by accident on the shared devvm checkout).

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-09 08:42:26 +00:00
Viktor Barzin
3d6c5b8bc7 matrix/authentik: remove orphaned Matrix OAuth2 app + provider (post-tuwunel)
The migration left a UI-managed (not TF) Authentik OIDC app orphaned — tuwunel
uses native password auth, so nothing consumed it. Deleted application `matrix`
+ OAuth2 provider pk=6 via the Authentik API (user-confirmed). Drop the stale
Matrix rows from the SSO reference tables and update the plan's residual list.

Doc-only [ci skip].

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-08 12:32:49 +00:00
Viktor Barzin
23602f393e matrix: migrate Synapse -> tuwunel (Rust homeserver, fresh start, federated)
Replace the cramped Synapse deployment with tuwunel v1.7.1: embedded RocksDB
drops the CNPG dependency (both init-containers, the db ESO, the Reloader
annotation all gone), env-var config, fsGroup-owned encrypted PVC, federation
on, tuwunel-served well-known delegation to :443. server_name unchanged
(matrix.viktorbarzin.me); fresh start (no Synapse->RocksDB migration path).
Registered @viktor admin then disabled registration (403).

Cleanup: removed the orphaned pg-matrix Vault static role and dropped the
matrix Postgres DB/role; updated service-catalog, upgrade-config, CLAUDE.md
PG-rotation list, and the Matrix OIDC->orphaned auth notes. Design+plan in
docs/plans/2026-06-08-matrix-synapse-to-tuwunel-*.

Already applied via scripts/tg (matrix tier-1 + targeted vault tier-0), so
[ci skip] to avoid CI reconciling an unrelated pre-existing vault OIDC
tune-TTL drift.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-08 11:58:17 +00:00
Viktor Barzin
8e44ccaa65 docs: dashboard access is forward-auth + token-paste (OIDC SSO blocked)
Correct the docs I'd written for the (reverted) oauth2-proxy SSO. Reality:
apiserver OIDC rejects all Authentik tokens (design §12), so the dashboard
uses forward-auth (admits kubernetes-* groups) + per-namespace SA token-paste.
Updates authentication.md, multi-tenancy.md, service-catalog, authentik-state,
and add-user skill (onboarding now documents the dashboard token). oauth2-proxy
+ k8s-dashboard OIDC app noted as idle. [ci skip]

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-05 09:19:10 +00:00
Viktor Barzin
ad3432d685 docs(k8s-dashboard): dashboard SSO as-built (Option B multi-issuer apiserver)
Update authentication.md (structured multi-issuer AuthenticationConfiguration
+ dashboard SSO flow), multi-tenancy.md (web dashboard access), authentik-state
(new k8s-dashboard app + gheorghe groups), service-catalog (dashboard auth),
and the k8s-version-upgrade runbook (kubeadm wipes --authentication-config →
re-apply rbac post-upgrade). Design/plan addenda record the issuer-constraint
pivot from the original dual-aud approach. [ci skip]

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-05 09:19:09 +00:00
Viktor Barzin
6c4e096688 authentik: zero-endpoints alert + upgrade-validation checklist
Add `AuthentikForwardAuthFallbackActive` Prometheus alert: fires on
sustained 401/s spike on the websecure entrypoint (>5/s for 5m), which
is the symptom of the auth-proxy Emergency-Access fallback firing —
in turn caused by zero ready endpoints on the outpost service.

Why this rule and not `kube_endpoint_address_available == 0`:
kube-state-metrics endpoint metrics exist as series names but never
have current values in this Prometheus pipeline (something is dropping
them silently). Detecting the failure at the edge via Traefik is more
reliable than instrumenting the broken middle.

Also fix the pre-existing `AuthentikOutpostForwardAuth400Spike` regex
— the service label is `authentik-ak-outpost-...`, not
`authentik-authentik-outpost-...`, so the alert never matched any
series and never could have fired. Verified in Prometheus before/after
the fix.

Add an "Upgrade Validation Checklist" section to
`.claude/reference/authentik-state.md` with the seven-step smoke test
to run after Authentik chart bumps, provider bumps, or outpost pod
recreation. Covers the brittle surfaces (Service selector, JSON
patches, postgres backend wiring, access_token_validity TTL, edge
auth flow, plan-to-zero).
2026-05-10 16:54:48 +00:00
Viktor Barzin
117b99e28f docs/authentik: document postgres session backend + close out 2026-04-18 post-mortem items
Update `.claude/reference/authentik-state.md`:
  - Add `ProxyProvider.access_token_validity = "weeks=4"` to the Session
    Duration table with the gotcha that the gorilla session store binds
    the value once at outpost startup (rollout restart needed).
  - Replace the "session storage moved to Postgres in 2025.10" note that
    falsely implied the migration was automatic — explain that the
    `Outpost.managed` field gates the postgres path and our outpost
    silently stayed on `FilesystemStore` until 2026-05-10.
  - Document the goauthentik 2026.2.2 service-selector bug
    (service.py:52) and the JSON-patch workaround.
  - Document that the standalone embedded-outpost deployment needs
    `AUTHENTIK_POSTGRESQL__*` env vars injected via JSON patch, plus the
    `app.kubernetes.io/component=server` pod label.
  - Note the "Terraform doesn't expose `Outpost.managed`" assumption
    that holds the `managed=embedded` value in place across applies.

Close out post-mortem `2026-04-18-authentik-outpost-shm-full.md`:
  - P2 codify-in-Terraform: DONE.
  - P3 access_token_validity reduce: DONE-alt (we did the opposite —
    bumped to 4 weeks — because postgres backend mooted the storage
    concern).
  - P3 move-off-embedded-outpost: DONE-alt (postgres backend addresses
    the loss-of-state class on the embedded outpost itself).
2026-05-10 16:28:11 +00:00
Viktor Barzin
40a6cd067b authentik: long-lived authenticated sessions, short-lived anonymous ones
- Adopt UserLoginStage (default-authentication-login) into Terraform
  and pin session_duration=weeks=4 so users stay logged in across
  browser restarts. There is no Brand.session_duration in 2026.2.x;
  UserLoginStage is the only correct lever.
- Cap anonymous Django sessions at 2h via
  AUTHENTIK_SESSIONS__UNAUTHENTICATED_AGE on server + worker pods
  (default is days=1). Bots, healthcheckers, and partial flows now
  get reaped within 2h instead of accumulating for a day.

Implementation note: the env var is injected via server.env /
worker.env rather than authentik.sessions.unauthenticated_age,
because authentik.existingSecret.secretName is set, which makes the
chart skip rendering its own AUTHENTIK_* Secret. authentik.* values
are therefore inert in this stack -- this is documented in
.claude/reference/authentik-state.md so future edits use the right
surface.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-01 19:03:50 +00:00
Viktor Barzin
fea8519f51 update VPN architecture docs and Authentik state reference
- vpn.md: Rewrite WireGuard section to match actual config (single tun_wg0
  interface, 10.3.2.0/24 subnet, hub-and-spoke topology, correct device
  names and subnets for London/Valchedrym)
- authentik-state.md: Document brute-force-protection policy unbinding fix
  that was blocking all unauthenticated users from login flows

[ci skip]
2026-04-06 16:26:21 +03:00
Viktor Barzin
d05ff57b11 authentik: auto-assign invitation group via expression policy [ci skip]
Added invitation-group-assignment expression policy bound to the
enrollment-login stage. Reads group name from invitation fixed_data
and auto-adds the user to the target group on enrollment.
No more manual assign step needed after signup.
2026-03-13 22:21:10 +00:00
Viktor Barzin
160fda882f authentik: cleanup unused resources + add invitation enrollment flow [ci skip]
Cleanup:
- Deleted 5 unused flows (enrollment-inviation, headscale-auth/authz, default-enrollment, oauth-enrollment)
- Deleted 8 orphaned stages bound only to deleted flows
- Deleted authentik Read-only group and role (0 users)
- Deleted 2 unbound policies (map github username, Map Google Attributes)

Invitation enrollment:
- Created invitation-enrollment flow with 5 stages (invitation validation,
  identification with social login, prompt, user write, auto-login)
- Set all OAuth sources (Google/GitHub/Facebook) enrollment_flow to invitation-enrollment
- New users can only sign up via single-use invitation links
- Added authentik-invite.sh script for invitation management
- Updated reference docs and authentik skill
2026-03-13 22:21:10 +00:00
Viktor Barzin
abe89c926e [ci skip] Refactor knowledge: CLAUDE.md 881→190 lines, extract reference data
CLAUDE.md changes:
- Extract service catalog + Cloudflare domains → .claude/reference/service-catalog.md
- Extract Proxmox VMs, hardware, network → .claude/reference/proxmox-inventory.md
- Extract GitHub/Drone API patterns → .claude/reference/github-drone-api.md
- Extract Authentik state snapshot → .claude/reference/authentik-state.md
- Remove Init Container pattern (duplicates setup-project skill)
- Remove Poison Fountain service notes (duplicates Anti-AI section)
- Consolidate Authentik section (link to skills + reference)
- Remove resource limit tables (kept tier definitions inline)

Skill merges (37→32):
- helm-release-force-rerender + helm-stuck-release-recovery → helm-release-troubleshooting
- containerd-multi-registry-pull-through-cache + k8s-docker-registry-cache-bypass → k8s-container-image-caching
- (traefik merges in previous commits)
2026-02-22 22:11:31 +00:00