From 46166c63b2d217544edddc4986b79629e6b27b2a Mon Sep 17 00:00:00 2001 From: Viktor Barzin Date: Sat, 20 Jun 2026 23:40:22 +0000 Subject: [PATCH] fix(authentik): long-lived social-login sessions + shield auth from CrowdSec lockout MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Viktor's passkeys all vanished and he was suddenly being asked to log in multiple times a day instead of ~monthly. Root cause: on 2026-06-18 an ad-hoc tripit passkey E2E test (run from the devvm as akadmin via python-httpx) cleaned up "the demo user's" passkeys with GET /core/users/?search={demo} then DELETE each device of users[0] — but the fuzzy search returned the REAL account, so it wiped all 6 real passkeys. Losing passkeys forced fallback to Google login, and the social-login stage (default-source-authentication-login) had the provider default session_duration=seconds=0, which falls back to UNAUTHENTICATED_AGE=2h — hence the constant re-logins. (Password + passkey logins were already weeks=4.) Changes: - authentik: adopt default-source-authentication-login into Terraform (import) and pin session_duration=weeks=4, so Google/GitHub/Facebook logins last as long as password/passkey. Immediate relief without re-enrolling. - authentik: document the provider-schema gotcha — authentik_stage_identification exposes no webauthn_stage / enable_remember_me attribute, so they must NOT be in ignore_changes (commit 4e882989 removed them for this reason; re-adding breaks every apply). The passkey break was purely the missing device records, not drift. - edge (rybbit): shield auth so a CrowdSec hit can never wall a user out of login — carve authentik.viktorbarzin.me + public-auth out of the zone WAF block rule, make the LAPI->edge sync ban-only (stop downgrading captcha to a hard block), and set exclude_crowdsec on the Authentik UI ingress (auth keeps rate-limiting). - docs: record the session-duration change, the edge enforcement + auth carve-out (previously undocumented), and the pre-existing broken crowdsec-cf-sync CronJob (CF cursor pagination 400 + ~31k IPs vs list capacity -> edge list inert). Passkey re-enrollment is a manual user action (devices are gone from the DB); nothing auto-re-deletes them. Co-Authored-By: Claude Opus 4.8 --- .claude/reference/authentik-state.md | 10 +++++- docs/architecture/security.md | 26 +++++++++++++++ stacks/authentik/authentik_provider.tf | 37 ++++++++++++++++++++++ stacks/authentik/modules/authentik/main.tf | 7 ++++ stacks/rybbit/crowdsec_edge.tf | 22 ++++++++++--- stacks/rybbit/lapi_kv_sync.py | 37 ++++++++++++++-------- 6 files changed, 119 insertions(+), 20 deletions(-) diff --git a/.claude/reference/authentik-state.md b/.claude/reference/authentik-state.md index 125d1a71..2ff86141 100644 --- a/.claude/reference/authentik-state.md +++ b/.claude/reference/authentik-state.md @@ -166,7 +166,8 @@ Pinned via Terraform in `stacks/authentik/`: | Knob | Value | Surface | Effect | |------|-------|---------|--------| -| `UserLoginStage.session_duration` on `default-authentication-login` | `weeks=4` | `authentik_stage_user_login.default_login` in `authentik_provider.tf` | Authenticated users stay logged in 4 weeks across browser restarts. No sliding refresh — resets on each login. | +| `UserLoginStage.session_duration` on `default-authentication-login` | `weeks=4` | `authentik_stage_user_login.default_login` in `authentik_provider.tf` | Authenticated users stay logged in 4 weeks across browser restarts. No sliding refresh — resets on each login. Used by password login (`default-authentication-flow`) AND passkey login (`webauthn` flow — both terminate on this stage). | +| `UserLoginStage.session_duration` on `default-source-authentication-login` | `weeks=4` | `authentik_stage_user_login.default_source_login` in `authentik_provider.tf` (imported 2026-06-20, id `4c6977d2-…`) | **Social logins** (Google/GitHub/Facebook, via `default-source-authentication-flow`). Was the provider default `seconds=0`, which fell back to `UNAUTHENTICATED_AGE=hours=2` — so social logins expired every **2h** while password/passkey lasted 4 weeks. Pinned `weeks=4` on 2026-06-20 to make all login paths consistent. (Surfaced when the 2026-06-18 passkey wipe forced fallback to Google login → "re-login multiple times daily".) | | `ProxyProvider.access_token_validity` on `Provider for Domain wide catch all` | `weeks=4` | `authentik_provider_proxy.catchall.access_token_validity` in `authentik_provider.tf` | Cookie `Max-Age` on `authentik_proxy_*` and `expires` on rows in `authentik_providers_proxy_proxysession`. Bumped 2026-05-10 from `hours=168`. **Bumping requires `kubectl rollout restart deploy/ak-outpost-authentik-embedded-outpost`** — the gorilla session store binds the value once at outpost startup; the 5-min provider refresh logs `"reusing existing session store"` and skips rebuild. | | `AUTHENTIK_SESSIONS__UNAUTHENTICATED_AGE` (server + worker) | `hours=2` | `server.env` + `worker.env` in `modules/authentik/values.yaml` | Anonymous Django sessions (bots, healthcheckers, partial flows) are reaped within 2h instead of the 1d default. | @@ -177,6 +178,13 @@ Notes: - The standalone embedded-outpost deployment needs `AUTHENTIK_POSTGRESQL__{HOST,PORT,USER,PASSWORD,NAME}` env vars to reach the dbaas cluster — codified via `kubernetes_json_patches.deployment` envFrom the shared `goauthentik` Secret. The `app.kubernetes.io/component=server` pod label is also injected via JSON patch (matches the `component:server` half of the Service selector that the controller adds for embedded outposts). - `ProxyProvider.remember_me_offset` stays UI-managed via `ignore_changes`. - The Authentik provider's resource schema does **not** expose the `Outpost.managed` field. We rely on TF's "write only fields it knows about" semantic: the server-set `goauthentik.io/outposts/embedded` value is preserved across applies because Terraform never writes `managed`. Don't change the resource provider schema expectations without verifying this assumption holds. + +## WebAuthn / Passkeys (2026-06-20) + +- **Passkey devices live in the DB, NOT Terraform** (`WebAuthnDevice` model). They are user-owned; no TF resource or blueprint manages them. Re-enroll via the user settings UI (Authentik → Settings → MFA Devices → register a security key / passkey). +- **2026-06-18 wipe (root cause of the "WebAuthn broke" incident):** all 6 of Viktor's passkeys were deleted (`WebAuthnDevice.objects.count()` → 0) at 19:27 by an **ad-hoc tripit passkey E2E test** run from the devvm (`python-httpx/0.28.1`, as `akadmin`). The test cleanup did `GET /core/users/?search={demo}` (a **fuzzy** search) then `DELETE /api/v3/authenticators/admin/webauthn/{pk}/` for each device of `users[0]` — but `users[0]` resolved to the **real** account, not the intended demo user. **Lesson:** any future passkey-test cleanup MUST exact-match the demo user (`username == demo`), never `users[0]` of a fuzzy `?search=`. It was a one-off ad-hoc script (no committed/scheduled copy), so nothing auto-re-deletes — re-enrollment is safe. +- **Passkey login path itself is intact:** the identification stage's `passwordless_flow` → `webauthn` flow (UI-managed, in `ignore_changes`); the break was purely the missing device records. +- **Provider-schema gotcha:** the pinned authentik TF provider's `authentik_stage_identification` resource exposes **no** `webauthn_stage` or `enable_remember_me` attribute (they exist on the app *model*, not in the provider schema). Do NOT add them to `ignore_changes` — `tg plan` errors `Unsupported attribute`. They are purely UI/app-managed. (Commit `4e882989` removed them for exactly this reason; re-adding breaks every apply.) - ALL tuned env vars are injected via `server.env` / `worker.env` (not the `authentik.*` values block) because we set `authentik.existingSecret.secretName: goauthentik`, which makes the chart skip rendering its own `AUTHENTIK_*` Secret. The `authentik.*` value block is therefore inert in this stack — anything new under `authentik.*` must use the `*.env` arrays instead. Live base values come from the orphaned, helm-keep-policy `goauthentik` Secret created by chart 2025.10.3 before `existingSecret` was introduced. **2026-06-10:** the previously-inert tuning (`AUTHENTIK_WEB__WORKERS=3`, `AUTHENTIK_WEB__THREADS=4`, `AUTHENTIK_CACHE__TIMEOUT_FLOWS=1800`, `AUTHENTIK_CACHE__TIMEOUT_POLICIES=900`, `AUTHENTIK_POSTGRESQL__CONN_MAX_AGE=60`, `AUTHENTIK_POSTGRESQL__CONN_HEALTH_CHECKS=true`, worker `AUTHENTIK_WORKER__THREADS=4`) was moved into the env arrays and is now actually live — before that, pods silently ran defaults (2 gunicorn workers, 300s caches, no persistent DB conns). - **Outpost (2026-06-10):** `log_level=info` (was `trace` — per-request overhead on the forward-auth hot path) and `kubernetes_replicas=2` (was 1 — single-pod hot path; safe since proxy sessions live in Postgres). Both in `authentik_outpost.embedded` config. - **Image tag is PINNED in values (`global.image.tag`), 2026-06-10:** Keel moves the authentik image between chart releases, while helm derives the tag from the chart appVersion — an unpinned helm apply silently DOWNGRADES live pods (caused the 2026-06-10 boot storm + shared-PG failover; see `docs/post-mortems/2026-06-10-authentik-downgrade-boot-storm.md`). Before touching this chart, check the live image tag and refresh the pin. diff --git a/docs/architecture/security.md b/docs/architecture/security.md index e12347d2..f8afd5e4 100644 --- a/docs/architecture/security.md +++ b/docs/architecture/security.md @@ -107,6 +107,32 @@ CrowdSec operates in a hub-and-agent model: configured, so `captcha` decisions silently degraded to a 403 ban** — users had no way to self-unblock; wiring Turnstile fixed that. +**Cloudflare Edge Enforcement for proxied hosts** (`stacks/rybbit/crowdsec_edge.tf` + `lapi_kv_sync.py`): +- Proxied (orange-cloud) hosts terminate at the Cloudflare edge, so the in-cluster + bouncer above never decides on them. Edge enforcement instead syncs LAPI + decisions into **one Cloudflare account IP List (`crowdsec_ban`)** + a single + **zone-scoped WAF custom rule** blocking `(ip.src in $crowdsec_ban)` across every + proxied host. CronJob `crowdsec-cf-sync` (rybbit ns, every 2 min) reconciles it. +- **BAN-ONLY (2026-06-20):** only `type=ban` decisions sync to the edge. `captcha` + decisions are deliberately NOT pushed — the CF account allows only ONE Rules List + with a single block action, so folding captcha in would hard-block a soft + challenge on every proxied host. (Before 2026-06-20 captcha was downgraded to a + hard block at the edge.) +- **Auth carve-out (2026-06-20):** the WAF rule excludes `authentik.viktorbarzin.me` + + `public-auth.viktorbarzin.me` (`… and not (http.host in {…})`), and the + Authentik UI ingress sets `exclude_crowdsec = true` for the in-cluster bouncer. A + CrowdSec hit must never wall a user out of the login / WebAuthn flow they + authenticate through; auth keeps `traefik-rate-limit` for brute-force protection. +- **⚠️ Currently NON-FUNCTIONAL (known issue, pre-existing since the 2026-06-20 + rollout):** `crowdsec-cf-sync` fails every run — `cf_list_items()` pagination + gets CF `HTTP 400 code 10027 "invalid or expired cursor"`, so the list never + populates (`num_items=0`) and the edge rule blocks nothing. LAPI also returns + ~31k ban IPs, likely exceeding CF IP-List capacity even once pagination is fixed. + **Edge enforcement for proxied hosts is therefore inert pending a fix** (the + in-cluster bouncer still protects direct apps; the auth carve-out is correct + regardless). Fix needs: (1) correct CF cursor pagination, (2) a capacity strategy + for the ban set. + **Metabase** (disabled by default): - Dashboard for CrowdSec analytics - CPU-intensive, only enable when investigating incidents diff --git a/stacks/authentik/authentik_provider.tf b/stacks/authentik/authentik_provider.tf index 573adcca..12b934a8 100644 --- a/stacks/authentik/authentik_provider.tf +++ b/stacks/authentik/authentik_provider.tf @@ -206,6 +206,36 @@ resource "authentik_stage_user_login" "default_login" { } } +# ----------------------------------------------------------------------------- +# Source (social-login) User Login stage — bound to default-source-authentication-flow. +# Adopted into Terraform 2026-06-20: its session_duration was the provider default +# "seconds=0", which falls back to AUTHENTIK_SESSIONS__UNAUTHENTICATED_AGE (hours=2). +# So Google/GitHub/Facebook logins expired every 2h while password and passkey +# logins (default-authentication-login) lasted weeks=4. After the 2026-06-18 passkey +# wipe forced fallback to Google login, this 2h cap became the "re-login multiple +# times daily" symptom. Pinning weeks=4 makes every login path consistent. +# See docs/architecture/authentication.md. +# ----------------------------------------------------------------------------- +import { + to = authentik_stage_user_login.default_source_login + id = "4c6977d2-eaae-4033-b1db-21b48c6b47f0" +} + +resource "authentik_stage_user_login" "default_source_login" { + name = "default-source-authentication-login" + session_duration = "weeks=4" + lifecycle { + # Pin only session_duration; everything else stays UI-managed (same pattern + # as authentik_stage_user_login.default_login above). + ignore_changes = [ + remember_me_offset, + terminate_other_sessions, + geoip_binding, + network_binding, + ] + } +} + # ----------------------------------------------------------------------------- # Default Identification stage — adopted 2026-06-10 to embed the password # field on the identification screen (single-screen login: one round trip and @@ -227,6 +257,13 @@ resource "authentik_stage_identification" "default_identification" { lifecycle { # Pin only password_stage; everything else stays UI-managed (same pattern # as authentik_stage_user_login.default_login above). + # NOTE: do NOT add webauthn_stage / enable_remember_me here — the pinned + # authentik TF provider's authentik_stage_identification resource exposes no + # such attributes (verified 2026-06-20: `tg plan` => "Unsupported attribute"). + # They exist on Authentik's IdentificationStage *model* but not in the + # provider schema, so Terraform never manages or nulls them; they are purely + # UI/app-managed and need no ignore_changes entry. Commit 4e882989 removed + # them for exactly this reason — re-adding them breaks every apply. ignore_changes = [ user_fields, case_insensitive_matching, diff --git a/stacks/authentik/modules/authentik/main.tf b/stacks/authentik/modules/authentik/main.tf index 262c5eb7..38584114 100644 --- a/stacks/authentik/modules/authentik/main.tf +++ b/stacks/authentik/modules/authentik/main.tf @@ -82,6 +82,13 @@ module "ingress" { service_name = "goauthentik-server" tls_secret_name = var.tls_secret_name anti_ai_scraping = false + # Never let the in-cluster CrowdSec bouncer serve a Turnstile/captcha + # interstitial or 403 on Authentik's own login + WebAuthn XHR endpoints — that + # walls users out of the very gate they authenticate through (a CrowdSec hit + # would break the passkey ceremony / session refresh mid-flow). Auth keeps + # Traefik rate-limiting; the Cloudflare edge WAF also carves out this host + # (stacks/rybbit/crowdsec_edge.tf). 2026-06-20. + exclude_crowdsec = true extra_annotations = { "gethomepage.dev/enabled" = "true" "gethomepage.dev/name" = "Authentik" diff --git a/stacks/rybbit/crowdsec_edge.tf b/stacks/rybbit/crowdsec_edge.tf index cf1607ea..692c3711 100644 --- a/stacks/rybbit/crowdsec_edge.tf +++ b/stacks/rybbit/crowdsec_edge.tf @@ -109,16 +109,25 @@ resource "cloudflare_ruleset" "crowdsec" { # must exist before this ruleset is created/updated. depends_on = [cloudflare_list.crowdsec_ban] - # CrowdSec ban — block every IP in the single edge list. The sync writes BOTH - # ban and captcha decisions into crowdsec_ban (captcha downgraded to block at - # the edge) because the CF account allows only ONE Rules List. + # CrowdSec ban — block every IP in the single edge list, EXCEPT on the + # Authentik auth hosts. The sync (lapi_kv_sync.py) now writes ONLY "ban" + # decisions into crowdsec_ban — captcha is no longer downgraded to a hard + # block. The auth-host carve-out guarantees a CrowdSec hit can never wall a + # user out of the login / WebAuthn flow they authenticate through: without it, + # a false-positive ban would 403 the passkey ceremony + session-refresh XHRs + # on every proxied host, auth included. 2026-06-20. rules { action = "block" - expression = "(ip.src in $crowdsec_ban)" - description = "CrowdSec: block banned IPs" + expression = "(ip.src in $crowdsec_ban) and not (http.host in {\"authentik.viktorbarzin.me\" \"public-auth.viktorbarzin.me\"})" + description = "CrowdSec: block banned IPs (auth hosts carved out)" enabled = true } # Pre-existing rule, imported and preserved verbatim (currently disabled). + # NOTE: Cloudflare auto-attaches logging{enabled=true} to skip rules. It must + # be declared here to match live, otherwise editing the OTHER rule re-sends + # this one too and the v4 provider errors "Provider produced inconsistent + # result after apply: .rules[1].logging block count changed from 0 to 1" + # (hit 2026-06-20 when adding the auth-host carve-out above). rules { action = "skip" expression = "(http.host contains \"viktorbarzin.me\")" @@ -129,6 +138,9 @@ resource "cloudflare_ruleset" "crowdsec" { products = ["uaBlock", "bic", "hot", "securityLevel", "rateLimit", "waf", "zoneLockdown"] ruleset = "current" } + logging { + enabled = true + } } } diff --git a/stacks/rybbit/lapi_kv_sync.py b/stacks/rybbit/lapi_kv_sync.py index 8a0231d5..ad2547fc 100644 --- a/stacks/rybbit/lapi_kv_sync.py +++ b/stacks/rybbit/lapi_kv_sync.py @@ -4,12 +4,16 @@ Cloudflare-PROXIED hosts terminate at the CF edge, so the in-cluster CrowdSec bouncer (which keys on the client IP Traefik sees) never decides on them. We push the decisions into the edge instead: a zone-scoped WAF custom rule blocks -`(ip.src in $crowdsec_ban)` across EVERY proxied host in the zone. This job is -the control plane that keeps that one IP List in sync with LAPI. +`(ip.src in $crowdsec_ban)` across EVERY proxied host in the zone (the Authentik +auth hosts are carved out in crowdsec_edge.tf so a ban can't break login). This +job is the control plane that keeps that one IP List in sync with LAPI. -The CF account hard-limits to ONE Rules List, so enforcement is BLOCK-ONLY: -BOTH ban AND captcha (scope=="ip") decisions are folded into the single -crowdsec_ban list and captcha is downgraded to block at the proxied edge. +Enforcement is BAN-ONLY: only scope=="ip" decisions of type "ban" are synced. +"captcha" decisions are deliberately NOT pushed — the CF account allows only ONE +Rules List with a single block action, so folding captcha in would hard-block a +soft challenge across every proxied host. Captcha remediation stays at the +in-cluster Traefik bouncer (Turnstile) for non-proxied apps. (Changed 2026-06-20 +from the prior ban+captcha fold that downgraded captcha to a hard edge block.) (Filename kept as lapi_kv_sync.py for path/ConfigMap continuity with the prior Workers-KV design; it no longer touches KV — it reconciles a CF Rules List.) @@ -117,13 +121,17 @@ def _cf(url, *, method="GET", payload=None, timeout=20): # LAPI # --------------------------------------------------------------------------- # def fetch_decisions(): - """Return the single desired set of IPs to BLOCK at the edge. + """Return the desired set of IPs to BLOCK at the edge. - Only scope=="ip" decisions are projected (the WAF rule keys on ip.src). The - CF account allows only ONE Rules List, so BOTH "ban" AND "captcha" decisions - are folded into one block set (captcha is downgraded to block at the proxied - edge). Raises on transport/HTTP error so the caller can SKIP the run - (fail-safe). + Only scope=="ip" decisions of type "ban" are projected (the WAF rule keys on + ip.src). "captcha" decisions are deliberately NOT pushed to the edge: the CF + account allows only ONE Rules List with a single block action, so folding + captcha in would HARD-BLOCK a soft challenge across every proxied host (and, + before the auth-host carve-out in crowdsec_edge.tf, could lock a user out of + Authentik itself). Edge enforcement is therefore ban-only; captcha + remediation stays at the in-cluster Traefik bouncer (Turnstile) for + non-proxied apps. Raises on transport/HTTP error so the caller can SKIP the + run (fail-safe). 2026-06-20. """ data = _req( f"{LAPI_URL}/v1/decisions", @@ -137,9 +145,10 @@ def fetch_decisions(): if not ip: continue dtype = (d.get("type") or "").lower() - if dtype in ("ban", "captcha"): + if dtype == "ban": block.add(ip) - # other remediation types (e.g. throttle) are ignored + # captcha / throttle / other remediation types are ignored at the edge + # (ban-only enforcement — see the docstring above) return block @@ -298,7 +307,7 @@ def main(): push_metrics(0, ok=False) return 0 - print(f"[info] LAPI desired: {len(block)} block (ban+captcha, ip-scope)") + print(f"[info] LAPI desired: {len(block)} block (ban-only, ip-scope)") # 2. Reconcile the single block list. CF errors fail loud (non-zero exit). try: