fix(authentik): long-lived social-login sessions + shield auth from CrowdSec lockout

Viktor's passkeys all vanished and he was suddenly being asked to log in multiple times a day instead of ~monthly. Root cause: on 2026-06-18 an ad-hoc tripit passkey E2E test (run from the devvm as akadmin via python-httpx) cleaned up "the demo user's" passkeys with GET /core/users/?search={demo} then DELETE each device of users[0] — but the fuzzy search returned the REAL account, so it wiped all 6 real passkeys. Losing passkeys forced fallback to Google login, and the social-login stage (default-source-authentication-login) had the provider default session_duration=seconds=0, which falls back to UNAUTHENTICATED_AGE=2h — hence the constant re-logins. (Password + passkey logins were already weeks=4.) Changes: - authentik: adopt default-source-authentication-login into Terraform (import) and pin session_duration=weeks=4, so Google/GitHub/Facebook logins last as long as password/passkey. Immediate relief without re-enrolling. - authentik: document the provider-schema gotcha — authentik_stage_identification exposes no webauthn_stage / enable_remember_me attribute, so they must NOT be in ignore_changes (commit 4e882989 removed them for this reason; re-adding breaks every apply). The passkey break was purely the missing device records, not drift. - edge (rybbit): shield auth so a CrowdSec hit can never wall a user out of login — carve authentik.viktorbarzin.me + public-auth out of the zone WAF block rule, make the LAPI->edge sync ban-only (stop downgrading captcha to a hard block), and set exclude_crowdsec on the Authentik UI ingress (auth keeps rate-limiting). - docs: record the session-duration change, the edge enforcement + auth carve-out (previously undocumented), and the pre-existing broken crowdsec-cf-sync CronJob (CF cursor pagination 400 + ~31k IPs vs list capacity -> edge list inert). Passkey re-enrollment is a manual user action (devices are gone from the DB); nothing auto-re-deletes them. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-20 23:40:22 +00:00 · 2026-06-20 23:40:22 +00:00 · 46166c63b2
commit 46166c63b2
parent 600f1f933c
6 changed files with 119 additions and 20 deletions
--- a/stacks/rybbit/crowdsec_edge.tf
+++ b/stacks/rybbit/crowdsec_edge.tf
@ -109,16 +109,25 @@ resource "cloudflare_ruleset" "crowdsec" {
  # must exist before this ruleset is created/updated.
  depends_on = [cloudflare_list.crowdsec_ban]

-  # CrowdSec ban — block every IP in the single edge list. The sync writes BOTH
-  # ban and captcha decisions into crowdsec_ban (captcha downgraded to block at
-  # the edge) because the CF account allows only ONE Rules List.
+  # CrowdSec ban — block every IP in the single edge list, EXCEPT on the
+  # Authentik auth hosts. The sync (lapi_kv_sync.py) now writes ONLY "ban"
+  # decisions into crowdsec_ban — captcha is no longer downgraded to a hard
+  # block. The auth-host carve-out guarantees a CrowdSec hit can never wall a
+  # user out of the login / WebAuthn flow they authenticate through: without it,
+  # a false-positive ban would 403 the passkey ceremony + session-refresh XHRs
+  # on every proxied host, auth included. 2026-06-20.
  rules {
    action      = "block"
-    expression  = "(ip.src in $crowdsec_ban)"
-    description = "CrowdSec: block banned IPs"
+    expression  = "(ip.src in $crowdsec_ban) and not (http.host in {\"authentik.viktorbarzin.me\" \"public-auth.viktorbarzin.me\"})"
+    description = "CrowdSec: block banned IPs (auth hosts carved out)"
    enabled     = true
  }
  # Pre-existing rule, imported and preserved verbatim (currently disabled).
+  # NOTE: Cloudflare auto-attaches logging{enabled=true} to skip rules. It must
+  # be declared here to match live, otherwise editing the OTHER rule re-sends
+  # this one too and the v4 provider errors "Provider produced inconsistent
+  # result after apply: .rules[1].logging block count changed from 0 to 1"
+  # (hit 2026-06-20 when adding the auth-host carve-out above).
  rules {
    action      = "skip"
    expression  = "(http.host contains \"viktorbarzin.me\")"
@ -129,6 +138,9 @@ resource "cloudflare_ruleset" "crowdsec" {
      products = ["uaBlock", "bic", "hot", "securityLevel", "rateLimit", "waf", "zoneLockdown"]
      ruleset  = "current"
    }
+    logging {
+      enabled = true
+    }
  }
 }

--- a/stacks/rybbit/lapi_kv_sync.py
+++ b/stacks/rybbit/lapi_kv_sync.py
@ -4,12 +4,16 @@
 Cloudflare-PROXIED hosts terminate at the CF edge, so the in-cluster CrowdSec
 bouncer (which keys on the client IP Traefik sees) never decides on them. We
 push the decisions into the edge instead: a zone-scoped WAF custom rule blocks
-`(ip.src in $crowdsec_ban)` across EVERY proxied host in the zone. This job is
-the control plane that keeps that one IP List in sync with LAPI.
+`(ip.src in $crowdsec_ban)` across EVERY proxied host in the zone (the Authentik
+auth hosts are carved out in crowdsec_edge.tf so a ban can't break login). This
+job is the control plane that keeps that one IP List in sync with LAPI.

-The CF account hard-limits to ONE Rules List, so enforcement is BLOCK-ONLY:
-BOTH ban AND captcha (scope=="ip") decisions are folded into the single
-crowdsec_ban list and captcha is downgraded to block at the proxied edge.
+Enforcement is BAN-ONLY: only scope=="ip" decisions of type "ban" are synced.
+"captcha" decisions are deliberately NOT pushed — the CF account allows only ONE
+Rules List with a single block action, so folding captcha in would hard-block a
+soft challenge across every proxied host. Captcha remediation stays at the
+in-cluster Traefik bouncer (Turnstile) for non-proxied apps. (Changed 2026-06-20
+from the prior ban+captcha fold that downgraded captcha to a hard edge block.)

 (Filename kept as lapi_kv_sync.py for path/ConfigMap continuity with the prior
 Workers-KV design; it no longer touches KV — it reconciles a CF Rules List.)
@ -117,13 +121,17 @@ def _cf(url, *, method="GET", payload=None, timeout=20):
 # LAPI
 # --------------------------------------------------------------------------- #
 def fetch_decisions():
-    """Return the single desired set of IPs to BLOCK at the edge.
+    """Return the desired set of IPs to BLOCK at the edge.

-    Only scope=="ip" decisions are projected (the WAF rule keys on ip.src). The
-    CF account allows only ONE Rules List, so BOTH "ban" AND "captcha" decisions
-    are folded into one block set (captcha is downgraded to block at the proxied
-    edge). Raises on transport/HTTP error so the caller can SKIP the run
-    (fail-safe).
+    Only scope=="ip" decisions of type "ban" are projected (the WAF rule keys on
+    ip.src). "captcha" decisions are deliberately NOT pushed to the edge: the CF
+    account allows only ONE Rules List with a single block action, so folding
+    captcha in would HARD-BLOCK a soft challenge across every proxied host (and,
+    before the auth-host carve-out in crowdsec_edge.tf, could lock a user out of
+    Authentik itself). Edge enforcement is therefore ban-only; captcha
+    remediation stays at the in-cluster Traefik bouncer (Turnstile) for
+    non-proxied apps. Raises on transport/HTTP error so the caller can SKIP the
+    run (fail-safe). 2026-06-20.
    """
    data = _req(
        f"{LAPI_URL}/v1/decisions",
@ -137,9 +145,10 @@ def fetch_decisions():
        if not ip:
            continue
        dtype = (d.get("type") or "").lower()
-        if dtype in ("ban", "captcha"):
+        if dtype == "ban":
            block.add(ip)
-        # other remediation types (e.g. throttle) are ignored
+        # captcha / throttle / other remediation types are ignored at the edge
+        # (ban-only enforcement — see the docstring above)
    return block


@ -298,7 +307,7 @@ def main():
        push_metrics(0, ok=False)
        return 0

-    print(f"[info] LAPI desired: {len(block)} block (ban+captcha, ip-scope)")
+    print(f"[info] LAPI desired: {len(block)} block (ban-only, ip-scope)")

    # 2. Reconcile the single block list. CF errors fail loud (non-zero exit).
    try: