fix(authentik): long-lived social-login sessions + shield auth from CrowdSec lockout
All checks were successful
ci/woodpecker/push/default Pipeline was successful

Viktor's passkeys all vanished and he was suddenly being asked to log in
multiple times a day instead of ~monthly. Root cause: on 2026-06-18 an ad-hoc
tripit passkey E2E test (run from the devvm as akadmin via python-httpx) cleaned
up "the demo user's" passkeys with GET /core/users/?search={demo} then DELETE
each device of users[0] — but the fuzzy search returned the REAL account, so it
wiped all 6 real passkeys. Losing passkeys forced fallback to Google login, and
the social-login stage (default-source-authentication-login) had the provider
default session_duration=seconds=0, which falls back to UNAUTHENTICATED_AGE=2h —
hence the constant re-logins. (Password + passkey logins were already weeks=4.)

Changes:
- authentik: adopt default-source-authentication-login into Terraform (import)
  and pin session_duration=weeks=4, so Google/GitHub/Facebook logins last as long
  as password/passkey. Immediate relief without re-enrolling.
- authentik: document the provider-schema gotcha — authentik_stage_identification
  exposes no webauthn_stage / enable_remember_me attribute, so they must NOT be in
  ignore_changes (commit 4e882989 removed them for this reason; re-adding breaks
  every apply). The passkey break was purely the missing device records, not drift.
- edge (rybbit): shield auth so a CrowdSec hit can never wall a user out of login —
  carve authentik.viktorbarzin.me + public-auth out of the zone WAF block rule,
  make the LAPI->edge sync ban-only (stop downgrading captcha to a hard block),
  and set exclude_crowdsec on the Authentik UI ingress (auth keeps rate-limiting).
- docs: record the session-duration change, the edge enforcement + auth carve-out
  (previously undocumented), and the pre-existing broken crowdsec-cf-sync CronJob
  (CF cursor pagination 400 + ~31k IPs vs list capacity -> edge list inert).

Passkey re-enrollment is a manual user action (devices are gone from the DB);
nothing auto-re-deletes them.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
This commit is contained in:
Viktor Barzin 2026-06-20 23:40:22 +00:00
parent 600f1f933c
commit 46166c63b2
6 changed files with 119 additions and 20 deletions

View file

@ -109,16 +109,25 @@ resource "cloudflare_ruleset" "crowdsec" {
# must exist before this ruleset is created/updated.
depends_on = [cloudflare_list.crowdsec_ban]
# CrowdSec ban block every IP in the single edge list. The sync writes BOTH
# ban and captcha decisions into crowdsec_ban (captcha downgraded to block at
# the edge) because the CF account allows only ONE Rules List.
# CrowdSec ban block every IP in the single edge list, EXCEPT on the
# Authentik auth hosts. The sync (lapi_kv_sync.py) now writes ONLY "ban"
# decisions into crowdsec_ban captcha is no longer downgraded to a hard
# block. The auth-host carve-out guarantees a CrowdSec hit can never wall a
# user out of the login / WebAuthn flow they authenticate through: without it,
# a false-positive ban would 403 the passkey ceremony + session-refresh XHRs
# on every proxied host, auth included. 2026-06-20.
rules {
action = "block"
expression = "(ip.src in $crowdsec_ban)"
description = "CrowdSec: block banned IPs"
expression = "(ip.src in $crowdsec_ban) and not (http.host in {\"authentik.viktorbarzin.me\" \"public-auth.viktorbarzin.me\"})"
description = "CrowdSec: block banned IPs (auth hosts carved out)"
enabled = true
}
# Pre-existing rule, imported and preserved verbatim (currently disabled).
# NOTE: Cloudflare auto-attaches logging{enabled=true} to skip rules. It must
# be declared here to match live, otherwise editing the OTHER rule re-sends
# this one too and the v4 provider errors "Provider produced inconsistent
# result after apply: .rules[1].logging block count changed from 0 to 1"
# (hit 2026-06-20 when adding the auth-host carve-out above).
rules {
action = "skip"
expression = "(http.host contains \"viktorbarzin.me\")"
@ -129,6 +138,9 @@ resource "cloudflare_ruleset" "crowdsec" {
products = ["uaBlock", "bic", "hot", "securityLevel", "rateLimit", "waf", "zoneLockdown"]
ruleset = "current"
}
logging {
enabled = true
}
}
}

View file

@ -4,12 +4,16 @@
Cloudflare-PROXIED hosts terminate at the CF edge, so the in-cluster CrowdSec
bouncer (which keys on the client IP Traefik sees) never decides on them. We
push the decisions into the edge instead: a zone-scoped WAF custom rule blocks
`(ip.src in $crowdsec_ban)` across EVERY proxied host in the zone. This job is
the control plane that keeps that one IP List in sync with LAPI.
`(ip.src in $crowdsec_ban)` across EVERY proxied host in the zone (the Authentik
auth hosts are carved out in crowdsec_edge.tf so a ban can't break login). This
job is the control plane that keeps that one IP List in sync with LAPI.
The CF account hard-limits to ONE Rules List, so enforcement is BLOCK-ONLY:
BOTH ban AND captcha (scope=="ip") decisions are folded into the single
crowdsec_ban list and captcha is downgraded to block at the proxied edge.
Enforcement is BAN-ONLY: only scope=="ip" decisions of type "ban" are synced.
"captcha" decisions are deliberately NOT pushed the CF account allows only ONE
Rules List with a single block action, so folding captcha in would hard-block a
soft challenge across every proxied host. Captcha remediation stays at the
in-cluster Traefik bouncer (Turnstile) for non-proxied apps. (Changed 2026-06-20
from the prior ban+captcha fold that downgraded captcha to a hard edge block.)
(Filename kept as lapi_kv_sync.py for path/ConfigMap continuity with the prior
Workers-KV design; it no longer touches KV it reconciles a CF Rules List.)
@ -117,13 +121,17 @@ def _cf(url, *, method="GET", payload=None, timeout=20):
# LAPI
# --------------------------------------------------------------------------- #
def fetch_decisions():
"""Return the single desired set of IPs to BLOCK at the edge.
"""Return the desired set of IPs to BLOCK at the edge.
Only scope=="ip" decisions are projected (the WAF rule keys on ip.src). The
CF account allows only ONE Rules List, so BOTH "ban" AND "captcha" decisions
are folded into one block set (captcha is downgraded to block at the proxied
edge). Raises on transport/HTTP error so the caller can SKIP the run
(fail-safe).
Only scope=="ip" decisions of type "ban" are projected (the WAF rule keys on
ip.src). "captcha" decisions are deliberately NOT pushed to the edge: the CF
account allows only ONE Rules List with a single block action, so folding
captcha in would HARD-BLOCK a soft challenge across every proxied host (and,
before the auth-host carve-out in crowdsec_edge.tf, could lock a user out of
Authentik itself). Edge enforcement is therefore ban-only; captcha
remediation stays at the in-cluster Traefik bouncer (Turnstile) for
non-proxied apps. Raises on transport/HTTP error so the caller can SKIP the
run (fail-safe). 2026-06-20.
"""
data = _req(
f"{LAPI_URL}/v1/decisions",
@ -137,9 +145,10 @@ def fetch_decisions():
if not ip:
continue
dtype = (d.get("type") or "").lower()
if dtype in ("ban", "captcha"):
if dtype == "ban":
block.add(ip)
# other remediation types (e.g. throttle) are ignored
# captcha / throttle / other remediation types are ignored at the edge
# (ban-only enforcement — see the docstring above)
return block
@ -298,7 +307,7 @@ def main():
push_metrics(0, ok=False)
return 0
print(f"[info] LAPI desired: {len(block)} block (ban+captcha, ip-scope)")
print(f"[info] LAPI desired: {len(block)} block (ban-only, ip-scope)")
# 2. Reconcile the single block list. CF errors fail loud (non-zero exit).
try: