fix(authentik): long-lived social-login sessions + shield auth from CrowdSec lockout
All checks were successful
ci/woodpecker/push/default Pipeline was successful
All checks were successful
ci/woodpecker/push/default Pipeline was successful
Viktor's passkeys all vanished and he was suddenly being asked to log in
multiple times a day instead of ~monthly. Root cause: on 2026-06-18 an ad-hoc
tripit passkey E2E test (run from the devvm as akadmin via python-httpx) cleaned
up "the demo user's" passkeys with GET /core/users/?search={demo} then DELETE
each device of users[0] — but the fuzzy search returned the REAL account, so it
wiped all 6 real passkeys. Losing passkeys forced fallback to Google login, and
the social-login stage (default-source-authentication-login) had the provider
default session_duration=seconds=0, which falls back to UNAUTHENTICATED_AGE=2h —
hence the constant re-logins. (Password + passkey logins were already weeks=4.)
Changes:
- authentik: adopt default-source-authentication-login into Terraform (import)
and pin session_duration=weeks=4, so Google/GitHub/Facebook logins last as long
as password/passkey. Immediate relief without re-enrolling.
- authentik: document the provider-schema gotcha — authentik_stage_identification
exposes no webauthn_stage / enable_remember_me attribute, so they must NOT be in
ignore_changes (commit 4e882989 removed them for this reason; re-adding breaks
every apply). The passkey break was purely the missing device records, not drift.
- edge (rybbit): shield auth so a CrowdSec hit can never wall a user out of login —
carve authentik.viktorbarzin.me + public-auth out of the zone WAF block rule,
make the LAPI->edge sync ban-only (stop downgrading captcha to a hard block),
and set exclude_crowdsec on the Authentik UI ingress (auth keeps rate-limiting).
- docs: record the session-duration change, the edge enforcement + auth carve-out
(previously undocumented), and the pre-existing broken crowdsec-cf-sync CronJob
(CF cursor pagination 400 + ~31k IPs vs list capacity -> edge list inert).
Passkey re-enrollment is a manual user action (devices are gone from the DB);
nothing auto-re-deletes them.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
This commit is contained in:
parent
600f1f933c
commit
46166c63b2
6 changed files with 119 additions and 20 deletions
|
|
@ -107,6 +107,32 @@ CrowdSec operates in a hub-and-agent model:
|
|||
configured, so `captcha` decisions silently degraded to a 403 ban** — users
|
||||
had no way to self-unblock; wiring Turnstile fixed that.
|
||||
|
||||
**Cloudflare Edge Enforcement for proxied hosts** (`stacks/rybbit/crowdsec_edge.tf` + `lapi_kv_sync.py`):
|
||||
- Proxied (orange-cloud) hosts terminate at the Cloudflare edge, so the in-cluster
|
||||
bouncer above never decides on them. Edge enforcement instead syncs LAPI
|
||||
decisions into **one Cloudflare account IP List (`crowdsec_ban`)** + a single
|
||||
**zone-scoped WAF custom rule** blocking `(ip.src in $crowdsec_ban)` across every
|
||||
proxied host. CronJob `crowdsec-cf-sync` (rybbit ns, every 2 min) reconciles it.
|
||||
- **BAN-ONLY (2026-06-20):** only `type=ban` decisions sync to the edge. `captcha`
|
||||
decisions are deliberately NOT pushed — the CF account allows only ONE Rules List
|
||||
with a single block action, so folding captcha in would hard-block a soft
|
||||
challenge on every proxied host. (Before 2026-06-20 captcha was downgraded to a
|
||||
hard block at the edge.)
|
||||
- **Auth carve-out (2026-06-20):** the WAF rule excludes `authentik.viktorbarzin.me`
|
||||
+ `public-auth.viktorbarzin.me` (`… and not (http.host in {…})`), and the
|
||||
Authentik UI ingress sets `exclude_crowdsec = true` for the in-cluster bouncer. A
|
||||
CrowdSec hit must never wall a user out of the login / WebAuthn flow they
|
||||
authenticate through; auth keeps `traefik-rate-limit` for brute-force protection.
|
||||
- **⚠️ Currently NON-FUNCTIONAL (known issue, pre-existing since the 2026-06-20
|
||||
rollout):** `crowdsec-cf-sync` fails every run — `cf_list_items()` pagination
|
||||
gets CF `HTTP 400 code 10027 "invalid or expired cursor"`, so the list never
|
||||
populates (`num_items=0`) and the edge rule blocks nothing. LAPI also returns
|
||||
~31k ban IPs, likely exceeding CF IP-List capacity even once pagination is fixed.
|
||||
**Edge enforcement for proxied hosts is therefore inert pending a fix** (the
|
||||
in-cluster bouncer still protects direct apps; the auth carve-out is correct
|
||||
regardless). Fix needs: (1) correct CF cursor pagination, (2) a capacity strategy
|
||||
for the ban set.
|
||||
|
||||
**Metabase** (disabled by default):
|
||||
- Dashboard for CrowdSec analytics
|
||||
- CPU-intensive, only enable when investigating incidents
|
||||
|
|
|
|||
Loading…
Add table
Add a link
Reference in a new issue