fix(authentik): long-lived social-login sessions + shield auth from CrowdSec lockout
All checks were successful
ci/woodpecker/push/default Pipeline was successful
All checks were successful
ci/woodpecker/push/default Pipeline was successful
Viktor's passkeys all vanished and he was suddenly being asked to log in
multiple times a day instead of ~monthly. Root cause: on 2026-06-18 an ad-hoc
tripit passkey E2E test (run from the devvm as akadmin via python-httpx) cleaned
up "the demo user's" passkeys with GET /core/users/?search={demo} then DELETE
each device of users[0] — but the fuzzy search returned the REAL account, so it
wiped all 6 real passkeys. Losing passkeys forced fallback to Google login, and
the social-login stage (default-source-authentication-login) had the provider
default session_duration=seconds=0, which falls back to UNAUTHENTICATED_AGE=2h —
hence the constant re-logins. (Password + passkey logins were already weeks=4.)
Changes:
- authentik: adopt default-source-authentication-login into Terraform (import)
and pin session_duration=weeks=4, so Google/GitHub/Facebook logins last as long
as password/passkey. Immediate relief without re-enrolling.
- authentik: document the provider-schema gotcha — authentik_stage_identification
exposes no webauthn_stage / enable_remember_me attribute, so they must NOT be in
ignore_changes (commit 4e882989 removed them for this reason; re-adding breaks
every apply). The passkey break was purely the missing device records, not drift.
- edge (rybbit): shield auth so a CrowdSec hit can never wall a user out of login —
carve authentik.viktorbarzin.me + public-auth out of the zone WAF block rule,
make the LAPI->edge sync ban-only (stop downgrading captcha to a hard block),
and set exclude_crowdsec on the Authentik UI ingress (auth keeps rate-limiting).
- docs: record the session-duration change, the edge enforcement + auth carve-out
(previously undocumented), and the pre-existing broken crowdsec-cf-sync CronJob
(CF cursor pagination 400 + ~31k IPs vs list capacity -> edge list inert).
Passkey re-enrollment is a manual user action (devices are gone from the DB);
nothing auto-re-deletes them.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
This commit is contained in:
parent
600f1f933c
commit
46166c63b2
6 changed files with 119 additions and 20 deletions
|
|
@ -206,6 +206,36 @@ resource "authentik_stage_user_login" "default_login" {
|
|||
}
|
||||
}
|
||||
|
||||
# -----------------------------------------------------------------------------
|
||||
# Source (social-login) User Login stage — bound to default-source-authentication-flow.
|
||||
# Adopted into Terraform 2026-06-20: its session_duration was the provider default
|
||||
# "seconds=0", which falls back to AUTHENTIK_SESSIONS__UNAUTHENTICATED_AGE (hours=2).
|
||||
# So Google/GitHub/Facebook logins expired every 2h while password and passkey
|
||||
# logins (default-authentication-login) lasted weeks=4. After the 2026-06-18 passkey
|
||||
# wipe forced fallback to Google login, this 2h cap became the "re-login multiple
|
||||
# times daily" symptom. Pinning weeks=4 makes every login path consistent.
|
||||
# See docs/architecture/authentication.md.
|
||||
# -----------------------------------------------------------------------------
|
||||
import {
|
||||
to = authentik_stage_user_login.default_source_login
|
||||
id = "4c6977d2-eaae-4033-b1db-21b48c6b47f0"
|
||||
}
|
||||
|
||||
resource "authentik_stage_user_login" "default_source_login" {
|
||||
name = "default-source-authentication-login"
|
||||
session_duration = "weeks=4"
|
||||
lifecycle {
|
||||
# Pin only session_duration; everything else stays UI-managed (same pattern
|
||||
# as authentik_stage_user_login.default_login above).
|
||||
ignore_changes = [
|
||||
remember_me_offset,
|
||||
terminate_other_sessions,
|
||||
geoip_binding,
|
||||
network_binding,
|
||||
]
|
||||
}
|
||||
}
|
||||
|
||||
# -----------------------------------------------------------------------------
|
||||
# Default Identification stage — adopted 2026-06-10 to embed the password
|
||||
# field on the identification screen (single-screen login: one round trip and
|
||||
|
|
@ -227,6 +257,13 @@ resource "authentik_stage_identification" "default_identification" {
|
|||
lifecycle {
|
||||
# Pin only password_stage; everything else stays UI-managed (same pattern
|
||||
# as authentik_stage_user_login.default_login above).
|
||||
# NOTE: do NOT add webauthn_stage / enable_remember_me here — the pinned
|
||||
# authentik TF provider's authentik_stage_identification resource exposes no
|
||||
# such attributes (verified 2026-06-20: `tg plan` => "Unsupported attribute").
|
||||
# They exist on Authentik's IdentificationStage *model* but not in the
|
||||
# provider schema, so Terraform never manages or nulls them; they are purely
|
||||
# UI/app-managed and need no ignore_changes entry. Commit 4e882989 removed
|
||||
# them for exactly this reason — re-adding them breaks every apply.
|
||||
ignore_changes = [
|
||||
user_fields,
|
||||
case_insensitive_matching,
|
||||
|
|
|
|||
|
|
@ -82,6 +82,13 @@ module "ingress" {
|
|||
service_name = "goauthentik-server"
|
||||
tls_secret_name = var.tls_secret_name
|
||||
anti_ai_scraping = false
|
||||
# Never let the in-cluster CrowdSec bouncer serve a Turnstile/captcha
|
||||
# interstitial or 403 on Authentik's own login + WebAuthn XHR endpoints — that
|
||||
# walls users out of the very gate they authenticate through (a CrowdSec hit
|
||||
# would break the passkey ceremony / session refresh mid-flow). Auth keeps
|
||||
# Traefik rate-limiting; the Cloudflare edge WAF also carves out this host
|
||||
# (stacks/rybbit/crowdsec_edge.tf). 2026-06-20.
|
||||
exclude_crowdsec = true
|
||||
extra_annotations = {
|
||||
"gethomepage.dev/enabled" = "true"
|
||||
"gethomepage.dev/name" = "Authentik"
|
||||
|
|
|
|||
|
|
@ -109,16 +109,25 @@ resource "cloudflare_ruleset" "crowdsec" {
|
|||
# must exist before this ruleset is created/updated.
|
||||
depends_on = [cloudflare_list.crowdsec_ban]
|
||||
|
||||
# CrowdSec ban — block every IP in the single edge list. The sync writes BOTH
|
||||
# ban and captcha decisions into crowdsec_ban (captcha downgraded to block at
|
||||
# the edge) because the CF account allows only ONE Rules List.
|
||||
# CrowdSec ban — block every IP in the single edge list, EXCEPT on the
|
||||
# Authentik auth hosts. The sync (lapi_kv_sync.py) now writes ONLY "ban"
|
||||
# decisions into crowdsec_ban — captcha is no longer downgraded to a hard
|
||||
# block. The auth-host carve-out guarantees a CrowdSec hit can never wall a
|
||||
# user out of the login / WebAuthn flow they authenticate through: without it,
|
||||
# a false-positive ban would 403 the passkey ceremony + session-refresh XHRs
|
||||
# on every proxied host, auth included. 2026-06-20.
|
||||
rules {
|
||||
action = "block"
|
||||
expression = "(ip.src in $crowdsec_ban)"
|
||||
description = "CrowdSec: block banned IPs"
|
||||
expression = "(ip.src in $crowdsec_ban) and not (http.host in {\"authentik.viktorbarzin.me\" \"public-auth.viktorbarzin.me\"})"
|
||||
description = "CrowdSec: block banned IPs (auth hosts carved out)"
|
||||
enabled = true
|
||||
}
|
||||
# Pre-existing rule, imported and preserved verbatim (currently disabled).
|
||||
# NOTE: Cloudflare auto-attaches logging{enabled=true} to skip rules. It must
|
||||
# be declared here to match live, otherwise editing the OTHER rule re-sends
|
||||
# this one too and the v4 provider errors "Provider produced inconsistent
|
||||
# result after apply: .rules[1].logging block count changed from 0 to 1"
|
||||
# (hit 2026-06-20 when adding the auth-host carve-out above).
|
||||
rules {
|
||||
action = "skip"
|
||||
expression = "(http.host contains \"viktorbarzin.me\")"
|
||||
|
|
@ -129,6 +138,9 @@ resource "cloudflare_ruleset" "crowdsec" {
|
|||
products = ["uaBlock", "bic", "hot", "securityLevel", "rateLimit", "waf", "zoneLockdown"]
|
||||
ruleset = "current"
|
||||
}
|
||||
logging {
|
||||
enabled = true
|
||||
}
|
||||
}
|
||||
}
|
||||
|
||||
|
|
|
|||
|
|
@ -4,12 +4,16 @@
|
|||
Cloudflare-PROXIED hosts terminate at the CF edge, so the in-cluster CrowdSec
|
||||
bouncer (which keys on the client IP Traefik sees) never decides on them. We
|
||||
push the decisions into the edge instead: a zone-scoped WAF custom rule blocks
|
||||
`(ip.src in $crowdsec_ban)` across EVERY proxied host in the zone. This job is
|
||||
the control plane that keeps that one IP List in sync with LAPI.
|
||||
`(ip.src in $crowdsec_ban)` across EVERY proxied host in the zone (the Authentik
|
||||
auth hosts are carved out in crowdsec_edge.tf so a ban can't break login). This
|
||||
job is the control plane that keeps that one IP List in sync with LAPI.
|
||||
|
||||
The CF account hard-limits to ONE Rules List, so enforcement is BLOCK-ONLY:
|
||||
BOTH ban AND captcha (scope=="ip") decisions are folded into the single
|
||||
crowdsec_ban list and captcha is downgraded to block at the proxied edge.
|
||||
Enforcement is BAN-ONLY: only scope=="ip" decisions of type "ban" are synced.
|
||||
"captcha" decisions are deliberately NOT pushed — the CF account allows only ONE
|
||||
Rules List with a single block action, so folding captcha in would hard-block a
|
||||
soft challenge across every proxied host. Captcha remediation stays at the
|
||||
in-cluster Traefik bouncer (Turnstile) for non-proxied apps. (Changed 2026-06-20
|
||||
from the prior ban+captcha fold that downgraded captcha to a hard edge block.)
|
||||
|
||||
(Filename kept as lapi_kv_sync.py for path/ConfigMap continuity with the prior
|
||||
Workers-KV design; it no longer touches KV — it reconciles a CF Rules List.)
|
||||
|
|
@ -117,13 +121,17 @@ def _cf(url, *, method="GET", payload=None, timeout=20):
|
|||
# LAPI
|
||||
# --------------------------------------------------------------------------- #
|
||||
def fetch_decisions():
|
||||
"""Return the single desired set of IPs to BLOCK at the edge.
|
||||
"""Return the desired set of IPs to BLOCK at the edge.
|
||||
|
||||
Only scope=="ip" decisions are projected (the WAF rule keys on ip.src). The
|
||||
CF account allows only ONE Rules List, so BOTH "ban" AND "captcha" decisions
|
||||
are folded into one block set (captcha is downgraded to block at the proxied
|
||||
edge). Raises on transport/HTTP error so the caller can SKIP the run
|
||||
(fail-safe).
|
||||
Only scope=="ip" decisions of type "ban" are projected (the WAF rule keys on
|
||||
ip.src). "captcha" decisions are deliberately NOT pushed to the edge: the CF
|
||||
account allows only ONE Rules List with a single block action, so folding
|
||||
captcha in would HARD-BLOCK a soft challenge across every proxied host (and,
|
||||
before the auth-host carve-out in crowdsec_edge.tf, could lock a user out of
|
||||
Authentik itself). Edge enforcement is therefore ban-only; captcha
|
||||
remediation stays at the in-cluster Traefik bouncer (Turnstile) for
|
||||
non-proxied apps. Raises on transport/HTTP error so the caller can SKIP the
|
||||
run (fail-safe). 2026-06-20.
|
||||
"""
|
||||
data = _req(
|
||||
f"{LAPI_URL}/v1/decisions",
|
||||
|
|
@ -137,9 +145,10 @@ def fetch_decisions():
|
|||
if not ip:
|
||||
continue
|
||||
dtype = (d.get("type") or "").lower()
|
||||
if dtype in ("ban", "captcha"):
|
||||
if dtype == "ban":
|
||||
block.add(ip)
|
||||
# other remediation types (e.g. throttle) are ignored
|
||||
# captcha / throttle / other remediation types are ignored at the edge
|
||||
# (ban-only enforcement — see the docstring above)
|
||||
return block
|
||||
|
||||
|
||||
|
|
@ -298,7 +307,7 @@ def main():
|
|||
push_metrics(0, ok=False)
|
||||
return 0
|
||||
|
||||
print(f"[info] LAPI desired: {len(block)} block (ban+captcha, ip-scope)")
|
||||
print(f"[info] LAPI desired: {len(block)} block (ban-only, ip-scope)")
|
||||
|
||||
# 2. Reconcile the single block list. CF errors fail loud (non-zero exit).
|
||||
try:
|
||||
|
|
|
|||
Loading…
Add table
Add a link
Reference in a new issue