fix(authentik): long-lived social-login sessions + shield auth from CrowdSec lockout
All checks were successful
ci/woodpecker/push/default Pipeline was successful
All checks were successful
ci/woodpecker/push/default Pipeline was successful
Viktor's passkeys all vanished and he was suddenly being asked to log in
multiple times a day instead of ~monthly. Root cause: on 2026-06-18 an ad-hoc
tripit passkey E2E test (run from the devvm as akadmin via python-httpx) cleaned
up "the demo user's" passkeys with GET /core/users/?search={demo} then DELETE
each device of users[0] — but the fuzzy search returned the REAL account, so it
wiped all 6 real passkeys. Losing passkeys forced fallback to Google login, and
the social-login stage (default-source-authentication-login) had the provider
default session_duration=seconds=0, which falls back to UNAUTHENTICATED_AGE=2h —
hence the constant re-logins. (Password + passkey logins were already weeks=4.)
Changes:
- authentik: adopt default-source-authentication-login into Terraform (import)
and pin session_duration=weeks=4, so Google/GitHub/Facebook logins last as long
as password/passkey. Immediate relief without re-enrolling.
- authentik: document the provider-schema gotcha — authentik_stage_identification
exposes no webauthn_stage / enable_remember_me attribute, so they must NOT be in
ignore_changes (commit 4e882989 removed them for this reason; re-adding breaks
every apply). The passkey break was purely the missing device records, not drift.
- edge (rybbit): shield auth so a CrowdSec hit can never wall a user out of login —
carve authentik.viktorbarzin.me + public-auth out of the zone WAF block rule,
make the LAPI->edge sync ban-only (stop downgrading captcha to a hard block),
and set exclude_crowdsec on the Authentik UI ingress (auth keeps rate-limiting).
- docs: record the session-duration change, the edge enforcement + auth carve-out
(previously undocumented), and the pre-existing broken crowdsec-cf-sync CronJob
(CF cursor pagination 400 + ~31k IPs vs list capacity -> edge list inert).
Passkey re-enrollment is a manual user action (devices are gone from the DB);
nothing auto-re-deletes them.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
This commit is contained in:
parent
600f1f933c
commit
46166c63b2
6 changed files with 119 additions and 20 deletions
|
|
@ -166,7 +166,8 @@ Pinned via Terraform in `stacks/authentik/`:
|
|||
|
||||
| Knob | Value | Surface | Effect |
|
||||
|------|-------|---------|--------|
|
||||
| `UserLoginStage.session_duration` on `default-authentication-login` | `weeks=4` | `authentik_stage_user_login.default_login` in `authentik_provider.tf` | Authenticated users stay logged in 4 weeks across browser restarts. No sliding refresh — resets on each login. |
|
||||
| `UserLoginStage.session_duration` on `default-authentication-login` | `weeks=4` | `authentik_stage_user_login.default_login` in `authentik_provider.tf` | Authenticated users stay logged in 4 weeks across browser restarts. No sliding refresh — resets on each login. Used by password login (`default-authentication-flow`) AND passkey login (`webauthn` flow — both terminate on this stage). |
|
||||
| `UserLoginStage.session_duration` on `default-source-authentication-login` | `weeks=4` | `authentik_stage_user_login.default_source_login` in `authentik_provider.tf` (imported 2026-06-20, id `4c6977d2-…`) | **Social logins** (Google/GitHub/Facebook, via `default-source-authentication-flow`). Was the provider default `seconds=0`, which fell back to `UNAUTHENTICATED_AGE=hours=2` — so social logins expired every **2h** while password/passkey lasted 4 weeks. Pinned `weeks=4` on 2026-06-20 to make all login paths consistent. (Surfaced when the 2026-06-18 passkey wipe forced fallback to Google login → "re-login multiple times daily".) |
|
||||
| `ProxyProvider.access_token_validity` on `Provider for Domain wide catch all` | `weeks=4` | `authentik_provider_proxy.catchall.access_token_validity` in `authentik_provider.tf` | Cookie `Max-Age` on `authentik_proxy_*` and `expires` on rows in `authentik_providers_proxy_proxysession`. Bumped 2026-05-10 from `hours=168`. **Bumping requires `kubectl rollout restart deploy/ak-outpost-authentik-embedded-outpost`** — the gorilla session store binds the value once at outpost startup; the 5-min provider refresh logs `"reusing existing session store"` and skips rebuild. |
|
||||
| `AUTHENTIK_SESSIONS__UNAUTHENTICATED_AGE` (server + worker) | `hours=2` | `server.env` + `worker.env` in `modules/authentik/values.yaml` | Anonymous Django sessions (bots, healthcheckers, partial flows) are reaped within 2h instead of the 1d default. |
|
||||
|
||||
|
|
@ -177,6 +178,13 @@ Notes:
|
|||
- The standalone embedded-outpost deployment needs `AUTHENTIK_POSTGRESQL__{HOST,PORT,USER,PASSWORD,NAME}` env vars to reach the dbaas cluster — codified via `kubernetes_json_patches.deployment` envFrom the shared `goauthentik` Secret. The `app.kubernetes.io/component=server` pod label is also injected via JSON patch (matches the `component:server` half of the Service selector that the controller adds for embedded outposts).
|
||||
- `ProxyProvider.remember_me_offset` stays UI-managed via `ignore_changes`.
|
||||
- The Authentik provider's resource schema does **not** expose the `Outpost.managed` field. We rely on TF's "write only fields it knows about" semantic: the server-set `goauthentik.io/outposts/embedded` value is preserved across applies because Terraform never writes `managed`. Don't change the resource provider schema expectations without verifying this assumption holds.
|
||||
|
||||
## WebAuthn / Passkeys (2026-06-20)
|
||||
|
||||
- **Passkey devices live in the DB, NOT Terraform** (`WebAuthnDevice` model). They are user-owned; no TF resource or blueprint manages them. Re-enroll via the user settings UI (Authentik → Settings → MFA Devices → register a security key / passkey).
|
||||
- **2026-06-18 wipe (root cause of the "WebAuthn broke" incident):** all 6 of Viktor's passkeys were deleted (`WebAuthnDevice.objects.count()` → 0) at 19:27 by an **ad-hoc tripit passkey E2E test** run from the devvm (`python-httpx/0.28.1`, as `akadmin`). The test cleanup did `GET /core/users/?search={demo}` (a **fuzzy** search) then `DELETE /api/v3/authenticators/admin/webauthn/{pk}/` for each device of `users[0]` — but `users[0]` resolved to the **real** account, not the intended demo user. **Lesson:** any future passkey-test cleanup MUST exact-match the demo user (`username == demo`), never `users[0]` of a fuzzy `?search=`. It was a one-off ad-hoc script (no committed/scheduled copy), so nothing auto-re-deletes — re-enrollment is safe.
|
||||
- **Passkey login path itself is intact:** the identification stage's `passwordless_flow` → `webauthn` flow (UI-managed, in `ignore_changes`); the break was purely the missing device records.
|
||||
- **Provider-schema gotcha:** the pinned authentik TF provider's `authentik_stage_identification` resource exposes **no** `webauthn_stage` or `enable_remember_me` attribute (they exist on the app *model*, not in the provider schema). Do NOT add them to `ignore_changes` — `tg plan` errors `Unsupported attribute`. They are purely UI/app-managed. (Commit `4e882989` removed them for exactly this reason; re-adding breaks every apply.)
|
||||
- ALL tuned env vars are injected via `server.env` / `worker.env` (not the `authentik.*` values block) because we set `authentik.existingSecret.secretName: goauthentik`, which makes the chart skip rendering its own `AUTHENTIK_*` Secret. The `authentik.*` value block is therefore inert in this stack — anything new under `authentik.*` must use the `*.env` arrays instead. Live base values come from the orphaned, helm-keep-policy `goauthentik` Secret created by chart 2025.10.3 before `existingSecret` was introduced. **2026-06-10:** the previously-inert tuning (`AUTHENTIK_WEB__WORKERS=3`, `AUTHENTIK_WEB__THREADS=4`, `AUTHENTIK_CACHE__TIMEOUT_FLOWS=1800`, `AUTHENTIK_CACHE__TIMEOUT_POLICIES=900`, `AUTHENTIK_POSTGRESQL__CONN_MAX_AGE=60`, `AUTHENTIK_POSTGRESQL__CONN_HEALTH_CHECKS=true`, worker `AUTHENTIK_WORKER__THREADS=4`) was moved into the env arrays and is now actually live — before that, pods silently ran defaults (2 gunicorn workers, 300s caches, no persistent DB conns).
|
||||
- **Outpost (2026-06-10):** `log_level=info` (was `trace` — per-request overhead on the forward-auth hot path) and `kubernetes_replicas=2` (was 1 — single-pod hot path; safe since proxy sessions live in Postgres). Both in `authentik_outpost.embedded` config.
|
||||
- **Image tag is PINNED in values (`global.image.tag`), 2026-06-10:** Keel moves the authentik image between chart releases, while helm derives the tag from the chart appVersion — an unpinned helm apply silently DOWNGRADES live pods (caused the 2026-06-10 boot storm + shared-PG failover; see `docs/post-mortems/2026-06-10-authentik-downgrade-boot-storm.md`). Before touching this chart, check the live image tag and refresh the pin.
|
||||
|
|
|
|||
|
|
@ -107,6 +107,32 @@ CrowdSec operates in a hub-and-agent model:
|
|||
configured, so `captcha` decisions silently degraded to a 403 ban** — users
|
||||
had no way to self-unblock; wiring Turnstile fixed that.
|
||||
|
||||
**Cloudflare Edge Enforcement for proxied hosts** (`stacks/rybbit/crowdsec_edge.tf` + `lapi_kv_sync.py`):
|
||||
- Proxied (orange-cloud) hosts terminate at the Cloudflare edge, so the in-cluster
|
||||
bouncer above never decides on them. Edge enforcement instead syncs LAPI
|
||||
decisions into **one Cloudflare account IP List (`crowdsec_ban`)** + a single
|
||||
**zone-scoped WAF custom rule** blocking `(ip.src in $crowdsec_ban)` across every
|
||||
proxied host. CronJob `crowdsec-cf-sync` (rybbit ns, every 2 min) reconciles it.
|
||||
- **BAN-ONLY (2026-06-20):** only `type=ban` decisions sync to the edge. `captcha`
|
||||
decisions are deliberately NOT pushed — the CF account allows only ONE Rules List
|
||||
with a single block action, so folding captcha in would hard-block a soft
|
||||
challenge on every proxied host. (Before 2026-06-20 captcha was downgraded to a
|
||||
hard block at the edge.)
|
||||
- **Auth carve-out (2026-06-20):** the WAF rule excludes `authentik.viktorbarzin.me`
|
||||
+ `public-auth.viktorbarzin.me` (`… and not (http.host in {…})`), and the
|
||||
Authentik UI ingress sets `exclude_crowdsec = true` for the in-cluster bouncer. A
|
||||
CrowdSec hit must never wall a user out of the login / WebAuthn flow they
|
||||
authenticate through; auth keeps `traefik-rate-limit` for brute-force protection.
|
||||
- **⚠️ Currently NON-FUNCTIONAL (known issue, pre-existing since the 2026-06-20
|
||||
rollout):** `crowdsec-cf-sync` fails every run — `cf_list_items()` pagination
|
||||
gets CF `HTTP 400 code 10027 "invalid or expired cursor"`, so the list never
|
||||
populates (`num_items=0`) and the edge rule blocks nothing. LAPI also returns
|
||||
~31k ban IPs, likely exceeding CF IP-List capacity even once pagination is fixed.
|
||||
**Edge enforcement for proxied hosts is therefore inert pending a fix** (the
|
||||
in-cluster bouncer still protects direct apps; the auth carve-out is correct
|
||||
regardless). Fix needs: (1) correct CF cursor pagination, (2) a capacity strategy
|
||||
for the ban set.
|
||||
|
||||
**Metabase** (disabled by default):
|
||||
- Dashboard for CrowdSec analytics
|
||||
- CPU-intensive, only enable when investigating incidents
|
||||
|
|
|
|||
|
|
@ -206,6 +206,36 @@ resource "authentik_stage_user_login" "default_login" {
|
|||
}
|
||||
}
|
||||
|
||||
# -----------------------------------------------------------------------------
|
||||
# Source (social-login) User Login stage — bound to default-source-authentication-flow.
|
||||
# Adopted into Terraform 2026-06-20: its session_duration was the provider default
|
||||
# "seconds=0", which falls back to AUTHENTIK_SESSIONS__UNAUTHENTICATED_AGE (hours=2).
|
||||
# So Google/GitHub/Facebook logins expired every 2h while password and passkey
|
||||
# logins (default-authentication-login) lasted weeks=4. After the 2026-06-18 passkey
|
||||
# wipe forced fallback to Google login, this 2h cap became the "re-login multiple
|
||||
# times daily" symptom. Pinning weeks=4 makes every login path consistent.
|
||||
# See docs/architecture/authentication.md.
|
||||
# -----------------------------------------------------------------------------
|
||||
import {
|
||||
to = authentik_stage_user_login.default_source_login
|
||||
id = "4c6977d2-eaae-4033-b1db-21b48c6b47f0"
|
||||
}
|
||||
|
||||
resource "authentik_stage_user_login" "default_source_login" {
|
||||
name = "default-source-authentication-login"
|
||||
session_duration = "weeks=4"
|
||||
lifecycle {
|
||||
# Pin only session_duration; everything else stays UI-managed (same pattern
|
||||
# as authentik_stage_user_login.default_login above).
|
||||
ignore_changes = [
|
||||
remember_me_offset,
|
||||
terminate_other_sessions,
|
||||
geoip_binding,
|
||||
network_binding,
|
||||
]
|
||||
}
|
||||
}
|
||||
|
||||
# -----------------------------------------------------------------------------
|
||||
# Default Identification stage — adopted 2026-06-10 to embed the password
|
||||
# field on the identification screen (single-screen login: one round trip and
|
||||
|
|
@ -227,6 +257,13 @@ resource "authentik_stage_identification" "default_identification" {
|
|||
lifecycle {
|
||||
# Pin only password_stage; everything else stays UI-managed (same pattern
|
||||
# as authentik_stage_user_login.default_login above).
|
||||
# NOTE: do NOT add webauthn_stage / enable_remember_me here — the pinned
|
||||
# authentik TF provider's authentik_stage_identification resource exposes no
|
||||
# such attributes (verified 2026-06-20: `tg plan` => "Unsupported attribute").
|
||||
# They exist on Authentik's IdentificationStage *model* but not in the
|
||||
# provider schema, so Terraform never manages or nulls them; they are purely
|
||||
# UI/app-managed and need no ignore_changes entry. Commit 4e882989 removed
|
||||
# them for exactly this reason — re-adding them breaks every apply.
|
||||
ignore_changes = [
|
||||
user_fields,
|
||||
case_insensitive_matching,
|
||||
|
|
|
|||
|
|
@ -82,6 +82,13 @@ module "ingress" {
|
|||
service_name = "goauthentik-server"
|
||||
tls_secret_name = var.tls_secret_name
|
||||
anti_ai_scraping = false
|
||||
# Never let the in-cluster CrowdSec bouncer serve a Turnstile/captcha
|
||||
# interstitial or 403 on Authentik's own login + WebAuthn XHR endpoints — that
|
||||
# walls users out of the very gate they authenticate through (a CrowdSec hit
|
||||
# would break the passkey ceremony / session refresh mid-flow). Auth keeps
|
||||
# Traefik rate-limiting; the Cloudflare edge WAF also carves out this host
|
||||
# (stacks/rybbit/crowdsec_edge.tf). 2026-06-20.
|
||||
exclude_crowdsec = true
|
||||
extra_annotations = {
|
||||
"gethomepage.dev/enabled" = "true"
|
||||
"gethomepage.dev/name" = "Authentik"
|
||||
|
|
|
|||
|
|
@ -109,16 +109,25 @@ resource "cloudflare_ruleset" "crowdsec" {
|
|||
# must exist before this ruleset is created/updated.
|
||||
depends_on = [cloudflare_list.crowdsec_ban]
|
||||
|
||||
# CrowdSec ban — block every IP in the single edge list. The sync writes BOTH
|
||||
# ban and captcha decisions into crowdsec_ban (captcha downgraded to block at
|
||||
# the edge) because the CF account allows only ONE Rules List.
|
||||
# CrowdSec ban — block every IP in the single edge list, EXCEPT on the
|
||||
# Authentik auth hosts. The sync (lapi_kv_sync.py) now writes ONLY "ban"
|
||||
# decisions into crowdsec_ban — captcha is no longer downgraded to a hard
|
||||
# block. The auth-host carve-out guarantees a CrowdSec hit can never wall a
|
||||
# user out of the login / WebAuthn flow they authenticate through: without it,
|
||||
# a false-positive ban would 403 the passkey ceremony + session-refresh XHRs
|
||||
# on every proxied host, auth included. 2026-06-20.
|
||||
rules {
|
||||
action = "block"
|
||||
expression = "(ip.src in $crowdsec_ban)"
|
||||
description = "CrowdSec: block banned IPs"
|
||||
expression = "(ip.src in $crowdsec_ban) and not (http.host in {\"authentik.viktorbarzin.me\" \"public-auth.viktorbarzin.me\"})"
|
||||
description = "CrowdSec: block banned IPs (auth hosts carved out)"
|
||||
enabled = true
|
||||
}
|
||||
# Pre-existing rule, imported and preserved verbatim (currently disabled).
|
||||
# NOTE: Cloudflare auto-attaches logging{enabled=true} to skip rules. It must
|
||||
# be declared here to match live, otherwise editing the OTHER rule re-sends
|
||||
# this one too and the v4 provider errors "Provider produced inconsistent
|
||||
# result after apply: .rules[1].logging block count changed from 0 to 1"
|
||||
# (hit 2026-06-20 when adding the auth-host carve-out above).
|
||||
rules {
|
||||
action = "skip"
|
||||
expression = "(http.host contains \"viktorbarzin.me\")"
|
||||
|
|
@ -129,6 +138,9 @@ resource "cloudflare_ruleset" "crowdsec" {
|
|||
products = ["uaBlock", "bic", "hot", "securityLevel", "rateLimit", "waf", "zoneLockdown"]
|
||||
ruleset = "current"
|
||||
}
|
||||
logging {
|
||||
enabled = true
|
||||
}
|
||||
}
|
||||
}
|
||||
|
||||
|
|
|
|||
|
|
@ -4,12 +4,16 @@
|
|||
Cloudflare-PROXIED hosts terminate at the CF edge, so the in-cluster CrowdSec
|
||||
bouncer (which keys on the client IP Traefik sees) never decides on them. We
|
||||
push the decisions into the edge instead: a zone-scoped WAF custom rule blocks
|
||||
`(ip.src in $crowdsec_ban)` across EVERY proxied host in the zone. This job is
|
||||
the control plane that keeps that one IP List in sync with LAPI.
|
||||
`(ip.src in $crowdsec_ban)` across EVERY proxied host in the zone (the Authentik
|
||||
auth hosts are carved out in crowdsec_edge.tf so a ban can't break login). This
|
||||
job is the control plane that keeps that one IP List in sync with LAPI.
|
||||
|
||||
The CF account hard-limits to ONE Rules List, so enforcement is BLOCK-ONLY:
|
||||
BOTH ban AND captcha (scope=="ip") decisions are folded into the single
|
||||
crowdsec_ban list and captcha is downgraded to block at the proxied edge.
|
||||
Enforcement is BAN-ONLY: only scope=="ip" decisions of type "ban" are synced.
|
||||
"captcha" decisions are deliberately NOT pushed — the CF account allows only ONE
|
||||
Rules List with a single block action, so folding captcha in would hard-block a
|
||||
soft challenge across every proxied host. Captcha remediation stays at the
|
||||
in-cluster Traefik bouncer (Turnstile) for non-proxied apps. (Changed 2026-06-20
|
||||
from the prior ban+captcha fold that downgraded captcha to a hard edge block.)
|
||||
|
||||
(Filename kept as lapi_kv_sync.py for path/ConfigMap continuity with the prior
|
||||
Workers-KV design; it no longer touches KV — it reconciles a CF Rules List.)
|
||||
|
|
@ -117,13 +121,17 @@ def _cf(url, *, method="GET", payload=None, timeout=20):
|
|||
# LAPI
|
||||
# --------------------------------------------------------------------------- #
|
||||
def fetch_decisions():
|
||||
"""Return the single desired set of IPs to BLOCK at the edge.
|
||||
"""Return the desired set of IPs to BLOCK at the edge.
|
||||
|
||||
Only scope=="ip" decisions are projected (the WAF rule keys on ip.src). The
|
||||
CF account allows only ONE Rules List, so BOTH "ban" AND "captcha" decisions
|
||||
are folded into one block set (captcha is downgraded to block at the proxied
|
||||
edge). Raises on transport/HTTP error so the caller can SKIP the run
|
||||
(fail-safe).
|
||||
Only scope=="ip" decisions of type "ban" are projected (the WAF rule keys on
|
||||
ip.src). "captcha" decisions are deliberately NOT pushed to the edge: the CF
|
||||
account allows only ONE Rules List with a single block action, so folding
|
||||
captcha in would HARD-BLOCK a soft challenge across every proxied host (and,
|
||||
before the auth-host carve-out in crowdsec_edge.tf, could lock a user out of
|
||||
Authentik itself). Edge enforcement is therefore ban-only; captcha
|
||||
remediation stays at the in-cluster Traefik bouncer (Turnstile) for
|
||||
non-proxied apps. Raises on transport/HTTP error so the caller can SKIP the
|
||||
run (fail-safe). 2026-06-20.
|
||||
"""
|
||||
data = _req(
|
||||
f"{LAPI_URL}/v1/decisions",
|
||||
|
|
@ -137,9 +145,10 @@ def fetch_decisions():
|
|||
if not ip:
|
||||
continue
|
||||
dtype = (d.get("type") or "").lower()
|
||||
if dtype in ("ban", "captcha"):
|
||||
if dtype == "ban":
|
||||
block.add(ip)
|
||||
# other remediation types (e.g. throttle) are ignored
|
||||
# captcha / throttle / other remediation types are ignored at the edge
|
||||
# (ban-only enforcement — see the docstring above)
|
||||
return block
|
||||
|
||||
|
||||
|
|
@ -298,7 +307,7 @@ def main():
|
|||
push_metrics(0, ok=False)
|
||||
return 0
|
||||
|
||||
print(f"[info] LAPI desired: {len(block)} block (ban+captcha, ip-scope)")
|
||||
print(f"[info] LAPI desired: {len(block)} block (ban-only, ip-scope)")
|
||||
|
||||
# 2. Reconcile the single block list. CF errors fail loud (non-zero exit).
|
||||
try:
|
||||
|
|
|
|||
Loading…
Add table
Add a link
Reference in a new issue