authentik: dedicated rate-limit carve-out + per-router 5xx observability
All checks were successful
ci/woodpecker/push/default Pipeline was successful

Unauthenticated users were getting a blank login screen (and the screen would
sometimes just hang). Root-caused via a read-only fan-out + adversarial verify:
the login SPA cold-loads ~70 flow-executor JS/CSS chunks from /static through
the SHARED 10/50 Traefik limiter, so a fresh/empty-cache load 429s the tail and
a failed ES-module import aborts SPA bootstrap -> permanent blank. authentik was
the only first-party SPA still on the default limiter (8 siblings already have a
carve-out). NAT-shared clients trip it especially easily (shared per-IP bucket).

- traefik: new `authentik-rate-limit` Middleware (average 100 / burst 1000,
  mirroring the existing health/tripit carve-outs). The authentik / and /static
  ingresses switch to it in the authentik-stack commit.
- monitoring: the `traefik` scrape job's drop-regex was a blanket
  `traefik_router_.*`, which also dropped `traefik_router_requests_total` — so
  per-router 4xx/5xx (incl. 429/503) was neither queryable nor alertable.
  Narrowed it to keep the counter while still dropping the high-cardinality
  `*_duration_seconds_bucket` histogram, and added `AuthentikRootRouter5xxHigh`
  for the episodic all-3-server-pods-NotReady 502/503/504 cascade.

Docs updated (networking.md rate-limit list, .claude/CLAUDE.md). GitOps CI applies.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
This commit is contained in:
Viktor Barzin 2026-06-28 09:10:34 +00:00
parent 65a09dcbc4
commit b84b0021c2
4 changed files with 57 additions and 5 deletions

View file

@ -2385,6 +2385,24 @@ serverFiles:
annotations:
summary: "Traefik replica config divergence: max/min reload rate = {{ $value | printf \"%.1f\" }}x"
description: "One Traefik replica is reloading config much less than its peers — likely a stale K8s informer cache returning 404 for ingresses. Identify the stale pod by comparing `traefik_config_reloads_total` across pods and delete it: `kubectl -n traefik delete pod <pod-name>`."
# Authentik '/' router 5xx — the episodic blank-login-screen + 30s-hang
# signature: all 3 goauthentik-server pods go NotReady together during a
# PG/pgbouncer transient, so Traefik has no healthy backend → 502/503/504.
# Needs traefik_router_requests_total (kept by the narrowed drop-regex in
# extraScrapeConfigs). Router name is verified live; it changes only if the
# authentik '/' ingress is renamed. 5% over 5m with a 0.1 req/s traffic
# floor so it doesn't fire on idle/just-restarted Prometheus.
- alert: AuthentikRootRouter5xxHigh
expr: |
sum(rate(traefik_router_requests_total{router="authentik-authentik-authentik-viktorbarzin-me@kubernetes",code=~"50[234]"}[5m]))
/ sum(rate(traefik_router_requests_total{router="authentik-authentik-authentik-viktorbarzin-me@kubernetes"}[5m])) * 100 > 5
and sum(rate(traefik_router_requests_total{router="authentik-authentik-authentik-viktorbarzin-me@kubernetes"}[5m])) > 0.1
for: 5m
labels:
severity: warning
annotations:
summary: "authentik '/' router 5xx rate {{ $value | printf \"%.1f\" }}% — episodic backend-unavailable (blank login / 30s hang)"
description: "The authentik UI '/' router is returning 502/503/504 — all 3 goauthentik-server pods are likely NotReady together during a PG/pgbouncer transient. Check `kubectl -n authentik get endpointslices` (should never be empty) and CNPG health."
- alert: HighServiceErrorRate
expr: |
(
@ -3642,8 +3660,15 @@ extraScrapeConfigs: |
- source_labels: [__meta_kubernetes_pod_name]
target_label: instance
metric_relabel_configs:
# Drop the high-cardinality duration HISTOGRAM (router/service/entrypoint
# *_bucket, ~5k series/pod — the real cardinality driver, commit 06490b06)
# plus the bulky bytes/tls/sum/count router series, but KEEP
# traefik_router_requests_total: the only per-router counter carrying both
# `router` and `code` labels, needed to see 4xx/5xx (incl. 429/503) on the
# authentik routers. ~448 series/pod. (The earlier blanket `traefik_router_.*`
# dropped it, so per-router error rates were un-queryable + un-alertable.)
- source_labels: [__name__]
regex: 'traefik_(router|service|entrypoint)_request_duration_seconds_bucket|traefik_router_.*'
regex: 'traefik_(router|service|entrypoint)_request_duration_seconds_bucket|traefik_router_request(s_bytes_total|s_tls_total|_duration_seconds_(sum|count))|traefik_router_responses_bytes_total'
action: drop
- job_name: 'realestate-crawler-api'
kubernetes_sd_configs: