authentik: dedicated rate-limit carve-out + per-router 5xx observability
All checks were successful
ci/woodpecker/push/default Pipeline was successful

Unauthenticated users were getting a blank login screen (and the screen would
sometimes just hang). Root-caused via a read-only fan-out + adversarial verify:
the login SPA cold-loads ~70 flow-executor JS/CSS chunks from /static through
the SHARED 10/50 Traefik limiter, so a fresh/empty-cache load 429s the tail and
a failed ES-module import aborts SPA bootstrap -> permanent blank. authentik was
the only first-party SPA still on the default limiter (8 siblings already have a
carve-out). NAT-shared clients trip it especially easily (shared per-IP bucket).

- traefik: new `authentik-rate-limit` Middleware (average 100 / burst 1000,
  mirroring the existing health/tripit carve-outs). The authentik / and /static
  ingresses switch to it in the authentik-stack commit.
- monitoring: the `traefik` scrape job's drop-regex was a blanket
  `traefik_router_.*`, which also dropped `traefik_router_requests_total` — so
  per-router 4xx/5xx (incl. 429/503) was neither queryable nor alertable.
  Narrowed it to keep the counter while still dropping the high-cardinality
  `*_duration_seconds_bucket` histogram, and added `AuthentikRootRouter5xxHigh`
  for the episodic all-3-server-pods-NotReady 502/503/504 cascade.

Docs updated (networking.md rate-limit list, .claude/CLAUDE.md). GitOps CI applies.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
This commit is contained in:
Viktor Barzin 2026-06-28 09:10:34 +00:00
parent 65a09dcbc4
commit b84b0021c2
4 changed files with 57 additions and 5 deletions

View file

@ -341,6 +341,33 @@ resource "kubernetes_manifest" "middleware_health_rate_limit" {
depends_on = [helm_release.traefik]
}
# Authentik-specific rate limit. The login SPA cold-loads its flow-executor
# JS/CSS chunks from /static (app-served, not a CDN) plus an API burst on /
# ~70 parallel requests on a fresh/empty-cache login. The default 10/50 limiter
# 429s the tail, and a 429'd ES-module import aborts SPA bootstrap blank login
# screen for cold/incognito/cache-cleared clients and any clients sharing a NAT
# egress IP (sixth instance of the burst pattern, after ha-sofia, ActualBudget,
# noVNC, tripit and health). authentik was the only first-party SPA still on the
# default limiter. Burst absorbs a couple of full cold loads back-to-back.
resource "kubernetes_manifest" "middleware_authentik_rate_limit" {
manifest = {
apiVersion = "traefik.io/v1alpha1"
kind = "Middleware"
metadata = {
name = "authentik-rate-limit"
namespace = kubernetes_namespace.traefik.metadata[0].name
}
spec = {
rateLimit = {
average = 100
burst = 1000
}
}
}
depends_on = [helm_release.traefik]
}
# Compress responses to clients at the entrypoint level (outermost).
# Applied at websecure entrypoint so all responses get compressed.
# Uses includedContentTypes (whitelist) instead of excludedContentTypes: