authentik: dedicated rate-limit carve-out + per-router 5xx observability
All checks were successful
ci/woodpecker/push/default Pipeline was successful
All checks were successful
ci/woodpecker/push/default Pipeline was successful
Unauthenticated users were getting a blank login screen (and the screen would sometimes just hang). Root-caused via a read-only fan-out + adversarial verify: the login SPA cold-loads ~70 flow-executor JS/CSS chunks from /static through the SHARED 10/50 Traefik limiter, so a fresh/empty-cache load 429s the tail and a failed ES-module import aborts SPA bootstrap -> permanent blank. authentik was the only first-party SPA still on the default limiter (8 siblings already have a carve-out). NAT-shared clients trip it especially easily (shared per-IP bucket). - traefik: new `authentik-rate-limit` Middleware (average 100 / burst 1000, mirroring the existing health/tripit carve-outs). The authentik / and /static ingresses switch to it in the authentik-stack commit. - monitoring: the `traefik` scrape job's drop-regex was a blanket `traefik_router_.*`, which also dropped `traefik_router_requests_total` — so per-router 4xx/5xx (incl. 429/503) was neither queryable nor alertable. Narrowed it to keep the counter while still dropping the high-cardinality `*_duration_seconds_bucket` histogram, and added `AuthentikRootRouter5xxHigh` for the episodic all-3-server-pods-NotReady 502/503/504 cascade. Docs updated (networking.md rate-limit list, .claude/CLAUDE.md). GitOps CI applies. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
This commit is contained in:
parent
65a09dcbc4
commit
b84b0021c2
4 changed files with 57 additions and 5 deletions
|
|
@ -204,7 +204,7 @@ the workflow's built-in `GITHUB_TOKEN` (`packages: write`).
|
|||
- **PDBs**: minAvailable=2 on Traefik and Authentik.
|
||||
- **Fallback proxies**: basicAuth when Authentik is down, fail-open when poison-fountain is down.
|
||||
- **CrowdSec enforcement is out-of-band** (no Traefik plugin/middleware — the dead Yaegi `crowdsec-bouncer-traefik-plugin` was removed on Traefik 3.7.5): banned IPs are dropped **in-kernel via nftables** by the `cs-firewall-bouncer` DaemonSet on **direct** hosts (drops in BOTH the `input` and `forward` hooks — Traefik is ETP=Local so client traffic is DNAT'd to the pod via `forward`; pulls ALL decisions incl. the ~31k CAPI blocklist), and **blocked at the Cloudflare edge** for **proxied** hosts (one `crowdsec_ban` Rules List + a zone WAF block rule, fed by the `crowdsec-cf-sync` CronJob in `rybbit` ns every 2 min — excludes CAPI). Zero per-request latency; **fails open** (LAPI down → no new bans, existing drops persist, legit traffic never blocked). Whitelist covers RFC1918 + tailnet + internal CIDRs. Full as-built: `docs/architecture/security.md`.
|
||||
- **Rate limiting**: Return 429 (not 503). Per-service tuning via dedicated middleware + `skip_default_rate_limit` (default 10/s burst 50): Immich 1000/20000, ActualBudget 50/300 (app boot = ~70 parallel revalidations).
|
||||
- **Rate limiting**: Return 429 (not 503). Per-service tuning via dedicated middleware + `skip_default_rate_limit` (default 10/s burst 50): Immich 1000/20000, ActualBudget 50/300 (app boot = ~70 parallel revalidations), authentik 100/1000 on `/`+`/static` (login SPA cold-loads ~70 flow chunks from `/static`; default burst 429'd them → blank login screen).
|
||||
- **Retry middleware**: 2 attempts, 100ms — in default ingress chain.
|
||||
- **Entrypoint transport timeouts** (`websecure` `respondingTimeouts`): `writeTimeout=0` (unlimited download duration), `readTimeout=3600s` (uploads ≤1h), `idleTimeout=600s`. These are **HARD total-duration caps**, not nginx-style per-read idle timeouts — a finite `writeTimeout` truncates *any* large download at that wall-clock mark (a prior `writeTimeout=60s` silently cut Immich videos at 60s). **Do NOT re-tighten `writeTimeout`**; keep `readTimeout` finite (slow-loris backstop) but ≥ longest expected upload. Full rationale: `docs/architecture/networking.md` → "Entrypoint Transport Timeouts".
|
||||
- **HTTP/3 (QUIC)**: Enabled on Traefik. Works for **direct (non-proxied) apps** via the dedicated LB IP below (ETP=Local). Proxied apps get QUIC at the Cloudflare edge.
|
||||
|
|
@ -232,7 +232,7 @@ the workflow's built-in `GITHUB_TOKEN` (`packages: write`).
|
|||
- Alertmanager is now scraped (`extraScrapeConfigs` job `alertmanager`) → `alertmanager_notifications_total`/`_alerts`/`_notifications_failed_total` available; it had no `prometheus.io/scrape` annotation so notification volume was previously unmeasurable.
|
||||
- Every new service gets Prometheus scrape config + Uptime Kuma monitor. External monitors auto-created for Cloudflare-proxied services by `external-monitor-sync` CronJob (10min, uptime-kuma ns). Mechanism: `ingress_factory` auto-adds `uptime.viktorbarzin.me/external-monitor=true` whenever `dns_type != "none"` (see `modules/kubernetes/ingress_factory/main.tf`) — no manual action needed on new services. The `cloudflare_proxied_names` list in `config.tfvars` is a legacy fallback for the 17 hostnames not yet migrated to `ingress_factory` `dns_type`; don't check that list when debugging "is this monitored?" questions.
|
||||
- **External monitoring**: `[External] <service>` monitors in Uptime Kuma test full external path (DNS → Cloudflare → Tunnel → Traefik). Divergence metric `external_internal_divergence_count` → alert `ExternalAccessDivergence` (15min). Config: `stacks/uptime-kuma/`, targets from `cloudflare_proxied_names` in `config.tfvars` (17 remaining centrally-managed hostnames; most DNS records now auto-created by `ingress_factory` `dns_type` param).
|
||||
- Key alerts: OOMKill, pod replica mismatch, 4xx/5xx error rates, UPS battery, CPU temp, SSD writes, NFS responsiveness, ClusterMemoryRequestsHigh (>85%), ContainerNearOOM (>85% limit), PodUnschedulable, ExternalAccessDivergence, ImmichSmartSearchSlow (context-search latency / clip_index cache eviction).
|
||||
- Key alerts: OOMKill, pod replica mismatch, 4xx/5xx error rates, UPS battery, CPU temp, SSD writes, NFS responsiveness, ClusterMemoryRequestsHigh (>85%), ContainerNearOOM (>85% limit), PodUnschedulable, ExternalAccessDivergence, ImmichSmartSearchSlow (context-search latency / clip_index cache eviction), AuthentikRootRouter5xxHigh (all-3-server-pods-NotReady cascade → 502/503/504 on the authentik `/` router). **The Traefik scrape keeps `traefik_router_requests_total`** (per-router `code` label) — the drop-regex in the `traefik` scrape job drops only the high-cardinality `*_duration_seconds_bucket` histogram, NOT the request counter, so per-router 429/5xx is queryable + alertable.
|
||||
- **E2E email monitoring**: CronJob `email-roundtrip-monitor` (every 20 min) sends test email via Brevo HTTP API to `smoke-test@viktorbarzin.me` (catch-all → `spam@`), verifies IMAP delivery, deletes test email, pushes metrics to Pushgateway + Uptime Kuma. Alerts: `EmailRoundtripFailing` (60m), `EmailRoundtripStale` (60m), `EmailRoundtripNeverRun` (60m). Outbound relay: Brevo EU (`smtp-relay.brevo.com:587`, 300/day free — migrated from Mailgun). Inbound external traffic enters via pfSense HAProxy on `10.0.20.1:{25,465,587,993}`, which forwards to k8s `mailserver-proxy` NodePort (30125-30128) with `send-proxy-v2`. Mailserver pod runs alt PROXY-speaking listeners (2525/4465/5587/10993) alongside stock PROXY-free ones (25/465/587/993) for intra-cluster clients. Real client IPs recovered from PROXY v2 header despite kube-proxy SNAT (replaces pre-2026-04-19 MetalLB `10.0.20.202` ETP:Local scheme; see bd code-yiu + `docs/runbooks/mailserver-pfsense-haproxy.md`). Vault: `brevo_api_key` in `secret/viktor` (probe + relay).
|
||||
- **Authentik walling-off guard**: `blackbox-exporter` (monitoring ns, `stacks/monitoring/modules/monitoring/authentik_walloff_probe.tf`) probes each must-stay-public `auth = "none"` carve-out URL with `no_follow_redirects` and FAILS (`fail_if_header_matches` on `Location`) iff it 302s to Authentik. Catches a carve-out regressing (TF revert / deploy / `ingress_factory` `auth` default flipping back to `"required"`). Scrape job `blackbox-authentik-walloff` (1m) → alert `AuthentikWallingOffPublicPath` (`probe_failed_due_to_regex == 1`, for 10m, `lane=security` → posts to `#alerts` via the `slack-security` receiver, which keeps its `[SECURITY]` styling; the dedicated `#security` channel was abandoned 2026-06-25 — the shared webhook's app isn't a member of it). **To guard a new carve-out: add one line to `local.authentik_walloff_targets`** (a `service → URL` map; `valid_status_codes` includes 301/302 so legit redirects/404s stay green — only the Authentik `Location` fails the probe). `curl -sI '<url>'` must NOT show a Location to `authentik.viktorbarzin.me` before adding.
|
||||
|
||||
|
|
|
|||
|
|
@ -261,7 +261,7 @@ Traefik chain:
|
|||
|
||||
1. **Anti-AI bot-block** (`ai-bot-block` ForwardAuth, on by default via `ingress_factory`): blocks/tarpits known AI crawlers. **Fail-open** (currently a no-op `return 200` — poison-fountain scaled to 0; see `docs/architecture/security.md`).
|
||||
2. **Authentik Forward-Auth** (if `protected = true`): SSO authentication via OIDC. Non-authenticated users are redirected to login. Auth headers are stripped before forwarding to backend.
|
||||
3. **Rate Limiting**: Per-IP throttling. Returns **429 Too Many Requests** (not 503) when limit exceeded. Default is `rate-limit` (average 10 req/s, burst 50). Services whose clients legitimately burst harder get a dedicated middleware via `skip_default_rate_limit = true` + `extra_middlewares`: Immich (`immich-rate-limit`, 1000/20000, photo uploads) and ActualBudget (`actualbudget-rate-limit`, 50/300 — the Actual web app boots with ~70 parallel asset/migration revalidations; the default burst 429'd the tail and stalled every page load).
|
||||
3. **Rate Limiting**: Per-IP throttling. Returns **429 Too Many Requests** (not 503) when limit exceeded. Default is `rate-limit` (average 10 req/s, burst 50). Services whose clients legitimately burst harder get a dedicated middleware via `skip_default_rate_limit = true` + `extra_middlewares`: Immich (`immich-rate-limit`, 1000/20000, photo uploads), ActualBudget (`actualbudget-rate-limit`, 50/300 — the Actual web app boots with ~70 parallel asset/migration revalidations; the default burst 429'd the tail and stalled every page load), and authentik (`authentik-rate-limit`, 100/1000, on `/` and `/static` — the login SPA cold-loads ~70 flow-executor JS/CSS chunks from `/static`; the default burst 429'd the tail and a failed ES-module import left a blank login screen for cold/incognito/NAT-shared clients).
|
||||
4. **Retry**: 2 attempts with 100ms delay on transient failures (5xx errors, connection errors).
|
||||
|
||||
Additional middleware:
|
||||
|
|
@ -550,7 +550,7 @@ chain — a CrowdSec/LAPI outage cannot cause 503s; it only stops new bans.) Che
|
|||
|
||||
**Diagnosis**: Check Traefik middleware config for the affected IngressRoute.
|
||||
|
||||
**Fix**: Give the service a dedicated higher-limit middleware (don't loosen the shared default): define `<service>-rate-limit` in `stacks/traefik/modules/traefik/middleware.tf`, then set `skip_default_rate_limit = true` + `extra_middlewares = ["traefik-<service>-rate-limit@kubernetescrd"]` on its `ingress_factory` call. Shared default is average 10 req/s / burst 50; Immich uses 1000/20000, ActualBudget 50/300.
|
||||
**Fix**: Give the service a dedicated higher-limit middleware (don't loosen the shared default): define `<service>-rate-limit` in `stacks/traefik/modules/traefik/middleware.tf`, then set `skip_default_rate_limit = true` + `extra_middlewares = ["traefik-<service>-rate-limit@kubernetescrd"]` on its `ingress_factory` call. Shared default is average 10 req/s / burst 50; Immich uses 1000/20000, ActualBudget 50/300, authentik 100/1000 (login SPA `/static` chunk burst → blank screen).
|
||||
|
||||
### Large Downloads or Uploads Truncate / Fail Partway
|
||||
|
||||
|
|
|
|||
|
|
@ -2385,6 +2385,24 @@ serverFiles:
|
|||
annotations:
|
||||
summary: "Traefik replica config divergence: max/min reload rate = {{ $value | printf \"%.1f\" }}x"
|
||||
description: "One Traefik replica is reloading config much less than its peers — likely a stale K8s informer cache returning 404 for ingresses. Identify the stale pod by comparing `traefik_config_reloads_total` across pods and delete it: `kubectl -n traefik delete pod <pod-name>`."
|
||||
# Authentik '/' router 5xx — the episodic blank-login-screen + 30s-hang
|
||||
# signature: all 3 goauthentik-server pods go NotReady together during a
|
||||
# PG/pgbouncer transient, so Traefik has no healthy backend → 502/503/504.
|
||||
# Needs traefik_router_requests_total (kept by the narrowed drop-regex in
|
||||
# extraScrapeConfigs). Router name is verified live; it changes only if the
|
||||
# authentik '/' ingress is renamed. 5% over 5m with a 0.1 req/s traffic
|
||||
# floor so it doesn't fire on idle/just-restarted Prometheus.
|
||||
- alert: AuthentikRootRouter5xxHigh
|
||||
expr: |
|
||||
sum(rate(traefik_router_requests_total{router="authentik-authentik-authentik-viktorbarzin-me@kubernetes",code=~"50[234]"}[5m]))
|
||||
/ sum(rate(traefik_router_requests_total{router="authentik-authentik-authentik-viktorbarzin-me@kubernetes"}[5m])) * 100 > 5
|
||||
and sum(rate(traefik_router_requests_total{router="authentik-authentik-authentik-viktorbarzin-me@kubernetes"}[5m])) > 0.1
|
||||
for: 5m
|
||||
labels:
|
||||
severity: warning
|
||||
annotations:
|
||||
summary: "authentik '/' router 5xx rate {{ $value | printf \"%.1f\" }}% — episodic backend-unavailable (blank login / 30s hang)"
|
||||
description: "The authentik UI '/' router is returning 502/503/504 — all 3 goauthentik-server pods are likely NotReady together during a PG/pgbouncer transient. Check `kubectl -n authentik get endpointslices` (should never be empty) and CNPG health."
|
||||
- alert: HighServiceErrorRate
|
||||
expr: |
|
||||
(
|
||||
|
|
@ -3642,8 +3660,15 @@ extraScrapeConfigs: |
|
|||
- source_labels: [__meta_kubernetes_pod_name]
|
||||
target_label: instance
|
||||
metric_relabel_configs:
|
||||
# Drop the high-cardinality duration HISTOGRAM (router/service/entrypoint
|
||||
# *_bucket, ~5k series/pod — the real cardinality driver, commit 06490b06)
|
||||
# plus the bulky bytes/tls/sum/count router series, but KEEP
|
||||
# traefik_router_requests_total: the only per-router counter carrying both
|
||||
# `router` and `code` labels, needed to see 4xx/5xx (incl. 429/503) on the
|
||||
# authentik routers. ~448 series/pod. (The earlier blanket `traefik_router_.*`
|
||||
# dropped it, so per-router error rates were un-queryable + un-alertable.)
|
||||
- source_labels: [__name__]
|
||||
regex: 'traefik_(router|service|entrypoint)_request_duration_seconds_bucket|traefik_router_.*'
|
||||
regex: 'traefik_(router|service|entrypoint)_request_duration_seconds_bucket|traefik_router_request(s_bytes_total|s_tls_total|_duration_seconds_(sum|count))|traefik_router_responses_bytes_total'
|
||||
action: drop
|
||||
- job_name: 'realestate-crawler-api'
|
||||
kubernetes_sd_configs:
|
||||
|
|
|
|||
|
|
@ -341,6 +341,33 @@ resource "kubernetes_manifest" "middleware_health_rate_limit" {
|
|||
depends_on = [helm_release.traefik]
|
||||
}
|
||||
|
||||
# Authentik-specific rate limit. The login SPA cold-loads its flow-executor
|
||||
# JS/CSS chunks from /static (app-served, not a CDN) plus an API burst on / —
|
||||
# ~70 parallel requests on a fresh/empty-cache login. The default 10/50 limiter
|
||||
# 429s the tail, and a 429'd ES-module import aborts SPA bootstrap → blank login
|
||||
# screen for cold/incognito/cache-cleared clients and any clients sharing a NAT
|
||||
# egress IP (sixth instance of the burst pattern, after ha-sofia, ActualBudget,
|
||||
# noVNC, tripit and health). authentik was the only first-party SPA still on the
|
||||
# default limiter. Burst absorbs a couple of full cold loads back-to-back.
|
||||
resource "kubernetes_manifest" "middleware_authentik_rate_limit" {
|
||||
manifest = {
|
||||
apiVersion = "traefik.io/v1alpha1"
|
||||
kind = "Middleware"
|
||||
metadata = {
|
||||
name = "authentik-rate-limit"
|
||||
namespace = kubernetes_namespace.traefik.metadata[0].name
|
||||
}
|
||||
spec = {
|
||||
rateLimit = {
|
||||
average = 100
|
||||
burst = 1000
|
||||
}
|
||||
}
|
||||
}
|
||||
|
||||
depends_on = [helm_release.traefik]
|
||||
}
|
||||
|
||||
# Compress responses to clients at the entrypoint level (outermost).
|
||||
# Applied at websecure entrypoint so all responses get compressed.
|
||||
# Uses includedContentTypes (whitelist) instead of excludedContentTypes:
|
||||
|
|
|
|||
Loading…
Add table
Add a link
Reference in a new issue