From 454a48c6ac38d0daa7cf65640c0db3a777cfdb11 Mon Sep 17 00:00:00 2001 From: Viktor Barzin Date: Sun, 1 Mar 2026 13:50:00 +0000 Subject: [PATCH] [ci skip] add Traefik resilience hardening design doc --- .../2026-03-01-traefik-resilience-design.md | 237 ++++++++++++++++++ 1 file changed, 237 insertions(+) create mode 100644 docs/plans/2026-03-01-traefik-resilience-design.md diff --git a/docs/plans/2026-03-01-traefik-resilience-design.md b/docs/plans/2026-03-01-traefik-resilience-design.md new file mode 100644 index 00000000..e1fa45f1 --- /dev/null +++ b/docs/plans/2026-03-01-traefik-resilience-design.md @@ -0,0 +1,237 @@ +# Traefik Resilience Hardening Design + +**Date**: 2026-03-01 +**Status**: Approved + +## Problem Statement + +Traefik is the single ingress point for 70+ services. It has downstream dependencies (ForwardAuth to Poison Fountain, ForwardAuth to Authentik) that are **fail-closed** with **unlimited timeouts**. If these dependencies go down or hang, the entire cluster's public-facing services return 502 or hang indefinitely. + +Additionally, no PodDisruptionBudgets exist, all 3 Traefik replicas can land on the same node, and there are no retries for transient backend failures. + +## Current State + +### Dependency Map (Request Path) + +``` +Client → Cloudflare → MetalLB (10.0.20.202) → Traefik (1 of 3 replicas) + → rate-limit .................... IN-PROCESS + → csp-headers ................... IN-PROCESS + → crowdsec (plugin) ............. FAIL-OPEN ✓ (already resilient) + → ai-bot-block (ForwardAuth) .... FAIL-CLOSED ✗ (Poison Fountain) + → anti-ai-headers ............... IN-PROCESS + → strip-accept-encoding ......... IN-PROCESS + → anti-ai-trap-links (plugin) ... IN-PROCESS + → [if protected=true]: + → authentik-forward-auth ....... FAIL-CLOSED ✗ (Authentik outpost) + → Backend Service +``` + +### Risk Assessment + +| Dependency | Fail Mode | Blast Radius | Likelihood | Mitigation | +|---|---|---|---|---| +| Poison Fountain (ai-bot-block) | FAIL-CLOSED | ALL services (default middleware) | Medium (tier 4-aux, 2 replicas) | NONE | +| Authentik (forward auth) | FAIL-CLOSED | Protected services (~4) | Low (3 replicas, tier 1-cluster) | Alert only | +| CrowdSec LAPI | FAIL-OPEN | None | Low | Fully configured | +| Response header timeout | Unlimited (0s) | ALL services (hung backend) | Medium | NONE | +| Pod scheduling | All on same node possible | ALL services | Medium | NONE | +| Node drain | Can evict all replicas | ALL services | During maintenance | NONE | + +## Design + +### 1. ForwardAuth Resilience (Nginx Resilience Proxies) + +#### 1a. AI Bot Block → Fail-Open + +Deploy a small nginx reverse proxy in front of Poison Fountain: +- Normal operation: proxies request to `poison-fountain:8080/auth`, returns its response +- Poison Fountain down: nginx catches 502/503/504, returns **200** (allow all traffic) +- The other 4 anti-AI layers (headers, trap links, tarpit, poison content) still work + +Update the `ai-bot-block` ForwardAuth middleware to point at the nginx proxy instead of directly at Poison Fountain. + +**Nginx config sketch:** +```nginx +upstream poison_fountain { + server poison-fountain.poison-fountain.svc.cluster.local:8080; +} +server { + listen 8080; + location /auth { + proxy_pass http://poison_fountain; + proxy_connect_timeout 3s; + proxy_read_timeout 5s; + proxy_intercept_errors on; + error_page 502 503 504 =200 /fallback-allow; + } + location = /fallback-allow { + return 200; + } + location /healthz { + return 200 "ok"; + } +} +``` + +**Deployment**: 2 replicas, tier `0-core`, topology spread across nodes, minimal resources (10m CPU, 16Mi memory). + +#### 1b. Authentik → BasicAuth Fallback + +Deploy a similar nginx proxy in front of Authentik's outpost: +- Normal operation: proxies to `ak-outpost-...:9000`, returns Authentik's response (SSO) +- Authentik down: falls back to nginx `auth_basic` with htpasswd credentials from a Kubernetes secret +- Protected services remain accessible to admins via basicAuth during Authentik outages + +Update the `authentik-forward-auth` middleware to point at the nginx proxy. + +**Nginx config sketch:** +```nginx +upstream authentik { + server ak-outpost-authentik-embedded-outpost.authentik.svc.cluster.local:9000; +} +server { + listen 9000; + location /outpost.goauthentik.io/auth/traefik { + proxy_pass http://authentik; + proxy_connect_timeout 3s; + proxy_read_timeout 5s; + proxy_intercept_errors on; + error_page 502 503 504 = @fallback_auth; + } + location @fallback_auth { + auth_basic "Emergency Access"; + auth_basic_user_file /etc/nginx/htpasswd; + # Return 200 with required headers if basicAuth passes + add_header X-authentik-username $remote_user; + return 200; + } + location /healthz { + return 200 "ok"; + } +} +``` + +**htpasswd secret**: Generated from existing admin credentials, stored in a Kubernetes secret, mounted into the nginx pod. + +### 2. Pod Scheduling & Disruption Protection + +#### 2a. Traefik Topology Spread + PDB + +Add to Traefik Helm values: +```yaml +topologySpreadConstraints: + - maxSkew: 1 + topologyKey: kubernetes.io/hostname + whenUnsatisfiable: DoNotSchedule + labelSelector: + matchLabels: + app.kubernetes.io/name: traefik + +podDisruptionBudget: + enabled: true + minAvailable: 2 +``` + +#### 2b. Authentik PDB + +Add to Authentik Helm values: +```yaml +server: + pdb: + enabled: true + minAvailable: 2 +``` + +#### 2c. Poison Fountain Tier Bump + +Change Poison Fountain namespace tier from `4-aux` to `1-cluster`: +- File: `stacks/poison-fountain/main.tf` +- Change: `tier = local.tiers.aux` → `tier = local.tiers.cluster` +- Effect: priority bumped from 200K to 800K, preemption enabled, LimitRange defaults change (512Mi default memory, max 4Gi) + +### 3. Timeout & Backend Protection + +#### 3a. Response Header Timeout + +Change from unlimited to 30s: +``` +--serversTransport.forwardingTimeouts.responseHeaderTimeout=30s +``` + +Prevents hung backends from holding Traefik goroutines indefinitely. + +#### 3b. ForwardAuth Proxy Timeouts + +The nginx resilience proxies use 3s connect / 5s read timeouts. If the upstream doesn't respond within 5s, the fallback activates. This is much faster than waiting for the backend to eventually time out. + +#### 3c. Retry Middleware + +Add a `retry` middleware to the default chain in ingress_factory: +```yaml +retry: + attempts: 2 + initialInterval: 100ms +``` + +Handles transient 502/503 from backends that are restarting. Only retries on network errors and 5xx. + +### 4. Monitoring & Alerting + +#### 4a. PoisonFountainDown Alert + +```yaml +- alert: PoisonFountainDown + expr: kube_deployment_status_replicas_available{namespace="poison-fountain", deployment="poison-fountain"} == 0 + for: 2m + labels: + severity: critical + annotations: + summary: "Poison Fountain is down - AI bot blocking degraded to fail-open" +``` + +#### 4b. Alert Inhibition + +When `TraefikDown` fires, suppress `PoisonFountainDown`. + +#### 4c. ForwardAuthFailing Alert + +Track when the nginx resilience proxies are serving fallback responses (meaning the real auth services are down): + +```yaml +- alert: ForwardAuthFailing + expr: rate(nginx_upstream_responses_total{status_code="502"}[5m]) > 0 + for: 5m + labels: + severity: warning + annotations: + summary: "ForwardAuth fallback active - check Authentik/Poison Fountain" +``` + +(Exact metric depends on nginx exporter configuration — may need a custom approach like logging fallback hits and counting with promtail.) + +## Files to Modify + +| File | Change | +|---|---| +| `stacks/platform/modules/traefik/main.tf` | Add topology spread, PDB, response header timeout | +| `stacks/platform/modules/traefik/middleware.tf` | Update ForwardAuth addresses to point at resilience proxies, add retry middleware | +| `stacks/poison-fountain/main.tf` | Change tier to `1-cluster`, add resilience proxy deployment | +| `stacks/platform/modules/authentik/main.tf` | Add PDB, add auth resilience proxy deployment | +| `modules/kubernetes/ingress_factory/main.tf` | Add retry middleware to default chain | +| `stacks/platform/modules/monitoring/prometheus_chart_values.tpl` | Add PoisonFountainDown alert, ForwardAuthFailing alert, alert inhibition | + +## Out of Scope + +- Circuit breakers (per-service complexity not worth it for homelab) +- Plugin pre-baking into Docker image (accepted risk) +- Active health checks on backends (K8s readiness probes sufficient) + +## Rollback Plan + +Each change is independent and can be reverted individually: +- Resilience proxies: revert ForwardAuth addresses back to direct service URLs +- PDBs: remove from Helm values +- Timeouts: revert to `0s` +- Retry middleware: remove from ingress_factory chain +- Alerts: remove from Prometheus config