# Traefik Resilience Hardening Design **Date**: 2026-03-01 **Status**: Approved ## Problem Statement Traefik is the single ingress point for 70+ services. It has downstream dependencies (ForwardAuth to Poison Fountain, ForwardAuth to Authentik) that are **fail-closed** with **unlimited timeouts**. If these dependencies go down or hang, the entire cluster's public-facing services return 502 or hang indefinitely. Additionally, no PodDisruptionBudgets exist, all 3 Traefik replicas can land on the same node, and there are no retries for transient backend failures. ## Current State ### Dependency Map (Request Path) ``` Client → Cloudflare → MetalLB (10.0.20.202) → Traefik (1 of 3 replicas) → rate-limit .................... IN-PROCESS → csp-headers ................... IN-PROCESS → crowdsec (plugin) ............. FAIL-OPEN ✓ (already resilient) → ai-bot-block (ForwardAuth) .... FAIL-CLOSED ✗ (Poison Fountain) → anti-ai-headers ............... IN-PROCESS → strip-accept-encoding ......... IN-PROCESS → anti-ai-trap-links (plugin) ... IN-PROCESS → [if protected=true]: → authentik-forward-auth ....... FAIL-CLOSED ✗ (Authentik outpost) → Backend Service ``` ### Risk Assessment | Dependency | Fail Mode | Blast Radius | Likelihood | Mitigation | |---|---|---|---|---| | Poison Fountain (ai-bot-block) | FAIL-CLOSED | ALL services (default middleware) | Medium (tier 4-aux, 2 replicas) | NONE | | Authentik (forward auth) | FAIL-CLOSED | Protected services (~4) | Low (3 replicas, tier 1-cluster) | Alert only | | CrowdSec LAPI | FAIL-OPEN | None | Low | Fully configured | | Response header timeout | Unlimited (0s) | ALL services (hung backend) | Medium | NONE | | Pod scheduling | All on same node possible | ALL services | Medium | NONE | | Node drain | Can evict all replicas | ALL services | During maintenance | NONE | ## Design ### 1. ForwardAuth Resilience (Nginx Resilience Proxies) #### 1a. AI Bot Block → Fail-Open Deploy a small nginx reverse proxy in front of Poison Fountain: - Normal operation: proxies request to `poison-fountain:8080/auth`, returns its response - Poison Fountain down: nginx catches 502/503/504, returns **200** (allow all traffic) - The other 4 anti-AI layers (headers, trap links, tarpit, poison content) still work Update the `ai-bot-block` ForwardAuth middleware to point at the nginx proxy instead of directly at Poison Fountain. **Nginx config sketch:** ```nginx upstream poison_fountain { server poison-fountain.poison-fountain.svc.cluster.local:8080; } server { listen 8080; location /auth { proxy_pass http://poison_fountain; proxy_connect_timeout 3s; proxy_read_timeout 5s; proxy_intercept_errors on; error_page 502 503 504 =200 /fallback-allow; } location = /fallback-allow { return 200; } location /healthz { return 200 "ok"; } } ``` **Deployment**: 2 replicas, tier `0-core`, topology spread across nodes, minimal resources (10m CPU, 16Mi memory). #### 1b. Authentik → BasicAuth Fallback Deploy a similar nginx proxy in front of Authentik's outpost: - Normal operation: proxies to `ak-outpost-...:9000`, returns Authentik's response (SSO) - Authentik down: falls back to nginx `auth_basic` with htpasswd credentials from a Kubernetes secret - Protected services remain accessible to admins via basicAuth during Authentik outages Update the `authentik-forward-auth` middleware to point at the nginx proxy. **Nginx config sketch:** ```nginx upstream authentik { server ak-outpost-authentik-embedded-outpost.authentik.svc.cluster.local:9000; } server { listen 9000; location /outpost.goauthentik.io/auth/traefik { proxy_pass http://authentik; proxy_connect_timeout 3s; proxy_read_timeout 5s; proxy_intercept_errors on; error_page 502 503 504 = @fallback_auth; } location @fallback_auth { auth_basic "Emergency Access"; auth_basic_user_file /etc/nginx/htpasswd; # Return 200 with required headers if basicAuth passes add_header X-authentik-username $remote_user; return 200; } location /healthz { return 200 "ok"; } } ``` **htpasswd secret**: Generated from existing admin credentials, stored in a Kubernetes secret, mounted into the nginx pod. ### 2. Pod Scheduling & Disruption Protection #### 2a. Traefik Topology Spread + PDB Add to Traefik Helm values: ```yaml topologySpreadConstraints: - maxSkew: 1 topologyKey: kubernetes.io/hostname whenUnsatisfiable: DoNotSchedule labelSelector: matchLabels: app.kubernetes.io/name: traefik podDisruptionBudget: enabled: true minAvailable: 2 ``` #### 2b. Authentik PDB Add to Authentik Helm values: ```yaml server: pdb: enabled: true minAvailable: 2 ``` #### 2c. Poison Fountain Tier Bump Change Poison Fountain namespace tier from `4-aux` to `1-cluster`: - File: `stacks/poison-fountain/main.tf` - Change: `tier = local.tiers.aux` → `tier = local.tiers.cluster` - Effect: priority bumped from 200K to 800K, preemption enabled, LimitRange defaults change (512Mi default memory, max 4Gi) ### 3. Timeout & Backend Protection #### 3a. Response Header Timeout Change from unlimited to 30s: ``` --serversTransport.forwardingTimeouts.responseHeaderTimeout=30s ``` Prevents hung backends from holding Traefik goroutines indefinitely. #### 3b. ForwardAuth Proxy Timeouts The nginx resilience proxies use 3s connect / 5s read timeouts. If the upstream doesn't respond within 5s, the fallback activates. This is much faster than waiting for the backend to eventually time out. #### 3c. Retry Middleware Add a `retry` middleware to the default chain in ingress_factory: ```yaml retry: attempts: 2 initialInterval: 100ms ``` Handles transient 502/503 from backends that are restarting. Only retries on network errors and 5xx. ### 4. Monitoring & Alerting #### 4a. PoisonFountainDown Alert ```yaml - alert: PoisonFountainDown expr: kube_deployment_status_replicas_available{namespace="poison-fountain", deployment="poison-fountain"} == 0 for: 2m labels: severity: critical annotations: summary: "Poison Fountain is down - AI bot blocking degraded to fail-open" ``` #### 4b. Alert Inhibition When `TraefikDown` fires, suppress `PoisonFountainDown`. #### 4c. ForwardAuthFailing Alert Track when the nginx resilience proxies are serving fallback responses (meaning the real auth services are down): ```yaml - alert: ForwardAuthFailing expr: rate(nginx_upstream_responses_total{status_code="502"}[5m]) > 0 for: 5m labels: severity: warning annotations: summary: "ForwardAuth fallback active - check Authentik/Poison Fountain" ``` (Exact metric depends on nginx exporter configuration — may need a custom approach like logging fallback hits and counting with promtail.) ## Files to Modify | File | Change | |---|---| | `stacks/platform/modules/traefik/main.tf` | Add topology spread, PDB, response header timeout | | `stacks/platform/modules/traefik/middleware.tf` | Update ForwardAuth addresses to point at resilience proxies, add retry middleware | | `stacks/poison-fountain/main.tf` | Change tier to `1-cluster`, add resilience proxy deployment | | `stacks/platform/modules/authentik/main.tf` | Add PDB, add auth resilience proxy deployment | | `modules/kubernetes/ingress_factory/main.tf` | Add retry middleware to default chain | | `stacks/platform/modules/monitoring/prometheus_chart_values.tpl` | Add PoisonFountainDown alert, ForwardAuthFailing alert, alert inhibition | ## Out of Scope - Circuit breakers (per-service complexity not worth it for homelab) - Plugin pre-baking into Docker image (accepted risk) - Active health checks on backends (K8s readiness probes sufficient) ## Rollback Plan Each change is independent and can be reverted individually: - Resilience proxies: revert ForwardAuth addresses back to direct service URLs - PDBs: remove from Helm values - Timeouts: revert to `0s` - Retry middleware: remove from ingress_factory chain - Alerts: remove from Prometheus config