8 KiB
Traefik Resilience Hardening Design
Date: 2026-03-01 Status: Approved
Problem Statement
Traefik is the single ingress point for 70+ services. It has downstream dependencies (ForwardAuth to Poison Fountain, ForwardAuth to Authentik) that are fail-closed with unlimited timeouts. If these dependencies go down or hang, the entire cluster's public-facing services return 502 or hang indefinitely.
Additionally, no PodDisruptionBudgets exist, all 3 Traefik replicas can land on the same node, and there are no retries for transient backend failures.
Current State
Dependency Map (Request Path)
Client → Cloudflare → MetalLB (10.0.20.202) → Traefik (1 of 3 replicas)
→ rate-limit .................... IN-PROCESS
→ csp-headers ................... IN-PROCESS
→ crowdsec (plugin) ............. FAIL-OPEN ✓ (already resilient)
→ ai-bot-block (ForwardAuth) .... FAIL-CLOSED ✗ (Poison Fountain)
→ anti-ai-headers ............... IN-PROCESS
→ strip-accept-encoding ......... IN-PROCESS
→ anti-ai-trap-links (plugin) ... IN-PROCESS
→ [if protected=true]:
→ authentik-forward-auth ....... FAIL-CLOSED ✗ (Authentik outpost)
→ Backend Service
Risk Assessment
| Dependency | Fail Mode | Blast Radius | Likelihood | Mitigation |
|---|---|---|---|---|
| Poison Fountain (ai-bot-block) | FAIL-CLOSED | ALL services (default middleware) | Medium (tier 4-aux, 2 replicas) | NONE |
| Authentik (forward auth) | FAIL-CLOSED | Protected services (~4) | Low (3 replicas, tier 1-cluster) | Alert only |
| CrowdSec LAPI | FAIL-OPEN | None | Low | Fully configured |
| Response header timeout | Unlimited (0s) | ALL services (hung backend) | Medium | NONE |
| Pod scheduling | All on same node possible | ALL services | Medium | NONE |
| Node drain | Can evict all replicas | ALL services | During maintenance | NONE |
Design
1. ForwardAuth Resilience (Nginx Resilience Proxies)
1a. AI Bot Block → Fail-Open
Deploy a small nginx reverse proxy in front of Poison Fountain:
- Normal operation: proxies request to
poison-fountain:8080/auth, returns its response - Poison Fountain down: nginx catches 502/503/504, returns 200 (allow all traffic)
- The other 4 anti-AI layers (headers, trap links, tarpit, poison content) still work
Update the ai-bot-block ForwardAuth middleware to point at the nginx proxy instead of directly at Poison Fountain.
Nginx config sketch:
upstream poison_fountain {
server poison-fountain.poison-fountain.svc.cluster.local:8080;
}
server {
listen 8080;
location /auth {
proxy_pass http://poison_fountain;
proxy_connect_timeout 3s;
proxy_read_timeout 5s;
proxy_intercept_errors on;
error_page 502 503 504 =200 /fallback-allow;
}
location = /fallback-allow {
return 200;
}
location /healthz {
return 200 "ok";
}
}
Deployment: 2 replicas, tier 0-core, topology spread across nodes, minimal resources (10m CPU, 16Mi memory).
1b. Authentik → BasicAuth Fallback
Deploy a similar nginx proxy in front of Authentik's outpost:
- Normal operation: proxies to
ak-outpost-...:9000, returns Authentik's response (SSO) - Authentik down: falls back to nginx
auth_basicwith htpasswd credentials from a Kubernetes secret - Protected services remain accessible to admins via basicAuth during Authentik outages
Update the authentik-forward-auth middleware to point at the nginx proxy.
Nginx config sketch:
upstream authentik {
server ak-outpost-authentik-embedded-outpost.authentik.svc.cluster.local:9000;
}
server {
listen 9000;
location /outpost.goauthentik.io/auth/traefik {
proxy_pass http://authentik;
proxy_connect_timeout 3s;
proxy_read_timeout 5s;
proxy_intercept_errors on;
error_page 502 503 504 = @fallback_auth;
}
location @fallback_auth {
auth_basic "Emergency Access";
auth_basic_user_file /etc/nginx/htpasswd;
# Return 200 with required headers if basicAuth passes
add_header X-authentik-username $remote_user;
return 200;
}
location /healthz {
return 200 "ok";
}
}
htpasswd secret: Generated from existing admin credentials, stored in a Kubernetes secret, mounted into the nginx pod.
2. Pod Scheduling & Disruption Protection
2a. Traefik Topology Spread + PDB
Add to Traefik Helm values:
topologySpreadConstraints:
- maxSkew: 1
topologyKey: kubernetes.io/hostname
whenUnsatisfiable: DoNotSchedule
labelSelector:
matchLabels:
app.kubernetes.io/name: traefik
podDisruptionBudget:
enabled: true
minAvailable: 2
2b. Authentik PDB
Add to Authentik Helm values:
server:
pdb:
enabled: true
minAvailable: 2
2c. Poison Fountain Tier Bump
Change Poison Fountain namespace tier from 4-aux to 1-cluster:
- File:
stacks/poison-fountain/main.tf - Change:
tier = local.tiers.aux→tier = local.tiers.cluster - Effect: priority bumped from 200K to 800K, preemption enabled, LimitRange defaults change (512Mi default memory, max 4Gi)
3. Timeout & Backend Protection
3a. Response Header Timeout
Change from unlimited to 30s:
--serversTransport.forwardingTimeouts.responseHeaderTimeout=30s
Prevents hung backends from holding Traefik goroutines indefinitely.
3b. ForwardAuth Proxy Timeouts
The nginx resilience proxies use 3s connect / 5s read timeouts. If the upstream doesn't respond within 5s, the fallback activates. This is much faster than waiting for the backend to eventually time out.
3c. Retry Middleware
Add a retry middleware to the default chain in ingress_factory:
retry:
attempts: 2
initialInterval: 100ms
Handles transient 502/503 from backends that are restarting. Only retries on network errors and 5xx.
4. Monitoring & Alerting
4a. PoisonFountainDown Alert
- alert: PoisonFountainDown
expr: kube_deployment_status_replicas_available{namespace="poison-fountain", deployment="poison-fountain"} == 0
for: 2m
labels:
severity: critical
annotations:
summary: "Poison Fountain is down - AI bot blocking degraded to fail-open"
4b. Alert Inhibition
When TraefikDown fires, suppress PoisonFountainDown.
4c. ForwardAuthFailing Alert
Track when the nginx resilience proxies are serving fallback responses (meaning the real auth services are down):
- alert: ForwardAuthFailing
expr: rate(nginx_upstream_responses_total{status_code="502"}[5m]) > 0
for: 5m
labels:
severity: warning
annotations:
summary: "ForwardAuth fallback active - check Authentik/Poison Fountain"
(Exact metric depends on nginx exporter configuration — may need a custom approach like logging fallback hits and counting with promtail.)
Files to Modify
| File | Change |
|---|---|
stacks/platform/modules/traefik/main.tf |
Add topology spread, PDB, response header timeout |
stacks/platform/modules/traefik/middleware.tf |
Update ForwardAuth addresses to point at resilience proxies, add retry middleware |
stacks/poison-fountain/main.tf |
Change tier to 1-cluster, add resilience proxy deployment |
stacks/platform/modules/authentik/main.tf |
Add PDB, add auth resilience proxy deployment |
modules/kubernetes/ingress_factory/main.tf |
Add retry middleware to default chain |
stacks/platform/modules/monitoring/prometheus_chart_values.tpl |
Add PoisonFountainDown alert, ForwardAuthFailing alert, alert inhibition |
Out of Scope
- Circuit breakers (per-service complexity not worth it for homelab)
- Plugin pre-baking into Docker image (accepted risk)
- Active health checks on backends (K8s readiness probes sufficient)
Rollback Plan
Each change is independent and can be reverted individually:
- Resilience proxies: revert ForwardAuth addresses back to direct service URLs
- PDBs: remove from Helm values
- Timeouts: revert to
0s - Retry middleware: remove from ingress_factory chain
- Alerts: remove from Prometheus config