[ci skip] add Traefik resilience hardening design doc

This commit is contained in:
Viktor Barzin 2026-03-01 13:50:00 +00:00
parent e728f4c106
commit 454a48c6ac

View file

@ -0,0 +1,237 @@
# Traefik Resilience Hardening Design
**Date**: 2026-03-01
**Status**: Approved
## Problem Statement
Traefik is the single ingress point for 70+ services. It has downstream dependencies (ForwardAuth to Poison Fountain, ForwardAuth to Authentik) that are **fail-closed** with **unlimited timeouts**. If these dependencies go down or hang, the entire cluster's public-facing services return 502 or hang indefinitely.
Additionally, no PodDisruptionBudgets exist, all 3 Traefik replicas can land on the same node, and there are no retries for transient backend failures.
## Current State
### Dependency Map (Request Path)
```
Client → Cloudflare → MetalLB (10.0.20.202) → Traefik (1 of 3 replicas)
→ rate-limit .................... IN-PROCESS
→ csp-headers ................... IN-PROCESS
→ crowdsec (plugin) ............. FAIL-OPEN ✓ (already resilient)
→ ai-bot-block (ForwardAuth) .... FAIL-CLOSED ✗ (Poison Fountain)
→ anti-ai-headers ............... IN-PROCESS
→ strip-accept-encoding ......... IN-PROCESS
→ anti-ai-trap-links (plugin) ... IN-PROCESS
→ [if protected=true]:
→ authentik-forward-auth ....... FAIL-CLOSED ✗ (Authentik outpost)
→ Backend Service
```
### Risk Assessment
| Dependency | Fail Mode | Blast Radius | Likelihood | Mitigation |
|---|---|---|---|---|
| Poison Fountain (ai-bot-block) | FAIL-CLOSED | ALL services (default middleware) | Medium (tier 4-aux, 2 replicas) | NONE |
| Authentik (forward auth) | FAIL-CLOSED | Protected services (~4) | Low (3 replicas, tier 1-cluster) | Alert only |
| CrowdSec LAPI | FAIL-OPEN | None | Low | Fully configured |
| Response header timeout | Unlimited (0s) | ALL services (hung backend) | Medium | NONE |
| Pod scheduling | All on same node possible | ALL services | Medium | NONE |
| Node drain | Can evict all replicas | ALL services | During maintenance | NONE |
## Design
### 1. ForwardAuth Resilience (Nginx Resilience Proxies)
#### 1a. AI Bot Block → Fail-Open
Deploy a small nginx reverse proxy in front of Poison Fountain:
- Normal operation: proxies request to `poison-fountain:8080/auth`, returns its response
- Poison Fountain down: nginx catches 502/503/504, returns **200** (allow all traffic)
- The other 4 anti-AI layers (headers, trap links, tarpit, poison content) still work
Update the `ai-bot-block` ForwardAuth middleware to point at the nginx proxy instead of directly at Poison Fountain.
**Nginx config sketch:**
```nginx
upstream poison_fountain {
server poison-fountain.poison-fountain.svc.cluster.local:8080;
}
server {
listen 8080;
location /auth {
proxy_pass http://poison_fountain;
proxy_connect_timeout 3s;
proxy_read_timeout 5s;
proxy_intercept_errors on;
error_page 502 503 504 =200 /fallback-allow;
}
location = /fallback-allow {
return 200;
}
location /healthz {
return 200 "ok";
}
}
```
**Deployment**: 2 replicas, tier `0-core`, topology spread across nodes, minimal resources (10m CPU, 16Mi memory).
#### 1b. Authentik → BasicAuth Fallback
Deploy a similar nginx proxy in front of Authentik's outpost:
- Normal operation: proxies to `ak-outpost-...:9000`, returns Authentik's response (SSO)
- Authentik down: falls back to nginx `auth_basic` with htpasswd credentials from a Kubernetes secret
- Protected services remain accessible to admins via basicAuth during Authentik outages
Update the `authentik-forward-auth` middleware to point at the nginx proxy.
**Nginx config sketch:**
```nginx
upstream authentik {
server ak-outpost-authentik-embedded-outpost.authentik.svc.cluster.local:9000;
}
server {
listen 9000;
location /outpost.goauthentik.io/auth/traefik {
proxy_pass http://authentik;
proxy_connect_timeout 3s;
proxy_read_timeout 5s;
proxy_intercept_errors on;
error_page 502 503 504 = @fallback_auth;
}
location @fallback_auth {
auth_basic "Emergency Access";
auth_basic_user_file /etc/nginx/htpasswd;
# Return 200 with required headers if basicAuth passes
add_header X-authentik-username $remote_user;
return 200;
}
location /healthz {
return 200 "ok";
}
}
```
**htpasswd secret**: Generated from existing admin credentials, stored in a Kubernetes secret, mounted into the nginx pod.
### 2. Pod Scheduling & Disruption Protection
#### 2a. Traefik Topology Spread + PDB
Add to Traefik Helm values:
```yaml
topologySpreadConstraints:
- maxSkew: 1
topologyKey: kubernetes.io/hostname
whenUnsatisfiable: DoNotSchedule
labelSelector:
matchLabels:
app.kubernetes.io/name: traefik
podDisruptionBudget:
enabled: true
minAvailable: 2
```
#### 2b. Authentik PDB
Add to Authentik Helm values:
```yaml
server:
pdb:
enabled: true
minAvailable: 2
```
#### 2c. Poison Fountain Tier Bump
Change Poison Fountain namespace tier from `4-aux` to `1-cluster`:
- File: `stacks/poison-fountain/main.tf`
- Change: `tier = local.tiers.aux``tier = local.tiers.cluster`
- Effect: priority bumped from 200K to 800K, preemption enabled, LimitRange defaults change (512Mi default memory, max 4Gi)
### 3. Timeout & Backend Protection
#### 3a. Response Header Timeout
Change from unlimited to 30s:
```
--serversTransport.forwardingTimeouts.responseHeaderTimeout=30s
```
Prevents hung backends from holding Traefik goroutines indefinitely.
#### 3b. ForwardAuth Proxy Timeouts
The nginx resilience proxies use 3s connect / 5s read timeouts. If the upstream doesn't respond within 5s, the fallback activates. This is much faster than waiting for the backend to eventually time out.
#### 3c. Retry Middleware
Add a `retry` middleware to the default chain in ingress_factory:
```yaml
retry:
attempts: 2
initialInterval: 100ms
```
Handles transient 502/503 from backends that are restarting. Only retries on network errors and 5xx.
### 4. Monitoring & Alerting
#### 4a. PoisonFountainDown Alert
```yaml
- alert: PoisonFountainDown
expr: kube_deployment_status_replicas_available{namespace="poison-fountain", deployment="poison-fountain"} == 0
for: 2m
labels:
severity: critical
annotations:
summary: "Poison Fountain is down - AI bot blocking degraded to fail-open"
```
#### 4b. Alert Inhibition
When `TraefikDown` fires, suppress `PoisonFountainDown`.
#### 4c. ForwardAuthFailing Alert
Track when the nginx resilience proxies are serving fallback responses (meaning the real auth services are down):
```yaml
- alert: ForwardAuthFailing
expr: rate(nginx_upstream_responses_total{status_code="502"}[5m]) > 0
for: 5m
labels:
severity: warning
annotations:
summary: "ForwardAuth fallback active - check Authentik/Poison Fountain"
```
(Exact metric depends on nginx exporter configuration — may need a custom approach like logging fallback hits and counting with promtail.)
## Files to Modify
| File | Change |
|---|---|
| `stacks/platform/modules/traefik/main.tf` | Add topology spread, PDB, response header timeout |
| `stacks/platform/modules/traefik/middleware.tf` | Update ForwardAuth addresses to point at resilience proxies, add retry middleware |
| `stacks/poison-fountain/main.tf` | Change tier to `1-cluster`, add resilience proxy deployment |
| `stacks/platform/modules/authentik/main.tf` | Add PDB, add auth resilience proxy deployment |
| `modules/kubernetes/ingress_factory/main.tf` | Add retry middleware to default chain |
| `stacks/platform/modules/monitoring/prometheus_chart_values.tpl` | Add PoisonFountainDown alert, ForwardAuthFailing alert, alert inhibition |
## Out of Scope
- Circuit breakers (per-service complexity not worth it for homelab)
- Plugin pre-baking into Docker image (accepted risk)
- Active health checks on backends (K8s readiness probes sufficient)
## Rollback Plan
Each change is independent and can be reverted individually:
- Resilience proxies: revert ForwardAuth addresses back to direct service URLs
- PDBs: remove from Helm values
- Timeouts: revert to `0s`
- Retry middleware: remove from ingress_factory chain
- Alerts: remove from Prometheus config