[ci skip] add Traefik resilience hardening design doc
This commit is contained in:
parent
e728f4c106
commit
454a48c6ac
1 changed files with 237 additions and 0 deletions
237
docs/plans/2026-03-01-traefik-resilience-design.md
Normal file
237
docs/plans/2026-03-01-traefik-resilience-design.md
Normal file
|
|
@ -0,0 +1,237 @@
|
|||
# Traefik Resilience Hardening Design
|
||||
|
||||
**Date**: 2026-03-01
|
||||
**Status**: Approved
|
||||
|
||||
## Problem Statement
|
||||
|
||||
Traefik is the single ingress point for 70+ services. It has downstream dependencies (ForwardAuth to Poison Fountain, ForwardAuth to Authentik) that are **fail-closed** with **unlimited timeouts**. If these dependencies go down or hang, the entire cluster's public-facing services return 502 or hang indefinitely.
|
||||
|
||||
Additionally, no PodDisruptionBudgets exist, all 3 Traefik replicas can land on the same node, and there are no retries for transient backend failures.
|
||||
|
||||
## Current State
|
||||
|
||||
### Dependency Map (Request Path)
|
||||
|
||||
```
|
||||
Client → Cloudflare → MetalLB (10.0.20.202) → Traefik (1 of 3 replicas)
|
||||
→ rate-limit .................... IN-PROCESS
|
||||
→ csp-headers ................... IN-PROCESS
|
||||
→ crowdsec (plugin) ............. FAIL-OPEN ✓ (already resilient)
|
||||
→ ai-bot-block (ForwardAuth) .... FAIL-CLOSED ✗ (Poison Fountain)
|
||||
→ anti-ai-headers ............... IN-PROCESS
|
||||
→ strip-accept-encoding ......... IN-PROCESS
|
||||
→ anti-ai-trap-links (plugin) ... IN-PROCESS
|
||||
→ [if protected=true]:
|
||||
→ authentik-forward-auth ....... FAIL-CLOSED ✗ (Authentik outpost)
|
||||
→ Backend Service
|
||||
```
|
||||
|
||||
### Risk Assessment
|
||||
|
||||
| Dependency | Fail Mode | Blast Radius | Likelihood | Mitigation |
|
||||
|---|---|---|---|---|
|
||||
| Poison Fountain (ai-bot-block) | FAIL-CLOSED | ALL services (default middleware) | Medium (tier 4-aux, 2 replicas) | NONE |
|
||||
| Authentik (forward auth) | FAIL-CLOSED | Protected services (~4) | Low (3 replicas, tier 1-cluster) | Alert only |
|
||||
| CrowdSec LAPI | FAIL-OPEN | None | Low | Fully configured |
|
||||
| Response header timeout | Unlimited (0s) | ALL services (hung backend) | Medium | NONE |
|
||||
| Pod scheduling | All on same node possible | ALL services | Medium | NONE |
|
||||
| Node drain | Can evict all replicas | ALL services | During maintenance | NONE |
|
||||
|
||||
## Design
|
||||
|
||||
### 1. ForwardAuth Resilience (Nginx Resilience Proxies)
|
||||
|
||||
#### 1a. AI Bot Block → Fail-Open
|
||||
|
||||
Deploy a small nginx reverse proxy in front of Poison Fountain:
|
||||
- Normal operation: proxies request to `poison-fountain:8080/auth`, returns its response
|
||||
- Poison Fountain down: nginx catches 502/503/504, returns **200** (allow all traffic)
|
||||
- The other 4 anti-AI layers (headers, trap links, tarpit, poison content) still work
|
||||
|
||||
Update the `ai-bot-block` ForwardAuth middleware to point at the nginx proxy instead of directly at Poison Fountain.
|
||||
|
||||
**Nginx config sketch:**
|
||||
```nginx
|
||||
upstream poison_fountain {
|
||||
server poison-fountain.poison-fountain.svc.cluster.local:8080;
|
||||
}
|
||||
server {
|
||||
listen 8080;
|
||||
location /auth {
|
||||
proxy_pass http://poison_fountain;
|
||||
proxy_connect_timeout 3s;
|
||||
proxy_read_timeout 5s;
|
||||
proxy_intercept_errors on;
|
||||
error_page 502 503 504 =200 /fallback-allow;
|
||||
}
|
||||
location = /fallback-allow {
|
||||
return 200;
|
||||
}
|
||||
location /healthz {
|
||||
return 200 "ok";
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
**Deployment**: 2 replicas, tier `0-core`, topology spread across nodes, minimal resources (10m CPU, 16Mi memory).
|
||||
|
||||
#### 1b. Authentik → BasicAuth Fallback
|
||||
|
||||
Deploy a similar nginx proxy in front of Authentik's outpost:
|
||||
- Normal operation: proxies to `ak-outpost-...:9000`, returns Authentik's response (SSO)
|
||||
- Authentik down: falls back to nginx `auth_basic` with htpasswd credentials from a Kubernetes secret
|
||||
- Protected services remain accessible to admins via basicAuth during Authentik outages
|
||||
|
||||
Update the `authentik-forward-auth` middleware to point at the nginx proxy.
|
||||
|
||||
**Nginx config sketch:**
|
||||
```nginx
|
||||
upstream authentik {
|
||||
server ak-outpost-authentik-embedded-outpost.authentik.svc.cluster.local:9000;
|
||||
}
|
||||
server {
|
||||
listen 9000;
|
||||
location /outpost.goauthentik.io/auth/traefik {
|
||||
proxy_pass http://authentik;
|
||||
proxy_connect_timeout 3s;
|
||||
proxy_read_timeout 5s;
|
||||
proxy_intercept_errors on;
|
||||
error_page 502 503 504 = @fallback_auth;
|
||||
}
|
||||
location @fallback_auth {
|
||||
auth_basic "Emergency Access";
|
||||
auth_basic_user_file /etc/nginx/htpasswd;
|
||||
# Return 200 with required headers if basicAuth passes
|
||||
add_header X-authentik-username $remote_user;
|
||||
return 200;
|
||||
}
|
||||
location /healthz {
|
||||
return 200 "ok";
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
**htpasswd secret**: Generated from existing admin credentials, stored in a Kubernetes secret, mounted into the nginx pod.
|
||||
|
||||
### 2. Pod Scheduling & Disruption Protection
|
||||
|
||||
#### 2a. Traefik Topology Spread + PDB
|
||||
|
||||
Add to Traefik Helm values:
|
||||
```yaml
|
||||
topologySpreadConstraints:
|
||||
- maxSkew: 1
|
||||
topologyKey: kubernetes.io/hostname
|
||||
whenUnsatisfiable: DoNotSchedule
|
||||
labelSelector:
|
||||
matchLabels:
|
||||
app.kubernetes.io/name: traefik
|
||||
|
||||
podDisruptionBudget:
|
||||
enabled: true
|
||||
minAvailable: 2
|
||||
```
|
||||
|
||||
#### 2b. Authentik PDB
|
||||
|
||||
Add to Authentik Helm values:
|
||||
```yaml
|
||||
server:
|
||||
pdb:
|
||||
enabled: true
|
||||
minAvailable: 2
|
||||
```
|
||||
|
||||
#### 2c. Poison Fountain Tier Bump
|
||||
|
||||
Change Poison Fountain namespace tier from `4-aux` to `1-cluster`:
|
||||
- File: `stacks/poison-fountain/main.tf`
|
||||
- Change: `tier = local.tiers.aux` → `tier = local.tiers.cluster`
|
||||
- Effect: priority bumped from 200K to 800K, preemption enabled, LimitRange defaults change (512Mi default memory, max 4Gi)
|
||||
|
||||
### 3. Timeout & Backend Protection
|
||||
|
||||
#### 3a. Response Header Timeout
|
||||
|
||||
Change from unlimited to 30s:
|
||||
```
|
||||
--serversTransport.forwardingTimeouts.responseHeaderTimeout=30s
|
||||
```
|
||||
|
||||
Prevents hung backends from holding Traefik goroutines indefinitely.
|
||||
|
||||
#### 3b. ForwardAuth Proxy Timeouts
|
||||
|
||||
The nginx resilience proxies use 3s connect / 5s read timeouts. If the upstream doesn't respond within 5s, the fallback activates. This is much faster than waiting for the backend to eventually time out.
|
||||
|
||||
#### 3c. Retry Middleware
|
||||
|
||||
Add a `retry` middleware to the default chain in ingress_factory:
|
||||
```yaml
|
||||
retry:
|
||||
attempts: 2
|
||||
initialInterval: 100ms
|
||||
```
|
||||
|
||||
Handles transient 502/503 from backends that are restarting. Only retries on network errors and 5xx.
|
||||
|
||||
### 4. Monitoring & Alerting
|
||||
|
||||
#### 4a. PoisonFountainDown Alert
|
||||
|
||||
```yaml
|
||||
- alert: PoisonFountainDown
|
||||
expr: kube_deployment_status_replicas_available{namespace="poison-fountain", deployment="poison-fountain"} == 0
|
||||
for: 2m
|
||||
labels:
|
||||
severity: critical
|
||||
annotations:
|
||||
summary: "Poison Fountain is down - AI bot blocking degraded to fail-open"
|
||||
```
|
||||
|
||||
#### 4b. Alert Inhibition
|
||||
|
||||
When `TraefikDown` fires, suppress `PoisonFountainDown`.
|
||||
|
||||
#### 4c. ForwardAuthFailing Alert
|
||||
|
||||
Track when the nginx resilience proxies are serving fallback responses (meaning the real auth services are down):
|
||||
|
||||
```yaml
|
||||
- alert: ForwardAuthFailing
|
||||
expr: rate(nginx_upstream_responses_total{status_code="502"}[5m]) > 0
|
||||
for: 5m
|
||||
labels:
|
||||
severity: warning
|
||||
annotations:
|
||||
summary: "ForwardAuth fallback active - check Authentik/Poison Fountain"
|
||||
```
|
||||
|
||||
(Exact metric depends on nginx exporter configuration — may need a custom approach like logging fallback hits and counting with promtail.)
|
||||
|
||||
## Files to Modify
|
||||
|
||||
| File | Change |
|
||||
|---|---|
|
||||
| `stacks/platform/modules/traefik/main.tf` | Add topology spread, PDB, response header timeout |
|
||||
| `stacks/platform/modules/traefik/middleware.tf` | Update ForwardAuth addresses to point at resilience proxies, add retry middleware |
|
||||
| `stacks/poison-fountain/main.tf` | Change tier to `1-cluster`, add resilience proxy deployment |
|
||||
| `stacks/platform/modules/authentik/main.tf` | Add PDB, add auth resilience proxy deployment |
|
||||
| `modules/kubernetes/ingress_factory/main.tf` | Add retry middleware to default chain |
|
||||
| `stacks/platform/modules/monitoring/prometheus_chart_values.tpl` | Add PoisonFountainDown alert, ForwardAuthFailing alert, alert inhibition |
|
||||
|
||||
## Out of Scope
|
||||
|
||||
- Circuit breakers (per-service complexity not worth it for homelab)
|
||||
- Plugin pre-baking into Docker image (accepted risk)
|
||||
- Active health checks on backends (K8s readiness probes sufficient)
|
||||
|
||||
## Rollback Plan
|
||||
|
||||
Each change is independent and can be reverted individually:
|
||||
- Resilience proxies: revert ForwardAuth addresses back to direct service URLs
|
||||
- PDBs: remove from Helm values
|
||||
- Timeouts: revert to `0s`
|
||||
- Retry middleware: remove from ingress_factory chain
|
||||
- Alerts: remove from Prometheus config
|
||||
Loading…
Add table
Add a link
Reference in a new issue