237 lines
8 KiB
Markdown
237 lines
8 KiB
Markdown
# Traefik Resilience Hardening Design
|
|
|
|
**Date**: 2026-03-01
|
|
**Status**: Approved
|
|
|
|
## Problem Statement
|
|
|
|
Traefik is the single ingress point for 70+ services. It has downstream dependencies (ForwardAuth to Poison Fountain, ForwardAuth to Authentik) that are **fail-closed** with **unlimited timeouts**. If these dependencies go down or hang, the entire cluster's public-facing services return 502 or hang indefinitely.
|
|
|
|
Additionally, no PodDisruptionBudgets exist, all 3 Traefik replicas can land on the same node, and there are no retries for transient backend failures.
|
|
|
|
## Current State
|
|
|
|
### Dependency Map (Request Path)
|
|
|
|
```
|
|
Client → Cloudflare → MetalLB (10.0.20.202) → Traefik (1 of 3 replicas)
|
|
→ rate-limit .................... IN-PROCESS
|
|
→ csp-headers ................... IN-PROCESS
|
|
→ crowdsec (plugin) ............. FAIL-OPEN ✓ (already resilient)
|
|
→ ai-bot-block (ForwardAuth) .... FAIL-CLOSED ✗ (Poison Fountain)
|
|
→ anti-ai-headers ............... IN-PROCESS
|
|
→ strip-accept-encoding ......... IN-PROCESS
|
|
→ anti-ai-trap-links (plugin) ... IN-PROCESS
|
|
→ [if protected=true]:
|
|
→ authentik-forward-auth ....... FAIL-CLOSED ✗ (Authentik outpost)
|
|
→ Backend Service
|
|
```
|
|
|
|
### Risk Assessment
|
|
|
|
| Dependency | Fail Mode | Blast Radius | Likelihood | Mitigation |
|
|
|---|---|---|---|---|
|
|
| Poison Fountain (ai-bot-block) | FAIL-CLOSED | ALL services (default middleware) | Medium (tier 4-aux, 2 replicas) | NONE |
|
|
| Authentik (forward auth) | FAIL-CLOSED | Protected services (~4) | Low (3 replicas, tier 1-cluster) | Alert only |
|
|
| CrowdSec LAPI | FAIL-OPEN | None | Low | Fully configured |
|
|
| Response header timeout | Unlimited (0s) | ALL services (hung backend) | Medium | NONE |
|
|
| Pod scheduling | All on same node possible | ALL services | Medium | NONE |
|
|
| Node drain | Can evict all replicas | ALL services | During maintenance | NONE |
|
|
|
|
## Design
|
|
|
|
### 1. ForwardAuth Resilience (Nginx Resilience Proxies)
|
|
|
|
#### 1a. AI Bot Block → Fail-Open
|
|
|
|
Deploy a small nginx reverse proxy in front of Poison Fountain:
|
|
- Normal operation: proxies request to `poison-fountain:8080/auth`, returns its response
|
|
- Poison Fountain down: nginx catches 502/503/504, returns **200** (allow all traffic)
|
|
- The other 4 anti-AI layers (headers, trap links, tarpit, poison content) still work
|
|
|
|
Update the `ai-bot-block` ForwardAuth middleware to point at the nginx proxy instead of directly at Poison Fountain.
|
|
|
|
**Nginx config sketch:**
|
|
```nginx
|
|
upstream poison_fountain {
|
|
server poison-fountain.poison-fountain.svc.cluster.local:8080;
|
|
}
|
|
server {
|
|
listen 8080;
|
|
location /auth {
|
|
proxy_pass http://poison_fountain;
|
|
proxy_connect_timeout 3s;
|
|
proxy_read_timeout 5s;
|
|
proxy_intercept_errors on;
|
|
error_page 502 503 504 =200 /fallback-allow;
|
|
}
|
|
location = /fallback-allow {
|
|
return 200;
|
|
}
|
|
location /healthz {
|
|
return 200 "ok";
|
|
}
|
|
}
|
|
```
|
|
|
|
**Deployment**: 2 replicas, tier `0-core`, topology spread across nodes, minimal resources (10m CPU, 16Mi memory).
|
|
|
|
#### 1b. Authentik → BasicAuth Fallback
|
|
|
|
Deploy a similar nginx proxy in front of Authentik's outpost:
|
|
- Normal operation: proxies to `ak-outpost-...:9000`, returns Authentik's response (SSO)
|
|
- Authentik down: falls back to nginx `auth_basic` with htpasswd credentials from a Kubernetes secret
|
|
- Protected services remain accessible to admins via basicAuth during Authentik outages
|
|
|
|
Update the `authentik-forward-auth` middleware to point at the nginx proxy.
|
|
|
|
**Nginx config sketch:**
|
|
```nginx
|
|
upstream authentik {
|
|
server ak-outpost-authentik-embedded-outpost.authentik.svc.cluster.local:9000;
|
|
}
|
|
server {
|
|
listen 9000;
|
|
location /outpost.goauthentik.io/auth/traefik {
|
|
proxy_pass http://authentik;
|
|
proxy_connect_timeout 3s;
|
|
proxy_read_timeout 5s;
|
|
proxy_intercept_errors on;
|
|
error_page 502 503 504 = @fallback_auth;
|
|
}
|
|
location @fallback_auth {
|
|
auth_basic "Emergency Access";
|
|
auth_basic_user_file /etc/nginx/htpasswd;
|
|
# Return 200 with required headers if basicAuth passes
|
|
add_header X-authentik-username $remote_user;
|
|
return 200;
|
|
}
|
|
location /healthz {
|
|
return 200 "ok";
|
|
}
|
|
}
|
|
```
|
|
|
|
**htpasswd secret**: Generated from existing admin credentials, stored in a Kubernetes secret, mounted into the nginx pod.
|
|
|
|
### 2. Pod Scheduling & Disruption Protection
|
|
|
|
#### 2a. Traefik Topology Spread + PDB
|
|
|
|
Add to Traefik Helm values:
|
|
```yaml
|
|
topologySpreadConstraints:
|
|
- maxSkew: 1
|
|
topologyKey: kubernetes.io/hostname
|
|
whenUnsatisfiable: DoNotSchedule
|
|
labelSelector:
|
|
matchLabels:
|
|
app.kubernetes.io/name: traefik
|
|
|
|
podDisruptionBudget:
|
|
enabled: true
|
|
minAvailable: 2
|
|
```
|
|
|
|
#### 2b. Authentik PDB
|
|
|
|
Add to Authentik Helm values:
|
|
```yaml
|
|
server:
|
|
pdb:
|
|
enabled: true
|
|
minAvailable: 2
|
|
```
|
|
|
|
#### 2c. Poison Fountain Tier Bump
|
|
|
|
Change Poison Fountain namespace tier from `4-aux` to `1-cluster`:
|
|
- File: `stacks/poison-fountain/main.tf`
|
|
- Change: `tier = local.tiers.aux` → `tier = local.tiers.cluster`
|
|
- Effect: priority bumped from 200K to 800K, preemption enabled, LimitRange defaults change (512Mi default memory, max 4Gi)
|
|
|
|
### 3. Timeout & Backend Protection
|
|
|
|
#### 3a. Response Header Timeout
|
|
|
|
Change from unlimited to 30s:
|
|
```
|
|
--serversTransport.forwardingTimeouts.responseHeaderTimeout=30s
|
|
```
|
|
|
|
Prevents hung backends from holding Traefik goroutines indefinitely.
|
|
|
|
#### 3b. ForwardAuth Proxy Timeouts
|
|
|
|
The nginx resilience proxies use 3s connect / 5s read timeouts. If the upstream doesn't respond within 5s, the fallback activates. This is much faster than waiting for the backend to eventually time out.
|
|
|
|
#### 3c. Retry Middleware
|
|
|
|
Add a `retry` middleware to the default chain in ingress_factory:
|
|
```yaml
|
|
retry:
|
|
attempts: 2
|
|
initialInterval: 100ms
|
|
```
|
|
|
|
Handles transient 502/503 from backends that are restarting. Only retries on network errors and 5xx.
|
|
|
|
### 4. Monitoring & Alerting
|
|
|
|
#### 4a. PoisonFountainDown Alert
|
|
|
|
```yaml
|
|
- alert: PoisonFountainDown
|
|
expr: kube_deployment_status_replicas_available{namespace="poison-fountain", deployment="poison-fountain"} == 0
|
|
for: 2m
|
|
labels:
|
|
severity: critical
|
|
annotations:
|
|
summary: "Poison Fountain is down - AI bot blocking degraded to fail-open"
|
|
```
|
|
|
|
#### 4b. Alert Inhibition
|
|
|
|
When `TraefikDown` fires, suppress `PoisonFountainDown`.
|
|
|
|
#### 4c. ForwardAuthFailing Alert
|
|
|
|
Track when the nginx resilience proxies are serving fallback responses (meaning the real auth services are down):
|
|
|
|
```yaml
|
|
- alert: ForwardAuthFailing
|
|
expr: rate(nginx_upstream_responses_total{status_code="502"}[5m]) > 0
|
|
for: 5m
|
|
labels:
|
|
severity: warning
|
|
annotations:
|
|
summary: "ForwardAuth fallback active - check Authentik/Poison Fountain"
|
|
```
|
|
|
|
(Exact metric depends on nginx exporter configuration — may need a custom approach like logging fallback hits and counting with promtail.)
|
|
|
|
## Files to Modify
|
|
|
|
| File | Change |
|
|
|---|---|
|
|
| `stacks/platform/modules/traefik/main.tf` | Add topology spread, PDB, response header timeout |
|
|
| `stacks/platform/modules/traefik/middleware.tf` | Update ForwardAuth addresses to point at resilience proxies, add retry middleware |
|
|
| `stacks/poison-fountain/main.tf` | Change tier to `1-cluster`, add resilience proxy deployment |
|
|
| `stacks/platform/modules/authentik/main.tf` | Add PDB, add auth resilience proxy deployment |
|
|
| `modules/kubernetes/ingress_factory/main.tf` | Add retry middleware to default chain |
|
|
| `stacks/platform/modules/monitoring/prometheus_chart_values.tpl` | Add PoisonFountainDown alert, ForwardAuthFailing alert, alert inhibition |
|
|
|
|
## Out of Scope
|
|
|
|
- Circuit breakers (per-service complexity not worth it for homelab)
|
|
- Plugin pre-baking into Docker image (accepted risk)
|
|
- Active health checks on backends (K8s readiness probes sufficient)
|
|
|
|
## Rollback Plan
|
|
|
|
Each change is independent and can be reverted individually:
|
|
- Resilience proxies: revert ForwardAuth addresses back to direct service URLs
|
|
- PDBs: remove from Helm values
|
|
- Timeouts: revert to `0s`
|
|
- Retry middleware: remove from ingress_factory chain
|
|
- Alerts: remove from Prometheus config
|