infra/docs/plans/2026-03-01-traefik-resilience-design.md

8 KiB

Traefik Resilience Hardening Design

Date: 2026-03-01 Status: Approved

Problem Statement

Traefik is the single ingress point for 70+ services. It has downstream dependencies (ForwardAuth to Poison Fountain, ForwardAuth to Authentik) that are fail-closed with unlimited timeouts. If these dependencies go down or hang, the entire cluster's public-facing services return 502 or hang indefinitely.

Additionally, no PodDisruptionBudgets exist, all 3 Traefik replicas can land on the same node, and there are no retries for transient backend failures.

Current State

Dependency Map (Request Path)

Client → Cloudflare → MetalLB (10.0.20.202) → Traefik (1 of 3 replicas)
  → rate-limit .................... IN-PROCESS
  → csp-headers ................... IN-PROCESS
  → crowdsec (plugin) ............. FAIL-OPEN ✓ (already resilient)
  → ai-bot-block (ForwardAuth) .... FAIL-CLOSED ✗ (Poison Fountain)
  → anti-ai-headers ............... IN-PROCESS
  → strip-accept-encoding ......... IN-PROCESS
  → anti-ai-trap-links (plugin) ... IN-PROCESS
  → [if protected=true]:
    → authentik-forward-auth ....... FAIL-CLOSED ✗ (Authentik outpost)
  → Backend Service

Risk Assessment

Dependency Fail Mode Blast Radius Likelihood Mitigation
Poison Fountain (ai-bot-block) FAIL-CLOSED ALL services (default middleware) Medium (tier 4-aux, 2 replicas) NONE
Authentik (forward auth) FAIL-CLOSED Protected services (~4) Low (3 replicas, tier 1-cluster) Alert only
CrowdSec LAPI FAIL-OPEN None Low Fully configured
Response header timeout Unlimited (0s) ALL services (hung backend) Medium NONE
Pod scheduling All on same node possible ALL services Medium NONE
Node drain Can evict all replicas ALL services During maintenance NONE

Design

1. ForwardAuth Resilience (Nginx Resilience Proxies)

1a. AI Bot Block → Fail-Open

Deploy a small nginx reverse proxy in front of Poison Fountain:

  • Normal operation: proxies request to poison-fountain:8080/auth, returns its response
  • Poison Fountain down: nginx catches 502/503/504, returns 200 (allow all traffic)
  • The other 4 anti-AI layers (headers, trap links, tarpit, poison content) still work

Update the ai-bot-block ForwardAuth middleware to point at the nginx proxy instead of directly at Poison Fountain.

Nginx config sketch:

upstream poison_fountain {
    server poison-fountain.poison-fountain.svc.cluster.local:8080;
}
server {
    listen 8080;
    location /auth {
        proxy_pass http://poison_fountain;
        proxy_connect_timeout 3s;
        proxy_read_timeout 5s;
        proxy_intercept_errors on;
        error_page 502 503 504 =200 /fallback-allow;
    }
    location = /fallback-allow {
        return 200;
    }
    location /healthz {
        return 200 "ok";
    }
}

Deployment: 2 replicas, tier 0-core, topology spread across nodes, minimal resources (10m CPU, 16Mi memory).

1b. Authentik → BasicAuth Fallback

Deploy a similar nginx proxy in front of Authentik's outpost:

  • Normal operation: proxies to ak-outpost-...:9000, returns Authentik's response (SSO)
  • Authentik down: falls back to nginx auth_basic with htpasswd credentials from a Kubernetes secret
  • Protected services remain accessible to admins via basicAuth during Authentik outages

Update the authentik-forward-auth middleware to point at the nginx proxy.

Nginx config sketch:

upstream authentik {
    server ak-outpost-authentik-embedded-outpost.authentik.svc.cluster.local:9000;
}
server {
    listen 9000;
    location /outpost.goauthentik.io/auth/traefik {
        proxy_pass http://authentik;
        proxy_connect_timeout 3s;
        proxy_read_timeout 5s;
        proxy_intercept_errors on;
        error_page 502 503 504 = @fallback_auth;
    }
    location @fallback_auth {
        auth_basic "Emergency Access";
        auth_basic_user_file /etc/nginx/htpasswd;
        # Return 200 with required headers if basicAuth passes
        add_header X-authentik-username $remote_user;
        return 200;
    }
    location /healthz {
        return 200 "ok";
    }
}

htpasswd secret: Generated from existing admin credentials, stored in a Kubernetes secret, mounted into the nginx pod.

2. Pod Scheduling & Disruption Protection

2a. Traefik Topology Spread + PDB

Add to Traefik Helm values:

topologySpreadConstraints:
  - maxSkew: 1
    topologyKey: kubernetes.io/hostname
    whenUnsatisfiable: DoNotSchedule
    labelSelector:
      matchLabels:
        app.kubernetes.io/name: traefik

podDisruptionBudget:
  enabled: true
  minAvailable: 2

2b. Authentik PDB

Add to Authentik Helm values:

server:
  pdb:
    enabled: true
    minAvailable: 2

2c. Poison Fountain Tier Bump

Change Poison Fountain namespace tier from 4-aux to 1-cluster:

  • File: stacks/poison-fountain/main.tf
  • Change: tier = local.tiers.auxtier = local.tiers.cluster
  • Effect: priority bumped from 200K to 800K, preemption enabled, LimitRange defaults change (512Mi default memory, max 4Gi)

3. Timeout & Backend Protection

3a. Response Header Timeout

Change from unlimited to 30s:

--serversTransport.forwardingTimeouts.responseHeaderTimeout=30s

Prevents hung backends from holding Traefik goroutines indefinitely.

3b. ForwardAuth Proxy Timeouts

The nginx resilience proxies use 3s connect / 5s read timeouts. If the upstream doesn't respond within 5s, the fallback activates. This is much faster than waiting for the backend to eventually time out.

3c. Retry Middleware

Add a retry middleware to the default chain in ingress_factory:

retry:
  attempts: 2
  initialInterval: 100ms

Handles transient 502/503 from backends that are restarting. Only retries on network errors and 5xx.

4. Monitoring & Alerting

4a. PoisonFountainDown Alert

- alert: PoisonFountainDown
  expr: kube_deployment_status_replicas_available{namespace="poison-fountain", deployment="poison-fountain"} == 0
  for: 2m
  labels:
    severity: critical
  annotations:
    summary: "Poison Fountain is down - AI bot blocking degraded to fail-open"

4b. Alert Inhibition

When TraefikDown fires, suppress PoisonFountainDown.

4c. ForwardAuthFailing Alert

Track when the nginx resilience proxies are serving fallback responses (meaning the real auth services are down):

- alert: ForwardAuthFailing
  expr: rate(nginx_upstream_responses_total{status_code="502"}[5m]) > 0
  for: 5m
  labels:
    severity: warning
  annotations:
    summary: "ForwardAuth fallback active - check Authentik/Poison Fountain"

(Exact metric depends on nginx exporter configuration — may need a custom approach like logging fallback hits and counting with promtail.)

Files to Modify

File Change
stacks/platform/modules/traefik/main.tf Add topology spread, PDB, response header timeout
stacks/platform/modules/traefik/middleware.tf Update ForwardAuth addresses to point at resilience proxies, add retry middleware
stacks/poison-fountain/main.tf Change tier to 1-cluster, add resilience proxy deployment
stacks/platform/modules/authentik/main.tf Add PDB, add auth resilience proxy deployment
modules/kubernetes/ingress_factory/main.tf Add retry middleware to default chain
stacks/platform/modules/monitoring/prometheus_chart_values.tpl Add PoisonFountainDown alert, ForwardAuthFailing alert, alert inhibition

Out of Scope

  • Circuit breakers (per-service complexity not worth it for homelab)
  • Plugin pre-baking into Docker image (accepted risk)
  • Active health checks on backends (K8s readiness probes sufficient)

Rollback Plan

Each change is independent and can be reverted individually:

  • Resilience proxies: revert ForwardAuth addresses back to direct service URLs
  • PDBs: remove from Helm values
  • Timeouts: revert to 0s
  • Retry middleware: remove from ingress_factory chain
  • Alerts: remove from Prometheus config