25 KiB
Traefik Resilience Hardening Implementation Plan
For Claude: REQUIRED SUB-SKILL: Use superpowers:executing-plans to implement this plan task-by-task.
Goal: Make Traefik resilient against downstream dependency failures (ForwardAuth services, hung backends) while preventing pod scheduling and disruption issues.
Architecture: Deploy nginx resilience proxies in front of fail-closed ForwardAuth services (Poison Fountain, Authentik), add PodDisruptionBudgets, topology spread constraints, response timeouts, retry middleware, and monitoring alerts.
Tech Stack: Terraform/Terragrunt, Kubernetes, Nginx, Traefik CRDs, Prometheus
Task 1: Bump Poison Fountain tier from aux to cluster
This is the simplest change and has no dependencies. Bumping the tier ensures Poison Fountain isn't evicted under memory pressure.
Files:
- Modify:
stacks/poison-fountain/main.tf:10(namespace tier label) - Modify:
stacks/poison-fountain/main.tf:52(deployment tier label)
Step 1: Change namespace tier
In stacks/poison-fountain/main.tf, line 10, change:
tier = local.tiers.aux
to:
tier = local.tiers.cluster
Step 2: Change deployment tier label
In stacks/poison-fountain/main.tf, line 52, change:
tier = local.tiers.aux
to:
tier = local.tiers.cluster
Step 3: Verify the plan
Run:
cd stacks/poison-fountain && terragrunt plan --non-interactive 2>&1 | tail -30
Expected: Plan shows namespace and deployment label changes from 4-aux to 1-cluster. No resource destruction.
Step 4: Apply
Run:
cd stacks/poison-fountain && terragrunt apply --non-interactive
Step 5: Verify the new LimitRange and PriorityClass
Run:
kubectl --kubeconfig $(pwd)/config describe limitrange tier-defaults -n poison-fountain
kubectl --kubeconfig $(pwd)/config get pods -n poison-fountain -o jsonpath='{.items[*].spec.priorityClassName}'
Expected: LimitRange shows 1-cluster defaults (512Mi default memory, max 4Gi). Priority class is tier-1-cluster.
Step 6: Commit
git add stacks/poison-fountain/main.tf
git commit -m "[ci skip] bump poison-fountain tier from aux to cluster (critical path for all ingress)"
Task 2: Deploy bot-block resilience proxy (nginx fail-open in front of Poison Fountain)
Deploy an nginx reverse proxy in the traefik namespace that proxies to Poison Fountain's /auth endpoint and returns 200 (allow) if Poison Fountain is unreachable.
Files:
- Modify:
stacks/platform/modules/traefik/main.tf(add nginx deployment, service, configmap) - Modify:
stacks/platform/modules/traefik/middleware.tf:287(update ai-bot-block ForwardAuth address)
Step 1: Add nginx configmap for bot-block proxy
Add to end of stacks/platform/modules/traefik/main.tf (before the closing of the file):
# Resilience proxy for ai-bot-block ForwardAuth
# Returns 200 (allow all) when Poison Fountain is unreachable
resource "kubernetes_config_map" "bot_block_proxy_config" {
metadata {
name = "bot-block-proxy-config"
namespace = kubernetes_namespace.traefik.metadata[0].name
}
data = {
"default.conf" = <<-EOT
upstream poison_fountain {
server poison-fountain.poison-fountain.svc.cluster.local:8080;
}
server {
listen 8080;
location /auth {
proxy_pass http://poison_fountain;
proxy_connect_timeout 3s;
proxy_read_timeout 5s;
proxy_send_timeout 5s;
proxy_intercept_errors on;
error_page 502 503 504 =200 /fallback-allow;
proxy_set_header Host $host;
proxy_set_header X-Real-IP $remote_addr;
proxy_set_header X-Forwarded-For $proxy_add_x_forwarded_for;
proxy_set_header X-Forwarded-Proto $scheme;
}
location = /fallback-allow {
internal;
return 200 "allowed";
}
location /healthz {
access_log off;
return 200 "ok";
}
}
EOT
}
}
Step 2: Add nginx deployment for bot-block proxy
Add after the configmap:
resource "kubernetes_deployment" "bot_block_proxy" {
metadata {
name = "bot-block-proxy"
namespace = kubernetes_namespace.traefik.metadata[0].name
labels = {
app = "bot-block-proxy"
}
}
spec {
replicas = 2
strategy {
type = "RollingUpdate"
rolling_update {
max_unavailable = 0
max_surge = 1
}
}
selector {
match_labels = {
app = "bot-block-proxy"
}
}
template {
metadata {
labels = {
app = "bot-block-proxy"
}
}
spec {
topology_spread_constraint {
max_skew = 1
topology_key = "kubernetes.io/hostname"
when_unsatisfiable = "DoNotSchedule"
label_selector {
match_labels = {
app = "bot-block-proxy"
}
}
}
container {
name = "nginx"
image = "nginx:1-alpine"
port {
container_port = 8080
}
volume_mount {
name = "config"
mount_path = "/etc/nginx/conf.d"
read_only = true
}
liveness_probe {
http_get {
path = "/healthz"
port = 8080
}
initial_delay_seconds = 3
period_seconds = 10
}
readiness_probe {
http_get {
path = "/healthz"
port = 8080
}
initial_delay_seconds = 2
period_seconds = 5
}
resources {
requests = {
cpu = "5m"
memory = "16Mi"
}
limits = {
cpu = "50m"
memory = "32Mi"
}
}
}
volume {
name = "config"
config_map {
name = kubernetes_config_map.bot_block_proxy_config.metadata[0].name
}
}
}
}
}
}
resource "kubernetes_service" "bot_block_proxy" {
metadata {
name = "bot-block-proxy"
namespace = kubernetes_namespace.traefik.metadata[0].name
labels = {
app = "bot-block-proxy"
}
}
spec {
selector = {
app = "bot-block-proxy"
}
port {
name = "http"
port = 8080
target_port = 8080
}
}
}
Step 3: Update ai-bot-block ForwardAuth address
In stacks/platform/modules/traefik/middleware.tf, line 287, change:
address = "http://poison-fountain.poison-fountain.svc.cluster.local:8080/auth"
to:
address = "http://bot-block-proxy.traefik.svc.cluster.local:8080/auth"
Step 4: Plan and verify
Run:
cd stacks/platform && terragrunt plan --non-interactive 2>&1 | grep -E "will be created|will be updated|Plan:"
Expected: 3 resources created (configmap, deployment, service), 1 resource updated (ai-bot-block middleware).
Step 5: Apply
Run:
cd stacks/platform && terragrunt apply --non-interactive
Step 6: Verify the proxy is running and forwarding correctly
Run:
kubectl --kubeconfig $(pwd)/config get pods -n traefik -l app=bot-block-proxy
kubectl --kubeconfig $(pwd)/config exec -n traefik deploy/bot-block-proxy -- wget -qO- http://localhost:8080/healthz
Expected: 2 pods Running. Health check returns "ok".
Step 7: Test fail-open behavior
Temporarily scale Poison Fountain to 0, verify the proxy returns 200:
kubectl --kubeconfig $(pwd)/config scale deployment poison-fountain -n poison-fountain --replicas=0
kubectl --kubeconfig $(pwd)/config exec -n traefik deploy/bot-block-proxy -- wget -qO- --timeout=10 http://localhost:8080/auth 2>&1
kubectl --kubeconfig $(pwd)/config scale deployment poison-fountain -n poison-fountain --replicas=2
Expected: With Poison Fountain at 0 replicas, the proxy returns 200 (fallback). After scaling back, normal forwarding resumes.
Step 8: Commit
git add stacks/platform/modules/traefik/main.tf stacks/platform/modules/traefik/middleware.tf
git commit -m "[ci skip] add bot-block resilience proxy: fail-open when Poison Fountain is down"
Task 3: Deploy auth resilience proxy (nginx basicAuth fallback in front of Authentik)
Deploy an nginx proxy that forwards to Authentik's outpost and falls back to basicAuth when Authentik is unreachable.
Files:
- Modify:
stacks/platform/modules/traefik/main.tf(add nginx deployment, service, configmap, htpasswd secret) - Modify:
stacks/platform/modules/traefik/middleware.tf:36(update authentik ForwardAuth address) - Modify:
stacks/platform/modules/traefik/main.tf:1(add variable for htpasswd)
Step 1: Add htpasswd variable
Add to top of stacks/platform/modules/traefik/main.tf (after existing variables):
variable "auth_fallback_htpasswd" {
type = string
description = "htpasswd-format string for emergency basicAuth fallback when Authentik is down"
sensitive = true
}
Step 2: Generate htpasswd and add to terraform.tfvars
Run (to generate a bcrypt htpasswd entry):
htpasswd -nbB admin "$(openssl rand -base64 16)"
Add the output to terraform.tfvars:
auth_fallback_htpasswd = "admin:$2y$05$..." # Generated value
Step 3: Pass variable through platform module
In stacks/platform/main.tf, find the traefik module block and add:
auth_fallback_htpasswd = var.auth_fallback_htpasswd
Add to stacks/platform/main.tf variables (if not already present):
variable "auth_fallback_htpasswd" {
type = string
sensitive = true
default = ""
}
Step 4: Add nginx configmap, secret, deployment, and service for auth proxy
Add to end of stacks/platform/modules/traefik/main.tf:
# Resilience proxy for Authentik ForwardAuth
# Falls back to basicAuth when Authentik is unreachable
resource "kubernetes_secret" "auth_proxy_htpasswd" {
metadata {
name = "auth-proxy-htpasswd"
namespace = kubernetes_namespace.traefik.metadata[0].name
}
data = {
"htpasswd" = var.auth_fallback_htpasswd
}
}
resource "kubernetes_config_map" "auth_proxy_config" {
metadata {
name = "auth-proxy-config"
namespace = kubernetes_namespace.traefik.metadata[0].name
}
data = {
"default.conf" = <<-EOT
upstream authentik {
server ak-outpost-authentik-embedded-outpost.authentik.svc.cluster.local:9000;
}
server {
listen 9000;
# Main auth endpoint - proxy to Authentik, fallback to basicAuth
location /outpost.goauthentik.io/auth/traefik {
proxy_pass http://authentik;
proxy_connect_timeout 3s;
proxy_read_timeout 5s;
proxy_send_timeout 5s;
proxy_intercept_errors on;
error_page 502 503 504 = @fallback_auth;
proxy_set_header Host $host;
proxy_set_header X-Real-IP $remote_addr;
proxy_set_header X-Forwarded-For $proxy_add_x_forwarded_for;
proxy_set_header X-Forwarded-Proto $scheme;
proxy_set_header X-Original-URL $scheme://$http_host$request_uri;
}
location @fallback_auth {
auth_basic "Emergency Access";
auth_basic_user_file /etc/nginx/htpasswd;
add_header X-authentik-username $remote_user always;
add_header X-Auth-Fallback "true" always;
return 200;
}
# Pass through other outpost paths (for OAuth flows when Authentik IS up)
location /outpost.goauthentik.io/ {
proxy_pass http://authentik;
proxy_connect_timeout 3s;
proxy_read_timeout 10s;
proxy_set_header Host $host;
proxy_set_header X-Real-IP $remote_addr;
proxy_set_header X-Forwarded-For $proxy_add_x_forwarded_for;
proxy_set_header X-Forwarded-Proto $scheme;
}
location /healthz {
access_log off;
return 200 "ok";
}
}
EOT
}
}
resource "kubernetes_deployment" "auth_proxy" {
metadata {
name = "auth-proxy"
namespace = kubernetes_namespace.traefik.metadata[0].name
labels = {
app = "auth-proxy"
}
}
spec {
replicas = 2
strategy {
type = "RollingUpdate"
rolling_update {
max_unavailable = 0
max_surge = 1
}
}
selector {
match_labels = {
app = "auth-proxy"
}
}
template {
metadata {
labels = {
app = "auth-proxy"
}
}
spec {
topology_spread_constraint {
max_skew = 1
topology_key = "kubernetes.io/hostname"
when_unsatisfiable = "DoNotSchedule"
label_selector {
match_labels = {
app = "auth-proxy"
}
}
}
container {
name = "nginx"
image = "nginx:1-alpine"
port {
container_port = 9000
}
volume_mount {
name = "config"
mount_path = "/etc/nginx/conf.d"
read_only = true
}
volume_mount {
name = "htpasswd"
mount_path = "/etc/nginx/htpasswd"
sub_path = "htpasswd"
read_only = true
}
liveness_probe {
http_get {
path = "/healthz"
port = 9000
}
initial_delay_seconds = 3
period_seconds = 10
}
readiness_probe {
http_get {
path = "/healthz"
port = 9000
}
initial_delay_seconds = 2
period_seconds = 5
}
resources {
requests = {
cpu = "5m"
memory = "16Mi"
}
limits = {
cpu = "50m"
memory = "32Mi"
}
}
}
volume {
name = "config"
config_map {
name = kubernetes_config_map.auth_proxy_config.metadata[0].name
}
}
volume {
name = "htpasswd"
secret {
secret_name = kubernetes_secret.auth_proxy_htpasswd.metadata[0].name
}
}
}
}
}
}
resource "kubernetes_service" "auth_proxy" {
metadata {
name = "auth-proxy"
namespace = kubernetes_namespace.traefik.metadata[0].name
labels = {
app = "auth-proxy"
}
}
spec {
selector = {
app = "auth-proxy"
}
port {
name = "http"
port = 9000
target_port = 9000
}
}
}
Step 5: Update authentik ForwardAuth address
In stacks/platform/modules/traefik/middleware.tf, line 36, change:
address = "http://ak-outpost-authentik-embedded-outpost.authentik.svc.cluster.local:9000/outpost.goauthentik.io/auth/traefik"
to:
address = "http://auth-proxy.traefik.svc.cluster.local:9000/outpost.goauthentik.io/auth/traefik"
Step 6: Plan and verify
Run:
cd stacks/platform && terragrunt plan --non-interactive 2>&1 | grep -E "will be created|will be updated|Plan:"
Expected: 4 resources created (secret, configmap, deployment, service), 1 resource updated (authentik-forward-auth middleware).
Step 7: Apply
Run:
cd stacks/platform && terragrunt apply --non-interactive
Step 8: Verify proxy is running
Run:
kubectl --kubeconfig $(pwd)/config get pods -n traefik -l app=auth-proxy
kubectl --kubeconfig $(pwd)/config exec -n traefik deploy/auth-proxy -- wget -qO- http://localhost:9000/healthz
Expected: 2 pods Running. Health check returns "ok".
Step 9: Commit
git add stacks/platform/modules/traefik/main.tf stacks/platform/modules/traefik/middleware.tf stacks/platform/main.tf
git commit -m "[ci skip] add auth resilience proxy: basicAuth fallback when Authentik is down"
Note: Do NOT commit terraform.tfvars (it contains the htpasswd secret and is git-crypt encrypted — it will be included in the next push automatically).
Task 4: Add Traefik topology spread, PDB, and response timeout
Files:
- Modify:
stacks/platform/modules/traefik/main.tf:26-205(Helm values)
Step 1: Add topology spread constraints to Traefik Helm values
In stacks/platform/modules/traefik/main.tf, after the tolerations = [] line (line 204), add:
topologySpreadConstraints = [{
maxSkew = 1
topologyKey = "kubernetes.io/hostname"
whenUnsatisfiable = "DoNotSchedule"
labelSelector = {
matchLabels = {
"app.kubernetes.io/name" = "traefik"
}
}
}]
podDisruptionBudget = {
enabled = true
minAvailable = 2
}
Step 2: Change response header timeout
In stacks/platform/modules/traefik/main.tf, line 184, change:
"--serversTransport.forwardingTimeouts.responseHeaderTimeout=0s",
to:
"--serversTransport.forwardingTimeouts.responseHeaderTimeout=30s",
Step 3: Plan and verify
Run:
cd stacks/platform && terragrunt plan --non-interactive 2>&1 | grep -E "will be|Plan:"
Expected: Helm release will be updated in-place.
Step 4: Apply
Run:
cd stacks/platform && terragrunt apply --non-interactive
Step 5: Verify topology spread
Run:
kubectl --kubeconfig $(pwd)/config get pods -n traefik -l app.kubernetes.io/name=traefik -o wide
Expected: 3 pods on 3 different nodes.
Step 6: Verify PDB
Run:
kubectl --kubeconfig $(pwd)/config get pdb -n traefik
Expected: PDB with minAvailable=2, currentHealthy=3, allowedDisruptions=1.
Step 7: Commit
git add stacks/platform/modules/traefik/main.tf
git commit -m "[ci skip] add Traefik topology spread, PDB (minAvailable=2), and 30s response timeout"
Task 5: Add Authentik PDB
Files:
- Modify:
stacks/platform/modules/authentik/values.yaml
Step 1: Add PDB configuration to Authentik Helm values
In stacks/platform/modules/authentik/values.yaml, add after the server: section (after line 33, before global:):
pdb:
enabled: true
minAvailable: 2
So the server section becomes:
server:
replicas: 3
resources:
requests:
cpu: 100m
memory: 512Mi
limits:
cpu: "2"
memory: 1Gi
ingress:
enabled: false
podAnnotations:
diun.enable: true
diun.include_tags: "^202[0-9].[0-9]+.*$"
pdb:
enabled: true
minAvailable: 2
Step 2: Plan and verify
Run:
cd stacks/platform && terragrunt plan --non-interactive 2>&1 | grep -E "will be|Plan:"
Expected: Helm release will be updated.
Step 3: Apply
Run:
cd stacks/platform && terragrunt apply --non-interactive
Step 4: Verify PDB
Run:
kubectl --kubeconfig $(pwd)/config get pdb -n authentik
Expected: PDB with minAvailable=2, currentHealthy=3, allowedDisruptions=1.
Step 5: Commit
git add stacks/platform/modules/authentik/values.yaml
git commit -m "[ci skip] add Authentik PDB (minAvailable=2)"
Task 6: Add retry middleware to ingress factory
Files:
- Modify:
stacks/platform/modules/traefik/middleware.tf(add retry middleware) - Modify:
modules/kubernetes/ingress_factory/main.tf:112-113(add to default chain)
Step 1: Add retry middleware CRD
Add to end of stacks/platform/modules/traefik/middleware.tf:
# Retry middleware for transient backend failures (502/503 during restarts)
resource "kubernetes_manifest" "middleware_retry" {
manifest = {
apiVersion = "traefik.io/v1alpha1"
kind = "Middleware"
metadata = {
name = "retry"
namespace = kubernetes_namespace.traefik.metadata[0].name
}
spec = {
retry = {
attempts = 2
initialInterval = "100ms"
}
}
}
depends_on = [helm_release.traefik]
}
Step 2: Add retry middleware to ingress factory default chain
In modules/kubernetes/ingress_factory/main.tf, line 112, the middleware chain starts with rate-limit. Add retry as the first middleware (retries should wrap the entire chain):
Change line 112-113 from:
"traefik.ingress.kubernetes.io/router.middlewares" = join(",", compact(concat([
var.skip_default_rate_limit ? null : "traefik-rate-limit@kubernetescrd",
to:
"traefik.ingress.kubernetes.io/router.middlewares" = join(",", compact(concat([
"traefik-retry@kubernetescrd",
var.skip_default_rate_limit ? null : "traefik-rate-limit@kubernetescrd",
Step 3: Plan both stacks
Run:
cd stacks/platform && terragrunt plan --non-interactive 2>&1 | grep -E "will be|Plan:"
Expected: 1 resource created (retry middleware).
Note: The ingress_factory change will take effect the next time any service stack is applied (it's a module used by all stacks). The middleware CRD must exist first.
Step 4: Apply platform stack
Run:
cd stacks/platform && terragrunt apply --non-interactive
Step 5: Verify retry middleware exists
Run:
kubectl --kubeconfig $(pwd)/config get middleware -n traefik retry
Expected: Middleware retry exists.
Step 6: Commit
git add stacks/platform/modules/traefik/middleware.tf modules/kubernetes/ingress_factory/main.tf
git commit -m "[ci skip] add retry middleware (2 attempts, 100ms) to default ingress chain"
Task 7: Add Prometheus alerts and inhibition rules
Files:
- Modify:
stacks/platform/modules/monitoring/prometheus_chart_values.tpl
Step 1: Add PoisonFountainDown alert
In stacks/platform/modules/monitoring/prometheus_chart_values.tpl, in the "Critical Services" alert group (after the AuthentikDown alert, around line 435), add:
- alert: PoisonFountainDown
expr: (kube_deployment_status_replicas_available{namespace="poison-fountain", deployment="poison-fountain"} or on() vector(0)) < 1
for: 2m
labels:
severity: critical
annotations:
summary: "Poison Fountain is down - AI bot blocking degraded to fail-open"
Step 2: Add ForwardAuthFallbackActive alert
In the "Traefik Ingress" alert group (after the TraefikHighOpenConnections alert, around line 587), add:
- alert: ForwardAuthFallbackActive
expr: |
(kube_deployment_status_replicas_available{namespace="poison-fountain", deployment="poison-fountain"} or on() vector(0)) < 1
or (kube_deployment_status_replicas_available{namespace="authentik", deployment="goauthentik-server"} or on() vector(0)) < 1
for: 5m
labels:
severity: warning
annotations:
summary: "ForwardAuth resilience proxy is serving fallback responses - check Poison Fountain and Authentik"
Step 3: Add alert inhibition rule
In the inhibit_rules section (around line 63), add after the existing TraefikDown inhibition:
# Traefik down makes Poison Fountain alerts redundant
- source_matchers:
- alertname = TraefikDown
target_matchers:
- alertname =~ "PoisonFountainDown|ForwardAuthFallbackActive"
Step 4: Plan and verify
Run:
cd stacks/platform && terragrunt plan --non-interactive 2>&1 | grep -E "will be|Plan:"
Expected: Helm release updated (Prometheus config changes).
Step 5: Apply
Run:
cd stacks/platform && terragrunt apply --non-interactive
Step 6: Verify alerts are loaded
Run:
kubectl --kubeconfig $(pwd)/config exec -n monitoring deploy/prometheus-server -- wget -qO- http://localhost:9090/api/v1/rules 2>&1 | python3 -c "import sys,json; rules=[r['name'] for g in json.load(sys.stdin)['data']['groups'] for r in g['rules']]; print('PoisonFountainDown:', 'PoisonFountainDown' in rules); print('ForwardAuthFallbackActive:', 'ForwardAuthFallbackActive' in rules)"
Expected: Both alerts show True.
Step 7: Commit
git add stacks/platform/modules/monitoring/prometheus_chart_values.tpl
git commit -m "[ci skip] add PoisonFountainDown and ForwardAuthFallbackActive alerts with inhibition"
Task 8: Final verification and push
Step 1: Run cluster health check
Run:
bash scripts/cluster_healthcheck.sh --quiet
Expected: No new WARN/FAIL related to our changes.
Step 2: Verify all resilience proxies are running
Run:
kubectl --kubeconfig $(pwd)/config get pods -n traefik -l "app in (bot-block-proxy,auth-proxy)" -o wide
kubectl --kubeconfig $(pwd)/config get pods -n traefik -l app.kubernetes.io/name=traefik -o wide
kubectl --kubeconfig $(pwd)/config get pdb -A
Expected: All proxy pods running on different nodes, Traefik pods spread across nodes, PDBs for Traefik and Authentik.
Step 3: Test a public service is still accessible
Run:
curl -s -o /dev/null -w "%{http_code}" https://viktorbarzin.me
Expected: 200 (or 301/302 redirect). Not 502.
Step 4: Push all commits
Ask user for confirmation, then:
git push origin master