infra/docs/plans/2026-03-01-traefik-resilience-plan.md

25 KiB

Traefik Resilience Hardening Implementation Plan

For Claude: REQUIRED SUB-SKILL: Use superpowers:executing-plans to implement this plan task-by-task.

Goal: Make Traefik resilient against downstream dependency failures (ForwardAuth services, hung backends) while preventing pod scheduling and disruption issues.

Architecture: Deploy nginx resilience proxies in front of fail-closed ForwardAuth services (Poison Fountain, Authentik), add PodDisruptionBudgets, topology spread constraints, response timeouts, retry middleware, and monitoring alerts.

Tech Stack: Terraform/Terragrunt, Kubernetes, Nginx, Traefik CRDs, Prometheus


Task 1: Bump Poison Fountain tier from aux to cluster

This is the simplest change and has no dependencies. Bumping the tier ensures Poison Fountain isn't evicted under memory pressure.

Files:

  • Modify: stacks/poison-fountain/main.tf:10 (namespace tier label)
  • Modify: stacks/poison-fountain/main.tf:52 (deployment tier label)

Step 1: Change namespace tier

In stacks/poison-fountain/main.tf, line 10, change:

tier = local.tiers.aux

to:

tier = local.tiers.cluster

Step 2: Change deployment tier label

In stacks/poison-fountain/main.tf, line 52, change:

tier = local.tiers.aux

to:

tier = local.tiers.cluster

Step 3: Verify the plan

Run:

cd stacks/poison-fountain && terragrunt plan --non-interactive 2>&1 | tail -30

Expected: Plan shows namespace and deployment label changes from 4-aux to 1-cluster. No resource destruction.

Step 4: Apply

Run:

cd stacks/poison-fountain && terragrunt apply --non-interactive

Step 5: Verify the new LimitRange and PriorityClass

Run:

kubectl --kubeconfig $(pwd)/config describe limitrange tier-defaults -n poison-fountain
kubectl --kubeconfig $(pwd)/config get pods -n poison-fountain -o jsonpath='{.items[*].spec.priorityClassName}'

Expected: LimitRange shows 1-cluster defaults (512Mi default memory, max 4Gi). Priority class is tier-1-cluster.

Step 6: Commit

git add stacks/poison-fountain/main.tf
git commit -m "[ci skip] bump poison-fountain tier from aux to cluster (critical path for all ingress)"

Task 2: Deploy bot-block resilience proxy (nginx fail-open in front of Poison Fountain)

Deploy an nginx reverse proxy in the traefik namespace that proxies to Poison Fountain's /auth endpoint and returns 200 (allow) if Poison Fountain is unreachable.

Files:

  • Modify: stacks/platform/modules/traefik/main.tf (add nginx deployment, service, configmap)
  • Modify: stacks/platform/modules/traefik/middleware.tf:287 (update ai-bot-block ForwardAuth address)

Step 1: Add nginx configmap for bot-block proxy

Add to end of stacks/platform/modules/traefik/main.tf (before the closing of the file):

# Resilience proxy for ai-bot-block ForwardAuth
# Returns 200 (allow all) when Poison Fountain is unreachable
resource "kubernetes_config_map" "bot_block_proxy_config" {
  metadata {
    name      = "bot-block-proxy-config"
    namespace = kubernetes_namespace.traefik.metadata[0].name
  }

  data = {
    "default.conf" = <<-EOT
      upstream poison_fountain {
          server poison-fountain.poison-fountain.svc.cluster.local:8080;
      }
      server {
          listen 8080;
          location /auth {
              proxy_pass http://poison_fountain;
              proxy_connect_timeout 3s;
              proxy_read_timeout 5s;
              proxy_send_timeout 5s;
              proxy_intercept_errors on;
              error_page 502 503 504 =200 /fallback-allow;
              proxy_set_header Host $host;
              proxy_set_header X-Real-IP $remote_addr;
              proxy_set_header X-Forwarded-For $proxy_add_x_forwarded_for;
              proxy_set_header X-Forwarded-Proto $scheme;
          }
          location = /fallback-allow {
              internal;
              return 200 "allowed";
          }
          location /healthz {
              access_log off;
              return 200 "ok";
          }
      }
    EOT
  }
}

Step 2: Add nginx deployment for bot-block proxy

Add after the configmap:

resource "kubernetes_deployment" "bot_block_proxy" {
  metadata {
    name      = "bot-block-proxy"
    namespace = kubernetes_namespace.traefik.metadata[0].name
    labels = {
      app = "bot-block-proxy"
    }
  }

  spec {
    replicas = 2
    strategy {
      type = "RollingUpdate"
      rolling_update {
        max_unavailable = 0
        max_surge       = 1
      }
    }
    selector {
      match_labels = {
        app = "bot-block-proxy"
      }
    }
    template {
      metadata {
        labels = {
          app = "bot-block-proxy"
        }
      }
      spec {
        topology_spread_constraint {
          max_skew           = 1
          topology_key       = "kubernetes.io/hostname"
          when_unsatisfiable = "DoNotSchedule"
          label_selector {
            match_labels = {
              app = "bot-block-proxy"
            }
          }
        }
        container {
          name  = "nginx"
          image = "nginx:1-alpine"

          port {
            container_port = 8080
          }

          volume_mount {
            name       = "config"
            mount_path = "/etc/nginx/conf.d"
            read_only  = true
          }

          liveness_probe {
            http_get {
              path = "/healthz"
              port = 8080
            }
            initial_delay_seconds = 3
            period_seconds        = 10
          }
          readiness_probe {
            http_get {
              path = "/healthz"
              port = 8080
            }
            initial_delay_seconds = 2
            period_seconds        = 5
          }

          resources {
            requests = {
              cpu    = "5m"
              memory = "16Mi"
            }
            limits = {
              cpu    = "50m"
              memory = "32Mi"
            }
          }
        }

        volume {
          name = "config"
          config_map {
            name = kubernetes_config_map.bot_block_proxy_config.metadata[0].name
          }
        }
      }
    }
  }
}

resource "kubernetes_service" "bot_block_proxy" {
  metadata {
    name      = "bot-block-proxy"
    namespace = kubernetes_namespace.traefik.metadata[0].name
    labels = {
      app = "bot-block-proxy"
    }
  }

  spec {
    selector = {
      app = "bot-block-proxy"
    }
    port {
      name        = "http"
      port        = 8080
      target_port = 8080
    }
  }
}

Step 3: Update ai-bot-block ForwardAuth address

In stacks/platform/modules/traefik/middleware.tf, line 287, change:

address            = "http://poison-fountain.poison-fountain.svc.cluster.local:8080/auth"

to:

address            = "http://bot-block-proxy.traefik.svc.cluster.local:8080/auth"

Step 4: Plan and verify

Run:

cd stacks/platform && terragrunt plan --non-interactive 2>&1 | grep -E "will be created|will be updated|Plan:"

Expected: 3 resources created (configmap, deployment, service), 1 resource updated (ai-bot-block middleware).

Step 5: Apply

Run:

cd stacks/platform && terragrunt apply --non-interactive

Step 6: Verify the proxy is running and forwarding correctly

Run:

kubectl --kubeconfig $(pwd)/config get pods -n traefik -l app=bot-block-proxy
kubectl --kubeconfig $(pwd)/config exec -n traefik deploy/bot-block-proxy -- wget -qO- http://localhost:8080/healthz

Expected: 2 pods Running. Health check returns "ok".

Step 7: Test fail-open behavior

Temporarily scale Poison Fountain to 0, verify the proxy returns 200:

kubectl --kubeconfig $(pwd)/config scale deployment poison-fountain -n poison-fountain --replicas=0
kubectl --kubeconfig $(pwd)/config exec -n traefik deploy/bot-block-proxy -- wget -qO- --timeout=10 http://localhost:8080/auth 2>&1
kubectl --kubeconfig $(pwd)/config scale deployment poison-fountain -n poison-fountain --replicas=2

Expected: With Poison Fountain at 0 replicas, the proxy returns 200 (fallback). After scaling back, normal forwarding resumes.

Step 8: Commit

git add stacks/platform/modules/traefik/main.tf stacks/platform/modules/traefik/middleware.tf
git commit -m "[ci skip] add bot-block resilience proxy: fail-open when Poison Fountain is down"

Task 3: Deploy auth resilience proxy (nginx basicAuth fallback in front of Authentik)

Deploy an nginx proxy that forwards to Authentik's outpost and falls back to basicAuth when Authentik is unreachable.

Files:

  • Modify: stacks/platform/modules/traefik/main.tf (add nginx deployment, service, configmap, htpasswd secret)
  • Modify: stacks/platform/modules/traefik/middleware.tf:36 (update authentik ForwardAuth address)
  • Modify: stacks/platform/modules/traefik/main.tf:1 (add variable for htpasswd)

Step 1: Add htpasswd variable

Add to top of stacks/platform/modules/traefik/main.tf (after existing variables):

variable "auth_fallback_htpasswd" {
  type        = string
  description = "htpasswd-format string for emergency basicAuth fallback when Authentik is down"
  sensitive   = true
}

Step 2: Generate htpasswd and add to terraform.tfvars

Run (to generate a bcrypt htpasswd entry):

htpasswd -nbB admin "$(openssl rand -base64 16)"

Add the output to terraform.tfvars:

auth_fallback_htpasswd = "admin:$2y$05$..."  # Generated value

Step 3: Pass variable through platform module

In stacks/platform/main.tf, find the traefik module block and add:

auth_fallback_htpasswd = var.auth_fallback_htpasswd

Add to stacks/platform/main.tf variables (if not already present):

variable "auth_fallback_htpasswd" {
  type      = string
  sensitive = true
  default   = ""
}

Step 4: Add nginx configmap, secret, deployment, and service for auth proxy

Add to end of stacks/platform/modules/traefik/main.tf:

# Resilience proxy for Authentik ForwardAuth
# Falls back to basicAuth when Authentik is unreachable
resource "kubernetes_secret" "auth_proxy_htpasswd" {
  metadata {
    name      = "auth-proxy-htpasswd"
    namespace = kubernetes_namespace.traefik.metadata[0].name
  }

  data = {
    "htpasswd" = var.auth_fallback_htpasswd
  }
}

resource "kubernetes_config_map" "auth_proxy_config" {
  metadata {
    name      = "auth-proxy-config"
    namespace = kubernetes_namespace.traefik.metadata[0].name
  }

  data = {
    "default.conf" = <<-EOT
      upstream authentik {
          server ak-outpost-authentik-embedded-outpost.authentik.svc.cluster.local:9000;
      }
      server {
          listen 9000;

          # Main auth endpoint - proxy to Authentik, fallback to basicAuth
          location /outpost.goauthentik.io/auth/traefik {
              proxy_pass http://authentik;
              proxy_connect_timeout 3s;
              proxy_read_timeout 5s;
              proxy_send_timeout 5s;
              proxy_intercept_errors on;
              error_page 502 503 504 = @fallback_auth;
              proxy_set_header Host $host;
              proxy_set_header X-Real-IP $remote_addr;
              proxy_set_header X-Forwarded-For $proxy_add_x_forwarded_for;
              proxy_set_header X-Forwarded-Proto $scheme;
              proxy_set_header X-Original-URL $scheme://$http_host$request_uri;
          }

          location @fallback_auth {
              auth_basic "Emergency Access";
              auth_basic_user_file /etc/nginx/htpasswd;
              add_header X-authentik-username $remote_user always;
              add_header X-Auth-Fallback "true" always;
              return 200;
          }

          # Pass through other outpost paths (for OAuth flows when Authentik IS up)
          location /outpost.goauthentik.io/ {
              proxy_pass http://authentik;
              proxy_connect_timeout 3s;
              proxy_read_timeout 10s;
              proxy_set_header Host $host;
              proxy_set_header X-Real-IP $remote_addr;
              proxy_set_header X-Forwarded-For $proxy_add_x_forwarded_for;
              proxy_set_header X-Forwarded-Proto $scheme;
          }

          location /healthz {
              access_log off;
              return 200 "ok";
          }
      }
    EOT
  }
}

resource "kubernetes_deployment" "auth_proxy" {
  metadata {
    name      = "auth-proxy"
    namespace = kubernetes_namespace.traefik.metadata[0].name
    labels = {
      app = "auth-proxy"
    }
  }

  spec {
    replicas = 2
    strategy {
      type = "RollingUpdate"
      rolling_update {
        max_unavailable = 0
        max_surge       = 1
      }
    }
    selector {
      match_labels = {
        app = "auth-proxy"
      }
    }
    template {
      metadata {
        labels = {
          app = "auth-proxy"
        }
      }
      spec {
        topology_spread_constraint {
          max_skew           = 1
          topology_key       = "kubernetes.io/hostname"
          when_unsatisfiable = "DoNotSchedule"
          label_selector {
            match_labels = {
              app = "auth-proxy"
            }
          }
        }
        container {
          name  = "nginx"
          image = "nginx:1-alpine"

          port {
            container_port = 9000
          }

          volume_mount {
            name       = "config"
            mount_path = "/etc/nginx/conf.d"
            read_only  = true
          }
          volume_mount {
            name       = "htpasswd"
            mount_path = "/etc/nginx/htpasswd"
            sub_path   = "htpasswd"
            read_only  = true
          }

          liveness_probe {
            http_get {
              path = "/healthz"
              port = 9000
            }
            initial_delay_seconds = 3
            period_seconds        = 10
          }
          readiness_probe {
            http_get {
              path = "/healthz"
              port = 9000
            }
            initial_delay_seconds = 2
            period_seconds        = 5
          }

          resources {
            requests = {
              cpu    = "5m"
              memory = "16Mi"
            }
            limits = {
              cpu    = "50m"
              memory = "32Mi"
            }
          }
        }

        volume {
          name = "config"
          config_map {
            name = kubernetes_config_map.auth_proxy_config.metadata[0].name
          }
        }
        volume {
          name = "htpasswd"
          secret {
            secret_name = kubernetes_secret.auth_proxy_htpasswd.metadata[0].name
          }
        }
      }
    }
  }
}

resource "kubernetes_service" "auth_proxy" {
  metadata {
    name      = "auth-proxy"
    namespace = kubernetes_namespace.traefik.metadata[0].name
    labels = {
      app = "auth-proxy"
    }
  }

  spec {
    selector = {
      app = "auth-proxy"
    }
    port {
      name        = "http"
      port        = 9000
      target_port = 9000
    }
  }
}

Step 5: Update authentik ForwardAuth address

In stacks/platform/modules/traefik/middleware.tf, line 36, change:

address            = "http://ak-outpost-authentik-embedded-outpost.authentik.svc.cluster.local:9000/outpost.goauthentik.io/auth/traefik"

to:

address            = "http://auth-proxy.traefik.svc.cluster.local:9000/outpost.goauthentik.io/auth/traefik"

Step 6: Plan and verify

Run:

cd stacks/platform && terragrunt plan --non-interactive 2>&1 | grep -E "will be created|will be updated|Plan:"

Expected: 4 resources created (secret, configmap, deployment, service), 1 resource updated (authentik-forward-auth middleware).

Step 7: Apply

Run:

cd stacks/platform && terragrunt apply --non-interactive

Step 8: Verify proxy is running

Run:

kubectl --kubeconfig $(pwd)/config get pods -n traefik -l app=auth-proxy
kubectl --kubeconfig $(pwd)/config exec -n traefik deploy/auth-proxy -- wget -qO- http://localhost:9000/healthz

Expected: 2 pods Running. Health check returns "ok".

Step 9: Commit

git add stacks/platform/modules/traefik/main.tf stacks/platform/modules/traefik/middleware.tf stacks/platform/main.tf
git commit -m "[ci skip] add auth resilience proxy: basicAuth fallback when Authentik is down"

Note: Do NOT commit terraform.tfvars (it contains the htpasswd secret and is git-crypt encrypted — it will be included in the next push automatically).


Task 4: Add Traefik topology spread, PDB, and response timeout

Files:

  • Modify: stacks/platform/modules/traefik/main.tf:26-205 (Helm values)

Step 1: Add topology spread constraints to Traefik Helm values

In stacks/platform/modules/traefik/main.tf, after the tolerations = [] line (line 204), add:

    topologySpreadConstraints = [{
      maxSkew           = 1
      topologyKey       = "kubernetes.io/hostname"
      whenUnsatisfiable = "DoNotSchedule"
      labelSelector = {
        matchLabels = {
          "app.kubernetes.io/name" = "traefik"
        }
      }
    }]

    podDisruptionBudget = {
      enabled      = true
      minAvailable = 2
    }

Step 2: Change response header timeout

In stacks/platform/modules/traefik/main.tf, line 184, change:

"--serversTransport.forwardingTimeouts.responseHeaderTimeout=0s",

to:

"--serversTransport.forwardingTimeouts.responseHeaderTimeout=30s",

Step 3: Plan and verify

Run:

cd stacks/platform && terragrunt plan --non-interactive 2>&1 | grep -E "will be|Plan:"

Expected: Helm release will be updated in-place.

Step 4: Apply

Run:

cd stacks/platform && terragrunt apply --non-interactive

Step 5: Verify topology spread

Run:

kubectl --kubeconfig $(pwd)/config get pods -n traefik -l app.kubernetes.io/name=traefik -o wide

Expected: 3 pods on 3 different nodes.

Step 6: Verify PDB

Run:

kubectl --kubeconfig $(pwd)/config get pdb -n traefik

Expected: PDB with minAvailable=2, currentHealthy=3, allowedDisruptions=1.

Step 7: Commit

git add stacks/platform/modules/traefik/main.tf
git commit -m "[ci skip] add Traefik topology spread, PDB (minAvailable=2), and 30s response timeout"

Task 5: Add Authentik PDB

Files:

  • Modify: stacks/platform/modules/authentik/values.yaml

Step 1: Add PDB configuration to Authentik Helm values

In stacks/platform/modules/authentik/values.yaml, add after the server: section (after line 33, before global:):

  pdb:
    enabled: true
    minAvailable: 2

So the server section becomes:

server:
  replicas: 3
  resources:
    requests:
      cpu: 100m
      memory: 512Mi
    limits:
      cpu: "2"
      memory: 1Gi
  ingress:
    enabled: false
  podAnnotations:
    diun.enable: true
    diun.include_tags: "^202[0-9].[0-9]+.*$"
  pdb:
    enabled: true
    minAvailable: 2

Step 2: Plan and verify

Run:

cd stacks/platform && terragrunt plan --non-interactive 2>&1 | grep -E "will be|Plan:"

Expected: Helm release will be updated.

Step 3: Apply

Run:

cd stacks/platform && terragrunt apply --non-interactive

Step 4: Verify PDB

Run:

kubectl --kubeconfig $(pwd)/config get pdb -n authentik

Expected: PDB with minAvailable=2, currentHealthy=3, allowedDisruptions=1.

Step 5: Commit

git add stacks/platform/modules/authentik/values.yaml
git commit -m "[ci skip] add Authentik PDB (minAvailable=2)"

Task 6: Add retry middleware to ingress factory

Files:

  • Modify: stacks/platform/modules/traefik/middleware.tf (add retry middleware)
  • Modify: modules/kubernetes/ingress_factory/main.tf:112-113 (add to default chain)

Step 1: Add retry middleware CRD

Add to end of stacks/platform/modules/traefik/middleware.tf:

# Retry middleware for transient backend failures (502/503 during restarts)
resource "kubernetes_manifest" "middleware_retry" {
  manifest = {
    apiVersion = "traefik.io/v1alpha1"
    kind       = "Middleware"
    metadata = {
      name      = "retry"
      namespace = kubernetes_namespace.traefik.metadata[0].name
    }
    spec = {
      retry = {
        attempts        = 2
        initialInterval = "100ms"
      }
    }
  }

  depends_on = [helm_release.traefik]
}

Step 2: Add retry middleware to ingress factory default chain

In modules/kubernetes/ingress_factory/main.tf, line 112, the middleware chain starts with rate-limit. Add retry as the first middleware (retries should wrap the entire chain):

Change line 112-113 from:

      "traefik.ingress.kubernetes.io/router.middlewares" = join(",", compact(concat([
        var.skip_default_rate_limit ? null : "traefik-rate-limit@kubernetescrd",

to:

      "traefik.ingress.kubernetes.io/router.middlewares" = join(",", compact(concat([
        "traefik-retry@kubernetescrd",
        var.skip_default_rate_limit ? null : "traefik-rate-limit@kubernetescrd",

Step 3: Plan both stacks

Run:

cd stacks/platform && terragrunt plan --non-interactive 2>&1 | grep -E "will be|Plan:"

Expected: 1 resource created (retry middleware).

Note: The ingress_factory change will take effect the next time any service stack is applied (it's a module used by all stacks). The middleware CRD must exist first.

Step 4: Apply platform stack

Run:

cd stacks/platform && terragrunt apply --non-interactive

Step 5: Verify retry middleware exists

Run:

kubectl --kubeconfig $(pwd)/config get middleware -n traefik retry

Expected: Middleware retry exists.

Step 6: Commit

git add stacks/platform/modules/traefik/middleware.tf modules/kubernetes/ingress_factory/main.tf
git commit -m "[ci skip] add retry middleware (2 attempts, 100ms) to default ingress chain"

Task 7: Add Prometheus alerts and inhibition rules

Files:

  • Modify: stacks/platform/modules/monitoring/prometheus_chart_values.tpl

Step 1: Add PoisonFountainDown alert

In stacks/platform/modules/monitoring/prometheus_chart_values.tpl, in the "Critical Services" alert group (after the AuthentikDown alert, around line 435), add:

          - alert: PoisonFountainDown
            expr: (kube_deployment_status_replicas_available{namespace="poison-fountain", deployment="poison-fountain"} or on() vector(0)) < 1
            for: 2m
            labels:
              severity: critical
            annotations:
              summary: "Poison Fountain is down - AI bot blocking degraded to fail-open"

Step 2: Add ForwardAuthFallbackActive alert

In the "Traefik Ingress" alert group (after the TraefikHighOpenConnections alert, around line 587), add:

          - alert: ForwardAuthFallbackActive
            expr: |
              (kube_deployment_status_replicas_available{namespace="poison-fountain", deployment="poison-fountain"} or on() vector(0)) < 1
              or (kube_deployment_status_replicas_available{namespace="authentik", deployment="goauthentik-server"} or on() vector(0)) < 1
            for: 5m
            labels:
              severity: warning
            annotations:
              summary: "ForwardAuth resilience proxy is serving fallback responses - check Poison Fountain and Authentik"

Step 3: Add alert inhibition rule

In the inhibit_rules section (around line 63), add after the existing TraefikDown inhibition:

      # Traefik down makes Poison Fountain alerts redundant
      - source_matchers:
          - alertname = TraefikDown
        target_matchers:
          - alertname =~ "PoisonFountainDown|ForwardAuthFallbackActive"

Step 4: Plan and verify

Run:

cd stacks/platform && terragrunt plan --non-interactive 2>&1 | grep -E "will be|Plan:"

Expected: Helm release updated (Prometheus config changes).

Step 5: Apply

Run:

cd stacks/platform && terragrunt apply --non-interactive

Step 6: Verify alerts are loaded

Run:

kubectl --kubeconfig $(pwd)/config exec -n monitoring deploy/prometheus-server -- wget -qO- http://localhost:9090/api/v1/rules 2>&1 | python3 -c "import sys,json; rules=[r['name'] for g in json.load(sys.stdin)['data']['groups'] for r in g['rules']]; print('PoisonFountainDown:', 'PoisonFountainDown' in rules); print('ForwardAuthFallbackActive:', 'ForwardAuthFallbackActive' in rules)"

Expected: Both alerts show True.

Step 7: Commit

git add stacks/platform/modules/monitoring/prometheus_chart_values.tpl
git commit -m "[ci skip] add PoisonFountainDown and ForwardAuthFallbackActive alerts with inhibition"

Task 8: Final verification and push

Step 1: Run cluster health check

Run:

bash scripts/cluster_healthcheck.sh --quiet

Expected: No new WARN/FAIL related to our changes.

Step 2: Verify all resilience proxies are running

Run:

kubectl --kubeconfig $(pwd)/config get pods -n traefik -l "app in (bot-block-proxy,auth-proxy)" -o wide
kubectl --kubeconfig $(pwd)/config get pods -n traefik -l app.kubernetes.io/name=traefik -o wide
kubectl --kubeconfig $(pwd)/config get pdb -A

Expected: All proxy pods running on different nodes, Traefik pods spread across nodes, PDBs for Traefik and Authentik.

Step 3: Test a public service is still accessible

Run:

curl -s -o /dev/null -w "%{http_code}" https://viktorbarzin.me

Expected: 200 (or 301/302 redirect). Not 502.

Step 4: Push all commits

Ask user for confirmation, then:

git push origin master