infra/stacks/terminal
Viktor Barzin 45c8e88e89 terminal: probe + alerts after Traefik replica routing-table skew
User reported "site loads but failed to connect on the tmux session". Root
cause was a Traefik replica (traefik-db7696fbf-ktjjz) that came up missing
the kubernetes_ingress-derived router for terminal.viktorbarzin.me — only
the IngressRoute CRDs registered. About 1/3 of /token preflight requests
landed on that replica and got 404 with router="-", and WS upgrades
intermittently failed the same way, so the lobby iframe stayed stuck on
"Failed to connect. Retrying...". `kubectl delete pod` on the bad replica
restored the missing router and unblocked the user.

This commit adds the long-term mitigation:

stacks/terminal/main.tf
  - kubernetes_cron_job_v1.webterminal_probe runs every 5min, hits
    /token + /ws via Cloudflare and the in-cluster ttyd Service, pushes
    4 gauges to Pushgateway (token_status, ws_status, ttyd_status,
    last_success_timestamp). Verified the probe end-to-end:
      token=302 ws=302 ttyd=200 ok=1

stacks/monitoring/modules/monitoring/prometheus_chart_values.tpl
  - Webterminal group: WebterminalTokenDegraded (warning, 10m),
    WebterminalWebsocketDegraded (critical, 10m),
    WebterminalTtydUnreachable (critical, 10m),
    WebterminalProbeStale (warning, 15m).
  - Traefik Router Parity group: TraefikRouterCountSkew fires when any
    Traefik replica's router count diverges from siblings for >10m —
    catches the same class of issue cluster-wide, not just for terminal.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-22 14:16:56 +00:00
..
backend.tf Woodpecker CI deploy [CI SKIP] 2026-05-22 14:16:45 +00:00
main.tf terminal: probe + alerts after Traefik replica routing-table skew 2026-05-22 14:16:56 +00:00
providers.tf Woodpecker CI deploy [CI SKIP] 2026-05-22 14:16:45 +00:00
secrets Add broker-sync Terraform stack (#7) 2026-04-17 21:17:45 +01:00
terragrunt.hcl Add terminal stack - reverse proxy to ttyd behind authentik 2026-03-10 23:46:01 +00:00