infra/stacks
Viktor Barzin 165bb7258e monitoring: detect stale Traefik replicas + reduce alert-storm cascading
Two new alertmanager inhibit rules and one new Prometheus alert,
informed by the 2026-05-12 incident where Traefik pod
traefik-db7696fbf-k42wp came back after a SIGTERM with only 6 routers
vs 119 on healthy peers (stale K8s informer cache) and served 404 for
~1/3 of viktorbarzin.me traffic.

* New alert TraefikReplicaConfigStale: fires when max/min reload-rate
  ratio across Traefik pods exceeds 5x for 2h. The 2h window + 2h
  for-clause tolerates legitimate post-restart ramp-up; the bug
  pattern persists indefinitely.
* New inhibit: TraefikReplicaConfigStale suppresses the symptom
  alerts (HighService{Error,4xx,Latency}, IngressTTFB{High,Critical},
  IngressErrorRate5xxHigh, TraefikHighOpenConnections,
  ForwardAuthFallbackActive, AnubisChallengeStoreErrors,
  ExternalAccessDivergence) so only the actionable root cause pages.
* New inhibit: HomeAssistantDown suppresses
  HomeAssistantCriticalSensorUnavailable and
  HomeAssistantMetricsMissing — when HA itself is down, every sensor
  going unavailable is noise (10x firings observed in the last 12h).
* Extend NodeDown and NFSServerUnresponsive target lists to also
  suppress HomeAssistantCriticalSensorUnavailable.
2026-05-22 14:16:45 +00:00
..
_template ingress_factory: replace protected bool with auth enum + audit pass across 100 stacks 2026-05-22 14:16:42 +00:00
actualbudget infra/ingress_factory: add auth = "app" mode for self-authed backends 2026-05-22 14:16:44 +00:00
affine infra/ingress_factory: add auth = "app" mode for self-authed backends 2026-05-22 14:16:44 +00:00
authentik infra: document auth = "app|none" tier on every legacy ingress 2026-05-22 14:16:44 +00:00
beads-server infra: document auth = "app|none" tier on every legacy ingress 2026-05-22 14:16:44 +00:00
blog ingress_factory: replace protected bool with auth enum + audit pass across 100 stacks 2026-05-22 14:16:42 +00:00
broker-sync fix: pvc-autoresizer + TF drift safety — bulk add ignore_changes 2026-05-22 14:16:43 +00:00
calico [infra] Partial Calico adoption: namespaces only (Wave 5b) 2026-04-18 22:52:56 +00:00
changedetection fix: pvc-autoresizer + TF drift safety — bulk add ignore_changes 2026-05-22 14:16:43 +00:00
chrome-service fix: pvc-autoresizer + TF drift safety — bulk add ignore_changes 2026-05-22 14:16:43 +00:00
city-guesser ingress_factory: replace protected bool with auth enum + audit pass across 100 stacks 2026-05-22 14:16:42 +00:00
claude-agent-service claude-agent: replace unused 10Gi PVC with 5Gi NFS-backed /persistent 2026-05-22 14:16:45 +00:00
claude-memory infra: document auth = "app|none" tier on every legacy ingress 2026-05-22 14:16:44 +00:00
cloudflared cloudflare: disable AI bot edge-block so x402 can issue payment offers 2026-05-22 14:16:42 +00:00
cnpg [infra] Suppress Goldilocks vpa-update-mode label drift on all namespaces [ci skip] 2026-04-18 21:15:27 +00:00
coturn [infra] Sweep dns_config ignore_changes across all pod-owning resources [ci skip] 2026-04-18 21:19:48 +00:00
crowdsec ingress_factory: replace protected bool with auth enum + audit pass across 100 stacks 2026-05-22 14:16:42 +00:00
cyberchef ingress_factory: replace protected bool with auth enum + audit pass across 100 stacks 2026-05-22 14:16:42 +00:00
dashy ingress_factory: replace protected bool with auth enum + audit pass across 100 stacks 2026-05-22 14:16:42 +00:00
dawarich infra: document auth = "app|none" tier on every legacy ingress 2026-05-22 14:16:44 +00:00
dbaas dbaas+monitoring: bump PG max_connections to 200, add scrape + alerts 2026-05-22 14:16:44 +00:00
descheduler [infra] Suppress Goldilocks vpa-update-mode label drift on all namespaces [ci skip] 2026-04-18 21:15:27 +00:00
diun fix: pvc-autoresizer + TF drift safety — bulk add ignore_changes 2026-05-22 14:16:43 +00:00
ebook2audiobook ingress_factory: replace protected bool with auth enum + audit pass across 100 stacks 2026-05-22 14:16:42 +00:00
ebooks infra: document auth = "app|none" tier on every legacy ingress 2026-05-22 14:16:44 +00:00
echo state(dbaas): update encrypted state 2026-05-22 14:16:43 +00:00
excalidraw fix: pvc-autoresizer + TF drift safety — bulk add ignore_changes 2026-05-22 14:16:43 +00:00
external-secrets [infra] Suppress Goldilocks vpa-update-mode label drift on all namespaces [ci skip] 2026-04-18 21:15:27 +00:00
f1-stream fix: pvc-autoresizer + TF drift safety — bulk add ignore_changes 2026-05-22 14:16:43 +00:00
fire-planner infra: document auth = "app|none" tier on every legacy ingress 2026-05-22 14:16:44 +00:00
foolery ingress_factory: replace protected bool with auth enum + audit pass across 100 stacks 2026-05-22 14:16:42 +00:00
forgejo infra: document auth = "app|none" tier on every legacy ingress 2026-05-22 14:16:44 +00:00
freedify ingress_factory: replace protected bool with auth enum + audit pass across 100 stacks 2026-05-22 14:16:42 +00:00
freshrss infra/ingress_factory: add auth = "app" mode for self-authed backends 2026-05-22 14:16:44 +00:00
frigate infra: document auth = "app|none" tier on every legacy ingress 2026-05-22 14:16:44 +00:00
grampsweb fix: pvc-autoresizer + TF drift safety — bulk add ignore_changes 2026-05-22 14:16:43 +00:00
hackmd fix: pvc-autoresizer + TF drift safety — bulk add ignore_changes 2026-05-22 14:16:43 +00:00
headscale infra: document auth = "app|none" tier on every legacy ingress 2026-05-22 14:16:44 +00:00
health fix: pvc-autoresizer + TF drift safety — bulk add ignore_changes 2026-05-22 14:16:43 +00:00
hermes-agent fix: pvc-autoresizer + TF drift safety — bulk add ignore_changes 2026-05-22 14:16:43 +00:00
homepage ingress_factory: replace protected bool with auth enum + audit pass across 100 stacks 2026-05-22 14:16:42 +00:00
immich infra: document auth = "app|none" tier on every legacy ingress 2026-05-22 14:16:44 +00:00
infra [forgejo] Phases 3+4+5: cutover, decommission, docs sweep 2026-05-07 23:29:34 +00:00
infra-maintenance [infra] Sweep dns_config ignore_changes across all pod-owning resources [ci skip] 2026-04-18 21:19:48 +00:00
insta2spotify infra: document auth = "app|none" tier on every legacy ingress 2026-05-22 14:16:44 +00:00
instagram-poster infra: document auth = "app|none" tier on every legacy ingress 2026-05-22 14:16:44 +00:00
isponsorblocktv fix: pvc-autoresizer + TF drift safety — bulk add ignore_changes 2026-05-22 14:16:43 +00:00
job-hunter grafana: env-var datasources + reloader so Vault rotations stop breaking dashboards 2026-05-10 11:12:39 +00:00
jsoncrack ingress_factory: replace protected bool with auth enum + audit pass across 100 stacks 2026-05-22 14:16:42 +00:00
k8s-dashboard ingress_factory: replace protected bool with auth enum + audit pass across 100 stacks 2026-05-22 14:16:42 +00:00
k8s-portal infra: document auth = "app|none" tier on every legacy ingress 2026-05-22 14:16:44 +00:00
k8s-version-upgrade k8s-version-upgrade: decompose into Job chain to fix self-preemption 2026-05-22 14:16:45 +00:00
kms ingress_factory: replace protected bool with auth enum + audit pass across 100 stacks 2026-05-22 14:16:42 +00:00
kured kured(sentinel-gate): fix auth + write-perm so safety checks actually run 2026-05-22 14:16:41 +00:00
kyverno [forgejo] Phases 3+4+5: cutover, decommission, docs sweep 2026-05-07 23:29:34 +00:00
linkwarden infra/ingress_factory: add auth = "app" mode for self-authed backends 2026-05-22 14:16:44 +00:00
llama-cpp infra/llama-cpp: benchmark report + -fa flag fix 2026-05-22 14:16:41 +00:00
local-path [infra] Adopt local-path-provisioner into Terraform (Wave 5c) 2026-04-18 22:39:55 +00:00
mailserver fix: pvc-autoresizer + TF drift safety — bulk add ignore_changes 2026-05-22 14:16:43 +00:00
matrix infra: document auth = "app|none" tier on every legacy ingress 2026-05-22 14:16:44 +00:00
meshcentral fix: pvc-autoresizer + TF drift safety — bulk add ignore_changes 2026-05-22 14:16:43 +00:00
metallb [infra] Suppress Goldilocks vpa-update-mode label drift on all namespaces [ci skip] 2026-04-18 21:15:27 +00:00
metrics-server [infra] Suppress Goldilocks vpa-update-mode label drift on all namespaces [ci skip] 2026-04-18 21:15:27 +00:00
monitoring monitoring: detect stale Traefik replicas + reduce alert-storm cascading 2026-05-22 14:16:45 +00:00
n8n infra: document auth = "app|none" tier on every legacy ingress 2026-05-22 14:16:44 +00:00
navidrome infra: document auth = "app|none" tier on every legacy ingress 2026-05-22 14:16:44 +00:00
netbox ingress_factory: replace protected bool with auth enum + audit pass across 100 stacks 2026-05-22 14:16:42 +00:00
networking-toolbox ingress_factory: replace protected bool with auth enum + audit pass across 100 stacks 2026-05-22 14:16:42 +00:00
nextcloud infra: document auth = "app|none" tier on every legacy ingress 2026-05-22 14:16:44 +00:00
nfs-csi [infra] TrueNAS decommission — remove active references from Terraform + configs 2026-04-19 16:57:05 +00:00
nodelocal-dns [dns] NodeLocal DNSCache — deploy DaemonSet to all nodes (WS C) 2026-04-19 15:46:41 +00:00
novelapp infra/ingress_factory: add auth = "app" mode for self-authed backends 2026-05-22 14:16:44 +00:00
ntfy infra: document auth = "app|none" tier on every legacy ingress 2026-05-22 14:16:44 +00:00
nvidia infra: document auth = "app|none" tier on every legacy ingress 2026-05-22 14:16:44 +00:00
onlyoffice infra: document auth = "app|none" tier on every legacy ingress 2026-05-22 14:16:44 +00:00
openclaw fix: pvc-autoresizer + TF drift safety — bulk add ignore_changes 2026-05-22 14:16:43 +00:00
osm_routing [infra] Sweep dns_config ignore_changes across all pod-owning resources [ci skip] 2026-04-18 21:19:48 +00:00
owntracks infra: document auth = "app|none" tier on every legacy ingress 2026-05-22 14:16:44 +00:00
paperless-ngx Woodpecker CI deploy [CI SKIP] 2026-05-22 14:16:45 +00:00
payslip-ingest grafana: env-var datasources + reloader so Vault rotations stop breaking dashboards 2026-05-10 11:12:39 +00:00
phpipam ingress_factory: replace protected bool with auth enum + audit pass across 100 stacks 2026-05-22 14:16:42 +00:00
platform [infra] Add Cloudflare provider to all stack lock files and generated providers 2026-04-16 16:31:36 +00:00
plotting-book fix: pvc-autoresizer + TF drift safety — bulk add ignore_changes 2026-05-22 14:16:43 +00:00
poison-fountain ingress_factory: replace protected bool with auth enum + audit pass across 100 stacks 2026-05-22 14:16:42 +00:00
postiz infra: document auth = "app|none" tier on every legacy ingress 2026-05-22 14:16:44 +00:00
priority-pass fix: pvc-autoresizer + TF drift safety — bulk add ignore_changes 2026-05-22 14:16:43 +00:00
privatebin infra: document auth = "app|none" tier on every legacy ingress 2026-05-22 14:16:44 +00:00
proxmox-csi proxmox-csi: opt SCs into pvc-autoresizer (resize.topolvm.io/enabled=true) 2026-05-22 14:16:41 +00:00
pvc-autoresizer [infra] Suppress Goldilocks vpa-update-mode label drift on all namespaces [ci skip] 2026-04-18 21:15:27 +00:00
rbac [infra] Migrate Terraform state from local SOPS to PostgreSQL backend 2026-04-16 19:33:12 +00:00
real-estate-crawler real-estate-crawler: populate SCRAPE_SCHEDULES (daily RENT + weekly BUY, London 1-2 bed) 2026-05-22 14:16:44 +00:00
redis fix: pvc-autoresizer threshold should be 10%, not 80% 2026-05-22 14:16:43 +00:00
reloader [infra] Suppress Goldilocks vpa-update-mode label drift on all namespaces [ci skip] 2026-04-18 21:15:27 +00:00
resume Woodpecker CI deploy [CI SKIP] 2026-05-22 14:16:44 +00:00
reverse-proxy chore: remove decommissioned registry.viktorbarzin.me ingress 2026-05-10 11:12:37 +00:00
rybbit infra: document auth = "app|none" tier on every legacy ingress 2026-05-22 14:16:44 +00:00
sealed-secrets [infra] Suppress Goldilocks vpa-update-mode label drift on all namespaces [ci skip] 2026-04-18 21:15:27 +00:00
send infra: document auth = "app|none" tier on every legacy ingress 2026-05-22 14:16:44 +00:00
servarr fix: pvc-autoresizer + TF drift safety — bulk add ignore_changes 2026-05-22 14:16:43 +00:00
shadowsocks [infra] Sweep dns_config ignore_changes across all pod-owning resources [ci skip] 2026-04-18 21:19:48 +00:00
speedtest fix: pvc-autoresizer + TF drift safety — bulk add ignore_changes 2026-05-22 14:16:43 +00:00
status-page [infra] Establish KYVERNO_LIFECYCLE_V1 drift-suppression convention [ci skip] 2026-04-18 14:15:51 +00:00
stirling-pdf fix: pvc-autoresizer + TF drift safety — bulk add ignore_changes 2026-05-22 14:16:43 +00:00
tandoor infra/ingress_factory: add auth = "app" mode for self-authed backends 2026-05-22 14:16:44 +00:00
technitium fix: pvc-autoresizer + TF drift safety — bulk add ignore_changes 2026-05-22 14:16:43 +00:00
terminal ingress_factory: replace protected bool with auth enum + audit pass across 100 stacks 2026-05-22 14:16:42 +00:00
tor-proxy fix: pvc-autoresizer + TF drift safety — bulk add ignore_changes 2026-05-22 14:16:43 +00:00
trading-bot ingress_factory: replace protected bool with auth enum + audit pass across 100 stacks 2026-05-22 14:16:42 +00:00
traefik ingress_factory: replace protected bool with auth enum + audit pass across 100 stacks 2026-05-22 14:16:42 +00:00
travel_blog ingress_factory: replace protected bool with auth enum + audit pass across 100 stacks 2026-05-22 14:16:42 +00:00
tuya-bridge infra: document auth = "app|none" tier on every legacy ingress 2026-05-22 14:16:44 +00:00
uptime-kuma fix: pvc-autoresizer + TF drift safety — bulk add ignore_changes 2026-05-22 14:16:43 +00:00
url ingress_factory: replace protected bool with auth enum + audit pass across 100 stacks 2026-05-22 14:16:42 +00:00
vault vault: move audit-PVC autoresizer annotations to kubernetes_annotations 2026-05-22 14:16:44 +00:00
vaultwarden infra: document auth = "app|none" tier on every legacy ingress 2026-05-22 14:16:44 +00:00
vpa ingress_factory: replace protected bool with auth enum + audit pass across 100 stacks 2026-05-22 14:16:42 +00:00
wealthfolio wealthfolio, paperless-ngx: drop migration-leftover -proxmox PVCs 2026-05-22 14:16:45 +00:00
webhook_handler infra: document auth = "app|none" tier on every legacy ingress 2026-05-22 14:16:44 +00:00
whisper fix: pvc-autoresizer + TF drift safety — bulk add ignore_changes 2026-05-22 14:16:43 +00:00
wireguard [infra] Suppress Goldilocks vpa-update-mode label drift on all namespaces [ci skip] 2026-04-18 21:15:27 +00:00
woodpecker infra: document auth = "app|none" tier on every legacy ingress 2026-05-22 14:16:44 +00:00
xray infra: document auth = "app|none" tier on every legacy ingress 2026-05-22 14:16:44 +00:00
ytdlp ingress_factory: replace protected bool with auth enum + audit pass across 100 stacks 2026-05-22 14:16:42 +00:00