infra/stacks/monitoring
Viktor Barzin cdbb418f45 monitoring: alert when cluster can't tolerate losing a non-GPU worker
ClusterCannotTolerateNonGpuNodeLoss fires when the most heavily reserved
non-GPU worker (k8s-node2/3/4) has more memory requests pinned to it
than the rest of the workers (incl. node1 GPU node) currently have free.
If that node went down, its pods would not fit elsewhere and would stay
Pending — exactly what happened today (2026-05-26) with node4 NotReady:
4 kyverno pods + woodpecker PVCs + several deployments stuck Pending
because node2/node3 were at 99% memory-request saturation.

Math: max(R(node X) for X in non-GPU workers) > sum(clamp_min(A(n) - R(n), 0))
over Ready workers. node1 included on the right because its taint is
PreferNoSchedule (soft) so it does absorb non-GPU pods under pressure.

Currently fires with a 33.96 GiB shortage. Remediation: right-size top
reservers via Goldilocks (immich-server 8Gi, frigate 5Gi, prometheus
4.4Gi, pg-cluster 3Gi each, paperless 2Gi) or bump VM RAM on
k8s-node2/k8s-node3 from 32GB → 48GB to match node1.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-26 02:34:13 +00:00
..
modules/monitoring monitoring: alert when cluster can't tolerate losing a non-GPU worker 2026-05-26 02:34:13 +00:00
main.tf [forgejo] Tolerate missing Vault keys during Phase 0 bootstrap 2026-05-07 15:53:08 +00:00
secrets extract monitoring, nvidia, mailserver, cloudflared, kyverno from platform [ci skip] 2026-03-17 21:34:11 +00:00
terragrunt.hcl extract monitoring, nvidia, mailserver, cloudflared, kyverno from platform [ci skip] 2026-03-17 21:34:11 +00:00