infra

Viktor Barzin cdbb418f45 monitoring: alert when cluster can't tolerate losing a non-GPU worker ClusterCannotTolerateNonGpuNodeLoss fires when the most heavily reserved non-GPU worker (k8s-node2/3/4) has more memory requests pinned to it than the rest of the workers (incl. node1 GPU node) currently have free. If that node went down, its pods would not fit elsewhere and would stay Pending — exactly what happened today (2026-05-26) with node4 NotReady: 4 kyverno pods + woodpecker PVCs + several deployments stuck Pending because node2/node3 were at 99% memory-request saturation. Math: max(R(node X) for X in non-GPU workers) > sum(clamp_min(A(n) - R(n), 0)) over Ready workers. node1 included on the right because its taint is PreferNoSchedule (soft) so it does absorb non-GPU pods under pressure. Currently fires with a 33.96 GiB shortage. Remediation: right-size top reservers via Goldilocks (immich-server 8Gi, frigate 5Gi, prometheus 4.4Gi, pg-cluster 3Gi each, paperless 2Gi) or bump VM RAM on k8s-node2/k8s-node3 from 32GB → 48GB to match node1. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>		2026-05-26 02:34:13 +00:00
..
dashboards	fire-planner: COL refresh CronJob + Grafana Cost-of-Living dashboard	2026-05-22 14:15:38 +00:00
server-power-cycle	Add broker-sync Terraform stack (#7 )	2026-04-17 21:17:45 +01:00
alloy.yaml	alloy: move resources to alloy.* (chart key bug); 1Gi limit fixes IO storm	2026-05-26 02:08:35 +00:00
Dockerfile	extract monitoring, nvidia, mailserver, cloudflared, kyverno from platform [ci skip]	2026-03-17 21:34:11 +00:00
goflow2.tf	[infra] Sweep dns_config ignore_changes across all pod-owning resources [ci skip]	2026-04-18 21:19:48 +00:00
grafana.tf	fire-planner: COL refresh CronJob + Grafana Cost-of-Living dashboard	2026-05-22 14:15:38 +00:00
grafana_chart_values.yaml	monitoring: protect grafana ingress with authentik + disable anonymous	2026-05-10 17:01:50 +00:00
idrac.tf	infra: document auth = "app\|none" tier on every legacy ingress	2026-05-11 19:25:48 +00:00
k8s-monitoring-values.yaml	cleanup: remove calibre and audiobookshelf stacks after ebooks migration [ci skip]	2026-03-25 23:56:07 +02:00
loki.tf	alloy: move resources to alloy.* (chart key bug); 1Gi limit fixes IO storm	2026-05-26 02:08:35 +00:00
loki.yaml	monitoring/loki: bump memory request 2Gi → 3Gi (close gap to 4Gi limit)	2026-05-24 01:10:55 +00:00
main.tf	keel: enroll 15 critical-path namespaces for digest-only auto-update	2026-05-17 12:13:22 +00:00
prometheus.tf	fix: HA Sofia REST sensors + PVC drift safety	2026-05-10 21:48:29 +00:00
prometheus_chart_values.tpl	monitoring: alert when cluster can't tolerate losing a non-GPU worker	2026-05-26 02:34:13 +00:00
prometheus_snmp_chart_values.yaml	extract monitoring, nvidia, mailserver, cloudflared, kyverno from platform [ci skip]	2026-03-17 21:34:11 +00:00
pve_exporter.tf	[infra] Sweep dns_config ignore_changes across all pod-owning resources [ci skip]	2026-04-18 21:19:48 +00:00
snmp_exporter.tf	infra: document auth = "app\|none" tier on every legacy ingress	2026-05-11 19:25:48 +00:00
ups_snmp_values.yaml	extract monitoring, nvidia, mailserver, cloudflared, kyverno from platform [ci skip]	2026-03-17 21:34:11 +00:00