infra

History

Viktor Barzin a66a8d0de2 Reduce downtime during platform stack applies CrowdSec Helm fix: - Increase ResourceQuota requests.cpu from 1 to 4 — pods were at 302% of quota, preventing scheduling during rolling upgrades - Reduce Helm timeout from 3600s to 600s — 1 hour hang is excessive - Add wait=true and wait_for_jobs=true for proper readiness checking Prometheus startup guard: - Add startup guard to 8 rate/increase-based alerts that false-fire after Prometheus restarts (needs 2 scrapes for rate() to work): PodCrashLooping, ContainerOOMKilled, CoreDNSErrors, HighServiceErrorRate, HighService4xxRate, HighServiceLatency, SSDHighWriteRate, HDDHighWriteRate - Guard: and on() (time() - process_start_time_seconds) > 900 suppresses alerts for 15m after Prometheus startup		2026-03-18 08:03:59 +00:00
..
dashboards	Add node hang instrumentation and scale down chromium services	2026-03-18 08:03:58 +00:00
server-power-cycle	[ci skip] Move Terraform modules into stack directories	2026-02-22 14:38:14 +00:00
alloy.yaml	Remove all CPU limits cluster-wide to eliminate CFS throttling	2026-03-18 08:03:58 +00:00
caretta.tf	Monitoring overhaul: reduce noise, add coverage gaps, auto-load dashboards	2026-03-18 08:03:59 +00:00
Dockerfile	[ci skip] Move Terraform modules into stack directories	2026-02-22 14:38:14 +00:00
goflow2.tf	Remove all CPU limits cluster-wide to eliminate CFS throttling	2026-03-18 08:03:58 +00:00
grafana.tf	Monitoring overhaul: reduce noise, add coverage gaps, auto-load dashboards	2026-03-18 08:03:59 +00:00
grafana_chart_values.yaml	Remove all CPU limits cluster-wide to eliminate CFS throttling	2026-03-18 08:03:58 +00:00
idrac.tf	[ci skip] platform: add ndots=2 dns_config to all deployment pod specs	2026-02-23 22:43:05 +00:00
k8s-monitoring-values.yaml	[ci skip] Move Terraform modules into stack directories	2026-02-22 14:38:14 +00:00
loki.tf	feat(monitoring): Disable Loki centralized logging while preserving configuration	2026-03-17 16:51:02 +00:00
loki.yaml	Remove all CPU limits cluster-wide to eliminate CFS throttling	2026-03-18 08:03:58 +00:00
main.tf	Remove all CPU limits cluster-wide to eliminate CFS throttling	2026-03-18 08:03:58 +00:00
prometheus.tf	[ci skip] expand Prometheus PVC to 200Gi, increase retention to 180GB for 1-year history	2026-03-06 23:16:32 +00:00
prometheus_chart_values.tpl	Reduce downtime during platform stack applies	2026-03-18 08:03:59 +00:00
prometheus_snmp_chart_values.yaml	[ci skip] Move Terraform modules into stack directories	2026-02-22 14:38:14 +00:00
pve_exporter.tf	[ci skip] platform: add ndots=2 dns_config to all deployment pod specs	2026-02-23 22:43:05 +00:00
snmp_exporter.tf	[ci skip] platform: add ndots=2 dns_config to all deployment pod specs	2026-02-23 22:43:05 +00:00
ups_snmp_values.yaml	[ci skip] Move Terraform modules into stack directories	2026-02-22 14:38:14 +00:00