CrowdSec Helm fix:
- Increase ResourceQuota requests.cpu from 1 to 4 — pods were at 302%
of quota, preventing scheduling during rolling upgrades
- Reduce Helm timeout from 3600s to 600s — 1 hour hang is excessive
- Add wait=true and wait_for_jobs=true for proper readiness checking
Prometheus startup guard:
- Add startup guard to 8 rate/increase-based alerts that false-fire
after Prometheus restarts (needs 2 scrapes for rate() to work):
PodCrashLooping, ContainerOOMKilled, CoreDNSErrors,
HighServiceErrorRate, HighService4xxRate, HighServiceLatency,
SSDHighWriteRate, HDDHighWriteRate
- Guard: and on() (time() - process_start_time_seconds) > 900
suppresses alerts for 15m after Prometheus startup