Reduce downtime during platform stack applies

CrowdSec Helm fix:
- Increase ResourceQuota requests.cpu from 1 to 4 — pods were at 302%
  of quota, preventing scheduling during rolling upgrades
- Reduce Helm timeout from 3600s to 600s — 1 hour hang is excessive
- Add wait=true and wait_for_jobs=true for proper readiness checking

Prometheus startup guard:
- Add startup guard to 8 rate/increase-based alerts that false-fire
  after Prometheus restarts (needs 2 scrapes for rate() to work):
  PodCrashLooping, ContainerOOMKilled, CoreDNSErrors,
  HighServiceErrorRate, HighService4xxRate, HighServiceLatency,
  SSDHighWriteRate, HDDHighWriteRate
- Guard: and on() (time() - process_start_time_seconds) > 900
  suppresses alerts for 15m after Prometheus startup
This commit is contained in:
Viktor Barzin 2026-03-14 12:09:09 +00:00 committed by Viktor Barzin
parent 44f6614bf9
commit a66a8d0de2
2 changed files with 12 additions and 7 deletions

View file

@ -107,7 +107,9 @@ resource "helm_release" "crowdsec" {
chart = "crowdsec"
values = [templatefile("${path.module}/values.yaml", { homepage_username = var.homepage_username, homepage_password = var.homepage_password, DB_PASSWORD = var.db_password, ENROLL_KEY = var.enroll_key, SLACK_WEBHOOK_URL = var.slack_webhook_url, mysql_host = var.mysql_host })]
timeout = 3600
timeout = 600
wait = true
wait_for_jobs = true
}
@ -365,7 +367,7 @@ resource "kubernetes_resource_quota" "crowdsec" {
}
spec {
hard = {
"requests.cpu" = "1"
"requests.cpu" = "4"
"requests.memory" = "8Gi"
"limits.memory" = "16Gi"
pods = "30"