Reduce downtime during platform stack applies
CrowdSec Helm fix: - Increase ResourceQuota requests.cpu from 1 to 4 — pods were at 302% of quota, preventing scheduling during rolling upgrades - Reduce Helm timeout from 3600s to 600s — 1 hour hang is excessive - Add wait=true and wait_for_jobs=true for proper readiness checking Prometheus startup guard: - Add startup guard to 8 rate/increase-based alerts that false-fire after Prometheus restarts (needs 2 scrapes for rate() to work): PodCrashLooping, ContainerOOMKilled, CoreDNSErrors, HighServiceErrorRate, HighService4xxRate, HighServiceLatency, SSDHighWriteRate, HDDHighWriteRate - Guard: and on() (time() - process_start_time_seconds) > 900 suppresses alerts for 15m after Prometheus startup
This commit is contained in:
parent
44f6614bf9
commit
a66a8d0de2
2 changed files with 12 additions and 7 deletions
|
|
@ -107,7 +107,9 @@ resource "helm_release" "crowdsec" {
|
|||
chart = "crowdsec"
|
||||
|
||||
values = [templatefile("${path.module}/values.yaml", { homepage_username = var.homepage_username, homepage_password = var.homepage_password, DB_PASSWORD = var.db_password, ENROLL_KEY = var.enroll_key, SLACK_WEBHOOK_URL = var.slack_webhook_url, mysql_host = var.mysql_host })]
|
||||
timeout = 3600
|
||||
timeout = 600
|
||||
wait = true
|
||||
wait_for_jobs = true
|
||||
}
|
||||
|
||||
|
||||
|
|
@ -365,7 +367,7 @@ resource "kubernetes_resource_quota" "crowdsec" {
|
|||
}
|
||||
spec {
|
||||
hard = {
|
||||
"requests.cpu" = "1"
|
||||
"requests.cpu" = "4"
|
||||
"requests.memory" = "8Gi"
|
||||
"limits.memory" = "16Gi"
|
||||
pods = "30"
|
||||
|
|
|
|||
Loading…
Add table
Add a link
Reference in a new issue