Reduce downtime during platform stack applies
CrowdSec fixes: - Increase ResourceQuota requests.cpu 1→4 (was at 302%, blocking upgrades) - Add LAPI startupProbe: 30 attempts × 10s = 5min startup window (LAPI pods were failing default startup probe during rolling upgrades) - Reduce Helm timeout 3600s→900s with wait=true, wait_for_jobs=true Prometheus startup guard on 8 rate-based alerts: - PodCrashLooping, ContainerOOMKilled, CoreDNSErrors, HighServiceErrorRate, HighService4xxRate, HighServiceLatency, SSDHighWriteRate, HDDHighWriteRate - Suppresses false positives for 15m after Prometheus restart
This commit is contained in:
parent
a66a8d0de2
commit
240feda408
2 changed files with 7 additions and 1 deletions
|
|
@ -107,7 +107,7 @@ resource "helm_release" "crowdsec" {
|
|||
chart = "crowdsec"
|
||||
|
||||
values = [templatefile("${path.module}/values.yaml", { homepage_username = var.homepage_username, homepage_password = var.homepage_password, DB_PASSWORD = var.db_password, ENROLL_KEY = var.enroll_key, SLACK_WEBHOOK_URL = var.slack_webhook_url, mysql_host = var.mysql_host })]
|
||||
timeout = 600
|
||||
timeout = 900
|
||||
wait = true
|
||||
wait_for_jobs = true
|
||||
}
|
||||
|
|
|
|||
|
|
@ -56,6 +56,12 @@ lapi:
|
|||
memory: 128Mi
|
||||
limits:
|
||||
memory: 1Gi
|
||||
startupProbe:
|
||||
httpGet:
|
||||
path: /health
|
||||
port: 8080
|
||||
failureThreshold: 30
|
||||
periodSeconds: 10
|
||||
priorityClassName: "tier-1-cluster"
|
||||
replicas: 3
|
||||
topologySpreadConstraints:
|
||||
|
|
|
|||
Loading…
Add table
Add a link
Reference in a new issue