Reduce downtime during platform stack applies

CrowdSec fixes:
- Increase ResourceQuota requests.cpu 1→4 (was at 302%, blocking upgrades)
- Add LAPI startupProbe: 30 attempts × 10s = 5min startup window
  (LAPI pods were failing default startup probe during rolling upgrades)
- Reduce Helm timeout 3600s→900s with wait=true, wait_for_jobs=true

Prometheus startup guard on 8 rate-based alerts:
- PodCrashLooping, ContainerOOMKilled, CoreDNSErrors,
  HighServiceErrorRate, HighService4xxRate, HighServiceLatency,
  SSDHighWriteRate, HDDHighWriteRate
- Suppresses false positives for 15m after Prometheus restart
This commit is contained in:
Viktor Barzin 2026-03-14 12:47:56 +00:00 committed by Viktor Barzin
parent a66a8d0de2
commit 240feda408
2 changed files with 7 additions and 1 deletions

View file

@ -56,6 +56,12 @@ lapi:
memory: 128Mi
limits:
memory: 1Gi
startupProbe:
httpGet:
path: /health
port: 8080
failureThreshold: 30
periodSeconds: 10
priorityClassName: "tier-1-cluster"
replicas: 3
topologySpreadConstraints: