Reduce downtime during platform stack applies
CrowdSec fixes: - Increase ResourceQuota requests.cpu 1→4 (was at 302%, blocking upgrades) - Add LAPI startupProbe: 30 attempts × 10s = 5min startup window (LAPI pods were failing default startup probe during rolling upgrades) - Reduce Helm timeout 3600s→900s with wait=true, wait_for_jobs=true Prometheus startup guard on 8 rate-based alerts: - PodCrashLooping, ContainerOOMKilled, CoreDNSErrors, HighServiceErrorRate, HighService4xxRate, HighServiceLatency, SSDHighWriteRate, HDDHighWriteRate - Suppresses false positives for 15m after Prometheus restart
This commit is contained in:
parent
a66a8d0de2
commit
240feda408
2 changed files with 7 additions and 1 deletions
|
|
@ -56,6 +56,12 @@ lapi:
|
|||
memory: 128Mi
|
||||
limits:
|
||||
memory: 1Gi
|
||||
startupProbe:
|
||||
httpGet:
|
||||
path: /health
|
||||
port: 8080
|
||||
failureThreshold: 30
|
||||
periodSeconds: 10
|
||||
priorityClassName: "tier-1-cluster"
|
||||
replicas: 3
|
||||
topologySpreadConstraints:
|
||||
|
|
|
|||
Loading…
Add table
Add a link
Reference in a new issue