rybbit/crowdsec-cf-sync: stop Cloudflare Lists-API retry-storm (429 self-DoS)
All checks were successful
ci/woodpecker/push/default Pipeline was successful
All checks were successful
ci/woodpecker/push/default Pipeline was successful
The edge-ban sync was failing every 2 min on Cloudflare HTTP 429 (rate-limited) and never recovering, leaving the crowdsec_ban list empty. Root cause: backoff_limit=2 made k8s re-run a failing pod up to 3x within seconds, so each */2 cycle fired a burst of POSTs into Cloudflare's per-60s Lists-API write limit. That kept the throttle perpetually tripped (it stopped clearing even after minutes of quiet) — a self-inflicted DoS. Two changes make the sync gentle and self-healing: - backoff_limit 2 -> 0: one attempt per */2 cycle (the schedule IS the retry cadence), no rapid-fire burst. - lapi_kv_sync.py: treat a CF 429 as a soft-skip (exit 0, retry next cycle) like the existing LAPI fail-safe, instead of fail-loud + k8s retry. Any other CF error still fails loud. Found during a cluster health check (AIOStreams CSI + pfSense SSH issues handled separately). Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
This commit is contained in:
parent
7c72368243
commit
5b49634fe0
2 changed files with 26 additions and 4 deletions
|
|
@ -234,7 +234,12 @@ resource "kubernetes_cron_job_v1" "crowdsec_cf_sync" {
|
|||
job_template {
|
||||
metadata {}
|
||||
spec {
|
||||
backoff_limit = 2
|
||||
# 0 retries: the */2 schedule IS the retry cadence. backoff_limit=2 made
|
||||
# k8s re-run a failing pod up to 3x within seconds, hammering Cloudflare's
|
||||
# Lists-API write limit inside one 60s window and escalating the throttle
|
||||
# until it stopped clearing (2026-06-27 outage). One attempt per cycle +
|
||||
# the 429-soft-skip in lapi_kv_sync.py keeps the sync gentle/self-healing.
|
||||
backoff_limit = 0
|
||||
ttl_seconds_after_finished = 3600
|
||||
template {
|
||||
metadata {
|
||||
|
|
|
|||
Loading…
Add table
Add a link
Reference in a new issue