rybbit/crowdsec-cf-sync: stop Cloudflare Lists-API retry-storm (429 self-DoS)

The edge-ban sync was failing every 2 min on Cloudflare HTTP 429 (rate-limited) and never recovering, leaving the crowdsec_ban list empty. Root cause: backoff_limit=2 made k8s re-run a failing pod up to 3x within seconds, so each */2 cycle fired a burst of POSTs into Cloudflare's per-60s Lists-API write limit. That kept the throttle perpetually tripped (it stopped clearing even after minutes of quiet) — a self-inflicted DoS. Two changes make the sync gentle and self-healing: - backoff_limit 2 -> 0: one attempt per */2 cycle (the schedule IS the retry cadence), no rapid-fire burst. - lapi_kv_sync.py: treat a CF 429 as a soft-skip (exit 0, retry next cycle) like the existing LAPI fail-safe, instead of fail-loud + k8s retry. Any other CF error still fails loud. Found during a cluster health check (AIOStreams CSI + pfSense SSH issues handled separately). Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-27 15:23:42 +00:00 · 2026-06-27 15:23:42 +00:00 · 5b49634fe0
commit 5b49634fe0
parent 7c72368243
2 changed files with 26 additions and 4 deletions
--- a/stacks/rybbit/crowdsec_edge.tf
+++ b/stacks/rybbit/crowdsec_edge.tf
@ -234,7 +234,12 @@ resource "kubernetes_cron_job_v1" "crowdsec_cf_sync" {
    job_template {
      metadata {}
      spec {
-        backoff_limit              = 2
+        # 0 retries: the */2 schedule IS the retry cadence. backoff_limit=2 made
+        # k8s re-run a failing pod up to 3x within seconds, hammering Cloudflare's
+        # Lists-API write limit inside one 60s window and escalating the throttle
+        # until it stopped clearing (2026-06-27 outage). One attempt per cycle +
+        # the 429-soft-skip in lapi_kv_sync.py keeps the sync gentle/self-healing.
+        backoff_limit              = 0
        ttl_seconds_after_finished = 3600
        template {
          metadata {