rybbit/crowdsec-cf-sync: stop Cloudflare Lists-API retry-storm (429 self-DoS)
All checks were successful
ci/woodpecker/push/default Pipeline was successful

The edge-ban sync was failing every 2 min on Cloudflare HTTP 429
(rate-limited) and never recovering, leaving the crowdsec_ban list empty.

Root cause: backoff_limit=2 made k8s re-run a failing pod up to 3x within
seconds, so each */2 cycle fired a burst of POSTs into Cloudflare's
per-60s Lists-API write limit. That kept the throttle perpetually tripped
(it stopped clearing even after minutes of quiet) — a self-inflicted DoS.

Two changes make the sync gentle and self-healing:
- backoff_limit 2 -> 0: one attempt per */2 cycle (the schedule IS the
  retry cadence), no rapid-fire burst.
- lapi_kv_sync.py: treat a CF 429 as a soft-skip (exit 0, retry next
  cycle) like the existing LAPI fail-safe, instead of fail-loud + k8s
  retry. Any other CF error still fails loud.

Found during a cluster health check (AIOStreams CSI + pfSense SSH issues
handled separately).

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
This commit is contained in:
Viktor Barzin 2026-06-27 15:23:42 +00:00
parent 7c72368243
commit 5b49634fe0
2 changed files with 26 additions and 4 deletions

View file

@ -234,7 +234,12 @@ resource "kubernetes_cron_job_v1" "crowdsec_cf_sync" {
job_template {
metadata {}
spec {
backoff_limit = 2
# 0 retries: the */2 schedule IS the retry cadence. backoff_limit=2 made
# k8s re-run a failing pod up to 3x within seconds, hammering Cloudflare's
# Lists-API write limit inside one 60s window and escalating the throttle
# until it stopped clearing (2026-06-27 outage). One attempt per cycle +
# the 429-soft-skip in lapi_kv_sync.py keeps the sync gentle/self-healing.
backoff_limit = 0
ttl_seconds_after_finished = 3600
template {
metadata {