crowdsec+rybbit: proxied edge to single CF list (block-only) + retrigger firewall-bouncer apply

CF account hard-limits to 1 Rules List, so proxied enforcement uses one crowdsec_ban
list + one WAF block rule; the sync writes both ban and captcha decisions into it
(captcha downgraded to block at the edge). Drops the second list + managed_challenge
rule. Trivial touch to firewall_bouncer.tf to make CI re-apply crowdsec and recreate
the DaemonSet (tar fix already in master; stale orphan was cleared).

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
This commit is contained in:
Viktor Barzin 2026-06-20 19:29:43 +00:00
parent 1406d8a391
commit 7cf93a0587
3 changed files with 69 additions and 90 deletions

View file

@ -33,6 +33,7 @@
# nodeSelector pins this to ONE node (k8s-node2, which runs a Traefik pod) for first validation. # nodeSelector pins this to ONE node (k8s-node2, which runs a Traefik pod) for first validation.
# !!! REMOVING THE nodeSelector ROLLS THIS DAEMONSET CLUSTER-WIDE !!! # !!! REMOVING THE nodeSelector ROLLS THIS DAEMONSET CLUSTER-WIDE !!!
# Do that ONLY after the one-node validation checklist passes (see commit/PR). # Do that ONLY after the one-node validation checklist passes (see commit/PR).
# Validating on k8s-node2 (single node) before removing the nodeSelector to roll cluster-wide.
locals { locals {
# Pin a specific stable release. Bump deliberately (re-validate on one node first). # Pin a specific stable release. Bump deliberately (re-validate on one node first).

View file

@ -4,28 +4,29 @@
# Proxied hosts terminate at the Cloudflare edge, so the in-cluster CrowdSec # Proxied hosts terminate at the Cloudflare edge, so the in-cluster CrowdSec
# bouncer (which keys on the real client IP seen by Traefik) never gets to # bouncer (which keys on the real client IP seen by Traefik) never gets to
# decide on them. To enforce CrowdSec bans/captchas on proxied traffic we push # decide on them. To enforce CrowdSec bans/captchas on proxied traffic we push
# the decision INTO the Cloudflare edge as account-level IP Lists + a single # the decision INTO the Cloudflare edge as a SINGLE account-level IP List + one
# zone-scoped WAF custom rule: # zone-scoped WAF custom rule:
# #
# * Two account IP Lists `crowdsec_ban` and `crowdsec_captcha` hold the # * ONE account IP List `crowdsec_ban` holds BOTH the banned AND captcha'd
# banned / captcha'd source IPs (empty in TF; populated at runtime). # source IPs (empty in TF; populated at runtime). The CF account hard-limits
# to ONE Rules List, so captcha decisions are downgraded to block at the
# edge and folded into this same list (block-only enforcement).
# * A zone-scoped WAF ruleset in the http_request_firewall_custom phase # * A zone-scoped WAF ruleset in the http_request_firewall_custom phase
# blocks `(ip.src in $crowdsec_ban)` and managed-challenges # blocks `(ip.src in $crowdsec_ban)`. Because it's a ZONE rule it enforces
# `(ip.src in $crowdsec_captcha)`. Because it's a ZONE rule it enforces
# across ALL proxied hosts in the zone (~135), not just the handful a # across ALL proxied hosts in the zone (~135), not just the handful a
# Worker would route. (The previous Worker+KV design only covered the ~27 # Worker would route. (The previous Worker+KV design only covered the ~27
# hosts the rybbit Worker routed; the analytics Worker in worker/ is # hosts the rybbit Worker routed; the analytics Worker in worker/ is
# unrelated and stays.) # unrelated and stays.)
# #
# This file is the CONTROL PLANE that keeps those lists in sync with LAPI: # This file is the CONTROL PLANE that keeps that list in sync with LAPI:
# 1. the two empty IP Lists (list ITEMS are owned by the CronJob at runtime, # 1. the single empty IP List (list ITEMS are owned by the CronJob at runtime,
# NOT by Terraform see the lifecycle ignore_changes on `item`), # NOT by Terraform see the lifecycle ignore_changes on `item`),
# 2. a LEAST-PRIVILEGE Cloudflare API token (account Filter-Lists edit only, # 2. a LEAST-PRIVILEGE Cloudflare API token (account Filter-Lists edit only,
# scoped to this account) the sync job authenticates with, # scoped to this account) the sync job authenticates with,
# 3. a CronJob running lapi_kv_sync.py every 2 min to full-reconcile LAPI # 3. a CronJob running lapi_kv_sync.py every 2 min to full-reconcile LAPI
# decisions into the two lists (mirrors monitoring/alert_digest.tf: stock # decisions (ban + captcha) into the one list (mirrors
# python:3.12-alpine + pure-stdlib script from a ConfigMap, no pip/apk at # monitoring/alert_digest.tf: stock python:3.12-alpine + pure-stdlib script
# runtime). # from a ConfigMap, no pip/apk at runtime).
# #
# Cloudflare provider is pinned v4.52.7 (~> 4) v4 schema is used throughout # Cloudflare provider is pinned v4.52.7 (~> 4) v4 schema is used throughout
# (v5 differs greatly: policy is a block here not a `policies = [...]` list; # (v5 differs greatly: policy is a block here not a `policies = [...]` list;
@ -44,7 +45,7 @@ locals {
} }
# ----------------------------------------------------------------------------- # -----------------------------------------------------------------------------
# IP Lists empty shells. The CronJob owns the items at runtime via the CF # IP List empty shell. The CronJob owns the items at runtime via the CF
# Rules-Lists API; TF must NOT manage items or every 2-min sync would fight the # Rules-Lists API; TF must NOT manage items or every 2-min sync would fight the
# next `terragrunt apply` (apply would try to delete the runtime items). # next `terragrunt apply` (apply would try to delete the runtime items).
# #
@ -54,7 +55,7 @@ locals {
# comment=... }`. We declare NO `item` blocks (empty list) and # comment=... }`. We declare NO `item` blocks (empty list) and
# ignore_changes=[item] so runtime items don't show as drift. # ignore_changes=[item] so runtime items don't show as drift.
# NOTE: list `name` must match /^[a-zA-Z0-9_]+$/ (underscores ok, no dashes) # NOTE: list `name` must match /^[a-zA-Z0-9_]+$/ (underscores ok, no dashes)
# hence crowdsec_ban / crowdsec_captcha (underscore, not dash). # hence crowdsec_ban (underscore, not dash).
# ----------------------------------------------------------------------------- # -----------------------------------------------------------------------------
resource "cloudflare_list" "crowdsec_ban" { resource "cloudflare_list" "crowdsec_ban" {
account_id = local.cf_account_id account_id = local.cf_account_id
@ -69,29 +70,17 @@ resource "cloudflare_list" "crowdsec_ban" {
} }
} }
resource "cloudflare_list" "crowdsec_captcha" {
account_id = local.cf_account_id
name = "crowdsec_captcha"
kind = "ip"
description = "CrowdSec captcha decisions (synced from LAPI)"
lifecycle {
ignore_changes = [item]
}
}
# ----------------------------------------------------------------------------- # -----------------------------------------------------------------------------
# Zone-scoped WAF custom ruleset the actual enforcement. One ruleset, two # Zone-scoped WAF custom ruleset the actual enforcement. One ruleset, one
# rules, applied to EVERY proxied host in the zone. # block rule, applied to EVERY proxied host in the zone.
# #
# ### VERIFY (v4.52.7): cloudflare_ruleset with zone_id + kind="zone" + # ### VERIFY (v4.52.7): cloudflare_ruleset with zone_id + kind="zone" +
# phase="http_request_firewall_custom"; `rules` is a repeatable block with # phase="http_request_firewall_custom"; `rules` is a repeatable block with
# action/expression/description/enabled. actions "block" and # action/expression/description/enabled. action "block" is valid. List
# "managed_challenge" are both valid. List references in WAF expressions use # references in WAF expressions use the list NAME with a `$` prefix (NOT the
# the list NAME with a `$` prefix (NOT the list id): ($crowdsec_ban). # list id): ($crowdsec_ban). Both ban and captcha decisions land in this one
# Rule order matters ban (block) is evaluated before captcha so a # list (the CF account allows only one Rules List), so a single block rule
# double-listed IP is blocked outright (the sync script also enforces # covers everything captcha is enforced as block at the edge.
# ban-wins, so an IP is never in both lists, but order is belt-and-braces).
# #
# zone_id is the viktorbarzin.me zone the single zone id used repo-wide # zone_id is the viktorbarzin.me zone the single zone id used repo-wide
# (default of var.cloudflare_zone_id in modules/kubernetes/ingress_factory and # (default of var.cloudflare_zone_id in modules/kubernetes/ingress_factory and
@ -116,24 +105,19 @@ resource "cloudflare_ruleset" "crowdsec" {
kind = "zone" kind = "zone"
phase = "http_request_firewall_custom" phase = "http_request_firewall_custom"
# The WAF rules reference the IP lists by name ($crowdsec_ban / $crowdsec_captcha), # The WAF rule references the IP list by name ($crowdsec_ban), so the list
# so the lists must exist before this ruleset is created/updated. # must exist before this ruleset is created/updated.
depends_on = [cloudflare_list.crowdsec_ban, cloudflare_list.crowdsec_captcha] depends_on = [cloudflare_list.crowdsec_ban]
# CrowdSec ban evaluated FIRST so a banned IP is blocked before anything else. # CrowdSec ban block every IP in the single edge list. The sync writes BOTH
# ban and captcha decisions into crowdsec_ban (captcha downgraded to block at
# the edge) because the CF account allows only ONE Rules List.
rules { rules {
action = "block" action = "block"
expression = "(ip.src in $crowdsec_ban)" expression = "(ip.src in $crowdsec_ban)"
description = "CrowdSec: block banned IPs" description = "CrowdSec: block banned IPs"
enabled = true enabled = true
} }
# CrowdSec captcha managed challenge for flagged IPs.
rules {
action = "managed_challenge"
expression = "(ip.src in $crowdsec_captcha)"
description = "CrowdSec: challenge flagged IPs"
enabled = true
}
# Pre-existing rule, imported and preserved verbatim (currently disabled). # Pre-existing rule, imported and preserved verbatim (currently disabled).
rules { rules {
action = "skip" action = "skip"
@ -279,10 +263,6 @@ resource "kubernetes_cron_job_v1" "crowdsec_cf_sync" {
name = "CF_BAN_LIST_ID" name = "CF_BAN_LIST_ID"
value = cloudflare_list.crowdsec_ban.id value = cloudflare_list.crowdsec_ban.id
} }
env {
name = "CF_CAPTCHA_LIST_ID"
value = cloudflare_list.crowdsec_captcha.id
}
env { env {
name = "PUSHGATEWAY_URL" name = "PUSHGATEWAY_URL"
value = "http://prometheus-prometheus-pushgateway.monitoring:9091" value = "http://prometheus-prometheus-pushgateway.monitoring:9091"

View file

@ -1,30 +1,33 @@
#!/usr/bin/env python3 #!/usr/bin/env python3
"""Sync CrowdSec LAPI decisions -> two Cloudflare account IP Lists. """Sync CrowdSec LAPI decisions -> ONE Cloudflare account IP List (block-only).
Cloudflare-PROXIED hosts terminate at the CF edge, so the in-cluster CrowdSec Cloudflare-PROXIED hosts terminate at the CF edge, so the in-cluster CrowdSec
bouncer (which keys on the client IP Traefik sees) never decides on them. We bouncer (which keys on the client IP Traefik sees) never decides on them. We
push the decisions into the edge instead: a zone-scoped WAF custom rule blocks push the decisions into the edge instead: a zone-scoped WAF custom rule blocks
`(ip.src in $crowdsec_ban)` and managed-challenges `(ip.src in $crowdsec_captcha)` `(ip.src in $crowdsec_ban)` across EVERY proxied host in the zone. This job is
across EVERY proxied host in the zone. This job is the control plane that keeps the control plane that keeps that one IP List in sync with LAPI.
those two IP Lists in sync with LAPI.
The CF account hard-limits to ONE Rules List, so enforcement is BLOCK-ONLY:
BOTH ban AND captcha (scope=="ip") decisions are folded into the single
crowdsec_ban list and captcha is downgraded to block at the proxied edge.
(Filename kept as lapi_kv_sync.py for path/ConfigMap continuity with the prior (Filename kept as lapi_kv_sync.py for path/ConfigMap continuity with the prior
Workers-KV design; it no longer touches KV it reconciles CF Rules Lists.) Workers-KV design; it no longer touches KV it reconciles a CF Rules List.)
Design notes: Design notes:
* Pure Python stdlib (no pip/apk at runtime) runs on stock python:3.12-alpine * Pure Python stdlib (no pip/apk at runtime) runs on stock python:3.12-alpine
mounted from a ConfigMap, the alert_digest pattern. mounted from a ConfigMap, the alert_digest pattern.
* FULL RECONCILE each run: read the complete decision set from LAPI, partition * FULL RECONCILE each run: read the complete decision set from LAPI, take the
into ban / captcha desired sets, then for each list compute add (desired - UNION of ban + captcha (scope=="ip") as the single desired set, then compute
existing) and remove (existing - desired) and apply both. An IP listed for add (desired - existing) and remove (existing - desired) against the one
BOTH ban and captcha is placed in BAN ONLY (ban wins; the WAF rule order crowdsec_ban list and apply both. A `cscli decisions delete` therefore
also blocks-before-challenges as belt-and-braces). A `cscli decisions clears from the edge within one interval (<=2 min).
delete` therefore clears from the edge within one interval (<=2 min). * FAIL-SAFE on LAPI: if LAPI can't be read we SKIP the run (list untouched,
* FAIL-SAFE on LAPI: if LAPI can't be read we SKIP the run (lists untouched, exit 0). A LAPI outage thus freezes the edge state rather than wiping the
exit 0). A LAPI outage thus freezes the edge state rather than wiping every block list degrade toward the last-known-good block set, never toward
ban degrade toward the last-known-good block set, never toward all-block all-block or a thundering un-ban. (Decisions linger only until the next
or a thundering un-ban. (Decisions linger only until the next successful successful sync, not their TTL we reconcile to LAPI truth, we don't
sync, not their TTL we reconcile to LAPI truth, we don't expire entries.) expire entries.)
* FAIL-LOUD on Cloudflare: any CF API error is logged and the job exits * FAIL-LOUD on Cloudflare: any CF API error is logged and the job exits
non-zero so the failure is visible (CronJob backoff + missing success non-zero so the failure is visible (CronJob backoff + missing success
metric + the next run retries). metric + the next run retries).
@ -42,9 +45,9 @@ official API reference (developers.cloudflare.com, 2026):
* GET /accounts/{acct}/rules/lists/bulk_operations/{op_id} -> status in * GET /accounts/{acct}/rules/lists/bulk_operations/{op_id} -> status in
{pending,running,completed,failed} (failed carries `error`). {pending,running,completed,failed} (failed carries `error`).
ASYNC HANDLING: Cloudflare allows only ONE pending bulk operation per ACCOUNT. ASYNC HANDLING: Cloudflare allows only ONE pending bulk operation per ACCOUNT.
So we must NOT fire add+delete (or both lists) concurrently we serialize and So we must NOT fire add+delete concurrently we serialize and poll each
poll each operation_id to a terminal state (short bounded timeout) before the operation_id to a terminal state (short bounded timeout) before the next
next mutation. If a poll times out we stop mutating for this run and report mutation. If a poll times out we stop mutating for this run and report
partial success (the next 2-min run reconciles the rest); we never abandon an partial success (the next 2-min run reconciles the rest); we never abandon an
in-flight op and immediately issue another (that would 409/reject). in-flight op and immediately issue another (that would 409/reject).
""" """
@ -63,7 +66,6 @@ LAPI_KEY = os.environ["LAPI_KEY"] # kvsync bouncer key, registered in LAPI
CF_ACCOUNT_ID = os.environ["CF_ACCOUNT_ID"] CF_ACCOUNT_ID = os.environ["CF_ACCOUNT_ID"]
CF_API_TOKEN = os.environ["CF_API_TOKEN"] # scoped: Account Filter Lists Edit CF_API_TOKEN = os.environ["CF_API_TOKEN"] # scoped: Account Filter Lists Edit
CF_BAN_LIST_ID = os.environ["CF_BAN_LIST_ID"] CF_BAN_LIST_ID = os.environ["CF_BAN_LIST_ID"]
CF_CAPTCHA_LIST_ID = os.environ["CF_CAPTCHA_LIST_ID"]
PUSHGATEWAY = os.environ.get("PUSHGATEWAY_URL", "").rstrip("/") # optional PUSHGATEWAY = os.environ.get("PUSHGATEWAY_URL", "").rstrip("/") # optional
CF_API = "https://api.cloudflare.com/client/v4" CF_API = "https://api.cloudflare.com/client/v4"
@ -115,17 +117,19 @@ def _cf(url, *, method="GET", payload=None, timeout=20):
# LAPI # LAPI
# --------------------------------------------------------------------------- # # --------------------------------------------------------------------------- #
def fetch_decisions(): def fetch_decisions():
"""Return (ban_set, captcha_set) of IPs from LAPI. """Return the single desired set of IPs to BLOCK at the edge.
Only scope=="ip" decisions are projected (the WAF rule keys on ip.src). An Only scope=="ip" decisions are projected (the WAF rule keys on ip.src). The
IP appearing in both ban and captcha is placed in BAN only. Raises on CF account allows only ONE Rules List, so BOTH "ban" AND "captcha" decisions
transport/HTTP error so the caller can SKIP the run (fail-safe). are folded into one block set (captcha is downgraded to block at the proxied
edge). Raises on transport/HTTP error so the caller can SKIP the run
(fail-safe).
""" """
data = _req( data = _req(
f"{LAPI_URL}/v1/decisions", f"{LAPI_URL}/v1/decisions",
headers={"X-Api-Key": LAPI_KEY, "Accept": "application/json"}, headers={"X-Api-Key": LAPI_KEY, "Accept": "application/json"},
) )
ban, captcha = set(), set() block = set()
for d in data or []: for d in data or []:
if (d.get("scope") or "").lower() != "ip": if (d.get("scope") or "").lower() != "ip":
continue continue
@ -133,13 +137,10 @@ def fetch_decisions():
if not ip: if not ip:
continue continue
dtype = (d.get("type") or "").lower() dtype = (d.get("type") or "").lower()
if dtype == "ban": if dtype in ("ban", "captcha"):
ban.add(ip) block.add(ip)
elif dtype == "captcha":
captcha.add(ip)
# other remediation types (e.g. throttle) are ignored # other remediation types (e.g. throttle) are ignored
captcha -= ban # ban wins: never list the same IP in both return block
return ban, captcha
# --------------------------------------------------------------------------- # # --------------------------------------------------------------------------- #
@ -260,14 +261,12 @@ def reconcile(label, list_id, desired):
# --------------------------------------------------------------------------- # # --------------------------------------------------------------------------- #
# Metrics (best-effort) # Metrics (best-effort)
# --------------------------------------------------------------------------- # # --------------------------------------------------------------------------- #
def push_metrics(ban_n, captcha_n, ok): def push_metrics(block_n, ok):
if not PUSHGATEWAY: if not PUSHGATEWAY:
return return
payload = ( payload = (
"# TYPE crowdsec_cf_list_ban_count gauge\n" "# TYPE crowdsec_cf_list_ban_count gauge\n"
f"crowdsec_cf_list_ban_count {ban_n}\n" f"crowdsec_cf_list_ban_count {block_n}\n"
"# TYPE crowdsec_cf_list_captcha_count gauge\n"
f"crowdsec_cf_list_captcha_count {captcha_n}\n"
"# TYPE crowdsec_cf_list_sync_success gauge\n" "# TYPE crowdsec_cf_list_sync_success gauge\n"
f"crowdsec_cf_list_sync_success {1 if ok else 0}\n" f"crowdsec_cf_list_sync_success {1 if ok else 0}\n"
"# TYPE crowdsec_cf_list_sync_last_run_seconds gauge\n" "# TYPE crowdsec_cf_list_sync_last_run_seconds gauge\n"
@ -289,28 +288,27 @@ def push_metrics(ban_n, captcha_n, ok):
def main(): def main():
# 1. Desired state from LAPI. Any failure here = SKIP (fail-safe). # 1. Desired state from LAPI. Any failure here = SKIP (fail-safe).
try: try:
ban, captcha = fetch_decisions() block = fetch_decisions()
except Exception as e: except Exception as e:
print( print(
f"[skip] LAPI unreadable ({e}); leaving CF lists untouched " f"[skip] LAPI unreadable ({e}); leaving the CF list untouched "
f"(fail-safe: freeze last-known edge state).", f"(fail-safe: freeze last-known edge state).",
file=sys.stderr, file=sys.stderr,
) )
push_metrics(0, 0, ok=False) push_metrics(0, ok=False)
return 0 return 0
print(f"[info] LAPI desired: {len(ban)} ban / {len(captcha)} captcha (ip-scope)") print(f"[info] LAPI desired: {len(block)} block (ban+captcha, ip-scope)")
# 2. Reconcile both lists. CF errors fail loud (non-zero exit). # 2. Reconcile the single block list. CF errors fail loud (non-zero exit).
try: try:
reconcile("ban", CF_BAN_LIST_ID, ban) reconcile("block", CF_BAN_LIST_ID, block)
reconcile("captcha", CF_CAPTCHA_LIST_ID, captcha)
except CFError as e: except CFError as e:
print(f"[error] Cloudflare API failure: {e}", file=sys.stderr) print(f"[error] Cloudflare API failure: {e}", file=sys.stderr)
push_metrics(len(ban), len(captcha), ok=False) push_metrics(len(block), ok=False)
return 1 return 1
push_metrics(len(ban), len(captcha), ok=True) push_metrics(len(block), ok=True)
return 0 return 0