docs: rewrite CrowdSec enforcement architecture (firewall-bouncer + CF WAF; Yaegi plugin removed)

The Traefik Yaegi CrowdSec bouncer plugin was dead on Traefik 3.7.5 (handler
never invoked) and has been removed. Document the replacement: in-kernel
nftables drop via cs-firewall-bouncer on direct hosts, and a Cloudflare IP-List
+ zone WAF block rule (fed by a LAPI->CF-list sync CronJob) on proxied hosts.
Both add zero per-request latency and fail open.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
This commit is contained in:
Viktor Barzin 2026-06-21 13:39:26 +00:00
parent 4df741f6de
commit ceae4d5f06
3 changed files with 176 additions and 111 deletions

View file

@ -202,7 +202,7 @@ the workflow's built-in `GITHUB_TOKEN` (`packages: write`).
- **Critical path services scaled to 3**: Traefik, Authentik, CrowdSec LAPI, PgBouncer, Cloudflared. - **Critical path services scaled to 3**: Traefik, Authentik, CrowdSec LAPI, PgBouncer, Cloudflared.
- **PDBs**: minAvailable=2 on Traefik and Authentik. - **PDBs**: minAvailable=2 on Traefik and Authentik.
- **Fallback proxies**: basicAuth when Authentik is down, fail-open when poison-fountain is down. - **Fallback proxies**: basicAuth when Authentik is down, fail-open when poison-fountain is down.
- **CrowdSec bouncer**: graceful degradation mode (fail-open on error). - **CrowdSec enforcement is out-of-band** (no Traefik plugin/middleware — the dead Yaegi `crowdsec-bouncer-traefik-plugin` was removed on Traefik 3.7.5): banned IPs are dropped **in-kernel via nftables** by the `cs-firewall-bouncer` DaemonSet on **direct** hosts (drops in BOTH the `input` and `forward` hooks — Traefik is ETP=Local so client traffic is DNAT'd to the pod via `forward`; pulls ALL decisions incl. the ~31k CAPI blocklist), and **blocked at the Cloudflare edge** for **proxied** hosts (one `crowdsec_ban` Rules List + a zone WAF block rule, fed by the `crowdsec-cf-sync` CronJob in `rybbit` ns every 2 min — excludes CAPI). Zero per-request latency; **fails open** (LAPI down → no new bans, existing drops persist, legit traffic never blocked). Whitelist covers RFC1918 + tailnet + internal CIDRs. Full as-built: `docs/architecture/security.md`.
- **Rate limiting**: Return 429 (not 503). Per-service tuning via dedicated middleware + `skip_default_rate_limit` (default 10/s burst 50): Immich 1000/20000, ActualBudget 50/300 (app boot = ~70 parallel revalidations). - **Rate limiting**: Return 429 (not 503). Per-service tuning via dedicated middleware + `skip_default_rate_limit` (default 10/s burst 50): Immich 1000/20000, ActualBudget 50/300 (app boot = ~70 parallel revalidations).
- **Retry middleware**: 2 attempts, 100ms — in default ingress chain. - **Retry middleware**: 2 attempts, 100ms — in default ingress chain.
- **Entrypoint transport timeouts** (`websecure` `respondingTimeouts`): `writeTimeout=0` (unlimited download duration), `readTimeout=3600s` (uploads ≤1h), `idleTimeout=600s`. These are **HARD total-duration caps**, not nginx-style per-read idle timeouts — a finite `writeTimeout` truncates *any* large download at that wall-clock mark (a prior `writeTimeout=60s` silently cut Immich videos at 60s). **Do NOT re-tighten `writeTimeout`**; keep `readTimeout` finite (slow-loris backstop) but ≥ longest expected upload. Full rationale: `docs/architecture/networking.md` → "Entrypoint Transport Timeouts". - **Entrypoint transport timeouts** (`websecure` `respondingTimeouts`): `writeTimeout=0` (unlimited download duration), `readTimeout=3600s` (uploads ≤1h), `idleTimeout=600s`. These are **HARD total-duration caps**, not nginx-style per-read idle timeouts — a finite `writeTimeout` truncates *any* large download at that wall-clock mark (a prior `writeTimeout=60s` silently cut Immich videos at 60s). **Do NOT re-tighten `writeTimeout`**; keep `readTimeout` finite (slow-loris backstop) but ≥ longest expected upload. Full rationale: `docs/architecture/networking.md` → "Entrypoint Transport Timeouts".
@ -216,7 +216,7 @@ the workflow's built-in `GITHUB_TOKEN` (`packages: write`).
|---------|--------------------------| |---------|--------------------------|
| Nextcloud | MaxRequestWorkers=150, needs 8Gi limit (Apache transient memory spikes, see commit eb94144), very generous startup probe | | Nextcloud | MaxRequestWorkers=150, needs 8Gi limit (Apache transient memory spikes, see commit eb94144), very generous startup probe |
| Immich | ML on SSD (CUDA), disable ModSecurity (breaks streaming), frequent upgrades. **`immich-machine-learning` MUST run with `MACHINE_LEARNING_MODEL_TTL > 0`** (set to `600` in `stacks/immich/main.tf`, env on the `immich-machine-learning` deployment). At `0`, no model ever unloads and onnxruntime's CUDA arena (OCR's dynamic input shapes inflate it to ~10 GB) is held forever on the **time-sliced T4 it shares with llama-swap/frigate/immich-server** — which has no VRAM isolation, so immich-ml starved llama-swap (qwen3-8b) and silently broke recruiter-responder triage for ~5 h on 2026-06-02 (post-mortem `docs/post-mortems/2026-06-02-immich-ml-ttl-gpu-oom-recruiter.md`). TTL>0 lets idle models (OCR, face — AND CLIP) free VRAM. The TTL is a single GLOBAL knob (no per-model pin), so CLIP would also unload after 600s idle; the `clip-keepalive` CronJob (`*/5 * * * *`, same stack) pings the CLIP textual encoder so smart-search stays warm without pinning the ad-hoc models. **Smart search has a SECOND warmth layer in Postgres** (don't conflate it with the ML model): the ~665MB vchord `clip_index` must stay resident in PG `shared_buffers`, else an ANN probe that lands on an evicted list pays a ~1.8s cold storage read vs ~4ms warm. The `postStart` hook prewarms it ONCE at pod start and `pg_prewarm.autoprewarm` only re-warms at *startup*, so the index decays out of cache over days under job buffer-pressure (observed ~33% resident after 9d uptime → slow context search, easily misattributed to the ML model). The `clip-index-prewarm` CronJob (`*/5`, same stack) re-runs `pg_prewarm('clip_index')` to pin it hot; `immich-search-probe` (`*/5`) measures live latency + residency → Pushgateway gauges (`immich_smart_search_db_seconds`, `immich_clip_index_cached_pct`) → alerts `ImmichSmartSearchSlow`/`ImmichClipIndexColdCache`/`ImmichSearchProbeStale` + cluster-health check #46 (`check_immich_search`). immich PG role is a superuser so the CronJobs can run `pg_prewarm`/`pg_buffercache`. **Video transcoding is GPU-accelerated**: `immich-server` is pinned to GPU node1 (nodeSelector `nvidia.com/gpu.present` + NoSchedule toleration + `gpu-workload` priority) with a time-sliced `nvidia.com/gpu=1` slice — the stock immich-server image's ffmpeg already ships h264/hevc_nvenc + NVDEC. Activated via `ffmpeg.accel=nvenc` + `accelDecode=true` in the **DB** system-config (`system_metadata` table, key `system-config`, JSONB — NOT Terraform; app config is DB-managed here like oauth/smtp). Direct DB edits need a pod **recreate** to reload (config is cached at boot; only API-driven changes broadcast a reload). **Streaming bitrate is capped** to keep 4K playback smooth on the contended HDD and over remote uplinks: `ffmpeg.maxBitrate=20000k` + `preset=medium` + `transcode=bitrate` (set 2026-06-01 — was uncapped `maxBitrate=0` + `ultrafast` + `targetResolution=original`, which produced 77264 Mbps 4K transcodes that stuttered for every client, local and remote, since even a single stream needs ~1013.5 MB/s off the shared `sdc` spindle). 4K resolution is preserved (`targetResolution=original`); originals are NEVER modified — only the `encoded-video/` streaming copy. To re-apply transcode settings to EXISTING videos (config changes only affect new/missing ones): delete the offenders' `asset_file` rows `WHERE type='encoded_video'` (derived/regenerable — never touches originals) then run videoConversion `force=false` (admin Jobs API → "Missing"); it regenerates them to the deterministic `<assetId>.mp4` path at concurrency 1 (gentle on sdc). See `docs/runbooks/immich-transcode-bitrate.md`. If Immich is ever reinstalled fresh (not restored), re-set these keys (accel, accelDecode, **maxBitrate=20000k, preset=medium, transcode=bitrate**). Thumbnails/previews live on SSD NFS (sdb) — do NOT move to block storage (HDD sdc = slower + the contended IO domain). **Background-job concurrency is capped to protect sdc** (DB-managed system-config, `system_metadata` key `system-config`, JSONB `job.*.concurrency`; re-set on fresh install): `thumbnailGeneration=2`, `metadataExtraction=2`, `library=2` — these jobs read ORIGINALS off the HDD library. Left uncapped (were 8/4/4) a library-wide job (e.g. Duplicate Detection on 2026-06-01) fans the ML/thumbnail backfill out into a read storm that saturates sdc and starves etcd → apiserver down. `sidecar`/`smartSearch`/`faceDetection` stay at Immich defaults (small `.xmp` / SSD previews). Apply via Job Settings UI or the `system-config` API; **direct DB edits need an `immich-server` pod recreate to reload** (config cached at boot). See `docs/post-mortems/2026-05-25-immich-anca-elements-io-storm.md`. | | Immich | ML on SSD (CUDA), disable ModSecurity (breaks streaming), frequent upgrades. **`immich-machine-learning` MUST run with `MACHINE_LEARNING_MODEL_TTL > 0`** (set to `600` in `stacks/immich/main.tf`, env on the `immich-machine-learning` deployment). At `0`, no model ever unloads and onnxruntime's CUDA arena (OCR's dynamic input shapes inflate it to ~10 GB) is held forever on the **time-sliced T4 it shares with llama-swap/frigate/immich-server** — which has no VRAM isolation, so immich-ml starved llama-swap (qwen3-8b) and silently broke recruiter-responder triage for ~5 h on 2026-06-02 (post-mortem `docs/post-mortems/2026-06-02-immich-ml-ttl-gpu-oom-recruiter.md`). TTL>0 lets idle models (OCR, face — AND CLIP) free VRAM. The TTL is a single GLOBAL knob (no per-model pin), so CLIP would also unload after 600s idle; the `clip-keepalive` CronJob (`*/5 * * * *`, same stack) pings the CLIP textual encoder so smart-search stays warm without pinning the ad-hoc models. **Smart search has a SECOND warmth layer in Postgres** (don't conflate it with the ML model): the ~665MB vchord `clip_index` must stay resident in PG `shared_buffers`, else an ANN probe that lands on an evicted list pays a ~1.8s cold storage read vs ~4ms warm. The `postStart` hook prewarms it ONCE at pod start and `pg_prewarm.autoprewarm` only re-warms at *startup*, so the index decays out of cache over days under job buffer-pressure (observed ~33% resident after 9d uptime → slow context search, easily misattributed to the ML model). The `clip-index-prewarm` CronJob (`*/5`, same stack) re-runs `pg_prewarm('clip_index')` to pin it hot; `immich-search-probe` (`*/5`) measures live latency + residency → Pushgateway gauges (`immich_smart_search_db_seconds`, `immich_clip_index_cached_pct`) → alerts `ImmichSmartSearchSlow`/`ImmichClipIndexColdCache`/`ImmichSearchProbeStale` + cluster-health check #46 (`check_immich_search`). immich PG role is a superuser so the CronJobs can run `pg_prewarm`/`pg_buffercache`. **Video transcoding is GPU-accelerated**: `immich-server` is pinned to GPU node1 (nodeSelector `nvidia.com/gpu.present` + NoSchedule toleration + `gpu-workload` priority) with a time-sliced `nvidia.com/gpu=1` slice — the stock immich-server image's ffmpeg already ships h264/hevc_nvenc + NVDEC. Activated via `ffmpeg.accel=nvenc` + `accelDecode=true` in the **DB** system-config (`system_metadata` table, key `system-config`, JSONB — NOT Terraform; app config is DB-managed here like oauth/smtp). Direct DB edits need a pod **recreate** to reload (config is cached at boot; only API-driven changes broadcast a reload). **Streaming bitrate is capped** to keep 4K playback smooth on the contended HDD and over remote uplinks: `ffmpeg.maxBitrate=20000k` + `preset=medium` + `transcode=bitrate` (set 2026-06-01 — was uncapped `maxBitrate=0` + `ultrafast` + `targetResolution=original`, which produced 77264 Mbps 4K transcodes that stuttered for every client, local and remote, since even a single stream needs ~1013.5 MB/s off the shared `sdc` spindle). 4K resolution is preserved (`targetResolution=original`); originals are NEVER modified — only the `encoded-video/` streaming copy. To re-apply transcode settings to EXISTING videos (config changes only affect new/missing ones): delete the offenders' `asset_file` rows `WHERE type='encoded_video'` (derived/regenerable — never touches originals) then run videoConversion `force=false` (admin Jobs API → "Missing"); it regenerates them to the deterministic `<assetId>.mp4` path at concurrency 1 (gentle on sdc). See `docs/runbooks/immich-transcode-bitrate.md`. If Immich is ever reinstalled fresh (not restored), re-set these keys (accel, accelDecode, **maxBitrate=20000k, preset=medium, transcode=bitrate**). Thumbnails/previews live on SSD NFS (sdb) — do NOT move to block storage (HDD sdc = slower + the contended IO domain). **Background-job concurrency is capped to protect sdc** (DB-managed system-config, `system_metadata` key `system-config`, JSONB `job.*.concurrency`; re-set on fresh install): `thumbnailGeneration=2`, `metadataExtraction=2`, `library=2` — these jobs read ORIGINALS off the HDD library. Left uncapped (were 8/4/4) a library-wide job (e.g. Duplicate Detection on 2026-06-01) fans the ML/thumbnail backfill out into a read storm that saturates sdc and starves etcd → apiserver down. `sidecar`/`smartSearch`/`faceDetection` stay at Immich defaults (small `.xmp` / SSD previews). Apply via Job Settings UI or the `system-config` API; **direct DB edits need an `immich-server` pod recreate to reload** (config cached at boot). See `docs/post-mortems/2026-05-25-immich-anca-elements-io-storm.md`. |
| CrowdSec | Pin version, disable Metabase when not needed (CPU hog), LAPI scaled to 3, **DB on PostgreSQL** (migrated from MySQL), flush config: max_items=10000/max_age=7d/agents_autodelete=30d, DECISION_DURATION=168h in blocklist CronJob | | CrowdSec | Pin version, disable Metabase when not needed (CPU hog), LAPI scaled to 3, **DB on PostgreSQL** (migrated from MySQL), flush config: max_items=10000/max_age=7d/agents_autodelete=30d, DECISION_DURATION=168h in blocklist CronJob. **Enforcement is out-of-band, NOT a Traefik plugin** (the Yaegi `crowdsec-bouncer-traefik-plugin` was dead on Traefik 3.7.5 and removed): `cs-firewall-bouncer` DaemonSet drops in-kernel via nftables on direct hosts (bouncer key `firewall`, v0.0.34 binary fetched at runtime, hostNetwork+NET_ADMIN, `stacks/crowdsec/modules/crowdsec/firewall_bouncer.tf`); `crowdsec-cf-sync` CronJob blocks at the CF edge for proxied hosts (bouncer key `kvsync`, `stacks/rybbit/crowdsec_edge.tf`). Both fail open. See `docs/architecture/security.md` |
| Frigate | GPU stall detection in liveness probe (inference speed check), high CPU | | Frigate | GPU stall detection in liveness probe (inference speed check), high CPU |
| Authentik | 3 server replicas + 2-replica embedded outpost (PG-backed sessions), PgBouncer in front of PostgreSQL, strip auth headers before forwarding. **`authentik.*` Helm values are INERT** (existingSecret skips chart env rendering) — tune via `server.env`/`worker.env` in `modules/authentik/values.yaml`. Single-screen login (password embedded in identification stage); all first-party OIDC apps use implicit consent (2026-06-10). `/static` ingress carve-out serves assets with immutable Cache-Control. | | Authentik | 3 server replicas + 2-replica embedded outpost (PG-backed sessions), PgBouncer in front of PostgreSQL, strip auth headers before forwarding. **`authentik.*` Helm values are INERT** (existingSecret skips chart env rendering) — tune via `server.env`/`worker.env` in `modules/authentik/values.yaml`. Single-screen login (password embedded in identification stage); all first-party OIDC apps use implicit consent (2026-06-10). `/static` ingress carve-out serves assets with immutable Cache-Control. |
| Kyverno | failurePolicy=Ignore to prevent blocking cluster, pin chart version | | Kyverno | failurePolicy=Ignore to prevent blocking cluster, pin chart version |

View file

@ -4,7 +4,7 @@ Last updated: 2026-04-19 (WS E — Kea DHCP pushes dual DNS per subnet; Kea DDNS
## Overview ## Overview
The homelab network is built on a dual-VLAN architecture with pfSense providing gateway services, Technitium for internal DNS, and Cloudflare for external DNS. Traefik serves as the Kubernetes ingress controller with a comprehensive middleware chain including CrowdSec bot protection, Authentik forward-auth, and rate limiting. All HTTP traffic flows through Cloudflared tunnels, avoiding the need for port forwarding or exposing public IPs. The homelab network is built on a dual-VLAN architecture with pfSense providing gateway services, Technitium for internal DNS, and Cloudflare for external DNS. Traefik serves as the Kubernetes ingress controller with a middleware chain of anti-AI bot-blocking, Authentik forward-auth, rate limiting, and retry. CrowdSec IP-reputation enforcement is **out-of-band** (not a Traefik hop): banned IPs are dropped in-kernel via nftables on direct hosts and blocked at the Cloudflare edge on proxied hosts (see `docs/architecture/security.md`). All HTTP traffic flows through Cloudflared tunnels, avoiding the need for port forwarding or exposing public IPs.
## Architecture Diagram ## Architecture Diagram
@ -16,12 +16,14 @@ graph TB
Traefik[Traefik Ingress<br/>3 replicas + PDB] Traefik[Traefik Ingress<br/>3 replicas + PDB]
subgraph "Middleware Chain" subgraph "Middleware Chain"
CS[CrowdSec Bouncer<br/>fail-open] AntiAI[Anti-AI bot-block<br/>fail-open]
Auth[Authentik Forward-Auth<br/>3 replicas + PDB] Auth[Authentik Forward-Auth<br/>3 replicas + PDB]
RL[Rate Limiter<br/>429 response] RL[Rate Limiter<br/>429 response]
Retry[Retry<br/>2 attempts, 100ms] Retry[Retry<br/>2 attempts, 100ms]
end end
CSdrop[CrowdSec drop<br/>nftables / CF edge<br/>out-of-band, pre-Traefik]
subgraph "Proxmox Host (eno1)" subgraph "Proxmox Host (eno1)"
vmbr0[vmbr0 Bridge<br/>192.168.1.127/24] vmbr0[vmbr0 Bridge<br/>192.168.1.127/24]
vmbr1[vmbr1 Internal<br/>VLAN-aware] vmbr1[vmbr1 Internal<br/>VLAN-aware]
@ -53,8 +55,9 @@ graph TB
Internet -->|DNS query| CF Internet -->|DNS query| CF
CF -->|CNAME to tunnel| CFD CF -->|CNAME to tunnel| CFD
CFD --> Traefik CFD --> Traefik
Traefik --> CS CSdrop -.->|banned IPs dropped before Traefik| Traefik
CS --> Auth Traefik --> AntiAI
AntiAI --> Auth
Auth --> RL Auth --> RL
RL --> Retry RL --> Retry
Retry --> Service Retry --> Service
@ -82,7 +85,7 @@ graph TB
| Cloudflare DNS | SaaS | External | ~50 public domains under viktorbarzin.me | | Cloudflare DNS | SaaS | External | ~50 public domains under viktorbarzin.me |
| Cloudflared | Container | K8s (3 replicas) | Tunnel ingress, replaces port forwarding | | Cloudflared | Container | K8s (3 replicas) | Tunnel ingress, replaces port forwarding |
| Traefik | Helm chart | K8s (3 replicas + PDB) | Ingress controller, HTTP/3 enabled | | Traefik | Helm chart | K8s (3 replicas + PDB) | Ingress controller, HTTP/3 enabled |
| CrowdSec | Helm chart | K8s (LAPI: 3 replicas) | Bot protection, fail-open bouncer | | CrowdSec | Helm chart | K8s (LAPI: 3 replicas) | IP reputation. Out-of-band enforcement: `cs-firewall-bouncer` DaemonSet (in-kernel nftables drop, direct hosts) + Cloudflare edge WAF rule (proxied hosts). Fail-open |
| Authentik | Helm chart | K8s (3 replicas + PDB) | SSO, forward-auth middleware | | Authentik | Helm chart | K8s (3 replicas + PDB) | SSO, forward-auth middleware |
| MetalLB | v0.15.3 Helm chart | K8s | LoadBalancer IPs (10.0.20.200-10.0.20.220), all services on 10.0.20.200 | | MetalLB | v0.15.3 Helm chart | K8s | LoadBalancer IPs (10.0.20.200-10.0.20.220), all services on 10.0.20.200 |
| Registry Cache | Container | 10.0.20.10 | Pull-through for docker.io:5000, ghcr.io:5010 | | Registry Cache | Container | 10.0.20.10 | Pull-through for docker.io:5000, ghcr.io:5010 |
@ -208,24 +211,31 @@ VMs tag traffic on vmbr1 to isolate workloads. pfSense bridges VLAN 20 to the up
### Ingress Flow ### Ingress Flow
CrowdSec is **not** a step in this chain — banned IPs are dropped before the
request ever reaches Traefik (Cloudflare edge WAF rule on proxied hosts; host
nftables on direct hosts). The flow below is for a request that survives that
out-of-band gate.
```mermaid ```mermaid
sequenceDiagram sequenceDiagram
participant Client participant Client
participant Cloudflare participant CFedge as Cloudflare (edge WAF: crowdsec_ban block)
participant Cloudflared participant Cloudflared
participant Traefik participant Traefik
participant CrowdSec participant AntiAI
participant Authentik participant Authentik
participant RateLimit participant RateLimit
participant Retry participant Retry
participant Service participant Service
participant Pod participant Pod
Client->>Cloudflare: HTTPS request to blog.viktorbarzin.me Client->>CFedge: HTTPS request to blog.viktorbarzin.me
Cloudflare->>Cloudflared: Forward via tunnel (QUIC) Note over CFedge: banned IP → blocked here (proxied hosts)
CFedge->>Cloudflared: Forward via tunnel (QUIC)
Cloudflared->>Traefik: HTTP to LoadBalancer IP Cloudflared->>Traefik: HTTP to LoadBalancer IP
Traefik->>CrowdSec: Apply bouncer middleware Note over Traefik: on direct hosts, banned IPs already dropped in-kernel (nftables forward hook)
CrowdSec->>Authentik: If allowed, check auth (protected=true) Traefik->>AntiAI: anti-AI bot-block (fail-open)
AntiAI->>Authentik: If allowed, check auth (protected=true)
Authentik->>RateLimit: If authenticated, check rate limit Authentik->>RateLimit: If authenticated, check rate limit
RateLimit->>Retry: If within limit, continue RateLimit->>Retry: If within limit, continue
Retry->>Service: Forward to Service Retry->>Service: Forward to Service
@ -234,24 +244,27 @@ sequenceDiagram
Service-->>Retry: Response Service-->>Retry: Response
Retry-->>RateLimit: Response Retry-->>RateLimit: Response
RateLimit-->>Authentik: Response (strip auth headers) RateLimit-->>Authentik: Response (strip auth headers)
Authentik-->>CrowdSec: Response Authentik-->>AntiAI: Response
CrowdSec-->>Traefik: Response AntiAI-->>Traefik: Response
Traefik-->>Cloudflared: Response Traefik-->>Cloudflared: Response
Cloudflared-->>Cloudflare: Response via tunnel Cloudflared-->>CFedge: Response via tunnel
Cloudflare-->>Client: HTTPS response CFedge-->>Client: HTTPS response
``` ```
### Middleware Chain ### Middleware Chain
Every ingress created by the `ingress_factory` module follows this chain: CrowdSec IP-reputation enforcement is **not** in this chain — it is out-of-band
(host nftables on direct hosts; the Cloudflare edge WAF `crowdsec_ban` rule on
proxied hosts), so banned IPs never reach the chain and there is no per-request
CrowdSec hop. Every ingress created by the `ingress_factory` module follows this
Traefik chain:
1. **CrowdSec Bouncer**: Checks IP against threat database. **Fail-open** mode — if LAPI is unreachable, traffic passes through to prevent outages. 1. **Anti-AI bot-block** (`ai-bot-block` ForwardAuth, on by default via `ingress_factory`): blocks/tarpits known AI crawlers. **Fail-open** (currently a no-op `return 200` — poison-fountain scaled to 0; see `docs/architecture/security.md`).
2. **Authentik Forward-Auth** (if `protected = true`): SSO authentication via OIDC. Non-authenticated users are redirected to login. Auth headers are stripped before forwarding to backend. 2. **Authentik Forward-Auth** (if `protected = true`): SSO authentication via OIDC. Non-authenticated users are redirected to login. Auth headers are stripped before forwarding to backend.
3. **Rate Limiting**: Per-IP throttling. Returns **429 Too Many Requests** (not 503) when limit exceeded. Default is `rate-limit` (average 10 req/s, burst 50). Services whose clients legitimately burst harder get a dedicated middleware via `skip_default_rate_limit = true` + `extra_middlewares`: Immich (`immich-rate-limit`, 1000/20000, photo uploads) and ActualBudget (`actualbudget-rate-limit`, 50/300 — the Actual web app boots with ~70 parallel asset/migration revalidations; the default burst 429'd the tail and stalled every page load). 3. **Rate Limiting**: Per-IP throttling. Returns **429 Too Many Requests** (not 503) when limit exceeded. Default is `rate-limit` (average 10 req/s, burst 50). Services whose clients legitimately burst harder get a dedicated middleware via `skip_default_rate_limit = true` + `extra_middlewares`: Immich (`immich-rate-limit`, 1000/20000, photo uploads) and ActualBudget (`actualbudget-rate-limit`, 50/300 — the Actual web app boots with ~70 parallel asset/migration revalidations; the default burst 429'd the tail and stalled every page load).
4. **Retry**: 2 attempts with 100ms delay on transient failures (5xx errors, connection errors). 4. **Retry**: 2 attempts with 100ms delay on transient failures (5xx errors, connection errors).
Additional middleware: Additional middleware:
- **Anti-AI**: On by default via `ingress_factory`. Blocks common AI crawler user-agents.
- **HTTP/3 (QUIC)**: Enabled globally on Traefik. - **HTTP/3 (QUIC)**: Enabled globally on Traefik.
### Entrypoint Transport Timeouts ### Entrypoint Transport Timeouts
@ -348,7 +361,7 @@ Containerd on all K8s nodes uses `hosts.toml` to redirect pulls to the local cac
| pfSense | `stacks/pfsense/` | VM + cloud-init config | | pfSense | `stacks/pfsense/` | VM + cloud-init config |
| Technitium | `stacks/technitium/` | Deployment, Service, PVC | | Technitium | `stacks/technitium/` | Deployment, Service, PVC |
| Traefik | `stacks/platform/` (sub-module) | Helm release, IngressRoute CRDs | | Traefik | `stacks/platform/` (sub-module) | Helm release, IngressRoute CRDs |
| CrowdSec | `stacks/platform/` (sub-module) | Helm release, LAPI + bouncer | | CrowdSec | `stacks/crowdsec/` (+ edge in `stacks/rybbit/`) | Helm release, LAPI + agent; `cs-firewall-bouncer` DaemonSet (nftables, direct hosts) + Cloudflare edge sync (proxied hosts) |
| Authentik | `stacks/authentik/` | Helm release, ingress, OIDC configs | | Authentik | `stacks/authentik/` | Helm release, ingress, OIDC configs |
| MetalLB | `stacks/platform/` (sub-module) | Helm release, IPAddressPool | | MetalLB | `stacks/platform/` (sub-module) | Helm release, IPAddressPool |
| Cloudflared | `stacks/cloudflared/` | Deployment (3 replicas), tunnel config; runs `--no-autoupdate` (in-place self-updates exited the pods and severed all tunnel WebSockets, 2026-06-09/10) | | Cloudflared | `stacks/cloudflared/` | Deployment (3 replicas), tunnel config; runs `--no-autoupdate` (in-place self-updates exited the pods and severed all tunnel WebSockets, 2026-06-09/10) |
@ -436,13 +449,30 @@ Containerd on all K8s nodes uses `hosts.toml` to redirect pulls to the local cac
**Decision**: Technitium handles internal `.lan` domains with near-zero latency. Cloudflare handles public domains with global DNS. K8s nodes use Technitium as primary, which forwards non-.lan queries to Cloudflare. **Decision**: Technitium handles internal `.lan` domains with near-zero latency. Cloudflare handles public domains with global DNS. K8s nodes use Technitium as primary, which forwards non-.lan queries to Cloudflare.
### Why Fail-Open on CrowdSec Bouncer? ### Why CrowdSec Enforcement Is Out-of-Band (and Fails Open)
**Alternatives considered**: CrowdSec used to enforce inline as a Traefik middleware (the
1. **Fail-closed**: Maximum security, but LAPI downtime blocks all traffic. `crowdsec-bouncer-traefik-plugin`). On Traefik 3.7.5 the Yaegi plugin handler was
2. **Redundant LAPI**: Already scaled to 3 replicas, but resource pressure can still cause outages. never invoked, so it enforced nothing; the plugin was removed and enforcement
moved off the request path entirely (full history in
`docs/architecture/security.md`). It now runs on two surfaces:
**Decision**: Availability > strict bot blocking. CrowdSec LAPI is scaled to 3 replicas for resilience, but during cluster-wide resource exhaustion (e.g., memory pressure), bouncer falls back to allowing traffic. This prevents a complete service outage due to a security add-on. - **Direct hosts**`cs-firewall-bouncer` DaemonSet drops banned IPs in the host
nftables, in **both the `input` and `forward` hooks**. The `forward` hook is
the load-bearing one: with Traefik on a dedicated LB IP at
`externalTrafficPolicy=Local`, client packets are DNAT'd to the Traefik **pod**
and transit the node's `forward` chain (not `input`) — which is exactly why the
ingress must preserve the **real client IP** end-to-end (ETP=Local + PROXY-v2
for IPv6; see the Traefik LB IP and IPv6 ingress notes above). Without the real
client IP the firewall-bouncer (and the CF edge rule) would have nothing to
match on.
- **Proxied hosts** → a Cloudflare edge WAF rule (`ip.src in $crowdsec_ban`) fed
by the `crowdsec-cf-sync` CronJob.
Both **fail open**: if LAPI is unreachable, the firewall-bouncer simply stops
receiving new decisions (existing drops persist) and the CF sync skips a run —
neither ever blocks legitimate traffic. Availability > strict bot blocking, and
out-of-band enforcement adds **zero per-request latency** (no Traefik hop).
### Why HTTP/3 (QUIC)? ### Why HTTP/3 (QUIC)?
@ -473,9 +503,10 @@ Containerd on all K8s nodes uses `hosts.toml` to redirect pulls to the local cac
**Symptoms**: All ingress routes return 503, Traefik dashboard shows no backends available. **Symptoms**: All ingress routes return 503, Traefik dashboard shows no backends available.
**Diagnosis**: Middleware chain is blocking traffic. Check: **Diagnosis**: Middleware chain is blocking traffic. (CrowdSec is **not** in the
1. Authentik status: `kubectl get pod -n authentik` chain — a CrowdSec/LAPI outage cannot cause 503s; it only stops new bans.) Check:
2. CrowdSec LAPI status: `kubectl get pod -n crowdsec` 1. Authentik status: `kubectl get pod -n authentik` (ForwardAuth fails closed if the auth server is unreachable)
2. `bot-block-proxy` status: `kubectl get pod -n traefik -l app=bot-block-proxy` (anti-AI ForwardAuth target — also fails closed if down)
3. Traefik logs: `kubectl logs -n kube-system deploy/traefik` 3. Traefik logs: `kubectl logs -n kube-system deploy/traefik`
**Fix**: If Authentik is down and ingress uses forward-auth, pods won't pass health checks. Scale Authentik to 3 replicas or temporarily disable forward-auth middleware. **Fix**: If Authentik is down and ingress uses forward-auth, pods won't pass health checks. Scale Authentik to 3 replicas or temporarily disable forward-auth middleware.

View file

@ -2,40 +2,50 @@
## Overview ## Overview
The homelab implements defense-in-depth security at the application layer (L7) using CrowdSec for threat intelligence and IP reputation, Kyverno for policy enforcement and resource governance, and a 3-layer anti-AI scraping defense (reduced from 5 in April 2026 after removing the rewrite-body plugin). All security components operate in graceful degradation mode (fail-open) to prevent cascading failures. Security policies are deployed in audit mode first, then selectively enforced after validation. The homelab implements defense-in-depth security using CrowdSec for threat intelligence and IP reputation, Kyverno for policy enforcement and resource governance, and a 3-layer anti-AI scraping defense (reduced from 5 in April 2026 after removing the rewrite-body plugin). CrowdSec enforcement is **out-of-band** (not a per-request Traefik hop — see the CrowdSec section): banned IPs are dropped in-kernel via nftables on direct hosts, and blocked at the Cloudflare edge on proxied hosts, so enforcement adds **zero per-request latency**. All security components fail open (a CrowdSec outage stops new bans but never blocks legitimate traffic). Security policies are deployed in audit mode first, then selectively enforced after validation.
## Architecture Diagram ## Architecture Diagram
CrowdSec enforcement is out-of-band (NOT an inline Traefik middleware hop). The
Traefik request chain is anti-AI → Authentik ForwardAuth → rate-limit → retry;
CrowdSec drops banned IPs *before* (direct hosts) or *off* (proxied hosts) that
chain entirely.
```mermaid ```mermaid
graph LR graph TB
Internet[Internet] Internet[Internet]
CF[Cloudflare WAF]
subgraph "Proxied hosts (orange-cloud)"
CFedge[Cloudflare edge<br/>WAF rule: ip.src in $crowdsec_ban → block]
end
subgraph "Direct hosts (grey-cloud / internal)"
NFT[Host nftables<br/>table crowdsec/crowdsec6<br/>drop in input + forward]
end
Tunnel[Cloudflared Tunnel] Tunnel[Cloudflared Tunnel]
CrowdSec[CrowdSec Bouncer<br/>Traefik Plugin] Traefik[Traefik<br/>anti-AI → Authentik → rate-limit → retry]
AntiAI[Anti-AI Check<br/>poison-fountain]
ForwardAuth[Authentik ForwardAuth]
RateLimit[Rate Limit Middleware]
Retry[Retry Middleware<br/>2 attempts, 100ms]
Backend[Backend Service] Backend[Backend Service]
LAPI[CrowdSec LAPI<br/>3 replicas] LAPI[CrowdSec LAPI<br/>3 replicas]
Agent[CrowdSec Agent] Agent[CrowdSec Agent<br/>parses Traefik logs]
FWB[cs-firewall-bouncer<br/>DaemonSet, every node]
CFsync[crowdsec-cf-sync<br/>CronJob, every 2 min]
Internet -->|1| CF Internet -->|proxied| CFedge
CF -->|2| Tunnel Internet -->|direct| NFT
Tunnel -->|3| CrowdSec CFedge -->|allowed| Tunnel
CrowdSec -.->|Query| LAPI Tunnel --> Traefik
Agent -.->|Report| LAPI NFT -->|allowed| Traefik
CrowdSec -->|4. Pass/Block| AntiAI Traefik --> Backend
AntiAI -->|5. Human/Bot| ForwardAuth
ForwardAuth -->|6. Authenticated| RateLimit
RateLimit -->|7. Under Limit| Retry
Retry -->|8. Success/Retry| Backend
style CrowdSec fill:#f9f,stroke:#333 Agent -.->|report| LAPI
style AntiAI fill:#ff9,stroke:#333 LAPI -.->|all decisions incl. CAPI| FWB
style ForwardAuth fill:#9f9,stroke:#333 FWB -.->|program drop rules| NFT
style RateLimit fill:#99f,stroke:#333 LAPI -.->|ban/captcha decisions, CAPI excluded| CFsync
CFsync -.->|push IP list| CFedge
style CFedge fill:#f9f,stroke:#333
style NFT fill:#f9f,stroke:#333
``` ```
## Components ## Components
@ -44,7 +54,8 @@ graph LR
|-----------|---------|----------|---------| |-----------|---------|----------|---------|
| CrowdSec LAPI | Pinned | `stacks/crowdsec/` | Local API, threat intelligence aggregation (3 replicas) | | CrowdSec LAPI | Pinned | `stacks/crowdsec/` | Local API, threat intelligence aggregation (3 replicas) |
| CrowdSec Agent | Pinned | `stacks/crowdsec/` | Log parser, scenario detection | | CrowdSec Agent | Pinned | `stacks/crowdsec/` | Log parser, scenario detection |
| CrowdSec Traefik Bouncer | Plugin | Traefik config | Plugin-based IP reputation check | | cs-firewall-bouncer | v0.0.34 | `stacks/crowdsec/modules/crowdsec/firewall_bouncer.tf` | In-kernel nftables drop on every node (DIRECT hosts). Bouncer key `firewall` |
| crowdsec-cf-sync | — | `stacks/rybbit/crowdsec_edge.tf` | LAPI→Cloudflare-IP-List sync CronJob (PROXIED hosts). Bouncer key `kvsync` |
| Kyverno | Pinned chart | `stacks/kyverno/` | Policy engine for K8s admission control | | Kyverno | Pinned chart | `stacks/kyverno/` | Policy engine for K8s admission control |
| poison-fountain | Latest | `stacks/poison-fountain/` | Anti-AI bot detection and tarpit service | | poison-fountain | Latest | `stacks/poison-fountain/` | Anti-AI bot detection and tarpit service |
| cert-manager/certbot | - | `stacks/cert-manager/` | TLS certificate management | | cert-manager/certbot | - | `stacks/cert-manager/` | TLS certificate management |
@ -54,11 +65,15 @@ graph LR
### Request Security Layers ### Request Security Layers
Every incoming request passes through 6 security layers: CrowdSec IP-reputation enforcement happens **before** a request reaches the
Traefik chain (banned IPs are dropped in-kernel on direct hosts, or blocked at
the Cloudflare edge on proxied hosts — see CrowdSec Threat Intelligence below).
A request that survives that out-of-band gate then passes through the Traefik
middleware chain:
1. **Cloudflare WAF** - DDoS protection, bot detection, firewall rules (external) 1. **Cloudflare WAF / edge** - DDoS protection, bot detection, firewall rules incl. the CrowdSec `crowdsec_ban` block rule (proxied hosts only)
2. **Cloudflared Tunnel** - Zero Trust tunnel, hides origin IP 2. **Cloudflared Tunnel** - Zero Trust tunnel, hides origin IP (proxied hosts)
3. **CrowdSec Bouncer** - IP reputation check against LAPI (fail-open on error) 3. **CrowdSec out-of-band drop** - nftables on direct hosts; *not* a Traefik hop (zero per-request latency)
4. **Anti-AI Scraping** - 3-layer bot defense (optional per service, updated 2026-04-17) 4. **Anti-AI Scraping** - 3-layer bot defense (optional per service, updated 2026-04-17)
5. **Authentik ForwardAuth** - Authentication check (if `protected = true`) 5. **Authentik ForwardAuth** - Authentication check (if `protected = true`)
6. **Rate Limiting** - Per-source IP rate limits (returns 429 on breach) 6. **Rate Limiting** - Per-source IP rate limits (returns 429 on breach)
@ -80,58 +95,71 @@ CrowdSec operates in a hub-and-agent model:
- Reports malicious IPs to LAPI - Reports malicious IPs to LAPI
- Shares threat intel with CrowdSec community (anonymized) - Shares threat intel with CrowdSec community (anonymized)
**Traefik Bouncer Plugin** (`crowdsec-bouncer-traefik-plugin`, `stacks/traefik/modules/traefik/middleware.tf`): Enforcement is split across **two out-of-band surfaces**, neither of which adds
- Integrated as Traefik middleware (in the default ingress chain) any per-request latency. (See "Why the Traefik bouncer plugin was removed" below
- Queries LAPI for IP reputation on each request for the supersession history — there is no longer an inline Traefik bouncer.)
- **Registered with LAPI** via `BOUNCER_KEY_traefik` env on the LAPI container
(`stacks/crowdsec/modules/crowdsec/values.yaml`), seeded from the same Vault key
the middleware presents (`ingress_crowdsec_api_key`). **Before 2026-06-19 the
bouncer was never registered → LAPI returned 403 → the plugin failed open and
enforced nothing (no bans, no captcha).** The seed re-registers automatically on
every LAPI start, so a DB wipe (e.g. the MySQL→PostgreSQL migration that lost the
original registration) can't silently disable enforcement again.
- **Fail-open mode**: If LAPI unreachable, allows traffic (graceful degradation)
- **Only sees non-proxied (direct) apps' real client IPs** (ETP=Local). Proxied
apps arrive from cloudflared's pod IP (in `clientTrustedIPs`) and are bypassed —
extending enforcement to proxied apps needs `forwardedHeadersTrustedIPs` (future).
- Honours two LAPI remediation types (profiles in `stacks/crowdsec/modules/crowdsec/values.yaml`):
- **`ban`** → HTTP 403 (serious attacks: CVE exploits, scanners, brute force)
- **`captcha`** → **Cloudflare Turnstile challenge** so the flagged user can
self-unblock (lower-severity abuse: `http-429-abuse`, `http-403-abuse`,
`http-crawl-non_statics`, `http-sensitive-files`). The plugin is configured
with `captchaProvider=turnstile` + the widget keys; the `captcha.html`
template is mounted into the Traefik pod at `/captcha`. The widget is
Terraform-managed in `stacks/traefik/main.tf`
(`cloudflare_turnstile_widget.crowdsec_captcha`, scoped to `viktorbarzin.me`
so it covers every subdomain). **Before 2026-06-19 no captcha provider was
configured, so `captcha` decisions silently degraded to a 403 ban** — users
had no way to self-unblock; wiring Turnstile fixed that.
**Cloudflare Edge Enforcement for proxied hosts** (`stacks/rybbit/crowdsec_edge.tf` + `lapi_kv_sync.py`): **Surface 1 — DIRECT (non-Cloudflare-proxied) hosts → in-kernel nftables drop**
- Proxied (orange-cloud) hosts terminate at the Cloudflare edge, so the in-cluster (`cs-firewall-bouncer` DaemonSet, `stacks/crowdsec/modules/crowdsec/firewall_bouncer.tf`):
bouncer above never decides on them. Edge enforcement instead syncs LAPI - Runs on **every node** (no nodeSelector). Programs the HOST nftables — `table ip
decisions into **one Cloudflare account IP List (`crowdsec_ban`)** + a single crowdsec` / `table ip6 crowdsec6` — with drop rules in **both the `input` AND
**zone-scoped WAF custom rule** blocking `(ip.src in $crowdsec_ban)` across every the `forward` hooks**. The `forward` hook is required because Traefik is a
proxied host. CronJob `crowdsec-cf-sync` (rybbit ns, every 2 min) reconciles it. LoadBalancer with `externalTrafficPolicy=Local`: client traffic is DNAT'd to the
- **BAN-ONLY (2026-06-20):** only `type=ban` decisions sync to the edge. `captcha` Traefik **pod** and transits the node's `forward` hook (not `input`) with the
decisions are deliberately NOT pushed — the CF account allows only ONE Rules List real client IP preserved. Chains use `policy accept` (only set members drop —
with a single block action, so folding captcha in would hard-block a soft it can never blackhole normal traffic).
challenge on every proxied host. (Before 2026-06-20 captcha was downgraded to a - Pulls **all** decisions from LAPI, **including the CAPI community blocklist
hard block at the edge.) (~31k IPs)**. Packets from banned IPs are dropped **in-kernel before reaching
- **Auth carve-out (2026-06-20):** the WAF rule excludes `authentik.viktorbarzin.me` Traefik** → zero per-request hops, no Traefik involvement at all.
+ `public-auth.viktorbarzin.me` (`… and not (http.host in {…})`), and the - **Packaging**: cs-firewall-bouncer publishes no container image, so the
Authentik UI ingress sets `exclude_crowdsec = true` for the in-cluster bouncer. A **v0.0.34** static binary is fetched at runtime by an initContainer onto a
CrowdSec hit must never wall a user out of the login / WebAuthn flow they `debian:bookworm-slim` runtime container. Needs `hostNetwork` +
authenticate through; auth keeps `traefik-rate-limit` for brute-force protection. `NET_ADMIN`/`NET_RAW` to talk netlink directly. Registered bouncer key:
- **⚠️ Currently NON-FUNCTIONAL (known issue, pre-existing since the 2026-06-20 **`firewall`**.
rollout):** `crowdsec-cf-sync` fails every run — `cf_list_items()` pagination - **Fail-open**: if LAPI is unreachable it just stops receiving new decisions
gets CF `HTTP 400 code 10027 "invalid or expired cursor"`, so the list never (existing drop rules persist); it never blocks legitimate traffic.
populates (`num_items=0`) and the edge rule blocks nothing. LAPI also returns
~31k ban IPs, likely exceeding CF IP-List capacity even once pagination is fixed. **Surface 2 — PROXIED (Cloudflare orange-cloud) hosts → Cloudflare edge block**
**Edge enforcement for proxied hosts is therefore inert pending a fix** (the (`stacks/rybbit/crowdsec_edge.tf` + `lapi_kv_sync.py`):
in-cluster bouncer still protects direct apps; the auth carve-out is correct - Proxied hosts terminate at the Cloudflare edge, so a host-level nftables drop
regardless). Fix needs: (1) correct CF cursor pagination, (2) a capacity strategy would never see them. Enforcement is instead a single Cloudflare Rules List
for the ban set. **`crowdsec_ban`** + a zone-scoped WAF custom rule `(ip.src in $crowdsec_ban)`
**block** action, which covers every proxied host in the zone.
- Fed by the **`crowdsec-cf-sync` CronJob** (namespace `rybbit`, every 2 min,
pure-stdlib Python in a ConfigMap). It pulls local **ban/captcha ip-scoped**
decisions and pushes them into the CF list, but **EXCLUDES the ~31k CAPI
community blocklist** — that set is far too large for a CF Rules List (the CF
account hard-limits to **one** list), and CAPI is already covered in-kernel on
direct hosts and by Cloudflare's own managed protections on proxied hosts.
Registered bouncer key: **`kvsync`**.
- **Block-only**: the single-list limit precludes a separate
captcha/managed-challenge list, so both ban and captcha decisions are enforced
as a plain block at the edge.
- **Auth carve-out:** the WAF rule excludes `authentik.viktorbarzin.me` +
`public-auth.viktorbarzin.me` (`… and not (http.host in {…})`). A CrowdSec hit
must never wall a user out of the login / WebAuthn flow they authenticate
through; auth keeps `traefik-rate-limit` for brute-force protection.
**Whitelist** (`stacks/crowdsec/whitelist.yaml`): a CrowdSec whitelist covers
RFC1918 + the tailnet + internal CIDRs (plus one specific external IP), so
internal users are never enforced. Internal access uses split-horizon DNS
straight to Traefik, and direct internal clients are RFC1918 — both whitelisted.
#### Why the Traefik bouncer plugin was removed
Enforcement used to run as an inline Traefik middleware — the
`crowdsec-bouncer-traefik-plugin` (Yaegi/Lua), which queried LAPI on every
request and could serve a Cloudflare Turnstile captcha for soft remediations.
On **Traefik 3.7.5 the Yaegi handler was never invoked**, so the bouncer was
registered but enforced **nothing** despite appearing healthy. Rather than chase
the Yaegi runtime, the whole plugin path was **removed** (2026-06): the plugin
static config + initContainer download, the `crowdsec` Middleware CRD, the
`captcha.html` template + its ConfigMap and volume mount, and the Cloudflare
Turnstile widget (`cloudflare_turnstile_widget.crowdsec_captcha`). It was
replaced by the two out-of-band surfaces above, which add zero per-request
latency and fail open. (The earlier `crowdsec-cf-sync` cursor-pagination /
IP-List-capacity issues are also moot now that CAPI is excluded from the edge
list and dropped in-kernel instead.)
**Metabase** (disabled by default): **Metabase** (disabled by default):
- Dashboard for CrowdSec analytics - Dashboard for CrowdSec analytics
@ -377,10 +405,12 @@ Beads: `code-8ywc` W1.6 + W1.7. **Status: planned.**
| Path | Purpose | | Path | Purpose |
|------|---------| |------|---------|
| `stacks/crowdsec/` | CrowdSec LAPI, agent, bouncer config | | `stacks/crowdsec/` | CrowdSec LAPI, agent config + `whitelist.yaml` |
| `stacks/crowdsec/modules/crowdsec/firewall_bouncer.tf` | cs-firewall-bouncer DaemonSet (in-kernel nftables drop, direct hosts) |
| `stacks/rybbit/crowdsec_edge.tf` + `lapi_kv_sync.py` | Cloudflare IP-List + WAF block rule + LAPI→CF sync CronJob (proxied hosts) |
| `stacks/kyverno/` | Kyverno deployment + policies | | `stacks/kyverno/` | Kyverno deployment + policies |
| `stacks/poison-fountain/` | Anti-AI service + CronJob | | `stacks/poison-fountain/` | Anti-AI service + CronJob |
| `stacks/platform/modules/traefik/middleware.tf` | Security middleware definitions | | `stacks/traefik/modules/traefik/middleware.tf` | Security middleware definitions (no longer includes a CrowdSec bouncer) |
| `stacks/platform/modules/ingress_factory/` | Per-service security toggles | | `stacks/platform/modules/ingress_factory/` | Per-service security toggles |
### Vault Paths ### Vault Paths
@ -490,7 +520,11 @@ spec:
**Fix**: **Fix**:
1. Check LAPI decisions: `kubectl exec -it crowdsec-lapi-0 -- cscli decisions list` 1. Check LAPI decisions: `kubectl exec -it crowdsec-lapi-0 -- cscli decisions list`
2. Remove ban: `kubectl exec -it crowdsec-lapi-0 -- cscli decisions delete --ip <IP>` 2. Remove ban: `kubectl exec -it crowdsec-lapi-0 -- cscli decisions delete --ip <IP>`
3. Whitelist if needed: Add to `stacks/crowdsec/whitelist.yaml` — the in-kernel drop clears as soon as `cs-firewall-bouncer` reconciles (direct
hosts); for proxied hosts the `crowdsec-cf-sync` CronJob removes it from the
`crowdsec_ban` CF list within ~2 min.
3. Whitelist if needed: Add to `stacks/crowdsec/whitelist.yaml` (RFC1918 + tailnet
+ internal CIDRs are already whitelisted, so internal clients are never banned).
### Kyverno Policy Blocking Deployment ### Kyverno Policy Blocking Deployment