From 9529eedfe0b6e7dd90a0c60b562bf797bdde2ece Mon Sep 17 00:00:00 2001 From: Viktor Barzin Date: Fri, 5 Jun 2026 22:07:02 +0000 Subject: [PATCH] docs(security): bot-block-proxy is a no-op while poison-fountain is at 0 [ci skip] Reflect commit b6dd23b1: bot-block-proxy short-circuits /auth to return 200 instead of proxying to the scaled-to-0 poison-fountain. - security.md Layer 1 + tarpit description + troubleshooting (fix stale stacks/platform path -> traefik stack; drop misleading restart-poison-fountain step). - .claude/CLAUDE.md: add matrix to PG rotation list; document that startup-read secret consumers need a Reloader annotation (matrix root cause, found via Loki 2026-06-05). Co-Authored-By: Claude Opus 4.8 --- .claude/CLAUDE.md | 2 +- docs/architecture/security.md | 36 +++++++++++++++++++++++------------ 2 files changed, 25 insertions(+), 13 deletions(-) diff --git a/.claude/CLAUDE.md b/.claude/CLAUDE.md index ca56a127..3451fe78 100755 --- a/.claude/CLAUDE.md +++ b/.claude/CLAUDE.md @@ -66,7 +66,7 @@ Violations cause state drift, which causes future applies to break or silently r - **ESO (External Secrets Operator)**: `stacks/external-secrets/` — 43 ExternalSecrets + 9 DB-creds ExternalSecrets. API version `v1beta1`. Two ClusterSecretStores: `vault-kv` and `vault-database`. - **Plan-time pattern**: Former plan-time stacks use `data "kubernetes_secret"` to read ESO-created K8s Secrets at plan time (no Vault dependency). First-apply gotcha: must `terragrunt apply -target=kubernetes_manifest.external_secret` first, then full apply. `count` on resources using secret values fails — remove conditional counts. - **14 hybrid stacks** still keep `data "vault_kv_secret_v2"` for plan-time needs (job commands, Helm templatefile, module inputs). Platform has 48 plan-time refs — no migration possible without restructuring modules. -- **Database rotation**: Vault DB engine rotates passwords every 7 days (604800s). MySQL: speedtest, wrongmove, codimd, nextcloud, shlink, grafana, phpipam. PostgreSQL: health, linkwarden, affine, woodpecker, claude_memory, crowdsec, technitium. Excluded: authentik (PgBouncer), root users. Technitium uses a password-sync CronJob (every 6h) to push rotated password to the Technitium app config via API, disable SQLite + MySQL logging, check PG plugin is loaded, configure PG query logging (90-day retention), and disable SQLite on secondary/tertiary instances. +- **Database rotation**: Vault DB engine rotates passwords every 7 days (604800s). MySQL: speedtest, wrongmove, codimd, nextcloud, shlink, grafana, phpipam. PostgreSQL: health, linkwarden, affine, woodpecker, claude_memory, crowdsec, technitium, matrix. Excluded: authentik (PgBouncer), root users. **Apps that read a rotated secret only at startup** (env var / initContainer, not a hot-reloaded mount) MUST carry a Reloader annotation (`secret.reloader.stakater.com/reload: `) or they keep the stale password and silently fail DB auth on each rotation until manually restarted — matrix's Synapse `inject-db-password` initContainer hit exactly this (found via Loki 2026-06-05, ~12.9k auth-fail lines/hr). Technitium uses a password-sync CronJob (every 6h) to push rotated password to the Technitium app config via API, disable SQLite + MySQL logging, check PG plugin is loaded, configure PG query logging (90-day retention), and disable SQLite on secondary/tertiary instances. - **K8s credentials**: Vault K8s secrets engine. Roles: `dashboard-admin`, `ci-deployer`, `openclaw`, `local-admin`. Use `vault write kubernetes/creds/ROLE kubernetes_namespace=NS`. Helper: `scripts/vault-kubeconfig`. - **CI/CD (GHA + Woodpecker)**: Docker builds run on **GitHub Actions** (free on public repos). Woodpecker is **deploy-only** — receives image tag via API POST, runs `kubectl set image`. Woodpecker authenticates via K8s SA JWT → Vault K8s auth. Sync CronJob pushes `secret/ci/global` → Woodpecker API every 6h. Shell scripts in HCL heredocs: escape `$` → `$$`, `%{}` → `%%{}`. - **Platform cannot depend on vault** (circular). Apply order: vault first, then platform. Platform has 48 vault refs, all in module inputs — no ESO migration possible. diff --git a/docs/architecture/security.md b/docs/architecture/security.md index 6a6286ae..6b3e794b 100644 --- a/docs/architecture/security.md +++ b/docs/architecture/security.md @@ -143,10 +143,22 @@ Active middleware chain: `ai-bot-block` (ForwardAuth) + `anti-ai-headers` (X-Rob #### Layer 1: Bot Blocking (ForwardAuth) -- Middleware calls `poison-fountain` service before backend -- Analyzes User-Agent, request patterns, timing -- Blocks known AI scrapers (GPTBot, CCBot, etc.) -- **Fail-open**: If poison-fountain down, allows traffic +- `ai-bot-block` middleware forward-auths to the `bot-block-proxy` openresty + service (`stacks/traefik/modules/traefik/main.tf`) — the bot-check hop before + the backend. +- **Currently a no-op (allow-all).** `poison-fountain` is intentionally scaled + to 0 (clears the ExternalAccessDivergence alert), so `bot-block-proxy` + short-circuits `/auth` to `return 200 "allowed"` instead of proxying to an + absent upstream. Same effective behaviour as the previous `proxy_pass` + + `error_page 5xx=200` fail-open, minus the ~51k/hr upstream-connect error logs + and per-request connect latency it generated (cleaned up 2026-06-05, found via + Loki). The Deployment carries `configmap.reloader.stakater.com/reload` so + config changes actually reload openresty (it does not hot-reload on its own). +- **To re-enable real bot-blocking**: restore the `upstream poison_fountain` + + `proxy_pass http://poison_fountain;` block in the `bot-block-proxy-config` + ConfigMap (git history) and scale `poison-fountain` up. It then forward-auths + bot checks (User-Agent / patterns) and tarpits known AI scrapers, fail-open if + poison-fountain is down. #### Layer 2: X-Robots-Tag Header @@ -160,12 +172,12 @@ Removed April 2026. The rewrite-body Traefik plugin used to inject hidden trap l #### Layer 3 (formerly 4): Tarpit / Poison Content -- `poison-fountain` service still exists as a standalone service at `poison.viktorbarzin.me` -- Serves AI bots extremely slowly (~100 bytes/sec tarpit) +- `poison-fountain` exists as a standalone service at `poison.viktorbarzin.me` but the serving Deployment is **scaled to 0** (replicas=0); only its 6-hourly content-fetch CronJob runs. The tarpit is therefore dormant until re-enabled. +- When running: serves AI bots extremely slowly (~50 bytes / 0.5s tarpit drip) - CronJob every 6 hours generates fake content -- Trap links are no longer injected into real pages, but bots that discover `poison.viktorbarzin.me` directly still get tarpitted and poisoned +- Trap links are no longer injected into real pages, but bots that discover `poison.viktorbarzin.me` directly would get tarpitted and poisoned -**Implementation**: See `stacks/poison-fountain/` and `stacks/platform/modules/traefik/middleware.tf` +**Implementation**: See `stacks/poison-fountain/` and `stacks/traefik/modules/traefik/{middleware.tf,main.tf}` (traefik moved from the platform stack to its own `traefik` stack) ### Audit Logging & Anomaly Detection (Wave 1) @@ -441,12 +453,12 @@ spec: ### Anti-AI Service Down, Traffic Blocked -**Problem**: `poison-fountain` service unhealthy, all traffic blocked. +**Problem**: anti-AI ForwardAuth (`ai-bot-block`) blocks traffic. With `bot-block-proxy` as a no-op `return 200` (poison-fountain scaled to 0) this should not happen; if it does, `bot-block-proxy` itself is unreachable (Traefik ForwardAuth fails **closed** when the auth server is down). **Fix**: -1. Verify fail-open config: Check `stacks/platform/modules/traefik/middleware.tf` for `failurePolicy: allow` -2. Restart service: `kubectl rollout restart deployment/poison-fountain -n poison-fountain` -3. Temporary disable: Set `anti_ai_scraping = false` in `ingress_factory` for affected services +1. Check `bot-block-proxy` pods are Ready: `kubectl get pods -n traefik -l app=bot-block-proxy` (2 replicas; critical-path forward-auth target). +2. Inspect/restart: `kubectl rollout restart deployment/bot-block-proxy -n traefik`. Config lives in the `bot-block-proxy-config` ConfigMap (`stacks/traefik/modules/traefik/main.tf`); changes auto-reload via the `configmap.reloader.stakater.com/reload` annotation. +3. Temporary disable: Set `anti_ai_scraping = false` in `ingress_factory` for affected services. ### Rate Limit Too Aggressive