docs(security): bot-block-proxy is a no-op while poison-fountain is at 0 [ci skip]
Reflect commit b6dd23b1: bot-block-proxy short-circuits /auth to return 200 instead of proxying to the scaled-to-0 poison-fountain. - security.md Layer 1 + tarpit description + troubleshooting (fix stale stacks/platform path -> traefik stack; drop misleading restart-poison-fountain step). - .claude/CLAUDE.md: add matrix to PG rotation list; document that startup-read secret consumers need a Reloader annotation (matrix root cause, found via Loki 2026-06-05). Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
This commit is contained in:
parent
9ad7756a94
commit
9529eedfe0
2 changed files with 25 additions and 13 deletions
|
|
@ -66,7 +66,7 @@ Violations cause state drift, which causes future applies to break or silently r
|
|||
- **ESO (External Secrets Operator)**: `stacks/external-secrets/` — 43 ExternalSecrets + 9 DB-creds ExternalSecrets. API version `v1beta1`. Two ClusterSecretStores: `vault-kv` and `vault-database`.
|
||||
- **Plan-time pattern**: Former plan-time stacks use `data "kubernetes_secret"` to read ESO-created K8s Secrets at plan time (no Vault dependency). First-apply gotcha: must `terragrunt apply -target=kubernetes_manifest.external_secret` first, then full apply. `count` on resources using secret values fails — remove conditional counts.
|
||||
- **14 hybrid stacks** still keep `data "vault_kv_secret_v2"` for plan-time needs (job commands, Helm templatefile, module inputs). Platform has 48 plan-time refs — no migration possible without restructuring modules.
|
||||
- **Database rotation**: Vault DB engine rotates passwords every 7 days (604800s). MySQL: speedtest, wrongmove, codimd, nextcloud, shlink, grafana, phpipam. PostgreSQL: health, linkwarden, affine, woodpecker, claude_memory, crowdsec, technitium. Excluded: authentik (PgBouncer), root users. Technitium uses a password-sync CronJob (every 6h) to push rotated password to the Technitium app config via API, disable SQLite + MySQL logging, check PG plugin is loaded, configure PG query logging (90-day retention), and disable SQLite on secondary/tertiary instances.
|
||||
- **Database rotation**: Vault DB engine rotates passwords every 7 days (604800s). MySQL: speedtest, wrongmove, codimd, nextcloud, shlink, grafana, phpipam. PostgreSQL: health, linkwarden, affine, woodpecker, claude_memory, crowdsec, technitium, matrix. Excluded: authentik (PgBouncer), root users. **Apps that read a rotated secret only at startup** (env var / initContainer, not a hot-reloaded mount) MUST carry a Reloader annotation (`secret.reloader.stakater.com/reload: <secret>`) or they keep the stale password and silently fail DB auth on each rotation until manually restarted — matrix's Synapse `inject-db-password` initContainer hit exactly this (found via Loki 2026-06-05, ~12.9k auth-fail lines/hr). Technitium uses a password-sync CronJob (every 6h) to push rotated password to the Technitium app config via API, disable SQLite + MySQL logging, check PG plugin is loaded, configure PG query logging (90-day retention), and disable SQLite on secondary/tertiary instances.
|
||||
- **K8s credentials**: Vault K8s secrets engine. Roles: `dashboard-admin`, `ci-deployer`, `openclaw`, `local-admin`. Use `vault write kubernetes/creds/ROLE kubernetes_namespace=NS`. Helper: `scripts/vault-kubeconfig`.
|
||||
- **CI/CD (GHA + Woodpecker)**: Docker builds run on **GitHub Actions** (free on public repos). Woodpecker is **deploy-only** — receives image tag via API POST, runs `kubectl set image`. Woodpecker authenticates via K8s SA JWT → Vault K8s auth. Sync CronJob pushes `secret/ci/global` → Woodpecker API every 6h. Shell scripts in HCL heredocs: escape `$` → `$$`, `%{}` → `%%{}`.
|
||||
- **Platform cannot depend on vault** (circular). Apply order: vault first, then platform. Platform has 48 vault refs, all in module inputs — no ESO migration possible.
|
||||
|
|
|
|||
|
|
@ -143,10 +143,22 @@ Active middleware chain: `ai-bot-block` (ForwardAuth) + `anti-ai-headers` (X-Rob
|
|||
|
||||
#### Layer 1: Bot Blocking (ForwardAuth)
|
||||
|
||||
- Middleware calls `poison-fountain` service before backend
|
||||
- Analyzes User-Agent, request patterns, timing
|
||||
- Blocks known AI scrapers (GPTBot, CCBot, etc.)
|
||||
- **Fail-open**: If poison-fountain down, allows traffic
|
||||
- `ai-bot-block` middleware forward-auths to the `bot-block-proxy` openresty
|
||||
service (`stacks/traefik/modules/traefik/main.tf`) — the bot-check hop before
|
||||
the backend.
|
||||
- **Currently a no-op (allow-all).** `poison-fountain` is intentionally scaled
|
||||
to 0 (clears the ExternalAccessDivergence alert), so `bot-block-proxy`
|
||||
short-circuits `/auth` to `return 200 "allowed"` instead of proxying to an
|
||||
absent upstream. Same effective behaviour as the previous `proxy_pass` +
|
||||
`error_page 5xx=200` fail-open, minus the ~51k/hr upstream-connect error logs
|
||||
and per-request connect latency it generated (cleaned up 2026-06-05, found via
|
||||
Loki). The Deployment carries `configmap.reloader.stakater.com/reload` so
|
||||
config changes actually reload openresty (it does not hot-reload on its own).
|
||||
- **To re-enable real bot-blocking**: restore the `upstream poison_fountain` +
|
||||
`proxy_pass http://poison_fountain;` block in the `bot-block-proxy-config`
|
||||
ConfigMap (git history) and scale `poison-fountain` up. It then forward-auths
|
||||
bot checks (User-Agent / patterns) and tarpits known AI scrapers, fail-open if
|
||||
poison-fountain is down.
|
||||
|
||||
#### Layer 2: X-Robots-Tag Header
|
||||
|
||||
|
|
@ -160,12 +172,12 @@ Removed April 2026. The rewrite-body Traefik plugin used to inject hidden trap l
|
|||
|
||||
#### Layer 3 (formerly 4): Tarpit / Poison Content
|
||||
|
||||
- `poison-fountain` service still exists as a standalone service at `poison.viktorbarzin.me`
|
||||
- Serves AI bots extremely slowly (~100 bytes/sec tarpit)
|
||||
- `poison-fountain` exists as a standalone service at `poison.viktorbarzin.me` but the serving Deployment is **scaled to 0** (replicas=0); only its 6-hourly content-fetch CronJob runs. The tarpit is therefore dormant until re-enabled.
|
||||
- When running: serves AI bots extremely slowly (~50 bytes / 0.5s tarpit drip)
|
||||
- CronJob every 6 hours generates fake content
|
||||
- Trap links are no longer injected into real pages, but bots that discover `poison.viktorbarzin.me` directly still get tarpitted and poisoned
|
||||
- Trap links are no longer injected into real pages, but bots that discover `poison.viktorbarzin.me` directly would get tarpitted and poisoned
|
||||
|
||||
**Implementation**: See `stacks/poison-fountain/` and `stacks/platform/modules/traefik/middleware.tf`
|
||||
**Implementation**: See `stacks/poison-fountain/` and `stacks/traefik/modules/traefik/{middleware.tf,main.tf}` (traefik moved from the platform stack to its own `traefik` stack)
|
||||
|
||||
### Audit Logging & Anomaly Detection (Wave 1)
|
||||
|
||||
|
|
@ -441,12 +453,12 @@ spec:
|
|||
|
||||
### Anti-AI Service Down, Traffic Blocked
|
||||
|
||||
**Problem**: `poison-fountain` service unhealthy, all traffic blocked.
|
||||
**Problem**: anti-AI ForwardAuth (`ai-bot-block`) blocks traffic. With `bot-block-proxy` as a no-op `return 200` (poison-fountain scaled to 0) this should not happen; if it does, `bot-block-proxy` itself is unreachable (Traefik ForwardAuth fails **closed** when the auth server is down).
|
||||
|
||||
**Fix**:
|
||||
1. Verify fail-open config: Check `stacks/platform/modules/traefik/middleware.tf` for `failurePolicy: allow`
|
||||
2. Restart service: `kubectl rollout restart deployment/poison-fountain -n poison-fountain`
|
||||
3. Temporary disable: Set `anti_ai_scraping = false` in `ingress_factory` for affected services
|
||||
1. Check `bot-block-proxy` pods are Ready: `kubectl get pods -n traefik -l app=bot-block-proxy` (2 replicas; critical-path forward-auth target).
|
||||
2. Inspect/restart: `kubectl rollout restart deployment/bot-block-proxy -n traefik`. Config lives in the `bot-block-proxy-config` ConfigMap (`stacks/traefik/modules/traefik/main.tf`); changes auto-reload via the `configmap.reloader.stakater.com/reload` annotation.
|
||||
3. Temporary disable: Set `anti_ai_scraping = false` in `ingress_factory` for affected services.
|
||||
|
||||
### Rate Limit Too Aggressive
|
||||
|
||||
|
|
|
|||
Loading…
Add table
Add a link
Reference in a new issue