[ci skip] update claude knowledge: add anti-AI scraping & poison-fountain docs

2026-02-22 21:36:40 +00:00 · 2026-02-22 21:36:40 +00:00 · 27bbfdc050
commit 27bbfdc050
parent 32a25e5779
1 changed files with 58 additions and 0 deletions
--- a/.claude/CLAUDE.md
+++ b/.claude/CLAUDE.md
@ -158,6 +158,43 @@ When configuring services to use the mailserver:
 - **Credentials**: Use existing accounts from `mailserver_accounts` in tfvars
 - **Common email**: `info@viktorbarzin.me` for service notifications

+### Anti-AI Scraping (5-Layer Defense)
+All services have anti-AI scraping enabled by default via `anti_ai_scraping = true` in `ingress_factory`. The 5 layers are:
+
+1. **Bot blocking** (`traefik-ai-bot-block`): ForwardAuth middleware → poison-fountain `/auth` endpoint. Checks `User-Agent` against known AI crawlers (GPTBot, ClaudeBot, CCBot, Google-Extended, etc.). Returns 403 for bots, 200 for normal users.
+2. **X-Robots-Tag header** (`traefik-anti-ai-headers`): Adds `noai, noimageai` to all responses.
+3. **Trap links** (`traefik-anti-ai-trap-links`): rewrite-body plugin injects 5 hidden `<a>` tags before `</body>` linking to `poison.viktorbarzin.me/article/*`. Only injected when request `Accept` header contains `text/html` (browsers/scrapers, not API calls).
+4. **Tarpit** (poison-fountain service): `/article/*` endpoints drip-feed responses at ~100 bytes/sec via chunked transfer encoding, wasting scraper time.
+5. **Poison content**: Cached documents from rnsaffn.com/poison2/ (50 docs, refreshed every 6h via CronJob) served through the tarpit to pollute AI training data.
+
+**Key files:**
+- `stacks/poison-fountain/` — Terraform stack (deployment, service, ingress, CronJob)
+- `stacks/poison-fountain/app/server.py` — Python HTTP server (ForwardAuth + tarpit)
+- `stacks/poison-fountain/app/fetch-poison.sh` — CronJob fetcher (uses `--http1.1`, upstream hangs on HTTP/2)
+- `stacks/platform/modules/traefik/middleware.tf` — 3 Traefik middleware CRDs
+- `modules/kubernetes/ingress_factory/main.tf` — `anti_ai_scraping` variable (default: true)
+
+**Testing:**
+```bash
+# Trap links (need Accept: text/html for rewrite-body plugin to process)
+curl -s -H "Accept: text/html,application/xhtml+xml" https://vaultwarden.viktorbarzin.me/ | grep -oE 'href="https://poison[^"]*"'
+
+# X-Robots-Tag header
+curl -sI -H "Accept: text/html" https://vaultwarden.viktorbarzin.me/ | grep -i x-robots
+
+# Bot blocking (403 for AI bots, 200 for normal users)
+curl -s -o /dev/null -w "%{http_code}" -A "GPTBot/1.0" https://vaultwarden.viktorbarzin.me/
+
+# Tarpit slow-drip (~100 bytes/sec)
+curl -s -H "Accept: text/html" https://poison.viktorbarzin.me/article/test
+```
+
+**Gotchas:**
+- rewrite-body plugin only processes responses when `Accept` header contains `text/html` — `curl` default `Accept: */*` does NOT match. Use `-H "Accept: text/html"` for testing.
+- rnsaffn.com/poison2/ hangs on HTTP/2 — fetcher must use `--http1.1`
+- NFS cache dir (`/mnt/main/poison-fountain/cache`) must be world-writable (chmod 777) because `curlimages/curl` runs as uid 101
+- To disable for a specific service: set `anti_ai_scraping = false` in its `ingress_factory` call
+
 ### Terragrunt Architecture
 - Root `terragrunt.hcl` provides DRY provider, backend, and variable loading for all stacks
 - Each stack contains its resources directly: `stacks/<service>/main.tf` has variable declarations, locals, and all Terraform resources inline
@ -357,6 +394,7 @@ Each stack's `terragrunt.hcl` includes the root `terragrunt.hcl` which provides:
 | whisper | Wyoming Faster Whisper STT (CPU on GPU node) | whisper |
 | grampsweb | Genealogy web app (Gramps Web) | grampsweb |
 | openclaw | AI agent gateway (OpenClaw) | openclaw |
+| poison-fountain | Anti-AI scraping (tarpit + poison) | poison-fountain |

 ---

@ -821,3 +859,23 @@ Set `protected = true` in the service's `ingress_factory` call in Terraform.
 - **Variables**: `openclaw_ssh_key`, `openclaw_skill_secrets` in `terraform.tfvars`
 - **Skill secrets**: Home Assistant tokens (london + sofia), Uptime Kuma password — passed as env vars
 - **Model providers**: Gemini (gemini-2.5-flash), Ollama (qwen2.5-coder:14b, deepseek-r1:14b), Llama API (Llama-3.3-70B, Llama-4-Scout/Maverick)
+
+### Poison Fountain (Anti-AI Scraping Service)
+- **Image**: `python:3.12-slim` (runs custom `server.py` from ConfigMap)
+- **Port**: 8080
+- **URL**: `https://poison.viktorbarzin.me` (public, no auth)
+- **Namespace**: `poison-fountain` (tier: aux)
+- **Stack**: `stacks/poison-fountain/`
+- **Architecture**: 1 Deployment (Python HTTP server) + 1 CronJob (fetcher, every 6h)
+- **Storage**: NFS at `/mnt/main/poison-fountain` — `cache/` subdir for poison docs (chmod 777 for curl uid 101)
+- **Endpoints**:
+  - `/auth` — ForwardAuth: checks User-Agent, returns 200 (allow) or 403 (block AI bots)
+  - `/article/*` — Tarpit: drip-feeds poison content at ~100 bytes/sec (DRIP_BYTES=50, DRIP_DELAY=0.5s)
+  - `/healthz` — Health check
+- **CronJob**: Fetches 50 documents from `rnsaffn.com/poison2/` using `--http1.1` (HTTP/2 hangs)
+- **Ingress**: Uses `anti_ai_scraping = false` (doesn't protect itself), `skip_default_rate_limit = true`, `exclude_crowdsec = true`
+- **DNS**: `poison.viktorbarzin.me` in `cloudflare_non_proxied_names`
+- **Traefik middlewares** (in `stacks/platform/modules/traefik/middleware.tf`):
+  - `ai-bot-block` — ForwardAuth to poison-fountain `/auth`
+  - `anti-ai-headers` — X-Robots-Tag: noai, noimageai
+  - `anti-ai-trap-links` — rewrite-body plugin injecting 5 hidden links before `</body>`