- Anti-AI: 5-layer → 3 active layers (bot-block, X-Robots-Tag, tarpit) - Layer 3 (trap links via rewrite-body) removed — Yaegi v3 incompatible - Rybbit analytics now injected via Cloudflare Worker (HTMLRewriter) - strip-accept-encoding middleware removed from all references Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
5.7 KiB
Anti-AI Scraping System Design
Status (Updated 2026-04-17): Partially superseded. Layer 3 (trap links via rewrite-body plugin) removed due to Traefik v3.6.12 Yaegi plugin incompatibility. The
strip-accept-encodingandanti-ai-trap-linksmiddlewares have been deleted. Rybbit analytics injection moved from Traefik rewrite-body to a Cloudflare Worker (infra/stacks/rybbit/worker/). Active layers: 1 (bot-block), 2 (headers), 4 (tarpit), 5 (poison content).
Problem
AI scrapers crawl public web services to harvest training data. We want to:
- Block known AI crawlers outright
- Poison the data that unknown scrapers collect
- Waste scraper resources with slow responses and infinite crawl loops
Architecture
Four active defense layers applied to all public services via Traefik (Layer 3 removed April 2026):
Internet -> Cloudflare -> Traefik
|
+-- Layer 1: ForwardAuth -> block known AI User-Agents (403)
|
+-- Layer 2: Headers -> X-Robots-Tag: noai, noimageai
|
+-- [REMOVED] Layer 3: Rewrite-body trap links (April 2026 — Yaegi bugs in Traefik v3.6.12)
|
+-- Layer 4: Poison service -> serve cached Poison Fountain data
|
+-- Layer 5: Tarpit -> slow-drip responses + infinite crawl loop
Components
1. poison-fountain service (new Kubernetes deployment)
A Python service with three responsibilities:
ForwardAuth endpoint (GET /auth):
- Reads
X-Forwarded-ForandUser-Agentfrom request headers - Checks User-Agent against list of known AI bot strings
- Returns 403 for matches, 200 for legitimate users
- Blocked bots: GPTBot, ChatGPT-User, ClaudeBot, Claude-Web, CCBot, Bytespider, Google-Extended, Applebot-Extended, anthropic-ai, cohere-ai, Diffbot, FacebookBot, PerplexityBot, YouBot, Meta-ExternalAgent, PetalBot, Amazonbot, AI2Bot, Omgilibot, img2dataset
Poison content endpoint (GET /article/<slug>):
- Serves cached poisoned content from NFS
- Wraps raw Poison Fountain data in realistic HTML templates (title, headings, paragraphs)
- Each response includes 10+ links to other poison pages (infinite crawl loop)
- Uses chunked transfer encoding to drip-feed content at ~100 bytes/second (tarpit)
- Response size: 50-100KB per page
Health endpoint (GET /healthz):
- Returns 200 OK for Kubernetes probes
2. poison-fountain-fetcher CronJob
- Runs every 6 hours
- Fetches gzip content from
https://rnsaffn.com/poison2/ - Decompresses and stores to NFS at
/mnt/main/poison-fountain/cache/ - Maintains a pool of ~50 cached poison documents
- Falls back to locally generated Markov-chain nonsense if Poison Fountain is unreachable
3. Traefik middleware additions
All defined in stacks/platform/modules/traefik/middleware.tf:
ai-bot-block (ForwardAuth):
- ForwardAuth to
http://poison-fountain.poison-fountain.svc.cluster.local:8080/auth - Trust forwarded headers from Traefik
- Added to all public services via ingress_factory
anti-ai-headers (Headers):
- Sets
X-Robots-Tag: noai, noimageaion all responses - Added to all public services via ingress_factory
anti-ai-trap-links (rewrite-body plugin) — REMOVED (Updated 2026-04-17):
- Removed due to Traefik v3.6.12 Yaegi runtime bugs making the rewrite-body plugin unreliable
- The companion
strip-accept-encodingmiddleware was also removed (only existed for rewrite-body) - Trap link injection is no longer active; poison-fountain still serves tarpit content standalone
4. Trap subdomain: poison.viktorbarzin.me
- Cloudflare DNS record (non-proxied, direct to cluster)
- IngressRoute routing all paths to poison-fountain service
- NO rate limiting on this route (let scrapers consume all they want)
- NO CrowdSec on this route (don't block scrapers here)
- Serves poisoned content with tarpit slow-drip
5. ingress_factory changes
New variables:
anti_ai_scraping(bool, default: true) - enable all anti-AI layers- When true, adds to middleware chain:
ai-bot-block,anti-ai-headers - Services can opt out with
anti_ai_scraping = false
Human User Protection
| Concern | Protection |
|---|---|
| Hidden links visible | CSS position:absolute;left:-9999px;height:0;overflow:hidden + aria-hidden="true" |
| False positive blocking | Only blocks specific AI bot User-Agent strings; no browser matches these |
| Performance overhead | ForwardAuth is a string match (<1ms). Rybbit injected via Cloudflare Worker (not Traefik). |
| Poison content leakage | Only served on poison.viktorbarzin.me, not linked from any navigation |
| Slow responses | Tarpit only applies to poison.viktorbarzin.me, not to real services |
File Locations
| Component | Path |
|---|---|
| Poison service stack | stacks/poison-fountain/main.tf |
| Poison service code | stacks/poison-fountain/app/ |
| Middleware definitions | stacks/platform/modules/traefik/middleware.tf |
| ingress_factory changes | modules/kubernetes/ingress_factory/main.tf |
| Cloudflare DNS | terraform.tfvars (cloudflare_non_proxied_names) |
| NFS cache | /mnt/main/poison-fountain/cache/ |
Deployment Order
- Add Cloudflare DNS record for
poison.viktorbarzin.me - Create NFS export for
/mnt/main/poison-fountain - Add Traefik middlewares (ai-bot-block, anti-ai-headers, anti-ai-trap-links)
- Update ingress_factory with anti_ai_scraping variable
- Deploy poison-fountain service + CronJob
- Apply platform stack (Traefik + Cloudflare changes)
- Apply poison-fountain stack
- Apply all other stacks to pick up new ingress_factory defaults