Viktor Barzin 65b0f30d5e [docs] Update anti-AI and rybbit docs after rewrite-body removal

- Anti-AI: 5-layer → 3 active layers (bot-block, X-Robots-Tag, tarpit)
- Layer 3 (trap links via rewrite-body) removed — Yaegi v3 incompatible
- Rybbit analytics now injected via Cloudflare Worker (HTMLRewriter)
- strip-accept-encoding middleware removed from all references

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

2026-04-17 21:43:13 +00:00

5.7 KiB

Raw Blame History

Anti-AI Scraping System Design

Status (Updated 2026-04-17): Partially superseded. Layer 3 (trap links via rewrite-body plugin) removed due to Traefik v3.6.12 Yaegi plugin incompatibility. The strip-accept-encoding and anti-ai-trap-links middlewares have been deleted. Rybbit analytics injection moved from Traefik rewrite-body to a Cloudflare Worker (infra/stacks/rybbit/worker/). Active layers: 1 (bot-block), 2 (headers), 4 (tarpit), 5 (poison content).

Problem

AI scrapers crawl public web services to harvest training data. We want to:

Block known AI crawlers outright
Poison the data that unknown scrapers collect
Waste scraper resources with slow responses and infinite crawl loops

Architecture

Four active defense layers applied to all public services via Traefik (Layer 3 removed April 2026):

Internet -> Cloudflare -> Traefik
                           |
                           +-- Layer 1: ForwardAuth -> block known AI User-Agents (403)
                           |
                           +-- Layer 2: Headers -> X-Robots-Tag: noai, noimageai
                           |
                           +-- [REMOVED] Layer 3: Rewrite-body trap links (April 2026 — Yaegi bugs in Traefik v3.6.12)
                           |
                           +-- Layer 4: Poison service -> serve cached Poison Fountain data
                           |
                           +-- Layer 5: Tarpit -> slow-drip responses + infinite crawl loop

Components

1. poison-fountain service (new Kubernetes deployment)

A Python service with three responsibilities:

ForwardAuth endpoint (GET /auth):

Reads X-Forwarded-For and User-Agent from request headers
Checks User-Agent against list of known AI bot strings
Returns 403 for matches, 200 for legitimate users
Blocked bots: GPTBot, ChatGPT-User, ClaudeBot, Claude-Web, CCBot, Bytespider, Google-Extended, Applebot-Extended, anthropic-ai, cohere-ai, Diffbot, FacebookBot, PerplexityBot, YouBot, Meta-ExternalAgent, PetalBot, Amazonbot, AI2Bot, Omgilibot, img2dataset

Poison content endpoint (GET /article/<slug>):

Serves cached poisoned content from NFS
Wraps raw Poison Fountain data in realistic HTML templates (title, headings, paragraphs)
Each response includes 10+ links to other poison pages (infinite crawl loop)
Uses chunked transfer encoding to drip-feed content at ~100 bytes/second (tarpit)
Response size: 50-100KB per page

Health endpoint (GET /healthz):

Returns 200 OK for Kubernetes probes

2. poison-fountain-fetcher CronJob

Runs every 6 hours
Fetches gzip content from https://rnsaffn.com/poison2/
Decompresses and stores to NFS at /mnt/main/poison-fountain/cache/
Maintains a pool of ~50 cached poison documents
Falls back to locally generated Markov-chain nonsense if Poison Fountain is unreachable

3. Traefik middleware additions

All defined in stacks/platform/modules/traefik/middleware.tf:

ai-bot-block (ForwardAuth):

ForwardAuth to http://poison-fountain.poison-fountain.svc.cluster.local:8080/auth
Trust forwarded headers from Traefik
Added to all public services via ingress_factory

anti-ai-headers (Headers):

Sets X-Robots-Tag: noai, noimageai on all responses
Added to all public services via ingress_factory

anti-ai-trap-links (rewrite-body plugin) — REMOVED (Updated 2026-04-17):

Removed due to Traefik v3.6.12 Yaegi runtime bugs making the rewrite-body plugin unreliable
The companion strip-accept-encoding middleware was also removed (only existed for rewrite-body)
Trap link injection is no longer active; poison-fountain still serves tarpit content standalone

4. Trap subdomain: poison.viktorbarzin.me

Cloudflare DNS record (non-proxied, direct to cluster)
IngressRoute routing all paths to poison-fountain service
NO rate limiting on this route (let scrapers consume all they want)
NO CrowdSec on this route (don't block scrapers here)
Serves poisoned content with tarpit slow-drip

5. ingress_factory changes

New variables:

anti_ai_scraping (bool, default: true) - enable all anti-AI layers
When true, adds to middleware chain: ai-bot-block, anti-ai-headers
Services can opt out with anti_ai_scraping = false

Human User Protection

Concern	Protection
Hidden links visible	CSS `position:absolute;left:-9999px;height:0;overflow:hidden` + `aria-hidden="true"`
False positive blocking	Only blocks specific AI bot User-Agent strings; no browser matches these
Performance overhead	ForwardAuth is a string match (<1ms). Rybbit injected via Cloudflare Worker (not Traefik).
Poison content leakage	Only served on poison.viktorbarzin.me, not linked from any navigation
Slow responses	Tarpit only applies to poison.viktorbarzin.me, not to real services

File Locations

Component	Path
Poison service stack	`stacks/poison-fountain/main.tf`
Poison service code	`stacks/poison-fountain/app/`
Middleware definitions	`stacks/platform/modules/traefik/middleware.tf`
ingress_factory changes	`modules/kubernetes/ingress_factory/main.tf`
Cloudflare DNS	`terraform.tfvars` (cloudflare_non_proxied_names)
NFS cache	`/mnt/main/poison-fountain/cache/`

Deployment Order

Add Cloudflare DNS record for poison.viktorbarzin.me
Create NFS export for /mnt/main/poison-fountain
Add Traefik middlewares (ai-bot-block, anti-ai-headers, anti-ai-trap-links)
Update ingress_factory with anti_ai_scraping variable
Deploy poison-fountain service + CronJob
Apply platform stack (Traefik + Cloudflare changes)
Apply poison-fountain stack
Apply all other stacks to pick up new ingress_factory defaults

5.7 KiB Raw Blame History