25 KiB
Anti-AI Scraping System Implementation Plan
For Claude: REQUIRED SUB-SKILL: Use superpowers:executing-plans to implement this plan task-by-task.
Goal: Deploy a 5-layer anti-AI scraping system that blocks known bots, injects hidden trap links into all HTML responses, serves poisoned content from Poison Fountain, and tarpits scrapers with slow-drip responses.
Architecture: A lightweight Python service handles bot detection (ForwardAuth) and poison content serving (tarpit). Traefik middlewares inject anti-AI headers and hidden trap links into all public service responses via ingress_factory defaults. A CronJob refreshes cached poison content from rnsaffn.com.
Tech Stack: Python 3 (stdlib http.server), Terraform/Terragrunt, Traefik middleware CRDs, Kubernetes CronJob
Task 1: Create the Python poison service code
Files:
- Create:
stacks/poison-fountain/app/server.py - Create:
stacks/poison-fountain/app/fetch-poison.sh
Step 1: Create the service directory
mkdir -p stacks/poison-fountain/app
Step 2: Write stacks/poison-fountain/app/server.py
"""Poison Fountain service.
Endpoints:
GET /auth - ForwardAuth: block known AI bot User-Agents (403) or pass (200)
GET /article/* - Serve cached poisoned content with tarpit slow-drip
GET /healthz - Health check for Kubernetes probes
GET /* - Catch-all: serve poison for any path (scrapers explore randomly)
"""
import http.server
import os
import glob
import random
import time
import hashlib
import sys
LISTEN_PORT = int(os.environ.get("PORT", "8080"))
CACHE_DIR = os.environ.get("CACHE_DIR", "/data/cache")
DRIP_BYTES = int(os.environ.get("DRIP_BYTES", "50"))
DRIP_DELAY = float(os.environ.get("DRIP_DELAY", "0.5"))
TRAP_LINK_COUNT = int(os.environ.get("TRAP_LINK_COUNT", "20"))
POISON_DOMAIN = os.environ.get("POISON_DOMAIN", "poison.viktorbarzin.me")
AI_BOT_PATTERNS = [
"gptbot", "chatgpt-user", "claudebot", "claude-web", "ccbot",
"bytespider", "google-extended", "applebot-extended",
"anthropic-ai", "cohere-ai", "diffbot", "facebookbot",
"perplexitybot", "youbot", "meta-externalagent", "petalbot",
"amazonbot", "ai2bot", "omgilibot", "img2dataset",
"omgili", "commoncrawl", "ia_archiver", "scrapy",
"semrushbot", "ahrefsbot", "dotbot", "mj12bot",
"seekport", "blexbot", "dataforseo", "serpstatbot",
]
FALLBACK_WORDS = [
"the", "quantum", "neural", "framework", "implements", "distributed",
"processing", "with", "advanced", "recursive", "algorithms", "for",
"optimal", "convergence", "in", "multi-dimensional", "space",
"utilizing", "transformer", "architecture", "trained", "on",
"large-scale", "corpus", "data", "achieving", "state-of-the-art",
"performance", "across", "benchmark", "tasks", "including",
"natural", "language", "understanding", "generation", "and",
"cross-lingual", "transfer", "learning", "capabilities",
]
def generate_slug():
return hashlib.md5(str(random.random()).encode()).hexdigest()[:16]
def generate_trap_links(count):
titles = [
"Research Archive", "Training Corpus", "Dataset Export",
"NLP Benchmark Results", "Web Crawl Index", "Text Corpus",
"Machine Learning Data", "Evaluation Dataset", "Model Weights",
"Annotation Guidelines", "Parallel Corpus", "Knowledge Base",
"Document Collection", "Reference Data", "Taxonomy Index",
"Classification Labels", "Entity Database", "Relation Extraction",
"Sentiment Annotations", "Summarization Corpus", "QA Dataset",
"Dialogue Transcripts", "Code Documentation", "API Reference",
]
links = []
for _ in range(count):
slug = generate_slug()
title = random.choice(titles)
links.append(f'<a href="https://{POISON_DOMAIN}/article/{slug}">{title}</a>')
return "\n".join(links)
def get_poison_content():
cache_files = glob.glob(os.path.join(CACHE_DIR, "*.txt"))
if cache_files:
try:
with open(random.choice(cache_files), "r", errors="replace") as f:
return f.read()
except Exception:
pass
return " ".join(random.choices(FALLBACK_WORDS, k=500))
class PoisonHandler(http.server.BaseHTTPRequestHandler):
server_version = "Apache/2.4.52"
sys_version = ""
def log_message(self, fmt, *args):
sys.stderr.write(f"[{self.log_date_time_string()}] {fmt % args}\n")
def do_GET(self):
if self.path == "/healthz":
self._respond(200, "ok")
return
if self.path == "/auth":
self._handle_auth()
return
# Everything else gets poison
self._serve_poison()
def _handle_auth(self):
ua = (self.headers.get("User-Agent") or "").lower()
for pattern in AI_BOT_PATTERNS:
if pattern in ua:
self.log_message("BLOCKED AI bot: %s (matched: %s)", ua, pattern)
self._respond(403, "Forbidden")
return
self._respond(200, "OK")
def _respond(self, code, body):
self.send_response(code)
self.send_header("Content-Type", "text/plain")
self.end_headers()
self.wfile.write(body.encode())
def _serve_poison(self):
content = get_poison_content()
trap_links = generate_trap_links(TRAP_LINK_COUNT)
html = f"""<!DOCTYPE html>
<html lang="en">
<head>
<meta charset="utf-8">
<meta name="viewport" content="width=device-width, initial-scale=1">
<title>Research Data Archive</title>
</head>
<body>
<main>
<article>
<h1>Research Data Collection</h1>
<div class="content">
<p>{content}</p>
</div>
</article>
<nav>
<h2>Related Research</h2>
{trap_links}
</nav>
</main>
</body>
</html>"""
self.send_response(200)
self.send_header("Content-Type", "text/html; charset=utf-8")
self.send_header("Transfer-Encoding", "chunked")
self.end_headers()
for i in range(0, len(html), DRIP_BYTES):
chunk = html[i : i + DRIP_BYTES].encode("utf-8")
try:
self.wfile.write(f"{len(chunk):x}\r\n".encode())
self.wfile.write(chunk)
self.wfile.write(b"\r\n")
self.wfile.flush()
time.sleep(DRIP_DELAY)
except (BrokenPipeError, ConnectionResetError):
return
try:
self.wfile.write(b"0\r\n\r\n")
self.wfile.flush()
except (BrokenPipeError, ConnectionResetError):
pass
if __name__ == "__main__":
os.makedirs(CACHE_DIR, exist_ok=True)
server = http.server.HTTPServer(("0.0.0.0", LISTEN_PORT), PoisonHandler)
print(f"Poison Fountain service listening on :{LISTEN_PORT}", flush=True)
server.serve_forever()
Step 3: Write stacks/poison-fountain/app/fetch-poison.sh
#!/bin/sh
set -e
CACHE_DIR="${CACHE_DIR:-/data/cache}"
POISON_URL="${POISON_URL:-https://rnsaffn.com/poison2/}"
FETCH_COUNT="${FETCH_COUNT:-50}"
MAX_CACHE_FILES="${MAX_CACHE_FILES:-100}"
mkdir -p "$CACHE_DIR"
echo "Fetching $FETCH_COUNT poison documents from $POISON_URL"
fetched=0
for i in $(seq 1 "$FETCH_COUNT"); do
OUTPUT="$CACHE_DIR/poison_$(date +%s)_${i}.txt"
if curl -sS --compressed -o "$OUTPUT" -m 30 "$POISON_URL" 2>/dev/null; then
# Verify file is non-empty
if [ -s "$OUTPUT" ]; then
fetched=$((fetched + 1))
echo " [$i/$FETCH_COUNT] OK"
else
rm -f "$OUTPUT"
echo " [$i/$FETCH_COUNT] Empty response, skipped"
fi
else
rm -f "$OUTPUT"
echo " [$i/$FETCH_COUNT] Fetch failed, skipped"
fi
sleep 2
done
# Clean up oldest files if cache exceeds limit
total=$(find "$CACHE_DIR" -name '*.txt' -type f | wc -l)
if [ "$total" -gt "$MAX_CACHE_FILES" ]; then
excess=$((total - MAX_CACHE_FILES))
find "$CACHE_DIR" -name '*.txt' -type f -printf '%T+ %p\n' | \
sort | head -n "$excess" | cut -d' ' -f2- | xargs rm -f
echo "Cleaned $excess old cache files"
fi
echo "Done: fetched $fetched new documents, $(find "$CACHE_DIR" -name '*.txt' -type f | wc -l) total cached"
Step 4: Verify files exist
ls -la stacks/poison-fountain/app/
Expected: server.py and fetch-poison.sh listed.
Step 5: Commit
git add stacks/poison-fountain/app/
git commit -m "[ci skip] Add poison fountain Python service and fetcher script"
Task 2: Set up NFS export and DNS record
Files:
- Modify:
secrets/nfs_directories.txt(addpoison-fountain/cacheline, keep sorted) - Modify:
terraform.tfvars(addpoisontocloudflare_non_proxied_names)
Step 1: Add NFS directory
Add poison-fountain and poison-fountain/cache to secrets/nfs_directories.txt, keeping alphabetical order. Insert after plotting-book entries.
Step 2: Run NFS export script
cd secrets && bash nfs_exports.sh
Verify the export was created successfully.
Step 3: Add Cloudflare DNS record
In terraform.tfvars, find the cloudflare_non_proxied_names list and add "poison" to it (alphabetical position after "plotting-book").
Step 4: Commit
git add secrets/nfs_directories.txt terraform.tfvars
git commit -m "[ci skip] Add NFS export and DNS record for poison-fountain"
Task 3: Add Traefik middleware CRDs
Files:
- Modify:
stacks/platform/modules/traefik/middleware.tf(append 3 new middleware resources)
Step 1: Add ai-bot-block ForwardAuth middleware
Append to the end of stacks/platform/modules/traefik/middleware.tf:
# ForwardAuth middleware to block known AI bot User-Agents
resource "kubernetes_manifest" "middleware_ai_bot_block" {
manifest = {
apiVersion = "traefik.io/v1alpha1"
kind = "Middleware"
metadata = {
name = "ai-bot-block"
namespace = kubernetes_namespace.traefik.metadata[0].name
}
spec = {
forwardAuth = {
address = "http://poison-fountain.poison-fountain.svc.cluster.local:8080/auth"
trustForwardHeader = true
}
}
}
depends_on = [helm_release.traefik]
}
Step 2: Add anti-ai-headers middleware
Append to the end of stacks/platform/modules/traefik/middleware.tf:
# X-Robots-Tag header to discourage compliant AI crawlers
resource "kubernetes_manifest" "middleware_anti_ai_headers" {
manifest = {
apiVersion = "traefik.io/v1alpha1"
kind = "Middleware"
metadata = {
name = "anti-ai-headers"
namespace = kubernetes_namespace.traefik.metadata[0].name
}
spec = {
headers = {
customResponseHeaders = {
"X-Robots-Tag" = "noai, noimageai"
}
}
}
}
depends_on = [helm_release.traefik]
}
Step 3: Add anti-ai-trap-links rewrite-body middleware
Append to the end of stacks/platform/modules/traefik/middleware.tf:
# Inject hidden trap links before </body> to catch AI scrapers
# Links are CSS-hidden and aria-hidden so humans never see them
resource "kubernetes_manifest" "middleware_anti_ai_trap_links" {
manifest = {
apiVersion = "traefik.io/v1alpha1"
kind = "Middleware"
metadata = {
name = "anti-ai-trap-links"
namespace = kubernetes_namespace.traefik.metadata[0].name
}
spec = {
plugin = {
rewrite-body = {
rewrites = [{
regex = "</body>"
replacement = "<div style=\"position:absolute;left:-9999px;height:0;overflow:hidden\" aria-hidden=\"true\"><a href=\"https://poison.viktorbarzin.me/article/training-data-2024-research-corpus\">Research Archive</a><a href=\"https://poison.viktorbarzin.me/article/dataset-export-machine-learning-v3\">Dataset Export</a><a href=\"https://poison.viktorbarzin.me/article/nlp-benchmark-evaluation-results\">Benchmark Results</a><a href=\"https://poison.viktorbarzin.me/article/web-crawl-index-2024-archive\">Web Index</a><a href=\"https://poison.viktorbarzin.me/article/text-corpus-english-dump\">Text Corpus</a></div></body>"
}]
monitoring = {
types = ["text/html"]
}
}
}
}
}
depends_on = [helm_release.traefik]
}
Step 4: Verify syntax
cd stacks/platform && terraform fmt -check modules/traefik/middleware.tf || terraform fmt modules/traefik/middleware.tf
Step 5: Commit
git add stacks/platform/modules/traefik/middleware.tf
git commit -m "[ci skip] Add anti-AI scraping Traefik middlewares (ForwardAuth, headers, trap links)"
Task 4: Update ingress_factory to apply anti-AI middlewares by default
Files:
- Modify:
modules/kubernetes/ingress_factory/main.tf(add variable + middleware references)
Step 1: Add anti_ai_scraping variable
In modules/kubernetes/ingress_factory/main.tf, add after the skip_default_rate_limit variable (around line 73):
variable "anti_ai_scraping" {
type = bool
default = true
}
Step 2: Add middlewares to the chain
In the kubernetes_ingress_v1 resource's router.middlewares annotation (around line 108-117), add 3 new lines for anti-AI middlewares. The updated concat list should include:
var.anti_ai_scraping ? "traefik-ai-bot-block@kubernetescrd" : null,
var.anti_ai_scraping ? "traefik-anti-ai-headers@kubernetescrd" : null,
var.anti_ai_scraping ? "traefik-strip-accept-encoding@kubernetescrd" : null,
var.anti_ai_scraping ? "traefik-anti-ai-trap-links@kubernetescrd" : null,
Insert these after the existing crowdsec line (line 111) and before the protected line (line 112). The full concat array becomes:
"traefik.ingress.kubernetes.io/router.middlewares" = join(",", compact(concat([
var.skip_default_rate_limit ? null : "traefik-rate-limit@kubernetescrd",
var.custom_content_security_policy == null ? "traefik-csp-headers@kubernetescrd" : null,
var.exclude_crowdsec ? null : "traefik-crowdsec@kubernetescrd",
var.anti_ai_scraping ? "traefik-ai-bot-block@kubernetescrd" : null,
var.anti_ai_scraping ? "traefik-anti-ai-headers@kubernetescrd" : null,
var.anti_ai_scraping ? "traefik-strip-accept-encoding@kubernetescrd" : null,
var.anti_ai_scraping ? "traefik-anti-ai-trap-links@kubernetescrd" : null,
var.protected ? "traefik-authentik-forward-auth@kubernetescrd" : null,
var.allow_local_access_only ? "traefik-local-only@kubernetescrd" : null,
var.rybbit_site_id != null ? "traefik-strip-accept-encoding@kubernetescrd" : null,
var.rybbit_site_id != null ? "${var.namespace}-rybbit-analytics-${var.name}@kubernetescrd" : null,
var.custom_content_security_policy != null ? "${var.namespace}-custom-csp-${var.name}@kubernetescrd" : null,
], var.extra_middlewares)))
Step 3: Format
terraform fmt modules/kubernetes/ingress_factory/main.tf
Step 4: Commit
git add modules/kubernetes/ingress_factory/main.tf
git commit -m "[ci skip] Add anti_ai_scraping option to ingress_factory (default: true)"
Task 5: Create the poison-fountain Terraform stack
Files:
- Create:
stacks/poison-fountain/terragrunt.hcl - Create:
stacks/poison-fountain/main.tf - Create:
stacks/poison-fountain/secrets(symlink)
Step 1: Create terragrunt.hcl
Write stacks/poison-fountain/terragrunt.hcl:
include "root" {
path = find_in_parent_folders()
}
dependency "platform" {
config_path = "../platform"
skip_outputs = true
}
Step 2: Create secrets symlink
ln -s ../../secrets stacks/poison-fountain/secrets
Step 3: Write stacks/poison-fountain/main.tf
variable "tls_secret_name" { type = string }
locals {
tiers = {
core = "0-core"
cluster = "1-cluster"
gpu = "2-gpu"
edge = "3-edge"
aux = "4-aux"
}
}
resource "kubernetes_namespace" "poison_fountain" {
metadata {
name = "poison-fountain"
labels = {
"istio-injection" = "disabled"
tier = local.tiers.aux
}
}
}
module "tls_secret" {
source = "../../modules/kubernetes/setup_tls_secret"
namespace = kubernetes_namespace.poison_fountain.metadata[0].name
tls_secret_name = var.tls_secret_name
}
# ConfigMap for the Python service code
resource "kubernetes_config_map" "poison_fountain_code" {
metadata {
name = "poison-fountain-code"
namespace = kubernetes_namespace.poison_fountain.metadata[0].name
}
data = {
"server.py" = file("${path.module}/app/server.py")
}
}
# ConfigMap for the fetcher script
resource "kubernetes_config_map" "poison_fountain_fetcher" {
metadata {
name = "poison-fountain-fetcher"
namespace = kubernetes_namespace.poison_fountain.metadata[0].name
}
data = {
"fetch-poison.sh" = file("${path.module}/app/fetch-poison.sh")
}
}
# Main service deployment
resource "kubernetes_deployment" "poison_fountain" {
metadata {
name = "poison-fountain"
namespace = kubernetes_namespace.poison_fountain.metadata[0].name
labels = {
app = "poison-fountain"
tier = local.tiers.aux
}
}
spec {
replicas = 1
strategy {
type = "Recreate"
}
selector {
match_labels = {
app = "poison-fountain"
}
}
template {
metadata {
labels = {
app = "poison-fountain"
}
}
spec {
container {
name = "poison-fountain"
image = "python:3.12-slim"
command = ["python", "/app/server.py"]
port {
container_port = 8080
}
env {
name = "CACHE_DIR"
value = "/data/cache"
}
env {
name = "DRIP_BYTES"
value = "50"
}
env {
name = "DRIP_DELAY"
value = "0.5"
}
env {
name = "POISON_DOMAIN"
value = "poison.viktorbarzin.me"
}
volume_mount {
name = "code"
mount_path = "/app"
read_only = true
}
volume_mount {
name = "data"
mount_path = "/data"
}
liveness_probe {
http_get {
path = "/healthz"
port = 8080
}
initial_delay_seconds = 5
period_seconds = 30
}
readiness_probe {
http_get {
path = "/healthz"
port = 8080
}
initial_delay_seconds = 3
period_seconds = 10
}
resources {
requests = {
cpu = "10m"
memory = "32Mi"
}
limits = {
cpu = "100m"
memory = "128Mi"
}
}
}
volume {
name = "code"
config_map {
name = kubernetes_config_map.poison_fountain_code.metadata[0].name
}
}
volume {
name = "data"
nfs {
server = "10.0.10.15"
path = "/mnt/main/poison-fountain"
}
}
}
}
}
}
# Internal service (for ForwardAuth from Traefik)
resource "kubernetes_service" "poison_fountain" {
metadata {
name = "poison-fountain"
namespace = kubernetes_namespace.poison_fountain.metadata[0].name
labels = {
app = "poison-fountain"
}
}
spec {
selector = {
app = "poison-fountain"
}
port {
name = "http"
port = 8080
target_port = 8080
}
}
}
# Public ingress for the poison trap subdomain
# Deliberately NO rate limiting, NO CrowdSec, NO anti-AI (we WANT scrapers here)
module "ingress" {
source = "../../modules/kubernetes/ingress_factory"
namespace = kubernetes_namespace.poison_fountain.metadata[0].name
name = "poison-fountain"
host = "poison"
port = 8080
tls_secret_name = var.tls_secret_name
skip_default_rate_limit = true
exclude_crowdsec = true
anti_ai_scraping = false
}
# CronJob to fetch and cache poisoned content from Poison Fountain
resource "kubernetes_cron_job_v1" "poison_fetcher" {
metadata {
name = "poison-fountain-fetcher"
namespace = kubernetes_namespace.poison_fountain.metadata[0].name
}
spec {
schedule = "0 */6 * * *"
successful_jobs_history_limit = 1
failed_jobs_history_limit = 1
concurrency_policy = "Forbid"
job_template {
metadata {
name = "poison-fountain-fetcher"
}
spec {
template {
metadata {
name = "poison-fountain-fetcher"
}
spec {
container {
name = "fetcher"
image = "curlimages/curl:latest"
command = ["sh", "/scripts/fetch-poison.sh"]
env {
name = "CACHE_DIR"
value = "/data/cache"
}
env {
name = "POISON_URL"
value = "https://rnsaffn.com/poison2/"
}
env {
name = "FETCH_COUNT"
value = "50"
}
volume_mount {
name = "scripts"
mount_path = "/scripts"
read_only = true
}
volume_mount {
name = "data"
mount_path = "/data"
}
}
volume {
name = "scripts"
config_map {
name = kubernetes_config_map.poison_fountain_fetcher.metadata[0].name
default_mode = "0755"
}
}
volume {
name = "data"
nfs {
server = "10.0.10.15"
path = "/mnt/main/poison-fountain"
}
}
restart_policy = "Never"
}
}
}
}
}
}
Step 4: Format and validate
terraform fmt stacks/poison-fountain/main.tf
cd stacks/poison-fountain && terragrunt validate --non-interactive
Step 5: Commit
git add stacks/poison-fountain/
git commit -m "[ci skip] Add poison-fountain Terraform stack (deployment, service, ingress, CronJob)"
Task 6: Deploy the platform stack (Traefik middlewares + DNS)
Step 1: Plan
cd stacks/platform && terragrunt plan --non-interactive 2>&1 | tail -40
Expected: New resources for the 3 middleware CRDs + Cloudflare DNS record for poison. Changes to existing ingress resources (new middleware annotations).
Review the plan output carefully. The key additions should be:
kubernetes_manifest.middleware_ai_bot_blockkubernetes_manifest.middleware_anti_ai_headerskubernetes_manifest.middleware_anti_ai_trap_links- Cloudflare DNS record for
poison - Modified ingress annotations on all services in the platform stack
Step 2: Apply
cd stacks/platform && terragrunt apply --non-interactive 2>&1 | tail -40
Step 3: Verify middlewares exist
kubectl --kubeconfig $(pwd)/config get middlewares.traefik.io -n traefik | grep -E "ai-bot-block|anti-ai"
Expected: 3 middleware resources listed.
Task 7: Deploy the poison-fountain stack
Step 1: Plan
cd stacks/poison-fountain && terragrunt plan --non-interactive 2>&1 | tail -30
Expected: New namespace, configmaps, deployment, service, ingress, CronJob.
Step 2: Apply
cd stacks/poison-fountain && terragrunt apply --non-interactive 2>&1 | tail -30
Step 3: Monitor pod startup
Spawn a background agent to watch the pod come up:
kubectl --kubeconfig $(pwd)/config get pods -n poison-fountain -w
Expected: Pod reaches Running state with 1/1 ready.
Step 4: Trigger the first poison cache fetch
kubectl --kubeconfig $(pwd)/config create job --from=cronjob/poison-fountain-fetcher poison-fetch-initial -n poison-fountain
Watch the job complete:
kubectl --kubeconfig $(pwd)/config logs -n poison-fountain -l job-name=poison-fetch-initial -f
Expected: Fetched N poison documents.
Task 8: Verify the full system
Step 1: Verify ForwardAuth blocks AI bots
curl -s -o /dev/null -w "%{http_code}" -H "User-Agent: GPTBot/1.0" https://echo.viktorbarzin.me/
Expected: 403
Step 2: Verify legitimate users pass through
curl -s -o /dev/null -w "%{http_code}" -H "User-Agent: Mozilla/5.0" https://echo.viktorbarzin.me/
Expected: 200
Step 3: Verify X-Robots-Tag header
curl -sI https://echo.viktorbarzin.me/ 2>/dev/null | grep -i x-robots-tag
Expected: X-Robots-Tag: noai, noimageai
Step 4: Verify hidden trap links in HTML
curl -s https://echo.viktorbarzin.me/ | grep -o "poison.viktorbarzin.me"
Expected: Multiple matches (trap links injected before </body>).
Step 5: Verify poison service serves content with tarpit
timeout 10 curl -s -H "User-Agent: Mozilla/5.0" https://poison.viktorbarzin.me/article/test 2>/dev/null | head -5
Expected: HTML content starting to arrive slowly (only a few lines in 10 seconds due to tarpit).
Step 6: Run cluster health check
bash scripts/cluster_healthcheck.sh --quiet
Expected: No new WARN/FAIL related to poison-fountain.
Step 7: Commit all applied state
git add -A && git status
Review for any uncommitted changes, commit if needed.