# Anti-AI Scraping System Implementation Plan > **For Claude:** REQUIRED SUB-SKILL: Use superpowers:executing-plans to implement this plan task-by-task. **Goal:** Deploy a 5-layer anti-AI scraping system that blocks known bots, injects hidden trap links into all HTML responses, serves poisoned content from Poison Fountain, and tarpits scrapers with slow-drip responses. **Architecture:** A lightweight Python service handles bot detection (ForwardAuth) and poison content serving (tarpit). Traefik middlewares inject anti-AI headers and hidden trap links into all public service responses via ingress_factory defaults. A CronJob refreshes cached poison content from rnsaffn.com. **Tech Stack:** Python 3 (stdlib http.server), Terraform/Terragrunt, Traefik middleware CRDs, Kubernetes CronJob --- ### Task 1: Create the Python poison service code **Files:** - Create: `stacks/poison-fountain/app/server.py` - Create: `stacks/poison-fountain/app/fetch-poison.sh` **Step 1: Create the service directory** ```bash mkdir -p stacks/poison-fountain/app ``` **Step 2: Write `stacks/poison-fountain/app/server.py`** ```python """Poison Fountain service. Endpoints: GET /auth - ForwardAuth: block known AI bot User-Agents (403) or pass (200) GET /article/* - Serve cached poisoned content with tarpit slow-drip GET /healthz - Health check for Kubernetes probes GET /* - Catch-all: serve poison for any path (scrapers explore randomly) """ import http.server import os import glob import random import time import hashlib import sys LISTEN_PORT = int(os.environ.get("PORT", "8080")) CACHE_DIR = os.environ.get("CACHE_DIR", "/data/cache") DRIP_BYTES = int(os.environ.get("DRIP_BYTES", "50")) DRIP_DELAY = float(os.environ.get("DRIP_DELAY", "0.5")) TRAP_LINK_COUNT = int(os.environ.get("TRAP_LINK_COUNT", "20")) POISON_DOMAIN = os.environ.get("POISON_DOMAIN", "poison.viktorbarzin.me") AI_BOT_PATTERNS = [ "gptbot", "chatgpt-user", "claudebot", "claude-web", "ccbot", "bytespider", "google-extended", "applebot-extended", "anthropic-ai", "cohere-ai", "diffbot", "facebookbot", "perplexitybot", "youbot", "meta-externalagent", "petalbot", "amazonbot", "ai2bot", "omgilibot", "img2dataset", "omgili", "commoncrawl", "ia_archiver", "scrapy", "semrushbot", "ahrefsbot", "dotbot", "mj12bot", "seekport", "blexbot", "dataforseo", "serpstatbot", ] FALLBACK_WORDS = [ "the", "quantum", "neural", "framework", "implements", "distributed", "processing", "with", "advanced", "recursive", "algorithms", "for", "optimal", "convergence", "in", "multi-dimensional", "space", "utilizing", "transformer", "architecture", "trained", "on", "large-scale", "corpus", "data", "achieving", "state-of-the-art", "performance", "across", "benchmark", "tasks", "including", "natural", "language", "understanding", "generation", "and", "cross-lingual", "transfer", "learning", "capabilities", ] def generate_slug(): return hashlib.md5(str(random.random()).encode()).hexdigest()[:16] def generate_trap_links(count): titles = [ "Research Archive", "Training Corpus", "Dataset Export", "NLP Benchmark Results", "Web Crawl Index", "Text Corpus", "Machine Learning Data", "Evaluation Dataset", "Model Weights", "Annotation Guidelines", "Parallel Corpus", "Knowledge Base", "Document Collection", "Reference Data", "Taxonomy Index", "Classification Labels", "Entity Database", "Relation Extraction", "Sentiment Annotations", "Summarization Corpus", "QA Dataset", "Dialogue Transcripts", "Code Documentation", "API Reference", ] links = [] for _ in range(count): slug = generate_slug() title = random.choice(titles) links.append(f'{title}') return "\n".join(links) def get_poison_content(): cache_files = glob.glob(os.path.join(CACHE_DIR, "*.txt")) if cache_files: try: with open(random.choice(cache_files), "r", errors="replace") as f: return f.read() except Exception: pass return " ".join(random.choices(FALLBACK_WORDS, k=500)) class PoisonHandler(http.server.BaseHTTPRequestHandler): server_version = "Apache/2.4.52" sys_version = "" def log_message(self, fmt, *args): sys.stderr.write(f"[{self.log_date_time_string()}] {fmt % args}\n") def do_GET(self): if self.path == "/healthz": self._respond(200, "ok") return if self.path == "/auth": self._handle_auth() return # Everything else gets poison self._serve_poison() def _handle_auth(self): ua = (self.headers.get("User-Agent") or "").lower() for pattern in AI_BOT_PATTERNS: if pattern in ua: self.log_message("BLOCKED AI bot: %s (matched: %s)", ua, pattern) self._respond(403, "Forbidden") return self._respond(200, "OK") def _respond(self, code, body): self.send_response(code) self.send_header("Content-Type", "text/plain") self.end_headers() self.wfile.write(body.encode()) def _serve_poison(self): content = get_poison_content() trap_links = generate_trap_links(TRAP_LINK_COUNT) html = f""" Research Data Archive

Research Data Collection

{content}

""" self.send_response(200) self.send_header("Content-Type", "text/html; charset=utf-8") self.send_header("Transfer-Encoding", "chunked") self.end_headers() for i in range(0, len(html), DRIP_BYTES): chunk = html[i : i + DRIP_BYTES].encode("utf-8") try: self.wfile.write(f"{len(chunk):x}\r\n".encode()) self.wfile.write(chunk) self.wfile.write(b"\r\n") self.wfile.flush() time.sleep(DRIP_DELAY) except (BrokenPipeError, ConnectionResetError): return try: self.wfile.write(b"0\r\n\r\n") self.wfile.flush() except (BrokenPipeError, ConnectionResetError): pass if __name__ == "__main__": os.makedirs(CACHE_DIR, exist_ok=True) server = http.server.HTTPServer(("0.0.0.0", LISTEN_PORT), PoisonHandler) print(f"Poison Fountain service listening on :{LISTEN_PORT}", flush=True) server.serve_forever() ``` **Step 3: Write `stacks/poison-fountain/app/fetch-poison.sh`** ```bash #!/bin/sh set -e CACHE_DIR="${CACHE_DIR:-/data/cache}" POISON_URL="${POISON_URL:-https://rnsaffn.com/poison2/}" FETCH_COUNT="${FETCH_COUNT:-50}" MAX_CACHE_FILES="${MAX_CACHE_FILES:-100}" mkdir -p "$CACHE_DIR" echo "Fetching $FETCH_COUNT poison documents from $POISON_URL" fetched=0 for i in $(seq 1 "$FETCH_COUNT"); do OUTPUT="$CACHE_DIR/poison_$(date +%s)_${i}.txt" if curl -sS --compressed -o "$OUTPUT" -m 30 "$POISON_URL" 2>/dev/null; then # Verify file is non-empty if [ -s "$OUTPUT" ]; then fetched=$((fetched + 1)) echo " [$i/$FETCH_COUNT] OK" else rm -f "$OUTPUT" echo " [$i/$FETCH_COUNT] Empty response, skipped" fi else rm -f "$OUTPUT" echo " [$i/$FETCH_COUNT] Fetch failed, skipped" fi sleep 2 done # Clean up oldest files if cache exceeds limit total=$(find "$CACHE_DIR" -name '*.txt' -type f | wc -l) if [ "$total" -gt "$MAX_CACHE_FILES" ]; then excess=$((total - MAX_CACHE_FILES)) find "$CACHE_DIR" -name '*.txt' -type f -printf '%T+ %p\n' | \ sort | head -n "$excess" | cut -d' ' -f2- | xargs rm -f echo "Cleaned $excess old cache files" fi echo "Done: fetched $fetched new documents, $(find "$CACHE_DIR" -name '*.txt' -type f | wc -l) total cached" ``` **Step 4: Verify files exist** ```bash ls -la stacks/poison-fountain/app/ ``` Expected: `server.py` and `fetch-poison.sh` listed. **Step 5: Commit** ```bash git add stacks/poison-fountain/app/ git commit -m "[ci skip] Add poison fountain Python service and fetcher script" ``` --- ### Task 2: Set up NFS export and DNS record **Files:** - Modify: `secrets/nfs_directories.txt` (add `poison-fountain/cache` line, keep sorted) - Modify: `terraform.tfvars` (add `poison` to `cloudflare_non_proxied_names`) **Step 1: Add NFS directory** Add `poison-fountain` and `poison-fountain/cache` to `secrets/nfs_directories.txt`, keeping alphabetical order. Insert after `plotting-book` entries. **Step 2: Run NFS export script** ```bash cd secrets && bash nfs_exports.sh ``` Verify the export was created successfully. **Step 3: Add Cloudflare DNS record** In `terraform.tfvars`, find the `cloudflare_non_proxied_names` list and add `"poison"` to it (alphabetical position after `"plotting-book"`). **Step 4: Commit** ```bash git add secrets/nfs_directories.txt terraform.tfvars git commit -m "[ci skip] Add NFS export and DNS record for poison-fountain" ``` --- ### Task 3: Add Traefik middleware CRDs **Files:** - Modify: `stacks/platform/modules/traefik/middleware.tf` (append 3 new middleware resources) **Step 1: Add `ai-bot-block` ForwardAuth middleware** Append to the end of `stacks/platform/modules/traefik/middleware.tf`: ```hcl # ForwardAuth middleware to block known AI bot User-Agents resource "kubernetes_manifest" "middleware_ai_bot_block" { manifest = { apiVersion = "traefik.io/v1alpha1" kind = "Middleware" metadata = { name = "ai-bot-block" namespace = kubernetes_namespace.traefik.metadata[0].name } spec = { forwardAuth = { address = "http://poison-fountain.poison-fountain.svc.cluster.local:8080/auth" trustForwardHeader = true } } } depends_on = [helm_release.traefik] } ``` **Step 2: Add `anti-ai-headers` middleware** Append to the end of `stacks/platform/modules/traefik/middleware.tf`: ```hcl # X-Robots-Tag header to discourage compliant AI crawlers resource "kubernetes_manifest" "middleware_anti_ai_headers" { manifest = { apiVersion = "traefik.io/v1alpha1" kind = "Middleware" metadata = { name = "anti-ai-headers" namespace = kubernetes_namespace.traefik.metadata[0].name } spec = { headers = { customResponseHeaders = { "X-Robots-Tag" = "noai, noimageai" } } } } depends_on = [helm_release.traefik] } ``` **Step 3: Add `anti-ai-trap-links` rewrite-body middleware** Append to the end of `stacks/platform/modules/traefik/middleware.tf`: ```hcl # Inject hidden trap links before to catch AI scrapers # Links are CSS-hidden and aria-hidden so humans never see them resource "kubernetes_manifest" "middleware_anti_ai_trap_links" { manifest = { apiVersion = "traefik.io/v1alpha1" kind = "Middleware" metadata = { name = "anti-ai-trap-links" namespace = kubernetes_namespace.traefik.metadata[0].name } spec = { plugin = { rewrite-body = { rewrites = [{ regex = "" replacement = "
Research ArchiveDataset ExportBenchmark ResultsWeb IndexText Corpus
" }] monitoring = { types = ["text/html"] } } } } } depends_on = [helm_release.traefik] } ``` **Step 4: Verify syntax** ```bash cd stacks/platform && terraform fmt -check modules/traefik/middleware.tf || terraform fmt modules/traefik/middleware.tf ``` **Step 5: Commit** ```bash git add stacks/platform/modules/traefik/middleware.tf git commit -m "[ci skip] Add anti-AI scraping Traefik middlewares (ForwardAuth, headers, trap links)" ``` --- ### Task 4: Update ingress_factory to apply anti-AI middlewares by default **Files:** - Modify: `modules/kubernetes/ingress_factory/main.tf` (add variable + middleware references) **Step 1: Add `anti_ai_scraping` variable** In `modules/kubernetes/ingress_factory/main.tf`, add after the `skip_default_rate_limit` variable (around line 73): ```hcl variable "anti_ai_scraping" { type = bool default = true } ``` **Step 2: Add middlewares to the chain** In the `kubernetes_ingress_v1` resource's `router.middlewares` annotation (around line 108-117), add 3 new lines for anti-AI middlewares. The updated `concat` list should include: ```hcl var.anti_ai_scraping ? "traefik-ai-bot-block@kubernetescrd" : null, var.anti_ai_scraping ? "traefik-anti-ai-headers@kubernetescrd" : null, var.anti_ai_scraping ? "traefik-strip-accept-encoding@kubernetescrd" : null, var.anti_ai_scraping ? "traefik-anti-ai-trap-links@kubernetescrd" : null, ``` Insert these after the existing `crowdsec` line (line 111) and before the `protected` line (line 112). The full `concat` array becomes: ```hcl "traefik.ingress.kubernetes.io/router.middlewares" = join(",", compact(concat([ var.skip_default_rate_limit ? null : "traefik-rate-limit@kubernetescrd", var.custom_content_security_policy == null ? "traefik-csp-headers@kubernetescrd" : null, var.exclude_crowdsec ? null : "traefik-crowdsec@kubernetescrd", var.anti_ai_scraping ? "traefik-ai-bot-block@kubernetescrd" : null, var.anti_ai_scraping ? "traefik-anti-ai-headers@kubernetescrd" : null, var.anti_ai_scraping ? "traefik-strip-accept-encoding@kubernetescrd" : null, var.anti_ai_scraping ? "traefik-anti-ai-trap-links@kubernetescrd" : null, var.protected ? "traefik-authentik-forward-auth@kubernetescrd" : null, var.allow_local_access_only ? "traefik-local-only@kubernetescrd" : null, var.rybbit_site_id != null ? "traefik-strip-accept-encoding@kubernetescrd" : null, var.rybbit_site_id != null ? "${var.namespace}-rybbit-analytics-${var.name}@kubernetescrd" : null, var.custom_content_security_policy != null ? "${var.namespace}-custom-csp-${var.name}@kubernetescrd" : null, ], var.extra_middlewares))) ``` **Step 3: Format** ```bash terraform fmt modules/kubernetes/ingress_factory/main.tf ``` **Step 4: Commit** ```bash git add modules/kubernetes/ingress_factory/main.tf git commit -m "[ci skip] Add anti_ai_scraping option to ingress_factory (default: true)" ``` --- ### Task 5: Create the poison-fountain Terraform stack **Files:** - Create: `stacks/poison-fountain/terragrunt.hcl` - Create: `stacks/poison-fountain/main.tf` - Create: `stacks/poison-fountain/secrets` (symlink) **Step 1: Create terragrunt.hcl** Write `stacks/poison-fountain/terragrunt.hcl`: ```hcl include "root" { path = find_in_parent_folders() } dependency "platform" { config_path = "../platform" skip_outputs = true } ``` **Step 2: Create secrets symlink** ```bash ln -s ../../secrets stacks/poison-fountain/secrets ``` **Step 3: Write `stacks/poison-fountain/main.tf`** ```hcl variable "tls_secret_name" { type = string } locals { tiers = { core = "0-core" cluster = "1-cluster" gpu = "2-gpu" edge = "3-edge" aux = "4-aux" } } resource "kubernetes_namespace" "poison_fountain" { metadata { name = "poison-fountain" labels = { "istio-injection" = "disabled" tier = local.tiers.aux } } } module "tls_secret" { source = "../../modules/kubernetes/setup_tls_secret" namespace = kubernetes_namespace.poison_fountain.metadata[0].name tls_secret_name = var.tls_secret_name } # ConfigMap for the Python service code resource "kubernetes_config_map" "poison_fountain_code" { metadata { name = "poison-fountain-code" namespace = kubernetes_namespace.poison_fountain.metadata[0].name } data = { "server.py" = file("${path.module}/app/server.py") } } # ConfigMap for the fetcher script resource "kubernetes_config_map" "poison_fountain_fetcher" { metadata { name = "poison-fountain-fetcher" namespace = kubernetes_namespace.poison_fountain.metadata[0].name } data = { "fetch-poison.sh" = file("${path.module}/app/fetch-poison.sh") } } # Main service deployment resource "kubernetes_deployment" "poison_fountain" { metadata { name = "poison-fountain" namespace = kubernetes_namespace.poison_fountain.metadata[0].name labels = { app = "poison-fountain" tier = local.tiers.aux } } spec { replicas = 1 strategy { type = "Recreate" } selector { match_labels = { app = "poison-fountain" } } template { metadata { labels = { app = "poison-fountain" } } spec { container { name = "poison-fountain" image = "python:3.12-slim" command = ["python", "/app/server.py"] port { container_port = 8080 } env { name = "CACHE_DIR" value = "/data/cache" } env { name = "DRIP_BYTES" value = "50" } env { name = "DRIP_DELAY" value = "0.5" } env { name = "POISON_DOMAIN" value = "poison.viktorbarzin.me" } volume_mount { name = "code" mount_path = "/app" read_only = true } volume_mount { name = "data" mount_path = "/data" } liveness_probe { http_get { path = "/healthz" port = 8080 } initial_delay_seconds = 5 period_seconds = 30 } readiness_probe { http_get { path = "/healthz" port = 8080 } initial_delay_seconds = 3 period_seconds = 10 } resources { requests = { cpu = "10m" memory = "32Mi" } limits = { cpu = "100m" memory = "128Mi" } } } volume { name = "code" config_map { name = kubernetes_config_map.poison_fountain_code.metadata[0].name } } volume { name = "data" nfs { server = "10.0.10.15" path = "/mnt/main/poison-fountain" } } } } } } # Internal service (for ForwardAuth from Traefik) resource "kubernetes_service" "poison_fountain" { metadata { name = "poison-fountain" namespace = kubernetes_namespace.poison_fountain.metadata[0].name labels = { app = "poison-fountain" } } spec { selector = { app = "poison-fountain" } port { name = "http" port = 8080 target_port = 8080 } } } # Public ingress for the poison trap subdomain # Deliberately NO rate limiting, NO CrowdSec, NO anti-AI (we WANT scrapers here) module "ingress" { source = "../../modules/kubernetes/ingress_factory" namespace = kubernetes_namespace.poison_fountain.metadata[0].name name = "poison-fountain" host = "poison" port = 8080 tls_secret_name = var.tls_secret_name skip_default_rate_limit = true exclude_crowdsec = true anti_ai_scraping = false } # CronJob to fetch and cache poisoned content from Poison Fountain resource "kubernetes_cron_job_v1" "poison_fetcher" { metadata { name = "poison-fountain-fetcher" namespace = kubernetes_namespace.poison_fountain.metadata[0].name } spec { schedule = "0 */6 * * *" successful_jobs_history_limit = 1 failed_jobs_history_limit = 1 concurrency_policy = "Forbid" job_template { metadata { name = "poison-fountain-fetcher" } spec { template { metadata { name = "poison-fountain-fetcher" } spec { container { name = "fetcher" image = "curlimages/curl:latest" command = ["sh", "/scripts/fetch-poison.sh"] env { name = "CACHE_DIR" value = "/data/cache" } env { name = "POISON_URL" value = "https://rnsaffn.com/poison2/" } env { name = "FETCH_COUNT" value = "50" } volume_mount { name = "scripts" mount_path = "/scripts" read_only = true } volume_mount { name = "data" mount_path = "/data" } } volume { name = "scripts" config_map { name = kubernetes_config_map.poison_fountain_fetcher.metadata[0].name default_mode = "0755" } } volume { name = "data" nfs { server = "10.0.10.15" path = "/mnt/main/poison-fountain" } } restart_policy = "Never" } } } } } } ``` **Step 4: Format and validate** ```bash terraform fmt stacks/poison-fountain/main.tf cd stacks/poison-fountain && terragrunt validate --non-interactive ``` **Step 5: Commit** ```bash git add stacks/poison-fountain/ git commit -m "[ci skip] Add poison-fountain Terraform stack (deployment, service, ingress, CronJob)" ``` --- ### Task 6: Deploy the platform stack (Traefik middlewares + DNS) **Step 1: Plan** ```bash cd stacks/platform && terragrunt plan --non-interactive 2>&1 | tail -40 ``` Expected: New resources for the 3 middleware CRDs + Cloudflare DNS record for `poison`. Changes to existing ingress resources (new middleware annotations). Review the plan output carefully. The key additions should be: - `kubernetes_manifest.middleware_ai_bot_block` - `kubernetes_manifest.middleware_anti_ai_headers` - `kubernetes_manifest.middleware_anti_ai_trap_links` - Cloudflare DNS record for `poison` - Modified ingress annotations on all services in the platform stack **Step 2: Apply** ```bash cd stacks/platform && terragrunt apply --non-interactive 2>&1 | tail -40 ``` **Step 3: Verify middlewares exist** ```bash kubectl --kubeconfig $(pwd)/config get middlewares.traefik.io -n traefik | grep -E "ai-bot-block|anti-ai" ``` Expected: 3 middleware resources listed. --- ### Task 7: Deploy the poison-fountain stack **Step 1: Plan** ```bash cd stacks/poison-fountain && terragrunt plan --non-interactive 2>&1 | tail -30 ``` Expected: New namespace, configmaps, deployment, service, ingress, CronJob. **Step 2: Apply** ```bash cd stacks/poison-fountain && terragrunt apply --non-interactive 2>&1 | tail -30 ``` **Step 3: Monitor pod startup** Spawn a background agent to watch the pod come up: ```bash kubectl --kubeconfig $(pwd)/config get pods -n poison-fountain -w ``` Expected: Pod reaches `Running` state with `1/1` ready. **Step 4: Trigger the first poison cache fetch** ```bash kubectl --kubeconfig $(pwd)/config create job --from=cronjob/poison-fountain-fetcher poison-fetch-initial -n poison-fountain ``` Watch the job complete: ```bash kubectl --kubeconfig $(pwd)/config logs -n poison-fountain -l job-name=poison-fetch-initial -f ``` Expected: Fetched N poison documents. --- ### Task 8: Verify the full system **Step 1: Verify ForwardAuth blocks AI bots** ```bash curl -s -o /dev/null -w "%{http_code}" -H "User-Agent: GPTBot/1.0" https://echo.viktorbarzin.me/ ``` Expected: `403` **Step 2: Verify legitimate users pass through** ```bash curl -s -o /dev/null -w "%{http_code}" -H "User-Agent: Mozilla/5.0" https://echo.viktorbarzin.me/ ``` Expected: `200` **Step 3: Verify X-Robots-Tag header** ```bash curl -sI https://echo.viktorbarzin.me/ 2>/dev/null | grep -i x-robots-tag ``` Expected: `X-Robots-Tag: noai, noimageai` **Step 4: Verify hidden trap links in HTML** ```bash curl -s https://echo.viktorbarzin.me/ | grep -o "poison.viktorbarzin.me" ``` Expected: Multiple matches (trap links injected before ``). **Step 5: Verify poison service serves content with tarpit** ```bash timeout 10 curl -s -H "User-Agent: Mozilla/5.0" https://poison.viktorbarzin.me/article/test 2>/dev/null | head -5 ``` Expected: HTML content starting to arrive slowly (only a few lines in 10 seconds due to tarpit). **Step 6: Run cluster health check** ```bash bash scripts/cluster_healthcheck.sh --quiet ``` Expected: No new WARN/FAIL related to poison-fountain. **Step 7: Commit all applied state** ```bash git add -A && git status ``` Review for any uncommitted changes, commit if needed.