Viktor Barzin 5a42643176 add architecture documentation for all infrastructure subsystems [ci skip]

14 docs covering networking, VPN, storage, authentication, security,
monitoring, secrets, CI/CD, backup/DR, compute, databases, and
multi-tenancy. Each doc includes Mermaid diagrams, component tables,
configuration references, decision rationale, and troubleshooting.

2026-03-24 00:55:25 +02:00

14 KiB

Raw Blame History

Security & L7 Protection

Overview

The homelab implements defense-in-depth security at the application layer (L7) using CrowdSec for threat intelligence and IP reputation, Kyverno for policy enforcement and resource governance, and a 5-layer anti-AI scraping defense. All security components operate in graceful degradation mode (fail-open) to prevent cascading failures. Security policies are deployed in audit mode first, then selectively enforced after validation.

Architecture Diagram

graph LR
    Internet[Internet]
    CF[Cloudflare WAF]
    Tunnel[Cloudflared Tunnel]
    CrowdSec[CrowdSec Bouncer<br/>Traefik Plugin]
    AntiAI[Anti-AI Check<br/>poison-fountain]
    ForwardAuth[Authentik ForwardAuth]
    RateLimit[Rate Limit Middleware]
    Retry[Retry Middleware<br/>2 attempts, 100ms]
    Backend[Backend Service]

    LAPI[CrowdSec LAPI<br/>3 replicas]
    Agent[CrowdSec Agent]

    Internet -->|1| CF
    CF -->|2| Tunnel
    Tunnel -->|3| CrowdSec
    CrowdSec -.->|Query| LAPI
    Agent -.->|Report| LAPI
    CrowdSec -->|4. Pass/Block| AntiAI
    AntiAI -->|5. Human/Bot| ForwardAuth
    ForwardAuth -->|6. Authenticated| RateLimit
    RateLimit -->|7. Under Limit| Retry
    Retry -->|8. Success/Retry| Backend

    style CrowdSec fill:#f9f,stroke:#333
    style AntiAI fill:#ff9,stroke:#333
    style ForwardAuth fill:#9f9,stroke:#333
    style RateLimit fill:#99f,stroke:#333

Components

Component	Version	Location	Purpose
CrowdSec LAPI	Pinned	`stacks/crowdsec/`	Local API, threat intelligence aggregation (3 replicas)
CrowdSec Agent	Pinned	`stacks/crowdsec/`	Log parser, scenario detection
CrowdSec Traefik Bouncer	Plugin	Traefik config	Plugin-based IP reputation check
Kyverno	Pinned chart	`stacks/kyverno/`	Policy engine for K8s admission control
poison-fountain	Latest	`stacks/poison-fountain/`	Anti-AI bot detection and tarpit service
cert-manager/certbot	-	`stacks/cert-manager/`	TLS certificate management
Traefik	Latest	`stacks/platform/`	Ingress controller with HTTP/3 (QUIC)

How It Works

Request Security Layers

Every incoming request passes through 6 security layers:

Cloudflare WAF - DDoS protection, bot detection, firewall rules (external)
Cloudflared Tunnel - Zero Trust tunnel, hides origin IP
CrowdSec Bouncer - IP reputation check against LAPI (fail-open on error)
Anti-AI Scraping - 5-layer bot defense (optional per service)
Authentik ForwardAuth - Authentication check (if protected = true)
Rate Limiting - Per-source IP rate limits (returns 429 on breach)
Retry Middleware - Auto-retry on transient errors (2 attempts, 100ms delay)

CrowdSec Threat Intelligence

CrowdSec operates in a hub-and-agent model:

LAPI (Local API):

3 replicas for high availability
Aggregates threat intelligence from agent + community
Maintains ban list (IP reputation database)
Version pinned to prevent breaking changes

Agent:

Parses Traefik access logs
Detects attack scenarios (SQL injection, directory traversal, brute force)
Reports malicious IPs to LAPI
Shares threat intel with CrowdSec community (anonymized)

Traefik Bouncer Plugin:

Integrated as Traefik middleware
Queries LAPI for IP reputation on each request
Fail-open mode: If LAPI unreachable, allows traffic (graceful degradation)
Blocks IPs on ban list, allows others

Metabase (disabled by default):

Dashboard for CrowdSec analytics
CPU-intensive, only enable when investigating incidents

Kyverno Policy Engine

Kyverno enforces cluster-wide policies via admission webhooks. All policies use failurePolicy=Ignore to prevent blocking cluster operations.

5-Tier Resource Governance

Namespaces are labeled with a tier (tier: 0 through tier: 4). Kyverno auto-generates:

LimitRange - Per-container CPU/memory limits
ResourceQuota - Namespace-wide resource caps

Tier	CPU Limit/Container	Memory Limit/Container	Namespace CPU Quota	Namespace Memory Quota
0	100m	128Mi	500m	512Mi
1	250m	256Mi	1000m	1Gi
2	500m	512Mi	2000m	2Gi
3	1000m	1Gi	4000m	4Gi
4	2000m	2Gi	8000m	8Gi

This prevents resource exhaustion and enforces governance without manual quota management.

Security Policies (ALL in Audit Mode)

Why audit mode? Gradual rollout without breaking existing workloads. Policies collect violations, then selectively enforced after cleanup.

Policy	Purpose	Enforcement
`deny-privileged-containers`	Block privileged pods	Audit
`deny-host-namespaces`	Block hostNetwork/hostPID/hostIPC	Audit
`restrict-sys-admin`	Block CAP_SYS_ADMIN	Audit
`require-trusted-registries`	Only allow approved image registries	Audit

Operational Policies

Policy	Purpose	Mode
`inject-priority-class-from-tier`	Set pod priorityClass based on namespace tier	Enforce (CREATE only)
`inject-ndots`	Set DNS `ndots:2` for faster lookups	Enforce
`sync-tier-label`	Propagate tier label to child resources	Enforce
`goldilocks-vpa-auto-mode`	Disable VPA globally (VPA off)	Enforce

Anti-AI Scraping (5-Layer Defense)

Enabled by default via ingress_factory module. Disable per-service with anti_ai_scraping = false.

Layer 1: Bot Blocking (ForwardAuth)

Middleware calls poison-fountain service before backend
Analyzes User-Agent, request patterns, timing
Blocks known AI scrapers (GPTBot, CCBot, etc.)
Fail-open: If poison-fountain down, allows traffic

Layer 2: X-Robots-Tag Header

HTTP response header: X-Robots-Tag: noai, noindex, nofollow
Instructs compliant bots to skip content
Lightweight, no performance impact

Layer 3: Trap Links

JavaScript injects invisible links before </body>
Links point to honeypot endpoints
Legitimate browsers don't click, bots follow
Triggered bots get added to ban list

Layer 4: Tarpit

Serves AI bots extremely slowly (~100 bytes/sec)
Wastes bot resources, makes scraping uneconomical
Humans see normal speed (only applies to detected bots)

Layer 5: Poison Content

CronJob every 6 hours generates fake content
Injects misleading/nonsense data into pages shown to bots
Degrades AI training data quality
Requires --http1.1 flag to work with current HTTP/2 setup

Implementation: See stacks/poison-fountain/ and stacks/platform/modules/traefik/middleware.tf

TLS & HTTP/3

Traefik handles TLS termination:

HTTP/3 (QUIC) enabled for performance
Automatic HTTP → HTTPS redirect
cert-manager/certbot manages certificate lifecycle
Let's Encrypt integration for automatic renewal

Rate Limiting

Per-source IP limits:

Default: 100 requests/minute
Returns 429 Too Many Requests (not 503)
Higher limits for upload-heavy services:
- Immich: 500 req/min (photo uploads)
- Nextcloud: 300 req/min (file sync)

Retry Middleware:

2 attempts max
100ms delay between retries
Applied after rate limiting
Handles transient backend errors

Fallback Proxies

Authentik Fallback:

If Authentik down, falls back to basicAuth
Prevents total service outage during IdP maintenance
Temporary credentials stored in Vault

Poison-Fountain Fallback:

If anti-AI service down, allows all traffic
Fail-open prevents blocking legitimate users
Monitors for service health, auto-recovers

Configuration

Key Config Files

Path	Purpose
`stacks/crowdsec/`	CrowdSec LAPI, agent, bouncer config
`stacks/kyverno/`	Kyverno deployment + policies
`stacks/poison-fountain/`	Anti-AI service + CronJob
`stacks/platform/modules/traefik/middleware.tf`	Security middleware definitions
`stacks/platform/modules/ingress_factory/`	Per-service security toggles

Vault Paths

CrowdSec API key: secret/crowdsec/api-key - LAPI authentication
BasicAuth fallback: secret/authentik/fallback-creds - Emergency auth
TLS certificates: secret/tls/ - Certificate private keys

Terraform Stacks

stacks/crowdsec/ - CrowdSec infrastructure
stacks/kyverno/ - Policy engine
stacks/poison-fountain/ - Anti-AI defense
stacks/platform/ - Traefik + middleware

Per-Service Security Config

module "myapp_ingress" {
  source = "./modules/ingress_factory"

  name      = "myapp"
  host      = "myapp.viktorbarzin.me"

  # Security toggles
  protected         = true   # Enable ForwardAuth
  anti_ai_scraping  = false  # Disable anti-AI (e.g., for public API)
  rate_limit        = 200    # Custom rate limit (req/min)
}

Kyverno Policy Example

apiVersion: kyverno.io/v1
kind: ClusterPolicy
metadata:
  name: inject-ndots
spec:
  background: false
  rules:
  - name: inject-ndots
    match:
      resources:
        kinds:
        - Pod
    mutate:
      patchStrategicMerge:
        spec:
          dnsConfig:
            options:
            - name: ndots
              value: "2"

Decisions & Rationale

Why CrowdSec over ModSecurity?

Community threat intelligence: Shared ban lists, crowdsourced attack detection
Easier management: YAML scenarios vs complex ModSecurity rules
Better performance: Lightweight Go agent vs resource-heavy Apache module
Active development: More frequent updates, responsive community

Why Audit-Only Security Policies?

Gradual rollout: Identify violations without breaking existing workloads
Risk reduction: Prevents policy bugs from blocking critical deployments
Better observability: Collect violation metrics before enforcing
Selective enforcement: Move to enforce mode per-policy after validation

Why 5-Layer Anti-AI Defense?

Defense in depth: Each layer catches different bot types
Compliant bots: Layer 2 (X-Robots-Tag) handles respectful crawlers
Dumb bots: Layer 3 (trap links) catches simple scrapers
Persistent bots: Layer 4 (tarpit) makes scraping uneconomical
Sophisticated bots: Layer 5 (poison content) degrades training data

Why Fail-Open Mode?

Availability over security: Homelab prioritizes uptime
Graceful degradation: Single component failure doesn't cascade
Manual intervention: Security incidents are rare, can handle manually
Layer redundancy: If one layer fails, others still protect

Why Pin CrowdSec/Kyverno Versions?

Breaking changes: Both projects had breaking config changes in past
Controlled upgrades: Test in staging before upgrading production
Stability: Prevents auto-upgrade during outages
Rollback: Easy to revert if upgrade causes issues

Why HTTP/3 (QUIC)?

Performance: Lower latency, better mobile performance
Connection migration: Survives IP changes (mobile networks)
0-RTT: Faster TLS handshake for repeat visitors
Future-proof: Industry moving to HTTP/3

Troubleshooting

CrowdSec Blocking Legitimate IP

Problem: Legitimate user IP on ban list.

Fix:

Check LAPI decisions: kubectl exec -it crowdsec-lapi-0 -- cscli decisions list
Remove ban: kubectl exec -it crowdsec-lapi-0 -- cscli decisions delete --ip <IP>
Whitelist if needed: Add to stacks/crowdsec/whitelist.yaml

Kyverno Policy Blocking Deployment

Problem: Pod creation fails with policy violation.

Fix:

Check policy reports: kubectl get policyreport -A
Verify failurePolicy=Ignore is set (should never block)
If blocking, temporarily disable policy: kubectl annotate clusterpolicy <policy> kyverno.io/exclude=true
Investigate root cause, fix workload or update policy

Anti-AI Service Down, Traffic Blocked

Problem: poison-fountain service unhealthy, all traffic blocked.

Fix:

Verify fail-open config: Check stacks/platform/modules/traefik/middleware.tf for failurePolicy: allow
Restart service: kubectl rollout restart deployment/poison-fountain -n poison-fountain
Temporary disable: Set anti_ai_scraping = false in ingress_factory for affected services

Rate Limit Too Aggressive

Problem: Legitimate users getting 429 errors.

Fix:

Check Traefik logs for rate limit hits: kubectl logs -n traefik -l app=traefik | grep 429
Increase limit in ingress_factory: rate_limit = 300
Apply: terraform apply

HTTP/3 Not Working

Problem: Browser shows HTTP/2, not HTTP/3.

Fix:

Verify Traefik HTTP/3 enabled: kubectl get cm traefik-config -o yaml | grep http3
Check UDP port 443 accessible: nc -u <public-ip> 443
Browser support: Use Chrome/Firefox dev tools, check Protocol column

TLS Certificate Expired

Problem: Browser shows certificate expired.

Fix:

Check cert-manager: kubectl get certificate -A
Force renewal: kubectl delete secret <tls-secret> -n <namespace>
cert-manager will auto-renew within 5 minutes
If fails, check Let's Encrypt rate limits

Traefik Retry Loop

Problem: Backend logs show duplicate requests.

Fix:

Check retry middleware config: Should be 2 attempts max
Verify backend isn't returning transient errors: Check for 5xx responses
Disable retry for specific service: Remove retry middleware from ingress_factory

Poison Content Not Injecting

Problem: Bots not receiving poisoned content.

Fix:

Verify CronJob running: kubectl get cronjob -n poison-fountain
Check logs: kubectl logs -n poison-fountain -l app=poison-content-injector
Ensure --http1.1 flag set (required for HTTP/2 backends)
Manually trigger: kubectl create job --from=cronjob/poison-content manual-poison

Authentication & Authorization - Authentik, OIDC, ForwardAuth
Networking - Ingress, DNS, load balancing
Monitoring - Prometheus, Grafana, alerting
CrowdSec Runbook - CrowdSec operations
Kyverno Policy Management - Policy authoring and troubleshooting

14 KiB Raw Blame History

Security & L7 Protection

Overview

Architecture Diagram

Components

How It Works

Request Security Layers

CrowdSec Threat Intelligence

Kyverno Policy Engine

5-Tier Resource Governance

Security Policies (ALL in Audit Mode)

Operational Policies

Anti-AI Scraping (5-Layer Defense)

Layer 1: Bot Blocking (ForwardAuth)

Layer 2: X-Robots-Tag Header

Layer 3: Trap Links

Layer 4: Tarpit

Layer 5: Poison Content

TLS & HTTP/3

Rate Limiting

Fallback Proxies

Configuration

Key Config Files

Vault Paths

Terraform Stacks

Per-Service Security Config

Kyverno Policy Example

Decisions & Rationale

Why CrowdSec over ModSecurity?

Why Audit-Only Security Policies?

Why 5-Layer Anti-AI Defense?

Why Fail-Open Mode?

Why Pin CrowdSec/Kyverno Versions?

Why HTTP/3 (QUIC)?

Troubleshooting

CrowdSec Blocking Legitimate IP

Kyverno Policy Blocking Deployment

Anti-AI Service Down, Traffic Blocked

Rate Limit Too Aggressive

HTTP/3 Not Working

TLS Certificate Expired

Traefik Retry Loop

Poison Content Not Injecting

Related

14 KiB

Raw Blame History