Replaces the abandoned FelixConfiguration.flowLogsFileEnabled approach (Calico
Enterprise-only field, rejected by OSS v3.26) with the supported primitive:
Calico GlobalNetworkPolicy with `action: Log`.
## Mechanics (verified end-to-end on 2026-05-19)
1. kubectl_manifest applies GNP `wave1-egress-observe-recruiter-responder`
with `namespaceSelector: kubernetes.io/metadata.name == 'recruiter-responder'`,
`types: [Egress]`, `egress: [{action: Log}, {action: Allow}]`.
2. Felix translates to iptables LOG rule in
`cali-po-_ZEv_aILlvyT9fbgWN58` chain with prefix `calico-packet: ` log-level=5.
3. Linux kernel emits LOG entries to ring buffer with transport=kernel.
4. systemd-journald captures kernel transport entries.
5. Alloy DaemonSet ships journal to Loki with `job=node-journal,transport=kernel`.
6. LogQL: `{job="node-journal"} |~ "calico-packet"` returns entries showing
SRC/DST/PROTO/PORT for every NEW egress connection.
## Verified output sample
`calico-packet: IN=cali6cfdec4abc1 OUT=ens18 MAC=... SRC=10.10.122.132
DST=9.9.9.9 LEN=60 TOS=0x00 PREC=0x00 TTL=...`
The Allow rule in the GNP keeps egress functional (recruiter-responder
remained 1/1 Running through the apply — verified Python TCP connections to
1.1.1.1, 8.8.8.8, 9.9.9.9 succeed).
## Wave 1 status
W1.6 observation infra is LIVE for the recruiter-responder pilot. W1.7
remains pending: collect 1 week of `{job="node-journal"} |~ "calico-packet"`
samples, build empirical egress allowlist, flip the GNP rules from
`[Log, Allow]` to `[Allow <specific dests>, Deny]`.
Expand observation to additional namespaces by adding entries to
`spec.namespaceSelector` (e.g. `kubernetes.io/metadata.name in {recruiter-responder,X,Y}`).
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
505 lines
24 KiB
Markdown
505 lines
24 KiB
Markdown
# Security & L7 Protection
|
||
|
||
## Overview
|
||
|
||
The homelab implements defense-in-depth security at the application layer (L7) using CrowdSec for threat intelligence and IP reputation, Kyverno for policy enforcement and resource governance, and a 3-layer anti-AI scraping defense (reduced from 5 in April 2026 after removing the rewrite-body plugin). All security components operate in graceful degradation mode (fail-open) to prevent cascading failures. Security policies are deployed in audit mode first, then selectively enforced after validation.
|
||
|
||
## Architecture Diagram
|
||
|
||
```mermaid
|
||
graph LR
|
||
Internet[Internet]
|
||
CF[Cloudflare WAF]
|
||
Tunnel[Cloudflared Tunnel]
|
||
CrowdSec[CrowdSec Bouncer<br/>Traefik Plugin]
|
||
AntiAI[Anti-AI Check<br/>poison-fountain]
|
||
ForwardAuth[Authentik ForwardAuth]
|
||
RateLimit[Rate Limit Middleware]
|
||
Retry[Retry Middleware<br/>2 attempts, 100ms]
|
||
Backend[Backend Service]
|
||
|
||
LAPI[CrowdSec LAPI<br/>3 replicas]
|
||
Agent[CrowdSec Agent]
|
||
|
||
Internet -->|1| CF
|
||
CF -->|2| Tunnel
|
||
Tunnel -->|3| CrowdSec
|
||
CrowdSec -.->|Query| LAPI
|
||
Agent -.->|Report| LAPI
|
||
CrowdSec -->|4. Pass/Block| AntiAI
|
||
AntiAI -->|5. Human/Bot| ForwardAuth
|
||
ForwardAuth -->|6. Authenticated| RateLimit
|
||
RateLimit -->|7. Under Limit| Retry
|
||
Retry -->|8. Success/Retry| Backend
|
||
|
||
style CrowdSec fill:#f9f,stroke:#333
|
||
style AntiAI fill:#ff9,stroke:#333
|
||
style ForwardAuth fill:#9f9,stroke:#333
|
||
style RateLimit fill:#99f,stroke:#333
|
||
```
|
||
|
||
## Components
|
||
|
||
| Component | Version | Location | Purpose |
|
||
|-----------|---------|----------|---------|
|
||
| CrowdSec LAPI | Pinned | `stacks/crowdsec/` | Local API, threat intelligence aggregation (3 replicas) |
|
||
| CrowdSec Agent | Pinned | `stacks/crowdsec/` | Log parser, scenario detection |
|
||
| CrowdSec Traefik Bouncer | Plugin | Traefik config | Plugin-based IP reputation check |
|
||
| Kyverno | Pinned chart | `stacks/kyverno/` | Policy engine for K8s admission control |
|
||
| poison-fountain | Latest | `stacks/poison-fountain/` | Anti-AI bot detection and tarpit service |
|
||
| cert-manager/certbot | - | `stacks/cert-manager/` | TLS certificate management |
|
||
| Traefik | Latest | `stacks/platform/` | Ingress controller with HTTP/3 (QUIC) |
|
||
|
||
## How It Works
|
||
|
||
### Request Security Layers
|
||
|
||
Every incoming request passes through 6 security layers:
|
||
|
||
1. **Cloudflare WAF** - DDoS protection, bot detection, firewall rules (external)
|
||
2. **Cloudflared Tunnel** - Zero Trust tunnel, hides origin IP
|
||
3. **CrowdSec Bouncer** - IP reputation check against LAPI (fail-open on error)
|
||
4. **Anti-AI Scraping** - 3-layer bot defense (optional per service, updated 2026-04-17)
|
||
5. **Authentik ForwardAuth** - Authentication check (if `protected = true`)
|
||
6. **Rate Limiting** - Per-source IP rate limits (returns 429 on breach)
|
||
7. **Retry Middleware** - Auto-retry on transient errors (2 attempts, 100ms delay)
|
||
|
||
### CrowdSec Threat Intelligence
|
||
|
||
CrowdSec operates in a hub-and-agent model:
|
||
|
||
**LAPI (Local API)**:
|
||
- 3 replicas for high availability
|
||
- Aggregates threat intelligence from agent + community
|
||
- Maintains ban list (IP reputation database)
|
||
- Version pinned to prevent breaking changes
|
||
|
||
**Agent**:
|
||
- Parses Traefik access logs
|
||
- Detects attack scenarios (SQL injection, directory traversal, brute force)
|
||
- Reports malicious IPs to LAPI
|
||
- Shares threat intel with CrowdSec community (anonymized)
|
||
|
||
**Traefik Bouncer Plugin**:
|
||
- Integrated as Traefik middleware
|
||
- Queries LAPI for IP reputation on each request
|
||
- **Fail-open mode**: If LAPI unreachable, allows traffic (graceful degradation)
|
||
- Blocks IPs on ban list, allows others
|
||
|
||
**Metabase** (disabled by default):
|
||
- Dashboard for CrowdSec analytics
|
||
- CPU-intensive, only enable when investigating incidents
|
||
|
||
### Kyverno Policy Engine
|
||
|
||
Kyverno enforces cluster-wide policies via admission webhooks. All policies use `failurePolicy=Ignore` to prevent blocking cluster operations.
|
||
|
||
#### 5-Tier Resource Governance
|
||
|
||
Namespaces are labeled with a tier (`tier: 0` through `tier: 4`). Kyverno auto-generates:
|
||
|
||
- **LimitRange** - Per-container CPU/memory limits
|
||
- **ResourceQuota** - Namespace-wide resource caps
|
||
|
||
| Tier | CPU Limit/Container | Memory Limit/Container | Namespace CPU Quota | Namespace Memory Quota |
|
||
|------|---------------------|------------------------|---------------------|------------------------|
|
||
| 0 | 100m | 128Mi | 500m | 512Mi |
|
||
| 1 | 250m | 256Mi | 1000m | 1Gi |
|
||
| 2 | 500m | 512Mi | 2000m | 2Gi |
|
||
| 3 | 1000m | 1Gi | 4000m | 4Gi |
|
||
| 4 | 2000m | 2Gi | 8000m | 8Gi |
|
||
|
||
This prevents resource exhaustion and enforces governance without manual quota management.
|
||
|
||
#### Security Policies
|
||
|
||
**Why audit mode first?** Gradual rollout without breaking existing workloads. Policies collect violations, then selectively enforced after cleanup.
|
||
|
||
**Wave 1 plan (locked 2026-05-18, see beads `code-8ywc`):** all four below flip from Audit → Enforce with `failurePolicy: Ignore` preserved and an exclude list covering the 31 critical namespaces (keel, calico-system, authentik, vault, cnpg-system, dbaas, monitoring, traefik, technitium, mailserver, kyverno, metallb-system, external-secrets, proxmox-csi, nfs-csi, nvidia, kube-system, cloudflared, crowdsec, reverse-proxy, reloader, descheduler, vpa, redis, sealed-secrets, headscale, wireguard, xray, infra-maintenance, metrics-server, tigera-operator). Phased: one policy per day with PolicyReport observation.
|
||
|
||
| Policy | Purpose | Current | Planned (wave 1) |
|
||
|--------|---------|---------|------------------|
|
||
| `deny-privileged-containers` | Block privileged pods | Audit | **Enforce** |
|
||
| `deny-host-namespaces` | Block hostNetwork/hostPID/hostIPC | Audit | **Enforce** |
|
||
| `restrict-sys-admin` | Block CAP_SYS_ADMIN | Audit | **Enforce** |
|
||
| `require-trusted-registries` | Only allow approved image registries (forgejo.viktorbarzin.me, docker.io, ghcr.io, quay.io, registry.k8s.io, gcr.io, oci://ghcr.io/sergelogvinov) | Audit | **Enforce** |
|
||
|
||
Cosign `verify-images` is **deferred** beyond wave 1 — needs image-signing infrastructure (Sigstore / cosign + KMS) before it can enforce meaningfully.
|
||
|
||
#### Operational Policies
|
||
|
||
| Policy | Purpose | Mode |
|
||
|--------|---------|------|
|
||
| `inject-priority-class-from-tier` | Set pod priorityClass based on namespace tier | Enforce (CREATE only) |
|
||
| `inject-ndots` | Set DNS `ndots:2` for faster lookups | Enforce |
|
||
| `sync-tier-label` | Propagate tier label to child resources | Enforce |
|
||
| `goldilocks-vpa-auto-mode` | Disable VPA globally (VPA off) | Enforce |
|
||
|
||
### Anti-AI Scraping (3 Active Layers) (Updated 2026-04-17)
|
||
|
||
Enabled by default via `ingress_factory` module. Disable per-service with `anti_ai_scraping = false`.
|
||
|
||
Active middleware chain: `ai-bot-block` (ForwardAuth) + `anti-ai-headers` (X-Robots-Tag). The `strip-accept-encoding` and `anti-ai-trap-links` middlewares were removed in April 2026 due to Traefik v3.6.12 Yaegi plugin incompatibility with the rewrite-body plugin.
|
||
|
||
#### Layer 1: Bot Blocking (ForwardAuth)
|
||
|
||
- Middleware calls `poison-fountain` service before backend
|
||
- Analyzes User-Agent, request patterns, timing
|
||
- Blocks known AI scrapers (GPTBot, CCBot, etc.)
|
||
- **Fail-open**: If poison-fountain down, allows traffic
|
||
|
||
#### Layer 2: X-Robots-Tag Header
|
||
|
||
- HTTP response header: `X-Robots-Tag: noai, noindex, nofollow`
|
||
- Instructs compliant bots to skip content
|
||
- Lightweight, no performance impact
|
||
|
||
#### ~~Layer 3: Trap Links~~ (REMOVED)
|
||
|
||
Removed April 2026. The rewrite-body Traefik plugin used to inject hidden trap links broke on Traefik v3.6.12 due to Yaegi runtime bugs. The companion `strip-accept-encoding` middleware was also removed.
|
||
|
||
#### Layer 3 (formerly 4): Tarpit / Poison Content
|
||
|
||
- `poison-fountain` service still exists as a standalone service at `poison.viktorbarzin.me`
|
||
- Serves AI bots extremely slowly (~100 bytes/sec tarpit)
|
||
- CronJob every 6 hours generates fake content
|
||
- Trap links are no longer injected into real pages, but bots that discover `poison.viktorbarzin.me` directly still get tarpitted and poisoned
|
||
|
||
**Implementation**: See `stacks/poison-fountain/` and `stacks/platform/modules/traefik/middleware.tf`
|
||
|
||
### Audit Logging & Anomaly Detection (Wave 1)
|
||
|
||
Beads epic: `code-8ywc`. **Status: partially live as of 2026-05-18.**
|
||
|
||
| Item | State |
|
||
|---|---|
|
||
| W1.2 Vault `file` audit device | **LIVE** — `vault_audit.file` in `stacks/vault/main.tf:287`, writing to `/vault/audit/vault-audit.log` on `proxmox-lvm-encrypted` PVC |
|
||
| W1.2 Vault `x_forwarded_for_authorized_addrs = 10.10.0.0/16` | **LIVE** — applied via `tg apply -target=helm_release.vault` on 2026-05-18; all 3 vault pods restarted cleanly |
|
||
| W1.2 Vault audit log shipping to Loki | **LIVE** — `audit-tail` sidecar in vault pods + Alloy DaemonSet ships to Loki with `container="audit-tail"`. Verified via `{namespace="vault",container="audit-tail"}` LogQL query. |
|
||
| W1.1 K8s API audit policy + shipping | **LIVE** — kube-apiserver audit policy was already configured (Metadata level, `/var/log/kubernetes/audit.log`, 7d retention). Alloy DaemonSet now tolerates control-plane taint, scrapes the audit log file, ships to Loki with `job=kubernetes-audit`. K2-K9 alert rules in Loki ruler. |
|
||
| W1.3 Source-IP anomaly rules (K9, V7, S1) | **LIVE** (K9, V7); **S1 PENDING** — fires once promtail/Alloy on PVE host ships sshd journal with `job=sshd-pve`. |
|
||
| W1.4 Kyverno security policies → Enforce | **LIVE** — 3 policies in Enforce mode with 35-namespace exclude list. |
|
||
| W1.5 Kyverno trusted-registries → Enforce | **LIVE** — explicit allowlist (15 registries + 6 DockerHub library bare names + 56 DockerHub user repos). Verified by admission dry-run: `evilcorp.example/malware:v1` BLOCKED, `alpine:3.20` and `docker.io/library/alpine:3.20` ALLOWED. |
|
||
| W1.6 Calico observe-phase (pilot: recruiter-responder) | **LIVE** (2026-05-19) — GlobalNetworkPolicy `wave1-egress-observe-recruiter-responder` with rules `[action:Log, action:Allow]`. FelixConfiguration.flowLogsFileEnabled approach abandoned (Calico Enterprise-only field, rejected by OSS v3.26). Log action emits iptables LOG with prefix `calico-packet: ` → kernel → journald → Alloy → Loki. Verified: `{job="node-journal"} \|~ "calico-packet"` returns real packet metadata (SRC/DST/PROTO). Expand to more namespaces by adding to `namespaceSelector`. |
|
||
| W1.7 NetworkPolicy phased enforce | **PENDING** — needs ~1 week of W1.6 observation, then build empirical allowlist from Loki queries, flip GNP rules from `[Log, Allow]` to `[Allow specific dests, Deny rest]`. |
|
||
|
||
The block below documents the locked design.
|
||
|
||
Response model: **(I) Slack-only, daily skim.** All security alerts land in a new `#security` Slack channel via Alertmanager. No paging. Mean detection time accepted as ~12-24h; the design weight sits on prevention (Kyverno enforce, NetworkPolicy default-deny egress) rather than runtime detection.
|
||
|
||
#### Detection sources
|
||
|
||
| Source | Mechanism | Ships via | Loki job label |
|
||
|---|---|---|---|
|
||
| K8s API audit log | Custom audit policy on kube-apiserver: drop `get`/`list`/`watch` at `None` for most resources, log writes at `Metadata`, secret reads at `Metadata`, `exec`/`portforward` at `RequestResponse`, exclude kubelet+controller-manager noise. Codified in `stacks/infra` kubeadm config templating. | Alloy DaemonSet tails `/var/log/kubernetes/audit/*.log` | `job=kube-audit` |
|
||
| Vault audit log | `file` audit device on existing Vault PVC. Vault listener config sets `x_forwarded_for_authorized_addrs` trusting Traefik pod CIDR so `remote_addr` is the real client IP, not Traefik's. | Alloy tails audit log file | `job=vault-audit` |
|
||
| PVE sshd auth log | journald `_SYSTEMD_UNIT=ssh.service` | promtail systemd unit on Proxmox host (192.168.1.127) | `job=sshd-pve` |
|
||
| Calico flow log | `flowLogsFileEnabled: true` in Calico Felix config | Alloy (cluster-wide) | `job=calico-flow` (W1.6 only) |
|
||
|
||
#### Alert rules (16 total)
|
||
|
||
Routed via **Loki ruler → Alertmanager → `#security` Slack receiver**. Same handling path as existing infra alerts — silenceable in Alertmanager UI, history queryable, severity labels (critical/warning/info) inside the single `#security` channel.
|
||
|
||
**K8s API audit (K2-K9, 8 rules — K1 cluster-admin-grant intentionally skipped):**
|
||
|
||
| # | Event | Severity |
|
||
|---|---|---|
|
||
| K2 | ServiceAccount token used from outside cluster (sourceIPs not in pod CIDR or trusted LAN) | critical |
|
||
| K3 | Secret READ in `vault`, `sealed-secrets`, `external-secrets` namespaces by a non-allowlisted ServiceAccount | critical |
|
||
| K4 | Exec into a pod in `vault`, `kube-system`, `dbaas`, `cnpg-system` (excluding `me@viktorbarzin.me` + 1 break-glass SA) | warning |
|
||
| K5 | >5 deletes of `Pod`, `Secret`, or `ConfigMap` in 60s by any single actor | critical |
|
||
| K6 | `audit-log-path` flag or audit policy modified on kube-apiserver | critical |
|
||
| K7 | New ClusterRole created with `verbs: ["*"]` and `resources: ["*"]` | warning |
|
||
| K8 | Anonymous binding granted (any RoleBinding/CRB referencing `system:anonymous` or `system:unauthenticated`) | critical |
|
||
| K9 | Authenticated request where `user.username == "me@viktorbarzin.me"` AND `sourceIPs[0]` NOT in allowlist CIDRs | critical |
|
||
|
||
**Vault audit (V1-V7):**
|
||
|
||
| # | Event | Severity |
|
||
|---|---|---|
|
||
| V1 | Root token created | critical |
|
||
| V2 | Audit device disabled or modified | critical |
|
||
| V3 | Seal status changed (`sys/seal` write) | critical |
|
||
| V4 | Policy written or modified (allowlist Terraform-driven writes by source IP / token role) | warning |
|
||
| V5 | Authentication failure spike >10/min on any auth method | warning |
|
||
| V6 | Token created with policies different from parent (privilege escalation) | critical |
|
||
| V7 | Vault audit event where `auth.entity_id == <viktor-entity-id>` AND `remote_addr` NOT in allowlist CIDRs | critical |
|
||
|
||
**Host (S1):**
|
||
|
||
| # | Event | Severity |
|
||
|---|---|---|
|
||
| S1 | PVE sshd auth success from source IP NOT in allowlist | critical |
|
||
|
||
#### Allowlist — "expected source IPs" for K2, K9, V7, S1
|
||
|
||
| CIDR | Source |
|
||
|---|---|
|
||
| `10.0.20.0/22` | VLAN 20 (K8s cluster + main LAN) |
|
||
| `192.168.1.0/24` | Proxmox host LAN + Sofia LAN (same RFC1918 block in both physical locations; cross-site traffic transits Headscale so the CIDR matches only on-LAN clients in either location) |
|
||
| K8s pod CIDR (verify at implementation time) | In-cluster pods talking to apiserver |
|
||
| K8s service CIDR | Service-to-apiserver traffic |
|
||
| Headscale tailnet | VPN-connected devices |
|
||
|
||
**Policy: no public-IP access ever.** Vault, kube-apiserver, PVE sshd must transit a trusted LAN or Headscale. Anything else fires an alert.
|
||
|
||
#### Why no canary tokens
|
||
|
||
Original plan included canary tokens (fake K8s Secret, Vault KV path, PVE file, sinkhole hostname). Rejected because Viktor routinely greps `secret/viktor` (135 keys) and lists `kubectl get secret -A` — any read-trigger canary self-fires. Use-based canaries (zero-RBAC SA tokens with audit alerts on use) were also considered but rejected in favor of cleaner source-IP anomaly detection (K9, V7) on REAL tokens — same threat model, no fake-token operational burden.
|
||
|
||
#### Why no K1 (cluster-admin grant detection)
|
||
|
||
Viktor opted out. Gap covered indirectly by K7 (new `*,*` ClusterRole created), K8 (anonymous binding), and K3 (secret read on Vault namespace) — most attacker progressions toward cluster-admin trigger one of these.
|
||
|
||
#### IOPS / disk-wear
|
||
|
||
Custom audit policy reduces volume ~80-90% vs default Metadata-everywhere. Loki tuned for fewer larger chunks: `chunk_target_size: 1.5MB`, `chunk_idle_period: 30m`, snappy compression. Retention 90d for security streams (matches Technitium DNS query log precedent). Net estimate: ~1-2 GB/day additional disk writes after tuning.
|
||
|
||
### NetworkPolicy Default-Deny Egress (Wave 1 — observe-then-enforce, tier 3+4)
|
||
|
||
Beads: `code-8ywc` W1.6 + W1.7. **Status: planned.**
|
||
|
||
**Approach (γ): cluster-wide observe-then-enforce.**
|
||
|
||
1. **Week 0:** Enable Calico flow logs cluster-wide. Apply a GlobalNetworkPolicy with selector `tier in {tier-3, tier-4}`, `action: Log` (no Deny). Ship flow logs to Loki.
|
||
2. **Week 1:** Build per-namespace egress allowlist from observed traffic. Common allowlist module `tier3_egress_baseline` covers DNS, NTP, internal Vault/ESO/Authentik, Brevo SMTP, Cloudflare API, OAuth providers. Per-namespace add-ons for service-specific external destinations.
|
||
3. **Week 2-3:** Apply default-deny + allowlist per-namespace, starting `recruiter-responder` (smallest egress footprint — local llama-cpp). Watch 24-48h per namespace, iterate. Roll out 3-5 namespaces/day.
|
||
|
||
**Scope exclusions:** tier 0/1/2 namespaces (defer to wave 2), 31 critical infra namespaces (same exclude list as Kyverno).
|
||
|
||
**DNS handling:** Calico GlobalNetworkPolicy supports domain-based rules via the `domains:` selector which queries CoreDNS internally. Static IPs reserved for fixed-IP services (Brevo SMTP relay).
|
||
|
||
**Known risks:**
|
||
- Rare-event misses: a Sunday-only CronJob's egress won't appear in 7 days of flow logs. Mitigation: extend observation to 2 weeks for namespaces with weekly CronJobs.
|
||
- Mass-rollout cascade: the 26h March 2026 outage (memory id=390) was a mass-change cascade. Mitigation: phased per-namespace with health-check pauses, similar to the 2026-05-17 Keel phased rollout (memory id=1972).
|
||
|
||
### TLS & HTTP/3
|
||
|
||
**Traefik** handles TLS termination:
|
||
- HTTP/3 (QUIC) enabled for performance
|
||
- Automatic HTTP → HTTPS redirect
|
||
- cert-manager/certbot manages certificate lifecycle
|
||
- Let's Encrypt integration for automatic renewal
|
||
|
||
### Rate Limiting
|
||
|
||
**Per-source IP limits**:
|
||
- Default: 100 requests/minute
|
||
- Returns **429 Too Many Requests** (not 503)
|
||
- Higher limits for upload-heavy services:
|
||
- Immich: 500 req/min (photo uploads)
|
||
- Nextcloud: 300 req/min (file sync)
|
||
|
||
**Retry Middleware**:
|
||
- 2 attempts max
|
||
- 100ms delay between retries
|
||
- Applied after rate limiting
|
||
- Handles transient backend errors
|
||
|
||
### Fallback Proxies
|
||
|
||
**Authentik Fallback**:
|
||
- If Authentik down, falls back to basicAuth
|
||
- Prevents total service outage during IdP maintenance
|
||
- Temporary credentials stored in Vault
|
||
|
||
**Poison-Fountain Fallback**:
|
||
- If anti-AI service down, allows all traffic
|
||
- Fail-open prevents blocking legitimate users
|
||
- Monitors for service health, auto-recovers
|
||
|
||
## Configuration
|
||
|
||
### Key Config Files
|
||
|
||
| Path | Purpose |
|
||
|------|---------|
|
||
| `stacks/crowdsec/` | CrowdSec LAPI, agent, bouncer config |
|
||
| `stacks/kyverno/` | Kyverno deployment + policies |
|
||
| `stacks/poison-fountain/` | Anti-AI service + CronJob |
|
||
| `stacks/platform/modules/traefik/middleware.tf` | Security middleware definitions |
|
||
| `stacks/platform/modules/ingress_factory/` | Per-service security toggles |
|
||
|
||
### Vault Paths
|
||
|
||
- **CrowdSec API key**: `secret/crowdsec/api-key` - LAPI authentication
|
||
- **BasicAuth fallback**: `secret/authentik/fallback-creds` - Emergency auth
|
||
- **TLS certificates**: `secret/tls/` - Certificate private keys
|
||
|
||
### Terraform Stacks
|
||
|
||
- `stacks/crowdsec/` - CrowdSec infrastructure
|
||
- `stacks/kyverno/` - Policy engine
|
||
- `stacks/poison-fountain/` - Anti-AI defense
|
||
- `stacks/platform/` - Traefik + middleware
|
||
|
||
### Per-Service Security Config
|
||
|
||
```hcl
|
||
module "myapp_ingress" {
|
||
source = "./modules/ingress_factory"
|
||
|
||
name = "myapp"
|
||
host = "myapp.viktorbarzin.me"
|
||
|
||
# Security toggles
|
||
protected = true # Enable ForwardAuth
|
||
anti_ai_scraping = false # Disable anti-AI (e.g., for public API)
|
||
rate_limit = 200 # Custom rate limit (req/min)
|
||
}
|
||
```
|
||
|
||
### Kyverno Policy Example
|
||
|
||
```yaml
|
||
apiVersion: kyverno.io/v1
|
||
kind: ClusterPolicy
|
||
metadata:
|
||
name: inject-ndots
|
||
spec:
|
||
background: false
|
||
rules:
|
||
- name: inject-ndots
|
||
match:
|
||
resources:
|
||
kinds:
|
||
- Pod
|
||
mutate:
|
||
patchStrategicMerge:
|
||
spec:
|
||
dnsConfig:
|
||
options:
|
||
- name: ndots
|
||
value: "2"
|
||
```
|
||
|
||
## Decisions & Rationale
|
||
|
||
### Why CrowdSec over ModSecurity?
|
||
|
||
- **Community threat intelligence**: Shared ban lists, crowdsourced attack detection
|
||
- **Easier management**: YAML scenarios vs complex ModSecurity rules
|
||
- **Better performance**: Lightweight Go agent vs resource-heavy Apache module
|
||
- **Active development**: More frequent updates, responsive community
|
||
|
||
### Why Audit-Only Security Policies?
|
||
|
||
- **Gradual rollout**: Identify violations without breaking existing workloads
|
||
- **Risk reduction**: Prevents policy bugs from blocking critical deployments
|
||
- **Better observability**: Collect violation metrics before enforcing
|
||
- **Selective enforcement**: Move to enforce mode per-policy after validation
|
||
|
||
### Why Multi-Layer Anti-AI Defense? (Updated 2026-04-17)
|
||
|
||
- **Defense in depth**: Each layer catches different bot types
|
||
- **Compliant bots**: Layer 2 (X-Robots-Tag) handles respectful crawlers
|
||
- **Persistent bots**: Tarpit makes scraping uneconomical
|
||
- **Poison content**: Degrades training data for bots that reach poison-fountain
|
||
- Layer 3 (trap links via rewrite-body) was removed due to Traefik v3 plugin incompatibility
|
||
|
||
### Why Fail-Open Mode?
|
||
|
||
- **Availability over security**: Homelab prioritizes uptime
|
||
- **Graceful degradation**: Single component failure doesn't cascade
|
||
- **Manual intervention**: Security incidents are rare, can handle manually
|
||
- **Layer redundancy**: If one layer fails, others still protect
|
||
|
||
### Why Pin CrowdSec/Kyverno Versions?
|
||
|
||
- **Breaking changes**: Both projects had breaking config changes in past
|
||
- **Controlled upgrades**: Test in staging before upgrading production
|
||
- **Stability**: Prevents auto-upgrade during outages
|
||
- **Rollback**: Easy to revert if upgrade causes issues
|
||
|
||
### Why HTTP/3 (QUIC)?
|
||
|
||
- **Performance**: Lower latency, better mobile performance
|
||
- **Connection migration**: Survives IP changes (mobile networks)
|
||
- **0-RTT**: Faster TLS handshake for repeat visitors
|
||
- **Future-proof**: Industry moving to HTTP/3
|
||
|
||
## Troubleshooting
|
||
|
||
### CrowdSec Blocking Legitimate IP
|
||
|
||
**Problem**: Legitimate user IP on ban list.
|
||
|
||
**Fix**:
|
||
1. Check LAPI decisions: `kubectl exec -it crowdsec-lapi-0 -- cscli decisions list`
|
||
2. Remove ban: `kubectl exec -it crowdsec-lapi-0 -- cscli decisions delete --ip <IP>`
|
||
3. Whitelist if needed: Add to `stacks/crowdsec/whitelist.yaml`
|
||
|
||
### Kyverno Policy Blocking Deployment
|
||
|
||
**Problem**: Pod creation fails with policy violation.
|
||
|
||
**Fix**:
|
||
1. Check policy reports: `kubectl get policyreport -A`
|
||
2. Verify `failurePolicy=Ignore` is set (should never block)
|
||
3. If blocking, temporarily disable policy: `kubectl annotate clusterpolicy <policy> kyverno.io/exclude=true`
|
||
4. Investigate root cause, fix workload or update policy
|
||
|
||
### Anti-AI Service Down, Traffic Blocked
|
||
|
||
**Problem**: `poison-fountain` service unhealthy, all traffic blocked.
|
||
|
||
**Fix**:
|
||
1. Verify fail-open config: Check `stacks/platform/modules/traefik/middleware.tf` for `failurePolicy: allow`
|
||
2. Restart service: `kubectl rollout restart deployment/poison-fountain -n poison-fountain`
|
||
3. Temporary disable: Set `anti_ai_scraping = false` in `ingress_factory` for affected services
|
||
|
||
### Rate Limit Too Aggressive
|
||
|
||
**Problem**: Legitimate users getting 429 errors.
|
||
|
||
**Fix**:
|
||
1. Check Traefik logs for rate limit hits: `kubectl logs -n traefik -l app=traefik | grep 429`
|
||
2. Increase limit in `ingress_factory`: `rate_limit = 300`
|
||
3. Apply: `terraform apply`
|
||
|
||
### HTTP/3 Not Working
|
||
|
||
**Problem**: Browser shows HTTP/2, not HTTP/3.
|
||
|
||
**Fix**:
|
||
1. Verify Traefik HTTP/3 enabled: `kubectl get cm traefik-config -o yaml | grep http3`
|
||
2. Check UDP port 443 accessible: `nc -u <public-ip> 443`
|
||
3. Browser support: Use Chrome/Firefox dev tools, check Protocol column
|
||
|
||
### TLS Certificate Expired
|
||
|
||
**Problem**: Browser shows certificate expired.
|
||
|
||
**Fix**:
|
||
1. Check cert-manager: `kubectl get certificate -A`
|
||
2. Force renewal: `kubectl delete secret <tls-secret> -n <namespace>`
|
||
3. cert-manager will auto-renew within 5 minutes
|
||
4. If fails, check Let's Encrypt rate limits
|
||
|
||
### Traefik Retry Loop
|
||
|
||
**Problem**: Backend logs show duplicate requests.
|
||
|
||
**Fix**:
|
||
1. Check retry middleware config: Should be 2 attempts max
|
||
2. Verify backend isn't returning transient errors: Check for 5xx responses
|
||
3. Disable retry for specific service: Remove retry middleware from `ingress_factory`
|
||
|
||
### Poison Content Not Serving (Updated 2026-04-17)
|
||
|
||
**Problem**: Bots not receiving poisoned content on `poison.viktorbarzin.me`.
|
||
|
||
**Note**: Poison content is no longer injected into real pages (rewrite-body removed). It is only served directly via the `poison.viktorbarzin.me` subdomain.
|
||
|
||
**Fix**:
|
||
1. Verify CronJob running: `kubectl get cronjob -n poison-fountain`
|
||
2. Check logs: `kubectl logs -n poison-fountain -l app=poison-fountain`
|
||
3. Manually trigger: `kubectl create job --from=cronjob/poison-content manual-poison`
|
||
|
||
## Related
|
||
|
||
- [Authentication & Authorization](./authentication.md) - Authentik, OIDC, ForwardAuth
|
||
- [Networking](./networking.md) - Ingress, DNS, load balancing
|
||
- [Monitoring](./monitoring.md) - Prometheus, Grafana, alerting
|
||
- [CrowdSec Runbook](../runbooks/crowdsec.md) - CrowdSec operations
|
||
- [Kyverno Policy Management](../runbooks/kyverno.md) - Policy authoring and troubleshooting
|