infra/docs/architecture/security.md
Viktor Barzin 2f9ac0110a security(wave1): W1.6 observe phase LIVE — Calico GNP action:Log pilot on recruiter-responder
Replaces the abandoned FelixConfiguration.flowLogsFileEnabled approach (Calico
Enterprise-only field, rejected by OSS v3.26) with the supported primitive:
Calico GlobalNetworkPolicy with `action: Log`.

## Mechanics (verified end-to-end on 2026-05-19)
1. kubectl_manifest applies GNP `wave1-egress-observe-recruiter-responder`
   with `namespaceSelector: kubernetes.io/metadata.name == 'recruiter-responder'`,
   `types: [Egress]`, `egress: [{action: Log}, {action: Allow}]`.
2. Felix translates to iptables LOG rule in
   `cali-po-_ZEv_aILlvyT9fbgWN58` chain with prefix `calico-packet: ` log-level=5.
3. Linux kernel emits LOG entries to ring buffer with transport=kernel.
4. systemd-journald captures kernel transport entries.
5. Alloy DaemonSet ships journal to Loki with `job=node-journal,transport=kernel`.
6. LogQL: `{job="node-journal"} |~ "calico-packet"` returns entries showing
   SRC/DST/PROTO/PORT for every NEW egress connection.

## Verified output sample
`calico-packet: IN=cali6cfdec4abc1 OUT=ens18 MAC=... SRC=10.10.122.132
DST=9.9.9.9 LEN=60 TOS=0x00 PREC=0x00 TTL=...`

The Allow rule in the GNP keeps egress functional (recruiter-responder
remained 1/1 Running through the apply — verified Python TCP connections to
1.1.1.1, 8.8.8.8, 9.9.9.9 succeed).

## Wave 1 status
W1.6 observation infra is LIVE for the recruiter-responder pilot. W1.7
remains pending: collect 1 week of `{job="node-journal"} |~ "calico-packet"`
samples, build empirical egress allowlist, flip the GNP rules from
`[Log, Allow]` to `[Allow <specific dests>, Deny]`.

Expand observation to additional namespaces by adding entries to
`spec.namespaceSelector` (e.g. `kubernetes.io/metadata.name in {recruiter-responder,X,Y}`).

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-22 14:17:00 +00:00

505 lines
24 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

# Security & L7 Protection
## Overview
The homelab implements defense-in-depth security at the application layer (L7) using CrowdSec for threat intelligence and IP reputation, Kyverno for policy enforcement and resource governance, and a 3-layer anti-AI scraping defense (reduced from 5 in April 2026 after removing the rewrite-body plugin). All security components operate in graceful degradation mode (fail-open) to prevent cascading failures. Security policies are deployed in audit mode first, then selectively enforced after validation.
## Architecture Diagram
```mermaid
graph LR
Internet[Internet]
CF[Cloudflare WAF]
Tunnel[Cloudflared Tunnel]
CrowdSec[CrowdSec Bouncer<br/>Traefik Plugin]
AntiAI[Anti-AI Check<br/>poison-fountain]
ForwardAuth[Authentik ForwardAuth]
RateLimit[Rate Limit Middleware]
Retry[Retry Middleware<br/>2 attempts, 100ms]
Backend[Backend Service]
LAPI[CrowdSec LAPI<br/>3 replicas]
Agent[CrowdSec Agent]
Internet -->|1| CF
CF -->|2| Tunnel
Tunnel -->|3| CrowdSec
CrowdSec -.->|Query| LAPI
Agent -.->|Report| LAPI
CrowdSec -->|4. Pass/Block| AntiAI
AntiAI -->|5. Human/Bot| ForwardAuth
ForwardAuth -->|6. Authenticated| RateLimit
RateLimit -->|7. Under Limit| Retry
Retry -->|8. Success/Retry| Backend
style CrowdSec fill:#f9f,stroke:#333
style AntiAI fill:#ff9,stroke:#333
style ForwardAuth fill:#9f9,stroke:#333
style RateLimit fill:#99f,stroke:#333
```
## Components
| Component | Version | Location | Purpose |
|-----------|---------|----------|---------|
| CrowdSec LAPI | Pinned | `stacks/crowdsec/` | Local API, threat intelligence aggregation (3 replicas) |
| CrowdSec Agent | Pinned | `stacks/crowdsec/` | Log parser, scenario detection |
| CrowdSec Traefik Bouncer | Plugin | Traefik config | Plugin-based IP reputation check |
| Kyverno | Pinned chart | `stacks/kyverno/` | Policy engine for K8s admission control |
| poison-fountain | Latest | `stacks/poison-fountain/` | Anti-AI bot detection and tarpit service |
| cert-manager/certbot | - | `stacks/cert-manager/` | TLS certificate management |
| Traefik | Latest | `stacks/platform/` | Ingress controller with HTTP/3 (QUIC) |
## How It Works
### Request Security Layers
Every incoming request passes through 6 security layers:
1. **Cloudflare WAF** - DDoS protection, bot detection, firewall rules (external)
2. **Cloudflared Tunnel** - Zero Trust tunnel, hides origin IP
3. **CrowdSec Bouncer** - IP reputation check against LAPI (fail-open on error)
4. **Anti-AI Scraping** - 3-layer bot defense (optional per service, updated 2026-04-17)
5. **Authentik ForwardAuth** - Authentication check (if `protected = true`)
6. **Rate Limiting** - Per-source IP rate limits (returns 429 on breach)
7. **Retry Middleware** - Auto-retry on transient errors (2 attempts, 100ms delay)
### CrowdSec Threat Intelligence
CrowdSec operates in a hub-and-agent model:
**LAPI (Local API)**:
- 3 replicas for high availability
- Aggregates threat intelligence from agent + community
- Maintains ban list (IP reputation database)
- Version pinned to prevent breaking changes
**Agent**:
- Parses Traefik access logs
- Detects attack scenarios (SQL injection, directory traversal, brute force)
- Reports malicious IPs to LAPI
- Shares threat intel with CrowdSec community (anonymized)
**Traefik Bouncer Plugin**:
- Integrated as Traefik middleware
- Queries LAPI for IP reputation on each request
- **Fail-open mode**: If LAPI unreachable, allows traffic (graceful degradation)
- Blocks IPs on ban list, allows others
**Metabase** (disabled by default):
- Dashboard for CrowdSec analytics
- CPU-intensive, only enable when investigating incidents
### Kyverno Policy Engine
Kyverno enforces cluster-wide policies via admission webhooks. All policies use `failurePolicy=Ignore` to prevent blocking cluster operations.
#### 5-Tier Resource Governance
Namespaces are labeled with a tier (`tier: 0` through `tier: 4`). Kyverno auto-generates:
- **LimitRange** - Per-container CPU/memory limits
- **ResourceQuota** - Namespace-wide resource caps
| Tier | CPU Limit/Container | Memory Limit/Container | Namespace CPU Quota | Namespace Memory Quota |
|------|---------------------|------------------------|---------------------|------------------------|
| 0 | 100m | 128Mi | 500m | 512Mi |
| 1 | 250m | 256Mi | 1000m | 1Gi |
| 2 | 500m | 512Mi | 2000m | 2Gi |
| 3 | 1000m | 1Gi | 4000m | 4Gi |
| 4 | 2000m | 2Gi | 8000m | 8Gi |
This prevents resource exhaustion and enforces governance without manual quota management.
#### Security Policies
**Why audit mode first?** Gradual rollout without breaking existing workloads. Policies collect violations, then selectively enforced after cleanup.
**Wave 1 plan (locked 2026-05-18, see beads `code-8ywc`):** all four below flip from Audit → Enforce with `failurePolicy: Ignore` preserved and an exclude list covering the 31 critical namespaces (keel, calico-system, authentik, vault, cnpg-system, dbaas, monitoring, traefik, technitium, mailserver, kyverno, metallb-system, external-secrets, proxmox-csi, nfs-csi, nvidia, kube-system, cloudflared, crowdsec, reverse-proxy, reloader, descheduler, vpa, redis, sealed-secrets, headscale, wireguard, xray, infra-maintenance, metrics-server, tigera-operator). Phased: one policy per day with PolicyReport observation.
| Policy | Purpose | Current | Planned (wave 1) |
|--------|---------|---------|------------------|
| `deny-privileged-containers` | Block privileged pods | Audit | **Enforce** |
| `deny-host-namespaces` | Block hostNetwork/hostPID/hostIPC | Audit | **Enforce** |
| `restrict-sys-admin` | Block CAP_SYS_ADMIN | Audit | **Enforce** |
| `require-trusted-registries` | Only allow approved image registries (forgejo.viktorbarzin.me, docker.io, ghcr.io, quay.io, registry.k8s.io, gcr.io, oci://ghcr.io/sergelogvinov) | Audit | **Enforce** |
Cosign `verify-images` is **deferred** beyond wave 1 — needs image-signing infrastructure (Sigstore / cosign + KMS) before it can enforce meaningfully.
#### Operational Policies
| Policy | Purpose | Mode |
|--------|---------|------|
| `inject-priority-class-from-tier` | Set pod priorityClass based on namespace tier | Enforce (CREATE only) |
| `inject-ndots` | Set DNS `ndots:2` for faster lookups | Enforce |
| `sync-tier-label` | Propagate tier label to child resources | Enforce |
| `goldilocks-vpa-auto-mode` | Disable VPA globally (VPA off) | Enforce |
### Anti-AI Scraping (3 Active Layers) (Updated 2026-04-17)
Enabled by default via `ingress_factory` module. Disable per-service with `anti_ai_scraping = false`.
Active middleware chain: `ai-bot-block` (ForwardAuth) + `anti-ai-headers` (X-Robots-Tag). The `strip-accept-encoding` and `anti-ai-trap-links` middlewares were removed in April 2026 due to Traefik v3.6.12 Yaegi plugin incompatibility with the rewrite-body plugin.
#### Layer 1: Bot Blocking (ForwardAuth)
- Middleware calls `poison-fountain` service before backend
- Analyzes User-Agent, request patterns, timing
- Blocks known AI scrapers (GPTBot, CCBot, etc.)
- **Fail-open**: If poison-fountain down, allows traffic
#### Layer 2: X-Robots-Tag Header
- HTTP response header: `X-Robots-Tag: noai, noindex, nofollow`
- Instructs compliant bots to skip content
- Lightweight, no performance impact
#### ~~Layer 3: Trap Links~~ (REMOVED)
Removed April 2026. The rewrite-body Traefik plugin used to inject hidden trap links broke on Traefik v3.6.12 due to Yaegi runtime bugs. The companion `strip-accept-encoding` middleware was also removed.
#### Layer 3 (formerly 4): Tarpit / Poison Content
- `poison-fountain` service still exists as a standalone service at `poison.viktorbarzin.me`
- Serves AI bots extremely slowly (~100 bytes/sec tarpit)
- CronJob every 6 hours generates fake content
- Trap links are no longer injected into real pages, but bots that discover `poison.viktorbarzin.me` directly still get tarpitted and poisoned
**Implementation**: See `stacks/poison-fountain/` and `stacks/platform/modules/traefik/middleware.tf`
### Audit Logging & Anomaly Detection (Wave 1)
Beads epic: `code-8ywc`. **Status: partially live as of 2026-05-18.**
| Item | State |
|---|---|
| W1.2 Vault `file` audit device | **LIVE**`vault_audit.file` in `stacks/vault/main.tf:287`, writing to `/vault/audit/vault-audit.log` on `proxmox-lvm-encrypted` PVC |
| W1.2 Vault `x_forwarded_for_authorized_addrs = 10.10.0.0/16` | **LIVE** — applied via `tg apply -target=helm_release.vault` on 2026-05-18; all 3 vault pods restarted cleanly |
| W1.2 Vault audit log shipping to Loki | **LIVE**`audit-tail` sidecar in vault pods + Alloy DaemonSet ships to Loki with `container="audit-tail"`. Verified via `{namespace="vault",container="audit-tail"}` LogQL query. |
| W1.1 K8s API audit policy + shipping | **LIVE** — kube-apiserver audit policy was already configured (Metadata level, `/var/log/kubernetes/audit.log`, 7d retention). Alloy DaemonSet now tolerates control-plane taint, scrapes the audit log file, ships to Loki with `job=kubernetes-audit`. K2-K9 alert rules in Loki ruler. |
| W1.3 Source-IP anomaly rules (K9, V7, S1) | **LIVE** (K9, V7); **S1 PENDING** — fires once promtail/Alloy on PVE host ships sshd journal with `job=sshd-pve`. |
| W1.4 Kyverno security policies → Enforce | **LIVE** — 3 policies in Enforce mode with 35-namespace exclude list. |
| W1.5 Kyverno trusted-registries → Enforce | **LIVE** — explicit allowlist (15 registries + 6 DockerHub library bare names + 56 DockerHub user repos). Verified by admission dry-run: `evilcorp.example/malware:v1` BLOCKED, `alpine:3.20` and `docker.io/library/alpine:3.20` ALLOWED. |
| W1.6 Calico observe-phase (pilot: recruiter-responder) | **LIVE** (2026-05-19) — GlobalNetworkPolicy `wave1-egress-observe-recruiter-responder` with rules `[action:Log, action:Allow]`. FelixConfiguration.flowLogsFileEnabled approach abandoned (Calico Enterprise-only field, rejected by OSS v3.26). Log action emits iptables LOG with prefix `calico-packet: ` → kernel → journald → Alloy → Loki. Verified: `{job="node-journal"} \|~ "calico-packet"` returns real packet metadata (SRC/DST/PROTO). Expand to more namespaces by adding to `namespaceSelector`. |
| W1.7 NetworkPolicy phased enforce | **PENDING** — needs ~1 week of W1.6 observation, then build empirical allowlist from Loki queries, flip GNP rules from `[Log, Allow]` to `[Allow specific dests, Deny rest]`. |
The block below documents the locked design.
Response model: **(I) Slack-only, daily skim.** All security alerts land in a new `#security` Slack channel via Alertmanager. No paging. Mean detection time accepted as ~12-24h; the design weight sits on prevention (Kyverno enforce, NetworkPolicy default-deny egress) rather than runtime detection.
#### Detection sources
| Source | Mechanism | Ships via | Loki job label |
|---|---|---|---|
| K8s API audit log | Custom audit policy on kube-apiserver: drop `get`/`list`/`watch` at `None` for most resources, log writes at `Metadata`, secret reads at `Metadata`, `exec`/`portforward` at `RequestResponse`, exclude kubelet+controller-manager noise. Codified in `stacks/infra` kubeadm config templating. | Alloy DaemonSet tails `/var/log/kubernetes/audit/*.log` | `job=kube-audit` |
| Vault audit log | `file` audit device on existing Vault PVC. Vault listener config sets `x_forwarded_for_authorized_addrs` trusting Traefik pod CIDR so `remote_addr` is the real client IP, not Traefik's. | Alloy tails audit log file | `job=vault-audit` |
| PVE sshd auth log | journald `_SYSTEMD_UNIT=ssh.service` | promtail systemd unit on Proxmox host (192.168.1.127) | `job=sshd-pve` |
| Calico flow log | `flowLogsFileEnabled: true` in Calico Felix config | Alloy (cluster-wide) | `job=calico-flow` (W1.6 only) |
#### Alert rules (16 total)
Routed via **Loki ruler → Alertmanager → `#security` Slack receiver**. Same handling path as existing infra alerts — silenceable in Alertmanager UI, history queryable, severity labels (critical/warning/info) inside the single `#security` channel.
**K8s API audit (K2-K9, 8 rules — K1 cluster-admin-grant intentionally skipped):**
| # | Event | Severity |
|---|---|---|
| K2 | ServiceAccount token used from outside cluster (sourceIPs not in pod CIDR or trusted LAN) | critical |
| K3 | Secret READ in `vault`, `sealed-secrets`, `external-secrets` namespaces by a non-allowlisted ServiceAccount | critical |
| K4 | Exec into a pod in `vault`, `kube-system`, `dbaas`, `cnpg-system` (excluding `me@viktorbarzin.me` + 1 break-glass SA) | warning |
| K5 | >5 deletes of `Pod`, `Secret`, or `ConfigMap` in 60s by any single actor | critical |
| K6 | `audit-log-path` flag or audit policy modified on kube-apiserver | critical |
| K7 | New ClusterRole created with `verbs: ["*"]` and `resources: ["*"]` | warning |
| K8 | Anonymous binding granted (any RoleBinding/CRB referencing `system:anonymous` or `system:unauthenticated`) | critical |
| K9 | Authenticated request where `user.username == "me@viktorbarzin.me"` AND `sourceIPs[0]` NOT in allowlist CIDRs | critical |
**Vault audit (V1-V7):**
| # | Event | Severity |
|---|---|---|
| V1 | Root token created | critical |
| V2 | Audit device disabled or modified | critical |
| V3 | Seal status changed (`sys/seal` write) | critical |
| V4 | Policy written or modified (allowlist Terraform-driven writes by source IP / token role) | warning |
| V5 | Authentication failure spike >10/min on any auth method | warning |
| V6 | Token created with policies different from parent (privilege escalation) | critical |
| V7 | Vault audit event where `auth.entity_id == <viktor-entity-id>` AND `remote_addr` NOT in allowlist CIDRs | critical |
**Host (S1):**
| # | Event | Severity |
|---|---|---|
| S1 | PVE sshd auth success from source IP NOT in allowlist | critical |
#### Allowlist — "expected source IPs" for K2, K9, V7, S1
| CIDR | Source |
|---|---|
| `10.0.20.0/22` | VLAN 20 (K8s cluster + main LAN) |
| `192.168.1.0/24` | Proxmox host LAN + Sofia LAN (same RFC1918 block in both physical locations; cross-site traffic transits Headscale so the CIDR matches only on-LAN clients in either location) |
| K8s pod CIDR (verify at implementation time) | In-cluster pods talking to apiserver |
| K8s service CIDR | Service-to-apiserver traffic |
| Headscale tailnet | VPN-connected devices |
**Policy: no public-IP access ever.** Vault, kube-apiserver, PVE sshd must transit a trusted LAN or Headscale. Anything else fires an alert.
#### Why no canary tokens
Original plan included canary tokens (fake K8s Secret, Vault KV path, PVE file, sinkhole hostname). Rejected because Viktor routinely greps `secret/viktor` (135 keys) and lists `kubectl get secret -A` — any read-trigger canary self-fires. Use-based canaries (zero-RBAC SA tokens with audit alerts on use) were also considered but rejected in favor of cleaner source-IP anomaly detection (K9, V7) on REAL tokens — same threat model, no fake-token operational burden.
#### Why no K1 (cluster-admin grant detection)
Viktor opted out. Gap covered indirectly by K7 (new `*,*` ClusterRole created), K8 (anonymous binding), and K3 (secret read on Vault namespace) — most attacker progressions toward cluster-admin trigger one of these.
#### IOPS / disk-wear
Custom audit policy reduces volume ~80-90% vs default Metadata-everywhere. Loki tuned for fewer larger chunks: `chunk_target_size: 1.5MB`, `chunk_idle_period: 30m`, snappy compression. Retention 90d for security streams (matches Technitium DNS query log precedent). Net estimate: ~1-2 GB/day additional disk writes after tuning.
### NetworkPolicy Default-Deny Egress (Wave 1 — observe-then-enforce, tier 3+4)
Beads: `code-8ywc` W1.6 + W1.7. **Status: planned.**
**Approach (γ): cluster-wide observe-then-enforce.**
1. **Week 0:** Enable Calico flow logs cluster-wide. Apply a GlobalNetworkPolicy with selector `tier in {tier-3, tier-4}`, `action: Log` (no Deny). Ship flow logs to Loki.
2. **Week 1:** Build per-namespace egress allowlist from observed traffic. Common allowlist module `tier3_egress_baseline` covers DNS, NTP, internal Vault/ESO/Authentik, Brevo SMTP, Cloudflare API, OAuth providers. Per-namespace add-ons for service-specific external destinations.
3. **Week 2-3:** Apply default-deny + allowlist per-namespace, starting `recruiter-responder` (smallest egress footprint — local llama-cpp). Watch 24-48h per namespace, iterate. Roll out 3-5 namespaces/day.
**Scope exclusions:** tier 0/1/2 namespaces (defer to wave 2), 31 critical infra namespaces (same exclude list as Kyverno).
**DNS handling:** Calico GlobalNetworkPolicy supports domain-based rules via the `domains:` selector which queries CoreDNS internally. Static IPs reserved for fixed-IP services (Brevo SMTP relay).
**Known risks:**
- Rare-event misses: a Sunday-only CronJob's egress won't appear in 7 days of flow logs. Mitigation: extend observation to 2 weeks for namespaces with weekly CronJobs.
- Mass-rollout cascade: the 26h March 2026 outage (memory id=390) was a mass-change cascade. Mitigation: phased per-namespace with health-check pauses, similar to the 2026-05-17 Keel phased rollout (memory id=1972).
### TLS & HTTP/3
**Traefik** handles TLS termination:
- HTTP/3 (QUIC) enabled for performance
- Automatic HTTP → HTTPS redirect
- cert-manager/certbot manages certificate lifecycle
- Let's Encrypt integration for automatic renewal
### Rate Limiting
**Per-source IP limits**:
- Default: 100 requests/minute
- Returns **429 Too Many Requests** (not 503)
- Higher limits for upload-heavy services:
- Immich: 500 req/min (photo uploads)
- Nextcloud: 300 req/min (file sync)
**Retry Middleware**:
- 2 attempts max
- 100ms delay between retries
- Applied after rate limiting
- Handles transient backend errors
### Fallback Proxies
**Authentik Fallback**:
- If Authentik down, falls back to basicAuth
- Prevents total service outage during IdP maintenance
- Temporary credentials stored in Vault
**Poison-Fountain Fallback**:
- If anti-AI service down, allows all traffic
- Fail-open prevents blocking legitimate users
- Monitors for service health, auto-recovers
## Configuration
### Key Config Files
| Path | Purpose |
|------|---------|
| `stacks/crowdsec/` | CrowdSec LAPI, agent, bouncer config |
| `stacks/kyverno/` | Kyverno deployment + policies |
| `stacks/poison-fountain/` | Anti-AI service + CronJob |
| `stacks/platform/modules/traefik/middleware.tf` | Security middleware definitions |
| `stacks/platform/modules/ingress_factory/` | Per-service security toggles |
### Vault Paths
- **CrowdSec API key**: `secret/crowdsec/api-key` - LAPI authentication
- **BasicAuth fallback**: `secret/authentik/fallback-creds` - Emergency auth
- **TLS certificates**: `secret/tls/` - Certificate private keys
### Terraform Stacks
- `stacks/crowdsec/` - CrowdSec infrastructure
- `stacks/kyverno/` - Policy engine
- `stacks/poison-fountain/` - Anti-AI defense
- `stacks/platform/` - Traefik + middleware
### Per-Service Security Config
```hcl
module "myapp_ingress" {
source = "./modules/ingress_factory"
name = "myapp"
host = "myapp.viktorbarzin.me"
# Security toggles
protected = true # Enable ForwardAuth
anti_ai_scraping = false # Disable anti-AI (e.g., for public API)
rate_limit = 200 # Custom rate limit (req/min)
}
```
### Kyverno Policy Example
```yaml
apiVersion: kyverno.io/v1
kind: ClusterPolicy
metadata:
name: inject-ndots
spec:
background: false
rules:
- name: inject-ndots
match:
resources:
kinds:
- Pod
mutate:
patchStrategicMerge:
spec:
dnsConfig:
options:
- name: ndots
value: "2"
```
## Decisions & Rationale
### Why CrowdSec over ModSecurity?
- **Community threat intelligence**: Shared ban lists, crowdsourced attack detection
- **Easier management**: YAML scenarios vs complex ModSecurity rules
- **Better performance**: Lightweight Go agent vs resource-heavy Apache module
- **Active development**: More frequent updates, responsive community
### Why Audit-Only Security Policies?
- **Gradual rollout**: Identify violations without breaking existing workloads
- **Risk reduction**: Prevents policy bugs from blocking critical deployments
- **Better observability**: Collect violation metrics before enforcing
- **Selective enforcement**: Move to enforce mode per-policy after validation
### Why Multi-Layer Anti-AI Defense? (Updated 2026-04-17)
- **Defense in depth**: Each layer catches different bot types
- **Compliant bots**: Layer 2 (X-Robots-Tag) handles respectful crawlers
- **Persistent bots**: Tarpit makes scraping uneconomical
- **Poison content**: Degrades training data for bots that reach poison-fountain
- Layer 3 (trap links via rewrite-body) was removed due to Traefik v3 plugin incompatibility
### Why Fail-Open Mode?
- **Availability over security**: Homelab prioritizes uptime
- **Graceful degradation**: Single component failure doesn't cascade
- **Manual intervention**: Security incidents are rare, can handle manually
- **Layer redundancy**: If one layer fails, others still protect
### Why Pin CrowdSec/Kyverno Versions?
- **Breaking changes**: Both projects had breaking config changes in past
- **Controlled upgrades**: Test in staging before upgrading production
- **Stability**: Prevents auto-upgrade during outages
- **Rollback**: Easy to revert if upgrade causes issues
### Why HTTP/3 (QUIC)?
- **Performance**: Lower latency, better mobile performance
- **Connection migration**: Survives IP changes (mobile networks)
- **0-RTT**: Faster TLS handshake for repeat visitors
- **Future-proof**: Industry moving to HTTP/3
## Troubleshooting
### CrowdSec Blocking Legitimate IP
**Problem**: Legitimate user IP on ban list.
**Fix**:
1. Check LAPI decisions: `kubectl exec -it crowdsec-lapi-0 -- cscli decisions list`
2. Remove ban: `kubectl exec -it crowdsec-lapi-0 -- cscli decisions delete --ip <IP>`
3. Whitelist if needed: Add to `stacks/crowdsec/whitelist.yaml`
### Kyverno Policy Blocking Deployment
**Problem**: Pod creation fails with policy violation.
**Fix**:
1. Check policy reports: `kubectl get policyreport -A`
2. Verify `failurePolicy=Ignore` is set (should never block)
3. If blocking, temporarily disable policy: `kubectl annotate clusterpolicy <policy> kyverno.io/exclude=true`
4. Investigate root cause, fix workload or update policy
### Anti-AI Service Down, Traffic Blocked
**Problem**: `poison-fountain` service unhealthy, all traffic blocked.
**Fix**:
1. Verify fail-open config: Check `stacks/platform/modules/traefik/middleware.tf` for `failurePolicy: allow`
2. Restart service: `kubectl rollout restart deployment/poison-fountain -n poison-fountain`
3. Temporary disable: Set `anti_ai_scraping = false` in `ingress_factory` for affected services
### Rate Limit Too Aggressive
**Problem**: Legitimate users getting 429 errors.
**Fix**:
1. Check Traefik logs for rate limit hits: `kubectl logs -n traefik -l app=traefik | grep 429`
2. Increase limit in `ingress_factory`: `rate_limit = 300`
3. Apply: `terraform apply`
### HTTP/3 Not Working
**Problem**: Browser shows HTTP/2, not HTTP/3.
**Fix**:
1. Verify Traefik HTTP/3 enabled: `kubectl get cm traefik-config -o yaml | grep http3`
2. Check UDP port 443 accessible: `nc -u <public-ip> 443`
3. Browser support: Use Chrome/Firefox dev tools, check Protocol column
### TLS Certificate Expired
**Problem**: Browser shows certificate expired.
**Fix**:
1. Check cert-manager: `kubectl get certificate -A`
2. Force renewal: `kubectl delete secret <tls-secret> -n <namespace>`
3. cert-manager will auto-renew within 5 minutes
4. If fails, check Let's Encrypt rate limits
### Traefik Retry Loop
**Problem**: Backend logs show duplicate requests.
**Fix**:
1. Check retry middleware config: Should be 2 attempts max
2. Verify backend isn't returning transient errors: Check for 5xx responses
3. Disable retry for specific service: Remove retry middleware from `ingress_factory`
### Poison Content Not Serving (Updated 2026-04-17)
**Problem**: Bots not receiving poisoned content on `poison.viktorbarzin.me`.
**Note**: Poison content is no longer injected into real pages (rewrite-body removed). It is only served directly via the `poison.viktorbarzin.me` subdomain.
**Fix**:
1. Verify CronJob running: `kubectl get cronjob -n poison-fountain`
2. Check logs: `kubectl logs -n poison-fountain -l app=poison-fountain`
3. Manually trigger: `kubectl create job --from=cronjob/poison-content manual-poison`
## Related
- [Authentication & Authorization](./authentication.md) - Authentik, OIDC, ForwardAuth
- [Networking](./networking.md) - Ingress, DNS, load balancing
- [Monitoring](./monitoring.md) - Prometheus, Grafana, alerting
- [CrowdSec Runbook](../runbooks/crowdsec.md) - CrowdSec operations
- [Kyverno Policy Management](../runbooks/kyverno.md) - Policy authoring and troubleshooting