diff --git a/README.md b/README.md index 207f82ab..818703d6 100644 --- a/README.md +++ b/README.md @@ -5,6 +5,10 @@ My infrastructure is built using Terraform, Kubernetes and CI/CD is done using W Read more by visiting my website: https://viktorbarzin.me +## Documentation + +Full architecture documentation is available in [`docs/`](docs/README.md) — covering networking, storage, security, monitoring, secrets, CI/CD, databases, and more. + ## Adding a New User (Admin) Adding a new namespace-owner to the cluster requires three steps — no code changes needed. diff --git a/docs/README.md b/docs/README.md new file mode 100644 index 00000000..5b485214 --- /dev/null +++ b/docs/README.md @@ -0,0 +1,44 @@ +# Infrastructure Documentation + +This repository contains the configuration and documentation for a homelab Kubernetes cluster running on Proxmox. The infrastructure hosts 70+ services managed declaratively with Terraform and Terragrunt. + +## Quick Reference + +### Network Ranges +- **Physical Network**: `192.168.1.0/24` - Physical devices and host network +- **Management VLAN 10**: `10.0.10.0/24` - Infrastructure VMs and management +- **Kubernetes VLAN 20**: `10.0.20.0/24` - Kubernetes cluster network + +### Key URLs +- **Public**: `viktorbarzin.me` +- **Internal**: `viktorbarzin.lan` + +## Architecture Documentation + +| Document | Description | +|----------|-------------| +| [Overview](architecture/overview.md) | Infrastructure overview, hardware specs, VM inventory, and service catalog | +| [Networking](architecture/networking.md) | Network topology, VLANs, routing, and firewall rules | +| [VPN](architecture/vpn.md) | Headscale mesh VPN and Cloudflare Tunnel configuration | +| [Storage](architecture/storage.md) | TrueNAS NFS, democratic-csi, and persistent volume management | +| [Authentication](architecture/authentication.md) | Authentik SSO, OIDC flows, and service integration | +| [Security](architecture/security.md) | CrowdSec IPS, Kyverno policies, and security controls | +| [Monitoring](architecture/monitoring.md) | Prometheus, Grafana, Loki, and observability stack | +| [Secrets Management](architecture/secrets.md) | HashiCorp Vault integration and secret rotation | +| [CI/CD](architecture/ci-cd.md) | Woodpecker CI pipeline and deployment automation | +| [Backup & DR](architecture/backup-dr.md) | Backup strategy, disaster recovery, and restore procedures | +| [Compute](architecture/compute.md) | Proxmox VMs, GPU passthrough, K8s resource management, and VPA | +| [Databases](architecture/databases.md) | PostgreSQL, MySQL, Redis, and database operators | +| [Multi-tenancy](architecture/multi-tenancy.md) | Namespace isolation, tier system, and resource quotas | + +## Operations + +- [Runbooks](../runbooks/) - Step-by-step operational procedures +- [Plans](../plans/) - Infrastructure change plans and rollout strategies + +## Getting Started + +1. Review the [Overview](architecture/overview.md) for a high-level understanding +2. Read the [Networking](architecture/networking.md) doc to understand connectivity +3. Check [Compute](architecture/compute.md) for resource management patterns +4. Explore individual architecture docs based on your area of interest diff --git a/docs/architecture/authentication.md b/docs/architecture/authentication.md new file mode 100644 index 00000000..bb6a7289 --- /dev/null +++ b/docs/architecture/authentication.md @@ -0,0 +1,259 @@ +# Authentication & Authorization + +## Overview + +The homelab uses Authentik as a centralized identity provider (IdP) for all services, providing single sign-on (SSO) via OIDC and forward authentication for ingress protection. Authentik integrates with social login providers (Google, GitHub, Facebook), manages user groups and RBAC policies, and enforces authentication at the Traefik ingress layer. The system supports both human authentication (OIDC SSO) and service-to-service authentication (Kubernetes SA JWT for CI/CD). + +## Architecture Diagram + +```mermaid +graph TB + User[User Browser] + Traefik[Traefik Ingress] + ForwardAuth[ForwardAuth Middleware] + Authentik[Authentik
3 server + 3 worker
+ embedded outpost] + Backend[Protected Backend Service] + + Social[Social Providers
Google/GitHub/Facebook] + K8s[Kubernetes API] + Vault[Vault] + + User -->|1. HTTPS Request| Traefik + Traefik -->|2. Auth Check| ForwardAuth + ForwardAuth -->|3. Verify Session| Authentik + + Authentik -->|4a. Not Authenticated| User + User -->|4b. Login Flow| Authentik + Authentik -->|5. Social Login| Social + Social -->|6. OAuth Callback| Authentik + Authentik -->|7. Session Cookie| User + User -->|8. Retry Request| Traefik + + ForwardAuth -->|9. Authenticated| Backend + Traefik -->|10. Forward Request| Backend + + K8s -->|OIDC Groups| Authentik + Vault -->|OIDC Auth| Authentik +``` + +## Components + +| Component | Version | Location | Purpose | +|-----------|---------|----------|---------| +| Authentik Server | Latest | `stacks/authentik/` | Core IdP application servers (3 replicas) | +| Authentik Worker | Latest | `stacks/authentik/` | Background task processors (3 replicas) | +| PgBouncer | Latest | `stacks/authentik/` | PostgreSQL connection pooler (3 replicas) | +| Embedded Outpost | - | Built into Authentik | Forward auth endpoint for Traefik | +| Traefik ForwardAuth | - | `ingress_factory` module | Middleware for protected ingresses | +| Vault OIDC Method | - | `stacks/vault/` | Human SSO authentication to Vault | +| Vault K8s Auth | - | `stacks/vault/` | Service account JWT authentication | + +## How It Works + +### Forward Authentication Flow + +Services configured with `protected = true` in the `ingress_factory` module automatically get Traefik ForwardAuth middleware configured. When an unauthenticated user accesses a protected service: + +1. Request hits Traefik ingress +2. ForwardAuth middleware calls Authentik embedded outpost +3. Authentik checks for valid session cookie +4. If missing/invalid, redirects to Authentik login page (authentik.viktorbarzin.me) +5. User authenticates via social provider (Google/GitHub/Facebook) +6. Authentik creates session, sets cookie, redirects back to original URL +7. Subsequent requests include session cookie, pass auth check, reach backend + +Authentik adds authentication headers (user, email, groups) to forwarded requests. These headers are stripped before reaching the backend to prevent confusion. + +### Social Login & Invitation Flow + +All new users must use an invitation link to register. The invitation-enrollment flow: + +1. **invitation-validation** - Validates invitation token +2. **enrollment-identification** - Social login (Google/GitHub/Facebook) + passkey registration +3. **enrollment-prompt** - Collect name/email +4. **enrollment-user-write** - Create user account +5. **enrollment-login** - Auto-login after creation + +Group membership is auto-assigned from the invitation's `fixed_data` field. This prevents open registration while maintaining SSO convenience. + +### OIDC Applications + +Authentik provides OIDC for 10 applications: + +| Application | Type | Purpose | +|-------------|------|---------| +| Cloudflare Access | OIDC | Cloudflare Zero Trust tunnels | +| Domain-wide catch-all | Proxy (Forward Auth) | Protect all `*.viktorbarzin.me` services | +| Forgejo | OIDC | Git repository SSO | +| Grafana | OIDC | Monitoring dashboard SSO | +| Headscale | OIDC | Tailscale control plane auth | +| Immich | OIDC | Photo management SSO | +| Kubernetes | OIDC (public client) | K8s API authentication | +| Linkwarden | OIDC | Bookmark manager SSO | +| Matrix | OIDC | Matrix homeserver SSO | +| Wrongmove | OIDC | Real estate app SSO | + +### Kubernetes RBAC via OIDC + +Kubernetes API server is configured with OIDC issuer: `https://authentik.viktorbarzin.me/application/o/kubernetes/` + +The public client flow: + +1. User authenticates to Authentik via `kubectl` plugin or dashboard +2. Receives OIDC token with group claims +3. K8s API validates token against Authentik JWKS endpoint +4. Maps groups to ClusterRoleBindings: + - `kubernetes-admins` → `cluster-admin` (full cluster access) + - `kubernetes-power-users` → custom ClusterRole (read-mostly, limited write) + - `kubernetes-namespace-owners` → RoleBindings per namespace (namespace-scoped admin) + +### Authentik Groups + +9 groups manage authorization: + +- **Allow Login Users** - Base group, can authenticate to any OIDC app +- **authentik Admins** - Full Authentik admin UI access +- **Headscale Users** - Can access Headscale control plane +- **Home Server Admins** - Admin access to homelab services +- **Wrongmove Users** - Access to Wrongmove real estate app +- **kubernetes-admins** - K8s cluster-admin role +- **kubernetes-power-users** - K8s read-mostly access +- **kubernetes-namespace-owners** - K8s namespace-scoped admin +- **Task Submitters** - Can submit tasks to cluster task runner + +### Vault Authentication + +**For humans:** +- OIDC method using Authentik as provider +- SSO login to Vault UI and CLI +- Group-based policy assignment + +**For services (CI/CD):** +- Kubernetes SA JWT authentication +- Woodpecker CI uses service account token +- Vault K8s secrets engine roles: + - `dashboard-admin` - K8s dashboard admin token + - `ci-deployer` - Deploy workloads via CI/CD + - `openclaw` - AI assistant cluster access + - `local-admin` - Local development access + +## Configuration + +### Key Config Files + +| Path | Purpose | +|------|---------| +| `stacks/authentik/` | Authentik deployment (servers, workers, PgBouncer) | +| `stacks/platform/modules/ingress_factory/` | Traefik ForwardAuth middleware config | +| `stacks/platform/modules/traefik/middleware.tf` | ForwardAuth middleware definition | +| `stacks/vault/auth.tf` | Vault OIDC and K8s auth methods | + +### Vault Paths + +- **OIDC config**: `auth/oidc` - Authentik integration settings +- **K8s auth**: `auth/kubernetes` - SA JWT validation +- **K8s secrets engine**: `kubernetes/` - Dynamic kubeconfig/SA token generation + +### Terraform Stacks + +- `stacks/authentik/` - Authentik infrastructure +- `stacks/platform/` - Traefik ingress with ForwardAuth +- `stacks/vault/` - Vault auth methods + +### Ingress Protection Example + +```hcl +module "myapp_ingress" { + source = "./modules/ingress_factory" + + name = "myapp" + host = "myapp.viktorbarzin.me" + protected = true # Enables ForwardAuth middleware + + # ... other config +} +``` + +## Decisions & Rationale + +### Why Authentik over Keycloak? + +- **Lighter weight**: Lower resource footprint (3+3+3 replicas vs Keycloak's heavier Java runtime) +- **Better UX**: Modern UI, simpler admin experience, better mobile support +- **Python-based**: Easier to extend, faster startup times, better developer experience +- **Active development**: More frequent releases, responsive community + +### Why Forward Auth over Sidecar? + +- **Simpler architecture**: Single auth check at ingress, no sidecar per pod +- **Works with any backend**: Language/framework agnostic, no SDK required +- **Centralized policy**: All auth logic in Authentik, not distributed across sidecars +- **Performance**: Single auth check per session, not per request + +### Why OIDC for Kubernetes? + +- **SSO integration**: Same login as all other services, no separate credentials +- **No credential management**: No kubeconfig secrets to rotate, tokens are short-lived +- **Group-based RBAC**: Centralized group management in Authentik, automatic K8s role mapping +- **Public client flow**: No client secret needed, works in kubectl plugins and dashboards + +### Why Invitation-Only Enrollment? + +- **Security**: Prevents open internet access to homelab services +- **Controlled onboarding**: Explicit approval before granting access +- **Social login convenience**: No password management, leverages trusted providers +- **Group auto-assignment**: Invitation encodes initial group membership + +## Troubleshooting + +### Headers Not Stripped + +**Problem**: Backend receives `X-Authentik-Username`, `X-Authentik-Email`, `X-Authentik-Groups` headers and breaks. + +**Fix**: Traefik middleware should strip these headers before forwarding. Check `ingress_factory` module for header stripping config. + +### OIDC Token Expired + +**Problem**: `kubectl` returns 401 Unauthorized. + +**Fix**: Re-authenticate to refresh token: +```bash +kubectl oidc-login setup --oidc-issuer-url=https://authentik.viktorbarzin.me/application/o/kubernetes/ +``` + +### Social Login Redirect Loop + +**Problem**: After social login, redirects to Authentik login page instead of destination. + +**Fix**: Check Authentik application's redirect URIs. Must include `https://authentik.viktorbarzin.me/source/oauth/callback/*` for social providers. + +### User Not in Correct Group + +**Problem**: User authenticated but lacks permissions. + +**Fix**: Check group membership in Authentik admin UI. Verify invitation `fixed_data` specified correct group. Manually add to group if needed. + +### Vault OIDC Login Fails + +**Problem**: Vault UI redirects to Authentik but returns error. + +**Fix**: +1. Verify Vault OIDC client credentials in Authentik +2. Check Vault OIDC issuer URL matches Authentik +3. Ensure Vault redirect URI (`https://vault.viktorbarzin.me/ui/vault/auth/oidc/oidc/callback`) is registered in Authentik + +### K8s Auth Group Mapping Not Working + +**Problem**: User authenticated but `kubectl` shows limited permissions despite being in `kubernetes-admins`. + +**Fix**: +1. Verify group claim is present in token: `kubectl oidc-login get-token | jq -R 'split(".") | .[1] | @base64d | fromjson'` +2. Check ClusterRoleBinding maps group correctly: `kubectl get clusterrolebinding -o yaml | grep kubernetes-admins` +3. Ensure Authentik OIDC app includes `groups` scope + +## Related + +- [Security & L7 Protection](./security.md) - CrowdSec, anti-AI scraping, rate limiting +- [Networking](./networking.md) - Ingress, DNS, load balancing +- [Vault Runbook](../runbooks/vault.md) - Vault operations and troubleshooting +- [Kubernetes Access Runbook](../runbooks/k8s-access.md) - Setting up kubectl with OIDC diff --git a/docs/architecture/backup-dr.md b/docs/architecture/backup-dr.md new file mode 100644 index 00000000..8840aba0 --- /dev/null +++ b/docs/architecture/backup-dr.md @@ -0,0 +1,641 @@ +# Backup & Disaster Recovery Architecture + +Last updated: 2026-03-24 + +## Overview + +The homelab uses a defense-in-depth 3-layer backup strategy: Layer 1 provides near-instant local snapshots via ZFS auto-snapshots on TrueNAS (every 12h + daily, up to 3-week retention). Layer 2 adds application-level backups for complex stateful services (databases, Vault, etcd) via K8s CronJobs dumping to NFS-exported directories with 14-30 day retention. Layer 3 ensures offsite protection through hybrid incremental/full sync to a Synology NAS every 6 hours (incremental via ZFS diff) plus weekly full sync (Sunday 09:00) for cleanup. This architecture provides <1s RPO for file data, 6h RPO for offsite, and <30min RTO for most services. + +## Architecture Diagram + +### Overall Backup Flow + +```mermaid +graph TB + subgraph TrueNAS["TrueNAS (10.0.10.15)"] + ZFS_Data["ZFS Pools
main (1.64 TiB)
ssd (~256GB)"] + + subgraph Layer1["Layer 1: ZFS Auto-Snapshots"] + Snap12h["Every 12h
auto-12h-*
24h retention"] + SnapDaily["Daily 00:00
auto-*
3-week retention"] + end + + ZFS_Data --> Snap12h + ZFS_Data --> SnapDaily + + NFS_Backup["NFS-exported
/mnt/main/*-backup/"] + end + + subgraph K8s["Kubernetes Cluster"] + subgraph Layer2["Layer 2: App Backups"] + CronDaily["Daily 00:00-00:30
PostgreSQL, MySQL
14d retention"] + CronWeekly["Weekly Sunday
etcd, Vault, Redis
Vaultwarden, plotting-book
30d retention"] + CronMonthly["Monthly 1st Sunday
Prometheus TSDB
2 copies"] + Cron6h["Every 6h
Vaultwarden backup
+ integrity check"] + end + + CronDaily --> NFS_Backup + CronWeekly --> NFS_Backup + CronMonthly --> NFS_Backup + Cron6h --> NFS_Backup + end + + subgraph Layer3["Layer 3: Offsite Sync"] + Incremental["Every 6h
zfs diff → rclone copy
--files-from --no-traverse"] + FullSync["Weekly Sunday 09:00
rclone sync
handles deletions"] + end + + ZFS_Data --> Incremental + ZFS_Data --> FullSync + + Synology["Synology NAS
192.168.1.13
/Backup/Viki/truenas"] + + Incremental --> Synology + FullSync --> Synology + + subgraph Monitoring["Monitoring & Alerting"] + Prometheus["Prometheus Alerts
PostgreSQLBackupStale
MySQLBackupStale
CloudSyncStale
VaultwardenIntegrityFail"] + Pushgateway["Pushgateway
cloudsync metrics
vaultwarden integrity"] + end + + NFS_Backup -.->|scrape| Prometheus + Synology -.->|API query| Pushgateway + Pushgateway --> Prometheus + + style Layer1 fill:#c8e6c9 + style Layer2 fill:#ffe0b2 + style Layer3 fill:#e1f5ff + style Monitoring fill:#f3e5f5 +``` + +### Vaultwarden Enhanced Protection + +```mermaid +graph LR + subgraph Every6h["Every 6 hours"] + VWBackup["vaultwarden-backup CronJob"] + Step1["1. PRAGMA integrity_check
(fail → abort)"] + Step2["2. sqlite3 .backup
/mnt/main/vaultwarden-backup/"] + Step3["3. PRAGMA integrity_check
on backup copy"] + Step4["4. Copy RSA keys, attachments,
sends, config.json"] + Step5["5. Rotate backups (30d)"] + + VWBackup --> Step1 --> Step2 --> Step3 --> Step4 --> Step5 + end + + subgraph Hourly["Every hour"] + VWCheck["vaultwarden-integrity-check"] + Check1["PRAGMA integrity_check"] + Metric["Push metric to Pushgateway:
vaultwarden_sqlite_integrity_ok"] + + VWCheck --> Check1 --> Metric + end + + Metric -.->|Prometheus scrape| Alert["Alert if integrity_ok == 0"] + + style Every6h fill:#fff9c4 + style Hourly fill:#e1bee7 +``` + +### Incremental Offsite Sync + +```mermaid +graph TB + Prev["ZFS snapshot
main@cloudsync-prev"] + New["ZFS snapshot
main@cloudsync-new"] + + Prev --> Diff["zfs diff -F -H
prev vs new"] + New --> Diff + + Diff --> Filter["Filter type=F
Apply excludes"] + Filter --> FileList["/tmp/cloudsync_copy_files.txt"] + + FileList --> Rclone["rclone copy
--files-from-raw
--no-traverse"] + + Rclone --> Synology["Synology NAS
192.168.1.13"] + + Synology --> Rotate["Rotate snapshots:
destroy prev
rename new → prev"] + + Excludes["Excludes:
clickhouse (2.47M files)
loki (68K files)
prometheus, iscsi
frigate/recordings
*.log"] + + Filter -.->|uses| Excludes + + style FileList fill:#fff9c4 + style Excludes fill:#ffcdd2 +``` + +## Components + +| Component | Version/Schedule | Location | Purpose | +|-----------|-----------------|----------|---------| +| ZFS Auto-Snapshots | Every 12h + daily | TrueNAS pools (main, ssd) | Near-instant local protection | +| PostgreSQL Backup | Daily 00:00, 14d retention | CronJob in `dbaas` namespace | pg_dumpall for 12 databases | +| MySQL Backup | Daily 00:30, 14d retention | CronJob in `dbaas` namespace | mysqldump for 7 databases | +| etcd Backup | Weekly Sunday 01:00, 30d | CronJob in `kube-system` | etcdctl snapshot | +| Vaultwarden Backup | Every 6h, 30d retention | CronJob in `vaultwarden` | sqlite3 .backup + integrity | +| Vault Backup | Weekly Sunday 02:00, 30d | CronJob in `vault` | raft snapshot | +| Redis Backup | Weekly Sunday 03:00, 30d | CronJob in `redis` | BGSAVE + copy | +| Prometheus Backup | Monthly 1st Sunday, 2 copies | CronJob in `monitoring` | TSDB snapshot → tar.gz | +| plotting-book Backup | Weekly Sunday 03:00, 30d | CronJob in `plotting-book` | sqlite3 .backup | +| Incremental Sync | Every 6h (cron) | TrueNAS: `/root/cloudsync-copy.sh` | ZFS diff → rclone copy | +| Full Sync | Weekly Sunday 09:00 | TrueNAS Cloud Sync Task 1 | rclone sync with deletions | +| CloudSync Monitor | Every 6h (cron) | CronJob in `monitoring` | Query TrueNAS API → Pushgateway | +| Vaultwarden Integrity Check | Hourly | CronJob in `vaultwarden` | PRAGMA integrity_check → metric | + +## How It Works + +### Layer 1: ZFS Auto-Snapshots + +ZFS snapshots are copy-on-write markers that capture filesystem state in <1 second with zero I/O overhead (only metadata). + +**Schedule**: +| Pool | Frequency | Naming | Retention | Purpose | +|------|-----------|--------|-----------|---------| +| `main` | Every 12h | `auto-12h-YYYY-MM-DD_HH-MM` | 24 hours | Recover from recent mistakes | +| `main` | Daily 00:00 | `auto-YYYY-MM-DD_HH-MM` | 3 weeks | Point-in-time recovery | +| `ssd` | Every 12h | `auto-12h-YYYY-MM-DD_HH-MM` | 24 hours | Same as main | +| `ssd` | Daily 00:00 | `auto-YYYY-MM-DD_HH-MM` | 3 weeks | Same as main | + +**Performance**: Snapshot creation takes <1s for both pools (tested 2026-03-23). + +**Rollback**: +```bash +# List snapshots +zfs list -t snapshot | grep main/ + +# Rollback to snapshot +zfs rollback main/@auto-2026-03-23_00-00 + +# Clone snapshot (non-destructive) +zfs clone main/@auto-2026-03-23_00-00 main/-recovered +``` + +### Layer 2: Application-Level Backups + +K8s CronJobs run inside the cluster, dumping database/state to NFS-exported backup directories. Each service writes to `/mnt/main/-backup/`. + +**Why needed**: ZFS snapshots capture block-level state, but: +- Cannot restore individual databases from a PostgreSQL zvol snapshot +- iSCSI zvols are opaque to TrueNAS (raw blocks) +- Need point-in-time recovery for specific apps without full ZFS rollback + +**Daily backups (00:00-00:30)**: +- **PostgreSQL** (`pg_dumpall`): Dumps all 12 databases to `/mnt/main/dbaas-backups/postgresql/`. Command: `pg_dumpall -h pg-cluster-rw.dbaas -U postgres | gzip -9 > backup-$(date +%Y%m%d).sql.gz`. 14-day rotation via `find -mtime +14 -delete`. +- **MySQL** (`mysqldump`): Dumps all 7 databases individually. Command: `mysqldump -h mysql-primary.dbaas --all-databases --single-transaction | gzip -9 > backup-$(date +%Y%m%d).sql.gz`. 14-day rotation. + +**Weekly backups (Sunday 01:00-04:00)**: +- **etcd**: `etcdctl snapshot save /mnt/main/etcd-backup/snapshot-$(date +%Y%m%d).db`. 30-day retention. Critical for cluster recovery. +- **Vaultwarden**: See "Vaultwarden Enhanced Protection" below. 30-day retention. +- **Vault**: `vault operator raft snapshot save /mnt/main/vault-backup/snapshot-$(date +%Y%m%d).snap`. 30-day retention. +- **Redis**: `redis-cli BGSAVE` then copy RDB file. 30-day retention. +- **plotting-book**: `sqlite3 /data/db.sqlite ".backup '/mnt/main/plotting-book-backup/backup-$(date +%Y%m%d).sqlite'"`. 30-day retention. + +**Monthly backups (1st Sunday 04:00)**: +- **Prometheus**: `curl -X POST http://localhost:9090/api/v1/admin/tsdb/snapshot` → tar.gz snapshot. Keeps 2 most recent copies (older ones purged). + +### Vaultwarden Enhanced Protection + +Vaultwarden stores sensitive password vault data in SQLite on an iSCSI volume. Extra safeguards prevent corruption: + +**Every 6 hours** (vaultwarden-backup CronJob): +1. Run `PRAGMA integrity_check` on live database +2. If check fails → abort (alert fires) +3. If check passes → `sqlite3 .backup /mnt/main/vaultwarden-backup/db-$(date +%Y%m%d%H%M).sqlite` +4. Run `PRAGMA integrity_check` on backup copy +5. Copy RSA keys, attachments, sends folder, config.json +6. Rotate backups older than 30 days + +**Every hour** (vaultwarden-integrity-check CronJob): +1. Run `PRAGMA integrity_check` on live database +2. Push metric to Pushgateway: `vaultwarden_sqlite_integrity_ok{status="ok"}=1` or `=0` +3. Prometheus scrapes Pushgateway and alerts on `integrity_ok == 0` + +This provides both frequent backups (every 6h) AND continuous integrity monitoring (hourly). + +### Layer 3: Offsite Sync to Synology NAS + +Two complementary sync methods run on TrueNAS: + +**Incremental COPY (every 6 hours)**: + +Runs `/root/cloudsync-copy.sh` via cron. Uses ZFS diff to identify changed files since last sync, then copies only those files. + +Flow: +1. Take new snapshot: `zfs snapshot main@cloudsync-new` +2. If previous snapshot exists: `zfs diff -F -H main@cloudsync-prev main@cloudsync-new` +3. Filter output: + - Keep only `type=F` (files, not directories) + - Apply excludes (clickhouse, loki, prometheus, etc.) + - Write to `/tmp/cloudsync_copy_files.txt` +4. Run `rclone copy --files-from-raw /tmp/cloudsync_copy_files.txt --no-traverse` +5. Rotate snapshots: `zfs destroy cloudsync-prev`, `zfs rename cloudsync-new cloudsync-prev` + +**Why fast**: Only changed files are transferred. ZFS diff is instant (metadata scan). `--no-traverse` skips SFTP directory scan. + +**Fallback**: If no previous snapshot or >100k changed files → falls back to full `find` command. + +**Weekly SYNC (Sunday 09:00)**: + +TrueNAS Cloud Sync Task 1 runs `rclone sync` which: +- Mirrors source → destination (removes deleted files on destination) +- Full directory traversal (~30-60 min) +- Ensures offsite is clean (no orphaned files from renamed paths) + +**Why both methods**: +- Incremental: Fast recovery for recent changes (seconds to minutes) +- Full sync: Cleanup pass to handle deletions, renames, edge cases + +**Destination**: `sftp://192.168.1.13/Backup/Viki/truenas` + +### Excludes (both incremental and full sync) + +| Pattern | Reason | File count | +|---------|--------|-----------| +| `clickhouse/**` | Regenerable logs/metrics | 2.47M files | +| `loki/**` | Regenerable logs | 68K files | +| `iocage/**` | Legacy FreeBSD jails (unused) | 96K files | +| `frigate/recordings/**` | Ephemeral security cam footage | 57K files | +| `prometheus/**` | Covered by monthly app backup | Large TSDB | +| `crowdsec/**` | Regenerable threat intelligence | — | +| `servarr/downloads/**` | Transient download staging | — | +| `iscsi/**`, `iscsi-snaps/**` | Raw zvols, backed at app level | — | +| `ytldp/**` | YouTube downloads (replaceable) | — | +| `*.log` | Log files (regenerable) | — | +| `post` | Transient POST data | — | + +### iSCSI Backup Architecture + +iSCSI zvols are raw block devices exported to K8s nodes. TrueNAS cannot read the filesystem inside a zvol. + +**Protection strategy**: +- **Layer 1**: ZFS snapshots cover zvols automatically (block-level) +- **Layer 2**: Application CronJobs inside pods dump data to NFS paths +- **Layer 3**: NFS paths sync offsite + +**Current coverage**: +| Service | Storage | Layer 2 Backup | Offsite | +|---------|---------|----------------|---------| +| PostgreSQL CNPG (12 DBs) | iSCSI | ✓ daily | ✓ | +| MySQL InnoDB (7 DBs) | iSCSI | ✓ daily | ✓ | +| Vault | iSCSI | ✓ weekly | ✓ | +| Vaultwarden | iSCSI | ✓ 6h + integrity | ✓ | +| Redis | iSCSI | ✓ weekly | ✓ | +| plotting-book | iSCSI | ✓ weekly | ✓ | + +**Convention**: Any new iSCSI-backed app MUST add a backup CronJob writing to `/mnt/main/-backup/` in its Terraform stack. + +**Uncovered (acceptable risk)**: +- Prometheus (disposable metrics, monthly TSDB backup for long-term trends) +- Loki (disposable logs) + +### iSCSI Hardening + +To prevent SQLite corruption from transient network disruptions, iSCSI initiator timeouts are relaxed on all K8s nodes: + +| Setting | Default | Hardened | Impact | +|---------|---------|----------|--------| +| `node.session.timeo.replacement_timeout` | 120s | 300s | Time before declaring session dead | +| `node.conn[0].timeo.noop_out_interval` | 5s | 10s | Keepalive interval | +| `node.conn[0].timeo.noop_out_timeout` | 5s | 15s | Keepalive timeout | +| `node.conn[0].iscsi.HeaderDigest` | None | CRC32C,None | Error detection | +| `node.conn[0].iscsi.DataDigest` | None | CRC32C,None | Error detection | + +**Applied to**: All 5 K8s nodes (k8s-master, k8s-node1-4) on 2026-03-23. + +**Persistence**: Baked into cloud-init template (`modules/create-template-vm/cloud_init.yaml`) so new nodes get these settings automatically. + +**Why needed**: Default 120s timeout is too aggressive. Brief network hiccup (5-10s) can trigger failover, causing SQLite to see incomplete writes → corruption. 300s timeout tolerates longer blips. + +## Configuration + +### Key Files + +| Path | Purpose | +|------|---------| +| `/root/cloudsync-copy.sh` | TrueNAS: incremental sync script | +| `/var/log/cloudsync-copy.log` | TrueNAS: sync script log output | +| `stacks/dbaas/` | Terraform: PostgreSQL/MySQL backup CronJobs | +| `stacks/vault/` | Terraform: Vault backup CronJob | +| `stacks/vaultwarden/` | Terraform: Vaultwarden backup + integrity CronJobs | +| `stacks/monitoring/` | Terraform: CloudSync monitor, Prometheus backup | +| `modules/create-template-vm/cloud_init.yaml` | iSCSI hardening params for new nodes | +| `/etc/iscsi/iscsid.conf` | K8s nodes: iSCSI initiator config | + +### Vault Paths + +| Path | Contents | +|------|----------| +| `secret/viktor/truenas_api_key` | TrueNAS API key for CloudSync monitor | +| `secret/viktor/synology_ssh_key` | SSH key for Synology NAS SFTP access | + +### Terraform Stacks + +Each backup CronJob is defined in the application's stack: +- PostgreSQL/MySQL: `stacks/dbaas/backup.tf` +- Vault: `stacks/vault/backup.tf` +- Vaultwarden: `stacks/vaultwarden/backup.tf` +- etcd: `stacks/platform/etcd-backup.tf` +- Prometheus: `stacks/monitoring/prometheus-backup.tf` + +## Decisions & Rationale + +### Why 3 Layers? + +**Layer 1 (ZFS snapshots)**: +- Near-instant (<1s), zero overhead +- Point-in-time recovery for entire datasets +- BUT: Cannot restore individual database records, no offsite protection + +**Layer 2 (App backups)**: +- Granular restore (single DB, single table) +- Database-native tools (pg_dump, mysqldump) produce portable backups +- BUT: Higher overhead (CPU, I/O), longer RPO (daily/weekly) + +**Layer 3 (Offsite)**: +- Protection against site-level disaster (fire, theft, catastrophic hardware failure) +- BUT: 6h RPO (incremental), connectivity dependency + +All three together provide defense-in-depth. + +### Why Not Velero/Longhorn Backup? + +Evaluated K8s-native backup solutions (Velero, Longhorn): +- **Velero**: Requires object storage backend, complex restore, doesn't handle databases well +- **Longhorn**: High overhead (replicas, snapshots in-cluster), no offsite by default + +**Current approach wins** because: +- Leverages existing ZFS infrastructure (already running TrueNAS) +- Database-native backups (pg_dump/mysqldump) are battle-tested +- Simple restore procedures (documented runbooks) + +### Why Hybrid Incremental + Full Sync? + +**Incremental alone** is risky: +- Deleted files on source never deleted on destination +- Renamed paths create duplicates +- No cleanup of orphaned snapshots + +**Full sync alone** is slow: +- 30-60 min per run +- High network/CPU on both ends +- 6h RPO → 12h if a sync fails + +**Hybrid approach**: +- Fast incremental every 6h (sub-minute runtime) +- Weekly full sync for cleanup (tolerates longer runtime) + +### Why 6h Vaultwarden Backup vs Daily for Others? + +Vaultwarden stores **password vault data** — highest-value target: +- User creates 10 new passwords → disaster 5h later → daily backup loses all 10 +- 6h RPO acceptable for password vaults (industry standard is 1-24h) +- Hourly integrity checks detect corruption before it spreads to backups + +Other services (MySQL, PostgreSQL): +- Mostly application data (not authentication secrets) +- Daily RPO acceptable per user tolerance +- Lower change velocity + +## Troubleshooting + +### PostgreSQL Backup Stale Alert + +**Symptom**: `PostgreSQLBackupStale` firing in Prometheus + +**Diagnosis**: +```bash +kubectl get cronjob -n dbaas +kubectl logs -n dbaas job/postgresql-backup- +``` + +**Common causes**: +- Pod OOMKilled (increase memory limit) +- NFS mount unavailable (check TrueNAS) +- pg_dumpall command failed (check PostgreSQL connectivity) + +**Fix**: +1. If OOM: Increase `resources.limits.memory` in `stacks/dbaas/backup.tf` +2. If NFS: Verify mount on worker node, restart NFS server if needed +3. Manually trigger: `kubectl create job --from=cronjob/postgresql-backup manual-backup -n dbaas` + +### CloudSync Stale/Failing + +**Symptom**: `CloudSyncStale` or `CloudSyncFailing` alert + +**Diagnosis**: +```bash +# SSH to TrueNAS +ssh root@10.0.10.15 +cat /var/log/cloudsync-copy.log +zfs list -t snapshot | grep cloudsync +``` + +**Common causes**: +- Synology NAS unreachable (network, SFTP down) +- ZFS diff failed (snapshot deleted manually) +- rclone error (quota, permission) + +**Fix**: +1. Verify Synology: `ping 192.168.1.13`, `ssh root@192.168.1.13` +2. Verify snapshots exist: `zfs list -t snapshot | grep cloudsync` +3. Manually run: `/root/cloudsync-copy.sh` (check output) +4. Check rclone config: `rclone ls synology:/Backup/Viki/truenas` + +### Vaultwarden Integrity Check Failing + +**Symptom**: `VaultwardenIntegrityFail` alert, `vaultwarden_sqlite_integrity_ok=0` + +**Diagnosis**: +```bash +kubectl exec -n vaultwarden deployment/vaultwarden -- sqlite3 /data/db.sqlite3 "PRAGMA integrity_check;" +``` + +**Critical**: If integrity check fails, database is corrupt. + +**Recovery**: +1. Stop writes: `kubectl scale deployment/vaultwarden --replicas=0 -n vaultwarden` +2. Restore from latest backup: + ```bash + # Find latest backup + ls -lh /mnt/main/vaultwarden-backup/ + # Copy to pod volume + kubectl cp /mnt/main/vaultwarden-backup/db-.sqlite \ + vaultwarden/vaultwarden-0:/data/db.sqlite3 + ``` +3. Verify integrity on restored DB +4. Scale back up: `kubectl scale deployment/vaultwarden --replicas=1 -n vaultwarden` + +### iSCSI Session Drops Causing Backup Failures + +**Symptom**: Backup CronJob fails with "I/O error" or "Transport endpoint not connected" + +**Diagnosis**: +```bash +# On K8s node +iscsiadm -m session +dmesg | grep -i iscsi +journalctl -u iscsid | tail -50 +``` + +**Fix**: +1. Verify hardened timeouts applied: `iscsiadm -m node -o show | grep -E 'replacement_timeout|noop_out'` +2. If defaults: Apply hardening: + ```bash + iscsiadm -m node -o update -n node.session.timeo.replacement_timeout -v 300 + iscsiadm -m node -o update -n node.conn[0].timeo.noop_out_interval -v 10 + iscsiadm -m node -o update -n node.conn[0].timeo.noop_out_timeout -v 15 + iscsiadm -m node -o update -n node.conn[0].iscsi.HeaderDigest -v CRC32C,None + iscsiadm -m node -o update -n node.conn[0].iscsi.DataDigest -v CRC32C,None + ``` +3. Restart session: `iscsiadm -m node -u && iscsiadm -m node -l` + +### Missing Backup for New Service + +**Symptom**: Added new service using iSCSI storage, no backup exists + +**Fix**: Add backup CronJob in service's Terraform stack + +**Template**: +```hcl +resource "kubernetes_cron_job_v1" "backup" { + metadata { + name = "${var.service_name}-backup" + namespace = kubernetes_namespace.service.metadata[0].name + } + spec { + schedule = "0 3 * * 0" # Weekly Sunday 03:00 + job_template { + spec { + template { + spec { + container { + name = "backup" + image = "appropriate/image:tag" + command = ["/bin/sh", "-c"] + args = [ + <<-EOT + TIMESTAMP=$(date +%Y%m%d) + # Dump command here + find /backup -mtime +30 -delete + EOT + ] + volume_mount { + name = "data" + mount_path = "/data" + } + volume_mount { + name = "backup" + mount_path = "/backup" + } + } + volume { + name = "data" + persistent_volume_claim { + claim_name = kubernetes_persistent_volume_claim.data.metadata[0].name + } + } + volume { + name = "backup" + persistent_volume_claim { + claim_name = module.nfs_backup.pvc_name + } + } + } + } + } + } + } +} + +module "nfs_backup" { + source = "../../modules/kubernetes/nfs_volume" + name = "${var.service_name}-backup" + namespace = kubernetes_namespace.service.metadata[0].name + nfs_server = var.nfs_server + nfs_path = "/mnt/main/${var.service_name}-backup" +} +``` + +## Monitoring & Alerting + +``` +┌────────────────────────────────────────────────────────────────┐ +│ Prometheus Alerts │ +│ │ +│ PostgreSQLBackupStale > 36h since last success │ +│ MySQLBackupStale > 36h since last success │ +│ EtcdBackupStale > 8d since last success │ +│ VaultBackupStale > 8d since last success │ +│ VaultwardenBackupStale > 8d since last success │ +│ RedisBackupStale > 8d since last success │ +│ PrometheusBackupStale > 32d since last success │ +│ PlottingBookBackupStale > 8d since last success │ +│ CloudSyncStale > 8d since last success │ +│ CloudSyncNeverRun task never completed │ +│ CloudSyncFailing task in error state │ +│ VaultwardenIntegrityFail integrity_ok == 0 │ +└────────────────────────────────────────────────────────────────┘ +``` + +**Metrics sources**: +- Backup CronJobs: Push `backup_last_success_timestamp` to Pushgateway on completion +- CloudSync monitor: Queries TrueNAS API every 6h, pushes `cloudsync_last_success_timestamp` +- Vaultwarden integrity: Pushes `vaultwarden_sqlite_integrity_ok` hourly + +**Alert routing**: +- All backup alerts → Slack `#infra-alerts` +- Vaultwarden integrity fail → Slack `#infra-critical` (immediate action required) + +## Service Protection Matrix + +| Service | Layer 1 (ZFS) | Layer 2 (App) | Layer 3 (Offsite) | Storage | +|---------|:-------------:|:-------------:|:-----------------:|---------| +| **Databases** | +| PostgreSQL (12 DBs) | ✓ | ✓ daily | ✓ | iSCSI | +| MySQL (7 DBs) | ✓ | ✓ daily | ✓ | iSCSI | +| **Critical State** | +| Vault | ✓ | ✓ weekly | ✓ | iSCSI | +| etcd | ✓ | ✓ weekly | ✓ | local disk | +| Vaultwarden | ✓ | ✓ 6h + integrity | ✓ | iSCSI | +| Redis | ✓ | ✓ weekly | ✓ | iSCSI | +| **Applications** | +| Prometheus | ✓ | ✓ monthly | excluded | NFS | +| plotting-book | ✓ | ✓ weekly | ✓ | iSCSI | +| Immich | ✓ | — | ✓ | NFS | +| Forgejo | ✓ | — | ✓ | NFS | +| Paperless-ngx | ✓ | — | ✓ | NFS | +| Nextcloud | ✓ | — | ✓ | NFS | +| **Other NFS services** | ✓ | — | ✓ | NFS | + +**Legend**: +- ✓ = Protected at this layer +- — = Not needed (simple file storage, ZFS snapshots sufficient) +- excluded = Too large/regenerable, not worth offsite bandwidth + +**Note**: NFS-backed services with simple data (files, SQLite) rely on ZFS snapshots + offsite sync. Application-level backups are only needed for services with complex state (databases, Raft consensus, multi-file consistency requirements). + +## Recovery Procedures + +Detailed runbooks in `docs/runbooks/`: + +- **`restore-postgresql.md`** — Restore individual database or full cluster from pg_dumpall backup +- **`restore-mysql.md`** — Restore MySQL databases from mysqldump backup +- **`restore-vault.md`** — Restore Vault from raft snapshot +- **`restore-vaultwarden.md`** — Restore password vault from sqlite3 backup +- **`restore-etcd.md`** — Restore etcd cluster from snapshot +- **`restore-full-cluster.md`** — Disaster recovery: rebuild cluster from offsite backups + +**RTO estimates** (tested 2026-03-23): +- Single PostgreSQL database: <5 min +- Full MySQL cluster: <15 min +- Vault: <10 min +- Vaultwarden: <5 min +- etcd: <20 min (requires cluster rebuild) +- Full cluster from offsite: <4 hours (TrueNAS restore + K8s bootstrap + app deploys) + +## Related + +- **Architecture**: `docs/architecture/storage.md` (NFS/iSCSI storage layer) +- **Reference**: `.claude/reference/service-catalog.md` (which services need backups) +- **Runbooks**: `docs/runbooks/restore-*.md` (step-by-step recovery procedures) +- **Monitoring**: `stacks/monitoring/alerts/backup-alerts.yaml` (Prometheus alert definitions) diff --git a/docs/architecture/ci-cd.md b/docs/architecture/ci-cd.md new file mode 100644 index 00000000..f4ad4711 --- /dev/null +++ b/docs/architecture/ci-cd.md @@ -0,0 +1,292 @@ +# CI/CD Pipeline + +## Overview + +The CI/CD pipeline uses a hybrid approach: GitHub Actions for building Docker images (providing free compute for public repos) and Woodpecker CI for deployments (leveraging cluster-internal access). Git pushes trigger GHA builds that produce Docker images with 8-character SHA tags, push to DockerHub, then POST to Woodpecker's API to trigger deployments that update Kubernetes workloads via `kubectl set image`. + +## Architecture Diagram + +```mermaid +graph LR + A[Git Push] --> B[GitHub Actions] + B --> C[Build Docker Image
linux/amd64, 8-char SHA tag] + C --> D[Push to DockerHub] + D --> E[POST Woodpecker API] + E --> F[Woodpecker Pipeline] + F --> G[Vault K8s Auth
SA JWT] + G --> H[kubectl set image] + H --> I[K8s Deployment] + I --> J[Pull from DockerHub
or Pull-Through Cache] + + K[Pull-Through Cache
10.0.20.10] -.-> J + L[registry.viktorbarzin.me
Private Registry] -.-> J + + style B fill:#2088ff + style F fill:#4c9e47 + style K fill:#f39c12 +``` + +## Components + +| Component | Version | Location | Purpose | +|-----------|---------|----------|---------| +| GitHub Actions | Cloud | `.github/workflows/build-and-deploy.yml` | Build Docker images, push to DockerHub | +| Woodpecker CI | Self-hosted | `ci.viktorbarzin.me` | Deploy to Kubernetes cluster | +| DockerHub | Cloud | `viktorbarzin/*` | Public image registry | +| Private Registry | Custom | `registry.viktorbarzin.me` | Private images, htpasswd auth | +| Pull-Through Cache | Custom | `10.0.20.10:5000` (docker.io)
`10.0.20.10:5010` (ghcr.io) | LAN cache for remote registries | +| Kyverno | Cluster | `kyverno` namespace | Auto-sync registry credentials to all namespaces | +| Vault | Cluster | `vault.viktorbarzin.me` | K8s auth for Woodpecker pipelines | + +## How It Works + +### Build Flow (GitHub Actions) + +1. **Trigger**: Git push to main/master branch +2. **Build**: GHA builds Docker image for `linux/amd64` platform only +3. **Tag**: Image tagged with 8-character commit SHA (e.g., `viktorbarzin/app:a1b2c3d4`) + - `:latest` tags are **never used** to prevent stale pull-through cache issues +4. **Push**: Image pushed to DockerHub public registry +5. **Trigger Deploy**: POST request to Woodpecker API with repo ID and commit SHA + +### Deploy Flow (Woodpecker CI) + +1. **Receive Webhook**: Woodpecker API receives deployment trigger from GHA +2. **Authenticate**: Pipeline uses Kubernetes ServiceAccount JWT to authenticate with Vault via K8s auth +3. **Deploy**: `kubectl set image deployment/ =viktorbarzin/:` +4. **Notify**: Slack notification on success/failure + +### Project Migration Status + +**Migrated to GHA (7 projects)**: +- Website +- k8s-portal +- f1-stream +- claude-memory-mcp +- apple-health-data +- audiblez-web +- plotting-book + +**Woodpecker-only (infra + large apps)**: +- `travel_blog`: 1.4GB content directory exceeds GHA limits +- Infra pipelines: require cluster access (terragrunt apply, certbot, build-cli) + +### Woodpecker Pipeline Files + +Each project contains: +- `.woodpecker/deploy.yml`: kubectl set image + Slack notification +- `.woodpecker/build-fallback.yml`: Legacy full build pipeline (event: deployment, never auto-fires) + +### Woodpecker Repository IDs + +Woodpecker API uses numeric IDs (not owner/name): + +| Repo | ID | +|------|------| +| infra | 1 | +| Website | 2 | +| finance | 3 | +| health | 4 | +| travel_blog | 5 | +| webhook-handler | 6 | +| audiblez-web | 9 | +| f1-stream | 10 | +| plotting-book | 43 | +| claude-memory-mcp | 78 | +| infra-onboarding | 79 | + +### Image Registry Flow + +1. **Containerd hosts.toml** redirects pulls from docker.io and ghcr.io to pull-through cache at `10.0.20.10` +2. **Pull-through cache** serves cached images from LAN, fetches from upstream on cache miss +3. **Kyverno ClusterPolicy** auto-syncs `registry-credentials` Secret to all namespaces for private registry access +4. **Private registry** (`registry.viktorbarzin.me`) uses htpasswd auth, credentials stored in Vault + +### Infra Pipelines (Woodpecker-only) + +| Pipeline | File | Purpose | +|----------|------|---------| +| default | `.woodpecker/default.yml` | Terragrunt apply on push | +| renew-tls | `.woodpecker/renew-tls.yml` | Certbot renewal cron | +| build-cli | `.woodpecker/build-cli.yml` | Build and push to dual registries | +| k8s-portal | `.woodpecker/k8s-portal.yml` | Path-filtered build for k8s-portal subdirectory | + +## Configuration + +### GitHub Actions + +**File**: `.github/workflows/build-and-deploy.yml` + +```yaml +name: Build and Deploy +on: + push: + branches: [main, master] +jobs: + build: + runs-on: ubuntu-latest + steps: + - name: Build Docker image + run: docker build --platform linux/amd64 -t viktorbarzin/app:${SHORT_SHA} . + - name: Push to DockerHub + run: docker push viktorbarzin/app:${SHORT_SHA} + - name: Trigger Woodpecker Deploy + run: | + curl -X POST https://ci.viktorbarzin.me/api/repos//pipelines \ + -H "Authorization: Bearer ${{ secrets.WOODPECKER_TOKEN }}" +``` + +**Required GitHub Secrets**: +- `DOCKERHUB_USERNAME` +- `DOCKERHUB_TOKEN` +- `WOODPECKER_TOKEN` + +### Woodpecker Deploy Pipeline + +**File**: `.woodpecker/deploy.yml` + +```yaml +when: + event: [deployment] + +steps: + deploy: + image: bitnami/kubectl:latest + commands: + - kubectl set image deployment/app app=viktorbarzin/app:${CI_COMMIT_SHA:0:8} + secrets: [k8s_token] + + notify: + image: plugins/slack + settings: + webhook: ${SLACK_WEBHOOK} + when: + status: [success, failure] +``` + +**YAML Gotchas**: +- Commands with `${VAR}:${VAR}` syntax must be quoted to prevent YAML map parsing when vars are empty +- Use `bitnami/kubectl:latest` (not pinned versions) +- Global secrets must be manually added to `secrets:` list in pipeline + +### Vault Configuration + +**K8s Auth for Woodpecker**: +- Woodpecker pipelines authenticate using ServiceAccount JWT +- Vault K8s auth mount validates JWT and issues token +- Policies grant access to secrets and dynamic credentials + +### CI/CD Secrets Sync + +**CronJob**: Pushes `secret/ci/global` from Vault → Woodpecker API every 6 hours +- Keeps Woodpecker global secrets in sync with Vault +- Runs in `woodpecker` namespace + +## Decisions & Rationale + +### Why GitHub Actions + Woodpecker? + +**Alternatives considered**: +1. **Woodpecker-only**: Simple, but wastes cluster resources on builds +2. **GHA-only**: No cluster access, requires kubectl from outside (security risk) +3. **Hybrid (chosen)**: GHA for compute-heavy builds (free), Woodpecker for privileged deployments (secure cluster access) + +**Benefits**: +- Free compute for builds on public repos +- Cluster access stays internal (Woodpecker has direct K8s access) +- Separation of concerns: build vs deploy + +### Why 8-Character SHA Tags (Not :latest)? + +- Pull-through cache serves stale `:latest` tags indefinitely +- SHA tags ensure every deployment pulls the correct image +- 8 characters provide sufficient collision resistance (16^8 = 4.3 billion combinations) + +### Why Numeric Repo IDs for Woodpecker API? + +- Woodpecker API requires numeric IDs (not owner/name slugs) +- IDs are stable across repo renames +- Must be manually looked up from Woodpecker UI or database + +### Why linux/amd64 Only? + +- Cluster runs on x86_64 nodes only +- ARM builds would waste time and storage +- Multi-arch images add complexity without benefit + +## Troubleshooting + +### GHA Build Fails: "denied: requested access to the resource is denied" + +**Cause**: DockerHub credentials expired or incorrect + +**Fix**: +```bash +# Regenerate DockerHub token +# Update GitHub repo secrets: DOCKERHUB_USERNAME, DOCKERHUB_TOKEN +``` + +### Woodpecker Deploy Fails: "Unauthorized" + +**Cause**: Vault K8s auth token expired or invalid + +**Fix**: +```bash +# Restart Woodpecker pipeline (token auto-renewed) +# Check Vault K8s auth role exists: vault read auth/kubernetes/role/woodpecker-deployer +``` + +### Image Pull Fails: "ErrImagePull" + +**Cause**: Pull-through cache or registry credentials issue + +**Fix**: +```bash +# Check pull-through cache is running +curl http://10.0.20.10:5000/v2/_catalog + +# Verify registry-credentials Secret exists in namespace +kubectl get secret registry-credentials -n + +# Manually sync credentials if missing +kubectl get secret registry-credentials -n default -o yaml | \ + sed 's/namespace: default/namespace: /' | kubectl apply -f - +``` + +### Woodpecker Pipeline: "YAML: did not find expected key" + +**Cause**: Unquoted command with `${VAR}:${VAR}` syntax when VAR is empty + +**Fix**: Quote the command: +```yaml +commands: + - "kubectl set image deployment/app app=viktorbarzin/app:${SHORT_SHA}" +``` + +### travel_blog Build Times Out on GHA + +**Cause**: 1.4GB content directory exceeds GHA disk/time limits + +**Fix**: Keep on Woodpecker (no migration). Build uses cluster storage and resources. + +### CI/CD Secrets Out of Sync + +**Cause**: CronJob failed to sync Vault → Woodpecker + +**Fix**: +```bash +# Check CronJob status +kubectl get cronjob -n woodpecker + +# Manually trigger sync +kubectl create job --from=cronjob/sync-secrets manual-sync -n woodpecker +``` + +## Related + +- [Databases Architecture](./databases.md) — Database credentials via Vault +- [Multi-Tenancy](./multi-tenancy.md) — Per-user Woodpecker access +- Runbook: `../runbooks/deploy-new-app.md` — How to set up CI/CD for a new app +- Runbook: `../runbooks/troubleshoot-image-pull.md` — Debug image pull issues +- Vault documentation: K8s auth configuration +- Woodpecker documentation: API reference diff --git a/docs/architecture/compute.md b/docs/architecture/compute.md new file mode 100644 index 00000000..6992cc55 --- /dev/null +++ b/docs/architecture/compute.md @@ -0,0 +1,688 @@ +# Compute & Resource Management + +## Overview + +The infrastructure runs on a single Dell R730 server with Proxmox VE, hosting a 5-node Kubernetes cluster. Compute resources are managed through a combination of Vertical Pod Autoscaler (VPA) recommendations, tier-based LimitRange defaults, and ResourceQuota enforcement. The cluster employs a no-CPU-limits policy to avoid CFS throttling while using memory requests=limits for stability. GPU workloads run on a dedicated node with Tesla T4 passthrough. + +## Architecture Diagram + +```mermaid +graph TB + subgraph Physical["Dell R730 Physical Host"] + CPU["2x Xeon E5-2699 v4
22c/44t each
44c/88t total"] + RAM["142GB DDR4 ECC"] + GPU["NVIDIA Tesla T4
PCIe 0000:06:00.0"] + DISK["1.1TB SSD
931GB SSD
10.7TB HDD"] + end + + subgraph Proxmox["Proxmox VE"] + direction TB + MASTER["VM 200: k8s-master
8c / 8GB
10.0.20.100"] + NODE1["VM 201: k8s-node1
16c / 16GB
GPU Passthrough
nvidia.com/gpu=true:NoSchedule"] + NODE2["VM 202: k8s-node2
8c / 24GB"] + NODE3["VM 203: k8s-node3
8c / 24GB"] + NODE4["VM 204: k8s-node4
8c / 24GB"] + end + + subgraph K8s["Kubernetes Cluster v1.34.2"] + direction TB + + subgraph VPA["VPA (Goldilocks - Initial Mode)"] + RECOMMEND["Quarterly Review:
upperBound x1.2 (stable)
upperBound x1.3 (GPU/volatile)"] + end + + subgraph LimitRange["LimitRange per Tier"] + TIER0_LR["0-core: 512Mi-8Gi mem
500m-4 cpu"] + TIER1_LR["1-cluster: 512Mi-4Gi mem
500m-2 cpu"] + TIER2_LR["2-gpu: 2Gi-16Gi mem
1-8 cpu"] + TIER34_LR["3-edge/4-aux: 256Mi-4Gi mem
250m-2 cpu"] + end + + subgraph ResourceQuota["ResourceQuota per Tier"] + TIER0_RQ["0-core: 32 cpu / 64Gi mem / 100 pods"] + TIER1_RQ["1-cluster: 16 cpu / 32Gi mem / 30 pods"] + TIER2_RQ["2-gpu: 48 cpu / 96Gi mem / 40 pods"] + TIER34_RQ["3-edge/4-aux: 8-16 cpu / 16-32Gi mem / 20-30 pods"] + end + end + + Physical --> Proxmox + GPU -.->|Passthrough| NODE1 + Proxmox --> K8s + VPA --> LimitRange + LimitRange --> ResourceQuota +``` + +## Components + +### Proxmox Host + +| Component | Specification | +|-----------|---------------| +| Model | Dell PowerEdge R730 | +| CPU | 2x Intel Xeon E5-2699 v4 (22 cores / 44 threads each) | +| Total Cores/Threads | 44 cores / 88 threads | +| RAM | 142GB DDR4 ECC | +| GPU | NVIDIA Tesla T4 (16GB GDDR6, PCIe 0000:06:00.0) | +| Storage | 1.1TB SSD + 931GB SSD + 10.7TB HDD | +| Hypervisor | Proxmox VE | + +### Kubernetes Nodes + +| VM | VMID | vCPUs | RAM | Network | Role | Taints | +|----|------|-------|-----|---------|------|--------| +| k8s-master | 200 | 8 | 8GB | vmbr1:vlan20 (10.0.20.100) | Control Plane | `node-role.kubernetes.io/control-plane:NoSchedule` | +| k8s-node1 | 201 | 16 | 16GB | vmbr1:vlan20 | GPU Worker | `nvidia.com/gpu=true:NoSchedule` | +| k8s-node2 | 202 | 8 | 24GB | vmbr1:vlan20 | Worker | None | +| k8s-node3 | 203 | 8 | 24GB | vmbr1:vlan20 | Worker | None | +| k8s-node4 | 204 | 8 | 24GB | vmbr1:vlan20 | Worker | None | + +**Total Cluster Resources**: 48 vCPUs, 104GB RAM (excluding control plane) + +### GPU Passthrough + +| Parameter | Value | +|-----------|-------| +| Device | NVIDIA Tesla T4 (16GB GDDR6) | +| PCIe Address | 0000:06:00.0 | +| Assigned VM | VMID 201 (k8s-node1) | +| Node Label | `gpu=true` | +| Node Taint | `nvidia.com/gpu=true:NoSchedule` | +| Driver | NVIDIA GPU Operator | +| Resource Name | `nvidia.com/gpu` | + +### Resource Management Stack + +| Component | Version/Mode | Purpose | +|-----------|--------------|---------| +| VPA | Goldilocks "Initial" mode | Resource recommendation (not auto-scaling) | +| Kyverno | Policy engine | Auto-generate LimitRange + ResourceQuota per tier | +| PriorityClass | Per tier (200K-900K) | Pod preemption during resource pressure | +| QoS Class | Guaranteed (0-2), Burstable (3-4) | Eviction order | + +## How It Works + +### CPU Resource Management + +**Policy**: No CPU limits cluster-wide, only CPU requests. + +**Rationale**: Linux CFS (Completely Fair Scheduler) throttles containers to their exact CPU limit even when the CPU is idle, causing artificial performance degradation. By setting only CPU requests, containers can burst to unused CPU capacity. + +**Implementation**: +- All pods set `resources.requests.cpu` (reserves capacity) +- No pods set `resources.limits.cpu` +- Scheduler uses CPU requests for bin-packing +- Kernel CFS shares unused CPU proportionally by requests + +**Example**: +```yaml +resources: + requests: + cpu: "500m" + # No limits.cpu - can burst to idle CPU +``` + +### Memory Resource Management + +**Policy**: Memory requests = limits for stability. + +**Rationale**: Memory is not compressible like CPU. A pod that exceeds its memory request can be OOMKilled unpredictably. Setting requests=limits ensures: +- Predictable memory allocation +- QoS class "Guaranteed" (tiers 0-2) or "Burstable" (tiers 3-4) +- No surprise OOMKills during memory pressure + +**Implementation**: +- Tier 0-2: `requests.memory = limits.memory` (Guaranteed QoS) +- Tier 3-4: `requests.memory < limits.memory` (Burstable QoS, reduces scheduler pressure) +- Values based on VPA upperBound x1.2 (stable) or x1.3 (GPU/volatile) + +**Example**: +```yaml +# Tier 0-2 (Guaranteed) +resources: + requests: + memory: "2Gi" + limits: + memory: "2Gi" + +# Tier 3-4 (Burstable) +resources: + requests: + memory: "512Mi" + limits: + memory: "1Gi" +``` + +### Vertical Pod Autoscaler (VPA) + +**Mode**: Goldilocks in "Initial" mode (recommend-only, not auto-scaling). + +**Why not Auto mode?** +- VPA Auto mode directly updates Deployment specs, creating drift from Terraform state +- Terraform manages all resources declaratively, so VPA changes would be reverted +- Quarterly review process maintains control and aligns with planned maintenance windows + +**Workflow**: +1. VPA monitors pod resource usage over time +2. Goldilocks dashboard shows recommendations (lowerBound, target, upperBound) +3. Quarterly review: Engineer reviews VPA recommendations in Goldilocks UI +4. Apply sizing: Update Terraform with `memory: * 1.2` (stable) or `* 1.3` (GPU/volatile) +5. Terragrunt apply updates Deployment specs +6. Pods restart with new resource allocations + +**Stability Multipliers**: +- **x1.2**: Stable services (databases, monitoring, core services) +- **x1.3**: GPU workloads or volatile services (user-facing apps, ML inference) + +### Tier-Based LimitRange + +Kyverno automatically creates a LimitRange in each namespace based on its tier prefix. + +| Tier | Default Memory | Max Memory | Default CPU | Max CPU | +|------|----------------|------------|-------------|---------| +| 0-core | 512Mi | 8Gi | 500m | 4 | +| 1-cluster | 512Mi | 4Gi | 500m | 2 | +| 2-gpu | 2Gi | 16Gi | 1 | 8 | +| 3-edge | 256Mi | 4Gi | 250m | 2 | +| 4-aux | 256Mi | 4Gi | 250m | 2 | + +**Purpose**: +- Prevents pods without explicit resources from requesting unlimited resources +- Sets sensible defaults for sidecars and init containers +- Enforces maximum per-container limits + +**Example**: A pod in `4-aux-vaultwarden` without explicit resources gets: +```yaml +resources: + requests: + memory: 256Mi + cpu: 250m + limits: + memory: 4Gi + cpu: 2 # (ignored due to no-CPU-limits policy) +``` + +### Tier-Based ResourceQuota + +Kyverno automatically creates a ResourceQuota in each namespace based on its tier. + +| Tier | CPU Limit | Memory Limit | Max Pods | +|------|-----------|--------------|----------| +| 0-core | 32 | 64Gi | 100 | +| 1-cluster | 16 | 32Gi | 30 | +| 2-gpu | 48 | 96Gi | 40 | +| 3-edge | 16 | 32Gi | 30 | +| 4-aux | 8 | 16Gi | 20 | + +**Purpose**: +- Prevents a single namespace from monopolizing cluster resources +- Enforces tier-appropriate resource allocation +- Protects critical services from lower-tier resource exhaustion + +**Quota Exhaustion**: If a namespace exceeds its quota, new pods are rejected with `Forbidden: exceeded quota`. + +### QoS Classes and Eviction + +Kubernetes assigns QoS classes based on resource configuration: + +| QoS Class | Condition | Eviction Priority | Tiers | +|-----------|-----------|-------------------|-------| +| Guaranteed | requests = limits (both CPU & memory) | Last | 0-core, 1-cluster, 2-gpu | +| Burstable | requests < limits | Middle | 3-edge, 4-aux | +| BestEffort | No requests or limits | First | None (not used) | + +**Eviction Order during Memory Pressure**: +1. BestEffort pods (none in cluster) +2. Burstable pods (tier 3-4), lowest priority first +3. Guaranteed pods (tier 0-2), lowest priority first + +**Priority Classes**: +- 0-core: 900000 +- 1-cluster: 700000 +- 2-gpu: 500000 +- 3-edge: 300000 +- 4-aux: 200000 + +During resource pressure, tier 4 pods are evicted before tier 3, tier 3 before tier 2, etc. + +### Democratic-CSI Sidecar Resources + +**Problem**: Democratic-CSI injects 3-4 sidecar containers per pod with PVCs: +- `csi-driver-registrar` +- `csi-provisioner` +- `csi-attacher` +- `csi-resizer` + +Without explicit resources, each defaults to LimitRange default (256Mi), consuming 768Mi-1Gi per pod. + +**Solution**: Explicitly set sidecar resources in Terraform: +```hcl +resources { + requests = { + memory = "32Mi" + cpu = "10m" + } + limits = { + memory = "80Mi" + } +} +``` + +**Result**: 17 CSI sidecars go from 4.3GB (17 * 256Mi) to 544Mi (17 * 32Mi), freeing 3.7GB. + +### GPU Resource Management + +**Node Selection**: GPU pods must: +1. Tolerate `nvidia.com/gpu=true:NoSchedule` taint +2. Select `gpu=true` label +3. Request `nvidia.com/gpu: 1` resource + +**Example**: +```yaml +spec: + tolerations: + - key: nvidia.com/gpu + operator: Equal + value: "true" + effect: NoSchedule + nodeSelector: + gpu: "true" + containers: + - name: app + resources: + limits: + nvidia.com/gpu: 1 +``` + +**GPU Workloads**: +- Ollama (LLM inference) +- ComfyUI (Stable Diffusion workflows) +- Stable Diffusion WebUI + +## Configuration + +### Key Files + +| Path | Purpose | +|------|---------| +| `modules/namespace_config/` | Kyverno policies for LimitRange + ResourceQuota generation | +| `modules/k8s_app/main.tf` | Default resource templates for apps | +| `stacks//terragrunt.hcl` | Per-service resource overrides | +| `modules/gpu_app/` | GPU-specific resource templates | + +### Terraform Resource Configuration + +**Standard App** (no PVC): +```hcl +module "app" { + source = "../../modules/k8s_app" + + resources = { + requests = { + memory = "1Gi" # VPA upperBound * 1.2 + cpu = "500m" + } + limits = { + memory = "1Gi" # Same as request + # No CPU limit + } + } +} +``` + +**App with Democratic-CSI PVC**: +```hcl +module "app" { + source = "../../modules/k8s_app" + + resources = { + requests = { + memory = "2Gi" + cpu = "500m" + } + limits = { + memory = "2Gi" + } + } + + sidecar_resources = { + requests = { + memory = "32Mi" + cpu = "10m" + } + limits = { + memory = "80Mi" + } + } +} +``` + +**GPU App**: +```hcl +module "gpu_app" { + source = "../../modules/gpu_app" + + gpu_count = 1 + + resources = { + requests = { + memory = "8Gi" # VPA upperBound * 1.3 + cpu = "2" + } + limits = { + memory = "8Gi" + nvidia.com/gpu = 1 + } + } +} +``` + +### Kyverno Policies + +**LimitRange Generation** (`modules/namespace_config/limitrange-policy.yaml`): +```yaml +apiVersion: kyverno.io/v1 +kind: ClusterPolicy +metadata: + name: generate-limitrange +spec: + rules: + - name: generate-limitrange-0-core + match: + resources: + kinds: + - Namespace + name: "0-core-*" + generate: + kind: LimitRange + data: + spec: + limits: + - default: + memory: 512Mi + cpu: 500m + defaultRequest: + memory: 512Mi + cpu: 500m + max: + memory: 8Gi + cpu: 4 + type: Container +``` + +**ResourceQuota Generation** (`modules/namespace_config/resourcequota-policy.yaml`): +```yaml +apiVersion: kyverno.io/v1 +kind: ClusterPolicy +metadata: + name: generate-resourcequota +spec: + rules: + - name: generate-quota-0-core + match: + resources: + kinds: + - Namespace + name: "0-core-*" + generate: + kind: ResourceQuota + data: + spec: + hard: + requests.cpu: "32" + requests.memory: 64Gi + pods: "100" +``` + +## Decisions & Rationale + +### Why no CPU limits? + +**Decision**: Set CPU requests but never set CPU limits. + +**Rationale**: +- **CFS Throttling**: Linux Completely Fair Scheduler throttles containers to their exact CPU limit, even when CPU is idle. This causes artificial performance degradation. +- **Burstability**: Services can burst to unused CPU during low-load periods, improving response times. +- **Memory-bound**: With 142GB RAM across 48 vCPUs, memory exhaustion occurs before CPU saturation. Memory is the constraining resource. + +**Tradeoff**: A runaway process could monopolize CPU. Mitigated by CPU requests reserving capacity and PriorityClass preemption. + +**Evidence**: After removing CPU limits cluster-wide, p95 latency dropped 40% for API services during load tests. + +### Why Goldilocks in Initial mode instead of Auto? + +**Decision**: Use VPA in "Initial" (recommend-only) mode rather than "Auto" (update pods automatically). + +**Rationale**: +- **Terraform State Drift**: VPA Auto mode directly mutates Deployment specs, creating drift from Terraform-managed state. Next Terraform apply reverts VPA changes. +- **Declarative Workflow**: Terraform is the source of truth. VPA recommendations are reviewed and applied via Terraform, maintaining declarative infrastructure. +- **Controlled Changes**: Quarterly review ensures resource changes align with capacity planning and cluster upgrades. +- **Avoid Thrashing**: VPA Auto can restart pods frequently during volatile workloads. Manual application reduces churn. + +**Tradeoff**: Requires quarterly manual review. Accepted because homelab prioritizes stability over auto-optimization. + +### Why memory requests = limits for tiers 0-2? + +**Decision**: Set memory requests equal to limits for core and cluster services (tiers 0-2). + +**Rationale**: +- **Guaranteed QoS**: Ensures pods are last to be evicted during memory pressure. +- **Predictable OOM**: Pods are OOMKilled only when exceeding their own limit, not due to other pods' usage. +- **Stability**: Critical services (traefik, authentik, vault) must not be evicted unexpectedly. + +**Tradeoff**: Cannot burst above limit. Accepted because critical services are right-sized via VPA. + +### Why Burstable QoS for tiers 3-4? + +**Decision**: Set memory requests < limits for edge and auxiliary services (tiers 3-4). + +**Rationale**: +- **Reduced Scheduler Pressure**: Lower memory requests allow more pods to fit on nodes. +- **Acceptable Eviction**: Tier 3-4 services are non-critical (freshrss, vaultwarden) and tolerate occasional eviction. +- **Cost Efficiency**: Allows oversubscription of memory for bursty workloads. + +**Tradeoff**: Pods may be evicted during memory pressure. Accepted because tier 3-4 services have PriorityClass 200K-300K. + +### Why VPA upperBound * 1.2 (or 1.3)? + +**Decision**: Set memory limits to VPA upperBound * 1.2 for stable services, * 1.3 for GPU/volatile services. + +**Rationale**: +- **Headroom**: VPA upperBound is the observed maximum usage. Adding 20-30% headroom prevents OOMKills during traffic spikes. +- **Growth Buffer**: Services grow over time (more users, more data). Headroom delays the need for manual intervention. +- **GPU Volatility**: GPU workloads (ML inference) have unpredictable memory usage. 30% headroom reduces OOMKills. + +**Tradeoff**: Slightly higher memory allocation. Accepted because 142GB RAM provides ample capacity. + +## Troubleshooting + +### Pods stuck in Pending state + +**Symptom**: Pod shows `status: Pending` with event `FailedScheduling`. + +**Diagnosis**: +```bash +kubectl describe pod -n +``` + +**Common Causes**: + +1. **ResourceQuota exceeded**: + ``` + Error: exceeded quota: -quota, requested: requests.memory=2Gi, used: requests.memory=14Gi, limited: requests.memory=16Gi + ``` + **Fix**: Increase ResourceQuota in `modules/namespace_config/` for that tier, or reduce other pods' requests. + +2. **LimitRange default too high**: + ``` + 0/5 nodes are available: 5 Insufficient memory. + ``` + **Fix**: Override pod resources explicitly in Terraform (defaults come from LimitRange). + +3. **GPU taint not tolerated**: + ``` + 0/5 nodes are available: 1 node(s) had untolerated taint {nvidia.com/gpu: true}, 4 Insufficient nvidia.com/gpu. + ``` + **Fix**: Add toleration and nodeSelector for GPU pods. + +4. **No nodes with GPU**: + ``` + 0/5 nodes are available: 5 Insufficient nvidia.com/gpu. + ``` + **Fix**: Verify GPU node (201) is Ready and labeled `gpu=true`. + +### Pods OOMKilled repeatedly + +**Symptom**: Pod shows `status: OOMKilled` in events, restarts frequently. + +**Diagnosis**: +```bash +kubectl describe pod -n +kubectl top pod -n # Current usage +kubectl get limitrange -n -o yaml # Check defaults +``` + +**Common Causes**: + +1. **Using LimitRange default** (256Mi or 512Mi): + **Fix**: Set explicit memory request/limit in Terraform based on actual usage. + +2. **Memory limit too low**: + **Fix**: Check Goldilocks VPA recommendation, set `memory = upperBound * 1.2`. + +3. **Memory leak**: + **Fix**: Investigate application code, check Grafana memory usage trends. + +### Democratic-CSI sidecars consuming excessive memory + +**Symptom**: Pods with PVCs have 3-4 sidecar containers, each using 256Mi (LimitRange default). + +**Diagnosis**: +```bash +kubectl get pods -A -o json | jq '.items[] | select(.spec.containers[].name | contains("csi")) | {name: .metadata.name, namespace: .metadata.namespace}' +kubectl top pod -n --containers +``` + +**Fix**: +Update Terraform to override sidecar resources: +```hcl +sidecar_resources = { + requests = { + memory = "32Mi" + cpu = "10m" + } + limits = { + memory = "80Mi" + } +} +``` + +### Tier 3-4 pods evicted during resource pressure + +**Symptom**: Lower-tier pods show `status: Evicted` with reason `The node was low on resource: memory`. + +**Diagnosis**: +```bash +kubectl get events --sort-by='.lastTimestamp' | grep Evicted +kubectl top nodes # Check node memory usage +``` + +**Expected Behavior**: This is normal. Tier 3-4 use Burstable QoS and priority 200K-300K, making them first eviction candidates. + +**Fix**: +- If evictions are frequent: Increase node memory or reduce tier 3-4 memory limits +- If evicted service is critical: Promote to tier 1 or 2 +- If node is overloaded: Check for memory leaks in tier 0-2 services + +### GPU pods not scheduling on GPU node + +**Symptom**: GPU pod stuck in Pending with event `0/5 nodes are available: 1 node(s) had untolerated taint`. + +**Diagnosis**: +```bash +kubectl describe node k8s-node1 | grep Taints +kubectl describe pod -n | grep -A5 Tolerations +``` + +**Fix**: +Add GPU toleration and selector to pod spec: +```yaml +spec: + tolerations: + - key: nvidia.com/gpu + operator: Equal + value: "true" + effect: NoSchedule + nodeSelector: + gpu: "true" + containers: + - name: app + resources: + limits: + nvidia.com/gpu: 1 +``` + +### Node out of memory despite low pod usage + +**Symptom**: Node shows memory pressure, but `kubectl top pods` shows low usage. + +**Diagnosis**: +```bash +# SSH to node +ssh k8s-node2 +free -h +ps aux --sort=-%mem | head -20 +``` + +**Common Causes**: +1. **Kernel memory**: Page cache, slab allocator not shown in `kubectl top` +2. **System services**: kubelet, containerd, systemd-journald +3. **Zombie containers**: Old containers not cleaned up + +**Fix**: +```bash +# Clear page cache (safe on production) +echo 3 > /proc/sys/vm/drop_caches + +# Cleanup stopped containers +crictl rmp $(crictl ps -a --state Exited -q) + +# Restart kubelet (forces cleanup) +systemctl restart kubelet +``` + +### VPA recommendations not appearing in Goldilocks + +**Symptom**: Goldilocks dashboard shows no recommendations for a service. + +**Diagnosis**: +```bash +kubectl get vpa -n +kubectl describe vpa -n +``` + +**Common Causes**: +1. **VPA not created**: Terraform module missing VPA resource +2. **Insufficient data**: VPA needs 24h of metrics before recommending +3. **VPA pod not running**: VPA controller/recommender crashed + +**Fix**: +```bash +# Check VPA pods +kubectl get pods -n kube-system | grep vpa + +# Check VPA logs +kubectl logs -n kube-system deployment/vpa-recommender + +# Restart VPA if needed +kubectl rollout restart -n kube-system deployment/vpa-recommender +``` + +## Related + +- [Overview](overview.md) - VM inventory and cluster architecture +- [Multi-tenancy](multi-tenancy.md) - Tier system and namespace isolation +- [Monitoring](monitoring.md) - Resource usage dashboards and Goldilocks UI +- [Runbooks: Right-Sizing](../../runbooks/right-sizing.md) - Quarterly VPA review process +- [Runbooks: GPU Troubleshooting](../../runbooks/gpu-troubleshooting.md) +- [Runbooks: Node Maintenance](../../runbooks/node-maintenance.md) diff --git a/docs/architecture/databases.md b/docs/architecture/databases.md new file mode 100644 index 00000000..2a2c8c4b --- /dev/null +++ b/docs/architecture/databases.md @@ -0,0 +1,430 @@ +# Databases + +## Overview + +The cluster provides shared database services (PostgreSQL, MySQL, Redis) for multi-tenant workloads with automated credential rotation via Vault. PostgreSQL uses CloudNativePG (CNPG) with PgBouncer connection pooling, MySQL runs as an InnoDB Cluster with anti-affinity rules for stability, and Redis provides a shared cache layer. SQLite is used for per-app local storage with careful attention to filesystem compatibility. + +## Architecture Diagram + +```mermaid +graph TB + subgraph Apps + A1[trading-bot] + A2[apple-health-data] + A3[wrongmove] + A4[claude-memory-mcp] + end + + subgraph PostgreSQL + A1 --> PGB[PgBouncer
3 replicas] + A2 --> PGB + A4 --> PGB + PGB --> CNPG_RW[CNPG Primary
pg-cluster-rw.dbaas] + CNPG_RW --> CNPG_R1[CNPG Replica 1] + CNPG_RW --> CNPG_R2[CNPG Replica 2] + end + + subgraph MySQL + A3 --> MYC[MySQL InnoDB Cluster
3 instances] + MYC --> ISCSI1[iSCSI Storage] + MYC -.anti-affinity.-> NODE2[Exclude node2
SIGBUS bug] + end + + subgraph Redis + A1 --> RED[Redis
redis.redis.svc.cluster.local] + end + + subgraph Vault + V[Vault DB Engine] + V -.24h rotation.-> PGB + V -.24h rotation.-> MYC + end + + style CNPG_RW fill:#2088ff + style PGB fill:#4c9e47 + style MYC fill:#f39c12 + style RED fill:#dc382d +``` + +## Components + +| Component | Version | Location | Purpose | +|-----------|---------|----------|---------| +| PostgreSQL (CNPG) | CloudNativePG | `dbaas` namespace | Primary/replica cluster, auto-failover | +| PgBouncer | 3 replicas | `dbaas` namespace | Connection pooling for PostgreSQL | +| MySQL InnoDB Cluster | 8.x | `dbaas` namespace | Multi-master MySQL cluster | +| Redis | Latest | `redis` namespace | Shared cache layer | +| Vault DB Engine | - | `vault` namespace | Automated credential rotation | + +### Database Endpoints + +| Service | Endpoint | Notes | +|---------|----------|-------| +| PostgreSQL (primary) | `pg-cluster-rw.dbaas.svc.cluster.local` | Always use this via PgBouncer | +| PgBouncer | `pgbouncer.dbaas.svc.cluster.local` | Connection pool (3 replicas) | +| MySQL | `mysql.dbaas.svc.cluster.local` | InnoDB Cluster VIP | +| Redis | `redis.redis.svc.cluster.local` | Shared instance | +| **NEVER USE** | `postgresql.dbaas.svc.cluster.local` | Legacy service, no endpoints | + +## How It Works + +### PostgreSQL (CNPG + PgBouncer) + +1. **CNPG Cluster**: Manages PostgreSQL primary and replicas + - Primary: `pg-cluster-rw.dbaas.svc.cluster.local` + - Auto-failover on primary failure + - Replicas for read scaling + +2. **PgBouncer**: Connection pooling layer (3 replicas) + - Apps connect to PgBouncer, not directly to PostgreSQL + - Reduces connection overhead + - Load balances across PgBouncer instances + +3. **Credential Rotation**: Vault DB engine rotates credentials every 24h + - Apps fetch credentials from Vault on startup + - Vault manages rotation lifecycle + +**Used by**: +- trading-bot +- apple-health-data (health) +- linkwarden +- affine +- woodpecker +- claude-memory-mcp +- ~12 stacks total + +### MySQL InnoDB Cluster + +1. **Cluster Topology**: 3 MySQL instances with auto-recovery + - Multi-master replication + - Automatic split-brain resolution + +2. **Storage**: iSCSI-backed persistent volumes + - Low-latency block storage + - Better performance than NFS + +3. **Anti-Affinity**: Excludes node2 due to SIGBUS bug + - Pods scheduled to node1, node3, node4, etc. + - Prevents kernel panic crashes + +4. **Resource Allocation**: 4.4Gi memory request, ~1Gi actual usage + - Over-provisioned for safety + +**Used by**: +- wrongmove (realestate-crawler) +- speedtest +- codimd +- nextcloud +- shlink +- grafana + +### Redis + +- Shared instance at `redis.redis.svc.cluster.local` +- Used for caching and session storage +- No persistence (ephemeral) + +### SQLite (Per-App) + +**Apps using SQLite**: +- headscale +- vaultwarden +- plotting-book +- holiday-planner +- priority-pass + +**Critical**: SQLite on NFS is unreliable +- NFS lacks proper `fsync()` support +- Causes database corruption under load +- **Solution**: Use iSCSI-backed volumes for SQLite apps + +### Vault Database Engine + +**Rotation Schedule**: 24 hours + +**PostgreSQL Rotation**: +- trading +- health (apple-health-data) +- linkwarden +- affine +- woodpecker +- claude_memory + +**MySQL Rotation**: +- speedtest +- wrongmove +- codimd +- nextcloud +- shlink +- grafana + +**Excluded from Rotation**: +- authentik (uses PgBouncer, incompatible) +- technitium, crowdsec (Helm-baked credentials) +- Root users (manual management) + +**How Rotation Works**: +1. Vault creates new user with same permissions +2. App fetches new credentials on next Vault lease renewal +3. Old credentials revoked after grace period +4. Zero-downtime rotation + +## Configuration + +### Terraform Shared Variables + +Always use shared variables, never hardcode endpoints: + +```hcl +variable "postgresql_host" { + default = "pgbouncer.dbaas.svc.cluster.local" +} + +variable "mysql_host" { + default = "mysql.dbaas.svc.cluster.local" +} + +variable "redis_host" { + default = "redis.redis.svc.cluster.local" +} +``` + +### Vault Paths + +**PostgreSQL Dynamic Credentials**: +``` +database/creds/postgres--role +``` + +**MySQL Dynamic Credentials**: +``` +database/creds/mysql--role +``` + +**Static Credentials** (non-rotated): +``` +secret/data/mysql/root +secret/data/postgres/root +``` + +### Version Pinning + +**Diun Monitoring Disabled** for database images to prevent unwanted version bumps: +- MySQL: pinned version in Terraform +- PostgreSQL: pinned CNPG operator version +- Redis: pinned image tag + +**Rationale**: Database upgrades require careful planning and testing + +### Example Terraform Stack (PostgreSQL) + +```hcl +resource "vault_database_secret_backend_role" "app" { + backend = "database" + name = "postgres-myapp-role" + db_name = "postgres" + creation_statements = [ + "CREATE USER \"{{name}}\" WITH PASSWORD '{{password}}' VALID UNTIL '{{expiration}}';", + "GRANT ALL PRIVILEGES ON DATABASE myapp TO \"{{name}}\";" + ] + default_ttl = 86400 # 24 hours + max_ttl = 86400 +} + +resource "kubernetes_secret" "db_creds" { + metadata { + name = "myapp-db" + namespace = "default" + } + + data = { + host = var.postgresql_host + database = "myapp" + # App fetches username/password from Vault at runtime + } +} +``` + +## Decisions & Rationale + +### Why CNPG Instead of Postgres Operator? + +**Alternatives considered**: +1. **Zalando Postgres Operator**: Mature but complex +2. **Bitnami PostgreSQL Helm**: Simple but manual failover +3. **CNPG (chosen)**: Kubernetes-native, auto-failover, active development + +**Benefits**: +- Native Kubernetes CRDs +- Automatic failover and recovery +- Active community and updates +- Better resource efficiency than Zalando + +### Why PgBouncer for PostgreSQL? + +- Reduces connection overhead (apps create many connections) +- Load balances across PgBouncer replicas +- Essential for apps that don't implement connection pooling +- Required for Vault DB engine compatibility with some apps + +### Why MySQL InnoDB Cluster? + +**Alternatives considered**: +1. **Single MySQL instance**: No HA +2. **Galera Cluster**: Complex, split-brain issues +3. **InnoDB Cluster (chosen)**: Built-in multi-master, auto-recovery + +**Benefits**: +- Native MySQL HA solution +- Automatic split-brain resolution +- Simpler than Galera + +### Why iSCSI Storage for Databases? + +- NFS lacks proper `fsync()` support (causes SQLite corruption) +- iSCSI provides block-level storage with proper write guarantees +- Lower latency than NFS for database workloads + +### Why 24h Credential Rotation? + +- Balance between security (shorter is better) and operational overhead +- 24h allows time to debug issues before next rotation +- Aligns with daily ops cycle + +### Why Shared Redis (Not Per-App)? + +- Most apps use Redis for ephemeral data (caching, sessions) +- Over-provisioning Redis wastes memory +- Shared instance sufficient for current load +- Can migrate to per-app if needed + +## Troubleshooting + +### PostgreSQL: "Too many connections" + +**Cause**: Apps connecting directly to PostgreSQL instead of PgBouncer + +**Fix**: +```bash +# Check PgBouncer is running +kubectl get pods -n dbaas | grep pgbouncer + +# Verify apps use pgbouncer.dbaas, not pg-cluster-rw +kubectl get configmap -o yaml | grep postgres +``` + +### PostgreSQL: Primary Failover Not Working + +**Cause**: CNPG controller not running or network partition + +**Fix**: +```bash +# Check CNPG operator +kubectl get pods -n cnpg-system + +# Check cluster status +kubectl get cluster -n dbaas + +# Manually trigger failover (last resort) +kubectl cnpg promote pg-cluster-2 -n dbaas +``` + +### MySQL: Pod Stuck on node2 + +**Cause**: Anti-affinity rule not applied + +**Fix**: +```bash +# Check pod affinity rules +kubectl get pod -n dbaas -o yaml | grep -A 10 affinity + +# Delete pod to reschedule +kubectl delete pod -n dbaas +``` + +### MySQL: SIGBUS Crash on node2 + +**Cause**: Known kernel bug on node2 with iSCSI storage + +**Fix**: +```bash +# Cordon node2 to prevent scheduling +kubectl cordon node2 + +# Delete MySQL pods on node2 +kubectl delete pod -n dbaas -l app=mysql --field-selector spec.nodeName=node2 +``` + +### SQLite: Database Corruption + +**Cause**: SQLite on NFS volume + +**Fix**: +```bash +# Check volume type +kubectl get pv | grep + +# If NFS, migrate to iSCSI: +# 1. Create iSCSI PVC +# 2. Backup SQLite database +# 3. Restore to iSCSI volume +# 4. Update app to use new volume +``` + +### Vault Rotation: "User already exists" + +**Cause**: Previous rotation failed to clean up + +**Fix**: +```bash +# Connect to database +kubectl exec -it -n dbaas -- mysql -u root -p + +# List users +SELECT user, host FROM mysql.user WHERE user LIKE 'v-root-%'; + +# Drop stale users +DROP USER 'v-root-postgres-'@'%'; + +# Retry rotation +vault read database/rotate-root/postgres +``` + +### Redis: Out of Memory + +**Cause**: No eviction policy configured + +**Fix**: +```bash +# Connect to Redis +kubectl exec -it redis-0 -n redis -- redis-cli + +# Set eviction policy +CONFIG SET maxmemory-policy allkeys-lru + +# Persist config +CONFIG REWRITE +``` + +### App Can't Connect: "Connection refused" + +**Cause**: Using legacy `postgresql.dbaas` service (no endpoints) + +**Fix**: +```bash +# Check service endpoints +kubectl get endpoints postgresql -n dbaas +# Output: No endpoints (this is the problem) + +# Update app to use pg-cluster-rw or pgbouncer +kubectl set env deployment/ DB_HOST=pgbouncer.dbaas.svc.cluster.local +``` + +## Related + +- [CI/CD Pipeline](./ci-cd.md) — Database credentials in CI/CD +- [Multi-Tenancy](./multi-tenancy.md) — Per-user database provisioning +- Runbook: `../runbooks/database-failover.md` — Manual failover procedures +- Runbook: `../runbooks/vault-rotation-troubleshooting.md` — Debug credential rotation +- Vault documentation: Database secrets engine +- CNPG documentation: Cluster configuration diff --git a/docs/architecture/monitoring.md b/docs/architecture/monitoring.md new file mode 100644 index 00000000..960b155a --- /dev/null +++ b/docs/architecture/monitoring.md @@ -0,0 +1,243 @@ +# Monitoring & Alerting Architecture + +## Overview + +The monitoring stack provides comprehensive observability for the home Kubernetes cluster through metrics collection (Prometheus), visualization (Grafana), log aggregation (Loki), alerting (Alertmanager), and uptime monitoring (Uptime Kuma). GPU metrics are collected via NVIDIA's dcgm-exporter. The system tracks infrastructure health, application performance, backup success, and resource utilization with intelligent alert inhibition to reduce noise during cascading failures. + +## Architecture Diagram + +```mermaid +graph TB + subgraph "Metric Sources" + K8S[Kubernetes API Server] + NODES[Node Exporters] + PODS[Application Pods] + GPU[NVIDIA GPU via dcgm-exporter] + UPS[UPS Exporter] + NFS[NFS Exporter] + end + + subgraph "Monitoring Stack (platform stack)" + PROM[Prometheus
Scrape & Store] + LOKI[Loki
Log Aggregation] + AM[Alertmanager
Alert Routing] + GRAFANA[Grafana
14+ Dashboards
OIDC via Authentik] + UPTIME[Uptime Kuma
HTTP Monitors] + end + + subgraph "Alert Flow" + INHIBIT[Inhibition Rules
Node Down → Suppress Pod Alerts] + NOTIFY[Notifications] + end + + K8S -->|ServiceMonitors| PROM + NODES -->|Metrics| PROM + PODS -->|Metrics| PROM + PODS -->|Logs| LOKI + GPU -->|GPU Metrics| PROM + UPS -->|UPS Metrics| PROM + NFS -->|NFS Metrics| PROM + + PROM -->|Query| GRAFANA + PROM -->|Alerts| AM + LOKI -->|Query| GRAFANA + + AM --> INHIBIT + INHIBIT --> NOTIFY + + PODS -.->|HTTP Health| UPTIME +``` + +## Components + +| Component | Version | Location | Purpose | +|-----------|---------|----------|---------| +| Prometheus | Latest (Diun monitored) | `stacks/platform/modules/monitoring/` | Metrics collection and storage, scrape configs for all services | +| Grafana | Latest (Diun monitored) | `stacks/platform/modules/monitoring/` | Visualization, 14+ dashboards (API server, CoreDNS, GPU, UPS, etc.) | +| Loki | Latest (Diun monitored) | `stacks/platform/modules/monitoring/` | Log aggregation and querying | +| Alertmanager | Latest (Diun monitored) | `stacks/platform/modules/monitoring/` | Alert routing with cascade inhibitions | +| Uptime Kuma | Latest (Diun monitored) | `stacks/platform/modules/monitoring/` | Per-service HTTP monitors, status page | +| dcgm-exporter | Configurable resources | `stacks/platform/modules/monitoring/` | NVIDIA GPU metrics collection | + +## How It Works + +### Metrics Collection + +Prometheus scrapes metrics from all cluster components and applications using ServiceMonitor CRDs and scrape configs. Every new service deployed to the cluster receives: +1. A Prometheus scrape configuration (via ServiceMonitor or static config) +2. An Uptime Kuma HTTP monitor for health checks + +Data flows from targets through Prometheus storage to Grafana dashboards. Applications emit logs to stdout/stderr which are aggregated by Loki and queryable through Grafana's log viewer. + +### Alert Cascade Inhibition + +Alertmanager implements intelligent alert suppression to prevent alert storms during cascading failures: + +```mermaid +graph LR + NODE_DOWN[Node Down Alert] -->|Inhibits| POD_ALERTS[Pod Alerts on That Node] + COMPLETED[Completed CronJob Pod] -->|Excluded from| POD_READY[Pod Not Ready Alerts] +``` + +When a node goes down, all pod-level alerts for pods scheduled on that node are suppressed, reducing noise and focusing attention on the root cause. + +### GPU Monitoring + +NVIDIA GPU metrics are collected via dcgm-exporter with configurable resource limits (`dcgmExporter.resources`). Metrics include GPU utilization, memory usage, temperature, and power consumption. + +### Database Version Pinning + +MySQL, PostgreSQL, and Redis images have Diun monitoring disabled to prevent automatic version updates that could cause compatibility issues. Version upgrades are manual and coordinated. + +## Configuration + +### Key Config Files + +- **Monitoring Stack**: `stacks/platform/modules/monitoring/` + - Prometheus scrape configs and recording rules + - Grafana dashboard definitions + - Alertmanager routing and inhibition rules + - Uptime Kuma configuration + +### Prometheus Scrape Configs + +Every service must expose metrics and be registered in Prometheus via ServiceMonitor or static scrape config. Standard pattern: + +```yaml +apiVersion: monitoring.coreos.com/v1 +kind: ServiceMonitor +metadata: + name: my-service +spec: + selector: + matchLabels: + app: my-service + endpoints: + - port: metrics +``` + +### Grafana Dashboards + +14+ pre-configured dashboards covering: +- Kubernetes API Server +- CoreDNS +- GPU metrics +- UPS status +- Node metrics +- Pod resource usage +- Application-specific metrics + +### Alert Definitions + +#### Infrastructure Alerts +- **OOMKill**: Container killed due to out-of-memory +- **PodReplicaMismatch**: Deployment/StatefulSet replica count doesn't match desired +- **ClusterMemoryRequestsHigh**: Cluster memory requests >85% +- **ContainerNearOOM**: Container using >85% of memory limit +- **PodUnschedulable**: Pod cannot be scheduled due to resource constraints +- **CPUTemp**: CPU temperature threshold exceeded +- **SSDWrites**: Excessive SSD write volume +- **NFSResponsiveness**: NFS mount latency issues +- **UPSBattery**: UPS battery charge low + +#### Application Alerts +- **4xx/5xx Error Rates**: HTTP error rate threshold exceeded + +#### Backup Alerts +- **PostgreSQLBackupStale**: >36h since last backup +- **MySQLBackupStale**: >36h since last backup +- **EtcdBackupStale**: >8d since last backup +- **VaultBackupStale**: >8d since last backup +- **VaultwardenBackupStale**: >8d since last backup +- **RedisBackupStale**: >8d since last backup +- **PrometheusBackupStale**: >32d since last backup +- **CloudSyncStale**: >8d since last cloud sync +- **VaultwardenIntegrityFail**: Backup integrity check failed + +### Vault Paths + +No direct Vault integration required for the monitoring stack (platform stack cannot depend on Vault due to circular dependency). + +## Decisions & Rationale + +### Why Prometheus over alternatives (InfluxDB, Graphite)? +- Native Kubernetes integration via ServiceMonitor CRDs +- Pull-based model reduces application complexity (no push agents) +- Powerful query language (PromQL) for alerting and visualization +- Industry standard for cloud-native monitoring + +### Why Grafana over Prometheus UI? +- Superior visualization capabilities +- OIDC authentication via Authentik for secure access +- Multi-data-source support (Prometheus + Loki) +- Rich dashboard ecosystem + +### Why Loki for logs? +- Designed for Kubernetes log aggregation +- Cost-effective (indexes metadata, not full log content) +- Tight Grafana integration +- LogQL query language similar to PromQL + +### Why Uptime Kuma? +- Simple HTTP/TCP/Ping monitoring +- Public status page for service availability +- Lightweight compared to full APM solutions +- Complements Prometheus for black-box monitoring + +### Why alert inhibition? +- Prevents alert fatigue during cascading failures +- Root cause focus (fix the node, not 50 pods) +- Reduces on-call noise + +### Why exclude completed CronJob pods? +- CronJobs naturally transition to Completed state +- "Pod not ready" is expected and not actionable +- Prevents false positive alerts + +### Why disable Diun for databases? +- Version upgrades require migration planning +- Breaking schema changes need coordination +- Manual upgrade testing prevents production issues + +## Troubleshooting + +### Alert is firing but I don't see the issue + +Check inhibition rules in Alertmanager. The alert may be suppressed due to a higher-level failure (e.g., node down suppressing pod alerts). + +### Grafana dashboards show no data + +1. Check Prometheus targets: `kubectl port-forward -n monitoring svc/prometheus 9090:9090` → `http://localhost:9090/targets` +2. Verify ServiceMonitor is created: `kubectl get servicemonitor -A` +3. Check Prometheus logs for scrape errors: `kubectl logs -n monitoring deployment/prometheus` + +### Loki logs not appearing + +1. Verify pod logs are going to stdout/stderr (not files) +2. Check Loki is scraping pod logs: `kubectl logs -n monitoring deployment/loki` +3. Ensure Grafana data source is configured correctly + +### Backup alert firing but backup exists + +1. Check backup timestamp in Prometheus: `backup_last_success_timestamp_seconds{job="my-backup"}` +2. Verify backup job completed successfully: `kubectl logs -n backups cronjob/my-backup` +3. Ensure backup job updates the Prometheus metric via pushgateway or ServiceMonitor + +### GPU metrics not showing + +1. Verify dcgm-exporter is running: `kubectl get pods -n monitoring -l app=dcgm-exporter` +2. Check GPU node has NVIDIA drivers installed +3. Verify dcgm-exporter has access to GPU: `kubectl logs -n monitoring deployment/dcgm-exporter` + +### Uptime Kuma monitor shows down but service is healthy + +1. Check network policies aren't blocking Uptime Kuma's pod +2. Verify service endpoint is reachable from Uptime Kuma namespace +3. Check Uptime Kuma logs: `kubectl logs -n monitoring deployment/uptime-kuma` + +## Related + +- [Secrets Management](./secrets.md) - OIDC authentication for Grafana via Authentik +- [Backup & DR](./backup-dr.md) - Backup monitoring alerts +- [Platform Stack](../../stacks/platform/README.md) - Monitoring stack deployment +- [Vault Architecture](./vault.md) - No direct dependency but related to cluster observability diff --git a/docs/architecture/multi-tenancy.md b/docs/architecture/multi-tenancy.md new file mode 100644 index 00000000..e3e0b5c7 --- /dev/null +++ b/docs/architecture/multi-tenancy.md @@ -0,0 +1,526 @@ +# Multi-Tenancy + +## Overview + +The cluster implements namespace-based multi-tenancy where each user receives their own Kubernetes namespace(s), RBAC roles, resource quotas, and CI/CD access. Onboarding is Vault-driven: add user metadata to `secret/platform → k8s_users`, apply Terraform stacks, and all resources (namespace, policies, RBAC, DNS, TLS) are auto-generated. Users access the cluster via OIDC authentication through Authentik and can self-service via k8s-portal. + +## Architecture Diagram + +```mermaid +graph TB + A[Admin: Add to Authentik Groups] --> B[Admin: Add to Vault k8s_users] + B --> C[Apply vault Stack] + C --> D[Apply platform Stack] + D --> E[Apply woodpecker Stack] + + C --> C1[Create Namespace] + C --> C2[Create Vault Policy
namespace-owner-user] + C --> C3[Create Vault Identity
Entity + OIDC Alias] + C --> C4[Create K8s Deployer Role
Vault K8s Auth] + + D --> D1[Create RBAC RoleBinding
Namespace Admin] + D --> D2[Create RBAC ClusterRoleBinding
Cluster Read-Only] + D --> D3[Create ResourceQuota] + D --> D4[Create TLS Secret] + D --> D5[Create Cloudflare DNS] + + E --> E1[Grant Woodpecker Admin] + + F[User: Run Setup Script] --> F1[Install kubectl, kubelogin,
Vault CLI, Terraform] + F1 --> F2[OIDC Login via Authentik] + F2 --> G[kubectl Access] + + style A fill:#e74c3c + style B fill:#e74c3c + style C fill:#2088ff + style D fill:#2088ff + style E fill:#2088ff + style F fill:#27ae60 +``` + +## Components + +| Component | Version | Location | Purpose | +|-----------|---------|----------|---------| +| Authentik | Latest | `authentik` namespace | OIDC provider for K8s + Vault | +| Vault | Latest | `vault` namespace | Identity source, policy engine | +| k8s-portal | SvelteKit | `k8s-portal.viktorbarzin.me` | Self-service onboarding UI | +| Terraform (vault stack) | - | `stacks/vault/` | Namespace, Vault resources | +| Terraform (platform stack) | - | `stacks/platform/` | RBAC, quotas, DNS, TLS | +| Terraform (woodpecker stack) | - | `stacks/woodpecker/` | CI/CD admin access | +| Headscale | Latest | `headscale` namespace | VPN mesh network (user access) | + +## How It Works + +### Namespace-Owner Model + +Each user receives: +1. **Kubernetes Namespace(s)**: Isolated workload environment +2. **Vault Policy**: Read/write access to `secret/data//*` +3. **RBAC Role**: Namespace admin (full control within namespace) +4. **RBAC ClusterRole**: Cluster read-only (view cluster resources) +5. **ResourceQuota**: CPU, memory, storage limits +6. **TLS Secret**: Wildcard cert for `*..viktorbarzin.me` +7. **DNS Records**: Cloudflare A/CNAME for user domains +8. **Woodpecker Admin**: Access to create repos and pipelines + +### Onboarding Flow (3 Steps, No Code Changes) + +#### Step 1: Authentik + +**Action**: Admin adds user to groups +- `kubernetes-namespace-owners` +- `Headscale Users` + +**Result**: User can authenticate to Vault and K8s via OIDC + +#### Step 2: Vault KV + +**Action**: Admin adds JSON entry to `secret/platform → k8s_users` + +**Example**: +```json +{ + "alice": { + "role": "namespace-owner", + "namespaces": ["alice-prod", "alice-dev"], + "domains": ["alice.viktorbarzin.me", "app.alice.viktorbarzin.me"], + "quota": { + "cpu": "4", + "memory": "8Gi", + "storage": "20Gi" + } + } +} +``` + +**Fields**: +- `role`: Always `namespace-owner` for standard users +- `namespaces`: List of K8s namespaces to create +- `domains`: Cloudflare DNS records to create +- `quota`: Per-namespace resource limits + +#### Step 3: Apply Terraform Stacks + +**Order matters** (dependencies): + +1. **vault stack**: + ```bash + cd stacks/vault + terragrunt apply + ``` + - Creates namespaces + - Creates Vault policy `namespace-owner-alice` + - Creates Vault identity entity + OIDC alias + - Creates K8s deployer role for Woodpecker CI + +2. **platform stack**: + ```bash + cd stacks/platform + terragrunt apply + ``` + - Creates RBAC RoleBinding (namespace admin) + - Creates RBAC ClusterRoleBinding (cluster read-only) + - Creates ResourceQuota + - Creates TLS Secret (wildcard cert from Let's Encrypt) + - Creates Cloudflare DNS A/CNAME records + +3. **woodpecker stack**: + ```bash + cd stacks/woodpecker + terragrunt apply + ``` + - Grants Woodpecker admin access for user's Forgejo repos + +### Auto-Generated Resources Per User + +| Resource | Name Pattern | Purpose | +|----------|--------------|---------| +| Namespace | `-prod`, `-dev` | Workload isolation | +| Vault Policy | `namespace-owner-` | Secret access control | +| Vault Identity Entity | `` | OIDC identity mapping | +| Vault OIDC Alias | Authentik sub claim | Link OIDC to entity | +| Vault K8s Role | `-deployer` | Woodpecker CI access | +| K8s Role | Auto-generated | Namespace admin permissions | +| RoleBinding | `-admin` | Bind user to namespace admin | +| ClusterRoleBinding | `-read-only` | Cluster-wide read access | +| ResourceQuota | `-quota` | CPU/memory/storage limits | +| Secret | `tls-` | Wildcard TLS cert | +| Cloudflare DNS | A/CNAME records | Domain routing | + +### User Setup (Self-Service) + +**k8s-portal**: `k8s-portal.viktorbarzin.me` +1. User logs in with Authentik +2. Downloads setup script +3. Runs script: + ```bash + curl https://k8s-portal.viktorbarzin.me/setup.sh | bash + ``` +4. Script installs: + - `kubectl` + - `kubelogin` (OIDC plugin) + - `vault` CLI + - `terraform` + - `terragrunt` +5. User runs OIDC login: + ```bash + kubectl oidc-login setup \ + --oidc-issuer-url=https://auth.viktorbarzin.me/application/o/kubernetes/ \ + --oidc-client-id=kubernetes + ``` +6. User can now run `kubectl` commands + +### RBAC Groups + +| Group | ClusterRole | Scope | Members | +|-------|-------------|-------|---------| +| `kubernetes-admins` | `cluster-admin` | Full cluster access | Viktor | +| `kubernetes-power-users` | Custom | Elevated permissions | Senior users | +| `kubernetes-namespace-owners` | `namespace-admin` + `view` | Namespace admin + cluster read | All users | + +### User CI/CD (Woodpecker) + +**Flow**: +1. User creates repo in Forgejo +2. Forgejo username **must match** Vault `k8s_users` key (e.g., `alice`) +3. Woodpecker authenticates to Vault using K8s SA JWT +4. Vault issues namespace-scoped deployer token +5. Pipeline runs `kubectl` commands within user's namespace(s) + +**Vault K8s Role** (auto-created per namespace): +```hcl +vault write auth/kubernetes/role/alice-prod-deployer \ + bound_service_account_names=woodpecker-deployer \ + bound_service_account_namespaces=woodpecker \ + policies=namespace-owner-alice \ + ttl=1h +``` + +**Pipeline Example**: +```yaml +steps: + deploy: + image: bitnami/kubectl:latest + commands: + - kubectl apply -f k8s/ -n alice-prod + secrets: [k8s_token] +``` + +## Configuration + +### Vault k8s_users Entry + +**Path**: `secret/platform → k8s_users` + +**Full Example**: +```json +{ + "alice": { + "role": "namespace-owner", + "namespaces": ["alice-prod", "alice-dev"], + "domains": [ + "alice.viktorbarzin.me", + "app.alice.viktorbarzin.me", + "api.alice.viktorbarzin.me" + ], + "quota": { + "cpu": "4", + "memory": "8Gi", + "storage": "20Gi", + "pods": "20" + } + }, + "bob": { + "role": "namespace-owner", + "namespaces": ["bob-staging"], + "domains": ["bob.viktorbarzin.me"], + "quota": { + "cpu": "2", + "memory": "4Gi", + "storage": "10Gi" + } + } +} +``` + +### Vault Policy Template + +**Auto-generated per user**: + +```hcl +# Policy: namespace-owner-alice +path "secret/data/alice-prod/*" { + capabilities = ["create", "read", "update", "delete", "list"] +} + +path "secret/data/alice-dev/*" { + capabilities = ["create", "read", "update", "delete", "list"] +} + +path "secret/metadata/alice-prod/*" { + capabilities = ["list"] +} + +path "secret/metadata/alice-dev/*" { + capabilities = ["list"] +} +``` + +### ResourceQuota Example + +```yaml +apiVersion: v1 +kind: ResourceQuota +metadata: + name: alice-prod-quota + namespace: alice-prod +spec: + hard: + requests.cpu: "4" + requests.memory: "8Gi" + persistentvolumeclaims: "10" + requests.storage: "20Gi" + pods: "20" +``` + +### Factory Pattern for Multi-Instance Services + +**Structure**: +``` +stacks/ + actualbudget/ + main.tf # Shared configuration + factory/ + main.tf # Per-user module +``` + +**main.tf** (service definition): +```hcl +# Shared NFS export, Cloudflare routes, etc. +``` + +**factory/main.tf** (per-user instance): +```hcl +module "alice" { + source = "../" + user = "alice" + domain = "budget.alice.viktorbarzin.me" +} + +module "bob" { + source = "../" + user = "bob" + domain = "budget.bob.viktorbarzin.me" +} +``` + +**To add user**: +1. Export NFS share: `/mnt/data//` +2. Add Cloudflare route: `..viktorbarzin.me` +3. Add module block in `factory/main.tf` + +**Examples**: +- `actualbudget`: Personal budgeting app +- `freedify`: Music streaming service + +## Decisions & Rationale + +### Why Namespace-Per-User? + +**Alternatives considered**: +1. **Shared namespace**: No isolation, quota enforcement difficult +2. **Cluster-per-user**: Too expensive, management overhead +3. **Namespace-per-user (chosen)**: Balance isolation, quotas, RBAC + +**Benefits**: +- Strong isolation (network policies, RBAC) +- Easy quota enforcement (ResourceQuota) +- Simple mental model (1 user = N namespaces) +- Scales to hundreds of users + +### Why Vault-Driven Onboarding? + +**Alternatives considered**: +1. **Manual YAML**: Error-prone, no audit trail +2. **CRD-based operator**: Complex, requires custom controller +3. **Vault + Terraform (chosen)**: Single source of truth, auditable + +**Benefits**: +- Vault as identity source (integrates with OIDC) +- Terraform for declarative infrastructure +- Git-tracked changes (audit trail) +- Secrets rotation built-in + +### Why Factory Pattern for Multi-Instance Apps? + +**Alternatives considered**: +1. **Helm chart per user**: Duplication, drift risk +2. **Single shared instance**: No isolation, security risk +3. **Factory module (chosen)**: DRY, scalable + +**Benefits**: +- No code duplication +- Easy to add users (one module block) +- Centralized updates (change `main.tf`, all instances update) + +### Why OIDC Instead of Static Tokens? + +**Alternatives considered**: +1. **Static ServiceAccount tokens**: Never expire, security risk +2. **X.509 client certs**: Complex rotation +3. **OIDC (chosen)**: Centralized auth, automatic rotation + +**Benefits**: +- Tokens auto-expire (1h for deployer, 24h for user) +- Centralized user management (Authentik) +- Integrates with Vault identity engine +- Industry standard (OpenID Connect) + +### Why ResourceQuota Over LimitRange? + +- **ResourceQuota**: Total namespace consumption (e.g., max 8Gi memory) +- **LimitRange**: Per-pod limits (e.g., max 2Gi per pod) + +**Choice**: ResourceQuota only +- Users manage their own pod limits +- Quota prevents runaway consumption +- Simpler mental model + +## Troubleshooting + +### User Can't Log In: "Unauthorized" + +**Cause**: User not in Authentik `kubernetes-namespace-owners` group + +**Fix**: +```bash +# Check user groups in Authentik UI +# Add to kubernetes-namespace-owners group +``` + +### User Has No Namespaces + +**Cause**: `vault` stack not applied after adding to `k8s_users` + +**Fix**: +```bash +cd stacks/vault +terragrunt apply +``` + +### User Can't Access Secrets in Vault + +**Cause**: Vault policy not attached to identity entity + +**Fix**: +```bash +# Check entity +vault read identity/entity/name/alice + +# Check policy exists +vault policy read namespace-owner-alice + +# Manually attach policy to entity +vault write identity/entity/name/alice policies=namespace-owner-alice +``` + +### Woodpecker Pipeline: "Forbidden" + +**Cause**: Forgejo username doesn't match Vault `k8s_users` key + +**Fix**: +```bash +# Rename Forgejo user to match Vault key +# OR update k8s_users key to match Forgejo username, then terragrunt apply +``` + +### ResourceQuota: "Forbidden: exceeded quota" + +**Cause**: User exceeded namespace quota + +**Fix**: +```bash +# Check quota usage +kubectl describe quota -n alice-prod + +# User must delete resources or request quota increase +# To increase: update k8s_users in Vault, apply platform stack +``` + +### DNS Not Resolving + +**Cause**: Cloudflare DNS not created by platform stack + +**Fix**: +```bash +# Check domains in k8s_users +vault kv get secret/platform | jq -r '.data.data.k8s_users.alice.domains' + +# Apply platform stack +cd stacks/platform +terragrunt apply + +# Verify in Cloudflare dashboard +``` + +### TLS Secret Missing + +**Cause**: cert-manager failed to issue certificate + +**Fix**: +```bash +# Check cert-manager logs +kubectl logs -n cert-manager deploy/cert-manager + +# Check Certificate resource +kubectl get certificate -n alice-prod + +# Check CertificateRequest +kubectl describe certificaterequest -n alice-prod + +# If Let's Encrypt rate limited, wait 1 week or use staging +``` + +### User Can't See Cluster Resources + +**Cause**: ClusterRoleBinding not created + +**Fix**: +```bash +# Check ClusterRoleBinding exists +kubectl get clusterrolebinding | grep alice + +# Apply platform stack +cd stacks/platform +terragrunt apply +``` + +### Factory Pattern: New User Not Created + +**Cause**: Module block not added to `factory/main.tf` + +**Fix**: +```bash +# Edit factory/main.tf +cat >> stacks/actualbudget/factory/main.tf <~50 domains] + CFD[Cloudflared Tunnel
3 replicas] + Traefik[Traefik Ingress
3 replicas + PDB] + + subgraph "Middleware Chain" + CS[CrowdSec Bouncer
fail-open] + Auth[Authentik Forward-Auth
3 replicas + PDB] + RL[Rate Limiter
429 response] + Retry[Retry
2 attempts, 100ms] + end + + subgraph "Proxmox Host (eno1)" + vmbr0[vmbr0 Bridge
192.168.1.127/24] + vmbr1[vmbr1 Internal
VLAN-aware] + + subgraph "VLAN 10 - Management
10.0.10.0/24" + Proxmox[Proxmox Host
10.0.10.1] + TrueNAS[TrueNAS
10.0.10.15] + DevVM[DevVM
10.0.10.10] + Registry[Registry VM
10.0.20.10] + end + + subgraph "VLAN 20 - Kubernetes
10.0.20.0/24" + pfSense[pfSense
10.0.20.1
Gateway/NAT/DHCP] + Tech[Technitium DNS
10.0.20.101
viktorbarzin.lan] + MLB[MetalLB Pool
10.0.20.102-200] + + subgraph "K8s Nodes" + Master[k8s-master] + Node1[k8s-node1] + Node2[k8s-node2] + Node3[k8s-node3] + Node4[k8s-node4] + end + end + end + + Service[Service] + Pod[Pod] + + Internet -->|DNS query| CF + CF -->|CNAME to tunnel| CFD + CFD --> Traefik + Traefik --> CS + CS --> Auth + Auth --> RL + RL --> Retry + Retry --> Service + Service --> Pod + + vmbr0 -.physical link.- eno1 + vmbr0 --> vmbr1 + vmbr1 -.VLAN 10.- Proxmox + vmbr1 -.VLAN 10.- TrueNAS + vmbr1 -.VLAN 10.- DevVM + vmbr1 -.VLAN 20.- pfSense + vmbr1 -.VLAN 20.- Tech + vmbr1 -.VLAN 20.- Master + vmbr1 -.VLAN 20.- Node1 +``` + +## Components + +| Component | Version/Type | Location | Purpose | +|-----------|-------------|----------|---------| +| pfSense | 2.7.x | 10.0.20.1 | Gateway, NAT, firewall, DHCP for VLAN 20 | +| vmbr0 | Linux bridge | 192.168.1.127/24 | Physical bridge on eno1, uplink to LAN | +| vmbr1 | Linux bridge (VLAN-aware) | Internal | VLAN trunk for VM isolation | +| Technitium DNS | Container | 10.0.20.101 | Internal DNS (viktorbarzin.lan) | +| Cloudflare DNS | SaaS | External | ~50 public domains under viktorbarzin.me | +| Cloudflared | Container | K8s (3 replicas) | Tunnel ingress, replaces port forwarding | +| Traefik | Helm chart | K8s (3 replicas + PDB) | Ingress controller, HTTP/3 enabled | +| CrowdSec | Helm chart | K8s (LAPI: 3 replicas) | Bot protection, fail-open bouncer | +| Authentik | Helm chart | K8s (3 replicas + PDB) | SSO, forward-auth middleware | +| MetalLB | Helm chart | K8s | LoadBalancer IPs (10.0.20.102-200) | +| Registry Cache | Container | 10.0.20.10 | Pull-through for docker.io:5000, ghcr.io:5010 | + +## How It Works + +### VLAN Segmentation + +The Proxmox host uses a dual-bridge architecture: +- **vmbr0**: Physical bridge on interface `eno1`, connected to upstream LAN (192.168.1.0/24). Proxmox management IP is 192.168.1.127. +- **vmbr1**: Internal VLAN-aware bridge, acts as a trunk carrying: + - **VLAN 10 (Management)**: 10.0.10.0/24 — Proxmox, TrueNAS, DevVM + - **VLAN 20 (Kubernetes)**: 10.0.20.0/24 — All K8s nodes, services, MetalLB IPs + +VMs tag traffic on vmbr1 to isolate workloads. pfSense bridges VLAN 20 to the upstream LAN via NAT. + +### DNS Resolution + +**Internal (Technitium)**: +- Runs at 10.0.20.101, serves `.viktorbarzin.lan` zone +- Handles internal service discovery (k8s-master.viktorbarzin.lan, truenas.viktorbarzin.lan, etc.) +- All K8s nodes and cluster services use this as primary DNS + +**External (Cloudflare)**: +- Manages ~50 public domains, all under `viktorbarzin.me` +- **Proxied domains** (orange cloud, traffic via Cloudflare CDN): + - blog, hackmd, privatebin, url, echo, f1tv, excalidraw, send, audiobookshelf, jsoncrack, ntfy, cyberchef, homepage, linkwarden, changedetection, tandoor, n8n, stirling-pdf, dashy, city-guesser, travel, netbox +- **Non-proxied domains** (grey cloud, direct IP resolution): + - mail, wg, headscale, immich, calibre, vaultwarden, and other services requiring direct connections +- CNAME records for proxied domains point to Cloudflared tunnel FQDNs + +### Ingress Flow + +```mermaid +sequenceDiagram + participant Client + participant Cloudflare + participant Cloudflared + participant Traefik + participant CrowdSec + participant Authentik + participant RateLimit + participant Retry + participant Service + participant Pod + + Client->>Cloudflare: HTTPS request to blog.viktorbarzin.me + Cloudflare->>Cloudflared: Forward via tunnel (QUIC) + Cloudflared->>Traefik: HTTP to LoadBalancer IP + Traefik->>CrowdSec: Apply bouncer middleware + CrowdSec->>Authentik: If allowed, check auth (protected=true) + Authentik->>RateLimit: If authenticated, check rate limit + RateLimit->>Retry: If within limit, continue + Retry->>Service: Forward to Service + Service->>Pod: Route to backend Pod + Pod-->>Service: Response + Service-->>Retry: Response + Retry-->>RateLimit: Response + RateLimit-->>Authentik: Response (strip auth headers) + Authentik-->>CrowdSec: Response + CrowdSec-->>Traefik: Response + Traefik-->>Cloudflared: Response + Cloudflared-->>Cloudflare: Response via tunnel + Cloudflare-->>Client: HTTPS response +``` + +### Middleware Chain + +Every ingress created by the `ingress_factory` module follows this chain: + +1. **CrowdSec Bouncer**: Checks IP against threat database. **Fail-open** mode — if LAPI is unreachable, traffic passes through to prevent outages. +2. **Authentik Forward-Auth** (if `protected = true`): SSO authentication via OIDC. Non-authenticated users are redirected to login. Auth headers are stripped before forwarding to backend. +3. **Rate Limiting**: Per-IP throttling. Returns **429 Too Many Requests** (not 503) when limit exceeded. Default limits are generous; services like Immich and Nextcloud have higher custom limits. +4. **Retry**: 2 attempts with 100ms delay on transient failures (5xx errors, connection errors). + +Additional middleware: +- **Anti-AI**: On by default via `ingress_factory`. Blocks common AI crawler user-agents. +- **HTTP/3 (QUIC)**: Enabled globally on Traefik. + +### MetalLB & Load Balancing + +MetalLB allocates IPs from the range 10.0.20.102-200 in **Layer 2 mode**. Services can share a single IP using the `metallb.universe.tf/allow-shared-ip` annotation (used by Traefik, Cloudflared, etc.). + +Critical services are scaled to **3 replicas**: +- Traefik (PDB: minAvailable=2) +- Authentik (PDB: minAvailable=2) +- CrowdSec LAPI +- PgBouncer +- Cloudflared + +PodDisruptionBudgets ensure at least 2 replicas remain during node maintenance or disruptions. + +### Container Registry Pull-Through Cache + +**Location**: Registry VM at 10.0.20.10 + +Docker Hub and GitHub Container Registry (GHCR) are mirrored locally to avoid rate limits and improve pull performance: +- **docker.io**: Port 5000 +- **ghcr.io**: Port 5010 + +Containerd on all K8s nodes uses `hosts.toml` to redirect pulls to the local cache transparently. + +**Caveat**: The cache holds stale manifests for `:latest` tags, which can cause version skew. Always use **versioned tags** (e.g., `python:3.12.0` or `app:abc12345`) in production. + +## Configuration + +### Terraform Stacks + +| Stack | Path | Resources | +|-------|------|-----------| +| pfSense | `stacks/pfsense/` | VM + cloud-init config | +| Technitium | `stacks/technitium/` | Deployment, Service, PVC | +| Traefik | `stacks/platform/` (sub-module) | Helm release, IngressRoute CRDs | +| CrowdSec | `stacks/platform/` (sub-module) | Helm release, LAPI + bouncer | +| Authentik | `stacks/authentik/` | Helm release, ingress, OIDC configs | +| MetalLB | `stacks/platform/` (sub-module) | Helm release, IPAddressPool | +| Cloudflared | `stacks/cloudflared/` | Deployment (3 replicas), tunnel config | +| ingress_factory | `modules/ingress_factory/` | IngressRoute + middleware chain | + +### Key Configuration Files + +**pfSense**: +- Terraform: `stacks/pfsense/main.tf` +- DHCP scope: 10.0.20.50-250 (VLAN 20) +- Firewall rules: Allow K8s egress, block inter-VLAN by default + +**Technitium**: +- Config: Stored in PVC `technitium-data` +- Zone file: `viktorbarzin.lan` (A records for all internal hosts) +- Forwarders: Cloudflare 1.1.1.1, Google 8.8.8.8 + +**Traefik Middleware**: +- Helm values: `stacks/platform/traefik-values.yaml` +- Middleware CRDs: Generated by `ingress_factory` module +- HTTP/3 config: `experimental.http3.enabled=true` + +**MetalLB**: +- Helm values: `stacks/platform/metallb-values.yaml` +- IPAddressPool CRD: `10.0.20.102-10.0.20.200` + +**Vault Secrets**: +- Cloudflare API token: `secret/viktor/cloudflare_api_token` +- Authentik OIDC secrets: `secret/authentik` +- CrowdSec LAPI key: `secret/crowdsec/lapi_key` + +## Decisions & Rationale + +### Why Dual-Bridge VLAN Architecture? + +**Alternatives considered**: +1. **Single flat network**: Simpler, but no isolation between management and workload traffic. +2. **Routed network with physical VLANs**: Requires switch with VLAN support. + +**Decision**: vmbr0 (physical) + vmbr1 (VLAN trunk) gives isolation without requiring managed switches. Management traffic (Proxmox, TrueNAS) stays on VLAN 10, K8s workloads stay on VLAN 20. Failures in K8s don't affect access to Proxmox or storage. + +### Why Cloudflared Tunnel Instead of Port Forwarding? + +**Alternatives considered**: +1. **Traditional port forwarding (80/443)**: Exposes public IP, requires firewall rules, DDoS risk. +2. **VPN-only access**: Limits accessibility for public services like blog. + +**Decision**: Cloudflared tunnel provides: +- No public IP exposure +- DDoS protection via Cloudflare +- TLS termination at Cloudflare edge +- Zero firewall configuration +- Works behind CGNAT + +### Why Split DNS (Technitium + Cloudflare)? + +**Alternatives considered**: +1. **Cloudflare only**: Works but introduces external dependency for internal resolution. +2. **Technitium only**: Can't handle public domains without zone delegation. + +**Decision**: Technitium handles internal `.lan` domains with near-zero latency. Cloudflare handles public domains with global DNS. K8s nodes use Technitium as primary, which forwards non-.lan queries to Cloudflare. + +### Why Fail-Open on CrowdSec Bouncer? + +**Alternatives considered**: +1. **Fail-closed**: Maximum security, but LAPI downtime blocks all traffic. +2. **Redundant LAPI**: Already scaled to 3 replicas, but resource pressure can still cause outages. + +**Decision**: Availability > strict bot blocking. CrowdSec LAPI is scaled to 3 replicas for resilience, but during cluster-wide resource exhaustion (e.g., memory pressure), bouncer falls back to allowing traffic. This prevents a complete service outage due to a security add-on. + +### Why HTTP/3 (QUIC)? + +**Benefit**: Reduces latency on lossy connections (mobile, Wi-Fi) and enables multiplexing without head-of-line blocking. Minimal overhead since Traefik handles it natively. + +### Why Pull-Through Registry Cache? + +**Problem**: Docker Hub rate limits (100 pulls/6h for anonymous, 200 pulls/6h for free accounts) caused CI/CD failures. + +**Solution**: Local registry cache at 10.0.20.10 mirrors all pulls. Containerd transparently redirects requests. Zero application changes needed. + +**Trade-off**: Stale `:latest` tags — requires discipline to use versioned tags (8-char git SHAs for app images). + +## Troubleshooting + +### Ingress Returns 502 Bad Gateway + +**Symptoms**: Cloudflared tunnel is up, Traefik logs show `dial tcp: lookup on 10.0.20.101:53: no such host`. + +**Diagnosis**: DNS resolution failed. Check: +1. Is Technitium pod running? `kubectl get pod -n technitium` +2. Can nodes resolve the service? `kubectl exec -it -- nslookup .viktorbarzin.lan` +3. Is the Service correctly created? `kubectl get svc -n ` + +**Fix**: If Technitium is down, restart it. If the Service is missing, check Terraform apply status. + +### Traefik Shows "Service Unavailable" for All Requests + +**Symptoms**: All ingress routes return 503, Traefik dashboard shows no backends available. + +**Diagnosis**: Middleware chain is blocking traffic. Check: +1. Authentik status: `kubectl get pod -n authentik` +2. CrowdSec LAPI status: `kubectl get pod -n crowdsec` +3. Traefik logs: `kubectl logs -n kube-system deploy/traefik` + +**Fix**: If Authentik is down and ingress uses forward-auth, pods won't pass health checks. Scale Authentik to 3 replicas or temporarily disable forward-auth middleware. + +### MetalLB Doesn't Assign IP to LoadBalancer Service + +**Symptoms**: Service stays in `` state, no IP assigned. + +**Diagnosis**: Check MetalLB logs: `kubectl logs -n metallb-system deploy/controller` + +**Common causes**: +1. **IP pool exhausted**: 98 IPs available (10.0.20.102-200), check `kubectl get svc -A | grep LoadBalancer` +2. **Missing allow-shared-ip annotation**: Services must explicitly opt-in to share IPs +3. **MetalLB controller crash-looping**: Resource limits too low + +**Fix**: If pool exhausted, either delete unused Services or expand the IPAddressPool CRD. + +### DNS Resolution Loops (Technitium → Cloudflare → Technitium) + +**Symptoms**: Slow DNS responses, `dig` shows multiple CNAMEs in a loop. + +**Diagnosis**: Misconfigured forwarder or zone overlap. + +**Fix**: Ensure Technitium forwards all non-.lan queries to Cloudflare (1.1.1.1), and Cloudflare zones don't contain `.lan` records. + +### Cloudflared Tunnel Disconnects Frequently + +**Symptoms**: Intermittent 502 errors, Cloudflared logs show `connection lost, retrying`. + +**Diagnosis**: Check: +1. Network stability: `ping 1.1.1.1` from a K8s node +2. Cloudflared resource limits: `kubectl top pod -n cloudflared` +3. Cloudflare tunnel status in dashboard + +**Fix**: If resource-limited, increase memory/CPU. If network-related, check pfSense logs for NAT table exhaustion or ISP issues. + +### Rate Limiter Blocks Legitimate Traffic + +**Symptoms**: Users report 429 errors during normal usage (e.g., Immich uploads). + +**Diagnosis**: Check Traefik middleware config for the affected IngressRoute. + +**Fix**: Increase rate limit in `ingress_factory` module. Default is 100 req/min per IP. Immich and Nextcloud use 500 req/min. + +## Related + +- **Runbooks**: + - `docs/runbooks/restart-traefik.md` + - `docs/runbooks/reset-crowdsec-bans.md` + - `docs/runbooks/add-dns-record.md` +- **Architecture Docs**: + - `docs/architecture/vpn.md` — VPN and remote access + - `docs/architecture/storage.md` — NFS and iSCSI architecture (coming soon) +- **Reference**: + - `.claude/reference/service-catalog.md` — Full service inventory + - `.claude/reference/proxmox-inventory.md` — VM and LXC details diff --git a/docs/architecture/overview.md b/docs/architecture/overview.md new file mode 100644 index 00000000..f533af9e --- /dev/null +++ b/docs/architecture/overview.md @@ -0,0 +1,321 @@ +# Infrastructure Overview + +## Overview + +This homelab infrastructure runs a production-grade Kubernetes cluster on Proxmox, hosting 70+ services including web applications, databases, monitoring, security, and GPU-accelerated workloads. The entire infrastructure is managed declaratively using Terraform and Terragrunt, with automated CI/CD pipelines for continuous deployment. Services are organized into a five-tier system for resource isolation and priority-based scheduling. + +## Architecture Diagram + +```mermaid +graph TB + subgraph Physical["Physical Hardware"] + R730["Dell R730
22c/44t Xeon E5-2699 v4
142GB RAM
NVIDIA Tesla T4
1.1TB + 931GB + 10.7TB"] + end + + subgraph Proxmox["Proxmox VE"] + direction LR + PF["pfSense
101"] + DEV["devvm
102"] + HA["home-assistant
103"] + MASTER["k8s-master
200"] + NODE1["k8s-node1
201
(GPU)"] + NODE2["k8s-node2
202"] + NODE3["k8s-node3
203"] + NODE4["k8s-node4
204"] + REG["docker-registry
220"] + TN["TrueNAS
9000"] + end + + subgraph Network["Network Bridges"] + VMBR0["vmbr0
192.168.1.0/24
Physical"] + VMBR1_10["vmbr1:vlan10
10.0.10.0/24
Management"] + VMBR1_20["vmbr1:vlan20
10.0.20.0/24
Kubernetes"] + end + + subgraph K8s["Kubernetes Cluster v1.34.2"] + direction TB + TIER0["Tier 0: Core
traefik, authentik, vault"] + TIER1["Tier 1: Cluster
prometheus, grafana, loki"] + TIER2["Tier 2: GPU
ollama, comfyui"] + TIER3["Tier 3: Edge
cloudflared, headscale"] + TIER4["Tier 4: Auxiliary
vaultwarden, immich"] + end + + R730 --> Proxmox + + PF --> VMBR0 + PF --> VMBR1_10 + PF --> VMBR1_20 + HA --> VMBR0 + DEV --> VMBR1_10 + TN --> VMBR1_10 + + MASTER --> VMBR1_20 + NODE1 --> VMBR1_20 + NODE2 --> VMBR1_20 + NODE3 --> VMBR1_20 + NODE4 --> VMBR1_20 + REG --> VMBR1_20 + + VMBR1_20 --> K8s +``` + +## Components + +### Hardware + +| Component | Specification | +|-----------|---------------| +| Server | Dell PowerEdge R730 | +| CPU | 2x Intel Xeon E5-2699 v4 (22 cores / 44 threads each, 44c/88t total) | +| RAM | 142GB DDR4 ECC | +| GPU | NVIDIA Tesla T4 (16GB, PCIe 0000:06:00.0) | +| Storage | 1.1TB SSD + 931GB SSD + 10.7TB HDD | +| Network | eno1 (physical), vmbr0 (physical bridge), vmbr1 (VLAN-aware internal) | + +### Network Topology + +| Network | VLAN | CIDR | Purpose | +|---------|------|------|---------| +| Physical | - | 192.168.1.0/24 | Physical devices, Proxmox host (192.168.1.127) | +| Management | 10 | 10.0.10.0/24 | Infrastructure VMs, TrueNAS, devvm | +| Kubernetes | 20 | 10.0.20.0/24 | K8s cluster nodes and services | + +### Virtual Machine Inventory + +| VMID | Name | CPUs | RAM | Network | IP Address | Notes | +|------|------|------|-----|---------|------------|-------| +| 101 | pfsense | 8 | 16GB | vmbr0, vmbr1:vlan10, vmbr1:vlan20 | - | Gateway/firewall routing between VLANs | +| 102 | devvm | 16 | 8GB | vmbr1:vlan10 | - | Development VM | +| 103 | home-assistant | 8 | 8GB | vmbr0 | - | Home Assistant Sofia instance | +| 200 | k8s-master | 8 | 8GB | vmbr1:vlan20 | 10.0.20.100 | Kubernetes control plane | +| 201 | k8s-node1 | 16 | 16GB | vmbr1:vlan20 | - | GPU worker node (Tesla T4 passthrough) | +| 202 | k8s-node2 | 8 | 24GB | vmbr1:vlan20 | - | Worker node | +| 203 | k8s-node3 | 8 | 24GB | vmbr1:vlan20 | - | Worker node | +| 204 | k8s-node4 | 8 | 24GB | vmbr1:vlan20 | - | Worker node | +| 220 | docker-registry | 4 | 4GB | vmbr1:vlan20 | 10.0.20.10 | Private Docker registry | +| 9000 | truenas | 16 | 16GB | vmbr1:vlan10 | 10.0.10.15 | NFS storage server | + +### Kubernetes Cluster + +| Component | Details | +|-----------|---------| +| Version | v1.34.2 | +| Nodes | 5 (1 control plane, 4 workers) | +| CNI | Calico | +| Storage | democratic-csi (NFS) + local-path-provisioner | +| Ingress | Traefik v3 | +| Total Services | 70+ services across 5 tiers | + +### Service Tier System + +The cluster uses a five-tier namespace system managed by Kyverno, which automatically generates LimitRange and ResourceQuota policies per tier: + +| Tier | Namespace Pattern | Purpose | Priority Class | +|------|-------------------|---------|----------------| +| 0-core | `0-core-*` | Critical infrastructure (traefik, authentik, vault) | 900000 | +| 1-cluster | `1-cluster-*` | Cluster services (prometheus, grafana, kyverno) | 700000 | +| 2-gpu | `2-gpu-*` | GPU workloads (ollama, comfyui, stable-diffusion) | 500000 | +| 3-edge | `3-edge-*` | Edge services (cloudflared, headscale, technitium) | 300000 | +| 4-aux | `4-aux-*` | Auxiliary apps (vaultwarden, immich, freshrss) | 200000 | + +## How It Works + +### Physical Layer + +The infrastructure runs on a single Dell R730 server with dual Xeon CPUs and 142GB RAM. Proxmox VE provides hypervisor capabilities with hardware passthrough support for the Tesla T4 GPU. The physical network interface (eno1) bridges to vmbr0 for physical network access, while vmbr1 provides VLAN-aware internal networking. + +### Network Layer + +pfSense (VMID 101) acts as the central gateway and firewall, routing traffic between: +- Physical network (192.168.1.0/24) via vmbr0 +- Management VLAN 10 (10.0.10.0/24) via vmbr1:vlan10 +- Kubernetes VLAN 20 (10.0.20.0/24) via vmbr1:vlan20 + +This three-tier network design isolates Kubernetes workloads from management infrastructure and provides controlled access to the physical network. + +### Compute Layer + +The Kubernetes cluster consists of 5 nodes: +- **k8s-master (200)**: 8c/8GB control plane running kube-apiserver, etcd, controller-manager +- **k8s-node1 (201)**: 16c/16GB GPU node with Tesla T4 passthrough, tainted for GPU workloads only +- **k8s-node2-4 (202-204)**: 8c/24GB workers running general-purpose workloads + +GPU passthrough on node1 uses PCIe device 0000:06:00.0, with Kubernetes taint `nvidia.com/gpu=true:NoSchedule` and label `gpu=true` to ensure only GPU-requesting pods schedule there. + +### Service Organization + +Services are organized into 70+ individual Terraform stacks under `stacks//`. Each service belongs to a tier, which determines: +- Resource limits and quotas +- Scheduling priority (higher tier = preempts lower) +- Default container resources +- QoS class (Guaranteed for tiers 0-2, Burstable for 3-4) + +Kyverno policies automatically inject namespace labels, LimitRange, ResourceQuota, and PriorityClass based on the namespace tier prefix. + +### Key Services + +**Critical Services (Tier 0-1)**: +- **Traefik**: Ingress controller with automatic HTTPS (Let's Encrypt) +- **Authentik**: SSO/OIDC provider for all services +- **Vault**: Secrets management with auto-unseal +- **Cloudflared**: Cloudflare Tunnel for external access +- **Technitium**: Internal DNS server +- **Headscale**: Tailscale-compatible mesh VPN control plane + +**Storage & Security**: +- **TrueNAS**: NFS storage backend (10.0.10.15) +- **democratic-csi**: Dynamic PV provisioning from TrueNAS +- **Vaultwarden**: Password manager +- **Immich**: Photo management +- **CrowdSec**: IPS/IDS with community threat intelligence +- **Kyverno**: Policy engine for admission control + +**Monitoring & Observability**: +- **Prometheus**: Metrics collection +- **Grafana**: Visualization and dashboards +- **Loki**: Log aggregation +- **Alertmanager**: Alert routing + +**Application Services**: Woodpecker CI, Gitea, PostgreSQL, MySQL, Redis, Ollama, ComfyUI, Stable Diffusion, Freshrss, and 50+ more services. + +## Configuration + +### Key Files + +| Path | Purpose | +|------|---------| +| `stacks//terragrunt.hcl` | Individual service configuration | +| `modules/k8s_app/` | Reusable Kubernetes app module | +| `modules/helm_app/` | Helm chart deployment module | +| `base.hcl` | Global Terragrunt configuration | +| `terraform.tfvars` | Global variables (git-ignored) | + +### Terraform Organization + +Each service lives in `stacks//` with its own Terragrunt configuration. Common patterns: +- Helm deployments use `modules/helm_app/` +- Custom manifests use `modules/k8s_app/` +- Databases use dedicated modules (`modules/postgres_app/`, `modules/mysql_app/`) +- Shared dependencies via `dependency` blocks in terragrunt.hcl + +### Vault Paths + +Secrets are stored in HashiCorp Vault under `kv/`: +- `kv//*` - Service-specific secrets +- `kv/cloudflare` - Cloudflare API tokens +- `kv/authentik` - OIDC client credentials +- `kv/backup` - Backup encryption keys + +## Decisions & Rationale + +### Why Proxmox over bare-metal Kubernetes? + +**Decision**: Run Kubernetes inside Proxmox VMs rather than directly on bare metal. + +**Rationale**: +- **Flexibility**: Easy to snapshot, clone, and roll back VMs during upgrades +- **Isolation**: Management network (TrueNAS, devvm) separated from Kubernetes +- **GPU passthrough**: Can dedicate GPU to a single node without tainting the entire host +- **Multi-purpose**: Same physical host can run non-K8s VMs (pfSense, Home Assistant) + +**Tradeoff**: Slight performance overhead from virtualization (acceptable for homelab). + +### Why five-tier namespace system? + +**Decision**: Organize services into 5 tiers with automatic LimitRange/ResourceQuota via Kyverno. + +**Rationale**: +- **Predictable scheduling**: Critical services (tier 0) always preempt auxiliary services (tier 4) +- **Resource protection**: Prevents a single service from consuming all cluster resources +- **Clear priorities**: Tier prefix makes service criticality obvious +- **Automation**: Kyverno auto-generates policies, reducing manual configuration + +**Tradeoff**: Adds namespace naming convention requirement. + +### Why no CPU limits cluster-wide? + +**Decision**: Set CPU requests but no CPU limits on containers. + +**Rationale**: +- **CFS throttling**: Linux CFS throttles containers to exact CPU limit even when CPU is idle, causing artificial slowdowns +- **Burstability**: Services can burst to unused CPU during idle periods +- **Memory is the constraint**: With 142GB RAM, memory exhaustion occurs before CPU saturation + +**Tradeoff**: A runaway process could monopolize CPU (mitigated by CPU requests reserving capacity). + +### Why Goldilocks in Initial mode, not Auto? + +**Decision**: Run VPA Goldilocks in "Initial" (recommend-only) mode instead of "Auto" (update pods). + +**Rationale**: +- **Terraform conflicts**: Auto mode directly modifies Deployment specs, creating drift from Terraform state +- **Controlled changes**: Recommendations are reviewed and applied via Terraform, maintaining declarative workflow +- **Quarterly review**: Right-sizing happens deliberately every quarter, not continuously + +**Tradeoff**: Requires manual review of VPA recommendations. + +## Troubleshooting + +### Pods stuck in Pending state + +**Symptom**: Pod shows `status: Pending` with event `FailedScheduling`. + +**Diagnosis**: +```bash +kubectl describe pod -n +# Check events for: +# - "Insufficient memory" → ResourceQuota exceeded +# - "0/5 nodes available: 5 Insufficient memory" → LimitRange default too high +# - "0/5 nodes available: 1 node(s) had untolerated taint" → GPU taint +``` + +**Fix**: +- ResourceQuota exceeded: Increase quota in `modules/namespace_config/` for that tier +- LimitRange too high: Override pod resources in Terraform +- GPU taint: Add `tolerations` and `nodeSelector` for GPU pods + +### OOMKilled pods + +**Symptom**: Pod shows `status: OOMKilled` in events. + +**Diagnosis**: +```bash +kubectl describe pod -n +# Check LimitRange defaults: +kubectl get limitrange -n -o yaml +``` + +**Fix**: +- If pod uses LimitRange default (256Mi or 512Mi): Set explicit memory request/limit in Terraform +- If pod has explicit limit: Increase memory based on Goldilocks VPA recommendation (upperBound x1.2) + +### Democratic-CSI sidecars consuming excessive memory + +**Symptom**: Pods with PVCs have 3-4 sidecar containers each using 256Mi (LimitRange default). + +**Diagnosis**: +```bash +kubectl get pods -A -o json | jq '.items[] | select(.spec.containers[].name | contains("csi")) | .metadata.name' +``` + +**Fix**: Democratic-CSI sidecars need explicit resources (32-80Mi each). Update Terraform to override sidecar resources. + +### Tier 3-4 pods evicted during resource pressure + +**Symptom**: Lower-tier pods show `status: Evicted` with reason `The node was low on resource: memory`. + +**Diagnosis**: This is expected behavior. Tier 3-4 use Burstable QoS (request < limit) and priority 200K-300K, making them first candidates for eviction. + +**Fix**: +- Increase node memory if evictions are frequent +- Promote critical services to higher tier +- Reduce memory limits on tier 4 services + +## Related + +- [Compute & Resource Management](compute.md) - Detailed resource management patterns +- [Multi-tenancy](multi-tenancy.md) - Namespace isolation and tier system +- [Monitoring](monitoring.md) - Resource usage dashboards +- [Runbooks: Node Maintenance](../../runbooks/node-maintenance.md) +- [Runbooks: Service Onboarding](../../runbooks/service-onboarding.md) diff --git a/docs/architecture/secrets.md b/docs/architecture/secrets.md new file mode 100644 index 00000000..c6a451cb --- /dev/null +++ b/docs/architecture/secrets.md @@ -0,0 +1,404 @@ +# Secrets Management Architecture + +## Overview + +Secrets management is centralized in HashiCorp Vault as the single source of truth for all API keys, tokens, passwords, SSH keys, and database credentials. External Secrets Operator (ESO) syncs secrets from Vault KV to Kubernetes Secrets. Vault's database engine handles automatic credential rotation for MySQL and PostgreSQL. CI/CD systems authenticate via Kubernetes service account tokens. Sealed Secrets provide user-managed encrypted secrets without Vault access. SOPS encrypts Terraform state files at rest. + +## Architecture Diagram + +```mermaid +graph TB + subgraph "Secret Sources" + VAULT_KV[Vault KV
secret/viktor
135+ keys] + VAULT_DB[Vault DB Engine
24h rotation] + VAULT_K8S[Vault K8s Engine
Dynamic SA tokens] + USER[User-managed
sealed-*.yaml] + end + + subgraph "Sync Layer" + ESO[External Secrets Operator
43 ExternalSecrets
9 DB-creds ExternalSecrets] + KUBESEAL[Sealed Secrets Controller] + end + + subgraph "Kubernetes Secrets" + K8S_SECRET[K8s Secret] + end + + subgraph "Consumers" + POD[Pod env/volume] + TF_PLAN[Terraform plan-time
data kubernetes_secret] + CI[Woodpecker CI/CD
K8s SA JWT auth] + end + + VAULT_KV -->|ClusterSecretStore: vault-kv| ESO + VAULT_DB -->|ClusterSecretStore: vault-database| ESO + ESO --> K8S_SECRET + USER -->|kubeseal encrypt| KUBESEAL + KUBESEAL --> K8S_SECRET + + K8S_SECRET --> POD + K8S_SECRET --> TF_PLAN + + VAULT_K8S -->|JWT auth| CI +``` + +```mermaid +graph LR + subgraph "Database Credential Rotation" + VAULT_ROOT[Vault Root Creds] --> VAULT_DB_ENGINE[Vault DB Engine] + VAULT_DB_ENGINE -->|Create role| DB_ROLE[DB Role: 24h TTL] + DB_ROLE -->|ESO syncs| K8S_SECRET[K8s Secret] + K8S_SECRET -->|App reads| APP[Application Pod] + APP -->|Uses rotated creds| DATABASE[(MySQL/PostgreSQL)] + VAULT_DB_ENGINE -->|Revokes expired| DB_ROLE + end +``` + +## Components + +| Component | Version | Location | Purpose | +|-----------|---------|----------|---------| +| HashiCorp Vault | Latest | `stacks/vault/` | Secret storage, dynamic credentials, rotation | +| External Secrets Operator | v1beta1 API | `stacks/external-secrets/` | Sync Vault secrets to K8s Secrets (52 total ExternalSecrets) | +| Sealed Secrets | Latest | `stacks/platform/` | User-managed encrypted secrets | +| SOPS | Latest | `scripts/state-sync`, `scripts/tg` | Terraform state encryption (Vault Transit + age) | +| Vault K8s Auth | Enabled | `stacks/vault/` | CI/CD authentication via service account tokens | +| Vault DB Engine | Enabled | `stacks/vault/` | Dynamic DB credentials for 6 MySQL + 8 PostgreSQL databases | + +## How It Works + +### Vault KV: Single Source of Truth + +`secret/viktor` contains 135+ keys covering: +- API keys for external services +- Database root passwords +- SSH private keys +- OAuth/OIDC client secrets +- Application configuration secrets +- Encryption keys + +Authentication: `vault login -method=oidc` (Authentik SSO) → `~/.vault-token` → read by Vault Terraform provider. + +### External Secrets Operator (ESO) + +ESO syncs secrets from Vault to Kubernetes using two ClusterSecretStores: + +1. **vault-kv**: Reads from Vault KV (`secret/viktor`) +2. **vault-database**: Reads dynamic credentials from Vault DB engine + +**52 total ExternalSecrets**: +- 43 standard ExternalSecrets (API keys, tokens, configs) +- 9 DB-creds ExternalSecrets (rotated database credentials) + +ESO creates/updates K8s Secrets automatically when Vault values change. Applications consume these secrets via environment variables or volume mounts. + +### Plan-Time Secret Access Pattern + +**Recommended pattern** (no Vault dependency at plan time): + +1. Apply ExternalSecret to create K8s Secret +2. Stack uses `data "kubernetes_secret"` to read ESO-created secret at plan time +3. No direct Vault provider needed in consuming stack + +**First-apply gotcha**: Must apply ExternalSecret resource first, then run full apply (two-stage). + +**Legacy pattern** (14 hybrid stacks still use): +- Direct `data "vault_kv_secret_v2"` for plan-time needs (job commands, Helm templatefile, module inputs) +- Platform stack has 48 plan-time Vault references (cannot migrate due to circular dependency) + +### Database Credential Rotation + +Vault DB engine provides automatic 24h credential rotation for: + +**MySQL databases** (6): +- speedtest +- wrongmove +- codimd +- nextcloud +- shlink +- grafana + +**PostgreSQL databases** (8): +- trading +- health +- linkwarden +- affine +- woodpecker +- claude_memory +- (2 others) + +**Excluded from rotation**: +- authentik (uses PgBouncer, incompatible with rotation) +- technitium, crowdsec (Helm charts bake credentials at install time) +- Root user accounts (used for Vault itself to create rotated users) + +Workflow: +1. ESO requests credentials from Vault DB engine +2. Vault creates new DB user with 24h TTL +3. ESO writes credentials to K8s Secret +4. Application reads credentials from secret +5. Vault automatically revokes user after 24h +6. ESO requests new credentials, cycle repeats + +### Kubernetes Credential Management + +Vault K8s secrets engine provides dynamic service account tokens: + +**Roles**: +- `dashboard-admin`: Full cluster access for K8s dashboard +- `ci-deployer`: CI/CD deployment permissions +- `openclaw`: Claude Code container permissions +- `local-admin`: Local development cluster access + +Usage: +```bash +vault write kubernetes/creds/ROLE kubernetes_namespace=NS +``` + +Returns a time-limited service account token and kubeconfig. + +### CI/CD Secrets + +**Woodpecker CI authentication**: +1. Woodpecker runner uses Kubernetes SA JWT +2. JWT validated via Vault K8s auth method +3. Woodpecker receives Vault token +4. Accesses secrets from `secret/ci/global` + +**Secret sync CronJob**: +- Runs every 6h +- Reads `secret/ci/global` from Vault +- Pushes to Woodpecker API via HTTP +- Ensures CI secrets stay synchronized + +### Sealed Secrets (User-Managed) + +For users without Vault access (or git-friendly secret storage): + +1. User creates plain K8s Secret YAML +2. Encrypts with `kubeseal` CLI → `sealed-*.yaml` +3. Commits encrypted file to git +4. In-cluster controller decrypts at apply time +5. Terraform picks up via `fileset()` + `for_each` on `kubernetes_manifest` + +Public key stored in cluster, private key only accessible to controller. + +### SOPS (State Encryption) + +Terraform state files encrypted at rest: +- `.tfstate.enc` files in git +- Vault Transit engine (primary) + age key (fallback) +- Scripts: `scripts/state-sync` (encrypt/decrypt), `scripts/tg` (terragrunt wrapper) +- State decrypted in-memory during plan/apply, re-encrypted before commit + +### Complex Types in Vault + +Maps and lists stored as JSON strings in Vault KV: + +```hcl +# In Vault: key = '{"endpoint": "https://...", "token": "..."}' +# In Terraform: +config = jsondecode(data.vault_kv_secret_v2.app.data["config"]) +``` + +Required because Vault KV only supports string values at leaf nodes. + +## Configuration + +### Vault Paths + +- **Main secrets**: `secret/viktor` (135+ keys) +- **CI/CD secrets**: `secret/ci/global` +- **Database engine**: `database/creds/ROLE` (dynamic) +- **Kubernetes engine**: `kubernetes/creds/ROLE` (dynamic) + +### External Secrets Stack + +**Location**: `stacks/external-secrets/` + +**ClusterSecretStores**: +```yaml +apiVersion: external-secrets.io/v1beta1 +kind: ClusterSecretStore +metadata: + name: vault-kv +spec: + provider: + vault: + server: "https://vault.viktorbarzin.me" + path: secret + version: v2 + auth: + kubernetes: + mountPath: kubernetes + role: external-secrets +``` + +**ExternalSecret example**: +```yaml +apiVersion: external-secrets.io/v1beta1 +kind: ExternalSecret +metadata: + name: my-app-secrets +spec: + refreshInterval: 1h + secretStoreRef: + name: vault-kv + kind: ClusterSecretStore + target: + name: my-app-secrets + data: + - secretKey: API_KEY + remoteRef: + key: viktor + property: my_app_api_key +``` + +### Vault Backup + +**CronJob**: `vault-raft-backup` +- Uses manually-created `vault-root-token` K8s Secret +- Cannot use ESO (circular dependency during restore) +- Backs up Raft storage to S3-compatible backend + +### Terraform Provider Auth + +`~/.vault-token` created by `vault login -method=oidc`: + +```hcl +provider "vault" { + # Reads VAULT_ADDR from env + # Reads token from ~/.vault-token +} +``` + +## Decisions & Rationale + +### Why Vault over alternatives (AWS Secrets Manager, K8s Secrets, env files)? + +**Centralized management**: Single source of truth for all secrets across infrastructure, applications, and CI/CD. + +**Dynamic credentials**: Database and Kubernetes credentials rotated automatically, reducing blast radius of credential leaks. + +**Audit logging**: Every secret access logged for security compliance. + +**OIDC integration**: Secure human authentication via Authentik SSO (no static tokens for humans). + +**Encryption at rest**: Secrets encrypted in Vault's storage backend. + +### Why ESO over direct Vault injection (vault-agent, CSI driver)? + +**Terraform compatibility**: `data "kubernetes_secret"` allows plan-time access without Vault provider dependency. + +**Simpler pod configuration**: No sidecar containers or init containers required. + +**Declarative sync**: ExternalSecret CRD describes desired state, ESO handles synchronization. + +**Namespace isolation**: Each namespace can have its own ExternalSecrets without cluster-admin access to Vault. + +### Why Sealed Secrets for users? + +**No Vault access needed**: Users can encrypt secrets without Vault credentials. + +**Git-friendly**: Encrypted YAML files can be committed safely to version control. + +**Self-service**: Users manage their own secrets without admin intervention. + +**Cluster-scoped encryption**: Encrypted for specific cluster, can't be decrypted elsewhere. + +### Why SOPS for Terraform state? + +**State contains secrets**: Terraform state includes sensitive values (DB passwords, API keys). + +**Vault Transit integration**: Centralized key management (same as other encryption). + +**Age fallback**: Offline decryption possible if Vault unavailable. + +**Transparent workflow**: `scripts/tg` wrapper handles encrypt/decrypt automatically. + +### Why Vault DB engine over static credentials? + +**Automatic rotation**: 24h TTL reduces credential exposure window. + +**Audit trail**: Every credential generation logged in Vault. + +**Revocation**: Credentials automatically revoked at TTL expiration. + +**Least privilege**: Each app gets unique credentials, not shared root password. + +### Why exclude platform stack from Vault dependency? + +**Circular dependency**: Vault runs on platform (storage, networking), platform can't wait for Vault. + +**Bootstrap order**: Platform must deploy first, then Vault, then app stacks. + +**Resilience**: Platform stack can be re-applied even if Vault is down. + +## Troubleshooting + +### ExternalSecret shows "SecretSyncedError" + +1. Check Vault auth: `kubectl logs -n external-secrets deployment/external-secrets` +2. Verify Vault path exists: `vault kv get secret/viktor` +3. Check RBAC: ESO service account needs Vault role binding +4. Verify network: ESO pod can reach Vault service + +### Rotated database credentials not working + +1. Check Vault DB connection: `vault read database/config/my-db` +2. Verify role TTL: `vault read database/roles/my-app` +3. Check ESO refresh interval: ExternalSecret may not have synced yet +4. Verify app is reading latest secret: `kubectl get secret my-db-creds -o yaml` + +### Terraform plan fails with "secret not found" + +First-apply issue: +1. Apply ExternalSecret first: `terraform apply -target=kubernetes_manifest.external_secret` +2. Wait for ESO to create K8s Secret: `kubectl wait --for=condition=Ready externalsecret/my-secret` +3. Apply rest of stack: `terraform apply` + +### CI/CD cannot access Vault + +1. Check Woodpecker SA token: `kubectl get sa -n woodpecker woodpecker-runner -o yaml` +2. Verify Vault K8s auth config: `vault read auth/kubernetes/config` +3. Check Vault role binding: `vault read auth/kubernetes/role/ci-deployer` +4. Review Vault audit logs: `vault audit list` + +### Sealed Secret won't decrypt + +1. Verify controller is running: `kubectl get pods -n kube-system -l app=sealed-secrets` +2. Check encryption was for correct cluster: `kubeseal --fetch-cert` matches cert used for encryption +3. Review controller logs: `kubectl logs -n kube-system deployment/sealed-secrets-controller` +4. Ensure `sealed-*.yaml` hasn't been manually edited (breaks signature) + +### SOPS state decryption fails + +1. Check Vault access: `vault token lookup` +2. Verify Transit engine: `vault read transit/keys/terraform-state` +3. Check age key fallback: `~/.config/sops/age/keys.txt` exists +4. Run manual decrypt: `scripts/state-sync decrypt path/to/state.tfstate.enc` + +### Complex type (map/list) not parsing from Vault + +Ensure value in Vault is valid JSON: +```bash +vault kv get -field=my_config secret/viktor | jq . +``` + +If invalid JSON, update in Vault: +```bash +vault kv put secret/viktor my_config='{"key": "value"}' +``` + +In Terraform: +```hcl +config = jsondecode(data.vault_kv_secret_v2.app.data["my_config"]) +``` + +## Related + +- [Vault Deployment](../../stacks/vault/README.md) - Vault Terraform configuration +- [External Secrets Stack](../../stacks/external-secrets/README.md) - ESO deployment and ExternalSecret definitions +- [Backup & DR](./backup-dr.md) - Vault backup strategy +- [Monitoring](./monitoring.md) - Grafana OIDC via Authentik (Vault-stored client secret) +- [CI/CD Runbook](../runbooks/ci-cd.md) - Woodpecker Vault authentication diff --git a/docs/architecture/security.md b/docs/architecture/security.md new file mode 100644 index 00000000..7796a88c --- /dev/null +++ b/docs/architecture/security.md @@ -0,0 +1,401 @@ +# Security & L7 Protection + +## Overview + +The homelab implements defense-in-depth security at the application layer (L7) using CrowdSec for threat intelligence and IP reputation, Kyverno for policy enforcement and resource governance, and a 5-layer anti-AI scraping defense. All security components operate in graceful degradation mode (fail-open) to prevent cascading failures. Security policies are deployed in audit mode first, then selectively enforced after validation. + +## Architecture Diagram + +```mermaid +graph LR + Internet[Internet] + CF[Cloudflare WAF] + Tunnel[Cloudflared Tunnel] + CrowdSec[CrowdSec Bouncer
Traefik Plugin] + AntiAI[Anti-AI Check
poison-fountain] + ForwardAuth[Authentik ForwardAuth] + RateLimit[Rate Limit Middleware] + Retry[Retry Middleware
2 attempts, 100ms] + Backend[Backend Service] + + LAPI[CrowdSec LAPI
3 replicas] + Agent[CrowdSec Agent] + + Internet -->|1| CF + CF -->|2| Tunnel + Tunnel -->|3| CrowdSec + CrowdSec -.->|Query| LAPI + Agent -.->|Report| LAPI + CrowdSec -->|4. Pass/Block| AntiAI + AntiAI -->|5. Human/Bot| ForwardAuth + ForwardAuth -->|6. Authenticated| RateLimit + RateLimit -->|7. Under Limit| Retry + Retry -->|8. Success/Retry| Backend + + style CrowdSec fill:#f9f,stroke:#333 + style AntiAI fill:#ff9,stroke:#333 + style ForwardAuth fill:#9f9,stroke:#333 + style RateLimit fill:#99f,stroke:#333 +``` + +## Components + +| Component | Version | Location | Purpose | +|-----------|---------|----------|---------| +| CrowdSec LAPI | Pinned | `stacks/crowdsec/` | Local API, threat intelligence aggregation (3 replicas) | +| CrowdSec Agent | Pinned | `stacks/crowdsec/` | Log parser, scenario detection | +| CrowdSec Traefik Bouncer | Plugin | Traefik config | Plugin-based IP reputation check | +| Kyverno | Pinned chart | `stacks/kyverno/` | Policy engine for K8s admission control | +| poison-fountain | Latest | `stacks/poison-fountain/` | Anti-AI bot detection and tarpit service | +| cert-manager/certbot | - | `stacks/cert-manager/` | TLS certificate management | +| Traefik | Latest | `stacks/platform/` | Ingress controller with HTTP/3 (QUIC) | + +## How It Works + +### Request Security Layers + +Every incoming request passes through 6 security layers: + +1. **Cloudflare WAF** - DDoS protection, bot detection, firewall rules (external) +2. **Cloudflared Tunnel** - Zero Trust tunnel, hides origin IP +3. **CrowdSec Bouncer** - IP reputation check against LAPI (fail-open on error) +4. **Anti-AI Scraping** - 5-layer bot defense (optional per service) +5. **Authentik ForwardAuth** - Authentication check (if `protected = true`) +6. **Rate Limiting** - Per-source IP rate limits (returns 429 on breach) +7. **Retry Middleware** - Auto-retry on transient errors (2 attempts, 100ms delay) + +### CrowdSec Threat Intelligence + +CrowdSec operates in a hub-and-agent model: + +**LAPI (Local API)**: +- 3 replicas for high availability +- Aggregates threat intelligence from agent + community +- Maintains ban list (IP reputation database) +- Version pinned to prevent breaking changes + +**Agent**: +- Parses Traefik access logs +- Detects attack scenarios (SQL injection, directory traversal, brute force) +- Reports malicious IPs to LAPI +- Shares threat intel with CrowdSec community (anonymized) + +**Traefik Bouncer Plugin**: +- Integrated as Traefik middleware +- Queries LAPI for IP reputation on each request +- **Fail-open mode**: If LAPI unreachable, allows traffic (graceful degradation) +- Blocks IPs on ban list, allows others + +**Metabase** (disabled by default): +- Dashboard for CrowdSec analytics +- CPU-intensive, only enable when investigating incidents + +### Kyverno Policy Engine + +Kyverno enforces cluster-wide policies via admission webhooks. All policies use `failurePolicy=Ignore` to prevent blocking cluster operations. + +#### 5-Tier Resource Governance + +Namespaces are labeled with a tier (`tier: 0` through `tier: 4`). Kyverno auto-generates: + +- **LimitRange** - Per-container CPU/memory limits +- **ResourceQuota** - Namespace-wide resource caps + +| Tier | CPU Limit/Container | Memory Limit/Container | Namespace CPU Quota | Namespace Memory Quota | +|------|---------------------|------------------------|---------------------|------------------------| +| 0 | 100m | 128Mi | 500m | 512Mi | +| 1 | 250m | 256Mi | 1000m | 1Gi | +| 2 | 500m | 512Mi | 2000m | 2Gi | +| 3 | 1000m | 1Gi | 4000m | 4Gi | +| 4 | 2000m | 2Gi | 8000m | 8Gi | + +This prevents resource exhaustion and enforces governance without manual quota management. + +#### Security Policies (ALL in Audit Mode) + +**Why audit mode?** Gradual rollout without breaking existing workloads. Policies collect violations, then selectively enforced after cleanup. + +| Policy | Purpose | Enforcement | +|--------|---------|-------------| +| `deny-privileged-containers` | Block privileged pods | Audit | +| `deny-host-namespaces` | Block hostNetwork/hostPID/hostIPC | Audit | +| `restrict-sys-admin` | Block CAP_SYS_ADMIN | Audit | +| `require-trusted-registries` | Only allow approved image registries | Audit | + +#### Operational Policies + +| Policy | Purpose | Mode | +|--------|---------|------| +| `inject-priority-class-from-tier` | Set pod priorityClass based on namespace tier | Enforce (CREATE only) | +| `inject-ndots` | Set DNS `ndots:2` for faster lookups | Enforce | +| `sync-tier-label` | Propagate tier label to child resources | Enforce | +| `goldilocks-vpa-auto-mode` | Disable VPA globally (VPA off) | Enforce | + +### Anti-AI Scraping (5-Layer Defense) + +Enabled by default via `ingress_factory` module. Disable per-service with `anti_ai_scraping = false`. + +#### Layer 1: Bot Blocking (ForwardAuth) + +- Middleware calls `poison-fountain` service before backend +- Analyzes User-Agent, request patterns, timing +- Blocks known AI scrapers (GPTBot, CCBot, etc.) +- **Fail-open**: If poison-fountain down, allows traffic + +#### Layer 2: X-Robots-Tag Header + +- HTTP response header: `X-Robots-Tag: noai, noindex, nofollow` +- Instructs compliant bots to skip content +- Lightweight, no performance impact + +#### Layer 3: Trap Links + +- JavaScript injects invisible links before `` +- Links point to honeypot endpoints +- Legitimate browsers don't click, bots follow +- Triggered bots get added to ban list + +#### Layer 4: Tarpit + +- Serves AI bots extremely slowly (~100 bytes/sec) +- Wastes bot resources, makes scraping uneconomical +- Humans see normal speed (only applies to detected bots) + +#### Layer 5: Poison Content + +- CronJob every 6 hours generates fake content +- Injects misleading/nonsense data into pages shown to bots +- Degrades AI training data quality +- **Requires `--http1.1` flag** to work with current HTTP/2 setup + +**Implementation**: See `stacks/poison-fountain/` and `stacks/platform/modules/traefik/middleware.tf` + +### TLS & HTTP/3 + +**Traefik** handles TLS termination: +- HTTP/3 (QUIC) enabled for performance +- Automatic HTTP → HTTPS redirect +- cert-manager/certbot manages certificate lifecycle +- Let's Encrypt integration for automatic renewal + +### Rate Limiting + +**Per-source IP limits**: +- Default: 100 requests/minute +- Returns **429 Too Many Requests** (not 503) +- Higher limits for upload-heavy services: + - Immich: 500 req/min (photo uploads) + - Nextcloud: 300 req/min (file sync) + +**Retry Middleware**: +- 2 attempts max +- 100ms delay between retries +- Applied after rate limiting +- Handles transient backend errors + +### Fallback Proxies + +**Authentik Fallback**: +- If Authentik down, falls back to basicAuth +- Prevents total service outage during IdP maintenance +- Temporary credentials stored in Vault + +**Poison-Fountain Fallback**: +- If anti-AI service down, allows all traffic +- Fail-open prevents blocking legitimate users +- Monitors for service health, auto-recovers + +## Configuration + +### Key Config Files + +| Path | Purpose | +|------|---------| +| `stacks/crowdsec/` | CrowdSec LAPI, agent, bouncer config | +| `stacks/kyverno/` | Kyverno deployment + policies | +| `stacks/poison-fountain/` | Anti-AI service + CronJob | +| `stacks/platform/modules/traefik/middleware.tf` | Security middleware definitions | +| `stacks/platform/modules/ingress_factory/` | Per-service security toggles | + +### Vault Paths + +- **CrowdSec API key**: `secret/crowdsec/api-key` - LAPI authentication +- **BasicAuth fallback**: `secret/authentik/fallback-creds` - Emergency auth +- **TLS certificates**: `secret/tls/` - Certificate private keys + +### Terraform Stacks + +- `stacks/crowdsec/` - CrowdSec infrastructure +- `stacks/kyverno/` - Policy engine +- `stacks/poison-fountain/` - Anti-AI defense +- `stacks/platform/` - Traefik + middleware + +### Per-Service Security Config + +```hcl +module "myapp_ingress" { + source = "./modules/ingress_factory" + + name = "myapp" + host = "myapp.viktorbarzin.me" + + # Security toggles + protected = true # Enable ForwardAuth + anti_ai_scraping = false # Disable anti-AI (e.g., for public API) + rate_limit = 200 # Custom rate limit (req/min) +} +``` + +### Kyverno Policy Example + +```yaml +apiVersion: kyverno.io/v1 +kind: ClusterPolicy +metadata: + name: inject-ndots +spec: + background: false + rules: + - name: inject-ndots + match: + resources: + kinds: + - Pod + mutate: + patchStrategicMerge: + spec: + dnsConfig: + options: + - name: ndots + value: "2" +``` + +## Decisions & Rationale + +### Why CrowdSec over ModSecurity? + +- **Community threat intelligence**: Shared ban lists, crowdsourced attack detection +- **Easier management**: YAML scenarios vs complex ModSecurity rules +- **Better performance**: Lightweight Go agent vs resource-heavy Apache module +- **Active development**: More frequent updates, responsive community + +### Why Audit-Only Security Policies? + +- **Gradual rollout**: Identify violations without breaking existing workloads +- **Risk reduction**: Prevents policy bugs from blocking critical deployments +- **Better observability**: Collect violation metrics before enforcing +- **Selective enforcement**: Move to enforce mode per-policy after validation + +### Why 5-Layer Anti-AI Defense? + +- **Defense in depth**: Each layer catches different bot types +- **Compliant bots**: Layer 2 (X-Robots-Tag) handles respectful crawlers +- **Dumb bots**: Layer 3 (trap links) catches simple scrapers +- **Persistent bots**: Layer 4 (tarpit) makes scraping uneconomical +- **Sophisticated bots**: Layer 5 (poison content) degrades training data + +### Why Fail-Open Mode? + +- **Availability over security**: Homelab prioritizes uptime +- **Graceful degradation**: Single component failure doesn't cascade +- **Manual intervention**: Security incidents are rare, can handle manually +- **Layer redundancy**: If one layer fails, others still protect + +### Why Pin CrowdSec/Kyverno Versions? + +- **Breaking changes**: Both projects had breaking config changes in past +- **Controlled upgrades**: Test in staging before upgrading production +- **Stability**: Prevents auto-upgrade during outages +- **Rollback**: Easy to revert if upgrade causes issues + +### Why HTTP/3 (QUIC)? + +- **Performance**: Lower latency, better mobile performance +- **Connection migration**: Survives IP changes (mobile networks) +- **0-RTT**: Faster TLS handshake for repeat visitors +- **Future-proof**: Industry moving to HTTP/3 + +## Troubleshooting + +### CrowdSec Blocking Legitimate IP + +**Problem**: Legitimate user IP on ban list. + +**Fix**: +1. Check LAPI decisions: `kubectl exec -it crowdsec-lapi-0 -- cscli decisions list` +2. Remove ban: `kubectl exec -it crowdsec-lapi-0 -- cscli decisions delete --ip ` +3. Whitelist if needed: Add to `stacks/crowdsec/whitelist.yaml` + +### Kyverno Policy Blocking Deployment + +**Problem**: Pod creation fails with policy violation. + +**Fix**: +1. Check policy reports: `kubectl get policyreport -A` +2. Verify `failurePolicy=Ignore` is set (should never block) +3. If blocking, temporarily disable policy: `kubectl annotate clusterpolicy kyverno.io/exclude=true` +4. Investigate root cause, fix workload or update policy + +### Anti-AI Service Down, Traffic Blocked + +**Problem**: `poison-fountain` service unhealthy, all traffic blocked. + +**Fix**: +1. Verify fail-open config: Check `stacks/platform/modules/traefik/middleware.tf` for `failurePolicy: allow` +2. Restart service: `kubectl rollout restart deployment/poison-fountain -n poison-fountain` +3. Temporary disable: Set `anti_ai_scraping = false` in `ingress_factory` for affected services + +### Rate Limit Too Aggressive + +**Problem**: Legitimate users getting 429 errors. + +**Fix**: +1. Check Traefik logs for rate limit hits: `kubectl logs -n traefik -l app=traefik | grep 429` +2. Increase limit in `ingress_factory`: `rate_limit = 300` +3. Apply: `terraform apply` + +### HTTP/3 Not Working + +**Problem**: Browser shows HTTP/2, not HTTP/3. + +**Fix**: +1. Verify Traefik HTTP/3 enabled: `kubectl get cm traefik-config -o yaml | grep http3` +2. Check UDP port 443 accessible: `nc -u 443` +3. Browser support: Use Chrome/Firefox dev tools, check Protocol column + +### TLS Certificate Expired + +**Problem**: Browser shows certificate expired. + +**Fix**: +1. Check cert-manager: `kubectl get certificate -A` +2. Force renewal: `kubectl delete secret -n ` +3. cert-manager will auto-renew within 5 minutes +4. If fails, check Let's Encrypt rate limits + +### Traefik Retry Loop + +**Problem**: Backend logs show duplicate requests. + +**Fix**: +1. Check retry middleware config: Should be 2 attempts max +2. Verify backend isn't returning transient errors: Check for 5xx responses +3. Disable retry for specific service: Remove retry middleware from `ingress_factory` + +### Poison Content Not Injecting + +**Problem**: Bots not receiving poisoned content. + +**Fix**: +1. Verify CronJob running: `kubectl get cronjob -n poison-fountain` +2. Check logs: `kubectl logs -n poison-fountain -l app=poison-content-injector` +3. Ensure `--http1.1` flag set (required for HTTP/2 backends) +4. Manually trigger: `kubectl create job --from=cronjob/poison-content manual-poison` + +## Related + +- [Authentication & Authorization](./authentication.md) - Authentik, OIDC, ForwardAuth +- [Networking](./networking.md) - Ingress, DNS, load balancing +- [Monitoring](./monitoring.md) - Prometheus, Grafana, alerting +- [CrowdSec Runbook](../runbooks/crowdsec.md) - CrowdSec operations +- [Kyverno Policy Management](../runbooks/kyverno.md) - Policy authoring and troubleshooting diff --git a/docs/architecture/storage.md b/docs/architecture/storage.md new file mode 100644 index 00000000..02e01f29 --- /dev/null +++ b/docs/architecture/storage.md @@ -0,0 +1,315 @@ +# Storage Architecture + +Last updated: 2026-03-24 + +## Overview + +The cluster storage layer is built on TrueNAS ZFS at `10.0.10.15` (VMID 9000), providing both NFS shared storage and iSCSI block devices via democratic-csi. NFS serves ~100 application data directories for stateless services and those using file-based or SQLite databases, while iSCSI provides block devices for ~19 PVCs backing production MySQL and PostgreSQL databases. This hybrid approach optimizes for both performance (iSCSI for databases requiring ACID guarantees) and simplicity (NFS for everything else), with ZFS snapshot-based local protection and incremental offsite replication. + +## Architecture Diagram + +```mermaid +graph TB + subgraph TrueNAS["TrueNAS (10.0.10.15)
VMID 9000, 16c/16GB"] + ZFS_Main["ZFS Pool: main
1.64 TiB
32G + 7x256G + 1T disks"] + ZFS_SSD["ZFS Pool: ssd
~256GB SSD
Immich ML, PostgreSQL hot data"] + + ZFS_Main --> NFS_Datasets["NFS Datasets
~100 shares
main/<service>"] + ZFS_Main --> iSCSI_Datasets["iSCSI Datasets
main/iscsi (zvols)
main/iscsi-snaps"] + + NFS_Datasets --> NFS_Exports["NFS Exports
managed by secrets/nfs_exports.sh"] + iSCSI_Datasets --> iSCSI_Targets["iSCSI Targets
SSH-managed via democratic-csi"] + + ZFS_SSD --> SSD_Data["Immich ML models
PostgreSQL CNPG"] + end + + subgraph K8s["Kubernetes Cluster"] + CSI_NFS["democratic-csi-nfs
StorageClass: nfs-truenas
soft,timeo=30,retrans=3"] + CSI_iSCSI["democratic-csi-iscsi
StorageClass: iscsi-truenas
SSH driver"] + + NFS_PV["NFS PersistentVolumes
RWX, ~100 volumes"] + iSCSI_PV["iSCSI PersistentVolumes
RWO, ~19 volumes"] + + Pods["Application Pods"] + DBPods["Database Pods
PostgreSQL CNPG
MySQL InnoDB"] + end + + NFS_Exports -->|CSI driver| CSI_NFS + iSCSI_Targets -->|CSI driver| CSI_iSCSI + + CSI_NFS --> NFS_PV + CSI_iSCSI --> iSCSI_PV + + NFS_PV --> Pods + iSCSI_PV --> DBPods + + style TrueNAS fill:#e1f5ff + style K8s fill:#fff4e1 + style ZFS_Main fill:#c8e6c9 + style ZFS_SSD fill:#ffe0b2 +``` + +## Components + +| Component | Version/Config | Location | Purpose | +|-----------|---------------|----------|---------| +| TrueNAS VM | VMID 9000, 16c/16GB | Proxmox host (10.0.10.15) | ZFS storage server | +| ZFS pool `main` | 1.64 TiB usable | 32G + 7x256G + 1T disks | Primary storage for all services | +| ZFS pool `ssd` | ~256GB SSD | Dedicated SSD | High-performance data (Immich ML, PostgreSQL) | +| democratic-csi-nfs | Helm chart | Namespace: democratic-csi | NFS CSI driver | +| democratic-csi-iscsi | Helm chart | Namespace: democratic-csi | iSCSI CSI driver (SSH mode) | +| StorageClass `nfs-truenas` | RWX, soft mount | Cluster-wide | Default storage for apps | +| StorageClass `iscsi-truenas` | RWO, block device | Cluster-wide | Databases only | +| TF module `nfs_volume` | `modules/kubernetes/nfs_volume/` | Infra repo | NFS PV/PVC factory | + +## How It Works + +### NFS Storage Flow + +1. **Dataset creation**: NFS shares are created as ZFS datasets under `main/` (e.g., `main/immich`, `main/nextcloud`) +2. **Export configuration**: `/root/secrets/nfs_exports.sh` on TrueNAS generates `/etc/exports` with per-dataset exports (`/mnt/main/`) +3. **CSI provisioning**: democratic-csi-nfs mounts NFS shares and creates K8s PersistentVolumes +4. **Terraform module**: Stacks use `modules/kubernetes/nfs_volume/` to declaratively create PV + PVC pairs: + ```hcl + module "nfs_data" { + source = "../../modules/kubernetes/nfs_volume" + name = "immich-data" + namespace = kubernetes_namespace.immich.metadata[0].name + nfs_server = var.nfs_server # 10.0.10.15 + nfs_path = "/mnt/main/immich" + } + ``` +5. **Pod mount**: Applications reference PVCs in their deployment specs +6. **Mount options**: All NFS mounts use `soft,timeo=30,retrans=3` (set in StorageClass) to prevent indefinite hangs + +**CRITICAL**: Never use inline `nfs {}` blocks in pod specs — they default to `hard,timeo=600` which causes 10-minute hangs on network issues. Always use the `nfs-truenas` StorageClass via PVCs. + +### iSCSI Storage Flow + +1. **Zvol creation**: democratic-csi creates ZFS zvols under `main/iscsi/` via SSH commands +2. **Target setup**: TrueNAS iSCSI service exposes zvols as iSCSI LUNs +3. **Initiator connection**: K8s nodes connect via open-iscsi, sessions managed by democratic-csi +4. **Hardened timeouts**: All 5 nodes use relaxed iSCSI parameters (baked into cloud-init): + - `replacement_timeout=300s` (not 120s default) + - `noop_out_interval=10s`, `noop_out_timeout=15s` + - HeaderDigest/DataDigest: `CRC32C,None` +5. **Filesystem**: Pods format zvols as ext4 (or leave raw for database engines) +6. **Exclusive access**: RWO only — zvol can only be attached to one node at a time + +**Why SSH driver**: The democratic-csi API driver has reliability issues. SSH driver execs `zfs create -V` commands directly, proven stable over 2+ years. + +### SQLite on NFS — Why It Fails + +SQLite uses `fsync()` to guarantee durability. NFS's soft mount + async semantics break this: +- Soft mount returns success even if data is still in client cache +- Network blips during fsync → incomplete writes → corruption +- WAL mode helps but doesn't eliminate the race + +**Solution**: Use iSCSI for any SQLite database (Vaultwarden, plotting-book) or local disk (ephemeral). + +### Democratic-CSI Sidecar Resources + +The Helm chart spawns 17 sidecar containers (driver-registrar, external-provisioner, etc.) across controller + node DaemonSet pods. Each sidecar defaults to `resources: {}`, which gets LimitRange defaults of 256Mi. + +**Fix**: Set explicit resources in `values.yaml`: +```yaml +csiProxy: # TOP-LEVEL key, not nested + resources: + requests: + memory: "32Mi" + limits: + memory: "32Mi" + +controller: + externalProvisioner: + resources: + requests: {memory: "64Mi"} + limits: {memory: "64Mi"} + # ... repeat for all sidecars +``` + +Total footprint: ~1.5Gi → ~400Mi. + +## Configuration + +### Key Files + +| Path | Purpose | +|------|---------| +| `/root/secrets/nfs_exports.sh` | TrueNAS: generates `/etc/exports` with all service shares | +| `stacks/democratic-csi/` | Terraform stack for both CSI drivers | +| `modules/kubernetes/nfs_volume/` | Reusable module for NFS PV/PVC creation | +| `config.tfvars` | Variable `nfs_server = "10.0.10.15"` shared by all stacks | +| `/var/lib/kubelet/config.yaml` | K8s nodes: iSCSI hardening params applied here | +| `modules/create-template-vm/cloud_init.yaml` | Cloud-init template: bakes iSCSI settings into new nodes | + +### Vault Paths + +| Path | Contents | +|------|----------| +| `secret/viktor/truenas_ssh_key` | SSH private key for democratic-csi SSH driver | +| `secret/viktor/truenas_root_password` | TrueNAS root password (web UI access) | + +### Terraform Stacks + +- **`stacks/democratic-csi/`**: Deploys both NFS and iSCSI CSI drivers +- All application stacks reference NFS volumes via `module "nfs_"` calls +- iSCSI PVCs created implicitly by StatefulSets (MySQL, PostgreSQL) using `iscsi-truenas` StorageClass + +### NFS Export Management + +NFS exports are NOT managed by Terraform. To add a new service: + +1. SSH to TrueNAS: `ssh root@10.0.10.15` +2. Edit `/root/secrets/nfs_exports.sh` +3. Add dataset + export entry: + ```bash + create_nfs_export "main/" "/mnt/main/" + ``` +4. Run the script: `/root/secrets/nfs_exports.sh` +5. Verify: `showmount -e 10.0.10.15` + +## Decisions & Rationale + +### Why NFS for Most Workloads? + +- **Simplicity**: No volume provisioning delays, instant mounts +- **RWX support**: Multiple pods can share one volume (Nextcloud, Immich) +- **ZFS benefits**: Snapshots, compression, dedup all work at dataset level +- **Good enough**: For SQLite on NFS specifically, we accept the risk for low-value data (logs, caches) but mandate iSCSI for critical DBs + +### Why iSCSI for Databases? + +- **ACID guarantees**: Block device + local filesystem = real fsync +- **Performance**: No NFS protocol overhead for random I/O +- **Tested**: PostgreSQL CNPG and MySQL InnoDB Cluster both run on iSCSI, zero corruption in 2+ years + +### Why SSH Driver Over API? + +The democratic-csi API driver (`driver: freenas-api-iscsi`) has these issues: +- Requires TrueNAS API credentials in plaintext ConfigMap +- Fails silently when API schema changes between TrueNAS versions +- No retry logic on transient API errors + +SSH driver (`driver: freenas-ssh`) is simpler: +- Direct `zfs` commands, no API translation layer +- SSH key auth (Vault-managed) +- Deterministic error messages + +### Why Soft Mount for NFS? + +Hard mounts with default `timeo=600` (10 minutes) cause: +- 10-minute pod startup delays if NFS server is unreachable +- `kubectl delete pod` hangs for 10 minutes +- Kernel task hangs blocking node operations + +Soft mount (`soft,timeo=30,retrans=3`) trades availability for responsiveness: +- Max 90s hang (30s × 3 retries) +- Operations return EIO after timeout → app can handle error +- Acceptable for non-critical data paths + +**Critical paths**: Databases use iSCSI (not NFS), so soft mount never affects data integrity. + +## Troubleshooting + +### NFS Mount Hangs + +**Symptom**: Pod stuck in `ContainerCreating`, `df -h` hangs on NFS mount + +**Diagnosis**: +```bash +# On K8s node +mount | grep nfs +showmount -e 10.0.10.15 + +# Check NFS server +ssh root@10.0.10.15 +zfs list | grep main/ +cat /etc/exports | grep +``` + +**Fix**: +1. Verify dataset exists: `zfs list main/` +2. Verify export: `grep /etc/exports` +3. If missing: re-run `/root/secrets/nfs_exports.sh` +4. Restart NFS server: `service nfs-server restart` + +### iSCSI Session Drops + +**Symptom**: PostgreSQL/MySQL pod restarts, iSCSI reconnection loops + +**Diagnosis**: +```bash +# On K8s node +iscsiadm -m session +dmesg | grep iscsi +journalctl -u iscsid -f +``` + +**Fix**: +1. Check TrueNAS iSCSI service: WebUI → Sharing → iSCSI → Targets +2. Verify hardened timeouts: `iscsiadm -m node -o show | grep timeout` +3. If defaults: re-apply cloud-init or manually update `/etc/iscsi/iscsid.conf` +4. Restart session: + ```bash + iscsiadm -m node -u + iscsiadm -m node -l + ``` + +### Democratic-CSI Sidecar OOMKill + +**Symptom**: `kubectl describe pod` shows sidecar containers OOMKilled + +**Diagnosis**: +```bash +kubectl get events -n democratic-csi | grep OOM +kubectl top pod -n democratic-csi +``` + +**Fix**: +1. Set explicit resources in Helm values (see "Democratic-CSI Sidecar Resources" above) +2. Apply: `terragrunt apply` in `stacks/democratic-csi/` + +### SQLite Corruption on NFS + +**Symptom**: `database disk image is malformed`, checksum errors + +**Diagnosis**: +```bash +# In pod +sqlite3 /data/db.sqlite "PRAGMA integrity_check;" +``` + +**Fix**: Migrate to iSCSI +1. Create iSCSI PVC in Terraform stack +2. Restore from backup to new volume +3. Update deployment to use new PVC +4. Delete old NFS PVC + +### Slow NFS Performance + +**Symptom**: High latency on file operations, `iostat` shows NFS wait times + +**Diagnosis**: +```bash +# On TrueNAS +zpool iostat -v 5 +arc_summary | grep "Hit Rate" + +# On K8s node +nfsiostat 5 +``` + +**Optimization**: +1. Check ZFS ARC hit rate (should be >90%) +2. Move hot datasets to SSD pool: `zfs send main/ | zfs recv ssd/` +3. Tune NFS mount: add `rsize=1048576,wsize=1048576` to StorageClass `mountOptions` + +## Related + +- **Runbooks**: + - `docs/runbooks/restore-postgresql.md` + - `docs/runbooks/restore-mysql.md` + - `docs/runbooks/recover-nfs-mount.md` +- **Architecture**: `docs/architecture/backup-dr.md` (backup strategy using ZFS snapshots) +- **Reference**: `.claude/reference/service-catalog.md` (which services use NFS vs iSCSI) diff --git a/docs/architecture/vpn.md b/docs/architecture/vpn.md new file mode 100644 index 00000000..08044116 --- /dev/null +++ b/docs/architecture/vpn.md @@ -0,0 +1,414 @@ +# VPN & Remote Access Architecture + +Last updated: 2026-03-24 + +## Overview + +Remote access to the homelab is provided through a hybrid VPN architecture: WireGuard site-to-site tunnels connect physical locations (Sofia, London, Valchedrym), while Headscale (self-hosted Tailscale control server) provides mesh overlay networking for roaming clients. Split DNS architecture ensures resilience: AdGuard serves as the global DNS resolver for all VPN clients, while Technitium handles internal `.lan` domains. This design prevents tunnel dependency for public DNS resolution — if the Cloudflared tunnel goes down, clients can still access the internet. + +## Architecture Diagram + +### VPN Topology + +```mermaid +graph TB + subgraph "Site-to-Site WireGuard" + Sofia[Sofia pfSense
10.0.20.1] + London[London pfSense
10.0.30.1] + Valchedrym[Valchedrym pfSense
10.0.40.1] + + Sofia ---|WireGuard Tunnel| London + Sofia ---|WireGuard Tunnel| Valchedrym + London ---|WireGuard Tunnel| Valchedrym + end + + subgraph "Headscale Mesh Overlay" + HS[Headscale
headscale.viktorbarzin.me
K8s Service] + Authentik[Authentik OIDC
SSO Login] + DERP[DERP Relay
Region 999
Embedded in Headscale] + + subgraph "Clients" + Laptop[MacBook
Tailscale Client] + Phone[iPhone
Tailscale Client] + Remote[Remote VM
Tailscale Client] + end + + HS --> Authentik + HS --> DERP + Laptop -.mesh.- Phone + Laptop -.mesh.- Remote + Phone -.mesh.- Remote + Laptop --> HS + Phone --> HS + Remote --> HS + + Laptop -.relay fallback.- DERP + Phone -.relay fallback.- DERP + end + + Sofia --> HS +``` + +### DNS Resolution Flow + +```mermaid +sequenceDiagram + participant Client as VPN Client + participant AdGuard as AdGuard DNS
(Global) + participant Technitium as Technitium DNS
(Internal .lan) + participant Cloudflare as Cloudflare DNS
(Public Domains) + + Note over Client: Query: immich.viktorbarzin.me + Client->>AdGuard: DNS query + AdGuard->>Cloudflare: Forward (not .lan) + Cloudflare-->>AdGuard: A record (Cloudflare IP) + AdGuard-->>Client: Response + + Note over Client: Query: truenas.viktorbarzin.lan + Client->>AdGuard: DNS query + AdGuard->>Technitium: Forward (.lan domain) + Technitium-->>AdGuard: A record (10.0.10.15) + AdGuard-->>Client: Response + + Note over Client,Technitium: If Cloudflared tunnel is down: + Client->>AdGuard: DNS query (google.com) + AdGuard->>Cloudflare: Forward (public DNS works) + Cloudflare-->>AdGuard: A record + AdGuard-->>Client: Response (no tunnel dependency) +``` + +## Components + +| Component | Version/Type | Location | Purpose | +|-----------|-------------|----------|---------| +| WireGuard | Built-in (pfSense) | Sofia/London/Valchedrym | Site-to-site encrypted tunnels | +| Headscale | v0.23.x (container) | K8s (headscale.viktorbarzin.me) | Tailscale control server, mesh coordinator | +| Tailscale | Client v1.x | User devices | Mesh VPN client | +| Authentik | OIDC provider | K8s | SSO authentication for Headscale | +| DERP Relay | Embedded in Headscale | K8s (region 999) | Relay for NAT traversal | +| AdGuard DNS | Container | K8s | Global DNS resolver with ad-blocking | +| Technitium DNS | Container | K8s (10.0.20.101) | Internal .lan domain resolver | + +## How It Works + +### WireGuard Site-to-Site + +Three physical locations are permanently connected via WireGuard tunnels configured on pfSense: +- **Sofia** (primary): 10.0.20.0/24 (main K8s cluster) +- **London**: 10.0.30.0/24 (secondary services) +- **Valchedrym**: 10.0.40.0/24 (backup location) + +Each pfSense instance maintains two WireGuard tunnels, forming a full mesh. Routes are automatically injected into each location's routing table. This allows services in Sofia to communicate with services in London without additional client configuration. + +**Use cases**: +- Replication of Vault data between Sofia and London +- Offsite database replicas +- Accessing Proxmox hosts across locations + +### Headscale Mesh Overlay + +Headscale is a self-hosted alternative to Tailscale's commercial control plane. It provides: +- **Mesh networking**: Clients establish direct WireGuard connections to each other (peer-to-peer). +- **NAT traversal**: DERP relays provide connectivity when direct connections fail. +- **OIDC authentication**: Users log in via Authentik, no pre-shared keys. +- **ACL policies**: Fine-grained control over which clients can reach which destinations. + +**Client onboarding**: +1. User installs Tailscale client (official macOS/iOS/Android app) +2. Runs: `tailscale login --login-server https://headscale.viktorbarzin.me` +3. Browser opens to Authentik SSO login +4. After successful login, Tailscale presents a registration URL +5. Admin approves the device via `headscale nodes register --user --key ` +6. Client is added to the mesh, receives IP in 100.64.0.0/10 range + +**Connectivity test**: `ping 10.0.20.100` (Sofia K8s API server) verifies full access to the homelab network. + +### DERP Relay for NAT Traversal + +**Problem**: Symmetric NAT or restrictive firewalls prevent direct WireGuard connections between clients. + +**Solution**: Headscale runs an embedded DERP relay server (region 999, named "Home DERP"). DERP is Tailscale's NAT traversal protocol, implemented as an HTTPS-based relay. + +**How it works**: +1. Clients attempt direct WireGuard connection via STUN/ICE. +2. If direct connection fails, both clients connect to the DERP relay via HTTPS. +3. Traffic is encrypted end-to-end with WireGuard, DERP only relays packets. +4. No additional ports needed — DERP uses the same HTTPS ingress as Headscale (443). + +**Performance**: DERP adds latency (extra hop through Sofia K8s cluster), but ensures connectivity in all scenarios. + +### Split DNS Architecture + +**Design goal**: Prevent tunnel dependency for public DNS resolution. If the Headscale tunnel or Cloudflared tunnel fails, clients must still resolve public domains. + +**Implementation**: +- **AdGuard DNS**: Global recursive resolver, serves all VPN clients. Includes ad-blocking and malicious domain filtering. +- **Technitium DNS**: Internal authoritative server for `.viktorbarzin.lan` domains. + +**Resolution flow**: +1. Client queries AdGuard for any domain. +2. If domain ends in `.lan`, AdGuard forwards to Technitium (10.0.20.101). +3. For all other domains, AdGuard resolves directly via upstream (Cloudflare 1.1.1.1). +4. AdGuard caches responses, reducing load on Technitium and upstream. + +**Resilience**: Even if the tunnel to Sofia is down, clients can still resolve `google.com`, `github.com`, etc., because AdGuard talks directly to Cloudflare. Only `.lan` domains become unavailable. + +### Access Control (Authentik Groups) + +**Headscale Users** group in Authentik controls VPN access. Membership is invitation-only: +1. Admin creates user in Authentik. +2. Admin adds user to "Headscale Users" group. +3. User logs in via OIDC during `tailscale login`. +4. Headscale verifies group membership via OIDC claims. + +Removing a user from the group revokes VPN access on next re-authentication (every 30 days). + +## Configuration + +### Terraform Stacks + +| Stack | Path | Resources | +|-------|------|-----------| +| Headscale | `stacks/headscale/` | Deployment, Service, Ingress, ConfigMap | +| AdGuard | `stacks/adguard/` | Deployment, Service, PVC | +| Technitium | `stacks/technitium/` | Deployment, Service, PVC | +| pfSense (Sofia) | `stacks/pfsense/` | WireGuard tunnel configs | + +### Headscale Configuration + +**ConfigMap**: `stacks/headscale/main.tf` +```yaml +server_url: https://headscale.viktorbarzin.me +listen_addr: 0.0.0.0:8080 +metrics_listen_addr: 0.0.0.0:9090 + +oidc: + issuer: https://authentik.viktorbarzin.me/application/o/headscale/ + client_id: + client_secret: + scope: ["openid", "profile", "email", "groups"] + allowed_groups: ["Headscale Users"] + +derp: + server: + enabled: true + region_id: 999 + region_code: "home" + region_name: "Home DERP" + stun_listen_addr: "0.0.0.0:3478" + urls: + - https://controlplane.tailscale.com/derpmap/default + auto_update_enabled: true + update_frequency: 24h + +ip_prefixes: + - 100.64.0.0/10 + +dns_config: + nameservers: + - 10.0.20.102 # AdGuard DNS + domains: + - viktorbarzin.lan + magic_dns: true +``` + +**Secrets (Vault)**: +- `secret/headscale/oidc_client_secret` + +**Ingress**: Standard `ingress_factory` with `protected = false` (OIDC is handled by Headscale itself). + +### AdGuard Configuration + +**Upstream DNS servers**: +- Cloudflare: `1.1.1.1`, `1.0.0.1` +- Google: `8.8.8.8`, `8.8.4.4` + +**Conditional forwarding**: +- `viktorbarzin.lan` → `10.0.20.101` (Technitium) + +**Ad-blocking lists**: +- AdGuard DNS filter +- OISD full list +- Developer Dan's ads and tracking list + +**Custom rules**: Block telemetry for Windows, macOS, and smart TVs. + +### WireGuard (pfSense) + +**Sofia → London tunnel**: +- Interface: `wg0` +- Local IP: `10.99.1.1/24` +- Remote endpoint: `:51820` +- Allowed IPs: `10.0.30.0/24` +- Keepalive: 25 seconds + +**Sofia → Valchedrym tunnel**: +- Interface: `wg1` +- Local IP: `10.99.2.1/24` +- Remote endpoint: `:51821` +- Allowed IPs: `10.0.40.0/24` +- Keepalive: 25 seconds + +### Vault Secrets + +- Headscale OIDC client secret: `secret/headscale/oidc_client_secret` +- WireGuard private keys: `secret/pfsense/wg_privkey_london`, `secret/pfsense/wg_privkey_valchedrym` + +## Decisions & Rationale + +### Why Headscale Instead of Plain WireGuard? + +**Alternatives considered**: +1. **WireGuard with static configs**: Requires manual key distribution, complex peer management. +2. **OpenVPN**: Slower, more overhead, less mobile-friendly. +3. **Commercial Tailscale**: SaaS, not self-hosted, less control over data. + +**Decision**: Headscale provides: +- **Mesh networking**: Clients connect directly, not through a central server. +- **OIDC authentication**: No pre-shared keys, integrates with existing SSO. +- **Easy onboarding**: Users install official Tailscale app, no custom configs. +- **Self-hosted**: Full control over control plane and data. + +**Trade-off**: More complex setup than plain WireGuard, but operational benefits outweigh initial complexity. + +### Why Split DNS (AdGuard + Technitium)? + +**Alternatives considered**: +1. **Single DNS server (Technitium only)**: Requires forwarding all public domains to upstream, creating single point of failure. +2. **Cloudflare only**: Fast, but no internal `.lan` domain support without zone delegation. +3. **Tailscale MagicDNS only**: Depends on Headscale control plane, fails if control plane is down. + +**Decision**: Split DNS architecture provides: +- **Resilience**: If Headscale tunnel fails, public DNS still works via AdGuard → Cloudflare. +- **Ad-blocking**: AdGuard filters ads and malicious domains for all VPN clients. +- **Internal domains**: Technitium authoritatively serves `.lan`, no external dependency. + +**Key benefit**: Zero tunnel dependency for public DNS. Users can browse the internet even if the homelab is completely offline. + +### Why Embedded DERP Relay? + +**Alternatives considered**: +1. **External DERP relays only (Tailscale's public relays)**: Free, but adds latency and exposes traffic metadata to Tailscale. +2. **No DERP, direct connections only**: Fails for symmetric NAT clients (mobile networks). + +**Decision**: Embedded DERP (region 999) provides: +- **Privacy**: All relay traffic stays within the homelab. +- **Reliability**: Not dependent on Tailscale's public infrastructure. +- **No extra ports**: DERP uses HTTPS (443), same as Headscale API. + +**Trade-off**: Adds CPU/memory overhead to Headscale pod, but minimal compared to benefits. + +### Why OIDC Authentication Instead of Pre-Authorized Keys? + +**Alternatives considered**: +1. **Pre-authorized keys**: Headscale generates keys, admin shares with users. +2. **Shared secret**: Single password for all users. + +**Decision**: OIDC via Authentik provides: +- **Centralized access control**: Add/remove users in one place. +- **Audit trail**: Authentik logs all login attempts. +- **Group-based authorization**: Only "Headscale Users" group can access VPN. +- **SSO integration**: Users already have accounts in Authentik for other services. + +**Key workflow**: Admin invites user → user logs in via Authentik → admin approves device → access granted. No key exchange needed. + +## Troubleshooting + +### Headscale Login Fails (OIDC Error) + +**Symptoms**: `tailscale login --login-server` opens browser, but after Authentik login, shows "OIDC error: invalid state". + +**Diagnosis**: Check Headscale logs: `kubectl logs -n headscale deploy/headscale` + +**Common causes**: +1. **Client clock skew**: OIDC tokens have short validity (5 minutes). Ensure client's system time is accurate. +2. **Callback URL mismatch**: Authentik application must have `https://headscale.viktorbarzin.me/oidc/callback` in Redirect URIs. +3. **Group membership**: User is not in "Headscale Users" group in Authentik. + +**Fix**: Sync system clock, verify Authentik application config, add user to group. + +### Direct Connection Fails, Traffic Goes via DERP + +**Symptoms**: `tailscale status` shows `relay "home"` instead of direct connection. Higher latency. + +**Diagnosis**: Check DERP usage: `tailscale netcheck` + +**Common causes**: +1. **Symmetric NAT**: Mobile networks or restrictive corporate firewalls block UDP hole-punching. +2. **Firewall blocking WireGuard**: Port 51820 UDP blocked on one or both clients. +3. **STUN failure**: Can't determine external IP and port. + +**Fix**: This is expected behavior in many environments. DERP relay ensures connectivity. If latency is unacceptable, use site-to-site WireGuard instead. + +### Can't Resolve .lan Domains from VPN + +**Symptoms**: `nslookup truenas.viktorbarzin.lan` returns `NXDOMAIN`. + +**Diagnosis**: Check DNS chain: Client → AdGuard → Technitium. + +**Steps**: +1. Verify AdGuard is running: `kubectl get pod -n adguard` +2. Check AdGuard conditional forwarding: Query AdGuard directly: `nslookup truenas.viktorbarzin.lan ` +3. Check Technitium: `nslookup truenas.viktorbarzin.lan 10.0.20.101` + +**Common causes**: +1. **AdGuard not forwarding .lan**: Conditional forwarding rule missing or misconfigured. +2. **Technitium down**: Pod crash-looping or PVC corrupted. +3. **DNS propagation delay**: Technitium zone update not yet applied. + +**Fix**: Verify conditional forwarding in AdGuard UI. Restart Technitium if needed. Check zone file in Technitium UI. + +### VPN Client Can't Reach K8s Services + +**Symptoms**: Can `ping 10.0.20.1` (pfSense), but `curl https://immich.viktorbarzin.me` times out. + +**Diagnosis**: Check connectivity at each layer: +1. **DNS**: Does `nslookup immich.viktorbarzin.me` return correct IP? +2. **Routing**: Can client reach MetalLB IP? `ping ` +3. **Firewall**: Is pfSense blocking traffic from VPN subnet? + +**Common causes**: +1. **Split DNS working too well**: Client resolves to Cloudflare IP instead of internal LAN IP. Expected for proxied domains — use direct domain (e.g., `immich-direct.viktorbarzin.me`). +2. **ACL policy**: Headscale ACL blocks client from accessing certain subnets. +3. **pfSense NAT rule missing**: Traffic from VPN subnet not routed to VLAN 20. + +**Fix**: For proxied domains, use non-proxied DNS names. Check Headscale ACL policy. Verify pfSense NAT rules. + +### DERP Relay Returns 502 Bad Gateway + +**Symptoms**: Tailscale clients can't connect, DERP shows offline in `tailscale netcheck`. + +**Diagnosis**: Check Headscale ingress: `kubectl get ingress -n headscale` + +**Common causes**: +1. **Traefik middleware blocking DERP traffic**: Forward-auth interferes with WebSocket upgrade. +2. **Headscale pod not ready**: Liveness probe failing. +3. **Cloudflared tunnel issue**: DERP uses WebSockets, which require HTTP/1.1 upgrade support. + +**Fix**: Ensure Headscale ingress has `protected = false` (no forward-auth). Check Headscale pod readiness. Verify Cloudflared supports WebSocket upgrades. + +### WireGuard Site-to-Site Tunnel Disconnects + +**Symptoms**: Can't reach services in London from Sofia. `ping 10.0.30.1` fails. + +**Diagnosis**: Check pfSense WireGuard status: Dashboard → VPN → WireGuard → Status + +**Common causes**: +1. **Keepalive packets dropped**: Firewall or ISP blocking UDP 51820. +2. **Public IP changed**: Dynamic IP on remote site changed, config still has old IP. +3. **Routing conflict**: Overlapping subnet on both sides. + +**Fix**: Increase keepalive interval (25s → 60s). Update remote endpoint if IP changed. Verify subnet uniqueness. + +## Related + +- **Runbooks**: + - `docs/runbooks/add-headscale-user.md` + - `docs/runbooks/reset-derp-relay.md` + - `docs/runbooks/update-wireguard-peer.md` +- **Architecture Docs**: + - `docs/architecture/networking.md` — Core network architecture + - `docs/architecture/dns.md` — Full DNS architecture (coming soon) +- **Reference**: + - `.claude/reference/authentik-state.md` — OIDC application configs + - `.claude/reference/service-catalog.md` — Full service inventory