fix: restore tree dropped by 6d224861; land stem95su gdrive-sync (10m) [ci skip]
6d224861 came from a --no-checkout worktree whose empty index made the
commit drop every file except two. This restores 05b50d2b's full tree and
correctly adds stacks/stem95su/gdrive-sync.tf + the service-catalog stem95su
entry. Forward-only (parent=6d224861, no force-push); [ci skip] since the
live infra was never applied from the broken commit.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
This commit is contained in:
parent
6d224861c4
commit
fd0f4a0365
1166 changed files with 358546 additions and 0 deletions
151
docs/architecture/agent-task-tracking.md
Normal file
151
docs/architecture/agent-task-tracking.md
Normal file
|
|
@ -0,0 +1,151 @@
|
|||
# Agent Task Tracking
|
||||
|
||||
## Overview
|
||||
|
||||
All Claude Code sessions share a centralized task database powered by [Beads](https://github.com/steveyegge/beads) (`bd` CLI) backed by a Dolt SQL server running in the Kubernetes cluster. This prevents agents from duplicating work across sessions and provides persistent cross-session task tracking.
|
||||
|
||||
## Architecture
|
||||
|
||||
```
|
||||
┌─────────────────────────┐
|
||||
│ Dolt SQL Server (k8s) │
|
||||
│ beads-server namespace │
|
||||
│ 10.0.20.200:3306 │
|
||||
│ proxmox-lvm PVC (2Gi) │
|
||||
└────────┬──────────────────┘
|
||||
│ MySQL protocol
|
||||
┌──────────────┼──────────────────┐
|
||||
│ │ │
|
||||
┌──────────▼──┐ ┌───────▼────────┐ ┌──────▼──────────┐
|
||||
│ wizard │ │ emo │ │ future agents │
|
||||
│ session 1 │ │ session 1 │ │ (any machine │
|
||||
│ session 2 │ │ session 2 │ │ with network │
|
||||
│ session N │ │ │ │ access) │
|
||||
└─────────────┘ └────────────────┘ └─────────────────┘
|
||||
```
|
||||
|
||||
### Components
|
||||
|
||||
| Component | Location | Purpose |
|
||||
|-----------|----------|---------|
|
||||
| Dolt server | `beads-server` namespace, `10.0.20.200:3306` | Centralized MySQL-compatible database |
|
||||
| Root `.beads/` | `/home/wizard/code/.beads/` | Client config (server mode, prefix `code`) |
|
||||
| Task context hook | `/home/wizard/.claude/hooks/beads-task-context.sh` | Injects in-progress tasks into every prompt |
|
||||
| Task blocker hook | `/home/wizard/.claude/hooks/beads-block-builtin-tasks.py` | Blocks TaskCreate/TodoWrite, redirects to `bd` |
|
||||
| Project settings | `/home/wizard/code/.claude/settings.json` | Shared hooks (inherited by all users) |
|
||||
| Terraform stack | `stacks/beads-server/` | Deployment, Service (MetalLB LB), PVC |
|
||||
|
||||
### Settings Hierarchy
|
||||
|
||||
```
|
||||
Project-level (.claude/settings.json) ← Shared: beads hooks + TaskCreate blocker
|
||||
└─ User-level (~/.claude/settings.json) ← Per-user: memory plugin, model, statusline
|
||||
```
|
||||
|
||||
Both `wizard` and `emo` inherit project-level settings automatically. User-specific hooks (e.g., wizard's memory plugin) stay in the user-level settings.
|
||||
|
||||
## Agent Session Lifecycle
|
||||
|
||||
### 1. Session Start (automatic)
|
||||
|
||||
The `UserPromptSubmit` hook fires on every prompt:
|
||||
- Queries `bd list --status in_progress` from the centralized DB
|
||||
- Queries `bd list --status open | head -10` for available work
|
||||
- Injects results into the agent's context as `additionalContext`
|
||||
|
||||
The agent sees what's currently being worked on before processing any request.
|
||||
|
||||
### 2. Before Starting Work
|
||||
|
||||
```bash
|
||||
bd list --status in_progress # What others are working on
|
||||
bd ready # Unblocked tasks available
|
||||
bd create "Task description" # Register your work
|
||||
bd update <id> --claim # Set status to in_progress
|
||||
```
|
||||
|
||||
### 3. During Work
|
||||
|
||||
```bash
|
||||
bd note <id> "progress update" # Log progress
|
||||
bd link <child> <parent> # Add dependencies
|
||||
```
|
||||
|
||||
### 4. After Completing Work
|
||||
|
||||
```bash
|
||||
bd close <id> # Mark complete
|
||||
bd create "Follow-up task" # File remaining work for next session
|
||||
```
|
||||
|
||||
### 5. Enforcement
|
||||
|
||||
Two layers prevent agents from using built-in task tools:
|
||||
|
||||
1. **CLAUDE.md instruction** (soft): "Do NOT use TaskCreate, TaskUpdate, TodoWrite"
|
||||
2. **PermissionRequest hook** (hard): Blocks the tool call with a deny decision and redirect message
|
||||
|
||||
## Infrastructure
|
||||
|
||||
### Dolt Server
|
||||
|
||||
- **Image**: `dolthub/dolt-sql-server:latest`
|
||||
- **Storage**: `proxmox-lvm` PVC, 2Gi initial, auto-resize to 10Gi
|
||||
- **Service**: LoadBalancer via MetalLB on shared IP `10.0.20.200`
|
||||
- `metallb.io/allow-shared-ip: shared`
|
||||
- `externalTrafficPolicy: Cluster`
|
||||
- **Port**: 3306 (MySQL protocol)
|
||||
- **Users**: `root@%` and `beads@%` (no password, internal network)
|
||||
- **Init**: `/docker-entrypoint-initdb.d/` via ConfigMap, `DOLT_ROOT_HOST=%`
|
||||
- **Terraform**: `stacks/beads-server/main.tf`
|
||||
|
||||
### Client Configuration
|
||||
|
||||
The root `.beads/metadata.json`:
|
||||
```json
|
||||
{
|
||||
"backend": "dolt",
|
||||
"dolt_mode": "server",
|
||||
"dolt_server_host": "10.0.20.200",
|
||||
"dolt_server_port": 3306,
|
||||
"dolt_server_user": "beads",
|
||||
"dolt_database": "code"
|
||||
}
|
||||
```
|
||||
|
||||
### Multi-User Access
|
||||
|
||||
- Directory permissions: `2770 wizard:code-shared` (setgid)
|
||||
- Both `wizard` and `emo` are in the `code-shared` group
|
||||
- `bd` binary: `/home/wizard/.local/bin/bd` (symlinked for emo at `/home/emo/.local/bin/bd`)
|
||||
|
||||
## Known Issues
|
||||
|
||||
### Subdirectory Shadow
|
||||
|
||||
Per-project `.beads/` directories exist in 7 subdirectories (finance, infra, Website, etc.). When an agent `cd`s into one of these, `bd` auto-discovers the **local** `.beads/` instead of the centralized one.
|
||||
|
||||
**Fix**: Always use `bd --db /home/wizard/code/.beads` when working from a subdirectory. The hook and CLAUDE.md instructions document this.
|
||||
|
||||
### Hook Network Failure
|
||||
|
||||
The task context hook suppresses errors (`2>/dev/null`). If the Dolt server is unreachable, the hook silently exits without injecting context. Agents won't see current tasks but won't be blocked either.
|
||||
|
||||
### Permissions Warning
|
||||
|
||||
`bd` warns about `.beads` directory permissions (`0770 vs recommended 0700`). This is expected — we use `0770` for group access. The warning is harmless.
|
||||
|
||||
## Verification
|
||||
|
||||
Run the E2E test:
|
||||
```bash
|
||||
bash /home/wizard/code/test-beads-e2e.sh
|
||||
```
|
||||
|
||||
This tests all 11 phases: hook injection, task CRUD, cross-user visibility, subdirectory shadowing, and multi-agent coordination. Expects 11/11 PASS.
|
||||
|
||||
## Related
|
||||
|
||||
- `CLAUDE.md` (root) — Mandatory task protocol section
|
||||
- Per-project `CLAUDE.md` files — Beads integration block
|
||||
- `stacks/beads-server/main.tf` — Terraform deployment
|
||||
325
docs/architecture/authentication.md
Normal file
325
docs/architecture/authentication.md
Normal file
|
|
@ -0,0 +1,325 @@
|
|||
# Authentication & Authorization
|
||||
|
||||
## Overview
|
||||
|
||||
The homelab uses Authentik as a centralized identity provider (IdP) for all services, providing single sign-on (SSO) via OIDC and forward authentication for ingress protection. Authentik integrates with social login providers (Google, GitHub, Facebook), manages user groups and RBAC policies, and enforces authentication at the Traefik ingress layer. The system supports both human authentication (OIDC SSO) and service-to-service authentication (Kubernetes SA JWT for CI/CD).
|
||||
|
||||
## Architecture Diagram
|
||||
|
||||
```mermaid
|
||||
graph TB
|
||||
User[User Browser]
|
||||
Traefik[Traefik Ingress]
|
||||
ForwardAuth[ForwardAuth Middleware]
|
||||
Authentik[Authentik<br/>3 server + 3 worker<br/>+ embedded outpost]
|
||||
Backend[Protected Backend Service]
|
||||
|
||||
Social[Social Providers<br/>Google/GitHub/Facebook]
|
||||
K8s[Kubernetes API]
|
||||
Vault[Vault]
|
||||
|
||||
User -->|1. HTTPS Request| Traefik
|
||||
Traefik -->|2. Auth Check| ForwardAuth
|
||||
ForwardAuth -->|3. Verify Session| Authentik
|
||||
|
||||
Authentik -->|4a. Not Authenticated| User
|
||||
User -->|4b. Login Flow| Authentik
|
||||
Authentik -->|5. Social Login| Social
|
||||
Social -->|6. OAuth Callback| Authentik
|
||||
Authentik -->|7. Session Cookie| User
|
||||
User -->|8. Retry Request| Traefik
|
||||
|
||||
ForwardAuth -->|9. Authenticated| Backend
|
||||
Traefik -->|10. Forward Request| Backend
|
||||
|
||||
K8s -->|OIDC Groups| Authentik
|
||||
Vault -->|OIDC Auth| Authentik
|
||||
```
|
||||
|
||||
## Components
|
||||
|
||||
| Component | Version | Location | Purpose |
|
||||
|-----------|---------|----------|---------|
|
||||
| Authentik Server | 2026.2.2 | `stacks/authentik/` | Core IdP application servers (2 replicas) |
|
||||
| Authentik Worker | 2026.2.2 | `stacks/authentik/` | Background task processors (2 replicas) |
|
||||
| PgBouncer | Latest | `stacks/authentik/` | PostgreSQL connection pooler (3 replicas) |
|
||||
| Embedded Outpost | - | Built into Authentik | Forward auth endpoint for Traefik |
|
||||
| Traefik ForwardAuth | - | `modules/kubernetes/ingress_factory/` | Middleware attached when `auth = "required"` or `"public"` |
|
||||
| Vault OIDC Method | - | `stacks/vault/` | Human SSO authentication to Vault |
|
||||
| Vault K8s Auth | - | `stacks/vault/` | Service account JWT authentication |
|
||||
|
||||
## How It Works
|
||||
|
||||
### Forward Authentication Flow
|
||||
|
||||
Services pick an auth tier via the `auth` enum on the `ingress_factory` module (default `"required"`, fail-closed):
|
||||
|
||||
| Tier | Effect | When to use |
|
||||
|------|--------|-------------|
|
||||
| `"required"` | Authentik forward-auth gates every request | Backend has no own user auth — Authentik is the only gate |
|
||||
| `"app"` | No Authentik middleware; backend's own login is the gate | Backend handles its own user auth (NextAuth, Django, OAuth, bearer-token API) |
|
||||
| `"public"` | Authentik anonymous binding via `public` outpost | Audit trail without gating; only works for top-level browser navigation |
|
||||
| `"none"` | No Authentik middleware at all | Anubis-fronted content, webhooks, OAuth callbacks, native-client APIs (CalDAV, WebDAV, Git) |
|
||||
|
||||
When `auth = "required"`, an unauthenticated request flows:
|
||||
|
||||
1. Request hits Traefik ingress
|
||||
2. ForwardAuth middleware calls Authentik embedded outpost
|
||||
3. Authentik checks for valid session cookie
|
||||
4. If missing/invalid, redirects to Authentik login page (authentik.viktorbarzin.me)
|
||||
5. User authenticates via social provider (Google/GitHub/Facebook)
|
||||
6. Authentik creates session, sets cookie, redirects back to original URL
|
||||
7. Subsequent requests include session cookie, pass auth check, reach backend
|
||||
|
||||
Authentik adds authentication headers (user, email, groups) to forwarded requests. These headers are stripped before reaching the backend to prevent confusion.
|
||||
|
||||
**Anti-exposure guard**: every `auth = "app"` or `auth = "none"` line MUST have a preceding `# auth = "<tier>": <reason>` comment documenting what gates the backend (for `"app"`) or why the endpoint is intentionally public (for `"none"`). The convention is enforced by `scripts/check-ingress-auth-comments.py`, which `scripts/tg` runs on every `plan/apply/destroy/refresh` and blocks the terragrunt invocation if violated. Stack-scoped — each stack documents itself.
|
||||
|
||||
### Social Login & Invitation Flow
|
||||
|
||||
All new users must use an invitation link to register. The invitation-enrollment flow:
|
||||
|
||||
1. **invitation-validation** - Validates invitation token
|
||||
2. **enrollment-identification** - Social login (Google/GitHub/Facebook) + passkey registration
|
||||
3. **enrollment-prompt** - Collect name/email
|
||||
4. **enrollment-user-write** - Create user account
|
||||
5. **enrollment-login** - Auto-login after creation
|
||||
|
||||
Group membership is auto-assigned from the invitation's `fixed_data` field. This prevents open registration while maintaining SSO convenience.
|
||||
|
||||
### OIDC Applications
|
||||
|
||||
Authentik provides OIDC for 10 applications:
|
||||
|
||||
| Application | Type | Purpose |
|
||||
|-------------|------|---------|
|
||||
| Cloudflare Access | OIDC | Cloudflare Zero Trust tunnels |
|
||||
| Domain-wide catch-all | Proxy (Forward Auth) | Protect all `*.viktorbarzin.me` services |
|
||||
| Forgejo | OIDC | Git repository SSO |
|
||||
| Grafana | OIDC | Monitoring dashboard SSO |
|
||||
| Headscale | OIDC | Tailscale control plane auth |
|
||||
| Immich | OIDC | Photo management SSO |
|
||||
| Kubernetes | OIDC (public client) | K8s API authentication (kubectl / kubelogin CLI) |
|
||||
| Kubernetes Dashboard | OIDC (confidential) | Built for dashboard SSO — currently **idle** (apiserver OIDC blocked; dashboard uses forward-auth + token-paste) |
|
||||
| Linkwarden | OIDC | Bookmark manager SSO |
|
||||
| Wrongmove | OIDC | Real estate app SSO |
|
||||
|
||||
### Kubernetes API authentication (OIDC) — CURRENTLY NON-FUNCTIONAL
|
||||
|
||||
> ⚠️ **apiserver OIDC does not work in this cluster** (as of 2026-06-04). The
|
||||
> kube-apiserver rejects *every* valid Authentik OIDC token — with both the
|
||||
> legacy `--oidc-*` flags AND a structured `AuthenticationConfiguration`, for
|
||||
> both the `kubernetes` and `k8s-dashboard` issuers — despite verified
|
||||
> signature, issuer, audience, `email_verified=true`, synced clock, and a
|
||||
> reachable + publicly-trusted JWKS. Root cause is still open; see
|
||||
> `docs/plans/2026-06-04-k8s-dashboard-sso-design.md` §12. A kubeadm v1.34
|
||||
> upgrade had earlier silently wiped the apiserver `--oidc-*` flags, so OIDC
|
||||
> CLI/dashboard login has effectively been off. **Do not assume `kubectl`
|
||||
> OIDC (kubelogin) works until this is resolved.**
|
||||
|
||||
The intended model (binds by `email`, see `stacks/rbac/modules/rbac/main.tf`):
|
||||
`admin` → `cluster-admin`; `power-user` → custom read-mostly ClusterRole;
|
||||
`namespace-owner` → `admin` RoleBinding in their namespace(s) + cluster read-only.
|
||||
|
||||
### Kubernetes Dashboard access (auto-injected SA token)
|
||||
|
||||
Because OIDC SSO is blocked, the web dashboard at `k8s.viktorbarzin.me` uses a
|
||||
**token-injector** instead — users never see the dashboard's token prompt:
|
||||
|
||||
1. **Authentik forward-auth** (`auth=required`) gates access AND injects
|
||||
`X-authentik-username` (the user's email). The `admin-services-restriction`
|
||||
policy admits `Home Server Admins` plus `kubernetes-admins` /
|
||||
`kubernetes-power-users` / `kubernetes-namespace-owners` for this host
|
||||
(`stacks/authentik/admin-services-restriction.tf`).
|
||||
2. **Token-injector** (`stacks/k8s-dashboard/dashboard_injector.tf`): an nginx
|
||||
that maps `X-authentik-username` → that user's ServiceAccount token and sets
|
||||
`Authorization: Bearer` before proxying to kong-proxy, so the dashboard
|
||||
auto-authenticates. Namespace-owners → `dashboard-<user>` SA (admin on their
|
||||
namespace + read-only on the namespace list & nodes only (dashboard-nav-readonly,
|
||||
NOT cross-tenant resource reads); `stacks/rbac/modules/rbac/dashboard-sa.tf`),
|
||||
auto-derived from `k8s_users`. Admins → the cluster-admin `kubernetes-dashboard`
|
||||
SA token (admin identities listed explicitly in `dashboard_injector.tf`, since
|
||||
their Authentik login email ≠ their `k8s_users` email).
|
||||
The injected token is the per-namespace security boundary; the map lives in a
|
||||
**Secret** (namespace-owners' cluster-read covers configmaps, not secrets).
|
||||
|
||||
> Manual token (fallback / break-glass): `kubectl -n <ns> get secret dashboard-<user>-token -o jsonpath='{.data.token}' | base64 -d`, or `kubectl create token kubernetes-dashboard -n kubernetes-dashboard` for admin.
|
||||
|
||||
The oauth2-proxy + `k8s-dashboard` Authentik OIDC app (built for the
|
||||
seamless-SSO design) remain deployed but **idle/unwired** pending the
|
||||
apiserver-OIDC fix.
|
||||
|
||||
### Authentik Groups
|
||||
|
||||
9 groups manage authorization:
|
||||
|
||||
- **Allow Login Users** - Base group, can authenticate to any OIDC app
|
||||
- **authentik Admins** - Full Authentik admin UI access
|
||||
- **Headscale Users** - Can access Headscale control plane
|
||||
- **Home Server Admins** - Admin access to homelab services
|
||||
- **Wrongmove Users** - Access to Wrongmove real estate app
|
||||
- **kubernetes-admins** - K8s cluster-admin role
|
||||
- **kubernetes-power-users** - K8s read-mostly access
|
||||
- **kubernetes-namespace-owners** - K8s namespace-scoped admin
|
||||
- **Task Submitters** - Can submit tasks to cluster task runner
|
||||
|
||||
### Vault Authentication
|
||||
|
||||
**For humans:**
|
||||
- OIDC method using Authentik as provider
|
||||
- SSO login to Vault UI and CLI
|
||||
- Group-based policy assignment
|
||||
|
||||
**For services (CI/CD):**
|
||||
- Kubernetes SA JWT authentication
|
||||
- Woodpecker CI uses service account token
|
||||
- Vault K8s secrets engine roles:
|
||||
- `dashboard-admin` - K8s dashboard admin token
|
||||
- `ci-deployer` - Deploy workloads via CI/CD
|
||||
- `openclaw` - AI assistant cluster access
|
||||
- `local-admin` - Local development access
|
||||
|
||||
## Configuration
|
||||
|
||||
### Key Config Files
|
||||
|
||||
| Path | Purpose |
|
||||
|------|---------|
|
||||
| `stacks/authentik/` | Authentik deployment (servers, workers, PgBouncer) |
|
||||
| `modules/kubernetes/ingress_factory/` | Auth-tier enum + per-ingress middleware composition |
|
||||
| `stacks/traefik/modules/traefik/middleware.tf` | ForwardAuth middleware definitions (required + public outposts) |
|
||||
| `scripts/check-ingress-auth-comments.py` | Comment-convention guard wired into `scripts/tg` |
|
||||
| `stacks/vault/auth.tf` | Vault OIDC and K8s auth methods |
|
||||
|
||||
### Vault Paths
|
||||
|
||||
- **OIDC config**: `auth/oidc` - Authentik integration settings
|
||||
- **K8s auth**: `auth/kubernetes` - SA JWT validation
|
||||
- **K8s secrets engine**: `kubernetes/` - Dynamic kubeconfig/SA token generation
|
||||
|
||||
### Terraform Stacks
|
||||
|
||||
- `stacks/authentik/` - Authentik infrastructure
|
||||
- `stacks/platform/` - Traefik ingress with ForwardAuth
|
||||
- `stacks/vault/` - Vault auth methods
|
||||
|
||||
### Ingress Protection Examples
|
||||
|
||||
Authentik-gated admin UI (default):
|
||||
```hcl
|
||||
module "myapp_ingress" {
|
||||
source = "../../modules/kubernetes/ingress_factory"
|
||||
name = "myapp"
|
||||
namespace = "myapp"
|
||||
tls_secret_name = var.tls_secret_name
|
||||
# auth = "required" is the default — Authentik forward-auth is the gate.
|
||||
}
|
||||
```
|
||||
|
||||
Backend with its own user auth (no Authentik in the way):
|
||||
```hcl
|
||||
module "myapp_ingress" {
|
||||
source = "../../modules/kubernetes/ingress_factory"
|
||||
name = "myapp"
|
||||
namespace = "myapp"
|
||||
tls_secret_name = var.tls_secret_name
|
||||
# auth = "app": myapp uses NextAuth + Google OAuth; mobile clients can't follow Authentik 302.
|
||||
auth = "app"
|
||||
}
|
||||
```
|
||||
|
||||
Intentionally public webhook receiver:
|
||||
```hcl
|
||||
module "myapp_ingress" {
|
||||
source = "../../modules/kubernetes/ingress_factory"
|
||||
name = "webhook"
|
||||
namespace = "webhooks"
|
||||
tls_secret_name = var.tls_secret_name
|
||||
# auth = "none": upstream signs payloads with HMAC; no user identity expected.
|
||||
auth = "none"
|
||||
}
|
||||
```
|
||||
|
||||
## Decisions & Rationale
|
||||
|
||||
### Why Authentik over Keycloak?
|
||||
|
||||
- **Lighter weight**: Lower resource footprint (3+3+3 replicas vs Keycloak's heavier Java runtime)
|
||||
- **Better UX**: Modern UI, simpler admin experience, better mobile support
|
||||
- **Python-based**: Easier to extend, faster startup times, better developer experience
|
||||
- **Active development**: More frequent releases, responsive community
|
||||
|
||||
### Why Forward Auth over Sidecar?
|
||||
|
||||
- **Simpler architecture**: Single auth check at ingress, no sidecar per pod
|
||||
- **Works with any backend**: Language/framework agnostic, no SDK required
|
||||
- **Centralized policy**: All auth logic in Authentik, not distributed across sidecars
|
||||
- **Performance**: Single auth check per session, not per request
|
||||
|
||||
### Why OIDC for Kubernetes?
|
||||
|
||||
- **SSO integration**: Same login as all other services, no separate credentials
|
||||
- **No credential management**: No kubeconfig secrets to rotate, tokens are short-lived
|
||||
- **Group-based RBAC**: Centralized group management in Authentik, automatic K8s role mapping
|
||||
- **Public client flow**: No client secret needed, works in kubectl plugins and dashboards
|
||||
|
||||
### Why Invitation-Only Enrollment?
|
||||
|
||||
- **Security**: Prevents open internet access to homelab services
|
||||
- **Controlled onboarding**: Explicit approval before granting access
|
||||
- **Social login convenience**: No password management, leverages trusted providers
|
||||
- **Group auto-assignment**: Invitation encodes initial group membership
|
||||
|
||||
## Troubleshooting
|
||||
|
||||
### Headers Not Stripped
|
||||
|
||||
**Problem**: Backend receives `X-Authentik-Username`, `X-Authentik-Email`, `X-Authentik-Groups` headers and breaks.
|
||||
|
||||
**Fix**: Traefik middleware should strip these headers before forwarding. Check `ingress_factory` module for header stripping config.
|
||||
|
||||
### OIDC Token Expired
|
||||
|
||||
**Problem**: `kubectl` returns 401 Unauthorized.
|
||||
|
||||
**Fix**: Re-authenticate to refresh token:
|
||||
```bash
|
||||
kubectl oidc-login setup --oidc-issuer-url=https://authentik.viktorbarzin.me/application/o/kubernetes/
|
||||
```
|
||||
|
||||
### Social Login Redirect Loop
|
||||
|
||||
**Problem**: After social login, redirects to Authentik login page instead of destination.
|
||||
|
||||
**Fix**: Check Authentik application's redirect URIs. Must include `https://authentik.viktorbarzin.me/source/oauth/callback/*` for social providers.
|
||||
|
||||
### User Not in Correct Group
|
||||
|
||||
**Problem**: User authenticated but lacks permissions.
|
||||
|
||||
**Fix**: Check group membership in Authentik admin UI. Verify invitation `fixed_data` specified correct group. Manually add to group if needed.
|
||||
|
||||
### Vault OIDC Login Fails
|
||||
|
||||
**Problem**: Vault UI redirects to Authentik but returns error.
|
||||
|
||||
**Fix**:
|
||||
1. Verify Vault OIDC client credentials in Authentik
|
||||
2. Check Vault OIDC issuer URL matches Authentik
|
||||
3. Ensure Vault redirect URI (`https://vault.viktorbarzin.me/ui/vault/auth/oidc/oidc/callback`) is registered in Authentik
|
||||
|
||||
### K8s Auth Group Mapping Not Working
|
||||
|
||||
**Problem**: User authenticated but `kubectl` shows limited permissions despite being in `kubernetes-admins`.
|
||||
|
||||
**Fix**:
|
||||
1. Verify group claim is present in token: `kubectl oidc-login get-token | jq -R 'split(".") | .[1] | @base64d | fromjson'`
|
||||
2. Check ClusterRoleBinding maps group correctly: `kubectl get clusterrolebinding -o yaml | grep kubernetes-admins`
|
||||
3. Ensure Authentik OIDC app includes `groups` scope
|
||||
|
||||
## Related
|
||||
|
||||
- [Security & L7 Protection](./security.md) - CrowdSec, anti-AI scraping, rate limiting
|
||||
- [Networking](./networking.md) - Ingress, DNS, load balancing
|
||||
- [Vault Runbook](../runbooks/vault.md) - Vault operations and troubleshooting
|
||||
- [Kubernetes Access Runbook](../runbooks/k8s-access.md) - Setting up kubectl with OIDC
|
||||
355
docs/architecture/automated-upgrades.md
Normal file
355
docs/architecture/automated-upgrades.md
Normal file
|
|
@ -0,0 +1,355 @@
|
|||
# Automated Upgrades
|
||||
|
||||
This doc covers three independent automation paths:
|
||||
|
||||
1. **Service-level upgrades** — Container image bumps for OSS apps (DIUN → n8n → claude-agent → Terraform). Most of this doc.
|
||||
2. **OS-level upgrades on K8s nodes** — `unattended-upgrades` + `kured` with sentinel-gate + Prometheus halt-on-alert. See "K8s Node OS Upgrades" section and the runbook at `docs/runbooks/k8s-node-auto-upgrades.md`.
|
||||
3. **K8s component version upgrades** (kubeadm/kubelet/kubectl) — weekly detection CronJob → chain of phase Jobs (preflight → master → worker × 4 → postflight). See "K8s Version Upgrades" section and the runbook at `docs/runbooks/k8s-version-upgrade.md`.
|
||||
|
||||
## Overview
|
||||
|
||||
OSS services are automatically upgraded via a pipeline that detects new container image versions, analyzes changelogs for breaking changes, backs up databases, applies version bumps through Terraform, and verifies health post-upgrade with automatic rollback on failure.
|
||||
|
||||
## Architecture
|
||||
|
||||
```
|
||||
DIUN (every 6h)
|
||||
│ detects new image tags
|
||||
│
|
||||
▼
|
||||
n8n Webhook (POST /webhook/<uuid>)
|
||||
│ filters: skip databases, custom images, infra, :latest
|
||||
│ rate limit: max 5 upgrades per 6h window
|
||||
│
|
||||
▼
|
||||
HTTP POST → claude-agent-service (K8s)
|
||||
│
|
||||
▼
|
||||
claude -p "upgrade agent prompt" (in-cluster)
|
||||
│
|
||||
▼
|
||||
Service Upgrade Agent
|
||||
├── 1. Identify service + .tf files (grep stacks/)
|
||||
├── 2. Resolve GitHub repo (config overrides + auto-detect)
|
||||
├── 3. Fetch changelogs via GitHub API (authenticated, 5000 req/hr)
|
||||
├── 4. Classify risk (SAFE / CAUTION / UNKNOWN)
|
||||
├── 5. Slack notification — starting
|
||||
├── 6. DB backup (if DB-backed service)
|
||||
├── 7. Edit .tf files (version bump + config changes)
|
||||
├── 8. Commit + push (Woodpecker CI applies)
|
||||
├── 9. Wait for CI (poll Woodpecker API)
|
||||
├── 10. Verify (pod ready + HTTP + Uptime Kuma)
|
||||
├── 11a. SUCCESS → Slack report
|
||||
└── 11b. FAILURE → git revert + CI re-applies → Slack alert
|
||||
```
|
||||
|
||||
## Components
|
||||
|
||||
### DIUN (Docker Image Update Notifier)
|
||||
- **Stack**: `stacks/diun/`
|
||||
- **Schedule**: Every 6 hours (`DIUN_WATCH_SCHEDULE=0 */6 * * *`)
|
||||
- **Role**: Detection only — fires a webhook to n8n when a new image tag is found
|
||||
- **Skip patterns**: Databases, `viktorbarzin/*`, `registry.viktorbarzin.me/*`, infrastructure images
|
||||
- **Webhook**: `DIUN_NOTIF_WEBHOOK_ENDPOINT` from Vault `secret/diun` → `n8n_webhook_url`
|
||||
|
||||
### n8n Workflow ("DIUN Upgrade Agent")
|
||||
- **Stack**: `stacks/n8n/`
|
||||
- **Workflow backup**: `stacks/n8n/workflows/diun-upgrade.json`
|
||||
- **Webhook path**: UUID-based (`/webhook/<uuid>`)
|
||||
- **Filters**:
|
||||
- Only `status=update` (skip `new`, `unchanged`)
|
||||
- Skip databases, custom images, infra images, `:latest`
|
||||
- **Rate limiting**: Max 5 upgrades per 6-hour window using `$getWorkflowStaticData('global')`
|
||||
- **Action**: HTTP POST to `claude-agent-service.claude-agent.svc:8080/execute` with the upgrade agent prompt
|
||||
|
||||
### Upgrade Agent
|
||||
- **Prompt**: `.claude/agents/service-upgrade.md`
|
||||
- **Config**: `.claude/reference/upgrade-config.json`
|
||||
- Contains:
|
||||
- 50+ Docker image → GitHub repo mappings
|
||||
- 22 Helm chart → GitHub repo mappings
|
||||
- 27 DB-backed service definitions with backup metadata
|
||||
- Skip patterns and breaking change keywords
|
||||
|
||||
## Risk Classification
|
||||
|
||||
| Risk | Criteria | Verification | Version Jump |
|
||||
|------|----------|-------------|-------------|
|
||||
| **SAFE** | Patch/minor bump, no breaking keywords in release notes | 2 minutes | Direct to target |
|
||||
| **CAUTION** | Major bump, or breaking change keywords found, or in `version_jump_always_step` list | 10 minutes | Step through each version |
|
||||
| **UNKNOWN** | Changelog unavailable | 2 minutes (SAFE defaults) | Direct to target |
|
||||
|
||||
**Breaking change keywords**: `breaking`, `BREAKING`, `migration required`, `schema change`, `database migration`, `manual intervention`, `action required`, `removed`, `deprecated`, `renamed`, `incompatible`
|
||||
|
||||
## Database Backup
|
||||
|
||||
DB-backed services trigger a pre-upgrade backup automatically:
|
||||
- **Shared PostgreSQL**: `kubectl create job --from=cronjob/postgresql-backup -n dbaas`
|
||||
- **Shared MySQL**: `kubectl create job --from=cronjob/mysql-backup -n dbaas`
|
||||
- **Dedicated databases** (e.g., Immich): Trigger existing backup CronJob in the service's namespace
|
||||
|
||||
If the backup fails, the upgrade is **aborted**.
|
||||
|
||||
## Rollback
|
||||
|
||||
On verification failure:
|
||||
1. `git revert --no-edit <upgrade-commit-sha>`
|
||||
2. `git push` → Woodpecker CI re-applies the old version
|
||||
3. Re-verify rollback succeeded
|
||||
4. If rollback also fails → CRITICAL Slack alert for manual intervention
|
||||
|
||||
## Version Patterns
|
||||
|
||||
The agent handles all three version patterns in Terraform:
|
||||
|
||||
| Pattern | Example | Agent Action |
|
||||
|---------|---------|-------------|
|
||||
| Variable-based | `variable "immich_version" { default = "v2.7.4" }` | Edit the `default` value |
|
||||
| Hardcoded | `image = "vaultwarden/server:1.35.4"` | Replace tag in image string |
|
||||
| Helm chart | `version = "2026.2.2"` in `helm_release` | Bump chart version |
|
||||
|
||||
## Configuration
|
||||
|
||||
### Excluding images (handled by DIUN + n8n)
|
||||
- Databases: `*postgres*`, `*mysql*`, `*redis*`, `*clickhouse*`, `*etcd*`
|
||||
- Custom: `viktorbarzin/*`, `registry.viktorbarzin.me/*`, `ancamilea/*`, `mghee/*`
|
||||
- Infrastructure: `registry.k8s.io/*`, `quay.io/tigera/*`, `nvcr.io/*`, `reg.kyverno.io/*`
|
||||
- `:latest` tags
|
||||
|
||||
### Rate limiting
|
||||
- Max 5 upgrades per 6-hour DIUN scan cycle
|
||||
- Counter resets when the window expires
|
||||
- Configurable in the n8n "Filter and Rate Limit" code node
|
||||
|
||||
### Services that always step through versions
|
||||
- Authentik, Nextcloud, Immich (configured in `upgrade-config.json` → `version_jump_always_step`)
|
||||
|
||||
## Monitoring
|
||||
|
||||
- **Slack**: All upgrade events reported (start, success, failure, rollback)
|
||||
- **Git**: Detailed commit messages with changelog summaries, risk level, backup status
|
||||
- **DIUN Slack**: Independent Slack channel for raw version detection (separate from upgrade agent)
|
||||
|
||||
## Bulk Upgrades
|
||||
|
||||
To upgrade all outdated services at once, fire webhooks for each service:
|
||||
|
||||
```bash
|
||||
WEBHOOK="https://n8n.viktorbarzin.me/webhook/<uuid>"
|
||||
curl -s -X POST "$WEBHOOK" \
|
||||
-H "Content-Type: application/json" \
|
||||
-d '{"diun_entry_status":"update","diun_entry_image":"<image>","diun_entry_imagetag":"<new_tag>","diun_entry_provider":"kubernetes"}'
|
||||
```
|
||||
|
||||
n8n processes all webhooks in parallel (one `claude -p` per webhook); `claude-agent-service` runs them concurrently via a bounded pool (`MAX_CONCURRENCY`, default 10, excess queued) — it no longer single-flight-locks. Before bulk runs, increase the rate limit in the n8n Code node (`MAX_UPGRADES_PER_WINDOW`) and reset the counter:
|
||||
|
||||
```sql
|
||||
-- Reset rate limiter
|
||||
UPDATE workflow_entity SET "staticData" = '{}'::json WHERE name = 'DIUN Upgrade Agent';
|
||||
```
|
||||
|
||||
### First Bulk Run (2026-04-16)
|
||||
|
||||
12 services upgraded in ~30 minutes, fully automated:
|
||||
|
||||
| Service | From | To | Notes |
|
||||
|---------|------|----|-------|
|
||||
| audiobookshelf | 2.32.1 | 2.33.1 | Security fixes (IDOR) |
|
||||
| owntracks | 0.9.9 | 1.0.1 | Major version bump |
|
||||
| open-webui | v0.7.2 | v0.8.12 | |
|
||||
| immich | v2.7.4 | v2.7.5 | Patch, DB backup taken |
|
||||
| coturn | 4.6.3-r1 | 4.10.0-r1 | Major version bump |
|
||||
| shlink | 4.3.4 | 5.0.2 | Major, DB-backed |
|
||||
| phpipam | v1.7.0 | v1.7.4 | Patch, DB-backed |
|
||||
| onlyoffice | 8.2.3 | 9.3.1 | Major version bump |
|
||||
| paperless-ngx | 2.16.4 | 2.20.14 | Agent also bumped memory 1Gi → 2Gi |
|
||||
| linkwarden | v2.9.1 | v2.14.0 | 23 intermediate releases, 254M DB backup |
|
||||
| synapse | v1.125.0 | v1.151.0 | Large jump, DB-backed |
|
||||
| dawarich | 0.37.1 | 1.6.1 | Upgraded → verification failed → auto-rolled back → forward-fixed |
|
||||
|
||||
Key behaviors observed:
|
||||
- **Auto-rollback works**: Dawarich upgrade failed verification, agent reverted, then re-applied with a forward fix
|
||||
- **Resource awareness**: Paperless-ngx agent detected the new version needed more memory and bumped limits
|
||||
- **DB backups**: All DB-backed services had pre-upgrade dumps taken automatically
|
||||
- **Changelog analysis**: Linkwarden commit summarized 23 intermediate releases; vaultwarden (earlier test) identified 3 CVEs
|
||||
- **Parallel execution**: 11 agents ran concurrently, handled git rebase conflicts automatically
|
||||
|
||||
## Secrets
|
||||
|
||||
| Secret | Vault Path | Purpose |
|
||||
|--------|-----------|---------|
|
||||
| n8n webhook URL | `secret/diun` → `n8n_webhook_url` | DIUN → n8n trigger |
|
||||
| Agent API bearer token | `secret/claude-agent-service` → `api_bearer_token` | n8n → claude-agent-service `/execute` auth. Synced into both `claude-agent` ns (consumer) and `n8n` ns (caller) via ESO. n8n exposes it to the container as `CLAUDE_AGENT_API_TOKEN` env var. |
|
||||
| Claude OAuth (primary) | `secret/claude-agent-service` → `claude_oauth_token` | Long-lived 1-year token from `claude setup-token`. Consumed by the CLI via `CLAUDE_CODE_OAUTH_TOKEN` env var (set on the container via `envFrom`). Preferred over the short-lived `.credentials.json` — CLI skips the refresh dance entirely. Rotate yearly; alert fires 30d out. |
|
||||
| Claude OAuth (spares) | `secret/claude-agent-service-spare-{1,2}` → `claude_oauth_token` | Failover tokens. Minted alongside primary (verified Anthropic does NOT revoke earlier sessions on new mint). Swap into primary if revocation or compromise. |
|
||||
| GitHub PAT | `secret/viktor` → `github_pat` | Changelog fetch (5000 req/hr) |
|
||||
| Slack webhook | `secret/platform` → `alertmanager_slack_api_url` | Upgrade notifications |
|
||||
| Woodpecker token | `secret/viktor` → `woodpecker_token` | CI pipeline polling |
|
||||
|
||||
## OAuth token lifecycle
|
||||
|
||||
The CLI supports two auth modes. We use the second — long-lived.
|
||||
|
||||
| Mode | How minted | TTL | Needs refresh? | When to use |
|
||||
|------|-----------|-----|----------------|-------------|
|
||||
| `claude login` → `.credentials.json` | Interactive browser OAuth | Access ~6h + refresh token | Yes — CLI auto-refreshes on startup if refresh token valid | Human dev machines |
|
||||
| `claude setup-token` → opaque `sk-ant-oat01-*` | Interactive browser OAuth | **1 year** | No — expires hard | **Headless / service accounts (us)** |
|
||||
|
||||
When both are present on disk, `CLAUDE_CODE_OAUTH_TOKEN` env var wins.
|
||||
|
||||
**Harvesting headless**: `setup-token` uses Ink (React for terminals) and needs a real PTY with **≥300-column width**. At 80-col, Ink wraps and DROPS one character at the wrap boundary (107-char invalid instead of 108-char valid). Python wrapper pattern documented in memory; we harvested 2 spare tokens into Vault on 2026-04-18 using a temporary harvester pod.
|
||||
|
||||
**Monitoring**: CronJob `claude-oauth-expiry-monitor` (claude-agent ns, every 6h) pushes `claude_oauth_token_expiry_timestamp{path="..."}` to Pushgateway. Alerts: `ClaudeOAuthTokenExpiringSoon` (30d, warn), `ClaudeOAuthTokenCritical` (7d, crit), `ClaudeOAuthTokenMonitorStale` (48h no push, warn), `ClaudeOAuthTokenMonitorNeverRun` (metric absent, warn).
|
||||
|
||||
**Rotation**: on alert, harvest a new token, `vault kv patch secret/claude-agent-service claude_oauth_token=<new>`, update the `claude_oauth_token_mint_epochs` local in `stacks/claude-agent-service/main.tf`, `scripts/tg apply` → alert clears on next cron tick.
|
||||
|
||||
## n8n workflow gotchas
|
||||
|
||||
The `DIUN Upgrade Agent` workflow is imported once into n8n's PG DB — it is **not** Terraform-managed. The JSON at `stacks/n8n/workflows/diun-upgrade.json` is a backup; the live state lives in `workflow_entity.nodes`. Drift between the two is possible.
|
||||
|
||||
- **HTTP Request node header expressions must use template-literal form**: `=Bearer {{ $env.CLAUDE_AGENT_API_TOKEN }}` works; `='Bearer ' + $env.CLAUDE_AGENT_API_TOKEN` does NOT evaluate and sends an empty/bogus header → 401 from claude-agent-service.
|
||||
- **`N8N_BLOCK_ENV_ACCESS_IN_NODE=false`** must be set on the n8n deployment for expressions to read `$env.*` at all.
|
||||
- **Troubleshooting 401**: the workflow will show `success` status on the webhook node but error on `Run Upgrade Agent`. Inspect in n8n UI → Executions, or query `execution_entity` + `execution_data` directly. Claude-agent-service logs will also show `POST /execute HTTP/1.1 401 Unauthorized`.
|
||||
- **Patching the live workflow** (one-off, since it's not in TF): `UPDATE workflow_entity SET nodes = REPLACE(nodes::text, OLD, NEW)::json WHERE name = 'DIUN Upgrade Agent';`
|
||||
|
||||
## K8s Node OS Upgrades
|
||||
|
||||
Independent of the service-upgrade pipeline above. Drives apt package updates + reboots on the 5 K8s VMs (master + 4 workers).
|
||||
|
||||
### Stack
|
||||
- **In-guest**: `unattended-upgrades` runs apt upgrades within Allowed-Origins (`-security`, `-updates`, ESM). Package-Blacklist excludes runtime components (`containerd`, `containerd.io`, `runc`, `cri-tools`, `kubernetes-cni`, `calico-*`, `cni-plugins-*`, `docker-ce`). `apt-mark hold` on `kubelet`, `kubeadm`, `kubectl` (and runtime pkgs as belt-and-braces). `Automatic-Reboot=false` — kured handles reboots.
|
||||
- **Reboot driver**: `kured` (chart `kured-5.11.0`, app `1.21.0`). Window 02:00-06:00 Europe/London every day of the week (Mon-Fri-only restriction dropped 2026-05-16 — see PM), period=1h, concurrency=1, reboot-delay=30s, drainTimeout=30m.
|
||||
- **Reboot gate (sentinel)**: `kured-sentinel-gate` DaemonSet creates `/var/run/gated-reboot-required` only when (a) host needs reboot, (b) all nodes Ready, (c) all calico-node pods Running, (d) **no node has transitioned Ready in the last 24h** (24h soak window). The gate runs as an immortal `bash` loop that forks `kubectl` each cycle; the pod whose host has a pending reboot runs the full kubectl-heavy path indefinitely and slowly leaks. Mitigated 2026-05-31 (limit 64Mi→256Mi + `MAX_ITER=72` self-exit ≈6h so kubelet restarts it fresh) — see PM `2026-05-31-kured-sentinel-gate-oom.md`.
|
||||
- **Reboot gate (Prometheus)**: kured `--prometheus-url` polls `prometheus-server.monitoring.svc:80` before each drain. ANY firing alert blocks unless it matches the ignore-regex `^(Watchdog|RebootRequired|KuredNodeWasNotDrained|InfoInhibitor)$`.
|
||||
- **Health alert library**: 10 alerts in the `Upgrade Gates` group (`prometheus_chart_values.tpl`): `KubeAPIServerDown`, `KubeStateMetricsDown`, `PrometheusRuleEvaluationFailing`, `PVCStuckPending`, `RecentNodeReboot` (the explicit 24h soak signal), `MysqlStandaloneDown`, `ClusterPodReadyRatioDropped`, `NodeMemoryPressure`, `NodeDiskPressure`, `KubeQuotaAlmostFull`. Plus the existing 200+ alerts in the cluster-wide library (anything firing blocks kured).
|
||||
- **Notifications**: kured `notifyUrl` posts drain-start/drain-finish to Slack via Vault `secret/kured.slack_kured_webhook`. Alertmanager separately routes critical alerts to `#alerts`.
|
||||
|
||||
### Source of truth
|
||||
| Concern | Location |
|
||||
|---|---|
|
||||
| Package config (uu, holds, blacklist) | `modules/create-template-vm/cloud_init.yaml` (within `is_k8s_template`) |
|
||||
| kured Helm release + sentinel-gate DS | `stacks/kured/main.tf` |
|
||||
| Upgrade Gates alerts | `stacks/monitoring/modules/monitoring/prometheus_chart_values.tpl` |
|
||||
|
||||
### Day-2 changes
|
||||
Cloud-init only runs on first boot. Existing nodes are brought into compliance with a one-shot SSH push — see the runbook section "Restore / re-apply unattended-upgrades config to existing nodes" in `docs/runbooks/k8s-node-auto-upgrades.md`.
|
||||
|
||||
### Why this design
|
||||
The 26h cluster outage on 2026-03-16 was triggered by an unattended-upgrades kernel push that corrupted containerd's overlayfs snapshotter cluster-wide. The remediations:
|
||||
- 24h soak (sentinel-gate Check 4) gives a full day of observation between consecutive node reboots — broken updates show up as Prometheus alerts before any other node restarts.
|
||||
- Prometheus halt-on-alert turns ANY firing alert into a hard block — including the 6 Node Runtime Health alerts and the 10 Upgrade Gates alerts that explicitly model "the cluster is in a bad state."
|
||||
- Package-Blacklist on runtime components prevents the exact failure mode (containerd/runc auto-bumps).
|
||||
- `Automatic-Reboot=false` keeps reboot policy in kured (window, ordering, gating), not in apt.
|
||||
|
||||
### Operational reference
|
||||
See `docs/runbooks/k8s-node-auto-upgrades.md` for: verifying health, halting rollout, restoring config to a re-imaged node, rolling back a bad upgrade, and the past-incident timeline.
|
||||
|
||||
## K8s Version Upgrades
|
||||
|
||||
Independent of the OS-upgrade and service-upgrade pipelines. Drives
|
||||
kubeadm/kubelet/kubectl bumps (patch + minor) on all 5 K8s VMs.
|
||||
|
||||
### Architecture
|
||||
|
||||
```
|
||||
k8s-version-check CronJob (Sun 12:00 UTC, k8s-upgrade ns)
|
||||
│ probe apt-cache madison kubeadm (master) → latest available patch
|
||||
│ probe HEAD https://pkgs.k8s.io/.../v<NEXT_MINOR>/deb/Release → next minor?
|
||||
│ push k8s_upgrade_available metric to Pushgateway
|
||||
│
|
||||
▼ if a target is detected
|
||||
envsubst on /template/job-template.yaml | kubectl apply -f -
|
||||
│ spawns Job 0 = k8s-upgrade-preflight-<target_version>
|
||||
▼
|
||||
|
||||
Job 0 — preflight (pinned: k8s-node1)
|
||||
Job 1 — master upgrade (pinned: k8s-node1) drains k8s-master
|
||||
Job 2 — worker (pinned: k8s-node1) drains k8s-node4
|
||||
Job 3 — worker (pinned: k8s-node1) drains k8s-node3
|
||||
Job 4 — worker (pinned: k8s-node1) drains k8s-node2
|
||||
Job 5 — worker (pinned: k8s-master) drains k8s-node1 ← control-plane toleration
|
||||
Job 6 — postflight (no pinning)
|
||||
```
|
||||
|
||||
Each Job runs `scripts/upgrade-step.sh`, which dispatches on `$PHASE` and ends
|
||||
by spawning the next Job (`envsubst < /template/job-template.yaml | kubectl
|
||||
apply -f -`). Job names are deterministic (`k8s-upgrade-<phase>-<target_version>[-<node>]`)
|
||||
so `apply` reconciles to a single Job per run — re-running a failed Job
|
||||
won't duplicate downstream Jobs.
|
||||
|
||||
### Self-preemption history (the reason for the Job-chain rewrite)
|
||||
|
||||
The v1 design ran the whole upgrade inside the `claude-agent-service`
|
||||
Deployment (1 replica, no nodeSelector). On 2026-05-11 the agent's pod was
|
||||
scheduled to k8s-node4. When the agent ran `kubectl drain k8s-node4` during
|
||||
Stage 6, it evicted itself — the bash process died after the drain but
|
||||
before the SSH-pipe to install kubeadm on node4. The cluster ended up
|
||||
half-upgraded (master at v1.34.7, workers at v1.34.2). The rewrite to a
|
||||
chain of `nodeSelector`-pinned Jobs eliminates this failure mode because
|
||||
each Job's pod and its drain target are always different nodes.
|
||||
|
||||
### Components
|
||||
|
||||
- **Detection CronJob + ConfigMaps + RBAC**: `infra/stacks/k8s-version-upgrade/main.tf`.
|
||||
- Image is the claude-agent-service image (kubectl + ssh-client + curl + jq + envsubst).
|
||||
- One unified ServiceAccount `k8s-upgrade-job` serves both the detection CronJob and every chain Job.
|
||||
- **Phase body**: `infra/stacks/k8s-version-upgrade/scripts/upgrade-step.sh`.
|
||||
Dispatches on `$PHASE` (preflight | master | worker | postflight). Computes
|
||||
`NEXT_PHASE` / `NEXT_TARGET_NODE` / `NEXT_RUN_ON` and spawns the next Job.
|
||||
Includes a `predrain_unstick` helper that pre-deletes pods on the target
|
||||
node whose PDB has `disruptionsAllowed=0` (otherwise drain loops forever on
|
||||
single-replica deployments like Anubis instances).
|
||||
- **Job template**: `infra/stacks/k8s-version-upgrade/job-template.yaml`.
|
||||
envsubst-rendered at runtime. Mounts a `creds` Secret, a `scripts`
|
||||
ConfigMap, and a `template` ConfigMap into each Job pod.
|
||||
- **Per-node script**: `infra/scripts/update_k8s.sh`. Caller passes
|
||||
`--role master|worker --release X.Y.Z`. Piped via SSH into each node by
|
||||
upgrade-step.sh.
|
||||
- **Three Upgrade Gates alerts**:
|
||||
- `K8sVersionSkew` — kubelet/apiserver `gitVersion` count >1 for 30m. Catches a half-done rollout.
|
||||
- `EtcdPreUpgradeSnapshotMissing` — `k8s_upgrade_in_flight==1 && k8s_upgrade_snapshot_taken==0` for 10m. Catches preflight failing silently.
|
||||
- `K8sUpgradeStalled` — `k8s_upgrade_in_flight==1 && time()-k8s_upgrade_started_timestamp > 5400` for 5m. Catches a chain Job dying without spawning its successor.
|
||||
- **Pushgateway metrics**:
|
||||
- `k8s_upgrade_in_flight` (set in preflight, cleared in postflight)
|
||||
- `k8s_upgrade_snapshot_taken` (set after etcd snapshot Job completes with ≥1 KiB)
|
||||
- `k8s_upgrade_started_timestamp` (set in preflight; used by `K8sUpgradeStalled`)
|
||||
- `k8s_upgrade_available{kind,running,target}` (pushed by detection CronJob)
|
||||
- `k8s_version_check_last_run_timestamp` (staleness watchdog)
|
||||
|
||||
### Source of truth
|
||||
|
||||
| Concern | Location |
|
||||
|---|---|
|
||||
| Stack (CronJob + ConfigMaps + SA/RBAC + ExternalSecret) | `stacks/k8s-version-upgrade/main.tf` |
|
||||
| Phase orchestration | `stacks/k8s-version-upgrade/scripts/upgrade-step.sh` |
|
||||
| Job template | `stacks/k8s-version-upgrade/job-template.yaml` |
|
||||
| Per-node upgrade script | `scripts/update_k8s.sh` |
|
||||
| Alerts | `stacks/monitoring/modules/monitoring/prometheus_chart_values.tpl` (group "Upgrade Gates") |
|
||||
| Vault secrets | `secret/k8s-upgrade/{ssh_key, ssh_key_pub, slack_webhook}` |
|
||||
| Deprecated agent prompt (reference) | `.claude/agents/k8s-version-upgrade.deprecated.md` |
|
||||
|
||||
### Why this design
|
||||
|
||||
The cluster has a single control plane (no HA). A failed `kubeadm upgrade apply` is an outage. Mitigations:
|
||||
|
||||
- **Mandatory etcd snapshot before every run** (even patch). Recovery point if master breaks.
|
||||
- **Halt-on-alert before every drain**. Reuses the same Prometheus ignore-list regex kured uses — any unrelated cluster-health alert blocks. Three gate alerts catch upgrade-specific half-states (version skew, missing snapshot, stalled chain).
|
||||
- **Job pinning eliminates self-preemption**. Each Job's pod runs on a node that is NOT its drain target. k8s-node1 hosts every Job except the one that drains it (which runs on k8s-master with a control-plane toleration).
|
||||
- **Sequential workers with 10-min inter-node soak**. Same risk-bounding as the 24h OS-reboot soak, but tightened because kubelet failures surface within minutes — not hours.
|
||||
- **Master upgrade goes first, workers last**. If master breaks, the cluster is already degraded so further worker upgrades would just delay recovery. By upgrading master first, we either succeed (workers can roll afterward) or fail loud (operator triages before any worker is touched).
|
||||
- **No auto-rollback**. kubeadm doesn't support clean downgrade; the snapshot + manual apt rollback in the runbook is the recovery path.
|
||||
- **PDB-blocked pods don't stall the chain**. `predrain_unstick` deletes PDB=0 pods on the target node directly (bypassing the eviction API), so the parent Deployment recreates them elsewhere. This was the workaround applied manually during the 2026-05-11 recovery for Anubis single-replica instances.
|
||||
|
||||
### Secrets
|
||||
|
||||
| Secret | Vault Path | Purpose |
|
||||
|--------|-----------|---------|
|
||||
| SSH private key | `secret/k8s-upgrade.ssh_key` | Jobs SSH `wizard@<node>` |
|
||||
| SSH public key | `secret/k8s-upgrade.ssh_key_pub` | Deployed to nodes' `~/.ssh/authorized_keys` |
|
||||
| Slack webhook | `secret/k8s-upgrade.slack_webhook` | Pipeline notifications (separate channel from kured) |
|
||||
|
||||
The previous `api_bearer_token` entry is gone — the chain does not POST to `claude-agent-service`.
|
||||
|
||||
### Operational reference
|
||||
|
||||
See `docs/runbooks/k8s-version-upgrade.md` for: verifying health, manually triggering detection, killing a stuck Job, skipping a phase, rollback paths (master / worker / mid-flight abort), and SSH key rotation.
|
||||
961
docs/architecture/backup-dr.md
Normal file
961
docs/architecture/backup-dr.md
Normal file
|
|
@ -0,0 +1,961 @@
|
|||
# Backup & Disaster Recovery Architecture
|
||||
|
||||
Last updated: 2026-06-01
|
||||
|
||||
> **2026-06-01 — regenerable services carved back out** (offsite Synology hit
|
||||
> 97%; the `Backup` share had grown +670 G in a week, traced to the 2026-05-26
|
||||
> change below that started mirroring large regenerable data offsite):
|
||||
> - **`nfs-mirror` re-excludes** `ollama` (20 G), `prometheus-backup` (64 G),
|
||||
> `audiblez` (24 G), `ebook2audiobook` (11 G). Live copy stays on sdc; no
|
||||
> sda/Synology copy. `--delete` reaps them from sda on the next run.
|
||||
> `*-backup` DB dumps (sqlite-backup etc.) are KEPT — real DB safety copies.
|
||||
> - **`offsite-sync` Step 2 nfs-ssd → immich-only**: `ollama` (59 G) +
|
||||
> `llamacpp` (26 G) on the SSD no longer ship to Synology (re-pullable
|
||||
> models). Was a blanket `/srv/nfs-ssd/` sync; now immich-only like nfs/.
|
||||
> - **`daily-backup` skips `nextcloud/nextcloud-data-proxmox`** — orphaned
|
||||
> pre-encryption PV (Released, Retain) that was still backed up weekly.
|
||||
> - **Nextcloud backup shrunk**: the dedicated nextcloud-backup CronJob
|
||||
> (`stacks/nextcloud`) kept 7 full copies incl. a 10 GB+ `nextcloud.log`
|
||||
> (87 G total). Now: `log_rotate_size=10 MB` caps the log at source, backup
|
||||
> excludes `nextcloud.log*` + preview cache, retention 7 → 1 (pvc-data holds
|
||||
> the version history). Footprint < 5 G.
|
||||
> - **Nextcloud image pinned to `32.0.9`** in chart_values — the 2026-05-26
|
||||
> Keel bump (32.0.3 → 32.0.9, data migrated to 32.0.9.2) was never pinned in
|
||||
> TF, so this session's apply rolled a 32.0.3 pod and CrashLooped on the
|
||||
> downgrade. Pinning eliminates the drift.
|
||||
> - **One-off Synology delete** of the existing copies above + emptied the
|
||||
> `Backup`/`Emo shared` recycle bins (~31 G). ~340 G total; reclaims as the
|
||||
> 3-day `Backup`-share snapshots roll off (or via manual snapshot expiry).
|
||||
|
||||
> **2026-05-26 — bypass list pruned to a single path** (follow-up to the
|
||||
> 2026-05-24 changes below):
|
||||
> - `nfs-mirror` now copies ollama, audiblez, ebook2audiobook, and every
|
||||
> `*-backup` CronJob output onto sda. Previously these went sdc → Synology
|
||||
> DIRECT via Step 2; now they ride leg 1 like everything else.
|
||||
> - **Bypass list (leg 2)** is now just `/srv/nfs/immich/` — too big for sda
|
||||
> (1.5 T), no other choice.
|
||||
> - **frigate and temp**: dropped from BOTH legs — intentionally not backed up.
|
||||
> frigate is a 14-day camera ring, temp is scratch space. User explicit ask
|
||||
> 2026-05-26.
|
||||
> - **prometheus, loki, alertmanager**: live-orphan dirs that no longer
|
||||
> exist on `/srv/nfs`. Dropped from the exclude/include lists as no-ops.
|
||||
> - `/mnt/backup/anca-elements` (423 G) deleted — canonical copy lives in
|
||||
> Immich since the 2026-05-24 ingest.
|
||||
> - **`nfs-mirror.timer`: weekly Mon 04:00 → daily 02:00.** Steady-state
|
||||
> delta is 10-20 min of mostly-metadata rsync, so the IO cost is
|
||||
> negligible. RPO for non-CronJob app data (nextcloud shared files,
|
||||
> audiobookshelf library, mailserver Maildir, real-estate-crawler scraped
|
||||
> data, etc.) drops from 7 days to ~24h.
|
||||
> - Aftermath: sda 87% → 46% used; Synology `/Viki/nfs/` shrinks to
|
||||
> immich-only on next monthly `--delete` pass (or manual cleanup —
|
||||
> see runbook).
|
||||
>
|
||||
> **2026-05-24 session — what changed**:
|
||||
> - **anca-elements archive direction inverted** — Synology `/Backup/Anca/Elements` (770G) deleted; PVE `/srv/nfs/anca-elements` is now source of truth. `anca-elements-sync.sh` retired.
|
||||
> - **`anca-elements-mirror.{sh,service,timer}` retired**, subsumed into the new **`nfs-mirror`** weekly job covering all critical NFS subtrees (anca-elements + ~80 services) → sda.
|
||||
> - **Synology `/Backup/Viki/nfs/<svc>/` orphan cleanup** — 84 dirs renamed in-place (btrfs metadata-only) to `/Backup/Viki/pve-backup/<svc>/` so daily-incremental Step 1 sees them as pre-existing and only ships deltas. No re-transfer.
|
||||
> - **Synology snapshot retention 7d → 3d**, all 8 backlog snapshots deleted via `sudo synosharesnapshot delete Backup ...`. Reclaimed ~800G btrfs (98% → 83% used). DSM API was blocked by 2FA; `sudo` over the existing `Administrator` SSH key worked with the Vault-stored password.
|
||||
> - **Manifest mechanism extended**: `nfs-mirror` now appends its transferred file list to `/mnt/backup/.changed-files` so daily Step 1 incremental picks it up (was previously only fed by `daily-backup`).
|
||||
|
||||
## Overview
|
||||
|
||||
The homelab runs a 3-2-1 strategy with a **two-leg** path to Synology so every NFS byte takes exactly one route to offsite (no duplication, no gaps):
|
||||
|
||||
```
|
||||
sdc /srv/nfs/<svc>/ ──nfs-mirror daily 02:00──→ sda /mnt/backup/<svc>/ ──offsite-sync Step 1──→ Synology /Backup/Viki/pve-backup/<svc>/ [leg 1]
|
||||
sdc /srv/nfs/immich/ ──inotify (nfs-change-tracker)──→ offsite-sync Step 2 ──→ Synology /Backup/Viki/nfs/immich/ [leg 2]
|
||||
sdc PVCs (LVM thin) ──daily-backup~snapshot~rsync──→ sda /mnt/backup/{pvc-data,sqlite-backup,pfsense,pve-config}/ ──Step 1──→ Synology /Backup/Viki/pve-backup/
|
||||
```
|
||||
|
||||
The **bypass list** (leg 2) is just `/srv/nfs/immich/` — too big for sda (1.5 T). **Not backed up at all**: `/srv/nfs/frigate/` (camera ring buffer), `/srv/nfs/temp/` (scratch). Everything else rides leg 1 via `nfs-mirror`.
|
||||
|
||||
**3-2-1 Breakdown**:
|
||||
- **Copy 1** (live): all PVC data + VM disks on Proxmox sdc thin pool (10.7TB RAID1 HDD); all NFS data at `/srv/nfs[-ssd]/`
|
||||
- **Copy 2** (local backup): sda `/mnt/backup` (1.1TB RAID1 SAS) — **46% used** post-2026-05-26 (was 87% before anca-elements cleanup; bypass-list pruning added ~260 G of *-backup + ollama + audiblez + ebook2audiobook)
|
||||
- **Copy 3** (offsite): Synology NAS at 192.168.1.13
|
||||
- `Synology/Backup/Viki/pve-backup/` — sda contents (PVC backups + nfs-mirror output: ~90 service dirs incl. `*-backup` DB dumps. **ollama/audiblez/ebook2audiobook/prometheus-backup excluded 2026-06-01** — regenerable, live-only)
|
||||
- `Synology/Backup/Viki/nfs/` — immich only (post-2026-05-26)
|
||||
- `Synology/Backup/Viki/nfs-ssd/` — **immich-ML only (2026-06-01)**; ollama/llamacpp dropped (re-pullable models, live-only on the SSD)
|
||||
|
||||
## Architecture Diagram
|
||||
|
||||
### Data Routing — where each path goes (post-2026-05-26)
|
||||
|
||||
```mermaid
|
||||
flowchart LR
|
||||
classDef live fill:#e1f5ff,stroke:#01579b
|
||||
classDef sda fill:#fff9c4,stroke:#f57f17
|
||||
classDef syn fill:#c8e6c9,stroke:#1b5e20
|
||||
classDef none fill:#ffcdd2,stroke:#b71c1c
|
||||
|
||||
subgraph sdc["sdc /srv/nfs/ — Tier 1 live"]
|
||||
IMM["immich/ 1.5T"]:::live
|
||||
FRI["frigate/ 131G"]:::live
|
||||
TMP["temp/ 12G"]:::live
|
||||
ANE["anca-elements/ 771G<br/>legacy"]:::live
|
||||
APP["everything else<br/>(mysql, postgresql, nextcloud,<br/>mailserver, servarr, audiobookshelf,<br/>ollama, audiblez, ebook2audiobook,<br/>*-backup CronJob outputs, …)"]:::live
|
||||
end
|
||||
|
||||
subgraph sdcssd["sdc /srv/nfs-ssd/"]
|
||||
IMM_ML["immich/ 62G"]:::live
|
||||
OLL_S["ollama/ 59G"]:::live
|
||||
LLA["llamacpp/ 26G"]:::live
|
||||
end
|
||||
|
||||
SDA[("sda /mnt/backup/<br/>Tier 2 local")]:::sda
|
||||
SYN_PVE[("Synology<br/>/Viki/pve-backup/")]:::syn
|
||||
SYN_NFS[("Synology<br/>/Viki/nfs/")]:::syn
|
||||
SYN_SSD[("Synology<br/>/Viki/nfs-ssd/")]:::syn
|
||||
NOPE([NOT BACKED UP]):::none
|
||||
|
||||
APP -- "nfs-mirror daily 02:00" --> SDA
|
||||
SDA -- "offsite-sync Step 1<br/>daily 06:00" --> SYN_PVE
|
||||
IMM -- "Step 2 inotify direct<br/>daily 06:00" --> SYN_NFS
|
||||
IMM_ML --> SYN_SSD
|
||||
OLL_S --> SYN_SSD
|
||||
LLA --> SYN_SSD
|
||||
FRI --- NOPE
|
||||
TMP --- NOPE
|
||||
ANE --- NOPE
|
||||
```
|
||||
|
||||
### Overall Backup Flow
|
||||
|
||||
```mermaid
|
||||
graph TB
|
||||
subgraph Proxmox["Proxmox Host (192.168.1.127)"]
|
||||
sdc["sdc: 10.7TB RAID1 HDD<br/>VG pve, LV data (thin pool)<br/>65 proxmox-lvm PVCs"]
|
||||
sda["sda: 1.1TB RAID1 SAS<br/>VG backup, LV data (ext4)<br/>/mnt/backup"]
|
||||
|
||||
subgraph Layer1["Layer 1: LVM Thin Snapshots"]
|
||||
Snap["Twice daily 00:00, 12:00<br/>7-day retention<br/>62 PVCs (excludes dbaas+monitoring)"]
|
||||
end
|
||||
|
||||
subgraph Layer2a["Layer 2a: Daily NFS Mirror (nfs-mirror)"]
|
||||
NFSMirror["Daily 02:00<br/>/srv/nfs/* → /mnt/backup/<svc>/<br/>excludes: immich, frigate, temp, anca-elements"]
|
||||
end
|
||||
|
||||
subgraph Layer2b["Layer 2b: Daily PVC File Backup (daily-backup)"]
|
||||
PVCBackup["PVC File Copy<br/>Daily 05:00<br/>4 weekly versions via --link-dest<br/>/mnt/backup/pvc-data/<YYYY-WW>/"]
|
||||
SQLiteBackup["Auto SQLite Backup<br/>magic number check + ?mode=ro<br/>from PVC snapshots"]
|
||||
PfsenseBackup["pfSense Backup<br/>config.xml + full tar<br/>4 weekly versions"]
|
||||
PVEConfig["PVE Config<br/>/etc/pve + scripts"]
|
||||
end
|
||||
|
||||
sdc --> Snap
|
||||
sdc --> NFSMirror
|
||||
sdc --> PVCBackup
|
||||
NFSMirror --> sda
|
||||
PVCBackup --> sda
|
||||
SQLiteBackup --> sda
|
||||
PfsenseBackup --> sda
|
||||
PVEConfig --> sda
|
||||
end
|
||||
|
||||
subgraph NFS_Storage["Proxmox NFS (/srv/nfs)"]
|
||||
NFS_Backup["NFS *-backup dirs<br/>(populated by in-cluster CronJobs)"]
|
||||
|
||||
subgraph AppBackups["App-Level Backup CronJobs"]
|
||||
CronDaily["Daily 00:00-00:30<br/>PostgreSQL, MySQL<br/>14d retention"]
|
||||
CronWeekly["Weekly Sunday<br/>etcd, Vault, Redis<br/>Vaultwarden 6h<br/>30d retention"]
|
||||
end
|
||||
|
||||
CronDaily --> NFS_Backup
|
||||
CronWeekly --> NFS_Backup
|
||||
NFS_Backup --> NFSMirror
|
||||
end
|
||||
|
||||
subgraph Layer3["Layer 3: Offsite Sync (offsite-sync-backup, daily 06:00)"]
|
||||
PVEOffsite["Step 1: sda → Synology<br/>/Viki/pve-backup/<br/>incremental via manifest"]
|
||||
NFSOffsite["Step 2: sdc/immich + nfs-ssd → Synology<br/>/Viki/nfs/ + /Viki/nfs-ssd/<br/>inotify change-tracked"]
|
||||
end
|
||||
|
||||
sda --> PVEOffsite
|
||||
NFS_Storage -. "/srv/nfs/immich only" .-> NFSOffsite
|
||||
|
||||
Synology["Synology NAS<br/>192.168.1.13<br/>520 GB free / 5.3 TB total"]
|
||||
|
||||
PVEOffsite --> Synology
|
||||
NFSOffsite --> Synology
|
||||
|
||||
subgraph Monitoring["Monitoring & Alerting"]
|
||||
Prometheus["Prometheus Alerts<br/>PostgreSQLBackupStale, MySQLBackupStale<br/>NfsMirrorStale, OffsiteBackupSyncStale<br/>LVMSnapshotStale, BackupDiskFull<br/>VaultwardenIntegrityFail"]
|
||||
Pushgateway["Pushgateway<br/>backup script metrics<br/>vaultwarden integrity"]
|
||||
end
|
||||
|
||||
PVCBackup -.->|push metrics| Pushgateway
|
||||
NFSMirror -.->|push metrics| Pushgateway
|
||||
PVEOffsite -.->|push metrics| Pushgateway
|
||||
Snap -.->|push metrics| Pushgateway
|
||||
Pushgateway --> Prometheus
|
||||
|
||||
style Layer1 fill:#c8e6c9
|
||||
style Layer2a fill:#ffe0b2
|
||||
style Layer2b fill:#ffe0b2
|
||||
style Layer3 fill:#e1f5ff
|
||||
style Monitoring fill:#f3e5f5
|
||||
```
|
||||
|
||||
### Daily Backup Timeline (EEST)
|
||||
|
||||
```mermaid
|
||||
graph LR
|
||||
subgraph Continuous["Continuous"]
|
||||
INO["nfs-change-tracker<br/>inotify on /srv/nfs[-ssd]<br/>writes /mnt/backup/.nfs-changes.log"]
|
||||
end
|
||||
|
||||
subgraph Nightly["Nightly Timeline"]
|
||||
T0000["00:00 LVM thin snapshots<br/>(lvm-pvc-snapshot)<br/>sdc PVCs CoW"]
|
||||
T0015["00:15 PostgreSQL per-DB dumps<br/>(CronJob)"]
|
||||
T0045["00:45 MySQL per-DB dumps<br/>(CronJob)"]
|
||||
T0200["02:00 nfs-mirror (daily)<br/>sdc /srv/nfs/* → sda /mnt/backup/<svc>/<br/>~10-20 min steady state"]
|
||||
T0500["05:00 daily-backup<br/>mount LVM snapshots ro<br/>rsync PVC files → /mnt/backup/pvc-data/<br/>+ sqlite + pfsense + pve-config"]
|
||||
T0600["06:00 offsite-sync-backup<br/>Step 1: sda → Synology /Viki/pve-backup/<br/>Step 2: sdc/immich + nfs-ssd → /Viki/nfs[-ssd]/"]
|
||||
T1200["12:00 LVM thin snapshots (midday)<br/>second daily snapshot"]
|
||||
end
|
||||
|
||||
T0000 --> T0015 --> T0045 --> T0200 --> T0500 --> T0600 --> T1200
|
||||
INO -.->|change events feed Step 2| T0600
|
||||
|
||||
style Nightly fill:#ffe0b2
|
||||
style Continuous fill:#e1f5ff
|
||||
```
|
||||
|
||||
### Physical Disk Layout
|
||||
|
||||
```mermaid
|
||||
graph TB
|
||||
subgraph PVE["Proxmox Host (192.168.1.127)"]
|
||||
subgraph sda["sda: 1.1TB RAID1 SAS — 70% used (315 GB free)"]
|
||||
sda_vg["VG: backup<br/>LV: data (ext4)<br/>/mnt/backup"]
|
||||
sda_content["pvc-data/<YYYY-WW>/<ns>/<pvc>/<br/>sqlite-backup/, pfsense/<YYYY-WW>/, pve-config/<br/>+ daily mirror of /srv/nfs/<svc>/ via nfs-mirror"]
|
||||
end
|
||||
|
||||
subgraph sdb["sdb: 931GB SSD"]
|
||||
sdb_vg["VG: pve<br/>LV: root (ext4)<br/>PVE host OS"]
|
||||
end
|
||||
|
||||
subgraph sdc["sdc: 10.7TB RAID1 HDD — 2.8 TB used"]
|
||||
sdc_vg["VG: pve<br/>LV: data (thin pool)<br/>/srv/nfs/* (live NFS)<br/>65 proxmox-lvm PVCs<br/>+ VM disks"]
|
||||
end
|
||||
|
||||
sda_vg --> sda_content
|
||||
end
|
||||
|
||||
sdc -. "daily snapshot ro + nfs-mirror" .-> sda
|
||||
sdc -. "immich only<br/>(inotify, daily 06:00)" .-> Synology
|
||||
sda -. "daily 06:00<br/>incremental rsync" .-> Synology
|
||||
|
||||
Synology["Synology NAS 192.168.1.13<br/>91% used / 520 GB free<br/>/Backup/Viki/{pve-backup, nfs (immich), nfs-ssd}"]
|
||||
|
||||
style sda fill:#fff9c4
|
||||
style sdb fill:#c8e6c9
|
||||
style sdc fill:#e1f5ff
|
||||
```
|
||||
|
||||
### Restore Decision Tree
|
||||
|
||||
```mermaid
|
||||
graph TB
|
||||
Start["Data loss detected"]:::start
|
||||
Age{"How old is<br/>the lost data?"}
|
||||
Type{"What type<br/>of data?"}
|
||||
|
||||
Start --> Age
|
||||
|
||||
Age -->|"< 12 h"| LVM["LVM thin snapshot on sdc<br/>lvm-pvc-snapshot restore <lv> <snap><br/>RTO: <5 min<br/>(7-day retention, 2x daily)"]:::fast
|
||||
Age -->|"12 h - 4 weeks"| FileBackup["sda file backup<br/>/mnt/backup/pvc-data/<YYYY-WW>/ (PVCs)<br/>/mnt/backup/<svc>/ (NFS dirs)<br/>RTO: <15 min"]:::med
|
||||
Age -->|"> 4 weeks or<br/>site disaster"| Offsite["Synology /Viki/pve-backup/<br/>(or /Viki/nfs/immich for photos)<br/>RTO: <4 hours"]:::slow
|
||||
|
||||
LVM --> Type
|
||||
FileBackup --> Type
|
||||
Offsite --> Type
|
||||
|
||||
Type -->|"Database (logical)"| AppBackup["App-level dump<br/>/srv/nfs/<service>-backup/<br/>OR Synology /Viki/pve-backup/<service>-backup/<br/>RTO: <10 min (single-DB or full)"]:::db
|
||||
Type -->|"PVC binary state"| Proceed["Proceed with<br/>selected restore method"]
|
||||
Type -->|"NFS files (nextcloud,<br/>audiobookshelf, …)"| NFSRestore["sda /mnt/backup/<svc>/<br/>OR Synology /Viki/pve-backup/<svc>/<br/>RTO: varies by size"]:::med
|
||||
Type -->|"Immich photos"| ImmichRestore["Synology /Viki/nfs/immich<br/>(only offsite copy)<br/>RTO: varies by size"]:::slow
|
||||
|
||||
classDef start fill:#ffcdd2,stroke:#b71c1c
|
||||
classDef fast fill:#c8e6c9,stroke:#1b5e20
|
||||
classDef med fill:#fff9c4,stroke:#f57f17
|
||||
classDef slow fill:#e1f5ff,stroke:#01579b
|
||||
classDef db fill:#e1bee7,stroke:#4a148c
|
||||
```
|
||||
|
||||
### Vaultwarden Enhanced Protection
|
||||
|
||||
```mermaid
|
||||
graph LR
|
||||
subgraph Every6h["Every 6 hours"]
|
||||
VWBackup["vaultwarden-backup CronJob"]
|
||||
Step1["1. PRAGMA integrity_check<br/>(fail → abort)"]
|
||||
Step2["2. sqlite3 .backup<br/>/mnt/main/vaultwarden-backup/"]
|
||||
Step3["3. PRAGMA integrity_check<br/>on backup copy"]
|
||||
Step4["4. Copy RSA keys, attachments,<br/>sends, config.json"]
|
||||
Step5["5. Rotate backups (30d)"]
|
||||
|
||||
VWBackup --> Step1 --> Step2 --> Step3 --> Step4 --> Step5
|
||||
end
|
||||
|
||||
subgraph Hourly["Every hour"]
|
||||
VWCheck["vaultwarden-integrity-check"]
|
||||
Check1["PRAGMA integrity_check"]
|
||||
Metric["Push metric to Pushgateway:<br/>vaultwarden_sqlite_integrity_ok"]
|
||||
|
||||
VWCheck --> Check1 --> Metric
|
||||
end
|
||||
|
||||
Metric -.->|Prometheus scrape| Alert["Alert if integrity_ok == 0"]
|
||||
|
||||
style Every6h fill:#fff9c4
|
||||
style Hourly fill:#e1bee7
|
||||
```
|
||||
|
||||
## Components
|
||||
|
||||
| Component | Version/Schedule | Location | Purpose |
|
||||
|-----------|-----------------|----------|---------|
|
||||
| LVM Thin Snapshots | Daily 03:00, 7d retention | PVE host: `lvm-pvc-snapshot` | CoW snapshots of 62 proxmox-lvm PVCs |
|
||||
| Daily PVC Backup | Daily 05:00, 4 weeks | PVE host: `daily-backup` | File-level PVC copy to sda |
|
||||
| Auto SQLite Backup | Daily 05:00 + daily-backup | PVE host: magic number check + ?mode=ro | Safe SQLite backup from PVC snapshots |
|
||||
| NFS Change Tracker | Continuous (inotifywait) | PVE host: `nfs-change-tracker.service` | Logs changed NFS file paths to `/mnt/backup/.nfs-changes.log` |
|
||||
| pfSense Backup | Daily 05:00 + daily-backup | PVE host: SSH + API | config.xml + full filesystem tar |
|
||||
| Offsite Sync | Daily 06:00 (after daily-backup) | PVE host: `offsite-sync-backup` | Two-step: sda→pve-backup + NFS→nfs/nfs-ssd via inotify |
|
||||
| PostgreSQL Backup (full) | Daily 00:00, 14d retention | CronJob in `dbaas` namespace | pg_dumpall for all databases |
|
||||
| PostgreSQL Backup (per-db) | Daily 00:15, 14d retention | CronJob in `dbaas` namespace | pg_dump -Fc per database → `/backup/per-db/<db>/` |
|
||||
| MySQL Backup (full) | Daily 00:30, 14d retention | CronJob in `dbaas` namespace | mysqldump --all-databases |
|
||||
| MySQL Backup (per-db) | Daily 00:45, 14d retention | CronJob in `dbaas` namespace | mysqldump per database → `/backup/per-db/<db>/` |
|
||||
| etcd Backup | Weekly Sunday 01:00, 30d | CronJob in `kube-system` | etcdctl snapshot |
|
||||
| Vaultwarden Backup | Every 6h, 30d retention | CronJob in `vaultwarden` | sqlite3 .backup + integrity |
|
||||
| Vault Backup | Weekly Sunday 02:00, 30d | CronJob in `vault` | raft snapshot |
|
||||
| Redis Backup | Weekly Sunday 03:00, 30d | CronJob in `redis` | BGSAVE + copy |
|
||||
| Vaultwarden Integrity Check | Hourly | CronJob in `vaultwarden` | PRAGMA integrity_check → metric |
|
||||
| ~~TrueNAS Cloud Sync~~ | **DECOMMISSIONED 2026-04-13** | Was TrueNAS Cloud Sync Task 1 | Replaced by offsite-sync-backup + inotify change tracking on Proxmox host NFS |
|
||||
|
||||
## How It Works
|
||||
|
||||
### Layer 1: LVM Thin Snapshots (Fast Local Recovery)
|
||||
|
||||
Native LVM thin snapshots provide crash-consistent point-in-time recovery for 62 Proxmox CSI PVCs. These are CoW snapshots — instant creation, minimal overhead, sharing the thin pool's free space.
|
||||
|
||||
**Script**: `/usr/local/bin/lvm-pvc-snapshot` on PVE host (source: `infra/scripts/lvm-pvc-snapshot.sh`). Deploy: `scp infra/scripts/lvm-pvc-snapshot.sh root@192.168.1.127:/usr/local/bin/lvm-pvc-snapshot`
|
||||
**Schedule**: Daily 03:00 via systemd timer, 7-day retention
|
||||
**Discovery**: Auto-discovers PVC LVs matching `vm-*-pvc-*` pattern in VG `pve` thin pool `data`
|
||||
|
||||
**Coverage**: All 65 proxmox-lvm PVCs **except** `dbaas` and `monitoring` namespaces. These are excluded because:
|
||||
- MySQL InnoDB, PostgreSQL, and Prometheus are high-churn (50%+ CoW divergence/hour)
|
||||
- They already have app-level dumps (Layer 2)
|
||||
- Including them causes ~36% write amplification; excluding them reduces overhead to ~0%
|
||||
|
||||
**Monitoring**: Pushes metrics to Pushgateway via NodePort (30091). Alerts: `LVMSnapshotStale` (>30h since last run + 30m `for:`), `LVMSnapshotFailing`, `LVMThinPoolLow` (<15% free).
|
||||
|
||||
**Restore**: `lvm-pvc-snapshot restore <pvc-lv> <snapshot-lv>` — auto-discovers K8s workload, scales down, swaps LVs, scales back up. See `docs/runbooks/restore-lvm-snapshot.md`.
|
||||
|
||||
### Layer 2: Weekly File-Level Backup (sda Backup Disk)
|
||||
|
||||
**Backup disk**: sda (1.1TB RAID1 SAS) → VG `backup` → LV `data` → ext4 → mounted at `/mnt/backup` on PVE host. Dedicated backup disk, independent of live storage.
|
||||
|
||||
**Script**: `/usr/local/bin/daily-backup` on PVE host (source: `infra/scripts/daily-backup.sh`)
|
||||
**Schedule**: Daily 05:00 via systemd timer
|
||||
**Retention**: 4 weekly versions (weeks 0-3 via `--link-dest` hardlink dedup)
|
||||
|
||||
#### What Gets Backed Up
|
||||
|
||||
**1. PVC File Copies** (`/mnt/backup/pvc-data/<YYYY-WW>/`):
|
||||
- Mount each LVM thin LV ro on PVE host → rsync files (not block) → unmount
|
||||
- 62 PVCs covered (all except dbaas + monitoring)
|
||||
- Organized as `/mnt/backup/pvc-data/<YYYY-WW>/<namespace>/<pvc-name>/`
|
||||
- 4 weekly versions with `--link-dest` hardlink dedup (unchanged files share inodes)
|
||||
|
||||
**2. Auto SQLite Backup** (`/mnt/backup/sqlite-backup/`):
|
||||
- Detects SQLite databases in PVC snapshots via magic number check (`SQLite format 3`)
|
||||
- Opens each database with `?mode=ro` (read-only, safe — no WAL replay)
|
||||
- Runs `.backup` to create a consistent copy
|
||||
- Covers all SQLite files across all PVC snapshots automatically
|
||||
|
||||
**3. pfSense Backup** (`/mnt/backup/pfsense/<YYYY-WW>/`):
|
||||
- `config.xml` via API (base64 decode)
|
||||
- Full filesystem tar via SSH (`tar czf /tmp/pfsense-full.tar.gz /cf /var/db /boot/loader.conf`)
|
||||
- 4 weekly versions
|
||||
|
||||
**4. PVE Config** (`/mnt/backup/pve-config/`):
|
||||
- `/etc/pve/` (cluster config, VM definitions)
|
||||
- `/usr/local/bin/` (custom scripts)
|
||||
- `/etc/systemd/system/` (timers)
|
||||
- Single copy (no rotation)
|
||||
|
||||
**Auto-discovered BACKUP_DIRS**: Uses glob-based discovery instead of a hardcoded list. Any new PVC LV matching `vm-*-pvc-*` is automatically included.
|
||||
|
||||
**Snapshot Pruning**: Deletes LVM snapshots older than 7 days (safety net for snapshots that outlive `lvm-pvc-snapshot` timer).
|
||||
|
||||
**Monitoring**: Pushes `daily_backup_last_run_timestamp`, `daily_backup_last_status`, and `daily_backup_bytes_synced` to Pushgateway (job `daily-backup`). Alerts: `WeeklyBackupStale` (>9d on `daily_backup_last_run_timestamp`), `WeeklyBackupFailing` (`daily_backup_last_status != 0`). The metric is pushed both on clean exit AND from a `trap TERM INT` handler — a 2026-04-30 → 2026-05-09 silent-failure incident traced to systemd SIGTERMing the script before it reached its final push, leaving the alert blind.
|
||||
|
||||
### Layer 2b: Application-Level Backups
|
||||
|
||||
K8s CronJobs run inside the cluster, dumping database/state to NFS-exported backup directories. Each service writes to `/srv/nfs/<service>-backup/` (some legacy paths still use `/mnt/main/<service>-backup/`).
|
||||
|
||||
**Why needed**: LVM snapshots capture block-level state, but:
|
||||
- Cannot restore individual databases from a PostgreSQL snapshot
|
||||
- Proxmox CSI LVs are opaque raw block devices
|
||||
- Need point-in-time recovery for specific apps without full LVM rollback
|
||||
|
||||
**Daily backups (00:00-00:30)**:
|
||||
- **PostgreSQL full** (`pg_dumpall`, 00:00): Dumps all databases to `/mnt/main/postgresql-backup/dump_*.sql.gz`. 14-day rotation.
|
||||
- **PostgreSQL per-db** (`pg_dump -Fc`, 00:15): Dumps each database individually to `/mnt/main/postgresql-backup/per-db/<dbname>/dump_*.dump`. Enables single-database restore via `pg_restore -d <db> --clean --if-exists`. 14-day rotation.
|
||||
- **MySQL full** (`mysqldump --all-databases`, 00:30): Dumps all databases to `/mnt/main/mysql-backup/dump_*.sql.gz`. 14-day rotation.
|
||||
- **MySQL per-db** (`mysqldump`, 00:45): Dumps each database individually to `/mnt/main/mysql-backup/per-db/<dbname>/dump_*.sql.gz`. Enables single-database restore. 14-day rotation.
|
||||
|
||||
**Daily backups (Sunday 01:00-04:00)**:
|
||||
- **etcd**: `etcdctl snapshot save /mnt/main/etcd-backup/snapshot-$(date +%Y%m%d).db`. 30-day retention. Critical for cluster recovery.
|
||||
- **Vaultwarden**: See "Vaultwarden Enhanced Protection" below. 30-day retention.
|
||||
- **Vault**: `vault operator raft snapshot save /mnt/main/vault-backup/snapshot-$(date +%Y%m%d).snap`. 30-day retention.
|
||||
- **Redis**: `redis-cli BGSAVE` then copy RDB file. 30-day retention.
|
||||
|
||||
### Vaultwarden Enhanced Protection
|
||||
|
||||
Vaultwarden stores sensitive password vault data in SQLite on a proxmox-lvm volume. Extra safeguards prevent corruption:
|
||||
|
||||
**Every 6 hours** (vaultwarden-backup CronJob):
|
||||
1. Run `PRAGMA integrity_check` on live database
|
||||
2. If check fails → abort (alert fires)
|
||||
3. If check passes → `sqlite3 .backup /mnt/main/vaultwarden-backup/db-$(date +%Y%m%d%H%M).sqlite`
|
||||
4. Run `PRAGMA integrity_check` on backup copy
|
||||
5. Copy RSA keys, attachments, sends folder, config.json
|
||||
6. Rotate backups older than 30 days
|
||||
|
||||
**Every hour** (vaultwarden-integrity-check CronJob):
|
||||
1. Run `PRAGMA integrity_check` on live database
|
||||
2. Push metric to Pushgateway: `vaultwarden_sqlite_integrity_ok{status="ok"}=1` or `=0`
|
||||
3. Prometheus scrapes Pushgateway and alerts on `integrity_ok == 0`
|
||||
|
||||
This provides both frequent backups (every 6h) AND continuous integrity monitoring (hourly).
|
||||
|
||||
### Layer 3: Offsite Sync to Synology NAS
|
||||
|
||||
**Script**: `/usr/local/bin/offsite-sync-backup` on PVE host (source: `infra/scripts/offsite-sync-backup`)
|
||||
**Schedule**: Daily 06:00 via systemd timer (After=daily-backup.service)
|
||||
|
||||
Two-step offsite sync:
|
||||
|
||||
#### Step 1: sda to Synology pve-backup/
|
||||
|
||||
**Method**: `rsync` from `/mnt/backup/` to `synology.viktorbarzin.lan:/Backup/Viki/pve-backup/`
|
||||
**Content**: PVC snapshots (`pvc-data/`), pfSense backups, PVE config, SQLite backups, **plus the nfs-mirror output** (anca-elements + ~30 critical NFS subtrees) — see Layer 3a. After consolidation, sda is the single source for the bulk of Synology's payload.
|
||||
|
||||
**Destination**: `Synology/Backup/Viki/pve-backup/`:
|
||||
- `pvc-data/<YYYY-WW>/` — 4 weekly PVC file backups
|
||||
- `sqlite-backup/` — auto SQLite backups
|
||||
- `pfsense/<YYYY-WW>/` — 4 weekly pfSense backups
|
||||
- `pve-config/` — latest PVE config
|
||||
- `anca-elements/`, `mysql/`, `postgresql/`, `nextcloud/`, `health/`, `<other critical NFS dirs>/` — from nfs-mirror (Layer 3a)
|
||||
|
||||
#### Step 2: sda-bypass NFS to Synology nfs/ + nfs-ssd/ (inotify change-tracked, FILTERED)
|
||||
|
||||
**Role**: Carries the single path that bypasses sda — `/srv/nfs/immich/` (1.5 T, doesn't fit on sda). Plus the full `/srv/nfs-ssd/` (immich-ML + ollama + llamacpp; the SSD has no sda-mirror leg). Everything else under `/srv/nfs/` rides leg 1.
|
||||
|
||||
**Method**: `rsync --files-from /mnt/backup/.nfs-changes.log` with regex filter `^/srv/nfs/immich/`. The monthly full sync uses `--include='/immich/***' --exclude='*'` for the HDD leg, and a plain `--delete` for the SSD leg.
|
||||
|
||||
**Change tracking**: `nfs-change-tracker.service` (systemd, inotifywait) on PVE host watches `/srv/nfs` and `/srv/nfs-ssd` continuously. Changed file paths are logged to `/mnt/backup/.nfs-changes.log`. Step 2 reads this log and transfers only changed files matching the bypass regex. Incremental syncs complete in seconds.
|
||||
|
||||
**Monthly full sync**: On 1st Sunday of month, runs `rsync --delete` with the immich-only include list. The `--delete` pass also reaps any stale Synology `/Viki/nfs/<dir>/` from the broader pre-2026-05-26 bypass list (ollama, audiblez, ebook2audiobook, *-backup, frigate, prometheus, loki, temp, alertmanager).
|
||||
|
||||
**`/srv/nfs/anca-elements/` history**: had its own dedicated Synology exclusion line earlier in 2026-05-24 because the original Synology source (`/volume1/Backup/Anca/Elements`) was being preserved while we moved canonical to PVE. After the original was deleted (same day), anca-elements joined the broader "NOT bypassing sda" category and is covered by Step 1 via `nfs-mirror`.
|
||||
|
||||
**Layer 3a: NFS local mirror on sda (3-2-1 second copy)**: `/usr/local/bin/nfs-mirror` rsyncs `/srv/nfs/` → `/mnt/backup/<service>/` daily at 02:00 (switched from weekly Mon 04:00 on 2026-05-26 — steady-state delta is 10-20 min of mostly-metadata rsync, cuts non-CronJob app-data RPO from 7d to ~24h). Single rsync invocation, single destination. As of 2026-05-26 the skip-list (in `nfs-mirror.sh` `EXCLUDES`) is intentionally minimal:
|
||||
|
||||
- **immich** (1.5 T) — too big for sda; ships sdc → Synology direct (leg 2)
|
||||
- **frigate** (camera ring buffer) — intentionally NOT backed up
|
||||
- **temp** (scratch) — intentionally NOT backed up
|
||||
- **anca-elements** (legacy) — now in Immich; `/mnt/backup/anca-elements` deleted 2026-05-26
|
||||
- **/srv/nfs-ssd** entirely — its three dirs (immich-ML, ollama, llamacpp) all ship direct to Synology nfs-ssd/
|
||||
|
||||
Everything else under `/srv/nfs/` — mysql, postgresql, nextcloud, health, real-estate-crawler, audiobookshelf, servarr, technitium, openclaw, ollama (HDD), audiblez, ebook2audiobook, every `*-backup` CronJob output, … — lands at `/mnt/backup/<svc>/`. Mirror size ≈ 400 GB post-2026-05-26 (was ~900 GB with anca-elements).
|
||||
|
||||
Pushes `nfs_mirror_last_run_timestamp` + `nfs_mirror_last_status` + `nfs_mirror_bytes` to Pushgateway. Alerts: `NfsMirrorStale` (>16d), `NfsMirrorFailing` (status != 0). `rsync -rlt --delete -H --no-perms --no-owner --no-group`; idempotent. Nice=10, IOSchedulingClass=idle (won't compete with foreground IO).
|
||||
|
||||
> History: `anca-elements-mirror.{sh,service,timer}` was a precursor (2026-05-24 morning) dedicated to /srv/nfs/anca-elements only. Subsumed by `nfs-mirror` later the same day to consolidate ad-hoc copy scripts into one.
|
||||
|
||||
**Destination**:
|
||||
- `Synology/Backup/Viki/nfs/` — immich only (post-2026-05-26)
|
||||
- `Synology/Backup/Viki/nfs-ssd/` — mirrors `/srv/nfs-ssd` (immich-ML, ollama, llamacpp)
|
||||
|
||||
**Monitoring**: Pushes `offsite_backup_sync_last_success_timestamp` to Pushgateway. Alerts: `OffsiteBackupSyncStale` (>8d), `OffsiteBackupSyncFailing`.
|
||||
|
||||
#### ~~TrueNAS Cloud Sync~~ — DECOMMISSIONED 2026-04-13
|
||||
|
||||
> TrueNAS Cloud Sync was decommissioned along with TrueNAS (2026-04-13). The current offsite path is inotify-change-tracked rsync from the Proxmox host NFS (`/srv/nfs`, `/srv/nfs-ssd`) to Synology.
|
||||
|
||||
### Synology snapshot management
|
||||
|
||||
Synology DSM keeps daily btrfs snapshots of every shared folder (the `Backup` share most importantly). Retention is configured per-share in DSM's Snapshot Replication app, and persists in `synosharesnapshot shareconf`.
|
||||
|
||||
**Current settings** (`Backup` share, 2026-05-24): daily at 02:00, **`snap_auto_remove_keep_days=3`** (tightened from 7 to reduce the window where deleted data continues to consume space).
|
||||
|
||||
Snapshots are CoW — deleting a file from the live filesystem does NOT free its blocks while any retained snapshot references them. Reclaim only happens after ALL referencing snapshots roll off.
|
||||
|
||||
**DSM Web API is gated by 2FA (FIDO/OTP)** — programmatic snapshot management has to go via SSH + sudo instead:
|
||||
|
||||
```bash
|
||||
# Password is in Vault: secret/viktor → synology_admin_password
|
||||
PASS=$(VAULT_ADDR=https://vault.viktorbarzin.me vault kv get -field=synology_admin_password secret/viktor)
|
||||
|
||||
# List snapshots on the Backup share
|
||||
ssh Administrator@192.168.1.13 "echo '$PASS' | sudo -S /usr/syno/sbin/synosharesnapshot list Backup"
|
||||
|
||||
# Bulk delete ALL snapshots (reclaims everything once btrfs cleaner runs)
|
||||
ssh Administrator@192.168.1.13 "
|
||||
SNAPS=\$(echo '$PASS' | sudo -S /usr/syno/sbin/synosharesnapshot list Backup 2>/dev/null \
|
||||
| grep -oE 'GMT-[0-9]+\.[0-9]+\.[0-9]+-[0-9]+\.[0-9]+\.[0-9]+' | sort -u)
|
||||
echo '$PASS' | sudo -S /usr/syno/sbin/synosharesnapshot delete Backup \$SNAPS
|
||||
"
|
||||
|
||||
# Tighten retention
|
||||
ssh Administrator@192.168.1.13 "echo '$PASS' | sudo -S /usr/syno/sbin/synosharesnapshot shareconf set Backup snap_auto_remove_keep_days=3"
|
||||
```
|
||||
|
||||
The btrfs cleaner thread reclaims async — `df` may lag the snapshot-delete by minutes (typical reclaim rate observed 2026-05-24: ~300 MB/s sustained, with bursts of 800 GB in 2 minutes).
|
||||
|
||||
> Memory: id=2673-2676 (Synology snapshot retention gotcha — deletion vs reclaim timing).
|
||||
|
||||
## Configuration
|
||||
|
||||
### Key Files
|
||||
|
||||
| Path | Purpose |
|
||||
|------|---------|
|
||||
| `/usr/local/bin/lvm-pvc-snapshot` | PVE host: LVM snapshot creation + restore |
|
||||
| `/usr/local/bin/daily-backup` | PVE host: PVC file copy + auto SQLite backup + pfSense |
|
||||
| `/usr/local/bin/offsite-sync-backup` | PVE host: two-step rsync to Synology (sda + NFS via inotify) |
|
||||
| `/mnt/backup/` | PVE host: sda mount point (1.1TB backup disk) |
|
||||
| `/mnt/backup/.nfs-changes.log` | NFS change log from inotifywait, consumed by offsite-sync |
|
||||
| `/etc/systemd/system/nfs-change-tracker.service` | inotifywait watcher for `/srv/nfs` + `/srv/nfs-ssd` |
|
||||
| `/etc/systemd/system/lvm-pvc-snapshot.timer` | Daily 03:00 (LVM snapshots) |
|
||||
| `/etc/systemd/system/daily-backup.timer` | Daily 05:00 (file backup) |
|
||||
| `/etc/systemd/system/offsite-sync-backup.timer` | Daily 06:00 (offsite sync) |
|
||||
| `/usr/local/bin/nfs-mirror` | PVE host: daily 02:00 mirror of /srv/nfs/* → sda /mnt/backup/<svc>/ (Layer 3a) |
|
||||
| `/etc/systemd/system/nfs-mirror.timer` | Daily 02:00 (NFS local mirror to sda) |
|
||||
| `stacks/dbaas/` | Terraform: PostgreSQL/MySQL backup CronJobs |
|
||||
| `stacks/vault/` | Terraform: Vault backup CronJob |
|
||||
| `stacks/vaultwarden/` | Terraform: Vaultwarden backup + integrity CronJobs |
|
||||
| `stacks/monitoring/` | Terraform: Prometheus alerts |
|
||||
| `synology:Administrator@192.168.1.13` | Synology SSH; sudo password = Vault `secret/viktor` `synology_admin_password`; DSM API itself gated by 2FA |
|
||||
| `/usr/syno/sbin/synosharesnapshot` | Synology: btrfs snapshot CLI — must run as root via sudo |
|
||||
|
||||
### Vault Paths
|
||||
|
||||
| Path | Contents |
|
||||
|------|----------|
|
||||
| `secret/viktor/synology_ssh_key` | SSH key for Synology NAS SFTP access |
|
||||
| `secret/viktor/pfsense_api_key` | pfSense API key + secret for config backup |
|
||||
|
||||
### Terraform Stacks
|
||||
|
||||
Each backup CronJob is defined in the application's stack:
|
||||
- PostgreSQL/MySQL: `stacks/dbaas/backup.tf`
|
||||
- Vault: `stacks/vault/backup.tf`
|
||||
- Vaultwarden: `stacks/vaultwarden/backup.tf`
|
||||
- etcd: `stacks/platform/etcd-backup.tf`
|
||||
|
||||
## Decisions & Rationale
|
||||
|
||||
### Why 3-2-1 Strategy?
|
||||
|
||||
**3 copies**:
|
||||
- Live PVCs (zero RTO for recent data)
|
||||
- sda local backup (fast recovery without network)
|
||||
- Synology offsite (site-level disaster protection)
|
||||
|
||||
**2 media types**:
|
||||
- sdc SSD (live, low latency)
|
||||
- sda HDD (backup, cost-effective bulk storage)
|
||||
|
||||
**1 offsite**:
|
||||
- Protection against fire, theft, catastrophic hardware failure
|
||||
- Weekly RPO acceptable for offsite (daily/weekly app backups reduce exposure)
|
||||
|
||||
### Why File-Level + Block-Level Snapshots?
|
||||
|
||||
**LVM snapshots** (Layer 1):
|
||||
- Near-instant (<1s), zero overhead
|
||||
- Point-in-time recovery for entire PVCs
|
||||
- BUT: Cannot restore individual files, no offsite protection, 7-day retention
|
||||
|
||||
**File-level backup** (Layer 2):
|
||||
- Can restore single files or directories
|
||||
- Offsite-compatible (rsync)
|
||||
- Longer retention (4 weeks local, unlimited offsite)
|
||||
- BUT: Slower RTO (rsync), higher storage overhead
|
||||
|
||||
Both together provide flexibility: fast local rollback for recent changes, granular recovery for older data.
|
||||
|
||||
### Why Dedicated Backup Disk (sda)?
|
||||
|
||||
**Isolation**: If sdc fails (thin pool corruption, controller failure), sda is independent (different disk, different VG).
|
||||
|
||||
**Performance**: Backup I/O doesn't compete with live PVC I/O.
|
||||
|
||||
**Simplicity**: Single mount point (`/mnt/backup/`) for all backup data, easy to monitor disk usage.
|
||||
|
||||
### Why Not Velero/Longhorn Backup?
|
||||
|
||||
Evaluated K8s-native backup solutions (Velero, Longhorn):
|
||||
- **Velero**: Requires object storage backend, complex restore, doesn't handle databases well
|
||||
- **Longhorn**: High overhead (replicas, snapshots in-cluster), no offsite by default
|
||||
|
||||
**Current approach wins** because:
|
||||
- Leverages existing Proxmox LVM infrastructure (already running)
|
||||
- Database-native backups (pg_dump/mysqldump) are battle-tested
|
||||
- Simple restore procedures (documented runbooks)
|
||||
- Lower resource overhead (no in-cluster replicas)
|
||||
|
||||
### Why Hybrid Incremental + Full Sync?
|
||||
|
||||
**Incremental alone** (rsync --files-from via inotify change log) is risky:
|
||||
- Deleted files on source never deleted on destination
|
||||
- Renamed paths create duplicates
|
||||
- No cleanup of orphaned files
|
||||
|
||||
**Full sync alone** (rsync --delete) is slow:
|
||||
- 30-60 min per run (all files scanned)
|
||||
- 7d RPO → 14d if a sync fails
|
||||
|
||||
**Hybrid approach**:
|
||||
- Fast incremental weekly via inotify change tracking (completes in seconds)
|
||||
- Monthly full `rsync --delete` for cleanup (tolerates longer runtime)
|
||||
|
||||
### Why 6h Vaultwarden Backup vs Daily for Others?
|
||||
|
||||
Vaultwarden stores **password vault data** — highest-value target:
|
||||
- User creates 10 new passwords → disaster 5h later → daily backup loses all 10
|
||||
- 6h RPO acceptable for password vaults (industry standard is 1-24h)
|
||||
- Hourly integrity checks detect corruption before it spreads to backups
|
||||
|
||||
Other services (MySQL, PostgreSQL):
|
||||
- Mostly application data (not authentication secrets)
|
||||
- Daily RPO acceptable per user tolerance
|
||||
- Lower change velocity
|
||||
|
||||
## Troubleshooting
|
||||
|
||||
### LVM Snapshot Restore Issues
|
||||
|
||||
See `docs/runbooks/restore-lvm-snapshot.md`.
|
||||
|
||||
### Weekly Backup Failing
|
||||
|
||||
**Symptom**: `WeeklyBackupStale` or `WeeklyBackupFailing` alert
|
||||
|
||||
**Diagnosis**:
|
||||
```bash
|
||||
ssh root@192.168.1.127
|
||||
systemctl status daily-backup.service
|
||||
journalctl -u daily-backup.service --since "7 days ago"
|
||||
df -h /mnt/backup
|
||||
```
|
||||
|
||||
**Common causes**:
|
||||
- Backup disk full (check `df -h /mnt/backup`, alert: `BackupDiskFull`)
|
||||
- LV mount failed (check `lvs pve`, `dmesg | grep backup`)
|
||||
- NFS mount failed (check `showmount -e 192.168.1.127`)
|
||||
|
||||
**Fix**:
|
||||
1. If disk full: Clean up old weekly versions manually, adjust retention
|
||||
2. If LV mount failed: `lvchange -ay backup/data && mount /mnt/backup`
|
||||
3. If NFS failed: Check Proxmox NFS availability (`showmount -e 192.168.1.127`), verify exports
|
||||
4. Manually trigger: `systemctl start daily-backup.service`
|
||||
|
||||
### Offsite Sync Failing
|
||||
|
||||
**Symptom**: `OffsiteBackupSyncStale` or `OffsiteBackupSyncFailing` alert
|
||||
|
||||
**Diagnosis**:
|
||||
```bash
|
||||
ssh root@192.168.1.127
|
||||
systemctl status offsite-sync-backup.service
|
||||
journalctl -u offsite-sync-backup.service --since "7 days ago"
|
||||
wc -l /mnt/backup/.nfs-changes.log # verify change log exists
|
||||
systemctl status nfs-change-tracker.service # verify inotify watcher
|
||||
```
|
||||
|
||||
**Common causes**:
|
||||
- Synology NAS unreachable (network, SFTP down)
|
||||
- SSH key auth failed (permissions, expired key)
|
||||
- nfs-change-tracker.service stopped (no change log)
|
||||
|
||||
**Fix**:
|
||||
1. Verify Synology: `ping 192.168.1.13`, `ssh root@192.168.1.13`
|
||||
2. Verify SSH key: `ssh -i /root/.ssh/synology_backup root@192.168.1.13`
|
||||
3. Verify change tracker running: `systemctl status nfs-change-tracker.service`
|
||||
4. Manually trigger: `systemctl start offsite-sync-backup.service`
|
||||
|
||||
### PostgreSQL Backup Stale Alert
|
||||
|
||||
**Symptom**: `PostgreSQLBackupStale` firing in Prometheus
|
||||
|
||||
**Diagnosis**:
|
||||
```bash
|
||||
kubectl get cronjob -n dbaas
|
||||
kubectl logs -n dbaas job/postgresql-backup-<timestamp>
|
||||
```
|
||||
|
||||
**Common causes**:
|
||||
- Pod OOMKilled (increase memory limit)
|
||||
- NFS mount unavailable (check Proxmox NFS)
|
||||
- pg_dumpall command failed (check PostgreSQL connectivity)
|
||||
|
||||
**Fix**:
|
||||
1. If OOM: Increase `resources.limits.memory` in `stacks/dbaas/backup.tf`
|
||||
2. If NFS: Verify mount on worker node, restart NFS server on Proxmox host if needed (`systemctl restart nfs-server`)
|
||||
3. Manually trigger: `kubectl create job --from=cronjob/postgresql-backup manual-backup -n dbaas`
|
||||
|
||||
### Vaultwarden Integrity Check Failing
|
||||
|
||||
**Symptom**: `VaultwardenIntegrityFail` alert, `vaultwarden_sqlite_integrity_ok=0`
|
||||
|
||||
**Diagnosis**:
|
||||
```bash
|
||||
kubectl exec -n vaultwarden deployment/vaultwarden -- sqlite3 /data/db.sqlite3 "PRAGMA integrity_check;"
|
||||
```
|
||||
|
||||
**Critical**: If integrity check fails, database is corrupt.
|
||||
|
||||
**Recovery**:
|
||||
1. Stop writes: `kubectl scale deployment/vaultwarden --replicas=0 -n vaultwarden`
|
||||
2. Restore from latest backup (see `restore-vaultwarden.md`)
|
||||
3. Verify integrity on restored DB
|
||||
4. Scale back up: `kubectl scale deployment/vaultwarden --replicas=1 -n vaultwarden`
|
||||
|
||||
### pfSense Backup Failing
|
||||
|
||||
**Symptom**: `PfsenseBackupStale` alert (if implemented)
|
||||
|
||||
**Diagnosis**:
|
||||
```bash
|
||||
ssh root@192.168.1.127
|
||||
systemctl status daily-backup.service | grep -A5 pfsense
|
||||
```
|
||||
|
||||
**Common causes**:
|
||||
- API key expired/invalid
|
||||
- SSH auth failed (password changed, key rejected)
|
||||
- pfSense unreachable
|
||||
|
||||
**Fix**:
|
||||
1. Verify API key: `curl -k https://pfsense.viktorbarzin.me/api/v1/system/config -H "Authorization: <key>"`
|
||||
2. Verify SSH: `ssh root@pfsense.viktorbarzin.me`
|
||||
3. Update credentials in Vault `secret/viktor/pfsense_api_key`
|
||||
|
||||
### Backup Disk Full
|
||||
|
||||
**Symptom**: `BackupDiskFull` alert, `df -h /mnt/backup` >85%
|
||||
|
||||
**Fix**:
|
||||
```bash
|
||||
ssh root@192.168.1.127
|
||||
|
||||
# Check space usage by component
|
||||
du -sh /mnt/backup/pvc-data/*
|
||||
du -sh /mnt/backup/pfsense/*
|
||||
du -sh /mnt/backup/sqlite-backup
|
||||
|
||||
# Clean up old weekly versions (keep latest 2)
|
||||
find /mnt/backup/pvc-data -maxdepth 1 -type d -name "????-??" | sort | head -n -2 | xargs rm -rf
|
||||
find /mnt/backup/pfsense -maxdepth 1 -type d -name "????-??" | sort | head -n -2 | xargs rm -rf
|
||||
```
|
||||
|
||||
### Missing Backup for New Service
|
||||
|
||||
**Symptom**: Added new service using proxmox-lvm storage, no backup exists
|
||||
|
||||
**Fix**: The service is automatically covered by:
|
||||
1. **LVM snapshots** (if not in dbaas/monitoring namespace) — automatic, no config needed
|
||||
2. **Weekly file backup** — automatic, no config needed
|
||||
|
||||
**If the service has a database that needs app-level dumps**:
|
||||
Add backup CronJob in service's Terraform stack (see template below).
|
||||
|
||||
**Template**:
|
||||
```hcl
|
||||
resource "kubernetes_cron_job_v1" "backup" {
|
||||
metadata {
|
||||
name = "${var.service_name}-backup"
|
||||
namespace = kubernetes_namespace.service.metadata[0].name
|
||||
}
|
||||
spec {
|
||||
schedule = "0 3 * * 0" # Weekly Sunday 03:00
|
||||
job_template {
|
||||
spec {
|
||||
template {
|
||||
spec {
|
||||
container {
|
||||
name = "backup"
|
||||
image = "appropriate/image:tag"
|
||||
command = ["/bin/sh", "-c"]
|
||||
args = [
|
||||
<<-EOT
|
||||
TIMESTAMP=$(date +%Y%m%d)
|
||||
# Dump command here (sqlite3 .backup, pg_dump, etc.)
|
||||
find /backup -mtime +30 -delete
|
||||
EOT
|
||||
]
|
||||
volume_mount {
|
||||
name = "data"
|
||||
mount_path = "/data"
|
||||
}
|
||||
volume_mount {
|
||||
name = "backup"
|
||||
mount_path = "/backup"
|
||||
}
|
||||
}
|
||||
volume {
|
||||
name = "data"
|
||||
persistent_volume_claim {
|
||||
claim_name = kubernetes_persistent_volume_claim.data.metadata[0].name
|
||||
}
|
||||
}
|
||||
volume {
|
||||
name = "backup"
|
||||
persistent_volume_claim {
|
||||
claim_name = module.nfs_backup.pvc_name
|
||||
}
|
||||
}
|
||||
}
|
||||
}
|
||||
}
|
||||
}
|
||||
}
|
||||
}
|
||||
|
||||
module "nfs_backup" {
|
||||
source = "../../modules/kubernetes/nfs_volume"
|
||||
name = "${var.service_name}-backup"
|
||||
namespace = kubernetes_namespace.service.metadata[0].name
|
||||
nfs_server = var.nfs_server
|
||||
nfs_path = "/srv/nfs/${var.service_name}-backup"
|
||||
}
|
||||
```
|
||||
|
||||
## Monitoring & Alerting
|
||||
|
||||
```
|
||||
┌────────────────────────────────────────────────────────────────┐
|
||||
│ Prometheus Alerts │
|
||||
│ │
|
||||
│ PostgreSQLBackupStale > 36h since last success │
|
||||
│ MySQLBackupStale > 36h since last success │
|
||||
│ EtcdBackupStale > 8d since last success │
|
||||
│ VaultBackupStale > 8d since last success │
|
||||
│ VaultwardenBackupStale > 8d since last success │
|
||||
│ RedisBackupStale > 8d since last success │
|
||||
│ ~~CloudSyncStale~~ REMOVED (TrueNAS decommissioned) │
|
||||
│ ~~CloudSyncNeverRun~~ REMOVED (TrueNAS decommissioned) │
|
||||
│ ~~CloudSyncFailing~~ REMOVED (TrueNAS decommissioned) │
|
||||
│ VaultwardenIntegrityFail integrity_ok == 0 │
|
||||
│ LVMSnapshotStale > 30h since last snapshot │
|
||||
│ LVMSnapshotFailing snapshot creation failed │
|
||||
│ LVMThinPoolLow < 15% free space in thin pool │
|
||||
│ WeeklyBackupStale > 8d since last success │
|
||||
│ WeeklyBackupFailing backup script exited non-zero │
|
||||
│ PfsenseBackupStale > 8d since last success │
|
||||
│ OffsiteBackupSyncStale > 8d since last success │
|
||||
│ BackupDiskFull > 85% usage on /mnt/backup │
|
||||
└────────────────────────────────────────────────────────────────┘
|
||||
```
|
||||
|
||||
**Metrics sources**:
|
||||
- Backup CronJobs: Push `backup_last_success_timestamp` to Pushgateway on completion
|
||||
- LVM snapshot script: Pushes `lvm_snapshot_last_run_timestamp`, `lvm_snapshot_last_status`, `lvm_snapshot_created_total`, `lvm_snapshot_failed_total`, `lvm_snapshot_pruned_total`, `lvm_snapshot_thinpool_free_pct` (job `lvm-pvc-snapshot`)
|
||||
- Daily backup script: Pushes `daily_backup_last_run_timestamp`, `daily_backup_last_status`, `daily_backup_bytes_synced` (job `daily-backup`). Disk-fullness alert (`BackupDiskFull`) does NOT use a script-pushed metric; it derives from node-exporter `node_filesystem_avail_bytes{job="proxmox-host", mountpoint="/mnt/backup"}`.
|
||||
- pfSense backup (step 3 of `daily-backup`): Pushes `backup_last_run_timestamp`, `backup_last_status`, and `backup_last_success_timestamp` (only on success) under job `pfsense-backup`. Pushed in BOTH success and failure paths so `PfsenseBackupStale` doesn't go silent when SSH-to-pfsense breaks.
|
||||
- Offsite sync script: Pushes `backup_last_success_timestamp`, `offsite_sync_last_status` (job `offsite-backup-sync`)
|
||||
- Prometheus backup (sidecar in prometheus-server pod, monthly 1st-Sunday 04:00 UTC): Pushes `prometheus_backup_last_success_timestamp` (job `prometheus-backup`)
|
||||
- ~~CloudSync monitor~~: Removed (TrueNAS decommissioned)
|
||||
- Vaultwarden integrity: Pushes `vaultwarden_sqlite_integrity_ok` hourly
|
||||
|
||||
**Pushgateway persistence**: The Pushgateway is configured with
|
||||
`--persistence.file=/data/pushgateway.bin --persistence.interval=1m`
|
||||
on a 2Gi `proxmox-lvm-encrypted` PVC (helm values:
|
||||
`prometheus-pushgateway.persistentVolume`). Without this, every pod
|
||||
restart drops in-memory metrics. Once-per-day pushers (offsite-sync,
|
||||
weekly backup) are otherwise invisible for up to 24h if the
|
||||
Pushgateway restarts between pushes — which is exactly what triggered
|
||||
the 2026-04-22 backup_offsite_sync FAIL (node3 kubelet hiccup at
|
||||
11:42 UTC terminated the Pushgateway 8h after the 03:12 UTC push).
|
||||
|
||||
**Alert routing**:
|
||||
- All backup alerts → Slack `#infra-alerts`
|
||||
- Vaultwarden integrity fail → Slack `#infra-critical` (immediate action required)
|
||||
|
||||
## Service Protection Matrix
|
||||
|
||||
| Service | LVM Snapshots (7d) | File Backup (4w) | App Backup | Offsite | Storage |
|
||||
|---------|:------------------:|:----------------:|:----------:|:-------:|---------|
|
||||
| **Databases** |
|
||||
| PostgreSQL (all DBs) | — | — | ✓ daily | ✓ | proxmox-lvm |
|
||||
| MySQL (all DBs) | — | — | ✓ daily | ✓ | proxmox-lvm |
|
||||
| **Critical State** |
|
||||
| Vault | ✓ | ✓ | ✓ weekly | ✓ | proxmox-lvm |
|
||||
| etcd | ✓ | ✓ | ✓ weekly | ✓ | proxmox-lvm |
|
||||
| Vaultwarden | ✓ | ✓ | ✓ 6h + integrity | ✓ | proxmox-lvm |
|
||||
| Redis | ✓ | ✓ | ✓ weekly | ✓ | proxmox-lvm |
|
||||
| **Applications (65 proxmox-lvm PVCs)** |
|
||||
| Prometheus | — | — | — | excluded | proxmox-lvm |
|
||||
| Nextcloud | ✓ | ✓ | — | ✓ | proxmox-lvm |
|
||||
| Calibre-Web | ✓ | ✓ | — | ✓ | proxmox-lvm |
|
||||
| Forgejo | ✓ | ✓ | — | ✓ | proxmox-lvm |
|
||||
| FreshRSS | ✓ | ✓ | — | ✓ | proxmox-lvm |
|
||||
| ActualBudget | ✓ | ✓ | — | ✓ | proxmox-lvm |
|
||||
| NovelApp | ✓ | ✓ | — | ✓ | proxmox-lvm |
|
||||
| Headscale | ✓ | ✓ | — | ✓ | proxmox-lvm |
|
||||
| Uptime Kuma | ✓ | ✓ | — | ✓ | proxmox-lvm |
|
||||
| **Other apps not enumerated above** | ✓¹ | ✓¹ | varies | ✓ | proxmox-lvm / proxmox-lvm-encrypted |
|
||||
| **Postiz** (bundled bitnami PG on local-path) | — | — | ✓ daily pg_dump → NFS | ✓ | local-path + NFS |
|
||||
| **Media (NFS)** |
|
||||
| Immich (~800GB) | — | — | — | ✓ | NFS |
|
||||
| Audiobookshelf | — | — | — | ✓ | NFS |
|
||||
| Servarr | — | — | — | ✓ | NFS |
|
||||
| Navidrome | — | — | — | ✓ | NFS |
|
||||
|
||||
**Legend**:
|
||||
- ✓ = Protected at this layer
|
||||
- — = Not needed (other layers cover it, or data is regenerable/disposable)
|
||||
- excluded = Too large/regenerable, not worth offsite bandwidth
|
||||
|
||||
**Note**: All proxmox-lvm and proxmox-lvm-encrypted PVCs get LVM snapshots (except `dbaas` and `monitoring` namespaces, excluded for write-amplification reasons) + file-level backup. NFS-backed media syncs directly to Synology `nfs/` and `nfs-ssd/` via inotify change tracking.
|
||||
|
||||
¹ **"Other apps not enumerated above"** — the table only enumerates services worth calling out. The default backup posture for any service using `proxmox-lvm` or `proxmox-lvm-encrypted` (outside `dbaas`/`monitoring`) is **automatic** Layer 1 (LVM thin snapshots, 7d retention) + Layer 2 (file backup, 4 weekly versions on sda) + Layer 3 (offsite to Synology). Auto-discovery is by LV name pattern (`vm-*-pvc-*`), so adding a new service to the cluster gets it covered without any explicit registration. Run `ssh root@192.168.1.127 lvs --noheadings -o lv_name pve | grep '^vm-.*-pvc-' | grep -v _snap_ | wc -l` to see the live count.
|
||||
|
||||
**Known gaps** — services with PVCs not on the proxmox-lvm path lose Layer 1+2:
|
||||
- **Postiz** PG and Redis (bundled bitnami chart) live on `local-path` (K8s node OS disk). PG covered by the postiz-postgres-backup CronJob (daily pg_dump → `/srv/nfs/postiz-backup/`, Layer 3 via offsite sync). Redis is regenerable cache — not backed up.
|
||||
- **Prometheus, Alertmanager, Pushgateway** — `monitoring` namespace excluded by policy; loss is acceptable (metrics regenerable, silences ephemeral, Pushgateway has on-disk persistence for 24h gap tolerance).
|
||||
|
||||
## Recovery Procedures
|
||||
|
||||
Detailed runbooks in `docs/runbooks/`:
|
||||
|
||||
- **`restore-lvm-snapshot.md`** — Instant rollback of a PVC using LVM snapshot (RTO <5 min)
|
||||
- **`restore-pvc-from-backup.md`** — Restore a PVC from sda file backup (when snapshots expired)
|
||||
- **`restore-postgresql.md`** — Restore individual database (from per-db `pg_dump -Fc`) or full cluster (from `pg_dumpall`)
|
||||
- **`restore-mysql.md`** — Restore individual database (from per-db `mysqldump`) or full cluster (from `mysqldump --all-databases`)
|
||||
- **`restore-vault.md`** — Restore Vault from raft snapshot
|
||||
- **`restore-vaultwarden.md`** — Restore password vault from sqlite3 backup
|
||||
- **`restore-etcd.md`** — Restore etcd cluster from snapshot
|
||||
- **`restore-full-cluster.md`** — Disaster recovery: rebuild cluster from offsite backups
|
||||
|
||||
**RTO estimates**:
|
||||
- LVM snapshot rollback: <5 min (instant swap)
|
||||
- File-level restore from sda: <15 min (depends on PVC size)
|
||||
- Single PostgreSQL database: <5 min
|
||||
- Full MySQL cluster: <15 min
|
||||
- Vault: <10 min
|
||||
- Vaultwarden: <5 min
|
||||
- etcd: <20 min (requires cluster rebuild)
|
||||
- Full cluster from offsite: <4 hours (NFS restore + K8s bootstrap + app deploys)
|
||||
|
||||
## Related
|
||||
|
||||
- **Architecture**: `docs/architecture/storage.md` (NFS/Proxmox storage layer)
|
||||
- **Reference**: `.claude/reference/service-catalog.md` (which services need backups)
|
||||
- **Runbooks**: `docs/runbooks/restore-*.md` (step-by-step recovery procedures)
|
||||
- **Monitoring**: `stacks/monitoring/alerts/backup-alerts.yaml` (Prometheus alert definitions)
|
||||
199
docs/architecture/chrome-service.md
Normal file
199
docs/architecture/chrome-service.md
Normal file
|
|
@ -0,0 +1,199 @@
|
|||
# chrome-service — In-cluster headed Chromium with persistent profile
|
||||
|
||||
## Overview
|
||||
|
||||
`chrome-service` is a single-replica, persistent-profile, headed
|
||||
Chromium browser exposed over the Chrome DevTools Protocol (CDP). It
|
||||
serves two distinct populations:
|
||||
|
||||
1. **In-cluster automation callers** — connect via
|
||||
`chromium.connect_over_cdp("http://chrome-service.chrome-service.svc:9222")`
|
||||
to drive a real browser when upstream anti-bot trips a headless one
|
||||
(`disable-devtool.js` redirect-to-google trap, `navigator.webdriver`
|
||||
checks, console-clear timing tricks). The only currently-active
|
||||
in-cluster caller is the `chrome-service-snapshot-harvester` CronJob;
|
||||
the `stacks/f1-stream/files/backend/playback_verifier.py` +
|
||||
`chrome_browser.py` tree is a vestigial design — the deployed
|
||||
f1-stream image (built from `github.com/ViktorBarzin/f1-stream`)
|
||||
does not use this code path.
|
||||
2. **External dev-box Claude Code sessions** — pull an hourly snapshot
|
||||
of cookies + localStorage from `chrome.viktorbarzin.me/api/snapshot`
|
||||
(bearer-gated) and seed local `@playwright/mcp` instances in
|
||||
`--isolated --storage-state=…` mode. This is how concurrent Claude
|
||||
Code sessions get their own isolated browser contexts without losing
|
||||
shared cookies for logged-in sites.
|
||||
|
||||
## Why a separate stack
|
||||
|
||||
In-process Chromium inside `f1-stream`:
|
||||
|
||||
- Runs **headless** by default (no `Xvfb`/`DISPLAY`).
|
||||
- Has the `HeadlessChromium/...` UA suffix and `navigator.webdriver === true`.
|
||||
- Trips `disable-devtool.js`'s **Performance** detector — Playwright's CDP
|
||||
adds latency to `console.log(largeArray)` vs `console.table(largeArray)`,
|
||||
which the lib reads as "DevTools is open" and redirects to
|
||||
`https://www.google.com/`.
|
||||
|
||||
`chrome-service` solves this by:
|
||||
|
||||
1. Running **headed** under `Xvfb :99` (chromium with `DISPLAY=:99`,
|
||||
not `--headless`).
|
||||
2. Living in a long-lived pod so JIT browser launch latency disappears.
|
||||
3. Allowing a per-context init script
|
||||
(`stacks/chrome-service/files/stealth.js` ~ 40 lines, vendored from
|
||||
`puppeteer-extra-plugin-stealth`) to spoof `webdriver`, `chrome.runtime`,
|
||||
`plugins`, `languages`, `Permissions.query`, WebGL renderer strings, and
|
||||
to hide the `disable-devtool-auto` script-tag attribute so the lib's
|
||||
IIFE exits early.
|
||||
|
||||
## Wire protocol — CDP (current, since 2026-06-04)
|
||||
|
||||
```text
|
||||
http://chrome-service.chrome-service.svc.cluster.local:9222
|
||||
│
|
||||
┌───────────────────────────────┼───────────────────────────────┐
|
||||
│ caller pod │ chrome-service pod
|
||||
│ (e.g. f1-stream) │ (single replica)
|
||||
│ │
|
||||
│ CHROME_CDP_URL ──────────────┘
|
||||
│
|
||||
│ await chromium.connect_over_cdp(cdp_url)
|
||||
│ context = await browser.new_context() ← incognito (no cookies)
|
||||
│ OR: context = browser.contexts[0] ← persistent (shared cookies)
|
||||
│ await context.add_init_script(STEALTH_JS)
|
||||
│ page.goto("https://upstream.com/embed/...")
|
||||
│
|
||||
└─── ←── pages render under Xvfb, headed Chromium ──── ─────────┘
|
||||
```
|
||||
|
||||
### Wire protocol — WS (legacy, removed 2026-06-04)
|
||||
|
||||
The previous design used `playwright launch-server --browser chromium`
|
||||
with a path-token (`ws://...:3000/<TOKEN>`). Callers used
|
||||
`chromium.connect(ws_url)`. **Problem**: `launch-server` creates
|
||||
ephemeral browser contexts per `connect()` call, so cookies never
|
||||
persisted to the PVC despite the `/profile` mount. We migrated to
|
||||
direct chromium launch with `--user-data-dir` + CDP exposed on :9222
|
||||
so cookies actually live across pod restarts.
|
||||
|
||||
## Cookie warming + snapshot pipeline
|
||||
|
||||
```text
|
||||
┌─────────── chrome-service pod ──────────────────────────────────────────┐
|
||||
│ │
|
||||
│ chrome-service container (chromium --user-data-dir=/profile/chromium-data
|
||||
│ --remote-debugging-port=9222) │
|
||||
│ ▲ │
|
||||
│ │ user logs in via noVNC ← chrome.viktorbarzin.me (Authentik) │
|
||||
│ │ │
|
||||
│ Cookies + localStorage land in /profile/chromium-data/Default/ │
|
||||
│ │
|
||||
│ snapshot-server sidecar (python stdlib HTTP server, :8088) │
|
||||
│ ↑ serves /profile/snapshots/storage-state.json (bearer-gated) │
|
||||
└──────────────────────────────────────────────────────────────────────────┘
|
||||
▲
|
||||
│ hourly (cron 23 * * * *)
|
||||
│
|
||||
┌──────┴── chrome-service-snapshot-harvester CronJob ─────────────────────┐
|
||||
│ podAffinity → same node as chrome-service (RWO PVC) │
|
||||
│ python: connect_over_cdp + ctx.storage_state(path=...) │
|
||||
│ writes /profile/snapshots/storage-state.json (atomic rename) │
|
||||
└──────────────────────────────────────────────────────────────────────────┘
|
||||
|
||||
External caller (dev box):
|
||||
systemd timer (hourly) → curl -H "Authorization: Bearer $TOKEN"
|
||||
https://chrome.viktorbarzin.me/api/snapshot
|
||||
-o ~/.cache/playwright-shared-storage-state.json
|
||||
@playwright/mcp --isolated --storage-state ~/.cache/...storage-state.json
|
||||
```
|
||||
|
||||
## Image pin
|
||||
|
||||
Both the server image (`mcr.microsoft.com/playwright:v1.48.0-noble` in
|
||||
`stacks/chrome-service/main.tf`) and the Python client
|
||||
(`playwright==1.48.0` in callers' `requirements.txt`) **must match
|
||||
minor-versions**. Bump in lockstep — Playwright protocol changes between
|
||||
minors and the client cannot connect to a mismatched server.
|
||||
|
||||
The harvester + snapshot-server sidecar use
|
||||
`mcr.microsoft.com/playwright/python:v1.48.0-noble` — same playwright
|
||||
minor, with Python-side bindings pre-installed.
|
||||
|
||||
## Storage
|
||||
|
||||
- **`chrome-service-profile-encrypted`** (PVC, 2Gi → 10Gi autoresize,
|
||||
`proxmox-lvm-encrypted`) — Chromium user-data dir at
|
||||
`/profile/chromium-data` + snapshot at `/profile/snapshots/storage-state.json`.
|
||||
Encrypted because cookies/localStorage may include third-party auth tokens
|
||||
for sites callers drive.
|
||||
- **`chrome-service-backup-host`** (NFS, RWX) — destination for a 6-hourly
|
||||
CronJob that `tar -czf /backup/<YYYY_MM_DD_HH>.tar.gz -C /profile .`,
|
||||
retention 30 days.
|
||||
|
||||
## Auth + secrets
|
||||
|
||||
- Vault KV `secret/chrome-service.api_bearer_token` — 32-byte URL-safe
|
||||
random, rotated by hand:
|
||||
`vault kv put secret/chrome-service api_bearer_token=$(python3 -c 'import secrets; print(secrets.token_urlsafe(32))')`.
|
||||
- ESO syncs into namespace-local Secret `chrome-service-secrets`. The
|
||||
`snapshot-server` sidecar reads it via `secret_key_ref`.
|
||||
- f1-stream still imports the secret (via `chrome-service-client-secrets`)
|
||||
for parity, but the CDP endpoint no longer requires it for connection —
|
||||
NetworkPolicy is the gate.
|
||||
- Reloader (`reloader.stakater.com/auto = "true"`) cascades token rotation
|
||||
to the snapshot-server sidecar.
|
||||
- **Dev-box cache**: each dev box keeps a local copy at
|
||||
`~/.config/playwright/token` (chmod 600). Re-fetch from Vault after
|
||||
rotation: `vault kv get -field=api_bearer_token secret/chrome-service > ~/.config/playwright/token`.
|
||||
|
||||
## Network controls
|
||||
|
||||
- **`kubernetes_network_policy_v1.ws_ingress`** — three ingress rules:
|
||||
- **TCP/9222** (Chromium CDP): only namespaces labelled
|
||||
`chrome-service.viktorbarzin.me/client = "true"` (plus an explicit
|
||||
fallback for `f1-stream` by `kubernetes.io/metadata.name`, plus
|
||||
`chrome-service`'s own namespace for the harvester CronJob).
|
||||
- **TCP/6080** (noVNC HTTP+WS): only the `traefik` namespace.
|
||||
- **TCP/8088** (snapshot-server): only the `traefik` namespace
|
||||
(bearer-token check happens in `snapshot_server.py`).
|
||||
- **CDP port 9222** is internal-only (no ingress, no Cloudflare DNS).
|
||||
- **noVNC sidecar** (`forgejo.viktorbarzin.me/viktor/chrome-service-novnc`)
|
||||
exposes a live HTML5 view of the headed Chromium session via
|
||||
`x11vnc` (connected to Xvfb on `localhost:6099`) bridged to
|
||||
`websockify` on port 6080. Service `chrome` maps :80 → :6080 and is
|
||||
exposed via `ingress_factory` at `chrome.viktorbarzin.me`,
|
||||
Authentik-gated.
|
||||
- **snapshot-server sidecar** (`mcr.microsoft.com/playwright/python:v1.48.0-noble`)
|
||||
serves `GET /api/snapshot` from `/profile/snapshots/storage-state.json`,
|
||||
bearer-gated by `PW_TOKEN`. Service `chrome-snapshot` maps :8088 → :8088
|
||||
and is exposed at `chrome.viktorbarzin.me/api/snapshot` via a second
|
||||
`ingress_factory` call with `auth = "none"` (the bearer check is in
|
||||
the sidecar, not at the ingress layer).
|
||||
|
||||
## Adding a new in-cluster caller
|
||||
|
||||
See `stacks/chrome-service/README.md` for the recipe (label namespace,
|
||||
inject `CHROME_CDP_URL`, vendor `stealth.js`).
|
||||
|
||||
## Limits + risks
|
||||
|
||||
- **Anti-bot vs stealth arms race** — when an upstream beats us (DRM
|
||||
license check, device-fingerprint mismatch, hotlink protection that
|
||||
whitelists specific parent domains), the verifier returns
|
||||
`is_playable=False` and the extractor moves on. No user-visible
|
||||
breakage, just empty stream lists for that source.
|
||||
- **JWPlayer DRM error 102630** — observed with several hmembeds embeds
|
||||
even from the headed chrome-service. The license check bails because
|
||||
the request origin isn't on the embed's allowlist; this is upstream
|
||||
policy, not an infra defect.
|
||||
- **Single replica + RWO PVC** — the deployment uses `Recreate` strategy.
|
||||
Brief outage on rollout, ~30s for browser warmup.
|
||||
- **No `/metrics` endpoint** — the cluster's generic
|
||||
`KubePodCrashLooping` rule covers basic alerting. A Prometheus scrape
|
||||
exporter is day-2 work.
|
||||
- **Snapshot covers cookies + localStorage only** — Playwright's
|
||||
`storage_state()` API doesn't capture IndexedDB or sessionStorage.
|
||||
Sites that rely on those for auth won't warm via the snapshot.
|
||||
- **Snapshot freshness up to 1h stale** — if a site rotates session
|
||||
cookies more often than that, an on-demand refresh CLI is needed
|
||||
(deferred to follow-on).
|
||||
307
docs/architecture/ci-cd.md
Normal file
307
docs/architecture/ci-cd.md
Normal file
|
|
@ -0,0 +1,307 @@
|
|||
# CI/CD Pipeline
|
||||
|
||||
## Overview
|
||||
|
||||
The CI/CD pipeline uses a hybrid approach: GitHub Actions for building Docker images (providing free compute for public repos) and Woodpecker CI for deployments (leveraging cluster-internal access). Git pushes trigger GHA builds that produce Docker images with 8-character SHA tags, push to DockerHub, then POST to Woodpecker's API to trigger deployments that update Kubernetes workloads via `kubectl set image`.
|
||||
|
||||
## Architecture Diagram
|
||||
|
||||
```mermaid
|
||||
graph LR
|
||||
A[Git Push] --> B[GitHub Actions]
|
||||
B --> C[Build Docker Image<br/>linux/amd64, 8-char SHA tag]
|
||||
C --> D[Push to DockerHub]
|
||||
D --> E[POST Woodpecker API]
|
||||
E --> F[Woodpecker Pipeline]
|
||||
F --> G[Vault K8s Auth<br/>SA JWT]
|
||||
G --> H[kubectl set image]
|
||||
H --> I[K8s Deployment]
|
||||
I --> J[Pull from DockerHub<br/>or Pull-Through Cache]
|
||||
|
||||
K[Pull-Through Cache<br/>10.0.20.10] -.-> J
|
||||
L[forgejo.viktorbarzin.me<br/>Private Registry on Forgejo] -.-> J
|
||||
|
||||
style B fill:#2088ff
|
||||
style F fill:#4c9e47
|
||||
style K fill:#f39c12
|
||||
```
|
||||
|
||||
## Components
|
||||
|
||||
| Component | Version | Location | Purpose |
|
||||
|-----------|---------|----------|---------|
|
||||
| GitHub Actions | Cloud | `.github/workflows/build-and-deploy.yml` | Build Docker images, push to DockerHub |
|
||||
| Woodpecker CI | Self-hosted | `ci.viktorbarzin.me` | Deploy to Kubernetes cluster |
|
||||
| DockerHub | Cloud | `viktorbarzin/*` | Public image registry |
|
||||
| Private Registry | Forgejo Packages | `forgejo.viktorbarzin.me/viktor` | Private container images (PAT auth, retention CronJob) — migrated from registry.viktorbarzin.me 2026-05-07 |
|
||||
| Pull-Through Cache | Custom | `10.0.20.10:5000` (docker.io)<br/>`10.0.20.10:5010` (ghcr.io) | LAN cache for remote registries |
|
||||
| Kyverno | Cluster | `kyverno` namespace | Auto-sync registry credentials to all namespaces |
|
||||
| Vault | Cluster | `vault.viktorbarzin.me` | K8s auth for Woodpecker pipelines |
|
||||
|
||||
## How It Works
|
||||
|
||||
### Build Flow (GitHub Actions)
|
||||
|
||||
1. **Trigger**: Git push to main/master branch
|
||||
2. **Build**: GHA builds Docker image for `linux/amd64` platform only
|
||||
3. **Tag**: Image tagged with 8-character commit SHA (e.g., `viktorbarzin/app:a1b2c3d4`)
|
||||
- `:latest` tags are **never used** to prevent stale pull-through cache issues
|
||||
4. **Push**: Image pushed to DockerHub public registry
|
||||
5. **Trigger Deploy**: POST request to Woodpecker API with repo ID and commit SHA
|
||||
|
||||
### Deploy Flow (Woodpecker CI)
|
||||
|
||||
1. **Receive Webhook**: Woodpecker API receives deployment trigger from GHA
|
||||
2. **Authenticate**: Pipeline uses Kubernetes ServiceAccount JWT to authenticate with Vault via K8s auth
|
||||
3. **Deploy**: `kubectl set image deployment/<name> <container>=viktorbarzin/<app>:<sha>`
|
||||
4. **Notify**: Slack notification on success/failure
|
||||
|
||||
### Project Migration Status
|
||||
|
||||
**Migrated to GHA (8 projects)**:
|
||||
- Website
|
||||
- k8s-portal
|
||||
- claude-memory-mcp
|
||||
- apple-health-data
|
||||
- audiblez-web
|
||||
- plotting-book
|
||||
- insta2spotify
|
||||
- book-search (audiobook-search)
|
||||
|
||||
**Woodpecker-native owned-app builds** (build + push to the Forgejo private
|
||||
registry + `kubectl set image` rollout, all in one `.woodpecker.yml`; Keel
|
||||
stays enrolled as a redundant net): `tuya_bridge`, `job-hunter`, `f1-stream`.
|
||||
`f1-stream` was extracted from this monorepo to `viktor/f1-stream` on
|
||||
2026-06-05 (Woodpecker repo id 166); the old github source is archived and its
|
||||
GHA-era Woodpecker repo (id 10) is deactivated.
|
||||
|
||||
**Woodpecker-only (infra + large apps)**:
|
||||
- `travel_blog`: 5.7GB content directory exceeds GHA limits
|
||||
- Infra pipelines: require cluster access (terragrunt apply, certbot, build-cli)
|
||||
|
||||
### Woodpecker Pipeline Files
|
||||
|
||||
Each project contains:
|
||||
- `.woodpecker/deploy.yml`: kubectl set image + Slack notification
|
||||
- `.woodpecker/build-fallback.yml`: Legacy full build pipeline (event: deployment, never auto-fires)
|
||||
|
||||
### Woodpecker Repository IDs
|
||||
|
||||
Woodpecker API uses numeric IDs (not owner/name):
|
||||
|
||||
| Repo | ID |
|
||||
|------|------|
|
||||
| infra | 1 |
|
||||
| Website | 2 |
|
||||
| finance | 3 |
|
||||
| health | 4 |
|
||||
| travel_blog | 5 |
|
||||
| webhook-handler | 6 |
|
||||
| audiblez-web | 9 |
|
||||
| plotting-book | 43 |
|
||||
| claude-memory-mcp | 78 |
|
||||
| infra-onboarding | 79 |
|
||||
|
||||
### Image Registry Flow
|
||||
|
||||
1. **Containerd hosts.toml** redirects pulls from docker.io and ghcr.io to pull-through cache at `10.0.20.10`
|
||||
2. **Pull-through cache** serves cached images from LAN, fetches from upstream on cache miss
|
||||
3. **Kyverno ClusterPolicy** auto-syncs `registry-credentials` Secret to all namespaces for private registry access
|
||||
4. **Private registry** has been Forgejo's built-in OCI registry at `forgejo.viktorbarzin.me/viktor/<image>` since 2026-05-07. Auth via PAT (Vault `secret/ci/global/forgejo_push_token` for push, `secret/viktor/forgejo_pull_token` for pull). The pre-migration `registry:2.8.3`-based private registry on `registry.viktorbarzin.me:5050` was the root cause of three orphan-index incidents in three weeks (2026-04-13, 2026-04-19, 2026-05-04 — see `docs/post-mortems/2026-04-19-registry-orphan-index.md` and the full migration writeup at `docs/plans/2026-05-07-forgejo-registry-consolidation-{design,plan}.md`). The five pull-through caches on `10.0.20.10` (ports 5000/5010/5020/5030/5040) stay in place for upstream registries.
|
||||
5. **Integrity probe** (`registry-integrity-probe` CronJob in `monitoring` ns, every 15m) walks `/v2/_catalog` → tags → indexes → child manifests via HEAD and pushes `registry_manifest_integrity_failures` to Pushgateway; alerts `RegistryManifestIntegrityFailure` / `RegistryIntegrityProbeStale` / `RegistryCatalogInaccessible` page on broken state. Authoritative check (HTTP API, not filesystem).
|
||||
|
||||
### Infra Pipelines (Woodpecker-only)
|
||||
|
||||
| Pipeline | File | Purpose |
|
||||
|----------|------|---------|
|
||||
| default | `.woodpecker/default.yml` | Terragrunt apply on push |
|
||||
| renew-tls | `.woodpecker/renew-tls.yml` | Certbot renewal cron |
|
||||
| build-cli | `.woodpecker/build-cli.yml` | Build and push to dual registries |
|
||||
| build-ci-image | `.woodpecker/build-ci-image.yml` | Build `infra-ci` tooling image (triggered by `ci/Dockerfile` change or manual); post-push HEADs every blob via `verify-integrity` step to catch orphan-index pushes |
|
||||
| k8s-portal | `.woodpecker/k8s-portal.yml` | Path-filtered build for k8s-portal subdirectory |
|
||||
| registry-config-sync | `.woodpecker/registry-config-sync.yml` | SCP `modules/docker-registry/*` to `/opt/registry/` on `10.0.20.10` when any managed file changes; bounces containers + nginx per `docs/runbooks/registry-vm.md` |
|
||||
| pve-nfs-exports-sync | `.woodpecker/pve-nfs-exports-sync.yml` | Sync `scripts/pve-nfs-exports` → `/etc/exports` on PVE host |
|
||||
| postmortem-todos | `.woodpecker/postmortem-todos.yml` | Auto-resolve safe TODOs from new `docs/post-mortems/*.md` via headless Claude agent |
|
||||
| drift-detection | `.woodpecker/drift-detection.yml` | Nightly Terraform drift detection |
|
||||
| issue-automation | `.woodpecker/issue-automation.yml` | Triage + respond to `ViktorBarzin/infra` GitHub issues |
|
||||
| provision-user | `.woodpecker/provision-user.yml` | Add namespace-owner user from Vault spec |
|
||||
|
||||
## Configuration
|
||||
|
||||
### GitHub Actions
|
||||
|
||||
**File**: `.github/workflows/build-and-deploy.yml`
|
||||
|
||||
```yaml
|
||||
name: Build and Deploy
|
||||
on:
|
||||
push:
|
||||
branches: [main, master]
|
||||
jobs:
|
||||
build:
|
||||
runs-on: ubuntu-latest
|
||||
steps:
|
||||
- name: Build Docker image
|
||||
run: docker build --platform linux/amd64 -t viktorbarzin/app:${SHORT_SHA} .
|
||||
- name: Push to DockerHub
|
||||
run: docker push viktorbarzin/app:${SHORT_SHA}
|
||||
- name: Trigger Woodpecker Deploy
|
||||
run: |
|
||||
curl -X POST https://ci.viktorbarzin.me/api/repos/<REPO_ID>/pipelines \
|
||||
-H "Authorization: Bearer ${{ secrets.WOODPECKER_TOKEN }}"
|
||||
```
|
||||
|
||||
**Required GitHub Secrets**:
|
||||
- `DOCKERHUB_USERNAME`
|
||||
- `DOCKERHUB_TOKEN`
|
||||
- `WOODPECKER_TOKEN`
|
||||
|
||||
### Woodpecker Deploy Pipeline
|
||||
|
||||
**File**: `.woodpecker/deploy.yml`
|
||||
|
||||
```yaml
|
||||
when:
|
||||
event: [deployment]
|
||||
|
||||
steps:
|
||||
deploy:
|
||||
image: bitnami/kubectl:latest
|
||||
commands:
|
||||
- kubectl set image deployment/app app=viktorbarzin/app:${CI_COMMIT_SHA:0:8}
|
||||
secrets: [k8s_token]
|
||||
|
||||
notify:
|
||||
image: plugins/slack
|
||||
settings:
|
||||
webhook: ${SLACK_WEBHOOK}
|
||||
when:
|
||||
status: [success, failure]
|
||||
```
|
||||
|
||||
**YAML Gotchas**:
|
||||
- Commands with `${VAR}:${VAR}` syntax must be quoted to prevent YAML map parsing when vars are empty
|
||||
- Use `bitnami/kubectl:latest` (not pinned versions)
|
||||
- Global secrets must be manually added to `secrets:` list in pipeline
|
||||
|
||||
### Vault Configuration
|
||||
|
||||
**K8s Auth for Woodpecker**:
|
||||
- Woodpecker pipelines authenticate using ServiceAccount JWT
|
||||
- Vault K8s auth mount validates JWT and issues token
|
||||
- Policies grant access to secrets and dynamic credentials
|
||||
|
||||
### CI/CD Secrets Sync
|
||||
|
||||
**CronJob**: Pushes `secret/ci/global` from Vault → Woodpecker API every 6 hours
|
||||
- Keeps Woodpecker global secrets in sync with Vault
|
||||
- Runs in `woodpecker` namespace
|
||||
|
||||
## Decisions & Rationale
|
||||
|
||||
### Why GitHub Actions + Woodpecker?
|
||||
|
||||
**Alternatives considered**:
|
||||
1. **Woodpecker-only**: Simple, but wastes cluster resources on builds
|
||||
2. **GHA-only**: No cluster access, requires kubectl from outside (security risk)
|
||||
3. **Hybrid (chosen)**: GHA for compute-heavy builds (free), Woodpecker for privileged deployments (secure cluster access)
|
||||
|
||||
**Benefits**:
|
||||
- Free compute for builds on public repos
|
||||
- Cluster access stays internal (Woodpecker has direct K8s access)
|
||||
- Separation of concerns: build vs deploy
|
||||
|
||||
### Why 8-Character SHA Tags (Not :latest)?
|
||||
|
||||
- Pull-through cache serves stale `:latest` tags indefinitely
|
||||
- SHA tags ensure every deployment pulls the correct image
|
||||
- 8 characters provide sufficient collision resistance (16^8 = 4.3 billion combinations)
|
||||
|
||||
### Why Numeric Repo IDs for Woodpecker API?
|
||||
|
||||
- Woodpecker API requires numeric IDs (not owner/name slugs)
|
||||
- IDs are stable across repo renames
|
||||
- Must be manually looked up from Woodpecker UI or database
|
||||
|
||||
### Why linux/amd64 Only?
|
||||
|
||||
- Cluster runs on x86_64 nodes only
|
||||
- ARM builds would waste time and storage
|
||||
- Multi-arch images add complexity without benefit
|
||||
|
||||
## Troubleshooting
|
||||
|
||||
### GHA Build Fails: "denied: requested access to the resource is denied"
|
||||
|
||||
**Cause**: DockerHub credentials expired or incorrect
|
||||
|
||||
**Fix**:
|
||||
```bash
|
||||
# Regenerate DockerHub token
|
||||
# Update GitHub repo secrets: DOCKERHUB_USERNAME, DOCKERHUB_TOKEN
|
||||
```
|
||||
|
||||
### Woodpecker Deploy Fails: "Unauthorized"
|
||||
|
||||
**Cause**: Vault K8s auth token expired or invalid
|
||||
|
||||
**Fix**:
|
||||
```bash
|
||||
# Restart Woodpecker pipeline (token auto-renewed)
|
||||
# Check Vault K8s auth role exists: vault read auth/kubernetes/role/woodpecker-deployer
|
||||
```
|
||||
|
||||
### Image Pull Fails: "ErrImagePull"
|
||||
|
||||
**Cause**: Pull-through cache or registry credentials issue
|
||||
|
||||
**Fix**:
|
||||
```bash
|
||||
# Check pull-through cache is running
|
||||
curl http://10.0.20.10:5000/v2/_catalog
|
||||
|
||||
# Verify registry-credentials Secret exists in namespace
|
||||
kubectl get secret registry-credentials -n <namespace>
|
||||
|
||||
# Manually sync credentials if missing
|
||||
kubectl get secret registry-credentials -n default -o yaml | \
|
||||
sed 's/namespace: default/namespace: <namespace>/' | kubectl apply -f -
|
||||
```
|
||||
|
||||
### Woodpecker Pipeline: "YAML: did not find expected key"
|
||||
|
||||
**Cause**: Unquoted command with `${VAR}:${VAR}` syntax when VAR is empty
|
||||
|
||||
**Fix**: Quote the command:
|
||||
```yaml
|
||||
commands:
|
||||
- "kubectl set image deployment/app app=viktorbarzin/app:${SHORT_SHA}"
|
||||
```
|
||||
|
||||
### travel_blog Build Times Out on GHA
|
||||
|
||||
**Cause**: 5.7GB content directory exceeds GHA disk/time limits
|
||||
|
||||
**Fix**: Keep on Woodpecker (no migration). Build uses cluster storage and resources.
|
||||
|
||||
### CI/CD Secrets Out of Sync
|
||||
|
||||
**Cause**: CronJob failed to sync Vault → Woodpecker
|
||||
|
||||
**Fix**:
|
||||
```bash
|
||||
# Check CronJob status
|
||||
kubectl get cronjob -n woodpecker
|
||||
|
||||
# Manually trigger sync
|
||||
kubectl create job --from=cronjob/sync-secrets manual-sync -n woodpecker
|
||||
```
|
||||
|
||||
## Related
|
||||
|
||||
- [Databases Architecture](./databases.md) — Database credentials via Vault
|
||||
- [Multi-Tenancy](./multi-tenancy.md) — Per-user Woodpecker access
|
||||
- Runbook: `../runbooks/deploy-new-app.md` — How to set up CI/CD for a new app
|
||||
- Runbook: `../runbooks/troubleshoot-image-pull.md` — Debug image pull issues
|
||||
- Vault documentation: K8s auth configuration
|
||||
- Woodpecker documentation: API reference
|
||||
728
docs/architecture/compute.md
Normal file
728
docs/architecture/compute.md
Normal file
|
|
@ -0,0 +1,728 @@
|
|||
# Compute & Resource Management
|
||||
|
||||
## Overview
|
||||
|
||||
The infrastructure runs on a single Dell R730 server with Proxmox VE, hosting a 7-node Kubernetes cluster. Compute resources are managed through a combination of Vertical Pod Autoscaler (VPA) recommendations, tier-based LimitRange defaults, and ResourceQuota enforcement. The cluster employs a no-CPU-limits policy to avoid CFS throttling while using memory requests=limits for stability. GPU workloads run on a dedicated node with Tesla T4 passthrough.
|
||||
|
||||
## Architecture Diagram
|
||||
|
||||
```mermaid
|
||||
graph TB
|
||||
subgraph Physical["Dell R730 Physical Host"]
|
||||
CPU["1x Xeon E5-2699 v4<br/>22c/44t<br/>CPU2 unpopulated"]
|
||||
RAM["272GB DDR4-2400 ECC"]
|
||||
GPU["NVIDIA Tesla T4<br/>PCIe 0000:06:00.0"]
|
||||
DISK["1.1TB SSD<br/>931GB SSD<br/>10.7TB HDD"]
|
||||
end
|
||||
|
||||
subgraph Proxmox["Proxmox VE"]
|
||||
direction TB
|
||||
MASTER["VM 200: k8s-master<br/>8c / 32GB<br/>10.0.20.100"]
|
||||
NODE1["VM 201: k8s-node1<br/>16c / 48GB<br/>GPU Passthrough<br/>nvidia.com/gpu=true:PreferNoSchedule"]
|
||||
NODE2["VM 202: k8s-node2<br/>8c / 32GB"]
|
||||
NODE3["VM 203: k8s-node3<br/>8c / 32GB"]
|
||||
NODE4["VM 204: k8s-node4<br/>8c / 32GB"]
|
||||
end
|
||||
|
||||
subgraph K8s["Kubernetes Cluster v1.34.2"]
|
||||
direction TB
|
||||
|
||||
subgraph VPA["VPA (Goldilocks - Initial Mode)"]
|
||||
RECOMMEND["Quarterly Review:<br/>upperBound x1.2 (stable)<br/>upperBound x1.3 (GPU/volatile)"]
|
||||
end
|
||||
|
||||
subgraph LimitRange["LimitRange per Tier"]
|
||||
TIER0_LR["0-core: 512Mi-8Gi mem<br/>500m-4 cpu"]
|
||||
TIER1_LR["1-cluster: 512Mi-4Gi mem<br/>500m-2 cpu"]
|
||||
TIER2_LR["2-gpu: 2Gi-16Gi mem<br/>1-8 cpu"]
|
||||
TIER34_LR["3-edge/4-aux: 256Mi-4Gi mem<br/>250m-2 cpu"]
|
||||
end
|
||||
|
||||
subgraph ResourceQuota["ResourceQuota per Tier"]
|
||||
TIER0_RQ["0-core: 32 cpu / 64Gi mem / 100 pods"]
|
||||
TIER1_RQ["1-cluster: 16 cpu / 32Gi mem / 30 pods"]
|
||||
TIER2_RQ["2-gpu: 48 cpu / 96Gi mem / 40 pods"]
|
||||
TIER34_RQ["3-edge/4-aux: 8-16 cpu / 16-32Gi mem / 20-30 pods"]
|
||||
end
|
||||
end
|
||||
|
||||
Physical --> Proxmox
|
||||
GPU -.->|Passthrough| NODE1
|
||||
Proxmox --> K8s
|
||||
VPA --> LimitRange
|
||||
LimitRange --> ResourceQuota
|
||||
```
|
||||
|
||||
## Components
|
||||
|
||||
### Proxmox Host
|
||||
|
||||
| Component | Specification |
|
||||
|-----------|---------------|
|
||||
| Model | Dell PowerEdge R730 |
|
||||
| CPU | 1x Intel Xeon E5-2699 v4 (22 cores / 44 threads, CPU2 unpopulated) |
|
||||
| Total Cores/Threads | 22 cores / 44 threads |
|
||||
| RAM | 272GB DDR4-2400 ECC RDIMM physical (10 DIMMs: 8x32G Samsung + 2x8G Hynix). VMs use ~176GB total (k8s-node1 48GB + 4 K8s VMs x 32GB) |
|
||||
| GPU | NVIDIA Tesla T4 (16GB GDDR6, PCIe 0000:06:00.0) |
|
||||
| Storage | 1.1TB SSD + 931GB SSD + 10.7TB HDD |
|
||||
| Hypervisor | Proxmox VE |
|
||||
|
||||
### Kubernetes Nodes
|
||||
|
||||
| VM | VMID | vCPUs | RAM | Network | Role | Taints |
|
||||
|----|------|-------|-----|---------|------|--------|
|
||||
| k8s-master | 200 | 8 | 32GB | vmbr1:vlan20 (10.0.20.100) | Control Plane | `node-role.kubernetes.io/control-plane:NoSchedule` |
|
||||
| k8s-node1 | 201 | 16 | 48GB | vmbr1:vlan20 | GPU Worker | `nvidia.com/gpu=true:PreferNoSchedule` (applied dynamically to whichever node carries the GPU) |
|
||||
| k8s-node2 | 202 | 8 | 32GB | vmbr1:vlan20 | Worker | None |
|
||||
| k8s-node3 | 203 | 8 | 32GB | vmbr1:vlan20 | Worker | None |
|
||||
| k8s-node4 | 204 | 8 | 32GB | vmbr1:vlan20 | Worker | None |
|
||||
|
||||
**Total Cluster Resources**: 48 vCPUs, ~176GB RAM (k8s-node1 48GB + 4 nodes x 32GB)
|
||||
|
||||
> **All Linux VMs are hand-managed in Proxmox, NOT in Terraform**
|
||||
> (decided 2026-05-26, commit 44c3770a). The telmate/proxmox v3.0.2
|
||||
> provider rewrites every disk slot on update — even ones covered by
|
||||
> `lifecycle.ignore_changes` — and it doesn't refresh per-disk
|
||||
> `mbps_*_concurrent` fields back from live state. We hit both bugs
|
||||
> in production (id=539 iSCSI mangling 2026-04-02, and the 2026-05-26
|
||||
> import attempt that corrupted k8s-node2 + k8s-node3 .conf files;
|
||||
> recovered via `/mnt/backup/pve-config/etc-pve/nodes/pve/qemu-server/`
|
||||
> nightly backups). What stays in TF: the cloud-init templates
|
||||
> (`k8s-node-template`, `non-k8s-node-template`,
|
||||
> `docker-registry-template` in `stacks/infra/main.tf`) — a fresh VM
|
||||
> still clones the right template and runs the same bootstrap.
|
||||
>
|
||||
> Per-VM I/O caps (defense against sdc saturation by a single noisy
|
||||
> guest) are applied by `apply-mbps-caps.{sh,service,timer}` on the
|
||||
> PVE host (sources in `infra/scripts/`, install pattern per
|
||||
> `architecture/backup-dr.md`). Timer fires `OnBootSec=5min` +
|
||||
> `OnCalendar=hourly`, so any drift (config restore, manual `qm
|
||||
> set`, fresh clone) self-heals within the hour. Current caps:
|
||||
> 102 devvm 60/60, 103 home-assistant 40/40, 200 k8s-master 100/60,
|
||||
> 201 k8s-node1 150/120, 202 k8s-node2 150/120, 203 k8s-node3 150/120,
|
||||
> 204 k8s-node4 150/120, 220 docker-registry 40/40.
|
||||
>
|
||||
> Re-adoption into TF (via the `bpg/proxmox` provider, which models
|
||||
> dynamic disks correctly) is possible but not scheduled — the
|
||||
> cloud-init template above already captures the bootstrap-
|
||||
> reproducibility goal.
|
||||
|
||||
### GPU Passthrough
|
||||
|
||||
| Parameter | Value |
|
||||
|-----------|-------|
|
||||
| Device | NVIDIA Tesla T4 (16GB GDDR6) |
|
||||
| PCIe Address | 0000:06:00.0 |
|
||||
| Assigned VM | VMID 201 (k8s-node1) — physical location only, no Terraform pin |
|
||||
| Node Label | `nvidia.com/gpu.present=true` (auto-applied by gpu-feature-discovery; also `feature.node.kubernetes.io/pci-10de.present=true` from NFD) |
|
||||
| Node Taint | `nvidia.com/gpu=true:PreferNoSchedule` (applied by `null_resource.gpu_node_config` to every NFD-tagged GPU node) |
|
||||
| Driver | NVIDIA GPU Operator |
|
||||
| Resource Name | `nvidia.com/gpu` |
|
||||
|
||||
### Resource Management Stack
|
||||
|
||||
| Component | Version/Mode | Purpose |
|
||||
|-----------|--------------|---------|
|
||||
| VPA | Goldilocks "Initial" mode | Resource recommendation (not auto-scaling) |
|
||||
| Kyverno | Policy engine | Auto-generate LimitRange + ResourceQuota per tier |
|
||||
| PriorityClass | Per tier (200K-900K) | Pod preemption during resource pressure |
|
||||
| QoS Class | Guaranteed (0-2), Burstable (3-4) | Eviction order |
|
||||
|
||||
## How It Works
|
||||
|
||||
### CPU Resource Management
|
||||
|
||||
**Policy**: No CPU limits cluster-wide, only CPU requests.
|
||||
|
||||
**Rationale**: Linux CFS (Completely Fair Scheduler) throttles containers to their exact CPU limit even when the CPU is idle, causing artificial performance degradation. By setting only CPU requests, containers can burst to unused CPU capacity.
|
||||
|
||||
**Implementation**:
|
||||
- All pods set `resources.requests.cpu` (reserves capacity)
|
||||
- No pods set `resources.limits.cpu`
|
||||
- Scheduler uses CPU requests for bin-packing
|
||||
- Kernel CFS shares unused CPU proportionally by requests
|
||||
|
||||
**Example**:
|
||||
```yaml
|
||||
resources:
|
||||
requests:
|
||||
cpu: "500m"
|
||||
# No limits.cpu - can burst to idle CPU
|
||||
```
|
||||
|
||||
### Memory Resource Management
|
||||
|
||||
**Policy**: Memory requests = limits for stability.
|
||||
|
||||
**Rationale**: Memory is not compressible like CPU. A pod that exceeds its memory request can be OOMKilled unpredictably. Setting requests=limits ensures:
|
||||
- Predictable memory allocation
|
||||
- QoS class "Guaranteed" (tiers 0-2) or "Burstable" (tiers 3-4)
|
||||
- No surprise OOMKills during memory pressure
|
||||
|
||||
**Implementation**:
|
||||
- Tier 0-2: `requests.memory = limits.memory` (Guaranteed QoS)
|
||||
- Tier 3-4: `requests.memory < limits.memory` (Burstable QoS, reduces scheduler pressure)
|
||||
- Values based on VPA upperBound x1.2 (stable) or x1.3 (GPU/volatile)
|
||||
|
||||
**Example**:
|
||||
```yaml
|
||||
# Tier 0-2 (Guaranteed)
|
||||
resources:
|
||||
requests:
|
||||
memory: "2Gi"
|
||||
limits:
|
||||
memory: "2Gi"
|
||||
|
||||
# Tier 3-4 (Burstable)
|
||||
resources:
|
||||
requests:
|
||||
memory: "512Mi"
|
||||
limits:
|
||||
memory: "1Gi"
|
||||
```
|
||||
|
||||
### Vertical Pod Autoscaler (VPA)
|
||||
|
||||
**Mode**: Goldilocks in "Initial" mode (recommend-only, not auto-scaling).
|
||||
|
||||
**Why not Auto mode?**
|
||||
- VPA Auto mode directly updates Deployment specs, creating drift from Terraform state
|
||||
- Terraform manages all resources declaratively, so VPA changes would be reverted
|
||||
- Quarterly review process maintains control and aligns with planned maintenance windows
|
||||
|
||||
**Workflow**:
|
||||
1. VPA monitors pod resource usage over time
|
||||
2. Goldilocks dashboard shows recommendations (lowerBound, target, upperBound)
|
||||
3. Quarterly review: Engineer reviews VPA recommendations in Goldilocks UI
|
||||
4. Apply sizing: Update Terraform with `memory: <upperBound> * 1.2` (stable) or `* 1.3` (GPU/volatile)
|
||||
5. Terragrunt apply updates Deployment specs
|
||||
6. Pods restart with new resource allocations
|
||||
|
||||
**Stability Multipliers**:
|
||||
- **x1.2**: Stable services (databases, monitoring, core services)
|
||||
- **x1.3**: GPU workloads or volatile services (user-facing apps, ML inference)
|
||||
|
||||
### Tier-Based LimitRange
|
||||
|
||||
Kyverno automatically creates a LimitRange in each namespace based on its tier prefix.
|
||||
|
||||
| Tier | Default Memory | Max Memory | Default CPU | Max CPU |
|
||||
|------|----------------|------------|-------------|---------|
|
||||
| 0-core | 512Mi | 8Gi | 500m | 4 |
|
||||
| 1-cluster | 512Mi | 4Gi | 500m | 2 |
|
||||
| 2-gpu | 2Gi | 16Gi | 1 | 8 |
|
||||
| 3-edge | 256Mi | 4Gi | 250m | 2 |
|
||||
| 4-aux | 256Mi | 4Gi | 250m | 2 |
|
||||
|
||||
**Purpose**:
|
||||
- Prevents pods without explicit resources from requesting unlimited resources
|
||||
- Sets sensible defaults for sidecars and init containers
|
||||
- Enforces maximum per-container limits
|
||||
|
||||
**Example**: A pod in `4-aux-vaultwarden` without explicit resources gets:
|
||||
```yaml
|
||||
resources:
|
||||
requests:
|
||||
memory: 256Mi
|
||||
cpu: 250m
|
||||
limits:
|
||||
memory: 4Gi
|
||||
cpu: 2 # (ignored due to no-CPU-limits policy)
|
||||
```
|
||||
|
||||
### Tier-Based ResourceQuota
|
||||
|
||||
Kyverno automatically creates a ResourceQuota in each namespace based on its tier.
|
||||
|
||||
| Tier | CPU Limit | Memory Limit | Max Pods |
|
||||
|------|-----------|--------------|----------|
|
||||
| 0-core | 32 | 64Gi | 100 |
|
||||
| 1-cluster | 16 | 32Gi | 30 |
|
||||
| 2-gpu | 48 | 96Gi | 40 |
|
||||
| 3-edge | 16 | 32Gi | 30 |
|
||||
| 4-aux | 8 | 16Gi | 20 |
|
||||
|
||||
**Purpose**:
|
||||
- Prevents a single namespace from monopolizing cluster resources
|
||||
- Enforces tier-appropriate resource allocation
|
||||
- Protects critical services from lower-tier resource exhaustion
|
||||
|
||||
**Quota Exhaustion**: If a namespace exceeds its quota, new pods are rejected with `Forbidden: exceeded quota`.
|
||||
|
||||
### QoS Classes and Eviction
|
||||
|
||||
Kubernetes assigns QoS classes based on resource configuration:
|
||||
|
||||
| QoS Class | Condition | Eviction Priority | Tiers |
|
||||
|-----------|-----------|-------------------|-------|
|
||||
| Guaranteed | requests = limits (both CPU & memory) | Last | 0-core, 1-cluster, 2-gpu |
|
||||
| Burstable | requests < limits | Middle | 3-edge, 4-aux |
|
||||
| BestEffort | No requests or limits | First | None (not used) |
|
||||
|
||||
**Eviction Order during Memory Pressure**:
|
||||
1. BestEffort pods (none in cluster)
|
||||
2. Burstable pods (tier 3-4), lowest priority first
|
||||
3. Guaranteed pods (tier 0-2), lowest priority first
|
||||
|
||||
**Priority Classes**:
|
||||
- 0-core: 900000
|
||||
- 1-cluster: 700000
|
||||
- 2-gpu: 500000
|
||||
- 3-edge: 300000
|
||||
- 4-aux: 200000
|
||||
|
||||
During resource pressure, tier 4 pods are evicted before tier 3, tier 3 before tier 2, etc.
|
||||
|
||||
### Democratic-CSI Sidecar Resources
|
||||
|
||||
**Problem**: Democratic-CSI injects 3-4 sidecar containers per pod with PVCs:
|
||||
- `csi-driver-registrar`
|
||||
- `csi-provisioner`
|
||||
- `csi-attacher`
|
||||
- `csi-resizer`
|
||||
|
||||
Without explicit resources, each defaults to LimitRange default (256Mi), consuming 768Mi-1Gi per pod.
|
||||
|
||||
**Solution**: Explicitly set sidecar resources in Terraform:
|
||||
```hcl
|
||||
resources {
|
||||
requests = {
|
||||
memory = "32Mi"
|
||||
cpu = "10m"
|
||||
}
|
||||
limits = {
|
||||
memory = "80Mi"
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
**Result**: 17 CSI sidecars go from 4.3GB (17 * 256Mi) to 544Mi (17 * 32Mi), freeing 3.7GB.
|
||||
|
||||
### GPU Resource Management
|
||||
|
||||
**Node Selection**: GPU pods must:
|
||||
1. Tolerate `nvidia.com/gpu=true:PreferNoSchedule` taint
|
||||
2. Select `nvidia.com/gpu.present=true` label (auto-applied by gpu-feature-discovery wherever the card is)
|
||||
3. Request `nvidia.com/gpu: 1` resource
|
||||
|
||||
**Example**:
|
||||
```yaml
|
||||
spec:
|
||||
tolerations:
|
||||
- key: nvidia.com/gpu
|
||||
operator: Equal
|
||||
value: "true"
|
||||
effect: NoSchedule
|
||||
nodeSelector:
|
||||
nvidia.com/gpu.present: "true"
|
||||
containers:
|
||||
- name: app
|
||||
resources:
|
||||
limits:
|
||||
nvidia.com/gpu: 1
|
||||
```
|
||||
|
||||
**Portability**: No Terraform code references a specific hostname for
|
||||
GPU scheduling. If the GPU card is physically moved to a different
|
||||
node, gpu-feature-discovery moves the `nvidia.com/gpu.present=true`
|
||||
label with it, and `null_resource.gpu_node_config` re-applies the
|
||||
`nvidia.com/gpu=true:PreferNoSchedule` taint to the new host on the
|
||||
next apply (discovery keyed on
|
||||
`feature.node.kubernetes.io/pci-10de.present=true`).
|
||||
|
||||
**GPU Workloads** (time-sliced — node advertises `Tesla-T4-SHARED`,
|
||||
`sharing-strategy=time-slicing`, `nvidia.com/gpu.replicas=100`, so many pods
|
||||
share the single T4; request `nvidia.com/gpu: 1` for a slice, not the whole card):
|
||||
- immich-machine-learning (CLIP smart-search + facial recognition, CUDA)
|
||||
- immich-server (NVENC/NVDEC video transcoding — `ffmpeg.accel=nvenc` + `accelDecode=true`)
|
||||
- Frigate (object-detection inference)
|
||||
- llama-cpp / llama-swap (LLM inference)
|
||||
- nvidia-exporter + gpu-pod-exporter (DCGM metrics)
|
||||
|
||||
## Configuration
|
||||
|
||||
### Key Files
|
||||
|
||||
| Path | Purpose |
|
||||
|------|---------|
|
||||
| `modules/namespace_config/` | Kyverno policies for LimitRange + ResourceQuota generation |
|
||||
| `modules/k8s_app/main.tf` | Default resource templates for apps |
|
||||
| `stacks/<service>/terragrunt.hcl` | Per-service resource overrides |
|
||||
| `modules/gpu_app/` | GPU-specific resource templates |
|
||||
|
||||
### Terraform Resource Configuration
|
||||
|
||||
**Standard App** (no PVC):
|
||||
```hcl
|
||||
module "app" {
|
||||
source = "../../modules/k8s_app"
|
||||
|
||||
resources = {
|
||||
requests = {
|
||||
memory = "1Gi" # VPA upperBound * 1.2
|
||||
cpu = "500m"
|
||||
}
|
||||
limits = {
|
||||
memory = "1Gi" # Same as request
|
||||
# No CPU limit
|
||||
}
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
**App with Democratic-CSI PVC**:
|
||||
```hcl
|
||||
module "app" {
|
||||
source = "../../modules/k8s_app"
|
||||
|
||||
resources = {
|
||||
requests = {
|
||||
memory = "2Gi"
|
||||
cpu = "500m"
|
||||
}
|
||||
limits = {
|
||||
memory = "2Gi"
|
||||
}
|
||||
}
|
||||
|
||||
sidecar_resources = {
|
||||
requests = {
|
||||
memory = "32Mi"
|
||||
cpu = "10m"
|
||||
}
|
||||
limits = {
|
||||
memory = "80Mi"
|
||||
}
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
**GPU App**:
|
||||
```hcl
|
||||
module "gpu_app" {
|
||||
source = "../../modules/gpu_app"
|
||||
|
||||
gpu_count = 1
|
||||
|
||||
resources = {
|
||||
requests = {
|
||||
memory = "8Gi" # VPA upperBound * 1.3
|
||||
cpu = "2"
|
||||
}
|
||||
limits = {
|
||||
memory = "8Gi"
|
||||
nvidia.com/gpu = 1
|
||||
}
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
### Kyverno Policies
|
||||
|
||||
**LimitRange Generation** (`modules/namespace_config/limitrange-policy.yaml`):
|
||||
```yaml
|
||||
apiVersion: kyverno.io/v1
|
||||
kind: ClusterPolicy
|
||||
metadata:
|
||||
name: generate-limitrange
|
||||
spec:
|
||||
rules:
|
||||
- name: generate-limitrange-0-core
|
||||
match:
|
||||
resources:
|
||||
kinds:
|
||||
- Namespace
|
||||
name: "0-core-*"
|
||||
generate:
|
||||
kind: LimitRange
|
||||
data:
|
||||
spec:
|
||||
limits:
|
||||
- default:
|
||||
memory: 512Mi
|
||||
cpu: 500m
|
||||
defaultRequest:
|
||||
memory: 512Mi
|
||||
cpu: 500m
|
||||
max:
|
||||
memory: 8Gi
|
||||
cpu: 4
|
||||
type: Container
|
||||
```
|
||||
|
||||
**ResourceQuota Generation** (`modules/namespace_config/resourcequota-policy.yaml`):
|
||||
```yaml
|
||||
apiVersion: kyverno.io/v1
|
||||
kind: ClusterPolicy
|
||||
metadata:
|
||||
name: generate-resourcequota
|
||||
spec:
|
||||
rules:
|
||||
- name: generate-quota-0-core
|
||||
match:
|
||||
resources:
|
||||
kinds:
|
||||
- Namespace
|
||||
name: "0-core-*"
|
||||
generate:
|
||||
kind: ResourceQuota
|
||||
data:
|
||||
spec:
|
||||
hard:
|
||||
requests.cpu: "32"
|
||||
requests.memory: 64Gi
|
||||
pods: "100"
|
||||
```
|
||||
|
||||
## Decisions & Rationale
|
||||
|
||||
### Why no CPU limits?
|
||||
|
||||
**Decision**: Set CPU requests but never set CPU limits.
|
||||
|
||||
**Rationale**:
|
||||
- **CFS Throttling**: Linux Completely Fair Scheduler throttles containers to their exact CPU limit, even when CPU is idle. This causes artificial performance degradation.
|
||||
- **Burstability**: Services can burst to unused CPU during low-load periods, improving response times.
|
||||
- **Memory-bound**: With 272GB physical host RAM (~160GB allocated to K8s VMs), memory is no longer the primary constraint. ~112GB headroom available for new VMs.
|
||||
|
||||
**Tradeoff**: A runaway process could monopolize CPU. Mitigated by CPU requests reserving capacity and PriorityClass preemption.
|
||||
|
||||
**Evidence**: After removing CPU limits cluster-wide, p95 latency dropped 40% for API services during load tests.
|
||||
|
||||
### Why Goldilocks in Initial mode instead of Auto?
|
||||
|
||||
**Decision**: Use VPA in "Initial" (recommend-only) mode rather than "Auto" (update pods automatically).
|
||||
|
||||
**Rationale**:
|
||||
- **Terraform State Drift**: VPA Auto mode directly mutates Deployment specs, creating drift from Terraform-managed state. Next Terraform apply reverts VPA changes.
|
||||
- **Declarative Workflow**: Terraform is the source of truth. VPA recommendations are reviewed and applied via Terraform, maintaining declarative infrastructure.
|
||||
- **Controlled Changes**: Quarterly review ensures resource changes align with capacity planning and cluster upgrades.
|
||||
- **Avoid Thrashing**: VPA Auto can restart pods frequently during volatile workloads. Manual application reduces churn.
|
||||
|
||||
**Tradeoff**: Requires quarterly manual review. Accepted because homelab prioritizes stability over auto-optimization.
|
||||
|
||||
### Why memory requests = limits for tiers 0-2?
|
||||
|
||||
**Decision**: Set memory requests equal to limits for core and cluster services (tiers 0-2).
|
||||
|
||||
**Rationale**:
|
||||
- **Guaranteed QoS**: Ensures pods are last to be evicted during memory pressure.
|
||||
- **Predictable OOM**: Pods are OOMKilled only when exceeding their own limit, not due to other pods' usage.
|
||||
- **Stability**: Critical services (traefik, authentik, vault) must not be evicted unexpectedly.
|
||||
|
||||
**Tradeoff**: Cannot burst above limit. Accepted because critical services are right-sized via VPA.
|
||||
|
||||
### Why Burstable QoS for tiers 3-4?
|
||||
|
||||
**Decision**: Set memory requests < limits for edge and auxiliary services (tiers 3-4).
|
||||
|
||||
**Rationale**:
|
||||
- **Reduced Scheduler Pressure**: Lower memory requests allow more pods to fit on nodes.
|
||||
- **Acceptable Eviction**: Tier 3-4 services are non-critical (freshrss, vaultwarden) and tolerate occasional eviction.
|
||||
- **Cost Efficiency**: Allows oversubscription of memory for bursty workloads.
|
||||
|
||||
**Tradeoff**: Pods may be evicted during memory pressure. Accepted because tier 3-4 services have PriorityClass 200K-300K.
|
||||
|
||||
### Why VPA upperBound * 1.2 (or 1.3)?
|
||||
|
||||
**Decision**: Set memory limits to VPA upperBound * 1.2 for stable services, * 1.3 for GPU/volatile services.
|
||||
|
||||
**Rationale**:
|
||||
- **Headroom**: VPA upperBound is the observed maximum usage. Adding 20-30% headroom prevents OOMKills during traffic spikes.
|
||||
- **Growth Buffer**: Services grow over time (more users, more data). Headroom delays the need for manual intervention.
|
||||
- **GPU Volatility**: GPU workloads (ML inference) have unpredictable memory usage. 30% headroom reduces OOMKills.
|
||||
|
||||
**Tradeoff**: Slightly higher memory allocation. Accepted because 272GB RAM provides ample capacity.
|
||||
|
||||
## Troubleshooting
|
||||
|
||||
### Pods stuck in Pending state
|
||||
|
||||
**Symptom**: Pod shows `status: Pending` with event `FailedScheduling`.
|
||||
|
||||
**Diagnosis**:
|
||||
```bash
|
||||
kubectl describe pod <pod-name> -n <namespace>
|
||||
```
|
||||
|
||||
**Common Causes**:
|
||||
|
||||
1. **ResourceQuota exceeded**:
|
||||
```
|
||||
Error: exceeded quota: <namespace>-quota, requested: requests.memory=2Gi, used: requests.memory=14Gi, limited: requests.memory=16Gi
|
||||
```
|
||||
**Fix**: Increase ResourceQuota in `modules/namespace_config/` for that tier, or reduce other pods' requests.
|
||||
|
||||
2. **LimitRange default too high**:
|
||||
```
|
||||
0/5 nodes are available: 5 Insufficient memory.
|
||||
```
|
||||
**Fix**: Override pod resources explicitly in Terraform (defaults come from LimitRange).
|
||||
|
||||
3. **GPU taint not tolerated**:
|
||||
```
|
||||
0/5 nodes are available: 1 node(s) had untolerated taint {nvidia.com/gpu: true}, 4 Insufficient nvidia.com/gpu.
|
||||
```
|
||||
**Fix**: Add toleration and nodeSelector for GPU pods.
|
||||
|
||||
4. **No nodes with GPU**:
|
||||
```
|
||||
0/5 nodes are available: 5 Insufficient nvidia.com/gpu.
|
||||
```
|
||||
**Fix**: Verify the GPU-carrying node is Ready and has the `nvidia.com/gpu.present=true` label. Check `kubectl get nodes -l nvidia.com/gpu.present=true` — if empty, gpu-feature-discovery hasn't labeled any node (operator not running, driver not loaded, or PCI passthrough broken).
|
||||
|
||||
### Pods OOMKilled repeatedly
|
||||
|
||||
**Symptom**: Pod shows `status: OOMKilled` in events, restarts frequently.
|
||||
|
||||
**Diagnosis**:
|
||||
```bash
|
||||
kubectl describe pod <pod-name> -n <namespace>
|
||||
kubectl top pod <pod-name> -n <namespace> # Current usage
|
||||
kubectl get limitrange -n <namespace> -o yaml # Check defaults
|
||||
```
|
||||
|
||||
**Common Causes**:
|
||||
|
||||
1. **Using LimitRange default** (256Mi or 512Mi):
|
||||
**Fix**: Set explicit memory request/limit in Terraform based on actual usage.
|
||||
|
||||
2. **Memory limit too low**:
|
||||
**Fix**: Check Goldilocks VPA recommendation, set `memory = upperBound * 1.2`.
|
||||
|
||||
3. **Memory leak**:
|
||||
**Fix**: Investigate application code, check Grafana memory usage trends.
|
||||
|
||||
### Democratic-CSI sidecars consuming excessive memory
|
||||
|
||||
**Symptom**: Pods with PVCs have 3-4 sidecar containers, each using 256Mi (LimitRange default).
|
||||
|
||||
**Diagnosis**:
|
||||
```bash
|
||||
kubectl get pods -A -o json | jq '.items[] | select(.spec.containers[].name | contains("csi")) | {name: .metadata.name, namespace: .metadata.namespace}'
|
||||
kubectl top pod <pod-name> -n <namespace> --containers
|
||||
```
|
||||
|
||||
**Fix**:
|
||||
Update Terraform to override sidecar resources:
|
||||
```hcl
|
||||
sidecar_resources = {
|
||||
requests = {
|
||||
memory = "32Mi"
|
||||
cpu = "10m"
|
||||
}
|
||||
limits = {
|
||||
memory = "80Mi"
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
### Tier 3-4 pods evicted during resource pressure
|
||||
|
||||
**Symptom**: Lower-tier pods show `status: Evicted` with reason `The node was low on resource: memory`.
|
||||
|
||||
**Diagnosis**:
|
||||
```bash
|
||||
kubectl get events --sort-by='.lastTimestamp' | grep Evicted
|
||||
kubectl top nodes # Check node memory usage
|
||||
```
|
||||
|
||||
**Expected Behavior**: This is normal. Tier 3-4 use Burstable QoS and priority 200K-300K, making them first eviction candidates.
|
||||
|
||||
**Fix**:
|
||||
- If evictions are frequent: Increase node memory or reduce tier 3-4 memory limits
|
||||
- If evicted service is critical: Promote to tier 1 or 2
|
||||
- If node is overloaded: Check for memory leaks in tier 0-2 services
|
||||
|
||||
### GPU pods not scheduling on GPU node
|
||||
|
||||
**Symptom**: GPU pod stuck in Pending with event `0/5 nodes are available: 1 node(s) had untolerated taint`.
|
||||
|
||||
**Diagnosis**:
|
||||
```bash
|
||||
kubectl describe node k8s-node1 | grep Taints
|
||||
kubectl describe pod <pod-name> -n <namespace> | grep -A5 Tolerations
|
||||
```
|
||||
|
||||
**Fix**:
|
||||
Add GPU toleration and selector to pod spec:
|
||||
```yaml
|
||||
spec:
|
||||
tolerations:
|
||||
- key: nvidia.com/gpu
|
||||
operator: Equal
|
||||
value: "true"
|
||||
effect: NoSchedule
|
||||
nodeSelector:
|
||||
nvidia.com/gpu.present: "true"
|
||||
containers:
|
||||
- name: app
|
||||
resources:
|
||||
limits:
|
||||
nvidia.com/gpu: 1
|
||||
```
|
||||
|
||||
### Node out of memory despite low pod usage
|
||||
|
||||
**Symptom**: Node shows memory pressure, but `kubectl top pods` shows low usage.
|
||||
|
||||
**Diagnosis**:
|
||||
```bash
|
||||
# SSH to node
|
||||
ssh k8s-node2
|
||||
free -h
|
||||
ps aux --sort=-%mem | head -20
|
||||
```
|
||||
|
||||
**Common Causes**:
|
||||
1. **Kernel memory**: Page cache, slab allocator not shown in `kubectl top`
|
||||
2. **System services**: kubelet, containerd, systemd-journald
|
||||
3. **Zombie containers**: Old containers not cleaned up
|
||||
|
||||
**Fix**:
|
||||
```bash
|
||||
# Clear page cache (safe on production)
|
||||
echo 3 > /proc/sys/vm/drop_caches
|
||||
|
||||
# Cleanup stopped containers
|
||||
crictl rmp $(crictl ps -a --state Exited -q)
|
||||
|
||||
# Restart kubelet (forces cleanup)
|
||||
systemctl restart kubelet
|
||||
```
|
||||
|
||||
### VPA recommendations not appearing in Goldilocks
|
||||
|
||||
**Symptom**: Goldilocks dashboard shows no recommendations for a service.
|
||||
|
||||
**Diagnosis**:
|
||||
```bash
|
||||
kubectl get vpa -n <namespace>
|
||||
kubectl describe vpa <vpa-name> -n <namespace>
|
||||
```
|
||||
|
||||
**Common Causes**:
|
||||
1. **VPA not created**: Terraform module missing VPA resource
|
||||
2. **Insufficient data**: VPA needs 24h of metrics before recommending
|
||||
3. **VPA pod not running**: VPA controller/recommender crashed
|
||||
|
||||
**Fix**:
|
||||
```bash
|
||||
# Check VPA pods
|
||||
kubectl get pods -n kube-system | grep vpa
|
||||
|
||||
# Check VPA logs
|
||||
kubectl logs -n kube-system deployment/vpa-recommender
|
||||
|
||||
# Restart VPA if needed
|
||||
kubectl rollout restart -n kube-system deployment/vpa-recommender
|
||||
```
|
||||
|
||||
## Related
|
||||
|
||||
- [Overview](overview.md) - VM inventory and cluster architecture
|
||||
- [Multi-tenancy](multi-tenancy.md) - Tier system and namespace isolation
|
||||
- [Monitoring](monitoring.md) - Resource usage dashboards and Goldilocks UI
|
||||
- [Runbooks: Right-Sizing](../../runbooks/right-sizing.md) - Quarterly VPA review process
|
||||
- [Runbooks: GPU Troubleshooting](../../runbooks/gpu-troubleshooting.md)
|
||||
- [Runbooks: Node Maintenance](../../runbooks/node-maintenance.md)
|
||||
446
docs/architecture/databases.md
Normal file
446
docs/architecture/databases.md
Normal file
|
|
@ -0,0 +1,446 @@
|
|||
# Databases
|
||||
|
||||
## Overview
|
||||
|
||||
The cluster provides shared database services (PostgreSQL, MySQL, Redis) for multi-tenant workloads with automated credential rotation via Vault. PostgreSQL uses CloudNativePG (CNPG) with PgBouncer connection pooling, MySQL runs as an InnoDB Cluster with anti-affinity rules for stability, and Redis provides a shared cache layer. SQLite is used for per-app local storage with careful attention to filesystem compatibility.
|
||||
|
||||
## Architecture Diagram
|
||||
|
||||
```mermaid
|
||||
graph TB
|
||||
subgraph Apps
|
||||
A1[trading-bot]
|
||||
A2[apple-health-data]
|
||||
A3[wrongmove]
|
||||
A4[claude-memory-mcp]
|
||||
end
|
||||
|
||||
subgraph PostgreSQL
|
||||
A1 --> PGB[PgBouncer<br/>3 replicas]
|
||||
A2 --> PGB
|
||||
A4 --> PGB
|
||||
PGB --> CNPG_RW[CNPG Primary<br/>pg-cluster-rw.dbaas]
|
||||
CNPG_RW --> CNPG_R1[CNPG Replica 1]
|
||||
end
|
||||
|
||||
subgraph MySQL
|
||||
A3 --> MYC[MySQL InnoDB Cluster<br/>3 instances]
|
||||
MYC --> LVM1[Proxmox-LVM Storage]
|
||||
MYC -.anti-affinity.-> NODE1[Exclude k8s-node1<br/>GPU node]
|
||||
end
|
||||
|
||||
subgraph Redis
|
||||
A1 --> RED[Redis<br/>redis.redis.svc.cluster.local]
|
||||
end
|
||||
|
||||
subgraph Vault
|
||||
V[Vault DB Engine]
|
||||
V -.7-day rotation.-> PGB
|
||||
V -.7-day rotation.-> MYC
|
||||
end
|
||||
|
||||
style CNPG_RW fill:#2088ff
|
||||
style PGB fill:#4c9e47
|
||||
style MYC fill:#f39c12
|
||||
style RED fill:#dc382d
|
||||
```
|
||||
|
||||
## Components
|
||||
|
||||
| Component | Version | Location | Purpose |
|
||||
|-----------|---------|----------|---------|
|
||||
| PostgreSQL (CNPG) | CloudNativePG (PostGIS 16: `postgis:16`) | `dbaas` namespace | Primary/replica cluster, auto-failover |
|
||||
| PgBouncer | 3 replicas | `dbaas` namespace | Connection pooling for PostgreSQL |
|
||||
| MySQL InnoDB Cluster | 8.4.4 | `dbaas` namespace | Multi-master MySQL cluster |
|
||||
| Redis | Latest | `redis` namespace | Shared cache layer |
|
||||
| Vault DB Engine | - | `vault` namespace | Automated credential rotation |
|
||||
|
||||
### Database Endpoints
|
||||
|
||||
| Service | Endpoint | Notes |
|
||||
|---------|----------|-------|
|
||||
| PostgreSQL (primary) | `pg-cluster-rw.dbaas.svc.cluster.local` | Always use this via PgBouncer |
|
||||
| PgBouncer | `pgbouncer.dbaas.svc.cluster.local` | Connection pool (3 replicas) |
|
||||
| MySQL | `mysql.dbaas.svc.cluster.local` | InnoDB Cluster VIP |
|
||||
| Redis | `redis.redis.svc.cluster.local` | Shared instance |
|
||||
| PostgreSQL (compat) | `postgresql.dbaas.svc.cluster.local` | Compatibility service, selects CNPG primary |
|
||||
|
||||
## How It Works
|
||||
|
||||
### PostgreSQL (CNPG + PgBouncer)
|
||||
|
||||
1. **CNPG Cluster**: Manages PostgreSQL primary and replicas
|
||||
- Primary: `pg-cluster-rw.dbaas.svc.cluster.local`
|
||||
- Auto-failover on primary failure
|
||||
- Replicas for read scaling
|
||||
|
||||
2. **PgBouncer**: Connection pooling layer (3 replicas)
|
||||
- Apps connect to PgBouncer, not directly to PostgreSQL
|
||||
- Reduces connection overhead
|
||||
- Load balances across PgBouncer instances
|
||||
|
||||
3. **Credential Rotation**: Vault DB engine rotates credentials every 7 days
|
||||
- Apps fetch credentials from Vault on startup
|
||||
- Vault manages rotation lifecycle
|
||||
|
||||
**Used by**:
|
||||
- trading-bot
|
||||
- apple-health-data (health)
|
||||
- linkwarden
|
||||
- affine
|
||||
- woodpecker
|
||||
- claude-memory-mcp
|
||||
- tripit
|
||||
- 5 active PG roles
|
||||
|
||||
### MySQL InnoDB Cluster
|
||||
|
||||
1. **Cluster Topology**: 3 MySQL instances with auto-recovery
|
||||
- Multi-master replication
|
||||
- Automatic split-brain resolution
|
||||
|
||||
2. **Storage**: Proxmox-LVM persistent volumes
|
||||
- Thin-provisioned LVM on Proxmox hosts
|
||||
- Block-level storage with proper write guarantees
|
||||
|
||||
3. **Anti-Affinity**: Excludes k8s-node1 (GPU node)
|
||||
- Pods scheduled to node2, node3, node4, etc.
|
||||
- Keeps database workloads off the GPU-dedicated node
|
||||
|
||||
4. **Resource Allocation**: 2Gi request / 3Gi limit
|
||||
- Right-sized based on VPA recommendations
|
||||
|
||||
**Used by**:
|
||||
- wrongmove (realestate-crawler)
|
||||
- speedtest
|
||||
- codimd
|
||||
- nextcloud
|
||||
- shlink
|
||||
- grafana
|
||||
- technitium (DNS query logs via QueryLogsMySqlApp plugin)
|
||||
|
||||
### Redis
|
||||
|
||||
Single **standalone** instance shared by all consumers (Immich, Authentik, Nextcloud, Paperless, Dawarich Sidekiq, Celery apps, Traefik, etc.). Clients talk to `redis-master.redis.svc.cluster.local:6379`, which now selects the single redis pod directly. **No Sentinel, no HAProxy, no replicas** — reverted from 3-node HA on 2026-05-30 (see "Why standalone" below).
|
||||
|
||||
**Architecture**:
|
||||
|
||||
1 pod in StatefulSet `redis-v2` (`replicas=1`, `podManagementPolicy=Parallel` retained for STS-field immutability), running `redis` + `redis_exporter` containers on `docker.io/library/redis:8-alpine` (8.6.2). Data on a `proxmox-lvm-encrypted` PVC (`data-redis-v2-0`, 5Gi→20Gi autoresize).
|
||||
|
||||
- `maxmemory=640mb` (83% of the 768Mi pod limit), **`maxmemory-policy=volatile-lru`**. The instance is shared by two workload classes: CACHES (want LRU eviction of disposable keys) and QUEUES (Immich BullMQ `bull:*`, Celery `_kombu:*` — must never be evicted or jobs vanish). `volatile-lru` evicts only keys carrying a TTL (caches set them) and never touches TTL-less keys (queue jobs), serving both correctly in one instance. Backstop: alert `RedisMemoryPressure` at 80% — if it ever fills with non-volatile keys, writes error like `noeviction`.
|
||||
- Persistence: RDB (`save 900 1 / 300 100 / 60 10000`) + AOF `appendfsync=everysec`. `aof-load-corrupt-tail-max-size=1024` tolerates ≤1KB of AOF tail garbage from an unclean reboot instead of crashlooping. Disk-wear (sdb Samsung 850 EVO, 150 TBW): Redis contributes <1 GB/day cluster-wide → 40+ year runway.
|
||||
- Memory `requests=limits=768Mi`. BGSAVE + AOF-rewrite fork can double RSS via COW; `auto-aof-rewrite-percentage=200` + `auto-aof-rewrite-min-size=128mb` tune down rewrite frequency.
|
||||
- Service `redis-master` (name/DNS unchanged across the HA teardown so no consumer needed editing). Keel opt-out (`keel.sh/policy=never`, label + annotation) — a prior patch-bump to `:8.0.6-alpine` rejected the AOF config and crashed it.
|
||||
- Weekly RDB backup to NFS (`/srv/nfs/redis-backup/`, Sunday 03:00, 28-day retention, Pushgateway metrics).
|
||||
- Auth disabled — NetworkPolicy is the isolation layer. `requirepass` + creds rollout to all clients remains a planned follow-up.
|
||||
- **Downtime model**: a single instance means a pod restart (image bump, node drain, OOM) is a few-seconds cluster-wide Redis blip. Explicitly accepted (Viktor, 2026-05-30) as the price of eliminating the HA failure modes below. There is no PDB (a single-replica PDB would only block node drains).
|
||||
|
||||
**Observability**: `oliver006/redis_exporter:v1.62.0` sidecar on port 9121, auto-scraped. Alerts: `RedisDown`, `RedisMemoryPressure` (>80%), `RedisEvictions`, `RedisForkLatencyHigh`, `RedisAOFRewriteLong`, `RedisBackupStale`, `RedisBackupNeverSucceeded`. (`RedisReplicationLagHigh` + `RedisReplicasMissing` removed with the replicas.)
|
||||
|
||||
**Why standalone** — HA Redis caused more outages than it prevented in this homelab. Five incidents: (a) 2026-04-04 service selector routed writes to a replica → `READONLY`; (b) 2026-04-19 AM master OOMKilled during BGSAVE+PSYNC (256Mi too tight); (c) 2026-04-19 PM sentinel quorum drift (2 sentinels, no majority) routed writes to a slave; (d) 2026-04-22 five-factor flap cascade (soft anti-affinity co-located pods + aggressive sentinel/probe timing + HAProxy polling race); (e) **2026-05-30 split-brain** — `redis-v2-0` booted during a network partition, hit the init script's deterministic "pod-0 is bootstrap master" fallback, and became a SECOND master alongside the sentinel-elected `redis-v2-2`; HAProxy's `expect rstring role:master` matched both and round-robined client connections across them, so Immich enqueued BullMQ jobs on one master while its workers blocked-popped on the other → every queue wedged, new-upload thumbnails 404'd cluster-wide. The 3-sentinel design (beads `code-v2b`) was built specifically to prevent split-brain after incident (c), yet the bootstrap fallback manufactured one anyway. Conclusion: for a homelab cache/broker, a single instance with a few-seconds restart blip is strictly simpler and more reliable than chasing Sentinel correctness. Mirrors the MySQL InnoDB-Cluster → standalone reversion (2026-04-16). Post-mortem: `docs/post-mortems/2026-05-30-redis-split-brain.md`.
|
||||
|
||||
### SQLite (Per-App)
|
||||
|
||||
**Apps using SQLite**:
|
||||
- headscale
|
||||
- vaultwarden
|
||||
- plotting-book
|
||||
- holiday-planner
|
||||
- priority-pass
|
||||
|
||||
**Critical**: SQLite on NFS is unreliable
|
||||
- NFS lacks proper `fsync()` support
|
||||
- Causes database corruption under load
|
||||
- **Solution**: Use Proxmox-LVM volumes for SQLite apps
|
||||
|
||||
### Vault Database Engine
|
||||
|
||||
**Rotation Schedule**: 7 days (604800s)
|
||||
|
||||
**PostgreSQL Rotation**:
|
||||
- health (apple-health-data)
|
||||
- linkwarden
|
||||
- affine
|
||||
- woodpecker
|
||||
- claude_memory
|
||||
- tripit (Vault static role `pg-tripit`)
|
||||
|
||||
**MySQL Rotation**:
|
||||
- speedtest
|
||||
- wrongmove
|
||||
- codimd
|
||||
- nextcloud
|
||||
- shlink
|
||||
- grafana
|
||||
- technitium (password synced to Technitium DNS app via CronJob every 6h)
|
||||
|
||||
**Excluded from Rotation**:
|
||||
- authentik (uses PgBouncer, incompatible)
|
||||
- crowdsec (Helm-baked credentials)
|
||||
- Root users (manual management)
|
||||
|
||||
**How Rotation Works**:
|
||||
1. Vault rotates the MySQL user's password (static role, 7-day period)
|
||||
2. ExternalSecrets Operator syncs new password to K8s Secret (15-min refresh)
|
||||
3. Apps read from K8s Secret via `secret_key_ref` env vars
|
||||
4. Special case: Technitium stores its MySQL connection in internal app config, so a CronJob pushes the rotated password to the Technitium API every 6 hours
|
||||
|
||||
## Configuration
|
||||
|
||||
### Terraform Shared Variables
|
||||
|
||||
Always use shared variables, never hardcode endpoints:
|
||||
|
||||
```hcl
|
||||
variable "postgresql_host" {
|
||||
default = "pgbouncer.dbaas.svc.cluster.local"
|
||||
}
|
||||
|
||||
variable "mysql_host" {
|
||||
default = "mysql.dbaas.svc.cluster.local"
|
||||
}
|
||||
|
||||
variable "redis_host" {
|
||||
default = "redis.redis.svc.cluster.local"
|
||||
}
|
||||
```
|
||||
|
||||
### Vault Paths
|
||||
|
||||
**PostgreSQL Dynamic Credentials**:
|
||||
```
|
||||
database/creds/postgres-<app>-role
|
||||
```
|
||||
|
||||
**MySQL Dynamic Credentials**:
|
||||
```
|
||||
database/creds/mysql-<app>-role
|
||||
```
|
||||
|
||||
**Static Credentials** (non-rotated):
|
||||
```
|
||||
secret/data/mysql/root
|
||||
secret/data/postgres/root
|
||||
```
|
||||
|
||||
### Version Pinning
|
||||
|
||||
**Diun Monitoring Disabled** for database images to prevent unwanted version bumps:
|
||||
- MySQL: pinned version in Terraform
|
||||
- PostgreSQL: pinned CNPG operator version
|
||||
- Redis: pinned image tag
|
||||
|
||||
**Rationale**: Database upgrades require careful planning and testing
|
||||
|
||||
### Example Terraform Stack (PostgreSQL)
|
||||
|
||||
```hcl
|
||||
resource "vault_database_secret_backend_role" "app" {
|
||||
backend = "database"
|
||||
name = "postgres-myapp-role"
|
||||
db_name = "postgres"
|
||||
creation_statements = [
|
||||
"CREATE USER \"{{name}}\" WITH PASSWORD '{{password}}' VALID UNTIL '{{expiration}}';",
|
||||
"GRANT ALL PRIVILEGES ON DATABASE myapp TO \"{{name}}\";"
|
||||
]
|
||||
default_ttl = 604800 # 7 days
|
||||
max_ttl = 604800
|
||||
}
|
||||
|
||||
resource "kubernetes_secret" "db_creds" {
|
||||
metadata {
|
||||
name = "myapp-db"
|
||||
namespace = "default"
|
||||
}
|
||||
|
||||
data = {
|
||||
host = var.postgresql_host
|
||||
database = "myapp"
|
||||
# App fetches username/password from Vault at runtime
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
## Decisions & Rationale
|
||||
|
||||
### Why CNPG Instead of Postgres Operator?
|
||||
|
||||
**Alternatives considered**:
|
||||
1. **Zalando Postgres Operator**: Mature but complex
|
||||
2. **Bitnami PostgreSQL Helm**: Simple but manual failover
|
||||
3. **CNPG (chosen)**: Kubernetes-native, auto-failover, active development
|
||||
|
||||
**Benefits**:
|
||||
- Native Kubernetes CRDs
|
||||
- Automatic failover and recovery
|
||||
- Active community and updates
|
||||
- Better resource efficiency than Zalando
|
||||
|
||||
### Why PgBouncer for PostgreSQL?
|
||||
|
||||
- Reduces connection overhead (apps create many connections)
|
||||
- Load balances across PgBouncer replicas
|
||||
- Essential for apps that don't implement connection pooling
|
||||
- Required for Vault DB engine compatibility with some apps
|
||||
|
||||
### Why MySQL InnoDB Cluster?
|
||||
|
||||
**Alternatives considered**:
|
||||
1. **Single MySQL instance**: No HA
|
||||
2. **Galera Cluster**: Complex, split-brain issues
|
||||
3. **InnoDB Cluster (chosen)**: Built-in multi-master, auto-recovery
|
||||
|
||||
**Benefits**:
|
||||
- Native MySQL HA solution
|
||||
- Automatic split-brain resolution
|
||||
- Simpler than Galera
|
||||
|
||||
### Why Block Storage for Databases?
|
||||
|
||||
- NFS lacks proper `fsync()` support (causes SQLite corruption)
|
||||
- Proxmox-LVM provides block-level storage with proper write guarantees
|
||||
- Lower latency than NFS for database workloads
|
||||
|
||||
### Why 7-Day Credential Rotation?
|
||||
|
||||
- Balance between security (shorter is better) and operational overhead
|
||||
- 7 days allows ample time to debug issues before next rotation
|
||||
- Reduces rotation-related disruptions while maintaining security hygiene
|
||||
|
||||
### Why Shared Redis (Not Per-App)?
|
||||
|
||||
- Most apps use Redis for ephemeral data (caching, sessions)
|
||||
- Over-provisioning Redis wastes memory
|
||||
- Shared instance sufficient for current load
|
||||
- Can migrate to per-app if needed
|
||||
|
||||
## Troubleshooting
|
||||
|
||||
### PostgreSQL: "Too many connections"
|
||||
|
||||
**Cause**: Apps connecting directly to PostgreSQL instead of PgBouncer
|
||||
|
||||
**Fix**:
|
||||
```bash
|
||||
# Check PgBouncer is running
|
||||
kubectl get pods -n dbaas | grep pgbouncer
|
||||
|
||||
# Verify apps use pgbouncer.dbaas, not pg-cluster-rw
|
||||
kubectl get configmap <app-config> -o yaml | grep postgres
|
||||
```
|
||||
|
||||
### PostgreSQL: Primary Failover Not Working
|
||||
|
||||
**Cause**: CNPG controller not running or network partition
|
||||
|
||||
**Fix**:
|
||||
```bash
|
||||
# Check CNPG operator
|
||||
kubectl get pods -n cnpg-system
|
||||
|
||||
# Check cluster status
|
||||
kubectl get cluster -n dbaas
|
||||
|
||||
# Manually trigger failover (last resort)
|
||||
kubectl cnpg promote pg-cluster-2 -n dbaas
|
||||
```
|
||||
|
||||
### MySQL: Pod Stuck on Excluded Node
|
||||
|
||||
**Cause**: Anti-affinity rule not applied (should exclude k8s-node1)
|
||||
|
||||
**Fix**:
|
||||
```bash
|
||||
# Check pod affinity rules
|
||||
kubectl get pod <mysql-pod> -n dbaas -o yaml | grep -A 10 affinity
|
||||
|
||||
# Delete pod to reschedule
|
||||
kubectl delete pod <mysql-pod> -n dbaas
|
||||
```
|
||||
|
||||
### MySQL: Pod Scheduled on GPU Node
|
||||
|
||||
**Cause**: Anti-affinity rule not preventing scheduling on k8s-node1
|
||||
|
||||
**Fix**:
|
||||
```bash
|
||||
# Check pod affinity rules
|
||||
kubectl get pod <mysql-pod> -n dbaas -o yaml | grep -A 10 affinity
|
||||
|
||||
# Delete pod to reschedule away from node1
|
||||
kubectl delete pod <mysql-pod> -n dbaas
|
||||
```
|
||||
|
||||
### SQLite: Database Corruption
|
||||
|
||||
**Cause**: SQLite on NFS volume
|
||||
|
||||
**Fix**:
|
||||
```bash
|
||||
# Check volume type
|
||||
kubectl get pv | grep <app>
|
||||
|
||||
# If NFS, migrate to proxmox-lvm:
|
||||
# 1. Create proxmox-lvm PVC
|
||||
# 2. Backup SQLite database
|
||||
# 3. Restore to proxmox-lvm volume
|
||||
# 4. Update app to use new volume
|
||||
```
|
||||
|
||||
### Vault Rotation: "User already exists"
|
||||
|
||||
**Cause**: Previous rotation failed to clean up
|
||||
|
||||
**Fix**:
|
||||
```bash
|
||||
# Connect to database
|
||||
kubectl exec -it <mysql-pod> -n dbaas -- mysql -u root -p
|
||||
|
||||
# List users
|
||||
SELECT user, host FROM mysql.user WHERE user LIKE 'v-root-%';
|
||||
|
||||
# Drop stale users
|
||||
DROP USER 'v-root-postgres-<hash>'@'%';
|
||||
|
||||
# Retry rotation
|
||||
vault read database/rotate-root/postgres
|
||||
```
|
||||
|
||||
### Redis: Out of Memory
|
||||
|
||||
**Cause**: No eviction policy configured
|
||||
|
||||
**Fix**:
|
||||
```bash
|
||||
# Connect to Redis
|
||||
kubectl exec -it redis-0 -n redis -- redis-cli
|
||||
|
||||
# Set eviction policy
|
||||
CONFIG SET maxmemory-policy allkeys-lru
|
||||
|
||||
# Persist config
|
||||
CONFIG REWRITE
|
||||
```
|
||||
|
||||
### App Can't Connect: "Connection refused"
|
||||
|
||||
**Cause**: Service endpoint not reachable or PgBouncer not running
|
||||
|
||||
**Fix**:
|
||||
```bash
|
||||
# Check service endpoints
|
||||
kubectl get endpoints pgbouncer -n dbaas
|
||||
kubectl get endpoints postgresql -n dbaas
|
||||
|
||||
# Update app to use pgbouncer
|
||||
kubectl set env deployment/<app> DB_HOST=pgbouncer.dbaas.svc.cluster.local
|
||||
```
|
||||
|
||||
## Related
|
||||
|
||||
- [CI/CD Pipeline](./ci-cd.md) — Database credentials in CI/CD
|
||||
- [Multi-Tenancy](./multi-tenancy.md) — Per-user database provisioning
|
||||
- Runbook: `../runbooks/database-failover.md` — Manual failover procedures
|
||||
- Runbook: `../runbooks/vault-rotation-troubleshooting.md` — Debug credential rotation
|
||||
- Vault documentation: Database secrets engine
|
||||
- CNPG documentation: Cluster configuration
|
||||
513
docs/architecture/dns.md
Normal file
513
docs/architecture/dns.md
Normal file
|
|
@ -0,0 +1,513 @@
|
|||
# DNS Architecture
|
||||
|
||||
Last updated: 2026-04-19 (WS C — NodeLocal DNSCache deployed; WS D — pfSense Unbound replaces dnsmasq; WS E — Kea multi-IP DHCP option 6 + TSIG-signed DDNS)
|
||||
|
||||
## Overview
|
||||
|
||||
DNS is served by a split architecture: **Technitium DNS** handles internal resolution (`.viktorbarzin.lan`) and recursive lookups, while **Cloudflare DNS** manages all public domains (`.viktorbarzin.me`). Kubernetes pods use **CoreDNS** which forwards to Technitium for internal zones. All three Technitium instances run on encrypted block storage with zone replication via AXFR every 30 minutes. A **NodeLocal DNSCache** DaemonSet runs on every node and transparently intercepts pod DNS traffic, caching responses locally so pods keep resolving even during CoreDNS, Technitium, or pfSense disruptions.
|
||||
|
||||
## Architecture Diagram
|
||||
|
||||
```mermaid
|
||||
graph TB
|
||||
subgraph "External"
|
||||
Internet[Internet Clients]
|
||||
CF[Cloudflare DNS<br/>~50 domains<br/>viktorbarzin.me]
|
||||
CFTunnel[Cloudflared Tunnel<br/>3 replicas]
|
||||
end
|
||||
|
||||
subgraph "LAN (192.168.1.0/24)"
|
||||
LAN[LAN Clients<br/>WiFi / Wired]
|
||||
TPLINK[TP-Link AP<br/>Dumb AP only]
|
||||
end
|
||||
|
||||
subgraph "pfSense (10.0.20.1)"
|
||||
pf_unbound[Unbound<br/>Resolver<br/>auth-zone AXFR]
|
||||
pf_kea[Kea DHCP4<br/>3 subnets, 53 reservations]
|
||||
pf_ddns[Kea DHCP-DDNS<br/>RFC 2136]
|
||||
end
|
||||
|
||||
subgraph "Kubernetes Cluster"
|
||||
NodeLocalDNS[NodeLocal DNSCache<br/>DaemonSet, 7 nodes<br/>169.254.20.10 + 10.96.0.10]
|
||||
CoreDNS[CoreDNS<br/>kube-system<br/>.:53 + viktorbarzin.lan:53]
|
||||
KubeDNSUpstream[kube-dns-upstream<br/>ClusterIP, selects CoreDNS pods]
|
||||
|
||||
subgraph "Technitium HA (namespace: technitium)"
|
||||
Primary[Primary<br/>technitium]
|
||||
Secondary[Secondary<br/>technitium-secondary]
|
||||
Tertiary[Tertiary<br/>technitium-tertiary]
|
||||
end
|
||||
|
||||
LB_DNS[LoadBalancer<br/>10.0.20.201<br/>ETP=Local]
|
||||
ClusterIP[ClusterIP<br/>10.96.0.53<br/>pinned]
|
||||
|
||||
subgraph "Automation CronJobs"
|
||||
ZoneSync[zone-sync<br/>every 30min]
|
||||
SplitHorizon[split-horizon-sync<br/>every 6h]
|
||||
DNSOpt[dns-optimization<br/>every 6h]
|
||||
PassSync[password-sync<br/>every 6h]
|
||||
DNSSync[phpipam-dns-sync<br/>every 15min]
|
||||
end
|
||||
end
|
||||
|
||||
Internet -->|DNS query| CF
|
||||
CF -->|CNAME to tunnel| CFTunnel
|
||||
LAN -->|DNS query UDP 53| pf_unbound
|
||||
pf_kea -->|lease event| pf_ddns
|
||||
pf_ddns -->|A + PTR| LB_DNS
|
||||
|
||||
pf_unbound -->|AXFR viktorbarzin.lan| LB_DNS
|
||||
pf_unbound -->|public queries DoT :853| CF
|
||||
|
||||
NodeLocalDNS -->|cache miss| KubeDNSUpstream
|
||||
KubeDNSUpstream --> CoreDNS
|
||||
CoreDNS -->|.viktorbarzin.lan| ClusterIP
|
||||
CoreDNS -->|public queries| pf_unbound
|
||||
|
||||
LB_DNS --> Primary
|
||||
LB_DNS --> Secondary
|
||||
LB_DNS --> Tertiary
|
||||
ClusterIP --> Primary
|
||||
ClusterIP --> Secondary
|
||||
ClusterIP --> Tertiary
|
||||
|
||||
ZoneSync -->|AXFR| Primary
|
||||
ZoneSync -->|replicate| Secondary
|
||||
ZoneSync -->|replicate| Tertiary
|
||||
```
|
||||
|
||||
## Components
|
||||
|
||||
| Component | Location | Version | Purpose |
|
||||
|-----------|----------|---------|---------|
|
||||
| Technitium DNS | K8s namespace `technitium` | 14.3.0 | Primary internal DNS + recursive resolver |
|
||||
| CoreDNS | K8s `kube-system` | Cluster default | K8s service discovery + forwarding to Technitium |
|
||||
| NodeLocal DNSCache | K8s `kube-system` (DaemonSet) | `k8s-dns-node-cache:1.23.1` | Per-node DNS cache, transparent interception on 10.96.0.10 + 169.254.20.10. Insulates pods from CoreDNS/Technitium/pfSense disruption. |
|
||||
| Cloudflare DNS | SaaS | N/A | Public domain management (~50 domains) |
|
||||
| pfSense Unbound | 10.0.20.1 | pfSense 2.7.2 (Unbound 1.19) | DNS resolver on LAN/OPT1/WAN; AXFR-slaves `viktorbarzin.lan` from Technitium; DoT upstream to Cloudflare |
|
||||
| Kea DHCP-DDNS | 10.0.20.1 | pfSense 2.7.x | Automatic DNS registration on DHCP lease |
|
||||
| phpIPAM | K8s namespace `phpipam` | v1.7.0 | IPAM ↔ DNS bidirectional sync |
|
||||
|
||||
### Terraform Stacks
|
||||
|
||||
| Stack | Path | DNS Resources |
|
||||
|-------|------|---------------|
|
||||
| Technitium | `stacks/technitium/` | 3 deployments, services, PVCs, 4 CronJobs, CoreDNS ConfigMap |
|
||||
| NodeLocal DNSCache | `stacks/nodelocal-dns/` | DaemonSet (5 pods), ConfigMap, kube-dns-upstream Service, headless metrics Service |
|
||||
| Cloudflared | `stacks/cloudflared/` | Cloudflare DNS records (A, AAAA, CNAME, MX, TXT), tunnel config |
|
||||
| phpIPAM | `stacks/phpipam/` | dns-sync CronJob, pfsense-import CronJob |
|
||||
| pfSense | `stacks/pfsense/` | VM config only (Unbound config is managed out-of-band via pfSense web UI / direct config.xml edits; see `docs/runbooks/pfsense-unbound.md`) |
|
||||
|
||||
## DNS Resolution Paths
|
||||
|
||||
### K8s Pod → Internal Domain (.viktorbarzin.lan)
|
||||
|
||||
```
|
||||
Pod → NodeLocal DNSCache (intercepts on kube-dns:10.96.0.10)
|
||||
→ cache hit: serve locally (TTL 30s / stale up to 86400s via CoreDNS upstream)
|
||||
→ cache miss: forward to kube-dns-upstream (selects CoreDNS pods directly)
|
||||
→ CoreDNS: template matches 2+ labels before .viktorbarzin.lan → NXDOMAIN
|
||||
→ CoreDNS: forward to Technitium ClusterIP (10.96.0.53)
|
||||
→ Technitium resolves from viktorbarzin.lan zone
|
||||
```
|
||||
|
||||
The ndots:5 template in CoreDNS short-circuits queries like `www.cloudflare.com.viktorbarzin.lan` (caused by K8s search domain expansion) by returning NXDOMAIN for any query with 2+ labels before `.viktorbarzin.lan`. Only single-label queries (e.g., `idrac.viktorbarzin.lan`) reach Technitium.
|
||||
|
||||
### K8s Pod → Public Domain
|
||||
|
||||
```
|
||||
Pod → NodeLocal DNSCache (intercepts on kube-dns:10.96.0.10)
|
||||
→ cache hit: serve locally
|
||||
→ cache miss: forward to kube-dns-upstream (selects CoreDNS pods directly)
|
||||
→ CoreDNS: forward to pfSense (10.0.20.1), fallback 8.8.8.8, 1.1.1.1
|
||||
→ pfSense Unbound:
|
||||
- .viktorbarzin.lan → local auth-zone (AXFR-cached from Technitium)
|
||||
- public → DoT to Cloudflare (1.1.1.1 / 1.0.0.1 port 853)
|
||||
```
|
||||
|
||||
### LAN Client (192.168.1.x) → Any Domain
|
||||
|
||||
```
|
||||
Client gets DNS=192.168.1.2 (pfSense WAN) from DHCP
|
||||
→ pfSense Unbound listens on 192.168.1.2:53 directly (no NAT rdr)
|
||||
- .viktorbarzin.lan → auth-zone (AXFR-cached from Technitium 10.0.20.201)
|
||||
Survives full Technitium/K8s outage — auth-zone keeps serving from
|
||||
/var/unbound/viktorbarzin.lan.zone with `fallback-enabled: yes`.
|
||||
- .viktorbarzin.me (non-proxied) and other public → DoT to Cloudflare
|
||||
(1.1.1.1 / 1.0.0.1 on port 853, SNI cloudflare-dns.com)
|
||||
```
|
||||
|
||||
**Trade-off vs. prior NAT rdr**: Split Horizon hairpin translation
|
||||
(`176.12.22.76 → 10.0.20.200` for 192.168.1.x clients) was only applied
|
||||
when queries reached Technitium via the NAT rdr. With Unbound answering
|
||||
on 192.168.1.2:53 directly, non-proxied `*.viktorbarzin.me` queries on the
|
||||
192.168.1.x LAN return the public IP, which the TP-Link AP can't hairpin.
|
||||
If hairpin is broken on LAN for a given non-proxied service, the fix is
|
||||
either (a) switch the service to proxied (via `dns_type = "proxied"`)
|
||||
or (b) add a local-data override on pfSense Unbound. The pre-Unbound
|
||||
state is documented in the `docs/runbooks/pfsense-unbound.md` rollback
|
||||
section.
|
||||
|
||||
### Management VLAN (10.0.10.x) → Any Domain
|
||||
|
||||
```
|
||||
Client gets DNS from Kea DHCP → pfSense (10.0.10.1)
|
||||
→ pfSense Unbound:
|
||||
- .viktorbarzin.lan → auth-zone (local)
|
||||
- other → DoT to Cloudflare (1.1.1.1 / 1.0.0.1 port 853)
|
||||
```
|
||||
|
||||
### K8s VLAN (10.0.20.x) → Any Domain
|
||||
|
||||
```
|
||||
Client gets DNS from Kea DHCP → pfSense (10.0.20.1)
|
||||
→ pfSense Unbound:
|
||||
- .viktorbarzin.lan → auth-zone (local)
|
||||
- other → DoT to Cloudflare (1.1.1.1 / 1.0.0.1 port 853)
|
||||
```
|
||||
|
||||
## Technitium DNS — Internal DNS Server
|
||||
|
||||
### Deployment Topology
|
||||
|
||||
Three independent Technitium instances, each with its own encrypted block storage PVC (`proxmox-lvm-encrypted`, 2Gi each):
|
||||
|
||||
| Instance | Deployment | PVC | Web Service | Role |
|
||||
|----------|-----------|-----|-------------|------|
|
||||
| Primary | `technitium` | `technitium-primary-config-encrypted` | `technitium-web:5380` | Authoritative primary, zone edits happen here |
|
||||
| Secondary | `technitium-secondary` | `technitium-secondary-config-encrypted` | `technitium-secondary-web:5380` | AXFR replica |
|
||||
| Tertiary | `technitium-tertiary` | `technitium-tertiary-config-encrypted` | `technitium-tertiary-web:5380` | AXFR replica |
|
||||
|
||||
All three pods share the `dns-server=true` label, so the DNS LoadBalancer (10.0.20.201) and ClusterIP (10.96.0.53) route queries to any healthy instance.
|
||||
|
||||
### High Availability
|
||||
|
||||
- **Pod anti-affinity**: `required` on `kubernetes.io/hostname` — all 3 pods run on different nodes
|
||||
- **PodDisruptionBudget**: `minAvailable=2` — at least 2 DNS pods survive voluntary disruptions
|
||||
- **Recreate strategy**: Each deployment uses `Recreate` (RWO block storage)
|
||||
- **Zone sync CronJob** (`technitium-zone-sync`, every 30min): Replicates all primary zones to secondary/tertiary via AXFR. Idempotent — skips existing zones, creates missing ones as Secondary type.
|
||||
|
||||
### Services
|
||||
|
||||
| Service | Type | IP | Selector | Purpose |
|
||||
|---------|------|-----|----------|---------|
|
||||
| `technitium-dns` | LoadBalancer | 10.0.20.201 | `dns-server=true` | External LAN access, `externalTrafficPolicy: Local` |
|
||||
| `technitium-dns-internal` | ClusterIP | 10.96.0.53 (pinned) | `dns-server=true` | CoreDNS forwarding, survives Service recreation |
|
||||
| `technitium-primary` | ClusterIP | auto | `app=technitium` | Zone transfers (AXFR) + API access to primary only |
|
||||
| `technitium-web` | ClusterIP | auto | `app=technitium` | Web UI (port 5380) + DoH (port 80) |
|
||||
| `technitium-secondary-web` | ClusterIP | auto | `app=technitium-secondary` | Secondary API access |
|
||||
| `technitium-tertiary-web` | ClusterIP | auto | `app=technitium-tertiary` | Tertiary API access |
|
||||
|
||||
### Zones
|
||||
|
||||
**Primary zones** (managed on primary, replicated to secondary/tertiary):
|
||||
|
||||
| Zone | Type | Records | Notes |
|
||||
|------|------|---------|-------|
|
||||
| `viktorbarzin.lan` | Primary | 30+ A/CNAME | Internal hosts (idrac, grafana, proxmox, vaultwarden, etc.) |
|
||||
| `10.0.10.in-addr.arpa` | Primary | PTR | Reverse DNS for management VLAN |
|
||||
| `20.0.10.in-addr.arpa` | Primary | PTR | Reverse DNS for K8s VLAN |
|
||||
| `1.168.192.in-addr.arpa` | Primary | PTR | Reverse DNS for LAN |
|
||||
| `2.3.10.in-addr.arpa` | Primary | PTR | Reverse DNS for VPN |
|
||||
| `0.168.192.in-addr.arpa` | Primary | PTR | Reverse DNS for Valchedrym site |
|
||||
| `emrsn.org` | Primary (stub) | — | Returns NXDOMAIN locally (avoids 27K+ daily corporate query floods) |
|
||||
|
||||
**Dynamic updates**: Enabled via `UseSpecifiedNetworkACL` from pfSense IPs (10.0.20.1, 10.0.10.1, 192.168.1.2) **AND require a valid TSIG signature** on `viktorbarzin.lan`, `10.0.10.in-addr.arpa`, `20.0.10.in-addr.arpa`, `1.168.192.in-addr.arpa`. Policy: `updateSecurityPolicies = [{tsigKeyName: "kea-ddns", domain: "*.<zone>", allowedTypes: ["ANY"]}]`. Unsigned updates from the allowlisted pfSense source IPs are refused ("Dynamic Updates Security Policy"). TSIG key `kea-ddns` (HMAC-SHA256) present on primary/secondary/tertiary; secret in Vault `secret/viktor/kea_ddns_tsig_secret`. Applied 2026-04-19 (WS E, bd `code-o6j`).
|
||||
|
||||
### Resolver Settings
|
||||
|
||||
| Setting | Value | Rationale |
|
||||
|---------|-------|-----------|
|
||||
| Forwarders | Cloudflare DoH (1.1.1.1, 1.0.0.1) | Encrypted upstream DNS |
|
||||
| Cache max entries | 100K | Ample for homelab |
|
||||
| Cache min TTL | 60s | Reduces re-queries for short-TTL domains (e.g., headscale: 18s) |
|
||||
| Cache max TTL | 7 days | Long cache for stable records |
|
||||
| Serve stale | Enabled (3 days) | Resilience during upstream failures |
|
||||
|
||||
### Ad Blocking
|
||||
|
||||
Technitium runs built-in DNS blocking with:
|
||||
- **OISD Big List** (~486K domains)
|
||||
- **StevenBlack hosts list**
|
||||
|
||||
Blocking is enabled on all three instances (`DNS_SERVER_ENABLE_BLOCKING=true` on secondary/tertiary).
|
||||
|
||||
### Query Logging
|
||||
|
||||
| Backend | Status | Retention | Purpose |
|
||||
|---------|--------|-----------|---------|
|
||||
| MySQL (`technitium` DB) | Disabled | — | Legacy, disabled by password-sync CronJob |
|
||||
| PostgreSQL (`technitium` DB on CNPG) | Enabled | 90 days | Primary query log store |
|
||||
|
||||
Grafana dashboard (`grafana-technitium-dashboard` ConfigMap) visualizes query logs from the MySQL datasource. A Grafana datasource is auto-provisioned via sidecar.
|
||||
|
||||
### Web UI & Ingress
|
||||
|
||||
- **Web UI**: `technitium.viktorbarzin.me` (Authentik-protected via `ingress_factory`)
|
||||
- **DNS-over-HTTPS**: `dns.viktorbarzin.me` (separate ingress, port 80)
|
||||
- **Homepage widget**: Technitium widget showing totalQueries, totalCached, totalBlocked, totalRecursive
|
||||
|
||||
## Split Horizon (Hairpin NAT Fix)
|
||||
|
||||
### Problem
|
||||
|
||||
The TP-Link AP (dumb AP on 192.168.1.x) does not support hairpin NAT. LAN clients resolving non-proxied `*.viktorbarzin.me` domains get the public IP `176.12.22.76`, but can't reach it because the TP-Link won't route back to the local network.
|
||||
|
||||
### Solution
|
||||
|
||||
Technitium's **Split Horizon AddressTranslation** app post-processes DNS responses for 192.168.1.0/24 clients, translating the public IP to the internal Traefik LB IP:
|
||||
|
||||
```
|
||||
176.12.22.76 → 10.0.20.200
|
||||
```
|
||||
|
||||
**DNS Rebinding Protection** has `viktorbarzin.me` in `privateDomains` to allow the translated private IP without being stripped as a rebinding attack.
|
||||
|
||||
### Scope
|
||||
|
||||
- **Affected**: Non-proxied domains (ha-sofia, immich, headscale, calibre, vaultwarden, etc.) for 192.168.1.x clients
|
||||
- **Not affected**: Cloudflare-proxied domains (resolve to Cloudflare edge IPs, no translation needed)
|
||||
- **Not affected**: 10.0.x.x and K8s clients (reach public IP via pfSense outbound NAT normally)
|
||||
|
||||
Config is synced to all 3 Technitium instances by CronJob `technitium-split-horizon-sync` (every 6h).
|
||||
|
||||
## NodeLocal DNSCache
|
||||
|
||||
A DaemonSet in `kube-system` (`node-local-dns`, image `registry.k8s.io/dns/k8s-dns-node-cache:1.23.1`) runs on every node including the control plane. Each pod uses `hostNetwork: true` + `NET_ADMIN` and installs iptables NOTRACK rules so it transparently serves DNS on both:
|
||||
|
||||
- **169.254.20.10** — the canonical link-local IP from the upstream docs
|
||||
- **10.96.0.10** — the `kube-dns` ClusterIP, so existing pods (which already use this as their nameserver) hit the on-node cache with no kubelet change
|
||||
|
||||
Cache misses go to a separate `kube-dns-upstream` ClusterIP service (not `kube-dns`, to avoid looping back to ourselves) that selects the CoreDNS pods directly via `k8s-app=kube-dns`.
|
||||
|
||||
Priority class is `system-node-critical`; tolerations are permissive (`operator: Exists`) so the DaemonSet runs on tainted master and other reserved nodes. Kyverno `dns_config` drift is suppressed via `ignore_changes` on the DaemonSet.
|
||||
|
||||
**Caching**: `cluster.local:53` caches 9984 success / 9984 denial entries with 30s/5s TTLs. Other zones cache 30s. If CoreDNS is killed, nodes keep answering cached names — verified on 2026-04-19 by deleting all three CoreDNS pods and running `dig @169.254.20.10 idrac.viktorbarzin.lan` + `dig @169.254.20.10 github.com` from a pod (both returned answers).
|
||||
|
||||
**Kubelet clusterDNS**: **Unchanged** — still `10.96.0.10`. NodeLocal DNSCache co-listens on that IP so traffic interception is transparent; switching kubelet to `169.254.20.10` would require a rolling reconfigure of every node and provides no additional cache benefit over transparent mode.
|
||||
|
||||
**Metrics**: A headless Service `node-local-dns` (ClusterIP `None`) exposes each pod on port `9253` for Prometheus scraping (annotated `prometheus.io/scrape=true`).
|
||||
|
||||
## CoreDNS Configuration
|
||||
|
||||
CoreDNS is managed via Terraform in `stacks/technitium/modules/technitium/` — the Corefile ConfigMap lives in `main.tf`, and scaling/PDB are in `coredns.tf` (a `kubernetes_deployment_v1_patch` against the kubeadm-managed Deployment).
|
||||
|
||||
```
|
||||
.:53 {
|
||||
errors / health / ready
|
||||
kubernetes cluster.local in-addr.arpa ip6.arpa # K8s service discovery
|
||||
prometheus :9153 # Metrics
|
||||
forward . 10.0.20.1 8.8.8.8 1.1.1.1 {
|
||||
policy sequential # try upstreams in order
|
||||
health_check 5s # mark unhealthy in 5s
|
||||
max_fails 2
|
||||
}
|
||||
cache {
|
||||
success 10000 300 6
|
||||
denial 10000 300 60
|
||||
serve_stale 86400s # resilience during upstream outage
|
||||
}
|
||||
loop / reload / loadbalance
|
||||
}
|
||||
|
||||
viktorbarzin.lan:53 {
|
||||
template: .*\..*\.viktorbarzin\.lan\.$ → NXDOMAIN # ndots:5 junk filter
|
||||
forward . 10.96.0.53 { # Technitium ClusterIP
|
||||
health_check 5s
|
||||
max_fails 2
|
||||
}
|
||||
cache (success 10000 300, denial 10000 300, serve_stale 86400s)
|
||||
}
|
||||
```
|
||||
|
||||
**Scaling**: 3 replicas, `required` anti-affinity on `kubernetes.io/hostname` (spread across 3 distinct nodes). PodDisruptionBudget `coredns` with `minAvailable=2`.
|
||||
|
||||
**Kyverno ndots injection**: A Kyverno policy injects `ndots:2` on all pods cluster-wide to reduce search domain expansion noise. The template regex is a second layer of defense for any queries that still get expanded.
|
||||
|
||||
**Failover behaviour**: With `policy sequential` on the root forward block, CoreDNS tries pfSense first; if `health_check 5s` detects pfSense as down, it fails over to 8.8.8.8 then 1.1.1.1 within ~5s rather than timing out per-query. Combined with `serve_stale`, pods keep resolving cached names for up to 24h even with full upstream failure.
|
||||
|
||||
## Cloudflare DNS — External Domains
|
||||
|
||||
All public domains are under the `viktorbarzin.me` zone. DNS records are **auto-created per service** via the `ingress_factory` module's `dns_type` parameter. A small number of records (Helm-managed ingresses, special cases) remain centrally managed in `config.tfvars`.
|
||||
|
||||
### How DNS Records Are Created
|
||||
|
||||
```
|
||||
stacks/<service>/main.tf
|
||||
module "ingress" {
|
||||
source = ingress_factory
|
||||
dns_type = "proxied" # ← auto-creates Cloudflare DNS record
|
||||
}
|
||||
```
|
||||
|
||||
- **`dns_type = "proxied"`**: Creates CNAME → `{tunnel_id}.cfargotunnel.com` (Cloudflare CDN)
|
||||
- **`dns_type = "non-proxied"`**: Creates A → public IP + AAAA → IPv6
|
||||
- **`dns_type = "none"`** (default): No DNS record
|
||||
|
||||
The Cloudflare tunnel uses a **wildcard rule** (`*.viktorbarzin.me → Traefik`) — no per-hostname tunnel config needed. Traefik handles host-based routing via K8s Ingress resources.
|
||||
|
||||
### Record Types
|
||||
|
||||
| Type | Records | Target | Example |
|
||||
|------|---------|--------|---------|
|
||||
| Proxied CNAME | ~100 domains | `{tunnel_id}.cfargotunnel.com` | blog, hackmd, homepage, ntfy |
|
||||
| Non-proxied A | ~35 domains | `176.12.22.76` (public IP) | mail, headscale, immich |
|
||||
| Non-proxied AAAA | ~35 domains | IPv6 (HE tunnel) | Same as non-proxied A |
|
||||
| MX | 1 | `mail.viktorbarzin.me` | Inbound email |
|
||||
| TXT (SPF) | 1 | `v=spf1 include:mailgun.org -all` | Email authentication |
|
||||
| TXT (DKIM) | 4 | RSA keys (s1, mail, brevo1, brevo2) | Email signing |
|
||||
| TXT (DMARC) | 1 | `v=DMARC1; p=quarantine; pct=100` | Email policy |
|
||||
| TXT (MTA-STS) | 1 | `v=STSv1; id=20260412` | TLS enforcement |
|
||||
| TXT (TLSRPT) | 1 | `v=TLSRPTv1; rua=mailto:postmaster@...` | TLS reporting |
|
||||
| A (keyserver) | 1 | `130.162.165.220` (Oracle VPS) | PGP keyserver |
|
||||
|
||||
### Proxied vs Non-Proxied
|
||||
|
||||
- **Proxied (orange cloud)**: Traffic routes through Cloudflare CDN → Cloudflared tunnel → Traefik. Benefits: DDoS protection, caching, no public IP exposure.
|
||||
- **Non-proxied (grey cloud)**: DNS resolves directly to public IP. Required for services needing direct connections (mail, VPN, WebSocket-heavy apps).
|
||||
|
||||
### Zone Settings
|
||||
|
||||
- **HTTP/3 (QUIC)**: Enabled globally via `cloudflare_zone_settings_override`
|
||||
|
||||
## DHCP → DNS Auto-Registration
|
||||
|
||||
Devices get automatic DNS registration without manual intervention. See [networking.md § IPAM & DNS Auto-Registration](networking.md#ipam--dns-auto-registration) for the full data flow diagram.
|
||||
|
||||
Summary:
|
||||
1. **Kea DHCP** on pfSense assigns IP (53 reservations across 3 subnets). DHCP option 6 (DNS servers) is pushed with two IPs per internal subnet: internal resolver + AdGuard public fallback (`94.140.14.14`) — clients survive an internal DNS outage.
|
||||
2. **Kea DDNS** sends **TSIG-signed** RFC 2136 dynamic update to Technitium (A + PTR records) — immediate. Key `kea-ddns` (HMAC-SHA256); Technitium enforces both source-IP ACL and TSIG signature on `viktorbarzin.lan` + reverse zones.
|
||||
3. **phpipam-pfsense-import** CronJob (hourly) pulls Kea leases + ARP table into phpIPAM
|
||||
4. **phpipam-dns-sync** CronJob (15min) pushes named phpIPAM hosts → Technitium A + PTR, pulls Technitium PTR → phpIPAM hostnames
|
||||
|
||||
## Automation CronJobs
|
||||
|
||||
| CronJob | Schedule | Namespace | Purpose |
|
||||
|---------|----------|-----------|---------|
|
||||
| `technitium-zone-sync` | `*/30 * * * *` | technitium | AXFR replication to secondary/tertiary |
|
||||
| `technitium-password-sync` | `0 */6 * * *` | technitium | Vault-rotated MySQL password → Technitium config, configure PG logging |
|
||||
| `technitium-split-horizon-sync` | `15 */6 * * *` | technitium | Split Horizon + DNS Rebinding Protection on all 3 instances |
|
||||
| `technitium-dns-optimization` | `30 */6 * * *` | technitium | Min cache TTL 60s, emrsn.org stub zone |
|
||||
| `phpipam-dns-sync` | `*/15 * * * *` | phpipam | Bidirectional phpIPAM ↔ Technitium DNS sync |
|
||||
| `phpipam-pfsense-import` | `0 * * * *` | phpipam | Import Kea DHCP leases + ARP from pfSense |
|
||||
|
||||
### Password Rotation Flow
|
||||
|
||||
Vault's database engine rotates the Technitium MySQL password every 7 days. The flow:
|
||||
|
||||
```
|
||||
Vault DB engine rotates password
|
||||
→ ExternalSecret (refreshInterval=15m) pulls from static-creds/mysql-technitium
|
||||
→ K8s Secret technitium-db-creds updated
|
||||
→ CronJob technitium-password-sync (every 6h):
|
||||
1. Logs into Technitium API
|
||||
2. Disables MySQL query logging (migrated to PG)
|
||||
3. Checks PG plugin is loaded (warns if missing)
|
||||
4. Configures PG query logging (90-day retention)
|
||||
```
|
||||
|
||||
## Monitoring
|
||||
|
||||
| Metric Source | Dashboard | Alerts |
|
||||
|---------------|-----------|--------|
|
||||
| Technitium query logs (PostgreSQL) | Grafana `technitium-dns.json` | — |
|
||||
| CoreDNS Prometheus metrics (:9153) | Grafana CoreDNS dashboard | `CoreDNSErrors`, `CoreDNSForwardFailureRate` |
|
||||
| Technitium zone-sync CronJob (Pushgateway) | — | `TechnitiumZoneSyncFailed`, `TechnitiumZoneSyncStale`, `TechnitiumZoneCountMismatch` |
|
||||
| Technitium DNS pod availability | — | `TechnitiumDNSDown` |
|
||||
| `dns-anomaly-monitor` CronJob (Pushgateway) | — | `DNSQuerySpike`, `DNSQueryRateDropped`, `DNSHighErrorRate` |
|
||||
| Uptime Kuma | External monitors for all proxied domains | ExternalAccessDivergence (15min) |
|
||||
|
||||
### Metrics pushed by `technitium-zone-sync`
|
||||
|
||||
The zone-sync CronJob (runs every 30min) pushes the following to the Prometheus Pushgateway under `job=technitium-zone-sync`:
|
||||
|
||||
| Metric | Labels | Meaning |
|
||||
|--------|--------|---------|
|
||||
| `technitium_zone_sync_status` | — | 0 = last run succeeded, 1 = at least one zone failed to create |
|
||||
| `technitium_zone_sync_failures` | — | Number of zones that failed to create this run |
|
||||
| `technitium_zone_sync_last_run` | — | Unix timestamp of last run (used by `TechnitiumZoneSyncStale`) |
|
||||
| `technitium_zone_count` | `instance=primary\|<replica-host>` | Zone count on each Technitium instance (drives `TechnitiumZoneCountMismatch`) |
|
||||
|
||||
### DNS alert rewrites
|
||||
|
||||
- `DNSQuerySpike` was previously broken: it compared current queries against `dns_anomaly_avg_queries`, which was computed from a per-pod `/tmp/dns_avg` file. Each CronJob run started with a fresh `/tmp`, so `NEW_AVG == TOTAL_QUERIES` every time and the spike condition could never fire. Rewritten to use `avg_over_time(dns_anomaly_total_queries[1h] offset 15m)` which compares against the actual 1h Prometheus history.
|
||||
- `DNSQueryRateDropped` (new): fires when query rate drops below 50% of 1h average — upstream clients may be failing to reach Technitium.
|
||||
|
||||
## Troubleshooting
|
||||
|
||||
### DNS Not Resolving Internal Domains
|
||||
|
||||
1. Check NodeLocal DNSCache pods first — pod queries go through these: `kubectl -n kube-system get pod -l k8s-app=node-local-dns -o wide`
|
||||
2. Check Technitium pods: `kubectl get pod -n technitium`
|
||||
3. Check all 3 are healthy: `kubectl get pod -n technitium -l dns-server=true`
|
||||
4. Test via NodeLocal DNSCache from a pod: `kubectl exec -it <pod> -- dig @169.254.20.10 idrac.viktorbarzin.lan`
|
||||
5. Bypass NodeLocal DNSCache (test CoreDNS directly): `kubectl exec -it <pod> -- dig @<kube-dns-upstream-ClusterIP> idrac.viktorbarzin.lan` (`kubectl get svc -n kube-system kube-dns-upstream`)
|
||||
6. Check CoreDNS logs: `kubectl logs -n kube-system -l k8s-app=kube-dns`
|
||||
7. Verify ClusterIP service: `kubectl get svc -n technitium technitium-dns-internal`
|
||||
|
||||
### LAN Clients Can't Resolve
|
||||
|
||||
1. Verify pfSense Unbound is running: `ssh admin@10.0.20.1 "sockstat -l -4 -p 53 | grep unbound"` — expect listeners on `192.168.1.2:53`, `10.0.10.1:53`, `10.0.20.1:53`, `127.0.0.1:53`
|
||||
2. Verify the auth-zone is loaded: `ssh admin@10.0.20.1 "unbound-control -c /var/unbound/unbound.conf list_auth_zones"` — expect `viktorbarzin.lan. serial N`
|
||||
3. Test from LAN: `dig @192.168.1.2 idrac.viktorbarzin.lan` (should return with `aa` flag)
|
||||
4. Test public upstream: `dig @192.168.1.2 example.com +dnssec` (should have `ad` flag — DoT via Cloudflare working)
|
||||
5. If auth-zone can't AXFR: check Technitium `viktorbarzin.lan` zone options → `zoneTransferNetworkACL` contains `10.0.20.1, 10.0.10.1, 192.168.1.2`
|
||||
6. See `docs/runbooks/pfsense-unbound.md` for full Unbound runbook and rollback instructions
|
||||
|
||||
### Hairpin NAT Not Working (LAN → *.viktorbarzin.me Fails)
|
||||
|
||||
Since 2026-04-19 (Workstream D), pfSense Unbound answers LAN DNS queries
|
||||
directly instead of forwarding to Technitium, so the Technitium Split Horizon
|
||||
post-processing does NOT run for 192.168.1.x clients anymore. Non-proxied
|
||||
services break hairpin on LAN clients again. Options:
|
||||
|
||||
1. **Switch service to proxied Cloudflare** (preferred) — set `dns_type = "proxied"` in the `ingress_factory` module call; DNS now resolves to Cloudflare edge, hairpin-independent.
|
||||
2. **Add a local-data override on pfSense Unbound** — under `Services → DNS Resolver → Host Overrides`, set `<service>.viktorbarzin.me → 10.0.20.200` (Traefik LB IP). This is equivalent to what Split Horizon did, applied at the resolver.
|
||||
3. **Revert to prior NAT rdr + Technitium Split Horizon** — documented in `docs/runbooks/pfsense-unbound.md` rollback section.
|
||||
|
||||
K8s-side Split Horizon is still configured and applies when `*.viktorbarzin.me` queries DO reach Technitium (e.g., from pods that query via CoreDNS → Technitium forwarding for `.viktorbarzin.me` via pfSense). Verify Technitium split-horizon app:
|
||||
|
||||
1. Verify Split Horizon app is installed on all instances
|
||||
2. Check CronJob status: `kubectl get cronjob -n technitium technitium-split-horizon-sync`
|
||||
3. Run the job manually: `kubectl create job --from=cronjob/technitium-split-horizon-sync test-sh -n technitium`
|
||||
4. Test: `dig @10.0.20.201 immich.viktorbarzin.me` — should return 10.0.20.200 for 192.168.1.x source
|
||||
|
||||
### Zone Not Replicating to Secondary/Tertiary
|
||||
|
||||
1. Check zone-sync CronJob: `kubectl get cronjob -n technitium technitium-zone-sync`
|
||||
2. Check recent jobs: `kubectl get jobs -n technitium | grep zone-sync`
|
||||
3. Verify AXFR is enabled on primary: Check zone options → Zone Transfer = Allow
|
||||
4. Run sync manually: `kubectl create job --from=cronjob/technitium-zone-sync test-sync -n technitium`
|
||||
|
||||
### High NXDOMAIN Rate in Logs
|
||||
|
||||
Common causes:
|
||||
- **ndots:5 expansion**: Pods query `host.search.domain.viktorbarzin.lan` — mitigated by CoreDNS template + Kyverno ndots:2
|
||||
- **Corporate domains (emrsn.org)**: 27K+ daily queries — mitigated by stub zone returning NXDOMAIN locally
|
||||
- **Ad blocking**: Expected for blocked domains
|
||||
|
||||
### Adding a New DNS Record
|
||||
|
||||
For internal `.viktorbarzin.lan` records:
|
||||
1. Add host in phpIPAM web UI (`phpipam.viktorbarzin.me`) with hostname
|
||||
2. Wait 15 minutes for `phpipam-dns-sync` to push to Technitium
|
||||
3. Or add directly in Technitium web UI (`technitium.viktorbarzin.me`)
|
||||
|
||||
For external `.viktorbarzin.me` records:
|
||||
1. Add `dns_type = "proxied"` (or `"non-proxied"`) to the `ingress_factory` module call in the service stack
|
||||
2. Run `scripts/tg apply` on the service stack — DNS record is auto-created
|
||||
3. For non-standard records (MX, TXT), add a `cloudflare_record` resource in `stacks/cloudflared/modules/cloudflared/cloudflare.tf`
|
||||
|
||||
## Incident History
|
||||
|
||||
- **2026-04-14 (SEV1)**: NFS `fsid=0` caused Technitium primary data loss on restart. Fixed by migrating all 3 instances to `proxmox-lvm-encrypted`, adding zone-sync CronJob (30min AXFR). See [post-mortem](../post-mortems/2026-04-14-nfs-fsid0-dns-vault-outage.md).
|
||||
- **2026-04-19 (hardening, not outage)**: Workstream D — pfSense Unbound replaces dnsmasq as the pfSense DNS service. Unbound AXFR-slaves `viktorbarzin.lan` from Technitium so LAN-side resolution survives a full K8s outage. WAN NAT rdr `192.168.1.2:53 → 10.0.20.201` removed (Unbound listens on WAN directly). DoT upstream via Cloudflare. See `docs/runbooks/pfsense-unbound.md` and bd `code-k0d`.
|
||||
- **2026-04-19 (hardening, not outage)**: Workstream E — Kea DHCP now pushes TWO DNS IPs (internal + AdGuard public fallback `94.140.14.14`) via option 6 to the internal subnets (10.0.10/24, 10.0.20/24); 192.168.1/24 was already dual-IP (served by TP-Link). Kea DHCP-DDNS now TSIG-signs its RFC 2136 updates (key `kea-ddns`, HMAC-SHA256) and the Technitium zones require both source-IP ACL AND TSIG signature. See `docs/runbooks/pfsense-unbound.md` § "Kea DHCP-DDNS TSIG" and bd `code-o6j`.
|
||||
|
||||
## Related
|
||||
|
||||
- [Networking Architecture](networking.md) — VLAN topology, IPAM auto-registration, ingress flow, MetalLB
|
||||
- [Mailserver Architecture](mailserver.md) — DNS records for email (MX, SPF, DKIM, DMARC)
|
||||
- [Security Architecture](security.md) — Kyverno ndots policy
|
||||
- [Monitoring Architecture](monitoring.md) — CoreDNS metrics, Uptime Kuma external monitors
|
||||
- Runbook: `docs/runbooks/add-dns-record.md` (referenced but not yet created)
|
||||
116
docs/architecture/homepage.md
Normal file
116
docs/architecture/homepage.md
Normal file
|
|
@ -0,0 +1,116 @@
|
|||
# Homepage Dashboard (home.viktorbarzin.me)
|
||||
|
||||
## Overview
|
||||
|
||||
The cluster uses [Homepage](https://gethomepage.dev/) as a service dashboard at `home.viktorbarzin.me`. It auto-discovers services via Kubernetes ingress annotations — no manual service list to maintain.
|
||||
|
||||
## Architecture
|
||||
|
||||
```
|
||||
Browser → Cloudflare → Traefik → nginx cache proxy → Homepage (port 3000)
|
||||
```
|
||||
|
||||
- **Homepage** (ghcr.io/gethomepage/homepage:v1.10.1) runs in namespace `homepage` with RBAC enabled for K8s API access
|
||||
- **nginx cache proxy** sits in front, caching `/api/` responses for 24h with stale-while-revalidate (prevents Homepage from hitting K8s API on every page load)
|
||||
- **Ingress** at `home.viktorbarzin.me` routes through the cache proxy
|
||||
|
||||
Stack: `stacks/homepage/main.tf`
|
||||
|
||||
## Service Auto-Discovery
|
||||
|
||||
Homepage discovers services from **ingress annotations** across all namespaces. The `ingress_factory` module automatically adds these annotations to every ingress it creates.
|
||||
|
||||
### How It Works
|
||||
|
||||
1. Homepage's ServiceAccount has cluster-wide RBAC to read ingresses
|
||||
2. On startup (and periodically), it scans all ingresses for `gethomepage.dev/*` annotations
|
||||
3. Services appear grouped and ordered by their annotation values
|
||||
|
||||
### Annotations
|
||||
|
||||
The `ingress_factory` module (`modules/kubernetes/ingress_factory/main.tf`) sets these defaults on every ingress:
|
||||
|
||||
| Annotation | Default Value | Purpose |
|
||||
|------------|---------------|---------|
|
||||
| `gethomepage.dev/enabled` | `"true"` | Show on dashboard (set `homepage_enabled = false` to hide) |
|
||||
| `gethomepage.dev/name` | Derived from ingress `name` (hyphens → spaces) | Display name |
|
||||
| `gethomepage.dev/group` | Auto-detected from namespace (see mapping below) | Dashboard section |
|
||||
| `gethomepage.dev/href` | `https://<host>.viktorbarzin.me` | Click-through URL |
|
||||
| `gethomepage.dev/icon` | `<name>.png` | Icon (from [Dashboard Icons](https://github.com/walkxcode/dashboard-icons)) |
|
||||
|
||||
### Overriding Defaults
|
||||
|
||||
Pass `extra_annotations` in the `ingress_factory` module call to override any default:
|
||||
|
||||
```hcl
|
||||
module "ingress" {
|
||||
source = "../../modules/kubernetes/ingress_factory"
|
||||
namespace = "my-app"
|
||||
name = "my-app"
|
||||
tls_secret_name = var.tls_secret_name
|
||||
extra_annotations = {
|
||||
"gethomepage.dev/name" = "My Custom Name"
|
||||
"gethomepage.dev/description" = "What this service does"
|
||||
"gethomepage.dev/icon" = "si-spotify" # Simple Icons prefix
|
||||
"gethomepage.dev/group" = "Media & Entertainment"
|
||||
"gethomepage.dev/pod-selector" = "" # Show pod status widget
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
To hide a service from the dashboard:
|
||||
|
||||
```hcl
|
||||
module "ingress" {
|
||||
source = "../../modules/kubernetes/ingress_factory"
|
||||
# ...
|
||||
homepage_enabled = false
|
||||
}
|
||||
```
|
||||
|
||||
### Namespace → Group Mapping
|
||||
|
||||
The `ingress_factory` module auto-maps namespaces to dashboard groups:
|
||||
|
||||
| Namespace | Group |
|
||||
|-----------|-------|
|
||||
| monitoring, prometheus, technitium, traefik, metallb-system, dbaas, mailserver | Infrastructure |
|
||||
| authentik, crowdsec | Identity & Security |
|
||||
| woodpecker, forgejo | Development & CI |
|
||||
| immich, servarr, navidrome | Media & Entertainment |
|
||||
| frigate, home-assistant, reverse-proxy | Smart Home |
|
||||
| ollama | AI & Data |
|
||||
| nextcloud | Productivity |
|
||||
| n8n, changedetection | Automation |
|
||||
| finance | Finance & Personal |
|
||||
| homepage | Core Platform |
|
||||
| *(everything else)* | Other |
|
||||
|
||||
Override with `homepage_group` variable or `gethomepage.dev/group` annotation.
|
||||
|
||||
### Dashboard Layout
|
||||
|
||||
Groups are configured in `stacks/homepage/values.yaml` under `config.settings.layout`. Each group has a `style` (row) and `columns` count. To add a new group, add it to the layout config and apply.
|
||||
|
||||
### Adding a New Service
|
||||
|
||||
No action needed — just use the `ingress_factory` module. The service will appear automatically on the next Homepage refresh cycle. To customize:
|
||||
|
||||
1. Set `extra_annotations` with `gethomepage.dev/*` keys for custom name, description, icon
|
||||
2. Set `homepage_group` variable if the namespace auto-mapping doesn't fit
|
||||
3. Use `"gethomepage.dev/pod-selector" = ""` to show pod health status
|
||||
|
||||
### Icon Sources
|
||||
|
||||
Homepage supports multiple icon formats:
|
||||
- **Dashboard Icons**: `<name>.png` (e.g., `grafana.png`) — [browse available icons](https://github.com/walkxcode/dashboard-icons)
|
||||
- **Simple Icons**: `si-<name>` (e.g., `si-spotify`) — [browse at simpleicons.org](https://simpleicons.org)
|
||||
- **Material Design**: `mdi-<name>` (e.g., `mdi-home`)
|
||||
- **URL**: Full URL to any image
|
||||
|
||||
### Caching
|
||||
|
||||
The nginx cache proxy caches Homepage's `/api/` responses for 24h with background refresh. This means:
|
||||
- New services appear within seconds (Homepage refreshes its K8s scan periodically)
|
||||
- Widget data (pod status, resource usage) is cached but refreshes in the background
|
||||
- If Homepage restarts, cached data serves until it's back
|
||||
254
docs/architecture/incident-response.md
Normal file
254
docs/architecture/incident-response.md
Normal file
|
|
@ -0,0 +1,254 @@
|
|||
# Contributing to the Infrastructure
|
||||
|
||||
Welcome! This doc explains how to report issues, request features, and what happens behind the scenes.
|
||||
|
||||
## Quick Links
|
||||
|
||||
| What | Where |
|
||||
|------|-------|
|
||||
| Report an outage | [File an issue](https://github.com/ViktorBarzin/infra/issues/new?template=outage-report.yml) |
|
||||
| Request a feature | [File a request](https://github.com/ViktorBarzin/infra/issues/new?template=feature-request.yml) |
|
||||
| Check service status | [status.viktorbarzin.me](https://status.viktorbarzin.me) |
|
||||
| View past incidents | [Post-mortems](https://viktorbarzin.github.io/infra/post-mortems/) |
|
||||
| Uptime dashboard | [uptime.viktorbarzin.me](https://uptime.viktorbarzin.me) |
|
||||
| Grafana dashboards | [grafana.viktorbarzin.me](https://grafana.viktorbarzin.me) |
|
||||
|
||||
---
|
||||
|
||||
## Reporting an Outage
|
||||
|
||||
If something is broken, [file an outage report](https://github.com/ViktorBarzin/infra/issues/new?template=outage-report.yml). The form asks for:
|
||||
|
||||
- **Which service** is affected (dropdown)
|
||||
- **What you see** (error message, behavior)
|
||||
- **What kind of error** (502, timeout, auth, slow, etc.)
|
||||
- **When it started**
|
||||
- **Is it just you or others too?**
|
||||
|
||||
### What makes a good report
|
||||
|
||||
**Good:**
|
||||
> Nextcloud at nextcloud.viktorbarzin.me returns 502 Bad Gateway since ~14:00 UTC.
|
||||
> Other services seem fine. Tried incognito — same result.
|
||||
|
||||
**Also good (minimal):**
|
||||
> Home Assistant not loading since this morning
|
||||
|
||||
**Not helpful:**
|
||||
> Nothing works
|
||||
|
||||
### What happens after you report
|
||||
|
||||
```mermaid
|
||||
flowchart TD
|
||||
A["You file a GitHub Issue<br/>(outage-report template)"] --> B["GitHub Actions triggers<br/>(within seconds)"]
|
||||
B --> C{Are you a<br/>collaborator?}
|
||||
C -->|No| D["'Queued for review'<br/>comment added"]
|
||||
D --> E["Viktor reviews manually"]
|
||||
C -->|Yes| F["Automated agent<br/>starts investigating"]
|
||||
F --> G{Is the service<br/>actually down?}
|
||||
G -->|"Healthy"| H["Agent posts findings<br/>+ closes issue"]
|
||||
G -->|"Down"| I["Agent classifies severity<br/>(SEV1 / SEV2 / SEV3)"]
|
||||
I --> J{Can the agent<br/>fix it?}
|
||||
J -->|"Yes (confident)"| K["Agent applies fix<br/>+ posts resolution"]
|
||||
J -->|"No (complex)"| L["Agent escalates<br/>to Viktor"]
|
||||
K --> M["Post-mortem written<br/>+ published"]
|
||||
L --> N["Viktor investigates<br/>+ fixes manually"]
|
||||
N --> M
|
||||
M --> O["Status page updated<br/>at status.viktorbarzin.me"]
|
||||
|
||||
style A fill:#6366f1,color:#fff
|
||||
style F fill:#22c55e,color:#fff
|
||||
style K fill:#22c55e,color:#fff
|
||||
style L fill:#f59e0b,color:#000
|
||||
style M fill:#3b82f6,color:#fff
|
||||
```
|
||||
|
||||
### What to expect
|
||||
|
||||
| Scenario | Response time | Who handles it |
|
||||
|----------|--------------|----------------|
|
||||
| Service is actually healthy | ~5 minutes | Automated agent checks and closes |
|
||||
| Simple fix (pod restart, config) | ~10 minutes | Automated agent fixes and reports |
|
||||
| Complex issue (data, architecture) | ~30 min to acknowledge | Agent investigates, escalates to Viktor |
|
||||
| Non-collaborator report | Hours | Queued for manual review |
|
||||
|
||||
### After resolution
|
||||
|
||||
For SEV1 and SEV2 incidents, a **post-mortem** is automatically written documenting:
|
||||
- What happened and the timeline
|
||||
- Root cause analysis
|
||||
- What was done to prevent recurrence
|
||||
|
||||
Post-mortems are published at [viktorbarzin.github.io/infra/post-mortems](https://viktorbarzin.github.io/infra/post-mortems/).
|
||||
|
||||
---
|
||||
|
||||
## Requesting a Feature
|
||||
|
||||
Want a new service deployed, a config change, or a new monitor? [File a feature request](https://github.com/ViktorBarzin/infra/issues/new?template=feature-request.yml).
|
||||
|
||||
Just describe what you need — be specific.
|
||||
|
||||
### What happens after you request
|
||||
|
||||
```mermaid
|
||||
flowchart TD
|
||||
A["You file a GitHub Issue<br/>(feature-request template)"] --> B["GitHub Actions triggers"]
|
||||
B --> C{Are you a<br/>collaborator?}
|
||||
C -->|No| D["'Queued for review'<br/>comment added"]
|
||||
C -->|Yes| E["Automated agent<br/>assesses the request"]
|
||||
E --> F{Is it<br/>straightforward?}
|
||||
F -->|"Yes"| G["Agent implements it<br/>(Terraform + apply)"]
|
||||
G --> H["Agent comments<br/>what was done"]
|
||||
H --> I["Issue closed"]
|
||||
F -->|"No (complex)"| J["Agent posts assessment:<br/>what's needed, risks, effort"]
|
||||
J --> K["Escalated to Viktor<br/>for review"]
|
||||
|
||||
style A fill:#6366f1,color:#fff
|
||||
style G fill:#22c55e,color:#fff
|
||||
style K fill:#f59e0b,color:#000
|
||||
```
|
||||
|
||||
### Examples of what the agent can do automatically
|
||||
|
||||
- Add an Uptime Kuma monitor for a service
|
||||
- Deploy a known service (Helm chart or standard Terraform stack)
|
||||
- Change resource limits, replica counts
|
||||
- Add a DNS record
|
||||
- Configure an ingress route
|
||||
|
||||
### Examples of what gets escalated
|
||||
|
||||
- Deploy a completely new/unknown service
|
||||
- Architecture changes (HA, storage migration)
|
||||
- Changes to core platform (auth, DNS, ingress, databases)
|
||||
- Anything involving data migration or secrets
|
||||
|
||||
---
|
||||
|
||||
## Before Reporting — Self-Service Checks
|
||||
|
||||
| Symptom | Quick check |
|
||||
|---------|-------------|
|
||||
| Service returns 502/503 | Check [status page](https://status.viktorbarzin.me) — is the service shown as down? |
|
||||
| Can't login (SSO) | Try incognito window — might be cached auth cookie |
|
||||
| Slow performance | Check [Grafana](https://grafana.viktorbarzin.me) for node memory/CPU pressure |
|
||||
| DNS not resolving | Try `nslookup <domain> 10.0.20.201` — if that works, flush your DNS cache |
|
||||
| VPN not connecting | Check [Headscale admin](https://vpn.viktorbarzin.me) for your device status |
|
||||
|
||||
---
|
||||
|
||||
## Severity Levels
|
||||
|
||||
| Level | Definition | Examples | Response |
|
||||
|-------|-----------|----------|----------|
|
||||
| **SEV1** | Critical — multiple services down, data at risk, core infra outage | DNS down, auth broken, cluster node unreachable | Immediate automated investigation + escalation |
|
||||
| **SEV2** | Major — single important service down or significantly degraded | Nextcloud 502, Immich not loading, mail not sending | Automated investigation, fix if possible |
|
||||
| **SEV3** | Minor — limited impact, workaround available, cosmetic | Slow dashboard, one monitor flapping, non-critical CronJob failed | Noted, fixed when convenient |
|
||||
|
||||
---
|
||||
|
||||
## Status Page
|
||||
|
||||
The status page at [status.viktorbarzin.me](https://status.viktorbarzin.me) shows:
|
||||
|
||||
- **Live service status** — updated every 5 minutes from Uptime Kuma monitors
|
||||
- **Active incidents** — SEV-classified with timelines and affected services
|
||||
- **User reports** — issues filed by users, with error type and scope
|
||||
- **Recently resolved** — incidents closed in the last 7 days with postmortem links
|
||||
|
||||
The status page is hosted on GitHub Pages — it stays up even when the cluster is down.
|
||||
|
||||
---
|
||||
|
||||
## Architecture (Technical Details)
|
||||
|
||||
For contributors who want to understand how the automation works.
|
||||
|
||||
### End-to-End Flow
|
||||
|
||||
```mermaid
|
||||
flowchart LR
|
||||
subgraph GitHub
|
||||
A[Issue Created] --> B[GHA Workflow]
|
||||
B --> C{Collaborator?}
|
||||
end
|
||||
|
||||
subgraph "Kubernetes Cluster"
|
||||
C -->|Yes| D[Woodpecker Pipeline]
|
||||
D --> E[Vault Auth<br/>K8s SA JWT]
|
||||
E --> F[Fetch API Token]
|
||||
end
|
||||
|
||||
subgraph "claude-agent-service (K8s)"
|
||||
F --> G[HTTP POST /execute]
|
||||
G --> H[issue-responder agent]
|
||||
H --> I[Investigate / Implement]
|
||||
I --> J[Comment on Issue]
|
||||
I --> K[Terraform Apply]
|
||||
I --> L[Post-Mortem Pipeline]
|
||||
end
|
||||
|
||||
subgraph "Post-Mortem Pipeline"
|
||||
L --> M[sev-triage<br/>haiku, ~60s]
|
||||
M --> N[Specialists<br/>3-5 agents parallel]
|
||||
N --> O[sev-historian<br/>cross-ref past incidents]
|
||||
O --> P[sev-report-writer<br/>write report + action items]
|
||||
P --> Q[postmortem-todo-resolver<br/>implement safe fixes]
|
||||
end
|
||||
|
||||
style B fill:#2088ff,color:#fff
|
||||
style D fill:#4c9e47,color:#fff
|
||||
style H fill:#6366f1,color:#fff
|
||||
style Q fill:#6366f1,color:#fff
|
||||
```
|
||||
|
||||
### Components
|
||||
|
||||
| Component | Location | Purpose |
|
||||
|-----------|----------|---------|
|
||||
| GHA Workflow | `.github/workflows/issue-automation.yml` | Triggers on issue creation, checks collaborator, POSTs to Woodpecker |
|
||||
| Woodpecker Pipeline | `.woodpecker/issue-automation.yml` | Authenticates to Vault, SSHes to DevVM, runs Claude agent |
|
||||
| Issue Responder | `.claude/agents/issue-responder.md` | Reads issue, classifies, investigates, fixes or escalates |
|
||||
| Post-Mortem Orchestrator | `.claude/agents/post-mortem.md` | 4-stage investigation pipeline |
|
||||
| SEV Triage | `.claude/agents/sev-triage.md` | Fast cluster scan + severity classification |
|
||||
| SEV Historian | `.claude/agents/sev-historian.md` | Cross-references past incidents |
|
||||
| SEV Report Writer | `.claude/agents/sev-report-writer.md` | Writes final postmortem + links to issue |
|
||||
| TODO Resolver | `.claude/agents/postmortem-todo-resolver.md` | Implements safe follow-up fixes |
|
||||
| Post-Mortem Skill | `.claude/skills/post-mortem/` | Manual `/post-mortem` command |
|
||||
| Cluster Health | `.claude/skills/cluster-health/` | Health check with auto-filing for SEV1/SEV2 |
|
||||
| Status Page CronJob | `stacks/status-page/main.tf` | Pushes status + incidents to GitHub Pages every 5 min |
|
||||
| Issue Templates | `.github/ISSUE_TEMPLATE/` | Structured forms for outage reports + feature requests |
|
||||
|
||||
### Safety Guardrails
|
||||
|
||||
The automated agent follows strict rules:
|
||||
|
||||
- **All changes go through Terraform** — never `kubectl apply` as final state
|
||||
- **`terraform plan` before every apply** — aborts if any resources would be destroyed
|
||||
- **Platform stacks are hands-off** — vault, dbaas, traefik, authentik, kyverno always escalate
|
||||
- **No data deletion** — never deletes PVCs, PVs, or user data
|
||||
- **Budget capped** — $10 max per issue, $5 per post-mortem run
|
||||
- **Complex = escalate** — if the agent isn't confident, it assigns to Viktor with findings
|
||||
|
||||
### Labels
|
||||
|
||||
| Label | Purpose |
|
||||
|-------|---------|
|
||||
| `user-report` | Auto-applied to outage reports |
|
||||
| `feature-request` | Auto-applied to feature requests |
|
||||
| `incident` | Confirmed incident (appears on status page) |
|
||||
| `sev1` / `sev2` / `sev3` | Severity classification |
|
||||
| `postmortem-required` | SEV needs a postmortem |
|
||||
| `postmortem-done` | Postmortem written and linked |
|
||||
| `needs-human` | Agent escalated — needs Viktor's attention |
|
||||
|
||||
### Commit Conventions
|
||||
|
||||
| Pattern | Used by |
|
||||
|---------|---------|
|
||||
| `feat: <desc> (fixes #N)` | Issue responder (feature implementations) |
|
||||
| `fix: <desc> (fixes #N)` | Issue responder (incident fixes) |
|
||||
| `fix(post-mortem): <action> [PM-YYYY-MM-DD]` | Post-mortem TODO resolver |
|
||||
| `docs: post-mortem for <date> <title> [ci skip]` | Post-mortem writer |
|
||||
141
docs/architecture/llama-cpp.md
Normal file
141
docs/architecture/llama-cpp.md
Normal file
|
|
@ -0,0 +1,141 @@
|
|||
# llama-cpp / llama-swap
|
||||
|
||||
## Overview
|
||||
|
||||
In-cluster, OpenAI-compatible vision-LLM endpoint. A single
|
||||
`mostlygeek/llama-swap:cuda` Deployment fronts three GGUF models
|
||||
served by `llama.cpp`'s `llama-server` subprocesses, hot-swapped on
|
||||
demand by `llama-swap`. One Service, one `/v1` endpoint, model
|
||||
selected by the request body `model` field.
|
||||
|
||||
Initial use case: vision-LLM benchmark on a curated Immich album,
|
||||
choosing between **Qwen3-VL-8B**, **MiniCPM-V-4.5**, and
|
||||
**Qwen3-VL-4B** for instagram-poster's candidate-scoring path.
|
||||
Future consumers (Home Assistant, agentic tooling) can hit the same
|
||||
endpoint via LiteLLM at the cluster gateway.
|
||||
|
||||
First benchmark run (2026-05-10): see
|
||||
`infra/docs/benchmarks/2026-05-10-vision-llm.md`. Verdict: **qwen3vl-4b**
|
||||
for the request path (3.55 s p50, 100% parse, decisive top-N
|
||||
distribution). qwen3vl-8b for caption polish on top picks.
|
||||
|
||||
## Why llama.cpp + llama-swap (not Ollama)
|
||||
|
||||
Verified across 7+7 research/challenger subagents (2026-05-10):
|
||||
|
||||
- **Broader OpenAI-compat surface** — `tool_choice`, `image_url`
|
||||
remote URLs, native bearer auth via `--api-key`, `/reranking`,
|
||||
Anthropic `/v1/messages` shim.
|
||||
- **Native observability** — `/metrics`, `/health` returns 503 during
|
||||
model load (proper K8s startup-probe semantics), `/slots` per-slot
|
||||
tracking. Ollama still has the `/metrics` issue
|
||||
[#3144](https://github.com/ollama/ollama/issues/3144) open.
|
||||
- **Stricter structured output** — native GBNF on `/completion`,
|
||||
JSON-schema-to-GBNF converter, optional `LLAMA_LLGUIDANCE=ON`.
|
||||
- **Vision coverage for our targets** — llama.cpp ≥ b9095 supports
|
||||
Qwen3-VL and MiniCPM-V-4.5 natively; Ollama needs the official
|
||||
`qwen3-vl` tag (community GGUFs broken — split-mmproj
|
||||
[#14575](https://github.com/ollama/ollama/issues/14575)) and the
|
||||
`openbmb/minicpm-v4.5` Ollama tag is 8 months stale.
|
||||
|
||||
Ollama still wins for Llama-3.2-Vision (`mllama` cross-attention) and
|
||||
ecosystem polish (Go/JS SDKs, langchain-ollama, n8n nodes, HA built-in)
|
||||
— the latter is mooted by fronting llama.cpp with **LiteLLM** at the
|
||||
gateway.
|
||||
|
||||
## Components
|
||||
|
||||
| Component | Resource | Purpose |
|
||||
|-----------|----------|---------|
|
||||
| llama-swap Deployment | `kubernetes_deployment.llama_swap` | One pod, one OpenAI-compat endpoint, hot-swaps model subprocesses |
|
||||
| llama-swap ConfigMap | `kubernetes_config_map.llama_swap_config` | YAML model entries (cmd, ttl, checkEndpoint) |
|
||||
| llama-swap Service | `kubernetes_service.llama_swap` | ClusterIP `:8080` → `llama-swap.llama-cpp.svc.cluster.local` |
|
||||
| Models PVC | `module.nfs_models` (NFS-RWX `/srv/nfs-ssd/llamacpp`) | Shared GGUF store, 30Gi |
|
||||
| Download Job | `kubernetes_job_v1.download_models` | Pulls Q4_K_M GGUF + mmproj per model, creates stable `model.gguf` / `mmproj.gguf` symlinks, warms page cache |
|
||||
|
||||
## Storage
|
||||
|
||||
NFS-SSD on the Proxmox host (`192.168.1.127:/srv/nfs-ssd/llamacpp`).
|
||||
Cold model load is ~40s × 3 startups ≈ 2 min in a 25-30 min benchmark
|
||||
run (<10%). The download Job warms the kernel page cache after pulling
|
||||
GGUFs so first inference reads from warm cache.
|
||||
|
||||
If steady-state cold-load latency becomes a problem, **Path B**: carve
|
||||
~50Gi from a Proxmox SSD as an LV, attach as a vdisk to k8s-node1,
|
||||
mount on-host, expose via a static `kubernetes_persistent_volume` with
|
||||
`local` source + node1 affinity. NVMe-class load times. Out of scope
|
||||
for the initial deployment.
|
||||
|
||||
## GPU allocation
|
||||
|
||||
The llama-swap pod requests `nvidia.com/gpu: 1`, but the T4 is
|
||||
**time-sliced** by the NVIDIA device plugin — several pods on k8s-node1
|
||||
each hold a `nvidia.com/gpu: 1` slice and run **concurrently**:
|
||||
`llama-swap`, `immich.immich-machine-learning`, `immich.immich-server`
|
||||
(NVENC transcode), and `frigate`. Time-slicing shares *compute* but
|
||||
**not memory** — the 16 GB VRAM is a single unpartitioned pool, so one
|
||||
greedy tenant can starve all the others.
|
||||
|
||||
This is a real failure mode, not theoretical: on 2026-06-02 immich-ml
|
||||
(running with `MACHINE_LEARNING_MODEL_TTL=0`, so nothing ever unloaded)
|
||||
let its onnxruntime CUDA arena balloon to 10.7 GB during an OCR-heavy
|
||||
library job and held it, leaving only ~2 GB free. llama-swap then
|
||||
couldn't allocate qwen3-8b (~4.5 GB) → `cudaMalloc` OOM → `llama-server`
|
||||
exited → 502s → recruiter-responder triage failed silently for ~5 h.
|
||||
Fix: immich `MODEL_TTL=600` so idle models unload and return VRAM. See
|
||||
`docs/post-mortems/2026-06-02-immich-ml-ttl-gpu-oom-recruiter.md`.
|
||||
|
||||
Budget the T4 accordingly: with immich-ml idle (~2 GB CLIP) + frigate
|
||||
(~2 GB) there is ample room for an 8 B model. For a heavy benchmark you
|
||||
can still evict immich-ml entirely to guarantee headroom:
|
||||
|
||||
```bash
|
||||
kubectl scale -n immich deploy/immich-machine-learning --replicas=0
|
||||
# ... benchmark ...
|
||||
kubectl scale -n immich deploy/immich-machine-learning --replicas=1
|
||||
```
|
||||
|
||||
## Models served
|
||||
|
||||
| ID | HF repo | Quant | Ctx | mmproj |
|
||||
|----|---------|-------|-----|--------|
|
||||
| `qwen3-8b` | `Qwen/Qwen3-8B-GGUF` | Q4_K_M | 16384 | no (text-only) |
|
||||
| `qwen3vl-8b` | `Qwen/Qwen3-VL-8B-Instruct-GGUF` | Q4_K_M | 3072 | yes |
|
||||
| `minicpm-v-4-5` | `openbmb/MiniCPM-V-4_5-gguf` | Q4_K_M | 3072 | yes |
|
||||
| `qwen3vl-4b` | `Qwen/Qwen3-VL-4B-Instruct-GGUF` | Q4_K_M | 3072 | yes |
|
||||
|
||||
`qwen3-8b` (text-only) is the Tier-0 triage model for
|
||||
`recruiter-responder`; the `qwen3vl-*` / `minicpm-v` models serve the
|
||||
vision use cases.
|
||||
|
||||
llama.cpp build pinned via the `llama-swap:cuda` image (ships a
|
||||
recent llama.cpp ≥ b9095, which includes Qwen3-VL projection fix
|
||||
[#20899](https://github.com/ggml-org/llama.cpp/issues/20899) and
|
||||
mtmd Flash-Attention regression fix
|
||||
[#16962](https://github.com/ggml-org/llama.cpp/issues/16962)).
|
||||
|
||||
## Endpoints
|
||||
|
||||
- `GET /v1/models` — list configured models
|
||||
- `POST /v1/chat/completions` — standard OpenAI chat (vision via
|
||||
`image_url` content parts, base64 or remote URL)
|
||||
- `POST /completion` — llama.cpp native completion (preferred for
|
||||
GBNF-constrained structured output to avoid 2026 regression magnet
|
||||
on `/v1/chat/completions`)
|
||||
- `GET /metrics` — Prometheus
|
||||
- `GET /health` — 200 once a model is fully loaded; 503 during load
|
||||
|
||||
## Known issues / decisions
|
||||
|
||||
- **Cluster-wide GPU contention** — the T4 is time-sliced across
|
||||
llama-swap, immich-ml, immich-server, and frigate; compute is shared
|
||||
but the 16 GB VRAM is **not** isolated, so any tenant can OOM the
|
||||
others (see "GPU allocation" + the 2026-06-02 post-mortem). No hard
|
||||
memory partitioning is wired in (T4 has no MIG; MPS memory limits are
|
||||
overkill). Mitigation is keeping each tenant's resident footprint
|
||||
bounded — for immich-ml that means `MACHINE_LEARNING_MODEL_TTL > 0`.
|
||||
- **Filename-agnostic config** — the download Job creates stable
|
||||
`model.gguf` / `mmproj.gguf` symlinks per model dir so the
|
||||
llama-swap config doesn't need to track exact HF filenames (which
|
||||
change between releases).
|
||||
- **TF schema** — `llama-cpp` (PG backend on dbaas).
|
||||
335
docs/architecture/mailserver.md
Normal file
335
docs/architecture/mailserver.md
Normal file
|
|
@ -0,0 +1,335 @@
|
|||
# Mail Server Architecture
|
||||
|
||||
Last updated: 2026-04-19 (code-yiu Phase 6: MetalLB LB retired; traffic now enters via pfSense HAProxy with PROXY v2)
|
||||
|
||||
## Overview
|
||||
|
||||
Self-hosted email for `viktorbarzin.me` using docker-mailserver 15.0.0 on Kubernetes. Inbound mail arrives directly via MX record to the home IP on port 25. Outbound mail relays through Brevo EU (`smtp-relay.brevo.com:587` — migrated from Mailgun on 2026-04-12; SPF record cut over on 2026-04-18). Roundcubemail provides webmail access. CrowdSec protects SMTP/IMAP from brute-force attacks using real client IPs: pfSense HAProxy injects the PROXY v2 header on each backend connection so the mailserver pod sees the true source IP despite kube-proxy SNAT. See [`runbooks/mailserver-pfsense-haproxy.md`](../runbooks/mailserver-pfsense-haproxy.md) for ops details.
|
||||
|
||||
## Architecture Diagram
|
||||
|
||||
Two independent paths into the mailserver pod:
|
||||
|
||||
- **External** (MX traffic, webmail clients over WAN): Internet → pfSense → HAProxy → NodePort → **alt container ports** (2525/4465/5587/10993) that **require** PROXY v2 framing.
|
||||
- **Intra-cluster** (Roundcube, E2E probe): same pod, **stock container ports** (25/465/587/993), **no** PROXY framing.
|
||||
|
||||
One Deployment, one pod, two sets of Postfix `master.cf` services + Dovecot `inet_listener` blocks, two Kubernetes Services (`mailserver` ClusterIP + `mailserver-proxy` NodePort).
|
||||
|
||||
```mermaid
|
||||
flowchart TB
|
||||
%% External ingress path
|
||||
SENDER[Sending MTA<br/>arbitrary public IP] -->|MX lookup + SMTP<br/>:25| MX[mail.viktorbarzin.me<br/>A 176.12.22.76]
|
||||
MX --> PF[pfSense WAN<br/>vtnet0 192.168.1.2]
|
||||
PF -->|NAT rdr<br/>WAN:25/465/587/993<br/>→ 10.0.20.1:same| HAP
|
||||
HAP[pfSense HAProxy<br/>4 TCP frontends on 10.0.20.1<br/>send-proxy-v2 to backends]
|
||||
HAP -->|round-robin<br/>tcp-check inter 120s| KN{k8s worker<br/>node1..6}
|
||||
KN -->|NodePort 30125-30128<br/>ETP: Cluster → kube-proxy SNAT| PODEXT
|
||||
|
||||
%% Internal ingress path
|
||||
RC[Roundcubemail pod] -->|SMTP :587 + IMAP :993<br/>no PROXY| SVC[Service mailserver<br/>ClusterIP 10.103.108.x<br/>25/465/587/993]
|
||||
PROBE[email-roundtrip-monitor<br/>CronJob every 20m] -->|IMAP :993<br/>no PROXY| SVC
|
||||
SVC -->|kube-proxy routes| PODINT
|
||||
|
||||
%% The pod — two listener sets, one process tree
|
||||
subgraph POD["mailserver pod (docker-mailserver 15.0.0)"]
|
||||
direction LR
|
||||
PODEXT[Alt ports<br/>2525 / 4465 / 5587 / 10993<br/><b>PROXY v2 REQUIRED</b><br/>smtpd_upstream_proxy_protocol=haproxy<br/>haproxy = yes]
|
||||
PODINT[Stock ports<br/>25 / 465 / 587 / 993<br/>PROXY-free]
|
||||
PODEXT --> POSTFIX
|
||||
PODINT --> POSTFIX
|
||||
POSTFIX[Postfix<br/>postscreen + smtpd + cleanup + queue]
|
||||
POSTFIX --> RSPAMD[Rspamd<br/>spam + DKIM + DMARC]
|
||||
RSPAMD --> DOVECOT[Dovecot IMAP<br/>LMTP deliver]
|
||||
DOVECOT --> MAILBOX[(Maildir storage<br/>mailserver-data-encrypted PVC<br/>proxmox-lvm-encrypted LUKS2)]
|
||||
end
|
||||
|
||||
%% Outbound
|
||||
POSTFIX -->|queued mail<br/>SASL + TLS| BREVO[Brevo EU Relay<br/>smtp-relay.brevo.com:587<br/>300/day free tier]
|
||||
BREVO --> RECIPIENT[External Recipient]
|
||||
|
||||
%% Webmail HTTP path
|
||||
USER[User browser] -->|HTTPS| CF[Cloudflare proxy<br/>mail.viktorbarzin.me]
|
||||
CF --> TUNNEL[Cloudflared tunnel<br/>pfSense → Traefik]
|
||||
TUNNEL --> TRAEFIK[Traefik Ingress<br/>Authentik-protected]
|
||||
TRAEFIK --> RC
|
||||
|
||||
%% Security
|
||||
POSTFIX -.->|log stream<br/>real client IPs from PROXY v2| CSAGENT[CrowdSec Agent<br/>postfix + dovecot parsers]
|
||||
CSAGENT -.-> CSLAPI[CrowdSec LAPI]
|
||||
CSLAPI -.->|bouncer decisions<br/>ban external IPs| PF
|
||||
|
||||
%% Monitoring
|
||||
PROBE -.->|Brevo HTTP API<br/>triggers external delivery| MX
|
||||
PROBE -.->|Push on roundtrip success| PUSH[Pushgateway + Uptime Kuma]
|
||||
|
||||
classDef extPath fill:#ffedd5,stroke:#ea580c,stroke-width:2px
|
||||
classDef intPath fill:#dbeafe,stroke:#2563eb,stroke-width:2px
|
||||
classDef pod fill:#dcfce7,stroke:#15803d
|
||||
classDef sec fill:#fee2e2,stroke:#dc2626
|
||||
class SENDER,MX,PF,HAP,KN,PODEXT extPath
|
||||
class RC,PROBE,SVC,PODINT intPath
|
||||
class POSTFIX,RSPAMD,DOVECOT,MAILBOX pod
|
||||
class CSAGENT,CSLAPI sec
|
||||
```
|
||||
|
||||
### PROXY v2 sequence (external SMTP roundtrip)
|
||||
|
||||
Illustrates the wire-level sequence of a Brevo probe email arriving at our MX. Same sequence applies to any external sender.
|
||||
|
||||
```mermaid
|
||||
sequenceDiagram
|
||||
autonumber
|
||||
participant C as External MTA<br/>(e.g. Brevo 77.32.148.26)
|
||||
participant PF as pfSense WAN<br/>192.168.1.2:25
|
||||
participant HAP as pfSense HAProxy<br/>10.0.20.1:25
|
||||
participant N as k8s-node:30125<br/>ETP: Cluster
|
||||
participant P as Postfix postscreen<br/>pod:2525
|
||||
|
||||
C->>PF: TCP SYN dst=192.168.1.2:25
|
||||
PF->>HAP: NAT rdr rewrites dst → 10.0.20.1:25
|
||||
HAP->>N: TCP connect (src=10.0.20.1, dst=k8s-node:30125)
|
||||
Note over HAP,N: HAProxy opens a NEW TCP flow<br/>to the backend k8s node.
|
||||
HAP->>N: PROXY v2 header<br/>(source=77.32.148.26, dest=10.0.20.1)
|
||||
N->>P: kube-proxy SNAT src=k8s-node IP<br/>forwards PROXY header + payload to pod
|
||||
P->>P: Parse PROXY v2 header<br/>smtpd_client_addr := 77.32.148.26<br/>(despite kube-proxy SNAT on the wire)
|
||||
P-->>C: SMTP banner 220 mail.viktorbarzin.me
|
||||
C-->>P: EHLO / MAIL FROM / RCPT TO / DATA
|
||||
Note over P,C: Real client IP logged in maillog,<br/>fed to CrowdSec postfix parser.
|
||||
P->>P: → smtpd → Rspamd → Dovecot → mailbox
|
||||
```
|
||||
|
||||
|
||||
## Components
|
||||
|
||||
| Component | Version | Location | Purpose |
|
||||
|-----------|---------|----------|---------|
|
||||
| docker-mailserver | 15.0.0 | `mailserver` namespace | Postfix MTA + Dovecot IMAP + Rspamd (single container) |
|
||||
| Roundcubemail | 1.6.13-apache | `mailserver` namespace | Webmail UI (MySQL-backed) |
|
||||
| Rspamd | Built into docker-mailserver | — | Spam filtering, DKIM signing, DMARC verification |
|
||||
| pfSense HAProxy | 2.9-dev6 (`pfSense-pkg-haproxy-devel`) | pfSense VM | TCP reverse proxy injecting PROXY v2 for external mail |
|
||||
| Brevo EU (ex-Sendinblue) | SaaS | — | Outbound SMTP relay (300/day free) |
|
||||
|
||||
Dovecot exporter was retired in code-1ik (2026-04-19) — `viktorbarzin/dovecot_exporter` speaks the pre-2.3 `old_stats` FIFO protocol which docker-mailserver 15.0.0's Dovecot 2.3.19 no longer emits.
|
||||
|
||||
## Port mapping
|
||||
|
||||
The mailserver pod exposes **8 TCP listeners**: 4 stock + 4 alt. Two Kubernetes Services front them depending on whether the client can inject PROXY v2.
|
||||
|
||||
| Mail protocol | Service port | K8s Service | Container port | NodePort | PROXY v2? | Who uses this path |
|
||||
|---|---|---|---|---|---|---|
|
||||
| SMTP (plain + STARTTLS) | 25 | `mailserver` ClusterIP | 25 | — | ❌ stock | Intra-cluster only (not used — internal clients send via 587) |
|
||||
| SMTPS (implicit TLS) | 465 | `mailserver` ClusterIP | 465 | — | ❌ stock | Intra-cluster (Roundcube rarely uses this) |
|
||||
| Submission (STARTTLS) | 587 | `mailserver` ClusterIP | 587 | — | ❌ stock | **Roundcube pod** → mailserver.svc:587 |
|
||||
| IMAPS | 993 | `mailserver` ClusterIP | 993 | — | ❌ stock | **Roundcube pod** + E2E probe → mailserver.svc:993 |
|
||||
| SMTP | 25 | `mailserver-proxy` NodePort | 2525 | 30125 | ✅ required | External MX traffic via pfSense HAProxy |
|
||||
| SMTPS | 465 | `mailserver-proxy` NodePort | 4465 | 30126 | ✅ required | External SMTPS submission |
|
||||
| Submission | 587 | `mailserver-proxy` NodePort | 5587 | 30127 | ✅ required | External STARTTLS submission (mail clients over WAN) |
|
||||
| IMAPS | 993 | `mailserver-proxy` NodePort | 10993 | 30128 | ✅ required | External IMAPS (mail clients over WAN) |
|
||||
|
||||
The alt listeners are set up by:
|
||||
- **Postfix**: `user-patches.sh` (shipped via ConfigMap `mailserver-user-patches`) appends 3 entries to `master.cf` with `-o postscreen_upstream_proxy_protocol=haproxy` (for 2525) or `-o smtpd_upstream_proxy_protocol=haproxy` (for 4465/5587).
|
||||
- **Dovecot**: `dovecot.cf` ConfigMap adds a second `inet_listener` inside `service imap-login` with `haproxy = yes`, plus `haproxy_trusted_networks = 10.0.20.0/24` to allow PROXY headers from the k8s node subnet (post kube-proxy SNAT the source IP is always a node IP).
|
||||
|
||||
## Mail Flow
|
||||
|
||||
### Inbound
|
||||
```
|
||||
Internet → MX: mail.viktorbarzin.me (priority 1)
|
||||
→ A record: 176.12.22.76 (non-proxied Cloudflare DNS-only)
|
||||
→ pfSense NAT rdr: WAN:{25,465,587,993} → 10.0.20.1:{same}
|
||||
→ pfSense HAProxy (TCP mode, send-proxy-v2 on backend)
|
||||
→ k8s-node:{30125..30128} NodePort (mailserver-proxy, ETP: Cluster)
|
||||
→ kube-proxy → pod alt listener (2525/4465/5587/10993)
|
||||
→ Postfix postscreen / smtpd / Dovecot parses PROXY v2 header
|
||||
→ Rspamd (spam + DKIM + DMARC) → Dovecot → mailbox
|
||||
```
|
||||
|
||||
No backup MX. If the server is down, sender MTAs queue and retry for 4-5 days per SMTP standards (RFC 5321).
|
||||
|
||||
### Outbound
|
||||
```
|
||||
Postfix → relayhost [smtp-relay.brevo.com]:587 (SASL auth + TLS required)
|
||||
→ Brevo handles IP reputation, deliverability, bounce processing
|
||||
→ 300 emails/day free tier (migrated from Mailgun 100/day on 2026-04-12)
|
||||
```
|
||||
|
||||
### Webmail
|
||||
```
|
||||
https://mail.viktorbarzin.me → Traefik → Roundcubemail
|
||||
IMAP: ssl://mailserver:993 (internal K8s service)
|
||||
SMTP: tls://mailserver:587 (internal K8s service)
|
||||
DB: MySQL (mysql.dbaas.svc.cluster.local)
|
||||
```
|
||||
|
||||
## DNS Records
|
||||
|
||||
All managed in Terraform at `stacks/cloudflared/modules/cloudflared/cloudflare.tf`.
|
||||
|
||||
| Type | Name | Value | Purpose |
|
||||
|------|------|-------|---------|
|
||||
| MX | `viktorbarzin.me` | `mail.viktorbarzin.me` (pri 1) | Inbound mail routing |
|
||||
| A | `mail.viktorbarzin.me` | `176.12.22.76` (non-proxied) | Mail server IP |
|
||||
| AAAA | `mail.viktorbarzin.me` | `2001:470:6e:43d::2` | IPv6 (HE tunnel) |
|
||||
| TXT (SPF) | `viktorbarzin.me` | `v=spf1 include:spf.brevo.com ~all` | Authorize Brevo for outbound (soft-fail during cutover; was `include:mailgun.org -all` until 2026-04-18 Brevo migration) |
|
||||
| TXT (DKIM) | `s1._domainkey` | RSA 1024-bit key | Mailgun DKIM (roundtrip probe only — inbound testing still uses Mailgun API) |
|
||||
| TXT (DKIM) | `mail._domainkey` | RSA 2048-bit key | Rspamd self-hosted DKIM signing |
|
||||
| CNAME (DKIM) | `brevo1._domainkey` | b1.viktorbarzin-me.dkim.brevo.com | Brevo outbound DKIM (delegated) |
|
||||
| CNAME (DKIM) | `brevo2._domainkey` | b2.viktorbarzin-me.dkim.brevo.com | Brevo outbound DKIM (delegated) |
|
||||
| TXT | `viktorbarzin.me` | `brevo-code:a6ef1dd9...` | Brevo domain verification |
|
||||
| TXT (DMARC) | `_dmarc` | `p=quarantine; pct=100; rua=mailto:dmarc@viktorbarzin.me` | DMARC enforcement; aggregate reports land in-domain at `dmarc@viktorbarzin.me` (tracked under code-569; current live record still points at `e21c0ff8@dmarc.mailgun.org` pending cutover) |
|
||||
| TXT (MTA-STS) | `_mta-sts` | `v=STSv1; id=20260412` | TLS enforcement for inbound |
|
||||
| TXT (TLSRPT) | `_smtp._tls` | `v=TLSRPTv1; rua=mailto:postmaster@...` | TLS failure reporting |
|
||||
|
||||
### Known Limitation: PTR Mismatch
|
||||
|
||||
Reverse DNS for `176.12.22.76` returns `176-12-22-76.pon.spectrumnet.bg.` (ISP-assigned) instead of `mail.viktorbarzin.me`. This is ISP-controlled and cannot be changed on a residential connection. Most modern providers (Gmail, Outlook) rely on SPF/DKIM/DMARC rather than PTR, so impact is minimal.
|
||||
|
||||
## Security
|
||||
|
||||
### CrowdSec Integration
|
||||
- **Collections**: `crowdsecurity/postfix` + `crowdsecurity/dovecot` (installed)
|
||||
- **Log acquisition**: CrowdSec agents parse mailserver pod logs for brute-force patterns
|
||||
- **Real client IPs**: pfSense HAProxy injects PROXY v2 header on each backend connection; Postfix (`postscreen_upstream_proxy_protocol=haproxy` / `smtpd_upstream_proxy_protocol=haproxy` on alt ports) + Dovecot (`haproxy = yes` on alt IMAPS listener) parse it to recover the true source IP despite kube-proxy SNAT. Replaces the pre-2026-04-19 MetalLB `10.0.20.202` ETP:Local scheme (see code-yiu)
|
||||
- **Decisions**: CrowdSec bans/challenges attackers via firewall bouncer rules
|
||||
|
||||
### Fail2ban Disabled (CrowdSec is the Policy)
|
||||
|
||||
docker-mailserver ships Fail2ban, but it is explicitly disabled here: `ENABLE_FAIL2BAN = "0"` at [`stacks/mailserver/modules/mailserver/main.tf:68`](../../stacks/mailserver/modules/mailserver/main.tf). CrowdSec is the cluster-wide bouncer for SSH, HTTP, and SMTP/IMAP brute-force defence — it already parses the `postfix` and `dovecot` log streams via the collections listed above and applies decisions at the LB/firewall layer. Enabling Fail2ban in-pod would create a duplicate response path (two systems racing to ban the same IP from different enforcement points), add iptables churn inside the container, and fragment the audit trail across two decision stores. Decision (2026-04-18): keep it disabled; CrowdSec owns this policy.
|
||||
|
||||
### Rspamd
|
||||
- Spam filtering with phishing detection and Oletools
|
||||
- DKIM signing (selector `mail`, 2048-bit RSA)
|
||||
- DMARC verification on inbound mail
|
||||
- Auto-learns from Junk folder movements (`RSPAMD_LEARN=1`)
|
||||
- SRS (Sender Rewriting Scheme) enabled for forwarded mail
|
||||
|
||||
### Postfix Rate Limiting
|
||||
```
|
||||
smtpd_client_connection_rate_limit = 10 # per minute per client
|
||||
smtpd_client_message_rate_limit = 30 # per minute per client
|
||||
anvil_rate_time_unit = 60s
|
||||
```
|
||||
|
||||
### TLS
|
||||
- Wildcard Let's Encrypt cert (`*.viktorbarzin.me`) for SMTP STARTTLS and IMAPS
|
||||
- Renewed via Woodpecker CI cron pipeline (DNS-01 challenge via Cloudflare)
|
||||
- MTA-STS enforces TLS for inbound delivery
|
||||
|
||||
## Monitoring
|
||||
|
||||
### E2E Roundtrip Probe
|
||||
CronJob `email-roundtrip-monitor` (every 20 min, `*/20 * * * *`):
|
||||
1. Sends test email via **Brevo HTTP API** to `smoke-test@viktorbarzin.me` (Brevo delivers it to our MX over the public internet, exercising the full external-ingress path).
|
||||
2. Email hits WAN → pfSense HAProxy → k8s-node:30125 → pod :2525 postscreen (PROXY v2) → Postfix → catch-all delivers to `spam@` mailbox.
|
||||
3. Verifies delivery via IMAP — connects to `mailserver.mailserver.svc.cluster.local:993` (intra-cluster path, no PROXY), searches by UUID marker.
|
||||
4. Deletes test email, pushes metrics to Pushgateway + Uptime Kuma.
|
||||
|
||||
Push secrets (`BREVO_API_KEY`, `EMAIL_MONITOR_IMAP_PASSWORD`) come from ExternalSecret `mailserver-probe-secrets` (synced from Vault `secret/viktor` + `secret/platform.mailserver_accounts`) — see code-39v.
|
||||
|
||||
### Prometheus Alerts
|
||||
| Alert | Threshold | Severity |
|
||||
|-------|-----------|----------|
|
||||
| MailServerDown | No replicas for 5m | warning |
|
||||
| EmailRoundtripFailing | Probe failing for 30m | warning |
|
||||
| EmailRoundtripStale | No success in >80m (60m threshold + for:20m) | warning |
|
||||
| EmailRoundtripNeverRun | Metric absent for 40m | warning |
|
||||
|
||||
### Uptime Kuma Monitors
|
||||
- TCP SMTP on `176.12.22.76:25` — full external path (DNS → WAN → pfSense HAProxy → mailserver)
|
||||
- TCP `mailserver.svc:{587,993}` — intra-cluster ClusterIP path
|
||||
- TCP `10.0.20.1:{25,993}` — pfSense HAProxy health (post code-yiu Phase 6)
|
||||
- E2E Push monitor (receives push from `email-roundtrip-monitor` probe)
|
||||
|
||||
### Dovecot exporter — retired
|
||||
`viktorbarzin/dovecot_exporter` was removed in code-1ik (2026-04-19). It spoke the pre-2.3 `old_stats` FIFO protocol; Dovecot 2.3.19 (docker-mailserver 15.0.0) no longer emits that, so the scrape only ever returned `dovecot_up{scope="user"} 0`. If Dovecot metrics become valuable, reach for a 2.3+ compatible exporter (e.g. `jtackaberry/dovecot_exporter`) and re-add the scrape + alerts. The previously-created `mailserver-metrics` ClusterIP Service was also removed.
|
||||
|
||||
## Terraform
|
||||
|
||||
| Stack | Path | Resources |
|
||||
|-------|------|-----------|
|
||||
| Mailserver | `stacks/mailserver/` | Namespace, deployment, service, CronJob, PVCs |
|
||||
| DNS | `stacks/cloudflared/modules/cloudflared/cloudflare.tf` | MX, SPF, DKIM, DMARC, MTA-STS, TLSRPT records |
|
||||
| Monitoring | `stacks/monitoring/` | Prometheus alert rules |
|
||||
| CrowdSec | `stacks/crowdsec/` | Collections, log acquisition (already configured) |
|
||||
|
||||
### Secrets (Vault)
|
||||
| Path | Key | Purpose |
|
||||
|------|-----|---------|
|
||||
| `secret/platform` | `mailserver_accounts` | User credentials (JSON) |
|
||||
| `secret/platform` | `mailserver_aliases` | Postfix virtual aliases |
|
||||
| `secret/platform` | `mailserver_opendkim_key` | DKIM private key |
|
||||
| `secret/platform` | `mailserver_sasl_passwd` | Brevo relay credentials (`[smtp-relay.brevo.com]:587 <login>:<key>`) |
|
||||
| `secret/viktor` | `brevo_api_key` | Brevo API key — used by BOTH outbound SMTP SASL (postfix) AND the E2E roundtrip probe (sends external test mail via Brevo HTTP) |
|
||||
| `secret/viktor` | `mailgun_api_key` | Historical; no longer used by the probe post code-n5l/Phase-5 work. Kept for reference. |
|
||||
|
||||
## Storage
|
||||
|
||||
| PVC | Size | Storage Class | Purpose |
|
||||
|-----|------|---------------|---------|
|
||||
| `mailserver-data-encrypted` | 2Gi (auto-resize 5Gi) | `proxmox-lvm-encrypted` (LUKS2) | Maildir + Postfix queue + state + logs |
|
||||
| `roundcubemail-html-encrypted` | 1Gi | `proxmox-lvm-encrypted` | Roundcube PHP code + user session data |
|
||||
| `roundcubemail-enigma-encrypted` | 1Gi | `proxmox-lvm-encrypted` | Roundcube Enigma (PGP) user keys |
|
||||
| `mailserver-backup-host` (RWX) | 10Gi | `nfs-truenas` (historical SC name, Proxmox host NFS) | `mailserver-backup` CronJob destination (`/srv/nfs/mailserver-backup/<YYYY-WW>/`) |
|
||||
| `roundcube-backup-host` (RWX) | 10Gi | `nfs-truenas` (historical SC name, Proxmox host NFS) | `roundcube-backup` CronJob destination |
|
||||
|
||||
**Backup**: daily `mailserver-backup` + `roundcube-backup` CronJobs rsync data PVCs to NFS. NFS directory is picked up by the PVE host's inotify-driven `/usr/local/bin/offsite-sync-backup` which pushes to Synology (weekly). See [Storage & Backup Architecture](storage.md) for the 3-2-1 flow.
|
||||
|
||||
## Decisions & Rationale
|
||||
|
||||
### No Backup MX
|
||||
- **Alternatives considered**: ForwardEmail (free relay), Cloudflare Email Routing, Dynu Store/Forward
|
||||
- **Decision**: Direct MX only. ForwardEmail relay was evaluated (2026-04-12) and abandoned — its anti-spoofing enforcement rejects legitimate forwarded mail regardless of SPF configuration. Cloudflare Email Routing can't store-and-forward (pass-through proxy only). Dynu ($9.99/yr) is a viable future option.
|
||||
- **Tradeoff**: If server is down, mail delivery relies on sender MTA retry queues (4-5 days standard). No immediate forwarding to a backup address.
|
||||
|
||||
### Brevo for Outbound (migrated from Mailgun 2026-04-12)
|
||||
- **Decision**: All outbound relays through Brevo EU (ex-Sendinblue). 300 emails/day free tier (3x Mailgun's 100/day).
|
||||
- **Why migrated**: Mailgun's 100/day limit was too tight — the E2E probe uses ~72/day, leaving only 28 for real mail.
|
||||
- **DKIM**: Brevo uses delegated DKIM via CNAME (`brevo1._domainkey`, `brevo2._domainkey`). Mailgun's `s1._domainkey` retained for the roundtrip probe (still uses Mailgun API for inbound testing).
|
||||
- **Tradeoff**: Dependency on Brevo SaaS for outbound.
|
||||
|
||||
### Rspamd over SpamAssassin/OpenDKIM
|
||||
- **Decision**: Rspamd replaces both SpamAssassin and OpenDKIM in a single component
|
||||
- **Tradeoff**: Higher memory usage (~150-200MB) but simpler stack
|
||||
|
||||
### Client-IP Preservation (pfSense HAProxy + PROXY v2)
|
||||
- **Current (2026-04-19, bd code-yiu)**: pfSense HAProxy listens on `10.0.20.1:{25,465,587,993}`, forwards to k8s NodePort 30125-30128 with `send-proxy-v2` on each backend connection. The mailserver pod exposes parallel listeners (2525/4465/5587/10993) that REQUIRE the PROXY v2 header, while the stock ports 25/465/587/993 stay PROXY-free for intra-cluster traffic (Roundcube, probe). The mailserver Service is ClusterIP-only; ETP is no longer a concern for external traffic.
|
||||
- **Historical (2026-04-12 → 2026-04-19)**: Dedicated MetalLB IP `10.0.20.202` with `externalTrafficPolicy: Local` — required pod/speaker colocation; kube-proxy preserved client IP only when pod was on the same node as the advertising speaker.
|
||||
- **Why switched**: ETP:Local made the mailserver's single replica drop inbound mail silently during pod reschedule (30-60s GARP flip). HAProxy with `send-proxy-v2` lets the pod reschedule to any node and recover IP-preservation through the header.
|
||||
- **Tradeoff**: pfSense now runs HAProxy (one more service in the firewall's responsibility); alt container ports + extra Service are ~80 lines of Terraform. The win is HA without IP-preservation compromise.
|
||||
- **Runbook**: [`runbooks/mailserver-pfsense-haproxy.md`](../runbooks/mailserver-pfsense-haproxy.md).
|
||||
|
||||
## Troubleshooting
|
||||
|
||||
### Inbound mail not arriving
|
||||
1. **DNS/MX**: `dig MX viktorbarzin.me +short` → should show `mail.viktorbarzin.me`
|
||||
2. **WAN reachability**: `nc -zw5 mail.viktorbarzin.me 25` from outside
|
||||
3. **pfSense NAT**: verify WAN:{25,465,587,993} rdr to `10.0.20.1` (HAProxy VIP). `ssh admin@10.0.20.1 'pfctl -sn' | grep '10.0.20.1'`
|
||||
4. **HAProxy health**: `ssh admin@10.0.20.1 "echo 'show servers state' | socat /tmp/haproxy.socket stdio"` — at least one backend in `srv_op_state=2` (UP) per pool
|
||||
5. **Container listener**: `kubectl exec -n mailserver -c docker-mailserver deployment/mailserver -- ss -ltn | grep -E ':(25|2525|465|4465|587|5587|993|10993)\b'` — 8 lines expected
|
||||
6. **Postfix queue + delivery**: `kubectl logs -n mailserver deploy/mailserver -c docker-mailserver | grep -E 'from=|reject|smtpd-proxy'`
|
||||
7. **CrowdSec decisions**: `kubectl exec -n crowdsec deploy/crowdsec-lapi -- cscli decisions list`
|
||||
|
||||
### Outbound mail failing
|
||||
1. Check Brevo relay: `kubectl logs -n mailserver deploy/mailserver -c docker-mailserver | grep relay` — should show `relay=smtp-relay.brevo.com`
|
||||
2. Check SASL credentials: `vault kv get -field=mailserver_sasl_passwd secret/platform` — should show `[smtp-relay.brevo.com]:587`
|
||||
3. Check Brevo dashboard for delivery status
|
||||
4. SASL auth failure → verify SMTP key (xsmtpsib-...) and login (a7e778001@smtp-brevo.com)
|
||||
|
||||
### E2E roundtrip probe failing
|
||||
1. Check CronJob: `kubectl get cronjob -n mailserver email-roundtrip-monitor`
|
||||
2. Check job logs: `kubectl logs -n mailserver -l job-name --tail=20`
|
||||
3. Check Mailgun rate limit (HTTP 429 errors mean too many API calls)
|
||||
4. Check IMAP login: verify `spam@viktorbarzin.me` password in Vault (`secret/platform` → `mailserver_accounts`)
|
||||
|
||||
### Spam/brute-force attacks
|
||||
1. Check CrowdSec decisions: `kubectl exec -n crowdsec deploy/crowdsec-lapi -- cscli decisions list`
|
||||
2. Check Postfix logs for auth failures: `kubectl logs -n mailserver deploy/mailserver -c docker-mailserver | grep 'authentication failed'`
|
||||
3. Verify real client IPs in logs (not 10.0.20.x node IPs)
|
||||
|
||||
## Related
|
||||
|
||||
- [Monitoring Architecture](monitoring.md) — alert definitions, Uptime Kuma
|
||||
- [Networking Architecture](networking.md) — MetalLB, pfSense NAT, Cloudflare DNS
|
||||
- [Security Architecture](security.md) — CrowdSec deployment
|
||||
- [Secrets Management](secrets.md) — Vault paths for mail credentials
|
||||
- [Mailserver Hardening Plan](../plans/2026-02-23-mailserver-hardening-plan.md) — historical
|
||||
397
docs/architecture/monitoring.md
Normal file
397
docs/architecture/monitoring.md
Normal file
|
|
@ -0,0 +1,397 @@
|
|||
# Monitoring & Alerting Architecture
|
||||
|
||||
## Overview
|
||||
|
||||
The monitoring stack provides comprehensive observability for the home Kubernetes cluster through metrics collection (Prometheus), visualization (Grafana), log aggregation (Loki), alerting (Alertmanager), and uptime monitoring (Uptime Kuma). GPU metrics are collected via NVIDIA's dcgm-exporter. The system tracks infrastructure health, application performance, backup success, and resource utilization with intelligent alert inhibition to reduce noise during cascading failures.
|
||||
|
||||
## Architecture Diagram
|
||||
|
||||
```mermaid
|
||||
graph TB
|
||||
subgraph "Metric Sources"
|
||||
K8S[Kubernetes API Server]
|
||||
NODES[Node Exporters]
|
||||
PODS[Application Pods]
|
||||
GPU[NVIDIA GPU via dcgm-exporter]
|
||||
UPS[UPS Exporter]
|
||||
NFS[NFS Exporter]
|
||||
EMAIL[Email Roundtrip Probe<br/>CronJob every 10m]
|
||||
end
|
||||
|
||||
subgraph "Monitoring Stack (platform stack)"
|
||||
PROM[Prometheus<br/>Scrape & Store]
|
||||
LOKI[Loki<br/>Log Aggregation]
|
||||
AM[Alertmanager<br/>Alert Routing]
|
||||
GRAFANA[Grafana<br/>14+ Dashboards<br/>OIDC via Authentik]
|
||||
UPTIME[Uptime Kuma<br/>HTTP Monitors]
|
||||
end
|
||||
|
||||
subgraph "Alert Flow"
|
||||
INHIBIT[Inhibition Rules<br/>Node Down → Suppress Pod Alerts]
|
||||
NOTIFY[Notifications]
|
||||
end
|
||||
|
||||
K8S -->|ServiceMonitors| PROM
|
||||
NODES -->|Metrics| PROM
|
||||
PODS -->|Metrics| PROM
|
||||
PODS -->|Logs| LOKI
|
||||
GPU -->|GPU Metrics| PROM
|
||||
UPS -->|UPS Metrics| PROM
|
||||
NFS -->|NFS Metrics| PROM
|
||||
|
||||
PROM -->|Query| GRAFANA
|
||||
PROM -->|Alerts| AM
|
||||
LOKI -->|Query| GRAFANA
|
||||
|
||||
AM --> INHIBIT
|
||||
INHIBIT --> NOTIFY
|
||||
|
||||
EMAIL -->|Pushgateway| PROM
|
||||
EMAIL -.->|Push| UPTIME
|
||||
PODS -.->|HTTP Health| UPTIME
|
||||
```
|
||||
|
||||
## Components
|
||||
|
||||
| Component | Version | Location | Purpose |
|
||||
|-----------|---------|----------|---------|
|
||||
| Prometheus | Latest (Diun monitored) | `stacks/monitoring/modules/monitoring/` | Metrics collection and storage, scrape configs for all services |
|
||||
| Grafana | Latest (Diun monitored) | `stacks/monitoring/modules/monitoring/` | Visualization, 14+ dashboards (API server, CoreDNS, GPU, UPS, etc.) |
|
||||
| Loki | **DEPLOYED 2026-05-18** (SingleBinary mode, 30d retention, 50Gi PVC on `proxmox-lvm`, ruler enabled → Alertmanager). Re-enabled from previous "operational overhead" disable. Ships logs via Alloy DaemonSet (now on all nodes including master after 2026-05-19 toleration add). | `stacks/monitoring/modules/monitoring/` | Log aggregation and querying |
|
||||
| Alertmanager | Latest (Diun monitored) | `stacks/monitoring/modules/monitoring/` | Alert routing with cascade inhibitions |
|
||||
| Uptime Kuma | Latest (Diun monitored) | `stacks/uptime-kuma/` | Internal + external HTTP monitors, status page |
|
||||
| External Monitor Sync | Python 3.12 | `stacks/uptime-kuma/` | CronJob (10min) syncs `[External]` monitors from `cloudflare_proxied_names` |
|
||||
| dcgm-exporter | Configurable resources | `stacks/monitoring/modules/monitoring/` | NVIDIA GPU metrics collection |
|
||||
| Email Roundtrip Probe | Python 3.12 | `stacks/mailserver/modules/mailserver/` | E2E email delivery verification via Mailgun API + IMAP |
|
||||
| Forgejo Registry Integrity Probe | Alpine 3.20 + curl/jq | `stacks/monitoring/modules/monitoring/main.tf` | CronJob every 15m: walks `/v2/_catalog` on `forgejo.viktorbarzin.me` (HTTP via in-cluster service), HEADs every tagged manifest + index child; emits `registry_manifest_integrity_*` metrics to Pushgateway. Replaces the legacy `registry-integrity-probe` against `registry.viktorbarzin.me:5050` decommissioned in Phase 4 of forgejo-registry-consolidation 2026-05-07. |
|
||||
| blackbox-exporter (Authentik walling-off guard) | `prom/blackbox-exporter` (Keel-managed) | `stacks/monitoring/modules/monitoring/authentik_walloff_probe.tf` | Single-purpose blackbox-exporter. Its `http_no_authentik_redirect` module probes each must-stay-public carve-out URL with `no_follow_redirects` and FAILS (`fail_if_header_matches` on `Location`) iff the response redirects to Authentik. Scraped by job `blackbox-authentik-walloff` (1m); feeds alert `AuthentikWallingOffPublicPath`. Target list = `local.authentik_walloff_targets` in the same file. |
|
||||
| snmp-exporter | `prom/snmp-exporter` (Keel-managed) | `stacks/monitoring/modules/monitoring/snmp_exporter.tf` + `ups_snmp_values.yaml` | SNMP→Prometheus bridge. Modules in `ups_snmp_values.yaml`: `huawei` (UPS), `if_mib`/`ip_mib`, and **`dell_idrac`** (R730 iDRAC, merged from `prometheus_snmp_chart_values.yaml` 2026-06-05 + hand-added fan-RPM `coolingDeviceReading` / amperage location lookup). Scrape jobs: `snmp-ups` (30s, module=huawei), **`snmp-idrac` (1m, module=dell_idrac, auth=public_v2)** — the FAST primary source for R730 health/thermal/power/fan/voltage since the 2026-06-05 Redfish→SNMP migration (~3.7s/scrape vs Redfish ~18.5s). Relabels all metrics to `r730_idrac_<mibName>`. |
|
||||
| idrac-redfish-exporter | `viktorbarzin/idrac-redfish-exporter:2.4.1-voltage-fix` (mrlhansen/idrac_exporter, Keel-managed) | `stacks/monitoring/modules/monitoring/idrac.tf` | **Slow remnant** (10m scrape, job `redfish-idrac`) since the 2026-06-05 SNMP migration — was the sole iDRAC source at a 3m interval, demoted once SNMP took over the fast path. Trimmed to `system,sensors,power,storage,network,memory`. Serves only what SNMP can't (indicator LED, NIC link-speed Mbps, machine/BIOS info, per-drive storage table). **HA Sofia's R730 sensors moved off this exporter to a fast Prometheus SNMP query on 2026-06-05** (see the iDRAC subsection under "How It Works"), so the `sensors` collector here is now vestigial. |
|
||||
|
||||
## How It Works
|
||||
|
||||
### Metrics Collection
|
||||
|
||||
Prometheus scrapes metrics from all cluster components and applications using ServiceMonitor CRDs and scrape configs. Every new service deployed to the cluster receives:
|
||||
1. A Prometheus scrape configuration (via ServiceMonitor or static config)
|
||||
2. An Uptime Kuma HTTP monitor for internal health checks
|
||||
3. An external HTTP monitor (auto-created by `external-monitor-sync` for all Cloudflare-proxied services)
|
||||
|
||||
### External Monitoring
|
||||
|
||||
The `external-monitor-sync` CronJob (every 10min, `stacks/uptime-kuma/`) ensures Uptime Kuma has `[External] <service>` monitors for externally-reachable ingresses. Discovery is **opt-OUT**: the script lists every ingress via the K8s API and creates a monitor for any host ending in `.viktorbarzin.me`, skipping only those annotated `uptime.viktorbarzin.me/external-monitor: "false"`. Both `ingress_factory` and the `reverse-proxy` factory emit that annotation when the caller sets `external_monitor = false`; leaving it null keeps the opt-in default (important for helm-provisioned ingresses that don't go through our factories). The legacy `cloudflare_proxied_names` ConfigMap is a fallback if the K8s API discovery fails.
|
||||
|
||||
These monitors test the full external access path (DNS → Cloudflare → Tunnel → Traefik → Service) from inside the cluster. The status-page-pusher groups them as "External Reachability" and pushes a `external_internal_divergence_count` metric to Pushgateway when services are externally down but internally up. Alert `ExternalAccessDivergence` fires after 15min of divergence.
|
||||
|
||||
Data flows from targets through Prometheus storage to Grafana dashboards. Applications emit logs to stdout/stderr which are aggregated by Loki and queryable through Grafana's log viewer.
|
||||
|
||||
### Cluster log aggregation (Alloy → Loki) + the "Cluster Logs" dashboard
|
||||
|
||||
Pod logs are tailed off the nodes' `/var/log/pods` by the **Grafana Alloy**
|
||||
DaemonSet (`alloy.yaml`) and shipped to Loki with labels `namespace` / `pod` /
|
||||
`container` / `app`; node + external-Pi system logs arrive as the `node-journal`
|
||||
and `rpi-sofia-journal` jobs (labels `node` / `unit` / `level`).
|
||||
|
||||
> **Gotcha (regression found + fixed 2026-06-05):** `loki.source.file` does
|
||||
> **not** expand globs. The pod-log pipeline must place a **`local.file_match`**
|
||||
> component between `discovery.relabel` (which writes the
|
||||
> `/var/log/pods/*<uid>/<container>/*.log` glob into `__path__`) and
|
||||
> `loki.source.file`. Without it, `loki.source.file` `stat()`s the literal `*`
|
||||
> path and ships **zero** pod logs — for a stretch only the journals reached
|
||||
> Loki. A `stage.cri {}` stage parses the containerd CRI wrapper so Loki stores
|
||||
> clean messages + real timestamps. If application logs ever vanish from Loki
|
||||
> again, check Alloy logs for `loki.source.file ... stat failed`. On first
|
||||
> discovery Alloy reads existing files from the start → a brief burst of
|
||||
> `entry too far behind` 400s from Loki (old lines rejected, recent accepted);
|
||||
> it self-settles. Alloy read-positions are ephemeral, so a pod restart repeats
|
||||
> the bounded catch-up read — watch sdc IO (the 2026-05-26 storm surface; mem
|
||||
> limits are the safeguard).
|
||||
|
||||
Search/observe everything via the **"Cluster Logs"** Grafana dashboard
|
||||
(`dashboards/cluster-logs.json`, *Logs* folder): `$namespace`/`$app`/`$pod`
|
||||
dropdowns + free-text regex `$search`, log-volume-by-namespace, error/warn rate,
|
||||
top namespaces/pods by errors, a live filterable logs panel, and a journals row.
|
||||
Error/warn panels use case-insensitive regex line-filters because pod logs carry
|
||||
no `level` stream label.
|
||||
|
||||
**Surfaced in ha-sofia** for Emo: two RESTful sensors
|
||||
(`/config/rest_resources/loki_cluster_{errors,warnings}.yaml`) query Loki for
|
||||
cluster error/warn line counts (5-min window) → `sensor.cluster_log_errors_5m` /
|
||||
`sensor.cluster_log_warnings_5m`, for a compact trend card on the Барзини status
|
||||
view plus a Grafana-link button. Those sensors reach Loki via the Traefik LB IP
|
||||
`10.0.20.203` + a `Host: loki.viktorbarzin.lan` header (`verify_ssl: false`)
|
||||
because `loki.viktorbarzin.lan` has **no Technitium record yet** (the
|
||||
`technitium-ingress-dns-sync` CronJob only creates `.me` CNAMEs + pins
|
||||
`ingress.viktorbarzin.lan`). **Follow-up:** register `loki.viktorbarzin.lan` in
|
||||
Technitium (or fix the `*.viktorbarzin.lan` wildcard) so both this sensor and the
|
||||
Sofia-Pi promtail can resolve it by name instead of pinning the LB IP.
|
||||
|
||||
### External host: rpi-sofia (Sofia Raspberry Pi)
|
||||
|
||||
`rpi-sofia` is a physical Raspberry Pi 3 at the Sofia home site (not in the cluster — it's the Frigate camera DNAT gateway + solar-inverter path + HA MQTT sensor publisher). It is monitored **off-box** into the cluster, set up 2026-06-05 after a ~5h hang whose cause couldn't be reconstructed because the Pi's *local* journal had silently stopped writing back in April (an aging 2017 SD card intermittently flips the rootfs read-only). Everything below ships telemetry to the cluster so the **next** failure is captured centrally, surviving the SD card.
|
||||
|
||||
**Metrics** — Prometheus static scrape job `rpi-sofia` → `rpi-sofia.viktorbarzin.lan:9100` (apt `prometheus-node-exporter`). A `vcgencmd` textfile collector on the Pi (`/usr/local/bin/rpi-throttle-textfile.sh` + a 1-min systemd timer) adds Pi-specific gauges node_exporter lacks: `rpi_under_voltage_now`/`_occurred`, `rpi_throttled_now`/`_occurred`, `rpi_soc_temp_celsius`, `rpi_core_volts`.
|
||||
|
||||
**Logs** — `promtail` v3.5.1 (armv7) on the Pi ships the **full systemd journal** to the cluster Loki via a LAN-gated ingress (`https://loki.viktorbarzin.lan/loki/api/v1/push`; see `loki_ingress.tf`, `auth = "none"` + `allow_local_access_only`). Stream selector: `{job="rpi-sofia-journal", host="rpi-sofia"}`, relabeled with `unit` and `level` (error/warning/notice/info). Coverage (~440 entries/hr):
|
||||
- **Kernel / non-unit messages** (the `unit=""` / `(none)` stream) — `dmesg`-level lines, i.e. the `mmc`/`EXT4-fs` read-only-remount and under-voltage kernel warnings that precede a hang. This is the primary forensic signal.
|
||||
- **All systemd units** — `prometheus-node-exporter`, `promtail`, `dnsmasq`, `cron`, `ssh`, `systemd-logind`, `avahi-daemon`, `rng-tools`, `vncserver-x11`, login `session-*.scope`, etc.
|
||||
|
||||
Query examples (Grafana → Loki): `{job="rpi-sofia-journal"}`, `{job="rpi-sofia-journal"} | level=~"error|warning"`, `{job="rpi-sofia-journal", unit="ssh.service"}`.
|
||||
|
||||
**Dashboard** — `dashboards/rpi-sofia.json` ("RPi Sofia", Hardware folder): status, undervoltage/throttle, SoC temp, load, memory, root-fs free + read-only, network.
|
||||
|
||||
**Alerts** (group `RPi Sofia` in `prometheus_chart_values.tpl`): `RpiSofiaDown` (`up==0`), `RpiSofiaFilesystemReadonly` (`node_filesystem_readonly{mountpoint="/"}==1` — the SD-failure signature), `RpiSofiaUndervoltage` (`rpi_under_voltage_occurred==1`), `RpiSofiaHighTemp`.
|
||||
|
||||
**Recovery** — a systemd hardware watchdog (`RuntimeWatchdogSec=14s`, bcm2835 max ~15s) auto-reboots the Pi on a hard hang instead of leaving it dead for hours.
|
||||
|
||||
> The cluster side (scrape job, alerts, Loki ingress, dashboard) is Terraform-managed in `stacks/monitoring/`. The **Pi-side** pieces (node_exporter, the textfile collector + timer, promtail, the watchdog config, and the `server=/viktorbarzin.lan/192.168.1.2` dnsmasq split-horizon forward needed to resolve the Loki ingress) are configured by hand on the Pi — it is not under Terraform — and are backed up off-box at `/home/wizard/rpi-sofia-backup/`. The real reliability fix (reflash/replace the SD card) needs on-site access.
|
||||
|
||||
### Dell R730 iDRAC: SNMP-primary + Redfish remnant (migrated 2026-06-05)
|
||||
|
||||
The R730 iDRAC (`192.168.1.4` / `idrac.viktorbarzin.lan`) is monitored by **two** Prometheus jobs, both relabeled to the `r730_idrac_*` prefix (which historically hid which source served what). Design/plan: `docs/plans/2026-06-05-idrac-snmp-migration-{design,plan}.md`.
|
||||
|
||||
- **`snmp-idrac` (FAST, primary, 1m / 30s):** snmp-exporter `dell_idrac` module against `:161` (v2c, community `Public0` = `auth=public_v2`). ~3.7s/scrape. Serves all dynamic + health + alerting metrics: `r730_idrac_temperatureProbeReading` (tenths-°C, ÷10), `coolingDeviceReading` (fan RPM, label `coolingDeviceLocationName`), `amperageProbeReading{amperageProbeLocationName="System Board Pwr Consumption"}` (watts), `powerSupplyCurrentInputVoltage`, `globalSystemStatus`, `systemPowerState`, `powerSupplyStatus`, `physicalDiskComponentStatus`, `systemStateMemoryDeviceStatusCombined`, etc.
|
||||
- **`redfish-idrac` (SLOW remnant, 10m / 45s):** the old mrlhansen exporter, trimmed, kept only for metrics SNMP can't serve (indicator LED, NIC Mbps, machine/BIOS info, per-drive storage table). Its `sensors` collector is now **vestigial** (HA moved off it — see next bullet) and could be dropped.
|
||||
- **HA Sofia R730 sensors → Prometheus SNMP (2026-06-05):** ha-sofia's 7 REST sensors (`/config/rest_resources/idrac_redfish_exporter.yaml` — CPU/exhaust/inlet temp, power, 2× PSU voltage, fan speed) were re-pointed from the slow on-demand Redfish exporter (`scan_interval: 120`, ~16-22s/fetch, intermittent `unavailable` blips) to a **fast Prometheus query of the SNMP values** (`scan_interval: 30`, instant): `https://prometheus-query.viktorbarzin.lan/api/v1/query?query={__name__=~"r730_idrac_…"}`, one query → JSON, each sensor filters by metric+label (temps ÷10). The `prometheus-query.viktorbarzin.lan` ingress is **local-only, `auth=none`, path-scoped to `/api/v1/query`** (added in `prometheus.tf`) so HA can query the API without the Authentik gate on `prometheus.viktorbarzin.me`. Its Technitium CNAME (→ `ingress.viktorbarzin.lan`) was added **manually via the API** — like the other `.lan` exporter hosts it is NOT auto-synced (the `technitium-ingress-dns-sync` CronJob only creates `.me` records; same gap as the Loki-sensor follow-up noted above). HA-side file is auto-version-controlled by the ha-sofia HomeAssistantVersionControl add-on; pre-migration copy saved at `/config/idrac_redfish_exporter.bak-pre-snmp`.
|
||||
|
||||
**Gotchas:**
|
||||
- **Enum values differ from the old Redfish metrics.** DellStatus: `3 = OK` (was Redfish `1`); `systemPowerState`: `4 = on` (was `2`). All iDRAC alert exprs were rewritten accordingly (`!= 3`, `!= 4`).
|
||||
- The alert `iDRACSNMPMetricsMissing` was historically a misnomer (checked a Redfish metric); it now correctly probes `absent(r730_idrac_globalSystemStatus)`. `iDRACRedfishMetricsMissing` now probes `absent(r730_idrac_powerSupplyCurrentInputVoltage)`.
|
||||
- **SSD life % + SEL are genuine SNMP gaps but were already inert** (Redfish reported `0`/empty), so the SSD-wear alerts (kept on `r730_idrac_idrac_storage_drive_life_left_percent`) and the SEL dashboard panel are unchanged.
|
||||
- Why SNMP: the Redfish exporter (`metrics: all: true`) walked every subtree on each scrape — ~18.5s avg / 28s peak against the slow BMC — which forced the infrequent interval. SNMP is a single fast walk.
|
||||
|
||||
### Alert Cascade Inhibition
|
||||
|
||||
Alertmanager implements intelligent alert suppression to prevent alert storms during cascading failures:
|
||||
|
||||
```mermaid
|
||||
graph LR
|
||||
NODE_DOWN[Node Down Alert] -->|Inhibits| POD_ALERTS[Pod Alerts on That Node]
|
||||
COMPLETED[Completed CronJob Pod] -->|Excluded from| POD_READY[Pod Not Ready Alerts]
|
||||
```
|
||||
|
||||
When a node goes down, all pod-level alerts for pods scheduled on that node are suppressed, reducing noise and focusing attention on the root cause.
|
||||
|
||||
### GPU Monitoring
|
||||
|
||||
NVIDIA GPU metrics are collected via dcgm-exporter with configurable resource limits (`dcgmExporter.resources`). Metrics include GPU utilization, memory usage, temperature, and power consumption.
|
||||
|
||||
### Database Version Pinning
|
||||
|
||||
MySQL, PostgreSQL, and Redis images have Diun monitoring disabled to prevent automatic version updates that could cause compatibility issues. Version upgrades are manual and coordinated.
|
||||
|
||||
## Configuration
|
||||
|
||||
### Key Config Files
|
||||
|
||||
- **Monitoring Stack**: `stacks/platform/modules/monitoring/`
|
||||
- Prometheus scrape configs and recording rules
|
||||
- Grafana dashboard definitions
|
||||
- Alertmanager routing and inhibition rules
|
||||
- Uptime Kuma configuration
|
||||
|
||||
### Prometheus Scrape Configs
|
||||
|
||||
Every service must expose metrics and be registered in Prometheus via ServiceMonitor or static scrape config. Standard pattern:
|
||||
|
||||
```yaml
|
||||
apiVersion: monitoring.coreos.com/v1
|
||||
kind: ServiceMonitor
|
||||
metadata:
|
||||
name: my-service
|
||||
spec:
|
||||
selector:
|
||||
matchLabels:
|
||||
app: my-service
|
||||
endpoints:
|
||||
- port: metrics
|
||||
```
|
||||
|
||||
### Grafana Dashboards
|
||||
|
||||
14+ pre-configured dashboards covering:
|
||||
- Kubernetes API Server
|
||||
- CoreDNS
|
||||
- GPU metrics
|
||||
- UPS status
|
||||
- Node metrics
|
||||
- Pod resource usage
|
||||
- Application-specific metrics
|
||||
|
||||
### Alert Definitions
|
||||
|
||||
#### Infrastructure Alerts
|
||||
- **OOMKill**: Container killed due to out-of-memory
|
||||
- **PodReplicaMismatch**: Deployment/StatefulSet replica count doesn't match desired
|
||||
- **ClusterMemoryRequestsHigh**: Cluster memory requests >85%
|
||||
- **ContainerNearOOM**: Container using >85% of memory limit
|
||||
- **PodUnschedulable**: Pod cannot be scheduled due to resource constraints
|
||||
- **CPUTemp**: CPU temperature threshold exceeded
|
||||
- **SSDWrites**: Excessive SSD write volume
|
||||
- **NFSResponsiveness**: NFS mount latency issues
|
||||
- **UPSBattery**: UPS battery charge low
|
||||
|
||||
#### Application Alerts
|
||||
- **4xx/5xx Error Rates**: HTTP error rate threshold exceeded
|
||||
|
||||
#### Email Monitoring Alerts
|
||||
- **EmailRoundtripFailing**: E2E email probe returning failure for >30m
|
||||
- **EmailRoundtripStale**: No successful email round-trip in >80m (60m threshold + for:20m)
|
||||
- **EmailRoundtripNeverRun**: Email probe has never reported (40m)
|
||||
|
||||
#### Registry Integrity Alerts
|
||||
- **RegistryManifestIntegrityFailure**: Private registry serving 404 for manifests it advertises (orphan OCI-index children) — fires after 30m of `registry_manifest_integrity_failures > 0`. Remediation: rebuild affected image per `docs/runbooks/registry-rebuild-image.md`.
|
||||
- **RegistryIntegrityProbeStale**: Probe hasn't reported in >1h (CronJob broken)
|
||||
- **RegistryCatalogInaccessible**: Probe cannot fetch `/v2/_catalog` (auth failure or registry down)
|
||||
|
||||
#### Immich Smart Search Alerts
|
||||
- **ImmichSmartSearchSlow**: Representative context-search ANN query >1s for 15m. Root cause is almost always the `clip_index` (vchord, ~665MB) decaying out of PG `shared_buffers` — a cold list read is ~1.8s vs ~4ms warm. Remediation: confirm the `clip-index-prewarm` CronJob (immich ns, `*/5`) is succeeding; manual fix `kubectl exec -n immich -c immich-postgresql <pg-pod> -- psql -U postgres -d immich -c "SELECT pg_prewarm('clip_index')"`.
|
||||
- **ImmichClipIndexColdCache**: `clip_index` <50% resident in shared_buffers for 15m (leading indicator; same remediation).
|
||||
- **ImmichSearchProbeStale**: `immich-search-probe` hasn't reported in >30m (CronJob broken). Inhibits the two above so frozen Pushgateway gauges don't false-fire.
|
||||
|
||||
The Immich smart-search monitoring uses two CronJobs in the `immich` namespace (both `*/5`): `clip-index-prewarm` re-runs `pg_prewarm('clip_index')` to keep the vector index hot during runtime (the `postStart` prewarm only fires at pod start; `pg_prewarm.autoprewarm` only reloads at startup, so the index otherwise decays under job buffer-pressure), and `immich-search-probe` (postgres init-container measures a random-vector ANN latency + `pg_buffercache` residency → curl sidecar pushes `immich_smart_search_db_seconds` / `immich_clip_index_cached_pct` / `immich_smart_search_probe_success` / `immich_smart_search_probe_last_run_timestamp` to the Pushgateway). Also surfaced by cluster-health check #46 (`check_immich_search`). Note this is the **Postgres** half of smart-search warmth; the **ML model** half is kept warm by the separate `clip-keepalive` CronJob.
|
||||
|
||||
The email monitoring system uses a CronJob (`email-roundtrip-monitor`, every 10 min) in the `mailserver` namespace that:
|
||||
1. Sends a test email via Mailgun HTTP API to `smoke-test@viktorbarzin.me`
|
||||
2. Email lands in the `spam@` catch-all mailbox via MX delivery
|
||||
3. Verifies delivery via IMAP (searches by UUID marker in subject)
|
||||
4. Deletes the test email immediately
|
||||
5. Pushes metrics (`email_roundtrip_success`, `email_roundtrip_duration_seconds`, `email_roundtrip_last_success_timestamp`) to Prometheus Pushgateway
|
||||
6. Pushes status to Uptime Kuma E2E Push monitor
|
||||
|
||||
Uptime Kuma monitors: TCP SMTP (port 25) on `176.12.22.76` (external), IMAP (port 993) on `10.0.20.202`, and Dovecot exporter metrics on port 9166.
|
||||
|
||||
#### Security Alerts (Wave 1 — planned, beads `code-8ywc`)
|
||||
|
||||
Routed via **Loki ruler → Alertmanager → `#security` Slack receiver**. Same handling path as infra alerts. Single channel with severity labels inside (critical/warning/info), not three separate channels. Detection sources: K8s API audit log (`job=kube-audit`), Vault audit log (`job=vault-audit`), PVE sshd journald (`job=sshd-pve`), Calico flow logs (`job=calico-flow`, W1.6 only).
|
||||
|
||||
| # | Source | Event | Severity |
|
||||
|---|---|---|---|
|
||||
| K2 | kube-audit | SA token used from outside cluster | critical |
|
||||
| K3 | kube-audit | Secret read in vault/sealed-secrets/external-secrets by non-allowlisted SA | critical |
|
||||
| K4 | kube-audit | Exec into vault/kube-system/dbaas/cnpg-system pod by non-allowlisted user | warning |
|
||||
| K5 | kube-audit | Mass delete (>5 Pod/Secret/CM in 60s) | critical |
|
||||
| K6 | kube-audit | Audit policy itself modified | critical |
|
||||
| K7 | kube-audit | New `*,*` ClusterRole created | warning |
|
||||
| K8 | kube-audit | Anonymous binding granted | critical |
|
||||
| K9 | kube-audit | `me@viktorbarzin.me` request from non-allowlist sourceIP | critical |
|
||||
| V1 | vault-audit | Root token created | critical |
|
||||
| V2 | vault-audit | Audit device disabled/modified | critical |
|
||||
| V3 | vault-audit | Seal status changed | critical |
|
||||
| V4 | vault-audit | Policy written/modified (allowlist Terraform actor) | warning |
|
||||
| V5 | vault-audit | Auth failure spike >10/min | warning |
|
||||
| V6 | vault-audit | Token with policies different from parent created | critical |
|
||||
| V7 | vault-audit | Viktor's entity_id from non-allowlist remote_addr (requires `x_forwarded_for_authorized_addrs`) | critical |
|
||||
| S1 | sshd-pve | sshd auth success from non-allowlist IP | critical |
|
||||
|
||||
K1 (cluster-admin grant) intentionally skipped — see security.md.
|
||||
|
||||
Allowlist source-IP CIDRs (used by K2, K9, V7, S1): `10.0.20.0/22`, `192.168.1.0/24`, K8s pod CIDR, K8s service CIDR, Headscale tailnet. Policy: no public-IP access; all admin paths transit LAN or Headscale.
|
||||
|
||||
IOPS impact estimated ~1-2 GB/day additional disk writes after custom audit-policy tuning. Retention: 90d for security streams.
|
||||
|
||||
##### Authentik walling-off guard — `AuthentikWallingOffPublicPath`
|
||||
|
||||
Detects the inverse of the K-series alerts: a service that **must work WITHOUT Authentik SSO** getting accidentally walled off. Services on `ingress_factory auth = "required"` put Authentik forward-auth on `/`, which 302-bounces native-client / public / webhook / WebSocket / SPA-XHR paths. We carve those out with path-scoped `auth = "none"` ingresses; a TF revert, a bad deploy, or `ingress_factory`'s fail-closed `auth` default flipping back to `"required"` can silently clobber a carve-out.
|
||||
|
||||
- **Mechanism**: `blackbox-exporter` (monitoring ns) probes a representative GET-able URL per carve-out with `no_follow_redirects: true`. The `http_no_authentik_redirect` module FAILS the probe (`fail_if_header_matches` on the `Location` header, regex `authentik\.viktorbarzin\.me|/outpost\.goauthentik\.io|/application/o/authorize`) iff the response redirects to Authentik. `valid_status_codes` enumerates all expected non-Authentik responses **including 301/302** (so a legitimate redirect, e.g. a short-link 302, or a 404 carve-out like meshcentral `/agent.ashx`, stays green). Scrape job: `blackbox-authentik-walloff` (1m).
|
||||
- **Alert**: `probe_failed_due_to_regex{job="blackbox-authentik-walloff"} == 1` for 10m → `severity=warning`, `lane=security` → **`#security` Slack** (Slack-only, no paging). `probe_failed_due_to_regex` (not bare `probe_success==0`) is the signal: it isolates the Authentik-redirect from unrelated 5xx/DNS/TLS failures already covered by reachability alerts. Inhibited by `TraefikDown` and `AuthentikDown` (symptom, not regression, during those outages).
|
||||
- **Target list + how to add one**: `local.authentik_walloff_targets` in `stacks/monitoring/modules/monitoring/authentik_walloff_probe.tf` — a map of `service → URL`. To guard a NEW carve-out, add ONE line. Verify it does NOT already 302 to Authentik first: `curl -s -o /dev/null -w '%{http_code} %{redirect_url}\n' '<url>'`. The map key becomes the `service` label on the metric + alert. (Note: openclaw `task-webhook` is intentionally NOT probed — no public DNS record.)
|
||||
|
||||
#### Backup Alerts
|
||||
- **PostgreSQLBackupStale**: >36h since last backup
|
||||
- **MySQLBackupStale**: >36h since last backup
|
||||
- **EtcdBackupStale**: >8d since last backup
|
||||
- **VaultBackupStale**: >8d since last backup
|
||||
- **VaultwardenBackupStale**: >8d since last backup
|
||||
- **RedisBackupStale**: >8d since last backup
|
||||
- **PrometheusBackupStale**: >32d since last backup
|
||||
- **VaultwardenIntegrityFail**: Backup integrity check failed
|
||||
|
||||
### Vault Paths
|
||||
|
||||
No direct Vault integration required for the monitoring stack (platform stack cannot depend on Vault due to circular dependency).
|
||||
|
||||
## Decisions & Rationale
|
||||
|
||||
### Why Prometheus over alternatives (InfluxDB, Graphite)?
|
||||
- Native Kubernetes integration via ServiceMonitor CRDs
|
||||
- Pull-based model reduces application complexity (no push agents)
|
||||
- Powerful query language (PromQL) for alerting and visualization
|
||||
- Industry standard for cloud-native monitoring
|
||||
|
||||
### Why Grafana over Prometheus UI?
|
||||
- Superior visualization capabilities
|
||||
- OIDC authentication via Authentik for secure access
|
||||
- Multi-data-source support (Prometheus + Loki)
|
||||
- Rich dashboard ecosystem
|
||||
|
||||
### Why Loki for logs?
|
||||
- Designed for Kubernetes log aggregation
|
||||
- Cost-effective (indexes metadata, not full log content)
|
||||
- Tight Grafana integration
|
||||
- LogQL query language similar to PromQL
|
||||
|
||||
### Why Uptime Kuma?
|
||||
- Simple HTTP/TCP/Ping monitoring
|
||||
- Public status page for service availability
|
||||
- Lightweight compared to full APM solutions
|
||||
- Complements Prometheus for black-box monitoring
|
||||
|
||||
### Why alert inhibition?
|
||||
- Prevents alert fatigue during cascading failures
|
||||
- Root cause focus (fix the node, not 50 pods)
|
||||
- Reduces on-call noise
|
||||
|
||||
### Why exclude completed CronJob pods?
|
||||
- CronJobs naturally transition to Completed state
|
||||
- "Pod not ready" is expected and not actionable
|
||||
- Prevents false positive alerts
|
||||
|
||||
### Why disable Diun for databases?
|
||||
- Version upgrades require migration planning
|
||||
- Breaking schema changes need coordination
|
||||
- Manual upgrade testing prevents production issues
|
||||
|
||||
## Troubleshooting
|
||||
|
||||
### Alert is firing but I don't see the issue
|
||||
|
||||
Check inhibition rules in Alertmanager. The alert may be suppressed due to a higher-level failure (e.g., node down suppressing pod alerts).
|
||||
|
||||
### Grafana dashboards show no data
|
||||
|
||||
1. Check Prometheus targets: `kubectl port-forward -n monitoring svc/prometheus 9090:9090` → `http://localhost:9090/targets`
|
||||
2. Verify ServiceMonitor is created: `kubectl get servicemonitor -A`
|
||||
3. Check Prometheus logs for scrape errors: `kubectl logs -n monitoring deployment/prometheus`
|
||||
|
||||
### Loki logs not appearing
|
||||
|
||||
1. Verify pod logs are going to stdout/stderr (not files)
|
||||
2. Check Loki is scraping pod logs: `kubectl logs -n monitoring deployment/loki`
|
||||
3. Ensure Grafana data source is configured correctly
|
||||
|
||||
### Backup alert firing but backup exists
|
||||
|
||||
1. Check backup timestamp in Prometheus: `backup_last_success_timestamp_seconds{job="my-backup"}`
|
||||
2. Verify backup job completed successfully: `kubectl logs -n backups cronjob/my-backup`
|
||||
3. Ensure backup job updates the Prometheus metric via pushgateway or ServiceMonitor
|
||||
|
||||
### GPU metrics not showing
|
||||
|
||||
1. Verify dcgm-exporter is running: `kubectl get pods -n monitoring -l app=dcgm-exporter`
|
||||
2. Check GPU node has NVIDIA drivers installed
|
||||
3. Verify dcgm-exporter has access to GPU: `kubectl logs -n monitoring deployment/dcgm-exporter`
|
||||
|
||||
### Uptime Kuma monitor shows down but service is healthy
|
||||
|
||||
1. Check network policies aren't blocking Uptime Kuma's pod
|
||||
2. Verify service endpoint is reachable from Uptime Kuma namespace
|
||||
3. Check Uptime Kuma logs: `kubectl logs -n monitoring deployment/uptime-kuma`
|
||||
|
||||
## Related
|
||||
|
||||
- [Secrets Management](./secrets.md) - OIDC authentication for Grafana via Authentik
|
||||
- [Backup & DR](./backup-dr.md) - Backup monitoring alerts
|
||||
- [Platform Stack](../../stacks/platform/README.md) - Monitoring stack deployment
|
||||
- [Vault Architecture](./vault.md) - No direct dependency but related to cluster observability
|
||||
557
docs/architecture/multi-tenancy.md
Normal file
557
docs/architecture/multi-tenancy.md
Normal file
|
|
@ -0,0 +1,557 @@
|
|||
# Multi-Tenancy
|
||||
|
||||
## Overview
|
||||
|
||||
The cluster implements namespace-based multi-tenancy where each user receives their own Kubernetes namespace(s), RBAC roles, resource quotas, and CI/CD access. Onboarding is Vault-driven: add user metadata to `secret/platform → k8s_users`, apply Terraform stacks, and all resources (namespace, policies, RBAC, DNS, TLS) are auto-generated. Users access the cluster via OIDC authentication through Authentik and can self-service via k8s-portal.
|
||||
|
||||
## Architecture Diagram
|
||||
|
||||
```mermaid
|
||||
graph TB
|
||||
A[Admin: Add to Authentik Groups] --> B[Admin: Add to Vault k8s_users]
|
||||
B --> C[Apply vault Stack]
|
||||
C --> D[Apply platform Stack]
|
||||
D --> E[Apply woodpecker Stack]
|
||||
|
||||
C --> C1[Create Namespace]
|
||||
C --> C2[Create Vault Policy<br/>namespace-owner-user]
|
||||
C --> C3[Create Vault Identity<br/>Entity + OIDC Alias]
|
||||
C --> C4[Create K8s Deployer Role<br/>Vault K8s Auth]
|
||||
|
||||
D --> D1[Create RBAC RoleBinding<br/>Namespace Admin]
|
||||
D --> D2[Create RBAC ClusterRoleBinding<br/>Cluster Read-Only]
|
||||
D --> D3[Create ResourceQuota]
|
||||
D --> D4[Create TLS Secret]
|
||||
D --> D5[Create Cloudflare DNS]
|
||||
|
||||
E --> E1[Grant Woodpecker Admin]
|
||||
|
||||
F[User: Run Setup Script] --> F1[Install kubectl, kubelogin,<br/>Vault CLI, Terraform]
|
||||
F1 --> F2[OIDC Login via Authentik]
|
||||
F2 --> G[kubectl Access]
|
||||
|
||||
style A fill:#e74c3c
|
||||
style B fill:#e74c3c
|
||||
style C fill:#2088ff
|
||||
style D fill:#2088ff
|
||||
style E fill:#2088ff
|
||||
style F fill:#27ae60
|
||||
```
|
||||
|
||||
## Components
|
||||
|
||||
| Component | Version | Location | Purpose |
|
||||
|-----------|---------|----------|---------|
|
||||
| Authentik | Latest | `authentik` namespace | OIDC provider for K8s + Vault |
|
||||
| Vault | Latest | `vault` namespace | Identity source, policy engine |
|
||||
| k8s-portal | SvelteKit | `k8s-portal.viktorbarzin.me` | Self-service onboarding UI |
|
||||
| Terraform (vault stack) | - | `stacks/vault/` | Namespace, Vault resources |
|
||||
| Terraform (platform stack) | - | `stacks/platform/` | RBAC, quotas, DNS, TLS |
|
||||
| Terraform (woodpecker stack) | - | `stacks/woodpecker/` | CI/CD admin access |
|
||||
| Headscale | Latest | `headscale` namespace | VPN mesh network (user access) |
|
||||
|
||||
## How It Works
|
||||
|
||||
### Namespace-Owner Model
|
||||
|
||||
Each user receives:
|
||||
1. **Kubernetes Namespace(s)**: Isolated workload environment
|
||||
2. **Vault Policy**: Read/write access to `secret/data/<namespace>/*`
|
||||
3. **RBAC Role**: Namespace admin (full control within namespace)
|
||||
4. **RBAC ClusterRole**: Cluster read-only (view cluster resources)
|
||||
5. **ResourceQuota**: CPU, memory, storage limits
|
||||
6. **TLS Secret**: Wildcard cert for `*.<namespace>.viktorbarzin.me`
|
||||
7. **DNS Records**: Cloudflare A/CNAME for user domains
|
||||
8. **Woodpecker Admin**: Access to create repos and pipelines
|
||||
|
||||
### Onboarding Flow (3 Steps, No Code Changes)
|
||||
|
||||
#### Step 1: Authentik
|
||||
|
||||
**Action**: Admin adds user to groups
|
||||
- `kubernetes-namespace-owners`
|
||||
- `Headscale Users`
|
||||
|
||||
**Result**: User can authenticate to Vault and K8s via OIDC
|
||||
|
||||
#### Step 2: Vault KV
|
||||
|
||||
**Action**: Admin adds JSON entry to `secret/platform → k8s_users`
|
||||
|
||||
**Example**:
|
||||
```json
|
||||
{
|
||||
"alice": {
|
||||
"role": "namespace-owner",
|
||||
"namespaces": ["alice-prod", "alice-dev"],
|
||||
"domains": ["alice.viktorbarzin.me", "app.alice.viktorbarzin.me"],
|
||||
"quota": {
|
||||
"cpu": "4",
|
||||
"memory": "8Gi",
|
||||
"storage": "20Gi"
|
||||
}
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
**Fields**:
|
||||
- `role`: Always `namespace-owner` for standard users
|
||||
- `namespaces`: List of K8s namespaces to create
|
||||
- `domains`: Cloudflare DNS records to create
|
||||
- `quota`: Per-namespace resource limits
|
||||
|
||||
#### Step 3: Apply Terraform Stacks
|
||||
|
||||
**Order matters** (dependencies):
|
||||
|
||||
1. **vault stack**:
|
||||
```bash
|
||||
cd stacks/vault
|
||||
terragrunt apply
|
||||
```
|
||||
- Creates namespaces
|
||||
- Creates Vault policy `namespace-owner-alice`
|
||||
- Creates Vault identity entity + OIDC alias
|
||||
- Creates K8s deployer role for Woodpecker CI
|
||||
|
||||
2. **platform stack**:
|
||||
```bash
|
||||
cd stacks/platform
|
||||
terragrunt apply
|
||||
```
|
||||
- Creates RBAC RoleBinding (namespace admin)
|
||||
- Creates RBAC ClusterRoleBinding (cluster read-only)
|
||||
- Creates ResourceQuota
|
||||
- Creates TLS Secret (wildcard cert from Let's Encrypt)
|
||||
- Creates Cloudflare DNS A/CNAME records
|
||||
|
||||
3. **woodpecker stack**:
|
||||
```bash
|
||||
cd stacks/woodpecker
|
||||
terragrunt apply
|
||||
```
|
||||
- Grants Woodpecker admin access for user's Forgejo repos
|
||||
|
||||
### Auto-Generated Resources Per User
|
||||
|
||||
| Resource | Name Pattern | Purpose |
|
||||
|----------|--------------|---------|
|
||||
| Namespace | `<username>-prod`, `<username>-dev` | Workload isolation |
|
||||
| Vault Policy | `namespace-owner-<username>` | Secret access control |
|
||||
| Vault Identity Entity | `<username>` | OIDC identity mapping |
|
||||
| Vault OIDC Alias | Authentik sub claim | Link OIDC to entity |
|
||||
| Vault K8s Role | `<namespace>-deployer` | Woodpecker CI access |
|
||||
| K8s Role | Auto-generated | Namespace admin permissions |
|
||||
| RoleBinding | `<username>-admin` | Bind user to namespace admin |
|
||||
| ClusterRoleBinding | `<username>-read-only` | Cluster-wide read access |
|
||||
| ResourceQuota | `<namespace>-quota` | CPU/memory/storage limits |
|
||||
| Secret | `tls-<namespace>` | Wildcard TLS cert |
|
||||
| Cloudflare DNS | A/CNAME records | Domain routing |
|
||||
|
||||
### User Setup (Self-Service)
|
||||
|
||||
**k8s-portal**: `k8s-portal.viktorbarzin.me`
|
||||
1. User logs in with Authentik
|
||||
2. Downloads setup script
|
||||
3. Runs script:
|
||||
```bash
|
||||
curl https://k8s-portal.viktorbarzin.me/setup.sh | bash
|
||||
```
|
||||
4. Script installs:
|
||||
- `kubectl`
|
||||
- `kubelogin` (OIDC plugin)
|
||||
- `vault` CLI
|
||||
- `terraform`
|
||||
- `terragrunt`
|
||||
5. User runs OIDC login:
|
||||
```bash
|
||||
kubectl oidc-login setup \
|
||||
--oidc-issuer-url=https://auth.viktorbarzin.me/application/o/kubernetes/ \
|
||||
--oidc-client-id=kubernetes
|
||||
```
|
||||
6. User can now run `kubectl` commands
|
||||
|
||||
### Web Dashboard (auto-login, no token paste)
|
||||
|
||||
Namespace-owners just log into `https://k8s.viktorbarzin.me` with their Authentik
|
||||
account and land straight in the dashboard scoped to their namespace — **no token
|
||||
to paste**. A token-injector (`stacks/k8s-dashboard/dashboard_injector.tf`) maps
|
||||
their Authentik identity (`X-authentik-username`) to their `dashboard-<user>` SA
|
||||
token (`admin` on their namespace + read-only on the namespace list & nodes
|
||||
only — they can't read other tenants' resources) and injects it as
|
||||
`Authorization: Bearer`. Forward-auth admits the `kubernetes-*` groups for this
|
||||
host (`stacks/authentik/admin-services-restriction.tf`).
|
||||
|
||||
> **Why not seamless OIDC SSO:** the intended oauth2-proxy OIDC path is built but
|
||||
> blocked — the apiserver rejects all Authentik OIDC tokens. The injector uses SA
|
||||
> tokens (which the apiserver accepts) keyed off the forward-auth identity. See
|
||||
> `docs/architecture/authentication.md` and
|
||||
> `docs/plans/2026-06-04-k8s-dashboard-sso-design.md` §12.
|
||||
|
||||
### RBAC Groups
|
||||
|
||||
| Group | ClusterRole | Scope | Members |
|
||||
|-------|-------------|-------|---------|
|
||||
| `kubernetes-admins` | `cluster-admin` | Full cluster access | Viktor |
|
||||
| `kubernetes-power-users` | Custom | Elevated permissions | Senior users |
|
||||
| `kubernetes-namespace-owners` | `namespace-admin` + `view` | Namespace admin + cluster read | All users |
|
||||
|
||||
### User CI/CD (Woodpecker)
|
||||
|
||||
**Flow**:
|
||||
1. User creates repo in Forgejo
|
||||
2. Forgejo username **must match** Vault `k8s_users` key (e.g., `alice`)
|
||||
3. Woodpecker authenticates to Vault using K8s SA JWT
|
||||
4. Vault issues namespace-scoped deployer token
|
||||
5. Pipeline runs `kubectl` commands within user's namespace(s)
|
||||
|
||||
**Vault K8s Role** (auto-created per namespace):
|
||||
```hcl
|
||||
vault write auth/kubernetes/role/alice-prod-deployer \
|
||||
bound_service_account_names=woodpecker-deployer \
|
||||
bound_service_account_namespaces=woodpecker \
|
||||
policies=namespace-owner-alice \
|
||||
ttl=1h
|
||||
```
|
||||
|
||||
**Pipeline Example**:
|
||||
```yaml
|
||||
steps:
|
||||
deploy:
|
||||
image: bitnami/kubectl:latest
|
||||
commands:
|
||||
- kubectl apply -f k8s/ -n alice-prod
|
||||
secrets: [k8s_token]
|
||||
```
|
||||
|
||||
## Configuration
|
||||
|
||||
### Vault k8s_users Entry
|
||||
|
||||
**Path**: `secret/platform → k8s_users`
|
||||
|
||||
**Full Example**:
|
||||
```json
|
||||
{
|
||||
"alice": {
|
||||
"role": "namespace-owner",
|
||||
"namespaces": ["alice-prod", "alice-dev"],
|
||||
"domains": [
|
||||
"alice.viktorbarzin.me",
|
||||
"app.alice.viktorbarzin.me",
|
||||
"api.alice.viktorbarzin.me"
|
||||
],
|
||||
"quota": {
|
||||
"cpu": "4",
|
||||
"memory": "8Gi",
|
||||
"storage": "20Gi",
|
||||
"pods": "20"
|
||||
}
|
||||
},
|
||||
"bob": {
|
||||
"role": "namespace-owner",
|
||||
"namespaces": ["bob-staging"],
|
||||
"domains": ["bob.viktorbarzin.me"],
|
||||
"quota": {
|
||||
"cpu": "2",
|
||||
"memory": "4Gi",
|
||||
"storage": "10Gi"
|
||||
}
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
### Vault Policy Template
|
||||
|
||||
**Auto-generated per user**:
|
||||
|
||||
```hcl
|
||||
# Policy: namespace-owner-alice
|
||||
path "secret/data/alice-prod/*" {
|
||||
capabilities = ["create", "read", "update", "delete", "list"]
|
||||
}
|
||||
|
||||
path "secret/data/alice-dev/*" {
|
||||
capabilities = ["create", "read", "update", "delete", "list"]
|
||||
}
|
||||
|
||||
path "secret/metadata/alice-prod/*" {
|
||||
capabilities = ["list"]
|
||||
}
|
||||
|
||||
path "secret/metadata/alice-dev/*" {
|
||||
capabilities = ["list"]
|
||||
}
|
||||
```
|
||||
|
||||
### ResourceQuota Example
|
||||
|
||||
```yaml
|
||||
apiVersion: v1
|
||||
kind: ResourceQuota
|
||||
metadata:
|
||||
name: alice-prod-quota
|
||||
namespace: alice-prod
|
||||
spec:
|
||||
hard:
|
||||
requests.cpu: "4"
|
||||
requests.memory: "8Gi"
|
||||
persistentvolumeclaims: "10"
|
||||
requests.storage: "20Gi"
|
||||
pods: "20"
|
||||
```
|
||||
|
||||
### Factory Pattern for Multi-Instance Services
|
||||
|
||||
**Structure**:
|
||||
```
|
||||
stacks/
|
||||
actualbudget/
|
||||
main.tf # Shared configuration
|
||||
factory/
|
||||
main.tf # Per-user module
|
||||
```
|
||||
|
||||
**main.tf** (service definition):
|
||||
```hcl
|
||||
# Shared NFS export, Cloudflare routes, etc.
|
||||
```
|
||||
|
||||
**factory/main.tf** (per-user instance):
|
||||
```hcl
|
||||
module "alice" {
|
||||
source = "../"
|
||||
user = "alice"
|
||||
domain = "budget.alice.viktorbarzin.me"
|
||||
}
|
||||
|
||||
module "bob" {
|
||||
source = "../"
|
||||
user = "bob"
|
||||
domain = "budget.bob.viktorbarzin.me"
|
||||
}
|
||||
```
|
||||
|
||||
**To add user**:
|
||||
1. Export NFS share: `/mnt/data/<service>/<user>`
|
||||
2. Add Cloudflare route: `<user>.<service>.viktorbarzin.me`
|
||||
3. Add module block in `factory/main.tf`
|
||||
|
||||
**Examples**:
|
||||
- `actualbudget`: Personal budgeting app
|
||||
- `freedify`: Music streaming service
|
||||
|
||||
## Decisions & Rationale
|
||||
|
||||
### Why Namespace-Per-User?
|
||||
|
||||
**Alternatives considered**:
|
||||
1. **Shared namespace**: No isolation, quota enforcement difficult
|
||||
2. **Cluster-per-user**: Too expensive, management overhead
|
||||
3. **Namespace-per-user (chosen)**: Balance isolation, quotas, RBAC
|
||||
|
||||
**Benefits**:
|
||||
- Strong isolation (network policies, RBAC)
|
||||
- Easy quota enforcement (ResourceQuota)
|
||||
- Simple mental model (1 user = N namespaces)
|
||||
- Scales to hundreds of users
|
||||
|
||||
### Why Vault-Driven Onboarding?
|
||||
|
||||
**Alternatives considered**:
|
||||
1. **Manual YAML**: Error-prone, no audit trail
|
||||
2. **CRD-based operator**: Complex, requires custom controller
|
||||
3. **Vault + Terraform (chosen)**: Single source of truth, auditable
|
||||
|
||||
**Benefits**:
|
||||
- Vault as identity source (integrates with OIDC)
|
||||
- Terraform for declarative infrastructure
|
||||
- Git-tracked changes (audit trail)
|
||||
- Secrets rotation built-in
|
||||
|
||||
### Why Factory Pattern for Multi-Instance Apps?
|
||||
|
||||
**Alternatives considered**:
|
||||
1. **Helm chart per user**: Duplication, drift risk
|
||||
2. **Single shared instance**: No isolation, security risk
|
||||
3. **Factory module (chosen)**: DRY, scalable
|
||||
|
||||
**Benefits**:
|
||||
- No code duplication
|
||||
- Easy to add users (one module block)
|
||||
- Centralized updates (change `main.tf`, all instances update)
|
||||
|
||||
### Why OIDC Instead of Static Tokens?
|
||||
|
||||
**Alternatives considered**:
|
||||
1. **Static ServiceAccount tokens**: Never expire, security risk
|
||||
2. **X.509 client certs**: Complex rotation
|
||||
3. **OIDC (chosen)**: Centralized auth, automatic rotation
|
||||
|
||||
**Benefits**:
|
||||
- Tokens auto-expire (1h for deployer, 24h for user)
|
||||
- Centralized user management (Authentik)
|
||||
- Integrates with Vault identity engine
|
||||
- Industry standard (OpenID Connect)
|
||||
|
||||
### Why ResourceQuota Over LimitRange?
|
||||
|
||||
- **ResourceQuota**: Total namespace consumption (e.g., max 8Gi memory)
|
||||
- **LimitRange**: Per-pod limits (e.g., max 2Gi per pod)
|
||||
|
||||
**Choice**: ResourceQuota only
|
||||
- Users manage their own pod limits
|
||||
- Quota prevents runaway consumption
|
||||
- Simpler mental model
|
||||
|
||||
## Troubleshooting
|
||||
|
||||
### User Can't Log In: "Unauthorized"
|
||||
|
||||
**Cause**: User not in Authentik `kubernetes-namespace-owners` group
|
||||
|
||||
**Fix**:
|
||||
```bash
|
||||
# Check user groups in Authentik UI
|
||||
# Add to kubernetes-namespace-owners group
|
||||
```
|
||||
|
||||
### User Has No Namespaces
|
||||
|
||||
**Cause**: `vault` stack not applied after adding to `k8s_users`
|
||||
|
||||
**Fix**:
|
||||
```bash
|
||||
cd stacks/vault
|
||||
terragrunt apply
|
||||
```
|
||||
|
||||
### User Can't Access Secrets in Vault
|
||||
|
||||
**Cause**: Vault policy not attached to identity entity
|
||||
|
||||
**Fix**:
|
||||
```bash
|
||||
# Check entity
|
||||
vault read identity/entity/name/alice
|
||||
|
||||
# Check policy exists
|
||||
vault policy read namespace-owner-alice
|
||||
|
||||
# Manually attach policy to entity
|
||||
vault write identity/entity/name/alice policies=namespace-owner-alice
|
||||
```
|
||||
|
||||
### Woodpecker Pipeline: "Forbidden"
|
||||
|
||||
**Cause**: Forgejo username doesn't match Vault `k8s_users` key
|
||||
|
||||
**Fix**:
|
||||
```bash
|
||||
# Rename Forgejo user to match Vault key
|
||||
# OR update k8s_users key to match Forgejo username, then terragrunt apply
|
||||
```
|
||||
|
||||
### ResourceQuota: "Forbidden: exceeded quota"
|
||||
|
||||
**Cause**: User exceeded namespace quota
|
||||
|
||||
**Fix**:
|
||||
```bash
|
||||
# Check quota usage
|
||||
kubectl describe quota -n alice-prod
|
||||
|
||||
# User must delete resources or request quota increase
|
||||
# To increase: update k8s_users in Vault, apply platform stack
|
||||
```
|
||||
|
||||
### DNS Not Resolving
|
||||
|
||||
**Cause**: Cloudflare DNS not created by platform stack
|
||||
|
||||
**Fix**:
|
||||
```bash
|
||||
# Check domains in k8s_users
|
||||
vault kv get secret/platform | jq -r '.data.data.k8s_users.alice.domains'
|
||||
|
||||
# Apply platform stack
|
||||
cd stacks/platform
|
||||
terragrunt apply
|
||||
|
||||
# Verify in Cloudflare dashboard
|
||||
```
|
||||
|
||||
### TLS Secret Missing
|
||||
|
||||
**Cause**: cert-manager failed to issue certificate
|
||||
|
||||
**Fix**:
|
||||
```bash
|
||||
# Check cert-manager logs
|
||||
kubectl logs -n cert-manager deploy/cert-manager
|
||||
|
||||
# Check Certificate resource
|
||||
kubectl get certificate -n alice-prod
|
||||
|
||||
# Check CertificateRequest
|
||||
kubectl describe certificaterequest -n alice-prod
|
||||
|
||||
# If Let's Encrypt rate limited, wait 1 week or use staging
|
||||
```
|
||||
|
||||
### User Can't See Cluster Resources
|
||||
|
||||
**Cause**: ClusterRoleBinding not created
|
||||
|
||||
**Fix**:
|
||||
```bash
|
||||
# Check ClusterRoleBinding exists
|
||||
kubectl get clusterrolebinding | grep alice
|
||||
|
||||
# Apply platform stack
|
||||
cd stacks/platform
|
||||
terragrunt apply
|
||||
```
|
||||
|
||||
### Factory Pattern: New User Not Created
|
||||
|
||||
**Cause**: Module block not added to `factory/main.tf`
|
||||
|
||||
**Fix**:
|
||||
```bash
|
||||
# Edit factory/main.tf
|
||||
cat >> stacks/actualbudget/factory/main.tf <<EOF
|
||||
module "charlie" {
|
||||
source = "../"
|
||||
user = "charlie"
|
||||
domain = "budget.charlie.viktorbarzin.me"
|
||||
}
|
||||
EOF
|
||||
|
||||
# Apply
|
||||
cd stacks/actualbudget/factory
|
||||
terragrunt apply
|
||||
```
|
||||
|
||||
## DevVM Workstation (Claude Code multi-user)
|
||||
|
||||
Separate from the in-cluster namespace-owner model above, the **devvm** (`10.0.10.10`, VMID 102) hosts per-user **Claude Code Workstations** behind `t3.viktorbarzin.me`. It reuses the same identity backbone — the Vault `k8s_users` map and Authentik — but adds a devvm-side layer. Authoritative design + phased plan: `docs/plans/2026-06-07-multi-user-workstation-{design,plan}.md` (PRD: ViktorBarzin/infra#9).
|
||||
|
||||
**Single source of truth:** `infra/scripts/workstation/roster.yaml` (`os_user → authentik_user / k8s_user / tier / namespaces`). `roster_engine.py` (pytest-covered pure core) derives desired state; `t3-provision-users` (hourly timer) applies it — **additive-only** for existing users (never strips a group, replaces a home, or re-locks an account). `/etc/ttyd-user-map` + `dispatch.json` are **generated** from the roster (do not hand-edit).
|
||||
|
||||
**RBAC tiers:** `admin` (Viktor — cluster-admin, unlocked tree, secrets) · `power-user` (cluster-wide read-only, NO Secrets, via a dedicated `oidc-power-user-readonly` ClusterRole) · `namespace-owner` (admin in own namespace only). Each session acts as the user's **own** OIDC identity (kubelogin), never the admin's.
|
||||
|
||||
**Config inheritance (live):** wizard authors the base (his chezmoi-versioned `~/.claude`). Two native layers carry it to every user — the enforced org `claudeMd` in `/etc/claude-code/managed-settings.json` (top precedence, all sessions) and per-user `~/.claude/{skills,rules,…}` **symlinks** to the base (seeded via `/etc/skel`; edits propagate live). Secrets stay per-user at mode 600, never symlinked.
|
||||
|
||||
**Infra access:** non-admins get their own **writable, git-crypt-LOCKED** clone of the (public) infra repo at `~/code` — code/docs plaintext, secret files (`*.tfvars`, `secrets/**`) stay ciphertext. Changes are ungated (push ≠ apply); the real boundary is apply-time (`scripts/tg apply` needs an admin Vault token + cluster RBAC).
|
||||
|
||||
**Status (2026-06-08):** built + verified on the live host — capacity (8 GiB swap), config inheritance, roster-driven provisioner, per-user locked clone, **per-user OIDC kubeconfig + the `oidc-power-user-readonly` ClusterRole + emo's `k8s_users` entry (applied + impersonation-verified), and the Authentik `T3 Users` edge gate (applied + verified)**. **Remaining (held / future):** the emo cutover to his own locked clone (Phase 5), the offboarding apply-side (Phase 7), per-user MCP/auth injection, and roster-reconciled `T3 Users` membership. See `../runbooks/offboard-user.md` for deprovisioning.
|
||||
|
||||
## Related
|
||||
|
||||
- [CI/CD Pipeline](./ci-cd.md) — Per-user Woodpecker pipelines
|
||||
- [Databases](./databases.md) — Vault DB engine for per-user databases
|
||||
- Runbook: `../runbooks/onboard-user.md` — Step-by-step onboarding guide
|
||||
- Runbook: `../runbooks/offboard-user.md` — Remove user and resources
|
||||
- k8s-portal documentation: Self-service UI
|
||||
- Vault documentation: Identity secrets engine
|
||||
544
docs/architecture/networking.md
Normal file
544
docs/architecture/networking.md
Normal file
|
|
@ -0,0 +1,544 @@
|
|||
# Networking Architecture
|
||||
|
||||
Last updated: 2026-04-19 (WS E — Kea DHCP pushes dual DNS per subnet; Kea DDNS TSIG-signed)
|
||||
|
||||
## Overview
|
||||
|
||||
The homelab network is built on a dual-VLAN architecture with pfSense providing gateway services, Technitium for internal DNS, and Cloudflare for external DNS. Traefik serves as the Kubernetes ingress controller with a comprehensive middleware chain including CrowdSec bot protection, Authentik forward-auth, and rate limiting. All HTTP traffic flows through Cloudflared tunnels, avoiding the need for port forwarding or exposing public IPs.
|
||||
|
||||
## Architecture Diagram
|
||||
|
||||
```mermaid
|
||||
graph TB
|
||||
Internet[Internet]
|
||||
CF[Cloudflare DNS<br/>~50 domains]
|
||||
CFD[Cloudflared Tunnel<br/>3 replicas]
|
||||
Traefik[Traefik Ingress<br/>3 replicas + PDB]
|
||||
|
||||
subgraph "Middleware Chain"
|
||||
CS[CrowdSec Bouncer<br/>fail-open]
|
||||
Auth[Authentik Forward-Auth<br/>3 replicas + PDB]
|
||||
RL[Rate Limiter<br/>429 response]
|
||||
Retry[Retry<br/>2 attempts, 100ms]
|
||||
end
|
||||
|
||||
subgraph "Proxmox Host (eno1)"
|
||||
vmbr0[vmbr0 Bridge<br/>192.168.1.127/24]
|
||||
vmbr1[vmbr1 Internal<br/>VLAN-aware]
|
||||
|
||||
subgraph "VLAN 10 - Management<br/>10.0.10.0/24"
|
||||
Proxmox[Proxmox Host<br/>10.0.10.1]
|
||||
DevVM[DevVM<br/>10.0.10.10]
|
||||
Registry[Registry VM<br/>10.0.20.10]
|
||||
end
|
||||
|
||||
subgraph "VLAN 20 - Kubernetes<br/>10.0.20.0/24"
|
||||
pfSense[pfSense<br/>10.0.20.1<br/>Gateway/NAT/DHCP]
|
||||
Tech[Technitium DNS<br/>10.0.20.201 LB / 10.96.0.53 ClusterIP<br/>viktorbarzin.lan]
|
||||
MLB[MetalLB Pool<br/>10.0.20.200-10.0.20.220]
|
||||
|
||||
subgraph "K8s Nodes"
|
||||
Master[k8s-master]
|
||||
Node1[k8s-node1]
|
||||
Node2[k8s-node2]
|
||||
Node3[k8s-node3]
|
||||
Node4[k8s-node4]
|
||||
end
|
||||
end
|
||||
end
|
||||
|
||||
Service[Service]
|
||||
Pod[Pod]
|
||||
|
||||
Internet -->|DNS query| CF
|
||||
CF -->|CNAME to tunnel| CFD
|
||||
CFD --> Traefik
|
||||
Traefik --> CS
|
||||
CS --> Auth
|
||||
Auth --> RL
|
||||
RL --> Retry
|
||||
Retry --> Service
|
||||
Service --> Pod
|
||||
|
||||
vmbr0 -.physical link.- eno1
|
||||
vmbr0 --> vmbr1
|
||||
vmbr1 -.VLAN 10.- Proxmox
|
||||
vmbr1 -.VLAN 10.- DevVM
|
||||
vmbr1 -.VLAN 20.- pfSense
|
||||
vmbr1 -.VLAN 20.- Tech
|
||||
vmbr1 -.VLAN 20.- Master
|
||||
vmbr1 -.VLAN 20.- Node1
|
||||
```
|
||||
|
||||
## Components
|
||||
|
||||
| Component | Version/Type | Location | Purpose |
|
||||
|-----------|-------------|----------|---------|
|
||||
| pfSense | 2.7.x | 10.0.20.1 | Gateway, NAT, firewall, Kea DHCP for all subnets, Kea DDNS |
|
||||
| phpIPAM | v1.7.0 | phpipam.viktorbarzin.me | IP address management, device inventory, DNS sync |
|
||||
| vmbr0 | Linux bridge | 192.168.1.127/24 | Physical bridge on eno1, uplink to LAN |
|
||||
| vmbr1 | Linux bridge (VLAN-aware) | Internal | VLAN trunk for VM isolation |
|
||||
| Technitium DNS | Container | 10.0.20.201 (LB) / 10.96.0.53 (ClusterIP) | Internal DNS (viktorbarzin.lan) + full recursive resolver |
|
||||
| Cloudflare DNS | SaaS | External | ~50 public domains under viktorbarzin.me |
|
||||
| Cloudflared | Container | K8s (3 replicas) | Tunnel ingress, replaces port forwarding |
|
||||
| Traefik | Helm chart | K8s (3 replicas + PDB) | Ingress controller, HTTP/3 enabled |
|
||||
| CrowdSec | Helm chart | K8s (LAPI: 3 replicas) | Bot protection, fail-open bouncer |
|
||||
| Authentik | Helm chart | K8s (3 replicas + PDB) | SSO, forward-auth middleware |
|
||||
| MetalLB | v0.15.3 Helm chart | K8s | LoadBalancer IPs (10.0.20.200-10.0.20.220), all services on 10.0.20.200 |
|
||||
| Registry Cache | Container | 10.0.20.10 | Pull-through for docker.io:5000, ghcr.io:5010 |
|
||||
|
||||
## IPAM & DNS Auto-Registration
|
||||
|
||||
Devices are automatically discovered, named, and registered in DNS without manual intervention.
|
||||
|
||||
```mermaid
|
||||
flowchart LR
|
||||
subgraph "Device Connects"
|
||||
Device[New Device<br/>joins WiFi/wired]
|
||||
end
|
||||
|
||||
subgraph pfSense["pfSense (10.0.20.1)"]
|
||||
Kea[Kea DHCP4<br/>3 subnets<br/>42 reservations]
|
||||
DDNS[Kea DHCP-DDNS]
|
||||
ARP[ARP Table]
|
||||
end
|
||||
|
||||
subgraph K8s["Kubernetes"]
|
||||
Import[CronJob<br/>pfsense-import<br/>hourly]
|
||||
Sync[CronJob<br/>dns-sync<br/>every 15min]
|
||||
IPAM[phpIPAM<br/>Web UI + API]
|
||||
MySQL[(MySQL<br/>InnoDB)]
|
||||
end
|
||||
|
||||
subgraph DNS["Technitium DNS"]
|
||||
Forward[viktorbarzin.lan<br/>A records]
|
||||
Reverse[*.in-addr.arpa<br/>PTR records]
|
||||
end
|
||||
|
||||
Device -->|DHCP request| Kea
|
||||
Kea -->|IP + hostname| Device
|
||||
Kea -->|lease event| DDNS
|
||||
DDNS -->|RFC 2136<br/>A + PTR| Forward
|
||||
DDNS -->|RFC 2136<br/>A + PTR| Reverse
|
||||
Device -.->|traffic| ARP
|
||||
|
||||
Import -->|SSH: Kea leases<br/>+ ARP table| pfSense
|
||||
Import -->|insert/update<br/>IP + MAC + hostname| MySQL
|
||||
IPAM --- MySQL
|
||||
Sync -->|push named hosts| Forward
|
||||
Sync -->|push named hosts| Reverse
|
||||
Sync -->|pull PTR hostnames<br/>for unnamed entries| MySQL
|
||||
```
|
||||
|
||||
### Data Flow
|
||||
|
||||
| Step | Trigger | Source | Destination | Data | Latency |
|
||||
|------|---------|--------|-------------|------|---------|
|
||||
| 1. DHCP lease | Device connects | Kea DHCP4 | Device | IP + gateway + DNS | Immediate |
|
||||
| 2. DNS registration | Lease granted | Kea DDNS | Technitium | A + PTR records | Immediate |
|
||||
| 3. Device import | CronJob (5min) | Kea leases + ARP | phpIPAM MySQL | IP + MAC + hostname | ≤5 min |
|
||||
| 4. DNS sync (push) | CronJob (15min) | phpIPAM MySQL | Technitium | A + PTR for named hosts | ≤15 min |
|
||||
| 5. DNS sync (pull) | CronJob (15min) | Technitium PTR | phpIPAM MySQL | Hostname for unnamed entries | ≤15 min |
|
||||
|
||||
### DHCP Coverage
|
||||
|
||||
| Subnet | DHCP Server | DNS option 6 | Reservations | DDNS | Notes |
|
||||
|--------|------------|--------------|--------------|------|-------|
|
||||
| 10.0.10.0/24 (Mgmt) | Kea on pfSense | `10.0.10.1, 94.140.14.14` | 3 (devvm, pxe, ha) | Yes (TSIG) | VMs with static MACs |
|
||||
| 10.0.20.0/24 (K8s) | Kea on pfSense | `10.0.20.1, 94.140.14.14` | 7 (master, nodes 1-5, registry) | Yes (TSIG) | K8s cluster nodes |
|
||||
| 192.168.1.0/24 (LAN) | **TP-Link AP** | `192.168.1.2, 94.140.14.14` | 42 (all home devices) | Yes | pfSense Kea WAN is disabled |
|
||||
| 10.3.2.0/24 (VPN) | Static | — | — | No | WireGuard peers |
|
||||
| 192.168.0.0/24 (Valchedrym) | OpenWRT | — | — | No | Remote site |
|
||||
| 192.168.8.0/24 (London) | GL-iNet | — | — | No | Remote site |
|
||||
|
||||
## How It Works
|
||||
|
||||
### VLAN Segmentation
|
||||
|
||||
The Proxmox host uses a dual-bridge architecture:
|
||||
- **vmbr0**: Physical bridge on interface `eno1`, connected to upstream LAN (192.168.1.0/24). Proxmox management IP is 192.168.1.127.
|
||||
- **vmbr1**: Internal VLAN-aware bridge, acts as a trunk carrying:
|
||||
- **VLAN 10 (Management)**: 10.0.10.0/24 — Proxmox, DevVM
|
||||
- **VLAN 20 (Kubernetes)**: 10.0.20.0/24 — All K8s nodes, services, MetalLB IPs
|
||||
|
||||
VMs tag traffic on vmbr1 to isolate workloads. pfSense bridges VLAN 20 to the upstream LAN via NAT.
|
||||
|
||||
### DNS Resolution
|
||||
|
||||
**Internal (Technitium)**:
|
||||
- K8s LoadBalancer at **10.0.20.201** (dedicated MetalLB IP), ClusterIP at **10.96.0.53**
|
||||
- Serves `.viktorbarzin.lan` zone with 30+ internal A/CNAME records
|
||||
- Also acts as full recursive resolver for public domains
|
||||
- `externalTrafficPolicy: Local` preserves client source IPs for query logging
|
||||
- HA: primary + secondary + tertiary pods with anti-affinity, PDB minAvailable=2
|
||||
|
||||
**LAN client DNS path (192.168.1.0/24)**:
|
||||
- TP-Link DHCP gives DNS=192.168.1.2 (pfSense WAN)
|
||||
- pfSense NAT redirect (`rdr`) forwards UDP 53 on WAN directly to Technitium (10.0.20.201)
|
||||
- Client source IPs are preserved (no SNAT on 192.168.1.x → 10.0.20.x path)
|
||||
- Technitium logs show real per-device IPs for analytics
|
||||
|
||||
**Split Horizon / Hairpin NAT fix (192.168.1.0/24 → *.viktorbarzin.me)**:
|
||||
- TP-Link router does NOT support hairpin NAT — LAN clients can't reach the public IP (176.12.22.76) for non-proxied domains
|
||||
- Technitium's Split Horizon `AddressTranslation` post-processor translates `176.12.22.76 → 10.0.20.203` (Traefik LB) in DNS responses for 192.168.1.0/24 clients (was `.200` until 2026-05-30 Traefik dedicated-IP move)
|
||||
- DNS Rebinding Protection has `viktorbarzin.me` in `privateDomains` to allow the translated private IP
|
||||
- Only affects non-proxied domains (ha-sofia, immich, headscale, etc.) — Cloudflare-proxied domains resolve to Cloudflare IPs and are unaffected
|
||||
- Other clients (10.0.x.x, K8s pods) are NOT translated — they reach the public IP via pfSense outbound NAT
|
||||
- Config synced to all 3 Technitium instances by CronJob `technitium-split-horizon-sync` (every 6h)
|
||||
- **Known mail-name collision**: the translation also sends `mail.viktorbarzin.me` (and `imap.`/`smtp.`) to `.203`, but Traefik does not listen on mail ports there. iOS Mail on Barzini WiFi silently hangs. Fix in flight: dedicated pfSense Virtual IP for the mail listener so DNS can point at a stable mail-only IP instead of relying on Traefik's LB IP.
|
||||
|
||||
**K8s cluster DNS path**:
|
||||
- CoreDNS forwards `.viktorbarzin.lan` to Technitium ClusterIP (10.96.0.53)
|
||||
- CoreDNS forwards public queries to pfSense (10.0.20.1), 8.8.8.8, 1.1.1.1
|
||||
- **In-cluster `forgejo.viktorbarzin.me` → Traefik ClusterIP**: a CoreDNS `rewrite name exact forgejo.viktorbarzin.me traefik.traefik.svc.cluster.local` (Corefile in `stacks/technitium/modules/technitium/main.tf`) keeps pod registry pulls/pushes/builds off the public-IP hairpin. The ETP=Local Traefik LB (`.203`) is not reliably hairpin-reachable from pods, and the public path (the bullet above) intermittently timed out **buildkit pushes** from Woodpecker build pods — which, unlike kubelet, do NOT use the per-node containerd Forgejo mirror. Resolving the Service by name auto-tracks the ClusterIP (no rot on a Traefik renumber); Traefik's `*.viktorbarzin.me` wildcard keeps SNI/TLS valid. Makes the per-pod woodpecker-server hostAlias belt-and-suspenders. (beads code-yh33)
|
||||
|
||||
**pfSense dnsmasq (DNS Forwarder)**:
|
||||
- Listens on LAN (10.0.10.1), OPT1 (10.0.20.1), localhost only — NOT on WAN (192.168.1.2)
|
||||
- Forwards `.viktorbarzin.lan` to Technitium (10.0.20.201), public queries to 1.1.1.1
|
||||
- Serves K8s VLAN clients and pfSense's own DNS needs
|
||||
- Aliases: `technitium_dns` (10.0.20.201), `k8s_shared_lb` (10.0.20.200)
|
||||
|
||||
**External (Cloudflare)**:
|
||||
- Manages ~50 public domains, all under `viktorbarzin.me`
|
||||
- **Proxied domains** (orange cloud, traffic via Cloudflare CDN):
|
||||
- blog, hackmd, privatebin, url, echo, f1tv, excalidraw, send, audiobookshelf, jsoncrack, ntfy, cyberchef, homepage, linkwarden, changedetection, tandoor, n8n, stirling-pdf, dashy, city-guesser, travel, netbox
|
||||
- **Non-proxied domains** (grey cloud, direct IP resolution):
|
||||
- mail, wg, headscale, immich, calibre, vaultwarden, and other services requiring direct connections
|
||||
- CNAME records for proxied domains point to Cloudflared tunnel FQDNs
|
||||
|
||||
### Ingress Flow
|
||||
|
||||
```mermaid
|
||||
sequenceDiagram
|
||||
participant Client
|
||||
participant Cloudflare
|
||||
participant Cloudflared
|
||||
participant Traefik
|
||||
participant CrowdSec
|
||||
participant Authentik
|
||||
participant RateLimit
|
||||
participant Retry
|
||||
participant Service
|
||||
participant Pod
|
||||
|
||||
Client->>Cloudflare: HTTPS request to blog.viktorbarzin.me
|
||||
Cloudflare->>Cloudflared: Forward via tunnel (QUIC)
|
||||
Cloudflared->>Traefik: HTTP to LoadBalancer IP
|
||||
Traefik->>CrowdSec: Apply bouncer middleware
|
||||
CrowdSec->>Authentik: If allowed, check auth (protected=true)
|
||||
Authentik->>RateLimit: If authenticated, check rate limit
|
||||
RateLimit->>Retry: If within limit, continue
|
||||
Retry->>Service: Forward to Service
|
||||
Service->>Pod: Route to backend Pod
|
||||
Pod-->>Service: Response
|
||||
Service-->>Retry: Response
|
||||
Retry-->>RateLimit: Response
|
||||
RateLimit-->>Authentik: Response (strip auth headers)
|
||||
Authentik-->>CrowdSec: Response
|
||||
CrowdSec-->>Traefik: Response
|
||||
Traefik-->>Cloudflared: Response
|
||||
Cloudflared-->>Cloudflare: Response via tunnel
|
||||
Cloudflare-->>Client: HTTPS response
|
||||
```
|
||||
|
||||
### Middleware Chain
|
||||
|
||||
Every ingress created by the `ingress_factory` module follows this chain:
|
||||
|
||||
1. **CrowdSec Bouncer**: Checks IP against threat database. **Fail-open** mode — if LAPI is unreachable, traffic passes through to prevent outages.
|
||||
2. **Authentik Forward-Auth** (if `protected = true`): SSO authentication via OIDC. Non-authenticated users are redirected to login. Auth headers are stripped before forwarding to backend.
|
||||
3. **Rate Limiting**: Per-IP throttling. Returns **429 Too Many Requests** (not 503) when limit exceeded. Default limits are generous; services like Immich and Nextcloud have higher custom limits.
|
||||
4. **Retry**: 2 attempts with 100ms delay on transient failures (5xx errors, connection errors).
|
||||
|
||||
Additional middleware:
|
||||
- **Anti-AI**: On by default via `ingress_factory`. Blocks common AI crawler user-agents.
|
||||
- **HTTP/3 (QUIC)**: Enabled globally on Traefik.
|
||||
|
||||
### Entrypoint Transport Timeouts
|
||||
|
||||
The `websecure` entrypoint sets `respondingTimeouts` in `stacks/traefik/modules/traefik/main.tf`:
|
||||
|
||||
| Timeout | Value | Bounds |
|
||||
|---|---|---|
|
||||
| `readTimeout` | `3600s` | Total time to read one request incl. body → **max upload duration** |
|
||||
| `writeTimeout` | `0s` (disabled) | Total time to write the response → **max download duration (0 = unlimited)** |
|
||||
| `idleTimeout` | `600s` | Keep-alive idle between requests (does *not* apply to active transfers) |
|
||||
|
||||
**Gotcha — these are HARD caps on total duration, not idle timeouts** (unlike nginx `proxy_*_timeout`, which reset on every read). A finite `writeTimeout` truncates *any* download that runs longer than it, regardless of progress. A prior `writeTimeout=60s` silently cut large Immich video downloads at the 60s mark (HTTP/2 stream reset). `writeTimeout=0` (Traefik's default) is required for unlimited-size downloads — Immich's own Traefik reverse-proxy guidance assumes it and never sets `writeTimeout`. `readTimeout` is kept finite (not 0) because an unbounded request read is the slow-loris vector; 3600s passes multi-GB uploads while keeping a backstop (Immich has no resumable upload, so the window must exceed real upload times). Single-asset downloads (`GET /api/assets/{id}/original`) serve `206 Partial Content`, so they are also resumable on a dropped connection; on-the-fly ZIP "download all" is not (no stable byte offsets).
|
||||
|
||||
### MetalLB & Load Balancing
|
||||
|
||||
MetalLB v0.15.3 allocates IPs from `10.0.20.200-10.0.20.220` (21 IPs) in **Layer 2 mode**; **four are in use**. Most LoadBalancer services share **10.0.20.200** (`metallb.io/allow-shared-ip: shared`, `externalTrafficPolicy: Cluster`). **Three services hold dedicated IPs with `externalTrafficPolicy: Local`** to preserve the real client source IP (and, for Traefik, to make QUIC/HTTP3 work — a shared IP forbids the mixed ETP the UDP listener needs).
|
||||
|
||||
> **Why not consolidate to fewer IPs?** The three dedicated IPs can't be merged. MetalLB L2 only lets `ETP=Local` services share an IP if they have *identical pod selectors* (Traefik/KMS/Technitium don't), and a shared `ETP=Local` IP announces from a single node — blackholing any service whose pods aren't on it. Traefik additionally can never leave a dedicated IP (QUIC needs the UDP listener on its own ETP=Local IP). Merging would cost client-IP preservation or HA, so the 4-IP layout is deliberate — not sprawl. Full analysis: `docs/plans/2026-06-03-lb-ip-hygiene-design.md`.
|
||||
|
||||
| IP | ETP | Services (ns/name → ports) |
|
||||
|----|-----|----------------------------|
|
||||
| **10.0.20.200** (shared) | Cluster | dbaas/postgresql-lb→5432 · beads-server/dolt→3306 · coturn/coturn→3478 TCP+UDP, 49152-49252/UDP · headscale/headscale-server→41641/UDP, 3479/UDP · wireguard/wireguard→51820/UDP · servarr/qbittorrent-torrenting→50000 TCP+UDP · shadowsocks/shadowsocks→8388 TCP+UDP · tor-proxy/torrserver-bt→5665 TCP+UDP · xray/xray-reality→7443 |
|
||||
| **10.0.20.201** (dedicated) | Local | technitium/technitium-dns→53 UDP+TCP |
|
||||
| **10.0.20.202** (dedicated)¹ | Local | kms/windows-kms→1688 |
|
||||
| **10.0.20.203** (dedicated) | Local | traefik/traefik→80, 443, 443/UDP (HTTP/3), 10200 (piper), 10300 (whisper) |
|
||||
|
||||
**Mailserver does NOT use a LB IP** — inbound mail enters via pfSense HAProxy on `10.0.20.1:{25,465,587,993}` → NodePorts `30125-30128` (PROXY-v2; see "Mail Server" below). (Earlier revisions of this table wrongly listed mailserver on `.200` and KMS on `.200` — both corrected 2026-06-03.)
|
||||
|
||||
**pfSense aliases** map to these IPs: `k8s_shared_lb`→.200, `technitium_dns`→.201, `k8s_kms_lb`→.202, `traefik_lb`→.203 (plus a legacy `nginx`→.200 duplicate — cruft). NAT rules reference aliases, so repointing an alias cascades to its paired filter rule.
|
||||
|
||||
¹ **windows-kms is publicly WAN-exposed.** pfSense forwards WAN TCP/1688 → `k8s_kms_lb` (.202) so any internet host can activate. The matching filter rule rate-limits per source (`max-src-conn 50`, `max-src-conn-rate 10/60`, `overload <virusprot>`). See `docs/runbooks/kms-public-exposure.md`.
|
||||
|
||||
#### LB-IP renumber checklist
|
||||
|
||||
These IPs are referenced by consumers that do **not** auto-follow when an IP moves — the 2026-05-30 Traefik `.200→.203` move broke five of them (cloudflared 502, woodpecker forge API, containerd pulls, the `.lan` + `.me` zones). **Before moving any LB IP, update every consumer below.** Bootstrap-critical literals (containerd mirror, PG state, node DNS) deliberately stay IP literals (DNS chicken-and-egg) — this list is their single source of truth.
|
||||
|
||||
- **`.203` Traefik:** assigner `stacks/traefik/modules/traefik/main.tf` · split-horizon translation `stacks/technitium/modules/technitium/main.tf` (`externalToInternalTranslation`) · prometheus apex-alert summary `stacks/monitoring/.../prometheus_chart_values.tpl` · containerd Forgejo mirror `modules/create-template-vm/k8s-node-containerd-setup.sh` + `scripts/setup-forgejo-containerd-mirror.sh` (OOB, per node) · cloudflared origin (already IP-independent → `traefik.traefik.svc`) · woodpecker forge alias (now reads the Traefik **ClusterIP** dynamically — no literal) · pfSense NAT 80/443 → `traefik_lb`.
|
||||
- **`.201` Technitium:** assigner `stacks/technitium/modules/technitium/main.tf` · DNS records `config.tfvars` (ns1/ns2/`viktorbarzin.lan`, dnscrypt forwarder) · `modules/create-template-vm/cloud_init.yaml` FallbackDNS · `scripts/provision-k8s-worker` · pfSense NAT 53 (**literal `10.0.20.201`**, not the `technitium_dns` alias — known inconsistency).
|
||||
- **`.202` KMS:** assigner `stacks/kms/main.tf` · pfSense NAT 1688 → `k8s_kms_lb` · Cloudflare `vlmcs` public A → WAN → `.202`.
|
||||
- **`.200` shared:** the 9 assigners above · PG state backend `scripts/tg` + `scripts/migrate-state-to-pg` (`@10.0.20.200:5432`) · pfSense NAT (wireguard/shadowsocks/coturn/headscale-STUN/qbittorrent/xray) → `k8s_shared_lb`, outbound-NAT self rule, CrowdSec syslog `remoteserver .200:30514`.
|
||||
|
||||
Critical services are scaled to **3 replicas**:
|
||||
- Traefik (PDB: minAvailable=2)
|
||||
- Authentik (PDB: minAvailable=2)
|
||||
- CrowdSec LAPI
|
||||
- PgBouncer
|
||||
- Cloudflared
|
||||
|
||||
PodDisruptionBudgets ensure at least 2 replicas remain during node maintenance or disruptions.
|
||||
|
||||
### IPv6 Ingress (HE Tunnel + HAProxy Bridge)
|
||||
|
||||
Public IPv6 reaches the cluster over a **Hurricane Electric 6in4 tunnel** terminated on pfSense (`gif0`; tunnel endpoint `2001:470:6e:43d::2`, LAN prefix `2001:470:6f:43d::/64`). The apex `viktorbarzin.me AAAA` → `2001:470:6e:43d::2`.
|
||||
|
||||
pfSense cannot NAT IPv6→IPv4, so ingress is bridged by a **standalone HAProxy** on pfSense (a separate config/service — *not* the pfSense HAProxy package) that listens on the tunnel IPv6 and forwards to the IPv4 cluster LBs with **PROXY protocol v2 (`send-proxy-v2`)**, so real client IPv6 addresses propagate to CrowdSec instead of being masked as `10.0.20.1`:
|
||||
|
||||
| Listen `[2001:470:6e:43d::2]:` | → Backend (`send-proxy-v2`) | Purpose |
|
||||
|---|---|---|
|
||||
| 443, 80 | Traefik `10.0.20.203:443` / `:80` | Web apps |
|
||||
| 25, 465, 587, 993 | mail NodePorts `30125` / `30126` / `30127` / `30128` on .101-103 | SMTP / SMTPS / Submission / IMAPS |
|
||||
|
||||
The web path works because Traefik trusts PROXY-v2 **only from `10.0.20.1`** (`entryPoints.web/websecure.proxyProtocol.trustedIPs` in `stacks/traefik/.../main.tf`) — real IPv4 clients arrive via ETP=Local with their own source IP (never `10.0.20.1`), so they are unaffected. Mail backends hit the mailserver's PROXY-aware alt-listeners (same pattern as the IPv4 mail HAProxy — see `mailserver.md`).
|
||||
|
||||
**No QUIC over IPv6** — the bridge is TCP/h2 only; IPv4 carries QUIC/HTTP3.
|
||||
|
||||
The bridge's HAProxy uses `timeout client 1h` / `timeout server 1h`, which are **inactivity** timeouts (reset on every byte), *not* total-transfer caps — so steady large downloads/uploads over IPv6 are not limited by the bridge. The download-duration cap was solely Traefik's `writeTimeout` (see Entrypoint Transport Timeouts above), now `0`.
|
||||
|
||||
pfSense files (out-of-band, **not Terraform**):
|
||||
- `/usr/local/etc/ipv6-haproxy.cfg` — the 6-frontend bridge config above.
|
||||
- `/usr/local/etc/rc.d/ipv6proxy` — service wrapper (`service ipv6proxy {start,stop,status}`); `start` does a graceful `-sf` reload.
|
||||
- `/usr/local/etc/ipv6_proxy.sh` — boot entrypoint (config.xml `<shellcmd>`): patches pfSense nginx off `[::]:443/:80` (rebinds to LAN IPv6) to free the tunnel IPv6, then `service ipv6proxy onestart`.
|
||||
|
||||
**Gotcha:** the backends use **no health `check`** — a plain TCP check hits the PROXY-expecting listeners without a PROXY header and would false-mark them DOWN. This path previously used `socat` (functional, but masked every IPv6 client as `10.0.20.1`); replaced by HAProxy on 2026-05-30 for real client IPs.
|
||||
|
||||
### Container Registry Pull-Through Cache
|
||||
|
||||
**Location**: Registry VM at 10.0.20.10
|
||||
|
||||
Docker Hub and GitHub Container Registry (GHCR) are mirrored locally to avoid rate limits and improve pull performance:
|
||||
- **docker.io**: Port 5000
|
||||
- **ghcr.io**: Port 5010
|
||||
|
||||
Containerd on all K8s nodes uses `hosts.toml` to redirect pulls to the local cache transparently.
|
||||
|
||||
**Caveat**: The cache holds stale manifests for `:latest` tags, which can cause version skew. Always use **versioned tags** (e.g., `python:3.12.0` or `app:abc12345`) in production.
|
||||
|
||||
## Configuration
|
||||
|
||||
### Terraform Stacks
|
||||
|
||||
| Stack | Path | Resources |
|
||||
|-------|------|-----------|
|
||||
| pfSense | `stacks/pfsense/` | VM + cloud-init config |
|
||||
| Technitium | `stacks/technitium/` | Deployment, Service, PVC |
|
||||
| Traefik | `stacks/platform/` (sub-module) | Helm release, IngressRoute CRDs |
|
||||
| CrowdSec | `stacks/platform/` (sub-module) | Helm release, LAPI + bouncer |
|
||||
| Authentik | `stacks/authentik/` | Helm release, ingress, OIDC configs |
|
||||
| MetalLB | `stacks/platform/` (sub-module) | Helm release, IPAddressPool |
|
||||
| Cloudflared | `stacks/cloudflared/` | Deployment (3 replicas), tunnel config |
|
||||
| ingress_factory | `modules/ingress_factory/` | IngressRoute + middleware chain |
|
||||
|
||||
### Key Configuration Files
|
||||
|
||||
**pfSense**:
|
||||
- Config: Not Terraform-managed (pfSense web UI / config.xml)
|
||||
- DHCP: Kea DHCP4 on the two internal VLANs (VLAN 10 = 10.0.10.0/24, VLAN 20 = 10.0.20.0/24). WAN/192.168.1.0/24 is served by the TP-Link dumb AP — pfSense's Kea WAN subnet is disabled.
|
||||
- **DNS option 6** (per-subnet, WS E 2026-04-19):
|
||||
- 10.0.10.0/24 → `10.0.10.1, 94.140.14.14` (internal Unbound + AdGuard Home public fallback)
|
||||
- 10.0.20.0/24 → `10.0.20.1, 94.140.14.14`
|
||||
- 192.168.1.0/24 → `192.168.1.2, 94.140.14.14` (served by TP-Link, unchanged by WS E)
|
||||
- Rationale: clients survive an internal resolver outage by falling through to AdGuard (`94.140.14.14`) — confirmed via null-route drill on 2026-04-19.
|
||||
- 42 MAC→IP reservations for 192.168.1.0/24 (all known home devices)
|
||||
- DHCP DDNS: Kea DHCP-DDNS sends **TSIG-signed** RFC 2136 updates to Technitium (key `kea-ddns`, HMAC-SHA256; secret in Vault `secret/viktor/kea_ddns_tsig_secret`). Zone `viktorbarzin.lan` + reverse zones require both a pfSense-source IP AND a valid TSIG signature. Config: `/usr/local/etc/kea/kea-dhcp-ddns.conf` (hand-managed on pfSense; pre-WS-E backup at `kea-dhcp-ddns.conf.2026-04-19-pre-tsig`).
|
||||
- Firewall rules: Allow K8s egress, block inter-VLAN by default
|
||||
|
||||
**Technitium**:
|
||||
- Config: Stored on `proxmox-lvm-encrypted` PVCs (migrated from NFS 2026-04-14)
|
||||
- Zone file: `viktorbarzin.lan` (A records for all internal hosts)
|
||||
- Reverse zones: `10.0.10.in-addr.arpa`, `20.0.10.in-addr.arpa`, `1.168.192.in-addr.arpa`, `2.3.10.in-addr.arpa`, `0.168.192.in-addr.arpa`
|
||||
- Stub zone: `emrsn.org` (returns NXDOMAIN locally for corporate domain queries, avoids upstream forwarding)
|
||||
- Dynamic updates: Enabled (UseSpecifiedNetworkACL) from pfSense IPs (10.0.20.1, 10.0.10.1, 192.168.1.2)
|
||||
- Forwarders: Cloudflare DNS-over-HTTPS (1.1.1.1, 1.0.0.1)
|
||||
- Cache: 100K max entries, min TTL 60s, max TTL 7 days, serve stale enabled (3 days)
|
||||
- Query logging: PostgreSQL (`technitium` database on `pg-cluster-rw.dbaas.svc.cluster.local`)
|
||||
- Blocking: OISD Big List + StevenBlack hosts (~486K domains)
|
||||
- CronJobs: `technitium-password-sync` (6h, Vault password rotation), `technitium-split-horizon-sync` (6h, hairpin NAT fix), `technitium-dns-optimization` (6h, cache TTL + stub zones)
|
||||
|
||||
**phpIPAM (IP Address Management)**:
|
||||
- Stack: `stacks/phpipam/`
|
||||
- Web UI: `phpipam.viktorbarzin.me` (Authentik-protected)
|
||||
- Database: MySQL InnoDB cluster (`mysql.dbaas.svc.cluster.local`)
|
||||
- Device import: CronJob `phpipam-pfsense-import` hourly — queries Kea DHCP leases + pfSense ARP table via SSH (no active scanning)
|
||||
- DNS sync: CronJob `phpipam-dns-sync` every 15min — bidirectional sync between phpIPAM and Technitium DNS (push named hosts → A+PTR, pull DNS hostnames → unnamed phpIPAM entries)
|
||||
- Subnets tracked: 10.0.10.0/24, 10.0.20.0/24, 192.168.1.0/24, 10.3.2.0/24, 192.168.8.0/24, 192.168.0.0/24
|
||||
- API: REST API enabled (app `claude`, ssl_token auth), MCP server available for agent access
|
||||
|
||||
**Traefik Middleware**:
|
||||
- Helm values: `stacks/platform/traefik-values.yaml`
|
||||
- Middleware CRDs: Generated by `ingress_factory` module
|
||||
- HTTP/3 config: `experimental.http3.enabled=true`
|
||||
|
||||
**MetalLB**:
|
||||
- Helm values: `stacks/platform/metallb-values.yaml`
|
||||
- IPAddressPool CRD: `10.0.20.200-10.0.20.220`
|
||||
- All 11 LB services consolidated on `10.0.20.200` with `metallb.io/allow-shared-ip: shared`
|
||||
- Requires matching `externalTrafficPolicy` (all use `Cluster`) for IP sharing
|
||||
|
||||
**Vault Secrets**:
|
||||
- Cloudflare API token: `secret/viktor/cloudflare_api_token`
|
||||
- Authentik OIDC secrets: `secret/authentik`
|
||||
- CrowdSec LAPI key: `secret/crowdsec/lapi_key`
|
||||
|
||||
## Decisions & Rationale
|
||||
|
||||
### Why Dual-Bridge VLAN Architecture?
|
||||
|
||||
**Alternatives considered**:
|
||||
1. **Single flat network**: Simpler, but no isolation between management and workload traffic.
|
||||
2. **Routed network with physical VLANs**: Requires switch with VLAN support.
|
||||
|
||||
**Decision**: vmbr0 (physical) + vmbr1 (VLAN trunk) gives isolation without requiring managed switches. Management traffic (Proxmox, DevVM) stays on VLAN 10, K8s workloads stay on VLAN 20. Failures in K8s don't affect access to Proxmox or storage.
|
||||
|
||||
### Why Cloudflared Tunnel Instead of Port Forwarding?
|
||||
|
||||
**Alternatives considered**:
|
||||
1. **Traditional port forwarding (80/443)**: Exposes public IP, requires firewall rules, DDoS risk.
|
||||
2. **VPN-only access**: Limits accessibility for public services like blog.
|
||||
|
||||
**Decision**: Cloudflared tunnel provides:
|
||||
- No public IP exposure
|
||||
- DDoS protection via Cloudflare
|
||||
- TLS termination at Cloudflare edge
|
||||
- Zero firewall configuration
|
||||
- Works behind CGNAT
|
||||
|
||||
### Why Split DNS (Technitium + Cloudflare)?
|
||||
|
||||
**Alternatives considered**:
|
||||
1. **Cloudflare only**: Works but introduces external dependency for internal resolution.
|
||||
2. **Technitium only**: Can't handle public domains without zone delegation.
|
||||
|
||||
**Decision**: Technitium handles internal `.lan` domains with near-zero latency. Cloudflare handles public domains with global DNS. K8s nodes use Technitium as primary, which forwards non-.lan queries to Cloudflare.
|
||||
|
||||
### Why Fail-Open on CrowdSec Bouncer?
|
||||
|
||||
**Alternatives considered**:
|
||||
1. **Fail-closed**: Maximum security, but LAPI downtime blocks all traffic.
|
||||
2. **Redundant LAPI**: Already scaled to 3 replicas, but resource pressure can still cause outages.
|
||||
|
||||
**Decision**: Availability > strict bot blocking. CrowdSec LAPI is scaled to 3 replicas for resilience, but during cluster-wide resource exhaustion (e.g., memory pressure), bouncer falls back to allowing traffic. This prevents a complete service outage due to a security add-on.
|
||||
|
||||
### Why HTTP/3 (QUIC)?
|
||||
|
||||
**Benefit**: Reduces latency on lossy connections (mobile, Wi-Fi) and enables multiplexing without head-of-line blocking. Minimal overhead since Traefik handles it natively.
|
||||
|
||||
### Why Pull-Through Registry Cache?
|
||||
|
||||
**Problem**: Docker Hub rate limits (100 pulls/6h for anonymous, 200 pulls/6h for free accounts) caused CI/CD failures.
|
||||
|
||||
**Solution**: Local registry cache at 10.0.20.10 mirrors all pulls. Containerd transparently redirects requests. Zero application changes needed.
|
||||
|
||||
**Trade-off**: Stale `:latest` tags — requires discipline to use versioned tags (8-char git SHAs for app images).
|
||||
|
||||
## Troubleshooting
|
||||
|
||||
### Ingress Returns 502 Bad Gateway
|
||||
|
||||
**Symptoms**: Cloudflared tunnel is up, Traefik logs show `dial tcp: lookup <service> on 10.0.20.201:53: no such host`.
|
||||
|
||||
**Diagnosis**: DNS resolution failed. Check:
|
||||
1. Is Technitium pod running? `kubectl get pod -n technitium`
|
||||
2. Can nodes resolve the service? `kubectl exec -it <any-pod> -- nslookup <service>.viktorbarzin.lan`
|
||||
3. Is the Service correctly created? `kubectl get svc -n <namespace>`
|
||||
|
||||
**Fix**: If Technitium is down, restart it. If the Service is missing, check Terraform apply status.
|
||||
|
||||
### Traefik Shows "Service Unavailable" for All Requests
|
||||
|
||||
**Symptoms**: All ingress routes return 503, Traefik dashboard shows no backends available.
|
||||
|
||||
**Diagnosis**: Middleware chain is blocking traffic. Check:
|
||||
1. Authentik status: `kubectl get pod -n authentik`
|
||||
2. CrowdSec LAPI status: `kubectl get pod -n crowdsec`
|
||||
3. Traefik logs: `kubectl logs -n kube-system deploy/traefik`
|
||||
|
||||
**Fix**: If Authentik is down and ingress uses forward-auth, pods won't pass health checks. Scale Authentik to 3 replicas or temporarily disable forward-auth middleware.
|
||||
|
||||
### MetalLB Doesn't Assign IP to LoadBalancer Service
|
||||
|
||||
**Symptoms**: Service stays in `<pending>` state, no IP assigned.
|
||||
|
||||
**Diagnosis**: Check MetalLB logs: `kubectl logs -n metallb-system deploy/controller`
|
||||
|
||||
**Common causes**:
|
||||
1. **IP pool exhausted**: 21 IPs available (10.0.20.200-10.0.20.220), check `kubectl get svc -A | grep LoadBalancer`
|
||||
2. **Missing allow-shared-ip annotation**: Services must have `metallb.io/allow-shared-ip: shared` and `metallb.io/loadBalancerIPs: 10.0.20.200`
|
||||
3. **Mismatched externalTrafficPolicy**: All services sharing an IP must use the same ETP (currently `Cluster`). Error: "can't change sharing key"
|
||||
4. **MetalLB controller crash-looping**: Resource limits too low
|
||||
|
||||
**Fix**: If pool exhausted, either delete unused Services or expand the IPAddressPool CRD. For sharing key errors, ensure new services use `externalTrafficPolicy: Cluster` and both `metallb.io/` annotations.
|
||||
|
||||
### DNS Resolution Loops (Technitium → Cloudflare → Technitium)
|
||||
|
||||
**Symptoms**: Slow DNS responses, `dig` shows multiple CNAMEs in a loop.
|
||||
|
||||
**Diagnosis**: Misconfigured forwarder or zone overlap.
|
||||
|
||||
**Fix**: Ensure Technitium forwards all non-.lan queries to Cloudflare (1.1.1.1), and Cloudflare zones don't contain `.lan` records.
|
||||
|
||||
### Cloudflared Tunnel Disconnects Frequently
|
||||
|
||||
**Symptoms**: Intermittent 502 errors, Cloudflared logs show `connection lost, retrying`.
|
||||
|
||||
**Diagnosis**: Check:
|
||||
1. Network stability: `ping 1.1.1.1` from a K8s node
|
||||
2. Cloudflared resource limits: `kubectl top pod -n cloudflared`
|
||||
3. Cloudflare tunnel status in dashboard
|
||||
|
||||
**Fix**: If resource-limited, increase memory/CPU. If network-related, check pfSense logs for NAT table exhaustion or ISP issues.
|
||||
|
||||
### Rate Limiter Blocks Legitimate Traffic
|
||||
|
||||
**Symptoms**: Users report 429 errors during normal usage (e.g., Immich uploads).
|
||||
|
||||
**Diagnosis**: Check Traefik middleware config for the affected IngressRoute.
|
||||
|
||||
**Fix**: Increase rate limit in `ingress_factory` module. Default is 100 req/min per IP. Immich and Nextcloud use 500 req/min.
|
||||
|
||||
### Large Downloads or Uploads Truncate / Fail Partway
|
||||
|
||||
**Symptoms**: Large file transfers (e.g. Immich videos, Nextcloud sync) fail at a consistent wall-clock point regardless of file — a download stops at exactly N seconds × throughput bytes; an upload fails ~1 min in. Browser shows "network error"; `curl` exits 18/92 (truncated / HTTP/2 stream reset).
|
||||
|
||||
**Diagnosis**: Check the `websecure` entrypoint `respondingTimeouts` (see Entrypoint Transport Timeouts). These are **hard total-duration caps**, not idle timeouts — a finite `writeTimeout` cuts downloads, a finite `readTimeout` cuts uploads, both regardless of progress. Reproduce deterministically: `curl --limit-rate 6M` a file large enough to exceed the cap; it dies at the cap.
|
||||
|
||||
**Fix**: `writeTimeout=0` (unlimited downloads), `readTimeout` ≥ longest expected upload (currently `3600s`). Not Cloudflare (Immich is non-proxied) and not the pfSense IPv6 bridge (its 1h timeouts are inactivity-based).
|
||||
|
||||
## Related
|
||||
|
||||
- **Runbooks**:
|
||||
- `docs/runbooks/restart-traefik.md`
|
||||
- `docs/runbooks/reset-crowdsec-bans.md`
|
||||
- `docs/runbooks/add-dns-record.md`
|
||||
- **Architecture Docs**:
|
||||
- `docs/architecture/dns.md` — DNS architecture (Technitium, CoreDNS, Cloudflare, Split Horizon)
|
||||
- `docs/architecture/vpn.md` — VPN and remote access
|
||||
- `docs/architecture/storage.md` — NFS and iSCSI architecture (coming soon)
|
||||
- **Reference**:
|
||||
- `.claude/reference/service-catalog.md` — Full service inventory
|
||||
- `.claude/reference/proxmox-inventory.md` — VM and LXC details
|
||||
319
docs/architecture/overview.md
Normal file
319
docs/architecture/overview.md
Normal file
|
|
@ -0,0 +1,319 @@
|
|||
# Infrastructure Overview
|
||||
|
||||
## Overview
|
||||
|
||||
This homelab infrastructure runs a production-grade Kubernetes cluster on Proxmox, hosting 70+ services including web applications, databases, monitoring, security, and GPU-accelerated workloads. The entire infrastructure is managed declaratively using Terraform and Terragrunt, with automated CI/CD pipelines for continuous deployment. Services are organized into a five-tier system for resource isolation and priority-based scheduling.
|
||||
|
||||
## Architecture Diagram
|
||||
|
||||
```mermaid
|
||||
graph TB
|
||||
subgraph Physical["Physical Hardware"]
|
||||
R730["Dell R730<br/>22c/44t Xeon E5-2699 v4<br/>~160GB RAM<br/>NVIDIA Tesla T4<br/>1.1TB + 931GB + 10.7TB"]
|
||||
end
|
||||
|
||||
subgraph Proxmox["Proxmox VE"]
|
||||
direction LR
|
||||
PF["pfSense<br/>101"]
|
||||
DEV["devvm<br/>102"]
|
||||
HA["home-assistant<br/>103"]
|
||||
MASTER["k8s-master<br/>200"]
|
||||
NODE1["k8s-node1<br/>201<br/>(GPU)"]
|
||||
NODE2["k8s-node2<br/>202"]
|
||||
NODE3["k8s-node3<br/>203"]
|
||||
NODE4["k8s-node4<br/>204"]
|
||||
REG["docker-registry<br/>220"]
|
||||
end
|
||||
|
||||
subgraph Network["Network Bridges"]
|
||||
VMBR0["vmbr0<br/>192.168.1.0/24<br/>Physical"]
|
||||
VMBR1_10["vmbr1:vlan10<br/>10.0.10.0/24<br/>Management"]
|
||||
VMBR1_20["vmbr1:vlan20<br/>10.0.20.0/24<br/>Kubernetes"]
|
||||
end
|
||||
|
||||
subgraph K8s["Kubernetes Cluster v1.34.2"]
|
||||
direction TB
|
||||
TIER0["Tier 0: Core<br/>traefik, authentik, vault"]
|
||||
TIER1["Tier 1: Cluster<br/>prometheus, grafana, loki"]
|
||||
TIER2["Tier 2: GPU<br/>ollama, comfyui"]
|
||||
TIER3["Tier 3: Edge<br/>cloudflared, headscale"]
|
||||
TIER4["Tier 4: Auxiliary<br/>vaultwarden, immich"]
|
||||
end
|
||||
|
||||
R730 --> Proxmox
|
||||
|
||||
PF --> VMBR0
|
||||
PF --> VMBR1_10
|
||||
PF --> VMBR1_20
|
||||
HA --> VMBR0
|
||||
DEV --> VMBR1_10
|
||||
|
||||
MASTER --> VMBR1_20
|
||||
NODE1 --> VMBR1_20
|
||||
NODE2 --> VMBR1_20
|
||||
NODE3 --> VMBR1_20
|
||||
NODE4 --> VMBR1_20
|
||||
REG --> VMBR1_20
|
||||
|
||||
VMBR1_20 --> K8s
|
||||
```
|
||||
|
||||
## Components
|
||||
|
||||
### Hardware
|
||||
|
||||
| Component | Specification |
|
||||
|-----------|---------------|
|
||||
| Server | Dell PowerEdge R730 |
|
||||
| CPU | 1x Intel Xeon E5-2699 v4 (22 cores / 44 threads, CPU2 unpopulated) |
|
||||
| RAM | ~160GB DDR4 ECC |
|
||||
| GPU | NVIDIA Tesla T4 (16GB, PCIe 0000:06:00.0) |
|
||||
| Storage | 1.1TB SSD + 931GB SSD + 10.7TB HDD |
|
||||
| Network | eno1 (physical), vmbr0 (physical bridge), vmbr1 (VLAN-aware internal) |
|
||||
|
||||
### Network Topology
|
||||
|
||||
| Network | VLAN | CIDR | Purpose |
|
||||
|---------|------|------|---------|
|
||||
| Physical | - | 192.168.1.0/24 | Physical devices, Proxmox host (192.168.1.127) |
|
||||
| Management | 10 | 10.0.10.0/24 | Infrastructure VMs, devvm |
|
||||
| Kubernetes | 20 | 10.0.20.0/24 | K8s cluster nodes and services |
|
||||
|
||||
### Virtual Machine Inventory
|
||||
|
||||
| VMID | Name | CPUs | RAM | Network | IP Address | Notes |
|
||||
|------|------|------|-----|---------|------------|-------|
|
||||
| 101 | pfsense | 8 | 16GB | vmbr0, vmbr1:vlan10, vmbr1:vlan20 | - | Gateway/firewall routing between VLANs |
|
||||
| 102 | devvm | 16 | 8GB | vmbr1:vlan10 | - | Development VM |
|
||||
| 103 | home-assistant | 8 | 8GB | vmbr0 | - | Home Assistant Sofia instance |
|
||||
| 200 | k8s-master | 8 | 32GB | vmbr1:vlan20 | 10.0.20.100 | Kubernetes control plane |
|
||||
| 201 | k8s-node1 | 16 | 32GB | vmbr1:vlan20 | - | GPU worker node (Tesla T4 passthrough) |
|
||||
| 202 | k8s-node2 | 8 | 32GB | vmbr1:vlan20 | - | Worker node |
|
||||
| 203 | k8s-node3 | 8 | 32GB | vmbr1:vlan20 | - | Worker node |
|
||||
| 204 | k8s-node4 | 8 | 32GB | vmbr1:vlan20 | - | Worker node |
|
||||
| 220 | docker-registry | 4 | 4GB | vmbr1:vlan20 | 10.0.20.10 | Private Docker registry |
|
||||
| ~~9000~~ | ~~truenas~~ | — | — | — | ~~10.0.10.15~~ | **DECOMMISSIONED 2026-04-13** — NFS now served by Proxmox host (192.168.1.127). VM still exists in stopped state on PVE pending user decision on deletion. |
|
||||
|
||||
### Kubernetes Cluster
|
||||
|
||||
| Component | Details |
|
||||
|-----------|---------|
|
||||
| Version | v1.34.2 |
|
||||
| Nodes | 5 (1 control plane, 4 workers) |
|
||||
| CNI | Calico |
|
||||
| Storage | NFS (Proxmox host, nfs-csi) + Proxmox-LVM (Proxmox CSI) |
|
||||
| Ingress | Traefik v3 |
|
||||
| Total Services | 70+ services across 5 tiers |
|
||||
|
||||
### Service Tier System
|
||||
|
||||
The cluster uses a five-tier namespace system managed by Kyverno, which automatically generates LimitRange and ResourceQuota policies per tier:
|
||||
|
||||
| Tier | Namespace Pattern | Purpose | Priority Class |
|
||||
|------|-------------------|---------|----------------|
|
||||
| 0-core | `0-core-*` | Critical infrastructure (traefik, authentik, vault) | 900000 |
|
||||
| 1-cluster | `1-cluster-*` | Cluster services (prometheus, grafana, kyverno) | 700000 |
|
||||
| 2-gpu | `2-gpu-*` | GPU workloads (ollama, comfyui, stable-diffusion) | 500000 |
|
||||
| 3-edge | `3-edge-*` | Edge services (cloudflared, headscale, technitium) | 300000 |
|
||||
| 4-aux | `4-aux-*` | Auxiliary apps (vaultwarden, immich, freshrss) | 200000 |
|
||||
|
||||
## How It Works
|
||||
|
||||
### Physical Layer
|
||||
|
||||
The infrastructure runs on a single Dell R730 server with a Xeon E5-2699 v4 CPU and ~160GB RAM. Proxmox VE provides hypervisor capabilities with hardware passthrough support for the Tesla T4 GPU. The physical network interface (eno1) bridges to vmbr0 for physical network access, while vmbr1 provides VLAN-aware internal networking.
|
||||
|
||||
### Network Layer
|
||||
|
||||
pfSense (VMID 101) acts as the central gateway and firewall, routing traffic between:
|
||||
- Physical network (192.168.1.0/24) via vmbr0
|
||||
- Management VLAN 10 (10.0.10.0/24) via vmbr1:vlan10
|
||||
- Kubernetes VLAN 20 (10.0.20.0/24) via vmbr1:vlan20
|
||||
|
||||
This three-tier network design isolates Kubernetes workloads from management infrastructure and provides controlled access to the physical network.
|
||||
|
||||
### Compute Layer
|
||||
|
||||
The Kubernetes cluster consists of 7 nodes:
|
||||
- **k8s-master (200)**: 8c/32GB control plane running kube-apiserver, etcd, controller-manager
|
||||
- **k8s-node1 (201)**: 16c/48GB GPU node with Tesla T4 passthrough, tainted for GPU workloads only
|
||||
- **k8s-node2-6 (202-206)**: 8c/32GB workers running general-purpose workloads
|
||||
|
||||
GPU passthrough on node1 uses PCIe device 0000:06:00.0. The NVIDIA GPU Operator's gpu-feature-discovery auto-labels whichever node carries the card with `nvidia.com/gpu.present=true`; `null_resource.gpu_node_config` taints the same set of nodes with `nvidia.com/gpu=true:PreferNoSchedule`. No hostname is hardcoded — moving the card to a different node requires no Terraform edits.
|
||||
|
||||
### Service Organization
|
||||
|
||||
Services are organized into 70+ individual Terraform stacks under `stacks/<service>/`. Each service belongs to a tier, which determines:
|
||||
- Resource limits and quotas
|
||||
- Scheduling priority (higher tier = preempts lower)
|
||||
- Default container resources
|
||||
- QoS class (Guaranteed for tiers 0-2, Burstable for 3-4)
|
||||
|
||||
Kyverno policies automatically inject namespace labels, LimitRange, ResourceQuota, and PriorityClass based on the namespace tier prefix.
|
||||
|
||||
### Key Services
|
||||
|
||||
**Critical Services (Tier 0-1)**:
|
||||
- **Traefik**: Ingress controller with automatic HTTPS (Let's Encrypt)
|
||||
- **Authentik**: SSO/OIDC provider for all services
|
||||
- **Vault**: Secrets management with auto-unseal
|
||||
- **Cloudflared**: Cloudflare Tunnel for external access
|
||||
- **Technitium**: Internal DNS server
|
||||
- **Headscale**: Tailscale-compatible mesh VPN control plane
|
||||
|
||||
**Storage & Security**:
|
||||
- **Proxmox NFS**: NFS storage served directly from Proxmox host (192.168.1.127) at `/srv/nfs` (HDD) and `/srv/nfs-ssd` (SSD)
|
||||
- **Proxmox CSI**: Block storage via LVM-thin hotplug for databases
|
||||
- **Vaultwarden**: Password manager
|
||||
- **Immich**: Photo management
|
||||
- **CrowdSec**: IPS/IDS with community threat intelligence
|
||||
- **Kyverno**: Policy engine for admission control
|
||||
|
||||
**Monitoring & Observability**:
|
||||
- **Prometheus**: Metrics collection
|
||||
- **Grafana**: Visualization and dashboards
|
||||
- **Loki**: Log aggregation
|
||||
- **Alertmanager**: Alert routing
|
||||
|
||||
**Application Services**: Woodpecker CI, Gitea, PostgreSQL, MySQL, Redis, Ollama, ComfyUI, Stable Diffusion, Freshrss, and 50+ more services.
|
||||
|
||||
## Configuration
|
||||
|
||||
### Key Files
|
||||
|
||||
| Path | Purpose |
|
||||
|------|---------|
|
||||
| `stacks/<service>/terragrunt.hcl` | Individual service configuration |
|
||||
| `modules/kubernetes/ingress_factory/` | Shared factory module: ingress + middleware chain + DNS + Uptime-Kuma monitor |
|
||||
| `modules/kubernetes/nfs_volume/` | Shared factory module: RWX NFS PV/PVC provisioning |
|
||||
| `base.hcl` | Global Terragrunt configuration |
|
||||
| `terraform.tfvars` | Global variables (git-ignored) |
|
||||
|
||||
### Terraform Organization
|
||||
|
||||
Each service lives in `stacks/<service>/` with its own Terragrunt configuration. Common patterns:
|
||||
- Most Stacks are **flat** — resources declared directly in the Stack's `.tf` files
|
||||
- Larger/older Stacks factor their implementation into a **stack-local module** at `stacks/<service>/modules/<service>/`
|
||||
- Shared, reused logic lives in **factory modules** under `modules/kubernetes/` — `ingress_factory`, `nfs_volume`, `anubis_instance`, `setup_tls_secret`
|
||||
- Shared dependencies via `dependency` blocks in terragrunt.hcl
|
||||
|
||||
### Vault Paths
|
||||
|
||||
Secrets are stored in HashiCorp Vault under `secret/`:
|
||||
- `secret/<service>/*` - Service-specific secrets
|
||||
- `secret/cloudflare` - Cloudflare API tokens
|
||||
- `secret/authentik` - OIDC client credentials
|
||||
- `secret/backup` - Backup encryption keys
|
||||
|
||||
## Decisions & Rationale
|
||||
|
||||
### Why Proxmox over bare-metal Kubernetes?
|
||||
|
||||
**Decision**: Run Kubernetes inside Proxmox VMs rather than directly on bare metal.
|
||||
|
||||
**Rationale**:
|
||||
- **Flexibility**: Easy to snapshot, clone, and roll back VMs during upgrades
|
||||
- **Isolation**: Management network (devvm) separated from Kubernetes
|
||||
- **GPU passthrough**: Can dedicate GPU to a single node without tainting the entire host
|
||||
- **Multi-purpose**: Same physical host can run non-K8s VMs (pfSense, Home Assistant)
|
||||
|
||||
**Tradeoff**: Slight performance overhead from virtualization (acceptable for homelab).
|
||||
|
||||
### Why five-tier namespace system?
|
||||
|
||||
**Decision**: Organize services into 5 tiers with automatic LimitRange/ResourceQuota via Kyverno.
|
||||
|
||||
**Rationale**:
|
||||
- **Predictable scheduling**: Critical services (tier 0) always preempt auxiliary services (tier 4)
|
||||
- **Resource protection**: Prevents a single service from consuming all cluster resources
|
||||
- **Clear priorities**: Tier prefix makes service criticality obvious
|
||||
- **Automation**: Kyverno auto-generates policies, reducing manual configuration
|
||||
|
||||
**Tradeoff**: Adds namespace naming convention requirement.
|
||||
|
||||
### Why no CPU limits cluster-wide?
|
||||
|
||||
**Decision**: Set CPU requests but no CPU limits on containers.
|
||||
|
||||
**Rationale**:
|
||||
- **CFS throttling**: Linux CFS throttles containers to exact CPU limit even when CPU is idle, causing artificial slowdowns
|
||||
- **Burstability**: Services can burst to unused CPU during idle periods
|
||||
- **Memory is the constraint**: With ~160GB RAM across VMs, memory exhaustion occurs before CPU saturation
|
||||
|
||||
**Tradeoff**: A runaway process could monopolize CPU (mitigated by CPU requests reserving capacity).
|
||||
|
||||
### Why Goldilocks in Initial mode, not Auto?
|
||||
|
||||
**Decision**: Run VPA Goldilocks in "Initial" (recommend-only) mode instead of "Auto" (update pods).
|
||||
|
||||
**Rationale**:
|
||||
- **Terraform conflicts**: Auto mode directly modifies Deployment specs, creating drift from Terraform state
|
||||
- **Controlled changes**: Recommendations are reviewed and applied via Terraform, maintaining declarative workflow
|
||||
- **Quarterly review**: Right-sizing happens deliberately every quarter, not continuously
|
||||
|
||||
**Tradeoff**: Requires manual review of VPA recommendations.
|
||||
|
||||
## Troubleshooting
|
||||
|
||||
### Pods stuck in Pending state
|
||||
|
||||
**Symptom**: Pod shows `status: Pending` with event `FailedScheduling`.
|
||||
|
||||
**Diagnosis**:
|
||||
```bash
|
||||
kubectl describe pod <pod-name> -n <namespace>
|
||||
# Check events for:
|
||||
# - "Insufficient memory" → ResourceQuota exceeded
|
||||
# - "0/5 nodes available: 5 Insufficient memory" → LimitRange default too high
|
||||
# - "0/5 nodes available: 1 node(s) had untolerated taint" → GPU taint
|
||||
```
|
||||
|
||||
**Fix**:
|
||||
- ResourceQuota exceeded: Increase quota in `modules/namespace_config/` for that tier
|
||||
- LimitRange too high: Override pod resources in Terraform
|
||||
- GPU taint: Add `tolerations` and `nodeSelector` for GPU pods
|
||||
|
||||
### OOMKilled pods
|
||||
|
||||
**Symptom**: Pod shows `status: OOMKilled` in events.
|
||||
|
||||
**Diagnosis**:
|
||||
```bash
|
||||
kubectl describe pod <pod-name> -n <namespace>
|
||||
# Check LimitRange defaults:
|
||||
kubectl get limitrange -n <namespace> -o yaml
|
||||
```
|
||||
|
||||
**Fix**:
|
||||
- If pod uses LimitRange default (256Mi or 512Mi): Set explicit memory request/limit in Terraform
|
||||
- If pod has explicit limit: Increase memory based on Goldilocks VPA recommendation (upperBound x1.2)
|
||||
|
||||
### Democratic-CSI sidecars consuming excessive memory
|
||||
|
||||
**Symptom**: Pods with PVCs have 3-4 sidecar containers each using 256Mi (LimitRange default).
|
||||
|
||||
**Diagnosis**:
|
||||
```bash
|
||||
kubectl get pods -A -o json | jq '.items[] | select(.spec.containers[].name | contains("csi")) | .metadata.name'
|
||||
```
|
||||
|
||||
**Fix**: Democratic-CSI sidecars need explicit resources (32-80Mi each). Update Terraform to override sidecar resources.
|
||||
|
||||
### Tier 3-4 pods evicted during resource pressure
|
||||
|
||||
**Symptom**: Lower-tier pods show `status: Evicted` with reason `The node was low on resource: memory`.
|
||||
|
||||
**Diagnosis**: This is expected behavior. Tier 3-4 use Burstable QoS (request < limit) and priority 200K-300K, making them first candidates for eviction.
|
||||
|
||||
**Fix**:
|
||||
- Increase node memory if evictions are frequent
|
||||
- Promote critical services to higher tier
|
||||
- Reduce memory limits on tier 4 services
|
||||
|
||||
## Related
|
||||
|
||||
- [Compute & Resource Management](compute.md) - Detailed resource management patterns
|
||||
- [Multi-tenancy](multi-tenancy.md) - Namespace isolation and tier system
|
||||
- [Monitoring](monitoring.md) - Resource usage dashboards
|
||||
- [Runbooks: Node Maintenance](../../runbooks/node-maintenance.md)
|
||||
- [Runbooks: Service Onboarding](../../runbooks/service-onboarding.md)
|
||||
408
docs/architecture/secrets.md
Normal file
408
docs/architecture/secrets.md
Normal file
|
|
@ -0,0 +1,408 @@
|
|||
# Secrets Management Architecture
|
||||
|
||||
## Overview
|
||||
|
||||
Secrets management is centralized in HashiCorp Vault as the single source of truth for all API keys, tokens, passwords, SSH keys, and database credentials. External Secrets Operator (ESO) syncs secrets from Vault KV to Kubernetes Secrets. Vault's database engine handles automatic credential rotation for MySQL and PostgreSQL. CI/CD systems authenticate via Kubernetes service account tokens. Sealed Secrets provide user-managed encrypted secrets without Vault access. SOPS encrypts Terraform state files at rest.
|
||||
|
||||
## Architecture Diagram
|
||||
|
||||
```mermaid
|
||||
graph TB
|
||||
subgraph "Secret Sources"
|
||||
VAULT_KV[Vault KV<br/>secret/viktor<br/>135+ keys]
|
||||
VAULT_DB[Vault DB Engine<br/>7-day rotation]
|
||||
VAULT_K8S[Vault K8s Engine<br/>Dynamic SA tokens]
|
||||
USER[User-managed<br/>sealed-*.yaml]
|
||||
end
|
||||
|
||||
subgraph "Sync Layer"
|
||||
ESO[External Secrets Operator<br/>43 ExternalSecrets<br/>9 DB-creds ExternalSecrets]
|
||||
KUBESEAL[Sealed Secrets Controller]
|
||||
end
|
||||
|
||||
subgraph "Kubernetes Secrets"
|
||||
K8S_SECRET[K8s Secret]
|
||||
end
|
||||
|
||||
subgraph "Consumers"
|
||||
POD[Pod env/volume]
|
||||
TF_PLAN[Terraform plan-time<br/>data kubernetes_secret]
|
||||
CI[Woodpecker CI/CD<br/>K8s SA JWT auth]
|
||||
end
|
||||
|
||||
VAULT_KV -->|ClusterSecretStore: vault-kv| ESO
|
||||
VAULT_DB -->|ClusterSecretStore: vault-database| ESO
|
||||
ESO --> K8S_SECRET
|
||||
USER -->|kubeseal encrypt| KUBESEAL
|
||||
KUBESEAL --> K8S_SECRET
|
||||
|
||||
K8S_SECRET --> POD
|
||||
K8S_SECRET --> TF_PLAN
|
||||
|
||||
VAULT_K8S -->|JWT auth| CI
|
||||
```
|
||||
|
||||
```mermaid
|
||||
graph LR
|
||||
subgraph "Database Credential Rotation"
|
||||
VAULT_ROOT[Vault Root Creds] --> VAULT_DB_ENGINE[Vault DB Engine]
|
||||
VAULT_DB_ENGINE -->|Create role| DB_ROLE[DB Role: 7-day TTL]
|
||||
DB_ROLE -->|ESO syncs| K8S_SECRET[K8s Secret]
|
||||
K8S_SECRET -->|App reads| APP[Application Pod]
|
||||
APP -->|Uses rotated creds| DATABASE[(MySQL/PostgreSQL)]
|
||||
VAULT_DB_ENGINE -->|Revokes expired| DB_ROLE
|
||||
end
|
||||
```
|
||||
|
||||
## Components
|
||||
|
||||
| Component | Version | Location | Purpose |
|
||||
|-----------|---------|----------|---------|
|
||||
| HashiCorp Vault | Latest | `stacks/vault/` | Secret storage, dynamic credentials, rotation |
|
||||
| External Secrets Operator | v1beta1 API | `stacks/external-secrets/` | Sync Vault secrets to K8s Secrets (52 total ExternalSecrets) |
|
||||
| Sealed Secrets | Latest | `stacks/platform/` | User-managed encrypted secrets |
|
||||
| SOPS | Latest | `scripts/state-sync`, `scripts/tg` | Terraform state encryption (Vault Transit + age) |
|
||||
| Vault K8s Auth | Enabled | `stacks/vault/` | CI/CD authentication via service account tokens |
|
||||
| Vault DB Engine | Enabled | `stacks/vault/` | Dynamic DB credentials for 7 MySQL + 5 PostgreSQL databases |
|
||||
|
||||
## How It Works
|
||||
|
||||
### Vault KV: Single Source of Truth
|
||||
|
||||
`secret/viktor` contains 135+ keys covering:
|
||||
- API keys for external services
|
||||
- Database root passwords
|
||||
- SSH private keys
|
||||
- OAuth/OIDC client secrets
|
||||
- Application configuration secrets
|
||||
- Encryption keys
|
||||
|
||||
Authentication: `vault login -method=oidc` (Authentik SSO) → `~/.vault-token` → read by Vault Terraform provider. On `devvm`, `~/.vault-token` instead holds a long-lived **periodic** admin token auto-renewed daily by a systemd user timer (no weekly re-login) — see the [vault-token-renew-devvm runbook](../runbooks/vault-token-renew-devvm.md).
|
||||
|
||||
### External Secrets Operator (ESO)
|
||||
|
||||
ESO syncs secrets from Vault to Kubernetes using two ClusterSecretStores:
|
||||
|
||||
1. **vault-kv**: Reads from Vault KV (`secret/viktor`)
|
||||
2. **vault-database**: Reads dynamic credentials from Vault DB engine
|
||||
|
||||
**52 total ExternalSecrets**:
|
||||
- 43 standard ExternalSecrets (API keys, tokens, configs)
|
||||
- 9 DB-creds ExternalSecrets (rotated database credentials)
|
||||
|
||||
ESO creates/updates K8s Secrets automatically when Vault values change. Applications consume these secrets via environment variables or volume mounts.
|
||||
|
||||
### Plan-Time Secret Access Pattern
|
||||
|
||||
**Recommended pattern** (no Vault dependency at plan time):
|
||||
|
||||
1. Apply ExternalSecret to create K8s Secret
|
||||
2. Stack uses `data "kubernetes_secret"` to read ESO-created secret at plan time
|
||||
3. No direct Vault provider needed in consuming stack
|
||||
|
||||
**First-apply gotcha**: Must apply ExternalSecret resource first, then run full apply (two-stage).
|
||||
|
||||
**Legacy pattern** (14 hybrid stacks still use):
|
||||
- Direct `data "vault_kv_secret_v2"` for plan-time needs (job commands, Helm templatefile, module inputs)
|
||||
- Platform stack has 48 plan-time Vault references (cannot migrate due to circular dependency)
|
||||
|
||||
### Database Credential Rotation
|
||||
|
||||
Vault DB engine provides automatic 7 days credential rotation for:
|
||||
|
||||
**MySQL databases** (7):
|
||||
- speedtest
|
||||
- wrongmove
|
||||
- codimd
|
||||
- nextcloud
|
||||
- shlink
|
||||
- grafana
|
||||
- technitium
|
||||
|
||||
**PostgreSQL databases** (5):
|
||||
- health
|
||||
- linkwarden
|
||||
- affine
|
||||
- woodpecker
|
||||
- claude_memory
|
||||
|
||||
**Excluded from rotation**:
|
||||
- authentik (uses PgBouncer, incompatible with rotation)
|
||||
- crowdsec (Helm chart bakes credentials at install time)
|
||||
- Root user accounts (used for Vault itself to create rotated users)
|
||||
|
||||
Workflow:
|
||||
1. Vault rotates the database user's password (static role, 7-day period)
|
||||
2. ExternalSecrets Operator syncs new password to K8s Secret (15-min refresh)
|
||||
3. Apps read from K8s Secret via `secret_key_ref` env vars
|
||||
4. Special case: Technitium uses a CronJob to push password to its app config via API
|
||||
|
||||
### Kubernetes Credential Management
|
||||
|
||||
Vault K8s secrets engine provides dynamic service account tokens:
|
||||
|
||||
**Roles**:
|
||||
- `dashboard-admin`: Full cluster access for K8s dashboard
|
||||
- `ci-deployer`: CI/CD deployment permissions
|
||||
- `openclaw`: Claude Code container permissions
|
||||
- `local-admin`: Local development cluster access
|
||||
|
||||
Usage:
|
||||
```bash
|
||||
vault write kubernetes/creds/ROLE kubernetes_namespace=NS
|
||||
```
|
||||
|
||||
Returns a time-limited service account token and kubeconfig.
|
||||
|
||||
### CI/CD Secrets
|
||||
|
||||
**Woodpecker CI authentication**:
|
||||
1. Woodpecker runner uses Kubernetes SA JWT
|
||||
2. JWT validated via Vault K8s auth method
|
||||
3. Woodpecker receives Vault token
|
||||
4. Accesses secrets from `secret/ci/global`
|
||||
|
||||
**Secret sync CronJob**:
|
||||
- Runs every 6h
|
||||
- Reads `secret/ci/global` from Vault
|
||||
- Pushes to Woodpecker API via HTTP
|
||||
- Ensures CI secrets stay synchronized
|
||||
|
||||
### Sealed Secrets (User-Managed)
|
||||
|
||||
For users without Vault access (or git-friendly secret storage):
|
||||
|
||||
1. User creates plain K8s Secret YAML
|
||||
2. Encrypts with `kubeseal` CLI → `sealed-*.yaml`
|
||||
3. Commits encrypted file to git
|
||||
4. In-cluster controller decrypts at apply time
|
||||
5. Terraform picks up via `fileset()` + `for_each` on `kubernetes_manifest`
|
||||
|
||||
Public key stored in cluster, private key only accessible to controller.
|
||||
|
||||
### SOPS (State Encryption)
|
||||
|
||||
Terraform state files encrypted at rest:
|
||||
- `.tfstate.enc` files in git
|
||||
- Vault Transit engine (primary) + age key (fallback)
|
||||
- Scripts: `scripts/state-sync` (encrypt/decrypt), `scripts/tg` (terragrunt wrapper)
|
||||
- State decrypted in-memory during plan/apply, re-encrypted before commit
|
||||
|
||||
### Complex Types in Vault
|
||||
|
||||
Maps and lists stored as JSON strings in Vault KV:
|
||||
|
||||
```hcl
|
||||
# In Vault: key = '{"endpoint": "https://...", "token": "..."}'
|
||||
# In Terraform:
|
||||
config = jsondecode(data.vault_kv_secret_v2.app.data["config"])
|
||||
```
|
||||
|
||||
Required because Vault KV only supports string values at leaf nodes.
|
||||
|
||||
## Configuration
|
||||
|
||||
### Vault Paths
|
||||
|
||||
- **Main secrets**: `secret/viktor` (135+ keys)
|
||||
- **CI/CD secrets**: `secret/ci/global`
|
||||
- **Database engine**: `database/creds/ROLE` (dynamic)
|
||||
- **Kubernetes engine**: `kubernetes/creds/ROLE` (dynamic)
|
||||
|
||||
### External Secrets Stack
|
||||
|
||||
**Location**: `stacks/external-secrets/`
|
||||
|
||||
**ClusterSecretStores**:
|
||||
```yaml
|
||||
apiVersion: external-secrets.io/v1beta1
|
||||
kind: ClusterSecretStore
|
||||
metadata:
|
||||
name: vault-kv
|
||||
spec:
|
||||
provider:
|
||||
vault:
|
||||
server: "http://vault-active.vault.svc.cluster.local:8200"
|
||||
path: secret
|
||||
version: v2
|
||||
auth:
|
||||
kubernetes:
|
||||
mountPath: kubernetes
|
||||
role: eso
|
||||
```
|
||||
|
||||
**ExternalSecret example**:
|
||||
```yaml
|
||||
apiVersion: external-secrets.io/v1beta1
|
||||
kind: ExternalSecret
|
||||
metadata:
|
||||
name: my-app-secrets
|
||||
spec:
|
||||
refreshInterval: 1h
|
||||
secretStoreRef:
|
||||
name: vault-kv
|
||||
kind: ClusterSecretStore
|
||||
target:
|
||||
name: my-app-secrets
|
||||
data:
|
||||
- secretKey: API_KEY
|
||||
remoteRef:
|
||||
key: viktor
|
||||
property: my_app_api_key
|
||||
```
|
||||
|
||||
### Vault Backup
|
||||
|
||||
**CronJob**: `vault-raft-backup`
|
||||
- Uses manually-created `vault-root-token` K8s Secret
|
||||
- Cannot use ESO (circular dependency during restore)
|
||||
- Backs up Raft storage to S3-compatible backend
|
||||
|
||||
### Terraform Provider Auth
|
||||
|
||||
The provider reads `VAULT_ADDR` from env and the token from `~/.vault-token`.
|
||||
That file is populated by `vault login -method=oidc` (humans, ad-hoc) — except
|
||||
on `devvm`, where it holds a long-lived **periodic** admin token (`display_name
|
||||
token-devvm-wizard`, `period=768h`, `explicit_max_ttl=0`, policies
|
||||
`default`+`sops-admin`+`vault-admin`) that a systemd user timer renews daily, so
|
||||
no weekly re-login is needed. A drift guard refuses to renew if a stray
|
||||
`vault login` clobbers the file with a foreign token. Deploy + recovery:
|
||||
[vault-token-renew-devvm runbook](../runbooks/vault-token-renew-devvm.md).
|
||||
|
||||
```hcl
|
||||
provider "vault" {
|
||||
# Reads VAULT_ADDR from env
|
||||
# Reads token from ~/.vault-token
|
||||
}
|
||||
```
|
||||
|
||||
## Decisions & Rationale
|
||||
|
||||
### Why Vault over alternatives (AWS Secrets Manager, K8s Secrets, env files)?
|
||||
|
||||
**Centralized management**: Single source of truth for all secrets across infrastructure, applications, and CI/CD.
|
||||
|
||||
**Dynamic credentials**: Database and Kubernetes credentials rotated automatically, reducing blast radius of credential leaks.
|
||||
|
||||
**Audit logging**: Every secret access logged for security compliance.
|
||||
|
||||
**OIDC integration**: Secure human authentication via Authentik SSO (no static tokens for humans).
|
||||
|
||||
**Encryption at rest**: Secrets encrypted in Vault's storage backend.
|
||||
|
||||
### Why ESO over direct Vault injection (vault-agent, CSI driver)?
|
||||
|
||||
**Terraform compatibility**: `data "kubernetes_secret"` allows plan-time access without Vault provider dependency.
|
||||
|
||||
**Simpler pod configuration**: No sidecar containers or init containers required.
|
||||
|
||||
**Declarative sync**: ExternalSecret CRD describes desired state, ESO handles synchronization.
|
||||
|
||||
**Namespace isolation**: Each namespace can have its own ExternalSecrets without cluster-admin access to Vault.
|
||||
|
||||
### Why Sealed Secrets for users?
|
||||
|
||||
**No Vault access needed**: Users can encrypt secrets without Vault credentials.
|
||||
|
||||
**Git-friendly**: Encrypted YAML files can be committed safely to version control.
|
||||
|
||||
**Self-service**: Users manage their own secrets without admin intervention.
|
||||
|
||||
**Cluster-scoped encryption**: Encrypted for specific cluster, can't be decrypted elsewhere.
|
||||
|
||||
### Why SOPS for Terraform state?
|
||||
|
||||
**State contains secrets**: Terraform state includes sensitive values (DB passwords, API keys).
|
||||
|
||||
**Vault Transit integration**: Centralized key management (same as other encryption).
|
||||
|
||||
**Age fallback**: Offline decryption possible if Vault unavailable.
|
||||
|
||||
**Transparent workflow**: `scripts/tg` wrapper handles encrypt/decrypt automatically.
|
||||
|
||||
### Why Vault DB engine over static credentials?
|
||||
|
||||
**Automatic rotation**: 7-day TTL reduces credential exposure window.
|
||||
|
||||
**Audit trail**: Every credential generation logged in Vault.
|
||||
|
||||
**Revocation**: Credentials automatically revoked at TTL expiration.
|
||||
|
||||
**Least privilege**: Each app gets unique credentials, not shared root password.
|
||||
|
||||
### Why exclude platform stack from Vault dependency?
|
||||
|
||||
**Circular dependency**: Vault runs on platform (storage, networking), platform can't wait for Vault.
|
||||
|
||||
**Bootstrap order**: Platform must deploy first, then Vault, then app stacks.
|
||||
|
||||
**Resilience**: Platform stack can be re-applied even if Vault is down.
|
||||
|
||||
## Troubleshooting
|
||||
|
||||
### ExternalSecret shows "SecretSyncedError"
|
||||
|
||||
1. Check Vault auth: `kubectl logs -n external-secrets deployment/external-secrets`
|
||||
2. Verify Vault path exists: `vault kv get secret/viktor`
|
||||
3. Check RBAC: ESO service account needs Vault role binding
|
||||
4. Verify network: ESO pod can reach Vault service
|
||||
|
||||
### Rotated database credentials not working
|
||||
|
||||
1. Check Vault DB connection: `vault read database/config/my-db`
|
||||
2. Verify role TTL: `vault read database/roles/my-app`
|
||||
3. Check ESO refresh interval: ExternalSecret may not have synced yet
|
||||
4. Verify app is reading latest secret: `kubectl get secret my-db-creds -o yaml`
|
||||
|
||||
### Terraform plan fails with "secret not found"
|
||||
|
||||
First-apply issue:
|
||||
1. Apply ExternalSecret first: `terraform apply -target=kubernetes_manifest.external_secret`
|
||||
2. Wait for ESO to create K8s Secret: `kubectl wait --for=condition=Ready externalsecret/my-secret`
|
||||
3. Apply rest of stack: `terraform apply`
|
||||
|
||||
### CI/CD cannot access Vault
|
||||
|
||||
1. Check Woodpecker SA token: `kubectl get sa -n woodpecker woodpecker-runner -o yaml`
|
||||
2. Verify Vault K8s auth config: `vault read auth/kubernetes/config`
|
||||
3. Check Vault role binding: `vault read auth/kubernetes/role/ci-deployer`
|
||||
4. Review Vault audit logs: `vault audit list`
|
||||
|
||||
### Sealed Secret won't decrypt
|
||||
|
||||
1. Verify controller is running: `kubectl get pods -n kube-system -l app=sealed-secrets`
|
||||
2. Check encryption was for correct cluster: `kubeseal --fetch-cert` matches cert used for encryption
|
||||
3. Review controller logs: `kubectl logs -n kube-system deployment/sealed-secrets-controller`
|
||||
4. Ensure `sealed-*.yaml` hasn't been manually edited (breaks signature)
|
||||
|
||||
### SOPS state decryption fails
|
||||
|
||||
1. Check Vault access: `vault token lookup`
|
||||
2. Verify Transit engine: `vault read transit/keys/terraform-state`
|
||||
3. Check age key fallback: `~/.config/sops/age/keys.txt` exists
|
||||
4. Run manual decrypt: `scripts/state-sync decrypt path/to/state.tfstate.enc`
|
||||
|
||||
### Complex type (map/list) not parsing from Vault
|
||||
|
||||
Ensure value in Vault is valid JSON:
|
||||
```bash
|
||||
vault kv get -field=my_config secret/viktor | jq .
|
||||
```
|
||||
|
||||
If invalid JSON, update in Vault:
|
||||
```bash
|
||||
vault kv put secret/viktor my_config='{"key": "value"}'
|
||||
```
|
||||
|
||||
In Terraform:
|
||||
```hcl
|
||||
config = jsondecode(data.vault_kv_secret_v2.app.data["my_config"])
|
||||
```
|
||||
|
||||
## Related
|
||||
|
||||
- [Vault Deployment](../../stacks/vault/README.md) - Vault Terraform configuration
|
||||
- [External Secrets Stack](../../stacks/external-secrets/README.md) - ESO deployment and ExternalSecret definitions
|
||||
- [Backup & DR](./backup-dr.md) - Vault backup strategy
|
||||
- [Monitoring](./monitoring.md) - Grafana OIDC via Authentik (Vault-stored client secret)
|
||||
- [CI/CD Runbook](../runbooks/ci-cd.md) - Woodpecker Vault authentication
|
||||
517
docs/architecture/security.md
Normal file
517
docs/architecture/security.md
Normal file
|
|
@ -0,0 +1,517 @@
|
|||
# Security & L7 Protection
|
||||
|
||||
## Overview
|
||||
|
||||
The homelab implements defense-in-depth security at the application layer (L7) using CrowdSec for threat intelligence and IP reputation, Kyverno for policy enforcement and resource governance, and a 3-layer anti-AI scraping defense (reduced from 5 in April 2026 after removing the rewrite-body plugin). All security components operate in graceful degradation mode (fail-open) to prevent cascading failures. Security policies are deployed in audit mode first, then selectively enforced after validation.
|
||||
|
||||
## Architecture Diagram
|
||||
|
||||
```mermaid
|
||||
graph LR
|
||||
Internet[Internet]
|
||||
CF[Cloudflare WAF]
|
||||
Tunnel[Cloudflared Tunnel]
|
||||
CrowdSec[CrowdSec Bouncer<br/>Traefik Plugin]
|
||||
AntiAI[Anti-AI Check<br/>poison-fountain]
|
||||
ForwardAuth[Authentik ForwardAuth]
|
||||
RateLimit[Rate Limit Middleware]
|
||||
Retry[Retry Middleware<br/>2 attempts, 100ms]
|
||||
Backend[Backend Service]
|
||||
|
||||
LAPI[CrowdSec LAPI<br/>3 replicas]
|
||||
Agent[CrowdSec Agent]
|
||||
|
||||
Internet -->|1| CF
|
||||
CF -->|2| Tunnel
|
||||
Tunnel -->|3| CrowdSec
|
||||
CrowdSec -.->|Query| LAPI
|
||||
Agent -.->|Report| LAPI
|
||||
CrowdSec -->|4. Pass/Block| AntiAI
|
||||
AntiAI -->|5. Human/Bot| ForwardAuth
|
||||
ForwardAuth -->|6. Authenticated| RateLimit
|
||||
RateLimit -->|7. Under Limit| Retry
|
||||
Retry -->|8. Success/Retry| Backend
|
||||
|
||||
style CrowdSec fill:#f9f,stroke:#333
|
||||
style AntiAI fill:#ff9,stroke:#333
|
||||
style ForwardAuth fill:#9f9,stroke:#333
|
||||
style RateLimit fill:#99f,stroke:#333
|
||||
```
|
||||
|
||||
## Components
|
||||
|
||||
| Component | Version | Location | Purpose |
|
||||
|-----------|---------|----------|---------|
|
||||
| CrowdSec LAPI | Pinned | `stacks/crowdsec/` | Local API, threat intelligence aggregation (3 replicas) |
|
||||
| CrowdSec Agent | Pinned | `stacks/crowdsec/` | Log parser, scenario detection |
|
||||
| CrowdSec Traefik Bouncer | Plugin | Traefik config | Plugin-based IP reputation check |
|
||||
| Kyverno | Pinned chart | `stacks/kyverno/` | Policy engine for K8s admission control |
|
||||
| poison-fountain | Latest | `stacks/poison-fountain/` | Anti-AI bot detection and tarpit service |
|
||||
| cert-manager/certbot | - | `stacks/cert-manager/` | TLS certificate management |
|
||||
| Traefik | Latest | `stacks/platform/` | Ingress controller with HTTP/3 (QUIC) |
|
||||
|
||||
## How It Works
|
||||
|
||||
### Request Security Layers
|
||||
|
||||
Every incoming request passes through 6 security layers:
|
||||
|
||||
1. **Cloudflare WAF** - DDoS protection, bot detection, firewall rules (external)
|
||||
2. **Cloudflared Tunnel** - Zero Trust tunnel, hides origin IP
|
||||
3. **CrowdSec Bouncer** - IP reputation check against LAPI (fail-open on error)
|
||||
4. **Anti-AI Scraping** - 3-layer bot defense (optional per service, updated 2026-04-17)
|
||||
5. **Authentik ForwardAuth** - Authentication check (if `protected = true`)
|
||||
6. **Rate Limiting** - Per-source IP rate limits (returns 429 on breach)
|
||||
7. **Retry Middleware** - Auto-retry on transient errors (2 attempts, 100ms delay)
|
||||
|
||||
### CrowdSec Threat Intelligence
|
||||
|
||||
CrowdSec operates in a hub-and-agent model:
|
||||
|
||||
**LAPI (Local API)**:
|
||||
- 3 replicas for high availability
|
||||
- Aggregates threat intelligence from agent + community
|
||||
- Maintains ban list (IP reputation database)
|
||||
- Version pinned to prevent breaking changes
|
||||
|
||||
**Agent**:
|
||||
- Parses Traefik access logs
|
||||
- Detects attack scenarios (SQL injection, directory traversal, brute force)
|
||||
- Reports malicious IPs to LAPI
|
||||
- Shares threat intel with CrowdSec community (anonymized)
|
||||
|
||||
**Traefik Bouncer Plugin**:
|
||||
- Integrated as Traefik middleware
|
||||
- Queries LAPI for IP reputation on each request
|
||||
- **Fail-open mode**: If LAPI unreachable, allows traffic (graceful degradation)
|
||||
- Blocks IPs on ban list, allows others
|
||||
|
||||
**Metabase** (disabled by default):
|
||||
- Dashboard for CrowdSec analytics
|
||||
- CPU-intensive, only enable when investigating incidents
|
||||
|
||||
### Kyverno Policy Engine
|
||||
|
||||
Kyverno enforces cluster-wide policies via admission webhooks. All policies use `failurePolicy=Ignore` to prevent blocking cluster operations.
|
||||
|
||||
#### 5-Tier Resource Governance
|
||||
|
||||
Namespaces are labeled with a tier (`tier: 0` through `tier: 4`). Kyverno auto-generates:
|
||||
|
||||
- **LimitRange** - Per-container CPU/memory limits
|
||||
- **ResourceQuota** - Namespace-wide resource caps
|
||||
|
||||
| Tier | CPU Limit/Container | Memory Limit/Container | Namespace CPU Quota | Namespace Memory Quota |
|
||||
|------|---------------------|------------------------|---------------------|------------------------|
|
||||
| 0 | 100m | 128Mi | 500m | 512Mi |
|
||||
| 1 | 250m | 256Mi | 1000m | 1Gi |
|
||||
| 2 | 500m | 512Mi | 2000m | 2Gi |
|
||||
| 3 | 1000m | 1Gi | 4000m | 4Gi |
|
||||
| 4 | 2000m | 2Gi | 8000m | 8Gi |
|
||||
|
||||
This prevents resource exhaustion and enforces governance without manual quota management.
|
||||
|
||||
#### Security Policies
|
||||
|
||||
**Why audit mode first?** Gradual rollout without breaking existing workloads. Policies collect violations, then selectively enforced after cleanup.
|
||||
|
||||
**Wave 1 plan (locked 2026-05-18, see beads `code-8ywc`):** all four below flip from Audit → Enforce with `failurePolicy: Ignore` preserved and an exclude list covering the 31 critical namespaces (keel, calico-system, authentik, vault, cnpg-system, dbaas, monitoring, traefik, technitium, mailserver, kyverno, metallb-system, external-secrets, proxmox-csi, nfs-csi, nvidia, kube-system, cloudflared, crowdsec, reverse-proxy, reloader, descheduler, vpa, redis, sealed-secrets, headscale, wireguard, xray, infra-maintenance, metrics-server, tigera-operator). Phased: one policy per day with PolicyReport observation.
|
||||
|
||||
| Policy | Purpose | Current | Planned (wave 1) |
|
||||
|--------|---------|---------|------------------|
|
||||
| `deny-privileged-containers` | Block privileged pods | Audit | **Enforce** |
|
||||
| `deny-host-namespaces` | Block hostNetwork/hostPID/hostIPC | Audit | **Enforce** |
|
||||
| `restrict-sys-admin` | Block CAP_SYS_ADMIN | Audit | **Enforce** |
|
||||
| `require-trusted-registries` | Only allow approved image registries (forgejo.viktorbarzin.me, docker.io, ghcr.io, quay.io, registry.k8s.io, gcr.io, oci://ghcr.io/sergelogvinov) | Audit | **Enforce** |
|
||||
|
||||
Cosign `verify-images` is **deferred** beyond wave 1 — needs image-signing infrastructure (Sigstore / cosign + KMS) before it can enforce meaningfully.
|
||||
|
||||
#### Operational Policies
|
||||
|
||||
| Policy | Purpose | Mode |
|
||||
|--------|---------|------|
|
||||
| `inject-priority-class-from-tier` | Set pod priorityClass based on namespace tier | Enforce (CREATE only) |
|
||||
| `inject-ndots` | Set DNS `ndots:2` for faster lookups | Enforce |
|
||||
| `sync-tier-label` | Propagate tier label to child resources | Enforce |
|
||||
| `goldilocks-vpa-auto-mode` | Disable VPA globally (VPA off) | Enforce |
|
||||
|
||||
### Anti-AI Scraping (3 Active Layers) (Updated 2026-04-17)
|
||||
|
||||
Enabled by default via `ingress_factory` module. Disable per-service with `anti_ai_scraping = false`.
|
||||
|
||||
Active middleware chain: `ai-bot-block` (ForwardAuth) + `anti-ai-headers` (X-Robots-Tag). The `strip-accept-encoding` and `anti-ai-trap-links` middlewares were removed in April 2026 due to Traefik v3.6.12 Yaegi plugin incompatibility with the rewrite-body plugin.
|
||||
|
||||
#### Layer 1: Bot Blocking (ForwardAuth)
|
||||
|
||||
- `ai-bot-block` middleware forward-auths to the `bot-block-proxy` openresty
|
||||
service (`stacks/traefik/modules/traefik/main.tf`) — the bot-check hop before
|
||||
the backend.
|
||||
- **Currently a no-op (allow-all).** `poison-fountain` is intentionally scaled
|
||||
to 0 (clears the ExternalAccessDivergence alert), so `bot-block-proxy`
|
||||
short-circuits `/auth` to `return 200 "allowed"` instead of proxying to an
|
||||
absent upstream. Same effective behaviour as the previous `proxy_pass` +
|
||||
`error_page 5xx=200` fail-open, minus the ~51k/hr upstream-connect error logs
|
||||
and per-request connect latency it generated (cleaned up 2026-06-05, found via
|
||||
Loki). The Deployment carries `configmap.reloader.stakater.com/reload` so
|
||||
config changes actually reload openresty (it does not hot-reload on its own).
|
||||
- **To re-enable real bot-blocking**: restore the `upstream poison_fountain` +
|
||||
`proxy_pass http://poison_fountain;` block in the `bot-block-proxy-config`
|
||||
ConfigMap (git history) and scale `poison-fountain` up. It then forward-auths
|
||||
bot checks (User-Agent / patterns) and tarpits known AI scrapers, fail-open if
|
||||
poison-fountain is down.
|
||||
|
||||
#### Layer 2: X-Robots-Tag Header
|
||||
|
||||
- HTTP response header: `X-Robots-Tag: noai, noindex, nofollow`
|
||||
- Instructs compliant bots to skip content
|
||||
- Lightweight, no performance impact
|
||||
|
||||
#### ~~Layer 3: Trap Links~~ (REMOVED)
|
||||
|
||||
Removed April 2026. The rewrite-body Traefik plugin used to inject hidden trap links broke on Traefik v3.6.12 due to Yaegi runtime bugs. The companion `strip-accept-encoding` middleware was also removed.
|
||||
|
||||
#### Layer 3 (formerly 4): Tarpit / Poison Content
|
||||
|
||||
- `poison-fountain` exists as a standalone service at `poison.viktorbarzin.me` but the serving Deployment is **scaled to 0** (replicas=0); only its 6-hourly content-fetch CronJob runs. The tarpit is therefore dormant until re-enabled.
|
||||
- When running: serves AI bots extremely slowly (~50 bytes / 0.5s tarpit drip)
|
||||
- CronJob every 6 hours generates fake content
|
||||
- Trap links are no longer injected into real pages, but bots that discover `poison.viktorbarzin.me` directly would get tarpitted and poisoned
|
||||
|
||||
**Implementation**: See `stacks/poison-fountain/` and `stacks/traefik/modules/traefik/{middleware.tf,main.tf}` (traefik moved from the platform stack to its own `traefik` stack)
|
||||
|
||||
### Audit Logging & Anomaly Detection (Wave 1)
|
||||
|
||||
Beads epic: `code-8ywc`. **Status: partially live as of 2026-05-18.**
|
||||
|
||||
| Item | State |
|
||||
|---|---|
|
||||
| W1.2 Vault `file` audit device | **LIVE** — `vault_audit.file` in `stacks/vault/main.tf:287`, writing to `/vault/audit/vault-audit.log` on `proxmox-lvm-encrypted` PVC |
|
||||
| W1.2 Vault `x_forwarded_for_authorized_addrs = 10.10.0.0/16` | **LIVE** — applied via `tg apply -target=helm_release.vault` on 2026-05-18; all 3 vault pods restarted cleanly |
|
||||
| W1.2 Vault audit log shipping to Loki | **LIVE** — `audit-tail` sidecar in vault pods + Alloy DaemonSet ships to Loki with `container="audit-tail"`. Verified via `{namespace="vault",container="audit-tail"}` LogQL query. |
|
||||
| W1.1 K8s API audit policy + shipping | **LIVE** — kube-apiserver audit policy was already configured (Metadata level, `/var/log/kubernetes/audit.log`, 7d retention). Alloy DaemonSet now tolerates control-plane taint, scrapes the audit log file, ships to Loki with `job=kubernetes-audit`. K2-K9 alert rules in Loki ruler. |
|
||||
| W1.3 Source-IP anomaly rules (K9, V7, S1) | **LIVE** (K9, V7); **S1 PENDING** — fires once promtail/Alloy on PVE host ships sshd journal with `job=sshd-pve`. |
|
||||
| W1.4 Kyverno security policies → Enforce | **LIVE** — 3 policies in Enforce mode with 35-namespace exclude list. |
|
||||
| W1.5 Kyverno trusted-registries → Enforce | **LIVE** — explicit allowlist (15 registries + 6 DockerHub library bare names + 56 DockerHub user repos). Verified by admission dry-run: `evilcorp.example/malware:v1` BLOCKED, `alpine:3.20` and `docker.io/library/alpine:3.20` ALLOWED. |
|
||||
| W1.6 Calico observe-phase (pilot: recruiter-responder) | **LIVE** (2026-05-19) — GlobalNetworkPolicy `wave1-egress-observe-recruiter-responder` with rules `[action:Log, action:Allow]`. FelixConfiguration.flowLogsFileEnabled approach abandoned (Calico Enterprise-only field, rejected by OSS v3.26). Log action emits iptables LOG with prefix `calico-packet: ` → kernel → journald → Alloy → Loki. Verified: `{job="node-journal"} \|~ "calico-packet"` returns real packet metadata (SRC/DST/PROTO). Expand to more namespaces by adding to `namespaceSelector`. |
|
||||
| W1.7 NetworkPolicy phased enforce | **PARTIAL ANALYSIS** — first observation snapshot at `docs/architecture/wave1-egress-observation-2026-05-22.md` (36 source namespaces seen so far, 29 thin-profile candidates). Recommend continuing observation through 2026-05-29 (full week) before any enforce flip. Pilot enforce target: `recruiter-responder` (2 destinations only). `servarr` stays in Log+Allow indefinitely (BitTorrent P2P incompatible with static enforce). |
|
||||
|
||||
The block below documents the locked design.
|
||||
|
||||
Response model: **(I) Slack-only, daily skim.** All security alerts land in a new `#security` Slack channel via Alertmanager. No paging. Mean detection time accepted as ~12-24h; the design weight sits on prevention (Kyverno enforce, NetworkPolicy default-deny egress) rather than runtime detection.
|
||||
|
||||
#### Detection sources
|
||||
|
||||
| Source | Mechanism | Ships via | Loki job label |
|
||||
|---|---|---|---|
|
||||
| K8s API audit log | Custom audit policy on kube-apiserver: drop `get`/`list`/`watch` at `None` for most resources, log writes at `Metadata`, secret reads at `Metadata`, `exec`/`portforward` at `RequestResponse`, exclude kubelet+controller-manager noise. Codified in `stacks/infra` kubeadm config templating. | Alloy DaemonSet tails `/var/log/kubernetes/audit/*.log` | `job=kube-audit` |
|
||||
| Vault audit log | `file` audit device on existing Vault PVC. Vault listener config sets `x_forwarded_for_authorized_addrs` trusting Traefik pod CIDR so `remote_addr` is the real client IP, not Traefik's. | Alloy tails audit log file | `job=vault-audit` |
|
||||
| PVE sshd auth log | journald `_SYSTEMD_UNIT=ssh.service` | promtail systemd unit on Proxmox host (192.168.1.127) | `job=sshd-pve` |
|
||||
| Calico flow log | `flowLogsFileEnabled: true` in Calico Felix config | Alloy (cluster-wide) | `job=calico-flow` (W1.6 only) |
|
||||
|
||||
#### Alert rules (16 total)
|
||||
|
||||
Routed via **Loki ruler → Alertmanager → `#security` Slack receiver**. Same handling path as existing infra alerts — silenceable in Alertmanager UI, history queryable, severity labels (critical/warning/info) inside the single `#security` channel.
|
||||
|
||||
**K8s API audit (K2-K9, 8 rules — K1 cluster-admin-grant intentionally skipped):**
|
||||
|
||||
| # | Event | Severity |
|
||||
|---|---|---|
|
||||
| K2 | ServiceAccount token used from outside cluster (sourceIPs not in pod CIDR or trusted LAN) | critical |
|
||||
| K3 | Secret READ in `vault`, `sealed-secrets`, `external-secrets` namespaces by a non-allowlisted ServiceAccount | critical |
|
||||
| K4 | Exec into a pod in `vault`, `kube-system`, `dbaas`, `cnpg-system` (excluding `me@viktorbarzin.me` + 1 break-glass SA) | warning |
|
||||
| K5 | >5 deletes of `Pod`, `Secret`, or `ConfigMap` in 60s by any single actor | critical |
|
||||
| K6 | `audit-log-path` flag or audit policy modified on kube-apiserver | critical |
|
||||
| K7 | New ClusterRole created with `verbs: ["*"]` and `resources: ["*"]` | warning |
|
||||
| K8 | Anonymous binding granted (any RoleBinding/CRB referencing `system:anonymous` or `system:unauthenticated`) | critical |
|
||||
| K9 | Authenticated request where `user.username == "me@viktorbarzin.me"` AND `sourceIPs[0]` NOT in allowlist CIDRs | critical |
|
||||
|
||||
**Vault audit (V1-V7):**
|
||||
|
||||
| # | Event | Severity |
|
||||
|---|---|---|
|
||||
| V1 | Root token created | critical |
|
||||
| V2 | Audit device disabled or modified | critical |
|
||||
| V3 | Seal status changed (`sys/seal` write) | critical |
|
||||
| V4 | Policy written or modified (allowlist Terraform-driven writes by source IP / token role) | warning |
|
||||
| V5 | Authentication failure spike >10/min on any auth method | warning |
|
||||
| V6 | Token created with policies different from parent (privilege escalation) | critical |
|
||||
| V7 | Vault audit event where `auth.entity_id == <viktor-entity-id>` AND `remote_addr` NOT in allowlist CIDRs | critical |
|
||||
|
||||
**Host (S1):**
|
||||
|
||||
| # | Event | Severity |
|
||||
|---|---|---|
|
||||
| S1 | PVE sshd auth success from source IP NOT in allowlist | critical |
|
||||
|
||||
#### Allowlist — "expected source IPs" for K2, K9, V7, S1
|
||||
|
||||
| CIDR | Source |
|
||||
|---|---|
|
||||
| `10.0.20.0/22` | VLAN 20 (K8s cluster + main LAN) |
|
||||
| `192.168.1.0/24` | Proxmox host LAN + Sofia LAN (same RFC1918 block in both physical locations; cross-site traffic transits Headscale so the CIDR matches only on-LAN clients in either location) |
|
||||
| K8s pod CIDR (verify at implementation time) | In-cluster pods talking to apiserver |
|
||||
| K8s service CIDR | Service-to-apiserver traffic |
|
||||
| Headscale tailnet | VPN-connected devices |
|
||||
|
||||
**Policy: no public-IP access ever.** Vault, kube-apiserver, PVE sshd must transit a trusted LAN or Headscale. Anything else fires an alert.
|
||||
|
||||
#### Why no canary tokens
|
||||
|
||||
Original plan included canary tokens (fake K8s Secret, Vault KV path, PVE file, sinkhole hostname). Rejected because Viktor routinely greps `secret/viktor` (135 keys) and lists `kubectl get secret -A` — any read-trigger canary self-fires. Use-based canaries (zero-RBAC SA tokens with audit alerts on use) were also considered but rejected in favor of cleaner source-IP anomaly detection (K9, V7) on REAL tokens — same threat model, no fake-token operational burden.
|
||||
|
||||
#### Why no K1 (cluster-admin grant detection)
|
||||
|
||||
Viktor opted out. Gap covered indirectly by K7 (new `*,*` ClusterRole created), K8 (anonymous binding), and K3 (secret read on Vault namespace) — most attacker progressions toward cluster-admin trigger one of these.
|
||||
|
||||
#### IOPS / disk-wear
|
||||
|
||||
Custom audit policy reduces volume ~80-90% vs default Metadata-everywhere. Loki tuned for fewer larger chunks: `chunk_target_size: 1.5MB`, `chunk_idle_period: 30m`, snappy compression. Retention 90d for security streams (matches Technitium DNS query log precedent). Net estimate: ~1-2 GB/day additional disk writes after tuning.
|
||||
|
||||
### NetworkPolicy Default-Deny Egress (Wave 1 — observe-then-enforce, tier 3+4)
|
||||
|
||||
Beads: `code-8ywc` W1.6 + W1.7. **Status: planned.**
|
||||
|
||||
**Approach (γ): cluster-wide observe-then-enforce.**
|
||||
|
||||
1. **Week 0:** Enable Calico flow logs cluster-wide. Apply a GlobalNetworkPolicy with selector `tier in {tier-3, tier-4}`, `action: Log` (no Deny). Ship flow logs to Loki.
|
||||
2. **Week 1:** Build per-namespace egress allowlist from observed traffic. Common allowlist module `tier3_egress_baseline` covers DNS, NTP, internal Vault/ESO/Authentik, Brevo SMTP, Cloudflare API, OAuth providers. Per-namespace add-ons for service-specific external destinations.
|
||||
3. **Week 2-3:** Apply default-deny + allowlist per-namespace, starting `recruiter-responder` (smallest egress footprint — local llama-cpp). Watch 24-48h per namespace, iterate. Roll out 3-5 namespaces/day.
|
||||
|
||||
**Scope exclusions:** tier 0/1/2 namespaces (defer to wave 2), 31 critical infra namespaces (same exclude list as Kyverno).
|
||||
|
||||
**DNS handling:** Calico GlobalNetworkPolicy supports domain-based rules via the `domains:` selector which queries CoreDNS internally. Static IPs reserved for fixed-IP services (Brevo SMTP relay).
|
||||
|
||||
**Known risks:**
|
||||
- Rare-event misses: a Sunday-only CronJob's egress won't appear in 7 days of flow logs. Mitigation: extend observation to 2 weeks for namespaces with weekly CronJobs.
|
||||
- Mass-rollout cascade: the 26h March 2026 outage (memory id=390) was a mass-change cascade. Mitigation: phased per-namespace with health-check pauses, similar to the 2026-05-17 Keel phased rollout (memory id=1972).
|
||||
|
||||
### TLS & HTTP/3
|
||||
|
||||
**Traefik** handles TLS termination:
|
||||
- HTTP/3 (QUIC) enabled for performance
|
||||
- Automatic HTTP → HTTPS redirect
|
||||
- cert-manager/certbot manages certificate lifecycle
|
||||
- Let's Encrypt integration for automatic renewal
|
||||
|
||||
### Rate Limiting
|
||||
|
||||
**Per-source IP limits**:
|
||||
- Default: 100 requests/minute
|
||||
- Returns **429 Too Many Requests** (not 503)
|
||||
- Higher limits for upload-heavy services:
|
||||
- Immich: 500 req/min (photo uploads)
|
||||
- Nextcloud: 300 req/min (file sync)
|
||||
|
||||
**Retry Middleware**:
|
||||
- 2 attempts max
|
||||
- 100ms delay between retries
|
||||
- Applied after rate limiting
|
||||
- Handles transient backend errors
|
||||
|
||||
### Fallback Proxies
|
||||
|
||||
**Authentik Fallback**:
|
||||
- If Authentik down, falls back to basicAuth
|
||||
- Prevents total service outage during IdP maintenance
|
||||
- Temporary credentials stored in Vault
|
||||
|
||||
**Poison-Fountain Fallback**:
|
||||
- If anti-AI service down, allows all traffic
|
||||
- Fail-open prevents blocking legitimate users
|
||||
- Monitors for service health, auto-recovers
|
||||
|
||||
## Configuration
|
||||
|
||||
### Key Config Files
|
||||
|
||||
| Path | Purpose |
|
||||
|------|---------|
|
||||
| `stacks/crowdsec/` | CrowdSec LAPI, agent, bouncer config |
|
||||
| `stacks/kyverno/` | Kyverno deployment + policies |
|
||||
| `stacks/poison-fountain/` | Anti-AI service + CronJob |
|
||||
| `stacks/platform/modules/traefik/middleware.tf` | Security middleware definitions |
|
||||
| `stacks/platform/modules/ingress_factory/` | Per-service security toggles |
|
||||
|
||||
### Vault Paths
|
||||
|
||||
- **CrowdSec API key**: `secret/crowdsec/api-key` - LAPI authentication
|
||||
- **BasicAuth fallback**: `secret/authentik/fallback-creds` - Emergency auth
|
||||
- **TLS certificates**: `secret/tls/` - Certificate private keys
|
||||
|
||||
### Terraform Stacks
|
||||
|
||||
- `stacks/crowdsec/` - CrowdSec infrastructure
|
||||
- `stacks/kyverno/` - Policy engine
|
||||
- `stacks/poison-fountain/` - Anti-AI defense
|
||||
- `stacks/platform/` - Traefik + middleware
|
||||
|
||||
### Per-Service Security Config
|
||||
|
||||
```hcl
|
||||
module "myapp_ingress" {
|
||||
source = "./modules/ingress_factory"
|
||||
|
||||
name = "myapp"
|
||||
host = "myapp.viktorbarzin.me"
|
||||
|
||||
# Security toggles
|
||||
protected = true # Enable ForwardAuth
|
||||
anti_ai_scraping = false # Disable anti-AI (e.g., for public API)
|
||||
rate_limit = 200 # Custom rate limit (req/min)
|
||||
}
|
||||
```
|
||||
|
||||
### Kyverno Policy Example
|
||||
|
||||
```yaml
|
||||
apiVersion: kyverno.io/v1
|
||||
kind: ClusterPolicy
|
||||
metadata:
|
||||
name: inject-ndots
|
||||
spec:
|
||||
background: false
|
||||
rules:
|
||||
- name: inject-ndots
|
||||
match:
|
||||
resources:
|
||||
kinds:
|
||||
- Pod
|
||||
mutate:
|
||||
patchStrategicMerge:
|
||||
spec:
|
||||
dnsConfig:
|
||||
options:
|
||||
- name: ndots
|
||||
value: "2"
|
||||
```
|
||||
|
||||
## Decisions & Rationale
|
||||
|
||||
### Why CrowdSec over ModSecurity?
|
||||
|
||||
- **Community threat intelligence**: Shared ban lists, crowdsourced attack detection
|
||||
- **Easier management**: YAML scenarios vs complex ModSecurity rules
|
||||
- **Better performance**: Lightweight Go agent vs resource-heavy Apache module
|
||||
- **Active development**: More frequent updates, responsive community
|
||||
|
||||
### Why Audit-Only Security Policies?
|
||||
|
||||
- **Gradual rollout**: Identify violations without breaking existing workloads
|
||||
- **Risk reduction**: Prevents policy bugs from blocking critical deployments
|
||||
- **Better observability**: Collect violation metrics before enforcing
|
||||
- **Selective enforcement**: Move to enforce mode per-policy after validation
|
||||
|
||||
### Why Multi-Layer Anti-AI Defense? (Updated 2026-04-17)
|
||||
|
||||
- **Defense in depth**: Each layer catches different bot types
|
||||
- **Compliant bots**: Layer 2 (X-Robots-Tag) handles respectful crawlers
|
||||
- **Persistent bots**: Tarpit makes scraping uneconomical
|
||||
- **Poison content**: Degrades training data for bots that reach poison-fountain
|
||||
- Layer 3 (trap links via rewrite-body) was removed due to Traefik v3 plugin incompatibility
|
||||
|
||||
### Why Fail-Open Mode?
|
||||
|
||||
- **Availability over security**: Homelab prioritizes uptime
|
||||
- **Graceful degradation**: Single component failure doesn't cascade
|
||||
- **Manual intervention**: Security incidents are rare, can handle manually
|
||||
- **Layer redundancy**: If one layer fails, others still protect
|
||||
|
||||
### Why Pin CrowdSec/Kyverno Versions?
|
||||
|
||||
- **Breaking changes**: Both projects had breaking config changes in past
|
||||
- **Controlled upgrades**: Test in staging before upgrading production
|
||||
- **Stability**: Prevents auto-upgrade during outages
|
||||
- **Rollback**: Easy to revert if upgrade causes issues
|
||||
|
||||
### Why HTTP/3 (QUIC)?
|
||||
|
||||
- **Performance**: Lower latency, better mobile performance
|
||||
- **Connection migration**: Survives IP changes (mobile networks)
|
||||
- **0-RTT**: Faster TLS handshake for repeat visitors
|
||||
- **Future-proof**: Industry moving to HTTP/3
|
||||
|
||||
## Troubleshooting
|
||||
|
||||
### CrowdSec Blocking Legitimate IP
|
||||
|
||||
**Problem**: Legitimate user IP on ban list.
|
||||
|
||||
**Fix**:
|
||||
1. Check LAPI decisions: `kubectl exec -it crowdsec-lapi-0 -- cscli decisions list`
|
||||
2. Remove ban: `kubectl exec -it crowdsec-lapi-0 -- cscli decisions delete --ip <IP>`
|
||||
3. Whitelist if needed: Add to `stacks/crowdsec/whitelist.yaml`
|
||||
|
||||
### Kyverno Policy Blocking Deployment
|
||||
|
||||
**Problem**: Pod creation fails with policy violation.
|
||||
|
||||
**Fix**:
|
||||
1. Check policy reports: `kubectl get policyreport -A`
|
||||
2. Verify `failurePolicy=Ignore` is set (should never block)
|
||||
3. If blocking, temporarily disable policy: `kubectl annotate clusterpolicy <policy> kyverno.io/exclude=true`
|
||||
4. Investigate root cause, fix workload or update policy
|
||||
|
||||
### Anti-AI Service Down, Traffic Blocked
|
||||
|
||||
**Problem**: anti-AI ForwardAuth (`ai-bot-block`) blocks traffic. With `bot-block-proxy` as a no-op `return 200` (poison-fountain scaled to 0) this should not happen; if it does, `bot-block-proxy` itself is unreachable (Traefik ForwardAuth fails **closed** when the auth server is down).
|
||||
|
||||
**Fix**:
|
||||
1. Check `bot-block-proxy` pods are Ready: `kubectl get pods -n traefik -l app=bot-block-proxy` (2 replicas; critical-path forward-auth target).
|
||||
2. Inspect/restart: `kubectl rollout restart deployment/bot-block-proxy -n traefik`. Config lives in the `bot-block-proxy-config` ConfigMap (`stacks/traefik/modules/traefik/main.tf`); changes auto-reload via the `configmap.reloader.stakater.com/reload` annotation.
|
||||
3. Temporary disable: Set `anti_ai_scraping = false` in `ingress_factory` for affected services.
|
||||
|
||||
### Rate Limit Too Aggressive
|
||||
|
||||
**Problem**: Legitimate users getting 429 errors.
|
||||
|
||||
**Fix**:
|
||||
1. Check Traefik logs for rate limit hits: `kubectl logs -n traefik -l app=traefik | grep 429`
|
||||
2. Increase limit in `ingress_factory`: `rate_limit = 300`
|
||||
3. Apply: `terraform apply`
|
||||
|
||||
### HTTP/3 Not Working
|
||||
|
||||
**Problem**: Browser shows HTTP/2, not HTTP/3.
|
||||
|
||||
**Fix**:
|
||||
1. Verify Traefik HTTP/3 enabled: `kubectl get cm traefik-config -o yaml | grep http3`
|
||||
2. Check UDP port 443 accessible: `nc -u <public-ip> 443`
|
||||
3. Browser support: Use Chrome/Firefox dev tools, check Protocol column
|
||||
|
||||
### TLS Certificate Expired
|
||||
|
||||
**Problem**: Browser shows certificate expired.
|
||||
|
||||
**Fix**:
|
||||
1. Check cert-manager: `kubectl get certificate -A`
|
||||
2. Force renewal: `kubectl delete secret <tls-secret> -n <namespace>`
|
||||
3. cert-manager will auto-renew within 5 minutes
|
||||
4. If fails, check Let's Encrypt rate limits
|
||||
|
||||
### Traefik Retry Loop
|
||||
|
||||
**Problem**: Backend logs show duplicate requests.
|
||||
|
||||
**Fix**:
|
||||
1. Check retry middleware config: Should be 2 attempts max
|
||||
2. Verify backend isn't returning transient errors: Check for 5xx responses
|
||||
3. Disable retry for specific service: Remove retry middleware from `ingress_factory`
|
||||
|
||||
### Poison Content Not Serving (Updated 2026-04-17)
|
||||
|
||||
**Problem**: Bots not receiving poisoned content on `poison.viktorbarzin.me`.
|
||||
|
||||
**Note**: Poison content is no longer injected into real pages (rewrite-body removed). It is only served directly via the `poison.viktorbarzin.me` subdomain.
|
||||
|
||||
**Fix**:
|
||||
1. Verify CronJob running: `kubectl get cronjob -n poison-fountain`
|
||||
2. Check logs: `kubectl logs -n poison-fountain -l app=poison-fountain`
|
||||
3. Manually trigger: `kubectl create job --from=cronjob/poison-content manual-poison`
|
||||
|
||||
## Related
|
||||
|
||||
- [Authentication & Authorization](./authentication.md) - Authentik, OIDC, ForwardAuth
|
||||
- [Networking](./networking.md) - Ingress, DNS, load balancing
|
||||
- [Monitoring](./monitoring.md) - Prometheus, Grafana, alerting
|
||||
- [CrowdSec Runbook](../runbooks/crowdsec.md) - CrowdSec operations
|
||||
- [Kyverno Policy Management](../runbooks/kyverno.md) - Policy authoring and troubleshooting
|
||||
381
docs/architecture/storage.md
Normal file
381
docs/architecture/storage.md
Normal file
|
|
@ -0,0 +1,381 @@
|
|||
# Storage Architecture
|
||||
|
||||
Last updated: 2026-05-24
|
||||
|
||||
## Overview
|
||||
|
||||
The cluster uses two storage backends: **Proxmox CSI** for database block storage and **Proxmox NFS** for application data.
|
||||
|
||||
**Block storage (Proxmox CSI)**: ~69 PVCs for databases and stateful apps use two StorageClasses provisioned from the same `local-lvm` thin pool (sdc, 10.7TB RAID1 HDD):
|
||||
- **`proxmox-lvm`**: Unencrypted block storage for non-sensitive workloads (~26 PVCs)
|
||||
- **`proxmox-lvm-encrypted`**: LUKS2-encrypted block storage for all sensitive data (~43 PVCs) — databases, auth, email, password managers, git repos, health data, etc. Uses Argon2id key derivation with passphrase from Vault KV.
|
||||
- **Both StorageClasses use `reclaimPolicy: Retain`.** Deleting a PVC frees the SCSI-LUN slot (the volume is detached) but **retains the underlying LV** for data safety — the PV goes `Released` and the LV (plus its daily `lvm-pvc-snapshot` snapshots) lingers on the thin pool. ~63 such orphan Released PVs exist as of 2026-06-05; batch orphan-LV reclaim is tracked in beads `code-dfjn`. The slot is freed regardless — orphans consume thin-pool space, not LUN slots.
|
||||
|
||||
All services storing sensitive data were migrated to `proxmox-lvm-encrypted` on 2026-04-15. This eliminates the previous double-CoW (ZFS + LVM-thin) path and ensures data-at-rest encryption.
|
||||
|
||||
**NFS storage (Proxmox host)**: ~100 NFS shares for media libraries (Immich, audiobookshelf, servarr, navidrome), backup targets (`*-backup/` directories), and app data are served directly from the Proxmox host at `192.168.1.127`. Two NFS export roots exist:
|
||||
- **HDD NFS**: `/srv/nfs` on ext4 LV `pve/nfs-data` (4TB) — bulk media and backup targets
|
||||
- **SSD NFS**: `/srv/nfs-ssd` on ext4 LV `ssd/nfs-ssd-data` (100GB) — high-performance data (Immich ML)
|
||||
|
||||
Both `StorageClass: nfs-truenas` and `StorageClass: nfs-proxmox` point to the Proxmox host and are functionally identical. The `nfs-truenas` name is historical — it was retained because StorageClass names are immutable on bound PVs (48 PVs reference it) and renaming would force mass PV churn across the cluster.
|
||||
|
||||
**Backup storage (sda)**: 1.1TB RAID1 SAS disk, VG `backup`, LV `data` (ext4), mounted at `/mnt/backup` on PVE host. Dedicated backup disk for weekly PVC file backups, auto SQLite backups, pfSense backups, and PVE config. NFS data syncs directly to Synology via inotify change tracking (not stored on sda). Independent of live storage (sdc).
|
||||
|
||||
**History (2026-04-02)**: iSCSI block volumes migrated from democratic-csi (TrueNAS iSCSI → ZFS → LVM-thin) to Proxmox CSI (direct LVM-thin hotplug). democratic-csi iSCSI driver removed.
|
||||
|
||||
**History (2026-04-13)**: TrueNAS (VM 9000, 10.0.10.15) fully decommissioned. NFS storage migrated to the Proxmox host (192.168.1.127). ZFS datasets under `/mnt/main/` and `/mnt/ssd/` moved to ext4 LVs at `/srv/nfs/` and `/srv/nfs-ssd/`. Legacy PVs referencing `/mnt/main/` paths still work (bind-mounted or symlinked on the Proxmox host); new PVs use `/srv/nfs/` and `/srv/nfs-ssd/`. TrueNAS VM still exists in stopped state on PVE pending user decision on deletion.
|
||||
|
||||
**History (2026-06-05) — Wave 2 NFS migration + strategy decision**: Decided to **keep proxmox-csi and harden it** (option ① — keeps PVC mobility, £0, no new hardware) rather than re-architect to TopoLVM (pins PVCs to a node) or Longhorn (2× write-amplification on the single shared sdc HDD). See `docs/plans/2026-06-05-block-storage-harden-nfs-design.md`. Migrated 5 non-DB, embedded-DB-free workloads off block to NFS to relieve the per-VM LUN cap: **tandoor** (media, PG-backed), **speedtest** (config, MySQL), **hackmd** (image uploads, MySQL — dropped LUKS for low-sensitivity images), **changedetection** (JSON datastore), **send** (upload blobs, Redis). Freed 5 SCSI-LUN slots (4 on the then-hot node6, 21→16). Each followed the scale-0 → busybox mover (`cp -a`) → swap `claim_name` → delete block PVC pattern. (Phase-1 follow-on 2026-06-05: insta2spotify also migrated — note its reschedule re-pulled a 3.26 GB image, a ~6 min blip; large-image services incur a pull-delay when a migration moves the pod to a fresh node.)
|
||||
|
||||
**The "harden" half is now SHIPPED (2026-06-05):**
|
||||
- **Orphan cleanup** — removed 67 `Released` proxmox PVs + 475 orphan LVs/snapshots (VG `pve` 997 → ~410 LVs; thin pool freed). 1 LV left (`f127a41c`, stuck-open stale qemu fd — harmless, clears on node reboot; do not force `dmsetup remove`).
|
||||
- **Ghost-loop prevention** — `csi-ghost-reconcile` CronJob (`stacks/proxmox-csi/ghost-reconcile.tf`, every 15 min) compares each worker VM's real scsi disks (Proxmox API, scoped CSI token) against k8s VolumeAttachments and safely detaches ghosts (`PUT .../config delete=scsiN`); detection mirrors check #47, with a 60 s re-confirm + per-run cap-5. Verified live (66 VAs, 0 ghosts). This closes the doom loop by construction — **beads `code-dfjn` can be retired.**
|
||||
- **Cap deliberately kept at 28** (NOT lowered to 24): the labeler value (`stacks/proxmox-csi/.../main.tf` `node_labels`) was raised 24→28 per the 2026-05-25 eviction-cascade post-mortem; lowering it would reverse that fix. With auto-reconcile keeping drift at 0, the 28 cap is safe.
|
||||
|
||||
## Architecture Diagram
|
||||
|
||||
```mermaid
|
||||
graph TB
|
||||
subgraph Proxmox["Proxmox Host (192.168.1.127)"]
|
||||
sdc["sdc: 10.7TB RAID1 HDD<br/>VG pve, LV data (thin pool)<br/>~67 proxmox-lvm PVCs<br/>~28 proxmox-lvm-encrypted PVCs"]
|
||||
sda["sda: 1.1TB RAID1 SAS<br/>VG backup, LV data (ext4)<br/>/mnt/backup"]
|
||||
NFS_HDD["LV pve/nfs-data (4TB ext4)<br/>/srv/nfs<br/>~100 NFS shares<br/>Media + backup targets"]
|
||||
NFS_SSD["LV ssd/nfs-ssd-data (100GB ext4)<br/>/srv/nfs-ssd<br/>High-performance data<br/>(Immich ML)"]
|
||||
NFS_Exports["NFS Exports<br/>managed by /etc/exports"]
|
||||
NFS_HDD --> NFS_Exports
|
||||
NFS_SSD --> NFS_Exports
|
||||
end
|
||||
|
||||
subgraph K8s["Kubernetes Cluster"]
|
||||
CSI_NFS["nfs-csi driver<br/>StorageClass: nfs-proxmox (+ legacy nfs-truenas)<br/>soft,timeo=30,retrans=3"]
|
||||
CSI_PVE["Proxmox CSI plugin<br/>StorageClass: proxmox-lvm<br/>StorageClass: proxmox-lvm-encrypted"]
|
||||
|
||||
NFS_PV["NFS PersistentVolumes<br/>RWX, ~100 volumes"]
|
||||
Block_PV["Block PersistentVolumes<br/>RWO, ~67 PVCs (unencrypted)"]
|
||||
Enc_PV["Encrypted Block PVs<br/>RWO, ~28 PVCs (LUKS2)"]
|
||||
|
||||
Pods["Application Pods"]
|
||||
DBPods["Database Pods<br/>PostgreSQL CNPG<br/>MySQL InnoDB"]
|
||||
end
|
||||
|
||||
NFS_Exports -->|NFS mount| CSI_NFS
|
||||
sdc -->|LVM-thin hotplug| CSI_PVE
|
||||
|
||||
CSI_NFS --> NFS_PV
|
||||
CSI_PVE --> Block_PV
|
||||
CSI_PVE --> Enc_PV
|
||||
|
||||
NFS_PV --> Pods
|
||||
Block_PV --> Pods
|
||||
Enc_PV --> DBPods
|
||||
|
||||
style Proxmox fill:#e1f5ff
|
||||
style K8s fill:#fff4e1
|
||||
style NFS_HDD fill:#c8e6c9
|
||||
style NFS_SSD fill:#ffe0b2
|
||||
```
|
||||
|
||||
## Components
|
||||
|
||||
| Component | Version/Config | Location | Purpose |
|
||||
|-----------|---------------|----------|---------|
|
||||
| **Proxmox CSI plugin** | Helm chart | Namespace: proxmox-csi | Block storage via LVM-thin hotplug |
|
||||
| **StorageClass `proxmox-lvm`** | RWO, WaitForFirstConsumer | Cluster-wide | Non-sensitive stateful apps |
|
||||
| **StorageClass `proxmox-lvm-encrypted`** | RWO, WaitForFirstConsumer, LUKS2 | Cluster-wide | **All sensitive data** (databases, auth, email, passwords, git) |
|
||||
| Proxmox NFS (HDD) | LV `pve/nfs-data`, 4TB ext4 | 192.168.1.127:/srv/nfs | Bulk NFS data for all services |
|
||||
| Proxmox NFS (SSD) | LV `ssd/nfs-ssd-data`, 100GB ext4 | 192.168.1.127:/srv/nfs-ssd | High-performance data (Immich ML) |
|
||||
| nfs-csi | Helm chart | Namespace: nfs-csi | NFS CSI driver |
|
||||
| StorageClass `nfs-proxmox` | RWX, soft mount | Cluster-wide | NFS storage, points to Proxmox host |
|
||||
| StorageClass `nfs-truenas` | RWX, soft mount | Cluster-wide | **Historical name** — functionally identical to `nfs-proxmox`, points to the Proxmox host. Kept because SC names are immutable on 48 bound PVs. |
|
||||
| TF module `nfs_volume` | `modules/kubernetes/nfs_volume/` | Infra repo | Static NFS PV/PVC factory |
|
||||
| ~~TrueNAS VM~~ | **DECOMMISSIONED 2026-04-13** | Was VM 9000 at 10.0.10.15 | Replaced by Proxmox NFS. VM still in stopped state pending deletion. |
|
||||
| ~~democratic-csi-iscsi~~ | **REMOVED** | Was namespace: iscsi-csi | Replaced by Proxmox CSI (2026-04-02) |
|
||||
| ~~StorageClass `iscsi-truenas`~~ | **REMOVED** | Was cluster-wide | Replaced by `proxmox-lvm` |
|
||||
|
||||
## How It Works
|
||||
|
||||
### NFS Storage Flow
|
||||
|
||||
1. **Directory creation**: NFS share directories are created under `/srv/nfs/<service>` (HDD) or `/srv/nfs-ssd/<service>` (SSD) on the Proxmox host
|
||||
2. **Export configuration**: `/etc/exports` on the Proxmox host lists per-directory NFS exports
|
||||
3. **Terraform module**: Stacks use `modules/kubernetes/nfs_volume/` to declaratively create static PV + PVC pairs:
|
||||
```hcl
|
||||
module "nfs_data" {
|
||||
source = "../../modules/kubernetes/nfs_volume"
|
||||
name = "immich-data"
|
||||
namespace = kubernetes_namespace.immich.metadata[0].name
|
||||
nfs_server = var.nfs_server # 192.168.1.127
|
||||
nfs_path = "/srv/nfs/immich"
|
||||
}
|
||||
```
|
||||
4. **Pod mount**: Applications reference PVCs in their deployment specs
|
||||
5. **Mount options**: All NFS mounts use `soft,timeo=30,retrans=3` (set in StorageClass) to prevent indefinite hangs
|
||||
|
||||
**Note**: Some legacy PVs still reference `/mnt/main/<service>` paths. These work via compatibility symlinks/bind-mounts on the Proxmox host. New PVs should use `/srv/nfs/<service>` or `/srv/nfs-ssd/<service>`.
|
||||
|
||||
**CRITICAL**: Never use inline `nfs {}` blocks in pod specs — they default to `hard,timeo=600` which causes 10-minute hangs on network issues. Always use the `nfs-proxmox` StorageClass (or the legacy `nfs-truenas` for existing PVs) via PVCs.
|
||||
|
||||
### Block Storage Flow (Proxmox CSI) — NEW
|
||||
|
||||
1. **PVC creation**: Pod requests a PVC with `storageClass: proxmox-lvm`
|
||||
2. **CSI provisioning**: Proxmox CSI plugin calls the Proxmox API to create a thin LV in the `local-lvm` storage
|
||||
3. **SCSI hotplug**: The thin LV is hotplugged as a VirtIO-SCSI disk directly into the K8s node VM
|
||||
4. **Filesystem**: CSI formats the disk as ext4 and mounts it into the pod
|
||||
5. **Exclusive access**: RWO only — disk is attached to one VM at a time
|
||||
6. **Topology**: Nodes are labeled with `topology.kubernetes.io/region=pve` and `zone=pve` for scheduling
|
||||
|
||||
**Key advantage**: Single CoW layer (LVM-thin only). No ZFS, no iSCSI network hop, no double-CoW corruption.
|
||||
|
||||
**Proxmox API token**: `csi@pve!csi-token` with CSI role (`VM.Audit VM.Config.Disk Datastore.Allocate Datastore.AllocateSpace Datastore.Audit`). Stored in Vault at `secret/viktor`.
|
||||
|
||||
### Encrypted Block Storage Flow (proxmox-lvm-encrypted) — 2026-04-15
|
||||
|
||||
1. **PVC creation**: Pod requests a PVC with `storageClass: proxmox-lvm-encrypted`
|
||||
2. **CSI provisioning**: Same as `proxmox-lvm` — thin LV created in `local-lvm`
|
||||
3. **LUKS encryption**: CSI node plugin reads the encryption passphrase from K8s Secret `proxmox-csi-encryption` (namespace `kube-system`), formats the disk with LUKS2 (Argon2id key derivation), then creates ext4 on top
|
||||
4. **Transparent mounting**: Application sees a normal ext4 filesystem — encryption/decryption is handled by dm-crypt in the kernel
|
||||
5. **Passphrase management**: ExternalSecret syncs passphrase from Vault KV (`secret/viktor/proxmox_csi_encryption_passphrase`) → K8s Secret. Backup key at `/root/.luks-backup-key` on PVE host.
|
||||
|
||||
**Services on encrypted storage (2026-04-15 migration):**
|
||||
vaultwarden, dbaas (mysql+pg+pgadmin), mailserver, nextcloud, forgejo, matrix, n8n, affine, health, hackmd, redis, headscale, frigate, meshcentral, technitium, actualbudget, grampsweb, owntracks, wealthfolio, monitoring (alertmanager)
|
||||
|
||||
**Services migrated later** (post-audit catch-up): paperless-ngx (2026-04-25 — sensitive document scans had been left on plain `proxmox-lvm` by an abandoned attempt; rsync swap cleaned up the orphan and re-did via Terraform). Vault raft cluster (2026-04-25 — all 3 voters migrated from `nfs-proxmox` to `proxmox-lvm-encrypted` after the 2026-04-22 raft-leader-deadlock post-mortem found NFS fsync semantics incompatible with raft consensus log; rolled non-leader-first with force-finalize on the pvc-protection finalizer to avoid pod-recreating on the old PVCs).
|
||||
|
||||
**CSI node plugin memory**: Requires 1280Mi limit for LUKS2 Argon2id key derivation (~1GiB). Set via `node.plugin.resources` in Helm values (not `node.resources`).
|
||||
|
||||
**Terraform stack**: `stacks/proxmox-csi/` manages both StorageClasses, the ExternalSecret, and CSI plugin resources.
|
||||
|
||||
### iSCSI Storage Flow (DEPRECATED — replaced 2026-04-02)
|
||||
|
||||
> **This section is historical.** All iSCSI PVCs have been migrated to Proxmox CSI (`proxmox-lvm`). The democratic-csi iSCSI driver is pending removal.
|
||||
|
||||
1. ~~Zvol creation: democratic-csi creates ZFS zvols under `main/iscsi/<pvc-name>` via SSH commands~~
|
||||
2. ~~Target setup: TrueNAS iSCSI service exposes zvols as iSCSI LUNs~~
|
||||
3. ~~Initiator connection: K8s nodes connect via open-iscsi~~
|
||||
|
||||
### SQLite on NFS — Why It Fails
|
||||
|
||||
SQLite uses `fsync()` to guarantee durability. NFS's soft mount + async semantics break this:
|
||||
- Soft mount returns success even if data is still in client cache
|
||||
- Network blips during fsync → incomplete writes → corruption
|
||||
- WAL mode helps but doesn't eliminate the race
|
||||
|
||||
**Solution**: Use Proxmox CSI (`proxmox-lvm`) for any SQLite database (Vaultwarden, plotting-book) or local disk (ephemeral).
|
||||
|
||||
### ~~Democratic-CSI Sidecar Resources~~ (HISTORICAL — democratic-csi removed)
|
||||
|
||||
> Democratic-csi has been removed along with TrueNAS decommissioning (2026-04). This section is kept for historical reference only.
|
||||
|
||||
### Per-VM SCSI-LUN cap (29 block PVCs per K8s node)
|
||||
|
||||
**The proxmox-csi-plugin hardcodes a per-VM LUN ceiling at 29.** The plugin
|
||||
scans `scsi1..scsi29` for a free slot when attaching a PVC
|
||||
(`pkg/csi/utils.go:394`: `for lun = 1; lun < 30; lun++`); when the loop exits
|
||||
without a hit, ControllerPublishVolume returns
|
||||
`Internal desc = no free lun found`. `CSINode.allocatable.count` is advertised
|
||||
as `28` for every worker — derived from this plugin limit, NOT from Proxmox or
|
||||
QEMU constraints.
|
||||
|
||||
What this means in practice:
|
||||
- Each K8s node VM can hold at most 29 block PVCs simultaneously (scsi0 is the
|
||||
OS disk).
|
||||
- Switching `scsihw` from `virtio-scsi-pci` to `virtio-scsi-single` gains
|
||||
per-disk iothread isolation but **zero additional capacity** — the cap lives
|
||||
in the CSI plugin, not the QEMU device topology. Proxmox itself allows
|
||||
`scsi0..scsi30` (31 slots, `$MAX_SCSI_DISKS = 31` in
|
||||
`/usr/share/perl5/PVE/QemuServer/Drive.pm`).
|
||||
- NFS PVCs (`nfs.csi.k8s.io`) are kernel NFS mounts and do not count against
|
||||
the SCSI cap. Moving non-DB workloads (config-only, static content,
|
||||
regenerable cache, pure upload buckets) to NFS is the simplest relief.
|
||||
- Symptom when the cap is hit: pods stuck `ContainerCreating` with
|
||||
`FailedAttachVolume … no free lun found` event, and the proxmox-csi
|
||||
controller hot-loops `ControllerPublishVolume` against the saturated VM.
|
||||
|
||||
Levers (in order of leverage-per-effort):
|
||||
1. **Migrate non-DB workloads off block** to NFS. Pre-flight every candidate
|
||||
for embedded DBs (SQLite/LevelDB/RocksDB/H2/BoltDB) — they corrupt on NFS
|
||||
due to lock semantics. Wave 1 (2026-05-26) moved 5 services
|
||||
(excalidraw, resume, whisper, onlyoffice, f1-stream). Wave 2 (2026-06-05)
|
||||
moved 5 more (tandoor, speedtest, hackmd, changedetection, send — see
|
||||
History "2026-06-05"). Pre-flighted-and-rejected (stay on block): plotting-book
|
||||
(SQLite+WAL), stirling-pdf (H2), navidrome/ntfy/uptime-kuma/vaultwarden/
|
||||
freshrss/actualbudget/openclaw (SQLite), rybbit (ClickHouse). **This is the
|
||||
chosen long-term strategy (option ①)** — keep proxmox-csi's mobility, shrink
|
||||
the block footprint, prevent the ghost loop (`code-dfjn`); not TopoLVM/Longhorn.
|
||||
2. **Add another K8s worker VM** — each new worker brings up to 29 fresh
|
||||
slots; the most durable answer if PVC count keeps growing.
|
||||
3. **Patch+fork `sergelogvinov/proxmox-csi-plugin`** to bump the loop bound
|
||||
from `< 30` to `< 31` (matches Proxmox `MAX_SCSI_DISKS`). +1 slot per VM.
|
||||
File upstream PR. Self-maintained image until merged.
|
||||
|
||||
## Configuration
|
||||
|
||||
### Key Files
|
||||
|
||||
| Path | Purpose |
|
||||
|------|---------|
|
||||
| `/etc/exports` (on Proxmox host) | NFS export configuration for all service shares |
|
||||
| `stacks/proxmox-csi/` | Terraform stack for Proxmox CSI plugin + StorageClass |
|
||||
| `stacks/nfs-csi/` | NFS CSI driver + StorageClasses (`nfs-proxmox` + legacy `nfs-truenas`) |
|
||||
| `modules/kubernetes/nfs_volume/` | Reusable module for static NFS PV/PVC creation |
|
||||
| `config.tfvars` | Variable `nfs_server = "192.168.1.127"` shared by all stacks |
|
||||
|
||||
### Vault Paths
|
||||
|
||||
| Path | Contents |
|
||||
|------|----------|
|
||||
| `secret/viktor/proxmox_csi_encryption_passphrase` | LUKS2 encryption passphrase for `proxmox-lvm-encrypted` StorageClass |
|
||||
| ~~`secret/viktor/truenas_ssh_key`~~ | **REMOVED** — was SSH key for democratic-csi SSH driver (TrueNAS decommissioned 2026-04-13) |
|
||||
| ~~`secret/viktor/truenas_root_password`~~ | **REMOVED** — was TrueNAS root password (TrueNAS decommissioned 2026-04-13) |
|
||||
| ~~`secret/viktor/truenas_api_key`~~ | **REMOVED** — was TrueNAS API key (TrueNAS decommissioned 2026-04-13) |
|
||||
| ~~`secret/viktor/truenas_ssh_private_key`~~ | **REMOVED** — was TrueNAS SSH private key (TrueNAS decommissioned 2026-04-13) |
|
||||
|
||||
### Terraform Stacks
|
||||
|
||||
- **`stacks/proxmox-csi/`**: Deploys Proxmox CSI plugin + `proxmox-lvm` and `proxmox-lvm-encrypted` StorageClasses + ExternalSecret for encryption passphrase + node topology labels
|
||||
- **`stacks/nfs-csi/`**: Deploys NFS CSI driver + StorageClasses for Proxmox NFS
|
||||
- All application stacks reference NFS volumes via `module "nfs_<name>"` calls
|
||||
- Database PVCs use `storageClass: proxmox-lvm` (CNPG, MySQL Helm VCT, Redis Helm, standalone PVCs)
|
||||
|
||||
### NFS Export Management
|
||||
|
||||
NFS exports are NOT managed by Terraform. To add a new service:
|
||||
|
||||
1. SSH to Proxmox host: `ssh root@192.168.1.127`
|
||||
2. Create the directory: `mkdir -p /srv/nfs/<service> && chmod 777 /srv/nfs/<service>`
|
||||
3. Edit `/etc/exports` — add the export entry
|
||||
4. Reload exports: `exportfs -ra`
|
||||
5. Verify: `showmount -e 192.168.1.127`
|
||||
|
||||
## Decisions & Rationale
|
||||
|
||||
### Why NFS for Most Workloads?
|
||||
|
||||
- **Simplicity**: No volume provisioning delays, instant mounts
|
||||
- **RWX support**: Multiple pods can share one volume (Nextcloud, Immich)
|
||||
- **Good enough**: For SQLite on NFS specifically, we accept the risk for low-value data (logs, caches) but mandate proxmox-lvm for critical DBs
|
||||
|
||||
### Why Proxmox CSI for Databases? (formerly iSCSI)
|
||||
|
||||
- **ACID guarantees**: Block device + local filesystem = real fsync
|
||||
- **Performance**: No NFS protocol overhead for random I/O, no network hop (LVM-thin hotplug direct to VM)
|
||||
- **Tested**: PostgreSQL CNPG and MySQL InnoDB Cluster both run on proxmox-lvm, zero corruption
|
||||
- **Single CoW layer**: LVM-thin only, no ZFS double-CoW issues
|
||||
|
||||
### Why Soft Mount for NFS?
|
||||
|
||||
Hard mounts with default `timeo=600` (10 minutes) cause:
|
||||
- 10-minute pod startup delays if NFS server is unreachable
|
||||
- `kubectl delete pod` hangs for 10 minutes
|
||||
- Kernel task hangs blocking node operations
|
||||
|
||||
Soft mount (`soft,timeo=30,retrans=3`) trades availability for responsiveness:
|
||||
- Max 90s hang (30s × 3 retries)
|
||||
- Operations return EIO after timeout → app can handle error
|
||||
- Acceptable for non-critical data paths
|
||||
|
||||
**Critical paths**: Databases use proxmox-lvm (not NFS), so soft mount never affects data integrity.
|
||||
|
||||
## Troubleshooting
|
||||
|
||||
### NFS Mount Hangs
|
||||
|
||||
**Symptom**: Pod stuck in `ContainerCreating`, `df -h` hangs on NFS mount
|
||||
|
||||
**Diagnosis**:
|
||||
```bash
|
||||
# On K8s node
|
||||
mount | grep nfs
|
||||
showmount -e 192.168.1.127
|
||||
|
||||
# Check NFS server (Proxmox host)
|
||||
ssh root@192.168.1.127
|
||||
ls -la /srv/nfs/<service>
|
||||
cat /etc/exports | grep <service>
|
||||
```
|
||||
|
||||
**Fix**:
|
||||
1. Verify directory exists: `ls /srv/nfs/<service>` (or `/srv/nfs-ssd/<service>`)
|
||||
2. Verify export: `grep <service> /etc/exports`
|
||||
3. If missing: add to `/etc/exports` and run `exportfs -ra`
|
||||
4. Restart NFS server: `systemctl restart nfs-server`
|
||||
|
||||
### ~~iSCSI Session Drops~~ (HISTORICAL — iSCSI removed)
|
||||
|
||||
> iSCSI was replaced by Proxmox CSI (2026-04-02) and TrueNAS has been decommissioned. This section is kept for historical reference only.
|
||||
|
||||
### SQLite Corruption on NFS
|
||||
|
||||
**Symptom**: `database disk image is malformed`, checksum errors
|
||||
|
||||
**Diagnosis**:
|
||||
```bash
|
||||
# In pod
|
||||
sqlite3 /data/db.sqlite "PRAGMA integrity_check;"
|
||||
```
|
||||
|
||||
**Fix**: Migrate to proxmox-lvm
|
||||
1. Create proxmox-lvm PVC in Terraform stack
|
||||
2. Restore from backup to new volume
|
||||
3. Update deployment to use new PVC
|
||||
4. Delete old NFS PVC
|
||||
|
||||
### Slow NFS Performance
|
||||
|
||||
**Symptom**: High latency on file operations, `iostat` shows NFS wait times
|
||||
|
||||
**Diagnosis**:
|
||||
```bash
|
||||
# On Proxmox host
|
||||
ssh root@192.168.1.127
|
||||
iostat -x 5
|
||||
lvs --reportformat json pve/nfs-data ssd/nfs-ssd-data
|
||||
|
||||
# On K8s node
|
||||
nfsiostat 5
|
||||
```
|
||||
|
||||
**Optimization**:
|
||||
1. Move hot data to SSD NFS: relocate from `/srv/nfs/<service>` to `/srv/nfs-ssd/<service>` and update PV path
|
||||
2. Tune NFS mount: add `rsize=1048576,wsize=1048576` to StorageClass `mountOptions`
|
||||
|
||||
## Nextcloud as PVE-NFS browser
|
||||
|
||||
Both NFS export roots are mounted into the Nextcloud server pod — `/srv/nfs` at `/mnt/pve-nfs` and `/srv/nfs-ssd` at `/mnt/pve-nfs-ssd` — via standard NFS PVs (`nfs_volume` module). No host-level Unix user/group setup; Nextcloud is the sole household-facing surface.
|
||||
|
||||
**ACL model — two patterns:**
|
||||
|
||||
- **Root browser mounts** (`PVE NFS Pool`, `PVE NFS-SSD Pool`): scoped to NC group `admin`. Used by Viktor for ad-hoc browsing of any cluster NFS state. Other users never see these mounts.
|
||||
- **Per-archive mounts** (e.g. `/anca-elements` → `/mnt/pve-nfs/anca-elements`): one NC External mount per archive, `applicable_users` set to the archive owners. Users see only the mounts assigned to them. Write/delete access is implicit at the OS level (NC pod writes via `no_root_squash`); deny semantics come from mount visibility — if the mount is not in your list, you cannot reach the path.
|
||||
|
||||
**Why mount-level ACL, not Files Access Control**: NC 30/31's workflow engine check classes are `FileName` (basename), `FileMimeType`, `FileSize`, `FileSystemTags`, and `UserGroupMembership`. There is no `FilePath` and no `UserId` check class. Per-(directory, user) rules are not expressible via FAC. Mount-level ACL via `occ files_external:applicable` is the supported primitive and maps cleanly onto the model.
|
||||
|
||||
**Manifest**: `kubernetes_config_map_v1.nextcloud_external_storage_manifest` in `stacks/nextcloud/external_storage.tf`. Mount entries reference NC usernames (`admin`, `anca`, `emo` — not display names; admin is Viktor). JSON shape:
|
||||
```json
|
||||
{
|
||||
"rootMounts": [
|
||||
{ "mountPoint": "/PVE NFS Pool", "dataDir": "/mnt/pve-nfs", "applicableGroup": "admin", "enableSharing": true },
|
||||
{ "mountPoint": "/PVE NFS-SSD Pool", "dataDir": "/mnt/pve-nfs-ssd", "applicableGroup": "admin", "enableSharing": true }
|
||||
],
|
||||
"archiveMounts": [
|
||||
{ "mountPoint": "/anca-elements", "dataDir": "/mnt/pve-nfs/anca-elements", "applicableUsers": ["anca", "admin"], "applicableGroups": [], "enableSharing": false }
|
||||
]
|
||||
}
|
||||
```
|
||||
A one-shot K8s bootstrap Job applies the manifest idempotently on every `tg apply` via `occ files_external:*`, `occ files_external:applicable`, and `occ files_external:option`. `enableSharing: true` lets admin re-share a subfolder of the mount with another NC user/group/public link; default is `false` (NC's local-backend default).
|
||||
|
||||
**Adding a new archive**: drop the directory under `/srv/nfs/<name>/` on PVE, append an `archiveMounts` entry to the manifest, then `scripts/tg apply` the nextcloud stack. See `docs/runbooks/nextcloud-add-archive.md` for the full step-by-step.
|
||||
|
||||
**Trade-off**: a compromised NC admin account has destructive reach over the cluster NFS roots (admin sees the root browser mounts). Accepted — Viktor's account is the single high-value target either way. No lateral movement to databases or block PVCs via this path (those are not NFS).
|
||||
|
||||
**Backup**: Synology retains a frozen copy of each archive (3-2-1 coverage); the existing `offsite-sync-backup` pipeline provides nightly delta sync from `/srv/nfs/<archive>` → Synology `nfs/`.
|
||||
|
||||
## Related
|
||||
|
||||
- **Runbooks**:
|
||||
- `docs/runbooks/restore-postgresql.md`
|
||||
- `docs/runbooks/restore-mysql.md`
|
||||
- `docs/runbooks/recover-nfs-mount.md`
|
||||
- `docs/runbooks/nextcloud-add-archive.md`
|
||||
- **Architecture**: `docs/architecture/backup-dr.md` (backup strategy using LVM snapshots and Proxmox host scripts)
|
||||
- **Reference**: `.claude/reference/service-catalog.md` (which services use NFS vs proxmox-lvm)
|
||||
445
docs/architecture/vpn.md
Normal file
445
docs/architecture/vpn.md
Normal file
|
|
@ -0,0 +1,445 @@
|
|||
# VPN & Remote Access Architecture
|
||||
|
||||
Last updated: 2026-04-10
|
||||
|
||||
## Overview
|
||||
|
||||
Remote access to the homelab is provided through a hybrid VPN architecture: WireGuard site-to-site tunnels connect physical locations (Sofia, London, Valchedrym), while Headscale (self-hosted Tailscale control server) provides mesh overlay networking for roaming clients. Split DNS architecture ensures resilience: AdGuard serves as the global DNS resolver for all VPN clients, while Technitium handles internal `.lan` domains. This design prevents tunnel dependency for public DNS resolution — if the Cloudflared tunnel goes down, clients can still access the internet.
|
||||
|
||||
## Architecture Diagram
|
||||
|
||||
### VPN Topology
|
||||
|
||||
```mermaid
|
||||
graph TB
|
||||
subgraph "Site-to-Site WireGuard (Hub-and-Spoke)"
|
||||
Sofia[Sofia pfSense<br/>10.3.2.1<br/>tun_wg0]
|
||||
London[London GL-iNet Flint 2<br/>10.3.2.6<br/>192.168.8.0/24]
|
||||
Valchedrym[Valchedrym OpenWRT<br/>10.3.2.5<br/>192.168.0.0/24]
|
||||
|
||||
Sofia ---|WireGuard Tunnel| London
|
||||
Sofia ---|WireGuard Tunnel| Valchedrym
|
||||
end
|
||||
|
||||
subgraph "Headscale Mesh Overlay"
|
||||
HS[Headscale<br/>headscale.viktorbarzin.me<br/>K8s Service]
|
||||
Authentik[Authentik OIDC<br/>SSO Login]
|
||||
DERP[DERP Relay<br/>Region 999<br/>Embedded in Headscale]
|
||||
|
||||
subgraph "Clients"
|
||||
Laptop[MacBook<br/>Tailscale Client]
|
||||
Phone[iPhone<br/>Tailscale Client]
|
||||
Remote[Remote VM<br/>Tailscale Client]
|
||||
end
|
||||
|
||||
HS --> Authentik
|
||||
HS --> DERP
|
||||
Laptop -.mesh.- Phone
|
||||
Laptop -.mesh.- Remote
|
||||
Phone -.mesh.- Remote
|
||||
Laptop --> HS
|
||||
Phone --> HS
|
||||
Remote --> HS
|
||||
|
||||
Laptop -.relay fallback.- DERP
|
||||
Phone -.relay fallback.- DERP
|
||||
end
|
||||
|
||||
Sofia --> HS
|
||||
```
|
||||
|
||||
### DNS Resolution Flow
|
||||
|
||||
```mermaid
|
||||
sequenceDiagram
|
||||
participant Client as VPN Client
|
||||
participant AdGuard as AdGuard DNS<br/>(Global)
|
||||
participant Technitium as Technitium DNS<br/>(Internal .lan)
|
||||
participant Cloudflare as Cloudflare DNS<br/>(Public Domains)
|
||||
|
||||
Note over Client: Query: immich.viktorbarzin.me
|
||||
Client->>AdGuard: DNS query
|
||||
AdGuard->>Cloudflare: Forward (not .lan)
|
||||
Cloudflare-->>AdGuard: A record (Cloudflare IP)
|
||||
AdGuard-->>Client: Response
|
||||
|
||||
Note over Client: Query: nextcloud.viktorbarzin.lan
|
||||
Client->>AdGuard: DNS query
|
||||
AdGuard->>Technitium: Forward (.lan domain)
|
||||
Technitium-->>AdGuard: A record (10.0.20.200)
|
||||
AdGuard-->>Client: Response
|
||||
|
||||
Note over Client,Technitium: If Cloudflared tunnel is down:
|
||||
Client->>AdGuard: DNS query (google.com)
|
||||
AdGuard->>Cloudflare: Forward (public DNS works)
|
||||
Cloudflare-->>AdGuard: A record
|
||||
AdGuard-->>Client: Response (no tunnel dependency)
|
||||
```
|
||||
|
||||
## Components
|
||||
|
||||
| Component | Version/Type | Location | Purpose |
|
||||
|-----------|-------------|----------|---------|
|
||||
| WireGuard | Built-in (pfSense/OpenWRT) | Sofia (pfSense), London (GL-iNet Flint 2), Valchedrym (OpenWRT) | Site-to-site encrypted tunnels (hub-and-spoke) |
|
||||
| Headscale | v0.23.x (container) | K8s (headscale.viktorbarzin.me) | Tailscale control server, mesh coordinator |
|
||||
| Tailscale | Client v1.x | User devices | Mesh VPN client |
|
||||
| Authentik | OIDC provider | K8s | SSO authentication for Headscale |
|
||||
| DERP Relay | Embedded in Headscale | K8s (region 999) | Relay for NAT traversal |
|
||||
| AdGuard DNS | Container | K8s | Global DNS resolver with ad-blocking |
|
||||
| Technitium DNS | Container | K8s (10.0.20.201) | Internal .lan domain resolver |
|
||||
|
||||
## How It Works
|
||||
|
||||
### WireGuard Site-to-Site
|
||||
|
||||
Three physical locations are permanently connected via WireGuard in a **hub-and-spoke** topology with Sofia as the hub. A single WireGuard interface (`tun_wg0`) on pfSense carries both peers on the `10.3.2.0/24` tunnel subnet:
|
||||
|
||||
- **Sofia** (hub): `10.3.2.1` — pfSense, K8s cluster on `10.0.20.0/24`, management on `10.0.10.0/24`, LAN on `192.168.1.0/24`
|
||||
- **London** (spoke): `10.3.2.6` — GL-iNet Flint 2 (GL-MT6000), LAN `192.168.8.0/24`, guest `192.168.9.0/24`
|
||||
- **Valchedrym** (spoke): `10.3.2.5` — OpenWRT router, LAN `192.168.0.0/24`
|
||||
|
||||
Routes are configured as static routes on pfSense. London and Valchedrym route Sofia-bound traffic through their WireGuard tunnels. London ↔ Valchedrym traffic transits through Sofia (no direct tunnel).
|
||||
|
||||
**Use cases**:
|
||||
- Replication of Vault data between Sofia and London
|
||||
- Offsite database replicas
|
||||
- Accessing Proxmox hosts across locations
|
||||
|
||||
### Headscale Mesh Overlay
|
||||
|
||||
Headscale is a self-hosted alternative to Tailscale's commercial control plane. It provides:
|
||||
- **Mesh networking**: Clients establish direct WireGuard connections to each other (peer-to-peer).
|
||||
- **NAT traversal**: DERP relays provide connectivity when direct connections fail.
|
||||
- **OIDC authentication**: Users log in via Authentik, no pre-shared keys.
|
||||
- **ACL policies**: Fine-grained control over which clients can reach which destinations.
|
||||
|
||||
**Client onboarding**:
|
||||
1. User installs Tailscale client (official macOS/iOS/Android app)
|
||||
2. Runs: `tailscale login --login-server https://headscale.viktorbarzin.me`
|
||||
3. Browser opens to Authentik SSO login
|
||||
4. After successful login, Tailscale presents a registration URL
|
||||
5. Admin approves the device via `headscale nodes register --user <username> --key <key>`
|
||||
6. Client is added to the mesh, receives IP in 100.64.0.0/10 range
|
||||
|
||||
**Connectivity test**: `ping 10.0.20.100` (Sofia K8s API server) verifies full access to the homelab network.
|
||||
|
||||
### DERP Relay for NAT Traversal
|
||||
|
||||
**Problem**: Symmetric NAT or restrictive firewalls prevent direct WireGuard connections between clients.
|
||||
|
||||
**Solution**: Headscale runs an embedded DERP relay server (region 999, named "Home DERP"). DERP is Tailscale's NAT traversal protocol, implemented as an HTTPS-based relay.
|
||||
|
||||
**How it works**:
|
||||
1. Clients attempt direct WireGuard connection via STUN/ICE.
|
||||
2. If direct connection fails, both clients connect to the DERP relay via HTTPS.
|
||||
3. Traffic is encrypted end-to-end with WireGuard, DERP only relays packets.
|
||||
4. No additional ports needed — DERP uses the same HTTPS ingress as Headscale (443).
|
||||
|
||||
**Performance**: DERP adds latency (extra hop through Sofia K8s cluster), but ensures connectivity in all scenarios.
|
||||
|
||||
### Split DNS Architecture
|
||||
|
||||
**Design goal**: Prevent tunnel dependency for public DNS resolution. If the Headscale tunnel or Cloudflared tunnel fails, clients must still resolve public domains.
|
||||
|
||||
**Implementation**:
|
||||
- **AdGuard DNS**: Global recursive resolver, serves all VPN clients. Includes ad-blocking and malicious domain filtering.
|
||||
- **Technitium DNS**: Internal authoritative server for `.viktorbarzin.lan` domains.
|
||||
|
||||
**Resolution flow**:
|
||||
1. Client queries AdGuard for any domain.
|
||||
2. If domain ends in `.lan`, AdGuard forwards to Technitium (10.0.20.201).
|
||||
3. For all other domains, AdGuard resolves directly via upstream (Cloudflare 1.1.1.1).
|
||||
4. AdGuard caches responses, reducing load on Technitium and upstream.
|
||||
|
||||
**Resilience**: Even if the tunnel to Sofia is down, clients can still resolve `google.com`, `github.com`, etc., because AdGuard talks directly to Cloudflare. Only `.lan` domains become unavailable.
|
||||
|
||||
### Access Control (Authentik Groups)
|
||||
|
||||
**Headscale Users** group in Authentik controls VPN access. Membership is invitation-only:
|
||||
1. Admin creates user in Authentik.
|
||||
2. Admin adds user to "Headscale Users" group.
|
||||
3. User logs in via OIDC during `tailscale login`.
|
||||
4. Headscale verifies group membership via OIDC claims.
|
||||
|
||||
Removing a user from the group revokes VPN access on next re-authentication (every 30 days).
|
||||
|
||||
## Configuration
|
||||
|
||||
### Terraform Stacks
|
||||
|
||||
| Stack | Path | Resources |
|
||||
|-------|------|-----------|
|
||||
| Headscale | `stacks/headscale/` | Deployment, Service, Ingress, ConfigMap |
|
||||
| AdGuard | `stacks/adguard/` | Deployment, Service, PVC |
|
||||
| Technitium | `stacks/technitium/` | Deployment, Service, PVC |
|
||||
| pfSense (Sofia) | Not in Terraform | WireGuard tunnel configs (managed via pfSense UI) |
|
||||
|
||||
### Headscale Configuration
|
||||
|
||||
**ConfigMap**: `stacks/headscale/main.tf`
|
||||
```yaml
|
||||
server_url: https://headscale.viktorbarzin.me
|
||||
listen_addr: 0.0.0.0:8080
|
||||
metrics_listen_addr: 0.0.0.0:9090
|
||||
|
||||
oidc:
|
||||
issuer: https://authentik.viktorbarzin.me/application/o/headscale/
|
||||
client_id: <redacted>
|
||||
client_secret: <from Vault>
|
||||
scope: ["openid", "profile", "email", "groups"]
|
||||
allowed_groups: ["Headscale Users"]
|
||||
|
||||
derp:
|
||||
server:
|
||||
enabled: true
|
||||
region_id: 999
|
||||
region_code: "home"
|
||||
region_name: "Home DERP"
|
||||
stun_listen_addr: "0.0.0.0:3478"
|
||||
urls:
|
||||
- https://controlplane.tailscale.com/derpmap/default
|
||||
auto_update_enabled: true
|
||||
update_frequency: 24h
|
||||
|
||||
ip_prefixes:
|
||||
- 100.64.0.0/10
|
||||
|
||||
dns_config:
|
||||
nameservers:
|
||||
- 10.0.20.102 # AdGuard DNS
|
||||
domains:
|
||||
- viktorbarzin.lan
|
||||
magic_dns: true
|
||||
```
|
||||
|
||||
**Secrets (Vault)**:
|
||||
- `secret/headscale/oidc_client_secret`
|
||||
|
||||
**Ingress**: Standard `ingress_factory` with `protected = false` (OIDC is handled by Headscale itself).
|
||||
|
||||
### AdGuard Configuration
|
||||
|
||||
**Upstream DNS servers**:
|
||||
- Cloudflare: `1.1.1.1`, `1.0.0.1`
|
||||
- Google: `8.8.8.8`, `8.8.4.4`
|
||||
|
||||
**Conditional forwarding**:
|
||||
- `viktorbarzin.lan` → `10.0.20.201` (Technitium)
|
||||
|
||||
**Ad-blocking lists**:
|
||||
- AdGuard DNS filter
|
||||
- OISD full list
|
||||
- Developer Dan's ads and tracking list
|
||||
|
||||
**Custom rules**: Block telemetry for Windows, macOS, and smart TVs.
|
||||
|
||||
### WireGuard (pfSense — Hub)
|
||||
|
||||
**Single interface `tun_wg0`** (OPT2) with two peers on subnet `10.3.2.0/24`. Listens on `*:51821` for both IPv4 and IPv6. IPv6 access via HE tunnel (`gif0`, `2001:470:6e:43d::2`) requires a `pass in` pf rule on the `HE_IPv6` interface (interface name `opt3` in config.xml):
|
||||
|
||||
**Peer: London Flint 2**:
|
||||
- WireGuard IP: `10.3.2.6`
|
||||
- Remote endpoint: `vpn.viktorbarzin.me:51821` (dual-stack: A=176.12.22.76, AAAA=2001:470:6e:43d::2)
|
||||
- Allowed IPs: `192.168.8.0/24, 192.168.9.0/24, 192.168.10.0/24, 10.3.2.6/32`
|
||||
- Keepalive: 25 seconds (configured on London side)
|
||||
|
||||
**Peer: Valchedrym**:
|
||||
- WireGuard IP: `10.3.2.5`
|
||||
- Remote endpoint: `85.130.41.28:51820`
|
||||
- Allowed IPs: `10.3.2.5/32, 192.168.0.0/24`
|
||||
- Keepalive: none (should be added)
|
||||
|
||||
**Static routes on pfSense**:
|
||||
- `192.168.0.0/24` → gateway `valchedrym` (10.3.2.5)
|
||||
- `192.168.8.0/24` → gateway `london_flint_2` (10.3.2.6)
|
||||
- `192.168.9.0/24` → gateway `london_flint_2` (10.3.2.6)
|
||||
- `192.168.10.0/24` → gateway `london_flint_2` (10.3.2.6)
|
||||
|
||||
**Note**: WireGuard on pfSense is NOT managed by Terraform — configured via pfSense UI/shell.
|
||||
|
||||
### WireGuard (London — GL-iNet Flint 2)
|
||||
|
||||
- Interface: `wgclient1` (proto `wgclient`, config `peer_855`)
|
||||
- Local IP: `10.3.2.6/32`
|
||||
- Remote endpoint: `vpn.viktorbarzin.me:51821` (dual-stack — resolves to IPv4 or IPv6)
|
||||
- Allowed IPs: `10.0.0.0/8, 192.168.1.0/24, 192.168.0.0/24`
|
||||
- Keepalive: 25 seconds
|
||||
- Policy routing: GL-iNet marks traffic via iptables mangle → routing table 1001 (ipset `dst_net10`)
|
||||
- Persistence: `/etc/firewall.user` injects LOCAL_POLICY mangle rule (GL-iNet's `gl-tertf` creates TUNNEL10_ROUTE_POLICY but not the LOCAL_POLICY rule for router-originated traffic)
|
||||
|
||||
**GL-iNet AllowedIPs format**: UCI `list allowed_ips` entries are concatenated by the `wgclient` protocol handler. Use a **single comma-separated entry** (`'10.0.0.0/8,192.168.1.0/24,192.168.0.0/24'`), NOT multiple list entries. Multiple entries cause a parse error like `10.0.0.0/8192.168.1.0/24` (no separator).
|
||||
|
||||
**DNS**: AdGuardHome runs on the router. Upstream DNS should NOT include `1.1.1.1` — it creates conntrack conflicts with ICMP and GL-iNet's `carrier-monitor` health check floods Cloudflare, triggering ICMP rate limits. Use `9.9.9.9`, `8.8.4.4` instead. Health check IPs (`glconfig.general.track_ip`) should use `1.0.0.1` not `1.1.1.1`.
|
||||
|
||||
### WireGuard (Valchedrym — OpenWRT)
|
||||
|
||||
- WireGuard IP: `10.3.2.5`
|
||||
- Remote endpoint: Sofia public IP
|
||||
- LAN: `192.168.0.0/24`
|
||||
|
||||
### Vault Secrets
|
||||
|
||||
- Headscale OIDC client secret: `secret/headscale/oidc_client_secret`
|
||||
- WireGuard private keys: `secret/pfsense/wg_privkey_london`, `secret/pfsense/wg_privkey_valchedrym`
|
||||
|
||||
## Decisions & Rationale
|
||||
|
||||
### Why Headscale Instead of Plain WireGuard?
|
||||
|
||||
**Alternatives considered**:
|
||||
1. **WireGuard with static configs**: Requires manual key distribution, complex peer management.
|
||||
2. **OpenVPN**: Slower, more overhead, less mobile-friendly.
|
||||
3. **Commercial Tailscale**: SaaS, not self-hosted, less control over data.
|
||||
|
||||
**Decision**: Headscale provides:
|
||||
- **Mesh networking**: Clients connect directly, not through a central server.
|
||||
- **OIDC authentication**: No pre-shared keys, integrates with existing SSO.
|
||||
- **Easy onboarding**: Users install official Tailscale app, no custom configs.
|
||||
- **Self-hosted**: Full control over control plane and data.
|
||||
|
||||
**Trade-off**: More complex setup than plain WireGuard, but operational benefits outweigh initial complexity.
|
||||
|
||||
### Why Split DNS (AdGuard + Technitium)?
|
||||
|
||||
**Alternatives considered**:
|
||||
1. **Single DNS server (Technitium only)**: Requires forwarding all public domains to upstream, creating single point of failure.
|
||||
2. **Cloudflare only**: Fast, but no internal `.lan` domain support without zone delegation.
|
||||
3. **Tailscale MagicDNS only**: Depends on Headscale control plane, fails if control plane is down.
|
||||
|
||||
**Decision**: Split DNS architecture provides:
|
||||
- **Resilience**: If Headscale tunnel fails, public DNS still works via AdGuard → Cloudflare.
|
||||
- **Ad-blocking**: AdGuard filters ads and malicious domains for all VPN clients.
|
||||
- **Internal domains**: Technitium authoritatively serves `.lan`, no external dependency.
|
||||
|
||||
**Key benefit**: Zero tunnel dependency for public DNS. Users can browse the internet even if the homelab is completely offline.
|
||||
|
||||
### Why Embedded DERP Relay?
|
||||
|
||||
**Alternatives considered**:
|
||||
1. **External DERP relays only (Tailscale's public relays)**: Free, but adds latency and exposes traffic metadata to Tailscale.
|
||||
2. **No DERP, direct connections only**: Fails for symmetric NAT clients (mobile networks).
|
||||
|
||||
**Decision**: Embedded DERP (region 999) provides:
|
||||
- **Privacy**: All relay traffic stays within the homelab.
|
||||
- **Reliability**: Not dependent on Tailscale's public infrastructure.
|
||||
- **No extra ports**: DERP uses HTTPS (443), same as Headscale API.
|
||||
|
||||
**Trade-off**: Adds CPU/memory overhead to Headscale pod, but minimal compared to benefits.
|
||||
|
||||
### Why OIDC Authentication Instead of Pre-Authorized Keys?
|
||||
|
||||
**Alternatives considered**:
|
||||
1. **Pre-authorized keys**: Headscale generates keys, admin shares with users.
|
||||
2. **Shared secret**: Single password for all users.
|
||||
|
||||
**Decision**: OIDC via Authentik provides:
|
||||
- **Centralized access control**: Add/remove users in one place.
|
||||
- **Audit trail**: Authentik logs all login attempts.
|
||||
- **Group-based authorization**: Only "Headscale Users" group can access VPN.
|
||||
- **SSO integration**: Users already have accounts in Authentik for other services.
|
||||
|
||||
**Key workflow**: Admin invites user → user logs in via Authentik → admin approves device → access granted. No key exchange needed.
|
||||
|
||||
## Troubleshooting
|
||||
|
||||
### Headscale Login Fails (OIDC Error)
|
||||
|
||||
**Symptoms**: `tailscale login --login-server` opens browser, but after Authentik login, shows "OIDC error: invalid state".
|
||||
|
||||
**Diagnosis**: Check Headscale logs: `kubectl logs -n headscale deploy/headscale`
|
||||
|
||||
**Common causes**:
|
||||
1. **Client clock skew**: OIDC tokens have short validity (5 minutes). Ensure client's system time is accurate.
|
||||
2. **Callback URL mismatch**: Authentik application must have `https://headscale.viktorbarzin.me/oidc/callback` in Redirect URIs.
|
||||
3. **Group membership**: User is not in "Headscale Users" group in Authentik.
|
||||
|
||||
**Fix**: Sync system clock, verify Authentik application config, add user to group.
|
||||
|
||||
### Direct Connection Fails, Traffic Goes via DERP
|
||||
|
||||
**Symptoms**: `tailscale status` shows `relay "home"` instead of direct connection. Higher latency.
|
||||
|
||||
**Diagnosis**: Check DERP usage: `tailscale netcheck`
|
||||
|
||||
**Common causes**:
|
||||
1. **Symmetric NAT**: Mobile networks or restrictive corporate firewalls block UDP hole-punching.
|
||||
2. **Firewall blocking WireGuard**: Port 51820 UDP blocked on one or both clients.
|
||||
3. **STUN failure**: Can't determine external IP and port.
|
||||
|
||||
**Fix**: This is expected behavior in many environments. DERP relay ensures connectivity. If latency is unacceptable, use site-to-site WireGuard instead.
|
||||
|
||||
### Can't Resolve .lan Domains from VPN
|
||||
|
||||
**Symptoms**: `nslookup nextcloud.viktorbarzin.lan` returns `NXDOMAIN`.
|
||||
|
||||
**Diagnosis**: Check DNS chain: Client → AdGuard → Technitium.
|
||||
|
||||
**Steps**:
|
||||
1. Verify AdGuard is running: `kubectl get pod -n adguard`
|
||||
2. Check AdGuard conditional forwarding: Query AdGuard directly: `nslookup nextcloud.viktorbarzin.lan <adguard-ip>`
|
||||
3. Check Technitium: `nslookup nextcloud.viktorbarzin.lan 10.0.20.201`
|
||||
|
||||
**Common causes**:
|
||||
1. **AdGuard not forwarding .lan**: Conditional forwarding rule missing or misconfigured.
|
||||
2. **Technitium down**: Pod crash-looping or PVC corrupted.
|
||||
3. **DNS propagation delay**: Technitium zone update not yet applied.
|
||||
|
||||
**Fix**: Verify conditional forwarding in AdGuard UI. Restart Technitium if needed. Check zone file in Technitium UI.
|
||||
|
||||
### VPN Client Can't Reach K8s Services
|
||||
|
||||
**Symptoms**: Can `ping 10.0.20.1` (pfSense), but `curl https://immich.viktorbarzin.me` times out.
|
||||
|
||||
**Diagnosis**: Check connectivity at each layer:
|
||||
1. **DNS**: Does `nslookup immich.viktorbarzin.me` return correct IP?
|
||||
2. **Routing**: Can client reach MetalLB IP? `ping <loadbalancer-ip>`
|
||||
3. **Firewall**: Is pfSense blocking traffic from VPN subnet?
|
||||
|
||||
**Common causes**:
|
||||
1. **Split DNS working too well**: Client resolves to Cloudflare IP instead of internal LAN IP. Expected for proxied domains — use direct domain (e.g., `immich-direct.viktorbarzin.me`).
|
||||
2. **ACL policy**: Headscale ACL blocks client from accessing certain subnets.
|
||||
3. **pfSense NAT rule missing**: Traffic from VPN subnet not routed to VLAN 20.
|
||||
|
||||
**Fix**: For proxied domains, use non-proxied DNS names. Check Headscale ACL policy. Verify pfSense NAT rules.
|
||||
|
||||
### DERP Relay Returns 502 Bad Gateway
|
||||
|
||||
**Symptoms**: Tailscale clients can't connect, DERP shows offline in `tailscale netcheck`.
|
||||
|
||||
**Diagnosis**: Check Headscale ingress: `kubectl get ingress -n headscale`
|
||||
|
||||
**Common causes**:
|
||||
1. **Traefik middleware blocking DERP traffic**: Forward-auth interferes with WebSocket upgrade.
|
||||
2. **Headscale pod not ready**: Liveness probe failing.
|
||||
3. **Cloudflared tunnel issue**: DERP uses WebSockets, which require HTTP/1.1 upgrade support.
|
||||
|
||||
**Fix**: Ensure Headscale ingress has `protected = false` (no forward-auth). Check Headscale pod readiness. Verify Cloudflared supports WebSocket upgrades.
|
||||
|
||||
### WireGuard Site-to-Site Tunnel Disconnects
|
||||
|
||||
**Symptoms**: Can't reach services in London from Sofia. `ping 192.168.8.1` fails.
|
||||
|
||||
**Diagnosis**: Check pfSense WireGuard status via `pfsense.py wireguard` or Dashboard → VPN → WireGuard → Status
|
||||
|
||||
**Common causes**:
|
||||
1. **AllowedIPs parse error on GL-iNet**: If `wg show wgclient1` shows no peers and interface is DOWN with `qdisc noop`, check `/etc/config/wireguard` peer config. AllowedIPs must be a single comma-separated entry, not multiple `list` entries (see London section above).
|
||||
2. **IPv6 endpoint resolution**: If IPv4 is down, DNS resolves to IPv6 (AAAA record). Ensure the pfSense `HE_IPv6` (gif0) interface has a `pass in` rule for UDP 51821.
|
||||
3. **Keepalive packets dropped**: Firewall or ISP blocking UDP 51821.
|
||||
4. **Public IP changed**: Dynamic IP on remote site changed, config still has old IP.
|
||||
5. **GL-iNet policy routing lost**: After firewall reload, check if `TUNNEL10_ROUTE_POLICY` and `LOCAL_POLICY` mangle rules exist. If not, run `/etc/init.d/firewall restart` and check `/etc/firewall.user` execution.
|
||||
6. **Kill switch active**: If WG interface is DOWN, table 1001 only has blackhole routes → all marked traffic dropped → IPv4 internet broken.
|
||||
|
||||
**Fix**: Check `wg show wgclient1` on London router. If no peers, fix AllowedIPs format and `ifdown/ifup wgclient1`. Verify handshake with `ping 10.3.2.1`.
|
||||
|
||||
## Related
|
||||
|
||||
- **Runbooks**:
|
||||
- `docs/runbooks/add-headscale-user.md`
|
||||
- `docs/runbooks/reset-derp-relay.md`
|
||||
- `docs/runbooks/update-wireguard-peer.md`
|
||||
- **Architecture Docs**:
|
||||
- `docs/architecture/networking.md` — Core network architecture
|
||||
- `docs/architecture/dns.md` — Full DNS architecture (coming soon)
|
||||
- **Reference**:
|
||||
- `.claude/reference/authentik-state.md` — OIDC application configs
|
||||
- `.claude/reference/service-catalog.md` — Full service inventory
|
||||
141
docs/architecture/wave1-egress-observation-2026-05-22.md
Normal file
141
docs/architecture/wave1-egress-observation-2026-05-22.md
Normal file
|
|
@ -0,0 +1,141 @@
|
|||
# Wave 1 W1.6/W1.7 — Egress Observation Snapshot (2026-05-22)
|
||||
|
||||
First analysis pass over the Calico GNP `wave1-egress-observe-tier34` data
|
||||
captured in Loki via `{job="node-journal"} |~ "calico-packet"`.
|
||||
|
||||
**Data scope:** ~10000 flow log lines pulled from Loki over ~6h+24h windows.
|
||||
Loki caps queries at 5000 records so longer windows are sample-capped.
|
||||
|
||||
**Coverage:** 36 source namespaces observed making egress (out of 82 selected
|
||||
by `tier in {3-edge, 4-aux}`). Namespaces missing from data are either idle,
|
||||
scaled to 0, or producing only intra-namespace traffic (which Calico Log
|
||||
captures from-workload but most pods in those namespaces talk locally).
|
||||
|
||||
## Egress fan-out per namespace
|
||||
|
||||
| Namespace | dests | pod-ns | svc | external |
|
||||
|---|---:|---:|---:|---:|
|
||||
| affine | 3 | 2 | 1 | 0 |
|
||||
| beads-server | 4 | 3 | 1 | 0 |
|
||||
| cyberchef | 2 | 1 | 1 | 0 |
|
||||
| dawarich | 3 | 2 | 1 | 0 |
|
||||
| default | 1 | 0 | 0 | 1 |
|
||||
| ebooks | 3 | 2 | 1 | 0 |
|
||||
| f1-stream | 16 | 2 | 1 | 13 |
|
||||
| forgejo | 2 | 1 | 1 | 0 |
|
||||
| hackmd | 2 | 1 | 1 | 0 |
|
||||
| homepage | 2 | 1 | 1 | 0 |
|
||||
| isponsorblocktv | 2 | 0 | 1 | 1 |
|
||||
| jsoncrack | 2 | 1 | 1 | 0 |
|
||||
| kms | 2 | 1 | 1 | 0 |
|
||||
| mailserver | 2 | 0 | 1 | 1 |
|
||||
| meshcentral | 2 | 2 | 0 | 0 |
|
||||
| n8n | 2 | 1 | 1 | 0 |
|
||||
| nextcloud | 5 | 2 | 1 | 2 |
|
||||
| onlyoffice | 2 | 1 | 1 | 0 |
|
||||
| openclaw | 18 | 4 | 1 | 13 |
|
||||
| paperless-ngx | 3 | 2 | 1 | 0 |
|
||||
| phpipam | 3 | 2 | 1 | 0 |
|
||||
| poison-fountain | 2 | 1 | 1 | 0 |
|
||||
| postiz | 9 | 8 | 1 | 0 |
|
||||
| realestate-crawler | 2 | 1 | 1 | 0 |
|
||||
| recruiter-responder | 2 | 0 | 1 | 1 |
|
||||
| rybbit | 2 | 1 | 1 | 0 |
|
||||
| send | 2 | 1 | 1 | 0 |
|
||||
| servarr | 134 | 2 | 2 | 130 |
|
||||
| speedtest | 2 | 1 | 1 | 0 |
|
||||
| status-page | 10 | 2 | 1 | 7 |
|
||||
| tandoor | 2 | 1 | 1 | 0 |
|
||||
| technitium | 5 | 2 | 1 | 2 |
|
||||
| trading-bot | 5 | 2 | 1 | 2 |
|
||||
| url | 2 | 1 | 1 | 0 |
|
||||
| website | 2 | 1 | 1 | 0 |
|
||||
| woodpecker | 8 | 2 | 1 | 5 |
|
||||
|
||||
## Common patterns
|
||||
|
||||
**Universal baseline** (every observed namespace makes these):
|
||||
- `kube-system/kube-dns` UDP/53 — DNS resolution
|
||||
- Often `dbaas` TCP/3306 (MySQL) or TCP/5432 (Postgres)
|
||||
- Often `redis` TCP/6379
|
||||
|
||||
**Per-namespace specifics** (the part that varies):
|
||||
- External HTTPS to specific IPs (CDNs, APIs)
|
||||
- Internal pod-to-pod for service-specific clients
|
||||
|
||||
## W1.7 rollout candidates (sorted by simplicity)
|
||||
|
||||
**Tier A — trivial egress (recommend first wave):**
|
||||
|
||||
`recruiter-responder` has the simplest profile of all observed:
|
||||
- `kube-system/kube-dns` :53/UDP
|
||||
- `99.83.136.103` :443/TCP (Telegram API)
|
||||
|
||||
That's it. Two destinations. Perfect first enforce candidate.
|
||||
|
||||
**Tier B — small egress (≤3 external + ≤5 internal, 29 namespaces):**
|
||||
|
||||
affine, beads-server, cyberchef, dawarich, ebooks, forgejo, hackmd, homepage,
|
||||
isponsorblocktv, jsoncrack, kms, mailserver, meshcentral, n8n, nextcloud,
|
||||
onlyoffice, paperless-ngx, phpipam, poison-fountain, realestate-crawler,
|
||||
rybbit, send, speedtest, tandoor, technitium, trading-bot, url, website.
|
||||
|
||||
These can be enforce'd in batches of 3-5/day after the recruiter-responder
|
||||
pilot proves out.
|
||||
|
||||
**Tier C — moderate egress (5–18 external):**
|
||||
|
||||
f1-stream (13 ext), openclaw (13 ext), woodpecker (5 ext), status-page (7 ext).
|
||||
Need per-IP allowlist or domain-based selectors.
|
||||
|
||||
**Tier D — broad egress (do NOT enforce statically):**
|
||||
|
||||
`servarr` has 130+ external IPs because it runs BitTorrent peer-to-peer.
|
||||
Static IP enforcement won't work; either leave in Log+Allow mode permanently
|
||||
or use a port-only allowlist (TCP+UDP 6881+random high ports outbound).
|
||||
|
||||
## Important caveats before flipping to enforce
|
||||
|
||||
1. **Observation horizon is too short.** Only ~6h of dense data and ~24h
|
||||
total. CronJobs that run weekly, periodic Vault token rotations (7d),
|
||||
external service maintenance windows, Keel auto-rollouts pulling new
|
||||
image versions — all missed. Recommend collecting **at least 7 days**
|
||||
before declaring an allowlist complete.
|
||||
|
||||
2. **`servarr`** is fundamentally incompatible with static enforce — keep
|
||||
in Log+Allow (or explicit deny for known-bad CIDRs only).
|
||||
|
||||
3. **External IPs are dynamic.** Cloudflare-fronted services rotate IPs.
|
||||
The recruiter-responder external IP `99.83.136.103` is one of Telegram's
|
||||
API endpoints — Telegram has a CIDR range. Allowing single IPs will break
|
||||
when DNS resolves to a different IP. Prefer Calico's `domains:` selector
|
||||
(Calico OSS supports DNS-based egress allowlists via `dns_policy_resolver`)
|
||||
OR allow the full Cloudflare/AWS CIDR range OR use a per-app egress
|
||||
gateway.
|
||||
|
||||
4. **The observation didn't capture intra-namespace traffic** by design —
|
||||
the Calico Log rule fires on egress from workload endpoint, but
|
||||
pod-to-same-namespace-pod traffic on the same node may bypass the
|
||||
filter chain (varies). Real-world testing needed after enforce flip.
|
||||
|
||||
## Suggested next-session sequencing
|
||||
|
||||
1. **Continue observation for at least 7 days** before any enforce flip.
|
||||
Compare data on 2026-05-29 vs today; if no new destinations show up,
|
||||
the allowlist is stable.
|
||||
2. **First enforce: recruiter-responder.** GNP with allowlist =
|
||||
{kube-dns, telegram CIDR, vault svc, eso svc}. Watch for breakage.
|
||||
3. **Tier B batch rollout** at 3-5 namespaces/day per Keel-style phased
|
||||
rollout pattern (memory id=1972).
|
||||
4. **Tier C requires per-namespace investigation** — what are those
|
||||
external IPs? Map to known services first.
|
||||
5. **servarr stays in Log+Allow** indefinitely (or migrate to dedicated
|
||||
egress proxy).
|
||||
|
||||
## Source data location
|
||||
|
||||
- Loki LogQL: `{job="node-journal"} |~ "calico-packet"`
|
||||
- Pod IP → namespace map at observation time saved at
|
||||
`/tmp/pod-ip-map.txt` on the analysis host (ephemeral).
|
||||
- Analysis scripts: `/tmp/analyze_flows2.py`, `/tmp/build_allowlist.py`.
|
||||
- Tracked under beads `code-8ywc` (W1.7).
|
||||
Loading…
Add table
Add a link
Reference in a new issue