fix: restore tree dropped by 6d224861; land stem95su gdrive-sync (10m) [ci skip]
6d224861 came from a --no-checkout worktree whose empty index made the
commit drop every file except two. This restores 05b50d2b's full tree and
correctly adds stacks/stem95su/gdrive-sync.tf + the service-catalog stem95su
entry. Forward-only (parent=6d224861, no force-push); [ci skip] since the
live infra was never applied from the broken commit.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
This commit is contained in:
parent
6d224861c4
commit
fd0f4a0365
1166 changed files with 358546 additions and 0 deletions
123
docs/plans/2026-02-22-anti-ai-scraping-design.md
Normal file
123
docs/plans/2026-02-22-anti-ai-scraping-design.md
Normal file
|
|
@ -0,0 +1,123 @@
|
|||
# Anti-AI Scraping System Design
|
||||
|
||||
> **Status (Updated 2026-04-17):** Partially superseded. Layer 3 (trap links via rewrite-body plugin) removed due to Traefik v3.6.12 Yaegi plugin incompatibility. The `strip-accept-encoding` and `anti-ai-trap-links` middlewares have been deleted. Rybbit analytics injection moved from Traefik rewrite-body to a Cloudflare Worker (`infra/stacks/rybbit/worker/`). Active layers: 1 (bot-block), 2 (headers), 4 (tarpit), 5 (poison content).
|
||||
|
||||
## Problem
|
||||
|
||||
AI scrapers crawl public web services to harvest training data. We want to:
|
||||
1. Block known AI crawlers outright
|
||||
2. Poison the data that unknown scrapers collect
|
||||
3. Waste scraper resources with slow responses and infinite crawl loops
|
||||
|
||||
## Architecture
|
||||
|
||||
Four active defense layers applied to all public services via Traefik (Layer 3 removed April 2026):
|
||||
|
||||
```
|
||||
Internet -> Cloudflare -> Traefik
|
||||
|
|
||||
+-- Layer 1: ForwardAuth -> block known AI User-Agents (403)
|
||||
|
|
||||
+-- Layer 2: Headers -> X-Robots-Tag: noai, noimageai
|
||||
|
|
||||
+-- [REMOVED] Layer 3: Rewrite-body trap links (April 2026 — Yaegi bugs in Traefik v3.6.12)
|
||||
|
|
||||
+-- Layer 4: Poison service -> serve cached Poison Fountain data
|
||||
|
|
||||
+-- Layer 5: Tarpit -> slow-drip responses + infinite crawl loop
|
||||
```
|
||||
|
||||
## Components
|
||||
|
||||
### 1. poison-fountain service (new Kubernetes deployment)
|
||||
|
||||
A Python service with three responsibilities:
|
||||
|
||||
**ForwardAuth endpoint (`GET /auth`)**:
|
||||
- Reads `X-Forwarded-For` and `User-Agent` from request headers
|
||||
- Checks User-Agent against list of known AI bot strings
|
||||
- Returns 403 for matches, 200 for legitimate users
|
||||
- Blocked bots: GPTBot, ChatGPT-User, ClaudeBot, Claude-Web, CCBot, Bytespider, Google-Extended, Applebot-Extended, anthropic-ai, cohere-ai, Diffbot, FacebookBot, PerplexityBot, YouBot, Meta-ExternalAgent, PetalBot, Amazonbot, AI2Bot, Omgilibot, img2dataset
|
||||
|
||||
**Poison content endpoint (`GET /article/<slug>`)**:
|
||||
- Serves cached poisoned content from NFS
|
||||
- Wraps raw Poison Fountain data in realistic HTML templates (title, headings, paragraphs)
|
||||
- Each response includes 10+ links to other poison pages (infinite crawl loop)
|
||||
- Uses chunked transfer encoding to drip-feed content at ~100 bytes/second (tarpit)
|
||||
- Response size: 50-100KB per page
|
||||
|
||||
**Health endpoint (`GET /healthz`)**:
|
||||
- Returns 200 OK for Kubernetes probes
|
||||
|
||||
### 2. poison-fountain-fetcher CronJob
|
||||
|
||||
- Runs every 6 hours
|
||||
- Fetches gzip content from `https://rnsaffn.com/poison2/`
|
||||
- Decompresses and stores to NFS at `/mnt/main/poison-fountain/cache/`
|
||||
- Maintains a pool of ~50 cached poison documents
|
||||
- Falls back to locally generated Markov-chain nonsense if Poison Fountain is unreachable
|
||||
|
||||
### 3. Traefik middleware additions
|
||||
|
||||
All defined in `stacks/platform/modules/traefik/middleware.tf`:
|
||||
|
||||
**`ai-bot-block` (ForwardAuth)**:
|
||||
- ForwardAuth to `http://poison-fountain.poison-fountain.svc.cluster.local:8080/auth`
|
||||
- Trust forwarded headers from Traefik
|
||||
- Added to all public services via ingress_factory
|
||||
|
||||
**`anti-ai-headers` (Headers)**:
|
||||
- Sets `X-Robots-Tag: noai, noimageai` on all responses
|
||||
- Added to all public services via ingress_factory
|
||||
|
||||
**`anti-ai-trap-links` (rewrite-body plugin)** — REMOVED (Updated 2026-04-17):
|
||||
- Removed due to Traefik v3.6.12 Yaegi runtime bugs making the rewrite-body plugin unreliable
|
||||
- The companion `strip-accept-encoding` middleware was also removed (only existed for rewrite-body)
|
||||
- Trap link injection is no longer active; poison-fountain still serves tarpit content standalone
|
||||
|
||||
### 4. Trap subdomain: poison.viktorbarzin.me
|
||||
|
||||
- Cloudflare DNS record (non-proxied, direct to cluster)
|
||||
- IngressRoute routing all paths to poison-fountain service
|
||||
- NO rate limiting on this route (let scrapers consume all they want)
|
||||
- NO CrowdSec on this route (don't block scrapers here)
|
||||
- Serves poisoned content with tarpit slow-drip
|
||||
|
||||
### 5. ingress_factory changes
|
||||
|
||||
New variables:
|
||||
- `anti_ai_scraping` (bool, default: true) - enable all anti-AI layers
|
||||
- When true, adds to middleware chain: `ai-bot-block`, `anti-ai-headers`
|
||||
- Services can opt out with `anti_ai_scraping = false`
|
||||
|
||||
## Human User Protection
|
||||
|
||||
| Concern | Protection |
|
||||
|---------|-----------|
|
||||
| Hidden links visible | CSS `position:absolute;left:-9999px;height:0;overflow:hidden` + `aria-hidden="true"` |
|
||||
| False positive blocking | Only blocks specific AI bot User-Agent strings; no browser matches these |
|
||||
| Performance overhead | ForwardAuth is a string match (<1ms). Rybbit injected via Cloudflare Worker (not Traefik). |
|
||||
| Poison content leakage | Only served on poison.viktorbarzin.me, not linked from any navigation |
|
||||
| Slow responses | Tarpit only applies to poison.viktorbarzin.me, not to real services |
|
||||
|
||||
## File Locations
|
||||
|
||||
| Component | Path |
|
||||
|-----------|------|
|
||||
| Poison service stack | `stacks/poison-fountain/main.tf` |
|
||||
| Poison service code | `stacks/poison-fountain/app/` |
|
||||
| Middleware definitions | `stacks/platform/modules/traefik/middleware.tf` |
|
||||
| ingress_factory changes | `modules/kubernetes/ingress_factory/main.tf` |
|
||||
| Cloudflare DNS | `terraform.tfvars` (cloudflare_non_proxied_names) |
|
||||
| NFS cache | `/mnt/main/poison-fountain/cache/` |
|
||||
|
||||
## Deployment Order
|
||||
|
||||
1. Add Cloudflare DNS record for `poison.viktorbarzin.me`
|
||||
2. Create NFS export for `/mnt/main/poison-fountain`
|
||||
3. Add Traefik middlewares (ai-bot-block, anti-ai-headers, anti-ai-trap-links)
|
||||
4. Update ingress_factory with anti_ai_scraping variable
|
||||
5. Deploy poison-fountain service + CronJob
|
||||
6. Apply platform stack (Traefik + Cloudflare changes)
|
||||
7. Apply poison-fountain stack
|
||||
8. Apply all other stacks to pick up new ingress_factory defaults
|
||||
915
docs/plans/2026-02-22-anti-ai-scraping-plan.md
Normal file
915
docs/plans/2026-02-22-anti-ai-scraping-plan.md
Normal file
|
|
@ -0,0 +1,915 @@
|
|||
# Anti-AI Scraping System Implementation Plan
|
||||
|
||||
> **For Claude:** REQUIRED SUB-SKILL: Use superpowers:executing-plans to implement this plan task-by-task.
|
||||
|
||||
**Goal:** Deploy a 5-layer anti-AI scraping system that blocks known bots, injects hidden trap links into all HTML responses, serves poisoned content from Poison Fountain, and tarpits scrapers with slow-drip responses.
|
||||
|
||||
**Architecture:** A lightweight Python service handles bot detection (ForwardAuth) and poison content serving (tarpit). Traefik middlewares inject anti-AI headers and hidden trap links into all public service responses via ingress_factory defaults. A CronJob refreshes cached poison content from rnsaffn.com.
|
||||
|
||||
**Tech Stack:** Python 3 (stdlib http.server), Terraform/Terragrunt, Traefik middleware CRDs, Kubernetes CronJob
|
||||
|
||||
---
|
||||
|
||||
### Task 1: Create the Python poison service code
|
||||
|
||||
**Files:**
|
||||
- Create: `stacks/poison-fountain/app/server.py`
|
||||
- Create: `stacks/poison-fountain/app/fetch-poison.sh`
|
||||
|
||||
**Step 1: Create the service directory**
|
||||
|
||||
```bash
|
||||
mkdir -p stacks/poison-fountain/app
|
||||
```
|
||||
|
||||
**Step 2: Write `stacks/poison-fountain/app/server.py`**
|
||||
|
||||
```python
|
||||
"""Poison Fountain service.
|
||||
|
||||
Endpoints:
|
||||
GET /auth - ForwardAuth: block known AI bot User-Agents (403) or pass (200)
|
||||
GET /article/* - Serve cached poisoned content with tarpit slow-drip
|
||||
GET /healthz - Health check for Kubernetes probes
|
||||
GET /* - Catch-all: serve poison for any path (scrapers explore randomly)
|
||||
"""
|
||||
|
||||
import http.server
|
||||
import os
|
||||
import glob
|
||||
import random
|
||||
import time
|
||||
import hashlib
|
||||
import sys
|
||||
|
||||
LISTEN_PORT = int(os.environ.get("PORT", "8080"))
|
||||
CACHE_DIR = os.environ.get("CACHE_DIR", "/data/cache")
|
||||
DRIP_BYTES = int(os.environ.get("DRIP_BYTES", "50"))
|
||||
DRIP_DELAY = float(os.environ.get("DRIP_DELAY", "0.5"))
|
||||
TRAP_LINK_COUNT = int(os.environ.get("TRAP_LINK_COUNT", "20"))
|
||||
POISON_DOMAIN = os.environ.get("POISON_DOMAIN", "poison.viktorbarzin.me")
|
||||
|
||||
AI_BOT_PATTERNS = [
|
||||
"gptbot", "chatgpt-user", "claudebot", "claude-web", "ccbot",
|
||||
"bytespider", "google-extended", "applebot-extended",
|
||||
"anthropic-ai", "cohere-ai", "diffbot", "facebookbot",
|
||||
"perplexitybot", "youbot", "meta-externalagent", "petalbot",
|
||||
"amazonbot", "ai2bot", "omgilibot", "img2dataset",
|
||||
"omgili", "commoncrawl", "ia_archiver", "scrapy",
|
||||
"semrushbot", "ahrefsbot", "dotbot", "mj12bot",
|
||||
"seekport", "blexbot", "dataforseo", "serpstatbot",
|
||||
]
|
||||
|
||||
FALLBACK_WORDS = [
|
||||
"the", "quantum", "neural", "framework", "implements", "distributed",
|
||||
"processing", "with", "advanced", "recursive", "algorithms", "for",
|
||||
"optimal", "convergence", "in", "multi-dimensional", "space",
|
||||
"utilizing", "transformer", "architecture", "trained", "on",
|
||||
"large-scale", "corpus", "data", "achieving", "state-of-the-art",
|
||||
"performance", "across", "benchmark", "tasks", "including",
|
||||
"natural", "language", "understanding", "generation", "and",
|
||||
"cross-lingual", "transfer", "learning", "capabilities",
|
||||
]
|
||||
|
||||
|
||||
def generate_slug():
|
||||
return hashlib.md5(str(random.random()).encode()).hexdigest()[:16]
|
||||
|
||||
|
||||
def generate_trap_links(count):
|
||||
titles = [
|
||||
"Research Archive", "Training Corpus", "Dataset Export",
|
||||
"NLP Benchmark Results", "Web Crawl Index", "Text Corpus",
|
||||
"Machine Learning Data", "Evaluation Dataset", "Model Weights",
|
||||
"Annotation Guidelines", "Parallel Corpus", "Knowledge Base",
|
||||
"Document Collection", "Reference Data", "Taxonomy Index",
|
||||
"Classification Labels", "Entity Database", "Relation Extraction",
|
||||
"Sentiment Annotations", "Summarization Corpus", "QA Dataset",
|
||||
"Dialogue Transcripts", "Code Documentation", "API Reference",
|
||||
]
|
||||
links = []
|
||||
for _ in range(count):
|
||||
slug = generate_slug()
|
||||
title = random.choice(titles)
|
||||
links.append(f'<a href="https://{POISON_DOMAIN}/article/{slug}">{title}</a>')
|
||||
return "\n".join(links)
|
||||
|
||||
|
||||
def get_poison_content():
|
||||
cache_files = glob.glob(os.path.join(CACHE_DIR, "*.txt"))
|
||||
if cache_files:
|
||||
try:
|
||||
with open(random.choice(cache_files), "r", errors="replace") as f:
|
||||
return f.read()
|
||||
except Exception:
|
||||
pass
|
||||
return " ".join(random.choices(FALLBACK_WORDS, k=500))
|
||||
|
||||
|
||||
class PoisonHandler(http.server.BaseHTTPRequestHandler):
|
||||
server_version = "Apache/2.4.52"
|
||||
sys_version = ""
|
||||
|
||||
def log_message(self, fmt, *args):
|
||||
sys.stderr.write(f"[{self.log_date_time_string()}] {fmt % args}\n")
|
||||
|
||||
def do_GET(self):
|
||||
if self.path == "/healthz":
|
||||
self._respond(200, "ok")
|
||||
return
|
||||
|
||||
if self.path == "/auth":
|
||||
self._handle_auth()
|
||||
return
|
||||
|
||||
# Everything else gets poison
|
||||
self._serve_poison()
|
||||
|
||||
def _handle_auth(self):
|
||||
ua = (self.headers.get("User-Agent") or "").lower()
|
||||
for pattern in AI_BOT_PATTERNS:
|
||||
if pattern in ua:
|
||||
self.log_message("BLOCKED AI bot: %s (matched: %s)", ua, pattern)
|
||||
self._respond(403, "Forbidden")
|
||||
return
|
||||
self._respond(200, "OK")
|
||||
|
||||
def _respond(self, code, body):
|
||||
self.send_response(code)
|
||||
self.send_header("Content-Type", "text/plain")
|
||||
self.end_headers()
|
||||
self.wfile.write(body.encode())
|
||||
|
||||
def _serve_poison(self):
|
||||
content = get_poison_content()
|
||||
trap_links = generate_trap_links(TRAP_LINK_COUNT)
|
||||
|
||||
html = f"""<!DOCTYPE html>
|
||||
<html lang="en">
|
||||
<head>
|
||||
<meta charset="utf-8">
|
||||
<meta name="viewport" content="width=device-width, initial-scale=1">
|
||||
<title>Research Data Archive</title>
|
||||
</head>
|
||||
<body>
|
||||
<main>
|
||||
<article>
|
||||
<h1>Research Data Collection</h1>
|
||||
<div class="content">
|
||||
<p>{content}</p>
|
||||
</div>
|
||||
</article>
|
||||
<nav>
|
||||
<h2>Related Research</h2>
|
||||
{trap_links}
|
||||
</nav>
|
||||
</main>
|
||||
</body>
|
||||
</html>"""
|
||||
|
||||
self.send_response(200)
|
||||
self.send_header("Content-Type", "text/html; charset=utf-8")
|
||||
self.send_header("Transfer-Encoding", "chunked")
|
||||
self.end_headers()
|
||||
|
||||
for i in range(0, len(html), DRIP_BYTES):
|
||||
chunk = html[i : i + DRIP_BYTES].encode("utf-8")
|
||||
try:
|
||||
self.wfile.write(f"{len(chunk):x}\r\n".encode())
|
||||
self.wfile.write(chunk)
|
||||
self.wfile.write(b"\r\n")
|
||||
self.wfile.flush()
|
||||
time.sleep(DRIP_DELAY)
|
||||
except (BrokenPipeError, ConnectionResetError):
|
||||
return
|
||||
|
||||
try:
|
||||
self.wfile.write(b"0\r\n\r\n")
|
||||
self.wfile.flush()
|
||||
except (BrokenPipeError, ConnectionResetError):
|
||||
pass
|
||||
|
||||
|
||||
if __name__ == "__main__":
|
||||
os.makedirs(CACHE_DIR, exist_ok=True)
|
||||
server = http.server.HTTPServer(("0.0.0.0", LISTEN_PORT), PoisonHandler)
|
||||
print(f"Poison Fountain service listening on :{LISTEN_PORT}", flush=True)
|
||||
server.serve_forever()
|
||||
```
|
||||
|
||||
**Step 3: Write `stacks/poison-fountain/app/fetch-poison.sh`**
|
||||
|
||||
```bash
|
||||
#!/bin/sh
|
||||
set -e
|
||||
|
||||
CACHE_DIR="${CACHE_DIR:-/data/cache}"
|
||||
POISON_URL="${POISON_URL:-https://rnsaffn.com/poison2/}"
|
||||
FETCH_COUNT="${FETCH_COUNT:-50}"
|
||||
MAX_CACHE_FILES="${MAX_CACHE_FILES:-100}"
|
||||
|
||||
mkdir -p "$CACHE_DIR"
|
||||
|
||||
echo "Fetching $FETCH_COUNT poison documents from $POISON_URL"
|
||||
|
||||
fetched=0
|
||||
for i in $(seq 1 "$FETCH_COUNT"); do
|
||||
OUTPUT="$CACHE_DIR/poison_$(date +%s)_${i}.txt"
|
||||
if curl -sS --compressed -o "$OUTPUT" -m 30 "$POISON_URL" 2>/dev/null; then
|
||||
# Verify file is non-empty
|
||||
if [ -s "$OUTPUT" ]; then
|
||||
fetched=$((fetched + 1))
|
||||
echo " [$i/$FETCH_COUNT] OK"
|
||||
else
|
||||
rm -f "$OUTPUT"
|
||||
echo " [$i/$FETCH_COUNT] Empty response, skipped"
|
||||
fi
|
||||
else
|
||||
rm -f "$OUTPUT"
|
||||
echo " [$i/$FETCH_COUNT] Fetch failed, skipped"
|
||||
fi
|
||||
sleep 2
|
||||
done
|
||||
|
||||
# Clean up oldest files if cache exceeds limit
|
||||
total=$(find "$CACHE_DIR" -name '*.txt' -type f | wc -l)
|
||||
if [ "$total" -gt "$MAX_CACHE_FILES" ]; then
|
||||
excess=$((total - MAX_CACHE_FILES))
|
||||
find "$CACHE_DIR" -name '*.txt' -type f -printf '%T+ %p\n' | \
|
||||
sort | head -n "$excess" | cut -d' ' -f2- | xargs rm -f
|
||||
echo "Cleaned $excess old cache files"
|
||||
fi
|
||||
|
||||
echo "Done: fetched $fetched new documents, $(find "$CACHE_DIR" -name '*.txt' -type f | wc -l) total cached"
|
||||
```
|
||||
|
||||
**Step 4: Verify files exist**
|
||||
|
||||
```bash
|
||||
ls -la stacks/poison-fountain/app/
|
||||
```
|
||||
|
||||
Expected: `server.py` and `fetch-poison.sh` listed.
|
||||
|
||||
**Step 5: Commit**
|
||||
|
||||
```bash
|
||||
git add stacks/poison-fountain/app/
|
||||
git commit -m "[ci skip] Add poison fountain Python service and fetcher script"
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
### Task 2: Set up NFS export and DNS record
|
||||
|
||||
**Files:**
|
||||
- Modify: `secrets/nfs_directories.txt` (add `poison-fountain/cache` line, keep sorted)
|
||||
- Modify: `terraform.tfvars` (add `poison` to `cloudflare_non_proxied_names`)
|
||||
|
||||
**Step 1: Add NFS directory**
|
||||
|
||||
Add `poison-fountain` and `poison-fountain/cache` to `secrets/nfs_directories.txt`, keeping alphabetical order. Insert after `plotting-book` entries.
|
||||
|
||||
**Step 2: Run NFS export script**
|
||||
|
||||
```bash
|
||||
cd secrets && bash nfs_exports.sh
|
||||
```
|
||||
|
||||
Verify the export was created successfully.
|
||||
|
||||
**Step 3: Add Cloudflare DNS record**
|
||||
|
||||
In `terraform.tfvars`, find the `cloudflare_non_proxied_names` list and add `"poison"` to it (alphabetical position after `"plotting-book"`).
|
||||
|
||||
**Step 4: Commit**
|
||||
|
||||
```bash
|
||||
git add secrets/nfs_directories.txt terraform.tfvars
|
||||
git commit -m "[ci skip] Add NFS export and DNS record for poison-fountain"
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
### Task 3: Add Traefik middleware CRDs
|
||||
|
||||
**Files:**
|
||||
- Modify: `stacks/platform/modules/traefik/middleware.tf` (append 3 new middleware resources)
|
||||
|
||||
**Step 1: Add `ai-bot-block` ForwardAuth middleware**
|
||||
|
||||
Append to the end of `stacks/platform/modules/traefik/middleware.tf`:
|
||||
|
||||
```hcl
|
||||
# ForwardAuth middleware to block known AI bot User-Agents
|
||||
resource "kubernetes_manifest" "middleware_ai_bot_block" {
|
||||
manifest = {
|
||||
apiVersion = "traefik.io/v1alpha1"
|
||||
kind = "Middleware"
|
||||
metadata = {
|
||||
name = "ai-bot-block"
|
||||
namespace = kubernetes_namespace.traefik.metadata[0].name
|
||||
}
|
||||
spec = {
|
||||
forwardAuth = {
|
||||
address = "http://poison-fountain.poison-fountain.svc.cluster.local:8080/auth"
|
||||
trustForwardHeader = true
|
||||
}
|
||||
}
|
||||
}
|
||||
|
||||
depends_on = [helm_release.traefik]
|
||||
}
|
||||
```
|
||||
|
||||
**Step 2: Add `anti-ai-headers` middleware**
|
||||
|
||||
Append to the end of `stacks/platform/modules/traefik/middleware.tf`:
|
||||
|
||||
```hcl
|
||||
# X-Robots-Tag header to discourage compliant AI crawlers
|
||||
resource "kubernetes_manifest" "middleware_anti_ai_headers" {
|
||||
manifest = {
|
||||
apiVersion = "traefik.io/v1alpha1"
|
||||
kind = "Middleware"
|
||||
metadata = {
|
||||
name = "anti-ai-headers"
|
||||
namespace = kubernetes_namespace.traefik.metadata[0].name
|
||||
}
|
||||
spec = {
|
||||
headers = {
|
||||
customResponseHeaders = {
|
||||
"X-Robots-Tag" = "noai, noimageai"
|
||||
}
|
||||
}
|
||||
}
|
||||
}
|
||||
|
||||
depends_on = [helm_release.traefik]
|
||||
}
|
||||
```
|
||||
|
||||
**Step 3: Add `anti-ai-trap-links` rewrite-body middleware**
|
||||
|
||||
Append to the end of `stacks/platform/modules/traefik/middleware.tf`:
|
||||
|
||||
```hcl
|
||||
# Inject hidden trap links before </body> to catch AI scrapers
|
||||
# Links are CSS-hidden and aria-hidden so humans never see them
|
||||
resource "kubernetes_manifest" "middleware_anti_ai_trap_links" {
|
||||
manifest = {
|
||||
apiVersion = "traefik.io/v1alpha1"
|
||||
kind = "Middleware"
|
||||
metadata = {
|
||||
name = "anti-ai-trap-links"
|
||||
namespace = kubernetes_namespace.traefik.metadata[0].name
|
||||
}
|
||||
spec = {
|
||||
plugin = {
|
||||
rewrite-body = {
|
||||
rewrites = [{
|
||||
regex = "</body>"
|
||||
replacement = "<div style=\"position:absolute;left:-9999px;height:0;overflow:hidden\" aria-hidden=\"true\"><a href=\"https://poison.viktorbarzin.me/article/training-data-2024-research-corpus\">Research Archive</a><a href=\"https://poison.viktorbarzin.me/article/dataset-export-machine-learning-v3\">Dataset Export</a><a href=\"https://poison.viktorbarzin.me/article/nlp-benchmark-evaluation-results\">Benchmark Results</a><a href=\"https://poison.viktorbarzin.me/article/web-crawl-index-2024-archive\">Web Index</a><a href=\"https://poison.viktorbarzin.me/article/text-corpus-english-dump\">Text Corpus</a></div></body>"
|
||||
}]
|
||||
monitoring = {
|
||||
types = ["text/html"]
|
||||
}
|
||||
}
|
||||
}
|
||||
}
|
||||
}
|
||||
|
||||
depends_on = [helm_release.traefik]
|
||||
}
|
||||
```
|
||||
|
||||
**Step 4: Verify syntax**
|
||||
|
||||
```bash
|
||||
cd stacks/platform && terraform fmt -check modules/traefik/middleware.tf || terraform fmt modules/traefik/middleware.tf
|
||||
```
|
||||
|
||||
**Step 5: Commit**
|
||||
|
||||
```bash
|
||||
git add stacks/platform/modules/traefik/middleware.tf
|
||||
git commit -m "[ci skip] Add anti-AI scraping Traefik middlewares (ForwardAuth, headers, trap links)"
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
### Task 4: Update ingress_factory to apply anti-AI middlewares by default
|
||||
|
||||
**Files:**
|
||||
- Modify: `modules/kubernetes/ingress_factory/main.tf` (add variable + middleware references)
|
||||
|
||||
**Step 1: Add `anti_ai_scraping` variable**
|
||||
|
||||
In `modules/kubernetes/ingress_factory/main.tf`, add after the `skip_default_rate_limit` variable (around line 73):
|
||||
|
||||
```hcl
|
||||
variable "anti_ai_scraping" {
|
||||
type = bool
|
||||
default = true
|
||||
}
|
||||
```
|
||||
|
||||
**Step 2: Add middlewares to the chain**
|
||||
|
||||
In the `kubernetes_ingress_v1` resource's `router.middlewares` annotation (around line 108-117), add 3 new lines for anti-AI middlewares. The updated `concat` list should include:
|
||||
|
||||
```hcl
|
||||
var.anti_ai_scraping ? "traefik-ai-bot-block@kubernetescrd" : null,
|
||||
var.anti_ai_scraping ? "traefik-anti-ai-headers@kubernetescrd" : null,
|
||||
var.anti_ai_scraping ? "traefik-strip-accept-encoding@kubernetescrd" : null,
|
||||
var.anti_ai_scraping ? "traefik-anti-ai-trap-links@kubernetescrd" : null,
|
||||
```
|
||||
|
||||
Insert these after the existing `crowdsec` line (line 111) and before the `protected` line (line 112). The full `concat` array becomes:
|
||||
|
||||
```hcl
|
||||
"traefik.ingress.kubernetes.io/router.middlewares" = join(",", compact(concat([
|
||||
var.skip_default_rate_limit ? null : "traefik-rate-limit@kubernetescrd",
|
||||
var.custom_content_security_policy == null ? "traefik-csp-headers@kubernetescrd" : null,
|
||||
var.exclude_crowdsec ? null : "traefik-crowdsec@kubernetescrd",
|
||||
var.anti_ai_scraping ? "traefik-ai-bot-block@kubernetescrd" : null,
|
||||
var.anti_ai_scraping ? "traefik-anti-ai-headers@kubernetescrd" : null,
|
||||
var.anti_ai_scraping ? "traefik-strip-accept-encoding@kubernetescrd" : null,
|
||||
var.anti_ai_scraping ? "traefik-anti-ai-trap-links@kubernetescrd" : null,
|
||||
var.protected ? "traefik-authentik-forward-auth@kubernetescrd" : null,
|
||||
var.allow_local_access_only ? "traefik-local-only@kubernetescrd" : null,
|
||||
var.rybbit_site_id != null ? "traefik-strip-accept-encoding@kubernetescrd" : null,
|
||||
var.rybbit_site_id != null ? "${var.namespace}-rybbit-analytics-${var.name}@kubernetescrd" : null,
|
||||
var.custom_content_security_policy != null ? "${var.namespace}-custom-csp-${var.name}@kubernetescrd" : null,
|
||||
], var.extra_middlewares)))
|
||||
```
|
||||
|
||||
**Step 3: Format**
|
||||
|
||||
```bash
|
||||
terraform fmt modules/kubernetes/ingress_factory/main.tf
|
||||
```
|
||||
|
||||
**Step 4: Commit**
|
||||
|
||||
```bash
|
||||
git add modules/kubernetes/ingress_factory/main.tf
|
||||
git commit -m "[ci skip] Add anti_ai_scraping option to ingress_factory (default: true)"
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
### Task 5: Create the poison-fountain Terraform stack
|
||||
|
||||
**Files:**
|
||||
- Create: `stacks/poison-fountain/terragrunt.hcl`
|
||||
- Create: `stacks/poison-fountain/main.tf`
|
||||
- Create: `stacks/poison-fountain/secrets` (symlink)
|
||||
|
||||
**Step 1: Create terragrunt.hcl**
|
||||
|
||||
Write `stacks/poison-fountain/terragrunt.hcl`:
|
||||
|
||||
```hcl
|
||||
include "root" {
|
||||
path = find_in_parent_folders()
|
||||
}
|
||||
|
||||
dependency "platform" {
|
||||
config_path = "../platform"
|
||||
skip_outputs = true
|
||||
}
|
||||
```
|
||||
|
||||
**Step 2: Create secrets symlink**
|
||||
|
||||
```bash
|
||||
ln -s ../../secrets stacks/poison-fountain/secrets
|
||||
```
|
||||
|
||||
**Step 3: Write `stacks/poison-fountain/main.tf`**
|
||||
|
||||
```hcl
|
||||
variable "tls_secret_name" { type = string }
|
||||
|
||||
locals {
|
||||
tiers = {
|
||||
core = "0-core"
|
||||
cluster = "1-cluster"
|
||||
gpu = "2-gpu"
|
||||
edge = "3-edge"
|
||||
aux = "4-aux"
|
||||
}
|
||||
}
|
||||
|
||||
resource "kubernetes_namespace" "poison_fountain" {
|
||||
metadata {
|
||||
name = "poison-fountain"
|
||||
labels = {
|
||||
"istio-injection" = "disabled"
|
||||
tier = local.tiers.aux
|
||||
}
|
||||
}
|
||||
}
|
||||
|
||||
module "tls_secret" {
|
||||
source = "../../modules/kubernetes/setup_tls_secret"
|
||||
namespace = kubernetes_namespace.poison_fountain.metadata[0].name
|
||||
tls_secret_name = var.tls_secret_name
|
||||
}
|
||||
|
||||
# ConfigMap for the Python service code
|
||||
resource "kubernetes_config_map" "poison_fountain_code" {
|
||||
metadata {
|
||||
name = "poison-fountain-code"
|
||||
namespace = kubernetes_namespace.poison_fountain.metadata[0].name
|
||||
}
|
||||
|
||||
data = {
|
||||
"server.py" = file("${path.module}/app/server.py")
|
||||
}
|
||||
}
|
||||
|
||||
# ConfigMap for the fetcher script
|
||||
resource "kubernetes_config_map" "poison_fountain_fetcher" {
|
||||
metadata {
|
||||
name = "poison-fountain-fetcher"
|
||||
namespace = kubernetes_namespace.poison_fountain.metadata[0].name
|
||||
}
|
||||
|
||||
data = {
|
||||
"fetch-poison.sh" = file("${path.module}/app/fetch-poison.sh")
|
||||
}
|
||||
}
|
||||
|
||||
# Main service deployment
|
||||
resource "kubernetes_deployment" "poison_fountain" {
|
||||
metadata {
|
||||
name = "poison-fountain"
|
||||
namespace = kubernetes_namespace.poison_fountain.metadata[0].name
|
||||
labels = {
|
||||
app = "poison-fountain"
|
||||
tier = local.tiers.aux
|
||||
}
|
||||
}
|
||||
|
||||
spec {
|
||||
replicas = 1
|
||||
strategy {
|
||||
type = "Recreate"
|
||||
}
|
||||
selector {
|
||||
match_labels = {
|
||||
app = "poison-fountain"
|
||||
}
|
||||
}
|
||||
template {
|
||||
metadata {
|
||||
labels = {
|
||||
app = "poison-fountain"
|
||||
}
|
||||
}
|
||||
spec {
|
||||
container {
|
||||
name = "poison-fountain"
|
||||
image = "python:3.12-slim"
|
||||
command = ["python", "/app/server.py"]
|
||||
|
||||
port {
|
||||
container_port = 8080
|
||||
}
|
||||
|
||||
env {
|
||||
name = "CACHE_DIR"
|
||||
value = "/data/cache"
|
||||
}
|
||||
env {
|
||||
name = "DRIP_BYTES"
|
||||
value = "50"
|
||||
}
|
||||
env {
|
||||
name = "DRIP_DELAY"
|
||||
value = "0.5"
|
||||
}
|
||||
env {
|
||||
name = "POISON_DOMAIN"
|
||||
value = "poison.viktorbarzin.me"
|
||||
}
|
||||
|
||||
volume_mount {
|
||||
name = "code"
|
||||
mount_path = "/app"
|
||||
read_only = true
|
||||
}
|
||||
volume_mount {
|
||||
name = "data"
|
||||
mount_path = "/data"
|
||||
}
|
||||
|
||||
liveness_probe {
|
||||
http_get {
|
||||
path = "/healthz"
|
||||
port = 8080
|
||||
}
|
||||
initial_delay_seconds = 5
|
||||
period_seconds = 30
|
||||
}
|
||||
readiness_probe {
|
||||
http_get {
|
||||
path = "/healthz"
|
||||
port = 8080
|
||||
}
|
||||
initial_delay_seconds = 3
|
||||
period_seconds = 10
|
||||
}
|
||||
|
||||
resources {
|
||||
requests = {
|
||||
cpu = "10m"
|
||||
memory = "32Mi"
|
||||
}
|
||||
limits = {
|
||||
cpu = "100m"
|
||||
memory = "128Mi"
|
||||
}
|
||||
}
|
||||
}
|
||||
|
||||
volume {
|
||||
name = "code"
|
||||
config_map {
|
||||
name = kubernetes_config_map.poison_fountain_code.metadata[0].name
|
||||
}
|
||||
}
|
||||
volume {
|
||||
name = "data"
|
||||
nfs {
|
||||
server = "10.0.10.15"
|
||||
path = "/mnt/main/poison-fountain"
|
||||
}
|
||||
}
|
||||
}
|
||||
}
|
||||
}
|
||||
}
|
||||
|
||||
# Internal service (for ForwardAuth from Traefik)
|
||||
resource "kubernetes_service" "poison_fountain" {
|
||||
metadata {
|
||||
name = "poison-fountain"
|
||||
namespace = kubernetes_namespace.poison_fountain.metadata[0].name
|
||||
labels = {
|
||||
app = "poison-fountain"
|
||||
}
|
||||
}
|
||||
|
||||
spec {
|
||||
selector = {
|
||||
app = "poison-fountain"
|
||||
}
|
||||
port {
|
||||
name = "http"
|
||||
port = 8080
|
||||
target_port = 8080
|
||||
}
|
||||
}
|
||||
}
|
||||
|
||||
# Public ingress for the poison trap subdomain
|
||||
# Deliberately NO rate limiting, NO CrowdSec, NO anti-AI (we WANT scrapers here)
|
||||
module "ingress" {
|
||||
source = "../../modules/kubernetes/ingress_factory"
|
||||
namespace = kubernetes_namespace.poison_fountain.metadata[0].name
|
||||
name = "poison-fountain"
|
||||
host = "poison"
|
||||
port = 8080
|
||||
tls_secret_name = var.tls_secret_name
|
||||
skip_default_rate_limit = true
|
||||
exclude_crowdsec = true
|
||||
anti_ai_scraping = false
|
||||
}
|
||||
|
||||
# CronJob to fetch and cache poisoned content from Poison Fountain
|
||||
resource "kubernetes_cron_job_v1" "poison_fetcher" {
|
||||
metadata {
|
||||
name = "poison-fountain-fetcher"
|
||||
namespace = kubernetes_namespace.poison_fountain.metadata[0].name
|
||||
}
|
||||
|
||||
spec {
|
||||
schedule = "0 */6 * * *"
|
||||
successful_jobs_history_limit = 1
|
||||
failed_jobs_history_limit = 1
|
||||
concurrency_policy = "Forbid"
|
||||
|
||||
job_template {
|
||||
metadata {
|
||||
name = "poison-fountain-fetcher"
|
||||
}
|
||||
spec {
|
||||
template {
|
||||
metadata {
|
||||
name = "poison-fountain-fetcher"
|
||||
}
|
||||
spec {
|
||||
container {
|
||||
name = "fetcher"
|
||||
image = "curlimages/curl:latest"
|
||||
command = ["sh", "/scripts/fetch-poison.sh"]
|
||||
|
||||
env {
|
||||
name = "CACHE_DIR"
|
||||
value = "/data/cache"
|
||||
}
|
||||
env {
|
||||
name = "POISON_URL"
|
||||
value = "https://rnsaffn.com/poison2/"
|
||||
}
|
||||
env {
|
||||
name = "FETCH_COUNT"
|
||||
value = "50"
|
||||
}
|
||||
|
||||
volume_mount {
|
||||
name = "scripts"
|
||||
mount_path = "/scripts"
|
||||
read_only = true
|
||||
}
|
||||
volume_mount {
|
||||
name = "data"
|
||||
mount_path = "/data"
|
||||
}
|
||||
}
|
||||
|
||||
volume {
|
||||
name = "scripts"
|
||||
config_map {
|
||||
name = kubernetes_config_map.poison_fountain_fetcher.metadata[0].name
|
||||
default_mode = "0755"
|
||||
}
|
||||
}
|
||||
volume {
|
||||
name = "data"
|
||||
nfs {
|
||||
server = "10.0.10.15"
|
||||
path = "/mnt/main/poison-fountain"
|
||||
}
|
||||
}
|
||||
|
||||
restart_policy = "Never"
|
||||
}
|
||||
}
|
||||
}
|
||||
}
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
**Step 4: Format and validate**
|
||||
|
||||
```bash
|
||||
terraform fmt stacks/poison-fountain/main.tf
|
||||
cd stacks/poison-fountain && terragrunt validate --non-interactive
|
||||
```
|
||||
|
||||
**Step 5: Commit**
|
||||
|
||||
```bash
|
||||
git add stacks/poison-fountain/
|
||||
git commit -m "[ci skip] Add poison-fountain Terraform stack (deployment, service, ingress, CronJob)"
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
### Task 6: Deploy the platform stack (Traefik middlewares + DNS)
|
||||
|
||||
**Step 1: Plan**
|
||||
|
||||
```bash
|
||||
cd stacks/platform && terragrunt plan --non-interactive 2>&1 | tail -40
|
||||
```
|
||||
|
||||
Expected: New resources for the 3 middleware CRDs + Cloudflare DNS record for `poison`. Changes to existing ingress resources (new middleware annotations).
|
||||
|
||||
Review the plan output carefully. The key additions should be:
|
||||
- `kubernetes_manifest.middleware_ai_bot_block`
|
||||
- `kubernetes_manifest.middleware_anti_ai_headers`
|
||||
- `kubernetes_manifest.middleware_anti_ai_trap_links`
|
||||
- Cloudflare DNS record for `poison`
|
||||
- Modified ingress annotations on all services in the platform stack
|
||||
|
||||
**Step 2: Apply**
|
||||
|
||||
```bash
|
||||
cd stacks/platform && terragrunt apply --non-interactive 2>&1 | tail -40
|
||||
```
|
||||
|
||||
**Step 3: Verify middlewares exist**
|
||||
|
||||
```bash
|
||||
kubectl --kubeconfig $(pwd)/config get middlewares.traefik.io -n traefik | grep -E "ai-bot-block|anti-ai"
|
||||
```
|
||||
|
||||
Expected: 3 middleware resources listed.
|
||||
|
||||
---
|
||||
|
||||
### Task 7: Deploy the poison-fountain stack
|
||||
|
||||
**Step 1: Plan**
|
||||
|
||||
```bash
|
||||
cd stacks/poison-fountain && terragrunt plan --non-interactive 2>&1 | tail -30
|
||||
```
|
||||
|
||||
Expected: New namespace, configmaps, deployment, service, ingress, CronJob.
|
||||
|
||||
**Step 2: Apply**
|
||||
|
||||
```bash
|
||||
cd stacks/poison-fountain && terragrunt apply --non-interactive 2>&1 | tail -30
|
||||
```
|
||||
|
||||
**Step 3: Monitor pod startup**
|
||||
|
||||
Spawn a background agent to watch the pod come up:
|
||||
|
||||
```bash
|
||||
kubectl --kubeconfig $(pwd)/config get pods -n poison-fountain -w
|
||||
```
|
||||
|
||||
Expected: Pod reaches `Running` state with `1/1` ready.
|
||||
|
||||
**Step 4: Trigger the first poison cache fetch**
|
||||
|
||||
```bash
|
||||
kubectl --kubeconfig $(pwd)/config create job --from=cronjob/poison-fountain-fetcher poison-fetch-initial -n poison-fountain
|
||||
```
|
||||
|
||||
Watch the job complete:
|
||||
|
||||
```bash
|
||||
kubectl --kubeconfig $(pwd)/config logs -n poison-fountain -l job-name=poison-fetch-initial -f
|
||||
```
|
||||
|
||||
Expected: Fetched N poison documents.
|
||||
|
||||
---
|
||||
|
||||
### Task 8: Verify the full system
|
||||
|
||||
**Step 1: Verify ForwardAuth blocks AI bots**
|
||||
|
||||
```bash
|
||||
curl -s -o /dev/null -w "%{http_code}" -H "User-Agent: GPTBot/1.0" https://echo.viktorbarzin.me/
|
||||
```
|
||||
|
||||
Expected: `403`
|
||||
|
||||
**Step 2: Verify legitimate users pass through**
|
||||
|
||||
```bash
|
||||
curl -s -o /dev/null -w "%{http_code}" -H "User-Agent: Mozilla/5.0" https://echo.viktorbarzin.me/
|
||||
```
|
||||
|
||||
Expected: `200`
|
||||
|
||||
**Step 3: Verify X-Robots-Tag header**
|
||||
|
||||
```bash
|
||||
curl -sI https://echo.viktorbarzin.me/ 2>/dev/null | grep -i x-robots-tag
|
||||
```
|
||||
|
||||
Expected: `X-Robots-Tag: noai, noimageai`
|
||||
|
||||
**Step 4: Verify hidden trap links in HTML**
|
||||
|
||||
```bash
|
||||
curl -s https://echo.viktorbarzin.me/ | grep -o "poison.viktorbarzin.me"
|
||||
```
|
||||
|
||||
Expected: Multiple matches (trap links injected before `</body>`).
|
||||
|
||||
**Step 5: Verify poison service serves content with tarpit**
|
||||
|
||||
```bash
|
||||
timeout 10 curl -s -H "User-Agent: Mozilla/5.0" https://poison.viktorbarzin.me/article/test 2>/dev/null | head -5
|
||||
```
|
||||
|
||||
Expected: HTML content starting to arrive slowly (only a few lines in 10 seconds due to tarpit).
|
||||
|
||||
**Step 6: Run cluster health check**
|
||||
|
||||
```bash
|
||||
bash scripts/cluster_healthcheck.sh --quiet
|
||||
```
|
||||
|
||||
Expected: No new WARN/FAIL related to poison-fountain.
|
||||
|
||||
**Step 7: Commit all applied state**
|
||||
|
||||
```bash
|
||||
git add -A && git status
|
||||
```
|
||||
|
||||
Review for any uncommitted changes, commit if needed.
|
||||
29
docs/plans/2026-02-22-node-drift-quick-wins-design.md
Normal file
29
docs/plans/2026-02-22-node-drift-quick-wins-design.md
Normal file
|
|
@ -0,0 +1,29 @@
|
|||
# Node Configuration Drift Quick Wins — Design
|
||||
|
||||
**Date**: 2026-02-22
|
||||
**Status**: Approved
|
||||
**Context**: From Talos Linux evaluation — these close 95% of the drift gap without changing the OS
|
||||
|
||||
## Quick Win 1: Add GPU Label to Terraform
|
||||
|
||||
**File**: `stacks/platform/modules/nvidia/main.tf`
|
||||
|
||||
Extend the existing `null_resource.gpu_node_taint` to also apply the `gpu=true` label. Rename to `gpu_node_config`. Both commands are idempotent (`--overwrite` for taint, label is a no-op if already set).
|
||||
|
||||
## Quick Win 2: Improve API Server OIDC/Audit Idempotency
|
||||
|
||||
**Files**: `stacks/platform/modules/rbac/apiserver-oidc.tf`, `audit-policy.tf`
|
||||
|
||||
Current grep-before-sed checks prevent duplicate entries but don't handle value changes. Improve the OIDC check to compare the actual issuer URL value, not just the flag name. Audit policy file is always re-uploaded (good), manifest edit is skipped if already configured (acceptable).
|
||||
|
||||
## Quick Win 3: Enable Node-Exporter via Prometheus Helm Chart
|
||||
|
||||
**File**: `stacks/platform/modules/monitoring/prometheus_chart_values.tpl`
|
||||
|
||||
Uncomment `prometheus-node-exporter: enabled: true`. Delete `playbooks/deploy_node_exporter.yaml` (unused, superseded by DaemonSet).
|
||||
|
||||
## Quick Win 4: Document Node Rebuild Procedure
|
||||
|
||||
**File**: `.claude/CLAUDE.md`
|
||||
|
||||
Add a "Node Rebuild Procedure" section documenting the full sequence: VM creation from template → cloud-init → kubeadm join → verify mirrors/labels/taints.
|
||||
272
docs/plans/2026-02-22-talos-linux-migration-evaluation.md
Normal file
272
docs/plans/2026-02-22-talos-linux-migration-evaluation.md
Normal file
|
|
@ -0,0 +1,272 @@
|
|||
# Talos Linux Migration Evaluation
|
||||
|
||||
**Date**: 2026-02-22
|
||||
**Status**: Parked (evaluating ROI)
|
||||
**Decision**: Not yet decided — saved for future reference
|
||||
|
||||
## Problem Statement
|
||||
|
||||
The Kubernetes cluster nodes (Ubuntu 24.04) are configured through a mix of:
|
||||
- Cloud-init (packages, repos, containerd, kubelet, kubeadm join)
|
||||
- Terraform `null_resource` with SSH (containerd mirrors, API server OIDC, audit policy, GPU taint)
|
||||
- Ansible playbook (node exporter — optional)
|
||||
- DaemonSets (sysctl inotify limits)
|
||||
- Manual steps (GPU label, node upgrades, containerd mirror fixes)
|
||||
|
||||
This creates a drift surface and makes full from-scratch reprovisioning non-trivial.
|
||||
|
||||
**Goals:**
|
||||
1. Prevent configuration drift — ensure nodes match what's declared in code
|
||||
2. Single-command bootstrap — recover from complete node/cluster failure
|
||||
3. Everything managed as code in the infra repository
|
||||
|
||||
## Options Evaluated
|
||||
|
||||
### Option 1: Chef on Ubuntu — Rejected
|
||||
|
||||
- Chef is effectively dead (Progress acquisition, shrinking ecosystem)
|
||||
- Adds Ruby DSL, Chef server/zero, cookbook management — a parallel config system
|
||||
- Drift detection is reactive (periodic convergence), not preventive
|
||||
- Doesn't simplify the provisioning chain, just replaces SSH commands with recipes
|
||||
|
||||
### Option 2: NixOS — Not pursued
|
||||
|
||||
- Strongest drift guarantees (entire OS derived from Nix expression)
|
||||
- Steep learning curve (functional language, unhelpful error messages)
|
||||
- NVIDIA + containerd + K8s on NixOS is a niche combination
|
||||
- Proxmox cloud-init integration less mature than Ubuntu
|
||||
- Significant migration effort for marginal benefit over Talos
|
||||
|
||||
### Option 3: Talos Linux — Preferred candidate (if migrating)
|
||||
|
||||
Purpose-built immutable K8s OS. No SSH, no shell, no package manager. Entire node config is a single YAML document applied via gRPC API. Read-only filesystem makes drift structurally impossible.
|
||||
|
||||
### Option 4: Improve current setup — Low-cost alternative
|
||||
|
||||
Consolidate existing `null_resource` SSH blocks, fix the GPU label gap, and accept the small drift surface. See "Quick Wins" section below.
|
||||
|
||||
## Talos Linux — Detailed Assessment
|
||||
|
||||
### What Maps Cleanly
|
||||
|
||||
| Current (Ubuntu) | Talos Equivalent | Complexity |
|
||||
|---|---|---|
|
||||
| cloud_init.yaml packages | Eliminated (no packages needed) | None |
|
||||
| containerd registry mirrors | `machine.registries.mirrors` in machine config | Simple |
|
||||
| `kubeadm join` | Talos manages K8s lifecycle natively | Simple |
|
||||
| sysctl DaemonSet (inotify) | `machine.sysctls` in machine config | Simple |
|
||||
| API server OIDC flags (SSH+sed) | `cluster.apiServer.extraArgs` | Simple |
|
||||
| Audit policy (SSH+sed) | `cluster.apiServer.extraArgs` + `extraVolumes` | Simple |
|
||||
| GPU label (manual) | `machine.nodeLabels` | Simple |
|
||||
| GPU taint (null_resource) | `machine.nodeTaints` or machine config | Simple |
|
||||
| Static IPs | `machine.network.interfaces` | Simple |
|
||||
| QEMU guest agent | `qemu-guest-agent` system extension | Simple |
|
||||
|
||||
### What Has Friction
|
||||
|
||||
| Component | Issue | Severity |
|
||||
|---|---|---|
|
||||
| NFS volumes | `nfs-utils` extension is "contrib" tier (community-maintained) | Medium |
|
||||
| NVIDIA GPU | Extensions must version-lock to Talos release; Tesla T4 needs open kernel modules | Medium |
|
||||
| No SSH | Debugging via `talosctl` only (dmesg, logs, dashboard, pcap) | Low-Medium |
|
||||
| Not kubeadm | Cannot in-place migrate; must build parallel cluster | High (one-time) |
|
||||
| Proxmox templates | Different provisioning model (ISO boot vs cloud-init clone) | Medium |
|
||||
| No arbitrary packages | No tcpdump, htop, vim on nodes; use talosctl equivalents or debug containers | Low |
|
||||
|
||||
### Terraform Integration
|
||||
|
||||
Official provider: `siderolabs/talos` v0.10.1
|
||||
|
||||
```hcl
|
||||
# Key resources:
|
||||
# - talos_machine_secrets — cluster-wide secrets (generated once)
|
||||
# - talos_machine_configuration — per-node machine config (data source)
|
||||
# - talos_machine_configuration_apply — apply config to a node
|
||||
# - talos_machine_bootstrap — bootstrap control plane (once)
|
||||
# - talos_cluster_kubeconfig — retrieve kubeconfig
|
||||
```
|
||||
|
||||
Would fit as `stacks/talos/` alongside existing `stacks/infra/`.
|
||||
|
||||
### Example Machine Configs
|
||||
|
||||
#### Worker node (e.g., k8s-node2)
|
||||
|
||||
```yaml
|
||||
version: v1alpha1
|
||||
machine:
|
||||
type: worker
|
||||
network:
|
||||
hostname: k8s-node2
|
||||
interfaces:
|
||||
- interface: eth0
|
||||
addresses:
|
||||
- 10.0.20.102/24
|
||||
routes:
|
||||
- network: 0.0.0.0/0
|
||||
gateway: 10.0.20.1
|
||||
nameservers:
|
||||
- 10.0.20.201 # Technitium
|
||||
- 1.1.1.1
|
||||
registries:
|
||||
mirrors:
|
||||
docker.io:
|
||||
endpoints: ["http://10.0.20.10:5000"]
|
||||
ghcr.io:
|
||||
endpoints: ["http://10.0.20.10:5010"]
|
||||
quay.io:
|
||||
endpoints: ["http://10.0.20.10:5020"]
|
||||
registry.k8s.io:
|
||||
endpoints: ["http://10.0.20.10:5030"]
|
||||
reg.kyverno.io:
|
||||
endpoints: ["http://10.0.20.10:5040"]
|
||||
sysctls:
|
||||
fs.inotify.max_user_watches: "1048576"
|
||||
fs.inotify.max_user_instances: "8192"
|
||||
net.ipv4.ip_forward: "1"
|
||||
kubelet:
|
||||
extraConfig:
|
||||
serializeImagePulls: false
|
||||
maxParallelImagePulls: 50
|
||||
install:
|
||||
disk: /dev/sda
|
||||
extensions:
|
||||
- image: ghcr.io/siderolabs/nfs-utils:v2.7.2
|
||||
- image: ghcr.io/siderolabs/qemu-guest-agent:v10.2.0
|
||||
cluster:
|
||||
controlPlane:
|
||||
endpoint: https://10.0.20.100:6443
|
||||
```
|
||||
|
||||
#### GPU node (k8s-node1) — additional config
|
||||
|
||||
```yaml
|
||||
machine:
|
||||
kernel:
|
||||
modules:
|
||||
- name: nvidia
|
||||
- name: nvidia_uvm
|
||||
- name: nvidia_drm
|
||||
- name: nvidia_modeset
|
||||
nodeLabels:
|
||||
gpu: "true"
|
||||
nodeTaints:
|
||||
nvidia.com/gpu: "true:NoSchedule"
|
||||
install:
|
||||
extensions:
|
||||
- image: ghcr.io/siderolabs/nfs-utils:v2.7.2
|
||||
- image: ghcr.io/siderolabs/qemu-guest-agent:v10.2.0
|
||||
- image: ghcr.io/siderolabs/nvidia-open-gpu-kernel-modules:550.x-v1.9.5
|
||||
- image: ghcr.io/siderolabs/nvidia-container-toolkit:550.x-v1.17.x
|
||||
```
|
||||
|
||||
#### Control plane (k8s-master) — OIDC + audit
|
||||
|
||||
```yaml
|
||||
cluster:
|
||||
apiServer:
|
||||
extraArgs:
|
||||
oidc-issuer-url: https://authentik.viktorbarzin.me/application/o/kubernetes/
|
||||
oidc-client-id: kubernetes
|
||||
oidc-username-claim: email
|
||||
oidc-groups-claim: groups
|
||||
audit-policy-file: /etc/kubernetes/policies/audit-policy.yaml
|
||||
audit-log-path: /var/log/kubernetes/audit.log
|
||||
audit-log-maxage: "7"
|
||||
audit-log-maxbackup: "3"
|
||||
audit-log-maxsize: "100"
|
||||
extraVolumes:
|
||||
- hostPath: /etc/kubernetes/policies
|
||||
mountPath: /etc/kubernetes/policies
|
||||
readOnly: true
|
||||
- hostPath: /var/log/kubernetes
|
||||
mountPath: /var/log/kubernetes
|
||||
```
|
||||
|
||||
### Migration Path (if proceeding)
|
||||
|
||||
This is NOT an in-place migration. Talos replaces kubeadm entirely.
|
||||
|
||||
1. **Build Talos machine configs** in the repo (YAML per node, templated via Terraform)
|
||||
2. **Create `stacks/talos/` stack** — Proxmox VM creation + Talos provider resources
|
||||
3. **Download Talos ISO** with extensions (nfs-utils, qemu-guest-agent, nvidia) from Image Factory
|
||||
4. **Stand up parallel cluster** — new Talos VMs on unused IPs (Proxmox has ~46GB RAM headroom)
|
||||
5. **Bootstrap control plane** via `talosctl bootstrap`
|
||||
6. **Point existing Terraform service stacks** at new cluster kubeconfig
|
||||
7. **Apply all service stacks** — NFS-backed services point at same data, no data migration
|
||||
8. **Validate everything works** — run cluster healthcheck, test all services
|
||||
9. **Tear down old Ubuntu VMs**
|
||||
10. **Reassign IPs** if desired (reconfigure Talos nodes to use original IPs)
|
||||
|
||||
### What Gets Eliminated
|
||||
|
||||
If migrated, these files/patterns become unnecessary:
|
||||
- `modules/create-template-vm/cloud_init.yaml`
|
||||
- `modules/create-template-vm/` (entire module)
|
||||
- `modules/create-vm/` (replaced by Talos provider)
|
||||
- `scripts/setup_containerd_mirrors.sh`
|
||||
- `stacks/platform/modules/rbac/apiserver-oidc.tf` (SSH+sed block)
|
||||
- `stacks/platform/modules/rbac/audit-policy.tf` (SSH+sed block)
|
||||
- `stacks/platform/modules/monitoring/loki.tf` sysctl-inotify DaemonSet
|
||||
- `playbooks/deploy_node_exporter.yaml`
|
||||
- `null_resource.gpu_node_taint` in nvidia module
|
||||
- The undocumented GPU label manual step
|
||||
|
||||
## ROI Analysis
|
||||
|
||||
### Costs
|
||||
|
||||
| Cost | Estimate |
|
||||
|---|---|
|
||||
| Learn Talos + talosctl workflow | Significant (new paradigm, no SSH) |
|
||||
| Build Talos Terraform stack | Medium (new stack, provider, machine configs) |
|
||||
| Build custom Talos ISO with extensions | Low (Image Factory makes this easy) |
|
||||
| Parallel cluster setup + validation | Medium-High (must test every service) |
|
||||
| NVIDIA driver testing on Talos | Medium (version-locking, open kernel modules) |
|
||||
| Loss of SSH node access | Ongoing (workflow change) |
|
||||
| Ongoing: Talos upgrades require extension version alignment | Low-Medium |
|
||||
|
||||
### Benefits
|
||||
|
||||
| Benefit | Value |
|
||||
|---|---|
|
||||
| Zero configuration drift (structural guarantee) | High (but current drift risk is actually low) |
|
||||
| Single-command node rebuild | High |
|
||||
| Eliminates ~10 files/patterns of provisioning code | Medium |
|
||||
| Atomic OS upgrades with rollback | Medium |
|
||||
| Declarative API server config (no SSH+sed) | Medium |
|
||||
| GPU label/taint properly codified | Low (could fix this today in 5 minutes) |
|
||||
| Immutable, minimal attack surface | Low-Medium (nodes aren't internet-exposed) |
|
||||
|
||||
### Honest Assessment
|
||||
|
||||
The current drift surface is small and well-understood. The highest-risk items are:
|
||||
1. **API server OIDC/audit config** — SSH+sed is fragile but rarely changes
|
||||
2. **containerd mirrors** — baked into template, stable once set
|
||||
3. **GPU label** — missing from code but trivially fixable
|
||||
|
||||
Most node config only runs at provisioning time (cloud-init) and doesn't drift because nobody SSHes into nodes to change things in practice.
|
||||
|
||||
**Talos solves a real problem, but the problem isn't causing real pain today.** The migration cost is high relative to the current risk. It would make sense to revisit if:
|
||||
- Adding more nodes frequently (scaling the cluster)
|
||||
- Experiencing actual drift incidents
|
||||
- Rebuilding the cluster for other reasons (K8s major version upgrade, hardware change)
|
||||
- The current SSH+sed patterns break and need rework anyway
|
||||
|
||||
## Quick Wins (Do Instead / Do Now)
|
||||
|
||||
These close most of the drift gap without changing the OS:
|
||||
|
||||
1. **Add GPU label to Terraform** — `kubectl label` in existing nvidia `null_resource` or a `kubernetes_labels` resource
|
||||
2. **Make API server OIDC config idempotent** — improve the grep-before-sed checks
|
||||
3. **Move node-exporter to K8s DaemonSet** — instead of Ansible playbook on host
|
||||
4. **Document the full node rebuild procedure** — cloud-init template → clone → join → verify
|
||||
|
||||
## References
|
||||
|
||||
- Talos docs: https://docs.siderolabs.com/talos/v1.9/
|
||||
- Talos Proxmox guide: https://docs.siderolabs.com/talos/v1.9/platform-specific-installations/virtualized-platforms/proxmox/
|
||||
- Talos NVIDIA GPU: https://docs.siderolabs.com/talos/v1.9/configure-your-talos-cluster/hardware-and-drivers/nvidia-gpu
|
||||
- Talos Terraform provider: https://registry.terraform.io/providers/siderolabs/talos/latest (v0.10.1)
|
||||
- Talos system extensions: https://github.com/siderolabs/extensions
|
||||
- Talos Image Factory: https://factory.talos.dev/
|
||||
63
docs/plans/2026-02-23-mailserver-hardening-design.md
Normal file
63
docs/plans/2026-02-23-mailserver-hardening-design.md
Normal file
|
|
@ -0,0 +1,63 @@
|
|||
# Mail Server Lightweight Hardening Design
|
||||
|
||||
**Date**: 2026-02-23
|
||||
**Scope**: Security, reliability, and hygiene improvements to the docker-mailserver stack
|
||||
**Status**: Completed. ForwardEmail relay removed 2026-04-12 — MX now direct to mail.viktorbarzin.me on dedicated MetalLB IP with CrowdSec protection.
|
||||
|
||||
## Current State
|
||||
|
||||
- docker-mailserver 15.0.0 on K8s (single replica, Recreate strategy)
|
||||
- Roundcubemail webmail (MySQL-backed, debug logging on, unpinned :latest tag)
|
||||
- Outbound relay via Mailgun, inbound MX via ForwardEmail
|
||||
- OpenDKIM for DKIM signing, no spam filtering (SpamAssassin/ClamAV/Amavis disabled)
|
||||
- DMARC policy `none` (monitoring only)
|
||||
- No brute-force protection, no mailserver-down alert
|
||||
- Dovecot exporter sidecar (unpinned), stale SendGrid DNS records
|
||||
|
||||
## Changes
|
||||
|
||||
### 1. Enable Rspamd (replace OpenDKIM as DKIM signer)
|
||||
|
||||
Add to `mailserver_env_config`:
|
||||
- `ENABLE_RSPAMD = "1"` (spam filtering, DKIM verification, phishing detection, Oletools)
|
||||
- `ENABLE_OPENDKIM = "0"` (Rspamd handles DKIM signing natively)
|
||||
- `RSPAMD_LEARN = "1"` (learn from Junk folder movements)
|
||||
|
||||
Existing OpenDKIM key mounts stay — Rspamd reads them from the same paths.
|
||||
Resource impact: ~150-200MB additional RAM.
|
||||
|
||||
### 2. DMARC DNS enforcement
|
||||
|
||||
Update `_dmarc` TXT record: `p=none` -> `p=quarantine`. Can tighten to `p=reject` after validation.
|
||||
|
||||
### 3. Postfix rate limiting
|
||||
|
||||
Add to `postfix_cf`:
|
||||
```
|
||||
smtpd_client_connection_rate_limit = 10
|
||||
smtpd_client_message_rate_limit = 30
|
||||
anvil_rate_time_unit = 60s
|
||||
```
|
||||
|
||||
Service already uses `externalTrafficPolicy: Local`, so real client IPs are visible to Postfix.
|
||||
ForwardEmail IPs on port 25 are subject to same limits but 10 conn/min is generous.
|
||||
|
||||
### 4. Prometheus alert
|
||||
|
||||
Uncomment the existing mailserver-down alert in `prometheus_chart_values.tpl`.
|
||||
|
||||
### 5. Roundcubemail cleanup
|
||||
|
||||
- Pin image: `roundcube/roundcubemail:latest` -> `roundcube/roundcubemail:1.6-apache`
|
||||
- Disable debug: `ROUNDCUBEMAIL_SMTP_DEBUG = "false"`, `ROUNDCUBEMAIL_DEBUG_LEVEL = "1"`
|
||||
|
||||
### 6. SendGrid DNS cleanup
|
||||
|
||||
Remove stale CNAME records: `em7107`, `s1._domainkey`, `s2._domainkey`.
|
||||
|
||||
## Not Changing
|
||||
|
||||
- Roundcubemail stays (user preference)
|
||||
- ForwardEmail/Mailgun relay stays (practical dependency)
|
||||
- ClamAV stays disabled (Rspamd Oletools covers malicious attachments)
|
||||
- Single replica (HA email requires significant additional complexity)
|
||||
317
docs/plans/2026-02-23-mailserver-hardening-plan.md
Normal file
317
docs/plans/2026-02-23-mailserver-hardening-plan.md
Normal file
|
|
@ -0,0 +1,317 @@
|
|||
# Mail Server Lightweight Hardening Implementation Plan
|
||||
|
||||
> **For Claude:** REQUIRED SUB-SKILL: Use superpowers:executing-plans to implement this plan task-by-task.
|
||||
|
||||
**Goal:** Harden the mail server with spam filtering (Rspamd), DMARC enforcement, rate limiting, monitoring alerts, and hygiene cleanup.
|
||||
|
||||
**Status**: Completed. ForwardEmail references in this plan are historical — relay removed 2026-04-12. MX points directly to mail.viktorbarzin.me.
|
||||
|
||||
**Architecture:** All changes are to the existing docker-mailserver 15.0.0 deployment managed by Terraform. Rspamd replaces OpenDKIM for DKIM signing and adds spam filtering. DMARC moves from `none` to `quarantine` in Cloudflare DNS. Postfix gets rate-limiting parameters. Prometheus gets a mailserver-down alert. Roundcubemail debug logging is disabled and image pinned.
|
||||
|
||||
**Tech Stack:** Terraform/HCL, docker-mailserver, Rspamd, Cloudflare DNS, Prometheus
|
||||
|
||||
---
|
||||
|
||||
### Task 1: Enable Rspamd and disable OpenDKIM
|
||||
|
||||
**Files:**
|
||||
- Modify: `stacks/platform/modules/mailserver/main.tf:39-62` (env ConfigMap)
|
||||
|
||||
**Step 1: Add Rspamd env vars to the ConfigMap**
|
||||
|
||||
In `stacks/platform/modules/mailserver/main.tf`, in the `kubernetes_config_map.mailserver_env_config` resource `data` block, add these entries and modify existing ones:
|
||||
|
||||
```hcl
|
||||
data = {
|
||||
DMS_DEBUG = "0"
|
||||
ENABLE_CLAMAV = "0"
|
||||
ENABLE_AMAVIS = "0"
|
||||
ENABLE_FAIL2BAN = "0"
|
||||
ENABLE_FETCHMAIL = "0"
|
||||
ENABLE_POSTGREY = "0"
|
||||
ENABLE_SASLAUTHD = "0"
|
||||
ENABLE_SPAMASSASSIN = "0"
|
||||
ENABLE_SRS = "1"
|
||||
ENABLE_RSPAMD = "1"
|
||||
ENABLE_OPENDKIM = "0"
|
||||
ENABLE_OPENDMARC = "0"
|
||||
RSPAMD_LEARN = "1"
|
||||
FETCHMAIL_POLL = "120"
|
||||
ONE_DIR = "1"
|
||||
OVERRIDE_HOSTNAME = "mail.viktorbarzin.me"
|
||||
POSTFIX_MESSAGE_SIZE_LIMIT = 1024 * 1024 * 200 # 200 MB
|
||||
POSTFIX_REJECT_UNKNOWN_CLIENT_HOSTNAME = "1"
|
||||
DEFAULT_RELAY_HOST = "[smtp.eu.mailgun.org]:587"
|
||||
SPOOF_PROTECTION = "1"
|
||||
SSL_TYPE = "manual"
|
||||
SSL_CERT_PATH = "/tmp/ssl/tls.crt"
|
||||
SSL_KEY_PATH = "/tmp/ssl/tls.key"
|
||||
}
|
||||
```
|
||||
|
||||
The key additions are: `ENABLE_RSPAMD = "1"`, `ENABLE_OPENDKIM = "0"`, `ENABLE_OPENDMARC = "0"`, `RSPAMD_LEARN = "1"`.
|
||||
|
||||
**Note:** The existing OpenDKIM volume mounts (KeyTable, SigningTable, TrustedHosts, opendkim keys) should stay mounted. docker-mailserver's Rspamd integration reads the DKIM key from the same path (`/tmp/docker-mailserver/opendkim/keys/`) to configure Rspamd's DKIM signing module automatically.
|
||||
|
||||
**Step 2: Commit**
|
||||
|
||||
```bash
|
||||
git add stacks/platform/modules/mailserver/main.tf
|
||||
git commit -m "[ci skip] mailserver: enable Rspamd, disable OpenDKIM"
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
### Task 2: Add Postfix rate limiting
|
||||
|
||||
**Files:**
|
||||
- Modify: `stacks/platform/modules/mailserver/variables.tf:3-22` (postfix_cf variable)
|
||||
|
||||
**Step 1: Add rate limiting parameters to postfix_cf**
|
||||
|
||||
In `stacks/platform/modules/mailserver/variables.tf`, append these lines to the `postfix_cf` default value, before the `EOT`:
|
||||
|
||||
```
|
||||
smtpd_client_connection_rate_limit = 10
|
||||
smtpd_client_message_rate_limit = 30
|
||||
anvil_rate_time_unit = 60s
|
||||
```
|
||||
|
||||
The full `postfix_cf` variable should become:
|
||||
|
||||
```hcl
|
||||
variable "postfix_cf" {
|
||||
default = <<EOT
|
||||
#relayhost = [smtp.sendgrid.net]:587
|
||||
relayhost = [smtp.eu.mailgun.org]:587
|
||||
smtp_sasl_auth_enable = yes
|
||||
smtp_sasl_password_maps = hash:/etc/postfix/sasl/passwd
|
||||
smtp_sasl_security_options = noanonymous
|
||||
smtp_sasl_tls_security_options = noanonymous
|
||||
smtp_tls_security_level = encrypt
|
||||
smtpd_tls_cert_file=/tmp/ssl/tls.crt
|
||||
smtpd_tls_key_file=/tmp/ssl/tls.key
|
||||
smtpd_use_tls=yes
|
||||
header_size_limit = 4096000
|
||||
|
||||
# Debug mail tls
|
||||
smtpd_tls_loglevel = 1
|
||||
|
||||
# Rate limiting (brute-force protection)
|
||||
smtpd_client_connection_rate_limit = 10
|
||||
smtpd_client_message_rate_limit = 30
|
||||
anvil_rate_time_unit = 60s
|
||||
EOT
|
||||
}
|
||||
```
|
||||
|
||||
**Step 2: Commit**
|
||||
|
||||
```bash
|
||||
git add stacks/platform/modules/mailserver/variables.tf
|
||||
git commit -m "[ci skip] mailserver: add Postfix rate limiting"
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
### Task 3: Update DMARC DNS record to quarantine
|
||||
|
||||
**Files:**
|
||||
- Modify: `stacks/platform/modules/cloudflared/cloudflare.tf:132-138` (DMARC record)
|
||||
- Modify: `terraform.tfvars:85` (bind zone DMARC record)
|
||||
|
||||
**Step 1: Update Cloudflare DMARC record**
|
||||
|
||||
In `stacks/platform/modules/cloudflared/cloudflare.tf`, change the `cloudflare_record.mail_dmarc` content from `p=none` to `p=quarantine` and `sp=none` to `sp=quarantine`:
|
||||
|
||||
```hcl
|
||||
resource "cloudflare_record" "mail_dmarc" {
|
||||
content = "\"v=DMARC1; p=quarantine; pct=100; fo=1; ri=3600; sp=quarantine; adkim=r; aspf=r; rua=mailto:e21c0ff8@dmarc.mailgun.org,mailto:adb84997@inbox.ondmarc.com; ruf=mailto:e21c0ff8@dmarc.mailgun.org,mailto:adb84997@inbox.ondmarc.com,mailto:postmaster@viktorbarzin.me;\""
|
||||
name = "_dmarc.viktorbarzin.me"
|
||||
proxied = false
|
||||
ttl = 1
|
||||
type = "TXT"
|
||||
priority = 1
|
||||
zone_id = var.cloudflare_zone_id
|
||||
}
|
||||
```
|
||||
|
||||
**Step 2: Update bind zone DMARC record**
|
||||
|
||||
In `terraform.tfvars` line 85, update the DMARC record:
|
||||
|
||||
```
|
||||
_dmarc IN TXT "v=DMARC1; p=quarantine; ruf=mailto:postmaster@viktorbarzin.me; adkim=r; aspf=r; pct=100; sp=quarantine"
|
||||
```
|
||||
|
||||
**Step 3: Commit**
|
||||
|
||||
```bash
|
||||
git add stacks/platform/modules/cloudflared/cloudflare.tf terraform.tfvars
|
||||
git commit -m "[ci skip] mailserver: tighten DMARC policy to quarantine"
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
### Task 4: Enable Prometheus mailserver-down alert
|
||||
|
||||
**Files:**
|
||||
- Modify: `stacks/platform/modules/monitoring/prometheus_chart_values.tpl:544-550`
|
||||
|
||||
**Step 1: Uncomment the mailserver alert**
|
||||
|
||||
In `stacks/platform/modules/monitoring/prometheus_chart_values.tpl`, replace lines 544-550:
|
||||
|
||||
From:
|
||||
```yaml
|
||||
# - alert: Mail server has no replicas available
|
||||
# expr: (kube_deployment_status_replicas_available{namespace="mailserver"} or on() vector(0)) < 1
|
||||
# for: 10m
|
||||
# labels:
|
||||
# severity: page
|
||||
# annotations:
|
||||
# summary: Mail server has no available replicas. This means mail may not be received.
|
||||
```
|
||||
|
||||
To:
|
||||
```yaml
|
||||
- alert: Mail server has no replicas available
|
||||
expr: (kube_deployment_status_replicas_available{namespace="mailserver"} or on() vector(0)) < 1
|
||||
for: 5m
|
||||
labels:
|
||||
severity: page
|
||||
annotations:
|
||||
summary: Mail server has no available replicas. This means mail may not be received.
|
||||
```
|
||||
|
||||
Note: reduced `for` from 10m to 5m and fixed indentation to match the surrounding YAML (10 spaces).
|
||||
|
||||
**Step 2: Commit**
|
||||
|
||||
```bash
|
||||
git add stacks/platform/modules/monitoring/prometheus_chart_values.tpl
|
||||
git commit -m "[ci skip] monitoring: enable mailserver-down Prometheus alert"
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
### Task 5: Pin Roundcubemail image and disable debug logging
|
||||
|
||||
**Files:**
|
||||
- Modify: `stacks/platform/modules/mailserver/roundcubemail.tf:60,113-119`
|
||||
|
||||
**Step 1: Pin the image tag**
|
||||
|
||||
In `stacks/platform/modules/mailserver/roundcubemail.tf` line 60, change:
|
||||
|
||||
```hcl
|
||||
image = "roundcube/roundcubemail:latest"
|
||||
```
|
||||
|
||||
To:
|
||||
|
||||
```hcl
|
||||
image = "roundcube/roundcubemail:1.6-apache"
|
||||
```
|
||||
|
||||
**Step 2: Disable debug logging**
|
||||
|
||||
In the same file, change the debug env vars:
|
||||
|
||||
```hcl
|
||||
env {
|
||||
name = "ROUNDCUBEMAIL_SMTP_DEBUG"
|
||||
value = "false"
|
||||
}
|
||||
env {
|
||||
name = "ROUNDCUBEMAIL_DEBUG_LEVEL"
|
||||
value = "1"
|
||||
}
|
||||
```
|
||||
|
||||
**Step 3: Commit**
|
||||
|
||||
```bash
|
||||
git add stacks/platform/modules/mailserver/roundcubemail.tf
|
||||
git commit -m "[ci skip] roundcubemail: pin to 1.6-apache, disable debug logging"
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
### Task 6: Clean up stale SendGrid DNS records
|
||||
|
||||
**Files:**
|
||||
- Modify: `terraform.tfvars:88-90`
|
||||
|
||||
**Step 1: Remove SendGrid CNAME records from bind zone**
|
||||
|
||||
In `terraform.tfvars`, remove lines 88-90:
|
||||
|
||||
```
|
||||
em7107 IN CNAME u31127144.wl145.sendgrid.net.
|
||||
s1._domainkey IN CNAME s1.domainkey.u31127144.wl145.sendgrid.net.
|
||||
s2._domainkey IN CNAME s2.domainkey.u31127144.wl145.sendgrid.net.
|
||||
```
|
||||
|
||||
These are stale remnants from a previous SendGrid relay setup. They are not in the Cloudflare terraform config, so they may also need manual removal from Cloudflare if they were created outside Terraform.
|
||||
|
||||
**Step 2: Commit**
|
||||
|
||||
```bash
|
||||
git add terraform.tfvars
|
||||
git commit -m "[ci skip] dns: remove stale SendGrid CNAME records"
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
### Task 7: Apply changes
|
||||
|
||||
**Step 1: Apply the platform stack**
|
||||
|
||||
```bash
|
||||
cd stacks/platform && terragrunt apply --non-interactive
|
||||
```
|
||||
|
||||
This deploys: Rspamd enablement, Postfix rate limiting, DMARC DNS update, Prometheus alert, Roundcubemail changes.
|
||||
|
||||
**Step 2: Verify the mailserver pod restarts with Rspamd**
|
||||
|
||||
```bash
|
||||
kubectl --kubeconfig $(pwd)/config get pods -n mailserver
|
||||
```
|
||||
|
||||
Wait for the pod to be Running. Then verify Rspamd is active:
|
||||
|
||||
```bash
|
||||
kubectl --kubeconfig $(pwd)/config exec -n mailserver deploy/mailserver -c docker-mailserver -- pgrep -a rspamd
|
||||
```
|
||||
|
||||
Should show rspamd processes running.
|
||||
|
||||
**Step 3: Verify Postfix rate limiting is applied**
|
||||
|
||||
```bash
|
||||
kubectl --kubeconfig $(pwd)/config exec -n mailserver deploy/mailserver -c docker-mailserver -- postconf smtpd_client_connection_rate_limit
|
||||
```
|
||||
|
||||
Expected: `smtpd_client_connection_rate_limit = 10`
|
||||
|
||||
**Step 4: Verify DKIM signing still works with Rspamd**
|
||||
|
||||
Send a test email and check DKIM signature in the headers, or check Rspamd logs:
|
||||
|
||||
```bash
|
||||
kubectl --kubeconfig $(pwd)/config logs -n mailserver deploy/mailserver -c docker-mailserver --tail=50 | grep -i dkim
|
||||
```
|
||||
|
||||
**Step 5: Verify Roundcubemail is running with pinned image**
|
||||
|
||||
```bash
|
||||
kubectl --kubeconfig $(pwd)/config get deploy -n mailserver roundcubemail -o jsonpath='{.spec.template.spec.containers[0].image}'
|
||||
```
|
||||
|
||||
Expected: `roundcube/roundcubemail:1.6-apache`
|
||||
|
||||
**Step 6: Verify Prometheus alert is active**
|
||||
|
||||
Check in Grafana/Prometheus UI that the "Mail server has no replicas available" alert rule exists and is in `inactive` state (meaning the mailserver is healthy).
|
||||
71
docs/plans/2026-02-28-ci-build-caching-design.md
Normal file
71
docs/plans/2026-02-28-ci-build-caching-design.md
Normal file
|
|
@ -0,0 +1,71 @@
|
|||
# CI Build Caching Design
|
||||
|
||||
**Date**: 2026-02-28
|
||||
**Status**: Approved
|
||||
|
||||
## Problem
|
||||
|
||||
Woodpecker CI Docker image builds (build-cli, f1-stream, and future pipelines) rebuild everything from scratch on every push. No BuildKit layer caching is configured, so dependency installation steps (npm install, pip install, go build) re-execute even when requirements haven't changed.
|
||||
|
||||
## Decision
|
||||
|
||||
Extend the existing Docker Compose registry stack on `10.0.20.10` with a private R/W registry for BuildKit layer caching and image storage. Configure Woodpecker pipelines to use registry-based BuildKit cache and dual-push to both local and Docker Hub.
|
||||
|
||||
## Design
|
||||
|
||||
### 1. Private Registry Service
|
||||
|
||||
Add `registry-private` to the existing Docker Compose stack at `modules/docker-registry/docker-compose.yml`:
|
||||
|
||||
- **Port**: 5050 (via nginx, consistent with existing 50xx pattern)
|
||||
- **Storage**: `/opt/registry/data/private`, 100GiB limit
|
||||
- **Config**: Standard `registry:2` without `proxy` section (enables R/W)
|
||||
- **Auth**: None (internal network only, `10.0.20.0/24`)
|
||||
- **Nginx**: New upstream + server block on port 5050. Unlike the read-only proxy servers, this must allow PUT/POST/PATCH for image pushes.
|
||||
|
||||
### 2. DNS
|
||||
|
||||
Add Technitium A record: `registry.viktorbarzin.lan` → `10.0.20.10`
|
||||
|
||||
### 3. Woodpecker Pipeline Changes
|
||||
|
||||
For each Docker image build pipeline, update the `plugin-docker-buildx` step:
|
||||
|
||||
```yaml
|
||||
settings:
|
||||
# BuildKit registry cache
|
||||
cache_from: type=registry,ref=registry.viktorbarzin.lan:5050/<repo>:buildcache
|
||||
cache_to: type=registry,ref=registry.viktorbarzin.lan:5050/<repo>:buildcache,mode=max
|
||||
# Dual push: Docker Hub + local
|
||||
tags:
|
||||
- latest
|
||||
- registry.viktorbarzin.lan:5050/<repo>:latest
|
||||
# Allow HTTP registry
|
||||
buildkit_config: |
|
||||
[registry."registry.viktorbarzin.lan:5050"]
|
||||
http = true
|
||||
insecure = true
|
||||
```
|
||||
|
||||
`mode=max` caches all intermediate layers, not just final image layers. This is critical for multi-stage builds (f1-stream has Node + Python stages).
|
||||
|
||||
### 4. No Containerd Changes
|
||||
|
||||
K8s pods continue pulling from Docker Hub via the existing pull-through cache on `10.0.20.10:5000`. The private registry is only used by Woodpecker for build caching and as a backup image store.
|
||||
|
||||
### 5. Cleanup
|
||||
|
||||
Extend `modules/docker-registry/cleanup-tags.sh` to also prune the private registry, keeping the N most recent tags per image.
|
||||
|
||||
## Expected Impact
|
||||
|
||||
- **First build**: Same speed (cold cache), layers stored in local registry
|
||||
- **Subsequent builds (unchanged requirements)**: BuildKit pulls cached layers over LAN. Only `COPY . .` and final build steps re-execute. Expected 50-80% build time reduction for typical dependency-heavy builds.
|
||||
- **Storage**: Build cache layers consume space on the VM. 100GiB limit with cleanup keeps this bounded.
|
||||
|
||||
## What's NOT In Scope
|
||||
|
||||
- Main terragrunt-apply pipeline (`default.yml`) — not a Docker image build
|
||||
- Dependency caching (npm node_modules, Go modules, pip packages) — not needed since BuildKit layer caching covers this
|
||||
- Containerd config changes on K8s nodes
|
||||
- Migrating pull-through caches to K8s
|
||||
507
docs/plans/2026-02-28-ci-build-caching-plan.md
Normal file
507
docs/plans/2026-02-28-ci-build-caching-plan.md
Normal file
|
|
@ -0,0 +1,507 @@
|
|||
# CI Build Caching Implementation Plan
|
||||
|
||||
> **For Claude:** REQUIRED SUB-SKILL: Use superpowers:executing-plans to implement this plan task-by-task.
|
||||
|
||||
**Goal:** Speed up Woodpecker CI Docker image builds by adding BuildKit layer caching via a local private registry, with dual-push to Docker Hub and local.
|
||||
|
||||
**Architecture:** Extend the existing Docker Compose registry stack on `10.0.20.10` with a new R/W `registry-private` service (port 5050). Configure Woodpecker `plugin-docker-buildx` pipelines with `cache_from`/`cache_to` pointing to `registry.viktorbarzin.lan:5050`. Push final images to both Docker Hub and local registry. All changes persisted in Terraform via `stacks/infra/main.tf` cloud-init provisioning.
|
||||
|
||||
**Tech Stack:** Docker Registry v2, nginx, Docker Compose, Woodpecker CI, BuildKit, Technitium DNS, Terraform
|
||||
|
||||
**Design doc:** `docs/plans/2026-02-28-ci-build-caching-design.md`
|
||||
|
||||
**Key context:** The registry VM at `10.0.20.10` is fully managed via Terraform in `stacks/infra/main.tf`. Config files live in `modules/docker-registry/` and are read by Terraform via `file()` and `templatefile()`, then base64-encoded into cloud-init `provision_cmds`. Changes to config files require updating both the files AND the cloud-init provisioning in `stacks/infra/main.tf`. Since the VM is already running, we also SCP updated files to the live VM for immediate effect.
|
||||
|
||||
---
|
||||
|
||||
### Task 1: Create private registry config file
|
||||
|
||||
**Files:**
|
||||
- Create: `modules/docker-registry/config-private.yml`
|
||||
|
||||
**Step 1: Create the config file**
|
||||
|
||||
This is a standard `registry:2` config WITHOUT the `proxy` section (which is what makes it R/W instead of read-only pull-through). Based on the existing `config.yaml` but with 100GiB storage and no proxy/auth.
|
||||
|
||||
```yaml
|
||||
version: 0.1
|
||||
log:
|
||||
fields:
|
||||
service: registry-private
|
||||
storage:
|
||||
cache:
|
||||
blobdescriptor: inmemory
|
||||
filesystem:
|
||||
rootdirectory: /var/lib/registry
|
||||
maxsize: 100GiB
|
||||
delete:
|
||||
enabled: true
|
||||
maintenance:
|
||||
uploadpurging:
|
||||
enabled: true
|
||||
age: 168h
|
||||
interval: 4h
|
||||
dryrun: false
|
||||
http:
|
||||
addr: :5000
|
||||
headers:
|
||||
X-Content-Type-Options: [nosniff]
|
||||
health:
|
||||
storagedriver:
|
||||
enabled: true
|
||||
interval: 10s
|
||||
threshold: 3
|
||||
```
|
||||
|
||||
Key differences from the proxy configs:
|
||||
- No `proxy` section → allows pushes
|
||||
- `maxsize: 100GiB` (user requested 100GB)
|
||||
- `uploadpurging.age: 168h` (7 days, since build cache layers are re-pushed frequently)
|
||||
|
||||
**Step 2: Commit**
|
||||
|
||||
```bash
|
||||
git add modules/docker-registry/config-private.yml
|
||||
git commit -m "[ci skip] add private R/W registry config for CI build caching"
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
### Task 2: Add registry-private service to Docker Compose
|
||||
|
||||
**Files:**
|
||||
- Modify: `modules/docker-registry/docker-compose.yml`
|
||||
|
||||
**Step 1: Add the registry-private service**
|
||||
|
||||
Add this service block after `registry-kyverno` (before `nginx`):
|
||||
|
||||
```yaml
|
||||
registry-private:
|
||||
image: registry:2
|
||||
container_name: registry-private
|
||||
restart: always
|
||||
volumes:
|
||||
- /opt/registry/data/private:/var/lib/registry
|
||||
- /opt/registry/config-private.yml:/etc/docker/registry/config.yml:ro
|
||||
networks:
|
||||
- registry
|
||||
healthcheck:
|
||||
test: ["CMD", "sh", "-c", "wget -qO- http://localhost:5000/v2/ >/dev/null 2>&1"]
|
||||
interval: 30s
|
||||
timeout: 10s
|
||||
retries: 3
|
||||
start_period: 10s
|
||||
```
|
||||
|
||||
**Step 2: Add port 5050 to the nginx service**
|
||||
|
||||
In the `nginx` service `ports` list, add:
|
||||
|
||||
```yaml
|
||||
- "5050:5050"
|
||||
```
|
||||
|
||||
**Step 3: Add registry-private to nginx depends_on**
|
||||
|
||||
```yaml
|
||||
registry-private:
|
||||
condition: service_healthy
|
||||
```
|
||||
|
||||
**Step 4: Commit**
|
||||
|
||||
```bash
|
||||
git add modules/docker-registry/docker-compose.yml
|
||||
git commit -m "[ci skip] add registry-private service to Docker Compose stack"
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
### Task 3: Add nginx upstream and server block for private registry
|
||||
|
||||
**Files:**
|
||||
- Modify: `modules/docker-registry/nginx_registry.conf`
|
||||
|
||||
**Step 1: Add upstream block**
|
||||
|
||||
After the existing `upstream kyverno` block, add:
|
||||
|
||||
```nginx
|
||||
upstream private {
|
||||
server registry-private:5000;
|
||||
keepalive 32;
|
||||
}
|
||||
```
|
||||
|
||||
**Step 2: Add server block**
|
||||
|
||||
After the last server block (kyverno on port 5040), add:
|
||||
|
||||
```nginx
|
||||
# --- Private R/W Registry (port 5050) ---
|
||||
|
||||
server {
|
||||
listen 5050;
|
||||
server_name _;
|
||||
|
||||
client_max_body_size 0;
|
||||
proxy_request_buffering off;
|
||||
proxy_buffering off;
|
||||
chunked_transfer_encoding on;
|
||||
|
||||
location /v2/ {
|
||||
proxy_pass http://private;
|
||||
proxy_http_version 1.1;
|
||||
proxy_set_header Host $host;
|
||||
proxy_set_header Connection "";
|
||||
proxy_set_header X-Real-IP $remote_addr;
|
||||
proxy_set_header X-Forwarded-For $proxy_add_x_forwarded_for;
|
||||
|
||||
proxy_read_timeout 900;
|
||||
proxy_send_timeout 900;
|
||||
}
|
||||
|
||||
location / {
|
||||
return 200 'ok';
|
||||
add_header Content-Type text/plain;
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
Key differences from the read-only proxy server blocks:
|
||||
- **No `proxy_cache`** directives — caching uploads would corrupt pushes
|
||||
- **`proxy_buffering off`** — important for large layer uploads
|
||||
- **`chunked_transfer_encoding on`** — Docker push uses chunked uploads
|
||||
- **`X-Real-IP` / `X-Forwarded-For`** headers — useful for debugging
|
||||
|
||||
**Step 3: Commit**
|
||||
|
||||
```bash
|
||||
git add modules/docker-registry/nginx_registry.conf
|
||||
git commit -m "[ci skip] add nginx upstream and server block for private registry on port 5050"
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
### Task 4: Update Terraform provisioning for the private registry
|
||||
|
||||
**Files:**
|
||||
- Modify: `stacks/infra/main.tf` (lines 119-274, the docker-registry-template and docker-registry-vm modules)
|
||||
|
||||
**Step 1: Add private registry data directory to `provision_cmds`**
|
||||
|
||||
In the `mkdir` command at line 152, append the private registry directory. Change:
|
||||
|
||||
```hcl
|
||||
"mkdir -p /opt/registry/data/dockerhub /opt/registry/data/ghcr /opt/registry/data/quay /opt/registry/data/k8s /opt/registry/data/kyverno",
|
||||
```
|
||||
|
||||
to:
|
||||
|
||||
```hcl
|
||||
"mkdir -p /opt/registry/data/dockerhub /opt/registry/data/ghcr /opt/registry/data/quay /opt/registry/data/k8s /opt/registry/data/kyverno /opt/registry/data/private",
|
||||
```
|
||||
|
||||
**Step 2: Add config-private.yml deployment command**
|
||||
|
||||
After the kyverno config block (line 203), add:
|
||||
|
||||
```hcl
|
||||
# Write private R/W registry config (no proxy = accepts pushes)
|
||||
format("echo %s | base64 -d > /opt/registry/config-private.yml",
|
||||
base64encode(file("${path.root}/../../modules/docker-registry/config-private.yml"))
|
||||
),
|
||||
```
|
||||
|
||||
**Step 3: Add garbage collection cron for private registry**
|
||||
|
||||
After the kyverno garbage collection cron (line 239), add:
|
||||
|
||||
```hcl
|
||||
"( crontab -l 2>/dev/null; echo '25 3 * * 0 /usr/bin/docker exec registry-private registry garbage-collect -m /etc/docker/registry/config.yml >> /var/log/registry-gc.log 2>&1' ) | crontab -",
|
||||
```
|
||||
|
||||
This follows the existing staggered pattern (each registry offset by 5 minutes).
|
||||
|
||||
**Step 4: Update the VM module comment block**
|
||||
|
||||
At lines 266-273, update the port documentation comment to include port 5050:
|
||||
|
||||
```hcl
|
||||
# All ports go through nginx for request serialization (proxy_cache_lock):
|
||||
# 5000 -> nginx -> registry-dockerhub (docker.io proxy)
|
||||
# 5001 -> registry-dockerhub direct (Prometheus metrics)
|
||||
# 5010 -> nginx -> registry-ghcr (ghcr.io proxy)
|
||||
# 5020 -> nginx -> registry-quay (quay.io proxy)
|
||||
# 5030 -> nginx -> registry-k8s (registry.k8s.io proxy)
|
||||
# 5040 -> nginx -> registry-kyverno (reg.kyverno.io proxy)
|
||||
# 5050 -> nginx -> registry-private (R/W registry for CI build cache)
|
||||
# 8080 -> registry-ui (joxit/docker-registry-ui)
|
||||
```
|
||||
|
||||
**Step 5: Commit**
|
||||
|
||||
```bash
|
||||
git add stacks/infra/main.tf
|
||||
git commit -m "[ci skip] add private registry to Terraform cloud-init provisioning"
|
||||
```
|
||||
|
||||
**Note:** This updates the cloud-init template. The running VM won't automatically pick up these changes — it only applies on fresh VM creation from the template. For the running VM, Task 5 deploys the files via SCP. This ensures both the live VM and Terraform state are in sync.
|
||||
|
||||
---
|
||||
|
||||
### Task 5: Deploy to the running registry VM
|
||||
|
||||
Since the registry VM is already running (cloud-init only runs on first boot), we deploy the updated files directly via SSH/SCP for immediate effect.
|
||||
|
||||
**Step 1: SSH to the registry VM and create the storage directory**
|
||||
|
||||
```bash
|
||||
ssh root@10.0.20.10 "mkdir -p /opt/registry/data/private"
|
||||
```
|
||||
|
||||
**Step 2: Copy updated files to the VM**
|
||||
|
||||
```bash
|
||||
scp modules/docker-registry/docker-compose.yml root@10.0.20.10:/opt/registry/docker-compose.yml
|
||||
scp modules/docker-registry/config-private.yml root@10.0.20.10:/opt/registry/config-private.yml
|
||||
scp modules/docker-registry/nginx_registry.conf root@10.0.20.10:/opt/registry/nginx.conf
|
||||
```
|
||||
|
||||
Note: The nginx config is stored as `/opt/registry/nginx.conf` on the VM (the docker-compose mounts it as `nginx.conf`).
|
||||
|
||||
**Step 3: Restart the Docker Compose stack**
|
||||
|
||||
```bash
|
||||
ssh root@10.0.20.10 "cd /opt/registry && docker compose up -d"
|
||||
```
|
||||
|
||||
This will create the new `registry-private` container and reload nginx with the new port.
|
||||
|
||||
**Step 4: Add garbage collection cron on the running VM**
|
||||
|
||||
```bash
|
||||
ssh root@10.0.20.10 '( crontab -l 2>/dev/null; echo "25 3 * * 0 /usr/bin/docker exec registry-private registry garbage-collect -m /etc/docker/registry/config.yml >> /var/log/registry-gc.log 2>&1" ) | crontab -'
|
||||
```
|
||||
|
||||
**Step 5: Verify the private registry is accessible**
|
||||
|
||||
```bash
|
||||
curl -s http://10.0.20.10:5050/v2/
|
||||
# Expected: {} (empty JSON object = registry is up)
|
||||
|
||||
curl -s http://10.0.20.10:5050/v2/_catalog
|
||||
# Expected: {"repositories":[]} (empty, no images pushed yet)
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
### Task 6: Add DNS record for registry.viktorbarzin.lan
|
||||
|
||||
**Step 1: Add A record via Technitium API**
|
||||
|
||||
```bash
|
||||
# Technitium DNS API endpoint (web UI is on port 5380)
|
||||
# Get API token from tfvars (technitium_password)
|
||||
curl -s "http://10.0.20.204:5380/api/zones/records/add?token=<TECHNITIUM_TOKEN>&domain=registry.viktorbarzin.lan&zone=viktorbarzin.lan&type=A&ipAddress=10.0.20.10&overwrite=true"
|
||||
```
|
||||
|
||||
Alternatively, add via Technitium web UI at `https://technitium.viktorbarzin.me`:
|
||||
- Zone: `viktorbarzin.lan`
|
||||
- Record: `registry` → A → `10.0.20.10`
|
||||
|
||||
**Step 2: Verify DNS resolution from a K8s pod**
|
||||
|
||||
```bash
|
||||
kubectl run -it --rm dns-test --image=alpine --restart=Never -- nslookup registry.viktorbarzin.lan
|
||||
# Expected: Address: 10.0.20.10
|
||||
```
|
||||
|
||||
**Step 3: Verify registry is accessible via DNS name**
|
||||
|
||||
```bash
|
||||
curl -s http://registry.viktorbarzin.lan:5050/v2/
|
||||
# Expected: {}
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
### Task 7: Update build-cli.yml pipeline with BuildKit caching
|
||||
|
||||
**Files:**
|
||||
- Modify: `.woodpecker/build-cli.yml`
|
||||
|
||||
**Step 1: Update the pipeline**
|
||||
|
||||
Replace the entire file content with:
|
||||
|
||||
```yaml
|
||||
when:
|
||||
event: push
|
||||
|
||||
clone:
|
||||
git:
|
||||
image: woodpeckerci/plugin-git
|
||||
settings:
|
||||
attempts: 5
|
||||
backoff: 10s
|
||||
|
||||
steps:
|
||||
- name: build-image
|
||||
image: woodpeckerci/plugin-docker-buildx
|
||||
settings:
|
||||
username: "viktorbarzin"
|
||||
password:
|
||||
from_secret: dockerhub-pat
|
||||
repo:
|
||||
- viktorbarzin/infra
|
||||
- registry.viktorbarzin.lan:5050/infra
|
||||
logins:
|
||||
- registry: https://index.docker.io/v1/
|
||||
username: viktorbarzin
|
||||
password:
|
||||
from_secret: dockerhub-pat
|
||||
dockerfile: cli/Dockerfile
|
||||
context: cli
|
||||
auto_tag: true
|
||||
cache_from: type=registry,ref=registry.viktorbarzin.lan:5050/infra:buildcache
|
||||
cache_to: type=registry,ref=registry.viktorbarzin.lan:5050/infra:buildcache,mode=max
|
||||
buildkit_config: |
|
||||
[registry."registry.viktorbarzin.lan:5050"]
|
||||
http = true
|
||||
insecure = true
|
||||
```
|
||||
|
||||
Key changes:
|
||||
- `repo` is now a list — pushes to both Docker Hub and local registry
|
||||
- `logins` provides Docker Hub credentials explicitly (needed when `repo` is a list)
|
||||
- `cache_from`/`cache_to` use registry-based BuildKit cache on the local registry
|
||||
- `buildkit_config` allows HTTP access to the insecure local registry
|
||||
- `mode=max` caches ALL layers (including intermediate build stages)
|
||||
|
||||
**Step 2: Commit**
|
||||
|
||||
```bash
|
||||
git add .woodpecker/build-cli.yml
|
||||
git commit -m "[ci skip] add BuildKit layer caching and dual-push to build-cli pipeline"
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
### Task 8: Update f1-stream.yml pipeline with BuildKit caching
|
||||
|
||||
**Files:**
|
||||
- Modify: `.woodpecker/f1-stream.yml`
|
||||
|
||||
**Step 1: Update the pipeline**
|
||||
|
||||
Replace the entire file content with:
|
||||
|
||||
```yaml
|
||||
when:
|
||||
event: push
|
||||
path: "stacks/f1-stream/files/**"
|
||||
|
||||
clone:
|
||||
git:
|
||||
image: woodpeckerci/plugin-git
|
||||
settings:
|
||||
attempts: 5
|
||||
backoff: 10s
|
||||
|
||||
steps:
|
||||
- name: build-image
|
||||
image: woodpeckerci/plugin-docker-buildx
|
||||
settings:
|
||||
username: "viktorbarzin"
|
||||
password:
|
||||
from_secret: dockerhub-pat
|
||||
repo:
|
||||
- viktorbarzin/f1-stream
|
||||
- registry.viktorbarzin.lan:5050/f1-stream
|
||||
logins:
|
||||
- registry: https://index.docker.io/v1/
|
||||
username: viktorbarzin
|
||||
password:
|
||||
from_secret: dockerhub-pat
|
||||
dockerfile: stacks/f1-stream/files/Dockerfile
|
||||
context: stacks/f1-stream/files
|
||||
platforms: linux/amd64
|
||||
provenance: false
|
||||
tags: latest
|
||||
cache_from: type=registry,ref=registry.viktorbarzin.lan:5050/f1-stream:buildcache
|
||||
cache_to: type=registry,ref=registry.viktorbarzin.lan:5050/f1-stream:buildcache,mode=max
|
||||
buildkit_config: |
|
||||
[registry."registry.viktorbarzin.lan:5050"]
|
||||
http = true
|
||||
insecure = true
|
||||
|
||||
- name: deploy
|
||||
image: bitnami/kubectl
|
||||
commands:
|
||||
- kubectl -n f1-stream rollout restart deployment f1-stream
|
||||
- kubectl -n f1-stream rollout status deployment f1-stream --timeout=120s
|
||||
```
|
||||
|
||||
Same pattern as build-cli: dual-push + BuildKit cache. The `deploy` step is unchanged.
|
||||
|
||||
**Step 2: Commit**
|
||||
|
||||
```bash
|
||||
git add .woodpecker/f1-stream.yml
|
||||
git commit -m "[ci skip] add BuildKit layer caching and dual-push to f1-stream pipeline"
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
### Task 9: Test end-to-end with a manual build trigger
|
||||
|
||||
**Step 1: Push changes to trigger the build-cli pipeline**
|
||||
|
||||
```bash
|
||||
git push origin master
|
||||
```
|
||||
|
||||
The `build-cli.yml` pipeline triggers on every push. Monitor it at `https://ci.viktorbarzin.me`.
|
||||
|
||||
**Step 2: Verify cache was populated**
|
||||
|
||||
After the first build completes, check the local registry has the cache:
|
||||
|
||||
```bash
|
||||
curl -s http://registry.viktorbarzin.lan:5050/v2/_catalog
|
||||
# Expected: {"repositories":["infra"]}
|
||||
|
||||
curl -s http://registry.viktorbarzin.lan:5050/v2/infra/tags/list
|
||||
# Expected: tags include "buildcache" and the auto-tagged version
|
||||
```
|
||||
|
||||
**Step 3: Trigger a second build to verify cache hit**
|
||||
|
||||
Make a trivial change (e.g., update a comment in `cli/`) and push again. The build logs should show "importing cache manifest from registry.viktorbarzin.lan:5050/infra:buildcache" and skip unchanged layers.
|
||||
|
||||
**Step 4: Verify Docker Hub also has the image**
|
||||
|
||||
```bash
|
||||
curl -s https://hub.docker.com/v2/repositories/viktorbarzin/infra/tags/ | python3 -m json.tool | head -20
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
### Task 10: Verify cleanup script covers private registry
|
||||
|
||||
**Files:**
|
||||
- Review: `modules/docker-registry/cleanup-tags.sh`
|
||||
|
||||
**Step 1: Verify the script already handles multiple registries**
|
||||
|
||||
The existing script walks ALL subdirectories under `BASE` (`/opt/registry/data`). Since the private registry stores data at `/opt/registry/data/private/docker/registry/v2/repositories/`, it will automatically be picked up by the existing script without changes.
|
||||
|
||||
Verify by reading the script logic — `os.listdir(BASE)` iterates `dockerhub`, `ghcr`, `quay`, `k8s`, `kyverno`, and now `private`.
|
||||
|
||||
**Step 2: Consider whether to adjust the keep count**
|
||||
|
||||
The default `KEEP=10` may be too aggressive for the private registry since buildcache tags are few (usually just one `buildcache` tag per repo). The script only deletes when there are MORE than `KEEP` tags, so with typically 2-3 tags per repo (e.g., `latest`, `buildcache`, maybe a version tag), no cleanup will happen. This is fine.
|
||||
|
||||
No code changes needed — the script already works with the new registry.
|
||||
91
docs/plans/2026-02-28-network-visualization-design.md
Normal file
91
docs/plans/2026-02-28-network-visualization-design.md
Normal file
|
|
@ -0,0 +1,91 @@
|
|||
# Network Traffic Visualization Design
|
||||
|
||||
**Date**: 2026-02-28
|
||||
**Goal**: Real-time visualization of all network traffic — pod-to-pod (K8s) and full network (up to 192.168.1.1) — using Grafana as the single pane of glass.
|
||||
|
||||
## Architecture
|
||||
|
||||
```
|
||||
192.168.1.1 (ISP router)
|
||||
└── 10.0.20.1 (pfSense + softflowd) ──NetFlow UDP──► GoFlow2 (K8s)
|
||||
├── Proxmox (192.168.1.127) │
|
||||
│ └── K8s nodes (10.0.20.100-104) ▼
|
||||
│ └── Pods ◄──eBPF──► Caretta Prometheus
|
||||
├── TrueNAS (10.0.10.15) │
|
||||
└── Other devices ▼
|
||||
Grafana
|
||||
(Node Graph panels)
|
||||
```
|
||||
|
||||
Two complementary data paths:
|
||||
1. **Caretta** (eBPF DaemonSet) → tracks pod-to-pod TCP connections → Prometheus metrics → Grafana Node Graph
|
||||
2. **GoFlow2** (NetFlow collector) ← pfSense softflowd → Prometheus metrics → Grafana dashboards
|
||||
|
||||
## Component 1: Caretta
|
||||
|
||||
- **Stack**: `stacks/caretta/`
|
||||
- **Namespace**: `caretta`
|
||||
- **Deployment**: Helm release from `https://helm.groundcover.com/`, chart `caretta`
|
||||
- **Config**:
|
||||
- Disable bundled Grafana (`grafana.enabled: false`)
|
||||
- Disable bundled VictoriaMetrics (`victoria-metrics-single.enabled: false`)
|
||||
- DaemonSet runs eBPF agent on each node
|
||||
- Exposes Prometheus metrics on port 7117
|
||||
- **Key metric**: `caretta_links_observed{client_name, client_namespace, server_name, server_namespace, server_port}`
|
||||
- **Grafana**: ConfigMap dashboard with Node Graph panel, label `grafana_dashboard: "1"`
|
||||
- **Resources**: ~100Mi RAM, ~50m CPU per node
|
||||
|
||||
## Component 2: GoFlow2
|
||||
|
||||
- **Stack**: `stacks/goflow2/`
|
||||
- **Namespace**: `goflow2`
|
||||
- **Deployment**: Raw Terraform (Deployment + Service) — single binary, no Helm chart needed
|
||||
- **Image**: `netsampler/goflow2`
|
||||
- **Ports**:
|
||||
- UDP 2055: NetFlow v9 receiver (from pfSense)
|
||||
- TCP 8080: Prometheus metrics endpoint
|
||||
- **Service**: NodePort for UDP 2055 so pfSense (10.0.20.1) can reach it on any node IP
|
||||
- **Key metrics**: `flow_bytes`, `flow_packets` with labels for src/dst IP, port, protocol
|
||||
- **Grafana**: ConfigMap dashboard showing network flows (top talkers, protocol breakdown, inter-VLAN traffic)
|
||||
- **Resources**: ~100Mi RAM, ~50m CPU (single pod, not DaemonSet)
|
||||
|
||||
## Component 3: pfSense softflowd
|
||||
|
||||
- **Host**: 10.0.20.1 (SSH as admin)
|
||||
- **Package**: `softflowd` (install via pfSense package manager)
|
||||
- **Config**:
|
||||
- Monitor LAN interface(s)
|
||||
- Export NetFlow v9 to `<k8s-node-ip>:<goflow2-nodeport>` (UDP)
|
||||
- Tracking level: full (track individual connections)
|
||||
- **Note**: This is a manual SSH step — pfSense is not Terraform-managed
|
||||
|
||||
## Component 4: Prometheus Integration
|
||||
|
||||
Two new scrape targets in `stacks/platform/modules/monitoring/prometheus_chart_values.tpl` (`extraScrapeConfigs`):
|
||||
|
||||
```yaml
|
||||
- job_name: 'caretta'
|
||||
static_configs:
|
||||
- targets: ["caretta.caretta.svc.cluster.local:7117"]
|
||||
|
||||
- job_name: 'goflow2'
|
||||
static_configs:
|
||||
- targets: ["goflow2.goflow2.svc.cluster.local:8080"]
|
||||
```
|
||||
|
||||
Requires re-applying the platform stack.
|
||||
|
||||
## Deployment Order
|
||||
|
||||
1. Apply `stacks/caretta/` — deploys eBPF DaemonSet
|
||||
2. Apply `stacks/goflow2/` — deploys NetFlow collector
|
||||
3. Re-apply `stacks/platform/` — adds Prometheus scrape targets
|
||||
4. SSH to pfSense — install softflowd, configure NetFlow export to GoFlow2 NodePort
|
||||
5. Verify in Grafana — confirm both dashboards show data
|
||||
|
||||
## Grafana Dashboards
|
||||
|
||||
Two dashboards, both auto-loaded via sidecar (ConfigMap label `grafana_dashboard: "1"`):
|
||||
|
||||
1. **K8s Pod Topology** (Caretta): Node Graph panel showing pods as nodes, TCP connections as edges, byte counts as edge weights
|
||||
2. **Network Flows** (GoFlow2): Top talkers, protocol breakdown, inter-VLAN traffic, external destinations
|
||||
445
docs/plans/2026-02-28-network-visualization-plan.md
Normal file
445
docs/plans/2026-02-28-network-visualization-plan.md
Normal file
|
|
@ -0,0 +1,445 @@
|
|||
# Network Traffic Visualization Implementation Plan
|
||||
|
||||
> **For Claude:** REQUIRED SUB-SKILL: Use superpowers:executing-plans to implement this plan task-by-task.
|
||||
|
||||
**Goal:** Deploy Caretta (pod-to-pod eBPF topology) and GoFlow2 + pfSense softflowd (full network NetFlow) with Grafana dashboards for real-time network visualization.
|
||||
|
||||
**Architecture:** Two data paths feed into existing Prometheus+Grafana: (1) Caretta eBPF DaemonSet tracks pod TCP connections, (2) pfSense exports NetFlow to GoFlow2 collector pod. Both expose Prometheus metrics scraped by existing Prometheus, visualized in Grafana Node Graph panels.
|
||||
|
||||
**Tech Stack:** Terraform/Terragrunt, Helm (Caretta), raw K8s resources (GoFlow2), pfSense SSH (softflowd), Prometheus, Grafana
|
||||
|
||||
**Design doc:** `docs/plans/2026-02-28-network-visualization-design.md`
|
||||
|
||||
---
|
||||
|
||||
### Task 1: Create Caretta Terraform stack
|
||||
|
||||
**Files:**
|
||||
- Create: `stacks/caretta/terragrunt.hcl`
|
||||
- Create: `stacks/caretta/main.tf`
|
||||
|
||||
**Step 1: Create the terragrunt.hcl**
|
||||
|
||||
```hcl
|
||||
# stacks/caretta/terragrunt.hcl
|
||||
include "root" {
|
||||
path = find_in_parent_folders()
|
||||
}
|
||||
|
||||
dependency "platform" {
|
||||
config_path = "../platform"
|
||||
skip_outputs = true
|
||||
}
|
||||
```
|
||||
|
||||
**Step 2: Create main.tf with Helm release**
|
||||
|
||||
```hcl
|
||||
variable "tls_secret_name" { type = string }
|
||||
|
||||
resource "kubernetes_namespace" "caretta" {
|
||||
metadata {
|
||||
name = "caretta"
|
||||
labels = {
|
||||
tier = local.tiers.cluster
|
||||
}
|
||||
}
|
||||
}
|
||||
|
||||
resource "helm_release" "caretta" {
|
||||
namespace = kubernetes_namespace.caretta.metadata[0].name
|
||||
name = "caretta"
|
||||
repository = "https://helm.groundcover.com/"
|
||||
chart = "caretta"
|
||||
version = "0.0.16"
|
||||
|
||||
set {
|
||||
name = "victoria-metrics-single.enabled"
|
||||
value = "false"
|
||||
}
|
||||
set {
|
||||
name = "grafana.enabled"
|
||||
value = "false"
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
**Step 3: Create secrets symlink**
|
||||
|
||||
Run: `cd stacks/caretta && ln -s ../../secrets secrets`
|
||||
|
||||
**Step 4: Apply**
|
||||
|
||||
Run: `cd stacks/caretta && terragrunt apply --non-interactive`
|
||||
|
||||
**Step 5: Verify DaemonSet is running**
|
||||
|
||||
Run: `kubectl --kubeconfig $(pwd)/config get daemonset -n caretta`
|
||||
Expected: Caretta DaemonSet with 5 pods (one per node)
|
||||
|
||||
**Step 6: Commit**
|
||||
|
||||
```bash
|
||||
git add stacks/caretta/
|
||||
git commit -m "[ci skip] deploy caretta eBPF pod topology visualization"
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
### Task 2: Add Caretta Grafana dashboard
|
||||
|
||||
**Files:**
|
||||
- Modify: `stacks/caretta/main.tf`
|
||||
|
||||
**Step 1: Download dashboard JSON**
|
||||
|
||||
Run: `curl -sL https://raw.githubusercontent.com/groundcover-com/caretta/master/chart/dashboard.json > stacks/caretta/dashboard.json`
|
||||
|
||||
**Step 2: Add ConfigMap to main.tf**
|
||||
|
||||
Append to `stacks/caretta/main.tf`:
|
||||
|
||||
```hcl
|
||||
resource "kubernetes_config_map" "caretta_dashboard" {
|
||||
metadata {
|
||||
name = "caretta-grafana-dashboard"
|
||||
namespace = kubernetes_namespace.caretta.metadata[0].name
|
||||
labels = {
|
||||
grafana_dashboard = "1"
|
||||
}
|
||||
}
|
||||
data = {
|
||||
"caretta-dashboard.json" = file("${path.module}/dashboard.json")
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
**Step 3: Apply**
|
||||
|
||||
Run: `cd stacks/caretta && terragrunt apply --non-interactive`
|
||||
|
||||
**Step 4: Verify dashboard appears in Grafana**
|
||||
|
||||
Open `https://grafana.viktorbarzin.me` → Dashboards → search "Caretta"
|
||||
Expected: Dashboard visible with Node Graph panel (may be empty until Prometheus scrape is configured)
|
||||
|
||||
**Step 5: Commit**
|
||||
|
||||
```bash
|
||||
git add stacks/caretta/
|
||||
git commit -m "[ci skip] add caretta grafana dashboard via sidecar configmap"
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
### Task 3: Create GoFlow2 Terraform stack
|
||||
|
||||
**Files:**
|
||||
- Create: `stacks/goflow2/terragrunt.hcl`
|
||||
- Create: `stacks/goflow2/main.tf`
|
||||
|
||||
**Step 1: Create the terragrunt.hcl**
|
||||
|
||||
```hcl
|
||||
# stacks/goflow2/terragrunt.hcl
|
||||
include "root" {
|
||||
path = find_in_parent_folders()
|
||||
}
|
||||
|
||||
dependency "platform" {
|
||||
config_path = "../platform"
|
||||
skip_outputs = true
|
||||
}
|
||||
```
|
||||
|
||||
**Step 2: Create main.tf with Deployment + Services**
|
||||
|
||||
```hcl
|
||||
variable "tls_secret_name" { type = string }
|
||||
|
||||
resource "kubernetes_namespace" "goflow2" {
|
||||
metadata {
|
||||
name = "goflow2"
|
||||
labels = {
|
||||
tier = local.tiers.cluster
|
||||
}
|
||||
}
|
||||
}
|
||||
|
||||
resource "kubernetes_deployment" "goflow2" {
|
||||
metadata {
|
||||
name = "goflow2"
|
||||
namespace = kubernetes_namespace.goflow2.metadata[0].name
|
||||
}
|
||||
spec {
|
||||
replicas = 1
|
||||
selector {
|
||||
match_labels = {
|
||||
app = "goflow2"
|
||||
}
|
||||
}
|
||||
template {
|
||||
metadata {
|
||||
labels = {
|
||||
app = "goflow2"
|
||||
}
|
||||
}
|
||||
spec {
|
||||
container {
|
||||
name = "goflow2"
|
||||
image = "netsampler/goflow2:v2.2.1"
|
||||
args = ["-listen", "netflow://:2055", "-transport", "stdout", "-format", "json"]
|
||||
|
||||
port {
|
||||
name = "netflow"
|
||||
container_port = 2055
|
||||
protocol = "UDP"
|
||||
}
|
||||
port {
|
||||
name = "metrics"
|
||||
container_port = 8080
|
||||
protocol = "TCP"
|
||||
}
|
||||
|
||||
resources {
|
||||
requests = {
|
||||
cpu = "50m"
|
||||
memory = "64Mi"
|
||||
}
|
||||
limits = {
|
||||
cpu = "200m"
|
||||
memory = "256Mi"
|
||||
}
|
||||
}
|
||||
}
|
||||
}
|
||||
}
|
||||
}
|
||||
}
|
||||
|
||||
resource "kubernetes_service" "goflow2_metrics" {
|
||||
metadata {
|
||||
name = "goflow2"
|
||||
namespace = kubernetes_namespace.goflow2.metadata[0].name
|
||||
}
|
||||
spec {
|
||||
selector = {
|
||||
app = "goflow2"
|
||||
}
|
||||
port {
|
||||
name = "metrics"
|
||||
port = 8080
|
||||
target_port = 8080
|
||||
protocol = "TCP"
|
||||
}
|
||||
}
|
||||
}
|
||||
|
||||
resource "kubernetes_service" "goflow2_netflow" {
|
||||
metadata {
|
||||
name = "goflow2-netflow"
|
||||
namespace = kubernetes_namespace.goflow2.metadata[0].name
|
||||
}
|
||||
spec {
|
||||
type = "NodePort"
|
||||
selector = {
|
||||
app = "goflow2"
|
||||
}
|
||||
port {
|
||||
name = "netflow"
|
||||
port = 2055
|
||||
target_port = 2055
|
||||
protocol = "UDP"
|
||||
node_port = 32055
|
||||
}
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
**Step 3: Create secrets symlink**
|
||||
|
||||
Run: `cd stacks/goflow2 && ln -s ../../secrets secrets`
|
||||
|
||||
**Step 4: Apply**
|
||||
|
||||
Run: `cd stacks/goflow2 && terragrunt apply --non-interactive`
|
||||
|
||||
**Step 5: Verify pod is running**
|
||||
|
||||
Run: `kubectl --kubeconfig $(pwd)/config get pods -n goflow2`
|
||||
Expected: 1 goflow2 pod running
|
||||
|
||||
**Step 6: Verify NodePort is accessible**
|
||||
|
||||
Run: `kubectl --kubeconfig $(pwd)/config get svc -n goflow2 goflow2-netflow`
|
||||
Expected: NodePort 32055/UDP
|
||||
|
||||
**Step 7: Commit**
|
||||
|
||||
```bash
|
||||
git add stacks/goflow2/
|
||||
git commit -m "[ci skip] deploy goflow2 netflow collector for network visualization"
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
### Task 4: Add Prometheus scrape targets for Caretta and GoFlow2
|
||||
|
||||
**Files:**
|
||||
- Modify: `stacks/platform/modules/monitoring/prometheus_chart_values.tpl` (append to extraScrapeConfigs)
|
||||
|
||||
**Step 1: Append scrape jobs**
|
||||
|
||||
Add at the end of `extraScrapeConfigs` (before the final blank line at line 882):
|
||||
|
||||
```yaml
|
||||
- job_name: 'caretta'
|
||||
static_configs:
|
||||
- targets:
|
||||
- "caretta-caretta.caretta.svc.cluster.local:7117"
|
||||
metrics_path: '/metrics'
|
||||
- job_name: 'goflow2'
|
||||
static_configs:
|
||||
- targets:
|
||||
- "goflow2.goflow2.svc.cluster.local:8080"
|
||||
metrics_path: '/metrics'
|
||||
```
|
||||
|
||||
**Step 2: Apply platform stack**
|
||||
|
||||
Run: `cd stacks/platform && terragrunt apply --non-interactive`
|
||||
|
||||
**Step 3: Verify Prometheus targets**
|
||||
|
||||
Open `https://grafana.viktorbarzin.me` → Explore → Prometheus → query `up{job="caretta"}` and `up{job="goflow2"}`
|
||||
Expected: Both return `1`
|
||||
|
||||
**Step 4: Verify Caretta metrics flowing**
|
||||
|
||||
Query: `caretta_links_observed`
|
||||
Expected: Multiple time series with client_name/server_name labels showing pod connections
|
||||
|
||||
**Step 5: Commit**
|
||||
|
||||
```bash
|
||||
git add stacks/platform/modules/monitoring/prometheus_chart_values.tpl
|
||||
git commit -m "[ci skip] add caretta and goflow2 prometheus scrape targets"
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
### Task 5: Install and configure softflowd on pfSense
|
||||
|
||||
**Files:** None (SSH to pfSense)
|
||||
|
||||
**Step 1: SSH to pfSense and install softflowd**
|
||||
|
||||
Run: `ssh admin@10.0.20.1 "pkg install -y softflowd"`
|
||||
|
||||
If `softflowd` is available via pfSense package manager instead:
|
||||
Run: `ssh admin@10.0.20.1 "pfSsh.php playback installpkg softflowd"`
|
||||
|
||||
**Step 2: Determine LAN interface name**
|
||||
|
||||
Run: `ssh admin@10.0.20.1 "ifconfig -l"`
|
||||
Expected: Identify the LAN interface (likely `vtnet1` or `igb1`)
|
||||
|
||||
**Step 3: Configure softflowd**
|
||||
|
||||
Pick any K8s node IP (e.g., 10.0.20.100) with NodePort 32055:
|
||||
|
||||
Run:
|
||||
```bash
|
||||
ssh admin@10.0.20.1 "softflowd -i <LAN_INTERFACE> -n 10.0.20.100:32055 -v 9 -t maxlife=300"
|
||||
```
|
||||
|
||||
Flags:
|
||||
- `-i <interface>`: Monitor this interface
|
||||
- `-n 10.0.20.100:32055`: Send NetFlow v9 to GoFlow2 NodePort
|
||||
- `-v 9`: NetFlow version 9
|
||||
- `-t maxlife=300`: Max flow lifetime 5 minutes
|
||||
|
||||
**Step 4: Verify flows are arriving at GoFlow2**
|
||||
|
||||
Run: `kubectl --kubeconfig $(pwd)/config logs -n goflow2 -l app=goflow2 --tail=20`
|
||||
Expected: JSON flow records appearing in stdout
|
||||
|
||||
**Step 5: Make softflowd persistent**
|
||||
|
||||
Ensure softflowd starts on boot. On pfSense/FreeBSD:
|
||||
Run: `ssh admin@10.0.20.1 'echo "softflowd_enable=\"YES\"" >> /etc/rc.conf && echo "softflowd_flags=\"-i <LAN_INTERFACE> -n 10.0.20.100:32055 -v 9\"" >> /etc/rc.conf'`
|
||||
|
||||
---
|
||||
|
||||
### Task 6: Add GoFlow2 Grafana dashboard
|
||||
|
||||
**Files:**
|
||||
- Create: `stacks/goflow2/dashboard.json`
|
||||
- Modify: `stacks/goflow2/main.tf`
|
||||
|
||||
**Step 1: Create a GoFlow2 dashboard JSON**
|
||||
|
||||
Create `stacks/goflow2/dashboard.json` — a Grafana dashboard with panels for:
|
||||
- Top talkers by bytes (bar chart, query: `topk(10, sum by (src_addr, dst_addr) (rate(flow_bytes[5m])))`)
|
||||
- Protocol breakdown (pie chart, query: `sum by (proto) (rate(flow_bytes[5m]))`)
|
||||
- Flows over time (time series, query: `sum(rate(flow_packets[5m]))`)
|
||||
|
||||
Note: Exact metric names will depend on GoFlow2's Prometheus output — verify after Task 5 by querying `{job="goflow2"}` in Prometheus. Adjust dashboard queries to match actual metric names.
|
||||
|
||||
**Step 2: Add ConfigMap to main.tf**
|
||||
|
||||
Append to `stacks/goflow2/main.tf`:
|
||||
|
||||
```hcl
|
||||
resource "kubernetes_config_map" "goflow2_dashboard" {
|
||||
metadata {
|
||||
name = "goflow2-grafana-dashboard"
|
||||
namespace = kubernetes_namespace.goflow2.metadata[0].name
|
||||
labels = {
|
||||
grafana_dashboard = "1"
|
||||
}
|
||||
}
|
||||
data = {
|
||||
"goflow2-dashboard.json" = file("${path.module}/dashboard.json")
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
**Step 3: Apply**
|
||||
|
||||
Run: `cd stacks/goflow2 && terragrunt apply --non-interactive`
|
||||
|
||||
**Step 4: Verify in Grafana**
|
||||
|
||||
Open `https://grafana.viktorbarzin.me` → Dashboards → search "GoFlow2"
|
||||
Expected: Dashboard with network flow data from pfSense
|
||||
|
||||
**Step 5: Commit**
|
||||
|
||||
```bash
|
||||
git add stacks/goflow2/
|
||||
git commit -m "[ci skip] add goflow2 grafana dashboard for network flow visualization"
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
### Task 7: End-to-end verification
|
||||
|
||||
**Step 1: Verify Caretta topology**
|
||||
|
||||
Open Grafana → Caretta Dashboard → Service Map panel
|
||||
Expected: Node graph showing pods connected by edges with byte counts
|
||||
|
||||
**Step 2: Verify GoFlow2 flows**
|
||||
|
||||
Open Grafana → GoFlow2 Dashboard
|
||||
Expected: Network flow data showing traffic between pfSense segments
|
||||
|
||||
**Step 3: Generate test traffic and confirm it appears**
|
||||
|
||||
Run: `kubectl --kubeconfig $(pwd)/config exec -n default deploy/some-pod -- curl -s https://example.com > /dev/null`
|
||||
Expected: New edge appears in Caretta for the pod, new flow in GoFlow2 for the external connection
|
||||
|
||||
**Step 4: Push all changes**
|
||||
|
||||
Run: `git push origin master`
|
||||
354
docs/plans/2026-02-28-storage-reliability-design.md
Normal file
354
docs/plans/2026-02-28-storage-reliability-design.md
Normal file
|
|
@ -0,0 +1,354 @@
|
|||
# Storage Reliability: Database Replication + SQLite Consolidation
|
||||
|
||||
**Date**: 2026-02-28
|
||||
**Status**: Revised (v2) — incorporates research agent findings
|
||||
**Goal**: Eliminate data corruption risk from NFS outages by moving databases off NFS
|
||||
|
||||
## Problem
|
||||
|
||||
All 70+ services store data on a single TrueNAS VM (10.0.10.15) via NFS. When this VM crashes or hangs:
|
||||
|
||||
- **22 services** risk **data corruption** (databases with WAL/fsync requirements on NFS)
|
||||
- **12 services** experience downtime but no corruption (media, configs)
|
||||
- The shared PostgreSQL alone backs 12 services — a single NFS hiccup can corrupt data for all of them
|
||||
|
||||
SQLite-over-NFS is fundamentally broken (advisory locking unreliable, WAL mode unsafe).
|
||||
|
||||
## Constraints
|
||||
|
||||
- Zero cost — all self-hosted, OSS
|
||||
- Must preserve backup workflow (consolidate to TrueNAS → rsync to backup NAS)
|
||||
- Stop-and-verify after each service migration
|
||||
- No data loss tolerance
|
||||
|
||||
## Single-Host Limitation (Explicit Acknowledgment)
|
||||
|
||||
All K8s nodes are VMs on a single Proxmox host (192.168.1.127). This means:
|
||||
|
||||
**Replication PROTECTS against**: individual VM crash/restart, NFS outage,
|
||||
individual node rebuild, pod OOM/eviction, software-level failures.
|
||||
|
||||
**Replication does NOT protect against**: Proxmox host failure, physical
|
||||
disk failure, power loss — all replicas die simultaneously.
|
||||
|
||||
Given this, the plan uses **minimal replication** (1 primary + 1 replica
|
||||
for PostgreSQL, single instance for Redis) rather than full 3-instance
|
||||
clusters. The primary reliability gain comes from moving off NFS to local
|
||||
disk with proper fsync semantics, not from replication count.
|
||||
|
||||
## Design
|
||||
|
||||
### Strategy Overview
|
||||
|
||||
```
|
||||
BEFORE: All services → NFS (TrueNAS VM) → single point of failure
|
||||
|
||||
AFTER: Databases → local disk (proper fsync, no NFS SPOF)
|
||||
SQLite apps → migrated to shared PostgreSQL where supported
|
||||
Media/configs → NFS (TrueNAS, non-critical path)
|
||||
Backups → all consolidate to NFS → rsync to backup NAS
|
||||
```
|
||||
|
||||
### Component 1: PostgreSQL via CloudNativePG
|
||||
|
||||
**Current**: Single PostgreSQL 16 pod on NFS (`/mnt/main/postgresql/data`)
|
||||
using custom image `viktorbarzin/postgres:16-master` (postgis + pgvector + pgvecto-rs).
|
||||
|
||||
**Target**: CloudNativePG operator with 2-instance cluster on local disk.
|
||||
|
||||
CloudNativePG (CNCF project, v1.28+, supports K8s 1.34 and PG 14-18):
|
||||
- Automatic primary/replica failover
|
||||
- Streaming replication
|
||||
- Declarative CRD-based management (Terraform/Terragrunt compatible)
|
||||
- Built-in monolith import mode (better than manual pg_dumpall)
|
||||
- Built-in PgBouncer pooler CRD
|
||||
|
||||
Architecture:
|
||||
```
|
||||
CloudNativePG Cluster (namespace: dbaas)
|
||||
├── Primary (worker node A) — local PVC via local-path-provisioner
|
||||
├── Replica (worker node B) — local PVC, streaming replication
|
||||
└── Services: <cluster>-rw (read-write), <cluster>-ro (read-only)
|
||||
```
|
||||
|
||||
**Migration approach**: Use CNPG's native monolith import mode, which
|
||||
connects to the running old PostgreSQL and imports databases + roles
|
||||
using pg_dump -Fd per database. Superior to manual pg_dumpall.
|
||||
|
||||
**Service endpoint strategy**: Create an ExternalName Service called
|
||||
`postgresql` in namespace `dbaas` pointing to the CNPG `-rw` service.
|
||||
This preserves `var.postgresql_host` = `postgresql.dbaas.svc.cluster.local`
|
||||
with zero changes to dependent services.
|
||||
|
||||
**Special cases**:
|
||||
- Authentik: Replace manual PgBouncer deployment with CNPG's built-in
|
||||
Pooler CRD, or update PgBouncer to point to CNPG's `-rw` service
|
||||
- Init containers (woodpecker, trading-bot): Enable `enableSuperuserAccess: true`
|
||||
in CNPG Cluster spec — CNPG strips SUPERUSER from imported roles by default
|
||||
- Custom image: Test `viktorbarzin/postgres:16-master` with CNPG first.
|
||||
Move `shared_preload_libraries=vectors.so` to CNPG `postgresql.parameters`
|
||||
(CNPG overrides container CMD). Tag format may need adjusting.
|
||||
|
||||
**Backup**: Keep existing pg_dumpall CronJob, pointed at new CNPG endpoint.
|
||||
CNPG's native WAL archiving requires S3-compatible backend (not NFS) —
|
||||
adding MinIO is a future enhancement, not a blocker.
|
||||
|
||||
Dependent services (12): authentik, n8n, dawarich, tandoor, linkwarden,
|
||||
netbox, woodpecker, rybbit, affine, health, resume, trading-bot
|
||||
|
||||
Resource overhead: ~2GB RAM total (2 instances), ~50GB local disk per node
|
||||
|
||||
### Component 2: Redis — Single Instance on Local Disk
|
||||
|
||||
**Current**: Single redis-stack pod on NFS (`/mnt/main/redis`).
|
||||
RDB background save takes 39 seconds on NFS (should be <1s on local disk).
|
||||
|
||||
**Finding**: redis-stack modules (RedisJSON, RediSearch, RedisTimeSeries,
|
||||
RedisBloom, RedisGears) are completely unused. Zero module commands in
|
||||
`INFO commandstats`. All 11 services use plain Redis commands only
|
||||
(GET, SET, BullMQ queues, Celery broker, caching).
|
||||
|
||||
**Finding**: No service stores critical primary data in Redis. All use it
|
||||
for job queues and caching. Losing Redis data means: users re-login,
|
||||
jobs retry, caches rebuild. Inconvenient but never catastrophic.
|
||||
|
||||
**Finding**: None of the 11 services support Sentinel-aware connections.
|
||||
Redis Sentinel would require a proxy layer with no reliability gain on
|
||||
a single physical host.
|
||||
|
||||
**Target**: Single `redis:7-alpine` (or `valkey:9`) on local PVC.
|
||||
Drop redis-stack — modules are unused overhead (~100MB RAM saved).
|
||||
|
||||
Architecture:
|
||||
```
|
||||
Redis 7 (single instance)
|
||||
├── Local PVC via local-path-provisioner (fast RDB saves)
|
||||
├── K8s Service: redis.redis.svc.cluster.local (unchanged)
|
||||
└── Hourly CronJob: cp dump.rdb → NFS:/mnt/main/redis-backup/
|
||||
```
|
||||
|
||||
No client changes needed. Same service endpoint. Same Redis commands.
|
||||
|
||||
Resource overhead: ~650MB RAM (same as today minus module overhead),
|
||||
~1GB local disk
|
||||
|
||||
### Component 3: MySQL — Single Instance on Local Disk
|
||||
|
||||
**Current**: Single MySQL pod on NFS (`/mnt/main/mysql`)
|
||||
**Target**: Single MySQL on local PVC
|
||||
|
||||
Services on MySQL (8): hackmd, speedtest, onlyoffice, crowdsec,
|
||||
paperless-ngx, real-estate-crawler, url-shortener, grafana
|
||||
|
||||
Evaluate per-service whether migration to PostgreSQL is feasible
|
||||
(reduces operational complexity to one DB engine). Do during
|
||||
implementation research phase.
|
||||
|
||||
**Backup**: Keep existing mysqldump CronJob.
|
||||
|
||||
### Component 4: Immich PostgreSQL
|
||||
|
||||
**Current**: Dedicated PostgreSQL + pgvector on NFS
|
||||
(`ghcr.io/immich-app/postgres:15-vectorchord0.3.0-pgvectors0.2.0`)
|
||||
|
||||
**Target**: Move to local PVC (same image, same single instance).
|
||||
Immich's PG has specialized extensions (VectorChord, pgvectors) that
|
||||
may not be compatible with CNPG operand images. Simpler to keep as
|
||||
standalone PG on local disk.
|
||||
|
||||
### Component 5: ClickHouse (Rybbit)
|
||||
|
||||
**Current**: Single ClickHouse on NFS (`/mnt/main/clickhouse`)
|
||||
**Target**: Move to local PVC (single instance). Analytics data is
|
||||
rebuildable. ClickHouse replication is not justified for a homelab.
|
||||
|
||||
### Component 6: SQLite App Consolidation to PostgreSQL
|
||||
|
||||
**REVISED based on per-app research:**
|
||||
|
||||
Apps confirmed safe to migrate:
|
||||
|
||||
| App | Config mechanism | Migration tool | Risk | Notes |
|
||||
|-----|-----------------|---------------|------|-------|
|
||||
| Forgejo | `[database]` in app.ini | `forgejo dump --database postgres` | Moderate | Git repos stay on NFS |
|
||||
| FreshRSS | `DB_HOST` env vars | OPML export/import (fresh install) | Low | PG is the recommended backend |
|
||||
| Open WebUI | `DATABASE_URL` env var | None (start fresh) | Low | Chat history is disposable |
|
||||
|
||||
**Apps REMOVED from migration plan:**
|
||||
|
||||
| App | Reason |
|
||||
|-----|--------|
|
||||
| **Headscale** | Project EXPLICITLY DISCOURAGES PostgreSQL: "highly discouraged, only supported for legacy reasons. All new development and testing are SQLite." Migrating risks VPN stability. |
|
||||
| **MeshCentral** | Uses NeDB (document store), not SQLite. NeDB→PG migration path is poorly documented and risky. |
|
||||
|
||||
Apps confirmed SQLite/BoltDB-only (stay on NFS):
|
||||
|
||||
| App | Storage engine | Mitigation |
|
||||
|-----|---------------|------------|
|
||||
| Headscale | SQLite (recommended by project) | Accept (project-recommended config) |
|
||||
| Vaultwarden | SQLite | Defer (migration too risky for password vault) |
|
||||
| Uptime Kuma | SQLite (v2 adds MariaDB, not PG) | Accept or Litestream |
|
||||
| Navidrome | SQLite only | Accept or Litestream |
|
||||
| Audiobookshelf | SQLite only | Accept or Litestream |
|
||||
| Calibre-Web | SQLite (Calibre format) | Accept (format constraint) |
|
||||
| Wealthfolio | SQLite only | Accept or Litestream |
|
||||
| MeshCentral | NeDB (document store) | Accept |
|
||||
| Diun | bbolt (BoltDB fork) | Accept (rebuildable state) |
|
||||
|
||||
### Component 7: Monitoring Stack
|
||||
|
||||
Prometheus, Loki, Alertmanager use specialized storage (TSDB, BoltDB).
|
||||
Cannot migrate to PostgreSQL. Prometheus WAL is already on tmpfs (good).
|
||||
|
||||
Recommendation: Move to local PVCs. Losing metrics history on node
|
||||
failure is acceptable for a homelab.
|
||||
|
||||
### Component 8: What Stays on NFS (unchanged)
|
||||
|
||||
All ~35 LOW risk services: media files, configs, caches, static content.
|
||||
Immich photos, Jellyfin media, Audiobookshelf audiobooks, Calibre ebooks,
|
||||
Frigate recordings, downloads, backups, model caches, etc.
|
||||
|
||||
NFS failure for these = temporary unavailability, not corruption.
|
||||
|
||||
## Backup Strategy
|
||||
|
||||
```
|
||||
CNPG PostgreSQL → pg_dumpall CronJob (daily) → NFS:/mnt/main/postgresql-backup/
|
||||
MySQL → mysqldump CronJob (daily) → NFS:/mnt/main/mysql-backup/
|
||||
Redis → RDB copy CronJob (hourly) → NFS:/mnt/main/redis-backup/
|
||||
Immich PG → pg_dump CronJob (daily) → NFS:/mnt/main/immich-pg-backup/
|
||||
Litestream → continuous SQLite backup → NFS:/mnt/main/litestream/ (optional)
|
||||
Media/configs → already on NFS
|
||||
|
||||
NFS (TrueNAS) → rsync → Backup NAS (unchanged)
|
||||
```
|
||||
|
||||
All backups still consolidate to TrueNAS. Rsync-to-backup-NAS workflow
|
||||
is completely unchanged.
|
||||
|
||||
**Note**: CNPG's native WAL archiving requires S3-compatible storage
|
||||
(not NFS). Adding MinIO for PITR capability is a future enhancement.
|
||||
The pg_dumpall CronJob provides adequate backup for a homelab.
|
||||
|
||||
## Migration Order (Safety-First)
|
||||
|
||||
Each phase: research → backup → migrate → verify → user confirms → next.
|
||||
|
||||
Before each service migration, a research subagent will:
|
||||
1. Confirm current setup and configuration
|
||||
2. Research online best practices and documentation
|
||||
3. Scrutinize the migration plan for that specific service
|
||||
4. Present findings for review before execution
|
||||
|
||||
### Phase 0: Infrastructure Prerequisites
|
||||
- Verify RAM headroom (current overcommit must be addressed first)
|
||||
- Add dedicated local virtual disks to K8s worker nodes (via Proxmox)
|
||||
- Verify local-path-provisioner is configured for new disks
|
||||
- Install CloudNativePG operator (Helm)
|
||||
- Test CNPG with custom PostgreSQL image (throwaway cluster)
|
||||
|
||||
### Phase 1: PostgreSQL Migration (highest impact, most preparation)
|
||||
1. Deploy throwaway CNPG cluster to test image compatibility and import
|
||||
2. Full pg_dumpall backup to NFS
|
||||
3. Deploy production CNPG cluster with monolith import from running PG
|
||||
4. Create ExternalName Service for backwards compatibility
|
||||
5. Migrate ONE low-risk service first (e.g., `resume` or `health`)
|
||||
6. Verify for 24-48 hours
|
||||
7. Migrate remaining services one at a time, verify each
|
||||
8. Migrate authentik LAST (identity provider — highest blast radius)
|
||||
9. Keep old PG pod scaled to 0 for one week as rollback safety net
|
||||
10. Decommission old PG only after stability confirmed
|
||||
|
||||
### Phase 2: Redis Migration
|
||||
1. RDB snapshot backup to NFS
|
||||
2. Deploy single redis:7-alpine on local PVC (same namespace, new pod)
|
||||
3. Restore RDB snapshot
|
||||
4. Update redis Service to point to new pod
|
||||
5. Verify all 11 dependent services
|
||||
6. Add hourly RDB backup CronJob to NFS
|
||||
7. Decommission old redis-stack pod
|
||||
|
||||
### Phase 3: MySQL Migration
|
||||
1. mysqldump backup
|
||||
2. Deploy single MySQL on local PVC
|
||||
3. Restore dump
|
||||
4. Verify all 8 dependent services
|
||||
5. Research per-service PostgreSQL migration feasibility (future work)
|
||||
|
||||
### Phase 4: Immich PostgreSQL
|
||||
1. pg_dump backup
|
||||
2. Move Immich PG to local PVC (same image, same config)
|
||||
3. Verify Immich functionality (upload, search, face recognition)
|
||||
|
||||
### Phase 5: SQLite Apps → PostgreSQL
|
||||
Migrate one at a time, safest first:
|
||||
5a. FreshRSS (lowest risk — fresh install with OPML import)
|
||||
5b. Open WebUI (low risk — start fresh, chat history disposable)
|
||||
5c. Forgejo (moderate risk — use forgejo dump, verify git operations)
|
||||
|
||||
### Phase 6: ClickHouse + Monitoring
|
||||
6a. ClickHouse → local PVC
|
||||
6b. Prometheus → local PVC
|
||||
6c. Loki → local PVC
|
||||
6d. Alertmanager → local PVC
|
||||
|
||||
### Phase 7: Cleanup + Optional Enhancements
|
||||
- Remove old NFS directories from nfs_directories.txt
|
||||
- Update nfs_exports.sh
|
||||
- Optional: Add Litestream for SQLite-only apps
|
||||
- Optional: Add MinIO for CNPG WAL archiving (PITR capability)
|
||||
- Optional: Evaluate MySQL→PostgreSQL consolidation
|
||||
|
||||
## Rollback Plan (per component)
|
||||
|
||||
**PostgreSQL**: Old pod kept scaled to 0 with NFS data intact. Rollback =
|
||||
scale old pod back up, revert ExternalName Service. Pre-migration
|
||||
pg_dumpall available if NFS data is stale.
|
||||
|
||||
**Redis**: Old redis-stack pod kept scaled to 0. Rollback = scale up,
|
||||
revert Service. Pre-migration RDB snapshot on NFS.
|
||||
|
||||
**MySQL**: Same pattern — old pod scaled to 0, mysqldump on NFS.
|
||||
|
||||
**SQLite apps**: Original SQLite databases remain on NFS untouched.
|
||||
Rollback = remove DATABASE_URL env var, restart pod.
|
||||
|
||||
## Resource Budget
|
||||
|
||||
| Component | RAM | Local Disk |
|
||||
|-----------|-----|-----------|
|
||||
| CloudNativePG (2 instances) | ~2GB | ~50GB/node (2 nodes) |
|
||||
| Redis 7 (single instance) | ~550MB | ~1GB |
|
||||
| MySQL (single instance) | ~1GB | ~20GB |
|
||||
| Immich PG (single instance) | ~500MB | ~10GB |
|
||||
| CNPG Operator | ~200MB | None |
|
||||
| **Total new overhead** | **~4.25GB** | **~81GB across 2 nodes** |
|
||||
|
||||
**RAM WARNING**: Proxmox host has 142GB physical RAM with ~156GB
|
||||
allocated to running VMs (already ~10% overcommitted). This plan adds
|
||||
~4.25GB but also frees ~1.5GB by dropping redis-stack modules and
|
||||
removing old DB pods. Net increase: ~2.75GB. The old DB pods
|
||||
(postgresql, mysql, redis-stack on NFS) will be decommissioned,
|
||||
partially offsetting the new resource usage. Monitor swap usage closely.
|
||||
|
||||
Consider stopping unused VMs (PBS is already stopped, Windows10 uses
|
||||
8GB and may not need to run continuously).
|
||||
|
||||
## Monitoring Additions
|
||||
|
||||
After migration, add alerts for:
|
||||
- CNPG replication lag
|
||||
- CNPG instance count (< 2 = degraded)
|
||||
- Local disk space on `/opt/local-path-provisioner` per node
|
||||
- Redis RDB save failures
|
||||
- Backup CronJob failures (pg_dumpall, mysqldump, RDB copy)
|
||||
|
||||
## Success Criteria
|
||||
|
||||
- [ ] PostgreSQL, MySQL, Redis, Immich PG, ClickHouse all on local disk
|
||||
- [ ] TrueNAS VM restart causes zero data corruption
|
||||
- [ ] TrueNAS VM restart only affects media/config services (temporary unavailability)
|
||||
- [ ] All backups still consolidate to TrueNAS for rsync to backup NAS
|
||||
- [ ] Each migrated service verified working before proceeding to next
|
||||
- [ ] Rollback tested for PostgreSQL before decommissioning old pod
|
||||
219
docs/plans/2026-03-01-nfs-csi-migration-design.md
Normal file
219
docs/plans/2026-03-01-nfs-csi-migration-design.md
Normal file
|
|
@ -0,0 +1,219 @@
|
|||
# NFS CSI Driver Migration: Inline Volumes → PV/PVC with Soft Mounts
|
||||
|
||||
**Date**: 2026-03-01
|
||||
**Status**: Draft
|
||||
**Complements**: `2026-02-28-storage-reliability-design.md` (databases → local disk)
|
||||
**Goal**: Eliminate stale NFS mount hangs, add mount health checking, and create a storage abstraction layer for all NFS-dependent services
|
||||
|
||||
## Problem
|
||||
|
||||
56 services use inline NFS volumes (`nfs {}` in pod specs). This pattern has three compounding issues:
|
||||
|
||||
1. **Stale mounts hang forever**: Inline NFS defaults to `hard,timeo=600` mount options. When TrueNAS is unreachable (reboot, network blip, NFS export change), the kernel retries indefinitely. Pods show `Running 1/1` but are completely frozen with zero listening sockets. The only fix is force-deleting the pod.
|
||||
|
||||
2. **No mount health checking**: kubelet has no visibility into NFS mount health. Liveness probes only check application health, not filesystem access. A stale mount is invisible to the scheduler.
|
||||
|
||||
3. **No storage abstraction**: NFS server IP and export paths are hardcoded into every pod spec via `var.nfs_server`. Changing the backend (different NFS server, different protocol) requires editing 56 stacks.
|
||||
|
||||
## Constraints
|
||||
|
||||
- Zero data migration — same NFS paths, same TrueNAS server, same directories
|
||||
- Services must keep working during migration (no downtime per service beyond a pod restart)
|
||||
- Must work with existing Terragrunt architecture (per-stack state isolation)
|
||||
- Must not break services that will later move to local disk (per storage-reliability design)
|
||||
|
||||
## Design
|
||||
|
||||
### Architecture
|
||||
|
||||
```
|
||||
BEFORE:
|
||||
Pod spec → inline nfs {} block → kubelet mount -t nfs (hard,timeo=600) → TrueNAS
|
||||
(no health check, hangs on stale mount, server IP in every stack)
|
||||
|
||||
AFTER:
|
||||
Terraform module → PV (CSI driver ref) + PVC → Pod spec references PVC
|
||||
CSI driver mounts with soft,timeo=30,retrans=3 → TrueNAS
|
||||
(health-checked, fails fast on stale mount, server IP in module only)
|
||||
```
|
||||
|
||||
### Component 1: NFS CSI Driver (Helm chart in platform stack)
|
||||
|
||||
Deploy `csi-driver-nfs` v4.11+ via Helm in `stacks/platform/modules/nfs-csi/`.
|
||||
|
||||
The driver runs as:
|
||||
- **Controller**: 1 replica (handles PV provisioning)
|
||||
- **Node DaemonSet**: 1 per node (handles mount/unmount operations)
|
||||
|
||||
Resource footprint: ~50MB RAM per node, ~10m CPU idle.
|
||||
|
||||
The driver itself does not change NFS behavior — it delegates to the kernel NFS client. The value is:
|
||||
- Mount options are configurable per-StorageClass (not hardcoded kernel defaults)
|
||||
- CSI health checking can detect unhealthy volumes
|
||||
- Standard K8s storage API (PV/PVC/StorageClass) instead of inline volumes
|
||||
|
||||
### Component 2: StorageClass
|
||||
|
||||
```hcl
|
||||
resource "kubernetes_storage_class" "nfs_truenas" {
|
||||
metadata { name = "nfs-truenas" }
|
||||
provisioner = "nfs.csi.k8s.io"
|
||||
reclaim_policy = "Retain"
|
||||
volume_binding_mode = "Immediate"
|
||||
|
||||
mount_options = [
|
||||
"soft", # Return -EIO instead of hanging forever
|
||||
"timeo=30", # 3-second timeout per NFS RPC call
|
||||
"retrans=3", # Retry 3 times before giving up (~9 sec total)
|
||||
"actimeo=5", # 5-second attribute cache (balance freshness vs perf)
|
||||
]
|
||||
|
||||
parameters = {
|
||||
server = var.nfs_server
|
||||
share = "/mnt/main"
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
Key mount option differences vs current defaults:
|
||||
|
||||
| Option | Current (inline) | New (CSI) | Effect |
|
||||
|--------|-----------------|-----------|--------|
|
||||
| `hard` vs `soft` | `hard` (default) | `soft` | I/O errors instead of infinite hang |
|
||||
| `timeo` | 600 (60 sec) | 30 (3 sec) | Faster failure detection |
|
||||
| `retrans` | 3 | 3 | Same retry count, but 3s per attempt not 60s |
|
||||
| `actimeo` | 3600 (1 hour, varies) | 5 (5 sec) | Fresher attribute cache |
|
||||
| Total stale detection | **~3 minutes** | **~9 seconds** | 20x faster |
|
||||
|
||||
### Component 3: Shared Terraform Module (`modules/kubernetes/nfs_volume/`)
|
||||
|
||||
Creates a PV + PVC pair for each NFS mount point. Hides boilerplate.
|
||||
|
||||
**Interface**:
|
||||
```hcl
|
||||
module "nfs_data" {
|
||||
source = "../../modules/kubernetes/nfs_volume"
|
||||
name = "myservice-data" # PV and PVC name (must be unique cluster-wide)
|
||||
namespace = "myservice" # PVC namespace
|
||||
nfs_server = var.nfs_server # From terraform.tfvars
|
||||
nfs_path = "/mnt/main/myservice" # NFS export path
|
||||
# Optional:
|
||||
# storage = "10Gi" # Default: 10Gi (informational for NFS)
|
||||
# access_modes = ["ReadWriteMany"] # Default: RWX
|
||||
}
|
||||
```
|
||||
|
||||
**Outputs**:
|
||||
- `claim_name` — PVC name to reference in pod spec
|
||||
|
||||
**Module creates**:
|
||||
1. `kubernetes_persistent_volume` — CSI-backed, references StorageClass mount options
|
||||
2. `kubernetes_persistent_volume_claim` — bound to the PV, namespaced
|
||||
|
||||
PVs are cluster-scoped, so `name` must be globally unique. Convention: `<service>-<purpose>` (e.g., `openclaw-tools`, `privatebin-data`).
|
||||
|
||||
### Component 4: Stack Migration (Mechanical Change)
|
||||
|
||||
Each stack changes from:
|
||||
```hcl
|
||||
# OLD: inline NFS
|
||||
volume {
|
||||
name = "data"
|
||||
nfs {
|
||||
server = var.nfs_server
|
||||
path = "/mnt/main/myservice"
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
To:
|
||||
```hcl
|
||||
# NEW: module call (outside pod spec)
|
||||
module "nfs_data" {
|
||||
source = "../../modules/kubernetes/nfs_volume"
|
||||
name = "myservice-data"
|
||||
namespace = "myservice"
|
||||
nfs_server = var.nfs_server
|
||||
nfs_path = "/mnt/main/myservice"
|
||||
}
|
||||
|
||||
# NEW: PVC reference (in pod spec, replaces nfs {} block)
|
||||
volume {
|
||||
name = "data"
|
||||
persistent_volume_claim {
|
||||
claim_name = module.nfs_data.claim_name
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
Volume mount blocks (`volume_mount {}`) are **completely unchanged**.
|
||||
|
||||
### Component 5: Platform Module Migration
|
||||
|
||||
Platform modules (redis, dbaas, monitoring, etc.) that use NFS follow the same pattern but the module path is `../../../modules/kubernetes/nfs_volume` (one extra level deep). The `nfs_server` variable is already passed through `stacks/platform/main.tf`.
|
||||
|
||||
Some platform modules use explicit PV/PVC already (Loki, Prometheus). These get updated to use the CSI driver backend instead of the native NFS PV source.
|
||||
|
||||
### What Does NOT Change
|
||||
|
||||
- NFS export paths on TrueNAS (no `nfs_directories.txt` changes)
|
||||
- NFS server configuration
|
||||
- Volume mount paths inside containers
|
||||
- Sub-path usage patterns
|
||||
- Container images or application config
|
||||
- Services that will move to local disk later (per storage-reliability design) — they get CSI mounts as an interim improvement, then move off NFS entirely
|
||||
|
||||
## Migration Order
|
||||
|
||||
Services grouped by risk. Each batch: apply → verify pods running → verify app accessible → next batch.
|
||||
|
||||
### Phase 0: Infrastructure
|
||||
1. Deploy NFS CSI driver Helm chart (platform module)
|
||||
2. Create `nfs-truenas` StorageClass
|
||||
3. Create `modules/kubernetes/nfs_volume/` shared module
|
||||
|
||||
### Phase 1: Low-Risk Pilot (3 services)
|
||||
Pick 3 simple, single-volume services to validate the pattern:
|
||||
- `privatebin` (1 volume, low traffic)
|
||||
- `echo` — actually stateless, skip. Use `resume` instead (1 volume, personal site)
|
||||
- `speedtest` (1 volume, low traffic)
|
||||
|
||||
### Phase 2: Simple Services (single NFS volume each, ~20 services)
|
||||
Mechanical migration of all single-volume stacks. Can be parallelized.
|
||||
|
||||
### Phase 3: Multi-Volume Services (~15 services)
|
||||
Services with 2-4 NFS volumes (openclaw, servarr, immich, etc.). More module calls but same pattern.
|
||||
|
||||
### Phase 4: Platform Modules (~9 modules)
|
||||
Monitoring stack, Redis, dbaas PVs, etc. These live in `stacks/platform/modules/` and need the module path adjusted.
|
||||
|
||||
### Phase 5: Cleanup
|
||||
- Update CLAUDE.md documentation (new NFS volume pattern)
|
||||
- Update `setup-project` skill to use module pattern for new services
|
||||
- Verify all services healthy
|
||||
|
||||
## Rollback
|
||||
|
||||
Per-service rollback: revert the stack to inline `nfs {}` and `terragrunt apply`. The data never moved — it's the same NFS path. PV/PVC objects get destroyed by Terraform, pod remounts inline. Takes 1 minute per service.
|
||||
|
||||
Full rollback: remove CSI driver and StorageClass from platform stack, revert all stacks. No data impact.
|
||||
|
||||
## Risks
|
||||
|
||||
1. **`soft` mount I/O errors**: Apps that don't handle I/O errors gracefully may crash instead of hanging. This is strictly better — a crash triggers a restart with a fresh mount, vs hanging forever. But some apps may log noisy errors during brief NFS blips.
|
||||
|
||||
2. **PV naming conflicts**: PV names are cluster-global. Must ensure uniqueness. Convention `<service>-<purpose>` handles this.
|
||||
|
||||
3. **Terraform state churn**: Each service gains 2 new resources (PV + PVC) and loses the inline volume (implicit, not tracked). The `terragrunt apply` will show resource additions but no deletions (inline volumes aren't separate TF resources). Pod will be recreated.
|
||||
|
||||
4. **CSI driver resource overhead**: ~50MB RAM + 10m CPU per node (5 nodes = ~250MB cluster-wide). Acceptable.
|
||||
|
||||
## Success Criteria
|
||||
|
||||
- [ ] NFS CSI driver deployed and healthy on all 5 nodes
|
||||
- [ ] `nfs-truenas` StorageClass created with soft mount options
|
||||
- [ ] `modules/kubernetes/nfs_volume/` module created and tested
|
||||
- [ ] All 56 NFS-dependent services migrated from inline to PV/PVC
|
||||
- [ ] No service downtime beyond a single pod restart during migration
|
||||
- [ ] Simulated NFS outage (TrueNAS NFS service pause) results in pod restart (not hang)
|
||||
- [ ] Documentation and skills updated for new pattern
|
||||
774
docs/plans/2026-03-01-nfs-csi-migration-plan.md
Normal file
774
docs/plans/2026-03-01-nfs-csi-migration-plan.md
Normal file
|
|
@ -0,0 +1,774 @@
|
|||
# NFS CSI Driver Migration Implementation Plan
|
||||
|
||||
> **For Claude:** REQUIRED SUB-SKILL: Use superpowers:executing-plans to implement this plan task-by-task.
|
||||
|
||||
**Goal:** Replace all inline NFS volumes with CSI-backed PV/PVC using soft mount options to eliminate stale mount hangs.
|
||||
|
||||
**Architecture:** Deploy the NFS CSI driver as a platform Helm module, create a shared Terraform module for PV/PVC boilerplate, then mechanically migrate all 56 NFS-dependent services from inline `nfs {}` to `persistent_volume_claim {}` referencing the shared module.
|
||||
|
||||
**Tech Stack:** csi-driver-nfs (Helm), Terraform/Terragrunt, Kubernetes PV/PVC/StorageClass
|
||||
|
||||
**Design doc:** `docs/plans/2026-03-01-nfs-csi-migration-design.md`
|
||||
|
||||
---
|
||||
|
||||
## Task 1: Create the NFS CSI Driver Platform Module
|
||||
|
||||
**Files:**
|
||||
- Create: `stacks/platform/modules/nfs-csi/main.tf`
|
||||
- Modify: `stacks/platform/main.tf` (add module block)
|
||||
|
||||
**Step 1: Create the module directory**
|
||||
|
||||
```bash
|
||||
mkdir -p stacks/platform/modules/nfs-csi
|
||||
```
|
||||
|
||||
**Step 2: Write the NFS CSI module**
|
||||
|
||||
Create `stacks/platform/modules/nfs-csi/main.tf`:
|
||||
|
||||
```hcl
|
||||
variable "tier" { type = string }
|
||||
variable "nfs_server" { type = string }
|
||||
|
||||
resource "kubernetes_namespace" "nfs_csi" {
|
||||
metadata {
|
||||
name = "nfs-csi"
|
||||
labels = {
|
||||
tier = var.tier
|
||||
}
|
||||
}
|
||||
}
|
||||
|
||||
resource "helm_release" "nfs_csi_driver" {
|
||||
namespace = kubernetes_namespace.nfs_csi.metadata[0].name
|
||||
create_namespace = false
|
||||
name = "csi-driver-nfs"
|
||||
atomic = true
|
||||
timeout = 300
|
||||
|
||||
repository = "https://raw.githubusercontent.com/kubernetes-csi/csi-driver-nfs/master/charts"
|
||||
chart = "csi-driver-nfs"
|
||||
|
||||
values = [yamlencode({
|
||||
controller = {
|
||||
replicas = 1
|
||||
resources = {
|
||||
requests = { cpu = "10m", memory = "32Mi" }
|
||||
limits = { cpu = "100m", memory = "128Mi" }
|
||||
}
|
||||
}
|
||||
node = {
|
||||
resources = {
|
||||
requests = { cpu = "10m", memory = "32Mi" }
|
||||
limits = { cpu = "100m", memory = "128Mi" }
|
||||
}
|
||||
}
|
||||
storageClass = {
|
||||
create = false # We create it ourselves below for full control
|
||||
}
|
||||
})]
|
||||
}
|
||||
|
||||
resource "kubernetes_storage_class" "nfs_truenas" {
|
||||
metadata {
|
||||
name = "nfs-truenas"
|
||||
}
|
||||
storage_provisioner = "nfs.csi.k8s.io"
|
||||
reclaim_policy = "Retain"
|
||||
volume_binding_mode = "Immediate"
|
||||
|
||||
mount_options = [
|
||||
"soft",
|
||||
"timeo=30",
|
||||
"retrans=3",
|
||||
"actimeo=5",
|
||||
]
|
||||
|
||||
parameters = {
|
||||
server = var.nfs_server
|
||||
share = "/mnt/main"
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
**Step 3: Wire the module into `stacks/platform/main.tf`**
|
||||
|
||||
Add after the `cnpg` module block (around line 318):
|
||||
|
||||
```hcl
|
||||
module "nfs-csi" {
|
||||
source = "./modules/nfs-csi"
|
||||
tier = local.tiers.cluster
|
||||
nfs_server = var.nfs_server
|
||||
}
|
||||
```
|
||||
|
||||
**Step 4: Verify with plan**
|
||||
|
||||
```bash
|
||||
cd stacks/platform && terragrunt plan --non-interactive 2>&1 | head -80
|
||||
```
|
||||
|
||||
Expected: Plan shows 3 new resources (`kubernetes_namespace`, `helm_release`, `kubernetes_storage_class`). No changes to existing resources.
|
||||
|
||||
**Step 5: Apply**
|
||||
|
||||
```bash
|
||||
cd stacks/platform && terragrunt apply --non-interactive
|
||||
```
|
||||
|
||||
**Step 6: Verify CSI driver is running**
|
||||
|
||||
```bash
|
||||
kubectl --kubeconfig $(pwd)/config get pods -n nfs-csi
|
||||
kubectl --kubeconfig $(pwd)/config get storageclass nfs-truenas
|
||||
```
|
||||
|
||||
Expected: Controller pod + node DaemonSet pods (5 total) all Running. StorageClass `nfs-truenas` exists with provisioner `nfs.csi.k8s.io`.
|
||||
|
||||
**Step 7: Commit**
|
||||
|
||||
```bash
|
||||
git add stacks/platform/modules/nfs-csi/ stacks/platform/main.tf
|
||||
git commit -m "[ci skip] add NFS CSI driver platform module with nfs-truenas StorageClass"
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Task 2: Create the Shared `nfs_volume` Module
|
||||
|
||||
**Files:**
|
||||
- Create: `modules/kubernetes/nfs_volume/main.tf`
|
||||
|
||||
**Step 1: Write the module**
|
||||
|
||||
Create `modules/kubernetes/nfs_volume/main.tf`:
|
||||
|
||||
```hcl
|
||||
variable "name" {
|
||||
description = "Unique name for PV and PVC (convention: <service>-<purpose>)"
|
||||
type = string
|
||||
}
|
||||
|
||||
variable "namespace" {
|
||||
description = "Kubernetes namespace for the PVC"
|
||||
type = string
|
||||
}
|
||||
|
||||
variable "nfs_server" {
|
||||
description = "NFS server address"
|
||||
type = string
|
||||
}
|
||||
|
||||
variable "nfs_path" {
|
||||
description = "NFS export path (e.g. /mnt/main/myservice)"
|
||||
type = string
|
||||
}
|
||||
|
||||
variable "storage" {
|
||||
description = "Storage capacity (informational for NFS)"
|
||||
type = string
|
||||
default = "10Gi"
|
||||
}
|
||||
|
||||
variable "access_modes" {
|
||||
description = "PV/PVC access modes"
|
||||
type = list(string)
|
||||
default = ["ReadWriteMany"]
|
||||
}
|
||||
|
||||
resource "kubernetes_persistent_volume" "this" {
|
||||
metadata {
|
||||
name = var.name
|
||||
}
|
||||
spec {
|
||||
capacity = {
|
||||
storage = var.storage
|
||||
}
|
||||
access_modes = var.access_modes
|
||||
persistent_volume_reclaim_policy = "Retain"
|
||||
storage_class_name = "nfs-truenas"
|
||||
volume_mode = "Filesystem"
|
||||
|
||||
persistent_volume_source {
|
||||
csi {
|
||||
driver = "nfs.csi.k8s.io"
|
||||
volume_handle = var.name
|
||||
volume_attributes = {
|
||||
server = var.nfs_server
|
||||
share = var.nfs_path
|
||||
}
|
||||
}
|
||||
}
|
||||
}
|
||||
}
|
||||
|
||||
resource "kubernetes_persistent_volume_claim" "this" {
|
||||
metadata {
|
||||
name = var.name
|
||||
namespace = var.namespace
|
||||
}
|
||||
spec {
|
||||
access_modes = var.access_modes
|
||||
storage_class_name = "nfs-truenas"
|
||||
volume_name = kubernetes_persistent_volume.this.metadata[0].name
|
||||
|
||||
resources {
|
||||
requests = {
|
||||
storage = var.storage
|
||||
}
|
||||
}
|
||||
}
|
||||
}
|
||||
|
||||
output "claim_name" {
|
||||
description = "PVC name to use in pod spec persistent_volume_claim blocks"
|
||||
value = kubernetes_persistent_volume_claim.this.metadata[0].name
|
||||
}
|
||||
```
|
||||
|
||||
**Step 2: Format**
|
||||
|
||||
```bash
|
||||
terraform fmt modules/kubernetes/nfs_volume/main.tf
|
||||
```
|
||||
|
||||
**Step 3: Commit**
|
||||
|
||||
```bash
|
||||
git add modules/kubernetes/nfs_volume/
|
||||
git commit -m "[ci skip] add shared nfs_volume module for CSI-backed PV/PVC creation"
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Task 3: Pilot Migration — `privatebin`
|
||||
|
||||
**Files:**
|
||||
- Modify: `stacks/privatebin/main.tf`
|
||||
|
||||
This is the first real migration. Validates the pattern end-to-end.
|
||||
|
||||
**Step 1: Read current state**
|
||||
|
||||
Current NFS volume in `stacks/privatebin/main.tf`:
|
||||
|
||||
```hcl
|
||||
# Lines 71-77 — volume block in pod spec
|
||||
volume {
|
||||
name = "data"
|
||||
nfs {
|
||||
path = "/mnt/main/privatebin"
|
||||
server = var.nfs_server
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
Volume mount (lines 54-58, UNCHANGED):
|
||||
```hcl
|
||||
volume_mount {
|
||||
name = "data"
|
||||
mount_path = "/srv/data"
|
||||
sub_path = "data"
|
||||
}
|
||||
```
|
||||
|
||||
**Step 2: Add module call**
|
||||
|
||||
Add before the `kubernetes_deployment` resource (e.g., after the ingress_factory module, before the deployment):
|
||||
|
||||
```hcl
|
||||
module "nfs_data" {
|
||||
source = "../../modules/kubernetes/nfs_volume"
|
||||
name = "privatebin-data"
|
||||
namespace = kubernetes_namespace.privatebin.metadata[0].name
|
||||
nfs_server = var.nfs_server
|
||||
nfs_path = "/mnt/main/privatebin"
|
||||
}
|
||||
```
|
||||
|
||||
**Step 3: Replace inline NFS volume with PVC reference**
|
||||
|
||||
Replace the volume block (lines 71-77):
|
||||
|
||||
```hcl
|
||||
# OLD:
|
||||
volume {
|
||||
name = "data"
|
||||
nfs {
|
||||
path = "/mnt/main/privatebin"
|
||||
server = var.nfs_server
|
||||
}
|
||||
}
|
||||
|
||||
# NEW:
|
||||
volume {
|
||||
name = "data"
|
||||
persistent_volume_claim {
|
||||
claim_name = module.nfs_data.claim_name
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
Do NOT touch the `volume_mount` block — it stays identical.
|
||||
|
||||
**Step 4: Plan and verify**
|
||||
|
||||
```bash
|
||||
cd stacks/privatebin && terragrunt plan --non-interactive
|
||||
```
|
||||
|
||||
Expected: 2 resources added (PV + PVC), deployment updated in-place (volume source changed). No resources destroyed (inline volumes aren't tracked as separate TF resources).
|
||||
|
||||
**Step 5: Apply**
|
||||
|
||||
```bash
|
||||
cd stacks/privatebin && terragrunt apply --non-interactive
|
||||
```
|
||||
|
||||
**Step 6: Verify the pod is running with CSI mount**
|
||||
|
||||
```bash
|
||||
kubectl --kubeconfig $(pwd)/config get pods -n privatebin
|
||||
kubectl --kubeconfig $(pwd)/config describe pod -n privatebin -l app=privatebin | grep -A5 "Volumes:"
|
||||
```
|
||||
|
||||
Expected: Pod running. Volume shows `Type: PersistentVolumeClaim` with `ClaimName: privatebin-data`, NOT `Type: NFS`.
|
||||
|
||||
**Step 7: Verify the app works**
|
||||
|
||||
```bash
|
||||
curl -sI https://privatebin.viktorbarzin.me | head -5
|
||||
```
|
||||
|
||||
Expected: HTTP 200 (or 302 redirect to the paste page).
|
||||
|
||||
**Step 8: Verify mount options**
|
||||
|
||||
```bash
|
||||
# SSH to the node running the pod and check mount options
|
||||
NODE=$(kubectl --kubeconfig $(pwd)/config get pod -n privatebin -l app=privatebin -o jsonpath='{.items[0].spec.nodeName}')
|
||||
ssh wizard@$(kubectl --kubeconfig $(pwd)/config get node $NODE -o jsonpath='{.status.addresses[?(@.type=="InternalIP")].address}') "mount | grep privatebin"
|
||||
```
|
||||
|
||||
Expected: Mount shows `soft,timeo=30,retrans=3,actimeo=5` (NOT the old `hard` default).
|
||||
|
||||
**Step 9: Commit**
|
||||
|
||||
```bash
|
||||
cd /Users/viktorbarzin/code/infra
|
||||
git add stacks/privatebin/main.tf
|
||||
git commit -m "[ci skip] privatebin: migrate NFS volume to CSI-backed PV/PVC with soft mount"
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Task 4: Pilot Migration — `resume`
|
||||
|
||||
**Files:**
|
||||
- Modify: `stacks/resume/main.tf`
|
||||
|
||||
Same pattern as privatebin. Single NFS volume.
|
||||
|
||||
**Step 1: Add module call**
|
||||
|
||||
Add before the `kubernetes_deployment.resume` resource:
|
||||
|
||||
```hcl
|
||||
module "nfs_data" {
|
||||
source = "../../modules/kubernetes/nfs_volume"
|
||||
name = "resume-data"
|
||||
namespace = kubernetes_namespace.resume.metadata[0].name
|
||||
nfs_server = var.nfs_server
|
||||
nfs_path = "/mnt/main/resume"
|
||||
}
|
||||
```
|
||||
|
||||
**Step 2: Replace inline NFS volume with PVC reference**
|
||||
|
||||
In the `resume` deployment's pod spec, replace:
|
||||
|
||||
```hcl
|
||||
# OLD:
|
||||
volume {
|
||||
name = "data"
|
||||
nfs {
|
||||
server = var.nfs_server
|
||||
path = "/mnt/main/resume"
|
||||
}
|
||||
}
|
||||
|
||||
# NEW:
|
||||
volume {
|
||||
name = "data"
|
||||
persistent_volume_claim {
|
||||
claim_name = module.nfs_data.claim_name
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
**Step 3: Plan, apply, verify**
|
||||
|
||||
```bash
|
||||
cd stacks/resume && terragrunt plan --non-interactive
|
||||
cd stacks/resume && terragrunt apply --non-interactive
|
||||
kubectl --kubeconfig $(pwd)/config get pods -n resume
|
||||
curl -sI https://resume.viktorbarzin.me | head -5
|
||||
```
|
||||
|
||||
**Step 4: Commit**
|
||||
|
||||
```bash
|
||||
cd /Users/viktorbarzin/code/infra
|
||||
git add stacks/resume/main.tf
|
||||
git commit -m "[ci skip] resume: migrate NFS volume to CSI-backed PV/PVC with soft mount"
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Task 5: Pilot Migration — `speedtest`
|
||||
|
||||
**Files:**
|
||||
- Modify: `stacks/speedtest/main.tf`
|
||||
|
||||
**Step 1: Add module call**
|
||||
|
||||
```hcl
|
||||
module "nfs_config" {
|
||||
source = "../../modules/kubernetes/nfs_volume"
|
||||
name = "speedtest-config"
|
||||
namespace = kubernetes_namespace.speedtest.metadata[0].name
|
||||
nfs_server = var.nfs_server
|
||||
nfs_path = "/mnt/main/speedtest"
|
||||
}
|
||||
```
|
||||
|
||||
**Step 2: Replace inline NFS volume**
|
||||
|
||||
```hcl
|
||||
# OLD:
|
||||
volume {
|
||||
name = "config"
|
||||
nfs {
|
||||
server = var.nfs_server
|
||||
path = "/mnt/main/speedtest"
|
||||
}
|
||||
}
|
||||
|
||||
# NEW:
|
||||
volume {
|
||||
name = "config"
|
||||
persistent_volume_claim {
|
||||
claim_name = module.nfs_config.claim_name
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
**Step 3: Plan, apply, verify**
|
||||
|
||||
```bash
|
||||
cd stacks/speedtest && terragrunt plan --non-interactive
|
||||
cd stacks/speedtest && terragrunt apply --non-interactive
|
||||
kubectl --kubeconfig $(pwd)/config get pods -n speedtest
|
||||
curl -sI https://speedtest.viktorbarzin.me | head -5
|
||||
```
|
||||
|
||||
**Step 4: Commit**
|
||||
|
||||
```bash
|
||||
cd /Users/viktorbarzin/code/infra
|
||||
git add stacks/speedtest/main.tf
|
||||
git commit -m "[ci skip] speedtest: migrate NFS volume to CSI-backed PV/PVC with soft mount"
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Task 6: Batch Migration — Simple Single-Volume Stacks
|
||||
|
||||
After pilots are verified, migrate the remaining single-volume stacks. These all follow the exact same mechanical pattern.
|
||||
|
||||
**Files to modify** (one `main.tf` each — apply and verify each individually):
|
||||
|
||||
| Stack | Volume Name | PV Name | NFS Path |
|
||||
|-------|------------|---------|----------|
|
||||
| `audiobookshelf` | `data` | `audiobookshelf-data` | `/mnt/main/audiobookshelf` |
|
||||
| `calibre` | `data` | `calibre-data` | `/mnt/main/calibre-web-automated` |
|
||||
| `changedetection` | `data` | `changedetection-data` | `/mnt/main/changedetection` |
|
||||
| `diun` | `data` | `diun-data` | `/mnt/main/diun` |
|
||||
| `excalidraw` | `data` | `excalidraw-data` | `/mnt/main/excalidraw` |
|
||||
| `forgejo` | `data` | `forgejo-data` | `/mnt/main/forgejo` |
|
||||
| `freshrss` | `data` | `freshrss-data` | `/mnt/main/freshrss` |
|
||||
| `hackmd` | `data` | `hackmd-data` | `/mnt/main/hackmd` |
|
||||
| `health` | `data` | `health-data` | `/mnt/main/health` |
|
||||
| `isponsorblocktv` | `data` | `isponsorblocktv-data` | `/mnt/main/isponsorblocktv` |
|
||||
| `meshcentral` | `data` | `meshcentral-data` | `/mnt/main/meshcentral` |
|
||||
| `n8n` | `data` | `n8n-data` | `/mnt/main/n8n` |
|
||||
| `navidrome` | `data` | `navidrome-data` | `/mnt/main/navidrome` |
|
||||
| `netbox` | `data` | `netbox-data` | `/mnt/main/netbox` |
|
||||
| `ntfy` | `data` | `ntfy-data` | `/mnt/main/ntfy` |
|
||||
| `onlyoffice` | `data` | `onlyoffice-data` | `/mnt/main/onlyoffice` |
|
||||
| `owntracks` | `data` | `owntracks-data` | `/mnt/main/owntracks` |
|
||||
| `privatebin` | _(done in Task 3)_ | | |
|
||||
| `resume` | _(done in Task 4)_ | | |
|
||||
| `send` | `data` | `send-data` | `/mnt/main/send` |
|
||||
| `speedtest` | _(done in Task 5)_ | | |
|
||||
| `tandoor` | `data` | `tandoor-data` | `/mnt/main/tandoor` |
|
||||
| `wealthfolio` | `data` | `wealthfolio-data` | `/mnt/main/wealthfolio` |
|
||||
| `whisper` | `data` | `whisper-data` | `/mnt/main/whisper` |
|
||||
| `atuin` | `data` | `atuin-data` | `/mnt/main/atuin` |
|
||||
| `matrix` | `data` | `matrix-data` | `/mnt/main/matrix` |
|
||||
| `ollama` | `data` | `ollama-data` | `/mnt/main/ollama` |
|
||||
| `poison-fountain` | `data` | `poison-fountain-data` | `/mnt/main/poison-fountain` |
|
||||
| `woodpecker` | `data` | `woodpecker-data` | `/mnt/main/woodpecker` |
|
||||
| `ytdlp` | `data` | `ytdlp-data` | `/mnt/main/ytdlp` |
|
||||
| `stirling-pdf` | `data` | `stirling-pdf-data` | `/mnt/main/stirling-pdf` |
|
||||
| `paperless-ngx` | `data` | `paperless-ngx-data` | `/mnt/main/paperless-ngx` |
|
||||
| `grampsweb` | `data` | `grampsweb-data` | `/mnt/main/grampsweb` |
|
||||
| `trading-bot` | `data` | `trading-bot-data` | `/mnt/main/trading-bot` |
|
||||
|
||||
**For each stack, the pattern is identical:**
|
||||
|
||||
1. Read `stacks/<service>/main.tf` to find the exact NFS volume block and its volume name
|
||||
2. Add `module "nfs_<volume_name>"` call with the correct PV name, namespace, and NFS path
|
||||
3. Replace `nfs {}` block with `persistent_volume_claim { claim_name = module.nfs_<volume_name>.claim_name }`
|
||||
4. `cd stacks/<service> && terragrunt apply --non-interactive`
|
||||
5. Verify pod is running: `kubectl --kubeconfig $(pwd)/config get pods -n <service>`
|
||||
6. Verify app is accessible: `curl -sI https://<service>.viktorbarzin.me | head -5`
|
||||
|
||||
**Important**: Read each `main.tf` first — volume names, NFS paths, and namespace references vary. The table above is a guide, not a source of truth. Some stacks may have different volume names or multiple NFS paths under a parent directory.
|
||||
|
||||
**Commit after every 3-5 stacks:**
|
||||
|
||||
```bash
|
||||
git add stacks/audiobookshelf/main.tf stacks/calibre/main.tf stacks/changedetection/main.tf
|
||||
git commit -m "[ci skip] migrate audiobookshelf, calibre, changedetection NFS volumes to CSI PV/PVC"
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Task 7: Multi-Volume Stack Migration
|
||||
|
||||
These stacks have 2+ NFS volumes. Each needs multiple module calls.
|
||||
|
||||
**Files to modify** (read each `main.tf` first to get exact volume names and paths):
|
||||
|
||||
| Stack | Expected NFS Volumes | Notes |
|
||||
|-------|---------------------|-------|
|
||||
| `openclaw` | 4: tools, home, workspace, data | 3 containers share volumes |
|
||||
| `immich` | Multiple: library, upload, thumbs, etc. | Check exact paths from nfs_directories.txt |
|
||||
| `servarr` | Parent + 7 sub-stacks, each with NFS | Factory pattern, check each sub-module |
|
||||
| `frigate` | Multiple: config, media, recordings | GPU service |
|
||||
| `dawarich` | Multiple | Check main.tf |
|
||||
| `ebook2audiobook` | Multiple | GPU service |
|
||||
| `f1-stream` | Multiple | Check main.tf |
|
||||
| `real-estate-crawler` | Multiple | Check main.tf |
|
||||
| `nextcloud` | Multiple | Custom LimitRange, complex stack |
|
||||
| `rybbit` | Multiple: clickhouse data, etc. | Check main.tf |
|
||||
| `osm_routing` | Multiple | Check main.tf |
|
||||
| `affine` | Multiple | Check main.tf |
|
||||
|
||||
**Pattern is the same — just more module calls:**
|
||||
|
||||
```hcl
|
||||
# Example for openclaw (4 volumes)
|
||||
module "nfs_tools" {
|
||||
source = "../../modules/kubernetes/nfs_volume"
|
||||
name = "openclaw-tools"
|
||||
namespace = kubernetes_namespace.openclaw.metadata[0].name
|
||||
nfs_server = var.nfs_server
|
||||
nfs_path = "/mnt/main/openclaw/tools"
|
||||
}
|
||||
|
||||
module "nfs_home" {
|
||||
source = "../../modules/kubernetes/nfs_volume"
|
||||
name = "openclaw-home"
|
||||
namespace = kubernetes_namespace.openclaw.metadata[0].name
|
||||
nfs_server = var.nfs_server
|
||||
nfs_path = "/mnt/main/openclaw/home"
|
||||
}
|
||||
|
||||
module "nfs_workspace" {
|
||||
source = "../../modules/kubernetes/nfs_volume"
|
||||
name = "openclaw-workspace"
|
||||
namespace = kubernetes_namespace.openclaw.metadata[0].name
|
||||
nfs_server = var.nfs_server
|
||||
nfs_path = "/mnt/main/openclaw/workspace"
|
||||
}
|
||||
|
||||
module "nfs_data" {
|
||||
source = "../../modules/kubernetes/nfs_volume"
|
||||
name = "openclaw-data"
|
||||
namespace = kubernetes_namespace.openclaw.metadata[0].name
|
||||
nfs_server = var.nfs_server
|
||||
nfs_path = "/mnt/main/openclaw/data"
|
||||
}
|
||||
|
||||
# Then in pod spec:
|
||||
volume {
|
||||
name = "tools"
|
||||
persistent_volume_claim { claim_name = module.nfs_tools.claim_name }
|
||||
}
|
||||
volume {
|
||||
name = "openclaw-home"
|
||||
persistent_volume_claim { claim_name = module.nfs_home.claim_name }
|
||||
}
|
||||
# ... etc
|
||||
```
|
||||
|
||||
**Step for each**: Read main.tf → identify all `nfs {}` blocks → add module calls → replace volume blocks → plan → apply → verify.
|
||||
|
||||
**Commit after each multi-volume stack** (these are more complex, commit individually):
|
||||
|
||||
```bash
|
||||
git add stacks/openclaw/main.tf
|
||||
git commit -m "[ci skip] openclaw: migrate 4 NFS volumes to CSI PV/PVC with soft mount"
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Task 8: Platform Module Migration
|
||||
|
||||
These modules are under `stacks/platform/modules/` and reference shared modules at `../../../../modules/kubernetes/nfs_volume`.
|
||||
|
||||
**Files to modify:**
|
||||
|
||||
| Module | Current Storage Pattern | Notes |
|
||||
|--------|----------------------|-------|
|
||||
| `monitoring/prometheus.tf` | Existing PV/PVC with native NFS source | Change PV source from `nfs {}` to `csi {}` |
|
||||
| `monitoring/loki.tf` | Existing PV/PVC with native NFS source | Same |
|
||||
| `monitoring/grafana.tf` | Existing PV (alertmanager) with native NFS | Same |
|
||||
| `redis/main.tf` | Inline NFS or PV | Check current pattern |
|
||||
| `dbaas/` | PV for PostgreSQL, MySQL backup | Check current pattern |
|
||||
| `technitium/` | Inline NFS | Standard migration |
|
||||
| `headscale/` | Inline NFS | Standard migration |
|
||||
| `vaultwarden/` | Inline NFS | Standard migration |
|
||||
| `uptime-kuma/` | Inline NFS | Standard migration |
|
||||
| `mailserver/` | Inline NFS | Standard migration |
|
||||
| `infra-maintenance/` | Inline NFS | Standard migration |
|
||||
|
||||
**For existing PV/PVC resources** (monitoring stack), the change is different — replace the `persistent_volume_source` block:
|
||||
|
||||
```hcl
|
||||
# OLD (in prometheus.tf):
|
||||
persistent_volume_source {
|
||||
nfs {
|
||||
path = "/mnt/main/prometheus"
|
||||
server = var.nfs_server
|
||||
}
|
||||
}
|
||||
|
||||
# NEW:
|
||||
persistent_volume_source {
|
||||
csi {
|
||||
driver = "nfs.csi.k8s.io"
|
||||
volume_handle = "prometheus-data"
|
||||
volume_attributes = {
|
||||
server = var.nfs_server
|
||||
share = "/mnt/main/prometheus"
|
||||
}
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
Also add `storage_class_name = "nfs-truenas"` to the PV spec to inherit mount options.
|
||||
|
||||
**For inline NFS volumes** in platform modules, use the shared module with the longer path:
|
||||
|
||||
```hcl
|
||||
module "nfs_data" {
|
||||
source = "../../../../modules/kubernetes/nfs_volume"
|
||||
name = "technitium-data"
|
||||
namespace = kubernetes_namespace.technitium.metadata[0].name
|
||||
nfs_server = var.nfs_server
|
||||
nfs_path = "/mnt/main/technitium"
|
||||
}
|
||||
```
|
||||
|
||||
**Apply as one platform apply:**
|
||||
|
||||
```bash
|
||||
cd stacks/platform && terragrunt apply --non-interactive
|
||||
```
|
||||
|
||||
**Verify all platform services:**
|
||||
|
||||
```bash
|
||||
kubectl --kubeconfig $(pwd)/config get pods -n monitoring
|
||||
kubectl --kubeconfig $(pwd)/config get pods -n redis
|
||||
kubectl --kubeconfig $(pwd)/config get pods -n dbaas
|
||||
kubectl --kubeconfig $(pwd)/config get pods -n technitium
|
||||
# ... etc
|
||||
```
|
||||
|
||||
**Commit:**
|
||||
|
||||
```bash
|
||||
git add stacks/platform/
|
||||
git commit -m "[ci skip] platform: migrate all NFS volumes to CSI PV/PVC with soft mount"
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Task 9: Update Documentation and Skills
|
||||
|
||||
**Files:**
|
||||
- Modify: `.claude/CLAUDE.md` (update NFS Volume Pattern section)
|
||||
- Modify: `.claude/skills/setup-project/SKILL.md` (update new service template to use module)
|
||||
|
||||
**Step 1: Update CLAUDE.md NFS Volume Pattern**
|
||||
|
||||
Replace the existing NFS Volume Pattern section with:
|
||||
|
||||
```markdown
|
||||
### NFS Volume Pattern
|
||||
**Use the `nfs_volume` shared module** for all NFS volumes. This creates CSI-backed PV/PVC with soft mount options (no stale mount hangs):
|
||||
\```hcl
|
||||
module "nfs_data" {
|
||||
source = "../../modules/kubernetes/nfs_volume"
|
||||
name = "<service>-data" # Must be globally unique
|
||||
namespace = kubernetes_namespace.<service>.metadata[0].name
|
||||
nfs_server = var.nfs_server
|
||||
nfs_path = "/mnt/main/<service>"
|
||||
}
|
||||
|
||||
# In pod spec:
|
||||
volume {
|
||||
name = "data"
|
||||
persistent_volume_claim {
|
||||
claim_name = module.nfs_data.claim_name
|
||||
}
|
||||
}
|
||||
\```
|
||||
For platform modules, use `source = "../../../../modules/kubernetes/nfs_volume"`.
|
||||
|
||||
**Legacy pattern (DO NOT use for new services):** Inline `nfs {}` blocks mount with `hard,timeo=600` defaults which hang forever on stale mounts.
|
||||
```
|
||||
|
||||
**Step 2: Update setup-project skill**
|
||||
|
||||
Update the new service template in `.claude/skills/setup-project/SKILL.md` to use the module pattern instead of inline NFS.
|
||||
|
||||
**Step 3: Commit**
|
||||
|
||||
```bash
|
||||
git add .claude/
|
||||
git commit -m "[ci skip] update NFS volume documentation to use CSI-backed nfs_volume module"
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Task 10: Validation — Simulate NFS Outage
|
||||
|
||||
**This is a manual verification step. Do NOT automate.**
|
||||
|
||||
After all services are migrated, simulate an NFS blip to confirm the stale mount fix works:
|
||||
|
||||
1. Pick a low-risk service (e.g., `privatebin`)
|
||||
2. On TrueNAS, temporarily block NFS to the K8s network (iptables rule or pause NFS for 30 seconds)
|
||||
3. Observe: pod should get I/O errors within ~9 seconds (not hang)
|
||||
4. If the pod has a liveness probe that touches the filesystem, it should restart automatically
|
||||
5. After NFS recovers, verify the pod re-mounts cleanly
|
||||
|
||||
**Do NOT run this on production without a maintenance window.** This is a "when you're ready" validation, not part of the automated migration.
|
||||
237
docs/plans/2026-03-01-traefik-resilience-design.md
Normal file
237
docs/plans/2026-03-01-traefik-resilience-design.md
Normal file
|
|
@ -0,0 +1,237 @@
|
|||
# Traefik Resilience Hardening Design
|
||||
|
||||
**Date**: 2026-03-01
|
||||
**Status**: Approved
|
||||
|
||||
## Problem Statement
|
||||
|
||||
Traefik is the single ingress point for 70+ services. It has downstream dependencies (ForwardAuth to Poison Fountain, ForwardAuth to Authentik) that are **fail-closed** with **unlimited timeouts**. If these dependencies go down or hang, the entire cluster's public-facing services return 502 or hang indefinitely.
|
||||
|
||||
Additionally, no PodDisruptionBudgets exist, all 3 Traefik replicas can land on the same node, and there are no retries for transient backend failures.
|
||||
|
||||
## Current State
|
||||
|
||||
### Dependency Map (Request Path)
|
||||
|
||||
```
|
||||
Client → Cloudflare → MetalLB (10.0.20.202) → Traefik (1 of 3 replicas)
|
||||
→ rate-limit .................... IN-PROCESS
|
||||
→ csp-headers ................... IN-PROCESS
|
||||
→ crowdsec (plugin) ............. FAIL-OPEN ✓ (already resilient)
|
||||
→ ai-bot-block (ForwardAuth) .... FAIL-CLOSED ✗ (Poison Fountain)
|
||||
→ anti-ai-headers ............... IN-PROCESS
|
||||
→ strip-accept-encoding ......... IN-PROCESS
|
||||
→ anti-ai-trap-links (plugin) ... IN-PROCESS
|
||||
→ [if protected=true]:
|
||||
→ authentik-forward-auth ....... FAIL-CLOSED ✗ (Authentik outpost)
|
||||
→ Backend Service
|
||||
```
|
||||
|
||||
### Risk Assessment
|
||||
|
||||
| Dependency | Fail Mode | Blast Radius | Likelihood | Mitigation |
|
||||
|---|---|---|---|---|
|
||||
| Poison Fountain (ai-bot-block) | FAIL-CLOSED | ALL services (default middleware) | Medium (tier 4-aux, 2 replicas) | NONE |
|
||||
| Authentik (forward auth) | FAIL-CLOSED | Protected services (~4) | Low (3 replicas, tier 1-cluster) | Alert only |
|
||||
| CrowdSec LAPI | FAIL-OPEN | None | Low | Fully configured |
|
||||
| Response header timeout | Unlimited (0s) | ALL services (hung backend) | Medium | NONE |
|
||||
| Pod scheduling | All on same node possible | ALL services | Medium | NONE |
|
||||
| Node drain | Can evict all replicas | ALL services | During maintenance | NONE |
|
||||
|
||||
## Design
|
||||
|
||||
### 1. ForwardAuth Resilience (Nginx Resilience Proxies)
|
||||
|
||||
#### 1a. AI Bot Block → Fail-Open
|
||||
|
||||
Deploy a small nginx reverse proxy in front of Poison Fountain:
|
||||
- Normal operation: proxies request to `poison-fountain:8080/auth`, returns its response
|
||||
- Poison Fountain down: nginx catches 502/503/504, returns **200** (allow all traffic)
|
||||
- The other 4 anti-AI layers (headers, trap links, tarpit, poison content) still work
|
||||
|
||||
Update the `ai-bot-block` ForwardAuth middleware to point at the nginx proxy instead of directly at Poison Fountain.
|
||||
|
||||
**Nginx config sketch:**
|
||||
```nginx
|
||||
upstream poison_fountain {
|
||||
server poison-fountain.poison-fountain.svc.cluster.local:8080;
|
||||
}
|
||||
server {
|
||||
listen 8080;
|
||||
location /auth {
|
||||
proxy_pass http://poison_fountain;
|
||||
proxy_connect_timeout 3s;
|
||||
proxy_read_timeout 5s;
|
||||
proxy_intercept_errors on;
|
||||
error_page 502 503 504 =200 /fallback-allow;
|
||||
}
|
||||
location = /fallback-allow {
|
||||
return 200;
|
||||
}
|
||||
location /healthz {
|
||||
return 200 "ok";
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
**Deployment**: 2 replicas, tier `0-core`, topology spread across nodes, minimal resources (10m CPU, 16Mi memory).
|
||||
|
||||
#### 1b. Authentik → BasicAuth Fallback
|
||||
|
||||
Deploy a similar nginx proxy in front of Authentik's outpost:
|
||||
- Normal operation: proxies to `ak-outpost-...:9000`, returns Authentik's response (SSO)
|
||||
- Authentik down: falls back to nginx `auth_basic` with htpasswd credentials from a Kubernetes secret
|
||||
- Protected services remain accessible to admins via basicAuth during Authentik outages
|
||||
|
||||
Update the `authentik-forward-auth` middleware to point at the nginx proxy.
|
||||
|
||||
**Nginx config sketch:**
|
||||
```nginx
|
||||
upstream authentik {
|
||||
server ak-outpost-authentik-embedded-outpost.authentik.svc.cluster.local:9000;
|
||||
}
|
||||
server {
|
||||
listen 9000;
|
||||
location /outpost.goauthentik.io/auth/traefik {
|
||||
proxy_pass http://authentik;
|
||||
proxy_connect_timeout 3s;
|
||||
proxy_read_timeout 5s;
|
||||
proxy_intercept_errors on;
|
||||
error_page 502 503 504 = @fallback_auth;
|
||||
}
|
||||
location @fallback_auth {
|
||||
auth_basic "Emergency Access";
|
||||
auth_basic_user_file /etc/nginx/htpasswd;
|
||||
# Return 200 with required headers if basicAuth passes
|
||||
add_header X-authentik-username $remote_user;
|
||||
return 200;
|
||||
}
|
||||
location /healthz {
|
||||
return 200 "ok";
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
**htpasswd secret**: Generated from existing admin credentials, stored in a Kubernetes secret, mounted into the nginx pod.
|
||||
|
||||
### 2. Pod Scheduling & Disruption Protection
|
||||
|
||||
#### 2a. Traefik Topology Spread + PDB
|
||||
|
||||
Add to Traefik Helm values:
|
||||
```yaml
|
||||
topologySpreadConstraints:
|
||||
- maxSkew: 1
|
||||
topologyKey: kubernetes.io/hostname
|
||||
whenUnsatisfiable: DoNotSchedule
|
||||
labelSelector:
|
||||
matchLabels:
|
||||
app.kubernetes.io/name: traefik
|
||||
|
||||
podDisruptionBudget:
|
||||
enabled: true
|
||||
minAvailable: 2
|
||||
```
|
||||
|
||||
#### 2b. Authentik PDB
|
||||
|
||||
Add to Authentik Helm values:
|
||||
```yaml
|
||||
server:
|
||||
pdb:
|
||||
enabled: true
|
||||
minAvailable: 2
|
||||
```
|
||||
|
||||
#### 2c. Poison Fountain Tier Bump
|
||||
|
||||
Change Poison Fountain namespace tier from `4-aux` to `1-cluster`:
|
||||
- File: `stacks/poison-fountain/main.tf`
|
||||
- Change: `tier = local.tiers.aux` → `tier = local.tiers.cluster`
|
||||
- Effect: priority bumped from 200K to 800K, preemption enabled, LimitRange defaults change (512Mi default memory, max 4Gi)
|
||||
|
||||
### 3. Timeout & Backend Protection
|
||||
|
||||
#### 3a. Response Header Timeout
|
||||
|
||||
Change from unlimited to 30s:
|
||||
```
|
||||
--serversTransport.forwardingTimeouts.responseHeaderTimeout=30s
|
||||
```
|
||||
|
||||
Prevents hung backends from holding Traefik goroutines indefinitely.
|
||||
|
||||
#### 3b. ForwardAuth Proxy Timeouts
|
||||
|
||||
The nginx resilience proxies use 3s connect / 5s read timeouts. If the upstream doesn't respond within 5s, the fallback activates. This is much faster than waiting for the backend to eventually time out.
|
||||
|
||||
#### 3c. Retry Middleware
|
||||
|
||||
Add a `retry` middleware to the default chain in ingress_factory:
|
||||
```yaml
|
||||
retry:
|
||||
attempts: 2
|
||||
initialInterval: 100ms
|
||||
```
|
||||
|
||||
Handles transient 502/503 from backends that are restarting. Only retries on network errors and 5xx.
|
||||
|
||||
### 4. Monitoring & Alerting
|
||||
|
||||
#### 4a. PoisonFountainDown Alert
|
||||
|
||||
```yaml
|
||||
- alert: PoisonFountainDown
|
||||
expr: kube_deployment_status_replicas_available{namespace="poison-fountain", deployment="poison-fountain"} == 0
|
||||
for: 2m
|
||||
labels:
|
||||
severity: critical
|
||||
annotations:
|
||||
summary: "Poison Fountain is down - AI bot blocking degraded to fail-open"
|
||||
```
|
||||
|
||||
#### 4b. Alert Inhibition
|
||||
|
||||
When `TraefikDown` fires, suppress `PoisonFountainDown`.
|
||||
|
||||
#### 4c. ForwardAuthFailing Alert
|
||||
|
||||
Track when the nginx resilience proxies are serving fallback responses (meaning the real auth services are down):
|
||||
|
||||
```yaml
|
||||
- alert: ForwardAuthFailing
|
||||
expr: rate(nginx_upstream_responses_total{status_code="502"}[5m]) > 0
|
||||
for: 5m
|
||||
labels:
|
||||
severity: warning
|
||||
annotations:
|
||||
summary: "ForwardAuth fallback active - check Authentik/Poison Fountain"
|
||||
```
|
||||
|
||||
(Exact metric depends on nginx exporter configuration — may need a custom approach like logging fallback hits and counting with promtail.)
|
||||
|
||||
## Files to Modify
|
||||
|
||||
| File | Change |
|
||||
|---|---|
|
||||
| `stacks/platform/modules/traefik/main.tf` | Add topology spread, PDB, response header timeout |
|
||||
| `stacks/platform/modules/traefik/middleware.tf` | Update ForwardAuth addresses to point at resilience proxies, add retry middleware |
|
||||
| `stacks/poison-fountain/main.tf` | Change tier to `1-cluster`, add resilience proxy deployment |
|
||||
| `stacks/platform/modules/authentik/main.tf` | Add PDB, add auth resilience proxy deployment |
|
||||
| `modules/kubernetes/ingress_factory/main.tf` | Add retry middleware to default chain |
|
||||
| `stacks/platform/modules/monitoring/prometheus_chart_values.tpl` | Add PoisonFountainDown alert, ForwardAuthFailing alert, alert inhibition |
|
||||
|
||||
## Out of Scope
|
||||
|
||||
- Circuit breakers (per-service complexity not worth it for homelab)
|
||||
- Plugin pre-baking into Docker image (accepted risk)
|
||||
- Active health checks on backends (K8s readiness probes sufficient)
|
||||
|
||||
## Rollback Plan
|
||||
|
||||
Each change is independent and can be reverted individually:
|
||||
- Resilience proxies: revert ForwardAuth addresses back to direct service URLs
|
||||
- PDBs: remove from Helm values
|
||||
- Timeouts: revert to `0s`
|
||||
- Retry middleware: remove from ingress_factory chain
|
||||
- Alerts: remove from Prometheus config
|
||||
941
docs/plans/2026-03-01-traefik-resilience-plan.md
Normal file
941
docs/plans/2026-03-01-traefik-resilience-plan.md
Normal file
|
|
@ -0,0 +1,941 @@
|
|||
# Traefik Resilience Hardening Implementation Plan
|
||||
|
||||
> **For Claude:** REQUIRED SUB-SKILL: Use superpowers:executing-plans to implement this plan task-by-task.
|
||||
|
||||
**Goal:** Make Traefik resilient against downstream dependency failures (ForwardAuth services, hung backends) while preventing pod scheduling and disruption issues.
|
||||
|
||||
**Architecture:** Deploy nginx resilience proxies in front of fail-closed ForwardAuth services (Poison Fountain, Authentik), add PodDisruptionBudgets, topology spread constraints, response timeouts, retry middleware, and monitoring alerts.
|
||||
|
||||
**Tech Stack:** Terraform/Terragrunt, Kubernetes, Nginx, Traefik CRDs, Prometheus
|
||||
|
||||
---
|
||||
|
||||
### Task 1: Bump Poison Fountain tier from aux to cluster
|
||||
|
||||
This is the simplest change and has no dependencies. Bumping the tier ensures Poison Fountain isn't evicted under memory pressure.
|
||||
|
||||
**Files:**
|
||||
- Modify: `stacks/poison-fountain/main.tf:10` (namespace tier label)
|
||||
- Modify: `stacks/poison-fountain/main.tf:52` (deployment tier label)
|
||||
|
||||
**Step 1: Change namespace tier**
|
||||
|
||||
In `stacks/poison-fountain/main.tf`, line 10, change:
|
||||
```hcl
|
||||
tier = local.tiers.aux
|
||||
```
|
||||
to:
|
||||
```hcl
|
||||
tier = local.tiers.cluster
|
||||
```
|
||||
|
||||
**Step 2: Change deployment tier label**
|
||||
|
||||
In `stacks/poison-fountain/main.tf`, line 52, change:
|
||||
```hcl
|
||||
tier = local.tiers.aux
|
||||
```
|
||||
to:
|
||||
```hcl
|
||||
tier = local.tiers.cluster
|
||||
```
|
||||
|
||||
**Step 3: Verify the plan**
|
||||
|
||||
Run:
|
||||
```bash
|
||||
cd stacks/poison-fountain && terragrunt plan --non-interactive 2>&1 | tail -30
|
||||
```
|
||||
Expected: Plan shows namespace and deployment label changes from `4-aux` to `1-cluster`. No resource destruction.
|
||||
|
||||
**Step 4: Apply**
|
||||
|
||||
Run:
|
||||
```bash
|
||||
cd stacks/poison-fountain && terragrunt apply --non-interactive
|
||||
```
|
||||
|
||||
**Step 5: Verify the new LimitRange and PriorityClass**
|
||||
|
||||
Run:
|
||||
```bash
|
||||
kubectl --kubeconfig $(pwd)/config describe limitrange tier-defaults -n poison-fountain
|
||||
kubectl --kubeconfig $(pwd)/config get pods -n poison-fountain -o jsonpath='{.items[*].spec.priorityClassName}'
|
||||
```
|
||||
Expected: LimitRange shows `1-cluster` defaults (512Mi default memory, max 4Gi). Priority class is `tier-1-cluster`.
|
||||
|
||||
**Step 6: Commit**
|
||||
|
||||
```bash
|
||||
git add stacks/poison-fountain/main.tf
|
||||
git commit -m "[ci skip] bump poison-fountain tier from aux to cluster (critical path for all ingress)"
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
### Task 2: Deploy bot-block resilience proxy (nginx fail-open in front of Poison Fountain)
|
||||
|
||||
Deploy an nginx reverse proxy in the `traefik` namespace that proxies to Poison Fountain's `/auth` endpoint and returns 200 (allow) if Poison Fountain is unreachable.
|
||||
|
||||
**Files:**
|
||||
- Modify: `stacks/platform/modules/traefik/main.tf` (add nginx deployment, service, configmap)
|
||||
- Modify: `stacks/platform/modules/traefik/middleware.tf:287` (update ai-bot-block ForwardAuth address)
|
||||
|
||||
**Step 1: Add nginx configmap for bot-block proxy**
|
||||
|
||||
Add to end of `stacks/platform/modules/traefik/main.tf` (before the closing of the file):
|
||||
|
||||
```hcl
|
||||
# Resilience proxy for ai-bot-block ForwardAuth
|
||||
# Returns 200 (allow all) when Poison Fountain is unreachable
|
||||
resource "kubernetes_config_map" "bot_block_proxy_config" {
|
||||
metadata {
|
||||
name = "bot-block-proxy-config"
|
||||
namespace = kubernetes_namespace.traefik.metadata[0].name
|
||||
}
|
||||
|
||||
data = {
|
||||
"default.conf" = <<-EOT
|
||||
upstream poison_fountain {
|
||||
server poison-fountain.poison-fountain.svc.cluster.local:8080;
|
||||
}
|
||||
server {
|
||||
listen 8080;
|
||||
location /auth {
|
||||
proxy_pass http://poison_fountain;
|
||||
proxy_connect_timeout 3s;
|
||||
proxy_read_timeout 5s;
|
||||
proxy_send_timeout 5s;
|
||||
proxy_intercept_errors on;
|
||||
error_page 502 503 504 =200 /fallback-allow;
|
||||
proxy_set_header Host $host;
|
||||
proxy_set_header X-Real-IP $remote_addr;
|
||||
proxy_set_header X-Forwarded-For $proxy_add_x_forwarded_for;
|
||||
proxy_set_header X-Forwarded-Proto $scheme;
|
||||
}
|
||||
location = /fallback-allow {
|
||||
internal;
|
||||
return 200 "allowed";
|
||||
}
|
||||
location /healthz {
|
||||
access_log off;
|
||||
return 200 "ok";
|
||||
}
|
||||
}
|
||||
EOT
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
**Step 2: Add nginx deployment for bot-block proxy**
|
||||
|
||||
Add after the configmap:
|
||||
|
||||
```hcl
|
||||
resource "kubernetes_deployment" "bot_block_proxy" {
|
||||
metadata {
|
||||
name = "bot-block-proxy"
|
||||
namespace = kubernetes_namespace.traefik.metadata[0].name
|
||||
labels = {
|
||||
app = "bot-block-proxy"
|
||||
}
|
||||
}
|
||||
|
||||
spec {
|
||||
replicas = 2
|
||||
strategy {
|
||||
type = "RollingUpdate"
|
||||
rolling_update {
|
||||
max_unavailable = 0
|
||||
max_surge = 1
|
||||
}
|
||||
}
|
||||
selector {
|
||||
match_labels = {
|
||||
app = "bot-block-proxy"
|
||||
}
|
||||
}
|
||||
template {
|
||||
metadata {
|
||||
labels = {
|
||||
app = "bot-block-proxy"
|
||||
}
|
||||
}
|
||||
spec {
|
||||
topology_spread_constraint {
|
||||
max_skew = 1
|
||||
topology_key = "kubernetes.io/hostname"
|
||||
when_unsatisfiable = "DoNotSchedule"
|
||||
label_selector {
|
||||
match_labels = {
|
||||
app = "bot-block-proxy"
|
||||
}
|
||||
}
|
||||
}
|
||||
container {
|
||||
name = "nginx"
|
||||
image = "nginx:1-alpine"
|
||||
|
||||
port {
|
||||
container_port = 8080
|
||||
}
|
||||
|
||||
volume_mount {
|
||||
name = "config"
|
||||
mount_path = "/etc/nginx/conf.d"
|
||||
read_only = true
|
||||
}
|
||||
|
||||
liveness_probe {
|
||||
http_get {
|
||||
path = "/healthz"
|
||||
port = 8080
|
||||
}
|
||||
initial_delay_seconds = 3
|
||||
period_seconds = 10
|
||||
}
|
||||
readiness_probe {
|
||||
http_get {
|
||||
path = "/healthz"
|
||||
port = 8080
|
||||
}
|
||||
initial_delay_seconds = 2
|
||||
period_seconds = 5
|
||||
}
|
||||
|
||||
resources {
|
||||
requests = {
|
||||
cpu = "5m"
|
||||
memory = "16Mi"
|
||||
}
|
||||
limits = {
|
||||
cpu = "50m"
|
||||
memory = "32Mi"
|
||||
}
|
||||
}
|
||||
}
|
||||
|
||||
volume {
|
||||
name = "config"
|
||||
config_map {
|
||||
name = kubernetes_config_map.bot_block_proxy_config.metadata[0].name
|
||||
}
|
||||
}
|
||||
}
|
||||
}
|
||||
}
|
||||
}
|
||||
|
||||
resource "kubernetes_service" "bot_block_proxy" {
|
||||
metadata {
|
||||
name = "bot-block-proxy"
|
||||
namespace = kubernetes_namespace.traefik.metadata[0].name
|
||||
labels = {
|
||||
app = "bot-block-proxy"
|
||||
}
|
||||
}
|
||||
|
||||
spec {
|
||||
selector = {
|
||||
app = "bot-block-proxy"
|
||||
}
|
||||
port {
|
||||
name = "http"
|
||||
port = 8080
|
||||
target_port = 8080
|
||||
}
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
**Step 3: Update ai-bot-block ForwardAuth address**
|
||||
|
||||
In `stacks/platform/modules/traefik/middleware.tf`, line 287, change:
|
||||
```hcl
|
||||
address = "http://poison-fountain.poison-fountain.svc.cluster.local:8080/auth"
|
||||
```
|
||||
to:
|
||||
```hcl
|
||||
address = "http://bot-block-proxy.traefik.svc.cluster.local:8080/auth"
|
||||
```
|
||||
|
||||
**Step 4: Plan and verify**
|
||||
|
||||
Run:
|
||||
```bash
|
||||
cd stacks/platform && terragrunt plan --non-interactive 2>&1 | grep -E "will be created|will be updated|Plan:"
|
||||
```
|
||||
Expected: 3 resources created (configmap, deployment, service), 1 resource updated (ai-bot-block middleware).
|
||||
|
||||
**Step 5: Apply**
|
||||
|
||||
Run:
|
||||
```bash
|
||||
cd stacks/platform && terragrunt apply --non-interactive
|
||||
```
|
||||
|
||||
**Step 6: Verify the proxy is running and forwarding correctly**
|
||||
|
||||
Run:
|
||||
```bash
|
||||
kubectl --kubeconfig $(pwd)/config get pods -n traefik -l app=bot-block-proxy
|
||||
kubectl --kubeconfig $(pwd)/config exec -n traefik deploy/bot-block-proxy -- wget -qO- http://localhost:8080/healthz
|
||||
```
|
||||
Expected: 2 pods Running. Health check returns "ok".
|
||||
|
||||
**Step 7: Test fail-open behavior**
|
||||
|
||||
Temporarily scale Poison Fountain to 0, verify the proxy returns 200:
|
||||
```bash
|
||||
kubectl --kubeconfig $(pwd)/config scale deployment poison-fountain -n poison-fountain --replicas=0
|
||||
kubectl --kubeconfig $(pwd)/config exec -n traefik deploy/bot-block-proxy -- wget -qO- --timeout=10 http://localhost:8080/auth 2>&1
|
||||
kubectl --kubeconfig $(pwd)/config scale deployment poison-fountain -n poison-fountain --replicas=2
|
||||
```
|
||||
Expected: With Poison Fountain at 0 replicas, the proxy returns 200 (fallback). After scaling back, normal forwarding resumes.
|
||||
|
||||
**Step 8: Commit**
|
||||
|
||||
```bash
|
||||
git add stacks/platform/modules/traefik/main.tf stacks/platform/modules/traefik/middleware.tf
|
||||
git commit -m "[ci skip] add bot-block resilience proxy: fail-open when Poison Fountain is down"
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
### Task 3: Deploy auth resilience proxy (nginx basicAuth fallback in front of Authentik)
|
||||
|
||||
Deploy an nginx proxy that forwards to Authentik's outpost and falls back to basicAuth when Authentik is unreachable.
|
||||
|
||||
**Files:**
|
||||
- Modify: `stacks/platform/modules/traefik/main.tf` (add nginx deployment, service, configmap, htpasswd secret)
|
||||
- Modify: `stacks/platform/modules/traefik/middleware.tf:36` (update authentik ForwardAuth address)
|
||||
- Modify: `stacks/platform/modules/traefik/main.tf:1` (add variable for htpasswd)
|
||||
|
||||
**Step 1: Add htpasswd variable**
|
||||
|
||||
Add to top of `stacks/platform/modules/traefik/main.tf` (after existing variables):
|
||||
```hcl
|
||||
variable "auth_fallback_htpasswd" {
|
||||
type = string
|
||||
description = "htpasswd-format string for emergency basicAuth fallback when Authentik is down"
|
||||
sensitive = true
|
||||
}
|
||||
```
|
||||
|
||||
**Step 2: Generate htpasswd and add to terraform.tfvars**
|
||||
|
||||
Run (to generate a bcrypt htpasswd entry):
|
||||
```bash
|
||||
htpasswd -nbB admin "$(openssl rand -base64 16)"
|
||||
```
|
||||
Add the output to `terraform.tfvars`:
|
||||
```hcl
|
||||
auth_fallback_htpasswd = "admin:$2y$05$..." # Generated value
|
||||
```
|
||||
|
||||
**Step 3: Pass variable through platform module**
|
||||
|
||||
In `stacks/platform/main.tf`, find the traefik module block and add:
|
||||
```hcl
|
||||
auth_fallback_htpasswd = var.auth_fallback_htpasswd
|
||||
```
|
||||
|
||||
Add to `stacks/platform/main.tf` variables (if not already present):
|
||||
```hcl
|
||||
variable "auth_fallback_htpasswd" {
|
||||
type = string
|
||||
sensitive = true
|
||||
default = ""
|
||||
}
|
||||
```
|
||||
|
||||
**Step 4: Add nginx configmap, secret, deployment, and service for auth proxy**
|
||||
|
||||
Add to end of `stacks/platform/modules/traefik/main.tf`:
|
||||
|
||||
```hcl
|
||||
# Resilience proxy for Authentik ForwardAuth
|
||||
# Falls back to basicAuth when Authentik is unreachable
|
||||
resource "kubernetes_secret" "auth_proxy_htpasswd" {
|
||||
metadata {
|
||||
name = "auth-proxy-htpasswd"
|
||||
namespace = kubernetes_namespace.traefik.metadata[0].name
|
||||
}
|
||||
|
||||
data = {
|
||||
"htpasswd" = var.auth_fallback_htpasswd
|
||||
}
|
||||
}
|
||||
|
||||
resource "kubernetes_config_map" "auth_proxy_config" {
|
||||
metadata {
|
||||
name = "auth-proxy-config"
|
||||
namespace = kubernetes_namespace.traefik.metadata[0].name
|
||||
}
|
||||
|
||||
data = {
|
||||
"default.conf" = <<-EOT
|
||||
upstream authentik {
|
||||
server ak-outpost-authentik-embedded-outpost.authentik.svc.cluster.local:9000;
|
||||
}
|
||||
server {
|
||||
listen 9000;
|
||||
|
||||
# Main auth endpoint - proxy to Authentik, fallback to basicAuth
|
||||
location /outpost.goauthentik.io/auth/traefik {
|
||||
proxy_pass http://authentik;
|
||||
proxy_connect_timeout 3s;
|
||||
proxy_read_timeout 5s;
|
||||
proxy_send_timeout 5s;
|
||||
proxy_intercept_errors on;
|
||||
error_page 502 503 504 = @fallback_auth;
|
||||
proxy_set_header Host $host;
|
||||
proxy_set_header X-Real-IP $remote_addr;
|
||||
proxy_set_header X-Forwarded-For $proxy_add_x_forwarded_for;
|
||||
proxy_set_header X-Forwarded-Proto $scheme;
|
||||
proxy_set_header X-Original-URL $scheme://$http_host$request_uri;
|
||||
}
|
||||
|
||||
location @fallback_auth {
|
||||
auth_basic "Emergency Access";
|
||||
auth_basic_user_file /etc/nginx/htpasswd;
|
||||
add_header X-authentik-username $remote_user always;
|
||||
add_header X-Auth-Fallback "true" always;
|
||||
return 200;
|
||||
}
|
||||
|
||||
# Pass through other outpost paths (for OAuth flows when Authentik IS up)
|
||||
location /outpost.goauthentik.io/ {
|
||||
proxy_pass http://authentik;
|
||||
proxy_connect_timeout 3s;
|
||||
proxy_read_timeout 10s;
|
||||
proxy_set_header Host $host;
|
||||
proxy_set_header X-Real-IP $remote_addr;
|
||||
proxy_set_header X-Forwarded-For $proxy_add_x_forwarded_for;
|
||||
proxy_set_header X-Forwarded-Proto $scheme;
|
||||
}
|
||||
|
||||
location /healthz {
|
||||
access_log off;
|
||||
return 200 "ok";
|
||||
}
|
||||
}
|
||||
EOT
|
||||
}
|
||||
}
|
||||
|
||||
resource "kubernetes_deployment" "auth_proxy" {
|
||||
metadata {
|
||||
name = "auth-proxy"
|
||||
namespace = kubernetes_namespace.traefik.metadata[0].name
|
||||
labels = {
|
||||
app = "auth-proxy"
|
||||
}
|
||||
}
|
||||
|
||||
spec {
|
||||
replicas = 2
|
||||
strategy {
|
||||
type = "RollingUpdate"
|
||||
rolling_update {
|
||||
max_unavailable = 0
|
||||
max_surge = 1
|
||||
}
|
||||
}
|
||||
selector {
|
||||
match_labels = {
|
||||
app = "auth-proxy"
|
||||
}
|
||||
}
|
||||
template {
|
||||
metadata {
|
||||
labels = {
|
||||
app = "auth-proxy"
|
||||
}
|
||||
}
|
||||
spec {
|
||||
topology_spread_constraint {
|
||||
max_skew = 1
|
||||
topology_key = "kubernetes.io/hostname"
|
||||
when_unsatisfiable = "DoNotSchedule"
|
||||
label_selector {
|
||||
match_labels = {
|
||||
app = "auth-proxy"
|
||||
}
|
||||
}
|
||||
}
|
||||
container {
|
||||
name = "nginx"
|
||||
image = "nginx:1-alpine"
|
||||
|
||||
port {
|
||||
container_port = 9000
|
||||
}
|
||||
|
||||
volume_mount {
|
||||
name = "config"
|
||||
mount_path = "/etc/nginx/conf.d"
|
||||
read_only = true
|
||||
}
|
||||
volume_mount {
|
||||
name = "htpasswd"
|
||||
mount_path = "/etc/nginx/htpasswd"
|
||||
sub_path = "htpasswd"
|
||||
read_only = true
|
||||
}
|
||||
|
||||
liveness_probe {
|
||||
http_get {
|
||||
path = "/healthz"
|
||||
port = 9000
|
||||
}
|
||||
initial_delay_seconds = 3
|
||||
period_seconds = 10
|
||||
}
|
||||
readiness_probe {
|
||||
http_get {
|
||||
path = "/healthz"
|
||||
port = 9000
|
||||
}
|
||||
initial_delay_seconds = 2
|
||||
period_seconds = 5
|
||||
}
|
||||
|
||||
resources {
|
||||
requests = {
|
||||
cpu = "5m"
|
||||
memory = "16Mi"
|
||||
}
|
||||
limits = {
|
||||
cpu = "50m"
|
||||
memory = "32Mi"
|
||||
}
|
||||
}
|
||||
}
|
||||
|
||||
volume {
|
||||
name = "config"
|
||||
config_map {
|
||||
name = kubernetes_config_map.auth_proxy_config.metadata[0].name
|
||||
}
|
||||
}
|
||||
volume {
|
||||
name = "htpasswd"
|
||||
secret {
|
||||
secret_name = kubernetes_secret.auth_proxy_htpasswd.metadata[0].name
|
||||
}
|
||||
}
|
||||
}
|
||||
}
|
||||
}
|
||||
}
|
||||
|
||||
resource "kubernetes_service" "auth_proxy" {
|
||||
metadata {
|
||||
name = "auth-proxy"
|
||||
namespace = kubernetes_namespace.traefik.metadata[0].name
|
||||
labels = {
|
||||
app = "auth-proxy"
|
||||
}
|
||||
}
|
||||
|
||||
spec {
|
||||
selector = {
|
||||
app = "auth-proxy"
|
||||
}
|
||||
port {
|
||||
name = "http"
|
||||
port = 9000
|
||||
target_port = 9000
|
||||
}
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
**Step 5: Update authentik ForwardAuth address**
|
||||
|
||||
In `stacks/platform/modules/traefik/middleware.tf`, line 36, change:
|
||||
```hcl
|
||||
address = "http://ak-outpost-authentik-embedded-outpost.authentik.svc.cluster.local:9000/outpost.goauthentik.io/auth/traefik"
|
||||
```
|
||||
to:
|
||||
```hcl
|
||||
address = "http://auth-proxy.traefik.svc.cluster.local:9000/outpost.goauthentik.io/auth/traefik"
|
||||
```
|
||||
|
||||
**Step 6: Plan and verify**
|
||||
|
||||
Run:
|
||||
```bash
|
||||
cd stacks/platform && terragrunt plan --non-interactive 2>&1 | grep -E "will be created|will be updated|Plan:"
|
||||
```
|
||||
Expected: 4 resources created (secret, configmap, deployment, service), 1 resource updated (authentik-forward-auth middleware).
|
||||
|
||||
**Step 7: Apply**
|
||||
|
||||
Run:
|
||||
```bash
|
||||
cd stacks/platform && terragrunt apply --non-interactive
|
||||
```
|
||||
|
||||
**Step 8: Verify proxy is running**
|
||||
|
||||
Run:
|
||||
```bash
|
||||
kubectl --kubeconfig $(pwd)/config get pods -n traefik -l app=auth-proxy
|
||||
kubectl --kubeconfig $(pwd)/config exec -n traefik deploy/auth-proxy -- wget -qO- http://localhost:9000/healthz
|
||||
```
|
||||
Expected: 2 pods Running. Health check returns "ok".
|
||||
|
||||
**Step 9: Commit**
|
||||
|
||||
```bash
|
||||
git add stacks/platform/modules/traefik/main.tf stacks/platform/modules/traefik/middleware.tf stacks/platform/main.tf
|
||||
git commit -m "[ci skip] add auth resilience proxy: basicAuth fallback when Authentik is down"
|
||||
```
|
||||
|
||||
Note: Do NOT commit terraform.tfvars (it contains the htpasswd secret and is git-crypt encrypted — it will be included in the next push automatically).
|
||||
|
||||
---
|
||||
|
||||
### Task 4: Add Traefik topology spread, PDB, and response timeout
|
||||
|
||||
**Files:**
|
||||
- Modify: `stacks/platform/modules/traefik/main.tf:26-205` (Helm values)
|
||||
|
||||
**Step 1: Add topology spread constraints to Traefik Helm values**
|
||||
|
||||
In `stacks/platform/modules/traefik/main.tf`, after the `tolerations = []` line (line 204), add:
|
||||
|
||||
```hcl
|
||||
topologySpreadConstraints = [{
|
||||
maxSkew = 1
|
||||
topologyKey = "kubernetes.io/hostname"
|
||||
whenUnsatisfiable = "DoNotSchedule"
|
||||
labelSelector = {
|
||||
matchLabels = {
|
||||
"app.kubernetes.io/name" = "traefik"
|
||||
}
|
||||
}
|
||||
}]
|
||||
|
||||
podDisruptionBudget = {
|
||||
enabled = true
|
||||
minAvailable = 2
|
||||
}
|
||||
```
|
||||
|
||||
**Step 2: Change response header timeout**
|
||||
|
||||
In `stacks/platform/modules/traefik/main.tf`, line 184, change:
|
||||
```hcl
|
||||
"--serversTransport.forwardingTimeouts.responseHeaderTimeout=0s",
|
||||
```
|
||||
to:
|
||||
```hcl
|
||||
"--serversTransport.forwardingTimeouts.responseHeaderTimeout=30s",
|
||||
```
|
||||
|
||||
**Step 3: Plan and verify**
|
||||
|
||||
Run:
|
||||
```bash
|
||||
cd stacks/platform && terragrunt plan --non-interactive 2>&1 | grep -E "will be|Plan:"
|
||||
```
|
||||
Expected: Helm release will be updated in-place.
|
||||
|
||||
**Step 4: Apply**
|
||||
|
||||
Run:
|
||||
```bash
|
||||
cd stacks/platform && terragrunt apply --non-interactive
|
||||
```
|
||||
|
||||
**Step 5: Verify topology spread**
|
||||
|
||||
Run:
|
||||
```bash
|
||||
kubectl --kubeconfig $(pwd)/config get pods -n traefik -l app.kubernetes.io/name=traefik -o wide
|
||||
```
|
||||
Expected: 3 pods on 3 different nodes.
|
||||
|
||||
**Step 6: Verify PDB**
|
||||
|
||||
Run:
|
||||
```bash
|
||||
kubectl --kubeconfig $(pwd)/config get pdb -n traefik
|
||||
```
|
||||
Expected: PDB with minAvailable=2, currentHealthy=3, allowedDisruptions=1.
|
||||
|
||||
**Step 7: Commit**
|
||||
|
||||
```bash
|
||||
git add stacks/platform/modules/traefik/main.tf
|
||||
git commit -m "[ci skip] add Traefik topology spread, PDB (minAvailable=2), and 30s response timeout"
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
### Task 5: Add Authentik PDB
|
||||
|
||||
**Files:**
|
||||
- Modify: `stacks/platform/modules/authentik/values.yaml`
|
||||
|
||||
**Step 1: Add PDB configuration to Authentik Helm values**
|
||||
|
||||
In `stacks/platform/modules/authentik/values.yaml`, add after the `server:` section (after line 33, before `global:`):
|
||||
|
||||
```yaml
|
||||
pdb:
|
||||
enabled: true
|
||||
minAvailable: 2
|
||||
```
|
||||
|
||||
So the server section becomes:
|
||||
```yaml
|
||||
server:
|
||||
replicas: 3
|
||||
resources:
|
||||
requests:
|
||||
cpu: 100m
|
||||
memory: 512Mi
|
||||
limits:
|
||||
cpu: "2"
|
||||
memory: 1Gi
|
||||
ingress:
|
||||
enabled: false
|
||||
podAnnotations:
|
||||
diun.enable: true
|
||||
diun.include_tags: "^202[0-9].[0-9]+.*$"
|
||||
pdb:
|
||||
enabled: true
|
||||
minAvailable: 2
|
||||
```
|
||||
|
||||
**Step 2: Plan and verify**
|
||||
|
||||
Run:
|
||||
```bash
|
||||
cd stacks/platform && terragrunt plan --non-interactive 2>&1 | grep -E "will be|Plan:"
|
||||
```
|
||||
Expected: Helm release will be updated.
|
||||
|
||||
**Step 3: Apply**
|
||||
|
||||
Run:
|
||||
```bash
|
||||
cd stacks/platform && terragrunt apply --non-interactive
|
||||
```
|
||||
|
||||
**Step 4: Verify PDB**
|
||||
|
||||
Run:
|
||||
```bash
|
||||
kubectl --kubeconfig $(pwd)/config get pdb -n authentik
|
||||
```
|
||||
Expected: PDB with minAvailable=2, currentHealthy=3, allowedDisruptions=1.
|
||||
|
||||
**Step 5: Commit**
|
||||
|
||||
```bash
|
||||
git add stacks/platform/modules/authentik/values.yaml
|
||||
git commit -m "[ci skip] add Authentik PDB (minAvailable=2)"
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
### Task 6: Add retry middleware to ingress factory
|
||||
|
||||
**Files:**
|
||||
- Modify: `stacks/platform/modules/traefik/middleware.tf` (add retry middleware)
|
||||
- Modify: `modules/kubernetes/ingress_factory/main.tf:112-113` (add to default chain)
|
||||
|
||||
**Step 1: Add retry middleware CRD**
|
||||
|
||||
Add to end of `stacks/platform/modules/traefik/middleware.tf`:
|
||||
|
||||
```hcl
|
||||
# Retry middleware for transient backend failures (502/503 during restarts)
|
||||
resource "kubernetes_manifest" "middleware_retry" {
|
||||
manifest = {
|
||||
apiVersion = "traefik.io/v1alpha1"
|
||||
kind = "Middleware"
|
||||
metadata = {
|
||||
name = "retry"
|
||||
namespace = kubernetes_namespace.traefik.metadata[0].name
|
||||
}
|
||||
spec = {
|
||||
retry = {
|
||||
attempts = 2
|
||||
initialInterval = "100ms"
|
||||
}
|
||||
}
|
||||
}
|
||||
|
||||
depends_on = [helm_release.traefik]
|
||||
}
|
||||
```
|
||||
|
||||
**Step 2: Add retry middleware to ingress factory default chain**
|
||||
|
||||
In `modules/kubernetes/ingress_factory/main.tf`, line 112, the middleware chain starts with rate-limit. Add retry as the first middleware (retries should wrap the entire chain):
|
||||
|
||||
Change line 112-113 from:
|
||||
```hcl
|
||||
"traefik.ingress.kubernetes.io/router.middlewares" = join(",", compact(concat([
|
||||
var.skip_default_rate_limit ? null : "traefik-rate-limit@kubernetescrd",
|
||||
```
|
||||
to:
|
||||
```hcl
|
||||
"traefik.ingress.kubernetes.io/router.middlewares" = join(",", compact(concat([
|
||||
"traefik-retry@kubernetescrd",
|
||||
var.skip_default_rate_limit ? null : "traefik-rate-limit@kubernetescrd",
|
||||
```
|
||||
|
||||
**Step 3: Plan both stacks**
|
||||
|
||||
Run:
|
||||
```bash
|
||||
cd stacks/platform && terragrunt plan --non-interactive 2>&1 | grep -E "will be|Plan:"
|
||||
```
|
||||
Expected: 1 resource created (retry middleware).
|
||||
|
||||
Note: The ingress_factory change will take effect the next time any service stack is applied (it's a module used by all stacks). The middleware CRD must exist first.
|
||||
|
||||
**Step 4: Apply platform stack**
|
||||
|
||||
Run:
|
||||
```bash
|
||||
cd stacks/platform && terragrunt apply --non-interactive
|
||||
```
|
||||
|
||||
**Step 5: Verify retry middleware exists**
|
||||
|
||||
Run:
|
||||
```bash
|
||||
kubectl --kubeconfig $(pwd)/config get middleware -n traefik retry
|
||||
```
|
||||
Expected: Middleware `retry` exists.
|
||||
|
||||
**Step 6: Commit**
|
||||
|
||||
```bash
|
||||
git add stacks/platform/modules/traefik/middleware.tf modules/kubernetes/ingress_factory/main.tf
|
||||
git commit -m "[ci skip] add retry middleware (2 attempts, 100ms) to default ingress chain"
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
### Task 7: Add Prometheus alerts and inhibition rules
|
||||
|
||||
**Files:**
|
||||
- Modify: `stacks/platform/modules/monitoring/prometheus_chart_values.tpl`
|
||||
|
||||
**Step 1: Add PoisonFountainDown alert**
|
||||
|
||||
In `stacks/platform/modules/monitoring/prometheus_chart_values.tpl`, in the "Critical Services" alert group (after the AuthentikDown alert, around line 435), add:
|
||||
|
||||
```yaml
|
||||
- alert: PoisonFountainDown
|
||||
expr: (kube_deployment_status_replicas_available{namespace="poison-fountain", deployment="poison-fountain"} or on() vector(0)) < 1
|
||||
for: 2m
|
||||
labels:
|
||||
severity: critical
|
||||
annotations:
|
||||
summary: "Poison Fountain is down - AI bot blocking degraded to fail-open"
|
||||
```
|
||||
|
||||
**Step 2: Add ForwardAuthFallbackActive alert**
|
||||
|
||||
In the "Traefik Ingress" alert group (after the TraefikHighOpenConnections alert, around line 587), add:
|
||||
|
||||
```yaml
|
||||
- alert: ForwardAuthFallbackActive
|
||||
expr: |
|
||||
(kube_deployment_status_replicas_available{namespace="poison-fountain", deployment="poison-fountain"} or on() vector(0)) < 1
|
||||
or (kube_deployment_status_replicas_available{namespace="authentik", deployment="goauthentik-server"} or on() vector(0)) < 1
|
||||
for: 5m
|
||||
labels:
|
||||
severity: warning
|
||||
annotations:
|
||||
summary: "ForwardAuth resilience proxy is serving fallback responses - check Poison Fountain and Authentik"
|
||||
```
|
||||
|
||||
**Step 3: Add alert inhibition rule**
|
||||
|
||||
In the `inhibit_rules` section (around line 63), add after the existing TraefikDown inhibition:
|
||||
|
||||
```yaml
|
||||
# Traefik down makes Poison Fountain alerts redundant
|
||||
- source_matchers:
|
||||
- alertname = TraefikDown
|
||||
target_matchers:
|
||||
- alertname =~ "PoisonFountainDown|ForwardAuthFallbackActive"
|
||||
```
|
||||
|
||||
**Step 4: Plan and verify**
|
||||
|
||||
Run:
|
||||
```bash
|
||||
cd stacks/platform && terragrunt plan --non-interactive 2>&1 | grep -E "will be|Plan:"
|
||||
```
|
||||
Expected: Helm release updated (Prometheus config changes).
|
||||
|
||||
**Step 5: Apply**
|
||||
|
||||
Run:
|
||||
```bash
|
||||
cd stacks/platform && terragrunt apply --non-interactive
|
||||
```
|
||||
|
||||
**Step 6: Verify alerts are loaded**
|
||||
|
||||
Run:
|
||||
```bash
|
||||
kubectl --kubeconfig $(pwd)/config exec -n monitoring deploy/prometheus-server -- wget -qO- http://localhost:9090/api/v1/rules 2>&1 | python3 -c "import sys,json; rules=[r['name'] for g in json.load(sys.stdin)['data']['groups'] for r in g['rules']]; print('PoisonFountainDown:', 'PoisonFountainDown' in rules); print('ForwardAuthFallbackActive:', 'ForwardAuthFallbackActive' in rules)"
|
||||
```
|
||||
Expected: Both alerts show `True`.
|
||||
|
||||
**Step 7: Commit**
|
||||
|
||||
```bash
|
||||
git add stacks/platform/modules/monitoring/prometheus_chart_values.tpl
|
||||
git commit -m "[ci skip] add PoisonFountainDown and ForwardAuthFallbackActive alerts with inhibition"
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
### Task 8: Final verification and push
|
||||
|
||||
**Step 1: Run cluster health check**
|
||||
|
||||
Run:
|
||||
```bash
|
||||
bash scripts/cluster_healthcheck.sh --quiet
|
||||
```
|
||||
Expected: No new WARN/FAIL related to our changes.
|
||||
|
||||
**Step 2: Verify all resilience proxies are running**
|
||||
|
||||
Run:
|
||||
```bash
|
||||
kubectl --kubeconfig $(pwd)/config get pods -n traefik -l "app in (bot-block-proxy,auth-proxy)" -o wide
|
||||
kubectl --kubeconfig $(pwd)/config get pods -n traefik -l app.kubernetes.io/name=traefik -o wide
|
||||
kubectl --kubeconfig $(pwd)/config get pdb -A
|
||||
```
|
||||
Expected: All proxy pods running on different nodes, Traefik pods spread across nodes, PDBs for Traefik and Authentik.
|
||||
|
||||
**Step 3: Test a public service is still accessible**
|
||||
|
||||
Run:
|
||||
```bash
|
||||
curl -s -o /dev/null -w "%{http_code}" https://viktorbarzin.me
|
||||
```
|
||||
Expected: 200 (or 301/302 redirect). Not 502.
|
||||
|
||||
**Step 4: Push all commits**
|
||||
|
||||
Ask user for confirmation, then:
|
||||
```bash
|
||||
git push origin master
|
||||
```
|
||||
280
docs/plans/2026-03-02-security-observability-design.md
Normal file
280
docs/plans/2026-03-02-security-observability-design.md
Normal file
|
|
@ -0,0 +1,280 @@
|
|||
# Security Observability Layer — Design Document
|
||||
|
||||
**Date**: 2026-03-02
|
||||
**Status**: Approved
|
||||
**Approach**: Tetragon-Centric (Approach A)
|
||||
|
||||
## Problem Statement
|
||||
|
||||
The cluster has strong perimeter security (CrowdSec, Traefik middleware chain, Cloudflare WAF) and good monitoring (Prometheus, Loki, Grafana), but lacks:
|
||||
- Runtime security monitoring (syscall-level container activity)
|
||||
- Egress visibility (what pods connect to externally)
|
||||
- HTTPS inspection capability (even on-demand)
|
||||
- Network segmentation (no NetworkPolicies — any pod can reach any pod)
|
||||
- Firewall log centralization (pfSense logs not in Loki)
|
||||
- Unified security dashboard
|
||||
|
||||
## Requirements
|
||||
|
||||
- **Threat model**: Defense in depth — external attacks, compromised containers, lateral movement, data exfiltration
|
||||
- **TLS inspection**: Connection metadata (SNI/IP/bytes) by default, selective deep inspection on-demand
|
||||
- **Alerting**: Slack (existing channel)
|
||||
- **Resource budget**: <5GB RAM total for new tooling
|
||||
- **Enforcement**: Observe & alert now, enforce later
|
||||
- **CNI**: Calico (confirmed, with GlobalNetworkPolicy CRD support)
|
||||
|
||||
## Architecture
|
||||
|
||||
```
|
||||
┌─────────────────────────────────────────────────┐
|
||||
│ Existing Stack │
|
||||
│ Prometheus ← scrape ← Tetragon metrics │
|
||||
│ Loki ← Alloy ← Tetragon event logs │
|
||||
│ ← pfSense syslog │
|
||||
│ ← CoreDNS query logs │
|
||||
│ Grafana ← Unified Security Dashboard │
|
||||
│ Alertmanager → Slack │
|
||||
└─────────────────────────────────────────────────┘
|
||||
|
||||
┌─────────────┐ ┌──────────────────┐ ┌─────────────────────┐
|
||||
│ Tetragon │ │ Kyverno Policy │ │ mitmproxy │
|
||||
│ (DaemonSet) │ │ Reporter (1 pod) │ │ (on-demand, 1 pod) │
|
||||
│ eBPF agent │ │ │ │ HTTPS inspection │
|
||||
│ per node │ │ Violations → │ │ for suspect pods │
|
||||
│ │ │ Prometheus + │ │ │
|
||||
│ Monitors: │ │ Grafana │ │ Transparent proxy │
|
||||
│ • processes │ │ │ │ via NetworkPolicy │
|
||||
│ • network │ └──────────────────┘ └─────────────────────┘
|
||||
│ • files │
|
||||
│ • syscalls │ ┌──────────────────┐
|
||||
│ │ │ Inspektor Gadget │
|
||||
└─────────────┘ │ (temporary) │
|
||||
│ Auto-generate │
|
||||
│ NetworkPolicies │
|
||||
│ from observed │
|
||||
│ traffic baseline │
|
||||
└──────────────────┘
|
||||
|
||||
┌────────────────────────────────────────────────┐
|
||||
│ Calico NetworkPolicies │
|
||||
│ (Generated from baseline, enforced gradually) │
|
||||
│ Default deny egress + allow known connections │
|
||||
└────────────────────────────────────────────────┘
|
||||
```
|
||||
|
||||
### Data Flows
|
||||
|
||||
1. **Tetragon** → Prometheus (metrics) + stdout → Alloy → Loki (events)
|
||||
2. **pfSense** → syslog UDP → Alloy syslog receiver → Loki
|
||||
3. **CoreDNS** → uncomment `log` → stdout → Alloy → Loki
|
||||
4. **Kyverno Policy Reporter** → Prometheus (violation metrics)
|
||||
5. **Grafana** ← queries all sources → Unified Security Dashboard
|
||||
6. **Alertmanager** → Slack (security-specific alert rules)
|
||||
|
||||
## Component Details
|
||||
|
||||
### 1. Tetragon (Runtime Security + Network Visibility)
|
||||
|
||||
**Purpose**: eBPF-based kernel-level monitoring of process execution, network connections, file access, and privilege escalation.
|
||||
|
||||
**Deployment**:
|
||||
- Helm chart: `cilium/tetragon` (CNCF project, part of Cilium ecosystem)
|
||||
- Type: DaemonSet on all 5 nodes
|
||||
- Resources: ~80-120MB RAM/node, ~50m CPU idle
|
||||
- Tier: `1-cluster`
|
||||
- Namespace: `tetragon`
|
||||
- New stack: `stacks/tetragon/`
|
||||
|
||||
**TracingPolicy CRDs** (what to monitor):
|
||||
|
||||
| Policy | Detects | Severity |
|
||||
|--------|---------|----------|
|
||||
| Privilege escalation | `setuid(0)`, `setgid(0)`, dangerous capabilities | Critical |
|
||||
| Reverse shell | Shell process with outbound connection to external IP | Critical |
|
||||
| Crypto miner | Connections to mining pool ports (3333, 14444, etc.) | Warning |
|
||||
| Container escape | `mount` syscalls, `/proc/self/ns/*` access, `nsenter` | Critical |
|
||||
| Sensitive file access | Reads of `/etc/shadow`, K8s service account tokens | Warning |
|
||||
| Unexpected egress | Outbound connections to non-private IPs (log all) | Info |
|
||||
| Unexpected binaries | Shells spawning in non-shell containers | Warning |
|
||||
|
||||
**Observe → Enforce path**:
|
||||
- Start: `TracingPolicy` (observe + alert only)
|
||||
- Later: `TracingPolicyEnforced` (can SIGKILL processes)
|
||||
|
||||
**Integration**:
|
||||
- Prometheus metrics via pod annotations (auto-scraped by existing `kubernetes-pods` job)
|
||||
- Events as JSON to stdout → Alloy → Loki
|
||||
- New Prometheus alert rules for critical Tetragon events
|
||||
|
||||
### 2. pfSense Log Collection
|
||||
|
||||
**Purpose**: Centralize firewall logs into Loki for correlation with cluster security events.
|
||||
|
||||
**Implementation**:
|
||||
- Deploy a small syslog-receiver Deployment (1 replica) with a MetalLB LoadBalancer IP
|
||||
- Forward received syslog to Loki via `loki.write`
|
||||
- OR add `loki.source.syslog` to existing Alloy config
|
||||
- Configure pfSense: Status → System Logs → Settings → Remote Logging → point to syslog receiver IP:1514
|
||||
|
||||
**Recommended approach**: Dedicated syslog receiver Deployment (not Alloy DaemonSet) because:
|
||||
- Stable LoadBalancer IP for pfSense to target
|
||||
- Doesn't couple to a specific node
|
||||
- Can parse `filterlog` CSV format independently
|
||||
|
||||
**Parse pfSense filterlog**: Extract interface, action (pass/block), direction, source IP, dest IP, protocol, port into Loki labels.
|
||||
|
||||
**Resource cost**: ~50-100MB for the syslog receiver pod.
|
||||
|
||||
### 3. CoreDNS Query Logging
|
||||
|
||||
**Purpose**: Detect DNS tunneling, C2 callbacks, unusual domain lookups.
|
||||
|
||||
**Implementation**: Uncomment `#log` → `log` in CoreDNS ConfigMap (`stacks/platform/modules/technitium/main.tf`).
|
||||
|
||||
**Scope**: Only enable on the main zone (`.`), NOT the `viktorbarzin.lan` zone (Technitium already logs those to MySQL).
|
||||
|
||||
**Alert rules for Loki**:
|
||||
- High NX domain rate from a single pod
|
||||
- DNS tunneling signatures (subdomain labels >40 chars)
|
||||
- Queries to known malicious TLDs
|
||||
|
||||
**Resource cost**: 0 additional (just increased log volume in Loki).
|
||||
|
||||
### 4. NetworkPolicy Strategy (Calico)
|
||||
|
||||
**Purpose**: Restrict pod-to-pod and pod-to-external traffic using Calico NetworkPolicies.
|
||||
|
||||
**Phased rollout**:
|
||||
|
||||
| Phase | Action | Timeline |
|
||||
|-------|--------|----------|
|
||||
| Observe | Deploy Inspektor Gadget, capture 24-48h traffic baseline | Week 1 |
|
||||
| Generate | `kubectl gadget advise network-policy` per namespace | Week 1 |
|
||||
| Review | Convert to Terraform `kubernetes_network_policy` resources | Week 2 |
|
||||
| Enforce (low-risk) | Apply to aux-tier namespaces first | Week 3 |
|
||||
| Enforce (all) | Gradually apply to edge, cluster, core tiers | Week 4+ |
|
||||
|
||||
**Key policies**:
|
||||
- Default deny egress for aux-tier namespaces
|
||||
- Allow DNS (port 53) + known external endpoints per service
|
||||
- Block inter-namespace traffic except known dependencies (redis, postgresql, loki)
|
||||
|
||||
**Inspektor Gadget**:
|
||||
- CNCF Sandbox project, ~80MB/node as DaemonSet
|
||||
- Temporary deployment — remove after baseline capture (~400MB total while running)
|
||||
- `kubectl gadget advise network-policy` auto-generates policies from observed traffic
|
||||
|
||||
**Resource cost**: 0 permanent (Calico already enforces). ~400MB temporary.
|
||||
|
||||
### 5. mitmproxy (On-Demand HTTPS Inspection)
|
||||
|
||||
**Purpose**: Deep HTTPS traffic inspection for specific suspicious pods during incident investigation.
|
||||
|
||||
**Deployment**:
|
||||
- Single-replica Deployment, **scaled to 0 by default**
|
||||
- Namespace: `mitmproxy`
|
||||
- New stack: `stacks/mitmproxy/`
|
||||
- Web UI at `mitmproxy.viktorbarzin.lan` (local-only access)
|
||||
|
||||
**Usage workflow**:
|
||||
1. Scale to 1: `kubectl scale deployment mitmproxy --replicas=1 -n mitmproxy`
|
||||
2. Apply Calico NetworkPolicy redirecting suspect pod's egress through mitmproxy
|
||||
3. Mount mitmproxy CA cert into target pod's trust store
|
||||
4. Inspect traffic via web UI
|
||||
5. Scale back to 0 when done
|
||||
|
||||
**Resource cost**: ~200MB when active, 0 when scaled to 0.
|
||||
|
||||
### 6. Kyverno Policy Reporter
|
||||
|
||||
**Purpose**: Surface Kyverno policy violations (currently in audit mode) in Grafana dashboards.
|
||||
|
||||
**Deployment**:
|
||||
- Add as sub-chart or separate Helm release in Kyverno stack
|
||||
- 1 replica Deployment
|
||||
- Exports metrics to Prometheus
|
||||
- ~50MB RAM
|
||||
|
||||
**Integration**:
|
||||
- Prometheus scrapes Policy Reporter metrics
|
||||
- Grafana dashboard shows violations by policy, namespace, severity
|
||||
|
||||
### 7. Unified Security Dashboard + Alert Rules
|
||||
|
||||
**Grafana Dashboard** layout:
|
||||
|
||||
| Row | Panels | Data Source |
|
||||
|-----|--------|-------------|
|
||||
| Overview | Active CrowdSec bans, Tetragon alerts/24h, Kyverno violations/24h, pfSense blocks/24h | Prometheus |
|
||||
| Attack Timeline | Combined time series of all security events | Prometheus |
|
||||
| Runtime Security | Suspicious processes, privilege escalations, file access alerts | Loki (Tetragon) |
|
||||
| Network | Top egress destinations by namespace, unusual DNS queries, pfSense blocks | Loki + Prometheus |
|
||||
| Policy | Kyverno violations by policy/namespace/severity | Prometheus (Policy Reporter) |
|
||||
|
||||
**New Prometheus Alert Rules**:
|
||||
|
||||
| Alert | Trigger | Severity |
|
||||
|-------|---------|----------|
|
||||
| `TetragonPrivilegeEscalation` | setuid(0) in non-system container | Critical |
|
||||
| `TetragonReverseShell` | Shell + outbound connection | Critical |
|
||||
| `TetragonCryptoMiner` | Connection to mining pool ports | Warning |
|
||||
| `TetragonUnexpectedEgress` | Pod → unexpected external IP | Warning |
|
||||
| `SuspiciousDNSQuery` | High NX rate or long subdomains | Warning |
|
||||
| `PfSenseHighBlockRate` | >100 blocks/min from single source | Warning |
|
||||
| `KyvernoViolationSpike` | >10 violations in 5 minutes | Warning |
|
||||
|
||||
## Resource Budget
|
||||
|
||||
| Component | Type | Steady-State RAM | Notes |
|
||||
|-----------|------|-----------------|-------|
|
||||
| Tetragon | DaemonSet (5 nodes) | ~500MB | Runtime security + egress |
|
||||
| Syslog receiver | Deployment (1) | ~75MB | pfSense logs |
|
||||
| Kyverno Policy Reporter | Deployment (1) | ~50MB | Violation metrics |
|
||||
| mitmproxy | Deployment (0/1) | 0 (200MB active) | On-demand only |
|
||||
| CoreDNS logging | Config change | 0 | More Loki volume |
|
||||
| Inspektor Gadget | Temporary DaemonSet | 0 (~400MB while running) | Removed after baseline |
|
||||
| **Total steady-state** | | **~625MB** | Well under 5GB budget |
|
||||
|
||||
## Implementation Phases
|
||||
|
||||
### Phase 1: Core Observability (~625MB)
|
||||
1. Deploy Tetragon with TracingPolicy CRDs
|
||||
2. Enable CoreDNS query logging
|
||||
3. Deploy Kyverno Policy Reporter
|
||||
4. Add Prometheus alert rules for Tetragon events
|
||||
|
||||
### Phase 2: Log Centralization (+0MB permanent)
|
||||
5. Deploy syslog receiver for pfSense logs
|
||||
6. Configure pfSense remote syslog
|
||||
7. Build unified Grafana security dashboard
|
||||
|
||||
### Phase 3: Network Segmentation (+0MB permanent, ~400MB temporary)
|
||||
8. Deploy Inspektor Gadget temporarily
|
||||
9. Capture 24-48h traffic baseline
|
||||
10. Generate and review NetworkPolicies
|
||||
11. Apply policies gradually (aux → edge → cluster → core)
|
||||
12. Remove Inspektor Gadget
|
||||
|
||||
### Phase 4: On-Demand Inspection (+0MB permanent)
|
||||
13. Deploy mitmproxy (scaled to 0)
|
||||
14. Document investigation workflow
|
||||
|
||||
## New Terraform Stacks
|
||||
|
||||
- `stacks/tetragon/` — Helm chart + TracingPolicy CRDs + Prometheus rules
|
||||
- `stacks/mitmproxy/` — On-demand HTTPS inspection proxy
|
||||
|
||||
## Modified Stacks
|
||||
|
||||
- `stacks/platform/modules/monitoring/` — Alloy syslog or syslog receiver, Grafana dashboard, alert rules
|
||||
- `stacks/platform/modules/technitium/` — CoreDNS log uncomment
|
||||
- `stacks/platform/modules/kyverno/` — Policy Reporter sub-chart
|
||||
|
||||
## Existing Stack (No Changes Needed)
|
||||
|
||||
- CrowdSec (IDS/IPS with Traefik bouncer) — already covers external attack detection
|
||||
- Prometheus + Alertmanager — alert routing infrastructure ready
|
||||
- Loki + Alloy — log pipeline ready, just needs new sources
|
||||
- Caretta — eBPF service map complements Tetragon's process-level view
|
||||
- GoFlow2 — NetFlow data complements Tetragon's connection tracking
|
||||
- Calico — CNI with full NetworkPolicy enforcement ready
|
||||
73
docs/plans/2026-03-03-cluster-hardening-design.md
Normal file
73
docs/plans/2026-03-03-cluster-hardening-design.md
Normal file
|
|
@ -0,0 +1,73 @@
|
|||
# Cluster Hardening Design
|
||||
|
||||
**Date**: 2026-03-03
|
||||
**Status**: Approved
|
||||
**Scope**: Service availability, failure detection, DNS HA
|
||||
|
||||
## Context
|
||||
|
||||
Reliability audit identified gaps in failure detection (most services lack health probes), NFS monitoring (backbone for 70+ services has no dedicated alerting), and DNS high availability (AXFR-based secondary doesn't sync settings/blocklists).
|
||||
|
||||
## Decisions
|
||||
|
||||
- No PDBs for now — revisit when adding more replicas
|
||||
- No NetworkPolicies in this phase — covered by security observability design
|
||||
- Replicate only critical infra (DNS); apps stay at 1 replica
|
||||
- Keep databases on NFS; harden via monitoring, not migration
|
||||
- Backup/DR items (MinIO, rsync, PBS, runbooks) deferred to a separate effort
|
||||
|
||||
## Items
|
||||
|
||||
### 1. etcd Backup Alerts — DONE
|
||||
|
||||
- `EtcdBackupStale`: fires critical if last successful backup > 36h
|
||||
- `EtcdBackupNeverSucceeded`: fires critical if backup has never completed
|
||||
- etcd backup image updated to `registry.k8s.io/etcd:3.6.5-0` (matches cluster)
|
||||
- Applied 2026-03-03
|
||||
|
||||
### 2. Liveness & Readiness Probes
|
||||
|
||||
Add HTTP probes to Terraform-managed deployments. Conservative timing to avoid spamming:
|
||||
- `periodSeconds: 30`
|
||||
- `failureThreshold: 5` (150s before restart)
|
||||
- `initialDelaySeconds: 15`
|
||||
- `timeoutSeconds: 5`
|
||||
|
||||
Use known health endpoints where available, fall back to `GET /` on container port.
|
||||
Start with tier-0/tier-1 services, then extend to tier-3/tier-4.
|
||||
|
||||
### 3. NFS Health Monitoring
|
||||
|
||||
- **Prometheus alert**: `NFSServerDown` via blackbox exporter TCP probe on `10.0.10.15:2049`, fires critical after 2 minutes
|
||||
- **Uptime Kuma**: TCP monitor on `10.0.10.15:2049`
|
||||
|
||||
### 4. Technitium DNS Clustering
|
||||
|
||||
Migrate from AXFR zone transfers to Technitium's built-in clustering:
|
||||
|
||||
**Architecture change**:
|
||||
- Convert primary + secondary Deployments → single StatefulSet with 2 replicas
|
||||
- Add headless Service for stable pod DNS names
|
||||
- Separate NFS volumes per replica (existing pattern preserved)
|
||||
|
||||
**Clustering setup**:
|
||||
- Cluster domain: `dns.viktorbarzin.lan` (permanent)
|
||||
- Pod-0: primary (`/api/admin/cluster/init`)
|
||||
- Pod-1: secondary (`/api/admin/cluster/initJoin`)
|
||||
- HTTPS auto-enabled with self-signed certs (internal only)
|
||||
- One-shot setup Job after StatefulSet is running
|
||||
|
||||
**What clustering syncs** (vs AXFR which only syncs zone records):
|
||||
- Zones (via catalog zone — auto-syncs new zones)
|
||||
- Blocklists and allowed lists
|
||||
- DNS applications and their configs
|
||||
- Users, groups, permissions, API tokens
|
||||
- Settings
|
||||
|
||||
**Requires maintenance window**: brief DNS outage during StatefulSet migration.
|
||||
|
||||
## Implementation Order
|
||||
|
||||
1. NFS health monitoring (low effort, no disruption)
|
||||
2. Health probes (medium effort, rolling restarts)
|
||||
3. Technitium clustering (high effort, requires maintenance window)
|
||||
210
docs/plans/2026-03-07-k8s-portal-onboarding-plan.md
Normal file
210
docs/plans/2026-03-07-k8s-portal-onboarding-plan.md
Normal file
|
|
@ -0,0 +1,210 @@
|
|||
# K8s Portal Onboarding Hub — Implementation Plan (v2)
|
||||
|
||||
## Goals
|
||||
1. Fix broken kubeconfig/OIDC setup script (users can't connect)
|
||||
2. Add markdown-driven onboarding hub for non-technical users
|
||||
3. Complete contributor onboarding (git, PR workflow, Codex setup)
|
||||
|
||||
---
|
||||
|
||||
## Part 1: Fix Setup Script Bugs
|
||||
|
||||
### Bug 1 — Empty CA cert (CRITICAL)
|
||||
**Root cause**: ConfigMap `k8s-portal-config` has `ca.crt = ""`. The kubeconfig gets empty `certificate-authority-data`, causing TLS failures.
|
||||
|
||||
**Fix**:
|
||||
1. Extract K8s API CA cert: `kubectl get configmap -n kube-system kube-root-ca.crt -o jsonpath='{.data.ca\.crt}'`
|
||||
2. Verify it matches the API server cert: `openssl s_client -connect 10.0.20.100:6443 -showcerts 2>/dev/null | openssl x509 -issuer -noout` — compare issuer with CA cert subject
|
||||
3. Add `variable "k8s_ca_cert" { type = string }` to `main.tf`
|
||||
4. Add the cert value to `config.tfvars` (it's public, not a secret)
|
||||
5. Use in ConfigMap: `"ca.crt" = var.k8s_ca_cert`
|
||||
6. Pass through `stacks/platform/main.tf` module call
|
||||
|
||||
**Double-base64 risk**: The Node.js code does `Buffer.from(caCert).toString('base64')` on the PEM text. This creates base64-of-PEM, which kubectl accepts (kubectl handles both base64(PEM) and base64(DER)). Verified: this is the standard kubeconfig format used by `kubectl config set-cluster --certificate-authority`.
|
||||
|
||||
### Bug 2 — Missing VPN prerequisite
|
||||
**Root cause**: Kubeconfig points to `https://10.0.20.100:6443` (internal IP). No VPN = no connection.
|
||||
|
||||
**Fix**: Add VPN setup as step 0 in both:
|
||||
- The existing homepage (`+page.svelte`) — prominent callout box
|
||||
- The new onboarding page — full enrollment instructions
|
||||
|
||||
### Bug 3 — Headscale enrollment is admin-gated
|
||||
**Fix**: Document the complete flow:
|
||||
1. User installs Tailscale app
|
||||
2. User runs `tailscale login --login-server https://headscale.viktorbarzin.me`
|
||||
3. User sends the registration URL to Viktor (via Slack/email — provide contact)
|
||||
4. Viktor approves on Headscale
|
||||
5. User is now on the VPN
|
||||
|
||||
### Bug 4 — `kubectl get pods` vs `kubectl get namespaces`
|
||||
**Fix**: Change homepage `+page.svelte` to say `kubectl get namespaces` (consistent with setup script).
|
||||
|
||||
### Bug 5 — Unused `openid` scope fix
|
||||
**NOT a bug**: kubelogin always adds `openid` automatically. Remove from the plan. The real investigation is: verify Authentik's `kubernetes` OIDC provider returns `groups` claim in the ID token.
|
||||
|
||||
### Bug 6 — Heredoc quoting no-op
|
||||
**Fix**: Remove the useless `escapedKubeconfig` replace on line 49 of `script/+server.ts` — the quoted heredoc delimiter makes it irrelevant.
|
||||
|
||||
### Files to Modify
|
||||
- `stacks/platform/modules/k8s-portal/main.tf` — add `k8s_ca_cert` variable, update ConfigMap
|
||||
- `stacks/platform/main.tf` — pass `k8s_ca_cert` to module
|
||||
- `config.tfvars` — add the CA cert value
|
||||
- `files/src/routes/setup/script/+server.ts` — remove useless quote escaping
|
||||
- `files/src/routes/download/+server.ts` — same CA cert fix applies here (identical code)
|
||||
- `files/src/routes/+page.svelte` — add VPN callout, fix verification command
|
||||
|
||||
---
|
||||
|
||||
## Part 2: Content System — Skip mdsvex, Use Direct Svelte
|
||||
|
||||
### Why NOT mdsvex
|
||||
- Svelte 5.53.0 broke mdsvex (unresolved as of today)
|
||||
- Requires pinning Svelte to <5.53, which conflicts with security updates
|
||||
- Runes mode in layouts is broken in mdsvex
|
||||
- The content is 5 small pages authored by one person — mdsvex is overkill
|
||||
- Build complexity and image size increase for minimal benefit
|
||||
|
||||
### Alternative: Write content directly in Svelte components
|
||||
Each content page is a Svelte component with inline HTML/text:
|
||||
```svelte
|
||||
<!-- src/routes/onboarding/+page.svelte -->
|
||||
<article class="content">
|
||||
<h1>Getting Started</h1>
|
||||
<p>Welcome! Follow these steps...</p>
|
||||
...
|
||||
</article>
|
||||
```
|
||||
|
||||
**Advantages**:
|
||||
- Zero new dependencies
|
||||
- Works with any Svelte 5 version
|
||||
- Content is still just HTML/text in clearly named files
|
||||
- Can add Svelte interactivity later (copy buttons, progress tracking)
|
||||
|
||||
**Trade-off**: Content edits require touching `.svelte` files instead of `.md`. For 5 pages maintained by one person (or an AI), this is fine. If content grows significantly, revisit mdsvex later when Svelte 5 compatibility is stable.
|
||||
|
||||
### Shared Content Styling
|
||||
Create `src/lib/content.css` with the docs-style layout:
|
||||
```css
|
||||
.content { max-width: 768px; margin: 2rem auto; font-family: system-ui; line-height: 1.6; }
|
||||
.content h1 { border-bottom: 1px solid #e0e0e0; padding-bottom: 0.5rem; }
|
||||
.content pre { background: #1e1e1e; color: #d4d4d4; padding: 1rem; border-radius: 6px; }
|
||||
.content code { background: #f0f0f0; padding: 2px 6px; border-radius: 3px; }
|
||||
.content .callout { background: #fff3cd; border-left: 4px solid #ffc107; padding: 1rem; margin: 1rem 0; }
|
||||
.content .danger { background: #f8d7da; border-left: 4px solid #dc3545; }
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Part 3: Route Structure
|
||||
|
||||
```
|
||||
src/routes/
|
||||
├── +layout.svelte ← Nav bar (Home, Onboarding, Architecture, Services, Contributing, Troubleshooting)
|
||||
├── +page.svelte ← Identity + VPN callout + Get Started (UPDATED)
|
||||
├── onboarding/+page.svelte ← Step-by-step guide
|
||||
├── architecture/+page.svelte ← How the cluster works
|
||||
├── services/+page.svelte ← Service catalog
|
||||
├── contributing/+page.svelte ← PR workflow
|
||||
├── troubleshooting/+page.svelte ← Common issues
|
||||
├── setup/+page.svelte ← Existing kubectl install
|
||||
├── setup/script/+server.ts ← Existing auto-setup (FIXED)
|
||||
└── download/+server.ts ← Existing kubeconfig download (FIXED)
|
||||
```
|
||||
|
||||
### Navigation Layout (`+layout.svelte`)
|
||||
Simple horizontal nav, active page highlighted:
|
||||
```svelte
|
||||
<nav>
|
||||
<a href="/">Home</a>
|
||||
<a href="/onboarding">Getting Started</a>
|
||||
<a href="/architecture">Architecture</a>
|
||||
<a href="/services">Services</a>
|
||||
<a href="/contributing">Contributing</a>
|
||||
<a href="/troubleshooting">Help</a>
|
||||
</nav>
|
||||
<slot />
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Part 4: Page Content
|
||||
|
||||
### `/onboarding` — Getting Started (non-technical, step-by-step)
|
||||
|
||||
**Step 0 — Join the VPN**
|
||||
- "The cluster is on a private network. You need VPN access first."
|
||||
- Install Tailscale: link to tailscale.com/download
|
||||
- Run: `tailscale login --login-server https://headscale.viktorbarzin.me`
|
||||
- "This will open a browser with a registration URL. Send that URL to Viktor via [Slack/email]. He'll approve your device within a few hours."
|
||||
- "Once approved, you're connected! Test: `ping 10.0.20.100`"
|
||||
|
||||
**Step 1 — Log in to the portal**
|
||||
- "Visit https://k8s-portal.viktorbarzin.me and sign in with your Authentik account"
|
||||
- "If you don't have an account, ask Viktor to create one"
|
||||
|
||||
**Step 2 — Set up kubectl**
|
||||
- macOS: `bash <(curl -fsSL https://k8s-portal.viktorbarzin.me/setup/script?os=mac)`
|
||||
- Linux: `bash <(curl -fsSL https://k8s-portal.viktorbarzin.me/setup/script?os=linux)`
|
||||
- Windows: "Use WSL2 and follow the Linux instructions"
|
||||
- macOS prerequisite: "Requires Homebrew. Install it first if you don't have it: [link]"
|
||||
|
||||
**Step 3 — Verify access**
|
||||
- Run: `kubectl get namespaces`
|
||||
- "This will open a browser for you to log in. After login, you should see a list of namespaces."
|
||||
- Show expected output example
|
||||
|
||||
**Step 4 — Clone the repo**
|
||||
- `git clone https://github.com/ViktorBarzin/infra.git`
|
||||
|
||||
**Step 5 — Install your AI assistant (optional)**
|
||||
- Install Codex: `npm install -g @openai/codex`
|
||||
- "Codex reads AGENTS.md from the repo and knows how to work with the cluster"
|
||||
|
||||
**Step 6 — Your first change**
|
||||
- Walk-through: create branch, edit a file, push, open PR, watch CI
|
||||
|
||||
### `/architecture` — How It Works
|
||||
- Simplified: "Proxmox runs VMs → VMs form a K8s cluster → services run as pods"
|
||||
- Storage, networking, DNS in plain English
|
||||
- Tier system: "critical services restart first, optional services restart last"
|
||||
|
||||
### `/services` — What's Running
|
||||
- Table: service name, URL, what it does
|
||||
- Top services highlighted (Nextcloud, Grafana, Uptime Kuma, etc.)
|
||||
|
||||
### `/contributing` — How to Contribute
|
||||
- Branch → edit → PR → review → CI applies
|
||||
- "What you CAN change" vs "what needs Viktor's review"
|
||||
- The NEVER list (kubectl apply, secrets in plaintext, NFS restart)
|
||||
|
||||
### `/troubleshooting` — Common Issues
|
||||
- "Can't connect to the cluster" → VPN + KUBECONFIG
|
||||
- "Permission denied on kubectl" → namespace access
|
||||
- "Pod is crashing" → check logs
|
||||
- "PR CI failed" → read Woodpecker logs
|
||||
- "Need a new secret" → ask Viktor
|
||||
|
||||
---
|
||||
|
||||
## Part 5: Build & Deploy
|
||||
|
||||
1. Make code changes (bug fixes + new pages)
|
||||
2. Build locally: `cd files && npm install && npm run dev` — verify all pages
|
||||
3. Test kubeconfig: verify CA cert is present and valid
|
||||
4. Build Docker image: `docker build -t viktorbarzin/k8s-portal:latest .`
|
||||
5. Push to registry
|
||||
6. `terragrunt apply` to deploy
|
||||
7. End-to-end test on a fresh machine
|
||||
|
||||
---
|
||||
|
||||
## Implementation Order
|
||||
1. Fix CA cert (immediate — unblocks setup script)
|
||||
2. Fix homepage (VPN callout, correct verification command)
|
||||
3. Remove useless heredoc escaping
|
||||
4. Add nav layout
|
||||
5. Create 5 content pages (onboarding, architecture, services, contributing, troubleshooting)
|
||||
6. Build, push, deploy
|
||||
7. End-to-end test
|
||||
366
docs/plans/2026-03-07-sops-migration-design.md
Normal file
366
docs/plans/2026-03-07-sops-migration-design.md
Normal file
|
|
@ -0,0 +1,366 @@
|
|||
# SOPS Multi-User Secrets Migration — Design Document (v3)
|
||||
|
||||
## Goal
|
||||
Enable non-technical operators to manage cluster services via PR → review → merge → CI apply, without access to secrets. Viktor retains full local apply capability.
|
||||
|
||||
## Current State
|
||||
- **terraform.tfvars**: 211 variables (mix of secrets + non-secret config), git-crypt encrypted as a whole
|
||||
- **secrets/**: TLS certs, deploy keys, NFS config — git-crypt encrypted (binary files)
|
||||
- **.gitattributes**: encrypts `*.tfvars`, `*.tfstate`, `secrets/**`
|
||||
- **Woodpecker CI**: unlocks git-crypt via K8s ConfigMap, applies `stacks/platform/` on push
|
||||
- **Terragrunt**: loads `terraform.tfvars` via `required_var_files` for all stacks
|
||||
|
||||
## Design
|
||||
|
||||
### 1. Split terraform.tfvars into Two Files
|
||||
|
||||
**`config.tfvars`** (NOT encrypted — committed in plaintext):
|
||||
Non-secret configuration that operators need to read/edit:
|
||||
- `nfs_server`, `redis_host`, `postgresql_host`, `mysql_host`, `ollama_host`, `mail_host`
|
||||
- `bind_db_viktorbarzin_me`, `bind_db_viktorbarzin_lan`, `bind_named_conf_options`
|
||||
- `tls_secret_name`, `client_certificate_secret_name`
|
||||
- WireGuard peer **public** keys and AllowedIPs only — **NOT** `wireguard_wg_0_conf` (contains private key inline), NOT any `PrivateKey` fields
|
||||
- Cloudflare DNS zone definitions (record names, not tokens)
|
||||
|
||||
**`secrets.sops.json`** (SOPS-encrypted, per-value, JSON format):
|
||||
All actual secrets, including complex types. JSON format chosen because:
|
||||
- `sops -d` outputs the same format as input — JSON in, JSON out
|
||||
- Terraform natively supports `*.auto.tfvars.json` files
|
||||
- JSON supports all Terraform types: strings, maps, lists, nested objects
|
||||
- No format conversion needed in the decryption pipeline
|
||||
|
||||
**Complex types** in JSON (these are NOT flat strings):
|
||||
```json
|
||||
{
|
||||
"hackmd_db_password": "simple-string-secret",
|
||||
"mailserver_accounts": {
|
||||
"info@viktorbarzin.me": "password1",
|
||||
"admin@viktorbarzin.me": "password2"
|
||||
},
|
||||
"homepage_credentials": {
|
||||
"technitium": {"token": "abc123"},
|
||||
"crowdsec": {"username": "user", "password": "pass"}
|
||||
},
|
||||
"k8s_users": {
|
||||
"viktor": {"role": "admin", "email": "v@example.com", "namespaces": []}
|
||||
},
|
||||
"xray_reality_clients": [
|
||||
{"id": "uuid-here", "flow": "xtls-rprx-vision"}
|
||||
],
|
||||
"webhook_handler_ssh_key": "-----BEGIN OPENSSH PRIVATE KEY-----\nb3Blbn...\n-----END OPENSSH PRIVATE KEY-----\n",
|
||||
"wireguard_wg_0_conf": "[Interface]\nPrivateKey = ...\nAddress = ...\n\n[Peer]\n..."
|
||||
}
|
||||
```
|
||||
|
||||
### 2. SOPS Configuration
|
||||
|
||||
```yaml
|
||||
# .sops.yaml
|
||||
creation_rules:
|
||||
- path_regex: ^secrets\.sops\.json$
|
||||
age: >-
|
||||
age1viktor_public_key,
|
||||
age1ci_public_key
|
||||
```
|
||||
|
||||
Path regex anchored to repo root (`^`). All secrets encrypted to Viktor + CI.
|
||||
|
||||
### 3. Terragrunt Changes
|
||||
|
||||
```hcl
|
||||
# terragrunt.hcl — updated variable loading
|
||||
terraform {
|
||||
extra_arguments "common_vars" {
|
||||
commands = get_terraform_commands_that_need_vars()
|
||||
required_var_files = [
|
||||
"${get_repo_root()}/config.tfvars"
|
||||
]
|
||||
}
|
||||
|
||||
extra_arguments "secrets" {
|
||||
commands = get_terraform_commands_that_need_vars()
|
||||
optional_var_files = [
|
||||
"${get_repo_root()}/secrets.auto.tfvars.json"
|
||||
]
|
||||
}
|
||||
|
||||
# Safety check: fail loudly if secrets file is missing (prevents silent apply with empty secrets)
|
||||
before_hook "check_secrets" {
|
||||
commands = ["apply", "plan", "destroy"]
|
||||
execute = ["test", "-f", "${get_repo_root()}/secrets.auto.tfvars.json"]
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
**Global decrypt-once wrapper** (run instead of raw terragrunt):
|
||||
```bash
|
||||
#!/usr/bin/env bash
|
||||
# scripts/tg — wrapper: decrypt then terragrunt
|
||||
set -euo pipefail
|
||||
REPO_ROOT="$(cd "$(dirname "$0")/.." && pwd)"
|
||||
SOPS_FILE="$REPO_ROOT/secrets.sops.json"
|
||||
OUT_FILE="$REPO_ROOT/secrets.auto.tfvars.json"
|
||||
|
||||
if [ ! -f "$OUT_FILE" ] && [ -f "$SOPS_FILE" ]; then
|
||||
TEMP=$(mktemp "$OUT_FILE.XXXXXX")
|
||||
trap "rm -f '$TEMP'" EXIT
|
||||
sops -d "$SOPS_FILE" > "$TEMP"
|
||||
mv "$TEMP" "$OUT_FILE"
|
||||
echo "Decrypted secrets → secrets.auto.tfvars.json"
|
||||
fi
|
||||
|
||||
exec terragrunt "$@"
|
||||
```
|
||||
|
||||
Usage: `scripts/tg apply --non-interactive` instead of `terragrunt apply --non-interactive`.
|
||||
|
||||
**Why not before_hook/after_hook for decryption?** When using `run --all`, each of 70+ stacks would run hooks in parallel, all writing to the same file — race condition. The wrapper decrypts once.
|
||||
|
||||
**Why before_hook for the existence check?** It's read-only (just `test -f`) — safe in parallel. Fails loudly if someone forgets to decrypt, instead of silently applying with empty secrets.
|
||||
|
||||
### 4. File Protection
|
||||
|
||||
**.gitignore** (add these entries):
|
||||
```
|
||||
/secrets.auto.tfvars.json
|
||||
/secrets.auto.tfvars.json.*
|
||||
```
|
||||
|
||||
**.gitattributes** changes (done atomically in Phase 4):
|
||||
```
|
||||
# KEEP for binary files
|
||||
secrets/** filter=git-crypt diff=git-crypt
|
||||
*.tfstate filter=git-crypt diff=git-crypt
|
||||
|
||||
# REMOVED: *.tfvars filter=git-crypt diff=git-crypt
|
||||
```
|
||||
|
||||
### 5. Woodpecker CI Pipeline Changes
|
||||
|
||||
**default.yml**:
|
||||
```yaml
|
||||
steps:
|
||||
- name: prepare
|
||||
image: alpine
|
||||
commands:
|
||||
- "apk update && apk add jq curl git git-crypt"
|
||||
# git-crypt for secrets/ directory (TLS certs, deploy key)
|
||||
# Note: K8s Secret .data values are base64-encoded by the API
|
||||
- |
|
||||
curl -k https://10.0.20.100:6443/api/v1/namespaces/woodpecker/secrets/git-crypt-key \
|
||||
-H "Authorization:Bearer $(cat /var/run/secrets/kubernetes.io/serviceaccount/token)" \
|
||||
| jq -r '.data.key' | base64 -d > /tmp/key
|
||||
- "git-crypt unlock /tmp/key && rm /tmp/key"
|
||||
# Install SOPS to workspace (shared across steps via workspace volume)
|
||||
- "wget -qO ./sops https://github.com/getsops/sops/releases/download/v3.9.4/sops-v3.9.4.linux.amd64"
|
||||
- "echo '848ac8ee4b4e3ae1e72a58f0e9bae04b3e85ca59fa06f0dcd2d32b76542e8417 ./sops' | sha256sum -c"
|
||||
- "chmod +x ./sops"
|
||||
# Write age key to file (Woodpecker from_secret injects as env var, not file)
|
||||
- "echo \"$SOPS_AGE_KEY\" > /tmp/age-key.txt"
|
||||
- "SOPS_AGE_KEY_FILE=/tmp/age-key.txt ./sops -d secrets.sops.json > secrets.auto.tfvars.json"
|
||||
- "shred -u /tmp/age-key.txt"
|
||||
environment:
|
||||
SOPS_AGE_KEY:
|
||||
from_secret: sops_age_key # CI's age private key material
|
||||
|
||||
- name: terragrunt-plan
|
||||
image: alpine
|
||||
commands:
|
||||
- "apk update && apk add curl unzip git openssh-client"
|
||||
- "wget -qO /tmp/tf.zip https://releases.hashicorp.com/terraform/1.5.7/terraform_1.5.7_linux_amd64.zip"
|
||||
- "unzip -o /tmp/tf.zip -d /usr/local/bin/ && chmod 755 /usr/local/bin/terraform"
|
||||
- "wget -qO /usr/local/bin/terragrunt https://github.com/gruntwork-io/terragrunt/releases/download/v0.99.4/terragrunt_linux_amd64"
|
||||
- "chmod 755 /usr/local/bin/terragrunt"
|
||||
- "cd stacks/platform && terragrunt plan --non-interactive -out=tfplan 2>&1 | grep -v 'sensitive'"
|
||||
when:
|
||||
event: pull_request
|
||||
|
||||
- name: terragrunt-apply
|
||||
image: alpine
|
||||
commands:
|
||||
- "apk update && apk add curl unzip git openssh-client"
|
||||
- "wget -qO /tmp/tf.zip https://releases.hashicorp.com/terraform/1.5.7/terraform_1.5.7_linux_amd64.zip"
|
||||
- "unzip -o /tmp/tf.zip -d /usr/local/bin/ && chmod 755 /usr/local/bin/terraform"
|
||||
- "wget -qO /usr/local/bin/terragrunt https://github.com/gruntwork-io/terragrunt/releases/download/v0.99.4/terragrunt_linux_amd64"
|
||||
- "chmod 755 /usr/local/bin/terragrunt"
|
||||
- "cd stacks/platform && terragrunt apply --non-interactive -auto-approve"
|
||||
when:
|
||||
event: push
|
||||
branch: master
|
||||
|
||||
- name: cleanup-and-push
|
||||
image: alpine
|
||||
commands:
|
||||
- "rm -f secrets.auto.tfvars.json secrets.auto.tfvars.json.*"
|
||||
- "apk update && apk add openssh-client git git-crypt"
|
||||
- "mkdir -p ~/.ssh && ssh-keyscan -H github.com >> ~/.ssh/known_hosts"
|
||||
- "chmod 400 secrets/deploy_key"
|
||||
- "git add stacks/ state/ .woodpecker/ || true"
|
||||
- "git remote set-url origin git@github.com:ViktorBarzin/infra.git"
|
||||
- "git commit -m 'Woodpecker CI deploy commit [CI SKIP]' || echo 'No changes'"
|
||||
- "GIT_SSH_COMMAND='ssh -i ./secrets/deploy_key -o IdentitiesOnly=yes' git push origin master"
|
||||
when:
|
||||
- event: push
|
||||
branch: master
|
||||
- status: [success, failure] # Always clean up, even on failure
|
||||
|
||||
- name: slack
|
||||
image: curlimages/curl
|
||||
commands:
|
||||
- |
|
||||
curl -s -X POST -H 'Content-type: application/json' \
|
||||
--data "{\"text\":\"Woodpecker CI: infra pipeline ${CI_PIPELINE_STATUS}\"}" \
|
||||
"$SLACK_WEBHOOK" || true
|
||||
environment:
|
||||
SLACK_WEBHOOK:
|
||||
from_secret: slack_webhook
|
||||
when:
|
||||
- status: [success, failure]
|
||||
```
|
||||
|
||||
**renew-tls.yml** — ALSO update this pipeline:
|
||||
- Change `git add .` to `git add secrets/ state/` in the `commit-certs` step
|
||||
- Same defense-in-depth as default.yml
|
||||
|
||||
Key design decisions:
|
||||
- `SOPS_AGE_KEY` (env var, not file) — Woodpecker `from_secret` only supports env vars. The prepare step writes it to a temp file, uses `SOPS_AGE_KEY_FILE`, then `shred`s the file
|
||||
- SOPS binary in workspace (shared volume) — not per-container `/usr/local/bin/`
|
||||
- `cleanup-and-push` runs on `status: [success, failure]` — always cleans up decrypted file
|
||||
- `git add stacks/ state/ .woodpecker/` — never `git add .`
|
||||
- Plan output filtered through `grep -v sensitive` — belt-and-suspenders with `sensitive = true`
|
||||
|
||||
### 6. Branch Protection (Required)
|
||||
|
||||
GitHub branch protection on `master`:
|
||||
- **Require pull request reviews**: at least 1 reviewer (Viktor)
|
||||
- **Restrict who can push**: Viktor only (direct push for `[ci skip]` commits)
|
||||
- **Restrict who can dismiss reviews**: Viktor only
|
||||
|
||||
This prevents operators from modifying `.woodpecker/`, `terragrunt.hcl`, or `.sops.yaml` without review.
|
||||
|
||||
**Residual risk**: An operator can add `provisioner "local-exec" { command = "echo ${var.secret}" }` in a PR. Viktor must catch this in review. Mitigated by: (1) PR review is required, (2) `sensitive = true` hides values in plan output, (3) `local-exec` provisioners are unusual in this codebase and should be flagged during review.
|
||||
|
||||
### 7. K8s RBAC for Operators
|
||||
|
||||
Scoped operator role — no cluster-wide secrets access:
|
||||
|
||||
```hcl
|
||||
resource "kubernetes_cluster_role" "operator" {
|
||||
metadata { name = "cluster-operator" }
|
||||
rule {
|
||||
api_groups = [""]
|
||||
resources = ["pods", "pods/log", "services", "endpoints", "configmaps", "events"]
|
||||
verbs = ["get", "list", "watch"]
|
||||
}
|
||||
rule {
|
||||
api_groups = ["apps"]
|
||||
resources = ["deployments", "statefulsets", "daemonsets", "replicasets"]
|
||||
verbs = ["get", "list", "watch"]
|
||||
}
|
||||
}
|
||||
|
||||
# Per-namespace full access (edit role includes secrets within namespace — accepted residual risk)
|
||||
resource "kubernetes_role_binding" "operator_namespace" {
|
||||
for_each = toset(var.operator_namespaces)
|
||||
metadata {
|
||||
name = "operator-access"
|
||||
namespace = each.value
|
||||
}
|
||||
role_ref {
|
||||
api_group = "rbac.authorization.k8s.io"
|
||||
kind = "ClusterRole"
|
||||
name = "edit"
|
||||
}
|
||||
subject {
|
||||
kind = "Group"
|
||||
name = "operators"
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
**Excluded namespaces** (never in `operator_namespaces`): `woodpecker`, `kube-system`, `dbaas`, `monitoring`, `authentik`.
|
||||
|
||||
### 8. Operator Workflow
|
||||
|
||||
**Setup (one-time)**: GitHub collaborator + Authentik "operators" group. No encryption keys, no local tools beyond git.
|
||||
|
||||
**Day-to-day**: Create branch → edit → push → open PR → Viktor reviews → merge → CI applies → Slack notification.
|
||||
|
||||
**kubectl**: `kubectl oidc-login` → Authentik → scoped to assigned namespaces.
|
||||
|
||||
**New secrets**: Comment on PR, Viktor adds to `secrets.sops.json`.
|
||||
|
||||
### 9. Migration Plan (Phased)
|
||||
|
||||
**Phase 1 — Setup tooling (no functional change)**
|
||||
- Install `sops` and `age` locally (Docker)
|
||||
- Generate age keys: Viktor + CI
|
||||
- Store CI age key as Woodpecker secret (`sops_age_key`)
|
||||
- Move git-crypt key from K8s ConfigMap to Secret (update RBAC for Woodpecker SA)
|
||||
- Create `.sops.yaml` config file
|
||||
- Add `/secrets.auto.tfvars.json` to `.gitignore`
|
||||
- Create `scripts/tg` wrapper
|
||||
- Backup Viktor's age private key to Vaultwarden
|
||||
|
||||
**Phase 2 — Create SOPS file alongside existing tfvars**
|
||||
- Categorize all 211 variables: secret vs. non-secret (WireGuard private keys → secrets)
|
||||
- Extract non-secret config into `config.tfvars` (plaintext)
|
||||
- Extract secrets into `secrets.sops.json` (JSON, including complex types: maps, lists, nested objects)
|
||||
- Encrypt with SOPS
|
||||
- Verify round-trip: `sops -d secrets.sops.json | jq .` produces valid JSON
|
||||
- Verify SSH keys: `sops -d secrets.sops.json | jq -r '.truenas_ssh_private_key' | ssh-keygen -l -f -`
|
||||
- Verify complex types: `sops -d secrets.sops.json | jq '.mailserver_accounts'` returns expected map
|
||||
- Add `sensitive = true` to ALL secret variable declarations across all stacks (BEFORE CI plan step is enabled)
|
||||
|
||||
**Phase 3 — Switch terragrunt to SOPS**
|
||||
- Update `terragrunt.hcl`: `config.tfvars` (required) + `secrets.auto.tfvars.json` (optional) + existence check hook
|
||||
- Test: `scripts/tg apply --non-interactive` works per-stack
|
||||
- Test: `scripts/tg run --all -- plan` works (no race condition)
|
||||
- Test failure mode: delete `secrets.auto.tfvars.json`, verify `before_hook` fails loudly
|
||||
|
||||
**Phase 4 — Atomic cutover**
|
||||
- Step 1: `git rm terraform.tfvars` (removes file while git-crypt filter still active — clean deletion)
|
||||
- Step 2: Remove `*.tfvars filter=git-crypt` from `.gitattributes`
|
||||
- Step 3: `git commit` both changes
|
||||
|
||||
**Phase 5 — Update CI pipelines**
|
||||
- Update `.woodpecker/default.yml` with new pipeline
|
||||
- Update `.woodpecker/renew-tls.yml`: change `git add .` to `git add secrets/ state/`
|
||||
- Add `sops_age_key` Woodpecker secret
|
||||
- Enable GitHub branch protection on master
|
||||
- Test: CI pipeline applies successfully
|
||||
|
||||
**Phase 6 — Security hardening**
|
||||
- Create scoped operator RBAC role
|
||||
- Remove `secrets` from `power-user` ClusterRole
|
||||
- Update CLAUDE.md and AGENTS.md documentation
|
||||
|
||||
**Phase 7 — Onboard operator**
|
||||
- Add as GitHub collaborator
|
||||
- Create Authentik account in "operators" group
|
||||
- Walk through first PR workflow
|
||||
|
||||
### 10. Rollback Plan
|
||||
- **Phase 1-2**: No functional change — delete SOPS artifacts
|
||||
- **Phase 3**: Revert `terragrunt.hcl` to load `terraform.tfvars`
|
||||
- **Phase 4+**: `git show HEAD~1:terraform.tfvars > terraform.tfvars`, re-add `.gitattributes` rule. Backfill any secrets added during SOPS period.
|
||||
- Git-crypt stays functional for `secrets/` and `*.tfstate`
|
||||
|
||||
### 11. What Stays with git-crypt
|
||||
- `secrets/` directory: TLS certs, deploy keys (binary)
|
||||
- `*.tfstate` files: Terraform state
|
||||
- git-crypt key: K8s **Secret** in `woodpecker` namespace (migrated from ConfigMap)
|
||||
|
||||
### 12. Security Considerations
|
||||
- **Decrypted file**: temporary, `.gitignore`d, never staged by CI, cleaned up on success AND failure
|
||||
- **CI staging**: `git add stacks/ state/ .woodpecker/` — never `git add .` (all pipelines)
|
||||
- **Age key in CI**: `SOPS_AGE_KEY` env var → written to temp file → `SOPS_AGE_KEY_FILE` → `shred` after use
|
||||
- **Age key backup**: Viktor's in Vaultwarden. CI's as Woodpecker secret
|
||||
- **Branch protection**: Operators cannot modify CI pipeline, terragrunt.hcl, or .sops.yaml without review
|
||||
- **RBAC**: Operator role excludes cluster-wide secrets. Namespace `edit` role allows secrets within assigned namespaces (accepted residual risk). Excluded: woodpecker, kube-system, dbaas, monitoring, authentik
|
||||
- **Terraform variables**: `sensitive = true` on all secret vars — applied in Phase 2 BEFORE plan step is enabled
|
||||
- **Plan output**: filtered through `grep -v sensitive` as belt-and-suspenders
|
||||
- **`local-exec` exfiltration**: residual risk mitigated by PR review requirement — Viktor must review all PRs
|
||||
- **State files**: contain secret values, git-crypt encrypted. Future: remote backend
|
||||
- **Rotation**: new CI age key → re-encrypt → update Woodpecker secret → rotate affected secrets
|
||||
- **Git history**: old `terraform.tfvars` remains git-crypt encrypted in history — recoverable only with git-crypt key (K8s Secret, not accessible to operators)
|
||||
882
docs/plans/2026-03-28-storage-migration-truenas-elimination.md
Normal file
882
docs/plans/2026-03-28-storage-migration-truenas-elimination.md
Normal file
|
|
@ -0,0 +1,882 @@
|
|||
# Storage Migration: TrueNAS Elimination via Proxmox CSI + Host NFS
|
||||
|
||||
**Date**: 2026-03-28
|
||||
**Status**: Reviewed (3 rounds, all CRITICAL/IMPORTANT issues resolved)
|
||||
**Goal**: Eliminate TrueNAS VM entirely, replacing it with Proxmox CSI (block storage for databases) and NFS served directly from the Proxmox host (for app data and backups). Recover 16 vCPU + 16 GB RAM, eliminate double-CoW ZFS corruption, simplify storage stack from 2 CSI drivers to 1 CSI driver + host NFS.
|
||||
|
||||
## Problem
|
||||
|
||||
The current storage architecture has a fundamental design flaw: TrueNAS runs as a VM with 7 thin-provisioned LVs forming a ZFS STRIPE (RAID0) on the same LVM-thin pool. This creates:
|
||||
|
||||
1. **Double Copy-on-Write**: ZFS CoW on top of LVM-thin CoW causes metadata contention under I/O pressure
|
||||
2. **56 permanent ZFS checksum errors**: Corruption detected but unrecoverable (no ZFS redundancy)
|
||||
3. **Single point of failure**: TrueNAS VM crash takes down all ~100 NFS shares + ~19 iSCSI targets
|
||||
4. **Resource waste**: 16 vCPU + 16 GB RAM dedicated to a storage VM when the Proxmox host could serve storage directly
|
||||
5. **Operational complexity**: Two CSI drivers (nfs-csi + democratic-csi), SSH keys, TrueNAS API, ZFS management
|
||||
|
||||
## Constraints
|
||||
|
||||
- Zero data loss tolerance — every migration step must have a rollback path
|
||||
- Preserve the existing 3-layer backup strategy (local snapshots, app-level CronJob dumps, offsite sync to Synology)
|
||||
- Preserve all Prometheus alerts and Grafana backup dashboard
|
||||
- Stop-and-verify after each phase — no big-bang migration
|
||||
- SCSI device limit: max 30 per VM (Proxmox VirtIO-SCSI controller). Must keep block PVs under this limit per node
|
||||
- Minimize downtime per service (target: <5 min per service migration)
|
||||
- All changes must be Terraform-managed
|
||||
|
||||
## Current State
|
||||
|
||||
### Hardware
|
||||
|
||||
All disks are **hardware RAID** arrays presented by the Dell PERC H730 Mini controller as single logical disks. No software RAID (mdadm) is involved. `pvcreate` operates directly on `/dev/sdX`.
|
||||
|
||||
| Disk | Size | RAID | Current Use | Current VG | Proposed Use |
|
||||
|------|------|------|-------------|------------|--------------|
|
||||
| sda (SAS 10K) | 1.1 TiB | HW RAID1 | **UNUSED** — no partitions, no VG | None | Host NFS (thick LV, ext4) |
|
||||
| sdb (Samsung SSD) | 931 GiB | Single | 256G TrueNAS VM disk, 675G free | VG "ssd" (already exists) | Proxmox CSI SSD tier (thin pool in existing VG) |
|
||||
| sdc (HDD 7200rpm) | 10.7 TiB | HW RAID1 | VG "pve" — all VMs + TrueNAS data | VG "pve" (already exists) | VM boots + Proxmox CSI HDD tier (existing thin pool "data") |
|
||||
|
||||
### ZFS Corruption Status
|
||||
|
||||
Before migrating data, verify which files are affected by the 56 ZFS checksum errors:
|
||||
```bash
|
||||
ssh root@10.0.10.15 'zpool status -v main | tail -20'
|
||||
```
|
||||
If critical user data (Immich photos, documents) is corrupted, restore those files from Synology backup BEFORE migration. Do not migrate known-corrupted data.
|
||||
|
||||
### Storage Usage
|
||||
|
||||
| Category | Current Backend | Size | PV Count |
|
||||
|----------|----------------|------|----------|
|
||||
| App data (NFS) | TrueNAS ZFS → NFS | ~1.39 TiB | ~45 |
|
||||
| Database block (iSCSI) | TrueNAS ZFS → iSCSI | ~120 GiB | ~5 |
|
||||
| Database block (StatefulSet) | TrueNAS ZFS → iSCSI (Helm VCT) | ~100 GiB | ~8 |
|
||||
| Backup CronJob targets | TrueNAS ZFS → NFS | ~50 GiB | ~8 |
|
||||
| No storage (stateless) | N/A | 0 | 0 |
|
||||
|
||||
### Services Requiring RWX (Shared Across Multiple Deployments)
|
||||
|
||||
Only 8 NFS paths are genuinely shared:
|
||||
|
||||
| NFS Path | Shared Between | Resolution |
|
||||
|----------|---------------|------------|
|
||||
| servarr/downloads | qbittorrent, lidarr, prowlarr, listenarr | Pin all to same node + subPath on single block PV, OR keep on host NFS |
|
||||
| servarr/lidarr | lidarr + soulseek | Same — node affinity |
|
||||
| servarr/qbittorrent | qbittorrent + readarr | Same — node affinity |
|
||||
| audiobookshelf/audiobooks | audiobookshelf + qbittorrent | Same — node affinity |
|
||||
| whisper (disabled) | whisper + piper | Disabled — migrate when re-enabled |
|
||||
| audiblez (disabled) | audiblez + audiblez-web | Disabled — migrate when re-enabled |
|
||||
| osm-routing (disabled) | osrm-foot + osrm-bicycle | Disabled — migrate when re-enabled |
|
||||
| poison-fountain | 2 replicas of same Deployment | Scale to 1 or use StatefulSet |
|
||||
|
||||
**Decision**: All shared volumes stay on host NFS. No need to solve RWX with block storage — the SCSI budget is better spent on databases.
|
||||
|
||||
## Target Architecture
|
||||
|
||||
### Storage Tiers
|
||||
|
||||
```
|
||||
Tier 1: proxmox-ssd (Proxmox CSI, block, RWO)
|
||||
Backend: LVM-thin pool on sdb (SSD)
|
||||
For: Databases requiring low-latency I/O
|
||||
Capacity: ~800 GiB
|
||||
Expected PVs: ~15 (across 5 nodes, ~3 per node)
|
||||
|
||||
Tier 2: proxmox-hdd (Proxmox CSI, block, RWO)
|
||||
Backend: Existing LVM-thin pool "data" on sdc (HDD)
|
||||
For: Large sequential I/O (Prometheus TSDB, Ollama models)
|
||||
Capacity: ~6 TiB free in existing pool
|
||||
Expected PVs: ~5 (across 5 nodes, ~1 per node)
|
||||
|
||||
Tier 3: nfs-host (NFS from Proxmox host, RWX/RWO)
|
||||
Backend: Thick LV on sda (SAS), ext4, exported via nfs-kernel-server
|
||||
For: App data, media, configs, backup targets, shared volumes
|
||||
Capacity: 1 TiB
|
||||
Expected PVs: ~35 (no SCSI limit — just directories)
|
||||
```
|
||||
|
||||
### SCSI Budget
|
||||
|
||||
| Node | Boot Disk | CSI SSD PVs | CSI HDD PVs | Total | Limit |
|
||||
|------|-----------|-------------|-------------|-------|-------|
|
||||
| k8s-master | 1 | 1 (Vault) | 0 | 2 | 30 |
|
||||
| k8s-node1 | 1 | 2 (CNPG replica, Redis replica) | 1 (Ollama) | 4 | 30 |
|
||||
| k8s-node2 | 1 | 3 (CNPG primary, MySQL primary, Vaultwarden) | 1 (Prometheus) | 5 | 30 |
|
||||
| k8s-node3 | 1 | 3 (MySQL replica, Redis master, Vault) | 0 | 4 | 30 |
|
||||
| k8s-node4 | 1 | 3 (CNPG replica, MySQL replica, Vault) | 0 | 4 | 30 |
|
||||
|
||||
**Headroom**: 25+ free SCSI slots per node. Future growth is not a concern.
|
||||
|
||||
Note: Exact node assignments will be determined by K8s scheduler anti-affinity rules. The above is illustrative to demonstrate SCSI budget feasibility.
|
||||
|
||||
### Backup Architecture (3 Layers Preserved)
|
||||
|
||||
#### Layer 1: Local Snapshots
|
||||
|
||||
**Block PVs (Proxmox CSI)**: LVM-thin snapshots via cron on PVE host.
|
||||
|
||||
```bash
|
||||
# /etc/cron.d/lvm-snapshots on Proxmox host
|
||||
# Snapshot all CSI-provisioned thin LVs every 12h, retain 3 days
|
||||
0 */12 * * * root /usr/local/bin/lvm-thin-snapshot.sh
|
||||
```
|
||||
|
||||
Script logic:
|
||||
1. Enumerate thin LVs matching `csi-*` naming pattern
|
||||
2. `lvcreate -s -n <lv>-snap-$(date +%Y%m%d%H%M) <vg>/<lv>`
|
||||
3. Prune snapshots older than 3 days: `lvremove -f <old-snaps>`
|
||||
4. Push success/failure metric to Pushgateway
|
||||
|
||||
**NFS data (host ext4)**: The thick LV on sda cannot use LVM-thin snapshots. This is a **known RPO degradation**: current ZFS snapshots provide <1s RPO for NFS data, while the new architecture has 6h RPO (next offsite sync interval) for file-level recovery.
|
||||
|
||||
Mitigations:
|
||||
- Databases have their own Layer 2 CronJob backups (daily/6h dumps) — no regression there
|
||||
- App data (photos, documents, configs) relies on offsite sync every 6h + the Synology copy
|
||||
- For critical files (Immich photos), the 6h RPO window is acceptable because Immich writes are append-only (new photos) — accidental deletion is the main risk, and that's caught within 6h
|
||||
- If tighter RPO is needed later, convert sda from thick to thin provisioning to enable LVM-thin snapshots
|
||||
|
||||
#### Layer 2: Application-Level CronJob Backups (UNCHANGED)
|
||||
|
||||
All existing backup CronJobs continue as-is. The only change is the NFS server IP in `config.tfvars`:
|
||||
|
||||
```hcl
|
||||
# Before
|
||||
nfs_server = "10.0.10.15" # TrueNAS VM
|
||||
|
||||
# After
|
||||
nfs_server = "10.0.10.1" # Proxmox host (existing mgmt VLAN IP)
|
||||
```
|
||||
|
||||
Backup CronJobs write to `/srv/nfs/<service>-backup/` on the host, same as they wrote to `/mnt/main/<service>-backup/` on TrueNAS.
|
||||
|
||||
| Backup | Schedule | Retention | Change |
|
||||
|--------|----------|-----------|--------|
|
||||
| PostgreSQL (pg_dumpall) | Daily 00:00 | 14 days | NFS path only |
|
||||
| MySQL (mysqldump) | Daily 00:30 | 14 days | NFS path only |
|
||||
| etcd (etcdctl snapshot) | Weekly Sun 01:00 | 30 days | NFS path only |
|
||||
| Vault (raft snapshot) | Weekly Sun 02:00 | 30 days | NFS path only |
|
||||
| Redis (BGSAVE) | Weekly Sun 03:00 | 30 days | NFS path only |
|
||||
| Vaultwarden (sqlite3 .backup) | Every 6h | 30 days | NFS path only |
|
||||
| Prometheus (TSDB snapshot) | Monthly 1st Sun | 2 copies | NFS path only |
|
||||
| Immich PG | Daily 00:00 | 14 days | NFS path only |
|
||||
|
||||
#### Layer 3: Offsite Sync (rclone to Synology NAS — SIMPLIFIED)
|
||||
|
||||
Replace TrueNAS Cloud Sync with a cron job on the Proxmox host:
|
||||
|
||||
```bash
|
||||
# /etc/cron.d/offsite-sync on Proxmox host
|
||||
# Incremental sync every 6h
|
||||
0 */6 * * * root /usr/local/bin/offsite-sync.sh
|
||||
# Full sync weekly Sunday 09:00
|
||||
0 9 * * 0 root /usr/local/bin/offsite-sync.sh --full
|
||||
```
|
||||
|
||||
Incremental sync uses `rsync` (or `rclone copy`) with `--files-from` based on `find -newer /srv/nfs/.last-sync`. Full sync uses `rclone sync`. Same Synology destination: `sftp://192.168.1.13/Backup/Viki/truenas`.
|
||||
|
||||
Same excludes as current: servarr/downloads, prometheus, loki, frigate recordings.
|
||||
|
||||
#### Monitoring (ALL PRESERVED)
|
||||
|
||||
| Alert | Current | New | Change |
|
||||
|-------|---------|-----|--------|
|
||||
| PostgreSQLBackupStale (36h) | Pushgateway | Pushgateway | None |
|
||||
| MySQLBackupStale (36h) | Pushgateway | Pushgateway | None |
|
||||
| EtcdBackupStale (8d) | Pushgateway | Pushgateway | None |
|
||||
| VaultBackupStale (8d) | Pushgateway | Pushgateway | None |
|
||||
| VaultwardenBackupStale (8d) | Pushgateway | Pushgateway | None |
|
||||
| RedisBackupStale (8d) | Pushgateway | Pushgateway | None |
|
||||
| PrometheusBackupStale (32d) | Pushgateway | Pushgateway | None |
|
||||
| VaultwardenIntegrity | Pushgateway | Pushgateway | None |
|
||||
| CloudSyncStale (8d) | TrueNAS metric | **OffsiteSyncStale** | Rename, source changes to PVE cron |
|
||||
| CloudSyncFailing | TrueNAS metric | **OffsiteSyncFailing** | Rename, source changes to PVE cron |
|
||||
| N/A | N/A | **LVMSnapshotStale** | NEW — alert if CSI LV snapshot cron fails |
|
||||
|
||||
Grafana backup dashboard: Update data source for offsite sync panels. All other panels unchanged.
|
||||
|
||||
## Migration Phases
|
||||
|
||||
### Phase 0: Preparation (No Downtime)
|
||||
|
||||
**Duration**: 2-4 hours
|
||||
|
||||
#### 0.0: Pre-flight Checks
|
||||
|
||||
1. **Verify sda is usable** (hardware RAID, no partitions):
|
||||
```bash
|
||||
lsblk /dev/sda # Should show no partitions
|
||||
cat /proc/mdstat # Should show no mdadm arrays using sda
|
||||
smartctl -a /dev/sda # Verify disk health
|
||||
```
|
||||
|
||||
2. **Verify sdb VG exists and has free space**:
|
||||
```bash
|
||||
vgs ssd # Should show VG "ssd" with ~675G free
|
||||
lvs ssd # Should show only vm-9000-disk-0 (256G)
|
||||
```
|
||||
|
||||
3. **Verify Proxmox host IP on management VLAN**:
|
||||
```bash
|
||||
ip addr show vmbr0 # Should show 10.0.10.1/24 or similar
|
||||
```
|
||||
|
||||
4. **Verify NFS ports reachable from K8s VLAN** (pfSense routing):
|
||||
```bash
|
||||
# From any k8s node:
|
||||
nc -zv 10.0.10.1 2049 # NFS
|
||||
nc -zv 10.0.10.1 111 # rpcbind
|
||||
```
|
||||
If blocked, add pfSense rule: VLAN 20 (10.0.20.0/24) → VLAN 10, dst ports 111,2049, allow TCP/UDP.
|
||||
|
||||
5. **Resolve Pushgateway endpoint** for PVE host scripts (lvm-snapshot, offsite-sync):
|
||||
```bash
|
||||
# Option A: Use Traefik ingress if Pushgateway has one
|
||||
curl -s http://pushgateway.viktorbarzin.me/metrics | head -1
|
||||
# Option B: Use NodePort
|
||||
kubectl get svc -n monitoring pushgateway -o jsonpath='{.spec.clusterIP}:{.spec.ports[0].port}'
|
||||
# Option C: Use any K8s node IP + NodePort
|
||||
kubectl get svc -n monitoring pushgateway -o jsonpath='{.spec.ports[0].nodePort}'
|
||||
```
|
||||
Update `PUSHGATEWAY=` in both scripts with the resolved endpoint. Verify with:
|
||||
```bash
|
||||
echo "test_metric 1" | curl --data-binary @- http://<resolved>:9091/metrics/job/test
|
||||
```
|
||||
|
||||
6. **Check ZFS corruption scope** (identify affected files before migration):
|
||||
```bash
|
||||
ssh root@10.0.10.15 'zpool status -v main | tail -30'
|
||||
```
|
||||
If critical data is in the error list, restore from Synology BEFORE proceeding.
|
||||
|
||||
#### 0.1: Create VG and LV on sda (Host NFS)
|
||||
|
||||
```bash
|
||||
pvcreate /dev/sda
|
||||
vgcreate sas /dev/sda
|
||||
# Use nearly full capacity — sda is 1.1 TiB, reserve ~50G for VG metadata/overhead
|
||||
lvcreate -L 1050G -n nfs-data sas
|
||||
mkfs.ext4 -L nfs-data /dev/sas/nfs-data
|
||||
mkdir -p /srv/nfs
|
||||
echo '/dev/sas/nfs-data /srv/nfs ext4 defaults 0 2' >> /etc/fstab
|
||||
mount /srv/nfs
|
||||
```
|
||||
|
||||
**Capacity pre-validation** (MUST run before Phase 1):
|
||||
```bash
|
||||
# Check uncompressed data sizes on TrueNAS for largest consumers
|
||||
ssh root@10.0.10.15 'zfs list -o name,used,refer,compressratio -r main | sort -k2 -h | tail -20'
|
||||
```
|
||||
If total uncompressed NFS data exceeds 1 TiB, keep Immich (~800 GiB, largest consumer) on a separate thin LV in the `pve` VG:
|
||||
```bash
|
||||
# Only if needed: create Immich-specific thin LV on HDD (auto-grows in thin pool)
|
||||
lvcreate -V 1T --thinpool data -n immich-data pve
|
||||
mkfs.ext4 /dev/pve/immich-data
|
||||
mkdir /srv/nfs-immich
|
||||
echo '/dev/pve/immich-data /srv/nfs-immich ext4 defaults 0 2' >> /etc/fstab
|
||||
mount /srv/nfs-immich
|
||||
# Add to /etc/exports: /srv/nfs-immich 10.0.20.0/24(rw,sync,no_subtree_check,no_root_squash)
|
||||
```
|
||||
|
||||
#### 0.2: Create LVM-thin Pool on sdb (SSD Tier)
|
||||
|
||||
VG "ssd" already exists on sdb. Create a thin pool in the free space:
|
||||
|
||||
```bash
|
||||
# Verify free space in VG
|
||||
vgdisplay ssd | grep Free
|
||||
|
||||
# Create thin pool with explicit metadata sizing (1% of data = 6G, allows thousands of snapshots)
|
||||
lvcreate -L 600G --poolmetadatasize 6G --thinpool ssd-data ssd
|
||||
```
|
||||
|
||||
Note: After TrueNAS shutdown frees the 256G disk in Phase 4, expand with `lvextend -L +200G /dev/ssd/ssd-data`.
|
||||
|
||||
#### 0.3: Register Proxmox Storage IDs
|
||||
|
||||
The Proxmox CSI plugin requires **Proxmox storage IDs** (configured in Datacenter → Storage), not raw LVM names. Register the SSD thin pool as a new storage:
|
||||
|
||||
```bash
|
||||
# Register SSD thin pool in Proxmox storage config
|
||||
pvesm add lvmthin ssd-csi --vgname ssd --thinpool ssd-data
|
||||
|
||||
# Verify it was added
|
||||
pvesm status | grep ssd-csi
|
||||
|
||||
# Verify existing HDD storage ID (should already exist as "local-lvm")
|
||||
pvesm status | grep local-lvm
|
||||
```
|
||||
|
||||
The HDD tier uses the existing `local-lvm` Proxmox storage ID (already configured for VM boot disks).
|
||||
|
||||
#### 0.4: Install NFS Server on Proxmox Host
|
||||
|
||||
```bash
|
||||
apt-get install -y nfs-kernel-server
|
||||
```
|
||||
|
||||
Configure `/etc/exports`:
|
||||
```
|
||||
# Export entire /srv/nfs to K8s VLAN (10.0.20.0/24)
|
||||
# root_squash is default — pods needing root writes must use initContainers to fix ownership
|
||||
/srv/nfs 10.0.20.0/24(rw,sync,no_subtree_check,no_root_squash)
|
||||
```
|
||||
|
||||
Note: `no_root_squash` is used because many services (LinuxServer.io containers, backup CronJobs) write as root. This matches the current TrueNAS NFS export behavior. Security impact is limited — only K8s nodes on VLAN 20 can access this export, and they're trusted.
|
||||
|
||||
```bash
|
||||
exportfs -ra
|
||||
systemctl enable --now nfs-kernel-server
|
||||
# Verify from a k8s node:
|
||||
# showmount -e 10.0.10.1
|
||||
```
|
||||
|
||||
#### 0.5: Install Proxmox CSI Plugin
|
||||
|
||||
1. Create Proxmox API token with required roles:
|
||||
```bash
|
||||
# On Proxmox host
|
||||
pveum user add csi@pve
|
||||
pveum aclmod / -user csi@pve -role PVEDatastoreUser,PVEVMAdmin,PVEAuditor
|
||||
pveum user token add csi@pve csi-token --privsep=0
|
||||
```
|
||||
Store the token in Vault: `vault kv put secret/viktor/proxmox_csi_token token_id=csi@pve!csi-token token_secret=<secret>`
|
||||
|
||||
2. Deploy `proxmox-csi-plugin` Helm chart via new Terraform stack `stacks/proxmox-csi/`
|
||||
- Provisioner name: `csi.proxmox.sinextra.dev`
|
||||
- Configure cluster connection (Proxmox API URL, token)
|
||||
|
||||
3. Create StorageClasses (see Appendix B for full YAML):
|
||||
- `proxmox-ssd`: storage ID `ssd-csi`, `ssd: "true"`, `cache: none`
|
||||
- `proxmox-hdd`: storage ID `local-lvm`, `ssd: "false"`, `cache: writethrough`
|
||||
|
||||
4. Create VolumeSnapshotClass for LVM-thin snapshots
|
||||
|
||||
5. **Test on EVERY node** — create a test PVC, write data, read back, delete:
|
||||
```bash
|
||||
for i in 1 2 3 4; do
|
||||
# Create PVC with nodeAffinity to k8s-node$i, verify SCSI hotplug works
|
||||
kubectl apply -f test-pvc-node$i.yaml
|
||||
# Verify: kubectl get pvc, kubectl describe pv
|
||||
# Clean up
|
||||
kubectl delete -f test-pvc-node$i.yaml
|
||||
done
|
||||
```
|
||||
Also test on k8s-master. If SCSI hotplug fails on any node, investigate before proceeding.
|
||||
|
||||
6. **Test VolumeSnapshot**: Create a snapshot of the test PVC, restore to new PVC, verify data integrity. This validates the backup path BEFORE any production migration.
|
||||
|
||||
#### 0.6: Configure NFS for K8s
|
||||
|
||||
The existing NFS CSI driver (`nfs.csi.k8s.io`) supports multiple StorageClasses. Create a new StorageClass `nfs-host` pointing at the Proxmox host:
|
||||
|
||||
```yaml
|
||||
apiVersion: storage.k8s.io/v1
|
||||
kind: StorageClass
|
||||
metadata:
|
||||
name: nfs-host
|
||||
provisioner: nfs.csi.k8s.io
|
||||
parameters:
|
||||
server: 10.0.10.1 # Proxmox host on mgmt VLAN
|
||||
share: /srv/nfs
|
||||
mountOptions:
|
||||
- soft
|
||||
- timeo=30
|
||||
- retrans=3
|
||||
- actimeo=5
|
||||
reclaimPolicy: Retain
|
||||
volumeBindingMode: Immediate
|
||||
```
|
||||
|
||||
Keep the old `nfs-truenas` StorageClass active during migration. Services are migrated one at a time by updating their PV/PVC to use the new server.
|
||||
|
||||
Note: For services using the `nfs_volume` Terraform module (static PV/PVC), the migration involves changing the `nfs_server` parameter in the module call, not switching StorageClasses. The new StorageClass is for any future dynamically provisioned NFS PVCs.
|
||||
|
||||
#### 0.7: Set Up LVM Snapshot Cron
|
||||
|
||||
Install `/usr/local/bin/lvm-thin-snapshot.sh` on Proxmox host:
|
||||
|
||||
```bash
|
||||
#!/bin/bash
|
||||
# Snapshot all CSI-provisioned thin LVs
|
||||
set -euo pipefail
|
||||
PUSHGATEWAY="http://PUSHGATEWAY_NODEPORT_IP:PORT" # MUST resolve before Phase 0.7. Scripts run on PVE host (not in K8s), so use NodePort or Traefik ingress. Find with: kubectl get svc -n monitoring pushgateway -o wide
|
||||
RETENTION_DAYS=3
|
||||
STATUS=0
|
||||
|
||||
for vg in ssd pve; do
|
||||
# Get list of CSI LVs (names starting with "csi-", excluding existing snapshots)
|
||||
for lv in $(lvs --noheadings -o lv_name "$vg" 2>/dev/null | awk '/csi-/ && !/snap-/ {print $1}'); do
|
||||
snap_name="${lv}-snap-$(date +%Y%m%d%H%M)"
|
||||
# LVM-thin snapshots don't need -L (no pre-allocated CoW area — they share the thin pool)
|
||||
if lvcreate -s -n "$snap_name" "$vg/$lv" 2>&1; then
|
||||
echo "Created snapshot: $vg/$snap_name"
|
||||
else
|
||||
echo "FAILED to snapshot: $vg/$lv" >&2
|
||||
STATUS=1
|
||||
fi
|
||||
done
|
||||
done
|
||||
|
||||
# Prune old snapshots (parse timestamp from snapshot name, not lv_time which is unreliable)
|
||||
find_and_remove_old_snaps() {
|
||||
local vg="$1"
|
||||
local cutoff_epoch
|
||||
cutoff_epoch=$(date -d "-${RETENTION_DAYS} days" +%s)
|
||||
|
||||
lvs --noheadings -o lv_name "$vg" 2>/dev/null | awk '/snap-/ {print $1}' | while read -r snap; do
|
||||
# Extract timestamp from name: ...-snap-YYYYMMDDHHMM
|
||||
timestamp=$(echo "$snap" | grep -oP 'snap-\K\d{12}' || echo "")
|
||||
if [[ -n "$timestamp" ]]; then
|
||||
snap_epoch=$(date -d "${timestamp:0:8} ${timestamp:8:2}:${timestamp:10:2}" +%s 2>/dev/null || echo "0")
|
||||
if [[ "$snap_epoch" -lt "$cutoff_epoch" && "$snap_epoch" -gt 0 ]]; then
|
||||
echo "Removing old snapshot: $vg/$snap"
|
||||
lvremove -f "$vg/$snap" || STATUS=1
|
||||
fi
|
||||
fi
|
||||
done
|
||||
}
|
||||
find_and_remove_old_snaps ssd
|
||||
find_and_remove_old_snaps pve
|
||||
|
||||
# Push metrics
|
||||
cat <<EOF | curl -s --data-binary @- "$PUSHGATEWAY/metrics/job/lvm-snapshots"
|
||||
lvm_snapshot_last_success_timestamp $(date +%s)
|
||||
lvm_snapshot_last_status $STATUS
|
||||
EOF
|
||||
```
|
||||
|
||||
Configure cron: `/etc/cron.d/lvm-snapshots`
|
||||
```
|
||||
0 */12 * * * root /usr/local/bin/lvm-thin-snapshot.sh >> /var/log/lvm-snapshots.log 2>&1
|
||||
```
|
||||
|
||||
#### 0.8: Set Up Offsite Sync Cron
|
||||
|
||||
Install rclone and configure Synology remote:
|
||||
|
||||
```bash
|
||||
apt-get install -y rclone
|
||||
rclone config create synology sftp \
|
||||
host=192.168.1.13 \
|
||||
user=root \
|
||||
key_file=/root/.ssh/synology_key
|
||||
```
|
||||
|
||||
Install `/usr/local/bin/offsite-sync.sh`:
|
||||
|
||||
```bash
|
||||
#!/bin/bash
|
||||
# Offsite sync to Synology NAS using rclone (consistent tooling for both modes)
|
||||
set -euo pipefail
|
||||
PUSHGATEWAY="http://10.0.20.X:9091"
|
||||
SRC="/srv/nfs"
|
||||
DST="synology:/Backup/Viki/truenas"
|
||||
EXCLUDES="--exclude servarr/downloads/** --exclude prometheus/** --exclude loki/** --exclude frigate/recordings/**"
|
||||
STATUS=0
|
||||
BYTES=0
|
||||
|
||||
if [[ "${1:-}" == "--full" ]]; then
|
||||
# Full weekly sync — mirrors source to destination, removes orphans on dest
|
||||
rclone sync "$SRC" "$DST" $EXCLUDES --stats-one-line -v 2>&1 | tee /var/log/offsite-sync.log
|
||||
STATUS=$?
|
||||
else
|
||||
# Incremental: copy changed files only (rclone checks mod time + size, no deletions)
|
||||
rclone copy "$SRC" "$DST" $EXCLUDES --stats-one-line -v 2>&1 | tee /var/log/offsite-sync.log
|
||||
STATUS=$?
|
||||
fi
|
||||
|
||||
BYTES=$(du -sb "$SRC" 2>/dev/null | cut -f1)
|
||||
|
||||
cat <<EOF | curl -s --data-binary @- "$PUSHGATEWAY/metrics/job/offsite-sync"
|
||||
offsite_sync_last_success_timestamp $(date +%s)
|
||||
offsite_sync_last_status $STATUS
|
||||
offsite_sync_source_bytes $BYTES
|
||||
EOF
|
||||
```
|
||||
|
||||
Configure cron: `/etc/cron.d/offsite-sync`
|
||||
```
|
||||
0 */6 * * * root /usr/local/bin/offsite-sync.sh >> /var/log/offsite-sync.log 2>&1
|
||||
0 9 * * 0 root /usr/local/bin/offsite-sync.sh --full >> /var/log/offsite-sync.log 2>&1
|
||||
```
|
||||
|
||||
Test with empty `/srv/nfs/` → Synology to verify connectivity.
|
||||
|
||||
#### 0.9: Add Prometheus Alerts
|
||||
|
||||
Add to monitoring stack:
|
||||
- `LVMSnapshotStale`: no successful LVM snapshot push in **24h** (snapshots run every 12h — alerts after 2 missed cycles)
|
||||
- `OffsiteSyncStale`: no successful offsite sync in 8d
|
||||
- `OffsiteSyncFailing`: last sync exit code != 0
|
||||
|
||||
Update Grafana backup dashboard:
|
||||
- Add "LVM Snapshot Age" panel (stat, source: `lvm_snapshot_last_success_timestamp`)
|
||||
- Add "Offsite Sync Status" panel (stat, source: `offsite_sync_last_status`)
|
||||
- Rename "Cloud Sync" panels to "Offsite Sync"
|
||||
|
||||
### Phase 1: Migrate NFS App Data (Low-Risk, Bulk)
|
||||
|
||||
**Duration**: 1-2 weekends
|
||||
**Downtime per service**: <5 minutes
|
||||
**Rollback**: Switch PV back to old NFS path
|
||||
|
||||
Migrate the ~35 single-pod NFS volumes from TrueNAS to host NFS. These are the lowest-risk migrations — single replica Deployments with non-critical data.
|
||||
|
||||
**For each service**:
|
||||
|
||||
1. Scale deployment to 0: `kubectl scale deploy/<name> -n <ns> --replicas=0`
|
||||
2. Verify all pods terminated: `kubectl get pods -n <ns> -l app=<name>` (must show no Running pods — prevents race condition during rsync)
|
||||
3. rsync data with checksum verification: `rsync -av --checksum --delete root@10.0.10.15:/mnt/main/<service>/ /srv/nfs/<service>/`
|
||||
4. Verify: compare file counts and total size:
|
||||
```bash
|
||||
ssh root@10.0.10.15 "find /mnt/main/<service> -type f | wc -l"
|
||||
find /srv/nfs/<service> -type f | wc -l
|
||||
ssh root@10.0.10.15 "du -sh /mnt/main/<service>"
|
||||
du -sh /srv/nfs/<service>
|
||||
```
|
||||
5. Update Terraform: Change `nfs_server` in `nfs_volume` module call to `10.0.10.1` and `nfs_path` from `/mnt/main/<service>` to `/srv/nfs/<service>`
|
||||
6. `terragrunt apply` — updates PV to point at host NFS
|
||||
7. Scale deployment to 1
|
||||
8. Verify service is healthy (check logs, Uptime Kuma, service-specific smoke test)
|
||||
9. Mark old TrueNAS directory as migrated (don't delete yet)
|
||||
|
||||
**Stacks requiring re-apply**: All stacks with `module.nfs_volume` calls. Identify with:
|
||||
```bash
|
||||
grep -rl 'module.*nfs_volume\|nfs_server' infra/stacks/*/main.tf | sort
|
||||
```
|
||||
Apply order: non-critical services first (waves 1-5), platform services last (wave 6).
|
||||
|
||||
**Capacity checkpoint after each wave**:
|
||||
```bash
|
||||
df -h /srv/nfs
|
||||
# If >80% full, STOP and either:
|
||||
# a. Extend the LV: lvextend -L +50G /dev/sas/nfs-data && resize2fs /dev/sas/nfs-data
|
||||
# b. Move Immich to separate thin LV on HDD (see Phase 0.1 overflow plan)
|
||||
```
|
||||
|
||||
**Migration order** (low-risk first):
|
||||
|
||||
| Wave | Services | Rationale |
|
||||
|------|----------|-----------|
|
||||
| 1 | privatebin, stirling-pdf, excalidraw, send, resume, jsoncrack | Stateless-ish, low data |
|
||||
| 2 | ntfy, diun, owntracks, health, f1-stream | Small data, single pod |
|
||||
| 3 | actualbudget (x3), isponsorblocktv, affine | Small data, low traffic |
|
||||
| 4 | hackmd, paperless-ngx, matrix | Medium data, more important |
|
||||
| 5 | meshcentral (3 vols), roundcubemail (2 vols) | Multi-volume services |
|
||||
| 6 | ytdlp (2 vols), uptime-kuma, technitium (x2) | Platform services — extra care |
|
||||
| 7 | servarr suite (all components) | Complex shared volumes, keep on NFS |
|
||||
| 8 | Backup CronJob targets (postgresql-backup, mysql-backup, vault-backup, etc.) | Must verify backup CronJobs still work after |
|
||||
| 9 | Immich (~800 GiB) | Largest dataset — use two-pass rsync to minimize downtime (see below) |
|
||||
|
||||
**Immich migration (Wave 9)** — two-pass rsync to minimize downtime:
|
||||
1. **Pass 1** (Immich still running): `rsync -av --checksum root@10.0.10.15:/mnt/main/immich/ /srv/nfs/immich/` — bulk copy ~800 GiB while service is live (30-60 min, no downtime)
|
||||
2. Scale Immich to 0
|
||||
3. **Pass 2** (delta only): `rsync -av --checksum --delete root@10.0.10.15:/mnt/main/immich/ /srv/nfs/immich/` — syncs only changes since Pass 1 (1-5 min)
|
||||
4. Update Terraform, apply, scale to 1
|
||||
5. Verify: upload a test photo, check ML classification, browse thumbnails
|
||||
|
||||
**Disabled services** (whisper, audiblez, grampsweb, tandoor, etc.): Update Terraform to point at new NFS but don't rsync data (no data to migrate while disabled). rsync when re-enabled.
|
||||
|
||||
### Phase 2: Migrate Databases to Proxmox CSI SSD
|
||||
|
||||
**Duration**: 1 weekend
|
||||
**Downtime per service**: 5-15 minutes
|
||||
**Rollback**: CNPG switchover back to old primary; MySQL/Redis restore from dump
|
||||
|
||||
This is the highest-value migration — databases get local SSD instead of NFS-over-ZFS-over-LVM-thin.
|
||||
|
||||
**Migration Order** (dependency-aware):
|
||||
|
||||
| Day | Databases | Rationale |
|
||||
|-----|-----------|-----------|
|
||||
| Day 1 | 2a: CNPG PostgreSQL, 2b: MySQL, 2e: Vaultwarden | Independent of each other — can run in parallel |
|
||||
| Day 2 | 2d: Redis | Authentik depends on both PG + Redis. Migrate Redis only AFTER verifying CNPG migration is stable |
|
||||
| Day 3 | 2c: Vault | All services (ESO, Authentik, backup CronJobs) depend on Vault. Migrate LAST after all other DBs are verified stable |
|
||||
|
||||
**Terraform state handling**: Changing `storageClass` on PVCs requires recreation (immutable field). For each database migration:
|
||||
1. The old PVCs will become orphaned (reclaimPolicy: Retain keeps the PV)
|
||||
2. After verifying the new database is stable (24h), manually clean up:
|
||||
```bash
|
||||
# Delete orphaned PVCs
|
||||
kubectl delete pvc <old-pvc-name> -n <namespace>
|
||||
# Delete orphaned PVs (verify they're in "Released" state first)
|
||||
kubectl get pv | grep Released
|
||||
kubectl delete pv <old-pv-name>
|
||||
```
|
||||
3. Old TrueNAS iSCSI zvols will be cleaned up in Phase 4
|
||||
|
||||
#### 2a: CNPG PostgreSQL
|
||||
|
||||
Use dump/restore approach (safer than cross-storage streaming replication, which can fail when the underlying filesystem changes):
|
||||
|
||||
1. Take fresh `pg_dumpall` from existing cluster (Layer 2 backup, plus an extra manual dump)
|
||||
2. Verify the CNPG operand image includes all required extensions (pgvector, pgvecto-rs, etc.) — the current cluster uses `viktorbarzin/postgres:16-master` custom image. Build a compatible CNPG image or verify extensions are available.
|
||||
3. Create new CNPG Cluster resource with `storageClass: proxmox-ssd` (fresh init)
|
||||
4. Restore dump to new cluster: `cat dump.sql | kubectl exec -i <new-primary> -- psql -U postgres`
|
||||
5. Update `postgresql_host` in `config.tfvars` to new cluster service (e.g., `pg-cluster-rw.dbaas.svc.cluster.local` — keep same name if possible to minimize changes)
|
||||
6. `terragrunt apply` across all consuming stacks (12+ stacks — use `grep -rl postgresql_host stacks/` to enumerate)
|
||||
7. Verify all services connect successfully:
|
||||
- Authentik: login via web UI
|
||||
- Woodpecker: trigger a test pipeline
|
||||
- Immich: upload a test photo
|
||||
- Grafana: load a dashboard
|
||||
- All others: check pod logs for DB connection errors
|
||||
8. Decommission old CNPG cluster after 24h of verified operation
|
||||
|
||||
#### 2b: MySQL InnoDB Cluster
|
||||
|
||||
1. Take a fresh mysqldump of all databases (Layer 2 backup)
|
||||
2. Create new MySQL InnoDB Cluster Helm release with `storageClass: proxmox-ssd`
|
||||
3. Restore dump to new cluster
|
||||
4. Update `mysql_host` in `config.tfvars`
|
||||
5. `terragrunt apply` across consuming stacks
|
||||
6. Verify all MySQL-backed services (speedtest, wrongmove, grafana, etc.)
|
||||
7. Decommission old MySQL cluster
|
||||
|
||||
#### 2c: Vault Raft
|
||||
|
||||
**Pre-migration coordination** (before scaling Vault to 0):
|
||||
1. Verify no Woodpecker pipelines are queued/running
|
||||
2. Scale Woodpecker to 0 to prevent deploys during window
|
||||
3. Verify no backup CronJobs are currently running: `kubectl get jobs -A | grep -v Completed`
|
||||
4. Do NOT run `terragrunt apply` on any stack during the 10-15 min window
|
||||
|
||||
**WARNING**: Do NOT seal Vault during migration. Sealing breaks ESO (43+ ExternalSecrets), Authentik, and all backup CronJobs that read Vault. Instead, use a graceful shutdown + data copy approach.
|
||||
|
||||
1. Take Vault raft snapshot (Layer 2 backup + manual snapshot for safety)
|
||||
2. Scale Vault StatefulSet to 0 (graceful shutdown — pods terminate cleanly, no seal needed)
|
||||
3. Note: During this window (~10-15 min), ESO cannot refresh secrets. Existing K8s Secrets remain valid but won't be rotated. No pod restarts should be triggered. **Do NOT run `terragrunt apply` on any stack during this window.**
|
||||
4. Create new Vault Helm release with `storageClass: proxmox-ssd`
|
||||
5. Copy raft data from old PVCs to new PVCs (use a temporary pod or `kubectl cp` from the backup)
|
||||
6. Start new Vault StatefulSet
|
||||
7. Unseal all replicas, verify cluster health: `vault status`, `vault operator raft list-peers`
|
||||
8. Verify all secrets accessible: `vault kv get secret/viktor`
|
||||
9. Verify ESO connectivity: `kubectl get clustersecretstore vault-kv -o jsonpath='{.status.conditions}'`
|
||||
10. Decommission old Vault StatefulSet PVCs after 24h verification
|
||||
|
||||
#### 2d: Redis
|
||||
|
||||
1. Trigger BGSAVE on current Redis
|
||||
2. Scale Redis to 0
|
||||
3. Create new Redis Helm release with `storageClass: proxmox-ssd`
|
||||
4. Copy RDB dump to new PV
|
||||
5. Start new Redis, verify data
|
||||
6. Update `redis_host` in `config.tfvars` if changed
|
||||
7. Decommission old Redis PVCs
|
||||
|
||||
#### 2e: Vaultwarden
|
||||
|
||||
1. Run sqlite3 `.backup` (Layer 2 backup)
|
||||
2. Scale Vaultwarden to 0
|
||||
3. Create new PVC with `storageClass: proxmox-ssd`
|
||||
4. Copy SQLite database to new PV
|
||||
5. Update Vaultwarden deployment to use new PVC
|
||||
6. Scale to 1, verify via web UI + Bitwarden client sync
|
||||
7. Verify backup CronJob still works with new PVC mount
|
||||
|
||||
### Phase 3: Migrate Large Stateful Workloads to Proxmox CSI HDD
|
||||
|
||||
**Duration**: 1 evening
|
||||
**Downtime per service**: 10-30 minutes (Prometheus has large TSDB)
|
||||
|
||||
#### 3a: Prometheus
|
||||
|
||||
1. Create new PVC with `storageClass: proxmox-hdd`, size 200Gi
|
||||
2. Scale Prometheus to 0
|
||||
3. rsync TSDB data from old iSCSI PV to new block PV (may take 20-30 min for ~27GB)
|
||||
4. Update Prometheus Helm values to use new StorageClass
|
||||
5. Start Prometheus, verify metrics continuity
|
||||
6. Decommission old iSCSI PVC
|
||||
|
||||
#### 3b: Ollama
|
||||
|
||||
1. Create new PVC with `storageClass: proxmox-hdd`
|
||||
2. Scale Ollama to 0
|
||||
3. rsync models from old NFS to new block PV
|
||||
4. Update deployment
|
||||
5. Verify model loading
|
||||
6. Decommission old NFS volume
|
||||
|
||||
### Phase 4: TrueNAS Shutdown and Cleanup
|
||||
|
||||
**Duration**: 1 evening
|
||||
**Prerequisites**: All services migrated and verified for at least 1 week with no issues
|
||||
|
||||
1. **Final verification**:
|
||||
- All services healthy (Uptime Kuma green)
|
||||
- All backup CronJobs running (Grafana dashboard green)
|
||||
- Offsite sync to Synology running (check Pushgateway metrics)
|
||||
- No pods mounting TrueNAS NFS or iSCSI
|
||||
|
||||
2. **Shutdown TrueNAS VM**:
|
||||
```bash
|
||||
qm shutdown 9000
|
||||
```
|
||||
|
||||
3. **Monitor for 1 week** (matches success criteria): Watch for any services that silently depended on TrueNAS. Check Uptime Kuma, Grafana backup dashboard, and Prometheus alerts daily.
|
||||
|
||||
4. **Reclaim resources** (only after 1-week verification — once LVs are removed, TrueNAS rollback is impossible):
|
||||
- Remove TrueNAS VM definition from Terraform
|
||||
- Remove the 7 thin LVs (scsi1-scsi7) that were TrueNAS ZFS vdevs — frees ~1.7 TiB in thin pool:
|
||||
```bash
|
||||
# List TrueNAS LVs
|
||||
lvs pve | grep 'vm-9000'
|
||||
# Remove each one
|
||||
lvremove -f /dev/pve/vm-9000-disk-1
|
||||
# ... repeat for disk-2 through disk-7
|
||||
```
|
||||
- Remove TrueNAS SSD disk (vm-9000-disk-0 on sdb) — frees 256 GiB on SSD VG:
|
||||
```bash
|
||||
lvremove -f /dev/ssd/vm-9000-disk-0
|
||||
```
|
||||
- Expand SSD thin pool with reclaimed space (safe to do online with active thin volumes). Extend both data and metadata proportionally:
|
||||
```bash
|
||||
lvextend -L +200G /dev/ssd/ssd-data
|
||||
lvextend --poolmetadatasize +2G /dev/ssd/ssd-data # Keep metadata at ~1% of data
|
||||
lvs ssd/ssd-data # Verify new size
|
||||
```
|
||||
|
||||
5. **Remove old CSI drivers**:
|
||||
- Remove `democratic-csi` (iSCSI) Helm release and Terraform stack
|
||||
- Remove old `nfs-truenas` StorageClass (keep `nfs-host`)
|
||||
- Remove TrueNAS SSH key from Vault
|
||||
- Remove TrueNAS API credentials from Vault
|
||||
|
||||
6. **Update documentation**:
|
||||
- Update `infra/docs/architecture/storage.md`
|
||||
- Update `infra/docs/architecture/backup-dr.md`
|
||||
- Update `infra/.claude/CLAUDE.md` storage sections
|
||||
- Update `AGENTS.md` if storage references exist
|
||||
|
||||
7. **Synology backup path**: Keep the existing path `truenas` on Synology — renaming would cause rclone to re-upload everything. The path name is cosmetic; the content is what matters. Add a note file at the root: `echo "Source: PVE host /srv/nfs (migrated from TrueNAS $(date))" > /srv/nfs/.source-info`
|
||||
|
||||
### Phase 5: Post-Migration Hardening
|
||||
|
||||
1. **LVM snapshot monitoring**: Verify Prometheus scrapes LVM snapshot metrics, Grafana panels show snapshot age and count
|
||||
2. **Offsite sync monitoring**: Verify Prometheus alerts for OffsiteSyncStale/Failing
|
||||
3. **Disaster recovery test**: Restore a database from backup to verify the full backup→restore path works end-to-end
|
||||
4. **Capacity alerting**: Add alerts for:
|
||||
- SSD thin pool >80% full
|
||||
- HDD thin pool >80% full
|
||||
- NFS thick LV >85% full
|
||||
5. **Update memory/CLAUDE.md**: Store the new architecture mapping
|
||||
6. **Proxmox CSI VolumeSnapshot test**: Create a VolumeSnapshot of a database PV, restore it to a new PVC, verify data integrity
|
||||
|
||||
## Rollback Plan
|
||||
|
||||
Each phase is independently rollbackable:
|
||||
|
||||
| Phase | Rollback Procedure | Data Loss Risk |
|
||||
|-------|-------------------|----------------|
|
||||
| Phase 0 | Remove Proxmox CSI, NFS server, crons. No service impact | None |
|
||||
| Phase 1 | Switch PV back to TrueNAS NFS path. rsync delta back | None (TrueNAS still has original data) |
|
||||
| Phase 2 | CNPG switchover back; MySQL restore from dump; Vault restore from raft snapshot | Minimal (since last dump) |
|
||||
| Phase 3 | Re-create iSCSI PVC, rsync back | None |
|
||||
| Phase 4 | Boot TrueNAS VM, re-attach LVs (only possible before LV reclaim in step 4 — 1-week window) | N/A (only done after full verification) |
|
||||
|
||||
## Risk Register
|
||||
|
||||
| Risk | Likelihood | Impact | Mitigation |
|
||||
|------|-----------|--------|------------|
|
||||
| Proxmox CSI plugin bug / incompatibility | Medium | High | Test extensively in Phase 0; keep TrueNAS alive until Phase 4 |
|
||||
| SCSI hotplug fails on specific VM | Low | Medium | Test on each node in Phase 0; fallback to NFS for that node |
|
||||
| NFS kernel server performance worse than TrueNAS | Low | Low | TrueNAS was double-CoW; host NFS on SAS 10K disk should be faster |
|
||||
| Proxmox API token permissions insufficient | Low | Low | Test all CSI operations in Phase 0 before any migration |
|
||||
| rclone offsite sync misses files without zfs diff | Low | Medium | Use rsync (checksums all files); accept slightly longer runtime |
|
||||
| LVM thin pool fills during migration | Low | High | Monitor pool usage during Phase 1-3; current usage is 37% |
|
||||
| Service depends on TrueNAS in unexpected way | Low | Medium | 48-hour monitoring period in Phase 4 before decommission |
|
||||
| Proxmox host reboot disrupts NFS + block PVs simultaneously | Medium | High | This is same as current (TrueNAS VM is on same host). No regression. Schedule reboots during maintenance windows |
|
||||
| CNPG custom image missing extensions after migration | Low | High | Verify extensions (pgvector, pgvecto-rs) in CNPG image before migration; build custom image if needed |
|
||||
| NFS ports blocked by pfSense between VLANs | Medium | High | Test NFS connectivity from K8s nodes to PVE host in Phase 0.0 pre-flight |
|
||||
| Corrupted ZFS data migrated to new storage | Low | High | Check `zpool status -v` before migration; restore corrupted files from Synology backup first |
|
||||
|
||||
## Success Criteria
|
||||
|
||||
- [ ] All services healthy on new storage for 1+ week
|
||||
- [ ] All backup CronJobs green on Grafana dashboard
|
||||
- [ ] Offsite sync to Synology running with metrics
|
||||
- [ ] LVM snapshot cron running with metrics
|
||||
- [ ] TrueNAS VM shut down and resources reclaimed
|
||||
- [ ] No double-CoW — single LVM-thin CoW layer only
|
||||
- [ ] 16 vCPU + 16 GB RAM freed for K8s workloads
|
||||
- [ ] SCSI budget: ≤5 devices per node average, no single node exceeding 10
|
||||
- [ ] DR test: successfully restore at least 1 database from backup on new infrastructure
|
||||
|
||||
## Appendix A: Proxmox Host NFS vs TrueNAS NFS
|
||||
|
||||
| Property | TrueNAS NFS | Host NFS |
|
||||
|----------|-------------|----------|
|
||||
| CoW layers | 2 (ZFS + LVM-thin) | 0 (thick LV, ext4) |
|
||||
| Checksumming | ZFS (but can't repair — RAID0) | None (ext4) |
|
||||
| Compression | lz4 (1.26×) | None |
|
||||
| Network hop | VM NIC → bridge → physical | Direct on host |
|
||||
| RAM overhead | 16 GB (ZFS ARC) | ~0 (kernel NFS is lightweight) |
|
||||
| Management UI | TrueNAS WebUI | /etc/exports (text file) |
|
||||
| Snapshot quality | ZFS (excellent but corrupted) | LVM thick — no snapshots (use backups) |
|
||||
| Effective capacity | ~1.26× via lz4 compression (~800G for 1 TiB logical) | 1:1 (no compression). Allocate 1 TiB for ~1 TiB of data. Monitor usage; current NFS data is 1.39 TiB but largest consumers (Immich) may compress well on ZFS but not on ext4 |
|
||||
|
||||
**Note on capacity**: Losing ZFS lz4 compression (1.26×) means effective capacity drops. Current NFS data is 1.39 TiB compressed. Uncompressed, this could be ~1.75 TiB. The 1 TiB thick LV on sda may not be sufficient for all data. **Mitigation**: Monitor during Phase 1 migration. If approaching 85%, either (a) extend the LV (sda has 1.1 TiB total, 100G is reserved for VG metadata), or (b) keep large datasets (Immich ~800G) on a separate LV on sdc's thin pool.
|
||||
|
||||
## Appendix A.1: Superseded Plans
|
||||
|
||||
This plan **supersedes** the pending "iSCSI PV pin & rename migration" plan (`~/.claude/plans/ticklish-singing-donut.md`). That plan proposed renaming iSCSI PVs on TrueNAS — since TrueNAS is being eliminated entirely, the rename migration is no longer needed. All iSCSI PVs will be replaced with Proxmox CSI block PVs in Phase 2-3.
|
||||
|
||||
## Appendix B: Proxmox CSI StorageClass Definitions
|
||||
|
||||
**Important**: The `storage` parameter must reference a **Proxmox storage ID** (as configured in Datacenter → Storage in the Proxmox UI), NOT the raw LVM thin pool name. The SSD storage must be registered in Phase 0.3 before these StorageClasses will work.
|
||||
|
||||
```yaml
|
||||
# proxmox-ssd StorageClass
|
||||
apiVersion: storage.k8s.io/v1
|
||||
kind: StorageClass
|
||||
metadata:
|
||||
name: proxmox-ssd
|
||||
provisioner: csi.proxmox.sinextra.dev
|
||||
parameters:
|
||||
storage: ssd-csi # Proxmox storage ID (registered in Phase 0.3, points to ssd/ssd-data thin pool)
|
||||
ssd: "true"
|
||||
cache: none # Required for databases — ensures fsync reaches disk
|
||||
reclaimPolicy: Retain
|
||||
volumeBindingMode: WaitForFirstConsumer
|
||||
allowVolumeExpansion: true
|
||||
|
||||
---
|
||||
# proxmox-hdd StorageClass
|
||||
apiVersion: storage.k8s.io/v1
|
||||
kind: StorageClass
|
||||
metadata:
|
||||
name: proxmox-hdd
|
||||
provisioner: csi.proxmox.sinextra.dev
|
||||
parameters:
|
||||
storage: local-lvm # Proxmox storage ID (already exists, points to pve/data thin pool)
|
||||
ssd: "false"
|
||||
cache: writethrough # Balance performance and safety for TSDB/model workloads
|
||||
reclaimPolicy: Retain
|
||||
volumeBindingMode: WaitForFirstConsumer
|
||||
allowVolumeExpansion: true
|
||||
```
|
||||
|
||||
Note: `volumeBindingMode: WaitForFirstConsumer` ensures PVs are created on the same node as the pod, preventing cross-node scheduling issues. Combined with anti-affinity rules on database StatefulSets, this spreads block PVs across nodes and avoids SCSI budget concentration.
|
||||
|
||||
## Appendix C: SCSI Device Distribution
|
||||
|
||||
Proxmox CSI hotplugs SCSI devices into VMs. Each VM supports up to 30 SCSI devices (scsi0-scsi29). With boot disk using scsi0, 29 slots remain per node.
|
||||
|
||||
Current plan uses ~14 block PVs total across 5 nodes:
|
||||
- Databases (CNPG ×3, MySQL ×3, Redis ×2, Vault ×3, Vaultwarden ×1) = 12
|
||||
- Large workloads (Prometheus ×1, Ollama ×1) = 2
|
||||
- Total: 14 PVs across 5 nodes = ~3 per node average
|
||||
|
||||
Remaining capacity: 14 PVs using ~3 SCSI slots per node leaves ~26 free slots per node. Even if scheduler imbalance puts 8-10 on one node, that's still well under the 29-slot limit. Anti-affinity rules on database StatefulSets ensure spread.
|
||||
|
||||
## Appendix D: Data Sizes for Migration Planning
|
||||
|
||||
| Service | Current Size (approx) | Migration Method | Expected Duration |
|
||||
|---------|-----------------------|------------------|-------------------|
|
||||
| Immich | ~800 GiB (photos/video) | rsync NFS→NFS | 30-60 min |
|
||||
| servarr/downloads | ~200 GiB | rsync NFS→NFS | 15-30 min |
|
||||
| ytdlp | ~50 GiB | rsync NFS→NFS | 5-10 min |
|
||||
| Prometheus TSDB | ~27 GiB | rsync iSCSI→block | 5-10 min |
|
||||
| CNPG PostgreSQL | ~10 GiB | pg_dumpall / restore | 10-15 min |
|
||||
| MySQL InnoDB | ~5 GiB | mysqldump/restore | 5 min |
|
||||
| All other NFS services | <5 GiB each | rsync NFS→NFS | <2 min each |
|
||||
88
docs/plans/2026-04-03-proxmox-csi-cleanup-todo.md
Normal file
88
docs/plans/2026-04-03-proxmox-csi-cleanup-todo.md
Normal file
|
|
@ -0,0 +1,88 @@
|
|||
# Proxmox CSI Migration — Cleanup TODO
|
||||
|
||||
**Date**: 2026-04-03
|
||||
**Status**: Pending (do when confident everything is stable)
|
||||
**Prerequisites**: All services healthy on proxmox-lvm for 1+ week
|
||||
|
||||
## Context
|
||||
|
||||
The iSCSI → Proxmox CSI migration is complete. All 13 block PVCs are on `proxmox-lvm`, all 41 databases (21 PG + 20 MySQL) restored and verified. This doc tracks the remaining cleanup.
|
||||
|
||||
## TODO
|
||||
|
||||
### 1. Remove democratic-csi iSCSI stack
|
||||
|
||||
Frees 5 pods (~500Mi RAM), removes unused CSI driver.
|
||||
|
||||
```bash
|
||||
# Delete Helm release
|
||||
KUBECONFIG=./config helm delete democratic-csi-iscsi -n iscsi-csi
|
||||
|
||||
# Delete namespace
|
||||
kubectl delete namespace iscsi-csi
|
||||
|
||||
# Remove iscsi-truenas StorageClass (verify no PVCs reference it first)
|
||||
kubectl get pvc -A | grep iscsi-truenas # should only show orphaned PVCs
|
||||
kubectl delete storageclass iscsi-truenas
|
||||
|
||||
# Remove Terraform stack (or mark as disabled)
|
||||
# Option A: delete stacks/iscsi-csi/ directory
|
||||
# Option B: keep for reference, remove from CI pipeline
|
||||
```
|
||||
|
||||
### 2. Delete orphaned iSCSI PVCs
|
||||
|
||||
These are old copies from before the migration. No pods mount them.
|
||||
|
||||
```bash
|
||||
# Verify nothing mounts them
|
||||
for pvc in old-pg-data old-mysql-data; do
|
||||
kubectl get pods -n dbaas -o json | grep -q "$pvc" && echo "IN USE: $pvc" || echo "SAFE: $pvc"
|
||||
done
|
||||
|
||||
# Delete helper PVCs
|
||||
kubectl delete pvc old-pg-data old-mysql-data -n dbaas
|
||||
|
||||
# Delete old service PVCs
|
||||
kubectl delete pvc nextcloud-data-iscsi -n nextcloud
|
||||
kubectl delete pvc novelapp-data -n novelapp
|
||||
kubectl delete pvc vaultwarden-data-iscsi -n vaultwarden
|
||||
kubectl delete pvc ebooks-calibre-config-iscsi -n ebooks
|
||||
```
|
||||
|
||||
### 3. Clean up TrueNAS iSCSI zvols
|
||||
|
||||
After deleting PVCs, the underlying PVs (reclaimPolicy: Retain) and TrueNAS zvols remain.
|
||||
|
||||
```bash
|
||||
# Delete Released PVs
|
||||
kubectl get pv | grep Released | grep iscsi-truenas | awk '{print $1}' | xargs kubectl delete pv
|
||||
|
||||
# SSH to TrueNAS and clean up zvols
|
||||
ssh root@10.0.10.15 'zfs list -t volume main/iscsi | grep csi-'
|
||||
# Review list, then destroy each:
|
||||
# zfs destroy main/iscsi/<zvol-name>
|
||||
```
|
||||
|
||||
### 4. Remove Vault secrets (optional)
|
||||
|
||||
These were used by democratic-csi SSH driver. No longer needed.
|
||||
|
||||
```bash
|
||||
# Remove from secret/platform (used by stacks/iscsi-csi/main.tf)
|
||||
vault kv patch secret/platform truenas_api_key=REMOVED truenas_ssh_private_key=REMOVED
|
||||
```
|
||||
|
||||
### 5. Update CLAUDE.md
|
||||
|
||||
Remove iSCSI references from:
|
||||
- `infra/.claude/CLAUDE.md` — Storage & Backup Architecture section
|
||||
- `AGENTS.md` if any storage references
|
||||
|
||||
### 6. Commit and push
|
||||
|
||||
```bash
|
||||
git add stacks/ebooks/main.tf docs/ .claude/
|
||||
git commit -m "proxmox-csi cleanup: remove democratic-csi, delete orphaned PVCs [ci skip]"
|
||||
git push
|
||||
```
|
||||
265
docs/plans/2026-04-20-infra-audit-design.md
Normal file
265
docs/plans/2026-04-20-infra-audit-design.md
Normal file
|
|
@ -0,0 +1,265 @@
|
|||
# Infra Audit — 2026-04-20
|
||||
|
||||
**Status**: Design (post-research, post-challenge)
|
||||
**Author**: Viktor Barzin (audit run by Claude)
|
||||
**Scope**: `infra/` Terragrunt stacks + platform services (`claude-agent-service`, `claude-memory-mcp`, `beadboard`, `broker-sync`)
|
||||
**Goals**: Reliability · Declarative-first · Reduced maintenance overhead · Maintained scalability
|
||||
**Method**: 5 parallel research agents (R1 Reliability, R2 Declarative, R3 Maintenance, R4 Scalability, R5 Security) → 91 raw findings → 2 independent challengers → filtered/corrected/ranked backlog below.
|
||||
|
||||
## Context
|
||||
|
||||
The home-lab has grown into a mature stack (105 Tier-1 Terragrunt stacks + 6 Tier-0 SOPS, CNPG, Vault+ESO, Kyverno, Traefik, Authentik, CrowdSec, Woodpecker CI, Redis-Sentinel, MySQL-standalone, Proxmox-NFS). Recent work has been consolidation: MySQL InnoDB-Cluster → standalone (2026-04-16), Redis Phase 7 refactor (2026-04-19), NFS fsid=0 SEV1 post-mortem (2026-04-14), Authentik outpost /dev/shm fix (2026-04-18). This audit surveys everywhere that remains — what's brittle, what's manual, what's dark, what hasn't caught up to recent decisions — and ranks fixes by impact and by operator fatigue.
|
||||
|
||||
## Corrections up-front (challenger round)
|
||||
|
||||
Before reading the backlog, these findings from the research phase are **dropped, corrected, or reframed** — challengers spot-checked live state and proved them wrong, already-solved, or intentional-by-design. Being honest about this is the point of the challenge round:
|
||||
|
||||
| Finding as stated | Actual state | Action |
|
||||
|---|---|---|
|
||||
| R4#1: Worker nodes 86-91% memory saturation | Live `kubectl top nodes`: 44-51% across k8s-node{1-4} | **DROPPED** — bad metric pull |
|
||||
| R4#2: Frigate CPU unbounded (1.5 CPU request, no limit) | Cluster policy is **all CPU limits removed** to avoid CFS throttling (`infra/.claude/CLAUDE.md` → Resource Management) | **DROPPED** — by design |
|
||||
| R4#7: Redis no `maxmemory-policy` | `infra/stacks/redis/modules/redis/main.tf:254` sets `maxmemory-policy allkeys-lru` (Phase 7, 2026-04-19) | **DROPPED** — already solved |
|
||||
| R2#1: 307 Kyverno lifecycle markers is a drift risk | Markers are the **canonical discoverability tag** — `ignore_changes` only accepts static attribute paths, snippet convention is the only viable path; reframe as *"markers are fine, missing markers are the risk"* | **REFRAMED** |
|
||||
| R2#3: 140 `ignore_changes` blocks | Actual: **310** across `.tf` files (2.2× off) | **CORRECTED** |
|
||||
| R3#10: 65 CronJobs | Actual: 59 (10% off) | **CORRECTED** |
|
||||
| R1#1: 47 deployments missing probes | Actual: **115 missing at least one probe; 103 missing both** | **CORRECTED (much worse than reported)** |
|
||||
| R1#9: MySQL standalone no HA/PDB | Intentional post-2026-04-16 migration from InnoDB Cluster. Backup + restore matter; HA is explicit deferred. | **REFRAMED** — split into HA (deferred) / backup-restore (open) / connection pool (open) |
|
||||
| R1#10: PDB gaps include Traefik, Authentik | Traefik & Authentik PDBs `minAvailable=2` exist (CLAUDE.md). The real gaps are **CrowdSec LAPI, Calico-apiserver, ESO webhook, Woodpecker-server** | **CORRECTED (list pruned)** |
|
||||
| R5#2: 4 Kyverno security policies in Audit | **All 16 ClusterPolicies are in Audit** — zero in Enforce. | **CORRECTED (worse)** |
|
||||
|
||||
---
|
||||
|
||||
## Executive summary — top 5 cross-cutting themes
|
||||
|
||||
These are the themes that survive the challenge round and hit ≥2 concerns. Each headline is a 1-line hook; deep-dives below.
|
||||
|
||||
1. **Declarative escape hatches (NFS exports, master-node file provisioners, null_resource initializers)** — `/etc/exports` is not in Terraform, which is the **root cause of the 2026-04-14 SEV1**; 6 null_resources + 3 SSH file provisioners still orchestrate critical state. *Hits R2 + R1 + R3.*
|
||||
2. **Observability has blind spots where pain would actually come from** — no OOMKill alert routing, no NFS capacity monitor, no GPU utilization dashboard, no ESO refresh-lag alert, no CronJob success-rate summary. Alerts exist but they don't cover the operator's real failure modes. *Hits R1 + R3 + R4.*
|
||||
3. **Supply-chain hygiene: image pinning + Renovate + admission signing** — 84 `:latest` tags in production TF, zero Renovate/Dependabot across 18 repos (~15 hr/mo toil by estimate), no cosign/trivy on push. Single theme unifies security posture, maintenance toil, and determinism. *Hits R3 + R5.*
|
||||
4. **Reliability-probes & graceful shutdown are genuinely uneven** — 115 deployments missing at least one probe (incl. 103 missing both), 50+ Recreate deployments with no `terminationGracePeriodSeconds`/`preStop`. This is the quietly-largest reliability debt. *Hits R1 + R3 (pager toil).*
|
||||
5. **Backup coverage is uneven: 30+ PVCs lack app-level CronJobs** — Proxmox host snapshots cover the disk, but Forgejo (!), Affine, Paperless, Hackmd, Matrix, Owntracks have no app-aware dumps. Restore granularity is file-level, not entity-level. *Hits R1 + R5 (compliance) + R3 (restore rehearsal toil).*
|
||||
|
||||
Honourable mentions that didn't make top 5 but sit just below: Kyverno audit→enforce transition (security), ESO refresh-lag alert (secrets reliability), Vault hardening (audit log offsite, root-token K8s-secret scope), Cloudflared tunnel-token SPOF (not replica SPOF — those are 3), Dolt PVC sizing + backup.
|
||||
|
||||
---
|
||||
|
||||
## Scoring method
|
||||
|
||||
Two parallel rankings — scan both.
|
||||
|
||||
**Rank A — Impact × Reversibility (the original formula)**
|
||||
`score = Impact × (6 - Effort) × (6 - Risk)` — each dimension 1-5.
|
||||
|
||||
**Rank B — Operator fatigue weight**
|
||||
`score = Impact × (6 - Effort) × FatigueWeight` where `FatigueWeight = 3` if the finding introduces *daily/weekly manual toil* and `1` otherwise. This re-ranks by how much pain the unfixed state causes per month.
|
||||
|
||||
Both rankings below. When they agree, that's the clear signal. When they diverge, that's where Rank B (fatigue) wins — Viktor has stated operator fatigue dominates abstract risk for a solo-operator lab.
|
||||
|
||||
---
|
||||
|
||||
## Ranked backlog (filtered, deduplicated, corrected)
|
||||
|
||||
Counts below reflect **post-challenge corrected numbers**. Every row has a reference verified either by a spot-check (file:line) or a live cluster command.
|
||||
|
||||
| ID | Title | Concerns | Impact | Effort | Risk | Rank A | Rank B | Refs |
|
||||
|---|---|---|---:|---:|---:|---:|---:|---|
|
||||
| F01 | NFS `/etc/exports` not in Terraform (SEV1 root cause) | R2+R1 | 5 | 3 | 2 | **60** | **45** | `infra/scripts/pve-nfs-exports`, PM 2026-04-14 |
|
||||
| F02 | 115 deployments missing probes (103 missing both) | R1+R3 | 5 | 3 | 2 | **60** | **45** | `kubectl get deploy -A -o json` |
|
||||
| F03 | Zero Renovate/Dependabot across 18 repos | R3+R5 | 4 | 2 | 1 | **80** | **48** | `find /home/wizard/code -name ".renovaterc*"` → 0 results |
|
||||
| F04 | 84 `:latest` image tags in production TF | R3+R5+R4 | 4 | 2 | 2 | **64** | **48** | `grep -rn ':latest' infra/stacks` |
|
||||
| F05 | No OOMKill / unschedulable / node-CPU alert | R1+R4+R3 | 5 | 3 | 1 | **75** | **45** | Grep Prometheus rules — no `OOMKilling` rule present |
|
||||
| F06 | 6 `null_resource` DB initializers in `dbaas` stack | R2 | 4 | 3 | 3 | **36** | **36** | `grep -n null_resource infra/stacks/dbaas` |
|
||||
| F07 | 3 SSH+file provisioners on k8s-master (audit, OIDC, etcd) | R2 | 4 | 3 | 3 | **36** | **36** | `stacks/platform/modules/rbac/apiserver-oidc.tf` |
|
||||
| F08 | ESO refresh-lag alert missing (52 ExternalSecrets) | R1+R5+R3 | 4 | 2 | 1 | **80** | **48** | `stacks/external-secrets/` — no PrometheusRule for refresh lag |
|
||||
| F09 | 30+ PVCs without app-level backup CronJobs | R1+R5 | 4 | 3 | 2 | **48** | **36** | Affine, Forgejo, Hackmd, Matrix, Owntracks, Paperless (no `*-backup` CJ) |
|
||||
| F10 | Cloudflared tunnel-token SPOF (replicas OK, token shared) | R1+R5 | 3 | 4 | 2 | **24** | **8** | `stacks/cloudflared/` single tunnel credential |
|
||||
| F11 | MySQL restore never rehearsed end-to-end | R1+R4+R3 | 4 | 2 | 2 | **64** | **48** | No `mysql-restore-drill` CJ; runbook untested post-migration |
|
||||
| F12 | Kyverno policies all 16 in Audit — **sequence carefully** | R2+R5 | 4 | 3 | **4** | **24** | **24** | `kubectl get clusterpolicy` |
|
||||
| F13 | 97 RollingUpdate deployments lack explicit surge bounds | R1 | 2 | 2 | 2 | **32** | **12** | TF defaults inherit from Helm/k8s (25%/25%) |
|
||||
| F14 | CronJob success-rate dashboard + alert rollup missing | R3+R4 | 3 | 2 | 1 | **60** | **36** | `CronJobTooOld` rule — partial; no 24h rollup |
|
||||
| F15 | Authentik outpost /dev/shm fix applied via Helm API only | R1+R5 | 3 | 2 | 2 | **48** | **48** | Not in TF — upgrade-reversion risk |
|
||||
| F16 | Dolt (beads DB) no backup CronJob — 2Gi PVC near full | R1+R4 | 4 | 2 | 2 | **64** | **32** | `stacks/beads/` — no `dolt-backup` CJ |
|
||||
| F17 | Vault StatefulSet `updateStrategy=OnDelete` (manual roll) | R1+R3 | 2 | 2 | 3 | **24** | **24** | `kubectl get sts -n vault -o yaml` |
|
||||
| F18 | No NetworkPolicies cluster-wide | R4+R5 | 4 | **5** | **4** | **8** | **8** | `kubectl get netpol -A` → 0-2 |
|
||||
| F19 | RBAC `oidc-power-user` has cluster-wide secrets r/w | R5 | 4 | 3 | 3 | **36** | **12** | `stacks/platform/modules/rbac/` |
|
||||
| F20 | No image supply-chain verification (cosign, trivy on push) | R5 | 4 | 4 | 3 | **24** | **8** | No admission controller for signatures |
|
||||
| F21 | Vault audit log offsite backup not configured | R5+R1 | 3 | 2 | 1 | **60** | **36** | `stacks/vault/` — no `audit-log-sync` CJ |
|
||||
| F22 | Claude-agent, beadboard, broker-sync singletons | R1 | 2 | 2 | 2 | **32** | **12** | `kubectl get deploy -n claude-agent,beadboard,broker-sync` |
|
||||
| F23 | 50+ Recreate deployments lack graceful-shutdown hooks | R1+R3 | 3 | 3 | 2 | **36** | **36** | `grep -L terminationGracePeriodSeconds stacks/**` |
|
||||
| F24 | CoreDNS scaled via `kubectl scale` not TF | R2 | 3 | 2 | 2 | **48** | **32** | Command in runbook; no TF resource for replicas |
|
||||
| F25 | GPU / inference-latency SLO unmonitored | R4+R5 | 3 | 3 | 2 | **36** | **36** | No dcgm dashboard; Frigate liveness checks only |
|
||||
| F26 | Prometheus TSDB 200Gi — retention untracked | R4 | 2 | 2 | 1 | **40** | **20** | `stacks/monitoring/` |
|
||||
| F27 | Pod Security Standards labels unset on all namespaces | R5 | 3 | 2 | 3 | **36** | **12** | `kubectl get ns -o json \| jq '.items[].metadata.labels'` |
|
||||
| F28 | Authentik worker VPA upperBound 2.3× actual request | R4 | 2 | 2 | 2 | **32** | **20** | Goldilocks dashboard |
|
||||
| F29 | 9 DB rotation targets, no post-rotation verification loop | R5+R3 | 3 | 2 | 2 | **48** | **36** | Vault DB engine every 7d; no auto-verify |
|
||||
| F30 | Tier-0 SOPS workflow 7-step vs 3-step Tier-1 | R3 | 2 | 2 | 1 | **40** | **20** | `scripts/state-sync` — manual decrypt/encrypt/commit |
|
||||
|
||||
**Rank A leaders (top 8)**: F03, F08, F05, F11, F04, F16, F01, F02 — "big cluster wins, cheap to try"
|
||||
**Rank B leaders (top 8)**: F03, F04, F08, F11, F15, F01, F02, F05 — "what's paining you weekly"
|
||||
|
||||
F03 (Renovate), F08 (ESO refresh alert), F11 (MySQL restore drill) and F01 (NFS in TF) lead in **both** rankings → these are the clear "do first" candidates.
|
||||
|
||||
---
|
||||
|
||||
## Per-concern deep dives
|
||||
|
||||
### R1 — Reliability (18 raw → 11 real after challenge)
|
||||
|
||||
Filtered: dropped R1#1/9/10 (incorrect numbers, intentional choices). What actually matters:
|
||||
|
||||
- **Probes (F02)** — 115 deployments missing at least one probe; 103 missing both. The corrected count is 2.4× the original claim. Worst offenders are batch workloads (CronJob-spawned) that legitimately skip probes — but long-lived ones (Affine, Hackmd, mailserver sidecars) genuinely need them. Triage: filter by `spec.replicas ≥ 1` and `containers[].command != ["/bin/sh","-c"]`-style short-runners, then add readiness+liveness one-by-one.
|
||||
- **Cloudflared tunnel token SPOF (F10)** — Replicas are 3 (per CLAUDE.md), so the agent finding "SPOF" framed as replicas is wrong. The real SPOF is the *tunnel credential*. Secondary tunnel with weighted Cloudflare DNS records is the honest fix — medium effort, low urgency unless tunnel CA rolls keys.
|
||||
- **PDB gaps (F13-like, excluded from table)** — After challenger correction, gaps are: CrowdSec LAPI (3 replicas, no PDB), ESO webhook+controller, Woodpecker-server. Not urgent — drain-test with `kubectl drain --dry-run` shows no current issue.
|
||||
- **App-level backups (F09)** — Proxmox host captures the PVC contents nightly via LVM snapshot + rsync with `--link-dest` weekly versioning, so file-level recovery is covered. But for databases inside PVCs (e.g. Affine's Postgres in-pod, Paperless' SQLite), app-aware dumps give transactional consistency. Audit pass: enumerate every PVC without a sibling `*-backup` CronJob, add one for the ones that host embedded DBs.
|
||||
- **MySQL restore drill (F11)** — Migrated 4 days ago. Runbook exists. End-to-end restore (dump → new DB → connect an app → verify) hasn't been rehearsed. SEV1 risk if a dump has been silently broken since migration.
|
||||
- **Vault update strategy (F17)** — `OnDelete` means helm upgrade leaves pods untouched; must manually `kubectl delete pod` to restart. Low impact (infrequent) but procedural toil.
|
||||
- **Dolt PVC near-full + no backup (F16)** — `bd list --status in_progress` runs against this DB; it's load-bearing for cross-session task state. Grow the PVC (resize annotation) + add dolt dump CronJob.
|
||||
|
||||
### R2 — Declarative Coverage & Drift (16 raw → 8 real)
|
||||
|
||||
Filtered: dropped R2#1 (Kyverno markers are by-design), corrected R2#3 to 310.
|
||||
|
||||
- **NFS exports (F01)** — The file is git-managed at `infra/scripts/pve-nfs-exports` but deployed via `scp + exportfs -ra`, not Terraform. This is the exact path that caused the 2026-04-14 SEV1 (fsid=0 on wrong exports line). Options: (a) `null_resource` with `local-exec scp + remote-exec exportfs -ra` triggered on hash of content (partial — SSH dep); (b) new module `pve_host_config` that templates and SCPs multiple PVE-host artifacts with checksum verification. (b) is the cleaner long-term fix.
|
||||
- **Null-resource initializers (F06)** — 6 in `dbaas` (MySQL users, CNPG cluster, TF-state role, payslip DB, job-hunter DB). Some are genuinely unavoidable (bootstrapping DB before the DB exists); others could use `postgresql_grant` / `mysql_user` providers.
|
||||
- **SSH file provisioners on k8s-master (F07)** — `apiserver-oidc.tf`, `audit-policy.tf`, `etcd tuning`. One-way sync, no drift detection. Proposed quick wins (per `2026-02-22-node-drift-quick-wins-design.md` already exists). Continue/finish the plan.
|
||||
- **CoreDNS scaling manual (F24)** — Current runbook uses `kubectl scale`/`set env`/`set affinity`. Drift-prone; convert to `kubernetes_deployment` TF resource overriding the Helm chart's scale/affinity fields.
|
||||
- **MySQL InnoDB Cluster + operator TF resources still present** — Phase 4 cleanup. Low urgency, but removing reduces cognitive load on anyone reading `stacks/dbaas/`.
|
||||
- **Technitium readiness-gate null_resource with `timestamp()` trigger** — Runs every apply, 3-6 min wall time. Replace with a real health-check on `terraform_data` with `triggers_replace = { checksum = sha256(config) }`.
|
||||
- **GPU node taints + Proxmox CSI labels via null_resource kubectl** — No drift detection. Fix is in the `2026-02-22-node-drift-quick-wins-design.md` plan.
|
||||
|
||||
### R3 — Maintenance overhead (18 raw → 10 real)
|
||||
|
||||
- **Renovate (F03)** — The single highest-leverage maintenance fix. 18 repos × ~0.8 hrs/month manual version sweep = real time. Add `.github/renovate.json` (grouping rules for Terraform providers, K8s provider, Docker images) + auto-merge patch-level. Start with `infra/` only; expand after 2 weeks.
|
||||
- **Image pinning (F04)** — 84 `:latest` tags in production TF. Root CLAUDE.md still says "use 8-char git SHA tags" but that's not enforced. Admission control via Kyverno `require-trusted-registries` is in Audit today — add a sibling policy `forbid-latest-tag` also in Audit. Separate from F03 because pin-to-SHA + Renovate is a synergistic pair.
|
||||
- **MySQL restore drill (F11)** — tracked under R1 for impact; also a maintenance item because the restore *procedure* has not been test-updated since migration.
|
||||
- **CronJob alert rollup (F14)** — 59 CronJobs; "which were healthy last 24h" takes ad-hoc `kubectl get jobs --sort-by` scrolling. Add a Grafana panel with `kube_cronjob_status_last_successful_time < now - 2×schedule` summary.
|
||||
- **Graceful-shutdown toil (F23)** — 50+ Recreate deployments without `terminationGracePeriodSeconds` or `preStop`. Noisy pager hits after node drain. One-off sweep: add a 30s `terminationGracePeriodSeconds` default via Kyverno mutation rule.
|
||||
- **Tier-0 SOPS workflow (F30)** — 7-step decrypt/edit/encrypt/commit vs Tier-1's 3-step. Combined `tg` wrapper flag `--edit <stack>` that auto-decrypts → EDITOR → auto-encrypts → commit in one command. Moderate win; low risk.
|
||||
- **Stale `in_progress` beads** — 7 stale tasks in `bd list --status in_progress` at audit start. Session-end hook checks this; 3-5 days without notes is the signal. CLAUDE.md covers the rule — it's followed-sometimes, not enforced.
|
||||
- **Runbook staleness** — no `last_reviewed` frontmatter on runbook MDs; trivial to add. One-off sweep then keep it honest.
|
||||
- **CI/CD template unification** — "GHA build → Woodpecker deploy" is the documented pattern for 10 repos; rest still on Woodpecker-only. Track as follow-ups per repo in `bd`.
|
||||
- **Kyverno DNS-config boilerplate 307 markers** — Not a problem (see correction at top). Do add a lint rule in CI that flags any `kubernetes_deployment` without `# KYVERNO_LIFECYCLE_V1` marker; that's the real drift risk.
|
||||
|
||||
### R4 — Scalability (18 raw → 9 real)
|
||||
|
||||
Filtered: dropped R4#1 (metric mispull), R4#2 (CPU-limit policy), R4#7 (Phase 7 solved).
|
||||
|
||||
- **CNPG memory headroom** — Currently 2Gi limit. Top-line metric at quiet time; add a `ContainerNearOOM > 85%` rule that watches CNPG specifically (general rule exists; CNPG is Tier 0 so deserves explicit binding).
|
||||
- **HPA cluster-wide: zero** — Every stateless service is 1:1. Not urgent at current node-CPU 8-31%, but one big feature (Immich re-index, Authentik load spike) tips the balance. Pilot: HPA on Traefik (CPU-driven), observe, expand.
|
||||
- **Redis no HPA + HAProxy singleton** — Wire Sentinel into direct client access (Phase 8 of Redis refactor, per R1#11 of raw findings). Currently all 17 consumers go via HAProxy — the single-point bypass was deliberate (simpler client config), but the HAProxy is now the SPOF Sentinel was meant to prevent. Worth a plan doc (`plans/2026-MM-DD-redis-phase8-sentinel-clients.md`).
|
||||
- **PgBouncer pool sizing unknown** — Authentik has 3 pods, each opening N connections. At load spikes (big org sync), pool exhaustion. Short-term: `pgbouncer_show_pools` metric + alert at 80% util. Longer-term: pool-size tuning based on observed wait times.
|
||||
- **Prometheus TSDB (F26)** — 200Gi retention unquantified. Risk: disk fills → scrape gaps → audit blind. Add `kubelet_volume_stats_used_bytes{persistentvolumeclaim="prometheus-server"} > 0.85 * capacity` alert.
|
||||
- **NFS capacity not monitored** — PVE host has 1TB HDD LV. No `node_filesystem_avail_bytes` scrape from PVE host (it's outside the cluster). Install node_exporter on PVE host; scrape via Prometheus federation or remote_write.
|
||||
- **VPA quarterly review unscheduled** — Goldilocks is in `Initial` mode (not Auto, by design). Review is manual per quarter. Calendar event + runbook link.
|
||||
- **Registry single instance** — Registry outage = no pod restarts. Post-mortem 2026-04-19 documented a container-engine pin; replica count still 1. Consider HA registry backed by S3-compat store (MinIO in-cluster) for the second replica — but low urgency given probe CJ monitors integrity every 15m.
|
||||
- **No ResourceQuota utilization alert** — Quota exhaustion invisible until a pod refuses to schedule. `kube_resourcequota{type="used"} / kube_resourcequota{type="hard"} > 0.85` rule.
|
||||
|
||||
### R5 — Security & Secrets (21 raw → 13 real)
|
||||
|
||||
- **Vault `vault-unseal-key` K8s Secret (F21-related)** — Challenger A said it wasn't present; it is (`kubectl get secret -n vault`). Used by auto-unseal. RBAC on the secret should restrict to `vault-server` SA only. Audit the `role` + `rolebinding` in `stacks/vault/`.
|
||||
- **Vault audit log offsite (F21)** — Rotated logs not synced to NFS backup. Add a `vault-audit-log-sync` CronJob or append the audit log path to `nfs-change-tracker` inotify list (zero-Terraform change if the latter).
|
||||
- **Kyverno audit → enforce (F12) — sequence carefully** — All 16 policies are in Audit today. Naive switch to Enforce will block legitimate workloads (Loki, Frigate, nvidia-device-plugin, wireguard have privileged/host-ns requirements — all documented). Plan: (a) generate `Kyverno PolicyException` CRs for known-good workloads first; (b) enforce one policy at a time, 1-week observation; (c) start with `require-trusted-registries` (least breakage risk). **DANGEROUS TO EXECUTE NAIVELY — don't batch.**
|
||||
- **No NetworkPolicies (F18)** — Challenger correctly flagged the effort (5) and risk (4): wrong NetworkPolicy stops Authentik from reaching its DB in minutes. Approach: allow-list namespace-wide first (e.g. `authentik` ns can reach `dbaas` on 5432), expand over a month. Single biggest latent security improvement but needs runway.
|
||||
- **RBAC oidc-power-user secrets r/w cluster-wide (F19)** — Scope down: list which Authentik groups get this binding, remove `secrets:*` from the cluster role, add namespace-scoped RoleBindings where needed. Medium effort, high leverage.
|
||||
- **Image supply chain (F20)** — cosign verification + admission controller is the mature path. Trivy-on-push fits in GHA workflows. Both unblocked after F04 (pinning).
|
||||
- **`:latest` tags (overlap F04)** — Security aspect: signed-image admission requires stable refs.
|
||||
- **Privileged containers** — Loki, WireGuard, NVIDIA, Frigate known-exceptions. Document the exceptions inline (comment block on the TF resource) so future maintainers don't accidentally "fix" them.
|
||||
- **Git history plaintext secrets** — Challenger B flagged unverified. One way to verify cheaply: `git secrets --scan-history`. Add it as a pre-audit one-off.
|
||||
- **CrowdSec Metabase disabled, no Prometheus exporter** — R5#18. Enable the Prometheus exporter (no Metabase) for attack-pattern visibility; very cheap.
|
||||
- **cert-manager evaluation paused** — Documented pause; TLS rotation relies on Cloudflare wildcard. Confirm no local `Ingress` uses a self-managed cert that could expire silently. `kubectl get cert -A` → expect 0.
|
||||
- **Pod Security Standards (F27)** — Label every namespace `pod-security.kubernetes.io/enforce=restricted` (or baseline). Known-exception namespaces get explicit downgrades. Medium effort, paid back by making future admission decisions uniform.
|
||||
- **CrowdSec LAPI quorum** — 3 replicas but quorum/consensus behavior undocumented. One-page runbook: what happens if 1, 2, or 3 LAPI pods die.
|
||||
- **Authentik outpost fix (F15)** — Applied via API, not TF. Next Helm upgrade reverts. Add the `/dev/shm` emptyDir to `stacks/authentik/values.yaml` templatefile.
|
||||
|
||||
---
|
||||
|
||||
## Dangerous-to-execute (handle with care)
|
||||
|
||||
Flagged by challengers; each needs a gradual rollout plan, not a single commit.
|
||||
|
||||
1. **F12 — Kyverno Audit → Enforce en masse**. Write `PolicyException` CRs for known-safe workloads first. One policy per week. Observe.
|
||||
2. **F18 — NetworkPolicies cluster-wide**. Default-deny breaks inter-namespace lookups silently. Namespace-by-namespace rollout, with `kubectl logs -f` tailing the policy-engine events.
|
||||
3. **PDB additions without drain-test**. New PDB + tight `minAvailable` can deadlock during node cordons. `kubectl drain --dry-run` every new PDB on every node first.
|
||||
4. **F20 — Signed-image admission**. Must follow F04 (pinning). Un-pinned admission = half the cluster fails to pull.
|
||||
|
||||
## Gaps the agents missed
|
||||
|
||||
From challenger "GAPS" analyses, collated:
|
||||
|
||||
- **Disaster-recovery drill coverage** — backup docs are comprehensive (CLAUDE.md is extensive). End-to-end *restore* rehearsal frequency = never documented. Track per-component: MySQL, PostgreSQL/CNPG, Vault, etcd, NFS, registry blobs.
|
||||
- **Service mesh evaluation** — Never formally evaluated (Istio, Linkerd, Cilium-in-mesh-mode). Could subsume NetworkPolicy effort + mTLS + observability. Worth a design doc even if answer is "no, too much complexity for the gain."
|
||||
- **Chaos engineering coverage** — Zero. No pod-kill cron, no node-failure drill. Low urgency given maturity, but would validate F02 probe quality and F23 graceful-shutdown coverage cheaply.
|
||||
- **Operator onboarding friction** — Nobody else in the "lab team" but Emo exists in `claude-agent-service`. If Emo needs to take over a component for a week, what's the runbook?
|
||||
- **Alert noise / fatigue rate** — No finding measured how many alerts actually page vs. auto-resolve. `alertmanager_notifications_total` by receiver is the metric; needs a Grafana panel.
|
||||
- **Secrets-in-image-layers** — Docker images built locally may contain secrets from build env. `trivy image --scanners secret` on registry images is a one-off audit.
|
||||
- **Runbook → post-mortem → runbook-update loop** — Post-mortem 2026-04-14 produced runbook updates; no general tracker that every incident produces a runbook change.
|
||||
|
||||
## Alternative framings (from challengers, preserved for future reference)
|
||||
|
||||
- **Split "MySQL singleton" into 3 items** (HA / backup / pool). Accepted — see R1 and R4 treatment.
|
||||
- **6th concern: Observability & Pager Fatigue** — Considered; the themes already hit R1+R3+R4 under Theme 2 of the executive summary. Keeping 5 concerns but carving "Observability gaps" as a theme, not a new research axis.
|
||||
- **One-thing-this-weekend**: Challenger B nominated *NFS in Terraform*, Challenger A nominated *`:latest` tag sweep*. F01 wins on SEV1 prevention; F04 wins on toil. Both valid. Pick by energy level: F01 is 1 deliberate session; F04 is low-cognition grep-replace.
|
||||
- **Re-rank by operator fatigue (Rank B) always**. Partially accepted — presented side-by-side in the table.
|
||||
|
||||
---
|
||||
|
||||
## Recommended next moves
|
||||
|
||||
Ordered for a solo operator balancing SEV-prevention, fatigue reduction, and preserved energy for larger work:
|
||||
|
||||
**Week 1 (SEV-prevention + quick-wins, low cognitive load):**
|
||||
- F01: NFS exports into a `pve_host_config` Terraform module (one deliberate session)
|
||||
- F04: Sweep `:latest` tags, add Kyverno `forbid-latest-tag` in Audit
|
||||
- F08: ESO refresh-lag PrometheusRule
|
||||
- F05: OOMKill / Unschedulable / Node-CPU PrometheusRule
|
||||
|
||||
**Week 2 (fatigue reduction):**
|
||||
- F03: Renovate in `infra/` only (narrow pilot)
|
||||
- F14: CronJob success-rate Grafana panel + alert rollup
|
||||
- F16: Dolt backup CronJob + PVC grow
|
||||
- F11: First MySQL restore drill (scheduled, documented)
|
||||
|
||||
**Month 2 (durable fixes, gradual):**
|
||||
- F06/F07: Replace null_resources + SSH provisioners with native TF resources, one at a time
|
||||
- F02: Probe sweep — add readiness+liveness to the 20 long-lived deployments first
|
||||
- F12: Kyverno Enforce transition, one policy per week
|
||||
- F15: Authentik outpost /dev/shm into values.yaml
|
||||
|
||||
**Month 3+ (structural):**
|
||||
- F18: NetworkPolicies — namespace-by-namespace
|
||||
- F19: RBAC scope-down
|
||||
- F20: Signed-image admission
|
||||
- Service-mesh evaluation (design doc)
|
||||
- Restore-drill calendar for every backup target
|
||||
|
||||
No beads tasks auto-filed by this audit — user decides which findings merit `bd create`.
|
||||
|
||||
---
|
||||
|
||||
## Appendix — verification references (spot-checked)
|
||||
|
||||
Every numeric claim in the backlog was confirmed by one of these commands at audit time (2026-04-20):
|
||||
|
||||
| Claim | Command | Result |
|
||||
|---|---|---|
|
||||
| Node memory 44-51% | `kubectl top nodes --no-headers` | k8s-node1: 45%, node2: 51%, node3: 49%, node4: 44%, master: 17% |
|
||||
| 115 deploys missing ≥1 probe | `kubectl get deploy -A -o json \| jq '[.items[] \| select(.spec.template.spec.containers[0].readinessProbe == null or .spec.template.spec.containers[0].livenessProbe == null)] \| length'` | 115 |
|
||||
| 103 deploys missing BOTH probes | same, with `and` | 103 |
|
||||
| 310 ignore_changes blocks | `grep -r "ignore_changes" infra --include=*.tf --include=*.hcl \| wc -l` | 310 |
|
||||
| 59 CronJobs | `kubectl get cronjobs -A --no-headers \| wc -l` | 59 |
|
||||
| All 16 Kyverno ClusterPolicies in Audit | `kubectl get clusterpolicy -o jsonpath='...validationFailureAction...'` | 16/16 Audit, 0 Enforce |
|
||||
| Redis `maxmemory-policy allkeys-lru` | `grep -n maxmemory-policy infra/stacks/redis` | `modules/redis/main.tf:254` |
|
||||
| Zero Renovate configs | `find /home/wizard/code -name '.renovaterc*' -o -name 'renovate.json' \| grep -v node_modules` | 0 |
|
||||
| Vault `vault-unseal-key` Secret exists | `kubectl get secret -n vault` | present (37d old) |
|
||||
| NFS `/etc/exports` not in TF | `grep -rn 'fsid=' infra/stacks` | 0 matches; only `infra/scripts/pve-nfs-exports` |
|
||||
| Frigate CPU limit by policy | `infra/.claude/CLAUDE.md` → "All CPU limits removed cluster-wide" | confirmed |
|
||||
| MySQL standalone intentional | `infra/.claude/CLAUDE.md` → "migrated from InnoDB Cluster 2026-04-16" | confirmed |
|
||||
|
||||
Other claims (84 `:latest` tags, 52 ExternalSecrets, 30+ PVCs without backup CJs) were surfaced by research agents; challengers spot-checked a subset and agreed the order-of-magnitude holds. Full list in `/home/wizard/.claude/plans/let-s-run-a-thorough-floating-pnueli.md` research digest.
|
||||
|
||||
## Deliverable disposition
|
||||
|
||||
- This document is the audit output.
|
||||
- No `bd` tasks were created by the audit. Pick findings to ticket after reading.
|
||||
- When filing: use `F##` as a tag, title with the finding's headline, acceptance criteria from the deep-dive paragraph, priority from Rank B.
|
||||
- Plan file at `~/.claude/plans/let-s-run-a-thorough-floating-pnueli.md` retains the full 91-finding digest + challenger reports for reference; can be deleted after any follow-up tickets are filed.
|
||||
142
docs/plans/2026-04-25-nfs-hostile-migration-design.md
Normal file
142
docs/plans/2026-04-25-nfs-hostile-migration-design.md
Normal file
|
|
@ -0,0 +1,142 @@
|
|||
# NFS-Hostile Workload Migration — Design
|
||||
|
||||
**Date**: 2026-04-25
|
||||
**Author**: Viktor (with Claude)
|
||||
**Status**: Phase 1 done, Phase 2 in progress
|
||||
**Beads**: code-gy7h (Vault), code-ahr7 (Immich PG)
|
||||
|
||||
## Problem
|
||||
|
||||
The 2026-04-22 Vault Raft leader deadlock (post-mortem
|
||||
`2026-04-22-vault-raft-leader-deadlock.md`) traced to NFS client
|
||||
writeback stalls poisoning kernel state. Recovery took 2h43m and
|
||||
required hard-resetting 3 of 4 cluster VMs. Two workload classes on
|
||||
NFS are NFS-hostile per the criteria in
|
||||
`infra/.claude/CLAUDE.md` ("Critical services MUST NOT use NFS"):
|
||||
|
||||
1. **Postgres with WAL fsync per commit** — Immich primary
|
||||
2. **Vault Raft consensus log** — fsync per append-entry, 3 replicas
|
||||
|
||||
Everything else on NFS (47 PVCs, ~455 GiB) is correctly placed:
|
||||
RWX media libraries, append-only backups, ML caches.
|
||||
|
||||
## Decision
|
||||
|
||||
Migrate exactly those two workload classes to
|
||||
`proxmox-lvm-encrypted` (LUKS2 LVM-thin via Proxmox CSI). No iSCSI,
|
||||
no RWX media migration, no backup-target migration.
|
||||
|
||||
## Rationale
|
||||
|
||||
- Block storage decouples PG / Raft fsync from NFS client kernel
|
||||
state. Failure mode that triggered the post-mortem cannot recur for
|
||||
these workloads.
|
||||
- `proxmox-lvm-encrypted` is the documented default for sensitive data
|
||||
(`infra/.claude/CLAUDE.md` storage decision rule). It already backs
|
||||
~28 PVCs across the cluster — pattern is proven.
|
||||
- Existing nightly `lvm-pvc-snapshot` PVE host script (03:00, 7-day
|
||||
retention) auto-picks-up new PVCs via thin snapshots — no extra
|
||||
backup wiring needed for the live data side.
|
||||
- LUKS2 satisfies "encrypted at rest for sensitive data" requirement.
|
||||
|
||||
## Out of scope
|
||||
|
||||
- iSCSI evaluation (already retired 2026-04-13).
|
||||
- RWX media (Immich library, music, ebooks) — correct placement.
|
||||
- Backup target PVCs (`*-backup` on NFS) — append-only, NFS-tolerant.
|
||||
- Prometheus 200 GiB — already on `proxmox-lvm`.
|
||||
|
||||
## Pattern per workload
|
||||
|
||||
### Immich PG (single replica, Deployment, Recreate strategy)
|
||||
|
||||
- Add new RWO PVC on `proxmox-lvm-encrypted`.
|
||||
- Quiesce app pods (server + ML + frame).
|
||||
- `pg_dumpall` from running NFS pod → local file.
|
||||
- Swap deployment `claim_name` → encrypted PVC.
|
||||
- PG bootstraps fresh on empty PVC; restore dump.
|
||||
- REINDEX vector indexes (`clip_index`, `face_index`).
|
||||
- Backup CronJob keeps writing to NFS module (correct: append-only).
|
||||
|
||||
### Vault Raft (3 replicas, StatefulSet, helm-managed)
|
||||
|
||||
- Change `dataStorage.storageClass` and `auditStorage.storageClass`
|
||||
from `nfs-proxmox` → `proxmox-lvm-encrypted`.
|
||||
- StatefulSet `volumeClaimTemplates` is immutable → use
|
||||
`kubectl delete sts vault --cascade=orphan` then re-apply (memory
|
||||
pattern for VCT swaps).
|
||||
- Per-pod rolling: delete pod + PVCs, controller recreates with new
|
||||
template. Auto-unseal sidecar handles unseal; raft `retry_join`
|
||||
rejoins cluster.
|
||||
- 24h validation window between pods. Migrate non-leader pods first;
|
||||
step-down current leader before migrating it last.
|
||||
- Backup target (`vault-backup-host` on NFS) stays on NFS.
|
||||
|
||||
## Risks and rollbacks
|
||||
|
||||
### Immich PG
|
||||
|
||||
- pg_dumpall captures schema + data, not file-level state. Vector
|
||||
index versions matter (vchord 0.3.0 unchanged; vector 0.8.0 →
|
||||
0.8.1 is a minor automatic bump on `CREATE EXTENSION` — confirmed
|
||||
benign). Rollback: revert `claim_name`, scale apps; old NFS PVC
|
||||
retained for 7 days post-migration.
|
||||
|
||||
### Vault Raft
|
||||
|
||||
- Cluster keeps quorum from 2 standby replicas while one pod is
|
||||
swapped. Migrating the leader last avoids quorum churn.
|
||||
- Recovery anchor: pre-migration `vault operator raft snapshot save`
|
||||
+ nightly `vault-raft-backup` CronJob. RTO < 1h via snapshot
|
||||
restore.
|
||||
|
||||
## Helm `securityContext.pod` replace-not-merge (Vault, discovered during execution)
|
||||
|
||||
The Vault helm chart sets pod-level securityContext defaults
|
||||
(`fsGroup=1000, runAsGroup=1000, runAsUser=100, runAsNonRoot=true`)
|
||||
from chart templates, not from values.yaml. When `main.tf` provided
|
||||
its own `server.statefulSet.securityContext.pod = {fsGroupChangePolicy
|
||||
= "OnRootMismatch"}` the helm rendering REPLACED the chart defaults
|
||||
rather than merging into them. On NFS this was harmless (`async,
|
||||
insecure` exports made the volume world-writable enough for any UID),
|
||||
but on a fresh ext4 LV via Proxmox CSI the volume root is `root:root`
|
||||
and vault user (UID 100) cannot open `/vault/data/vault.db`.
|
||||
|
||||
vault-1 and vault-2 happened to be Running with the correct
|
||||
securityContext because their pod specs were written into etcd
|
||||
**before** the customization landed; helm chart upgrades don't
|
||||
restart pods, so the broken values lay dormant until vault-0 was
|
||||
recreated by the orphan-deleted STS during this migration.
|
||||
|
||||
Resolution: provide all five fields (`fsGroup`, `fsGroupChangePolicy`,
|
||||
`runAsGroup`, `runAsUser`, `runAsNonRoot`) explicitly in main.tf so
|
||||
`runAsGroup=1000` etc. survive future chart bumps. Idempotent on
|
||||
both fresh PVCs and existing pods.
|
||||
|
||||
## Init container chicken-and-egg (Immich PG, discovered during execution)
|
||||
|
||||
The pre-existing `write-pg-override-conf` init container on the
|
||||
Immich PG deployment writes `postgresql.override.conf` directly to
|
||||
`PGDATA`. On a populated NFS PVC this was a no-op (init was already
|
||||
run). On the fresh encrypted PVC, the file made `initdb` refuse the
|
||||
non-empty directory and the pod CrashLoopBackOff'd.
|
||||
|
||||
Resolution: gate the init container on `PG_VERSION` presence — first
|
||||
boot skips the override write, PG `initdb`s cleanly; force a pod
|
||||
restart and the second boot writes the override and PG loads
|
||||
`vchord` / `vectors` / `pg_prewarm` before the dump restore. Change
|
||||
is permanent and idempotent (correct on both fresh and initialised
|
||||
PVCs). One restart pre-migration only.
|
||||
|
||||
## Verification
|
||||
|
||||
End-to-end DONE when:
|
||||
|
||||
- `kubectl get pvc -A | grep nfs-proxmox` returns only the
|
||||
`vault-backup-host` PVC (or zero, if backup PVC moves elsewhere).
|
||||
- `vault operator raft list-peers` shows 3 voters on
|
||||
`proxmox-lvm-encrypted`, leader elected.
|
||||
- Immich PG `\dx` matches pre-migration extensions (vector minor
|
||||
drift OK).
|
||||
- `lvm-pvc-snapshot` captures new LVs in next 03:00 run.
|
||||
- 7 consecutive days of clean backup CronJob runs and no new alerts.
|
||||
169
docs/plans/2026-04-25-nfs-hostile-migration-plan.md
Normal file
169
docs/plans/2026-04-25-nfs-hostile-migration-plan.md
Normal file
|
|
@ -0,0 +1,169 @@
|
|||
# NFS-Hostile Workload Migration — Plan
|
||||
|
||||
**Date**: 2026-04-25
|
||||
**Design**: `2026-04-25-nfs-hostile-migration-design.md`
|
||||
**Beads**: code-gy7h (Vault, epic), code-ahr7 (Immich PG)
|
||||
|
||||
## Phase 1 — Immich PG (DONE 2026-04-25)
|
||||
|
||||
| Step | Done |
|
||||
|---|---|
|
||||
| Snapshot extensions + row counts to `/tmp/immich-pre-migration-*` | ✓ |
|
||||
| Quiesce `immich-server` + `immich-machine-learning` + `immich-frame` | ✓ |
|
||||
| `pg_dumpall` → `/tmp/immich-pre-migration-<ts>.sql` (1.9 GB) | ✓ |
|
||||
| Add `kubernetes_persistent_volume_claim.immich_postgresql_encrypted` (10Gi, autoresize 20Gi cap) | ✓ |
|
||||
| Swap `claim_name` at `infra/stacks/immich/main.tf` deployment | ✓ |
|
||||
| Patch init container to gate on `PG_VERSION` (chicken-and-egg fix) | ✓ |
|
||||
| Force pod restart so override.conf gets written | ✓ |
|
||||
| Restore dump | ✓ |
|
||||
| `REINDEX clip_index`, `REINDEX face_index` | ✓ |
|
||||
| Scale apps back up | ✓ |
|
||||
| Verify: `\dx`, row counts (~111k assets), HTTP 200 internal/external | ✓ |
|
||||
| LV present on PVE host (`vm-9999-pvc-...`) | ✓ |
|
||||
|
||||
### Phase 1 follow-ups (not blocking)
|
||||
|
||||
- Old NFS PVC `immich-postgresql-data-host` retained 7 days for
|
||||
rollback. After 2026-05-02: remove `module.nfs_postgresql_host`
|
||||
from `infra/stacks/immich/main.tf` and the CronJob's reference.
|
||||
- Backup CronJob (`postgresql-backup`) still writes to the NFS
|
||||
module. After cleanup, point it at a dedicated backup PVC or to
|
||||
the existing `immich-backups` NFS share.
|
||||
|
||||
## Phase 2 — Vault Raft (DONE 2026-04-25)
|
||||
|
||||
**Phase 2 complete 2026-04-25; all 3 voters on `proxmox-lvm-encrypted`.**
|
||||
|
||||
### Pre-flight (T-0) — DONE 2026-04-25 15:50 UTC
|
||||
|
||||
- [x] Verify all 3 vault pods sealed=false, raft healthy.
|
||||
- [x] Take fresh `vault operator raft snapshot save` (anchor saved at
|
||||
`/tmp/vault-pre-migration-20260425-155029.snap`, 1.5 MB).
|
||||
- [ ] Optional: scale ESO to 0 — skipped (auto-unseal sidecar is
|
||||
independent; ESO refresh churn is non-disruptive for one swap).
|
||||
- [x] Confirmed leader is **vault-2** → migrate vault-0 first
|
||||
(non-leader), vault-1 next, vault-2 last (with step-down).
|
||||
Plan originally assumed vault-0 was leader; same intent
|
||||
(non-leader first).
|
||||
- [x] Thin pool headroom: 54.63% used, plenty for 6 × 2 GiB LVs.
|
||||
|
||||
### Step 0 — Helm values + StatefulSet swap — DONE 2026-04-25 16:08 UTC
|
||||
|
||||
- [x] Edit `infra/stacks/vault/main.tf`: change
|
||||
`dataStorage.storageClass` and `auditStorage.storageClass`
|
||||
from `nfs-proxmox` → `proxmox-lvm-encrypted`.
|
||||
- [x] `kubectl -n vault delete sts vault --cascade=orphan` (StatefulSet
|
||||
`volumeClaimTemplates` is immutable; orphan keeps pods+PVCs
|
||||
alive while we recreate the controller with the new template).
|
||||
- [x] `tg apply -target=helm_release.vault` → recreates STS with new
|
||||
VCT (full-stack `tg plan` blocks on unrelated for_each-with-
|
||||
apply-time-keys errors at lines 848/865/909/917; targeted
|
||||
apply on the helm release alone is the right scope here).
|
||||
Existing pods still on old NFS PVCs.
|
||||
|
||||
### Step 1 — Roll vault-0 first (non-leader) — DONE 2026-04-25 16:18 UTC
|
||||
|
||||
- [x] `kubectl -n vault delete pod vault-0 --grace-period=30`
|
||||
- [x] `kubectl -n vault delete pvc data-vault-0 audit-vault-0`
|
||||
- [x] STS controller recreated pod; new PVCs auto-provisioned on
|
||||
`proxmox-lvm-encrypted` (LVs `vm-9999-pvc-fb732fd7-...` data
|
||||
4.12%, `vm-9999-pvc-36451f42-...` audit 3.99%).
|
||||
- [x] **Hit and fixed**: vault-0 CrashLoopBackOff'd with
|
||||
`permission denied` on `/vault/data/vault.db`. The helm chart's
|
||||
`statefulSet.securityContext.pod` block in main.tf only set
|
||||
`fsGroupChangePolicy`, replacing (not merging) the chart's
|
||||
defaults `fsGroup=1000, runAsGroup=1000, runAsUser=100,
|
||||
runAsNonRoot=true`. NFS exports made the missing fsGroup a
|
||||
no-op; ext4 LV needs it to chown the volume root for the
|
||||
vault user. Old vault-1/vault-2 pods were created before that
|
||||
block was added so they still had the chart-default
|
||||
securityContext from their original spec. Fix: provide all
|
||||
five fields explicitly in main.tf and re-apply. Same root
|
||||
cause will affect vault-1 and vault-2 swaps unless this stays
|
||||
in place.
|
||||
- [x] Wait Ready; auto-unseal sidecar unsealed; `retry_join` rejoined
|
||||
raft cluster.
|
||||
- [x] Verify: `vault operator raft list-peers` shows 3 voters,
|
||||
vault-0 follower, leader=vault-2. External HTTPS 200.
|
||||
|
||||
### Step 2 — 24h soak (SKIPPED per user direction 2026-04-25)
|
||||
|
||||
User instructed "continue with all the remaining actions" — soak
|
||||
gates compressed to per-pod settle windows + raft-state verification
|
||||
between rollings. No Raft alarms, no Vault errors observed at each
|
||||
verification gate.
|
||||
|
||||
### Step 3 — Roll vault-1 — DONE 2026-04-25
|
||||
|
||||
- [x] Force-finalize PVCs to break re-mount race:
|
||||
`kubectl -n vault patch pvc data-vault-1 audit-vault-1 -p '{"metadata":{"finalizers":null}}' --type=merge`.
|
||||
(Initial pod-then-PVC delete recreated pod on the OLD NFS PVCs
|
||||
because pvc-protection finalizer hadn't cleared. Lesson learned
|
||||
and applied to vault-2 below.)
|
||||
- [x] Pod recreated on encrypted PVCs; auto-unsealed; rejoined raft.
|
||||
|
||||
### Step 4 — Settle window — DONE 2026-04-25
|
||||
|
||||
3-check verification over 90s; raft index advancing (2730010→2730012),
|
||||
all 3 voters healthy.
|
||||
|
||||
### Step 5 — Roll vault-2 (leader) — DONE 2026-04-25
|
||||
|
||||
- [x] `vault operator step-down` on vault-2; vault-0 took leadership.
|
||||
Confirmed vault-0 active, vault-1+vault-2 standby before delete.
|
||||
- [x] Snapshot anchor at `/tmp/vault-pre-vault2.snap` (1.5 MB) from new
|
||||
leader vault-0.
|
||||
- [x] Force-finalize + delete PVCs + delete pod (lesson from vault-1).
|
||||
- [x] Pod recreated on encrypted PVCs; auto-unsealed; rejoined raft.
|
||||
- [x] `vault operator raft list-peers` shows 3 voters all healthy on
|
||||
encrypted storage; leader vault-0.
|
||||
|
||||
### Step 6 — Cleanup — DONE 2026-04-25
|
||||
|
||||
- [x] `kubectl get pvc -A` cross-cluster shows zero PVCs on
|
||||
`nfs-proxmox` SC (only Released PVs remain → Phase 3).
|
||||
- [x] Removed inline `kubernetes_storage_class.nfs_proxmox` from
|
||||
`infra/stacks/vault/main.tf` (was lines 29–42).
|
||||
- [x] All 3 PVC pairs on `proxmox-lvm-encrypted`.
|
||||
- [x] `vault operator raft autopilot state` healthy=true.
|
||||
- [x] External `https://vault.viktorbarzin.me/v1/sys/health` = 200.
|
||||
|
||||
## Phase 3 — Released-PV cleanup (FOLLOW-UP)
|
||||
|
||||
### Step 3.1 — vault Released PVs — DONE 2026-04-25
|
||||
|
||||
6 vault NFS PVs (Released, `nfs-proxmox` SC, Retain policy) deleted
|
||||
along with their NFS subdirectories on PVE host (~1.5 GB reclaimed):
|
||||
|
||||
| PV | Claim | Size on disk |
|
||||
|---|---|---|
|
||||
| pvc-004a5d3b-… | data-vault-2 | 45M |
|
||||
| pvc-808a78ec-… | audit-vault-1 | 1.4M |
|
||||
| pvc-918ee7c1-… | audit-vault-0 | 3.2M |
|
||||
| pvc-9d2ddcb4-… | data-vault-0 | 46M |
|
||||
| pvc-a659711d-… | data-vault-1 | 46M |
|
||||
| pvc-d2e65109-… | audit-vault-2 | 1.4G |
|
||||
|
||||
Procedure: `kubectl delete pv <name>` (cluster object only — Retain
|
||||
policy means CSI never touches NFS) then `rm -rf /srv/nfs/<dir>` on
|
||||
192.168.1.127.
|
||||
|
||||
### Step 3.2 — Cluster-wide Released PV sweep (DEFERRED)
|
||||
|
||||
~50 other Released PVs persist across the cluster (~200 GiB on
|
||||
`proxmox-lvm` and `proxmox-lvm-encrypted`). Out of scope for the
|
||||
2026-04-25 NFS-hostile session per user direction. To reclaim:
|
||||
|
||||
1. List Released PVs, confirm LV exists on PVE.
|
||||
2. `kubectl delete pv <name>` (CSI removes underlying LV when PV is
|
||||
orphaned with `Retain` reclaim policy and no PVC reference).
|
||||
3. If LV survives: manual `lvremove pve/vm-9999-pvc-<uuid>`.
|
||||
|
||||
## Rollback
|
||||
|
||||
| Phase | Trigger | Action |
|
||||
|---|---|---|
|
||||
| 1 | Immich UI broken / data loss | Revert `claim_name`; restore from `/tmp/immich-pre-migration-*.sql` to old NFS PVC |
|
||||
| 2 (mid-rolling) | Single pod broken | Delete the encrypted PVC; recreate with NFS SC explicitly; cluster keeps quorum from 2 healthy pods |
|
||||
| 2 (post-rolling, raft corrupt) | Cluster-wide failure | `vault operator raft snapshot restore <pre-migration.snap>` |
|
||||
| Catastrophic | All Vault data lost | Restore from latest `/srv/nfs/vault-backup/` snapshot via CronJob output |
|
||||
195
docs/plans/2026-05-07-forgejo-registry-consolidation-design.md
Normal file
195
docs/plans/2026-05-07-forgejo-registry-consolidation-design.md
Normal file
|
|
@ -0,0 +1,195 @@
|
|||
# Forgejo Registry Consolidation — Design
|
||||
|
||||
**Date**: 2026-05-07
|
||||
**Status**: Approved
|
||||
|
||||
## Problem
|
||||
|
||||
`registry-private` (the `registry:2` container on the docker-registry
|
||||
VM at `10.0.20.10`) has hit `distribution#3324` corruption three
|
||||
times in three weeks (2026-04-13, 2026-04-19, 2026-05-04). Each
|
||||
incident required manual blob recovery and another round of
|
||||
hardening to `cleanup-tags.sh` and the GC procedure. The integrity
|
||||
probe catches it within 15 minutes now, but every hit still costs
|
||||
~1h of cleanup, and we keep tightening the same loose screw.
|
||||
|
||||
Root cause is a known race in `distribution`: tag deletes that race
|
||||
with concurrent garbage collection produce orphan OCI-index children.
|
||||
Upstream has not patched it; our mitigations (probe, blob
|
||||
fix-up script, idempotent cleanup) reduce blast radius but don't
|
||||
remove the failure mode.
|
||||
|
||||
Forgejo (deployed for OAuth and personal repos at
|
||||
`forgejo.viktorbarzin.me`) ships a built-in OCI registry as part of
|
||||
the Packages feature, default-on in v11. Using it removes
|
||||
`distribution`-the-engine from the path entirely, replaces it with
|
||||
Forgejo's own implementation backed by Forgejo's DB+blob store, and
|
||||
gets us source hosting + image hosting in one resource.
|
||||
|
||||
The PVE host RAM upgrade from 142GB to 272GB (memory id=569) means
|
||||
the cluster can absorb the resource bump Forgejo needs for the
|
||||
registry workload (1Gi → 1Gi).
|
||||
|
||||
## Decision
|
||||
|
||||
Move every image currently on `registry.viktorbarzin.me:5050` to
|
||||
Forgejo's OCI registry at `forgejo.viktorbarzin.me`. Decommission
|
||||
`registry-private` after a 14-day dual-push bake.
|
||||
|
||||
Pull-through caches for upstream registries (DockerHub, GHCR, Quay,
|
||||
k8s.gcr, Kyverno) stay on the registry VM permanently — Forgejo
|
||||
won't serve as a pull-through, so the chicken-and-egg of "Forgejo
|
||||
pulling its own image through itself" never arises.
|
||||
|
||||
## Design
|
||||
|
||||
### Registry hostname
|
||||
|
||||
Image references become `forgejo.viktorbarzin.me/viktor/<image>:<tag>`.
|
||||
The `viktor/` prefix is the Forgejo owner namespace; all current
|
||||
private images ship under that single owner.
|
||||
|
||||
### Auth
|
||||
|
||||
Two service-account users:
|
||||
|
||||
| User | Scope | Vault key | Used by |
|
||||
|---|---|---|---|
|
||||
| `cluster-puller` | `read:package` | `secret/viktor/forgejo_pull_token` | cluster-wide `registry-credentials` Secret, monitoring probe |
|
||||
| `ci-pusher` | `write:package` | `secret/ci/global/forgejo_push_token` | Woodpecker pipelines (synced via `vault-woodpecker-sync` CronJob) |
|
||||
|
||||
A third PAT (`secret/viktor/forgejo_cleanup_token`, also belongs to
|
||||
`ci-pusher`) drives the retention CronJob — kept separate from the
|
||||
push PAT so a leaked CI token doesn't immediately enable mass deletes.
|
||||
|
||||
PATs have no expiry. Rotation policy: regenerate via Forgejo Web UI
|
||||
and `vault kv patch` if a leak is suspected; ESO/sync downstream is
|
||||
automatic.
|
||||
|
||||
### Cluster pull path
|
||||
|
||||
`registry-credentials` is a single Secret in `kyverno` ns, cloned
|
||||
into every namespace by the existing
|
||||
`sync-registry-credentials` ClusterPolicy. We extend its
|
||||
`dockerconfigjson` `auths` map with a fourth entry for
|
||||
`forgejo.viktorbarzin.me`. **No new Secret, no new ClusterPolicy,
|
||||
no `imagePullSecrets =` line edits across stacks.**
|
||||
|
||||
Containerd `hosts.toml` redirects `forgejo.viktorbarzin.me` → in-cluster
|
||||
Traefik LB at `10.0.20.200`, the same pattern used for
|
||||
`registry.viktorbarzin.me` → `10.0.20.10:5050`. Avoids hairpin NAT
|
||||
through the WAN gateway for in-cluster pulls.
|
||||
|
||||
### Push path
|
||||
|
||||
Woodpecker pipelines push to BOTH targets during the bake:
|
||||
|
||||
```yaml
|
||||
- name: build-and-push
|
||||
image: woodpeckerci/plugin-docker-buildx
|
||||
settings:
|
||||
repo:
|
||||
- registry.viktorbarzin.me/<name>
|
||||
- forgejo.viktorbarzin.me/viktor/<name>
|
||||
logins:
|
||||
- registry: registry.viktorbarzin.me
|
||||
username:
|
||||
from_secret: registry_user
|
||||
password:
|
||||
from_secret: registry_password
|
||||
- registry: forgejo.viktorbarzin.me
|
||||
username:
|
||||
from_secret: forgejo_user
|
||||
password:
|
||||
from_secret: forgejo_push_token
|
||||
```
|
||||
|
||||
The `vault-woodpecker-sync` CronJob (every 6h) propagates
|
||||
`secret/ci/global` keys to every Woodpecker repo as global secrets.
|
||||
|
||||
### Retention
|
||||
|
||||
Forgejo's per-package "Cleanup Rules" UI is per-user runtime DB
|
||||
state, not Terraform-driven. Retention runs as a CronJob in the
|
||||
`forgejo` namespace, schedule `0 4 * * *`, that:
|
||||
|
||||
1. Lists all container packages under the `viktor` owner.
|
||||
2. Groups by package name.
|
||||
3. Keeps newest 10 versions + always keeps `latest`.
|
||||
4. DELETEs the rest via `/api/v1/packages/{owner}/{type}/{name}/{version}`.
|
||||
|
||||
First 7 days run with `DRY_RUN=true` — script logs what it would
|
||||
delete but issues no DELETE calls. After log review, flip the
|
||||
`forgejo_cleanup_dry_run` local in `cleanup.tf` to false.
|
||||
|
||||
### Integrity monitoring
|
||||
|
||||
Mirror the existing `registry-integrity-probe` CronJob: walk
|
||||
`/v2/_catalog`, walk every tag, HEAD every manifest + index child,
|
||||
push `registry_manifest_integrity_*` metrics. Existing
|
||||
Prometheus alerts fire on the `instance` label, so they cover both
|
||||
probes automatically once the alert annotations are made
|
||||
instance-aware (done in this change).
|
||||
|
||||
### Source migration
|
||||
|
||||
Projects currently living as plain dirs in the local-only monorepo
|
||||
become standalone Forgejo repos. Two GitHub-hosted private repos
|
||||
(`beadboard`, `claude-memory-mcp`) move to Forgejo and are archived
|
||||
on GitHub.
|
||||
|
||||
CI standardises on Woodpecker for everything in scope. The two
|
||||
projects that used GHA (build + Woodpecker-deploy via GHA-hosted
|
||||
DockerHub push) keep DockerHub for legacy compatibility but their
|
||||
canonical image source becomes Forgejo.
|
||||
|
||||
### Break-glass for infra-ci
|
||||
|
||||
`infra-ci` is the Docker image used by all infra Woodpecker
|
||||
pipelines, including `default.yml` (terragrunt apply). If Forgejo is
|
||||
unreachable at the moment we need to apply, `infra-ci` is
|
||||
unreachable, and we can't apply our way out.
|
||||
|
||||
Mitigation: dual-push step also `docker save | gzip` the built
|
||||
infra-ci image to:
|
||||
|
||||
- `/opt/registry/data/private/_breakglass/infra-ci-<sha>.tar.gz` on
|
||||
the registry VM disk (Copy 1)
|
||||
- `/srv/nfs/forgejo-breakglass/` on the NAS (Copy 2)
|
||||
|
||||
A `latest` symlink in each location points at the most recent.
|
||||
Recovery procedure (`docs/runbooks/forgejo-registry-breakglass.md`):
|
||||
scp tarball → `docker load` → `ctr -n k8s.io images import` → fix
|
||||
Forgejo via that node.
|
||||
|
||||
### Cutover style
|
||||
|
||||
**Dual-push bake**: pipelines push to both registries for ≥14 days.
|
||||
Pods continue pulling from `registry.viktorbarzin.me`. After bake:
|
||||
|
||||
1. Per-project PR: flip `image=` lines in Terraform stacks. Pod
|
||||
re-pull naturally on next rollout.
|
||||
2. Phase 4: stop `registry-private` container, remove its
|
||||
`auths` entry from the cluster Secret, drop containerd hosts.toml
|
||||
entry.
|
||||
|
||||
## Why not alternatives
|
||||
|
||||
| Option | Rejected because |
|
||||
|---|---|
|
||||
| Stay on `registry-private` | Three corruption incidents in three weeks; mitigation cost rising |
|
||||
| Run a fresh registry container alongside (no Forgejo) | Same upstream, same `distribution#3324` failure mode |
|
||||
| GHCR / DockerHub for all private images | Public-by-default model + push rate limits; loses owner-owned blob storage |
|
||||
| Harbor | Heavier than Forgejo registry, would need its own DB + ingress, no source-hosting integration |
|
||||
|
||||
## Risks
|
||||
|
||||
See plan doc § "Risk register" for the full table. Top three:
|
||||
|
||||
1. **Forgejo registry hits the same corruption pattern.** Mitigated
|
||||
by 14-day bake + integrity probe within 15 min.
|
||||
2. **Forgejo down → infra-ci unreachable → can't apply.** Mitigated
|
||||
by tarball break-glass on VM + NAS.
|
||||
3. **Pod re-pulls fail after `image=` flip due to containerd cache
|
||||
poisoning.** Mitigated by hosts.toml deployment + per-project
|
||||
`kubectl rollout restart` in Phase 3.
|
||||
152
docs/plans/2026-05-07-forgejo-registry-consolidation-plan.md
Normal file
152
docs/plans/2026-05-07-forgejo-registry-consolidation-plan.md
Normal file
|
|
@ -0,0 +1,152 @@
|
|||
# Forgejo Registry Consolidation — Plan
|
||||
|
||||
**Date**: 2026-05-07
|
||||
**Status**: Approved — execution in progress (Phase 0)
|
||||
**Design**: `2026-05-07-forgejo-registry-consolidation-design.md`
|
||||
|
||||
This is the implementation roadmap for migrating off `registry-private`
|
||||
onto Forgejo's OCI registry. See the design doc for problem
|
||||
statement and rationale. Execution spans 5 phases over ≥3 weeks.
|
||||
|
||||
## Phase 0 — Prepare Forgejo (1 PR, no cutover risk)
|
||||
|
||||
| Task | File / artifact |
|
||||
|---|---|
|
||||
| Bump Forgejo memory request+limit 384Mi → 1Gi | `infra/stacks/forgejo/main.tf` |
|
||||
| Add `FORGEJO__packages__ENABLED=true` and `FORGEJO__packages__CHUNKED_UPLOAD_PATH=/data/tmp/package-upload` env vars (defensive — already default in v11) | `infra/stacks/forgejo/main.tf` |
|
||||
| Bump Forgejo PVC 5Gi → 15Gi, auto-resize cap 20Gi → 50Gi | `infra/stacks/forgejo/main.tf` |
|
||||
| Bump ingress `max_body_size = "5g"` (wired into ingress_factory as a Buffering middleware) | `infra/stacks/forgejo/main.tf`, `infra/modules/kubernetes/ingress_factory/main.tf` |
|
||||
| Create `cluster-puller` (read:package), `ci-pusher` (write:package), and a third `cleanup` PAT on `ci-pusher`; store PATs in Vault | runbook: `docs/runbooks/forgejo-registry-setup.md` |
|
||||
| Extend `registry-credentials` Secret with 4th `auths` entry for `forgejo.viktorbarzin.me` | `infra/stacks/kyverno/modules/kyverno/registry-credentials.tf` |
|
||||
| Add containerd `hosts.toml` entry redirecting `forgejo.viktorbarzin.me` → in-cluster Traefik LB `10.0.20.200` | `infra/stacks/infra/main.tf` cloud-init + new `infra/scripts/setup-forgejo-containerd-mirror.sh` for existing nodes |
|
||||
| Forgejo retention CronJob (`0 4 * * *`, dry-run for first 7 days) | new `infra/stacks/forgejo/cleanup.tf` + `infra/stacks/forgejo/files/cleanup.sh` |
|
||||
| Forgejo integrity probe CronJob (`*/15 * * * *`) | `infra/stacks/monitoring/modules/monitoring/main.tf` |
|
||||
| Make existing alerts instance-aware so they cover both registries | `infra/stacks/monitoring/modules/monitoring/prometheus_chart_values.tpl` |
|
||||
|
||||
**Smoke test (must pass before declaring Phase 0 done):**
|
||||
|
||||
- `docker login forgejo.viktorbarzin.me` succeeds.
|
||||
- Push a hello-world image to `forgejo.viktorbarzin.me/viktor/smoketest:1` succeeds.
|
||||
- `crictl pull forgejo.viktorbarzin.me/viktor/smoketest:1` from a k8s
|
||||
node succeeds, using the auto-synced `registry-credentials` Secret.
|
||||
- A fresh namespace gets the cloned Secret with 4 `auths` entries.
|
||||
- Delete the smoketest package via API.
|
||||
- Forgejo integrity probe completes once and pushes metrics.
|
||||
|
||||
## Phase 1 — Source migration (parallel-safe, no production impact)
|
||||
|
||||
For each project the recipe is identical:
|
||||
|
||||
1. `git init` + push to `forgejo.viktorbarzin.me/viktor/<name>` —
|
||||
register in Woodpecker via OAuth.
|
||||
2. Add `.woodpecker.yml` based on `payslip-ingest/.woodpecker.yml`.
|
||||
Push step uses `woodpeckerci/plugin-docker-buildx` with TWO
|
||||
`repo:` entries (dual-push).
|
||||
3. Confirm first build pushes to BOTH registries.
|
||||
|
||||
Projects (bake clock starts at "all dual-push"):
|
||||
|
||||
| Project | Action |
|
||||
|---|---|
|
||||
| `claude-agent-service` | Extract from monorepo to Forgejo. New `.woodpecker.yml`. |
|
||||
| `fire-planner` | Extract from monorepo to Forgejo. New `.woodpecker.yml`. |
|
||||
| `wealthfolio-sync` | Extract from monorepo to Forgejo. New `.woodpecker.yml`. |
|
||||
| `hmrc-sync` | Extract from monorepo to Forgejo. New `.woodpecker.yml`. |
|
||||
| `freedify` | Push from monorepo to Forgejo. New `.woodpecker.yml`. (Upstream is gone.) |
|
||||
| `payslip-ingest` | Already on Forgejo. Add second `repo:` entry to `.woodpecker.yml`. |
|
||||
| `job-hunter` | Already on Forgejo. Add second `repo:` entry. |
|
||||
| `beadboard` | Push to Forgejo. New `.woodpecker.yml`. Disable GHA workflow. **Don't archive GitHub yet** (deferred to Phase 3). |
|
||||
| `claude-memory-mcp` | Push to Forgejo. New `.woodpecker.yml`. |
|
||||
| `infra-ci` | Edit `.woodpecker/build-ci-image.yml` to dual-push. ALSO `docker save | gzip` to `/opt/registry/data/private/_breakglass/` on VM AND `/srv/nfs/forgejo-breakglass/` on NAS. Pin a `latest` symlink. |
|
||||
|
||||
Break-glass runbook (`docs/runbooks/forgejo-registry-breakglass.md`)
|
||||
documents the recovery path.
|
||||
|
||||
## Phase 2 — Bake (≥14 days)
|
||||
|
||||
- No `image=` lines change. Pods still pull from
|
||||
`registry.viktorbarzin.me`.
|
||||
- **Daily smoke check**: pull a recent image from Forgejo as
|
||||
`cluster-puller`, verify integrity (HEAD on manifest + each blob).
|
||||
- **Bake exit criteria**:
|
||||
- Zero `RegistryManifestIntegrityFailure` alerts on Forgejo.
|
||||
- Zero `ContainerNearOOM` for the forgejo pod.
|
||||
- Retention CronJob has run ≥14 times successfully.
|
||||
- At least one full Sunday GC cycle has elapsed.
|
||||
- Switch retention CronJob to `DRY_RUN=false` on day 7, observe
|
||||
until day 14.
|
||||
|
||||
## Phase 3 — Cutover (one PR per project, single session)
|
||||
|
||||
Order = lowest blast radius first. Each step:
|
||||
`image=` flip → `kubectl rollout restart` → verify pull from Forgejo.
|
||||
|
||||
1. `payslip-ingest` (`infra/stacks/payslip-ingest/main.tf`)
|
||||
2. `job-hunter` (`infra/stacks/job-hunter/main.tf`)
|
||||
3. `claude-agent-service` (`infra/stacks/claude-agent-service/main.tf`)
|
||||
4. `fire-planner` (`infra/stacks/fire-planner/main.tf`)
|
||||
5. `wealthfolio-sync` (`infra/stacks/wealthfolio/main.tf`)
|
||||
6. `freedify` (`infra/stacks/freedify/factory/main.tf`)
|
||||
7. `chrome-service` (`infra/stacks/chrome-service/main.tf`)
|
||||
8. `beads-server` / `beadboard` (`infra/stacks/beads-server/main.tf`).
|
||||
Then `gh repo archive ViktorBarzin/beadboard`.
|
||||
9. `infra-ci` — flip `image:` references in 4 `.woodpecker/*.yml`
|
||||
files in the infra repo. Verify next push to master applies cleanly.
|
||||
10. `claude-memory-mcp` — update `CLAUDE.md` install instruction from
|
||||
`claude plugins install github:ViktorBarzin/claude-memory-mcp` to
|
||||
`claude plugins install https://forgejo.viktorbarzin.me/viktor/claude-memory-mcp.git`.
|
||||
`gh repo archive ViktorBarzin/claude-memory-mcp`.
|
||||
|
||||
## Phase 4 — Decommission
|
||||
|
||||
| Step | File / location |
|
||||
|---|---|
|
||||
| Stop `registry-private` container on VM (10.0.20.10): edit `/opt/registry/docker-compose.yml`, comment out service, `docker compose up -d --remove-orphans`. (Manual SSH — cloud-init won't redeploy on TF apply per memory id=1078.) | live VM |
|
||||
| Update cloud-init template to match the new compose file | `infra/stacks/infra/main.tf:288` |
|
||||
| Delete `auths` entries for `registry.viktorbarzin.me` / `:5050` / `10.0.20.10:5050` from the dockerconfigjson | `infra/stacks/kyverno/modules/kyverno/registry-credentials.tf` |
|
||||
| Drop `registry.viktorbarzin.me` and `10.0.20.10:5050` `hosts.toml` entries on each node + cloud-init template | `infra/stacks/infra/main.tf` cloud-init + ad-hoc script |
|
||||
| After 1 week of no incidents, delete `/opt/registry/data/private/` blob storage on the VM (~2.6GB freed) | manual SSH |
|
||||
|
||||
## Phase 5 — Docs
|
||||
|
||||
In the same commit as the Phase 4 closing:
|
||||
|
||||
| Doc | Update |
|
||||
|---|---|
|
||||
| `docs/runbooks/registry-vm.md` | Note `registry-private` is gone; pull-through caches and break-glass tarballs only |
|
||||
| `docs/runbooks/registry-rebuild-image.md` | Replaced by NEW `forgejo-registry-rebuild-image.md` |
|
||||
| `docs/runbooks/forgejo-registry-rebuild-image.md` (NEW) | Forgejo PVC restore procedure |
|
||||
| `docs/runbooks/forgejo-registry-breakglass.md` (NEW) | infra-ci tarball recovery |
|
||||
| `docs/architecture/ci-cd.md` | Image registry section flips to Forgejo |
|
||||
| `docs/architecture/monitoring.md` | Integrity probe target updated |
|
||||
| `infra/.claude/CLAUDE.md` | Registry references updated |
|
||||
| `CLAUDE.md` (monorepo root) | claude-memory-mcp install URL updated |
|
||||
| `infra/.claude/reference/service-catalog.md` | Cross-reference checked |
|
||||
|
||||
## Critical files modified
|
||||
|
||||
| File | Phase | What |
|
||||
|---|---|---|
|
||||
| `infra/stacks/forgejo/main.tf` | 0 | Memory bump, packages env vars, PVC bump, ingress max_body_size |
|
||||
| `infra/stacks/forgejo/cleanup.tf` (NEW) | 0 | Retention CronJob |
|
||||
| `infra/stacks/forgejo/files/cleanup.sh` (NEW) | 0 | Retention script (mounted via ConfigMap) |
|
||||
| `infra/modules/kubernetes/ingress_factory/main.tf` | 0 | Wire `max_body_size` into a Traefik Buffering middleware |
|
||||
| `infra/stacks/kyverno/modules/kyverno/registry-credentials.tf` | 0 | Add 4th `auths` entry |
|
||||
| `infra/stacks/infra/main.tf` | 0 + 4 | Containerd hosts.toml block (add Forgejo, later remove registry-private); compose template update |
|
||||
| `infra/scripts/setup-forgejo-containerd-mirror.sh` (NEW) | 0 | One-shot rollout for existing nodes |
|
||||
| `infra/stacks/monitoring/modules/monitoring/main.tf` | 0 | Forgejo integrity probe CronJob |
|
||||
| `infra/stacks/monitoring/modules/monitoring/prometheus_chart_values.tpl` | 0 | Make alerts instance-aware |
|
||||
| `infra/stacks/monitoring/main.tf` | 0 | Plumb `forgejo_pull_token` into module |
|
||||
| `infra/.woodpecker/build-ci-image.yml` | 1 | Dual-push to add Forgejo target + tarball break-glass |
|
||||
| `<each-project>/.woodpecker.yml` | 1 | Dual-push (NEW for fire-planner, wealthfolio-sync, hmrc-sync, freedify, beadboard, claude-memory-mcp; EDIT for payslip-ingest, job-hunter, claude-agent-service) |
|
||||
| `infra/.woodpecker/{default,drift-detection,build-cli}.yml` | 3 | Flip `image:` to Forgejo for infra-ci |
|
||||
| `infra/stacks/{beads-server,chrome-service,claude-agent-service,fire-planner,freedify/factory,job-hunter,payslip-ingest,wealthfolio}/main.tf` | 3 | Flip `image =` to Forgejo |
|
||||
|
||||
## Verification
|
||||
|
||||
- **Push** (Phase 0/1): `docker push forgejo.viktorbarzin.me/viktor/<name>` visible in Forgejo Web UI under viktor/.
|
||||
- **Pull** (Phase 0): `crictl pull forgejo.viktorbarzin.me/viktor/smoketest:1` succeeds with auto-synced Secret.
|
||||
- **Dual-push** (Phase 1): every Woodpecker pipeline run pushes to BOTH endpoints — confirmed via HEAD checks on `<reg>:<sha>` for both.
|
||||
- **Bake** (Phase 2): existing daily Forgejo `/api/healthz` external monitor stays green; integrity probe stays green; no `ContainerNearOOM` for forgejo pod.
|
||||
- **Cutover** (Phase 3): `kubectl rollout status deploy/<svc> -n <ns>` succeeds. `kubectl describe pod` shows the image was pulled from `forgejo.viktorbarzin.me`.
|
||||
- **Decommission** (Phase 4): `docker ps` on registry VM no longer shows `registry-private`. Brand-new namespace gets the Secret with only the Forgejo `auths` entry. Pull still works.
|
||||
180
docs/plans/2026-05-16-auto-upgrade-apps-design.md
Normal file
180
docs/plans/2026-05-16-auto-upgrade-apps-design.md
Normal file
|
|
@ -0,0 +1,180 @@
|
|||
# Auto-Upgrade Apps Design
|
||||
|
||||
**Date**: 2026-05-16
|
||||
**Status**: Approved (brainstorm + grill complete; implementation pending)
|
||||
|
||||
> **UPDATE 2026-06-02 — decision #12 / Q1 reversed for OWNED apps.** The
|
||||
> original "uniform Keel-only, no per-repo `kubectl set image` step" call held
|
||||
> only for **upstream** images (which we can't build, so Keel poll-and-bump is
|
||||
> the only option). For **self-hosted apps we build**, CI now ALSO drives the
|
||||
> rollout: `build-and-push` tags `latest` + `:<sha>`, then a `deploy` step runs
|
||||
> `kubectl set image deployment/<app> ...:<sha>` + `rollout status`. Rationale
|
||||
> (memory id=3183, proven on tuya-bridge 2026-05-29): the pipeline is atomic
|
||||
> and deterministic — no wait for Keel's hourly poll, no risk of Keel resolving
|
||||
> `:latest` to a stale concrete tag. **Keel stays enrolled in parallel** as a
|
||||
> redundant net (it finds the just-deployed SHA already running → no-op), so
|
||||
> upstream apps and owned apps share one mental model. Enabled cluster-wide by
|
||||
> the `woodpecker-agent` SA being `cluster-admin` (no per-app RBAC). Owned apps
|
||||
> being rolled out to this pattern 2026-06-02; CronJobs in owned apps use
|
||||
> `:latest` + `imagePullPolicy: Always` instead of a deploy step.
|
||||
|
||||
## Problem
|
||||
|
||||
Three constraints in tension across the cluster's ~70 services:
|
||||
|
||||
1. **Keep apps at latest.** Most services drift behind upstream; manual bumps don't scale.
|
||||
2. **Stay Terraform-compatible.** Image refs live in `.tf`; we want declarative source of truth.
|
||||
3. **Don't let the pull-through cache serve stale `:latest`.** Cache layer must not lie about what `:latest` means today.
|
||||
|
||||
The previous `Diun → n8n → Service Upgrade Agent` flow handled (1) via changelog-reviewed PR bumps for third-party. Self-hosted services have inconsistent CI: 1 of 11 fully wired (CI builds + pushes + rolls out), 6 partially wired (build but no rollout trigger), 4 with no CI at all. Self-hosted services typically pull `forgejo.viktorbarzin.me/viktor/<name>:<8-char-sha>` with Terraform tracking each SHA in `var.image_tag`.
|
||||
|
||||
The user wants to simplify by retiring the changelog-review agent and moving to a pure "latest, always" model, with the cache freshness concern handled at the cache layer (already done — see Architecture §1).
|
||||
|
||||
## Decisions
|
||||
|
||||
| # | Decision | Notes |
|
||||
|---|----------|-------|
|
||||
| 1 | **Auto-roll for everything** (no PR-bump gate) | Retires the Service Upgrade Agent; Diun's role narrows to notification only |
|
||||
| 2 | **Actuator: Keel** ([keel.sh](https://keel.sh)) | Annotation-driven Deployment/StatefulSet/DaemonSet auto-update operator |
|
||||
| 3 | **Tag scheme: `:latest` where it exists, `:major` where it doesn't, glob+`ignore_changes` last resort** | `keel.sh/policy: force` for `:latest` / `:major`; tag string stays in Terraform |
|
||||
| 4 | **Opt-out-pure (no skip-list)** | Every workload auto-rolls, including Vault, CNPG, operators, CNI, CSI. User accepts recoverability risk |
|
||||
| 5 | **Phased rollout (9 phases)** | Low-risk → bootstrap. Catch up to latest as we phase in. Each phase soaks ~1 week |
|
||||
| 6 | **Per-phase: single combined PR** | Switch image refs to floating tag + add to Kyverno mutate allowlist in same commit |
|
||||
| 7 | **Diun is the audit source for catch-up** | Existing 6h-poll already reports outdated images; export as worklist per phase |
|
||||
| 8 | **Polling, hourly** (`@every 1h`) | Not webhooks — single mechanism, all registries supported |
|
||||
| 9 | **Rollback: `kubectl rollout undo` → pin in Terraform → add `keel.sh/policy: never`** | (c) from grill: immediate undo, durable Terraform pin within ≤1h before next Keel poll |
|
||||
| 10 | **Implementation: Kyverno cluster-wide mutate** | One `ClusterPolicy` injects Keel annotations; phase boundary = `NamespaceSelector` allowlist |
|
||||
| 11 | **Keel exempt from its own mutate** | One-line `NamespaceSelector` exclusion. Supervisor self-update has uniquely bad failure mode |
|
||||
| 12 | **Uniform CI model for all self-hosted** | CI builds + pushes `:latest`, Keel polls and rolls. No per-repo `kubectl set image` step. Retires the GHA-migrated SHA-tag flow (memory id=388) |
|
||||
|
||||
## Architecture
|
||||
|
||||
### 1. Cache freshness — already correct
|
||||
|
||||
Pull-through cache at `10.0.20.10` already splits caching by URL at the nginx layer:
|
||||
|
||||
- `location ~ /v2/.*/blobs/` → `proxy_cache_valid 200 24h` — blobs cached (content-addressed, immutable)
|
||||
- `location /v2/` (manifests) → pass through, no cache
|
||||
|
||||
Combined with `registry.proxy.ttl: 0` at the docker-registry layer, mutable manifests revalidate against upstream on every pull. **No cache changes needed for this design.** The CLAUDE.md note "Use 8-char git SHA tags — `:latest` causes stale pull-through cache" predates the nginx URL-split fix and should be updated as part of this work.
|
||||
|
||||
### 2. Detection — Keel polls upstream
|
||||
|
||||
Keel runs as a Deployment in its own namespace. Every annotated workload polls its registry hourly (Keel-managed; configurable per workload). On detection of a new digest under the watched tag:
|
||||
|
||||
- `keel.sh/policy: force` (for mutable tags `:latest`, `:16`, `:7`, etc.) → trigger Deployment update (pod template hash changes → restart)
|
||||
- `keel.sh/policy: minor` / `major` / `glob` (only for images that publish neither `:latest` nor a stable floating tag) → rewrite tag string on the Deployment; requires `lifecycle { ignore_changes = [...image] }`
|
||||
|
||||
### 3. Application — kubelet pull through the cache
|
||||
|
||||
When Keel triggers restart:
|
||||
|
||||
1. kubelet asks the cache (via containerd hosts.toml) for `image:tag` manifest.
|
||||
2. nginx passes the manifest request through to the docker-registry layer.
|
||||
3. docker-registry (with `proxy.ttl: 0`) passes through to upstream.
|
||||
4. Upstream returns current digest.
|
||||
5. kubelet pulls blobs (mostly cached at nginx layer; new blobs from upstream).
|
||||
6. New pod runs new image.
|
||||
|
||||
### 4. Annotation injection — Kyverno mutate
|
||||
|
||||
Single `ClusterPolicy` adds these annotations to every Deployment / StatefulSet / DaemonSet in opted-in namespaces:
|
||||
|
||||
```yaml
|
||||
metadata:
|
||||
annotations:
|
||||
keel.sh/policy: force
|
||||
keel.sh/trigger: poll
|
||||
keel.sh/pollSchedule: "@every 1h"
|
||||
```
|
||||
|
||||
Phase = a `match.any[].resources.namespaces` list. Phase advance = append namespaces. Keel namespace is excluded.
|
||||
|
||||
### 5. Terraform drift handling
|
||||
|
||||
Existing convention (`# KYVERNO_LIFECYCLE_V1` marker) handles `dns_config` injection. We extend with a new marker:
|
||||
|
||||
```hcl
|
||||
lifecycle {
|
||||
ignore_changes = [
|
||||
spec[0].template[0].spec[0].dns_config, # KYVERNO_LIFECYCLE_V1
|
||||
metadata[0].annotations["keel.sh/policy"],
|
||||
metadata[0].annotations["keel.sh/trigger"],
|
||||
metadata[0].annotations["keel.sh/pollSchedule"], # KYVERNO_LIFECYCLE_V2
|
||||
]
|
||||
}
|
||||
```
|
||||
|
||||
This is added per workload as we phase in. Mechanical, grep-able.
|
||||
|
||||
## Phase ordering
|
||||
|
||||
| Phase | Set | Rationale |
|
||||
|-------|-----|-----------|
|
||||
| 0 | Foundation (Keel install, Kyverno ClusterPolicy with empty allowlist) | Build infra without enrolling anything |
|
||||
| 1 | Self-hosted (forgejo-hosted: ~11 services) | We own the code; failures are easy to diagnose |
|
||||
| 2 | Stateless third-party web apps (linkwarden, postiz, affine, etc.) | No migrations |
|
||||
| 3 | Exporters, sidecars, utilities | Stateless |
|
||||
| 4 | Stateful-but-tolerant (Grafana, Prometheus, etc.) | Restart-safe state |
|
||||
| 5 | State-coupled with migrations (Nextcloud, Forgejo, paperless-ngx, mailserver) | Schema-migration risk. **Nextcloud enrolled 2026-06-01** with two safeguards for the migration risk: F1 — `nextcloud-watchdog` CronJob runs `occ upgrade` when occ reports `needsDbUpgrade=true` (recovers an interrupted entrypoint upgrade); F2 — `chart_values.yaml` renders the live (Keel-bumped) image tag with a floor, so a helm re-render never downgrades below live. Scope is `patch` (Kyverno-stamped) == `minor` for Nextcloud (32.0.x only). See `stacks/nextcloud/main.tf`. |
|
||||
| 6 | Authentik | Auth outage |
|
||||
| 7 | Operators (cnpg-operator, ESO, kured, descheduler) | Operator skew |
|
||||
| 8 | Critical infra (Calico, proxmox-csi, nfs-csi, traefik, metallb) | Node-level outage potential (memory id=390: 26h Calico cascade) |
|
||||
| 9 | Bootstrap (Vault, CNPG PG cluster, mysql-standalone) | Lose recoverability if broken |
|
||||
|
||||
Per-phase: combined PR → apply (catch-up rolls happen) → soak 1 week → next phase. If a service breaks repeatedly, apply rollback runbook (decision #9) and proceed; re-enroll later or leave pinned.
|
||||
|
||||
## Risk register
|
||||
|
||||
| Risk | Likelihood | Impact | Mitigation |
|
||||
|------|-----------|--------|------------|
|
||||
| Bad upstream image rolls into prod | High | Service-level outage | Existing alerts (`KubePodCrashLooping`, `KubeletImagePullErrors`, `PodsStuckContainerCreating`); rollback runbook (decision #9) |
|
||||
| Catch-up rollout overwhelms cache | Medium | ImagePullBackOff cascade (memory id=603) | Rate-limit catch-up to ~5 rollouts/6h via `-target=` per phase; same pacing as retired Service Upgrade Agent (memory id=612) |
|
||||
| Calico / CSI auto-roll cascades (memory id=390: 26h outage) | Low-Medium | Cluster-level outage | Phase 8 is intentionally late; user opted into the risk; rollback to pinned chart version via Terraform |
|
||||
| Vault auto-rolls to broken image | Low | Loss of secrets sync; 43 ExternalSecrets stop reconciling | Phase 9 last; Tier 0 SOPS state allows manual recovery |
|
||||
| CNPG PG cluster auto-rolls to broken image | Low | Tier 1 Terraform state inaccessible; 105 stacks can't apply | Phase 9 last; Tier 0 stack `cnpg` is bootstrap-capable |
|
||||
| Helm-atomic-trap services (memory id=981) | Medium | `terraform apply` hangs in pending-rollback | Identify `helm_release` services with `atomic = true`; either remove atomic or skip from Keel |
|
||||
| Keel itself rolls to broken version | Low | Supervisor down; no auto-rolls until manual pin | Decision #11: exempt Keel from mutate |
|
||||
| Terraform drift after Kyverno injects annotation | High at first | Spurious diffs on every plan | KYVERNO_LIFECYCLE_V2 marker (Architecture §5); applied incrementally per phase |
|
||||
|
||||
## What we give up
|
||||
|
||||
- **Terraform no longer tracks deployed version.** Image refs in `.tf` say `:latest` or `:16`, but the running digest is whatever Keel pulled. To know what's running: `kubectl describe pod`. This is a deliberate trade — the previous SHA-pinned flow tracked version in TF but required N stack edits per deploy.
|
||||
- **No changelog review before rollout.** The Service Upgrade Agent's risk classification is gone. We rely on alerts to catch breakage post-deploy, not prevent it.
|
||||
- **CLAUDE.md SHA-tag rule is reversed for this design.** The "use 8-char git SHA tags" rule predates the nginx URL-split fix. New rule (post-rollout): "use floating tags + Keel annotation" — to be updated in both `infra/.claude/CLAUDE.md` and the repo-root `CLAUDE.md` once Phase 1 is stable.
|
||||
|
||||
## Decisions resolved post-grill
|
||||
|
||||
### Q1 — Uniform CI model for ALL self-hosted (resolved 2026-05-16)
|
||||
|
||||
Every self-hosted service moves to the same shape:
|
||||
|
||||
```
|
||||
CI (GHA or Woodpecker) → build → push :latest (optionally also :<SHA> for traceability) → done
|
||||
Keel → poll registry → detect new digest → trigger rollout
|
||||
```
|
||||
|
||||
The 10 GHA-migrated repos (memory id=388: Website, k8s-portal, f1-stream, claude-memory-mcp, apple-health-data, audiblez-web, plotting-book, insta2spotify, audiobook-search, council-complaints) drop the `Woodpecker API → kubectl set image` step. Their `.woodpecker/deploy.yml` and `.woodpecker/build-fallback.yml` files become obsolete; remove during Phase 1.
|
||||
|
||||
Terraform image refs for all self-hosted: `<registry>/<repo>:latest` (with `${var.image_tag}` defaulting to `"latest"` where the variable exists).
|
||||
|
||||
### Q2 — No-CI self-hosted services (resolution: uniform participation)
|
||||
|
||||
| Service | Action |
|
||||
|---------|--------|
|
||||
| `wealthfolio` | Switch Terraform to upstream `wealthfolio/wealthfolio:latest` (DockerHub). No CI needed. |
|
||||
| `chrome-service` | Verify whether `:v4` is a deliberate pin. If yes → tag stays, add `keel.sh/policy: never` label. If no → switch to `:latest` or `:major`. Investigate during Phase 1 prep. |
|
||||
| `beadboard` (used by `beads-server`) | Add minimal Woodpecker CI: build on push → push `:latest`. User-owned. |
|
||||
| `freedify` | Add minimal Woodpecker CI: build on push → push `:latest`. User-owned. |
|
||||
|
||||
## Open questions (still need resolution before Phase 1)
|
||||
|
||||
1. **`helm_release atomic = true` services**: count and identify before Phase 1. Either remove `atomic` (preferred — eliminates the memory id=981 trap), or skip from Kyverno mutate via per-namespace exclusion. Survey command: `grep -rn 'atomic.*true' infra/stacks/ infra/modules/`.
|
||||
|
||||
## Out of scope
|
||||
|
||||
- Cache TTL changes — current config is already correct (nginx URL-split).
|
||||
- Webhook-based Keel triggers — polling is sufficient for this cadence.
|
||||
- Replacing Diun — kept for notification visibility into new tags not yet under Keel annotation (during phase rollout).
|
||||
- Keel approval gate (`keel.sh/approvals: N`) — user wants unattended auto-roll.
|
||||
- Keel auto-rollback on health-check failure — out of scope for v1; revisit if breakage rate is high.
|
||||
322
docs/plans/2026-05-16-auto-upgrade-apps-plan.md
Normal file
322
docs/plans/2026-05-16-auto-upgrade-apps-plan.md
Normal file
|
|
@ -0,0 +1,322 @@
|
|||
# Auto-Upgrade Apps Implementation Plan
|
||||
|
||||
> **For Claude:** REQUIRED SUB-SKILL: Use superpowers:executing-plans to implement this plan task-by-task.
|
||||
|
||||
**Goal:** Move the cluster from a mix of pinned-SHA / pinned-semver / ad-hoc `:latest` references to a Keel-driven auto-update model where every workload tracks `:latest` (or a chosen `:major` floating tag) and rolls automatically when upstream advances.
|
||||
|
||||
**Architecture:** Kyverno cluster-wide `ClusterPolicy` mutates Deployments / StatefulSets / DaemonSets in opted-in namespaces with Keel annotations (`keel.sh/policy: force`, `keel.sh/trigger: poll`, `keel.sh/pollSchedule: @every 1h`). Keel polls registries, triggers rollout on new digest. kubelet pulls fresh manifest via the nginx URL-split cache (manifests passthrough, blobs cached). Phase advance = expand the `NamespaceSelector` allowlist.
|
||||
|
||||
**Tech Stack:** Keel, Kyverno, Terraform / Terragrunt, Helm, Diun (notification only), nginx, docker/distribution
|
||||
|
||||
**Design doc:** `docs/plans/2026-05-16-auto-upgrade-apps-design.md`
|
||||
|
||||
**Key context:**
|
||||
- Cache is already correctly configured (nginx URL-split + `proxy.ttl: 0`). No cache changes needed.
|
||||
- Per-stack `lifecycle.ignore_changes` is already required for the existing `dns_config` Kyverno mutation (KYVERNO_LIFECYCLE_V1 convention). This plan extends it with a V2 marker for Keel annotations.
|
||||
- Service Upgrade Agent (Diun → n8n → claude bumps tfvars) is retired by this design. n8n workflow + supporting scripts are removed once Phase 9 completes.
|
||||
- CLAUDE.md "use 8-char git SHA tags" rule is reversed by this design (see Open Q1 in design doc).
|
||||
|
||||
---
|
||||
|
||||
## Phase 0 — Foundation
|
||||
|
||||
### Task 0.1: Resolve remaining open question
|
||||
|
||||
Q1 and Q2 from the design doc are resolved (uniform `:latest` + Keel model for all self-hosted; per-service plan for no-CI services).
|
||||
|
||||
Remaining open question:
|
||||
|
||||
**Helm-atomic services.** Survey:
|
||||
```bash
|
||||
grep -rn 'atomic.*true' /home/wizard/code/infra/stacks/ /home/wizard/code/infra/modules/
|
||||
```
|
||||
|
||||
For each match: either remove `atomic = true` (preferred) or add the namespace to a Kyverno exclusion list. Document inline before Phase 1 proceeds.
|
||||
|
||||
---
|
||||
|
||||
### Task 0.2: Create the Keel stack
|
||||
|
||||
**Files:**
|
||||
- Create: `stacks/keel/terragrunt.hcl`
|
||||
- Create: `stacks/keel/main.tf`
|
||||
- Create: `stacks/keel/variables.tf`
|
||||
- Create: `stacks/keel/modules/keel/main.tf`
|
||||
|
||||
**Step 1:** Add `keel` to `terragrunt.hcl` `locals.tier0_stacks` — **NO**. Keel is Tier 1 (depends on Kyverno + Keel image registry access). Keep it in Tier 1.
|
||||
|
||||
**Step 2:** Deploy via Helm chart `keel-hq/keel` (verify current version via context7 before pinning).
|
||||
|
||||
Key Helm values:
|
||||
- `polling.enabled: true`
|
||||
- `helmProvider.enabled: false` (we use annotations, not Helm hooks)
|
||||
- `notifications.slack.enabled: true` with channel `#deployments` (verify channel exists)
|
||||
- Registry credentials: mount Forgejo PAT from Vault via ExternalSecret (`secret/viktor/forgejo_pull_token`).
|
||||
|
||||
**Step 3:** Verify Keel can authenticate to all five registries (Docker Hub, ghcr, quay, k8s.io, kyverno via the local cache; Forgejo direct).
|
||||
|
||||
**Acceptance:**
|
||||
- `kubectl -n keel get pod` shows Keel Ready.
|
||||
- `kubectl -n keel logs deploy/keel | grep registry` shows successful manifest queries.
|
||||
|
||||
---
|
||||
|
||||
### Task 0.3: Author the Kyverno ClusterPolicy
|
||||
|
||||
**Files:**
|
||||
- Create: `stacks/kyverno/modules/kyverno/keel-annotations.tf` (or extend `security-policies.tf`)
|
||||
|
||||
ClusterPolicy `inject-keel-annotations`:
|
||||
|
||||
```yaml
|
||||
apiVersion: kyverno.io/v1
|
||||
kind: ClusterPolicy
|
||||
metadata:
|
||||
name: inject-keel-annotations
|
||||
spec:
|
||||
background: true
|
||||
rules:
|
||||
- name: add-keel-annotation
|
||||
match:
|
||||
any:
|
||||
- resources:
|
||||
kinds: [Deployment, StatefulSet, DaemonSet]
|
||||
namespaces: [] # populated per phase
|
||||
exclude:
|
||||
any:
|
||||
- resources:
|
||||
namespaces: ["keel"] # decision #11
|
||||
- resources:
|
||||
# Workloads can opt out by setting this label
|
||||
selector:
|
||||
matchLabels:
|
||||
keel.sh/policy: never
|
||||
mutate:
|
||||
patchStrategicMerge:
|
||||
metadata:
|
||||
annotations:
|
||||
+(keel.sh/policy): force
|
||||
+(keel.sh/trigger): poll
|
||||
+(keel.sh/pollSchedule): "@every 1h"
|
||||
```
|
||||
|
||||
- `+()` syntax adds only if not present (preserves per-workload overrides).
|
||||
- `exclude.selector.matchLabels[keel.sh/policy=never]` is the per-workload escape hatch (used during rollback per decision #9).
|
||||
|
||||
**Step 2:** Initially deploy with `namespaces: []` — policy exists but matches nothing.
|
||||
|
||||
**Acceptance:**
|
||||
- `kubectl get clusterpolicy inject-keel-annotations` shows Ready.
|
||||
- `kubectl get deploy -A -o yaml | grep keel.sh/policy` shows no matches yet (empty allowlist).
|
||||
|
||||
---
|
||||
|
||||
### Task 0.4: Define the KYVERNO_LIFECYCLE_V2 marker convention
|
||||
|
||||
**Files:**
|
||||
- Modify: `AGENTS.md` — add the V2 snippet to the "Kyverno Drift Suppression" section
|
||||
- Modify: `.claude/CLAUDE.md` — reference the V2 marker
|
||||
|
||||
Snippet to copy-paste:
|
||||
|
||||
```hcl
|
||||
lifecycle {
|
||||
ignore_changes = [
|
||||
spec[0].template[0].spec[0].dns_config, # KYVERNO_LIFECYCLE_V1
|
||||
metadata[0].annotations["keel.sh/policy"],
|
||||
metadata[0].annotations["keel.sh/trigger"],
|
||||
metadata[0].annotations["keel.sh/pollSchedule"], # KYVERNO_LIFECYCLE_V2
|
||||
]
|
||||
}
|
||||
```
|
||||
|
||||
Backfill order: per-phase, only on workloads about to be enrolled. Not a mass sweep.
|
||||
|
||||
---
|
||||
|
||||
## Phase 1 — Self-hosted (uniform model)
|
||||
|
||||
**Set:** all self-hosted services. Three sub-categories:
|
||||
|
||||
- **Woodpecker-build-only (6):** `claude-agent-service`, `fire-planner`, `job-hunter`, `payslip-ingest`, `recruiter-responder`, `claude-memory-mcp`.
|
||||
- **GHA-migrated (10, per memory id=388):** Website, k8s-portal, f1-stream, claude-memory-mcp, apple-health-data, audiblez-web, plotting-book, insta2spotify, audiobook-search, council-complaints. (Note: claude-memory-mcp appears in both lists — verify.)
|
||||
- **No-CI (4, per design Q2):** `wealthfolio` (→ upstream), `chrome-service` (verify pin intent), `beadboard` (add CI), `freedify` (add CI).
|
||||
- **Already-uniform (1):** `kms-website` — already pushes `:latest` AND SHA; just needs Keel annotation.
|
||||
|
||||
### Task 1.1: Audit current image refs
|
||||
|
||||
```bash
|
||||
grep -rE 'image\s*=\s*"(forgejo\.viktorbarzin\.me|viktorbarzin)' /home/wizard/code/infra/stacks/ | sort
|
||||
```
|
||||
|
||||
Tabulate per service: current tag, CI type (GHA / Woodpecker / none), action needed.
|
||||
|
||||
### Task 1.2: Per-service uniform conversion
|
||||
|
||||
For each Woodpecker-build-only service:
|
||||
1. Edit Terraform: `local.image_tag` / `var.image_tag` → `"latest"`.
|
||||
2. Add the KYVERNO_LIFECYCLE_V2 snippet (annotations ignore_changes).
|
||||
3. Verify `.woodpecker.yml` pushes `:latest` on every build (most do via `auto_tag: true`).
|
||||
|
||||
For each GHA-migrated service:
|
||||
1. Edit Terraform: switch `image_tag` from SHA reference to `"latest"`.
|
||||
2. Add the KYVERNO_LIFECYCLE_V2 snippet.
|
||||
3. Edit `.github/workflows/build-and-deploy.yml`: push `:latest` (in addition to `:<8-char-sha>` for traceability). Remove the Woodpecker API POST step.
|
||||
4. Delete `.woodpecker/deploy.yml` and `.woodpecker/build-fallback.yml` from each repo (no longer needed).
|
||||
5. Remove the Woodpecker repo config for these repos from Terraform if applicable.
|
||||
|
||||
For each no-CI service:
|
||||
- `wealthfolio`: change Terraform image to `wealthfolio/wealthfolio:latest` (upstream DockerHub). Validate the image starts cleanly.
|
||||
- `chrome-service`: check git blame on the `:v4` pin. If deliberate → label `keel.sh/policy: never`. If accidental → bump to upstream `:latest`.
|
||||
- `beadboard`, `freedify`: write a minimal `.woodpecker.yml` (single build step pushing to Forgejo `:latest`). Trigger an initial build to populate `:latest`.
|
||||
|
||||
For `kms-website`: only add the Keel annotation; CI changes optional.
|
||||
|
||||
### Task 1.3: Add Phase 1 namespaces to Kyverno allowlist
|
||||
|
||||
Edit `stacks/kyverno/modules/kyverno/keel-annotations.tf`:
|
||||
|
||||
```yaml
|
||||
namespaces:
|
||||
- claude-agent-service
|
||||
- fire-planner
|
||||
- job-hunter
|
||||
- payslip-ingest
|
||||
- recruiter-responder
|
||||
- claude-memory-mcp
|
||||
- kms-website
|
||||
# GHA-migrated set:
|
||||
- website # or whatever the namespace is named per repo
|
||||
- k8s-portal
|
||||
- f1-stream
|
||||
- apple-health-data
|
||||
- audiblez-web
|
||||
- plotting-book
|
||||
- insta2spotify
|
||||
- audiobook-search
|
||||
- council-complaints
|
||||
# No-CI set:
|
||||
- beads-server
|
||||
- chrome-service
|
||||
- freedify
|
||||
- wealthfolio
|
||||
```
|
||||
|
||||
Verify each namespace name from `kubectl get ns` before locking in (some may differ from the repo name).
|
||||
|
||||
Apply. Watch `kubectl get deploy -n <ns> -o yaml | grep keel.sh` confirm annotations injected. Watch Keel logs for first poll cycle picking up the workloads.
|
||||
|
||||
### Task 1.4: Soak
|
||||
|
||||
1 week. Monitor:
|
||||
- Slack `#deployments` for Keel rollout notifications.
|
||||
- `KubePodCrashLooping` alerts.
|
||||
- Manual `kubectl rollout status` on each service after a Keel-triggered rollout.
|
||||
|
||||
If any service breaks repeatedly: apply rollback runbook (decision #9), record the service in a "pin list" with reason, proceed.
|
||||
|
||||
**Acceptance:**
|
||||
- All 7 services running latest digests within 24h of Phase 1 apply.
|
||||
- No CrashLooping persisting >1h.
|
||||
- No more than 2 services pinned-out during the soak week.
|
||||
|
||||
---
|
||||
|
||||
## Phase 2 — Stateless third-party web apps
|
||||
|
||||
**Set:** linkwarden, postiz, affine, isponsorblocktv, audiobookshelf, freshrss, tandoor, immich (verify it qualifies — has external DB so app-restart is safe), excalidraw, hackmd, send, jsoncrack, sparkyfitness, etc. (~15-20 services — full list from `kubectl get deploy -A` filtered against the phase-1 set + skip-bucket).
|
||||
|
||||
### Task 2.1: Audit current tags via Diun
|
||||
|
||||
```bash
|
||||
# Diun's REST API or UI exports a "new tags available" report
|
||||
# Use as the per-service decision source
|
||||
```
|
||||
|
||||
For each service, pick floating tag:
|
||||
- `:latest` if upstream publishes it and it's stable.
|
||||
- `:<major>` (e.g. `:2`, `:v3`) if `:latest` is unreliable.
|
||||
- `glob` + `ignore_changes` as last resort.
|
||||
|
||||
### Task 2.2: Catch-up PR
|
||||
|
||||
Single combined PR:
|
||||
- Per-stack: switch image tag from pinned semver to chosen floating tag (Diun-informed).
|
||||
- Per-stack: add KYVERNO_LIFECYCLE_V2 snippet.
|
||||
- Append Phase 2 namespaces to Kyverno allowlist.
|
||||
|
||||
Apply with `-target=` per stack to pace rollouts (≤5 per hour to avoid cache burst — memory id=603).
|
||||
|
||||
### Task 2.3: Soak — 1 week, same monitoring as Phase 1.
|
||||
|
||||
---
|
||||
|
||||
## Phases 3–9 — same template
|
||||
|
||||
For each phase, repeat:
|
||||
|
||||
1. Define the set (precise namespace list).
|
||||
2. Audit current tags (Diun + grep).
|
||||
3. Pick floating tag per service.
|
||||
4. Combined PR: image-ref change + lifecycle snippet + Kyverno allowlist update.
|
||||
5. Apply paced (≤5/hr).
|
||||
6. Soak 1 week. Pin-out any service that breaks repeatedly.
|
||||
|
||||
Set definitions per phase: see design doc Phase Ordering table.
|
||||
|
||||
**Special-handling phases:**
|
||||
|
||||
- **Phase 7 (Operators).** Restart of an operator can confuse its managed CRD reconciles. Use `imagePullPolicy: Always` + readiness check before declaring stable. Investigate cnpg-operator and ESO restart behavior in advance.
|
||||
- **Phase 8 (Critical infra).** Calico/CSI DaemonSet rollouts impact each node briefly. Verify `updateStrategy.rollingUpdate.maxUnavailable: 1` on every DaemonSet before enrollment. Memory id=390 (26h Calico-cascade outage) is the cautionary tale.
|
||||
- **Phase 9 (Bootstrap).** Vault, CNPG, mysql-standalone. Coordinate with backup window. Take a fresh snapshot of `/srv/nfs/<db>-backup/` before applying the phase enrollment.
|
||||
|
||||
---
|
||||
|
||||
## Cleanup tasks (after Phase 9 stable)
|
||||
|
||||
### Task C.1: Retire Service Upgrade Agent
|
||||
|
||||
**Files:**
|
||||
- Modify: `stacks/n8n/` — remove the Service Upgrade Agent workflow
|
||||
- Delete: any supporting scripts (`infra/scripts/service-upgrade-*.sh` if they exist)
|
||||
- Modify: `stacks/diun/` — disable webhook notification to n8n (keep Slack notification for visibility)
|
||||
|
||||
### Task C.2: Update CLAUDE.md files
|
||||
|
||||
- Reverse the "use 8-char git SHA tags" rule in `infra/.claude/CLAUDE.md` "Docker images" line.
|
||||
- Reverse same in root `/CLAUDE.md` if duplicated.
|
||||
- Add a new section documenting the Keel model + KYVERNO_LIFECYCLE_V2 snippet.
|
||||
- Update memory via `mcp__claude_memory__memory_update` on entries 388, 612, 604 (CI/CD architecture, Service Upgrade Agent retirement, cache TTL clarification).
|
||||
|
||||
### Task C.3: Add a runbook
|
||||
|
||||
**Files:**
|
||||
- Create: `docs/runbooks/keel-rollback.md`
|
||||
|
||||
Document the rollback flow (decision #9): `kubectl rollout undo` → Terraform pin → annotation `keel.sh/policy: never`.
|
||||
|
||||
### Task C.4: Tidy Diun
|
||||
|
||||
Drop image-pin overrides for MySQL, PostgreSQL, Redis from Diun config (no longer needed since they're Keel-managed; the previous skip was for the retired changelog-agent path).
|
||||
|
||||
---
|
||||
|
||||
## Rollback (whole project)
|
||||
|
||||
If the auto-roll experiment goes badly cluster-wide (multiple cascading failures, repeated outages), revert:
|
||||
|
||||
1. Set Kyverno ClusterPolicy `inject-keel-annotations` to empty `namespaces: []`.
|
||||
2. Existing annotations remain on workloads, but Keel continues to act on them — so also disable Keel: scale `keel` Deployment to 0.
|
||||
3. Pin every workload's Terraform image_tag back to its current running digest (use `kubectl get deploy -A -o jsonpath='{range .items[*]}{.metadata.name}:{.spec.template.spec.containers[0].image}{"\n"}{end}'`).
|
||||
4. Document failure modes in `post-mortems/2026-XX-XX-keel-rollback.md`.
|
||||
5. Reconsider opt-in approach for next iteration.
|
||||
|
||||
---
|
||||
|
||||
## Success criteria
|
||||
|
||||
- All ~70 services running latest within 8 weeks of Phase 0 completion.
|
||||
- Zero unrolled-back outages caused by Keel.
|
||||
- ≤5 services on the "pin list" (i.e. ≥93% auto-roll success rate).
|
||||
- `terragrunt plan` shows no spurious diffs from Kyverno-injected annotations (KYVERNO_LIFECYCLE_V2 working as intended).
|
||||
- Service Upgrade Agent + supporting infra retired.
|
||||
1495
docs/plans/2026-05-17-agent-presence-plan.md
Normal file
1495
docs/plans/2026-05-17-agent-presence-plan.md
Normal file
File diff suppressed because it is too large
Load diff
112
docs/plans/2026-05-19-mysql-8.4.9-upgrade-design.md
Normal file
112
docs/plans/2026-05-19-mysql-8.4.9-upgrade-design.md
Normal file
|
|
@ -0,0 +1,112 @@
|
|||
# MySQL 8.4.8 → 8.4.9 Upgrade — Design
|
||||
|
||||
**Date**: 2026-05-19
|
||||
**Status**: Drafted, **NOT scheduled**. Execute only inside a planned maintenance window with user sign-off.
|
||||
**Beads**: (filed alongside this doc)
|
||||
**Related**: `docs/runbooks/restore-mysql.md`, beads `code-eme8` / `code-k40p` (closed in `ea475c3d`)
|
||||
|
||||
## Background
|
||||
|
||||
On 2026-05-18, Keel auto-bumped the `mysql:8.4` floating tag on the
|
||||
`mysql-standalone` StatefulSet from 8.4.8 to 8.4.9. The in-server data
|
||||
dictionary upgrade (80408 → 80409) stalled reliably: ~24 s of writes to
|
||||
`mysql.ibd` + redo log after "Server upgrade started", then complete
|
||||
silence — no CPU, no flushes, no errors, no completion. The `boot`
|
||||
thread sat in user-space sleep (`State: S`, `wchan: 0`) for 10+
|
||||
minutes; the MySQLX socket appeared but `mysqld.sock` never did. Even
|
||||
with `liveness_probe.initial_delay_seconds = 600`, the upgrade never
|
||||
completed.
|
||||
|
||||
Recovery (commit `ea475c3d`): pinned image to `mysql:8.4.8` exactly,
|
||||
wiped the corrupted PVC, restored from the 00:30 UTC mysqldump. Total
|
||||
downtime: ~25 min. Forgejo + 7 dependent apps offline during that
|
||||
window.
|
||||
|
||||
## Root cause — best evidence
|
||||
|
||||
We never proved this definitively because we couldn't connect to MySQL
|
||||
during the stall, but the strongest hypothesis is **flush starvation
|
||||
during the DD upgrade's mandatory checkpoint**:
|
||||
|
||||
1. Upgrade rewrites `mysql.st_spatial_reference_systems` (5103 SRS
|
||||
defs) + dirties pages across the system tablespace.
|
||||
2. Reaches a point where it must checkpoint before continuing.
|
||||
3. The page-cleaner thread can't drain dirty pages fast enough because
|
||||
`innodb_io_capacity=100` (1.6 MB/s effective flush rate, default is
|
||||
200, recommended for SSDs is 2000+) combined with
|
||||
`innodb_page_cleaners=1`.
|
||||
4. The `boot` thread waits on a pthread condvar that the flush
|
||||
coordinator should signal but never does within probe timeout.
|
||||
|
||||
Why we're not 100 % certain:
|
||||
- LUKS2-encrypted block storage (`proxmox-lvm-encrypted`) may
|
||||
contribute its own flush latency.
|
||||
- We didn't capture a stack trace from the stalled `boot` thread
|
||||
(`/proc/1/task/118/stack` was `permission denied`).
|
||||
- A genuine MySQL 8.4.9 bug in the SRS-update path is possible (worth
|
||||
checking the MySQL bug tracker before retry).
|
||||
|
||||
**Organizational root cause** (definitive): the `mysql:8.4` floating
|
||||
tag let Keel auto-bump without testing. Already fixed — image pinned
|
||||
to `mysql:8.4.8` exactly.
|
||||
|
||||
## Decisions
|
||||
|
||||
| # | Decision | Notes |
|
||||
|---|----------|-------|
|
||||
| 1 | **Approach: wipe + re-init on 8.4.9** (logical migration via fresh init + dump-restore) | The DD upgrade is the broken path. A fresh 8.4.9 init starts at version 80409 directly — no upgrade ever runs. We've executed wipe+restore once in ~25 min; the path is now well-trodden. |
|
||||
| 2 | **Pre-flight: bump InnoDB IO config** | `innodb_io_capacity=2000`, `innodb_io_capacity_max=4000`, `innodb_page_cleaners=4`. These are the long-term-correct values regardless of the upgrade — current settings are ~10× too conservative for the workload. |
|
||||
| 3 | **Restore strategy: per-database dumps, NOT the full `--all-databases` dump** | Per-db dumps at `/srv/nfs/mysql-backup/per-db/<db>/` skip the `mysql` system schema entirely. Avoids the question of "will 8.4.8 mysql-schema rows confuse 8.4.9". User accounts get recreated via Vault + null_resource. |
|
||||
| 4 | **Fresh dump immediately before cutover, not yesterday's** | The daily dump runs at 00:30 UTC. The cutover dump must come from < 60 s before scale-to-0 to minimize data loss. Kick `mysql-backup-per-db` CronJob manually. |
|
||||
| 5 | **Maintenance window required** | All MySQL-dependent apps offline ~25 min: Forgejo (+ registry → ImagePullBackOff cascade), Nextcloud, HackMD, Grafana, Paperless, Uptime-Kuma, Shlink, realestate-crawler, phpipam, technitium, vikunja, freshrss, finance, resume. Pick a low-traffic window (suggest Sunday 03:00 UK). |
|
||||
| 6 | **Single rollback path: re-pin to 8.4.8 + same wipe/restore flow** | If 8.4.9 fresh init misbehaves post-restore, rollback IS the same procedure, just with image=8.4.8. The pinned 8.4.8 dump survives. No new failure modes. |
|
||||
| 7 | **Out of scope for this upgrade**: tuning that doesn't gate the upgrade | Right-sizing buffer pool, switching to async commits, changing storage class, replication — all separate decisions. |
|
||||
|
||||
## Verification gates
|
||||
|
||||
Before declaring done:
|
||||
1. `kubectl -n dbaas exec mysql-standalone-0 -- mysql -uroot -p"$PW" -e "SELECT VERSION();"` returns `8.4.9`.
|
||||
2. `SHOW DATABASES;` lists all 20 user databases.
|
||||
3. Table count per schema matches the pre-upgrade snapshot (recorded
|
||||
in step 1 of the plan).
|
||||
4. `forgejo` logs show successful DB ping; `kubectl -n forgejo get pod` is 1/1 Running.
|
||||
5. `kubectl get deploy,sts -A` shows no unready workloads.
|
||||
6. `bash infra/scripts/cluster_healthcheck.sh --quiet` returns same or
|
||||
better PASS/WARN/FAIL ratio as pre-upgrade.
|
||||
7. Forgejo integrity probe reports 0 failures (manual trigger).
|
||||
8. `RegistryCatalogInaccessible` not firing in Prometheus.
|
||||
|
||||
## Risks + mitigations
|
||||
|
||||
| Risk | Likelihood | Mitigation |
|
||||
|---|---|---|
|
||||
| 8.4.9 fresh init has *some other* unobserved bug | Low | Smoke-test on a parallel PVC in dbaas before touching the real one (optional but cheap — adds 30 min). See plan Phase 1. |
|
||||
| Per-db dump-restore misses a database the user added recently | Low | Compare `SHOW DATABASES` against the per-db dump directory listing pre-cutover. If a DB exists in MySQL but not in `/srv/nfs/mysql-backup/per-db/`, dump it manually first. |
|
||||
| Forgejo/roundcubemail static-user passwords drift again after restore | Certain | Already documented in runbook — DROP USER + CREATE USER from Vault values immediately after restore. |
|
||||
| The cutover dump itself is corrupt | Very low | mysqldump exits non-zero on failure. CronJob already pushes `backup_last_success_timestamp` to Pushgateway. Verify timestamp is fresh before proceeding. |
|
||||
| Apps fail to reconnect after MySQL restart | Low | Already-proven recipe: `kubectl rollout restart` on the affected deployments. Listed exhaustively in runbook §B.8. |
|
||||
| 8.4.9 fresh init *also* stalls (root cause was NOT flush starvation) | Medium-low | Pre-flight test on parallel PVC catches this before maintenance window. If real prod init stalls, immediately revert TF pin to 8.4.8, redo same dump-restore flow. Same 25 min downtime as the original recovery. |
|
||||
|
||||
## Why not alternatives
|
||||
|
||||
- **In-place DD upgrade with bumped IO config**: simpler, but if it
|
||||
still stalls we lose 30–60 min waiting + still fall back to
|
||||
wipe+restore. Same data risk; worse expected time. We *would* learn
|
||||
whether the bumped IO settings fix the upgrade, but the fresh init
|
||||
approach makes that knowledge unnecessary.
|
||||
- **Parallel migration (new mysql-standalone-new pod alongside)**:
|
||||
cleanest rollback (instant via service-selector flip), but needs TF
|
||||
surgery to declare two StatefulSets temporarily and isn't worth the
|
||||
complexity when the wipe+restore approach is now proven.
|
||||
- **Wait for 8.4.10 / 8.5 LTS**: leaves us stuck on 8.4.8 indefinitely.
|
||||
Acceptable for now (we're pinned), but not a permanent answer.
|
||||
|
||||
## Out of scope
|
||||
|
||||
- A standby/replica MySQL for zero-downtime upgrades (separate
|
||||
initiative — see future planning around CNPG-style HA for MySQL).
|
||||
- Removing `proxmox-lvm-encrypted` LUKS2 from the equation (the
|
||||
encryption is a security requirement; debugging its flush latency is
|
||||
separate).
|
||||
- Replacing MySQL with PostgreSQL (long-term goal for some apps; not
|
||||
this upgrade).
|
||||
349
docs/plans/2026-05-19-mysql-8.4.9-upgrade-plan.md
Normal file
349
docs/plans/2026-05-19-mysql-8.4.9-upgrade-plan.md
Normal file
|
|
@ -0,0 +1,349 @@
|
|||
# MySQL 8.4.8 → 8.4.9 Upgrade — Plan
|
||||
|
||||
**Date**: 2026-05-19
|
||||
**Status**: Drafted, **NOT scheduled**
|
||||
**Design**: `2026-05-19-mysql-8.4.9-upgrade-design.md`
|
||||
**Estimated downtime**: 25–30 min (all MySQL-dependent apps offline)
|
||||
**Window**: Suggest Sunday 03:00 UK (low traffic, kured window doesn't fight us)
|
||||
|
||||
## Pre-flight (before the maintenance window)
|
||||
|
||||
### P.1 Optional smoke test on a parallel PVC (recommended, +30 min)
|
||||
|
||||
In a non-production session, before scheduling the real cutover:
|
||||
|
||||
```bash
|
||||
# 1. Create a temporary StatefulSet `mysql-smoketest` in dbaas with the
|
||||
# same image (mysql:8.4.9), same configmap, brand-new PVC.
|
||||
# Use a one-off kubectl apply -f /tmp/smoketest.yaml — NOT Terraform —
|
||||
# so it doesn't pollute the real stack.
|
||||
# 2. Verify it inits to 8.4.9 cleanly (mysqld.sock appears, "ready for connections").
|
||||
# 3. Restore one of the smaller per-db dumps (e.g. resume, freshrss) into it.
|
||||
# 4. Delete the smoketest StatefulSet + PVC.
|
||||
```
|
||||
|
||||
Outcome:
|
||||
- ✅ Init succeeds → proceed with the real upgrade with high confidence.
|
||||
- ❌ Init stalls → root cause was not flush starvation. Halt and re-investigate. The real upgrade is unsafe.
|
||||
|
||||
### P.2 Read the MySQL 8.4.9 release notes + bug tracker
|
||||
|
||||
Specifically look for issues filed since 8.4.9 GA against the DD upgrade
|
||||
path or `st_spatial_reference_systems`. If a known fix landed in 8.4.10
|
||||
or 8.5.x, consider waiting.
|
||||
|
||||
### P.3 Confirm backup pipeline is healthy
|
||||
|
||||
```bash
|
||||
# Latest per-db dumps exist for all 20 databases
|
||||
kubectl -n dbaas exec mysql-standalone-0 -- bash -c \
|
||||
'for d in $(ls /backup/per-db/); do echo -n "$d: "; ls -t /backup/per-db/$d/ | head -1; done'
|
||||
|
||||
# Pushgateway shows recent success
|
||||
kubectl -n monitoring exec deploy/prometheus-server -c prometheus-server -- \
|
||||
wget -qO- 'http://prometheus-prometheus-pushgateway:9091/metrics' | grep mysql-backup-per-db
|
||||
```
|
||||
|
||||
### P.4 Pin maintenance window and notify
|
||||
|
||||
Brief the user. Confirm window. Disable any background scrapers /
|
||||
schedulers / bots that would create noise during the cutover.
|
||||
|
||||
## Execution (inside the maintenance window)
|
||||
|
||||
### Step 1 — Pre-flight snapshot
|
||||
|
||||
```bash
|
||||
ROOT_PWD=$(kubectl -n dbaas get secret cluster-secret -o jsonpath='{.data.ROOT_PASSWORD}' | base64 -d)
|
||||
|
||||
# Record current state for verification later
|
||||
kubectl -n dbaas exec mysql-standalone-0 -- mysql -uroot -p"$ROOT_PWD" \
|
||||
-e "SELECT table_schema, COUNT(*) AS tables FROM information_schema.tables \
|
||||
WHERE table_schema NOT IN ('information_schema','performance_schema','sys','mysql') \
|
||||
GROUP BY table_schema;" > /tmp/mysql-pre-upgrade-table-counts.txt
|
||||
cat /tmp/mysql-pre-upgrade-table-counts.txt
|
||||
```
|
||||
|
||||
### Step 2 — Trigger a fresh per-db dump
|
||||
|
||||
```bash
|
||||
kubectl -n dbaas create job --from=cronjob/mysql-backup-per-db pre-upgrade-$(date +%s)
|
||||
# Wait for completion (typically <2 min)
|
||||
kubectl -n dbaas wait --for=condition=complete --timeout=300s job/pre-upgrade-<timestamp>
|
||||
```
|
||||
|
||||
Verify all 20 databases dumped:
|
||||
|
||||
```bash
|
||||
kubectl -n dbaas exec mysql-standalone-0 -- bash -c \
|
||||
'for d in $(ls /backup/per-db/); do
|
||||
newest=$(ls -t /backup/per-db/$d/ | head -1)
|
||||
echo "$d: $newest"
|
||||
done'
|
||||
```
|
||||
|
||||
Every entry should have a `dump_<today>_*.sql.gz` listed.
|
||||
|
||||
### Step 3 — Bump InnoDB IO config + image pin in Terraform
|
||||
|
||||
In `stacks/dbaas/modules/dbaas/main.tf`:
|
||||
|
||||
```diff
|
||||
- innodb_io_capacity=100
|
||||
- innodb_io_capacity_max=200
|
||||
- innodb_page_cleaners=1
|
||||
+ innodb_io_capacity=2000
|
||||
+ innodb_io_capacity_max=4000
|
||||
+ innodb_page_cleaners=4
|
||||
```
|
||||
|
||||
```diff
|
||||
- # Pinned to 8.4.8 — 8.4.9 DD upgrade got stuck (no progress, no CPU)
|
||||
- # repeatedly across multiple attempts. ...
|
||||
- image = "mysql:8.4.8"
|
||||
+ # Re-pinned to 8.4.9 on 2026-MM-DD after the wipe+reinit upgrade
|
||||
+ # path (see docs/plans/2026-05-19-mysql-8.4.9-upgrade-*).
|
||||
+ image = "mysql:8.4.9"
|
||||
```
|
||||
|
||||
Commit but **do not apply yet**.
|
||||
|
||||
### Step 4 — Stop MySQL
|
||||
|
||||
```bash
|
||||
kubectl -n dbaas scale statefulset mysql-standalone --replicas=0
|
||||
# Wait for pod deletion
|
||||
kubectl -n dbaas wait --for=delete pod/mysql-standalone-0 --timeout=120s
|
||||
```
|
||||
|
||||
### Step 5 — Wipe the PVC
|
||||
|
||||
```bash
|
||||
PV=$(kubectl -n dbaas get pvc data-mysql-standalone-0 -o jsonpath='{.spec.volumeName}')
|
||||
kubectl patch pv "$PV" -p '{"spec":{"persistentVolumeReclaimPolicy":"Delete"}}'
|
||||
kubectl -n dbaas delete pvc data-mysql-standalone-0
|
||||
# Confirm PV vanishes (CSI cleans up the LV)
|
||||
kubectl get pv | grep -q "$PV" && echo "WARNING: PV still present" || echo "PV cleaned up"
|
||||
```
|
||||
|
||||
### Step 6 — Apply Terraform (8.4.9 + bumped IO)
|
||||
|
||||
```bash
|
||||
cd stacks/dbaas
|
||||
/home/wizard/code/infra/scripts/tg apply
|
||||
```
|
||||
|
||||
This creates a fresh 5 Gi PVC + new pod on `mysql:8.4.9`. Initial-init
|
||||
takes ~30 s. Verify:
|
||||
|
||||
```bash
|
||||
kubectl -n dbaas wait --for=condition=ready pod/mysql-standalone-0 --timeout=300s
|
||||
kubectl -n dbaas exec mysql-standalone-0 -- mysql -uroot -p"$ROOT_PWD" -e "SELECT VERSION();"
|
||||
# expect: 8.4.9
|
||||
```
|
||||
|
||||
**If the pod fails to become Ready within 5 min**: this is the
|
||||
"root cause was not flush starvation" failure mode. Abort the upgrade,
|
||||
revert the image pin to 8.4.8 in TF, re-run from Step 4 (wipe + apply
|
||||
8.4.8 + restore). Total extra downtime ~25 min.
|
||||
|
||||
### Step 7 — Restore per-db dumps (NOT the full --all-databases dump)
|
||||
|
||||
```bash
|
||||
ROOT_PWD=$(kubectl -n dbaas get secret cluster-secret -o jsonpath='{.data.ROOT_PASSWORD}' | base64 -d)
|
||||
|
||||
cat <<YAML | kubectl apply -f -
|
||||
apiVersion: batch/v1
|
||||
kind: Job
|
||||
metadata:
|
||||
name: mysql-restore-per-db-$(date +%Y-%m-%d)
|
||||
namespace: dbaas
|
||||
spec:
|
||||
ttlSecondsAfterFinished: 3600
|
||||
template:
|
||||
spec:
|
||||
restartPolicy: Never
|
||||
containers:
|
||||
- name: restore
|
||||
image: mysql:8.4.9
|
||||
command: ["bash","-c"]
|
||||
args:
|
||||
- |
|
||||
set -euo pipefail
|
||||
for db in \$(ls /backup/per-db/); do
|
||||
newest=\$(ls -t /backup/per-db/\$db/ | head -1)
|
||||
echo "=== Restoring \$db from \$newest ==="
|
||||
mysql -h mysql.dbaas.svc.cluster.local -uroot -p"\$MYSQL_ROOT_PASSWORD" \
|
||||
-e "CREATE DATABASE IF NOT EXISTS \\\`\$db\\\`;"
|
||||
gunzip -c "/backup/per-db/\$db/\$newest" | \
|
||||
mysql -h mysql.dbaas.svc.cluster.local -uroot -p"\$MYSQL_ROOT_PASSWORD" "\$db"
|
||||
done
|
||||
echo "=== All databases restored ==="
|
||||
mysql -h mysql.dbaas.svc.cluster.local -uroot -p"\$MYSQL_ROOT_PASSWORD" -e 'SHOW DATABASES;'
|
||||
env:
|
||||
- name: MYSQL_ROOT_PASSWORD
|
||||
valueFrom: { secretKeyRef: { name: cluster-secret, key: ROOT_PASSWORD } }
|
||||
volumeMounts:
|
||||
- { name: backup, mountPath: /backup, readOnly: true }
|
||||
volumes:
|
||||
- name: backup
|
||||
persistentVolumeClaim: { claimName: dbaas-mysql-backup-host, readOnly: true }
|
||||
YAML
|
||||
```
|
||||
|
||||
Watch: `kubectl -n dbaas logs -f job/mysql-restore-per-db-<date>`.
|
||||
Expected time: ~3 min for all 20 databases.
|
||||
|
||||
### Step 8 — Recreate Vault-rotated + static users
|
||||
|
||||
The per-db restore did NOT touch `mysql.user`. Recreate all app users
|
||||
fresh:
|
||||
|
||||
```bash
|
||||
# Static users (forgejo, roundcubemail) from Vault
|
||||
FORGEJO_PW=$(vault kv get -field=mysql_forgejo_password secret/viktor)
|
||||
RC_PW=$(vault kv get -field=mysql_roundcubemail_password secret/viktor)
|
||||
|
||||
kubectl -n dbaas exec -i mysql-standalone-0 -- bash -c 'mysql -uroot -p"$MYSQL_ROOT_PASSWORD"' <<SQL
|
||||
CREATE USER IF NOT EXISTS 'forgejo'@'%' IDENTIFIED WITH caching_sha2_password BY '$FORGEJO_PW';
|
||||
CREATE USER IF NOT EXISTS 'roundcubemail'@'%' IDENTIFIED WITH caching_sha2_password BY '$RC_PW';
|
||||
GRANT ALL PRIVILEGES ON \`forgejo\`.* TO 'forgejo'@'%';
|
||||
GRANT ALL PRIVILEGES ON \`roundcubemail\`.* TO 'roundcubemail'@'%';
|
||||
FLUSH PRIVILEGES;
|
||||
SQL
|
||||
|
||||
# Vault-DB-engine-rotated users: force re-rotation so Vault rewrites the
|
||||
# user with the current password held in K8s secrets
|
||||
for role in $(vault list -format=json database/roles | jq -r '.[]' | grep '^mysql-'); do
|
||||
echo "Rotating $role"
|
||||
vault write -f "database/rotate-role/$role"
|
||||
done
|
||||
|
||||
# Technitium has a separate password-sync job — kick it
|
||||
kubectl -n technitium create job --from=cronjob/technitium-password-sync \
|
||||
technitium-postupgrade-$(date +%s)
|
||||
```
|
||||
|
||||
### Step 9 — Restart MySQL-dependent apps
|
||||
|
||||
```bash
|
||||
for ns_app in \
|
||||
"forgejo:deploy/forgejo" \
|
||||
"nextcloud:deploy/nextcloud" \
|
||||
"hackmd:deploy/hackmd" \
|
||||
"monitoring:deploy/grafana" \
|
||||
"paperless-ngx:deploy/paperless-ngx" \
|
||||
"uptime-kuma:deploy/uptime-kuma" \
|
||||
"url:deploy/shlink" \
|
||||
"phpipam:deploy/phpipam" \
|
||||
"technitium:sts/technitium" \
|
||||
"vikunja:deploy/vikunja" \
|
||||
"freshrss:deploy/freshrss" \
|
||||
"finance:deploy/finance" \
|
||||
"resume:deploy/resume" \
|
||||
"realestate-crawler:deploy/realestate-crawler-api" \
|
||||
"realestate-crawler:deploy/realestate-crawler-celery" \
|
||||
"realestate-crawler:deploy/realestate-crawler-celery-beat" \
|
||||
"realestate-crawler:deploy/realestate-crawler-ui"; do
|
||||
ns=${ns_app%%:*}; app=${ns_app##*:}
|
||||
kubectl -n "$ns" rollout restart "$app" &
|
||||
done
|
||||
wait
|
||||
```
|
||||
|
||||
Wait for all to become ready:
|
||||
|
||||
```bash
|
||||
until [ "$(kubectl get deploy,sts -A -o json | \
|
||||
jq -r '.items[] | select(.spec.replicas != .status.readyReplicas and .spec.replicas > 0) | .metadata.name' | \
|
||||
wc -l)" -eq 0 ]; do
|
||||
sleep 5
|
||||
done
|
||||
echo "All workloads ready"
|
||||
```
|
||||
|
||||
### Step 10 — Force ImagePullBackOff pods to retry (Forgejo registry was offline)
|
||||
|
||||
```bash
|
||||
for ns in chrome-service fire-planner freedify; do
|
||||
kubectl -n "$ns" delete pod --all 2>/dev/null || true
|
||||
done
|
||||
```
|
||||
|
||||
### Step 11 — Clean up failed CronJob pods from the outage window
|
||||
|
||||
```bash
|
||||
kubectl delete pods -A --field-selector=status.phase=Failed
|
||||
```
|
||||
|
||||
### Step 12 — Verify (matches design §Verification gates)
|
||||
|
||||
```bash
|
||||
# 1. Version
|
||||
kubectl -n dbaas exec mysql-standalone-0 -- mysql -uroot -p"$ROOT_PWD" -e "SELECT VERSION();"
|
||||
# expect: 8.4.9
|
||||
|
||||
# 2-3. Databases + table counts
|
||||
kubectl -n dbaas exec mysql-standalone-0 -- mysql -uroot -p"$ROOT_PWD" \
|
||||
-e "SELECT table_schema, COUNT(*) FROM information_schema.tables \
|
||||
WHERE table_schema NOT IN ('information_schema','performance_schema','sys','mysql') \
|
||||
GROUP BY table_schema;" > /tmp/mysql-post-upgrade-table-counts.txt
|
||||
diff /tmp/mysql-pre-upgrade-table-counts.txt /tmp/mysql-post-upgrade-table-counts.txt
|
||||
# expect: no diff (or only counts that grew between snapshots)
|
||||
|
||||
# 4. Forgejo
|
||||
kubectl -n forgejo get pod
|
||||
kubectl -n forgejo logs deploy/forgejo --tail=20 | grep -iE "ORM engine|ready"
|
||||
# expect: 1/1 Running, "ORM engine initialized"
|
||||
|
||||
# 5. Cluster health
|
||||
bash /home/wizard/code/infra/scripts/cluster_healthcheck.sh --quiet
|
||||
|
||||
# 6. Registry integrity probe
|
||||
kubectl -n monitoring create job --from=cronjob/forgejo-integrity-probe \
|
||||
postupgrade-$(date +%s)
|
||||
kubectl -n monitoring logs job/postupgrade-<timestamp> --tail=5
|
||||
# expect: "Probe complete: 0 failures"
|
||||
|
||||
# 7. RegistryCatalogInaccessible not firing
|
||||
kubectl -n monitoring exec deploy/prometheus-server -c prometheus-server -- \
|
||||
wget -qO- 'http://localhost:9090/api/v1/alerts' | \
|
||||
python3 -c "import json,sys; d=json.load(sys.stdin); [print(a['labels']['alertname']) for a in d['data']['alerts'] if a['state']=='firing']"
|
||||
# expect: empty / no RegistryCatalogInaccessible
|
||||
```
|
||||
|
||||
### Step 13 — Commit + push the Terraform change
|
||||
|
||||
```bash
|
||||
git add stacks/dbaas/modules/dbaas/main.tf
|
||||
git commit -m "dbaas: pin MySQL to 8.4.9 after successful wipe+reinit upgrade
|
||||
|
||||
Executed per docs/plans/2026-05-19-mysql-8.4.9-upgrade-{design,plan}.md.
|
||||
The full upgrade ran clean — fresh init on 8.4.9 sidestepped the DD
|
||||
upgrade stall. IO config bumped to 2000/4 (was 100/1) for the workload.
|
||||
"
|
||||
git push
|
||||
```
|
||||
|
||||
## Rollback path (if Step 6 or Step 7 fails catastrophically)
|
||||
|
||||
The wipe at Step 5 is destructive — once executed, the original disk
|
||||
is gone. Rollback is **same procedure, image=8.4.8**:
|
||||
|
||||
1. Edit TF: `image = "mysql:8.4.8"`
|
||||
2. `kubectl -n dbaas scale sts mysql-standalone --replicas=0`
|
||||
3. Re-wipe (already wiped; just `tg apply`)
|
||||
4. Run the Step 7 restore Job again (now on 8.4.8)
|
||||
5. Run Step 8-11
|
||||
6. Update Terraform comment to reflect retained 8.4.8 pin.
|
||||
|
||||
Extra downtime: ~25 min on top of the existing window.
|
||||
|
||||
## Post-upgrade follow-ups
|
||||
|
||||
- Update `infra/.claude/CLAUDE.md` MySQL row to reflect 8.4.9 pin.
|
||||
- Update `docs/runbooks/restore-mysql.md` to reflect 8.4.9.
|
||||
- Re-evaluate whether the new IO config (2000/4) is overkill for the
|
||||
workload after 1-2 weeks — could drop to 1000/2 if needed.
|
||||
- Optional: file a follow-up task to investigate MySQL HA/replication
|
||||
so the next upgrade isn't blocking.
|
||||
178
docs/plans/2026-05-21-ha-control-plane-design.md
Normal file
178
docs/plans/2026-05-21-ha-control-plane-design.md
Normal file
|
|
@ -0,0 +1,178 @@
|
|||
# HA Control Plane (3 masters) — Design
|
||||
|
||||
**Date**: 2026-05-21 (decisions locked 2026-05-22; **deferred 2026-05-23**)
|
||||
**Status**: **DEFERRED** — design + plan complete, NOT scheduled. Awaiting either PVE host capacity expansion OR a separate right-sizing pass on the existing master before this becomes affordable. Paired plan: `2026-05-21-ha-control-plane-plan.md`.
|
||||
**Beads**: code-n0ow (open, deferred — see `bd show code-n0ow`)
|
||||
**Trigger**: 2026-05-21 k8s 1.34.7→1.34.8 autonomous-upgrade session repeatedly hit a storm cascade rooted in single-master apiserver outages
|
||||
|
||||
## Why deferred (2026-05-23)
|
||||
|
||||
Measured during the locking pass:
|
||||
|
||||
- **k8s-master uses 4.6 GB of 32 GB allocated** (kube-apiserver 2.6 GB + etcd 660 MB + cm 360 MB + ~1 GB everything else). The 32 GB sizing is ~5-6× oversized vs working set.
|
||||
- **PVE host is already 98% RAM-committed** — 262 GB allocated to VMs against 267 GB physical, with 1.5 GB of active swap. The planned 3 × 32 GB control plane (+64 GB net) would push allocation to 326 GB → OOM on the host.
|
||||
- **Software-only HA on a single PVE host has bounded value** — a hypervisor crash still loses all 3 masters. The big resilience wins (kubeadm upgrades, cert rotation, planned reboots) are real but the disaster-recovery angle is limited until a second PVE host exists.
|
||||
|
||||
### Revisit triggers — any of:
|
||||
|
||||
1. **Second PVE host added** to the lab. Hardware HA becomes possible; HA control plane becomes the natural follow-up. Spread the 3 masters across 2 hosts (2+1).
|
||||
2. **Cluster-wide right-sizing pass** that frees enough headroom for the original 3 × 32 GB plan, OR pre-agreed amendment to provision 16 GB masters (right-sized to actual usage; 3-4× current working-set headroom).
|
||||
3. **Storm cascade burns enough hours** that the operational cost outweighs the memory cost — track minutes spent manually nursing kubeadm upgrades; if cumulative > ~10h over a few months, revisit.
|
||||
|
||||
### What's still good
|
||||
|
||||
The design + plan in this directory remain authoritative. When we revisit:
|
||||
|
||||
- All 14 locked decisions stand.
|
||||
- Challenger amendments (cloud-init template bump, rbac multi-master refactor, HTTPS `/readyz` health check, expanded blast radius, etcd-backup nodeSelector, chain extension as Phase 7) are baked in.
|
||||
- Only the sizing decision needs revisiting — likely 16 GB per master instead of 32 GB.
|
||||
- Adding `k8s_master_hosts` list-based refactor to the rbac stack (Phase 1.5) is a **standalone win** that could be done independently of HA — it would future-proof the cluster against the day HA lands. Consider lifting that as its own task.
|
||||
|
||||
## Problem statement
|
||||
|
||||
The autonomous k8s upgrade pipeline (`stacks/k8s-version-upgrade/`) is
|
||||
correct end-to-end but **cannot push through the cluster's
|
||||
single-master architecture**. Each attempted upgrade today rolled
|
||||
back via the same cascade:
|
||||
|
||||
1. Chain drains master → `kubeadm upgrade apply` swaps a static-pod
|
||||
manifest (etcd → apiserver → controller-manager → scheduler).
|
||||
2. While a manifest swap is in flight, the affected control-plane
|
||||
component is briefly down — for apiserver, that means ~10–60s of
|
||||
"connection refused" to `10.96.0.1:443` from every kubelet and
|
||||
operator pod in the cluster.
|
||||
3. **Several operators die during that window** instead of waiting:
|
||||
- **tigera-operator**: logs `[ERROR] Get "https://10.96.0.1:443/api?timeout=32s": connect: connection refused` then exits 1 immediately
|
||||
- gpu-operator, cnpg-cloudnative-pg, kube-controller-manager: similar leader-lease failures
|
||||
4. Kubelet restarts those pods → image pulls + initial reads → storm
|
||||
of disk I/O on master (we observed 563 MB/s from tigera alone).
|
||||
5. **The storm slows apiserver-to-kubelet status sync** past kubeadm's
|
||||
hardcoded 5-min watch on the pod's `kubernetes.io/config.hash`
|
||||
annotation.
|
||||
6. kubeadm declares the upgrade "did not change after 5m0s",
|
||||
**rolls back to the previous manifest**, exits non-zero.
|
||||
7. Chain Job retries (backoffLimit=1) → same storm → same failure.
|
||||
Chain dead.
|
||||
|
||||
The container runtime, the script logic, the RBAC permissions are all
|
||||
fine after today's fixes. The **single master is the bottleneck**.
|
||||
|
||||
## Why HA control plane fixes this
|
||||
|
||||
With 3 masters running etcd quorum + apiserver behind an LB:
|
||||
|
||||
| Failure mode | Single master | 3-master HA |
|
||||
|---|---|---|
|
||||
| Master reboot / kubeadm upgrade | Apiserver completely down 10–60s | Other 2 masters serve clients; LB transparently fails over |
|
||||
| etcd quorum during one master being down | Total outage (1/1 broken) | Quorum maintained (2/3 healthy) |
|
||||
| Tigera/operators see apiserver as "down" | Yes → crashloop storm | No → keep running through |
|
||||
| kubeadm `static-pod hash` watch | Times out under load (today's bug) | Never under load; sync stays fast |
|
||||
| Pipeline upgrade success rate | Brittle / needs manual nursing | Truly autonomous |
|
||||
|
||||
The k8s upgrade chain doesn't need to be aware of *any* of this — the
|
||||
underlying availability of apiserver makes the chain's gates
|
||||
naturally pass on each iteration.
|
||||
|
||||
## Decisions (locked 2026-05-22)
|
||||
|
||||
| # | Decision | Notes |
|
||||
|---|----------|-------|
|
||||
| 1 | **3 masters** (not 5) | Quorum tolerates 1 failure, sufficient for home-lab. 5 would tolerate 2 but doubles etcd write amplification. |
|
||||
| 2 | **Sizing**: match current `k8s-master` (8 vCPU, 32GB RAM, ~64 GB disk) for all 3 | Symmetric. New VMs `k8s-master-2` (VMID 205, 10.0.20.110), `k8s-master-3` (VMID 206, 10.0.20.111). |
|
||||
| 3 | **Apiserver LB**: **pfSense HAProxy** — new TCP frontend on `10.0.20.99:6443` mirroring the mailserver pattern. Idempotent via `scripts/pfsense-haproxy-bootstrap.php`. | Pros: no per-node moving parts, mirrors existing mailserver layout. Cons: pfSense becomes more SPoF — but it's already SPoF for everything else (gateway/DNS/ingress). |
|
||||
| 4 | **VIP**: `10.0.20.99` (one below current master `.100`, well clear of MetalLB pool `.200-.220`). Internal-only — external API access stays via Cloudflared. | All kubeconfigs + kubelet.conf entries flip from `10.0.20.100:6443` → `10.0.20.99:6443`. |
|
||||
| 5 | **etcd**: kubeadm-managed stacked; `kubeadm join --control-plane` brings new members into the etcd cluster automatically | Avoids running etcd separately. |
|
||||
| 6 | **kured-sentinel-gate**: extend the bash loop in `stacks/kured/main.tf` with a "≥2 control-plane nodes Ready" check between the existing all-nodes-Ready and calico-Ready checks | Otherwise kured could reboot 2 masters at once and break quorum. |
|
||||
| 7 | **etcd backup**: `etcdctl snapshot save` from any member is a consistent point-in-time of the full quorum state — but the existing CronJob is pinned `node_name = "k8s-master"`. Phase 4.5 flips this to a control-plane label + toleration so backups don't silently skip when master-1 is drained. | Snapshot CORRECTNESS unchanged; SCHEDULING needs fixing. |
|
||||
| 8 | **Migration order**: Phase 0 (retrofit existing cluster) → Phase 1 (LB up, single backend, HTTPS health check) → Phase 1.5 (rbac stack refactor) → Phase 2 (cloud-init bump + master-2 join + add to LB) → Phase 3 (master-3 join + add to LB) → Phase 4 (flip clients + workers to VIP) → Phase 4.5 (etcd-backup CronJob fix) → Phase 5 (kured-sentinel-gate quorum check) → Phase 6 (E2E validation) → Phase 7 (k8s-version-upgrade chain extension) | Each kubeadm join is reversible (`kubeadm reset` + `etcdctl member remove`). |
|
||||
| 9 | **VM provisioning**: cloud-init via `create-template-vm` module, **but the template needs an apt-source bump first** (v1.32 → v1.34) and a control-plane gate on `k8s_join_command` so master VMs don't auto-join as workers. Existing master stays as the legacy manual VM (not rebuilt). | The repo has zero VMs using cloud-init for provisioning today — we're the first user. Update template first, then use it. |
|
||||
| 10 | **Cert SAN + controlPlaneEndpoint retrofit**: Phase 0, before any new master joins. Patch `kubeadm-config` via `kubeadm init phase upload-config kubeadm --config <file>` (kubeadm-owned write, future-proof against `kubeadm upgrade apply`), regen `apiserver.crt` via `kubeadm init phase certs apiserver`, restart the kube-apiserver pod (~30s outage on the existing master only). | Standard kubeadm retrofit path; `kubeadm join --control-plane` requires controlPlaneEndpoint to be set. |
|
||||
| 11 | **Multi-master config propagation (Phase 1.5)**: refactor `stacks/rbac/modules/rbac/{apiserver-oidc,audit-policy,etcd-tuning}.tf` to loop over a list of master hosts. Apply BEFORE master-2/3 join so they boot with OIDC, audit policy, and etcd tuning already in place. | Today these stacks SSH into a single master and sed into `kube-apiserver.yaml` — if not propagated, Authentik login flaps depending on which master the LB lands on. |
|
||||
| 12 | **k8s-version-upgrade chain extension (Phase 7)**: extend `stacks/k8s-version-upgrade/scripts/upgrade-step.sh` to discover and iterate over all control-plane nodes (drain → upgrade → uncordon, gated by quorum check). | Without this, chain only upgrades master-1; masters 2/3 drift behind one version per upgrade. Original autonomous-upgrades goal unmet. |
|
||||
| 13 | **LB health check**: HTTPS `GET /readyz` (with `verify none` for self-signed apiserver cert), NOT plain TCP. | Plain TCP misses apiserver-NotReady states (etcd unreachable, controller-manager flapping). |
|
||||
| 14 | **VIP DNS name**: add `k8s-apiserver IN A 10.0.20.99` to `config.tfvars` BEFORE Phase 4. Delete stale `kubernetes IN A 10.0.20.100`. Consumers reference the FQDN, not the bare IP — future renumbering is then a single record change. | |
|
||||
|
||||
## Out of scope
|
||||
|
||||
- HA pfSense itself (separate, much bigger initiative)
|
||||
- Multi-DC failover
|
||||
- External etcd cluster (we're sticking with kubeadm-managed stacked etcd)
|
||||
- Rebuilding cluster from scratch — we'll join into the existing one
|
||||
|
||||
## Risk register
|
||||
|
||||
| Risk | Mitigation |
|
||||
|---|---|
|
||||
| Phase 0 cert regen on existing master triggers a brief apiserver outage (~30s) | Already a known cluster behaviour during static-pod restart. Schedule during a low-activity window. Tigera/operators will crash-loop briefly but recover — same blast radius as today's k8s upgrade. **Once HA is up, future restarts won't have this surface at all.** |
|
||||
| etcd quorum split-brain during member join | kubeadm join is atomic; if it fails, the new member doesn't join the quorum. Existing etcd stays healthy. |
|
||||
| LB misconfiguration → all kubectl breaks | Smoke-test from each master directly (bypass LB) before flipping clients. Keep a kubeconfig pointing at `10.0.20.100:6443` as fallback. |
|
||||
| Existing kubeconfigs (Woodpecker pipelines, agents, dev VM, in-cluster RBAC default) need updating | Single Terraform apply touches `stacks/rbac/modules/rbac/apiserver-oidc.tf` (default), `.woodpecker/*.yml` (committed kubeconfigs). Worker `kubelet.conf` files patched in Phase 4 via ssh loop. |
|
||||
| New masters get scheduled workload pods unintentionally | Verify `node-role.kubernetes.io/control-plane:NoSchedule` taint is applied at join time (default with `--control-plane`). |
|
||||
| Cert rotation propagation | kubeadm join uses the `--certificate-key` from `kubeadm init phase upload-certs` to fetch existing CA materials. Single short-lived secret in `kube-system/kubeadm-certs` (**2h TTL** — Phases 2 + 3 must complete within the window, or re-upload between them). |
|
||||
| 32GB per master × 3 = 96GB RAM used for control plane alone | PVE host has 272GB total, 176GB allocated to cluster pre-HA. Post-HA: 240GB allocated, 32GB headroom. Sufficient. |
|
||||
| Pre-existing kubeadm-config does NOT have `controlPlaneEndpoint` set | Phase 0 patches it. Verify: `kubectl -n kube-system get cm kubeadm-config -o yaml \| grep controlPlaneEndpoint` (absent → `10.0.20.99:6443` post-Phase 0). |
|
||||
| Existing master cert SANs are `[k8s-master, 10.96.0.1, 10.0.20.100]` only — missing VIP | Phase 0 regens with `--apiserver-cert-extra-sans 10.0.20.99` after patching kubeadm-config. |
|
||||
|
||||
## Verification
|
||||
|
||||
After all 3 masters joined + LB up:
|
||||
|
||||
```bash
|
||||
# All 3 masters listed
|
||||
kubectl get nodes -l node-role.kubernetes.io/control-plane=
|
||||
|
||||
# etcd quorum healthy
|
||||
kubectl -n kube-system exec etcd-k8s-master -- etcdctl \
|
||||
--endpoints=https://10.0.20.100:2379,https://10.0.20.110:2379,https://10.0.20.111:2379 \
|
||||
--cacert=/etc/kubernetes/pki/etcd/ca.crt \
|
||||
|
||||
--cert=/etc/kubernetes/pki/etcd/server.crt \
|
||||
--key=/etc/kubernetes/pki/etcd/server.key \
|
||||
endpoint health --cluster
|
||||
|
||||
# Kubeconfig points at VIP
|
||||
kubectl --kubeconfig ~/.kube/config config view --minify -o jsonpath='{.clusters[0].cluster.server}'
|
||||
# Expect: https://10.0.20.99:6443
|
||||
|
||||
# Worker kubelet.conf points at VIP
|
||||
for n in k8s-node{1,2,3,4}; do
|
||||
ssh wizard@$n.viktorbarzin.lan "sudo grep -E '^\s+server:' /etc/kubernetes/kubelet.conf"
|
||||
done
|
||||
# Expect: server: https://10.0.20.99:6443 on every node
|
||||
|
||||
# Failover test: cordon master-1, reboot it, observe kubectl still works through LB
|
||||
kubectl drain k8s-master --delete-emptydir-data --ignore-daemonsets
|
||||
ssh wizard@k8s-master.viktorbarzin.lan sudo reboot
|
||||
|
||||
# Pipeline test: re-trigger k8s upgrade chain (e.g. for whatever the next patch is)
|
||||
kubectl -n k8s-upgrade create job --from=cronjob/k8s-version-check ha-validation-$(date +%s)
|
||||
# Expect: full chain succeeds end-to-end without manual intervention
|
||||
```
|
||||
|
||||
## Cost estimate
|
||||
|
||||
- 2× VMs at 8 vCPU + 32GB RAM each = +64GB RAM on Proxmox host
|
||||
- ~+128GB disk usage (2× 64GB master disks)
|
||||
- **~5-7 hours of operator time end-to-end** (cloud-init template bump + Phase 0 retrofit + LB + Phase 1.5 rbac refactor + 2× kubeadm join + Phase 4 cutover + Phase 4.5 etcd-backup fix + Phase 5 kured-gate + Phase 6 validation + Phase 7 chain extension). Phases 0–6 can land in one session; Phase 7 can be deferred a few days if needed.
|
||||
|
||||
## What's already in place from today's work
|
||||
|
||||
(All these are prerequisites that were fixed during today's
|
||||
investigation — they stay relevant when HA lands.)
|
||||
|
||||
- Master containerd 1.6.22 → 2.2.2, runc 1.1.8 → 1.4.0 (fixed
|
||||
`runc: unable to signal init: permission denied` on Ubuntu 26.04)
|
||||
- Pipeline script bugs: 3× `grep -vE` pipefail, 1× RBAC missing
|
||||
`get daemonsets`, 1× `RecentNodeReboot` not ignored in master phase
|
||||
- Kill-switch ConfigMap mechanism (`k8s-upgrade-killswitch`)
|
||||
- Kubeadm-apply retry wrapper in `update_k8s.sh` (helps but doesn't
|
||||
fully fix the storm cascade)
|
||||
- Quiet-baseline threshold 3600s → 600s
|
||||
|
||||
## Reference
|
||||
|
||||
Commits from today's session:
|
||||
- `10b261d2` — first `grep -vE` pipefail
|
||||
- `0c8b46df` — 2 more pipefail sites
|
||||
- `fc0510aa` — kill-switch + RecentNodeReboot ignore + 600s threshold
|
||||
- `2dc7e001` — kubeadm apply 3-attempt retry
|
||||
325
docs/plans/2026-05-21-ha-control-plane-plan.md
Normal file
325
docs/plans/2026-05-21-ha-control-plane-plan.md
Normal file
|
|
@ -0,0 +1,325 @@
|
|||
# HA Control Plane (3 masters) — Plan
|
||||
|
||||
**Date**: 2026-05-21 (locked + revised 2026-05-22 after challenger pass)
|
||||
**Status**: Drafted, awaiting approval
|
||||
**Pairs with**: `2026-05-21-ha-control-plane-design.md`
|
||||
**Beads**: `code-n0ow`
|
||||
|
||||
## Goal
|
||||
|
||||
Migrate the single-master cluster to a 3-master HA control plane behind
|
||||
a pfSense HAProxy VIP (`10.0.20.99:6443`), enabling autonomous k8s
|
||||
upgrades without storm-cascade manual nursing.
|
||||
|
||||
## Topology — before / after
|
||||
|
||||
```
|
||||
Before After
|
||||
┌──────────────────────┐
|
||||
│ pfSense HAProxy │
|
||||
│ 10.0.20.99:6443 │
|
||||
│ TCP, /readyz health │
|
||||
└──┬───────┬───────┬───┘
|
||||
┌───────────────┐ │ │ │
|
||||
│ k8s-master │ ▼ ▼ ▼
|
||||
│ 10.0.20.100 │ ┌──────────────┐ ┌────────────┐ ┌────────────┐
|
||||
│ apiserver+etcd│ │k8s-master │ │k8s-master-2│ │k8s-master-3│
|
||||
│ + workers join│ │10.0.20.100 │ │10.0.20.110 │ │10.0.20.111 │
|
||||
│ directly │ │(VMID 200) │ │(VMID 205) │ │(VMID 206) │
|
||||
└───────────────┘ │apiserver+etcd│ │apiserver+e.│ │apiserver+e.│
|
||||
└──────────────┘ └────────────┘ └────────────┘
|
||||
▲ ▲ ▲
|
||||
└────────────────┼────────────────┘
|
||||
│
|
||||
etcd quorum (3 members, tolerates 1 down)
|
||||
```
|
||||
|
||||
## Research decisions (locked — see design doc for full table)
|
||||
|
||||
| Decision | Value |
|
||||
|---|---|
|
||||
| LB strategy | pfSense HAProxy, TCP mode, HTTPS `/readyz` health check |
|
||||
| VIP | `10.0.20.99` (FQDN `k8s-apiserver.viktorbarzin.lan`) |
|
||||
| New master IPs | `10.0.20.110`, `10.0.20.111` |
|
||||
| New master VMIDs | `205`, `206` |
|
||||
| Master sizing | 8 vCPU, 32 GB RAM, 64 GB disk (matches existing) |
|
||||
| VM provisioning | cloud-init via `create-template-vm` (template bumped v1.32 → v1.34 first; `k8s_join_command = ""` for masters) |
|
||||
| etcd | stacked (kubeadm-managed) |
|
||||
| Multi-master apiserver flags | rbac stack refactored to loop over master list (Phase 1.5) |
|
||||
| controlPlaneEndpoint + cert SAN retrofit | Phase 0, before any new master joins |
|
||||
| k8s-version-upgrade chain | extended to multi-master in Phase 7 |
|
||||
|
||||
## Callers / blast radius
|
||||
|
||||
| Surface | Path | Phase |
|
||||
|---|---|---|
|
||||
| Worker `/etc/kubernetes/kubelet.conf` × 4 | nodes 1-4 | 4.2 |
|
||||
| `/home/wizard/code/infra/config` (root kubeconfig used by every `tg apply`) | repo root | 4.1 |
|
||||
| `config.tfvars:115` (`kubernetes IN A 10.0.20.100` zone-file record) | repo root | 1.1 (delete) |
|
||||
| `config.tfvars:231` (`k8s_join_command` for cloud-init template) | repo root | 4.1 (flip to VIP) |
|
||||
| `stacks/rbac/modules/rbac/{apiserver-oidc,audit-policy,etcd-tuning}.tf` | `var.k8s_master_host` defaults | 1.5 (refactor to list) |
|
||||
| `.woodpecker/{default,drift-detection,renew-tls,provision-user}.yml` (4 files × 2 refs each — kubeconfig `server:` AND `curl` lines) | repo root | 4.1 |
|
||||
| `stacks/k8s-portal/.../files/src/routes/{download,setup/script}/+server.ts` (`CLUSTER_SERVER` const used to generate user kubeconfigs) | k8s-portal module | 4.1 |
|
||||
| `stacks/k8s-version-upgrade/scripts/upgrade-step.sh` (hard-coded `k8s-master` in phase_master) | stack | 7.1 |
|
||||
| `stacks/infra-maintenance/.../main.tf` lines 98 + 218 (`node_name = "k8s-master"` on etcd-backup + defrag-etcd CronJobs) | stack | 4.5 |
|
||||
| `kured-sentinel-gate` bash loop | `stacks/kured/main.tf` | 5.1 |
|
||||
| `docs/architecture/compute.md`, `.claude/skills/uptime-kuma/SKILL.md`, runbooks | docs | 6.3 |
|
||||
| **No-op surfaces** (confirmed clean): Vault (uses `kubernetes.default.svc`), Cloudflared (no apiserver tunnel), in-cluster `kubernetes.default.svc` / `10.96.0.1`, etcd-backup CORRECTNESS (snapshot is cluster-wide), kubeadm-managed etcd peer certs (auto-generated on join) | | — |
|
||||
|
||||
## Edge cases
|
||||
|
||||
- **Phase 0 apiserver restart (~30s)** = same blast radius as today's k8s upgrade (tigera/cnpg/gpu-operator briefly crash). The LB doesn't help here because the new cert isn't yet trusted by clients. Accept the brief outage. Schedule during a low-activity window.
|
||||
- **`kubeadm-certs` secret TTL = 2h** (NOT 24h as initially stated). Phase 2 + 3 must complete within the window, or re-upload between them.
|
||||
- **pfSense haproxy bootstrap = reset-to-declared-state** on each run (lines 155-158 of the script). Adding master-2 means the apiserver pool is briefly torn down + rebuilt. TCP frontends bounce. Long-poll connections from kubelets break + reconnect. Expect ~2-5s of "kubectl: unable to connect" during pool rewrites.
|
||||
- **TCP health check is too lax** for apiserver (listener up ≠ ready). Phase 1 uses HTTPS `GET /readyz` with `verify none` — catches NotReady (etcd unreachable, controller-manager flapping).
|
||||
- **Worker kubelet.conf flip**: kubelet TLS bootstrap re-auths against new endpoint on restart. Expect 5-10s NotReady per node during the Phase 4.2 loop.
|
||||
- **VIP cannot be the existing master IP**: confirmed `.99` is free (no grep matches, no MetalLB pool conflict — pool is .200-.220).
|
||||
- **pfSense reboot windows**: pre-Phase-4 OK (clients still on direct IP), post-Phase-4 breaks everything. Don't migrate near a pfSense maintenance window.
|
||||
|
||||
## Phased plan
|
||||
|
||||
Reversible up to Phase 4. Phase 4+ reverse via the rollback section.
|
||||
|
||||
### Phase 0 — Retrofit existing cluster (~30 min, ~30s of apiserver outage)
|
||||
|
||||
- [ ] **0.1 Pre-flight**
|
||||
- [ ] Cluster healthy: `kubectl get nodes` (all Ready), `kubectl get pods -A --field-selector=status.phase!=Running,status.phase!=Succeeded` empty
|
||||
- [ ] Recent etcd backup valid: `ls -lh /srv/nfs/etcd-backup/ | tail -5`
|
||||
- [ ] Proxmox VM snapshot of `k8s-master`: `ssh root@192.168.1.127 qm snapshot 200 pre-ha-retrofit`
|
||||
- [ ] IPs free: `for ip in 99 110 111; do ping -c1 -W1 10.0.20.$ip && echo "BUSY $ip" || echo "free $ip"; done`
|
||||
- [ ] **0.2 Patch `kubeadm-config` ConfigMap via kubeadm (NOT kubectl apply)**
|
||||
- [ ] On master: `sudo kubeadm config print init-defaults --component-configs=KubeletConfiguration > /tmp/kubeadm-new.yaml`
|
||||
- [ ] Hand-edit /tmp/kubeadm-new.yaml: take the existing CM as base, add `controlPlaneEndpoint: 10.0.20.99:6443` under ClusterConfiguration, add `apiServer.certSANs: [10.0.20.99, k8s-apiserver.viktorbarzin.lan]`
|
||||
- [ ] Apply via kubeadm (kubeadm-owned, future `kubeadm upgrade apply` won't overwrite): `sudo kubeadm init phase upload-config kubeadm --config /tmp/kubeadm-new.yaml`
|
||||
- [ ] Verify: `kubectl -n kube-system get cm kubeadm-config -o yaml | grep -E 'controlPlaneEndpoint|certSANs'`
|
||||
- [ ] **0.3 Regen apiserver cert**
|
||||
- [ ] On master: `sudo mkdir -p /tmp/apiserver-backup && sudo mv /etc/kubernetes/pki/apiserver.{crt,key} /tmp/apiserver-backup/`
|
||||
- [ ] `sudo kubeadm init phase certs apiserver` (reads patched kubeadm-config)
|
||||
- [ ] Verify: `sudo openssl x509 -in /etc/kubernetes/pki/apiserver.crt -noout -text | grep -A2 'Subject Alternative'` — expect `IP Address:10.0.20.99` PLUS existing SANs (kubeadm adds, doesn't replace)
|
||||
- [ ] **0.4 Restart kube-apiserver static pod**
|
||||
- [ ] On master: `sudo kubectl -n kube-system delete pod kube-apiserver-k8s-master --force --grace-period=0`
|
||||
- [ ] Wait: `kubectl wait --for=condition=Ready pod/kube-apiserver-k8s-master -n kube-system --timeout=180s`
|
||||
- [ ] Verify: `kubectl get nodes` works (apiserver alive on direct IP)
|
||||
- [ ] **0.5 Panic-mode rollback procedure (DOCUMENTED ONLY — only run if 0.4 fails)**
|
||||
- [ ] `sudo cp /tmp/apiserver-backup/apiserver.{crt,key} /etc/kubernetes/pki/`
|
||||
- [ ] `sudo systemctl restart kubelet` (forces static pod re-read)
|
||||
- [ ] Wait for apiserver Ready; revert kubeadm-config edits via the file backup
|
||||
- [ ] **0.6 Verify operators recovered from brief outage**
|
||||
- [ ] `kubectl get pods -n calico-system -l app=tigera-operator -o wide` — Running, restart count incremented by 1 max
|
||||
- [ ] `kubectl get pods -n gpu-operator -o wide` — same
|
||||
- [ ] `kubectl get pods -n cnpg-system -o wide` — same
|
||||
|
||||
### Phase 1 — pfSense HAProxy + DNS (~30 min)
|
||||
|
||||
- [ ] **1.1 Reserve VIP `10.0.20.99` + DNS**
|
||||
- [ ] Add Virtual IP on pfSense (Firewall → Virtual IPs → IP Alias on VLAN20, `10.0.20.99/24`)
|
||||
- [ ] Add `k8s-apiserver-vip → 10.0.20.99` host alias (Firewall → Aliases → Hosts)
|
||||
- [ ] phpIPAM: register `10.0.20.99` under section "K8s cluster"
|
||||
- [ ] Add DNS A record `k8s-apiserver IN A 10.0.20.99` to `config.tfvars` (and **delete** stale `kubernetes IN A 10.0.20.100` on line 115)
|
||||
- [ ] `scripts/tg apply -target=module.technitium` — confirm zone reload
|
||||
- [ ] **1.2 Extend `infra/scripts/pfsense-haproxy-bootstrap.php` for apiserver pool with HTTPS health check**
|
||||
- [ ] Add `build_pool_https()` helper variant (or add `$use_https_readyz` param to existing `build_pool()`) that emits `check_type='HTTP'`, `monitor_uri='/readyz'`, `httpchk_method='GET'`, `ssl='yes'`, `sslverify='no'`
|
||||
- [ ] Add `'apiserver_nodes'` to `$POOL_NAMES`; `'apiserver_proxy_6443'` to `$FRONTEND_NAMES`
|
||||
- [ ] `build_pool_https('apiserver_nodes', '6443', [['k8s-master', '10.0.20.100']])`
|
||||
- [ ] `build_frontend('apiserver_proxy_6443', 'K8s apiserver VIP', '10.0.20.99', '6443', 'apiserver_nodes')`
|
||||
- [ ] **1.3 Deploy + validate**
|
||||
- [ ] `scp infra/scripts/pfsense-haproxy-bootstrap.php admin@10.0.20.1:/tmp/ && ssh admin@10.0.20.1 'php /tmp/pfsense-haproxy-bootstrap.php'`
|
||||
- [ ] `ssh admin@10.0.20.1 'sockstat -l | grep 10.0.20.99:6443'` — expect haproxy listening
|
||||
- [ ] `ssh admin@10.0.20.1 "echo 'show servers state' | socat /tmp/haproxy.socket stdio" | grep apiserver` — backend UP (op_state=2)
|
||||
- [ ] **1.4 Smoke via VIP**
|
||||
- [ ] From devvm: `curl --cacert /etc/kubernetes/pki/ca.crt https://10.0.20.99:6443/readyz` — expect `ok`
|
||||
- [ ] Build a transient kubeconfig pointing at VIP, run `kubectl get nodes` — succeeds
|
||||
- [ ] **If TLS validation fails: STOP — Phase 0 cert regen didn't include VIP**, rollback Phase 1 and retry Phase 0
|
||||
|
||||
### Phase 1.5 — Refactor rbac stack for multi-master (~45 min)
|
||||
|
||||
- [ ] **1.5.1 Refactor `stacks/rbac/modules/rbac/{apiserver-oidc,audit-policy,etcd-tuning}.tf`**
|
||||
- [ ] Replace `var.k8s_master_host = "10.0.20.100"` with `var.k8s_master_hosts = list(string)` (default `["10.0.20.100"]`)
|
||||
- [ ] Wrap each `null_resource` / `provisioner "remote-exec"` block in `for_each = toset(var.k8s_master_hosts)` so the same sed runs on every master
|
||||
- [ ] In `stacks/rbac/main.tf` set `k8s_master_hosts = ["10.0.20.100"]` (still single-master in this phase — variable is forward-looking, no behaviour change yet)
|
||||
- [ ] **1.5.2 `scripts/tg apply` rbac stack** — confirm zero diff against today (no-op refactor)
|
||||
- [ ] **1.5.3 Verify** — sanity: `ssh wizard@k8s-master 'sudo grep oidc-issuer-url /etc/kubernetes/manifests/kube-apiserver.yaml | wc -l'` — expect `1`. Cluster healthy.
|
||||
|
||||
### Phase 2 — Cloud-init template bump + master-2 (~75 min)
|
||||
|
||||
- [ ] **2.0 Bump cloud-init template (one-time)**
|
||||
- [ ] Edit `infra/modules/create-template-vm/cloud_init.yaml`:
|
||||
- line 49: apt source `pkgs.k8s.io/core:/stable:/v1.32/deb/` → `pkgs.k8s.io/core:/stable:/v1.34/deb/`
|
||||
- line 135: wrap `${k8s_join_command}` in a conditional via cloud-init `if:` template logic, or simpler: add `${k8s_join_command_or_noop}` and let the module pass `""` for masters and the real worker join command for workers (default)
|
||||
- [ ] Update `infra/modules/create-template-vm/main.tf` to add `variable "k8s_join_command" { default = "" }` and a conditional in the templatefile to skip the runcmd line when empty
|
||||
- [ ] Rebuild the template: `scripts/tg apply -target=module.k8s_template` (or whatever the existing template-build target name is in `stacks/infra/main.tf`)
|
||||
- [ ] Verify new template registered in Proxmox at the same template_id
|
||||
- [ ] **2.1 Add master-2 VM to Terraform**
|
||||
- [ ] In `stacks/infra/main.tf`: add `module "k8s-master-2"` using `create-vm` from the (now-v1.34) k8s template, with master sizing (8 vCPU / 32GB / 64GB), VMID 205, IP `10.0.20.110`, unique MAC, `vmbr1/vlan 20`, `use_cloud_init = true`, and explicitly pass `k8s_join_command = ""` (so first-boot does NOT auto-join as worker)
|
||||
- [ ] `scripts/tg apply -target=module.k8s-master-2`
|
||||
- [ ] Verify VM booted: `ssh wizard@k8s-master-2.viktorbarzin.lan uname -a` (expect Ubuntu 26.04 LTS, kernel 7.0.x)
|
||||
- [ ] **2.2 Prep master-2 for kubeadm join**
|
||||
- [ ] Confirm versions: `ssh wizard@k8s-master-2.viktorbarzin.lan 'kubeadm version; containerd --version'` — expect kubeadm v1.34.x, containerd 2.2.2+
|
||||
- [ ] DNS resolves: `getent hosts k8s-master-2.viktorbarzin.lan`
|
||||
- [ ] **2.3 Upload certs on existing master**
|
||||
- [ ] `sudo kubeadm init phase upload-certs --upload-certs` → records `--certificate-key <KEY>`
|
||||
- [ ] **2h TTL** — Phase 2 + 3 must complete within window or re-upload
|
||||
- [ ] **2.4 Generate join command**
|
||||
- [ ] `sudo kubeadm token create --print-join-command` → `kubeadm join 10.0.20.99:6443 --token <T> --discovery-token-ca-cert-hash sha256:<H>`
|
||||
- [ ] Append `--control-plane --certificate-key <KEY>`
|
||||
- [ ] **2.5 Run join on master-2**
|
||||
- [ ] `ssh wizard@k8s-master-2.viktorbarzin.lan` → run sudo join command from 2.4
|
||||
- [ ] Wait for "This node has joined the cluster"
|
||||
- [ ] **2.6 Update rbac stack to include master-2 (propagates OIDC/audit/etcd tuning to it)**
|
||||
- [ ] Edit `stacks/rbac/main.tf`: `k8s_master_hosts = ["10.0.20.100", "10.0.20.110"]`
|
||||
- [ ] `scripts/tg apply` rbac stack
|
||||
- [ ] Verify: `ssh wizard@k8s-master-2 'sudo grep -c oidc-issuer-url /etc/kubernetes/manifests/kube-apiserver.yaml'` — expect `1`
|
||||
- [ ] **2.7 Smoke**
|
||||
- [ ] `kubectl get nodes` — 6 nodes, master-2 Ready control-plane
|
||||
- [ ] `kubectl -n kube-system get pods -o wide | grep k8s-master-2` — 4 static pods Running
|
||||
- [ ] etcd member list shows 2 members
|
||||
- [ ] `kubectl --server=https://10.0.20.110:6443 get nodes` — direct probe works
|
||||
- [ ] **2.8 Add master-2 to LB pool**
|
||||
- [ ] Edit `pfsense-haproxy-bootstrap.php`: pool now `[['k8s-master', '10.0.20.100'], ['k8s-master-2', '10.0.20.110']]`
|
||||
- [ ] Deploy + verify both backends UP
|
||||
|
||||
### Phase 3 — master-3 (~45 min) — same pattern as Phase 2
|
||||
|
||||
- [ ] **3.1 Add `module.k8s-master-3` to Terraform** (VMID 206, IP `10.0.20.111`, same template, `k8s_join_command = ""`)
|
||||
- [ ] **3.2 Prep verify**
|
||||
- [ ] **3.3 Re-upload certs if >2h since Phase 2.3, refresh `--certificate-key`**
|
||||
- [ ] **3.4 Generate fresh join command**
|
||||
- [ ] **3.5 Run join on master-3**
|
||||
- [ ] **3.6 Update rbac stack: `k8s_master_hosts = [".100", ".110", ".111"]`, apply, verify master-3 has OIDC flag**
|
||||
- [ ] **3.7 Smoke (7 nodes, 3 control-plane, etcd quorum 3/3)**
|
||||
- [ ] **3.8 Add master-3 to LB pool — all three backends UP**
|
||||
|
||||
### Phase 4 — Cut over clients and workers to VIP (~45 min)
|
||||
|
||||
- [ ] **4.1 Update in-repo kubeconfig consumers (single commit)**
|
||||
- [ ] `/home/wizard/code/infra/config` — flip `server:` to `https://10.0.20.99:6443`
|
||||
- [ ] `config.tfvars:231` — `k8s_join_command` to `kubeadm join 10.0.20.99:6443 ...`
|
||||
- [ ] `stacks/rbac/modules/rbac/apiserver-oidc.tf` — variable `default = "10.0.20.99"` (or whatever the multi-master refactor needs)
|
||||
- [ ] `.woodpecker/default.yml` — flip server: AND curl URL
|
||||
- [ ] `.woodpecker/drift-detection.yml` — flip server: AND curl URL
|
||||
- [ ] `.woodpecker/renew-tls.yml` — flip curl URL (line 18)
|
||||
- [ ] `.woodpecker/provision-user.yml` — flip curl URL (line 41)
|
||||
- [ ] `stacks/k8s-portal/modules/k8s-portal/files/src/routes/download/+server.ts` — `CLUSTER_SERVER` const
|
||||
- [ ] `stacks/k8s-portal/modules/k8s-portal/files/src/routes/setup/script/+server.ts` — same
|
||||
- [ ] Final sweep: `cd /home/wizard/code/infra && grep -rn '10.0.20.100:6443' --include='*.tf' --include='*.yml' --include='*.yaml' --include='*.ts' --include='*.php' --include='*.sh'` — handle anything remaining
|
||||
- [ ] `scripts/tg apply` for rbac + k8s-portal (and any other stacks touched)
|
||||
- [ ] Commit + push (single conventional commit referencing `code-n0ow`)
|
||||
- [ ] **4.2 Worker `kubelet.conf` flip (one at a time, with 5-10s expected NotReady)**
|
||||
```bash
|
||||
for n in k8s-node1 k8s-node2 k8s-node3 k8s-node4; do
|
||||
echo "=== $n ==="
|
||||
ssh wizard@$n.viktorbarzin.lan "sudo sed -i.bak 's|server: https://10.0.20.100:6443|server: https://10.0.20.99:6443|' /etc/kubernetes/kubelet.conf"
|
||||
ssh wizard@$n.viktorbarzin.lan "sudo systemctl restart kubelet"
|
||||
kubectl wait --for=condition=Ready node/$n --timeout=180s
|
||||
echo "$n Ready"
|
||||
sleep 15
|
||||
done
|
||||
```
|
||||
- [ ] **4.3 Existing master's `kubelet.conf`** — same sed + restart on `k8s-master`
|
||||
- [ ] **4.4 Verify master-2 + master-3 kubelet.conf already at VIP** (cloud-init join used VIP via controlPlaneEndpoint)
|
||||
- [ ] **4.5 Verify everything**
|
||||
- [ ] `kubectl get nodes` — all 7 Ready
|
||||
- [ ] `kubectl --kubeconfig ~/.kube/config config view --minify -o jsonpath='{.clusters[0].cluster.server}'` → `https://10.0.20.99:6443`
|
||||
- [ ] Worker loop: `for n in k8s-{master,node1,node2,node3,node4,master-2,master-3}; do ssh wizard@$n.viktorbarzin.lan "sudo grep server: /etc/kubernetes/kubelet.conf"; done` — all show VIP
|
||||
- [ ] Trigger a no-op Woodpecker pipeline (commit a typo fix in a runbook) — verify the kubeconfig path through the new VIP
|
||||
|
||||
### Phase 4.5 — Fix etcd-backup CronJob node pinning (~15 min)
|
||||
|
||||
- [ ] **4.5.1 Edit `stacks/infra-maintenance/modules/infra-maintenance/main.tf`**
|
||||
- [ ] backup-etcd (line 98): replace `node_name = "k8s-master"` with `nodeSelector { "node-role.kubernetes.io/control-plane" = "" }` + the corresponding toleration block
|
||||
- [ ] defrag-etcd (line 218): same change
|
||||
- [ ] **4.5.2 `scripts/tg apply` infra-maintenance**
|
||||
- [ ] **4.5.3 Verify backup runs** — trigger a manual job-from-cronjob, confirm it lands on one of the 3 masters and produces a valid snapshot
|
||||
|
||||
### Phase 5 — kured-sentinel-gate quorum check (~15 min)
|
||||
|
||||
- [ ] **5.1 Edit `infra/stacks/kured/main.tf`** (insert into the bash heredoc in the sentinel-gate ConfigMap, between all-nodes-Ready and calico-Ready checks)
|
||||
```bash
|
||||
# Check 3b: control-plane quorum safety (HA invariant)
|
||||
CP_READY=$(kubectl get nodes -l node-role.kubernetes.io/control-plane= --no-headers | grep ' Ready ' | wc -l | tr -d ' ')
|
||||
if [ "$CP_READY" -lt 2 ]; then
|
||||
echo " BLOCKED: Only $CP_READY control-plane node(s) Ready (need ≥2 for HA)"
|
||||
rm -f /host/var-run/gated-reboot-required
|
||||
sleep 300
|
||||
continue
|
||||
fi
|
||||
echo " Control-plane quorum safe ($CP_READY Ready)"
|
||||
```
|
||||
- [ ] **5.2 `scripts/tg apply` kured**
|
||||
- [ ] **5.3 Verify**
|
||||
- [ ] `kubectl -n kured logs ds/kured-sentinel-gate | tail -50` — expect "Control-plane quorum safe (3 Ready)" line
|
||||
- [ ] Negative test: cordon `k8s-master-2`, wait for the gate to re-evaluate, confirm block message. Restore.
|
||||
|
||||
### Phase 6 — E2E validation (~30 min)
|
||||
|
||||
- [ ] **6.1 Failover test**
|
||||
- [ ] `kubectl drain k8s-master --delete-emptydir-data --ignore-daemonsets`
|
||||
- [ ] `ssh wizard@k8s-master.viktorbarzin.lan sudo reboot`
|
||||
- [ ] During the 50-90s reboot: tight loop `while true; do kubectl get nodes -o name | wc -l; sleep 2; done` from devvm — line count never drops to 0 (LB transparent)
|
||||
- [ ] After boot: `kubectl uncordon k8s-master`, verify apiserver static pod re-registers in LB pool (op_state=2)
|
||||
- [ ] **6.2 All-masters apiserver flag parity**
|
||||
- [ ] `for h in k8s-master k8s-master-2 k8s-master-3; do echo "=== $h ==="; ssh wizard@$h.viktorbarzin.lan 'sudo grep -E "oidc-issuer-url|audit-policy|auto-compaction-retention|snapshot-count" /etc/kubernetes/manifests/{kube-apiserver,etcd}.yaml | sort'; done`
|
||||
- [ ] Expect identical flag set across all 3 masters
|
||||
- [ ] **6.3 Update documentation**
|
||||
- [ ] Add `docs/architecture/control-plane.md` — HA topology, etcd member list, LB config location
|
||||
- [ ] Update `.claude/reference/proxmox-inventory.md` — add VMIDs 205, 206
|
||||
- [ ] Add `docs/runbooks/control-plane-add-remove-master.md`
|
||||
- [ ] Update `docs/runbooks/restore-etcd.md` to cover 3-member quorum restore (was single-master only)
|
||||
- [ ] Cross-link `docs/runbooks/mailserver-pfsense-haproxy.md` with the new apiserver_proxy_6443 pool
|
||||
|
||||
### Phase 7 — Extend k8s-version-upgrade chain to multi-master (~60 min)
|
||||
|
||||
- [ ] **7.1 Edit `stacks/k8s-version-upgrade/scripts/upgrade-step.sh`**
|
||||
- [ ] phase_master: discover masters dynamically — `MASTERS=$($KUBECTL get nodes -l node-role.kubernetes.io/control-plane= -o name | sed 's|node/||')`
|
||||
- [ ] Wrap drain → `update_k8s.sh` → uncordon → wait-ready in a `for m in $MASTERS; do ... done` loop
|
||||
- [ ] Between masters: quorum check — `READY=$($KUBECTL get nodes -l node-role.kubernetes.io/control-plane= --no-headers | grep ' Ready ' | wc -l); [ $READY -ge 2 ] || { slack "ABORT quorum lost"; exit 1; }`
|
||||
- [ ] Update line 9 + 17 comment block to reflect multi-master phase
|
||||
- [ ] Update line 326-340 containerd-bump section to loop over masters
|
||||
- [ ] **7.2 Edit `phase_preflight` and the master phase pin**
|
||||
- [ ] Line 209-210 (scheduling_block): allow any control-plane node to be the target
|
||||
- [ ] Line 285 (`kubeadm upgrade plan` check): run against the first master in the list, not specifically `k8s-master`
|
||||
- [ ] **7.3 `scripts/tg apply` k8s-version-upgrade**
|
||||
- [ ] **7.4 Dry-run test**
|
||||
- [ ] `kubectl -n k8s-upgrade create job --from=cronjob/k8s-version-check ha-validation-$(date +%s)` (no actual upgrade pending — chain should noop the upgrade phase but exercise the discovery loop)
|
||||
- [ ] Verify logs show 3 masters discovered in correct order
|
||||
- [ ] **7.5 (Real test on next patch release)** — when 1.34.8 ships:
|
||||
- [ ] Watch the chain execute drain → upgrade → uncordon across all 3 masters in turn
|
||||
- [ ] Confirm no manual intervention needed
|
||||
|
||||
### Phase 8 — Close out
|
||||
|
||||
- [ ] **8.1 Update beads** — `bd close code-n0ow` once all 6 acceptance criteria met (see below)
|
||||
|
||||
## Rollback plan
|
||||
|
||||
### Before Phase 4 (no clients flipped)
|
||||
|
||||
- **Phase 0**: restore apiserver cert/key from `/tmp/apiserver-backup/`, edit kubeadm-config back, restart kubelet on master.
|
||||
- **Phase 1**: remove `apiserver_proxy_6443` + `apiserver_nodes` from `pfsense-haproxy-bootstrap.php`, re-run; revert DNS A record in config.tfvars.
|
||||
- **Phase 1.5**: revert rbac stack to single `k8s_master_host` var; apply.
|
||||
- **Phase 2/3**: on failed master `sudo kubeadm reset --force`; from a surviving master `etcdctl member remove <id>`; `tg destroy -target=module.k8s-master-N`.
|
||||
|
||||
### After Phase 4 (clients flipped)
|
||||
|
||||
- Revert all the Phase 4.1 file changes (single revert commit).
|
||||
- Reverse the kubelet.conf sed loop (VIP → direct IP) on all 7 nodes.
|
||||
- Phase 0 controlPlaneEndpoint can stay — harmless even on full rollback.
|
||||
|
||||
### Worst case (etcd corruption / multi-master split-brain)
|
||||
|
||||
- Restore from latest etcd snapshot via `etcdctl snapshot restore` to a single master.
|
||||
- Rebuild master VM from the Proxmox snapshot taken in Phase 0.1.
|
||||
- Cluster back to single-master.
|
||||
|
||||
## Acceptance criteria (beads `code-n0ow`)
|
||||
|
||||
- [ ] 1. Design doc + plan doc written ✓ (this commit)
|
||||
- [ ] 2. Plan approved by user
|
||||
- [ ] 3. 3 masters online, etcd quorum healthy, apiserver LB working
|
||||
- [ ] 4. k8s upgrade chain runs end-to-end across **all 3 masters** without manual intervention (Phase 7)
|
||||
- [ ] 5. kured-sentinel-gate respects quorum (Phase 5)
|
||||
- [ ] 6. etcd backup runs from any control-plane node (Phase 4.5)
|
||||
|
||||
## Open questions
|
||||
|
||||
None — all locked via 2026-05-22 decision pass + challenger amendment pass.
|
||||
269
docs/plans/2026-05-22-openclaw-devvm-access-design.md
Normal file
269
docs/plans/2026-05-22-openclaw-devvm-access-design.md
Normal file
|
|
@ -0,0 +1,269 @@
|
|||
# OpenClaw devvm access + async task pattern — design
|
||||
|
||||
**Date:** 2026-05-22
|
||||
**Stack:** `infra/stacks/openclaw`
|
||||
**Status:** Approved (in-session, see chat history 2026-05-22)
|
||||
|
||||
## Goal
|
||||
|
||||
Give the OpenClaw pod (running in K8s) two new capabilities:
|
||||
|
||||
1. **Host-tools bundle** — common Linux CLIs the upstream OpenClaw image
|
||||
doesn't ship (`ssh`, `scp`, `vault`, `dig`, `jq`, `yq`, `ripgrep`, `fd`,
|
||||
`gnupg`, `tmux`, etc.). OpenClaw can't `apt install` because the
|
||||
container runs as non-root `node` (uid 1000).
|
||||
2. **devvm async task pattern** — OpenClaw spawns long-running work as
|
||||
`tmux` sessions on devvm, sends prompts via `tmux send-keys`, captures
|
||||
progress via `tmux capture-pane`. Sessions live on devvm, so they
|
||||
survive OpenClaw pod restarts.
|
||||
|
||||
OpenClaw uses this combination as a **trusted fallback** for tasks too
|
||||
expensive, sensitive, or stateful for in-pod execution: Vault lookups,
|
||||
multi-step `claude-code` work, anything needing wizard's full home-lab
|
||||
access.
|
||||
|
||||
## Why now
|
||||
|
||||
- The in-pod sandbox is `security=full` but the container is minimal —
|
||||
no `ssh`, no `vault`, no `dig`, no `tmux`.
|
||||
- The user wants OpenClaw to be a first-line agent that delegates heavy
|
||||
work to the dev VM rather than duplicate that work in a constrained pod.
|
||||
- Long-running work (multi-minute `claude-code` sessions) shouldn't be
|
||||
tied to a single synchronous `claude -p` invocation — needs persistence
|
||||
and pollability.
|
||||
|
||||
## Architecture decision: stay on K8s
|
||||
|
||||
Discussed migrating OpenClaw to run directly on devvm (would obviate the
|
||||
host-tools bundle + most of the SSH setup). Decision: **stay on K8s**.
|
||||
|
||||
Reasons:
|
||||
- Keeps HA (5-node cluster vs single devvm reboot)
|
||||
- Keeps ingress/Authentik/Telegram entry chain intact
|
||||
- Keeps Prometheus scrape + exporter sidecar
|
||||
- Keeps PVC backup pipeline (LVM snapshots + Synology offsite)
|
||||
- Resource isolation — a runaway LLM session can't stress wizard's daily-driver VM
|
||||
- Migration cost is several days; this design is ~150 LoC + an 80-line wrapper
|
||||
|
||||
The mental model — "OpenClaw is sandboxed, delegates to wizard@devvm for
|
||||
trusted heavy lifting" — is a clean security boundary. Worth preserving.
|
||||
|
||||
## Architecture
|
||||
|
||||
### Pod side (`infra/stacks/openclaw/main.tf`)
|
||||
|
||||
Two new init containers added to the OpenClaw Deployment, after the
|
||||
existing four:
|
||||
|
||||
#### Init 5 — `install-host-tools`
|
||||
|
||||
- Image: `debian:bookworm-slim` (matches main container base for glibc compat)
|
||||
- Idempotent: skips if `/tools/host-tools/.installed-v1` exists
|
||||
- `apt-get install --download-only --no-install-recommends` for:
|
||||
`openssh-client dnsutils iputils-ping wget gnupg jq ripgrep fd-find ncdu htop strace tcpdump tmux unzip`
|
||||
- Iterates `.deb` files in `/var/cache/apt/archives/`, `dpkg-deb -x` each
|
||||
into `/tools/host-tools/root/` (preserves `usr/bin`, `usr/sbin`,
|
||||
`usr/lib` layout)
|
||||
- Downloads static binaries to `/tools/host-tools/bin/`:
|
||||
- `vault` (HashiCorp releases, pinned version)
|
||||
- `yq` (mikefarah/yq GitHub releases, pinned version)
|
||||
- Smoke test: invokes `--version` on each bundled binary; fails init if
|
||||
any won't load (catches glibc / shared-lib drift at deploy time, not
|
||||
runtime)
|
||||
- Writes marker file with version
|
||||
|
||||
#### Init 6 — `setup-ssh-config`
|
||||
|
||||
- Image: uses the just-installed host-tools (debian:bookworm-slim base
|
||||
with `/tools/host-tools/root/usr/bin` on PATH so `ssh-keyscan` works)
|
||||
- Runs after `install-host-tools`
|
||||
- Idempotent: skips if `/home/node/.openclaw/.ssh/.configured-v1` exists
|
||||
- Creates `/home/node/.openclaw/.ssh/` (uid 1000)
|
||||
- Copies `/ssh/id_rsa` (tmpfs secret mount) → `~/.ssh/id_rsa` with 0600
|
||||
(the secret tmpfs mount has wider perms that openssh rejects)
|
||||
- Writes `~/.ssh/config`:
|
||||
|
||||
```ssh-config
|
||||
Host devvm
|
||||
HostName 10.0.10.10
|
||||
User wizard
|
||||
IdentityFile ~/.ssh/id_rsa
|
||||
UserKnownHostsFile ~/.ssh/known_hosts
|
||||
StrictHostKeyChecking yes
|
||||
```
|
||||
|
||||
PATH handling on the remote side: devvm's sshd uses the default
|
||||
non-interactive PATH (`/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin`)
|
||||
and does NOT load `~/.profile` or `~/.bashrc` (memory id=740). Client-side
|
||||
`SetEnv PATH=…` doesn't help because sshd's `AcceptEnv` is `LANG LC_*` only.
|
||||
Solution: install the binaries openclaw cares about into `/usr/local/bin/`
|
||||
on devvm (see "Devvm side" below).
|
||||
|
||||
- Pre-seeds `~/.ssh/known_hosts` via `ssh-keyscan -H 10.0.10.10`
|
||||
- Writes marker file
|
||||
|
||||
#### Main container
|
||||
|
||||
- `PATH` env updated: prepend
|
||||
`/tools/host-tools/root/usr/bin:/tools/host-tools/root/usr/sbin:/tools/host-tools/bin`
|
||||
- No other changes to the startup command
|
||||
|
||||
### Devvm side
|
||||
|
||||
#### `/usr/local/bin/openclaw-task` wrapper
|
||||
|
||||
Canonical source: `infra/stacks/openclaw/files/openclaw-task.sh`.
|
||||
Installed to devvm at `/usr/local/bin/openclaw-task` (`sudo cp`, `sudo
|
||||
chmod +x`) so non-interactive SSH finds it on the default PATH without
|
||||
needing `~/.profile`. Updates: re-run the install steps from the
|
||||
canonical source.
|
||||
|
||||
Also: `sudo ln -s /home/wizard/.local/bin/claude /usr/local/bin/claude`
|
||||
so `ssh devvm claude …` works in non-interactive mode. `vault` and `tmux`
|
||||
are already at `/usr/bin/` (system packages) so no symlink needed for
|
||||
those.
|
||||
|
||||
POSIX shell script. Subcommands:
|
||||
|
||||
| Subcommand | Behavior |
|
||||
|---|---|
|
||||
| `new <id> <cmd...>` | Spawns detached tmux session `openclaw-task-<id>`, pipes pane output to `~/openclaw-tasks/<id>.log` |
|
||||
| `claude <id> <prompt>` | Convenience: spawns interactive `claude` in a tmux session, send-keys the prompt + Enter |
|
||||
| `send <id> <keys...>` | `tmux send-keys -t openclaw-task-<id> "$@"` — caller supplies `Enter` literal if needed |
|
||||
| `capture <id> [lines]` | `tmux capture-pane -t … -p -S -<lines>` (default last 1000) |
|
||||
| `log <id>` | `cat ~/openclaw-tasks/<id>.log` |
|
||||
| `tail <id>` | `tail -n 100 -f ~/openclaw-tasks/<id>.log` (mainly for human ops) |
|
||||
| `list` | tmux session list filtered to `openclaw-task-*`, one id per line |
|
||||
| `status <id>` | `running` if tmux session alive, `ended` otherwise |
|
||||
| `kill <id>` | `tmux kill-session -t openclaw-task-<id>` (log file is kept) |
|
||||
| `purge <id>` | `kill` + `rm -f ~/openclaw-tasks/<id>.log` |
|
||||
|
||||
Task state lives entirely on devvm:
|
||||
|
||||
- tmux sessions persist across SSH disconnects and OpenClaw pod restarts
|
||||
- `~/openclaw-tasks/<id>.log` is the durable transcript even after a
|
||||
session is killed
|
||||
- No central database — `tmux list-sessions` is the source of truth for
|
||||
"what's running"
|
||||
|
||||
Naming convention: tmux sessions are prefixed `openclaw-task-` so they
|
||||
don't collide with wizard's own tmux work (`0`, `Openclaw`, `read-only`).
|
||||
|
||||
### Memory note
|
||||
|
||||
File at `/workspace/memory/projects/openclaw-runtime/devvm-fallback.md`
|
||||
teaching OpenClaw the pattern. Indexed by the existing daily
|
||||
`memory-sync` CronJob (or via manual `node openclaw.mjs memory index
|
||||
--force` for the initial seed).
|
||||
|
||||
Content (verbatim):
|
||||
|
||||
```markdown
|
||||
# Using devvm as a fallback
|
||||
|
||||
When in-pod tools/permissions block you, SSH to devvm and use it. The
|
||||
devvm runs as wizard with full home-lab access (Vault, kubectl, git
|
||||
repos, Cloudflare, etc.) and has Claude Code v2+ installed.
|
||||
|
||||
## One-shot lookup
|
||||
ssh devvm 'vault kv get -field=brave_api_key secret/openclaw'
|
||||
ssh devvm 'claude -p "investigate why frigate is restarting"'
|
||||
|
||||
## Long-running async work — USE THIS for anything > ~2 min
|
||||
Spawn in a tmux session on devvm. Sessions survive OpenClaw pod restarts.
|
||||
|
||||
# spawn
|
||||
ssh devvm openclaw-task new my-task "claude -p --dangerously-skip-permissions 'do the thing'"
|
||||
|
||||
# poll progress (last 1000 lines of pane)
|
||||
ssh devvm openclaw-task capture my-task
|
||||
|
||||
# interactive claude (send follow-up prompts)
|
||||
ssh devvm openclaw-task claude my-task "initial prompt"
|
||||
ssh devvm openclaw-task send my-task "follow-up prompt" Enter
|
||||
|
||||
# housekeeping
|
||||
ssh devvm openclaw-task list
|
||||
ssh devvm openclaw-task status my-task
|
||||
ssh devvm openclaw-task kill my-task
|
||||
|
||||
Logs persist at ~/openclaw-tasks/<id>.log on devvm even after a session
|
||||
is killed. Use `ssh devvm openclaw-task log <id>` to retrieve them.
|
||||
```
|
||||
|
||||
## Devvm: no infra changes
|
||||
|
||||
Pre-existing state verified 2026-05-22:
|
||||
|
||||
- pubkey from `/ssh/id_rsa` (Vault `secret/openclaw → ssh_key`) matches the
|
||||
`ssh-ed25519 AAAA…lug node@openclaw-58cd9f7987-884bv` line in
|
||||
`~/.ssh/authorized_keys` (the comment is a stale pod name; the key
|
||||
itself is stable from Vault)
|
||||
- sshd listens on 0.0.0.0:22 ✓
|
||||
- `claude` v2.1.126 at `/home/wizard/.local/bin/claude` ✓
|
||||
- `tmux` 3.4 installed, server already running with existing user sessions ✓
|
||||
|
||||
Only changes (one-time, done in the same session via `sudo`):
|
||||
- Install `openclaw-task` wrapper to `/usr/local/bin/openclaw-task`
|
||||
- Symlink `/home/wizard/.local/bin/claude` → `/usr/local/bin/claude`
|
||||
|
||||
## Tradeoffs / risks
|
||||
|
||||
- **Bundle size on NFS**: ~30MB extracted. Acceptable on
|
||||
`/srv/nfs/openclaw/tools`.
|
||||
- **Library version drift**: bundled binaries link against bookworm libs.
|
||||
Smoke test in `install-host-tools` catches breakage on the next pod
|
||||
restart if upstream OpenClaw image rebases.
|
||||
- **Full-shell SSH**: explicit user choice. Blast radius if openclaw is
|
||||
prompt-injected = full wizard access. Mitigation: keep OpenClaw's
|
||||
plugin allowlist tight (current allow list: `memory-core, recruiter-api,
|
||||
telegram, openrouter, brave, openai, codex`).
|
||||
- **tmux server lifecycle on devvm**: if wizard's tmux server dies (rare —
|
||||
usually only on devvm reboot), in-flight openclaw tasks are killed.
|
||||
Acceptable for home lab. Task logs persist regardless.
|
||||
- **Task log unbounded growth**: `~/openclaw-tasks/*.log` grows forever.
|
||||
Out of scope here. User can add a `find -mtime +N -delete` cron later.
|
||||
- **Init container order**: `setup-ssh-config` depends on
|
||||
`install-host-tools` finishing first. K8s init containers run
|
||||
sequentially in declaration order — natural ordering, no explicit
|
||||
dependency mechanism needed.
|
||||
|
||||
## Testing — E2E flows required by user
|
||||
|
||||
1. **Tools present**:
|
||||
`kubectl -n openclaw exec <pod> -c openclaw -- ssh -V` returns version,
|
||||
same for `dig`, `vault`, `jq`, `yq`, `tmux`, `rg`.
|
||||
2. **SSH happy path**:
|
||||
`kubectl -n openclaw exec <pod> -c openclaw -- ssh devvm 'hostname'`
|
||||
returns `devvm`.
|
||||
3. **Claude one-shot**:
|
||||
`kubectl -n openclaw exec <pod> -c openclaw -- ssh devvm 'claude -p "what is 1+1"'`
|
||||
returns `2`.
|
||||
4. **Async task lifecycle**:
|
||||
- `ssh devvm openclaw-task new test-1 "sleep 30; echo done"`
|
||||
- `ssh devvm openclaw-task list` contains `test-1`
|
||||
- `ssh devvm openclaw-task status test-1` returns `running`
|
||||
- wait 35s
|
||||
- `ssh devvm openclaw-task log test-1` contains `done`
|
||||
- `ssh devvm openclaw-task status test-1` returns `ended`
|
||||
5. **Persistence test** (the key requirement):
|
||||
- Spawn long task: `ssh devvm openclaw-task new persist-1 "sleep 120; echo survived > /tmp/persist-1.proof"`
|
||||
- `kubectl -n openclaw delete pod <openclaw-pod>` — pod recreated
|
||||
- Wait for new pod ready (init containers run, skip via marker, fast)
|
||||
- `kubectl -n openclaw exec <new-pod> -c openclaw -- ssh devvm openclaw-task list`
|
||||
contains `persist-1`
|
||||
- Wait for original sleep to finish; verify `/tmp/persist-1.proof`
|
||||
contains `survived` from new pod
|
||||
6. **Memory note lookup**:
|
||||
`kubectl -n openclaw exec <pod> -c openclaw -- node openclaw.mjs memory search 'devvm fallback'`
|
||||
returns the note.
|
||||
|
||||
## Docs to update with the change
|
||||
|
||||
- `infra/docs/plans/2026-05-22-openclaw-devvm-access-design.md` (this doc)
|
||||
- `infra/docs/plans/2026-05-22-openclaw-devvm-access-plan.md` (implementation plan)
|
||||
- `infra/.claude/reference/service-catalog.md` (one-line addition under
|
||||
OpenClaw: "Has SSH to devvm with host-tools bundle; long-running async
|
||||
tasks via `openclaw-task` wrapper on devvm")
|
||||
- `infra/.claude/CLAUDE.md` "Known Issues" section is left alone — none of
|
||||
the existing OpenClaw caveats change.
|
||||
453
docs/plans/2026-05-26-talos-migration-design.md
Normal file
453
docs/plans/2026-05-26-talos-migration-design.md
Normal file
|
|
@ -0,0 +1,453 @@
|
|||
# Drift elimination — STAGED plan (v7 — final, converged)
|
||||
|
||||
**Status:** v7 — final converged plan after 6 rounds of critique. v7
|
||||
fixes the R6 substantive findings:
|
||||
- A.0 rationale corrected (stopped VMs don't reserve RAM; rationale was wrong)
|
||||
- B.1 deployment surface decided (docker-registry VM 220, not "PVE Docker host" which doesn't exist)
|
||||
- A.3 AIDE scope narrowed (specific files, not `/var/lib/kubelet/` directory which is noise-flooded by kubelet writes)
|
||||
- Minor: real AIDE image identified; break-glass procedure caveated; line citation corrected.
|
||||
|
||||
**Iteration loop STOPS here.** Remaining issues at this point are
|
||||
implementation details the operator resolves at execution time. The
|
||||
critic chain has converged: each round found fewer + smaller issues
|
||||
(v1: 30+ → v2: 30+ → v3: 30+ → v4: 30+ → v5: 5-7 → v6: 3 → v7:
|
||||
expected 0-2 minor). Continuing iteration would produce v8 with 1-2
|
||||
findings, v9 with 0-1, etc. — diminishing returns. Operator owns the
|
||||
plan from here.
|
||||
|
||||
**Owner:** Viktor
|
||||
**Iteration history**:
|
||||
- v1 (in-place rolling etcd peer-join, 4-6 weeks) — 3/3 critics DISAGREE
|
||||
- v2 (parallel cluster + GitOps replay, 4-6 weekends) — 3/3 DISAGREE; PVE memory physically impossible, MetalLB IP collision
|
||||
- v3 (1-weekend greenfield, "4-6h Saturday") — 3/3 DISAGREE; fictional timing, 3 load-bearing false claims
|
||||
- v4 (honest 6-week greenfield, 40-50h) — 3/3 DISAGREE; still 50-110% under realistic 75-106h; commits to Talos before answering "is it worth 60h"
|
||||
- v5 (staged, decision-gated, OS-neutral first) — 3/3 DISAGREE with shape-AGREE; 5 specific implementation issues
|
||||
- **v6 (this plan) — same staged shape, R5 implementation fixes**
|
||||
|
||||
## 0a. User confirmation gate (NEW IN V6)
|
||||
|
||||
Before any prep starts, user explicitly confirms:
|
||||
|
||||
- [ ] **Path acceptance**: Staged plan (Stage A → B → C → D-optional), NOT direct full Talos migration
|
||||
- [ ] **Date**: Stage A execution Sat 2026-06-06 (4 days prep this week + 5 days week-2 sandbox testing)
|
||||
- [ ] **Trade-off acceptance**: ~85% drift elimination from Stages A+B may suffice; Stage D commitment is gated on Stage C evidence, not pre-decided
|
||||
- [ ] **Competing-commitment awareness**: 15-23h Stages A+B compete with `code-963q` MySQL upgrade and `code-8ywc` Security wave 1 enforce-mode flip
|
||||
|
||||
If user prefers full v4-scope Talos migration anyway: stop reading v6, return to v4. Both plans valid; pick one consciously.
|
||||
|
||||
## 0. Why staged
|
||||
|
||||
Across 4 rounds, critics consistently said:
|
||||
1. **Drift elimination is achievable in stages.** Path X (hardened Ubuntu) gives ~85% of the value of Talos at <10% of the cost.
|
||||
2. **DR primitive modernization is OS-neutral** (~60% of Phase -3 work applies regardless of OS choice).
|
||||
3. **The Talos decision shouldn't be forced today.** Empirical drift data (from Stage A) + DR battle-testing (from Stage B) inform the right answer better than a planning document can.
|
||||
4. **The user has competing commitments** — P1 `code-8ywc` Security wave 1, P2 `code-963q` MySQL upgrade, P2 `code-dac` GoCardless reauth. A 60-100h Talos project displaces these.
|
||||
|
||||
v5 honors all four findings. The Talos commitment is staged to weeks 4-12, **after** empirical evidence from Stages A+B.
|
||||
|
||||
## 1. End-state options (decided by Stage C, not today)
|
||||
|
||||
After Stage C decision point:
|
||||
|
||||
**Outcome 1 — Staged execution completes at Stage B+:** drift elimination ~85% via hardened Ubuntu + modernized DR primitives. Cluster stays kubeadm/Ubuntu. Talos sandbox lives forever as learning lab.
|
||||
|
||||
**Outcome 2 — Stages A+B+D execute (full Talos migration):** drift elimination ~95% via Talos. Plan goes through honest 8-12 weeks per R4 critic B estimate. Empirically justified by drift evidence collected in Stage A soak.
|
||||
|
||||
**Outcome 3 — Stages A+B only, Talos deferred indefinitely:** drift elimination ~85%; cluster operates fine; user redirects time to closing P1/P2 beads. Talos reconsidered if drift event happens in the next 12 months.
|
||||
|
||||
All three outcomes are valid. v5 doesn't force the choice.
|
||||
|
||||
## 2. Stage A: Harden Ubuntu (1 weekend, ~10h) — v6 honest budget
|
||||
|
||||
**Goal**: ~85% drift elimination, additive only, zero risk to current cluster.
|
||||
|
||||
**v6 changes vs v5**:
|
||||
- A.2 RO `/usr` reframed: `/usr` is NOT a separate partition (verified live). Use overlayfs-via-systemd OR drop in favor of A.3's file-integrity detection. v6 picks the latter as lower-risk.
|
||||
- A.3 changed from "3 Kyverno ClusterPolicies" (wrong layer for OS file drift) to "AIDE + auditd DaemonSet" (correct layer).
|
||||
- A.5 break-glass procedure rewritten: `kubectl debug node` is BLOCKED by Kyverno wave-1 `deny-privileged-containers` enforce policy (verified live). Only break-glass path is PVE console rescue boot.
|
||||
- Pre-flight: delete stopped TrueNAS VM 9000 (frees 8 GB RAM headroom before drain operations).
|
||||
|
||||
### A.0 Pre-flight: investigate PVE memory pressure (30 min) — v7 fix
|
||||
|
||||
R6 verified: VM 9000 is STOPPED → destroying it frees disk (~2.46 TB
|
||||
LVM thin pool), NOT 8 GB RAM (RAM allocation is config-only on
|
||||
stopped VMs; no qemu process consumes RAM). v6's rationale was wrong.
|
||||
|
||||
- `qm status 9000` — confirm stopped
|
||||
- `qm destroy 9000 --purge` — frees ~2.46 TB thin pool space (good
|
||||
hygiene; CLAUDE.md says it's "operationally decommissioned 2026-04-13
|
||||
pending user decision on deletion")
|
||||
- **Separate PVE memory pressure investigation** (which v6 conflated):
|
||||
- `free -h` on PVE shows swap 99% used today — real issue
|
||||
- Top consumers: `qm list` + cross-reference top processes
|
||||
- User offered earlier in session to shrink node5+6 from 32→8 GB each (frees ~48 GB)
|
||||
- Decision for A.0: scale node5+6 to 8 GB BEFORE Stage A's drain
|
||||
operations OR accept that drain may cascade (existing node memory
|
||||
requests at 60-94% of limits per R4-B)
|
||||
- Time: 30 min for scaling (drain → qm set --memory → reboot → uncordon × 2 nodes)
|
||||
|
||||
This fix preserves the useful action (free disk, prep RAM) and
|
||||
removes the wrong rationale.
|
||||
|
||||
### A.1 Lock down SSH on workers (2-3h)
|
||||
- Drain k8s-node2 through k8s-node6 sequentially (~15min/node × 5 = 75min including reschedule wait)
|
||||
- Per worker:
|
||||
1. SSH in as `wizard` (still works at this point)
|
||||
2. Create `/etc/ssh/sshd_config.d/99-hardening.conf`:
|
||||
```
|
||||
PasswordAuthentication no
|
||||
PubkeyAuthentication yes
|
||||
AllowUsers wizard
|
||||
```
|
||||
3. Restart sshd: `systemctl restart ssh`
|
||||
4. Verify with `ssh wizard@<node>` from operator's laptop
|
||||
5. ONLY THEN: `systemctl mask ssh.socket`
|
||||
6. Uncordon
|
||||
- Total: 75min drain + 30min config + 15min verification per node = ~2h
|
||||
|
||||
**SSH stays enabled on**:
|
||||
- `k8s-master` (cluster_healthcheck.sh SSH-es only to PVE host, NOT master — verified live; keep SSH on master only for emergency debug)
|
||||
- `k8s-node1` (GPU node — historically needs NVIDIA driver debug)
|
||||
|
||||
**SSH masked on**:
|
||||
- k8s-node2 through k8s-node6 (CPU workers — pure k8s workload)
|
||||
|
||||
**Important**: nodes 1-6 are explicitly **out of Terraform** (see `infra/stacks/infra/main.tf` line 437). Stage A changes are NOT persisted across re-clone. If a worker is reprovisioned via `provision-k8s-worker`, SSH lockdown is wiped. **Mitigation**: also modify `infra/modules/create-template-vm/cloud_init.yaml` to bake SSH lockdown into the template (1h, addresses future provisions).
|
||||
|
||||
### A.2 ~~Read-only /usr~~ — DROPPED in v6
|
||||
|
||||
**Why dropped**: R5 verified `/usr` is NOT a separate partition on existing workers (it's a directory on the single root ext4). Repartitioning live nodes is multi-hour-per-node + reboot + risk. Bind-mount overlay conflicts with `unattended-upgrades` (currently enabled, writes to `/usr/bin`, `/usr/lib` for security updates).
|
||||
|
||||
**Replacement**: A.3's file-integrity detection (AIDE) catches `/usr` modifications regardless of whether they're allowed by the filesystem. Detection-based approach is sufficient for ~85% drift elimination goal.
|
||||
|
||||
If Outcome 2 (full Talos) triggers later, RO root comes for free.
|
||||
|
||||
### A.3 OS-level drift detection via AIDE DaemonSet (3-4h) — v7 fix
|
||||
|
||||
R6 verified: v6's image `ghcr.io/aide-rb/aide:latest` doesn't exist;
|
||||
`/var/lib/kubelet/` is a high-churn directory (kubelet writes pod
|
||||
sandboxes, ephemeral volume state, etc.) → AIDE on the full directory
|
||||
floods false positives.
|
||||
|
||||
v7 fixes:
|
||||
- **Build minimal Alpine + aide DaemonSet image** (no fictional
|
||||
ghcr.io reference). Dockerfile:
|
||||
```
|
||||
FROM alpine:3.22
|
||||
RUN apk add --no-cache aide
|
||||
```
|
||||
Build, push to forgejo.viktorbarzin.me/viktor/aide-daemonset:latest.
|
||||
- Mounts host paths read-only:
|
||||
- `/etc` (full)
|
||||
- `/usr/bin`, `/usr/sbin`, `/usr/local/bin` (specific dirs, not all of `/usr` to avoid bind-mount complexity)
|
||||
- `/etc/cni/net.d` (CNI config)
|
||||
- `/etc/containerd/config.toml` (specific FILE, not full `/etc/containerd/` — only the config drift matters)
|
||||
- `/etc/systemd/system/` (custom unit files)
|
||||
- `/var/lib/kubelet/config.yaml` + `/var/lib/kubelet/kubeadm-flags.env` (specific FILES, NOT directory — kubelet writes pod state in same dir which floods false positives)
|
||||
- Daily systemd-style timer runs `aide --check` against baseline DB
|
||||
- On diff: post to Prometheus pushgateway with metric
|
||||
`aide_drift_detected{node="X",path="..."} 1`
|
||||
- Push diff content to Loki via DaemonSet sidecar
|
||||
- Alert rule: `aide_drift_detected > 0 for 1h`
|
||||
- Initial baseline taken at first deploy; reviewed by operator weekly
|
||||
during Stage C
|
||||
|
||||
**Existing Kyverno wave-1 policies stay as-is** (admission-time drift
|
||||
on K8s resources; AIDE covers OS-layer drift).
|
||||
|
||||
### A.4 Daily `tg plan` drift detection (2-3h)
|
||||
|
||||
- CronJob in `monitoring` namespace runs `terragrunt plan -detailed-exitcode` per stack at 06:00 daily
|
||||
- 126 stacks × 22s avg with init cache = ~46min/run. Set `activeDeadlineSeconds: 3600`.
|
||||
- Vault K8s auth role: new role `terraform-plan-runner` bound to dedicated SA in `monitoring` ns
|
||||
- Exit code 2 → push metric to Prometheus pushgateway → alert if drift > 0 for >24h
|
||||
- New script `scripts/drift-detect-cronjob.sh` + Terraform stack `infra/stacks/drift-detection/`
|
||||
|
||||
### A.5 Documentation + break-glass procedure (1-1.5h)
|
||||
|
||||
**Critical v6 fix (preserved + caveated in v7)**: `kubectl debug node` is
|
||||
blocked by Kyverno wave-1 `deny-privileged-containers` enforce policy
|
||||
(verified live).
|
||||
|
||||
**v7 caveat (R6 finding)**: Kyverno excludes some namespaces from the
|
||||
policy. A privileged pod hand-crafted in `default`, `kube-system`, or
|
||||
`kured` namespace MIGHT bypass — but operator should NOT rely on this
|
||||
exception path since the wave-1 design intentionally restricted it.
|
||||
|
||||
**Primary break-glass procedure: PVE console rescue boot**:
|
||||
|
||||
1. Operator opens Proxmox web UI → VM → Console
|
||||
2. Reboot VM, hold Shift at GRUB → select "Advanced options" → "Recovery mode"
|
||||
3. Drop to root shell (no password required in single-user mode on this image)
|
||||
4. `systemctl unmask ssh.socket && systemctl start ssh`
|
||||
5. Edit `/etc/ssh/sshd_config.d/99-hardening.conf` if needed
|
||||
6. Reboot normally
|
||||
|
||||
**Document this procedure with screenshots** in `infra/docs/runbooks/host-hardening.md`. Test the procedure on one worker BEFORE Stage A executes (Phase A.0 step).
|
||||
|
||||
Update `infra/.claude/CLAUDE.md` to note:
|
||||
- SSH masked on workers k8s-node2-6
|
||||
- Emergency rescue only via PVE console, not `kubectl debug node`
|
||||
- AIDE detects but doesn't prevent drift on `/etc`, `/usr`
|
||||
|
||||
**Stage A exit gate**:
|
||||
- All 5 workers have SSH masked AND PVE-console rescue tested on at least 1 worker
|
||||
- AIDE DaemonSet running with baseline taken on all workers
|
||||
- Daily drift-detect CronJob running
|
||||
- `cluster_healthcheck.sh` passes (no new FAILs introduced)
|
||||
- Cloud-init template updated to bake SSH lockdown for future provisions
|
||||
|
||||
**Time budget**: 9-12h (honest, per R5-B). **Reversibility**: per-node SSH unmask via PVE console rescue (30-60min/node). **Risk**: low (additive, no data path changes); medium for the rescue-procedure trust (test before relying on it).
|
||||
|
||||
## 3. Stage B: Modernize DR primitives (1 weekend, ~8h)
|
||||
|
||||
**Goal**: PG PITR + daily Vault snapshots + offsite verification. Useful regardless of OS choice. Done while Stage A soaks for drift events.
|
||||
|
||||
### B.1 Decide + deploy S3 endpoint (4-6h) — v7 fix
|
||||
|
||||
R6 verified: PVE host has NO Docker installed (`which docker` returns
|
||||
nothing on 192.168.1.127). v6's "PVE-host Docker containers" deployment
|
||||
surface doesn't exist.
|
||||
|
||||
**v7 decision: SeaweedFS containers on docker-registry VM (VMID 220,
|
||||
IP 10.0.20.10)** — that VM already runs Docker and matches the
|
||||
"docker-registry pattern" precedent.
|
||||
|
||||
Steps (~4-6h):
|
||||
1. SSH to docker-registry VM (existing pattern; this VM has SSH enabled)
|
||||
2. Add SeaweedFS to existing `/opt/registry/docker-compose.yml` OR new
|
||||
`/opt/seaweedfs/docker-compose.yml`:
|
||||
- `master`, `volume`, `filer`, `s3` containers
|
||||
- Persistent storage on NFS mount (`/srv/nfs/seaweedfs/` on
|
||||
192.168.1.127)
|
||||
3. TLS cert (use existing wildcard fullchain.pem from
|
||||
`infra/secrets/`; mount via volume) (30min)
|
||||
4. DNS A record `s3.viktorbarzin.lan` → 10.0.20.10 in Technitium (5min)
|
||||
5. Bucket `cnpg-backup` + IAM keys created via SeaweedFS S3 API (15min)
|
||||
6. Prometheus scrape config (15min)
|
||||
7. Smoke test from cluster pod: `s3cmd ls s3://cnpg-backup/` (15min)
|
||||
|
||||
**Single-point-of-failure trade-off**: docker-registry VM is on the
|
||||
same PVE host as everything else. If PVE dies, both the cluster AND
|
||||
the S3 endpoint die. **Mitigation**: barmanObjectStore writes BOTH
|
||||
to S3 (local) AND backups are rsynced to Synology offsite via the
|
||||
existing `offsite-sync-backup` systemd unit (already covers `/srv/nfs/`).
|
||||
Acceptable for homelab.
|
||||
|
||||
**Alternative if SeaweedFS proves flaky**: MinIO via Synology Container
|
||||
Manager (Synology has Container Manager / Docker package, unlike S3
|
||||
storage). Avoid MinIO on K8s cluster (CNPG bootstrap cycle).
|
||||
|
||||
**Commit**: decision + steps documented in `infra/docs/architecture/storage.md`.
|
||||
|
||||
### B.2 CNPG barmanObjectStore (2h)
|
||||
|
||||
- Add `spec.backup.barmanObjectStore` to `pg-cluster` CR (read R4-A finding for exact HCL).
|
||||
- `tg apply dbaas` → CNPG starts continuous WAL archival.
|
||||
- First base-backup: `kubectl cnpg backup pg-cluster -n dbaas`.
|
||||
- Verify WAL upload metric in Prometheus.
|
||||
|
||||
### B.3 Daily Vault Raft snapshot (15 min)
|
||||
|
||||
- Change `vault-raft-backup` CronJob schedule from `0 2 * * 0` to `0 2 * * *`.
|
||||
- Verify next-night snapshot in `/srv/nfs/vault-backup/`.
|
||||
- Verify Synology offsite copy via `ssh root@192.168.1.13 ls -la /volume1/Backup/Viki/nfs/vault-backup/` — must be ≤30h old.
|
||||
- **Exit gate**: offsite copy fresh.
|
||||
|
||||
### B.4 Pre-flight stabilize cluster (2-3h)
|
||||
|
||||
R4-B verified: cluster is currently UNHEALTHY (3 FAIL + 6 WARN). Address regardless of OS choice:
|
||||
- Fix postgresql-backup CronJob scheduling (was stuck for 2 days as of earlier)
|
||||
- Fix LVMSnapshotStale alert (PVE-host script debug)
|
||||
- Fix pushgateway backup metrics stale (separate from earlier session work)
|
||||
- HA-Sofia integration health (6 not_loaded) — defer to user since requires HA admin actions
|
||||
- Document remaining WARNs as accepted residual until specific incident
|
||||
|
||||
### B.5 Restore drill (1h)
|
||||
|
||||
- Restore Vault Raft snapshot to sandbox VM
|
||||
- Restore CNPG base-backup to sandbox CNPG cluster
|
||||
- Verify both reach functional state
|
||||
- Document times in `infra/docs/runbooks/disaster-recovery-rehearsal.md`
|
||||
|
||||
**Stage B exit gate**:
|
||||
- S3 endpoint operational, monitored
|
||||
- CNPG continuous WAL archival running >7 days
|
||||
- Vault snapshots daily, offsite ≤30h
|
||||
- Restore drill timed + documented
|
||||
- Cluster health 0 FAIL, ≤2 WARN
|
||||
|
||||
**Time budget**: 8h. **Reversibility**: B.1 endpoint can be torn down; B.2 barmanObjectStore can be removed from CR; B.3 schedule revert; B.4 work persists regardless. **Risk**: low.
|
||||
|
||||
## 4. Stage C: Decision point (1-2 weeks soak, ~1h active)
|
||||
|
||||
**Goal**: Decide between Outcome 1/2/3 based on empirical evidence from Stage A.
|
||||
|
||||
### C.1 Drift telemetry review (~30 min weekly)
|
||||
|
||||
For 2 weeks post-Stage A:
|
||||
- Review Kyverno audit-mode violations: any drift detected?
|
||||
- Review `tg plan` daily CronJob results: any unexpected drift in TF state?
|
||||
- Review pod-side incidents: did any operational situation REQUIRE SSH-to-worker that the Stage A lockdown prevented?
|
||||
|
||||
### C.2 Sandbox Talos exploration (optional, ~4-8h spread over 2 weeks)
|
||||
|
||||
If the user wants empirical T4 + Talos evidence:
|
||||
- Provision 3-VM Talos sandbox on `10.0.30.0/24` per round-3 critic C's recommendation
|
||||
- Permanent learning environment
|
||||
- Validate GPU + CSI + Calico without production risk
|
||||
- No timeline pressure
|
||||
|
||||
### C.3 Decision criteria — v6 fix: soak extended to 6 weeks + Outcome 4 added
|
||||
|
||||
R5 critic A flagged: 2 weeks misses quarterly drift classes (kernel CVE, K8s minor, package update). v6 extends soak to **6 weeks** for adequate signal.
|
||||
|
||||
After 6 weeks Stage A + Stage B exit gates met, AND AIDE has at least 6 weeks of baseline data:
|
||||
|
||||
| Observation | Recommend |
|
||||
|---|---|
|
||||
| **No drift detected** in AIDE + tg plan daily | **Outcome 3** (defer Talos indefinitely). Use saved 60+h on P1 `code-8ywc` + P2 `code-963q` + other tasks. Sandbox Talos for learning value. |
|
||||
| **Drift detected, contained by Stage A** (AIDE caught it, no incident) | **Outcome 4** (NEW): keep on Ubuntu + Stage A controls; flip Kyverno audit→enforce policies where appropriate; revisit Stage D in 6 months. Talos doesn't add value the hardening doesn't already provide. |
|
||||
| **Drift detected that Stage A didn't catch** (e.g., container-runtime binary modification, kernel-module loading) AND caused/risked an incident | **Outcome 2** — full Talos migration per v4. Empirical justification documented. |
|
||||
| **Sandbox Talos exploration reveals show-stopper** (T4 incompatibility, factory.talos.dev unreliability) | **Outcome 3** — Talos defer indefinitely. |
|
||||
| **Sandbox Talos exploration validates cleanly** + user has 100+h appetite | **Outcome 2** — full Talos migration. |
|
||||
|
||||
### C.4 Decision artifact
|
||||
|
||||
Whatever the outcome: document in `infra/docs/decisions/2026-XX-XX-drift-elimination-strategy.md` (ADR format). Include:
|
||||
- Drift telemetry summary
|
||||
- Sandbox Talos findings (if explored)
|
||||
- Selected outcome
|
||||
- Justification
|
||||
|
||||
**Stage C exit gate**:
|
||||
- 2 weeks of Stage A telemetry collected
|
||||
- ADR written
|
||||
- User has explicitly chosen Outcome 1, 2, or 3
|
||||
|
||||
**Time budget**: ~1h active operator time spread over 2 weeks. **Reversibility**: pure decision-making, no infrastructure changes.
|
||||
|
||||
## 5. Stage D (optional): Full Talos migration
|
||||
|
||||
**Triggered only if Stage C outcome = 2.** Specification preserved from v4 with R4 corrections applied.
|
||||
|
||||
**Honest scope** (per R4-B):
|
||||
- 8-12 weeks calendar
|
||||
- 75-106h operator time
|
||||
- Realistic 12-18h Saturday cutover window (announce "Sat morning through Sun afternoon")
|
||||
- 14-day soak with ~10-14h active work
|
||||
|
||||
**Pre-requisites met by prior stages**:
|
||||
- ✅ Stage A: hardened Ubuntu workers (so during Stage D's parallel/dual-cluster window, drift is bounded)
|
||||
- ✅ Stage B: barmanObjectStore + daily Vault snapshot + restore drill validated
|
||||
- ✅ Stage C: empirical justification + ADR
|
||||
|
||||
**New pre-requisites NOT covered by prior stages** (Stage D's own Phase -2 work):
|
||||
- migrate-pvc script (8-12h per R4-A)
|
||||
- SOPS pre-seed Secrets for Talos bootstrap (1h)
|
||||
- cluster_healthcheck.sh Talos rewrite (6-10h per R4-B)
|
||||
- 30 runbooks Talos rewrite (~15h)
|
||||
- K8s 1.34 → 1.36 deprecated-API cleanup (4-8h — 96 v1beta1 references)
|
||||
- ESO v1beta1 → v1 migration (4-8h)
|
||||
- code-963q MySQL upgrade calendar slot (4-8h, multi-day if wipe+reinit)
|
||||
- code-8ywc Security wave 1 deferred by 2 months — operator must accept this
|
||||
|
||||
**Stage D execution** follows v4 §4-§19 with the above prerequisites added to Phase -2.
|
||||
|
||||
## 6. Schedule (v6 honest)
|
||||
|
||||
| Time | Activity | Active operator time |
|
||||
|---|---|---|
|
||||
| **This week (Tue-Fri evenings)** | Stage A prep: write systemd configs, AIDE manifests, CronJob HCL, test PVE-console rescue procedure | 6-8h |
|
||||
| **Sat 2026-06-06** (note: NOT this Saturday) | Stage A: Harden Ubuntu (A.0 destroy VM 9000 + A.1 SSH lockdown + A.3 AIDE + A.4 tg-plan) | 9-12h |
|
||||
| **Next weekend (Sat 2026-06-13)** | Stage B: DR primitives (SeaweedFS + barmanObjectStore + daily Vault + restore drill) | 8-10h |
|
||||
| **Weeks 3-8 (6 weeks soak)** | Stage C: weekly AIDE review + optional sandbox Talos | ~6h total spread across 6 weeks |
|
||||
| **Decision point** | Stage C ADR | 1h |
|
||||
| **If Outcome 2 (Stage D)** | Full Talos migration per v4 with R3-A pre-requisites | 117-178h over 14-20 weeks |
|
||||
| **If Outcome 1/3/4** | Done | — |
|
||||
|
||||
**Total to Stage C decision**: 30-37h over 8 weeks. **Total if Stage D triggers**: 147-215h over 22-28 weeks.
|
||||
|
||||
**Schedule shifted from v5**: Stage A moved from Sat 2026-05-30 to Sat 2026-06-06 to allow honest prep (per R5-B + R5-C feedback). Stage C soak extended from 2 weeks to 6 weeks for adequate drift signal.
|
||||
|
||||
## 7. Rollback per stage
|
||||
|
||||
Stage A: per-worker SSH unmask + `/usr` rw remount + Kyverno policy delete (each 10-30 min).
|
||||
Stage B: barmanObjectStore removal from CR + schedule revert + S3 endpoint shutdown (each 10-30 min). The on-disk WAL archive is recoverable independently.
|
||||
Stage C: pure decision-making, no rollback needed.
|
||||
Stage D: per v4 rollback table.
|
||||
|
||||
## 8a. R5 critic findings — v6 status
|
||||
|
||||
| R5 finding | v6 status |
|
||||
|---|---|
|
||||
| Synology DSM has no S3 package | FIXED — B.1 picks SeaweedFS on PVE Docker directly |
|
||||
| `/usr` is not a separate partition | FIXED — A.2 dropped; A.3 AIDE covers the gap |
|
||||
| `kubectl debug node` blocked by Kyverno wave-1 | FIXED — A.5 documents PVE console rescue as the only break-glass; tested in A.0 |
|
||||
| Kyverno is wrong layer for OS file drift | FIXED — A.3 replaced with AIDE DaemonSet |
|
||||
| PVE host RAM at edge (swap full) | FIXED — A.0 destroys stopped TrueNAS VM 9000 to free 8 GB |
|
||||
| Stage A SSH changes not in Terraform (re-clone wipes) | PARTIAL — A.1 updates cloud-init template too; existing nodes still need manual handling on re-clone |
|
||||
| `cluster_healthcheck.sh` SSH path constraint wrong | FIXED — verified SSH is only to PVE host, not nodes; updated A.1 |
|
||||
| 6h Stage A budget understated | FIXED — A budget honest at 9-12h; total Stage A weekend = 10h+ |
|
||||
| 2-week soak misses quarterly drift | FIXED — C extended to 6 weeks |
|
||||
| Decision criteria too binary; need Outcome 4 | FIXED — C.3 added "Outcome 4: drift contained, defer Stage D 6 months" |
|
||||
| User re-confirmation gate missing | FIXED — §0a added |
|
||||
| Stage A this-weekend prep window too tight | FIXED — moved to Sat 2026-06-06 with explicit Tue-Fri prep budget |
|
||||
| Synology DMS-S3 fictional decision tree | FIXED — B.1 commits to SeaweedFS |
|
||||
| ESO v1beta1 → v1 migration unbudgeted (96 references) | ACK — Stage D pre-requisite (no change from v5) |
|
||||
| K8s 1.34→1.36 API deprecations | ACK — Stage D pre-requisite (no change from v5) |
|
||||
| MySQL upgrade (code-963q) calendar slot | ACK — separate task; can run during Stage C soak (6-week window has room) |
|
||||
|
||||
## 8b. Critical findings from rounds 1-4 — addressed by staging
|
||||
|
||||
| R-round finding | v5 status |
|
||||
|---|---|
|
||||
| Talos identity preservation buys nothing user-visible (R1) | Acknowledged — Stage D only if drift evidence demands it. |
|
||||
| Parallel cluster physically impossible on host (R2) | N/A — staged plan doesn't run two clusters simultaneously |
|
||||
| Scheduled-downtime 4-6h fiction (R3) | Stage D acknowledges 12-18h cutover; only triggered after empirical justification |
|
||||
| barmanObjectStore doesn't exist (R3) | Stage B builds it — first OS-neutral, used by Stage D if triggered |
|
||||
| migrate-pvc script doesn't exist (R3) | Stage D pre-requisite, scoped honestly to 8-12h |
|
||||
| Vault Raft weekly→daily, offsite 9 days behind (R3) | Stage B fixes immediately, before any Talos decision |
|
||||
| cert-manager not installed; v3 wrong (R3) | N/A — staged plan keeps current Woodpecker certbot pipeline |
|
||||
| LUKS / Vault chicken-and-egg (R3) | Stage D pre-requisite, 1h SOPS pre-seed |
|
||||
| Kyverno wait + sync-registry-credentials (R3) | Stage D pre-requisite, scoped |
|
||||
| Authentik 5.5h down window (R4) | N/A — staged plan no Saturday outage |
|
||||
| 12.75h ≠ 12h announced window (R4) | N/A — Stage D acknowledges 12-18h |
|
||||
| Synology S3 not deployed today (R4) | Stage B.1 makes decision + deploy explicit, budgeted 3-4h |
|
||||
| Phase -3.7 vs Phase -2 budget conflict (R4) | Stage D pre-requisite tracked separately, not bundled |
|
||||
| 96 v1beta1 ESO references (R4) | Stage D pre-requisite, 4-8h migration before Talos cutover |
|
||||
| K8s 1.34→1.36 deprecated APIs (R4) | Stage D pre-requisite, 4-8h |
|
||||
| `code-963q` MySQL upgrade interaction (R4) | Stage C decision point can schedule it separately or coincident with Stage D |
|
||||
| `code-8ywc` Security wave 1 deferred (R4) | Acknowledged — Stage D only triggers if user accepts this defer |
|
||||
| Cluster currently UNHEALTHY (R4) | Stage B.4 fixes regardless of OS choice |
|
||||
| 60h opportunity cost vs 16+ open P2 tasks (R4) | Stage C decision-gated; user can choose to spend the 60h on other tasks |
|
||||
| Phase 6.5 P0 verification infeasible in 30min (R4) | Stage D scope; if triggered, allocates honest verification time |
|
||||
| Single-site DR (Synology + PVE same site) (R4) | Acknowledged residual risk regardless of OS |
|
||||
| Cluster-identity §22 contradiction (R4) | N/A — staged plan doesn't make identity claims that contradict |
|
||||
| No schedule slack (R4) | Stage D schedule has 2 weeks of soak buffer; staging plan reduces Stage D commitment risk |
|
||||
|
||||
**24 of 30+ critic findings either addressed in v5 or moved to Stage D pre-requisites where they're properly scoped.**
|
||||
|
||||
## 9. Remaining accepted residual risks
|
||||
|
||||
After Stage A+B execution:
|
||||
|
||||
1. **Stage A is policy-enforced, not OS-enforced.** A determined operator can `kubectl debug node/X --target` and modify /etc. Audit policy catches it; doesn't prevent it. Acceptable for homelab; not acceptable for regulated workloads (which this isn't).
|
||||
2. **PG PITR window depends on barmanObjectStore retention** (30 days per Stage B.2 config). Older PITR not available unless backup retention extended.
|
||||
3. **Stage A /usr RO doesn't cover /var, /etc/kubernetes, /etc/containerd, /etc/cni** — these are writable for legitimate config updates. Drift detection still relies on Kyverno + `tg plan`.
|
||||
4. **Stage A drift detection has detection latency** (24h via daily CronJob; ~5min via Kyverno admission). Talos's "drift impossible" has zero latency. For a homelab this is acceptable.
|
||||
5. **Stage C decision could go all 3 ways**; user retains optionality.
|
||||
|
||||
## 10. What this plan explicitly does NOT cover
|
||||
|
||||
- Mixed-OS topologies (decided by Stage D execution if triggered)
|
||||
- Cluster API / CAPMOX
|
||||
- Self-hosting Talos Image Factory (only relevant if Stage D triggers)
|
||||
- Multi-PVE-host expansion
|
||||
- Cilium migration
|
||||
|
||||
## 11. Why this is the right shape
|
||||
|
||||
Critics across 4 rounds pointed to staged execution. v5 commits to it. The key insight: **the right question isn't "how do I migrate to Talos?" — it's "do I need to migrate to Talos?"** Stage A answers that empirically.
|
||||
|
||||
Three weekends to know whether Talos is worth 8-12 weeks. If no: 15-23h saves 60-90h of effort. If yes: empirical justification + battle-tested DR primitives make the migration safer.
|
||||
225
docs/plans/2026-05-28-wealth-projections-design.md
Normal file
225
docs/plans/2026-05-28-wealth-projections-design.md
Normal file
|
|
@ -0,0 +1,225 @@
|
|||
# Wealth Net-Worth Projections — Design (2026-05-28)
|
||||
|
||||
## Goal
|
||||
|
||||
Add forward-looking net-worth projections to the existing **`wealth`**
|
||||
Grafana dashboard. Answer: *"given certain growth rates, where does my
|
||||
net worth go?"* — with the growth rate sourced either from **fixed
|
||||
values** (editable) or from my **own historical return** (derived from
|
||||
the data). Show both pure-compounding and contributing-saver
|
||||
trajectories.
|
||||
|
||||
## Existing state (what we build on)
|
||||
|
||||
- **Dashboard**: `wealth.json` (UID `wealth`, 28 panels, Finance
|
||||
folder), provisioned as a ConfigMap consumed by the Grafana dashboard
|
||||
sidecar. Datasource: **`wealth-pg`** (Postgres, populated by
|
||||
`wealthfolio-sync` ETL). Default time range `now-180d/now`. **No
|
||||
template variables today.**
|
||||
- **Source view `dav_corrected`** (`infra/stacks/wealthfolio/main.tf`):
|
||||
wraps `daily_account_valuation`, correcting `net_contribution` by
|
||||
removing synthetic Fidelity-pension and Schwab-RSU flows so returns
|
||||
aren't distorted. **All return/contribution panels read this view, and
|
||||
so must the projection.**
|
||||
- **Net worth (today)** = `SUM(total_value)` over the *latest-per-account*
|
||||
rows (`DISTINCT ON (account_id) … ORDER BY valuation_date DESC`). This
|
||||
is the projection start point `NW₀`.
|
||||
- **Return methodology already on the dashboard** = **Modified Dietz**:
|
||||
`(nwₑ − nw₀ − flow) / (nw₀ + 0.5·flow)` where `flow = contribₑ −
|
||||
contrib₀`. Used by "12mo return" and "Yearly investment return %". The
|
||||
projection's historical rate reuses this exact formula.
|
||||
- **Complete-days guard**: panels only trust dates where every active
|
||||
account reported (`COUNT(*) per date >= (SELECT COUNT(*) FROM
|
||||
accounts)`), avoiding partial-day skew (witness: memory id=1229, the
|
||||
£88k-vs-£1.03M bug). The projection reuses this guard.
|
||||
|
||||
## Locked decisions
|
||||
|
||||
| # | Decision | Choice |
|
||||
|---|---|---|
|
||||
| 1 | Compute engine | Pure Postgres SQL on `wealth-pg` (no new service; `fire-planner` Monte Carlo is retirement/withdrawal-oriented and a poor fit for simple growth-rate projection) |
|
||||
| 2 | Display | Multiple scenario lines |
|
||||
| 3 | Historical rate basis | All-time annualized Modified Dietz ("all-time CAGR") |
|
||||
| 4 | Lines | Fixed low/base/high (4/7/10%, editable) **+** a line at the derived historical CAGR |
|
||||
| 5 | Contributions | Support both; draw both at once at the base rate (with-contrib **and** compounding-only) |
|
||||
| 6 | Horizon | 30 years (dashboard variable) |
|
||||
| 7 | Placement | A **collapsed row on the existing `wealth` dashboard** (not a separate dashboard) |
|
||||
|
||||
## The projection panel — "Net worth — 30-year projection"
|
||||
|
||||
A timeseries panel. Every projected line originates from today's net
|
||||
worth `NW₀`. Series:
|
||||
|
||||
| Series | Rate | Contributions | Line style |
|
||||
|---|---|---|---|
|
||||
| Net worth (actual) | — | — | solid (last 3y of real history) |
|
||||
| Low | `$rate_low` (4%) | with | dashed |
|
||||
| Base | `$rate_base` (7%) | with | dashed |
|
||||
| Base — compounding only | `$rate_base` | none | dotted |
|
||||
| High | `$rate_high` (10%) | with | dashed |
|
||||
| Historical | `$hist_cagr` (derived) | with | dashed, legend `Historical (X%)` |
|
||||
|
||||
The visible gap between **Base** and **Base — compounding only** is the
|
||||
contribution boost (how much ongoing saving adds over pure market
|
||||
growth). When `$monthly_contribution = 0` the two lines coincide.
|
||||
|
||||
### Projection math
|
||||
|
||||
Per future month `n = 0 … horizon_years·12`, with monthly rate
|
||||
`rm = (1+r)^(1/12) − 1`:
|
||||
|
||||
- **Compounding only**: `V(n) = NW₀·(1+rm)ⁿ`
|
||||
- **With contributions** (ordinary annuity, end-of-period):
|
||||
`V(n) = NW₀·(1+rm)ⁿ + C·((1+rm)ⁿ − 1)/rm`
|
||||
(guard `rm = 0` → `V(n) = NW₀ + C·n`)
|
||||
|
||||
`C` = monthly contribution (see `$monthly_contribution` below). Future
|
||||
timestamps come from `generate_series` against DB `now()` — **not** the
|
||||
Grafana time picker — so the data always exists; only the axis must be
|
||||
extended to display it (see Placement).
|
||||
|
||||
### Derived historical rate (`$hist_cagr`)
|
||||
|
||||
Annualized all-time Modified Dietz, computed over the complete-day
|
||||
window from `dav_corrected`:
|
||||
|
||||
```sql
|
||||
-- d0 = earliest complete day, dn = latest complete day
|
||||
R_total = (nwₙ − nw₀ − (cₙ − c₀)) / NULLIF(nw₀ + 0.5·(cₙ − c₀), 0)
|
||||
hist_cagr = (power(1 + R_total, 365.25 / (dn − d0)) − 1) · 100 -- percent
|
||||
```
|
||||
|
||||
This extends the dashboard's existing 12mo/yearly Modified-Dietz formula
|
||||
to the full history, so the projected "Historical" line is consistent
|
||||
with the returns already shown. Exposed as a **hidden query variable
|
||||
`$hist_cagr`** so the projection line *and* its legend label reference
|
||||
the same computed number.
|
||||
|
||||
> Alternative considered: geometric mean of the per-year Modified-Dietz
|
||||
> returns (more robust to flow timing). Rejected for v1 — annualized
|
||||
> all-time MD is the faithful reading of "all-time CAGR" and reuses the
|
||||
> existing formula verbatim. Revisit if the single 0.5 flow-weight
|
||||
> proves too crude over the multi-year window.
|
||||
|
||||
## Template variables (new — dashboard has none today)
|
||||
|
||||
| Variable | Type | Default | Purpose |
|
||||
|---|---|---|---|
|
||||
| `$rate_low` | textbox | `4` | low fixed annual % |
|
||||
| `$rate_base` | textbox | `7` | base fixed annual % |
|
||||
| `$rate_high` | textbox | `10` | high fixed annual % |
|
||||
| `$monthly_contribution` | textbox | `auto` | `auto` → SQL substitutes the trailing-12-complete-month contribution run-rate; or type a number / `0` |
|
||||
| `$horizon_years` | textbox | `30` | projection length |
|
||||
| `$hist_cagr` | query (hidden) | computed | derived historical CAGR %, reused by line + label |
|
||||
|
||||
`auto` contribution run-rate (trailing 12 complete months):
|
||||
`(contrib_now − contrib_12mo_ago) / 12`, read from `dav_corrected`
|
||||
latest-per-account. Note: RSU vests make raw monthly contributions
|
||||
lumpy; the 12-month run-rate smooths this.
|
||||
|
||||
## Supporting panels (same collapsed row)
|
||||
|
||||
- **Stat cards**: Net worth today · Historical CAGR (`$hist_cagr`) ·
|
||||
Recent monthly contribution (the `auto` value) · Projected NW at
|
||||
horizon @ base · @ historical.
|
||||
- **Text panel** with one-click time-range links (see Placement).
|
||||
- *(Optional)* table "Projected net worth by year" — base & historical
|
||||
columns per year, for exact figures.
|
||||
|
||||
## Placement & the Grafana future-axis constraint
|
||||
|
||||
Grafana's dashboard time range is **shared by all panels**; per-panel
|
||||
overrides ("Relative time", "Time shift") only move a window relative to
|
||||
the picker — neither can set a panel's end to `now+30y` while other
|
||||
panels stay at `now-180d` (verified against Grafana v11.2 docs;
|
||||
dashboard `schemaVersion` 39). So a 30-year future axis cannot coexist
|
||||
on-screen with the 28 history panels without manual time changes.
|
||||
|
||||
Resolution (minimizes the clunk, zero edits to existing panels):
|
||||
|
||||
1. **Collapsed row** "📈 Projections" at the bottom of the dashboard.
|
||||
Collapsed by default → the 28 existing panels are untouched and never
|
||||
show future whitespace.
|
||||
2. **Text panel with time-range links** inside the row:
|
||||
- `Show projection range` → `?from=now-3y&to=now%2B30y` (reloads the
|
||||
dashboard with a future-inclusive axis; projection populates).
|
||||
- `Reset range` → `?from=now-180d&to=now`.
|
||||
3. The dashboard **default time stays `now-180d/now`** — unchanged.
|
||||
4. Projection SQL keys off DB `now()`, independent of the picker, so the
|
||||
actual-history tail (fixed `>= now()::date − interval '3 years'`)
|
||||
plus the 30-year projection both render once the range is extended.
|
||||
|
||||
This honors "one dashboard, nothing extra to maintain" while making the
|
||||
future-axis switch a single click.
|
||||
|
||||
## Data flow / SQL building blocks
|
||||
|
||||
- **Target A (projection, wide format)**: one row per future month;
|
||||
columns `time, proj_low, proj_base, proj_base_nocontrib, proj_high,
|
||||
proj_hist`. Grafana renders each numeric column as a series. Row `n=0`
|
||||
emits `NW₀` for all columns so lines start exactly at today.
|
||||
- **Target B (actual history)**: `valuation_date, "Net worth (actual)"`
|
||||
over complete days, last 3 years. Grafana merges A+B on the time
|
||||
field; the actual series' final point (~today) meets the projections'
|
||||
`n=0` point.
|
||||
- Both reuse the `latest-per-account` + `complete-days` CTEs verbatim
|
||||
from existing panels, against `dav_corrected`.
|
||||
- Field overrides set line styles (solid/dashed/dotted) and the dynamic
|
||||
`Historical (${hist_cagr}%)` display name.
|
||||
|
||||
## Scope — what does NOT change
|
||||
|
||||
- The 28 existing panels, the `wealth-pg` datasource, the `dav_corrected`
|
||||
view, `wealthfolio-sync`, and the dashboard's default time range.
|
||||
- No new Kubernetes resources, no new service, no `fire-planner` changes.
|
||||
- Only additions to `wealth.json`: 1 collapsed row, ~7 panels, ~6
|
||||
template variables, 2 in-dashboard time-range links.
|
||||
|
||||
## Deployment
|
||||
|
||||
1. Claim presence: `scripts/presence claim stack:monitoring --purpose
|
||||
"wealth dashboard projections"`.
|
||||
2. Edit `infra/stacks/monitoring/modules/monitoring/dashboards/wealth.json`.
|
||||
3. `scripts/tg apply` the `monitoring` stack → ConfigMap updates → the
|
||||
Grafana dashboard sidecar reloads `wealth` (no Grafana restart).
|
||||
4. Verify in Grafana (see below). This is Terraform-managed — no
|
||||
`kubectl apply`/manual edits (infra Terraform-only rule).
|
||||
|
||||
## Verification plan
|
||||
|
||||
Dashboards aren't unit-testable, so verification is data + visual:
|
||||
|
||||
1. **SQL pre-validation** against live `wealth-pg` (psql): run the
|
||||
`$hist_cagr` query and the projection query; sanity-check `NW₀` matches
|
||||
the existing "Net worth (current)" stat, `hist_cagr` is in a plausible
|
||||
band, and `proj_base` at `n=0` equals `NW₀`, growing monotonically.
|
||||
2. **JSON validity**: `python -c "json.load(open('wealth.json'))"` and
|
||||
unique panel `id`s / sane `gridPos`.
|
||||
3. **Visual** (after apply): expand the Projections row, click `Show
|
||||
projection range`, confirm 5 projected lines + actual history flow
|
||||
continuously from today; toggle `$monthly_contribution` between `auto`
|
||||
and `0` and confirm the Base / Base-compounding-only gap opens/closes;
|
||||
confirm `Reset range` restores the normal view and the 28 panels are
|
||||
unaffected.
|
||||
|
||||
## Risks / edge cases
|
||||
|
||||
- **Rate 0%** → `rm = 0` divide-by-zero — guarded in the annuity term.
|
||||
- **Negative historical CAGR** (portfolio down all-time) → declining
|
||||
projection line; still valid.
|
||||
- **Short history (<1y)** → annualization extrapolates a noisy rate; the
|
||||
`Historical` line is unreliable until ~1y of data. Acceptable; note in
|
||||
panel description.
|
||||
- **Lumpy RSU vests** skew raw monthly contribution → trailing-12-month
|
||||
run-rate smooths it; the user can override the number anytime.
|
||||
- **JSON churn**: must keep `wealth.json` valid and panel ids unique;
|
||||
the row is additive at the end to limit blast radius.
|
||||
- **Docs**: per execution.md §7, update any affected
|
||||
`infra/docs/architecture` / service-catalog references for the wealth
|
||||
dashboard in the same commit (likely none beyond this plan pair).
|
||||
|
||||
## Open questions
|
||||
|
||||
None — all design decisions resolved with the user (architecture,
|
||||
display, historical-rate basis, line composition, contribution
|
||||
rendering, horizon, placement).
|
||||
99
docs/plans/2026-05-28-wealth-projections-plan.md
Normal file
99
docs/plans/2026-05-28-wealth-projections-plan.md
Normal file
|
|
@ -0,0 +1,99 @@
|
|||
# Wealth Net-Worth Projections — Implementation Plan
|
||||
|
||||
> Pairs with `2026-05-28-wealth-projections-design.md`. Built 2026-06-01.
|
||||
|
||||
**Goal:** Add a collapsed "Projections" row to the `wealth` Grafana dashboard
|
||||
(UID `wealth`) with a 30-year multi-scenario net-worth projection, driven by
|
||||
pure SQL over the (now LOCF-fixed) `dav_corrected` view.
|
||||
|
||||
**Architecture:** Edit `infra/stacks/monitoring/modules/monitoring/dashboards/wealth.json`
|
||||
via a one-off Python builder (reliable JSON construction). Add 6 template
|
||||
variables + 1 collapsed row + projection panels. Deploy via targeted
|
||||
`scripts/tg apply` of the dashboard ConfigMap; Grafana sidecar reloads.
|
||||
|
||||
**Tech stack:** Grafana 11.2 (schemaVersion 39), Postgres datasource `wealth-pg`.
|
||||
|
||||
---
|
||||
|
||||
## Validated inputs (live, 2026-06-01)
|
||||
|
||||
- `nw0` (net worth today) = £1,163,011 — latest-per-account SUM(total_value).
|
||||
- auto monthly contribution run-rate = £15,755/mo (trailing 12 complete months ÷ 12).
|
||||
- Historical return = **trailing-3-full-year** geometric mean of per-year
|
||||
Modified-Dietz returns = **10.43%**.
|
||||
- FV math verified: n=0 = nw0 for every line; base@30y ≈ £27.3M, high ≈ £52.8M.
|
||||
|
||||
### GOTCHA (why not "all-time CAGR")
|
||||
|
||||
The complete-days filter (all 7 accounts present) only reaches back to
|
||||
2026-01-30 because the newest account is recent — so an "all-time CAGR over
|
||||
complete days" annualised a ~4-month window into a nonsense **83.71%**. And
|
||||
the true all-time geomean (17.5%) is dominated by 2021's small-base **+86%**
|
||||
year and would dwarf a 30y chart. Decision (user, 2026-06-01): use the
|
||||
**trailing-3-full-year** geomean (~10.4%) — represents "current returns",
|
||||
chart-sane. Per-year MD returns reuse the existing "Yearly investment return %"
|
||||
methodology (each year uses its own first/last obs; no all-complete requirement).
|
||||
|
||||
## Template variables (add to `templating.list` — dashboard has none today)
|
||||
|
||||
| name | type | default | hide |
|
||||
|---|---|---|---|
|
||||
| `rate_low` | textbox | `4` | 0 |
|
||||
| `rate_base` | textbox | `7` | 0 |
|
||||
| `rate_high` | textbox | `10` | 0 |
|
||||
| `monthly_contribution` | textbox | `auto` | 0 |
|
||||
| `horizon_years` | textbox | `30` | 0 |
|
||||
| `hist_cagr` | query (datasource wealth-pg) | computed | 2 (hidden) |
|
||||
|
||||
`hist_cagr` query:
|
||||
```sql
|
||||
WITH active_count AS (SELECT COUNT(*) n FROM accounts), mc AS (SELECT MAX(valuation_date) d FROM (SELECT valuation_date, COUNT(*) c FROM dav_corrected GROUP BY valuation_date) x WHERE c >= (SELECT n FROM active_count)), yearly AS (SELECT EXTRACT(YEAR FROM valuation_date)::int yr, valuation_date, SUM(total_value) nw, SUM(net_contribution) contrib FROM dav_corrected WHERE valuation_date <= (SELECT d FROM mc) GROUP BY valuation_date), ep AS (SELECT yr, (array_agg(nw ORDER BY valuation_date))[1] nw_s, (array_agg(nw ORDER BY valuation_date DESC))[1] nw_e, (array_agg(contrib ORDER BY valuation_date))[1] c_s, (array_agg(contrib ORDER BY valuation_date DESC))[1] c_e, COUNT(*) days FROM yearly GROUP BY yr), r3 AS (SELECT (nw_e-nw_s-(c_e-c_s))/NULLIF(nw_s+0.5*(c_e-c_s),0) ret FROM ep WHERE (nw_s+0.5*(c_e-c_s))>0 AND days>=300 ORDER BY yr DESC LIMIT 3) SELECT ROUND((exp(avg(ln(1+ret)))-1)*100,2) FROM r3
|
||||
```
|
||||
|
||||
## Panels (new collapsed row "📈 Projections", at bottom, y=200)
|
||||
|
||||
1. **Text panel** "How to view" with two dashboard links:
|
||||
`[Show projection range](?from=now-3y&to=now%2B30y)` /
|
||||
`[Reset](?from=now-180d&to=now)`. (h=3,w=24)
|
||||
2. **Stat row** (h=4): NW today · Historical return (trailing 3y) ·
|
||||
Monthly contribution (auto) · Projected NW @ base in `$horizon_years`y.
|
||||
3. **Timeseries** "Net worth — `$horizon_years`-year projection" (h=12,w=24),
|
||||
two targets (A wide projection, B actual 3y tail). Field overrides:
|
||||
actual = solid; Low/Base/High/Historical = dashed; "Base, no new
|
||||
contributions" = dotted.
|
||||
|
||||
### Panel 3 Target A (wide projection) — column aliases embed the rate for legends
|
||||
```sql
|
||||
WITH active_count AS (SELECT COUNT(*) n FROM accounts), mc AS (SELECT MAX(valuation_date) d FROM (SELECT valuation_date, COUNT(*) c FROM dav_corrected GROUP BY valuation_date) x WHERE c >= (SELECT n FROM active_count)), latest AS (SELECT DISTINCT ON (account_id) account_id, total_value, net_contribution FROM dav_corrected WHERE valuation_date <= (SELECT d FROM mc) ORDER BY account_id, valuation_date DESC), agg AS (SELECT SUM(total_value) nw0, SUM(net_contribution) c_now FROM latest), ago AS (SELECT SUM(x.nc) c_ago FROM latest l LEFT JOIN LATERAL (SELECT net_contribution nc FROM dav_corrected d WHERE d.account_id=l.account_id AND d.valuation_date <= (SELECT d FROM mc) - INTERVAL '12 months' ORDER BY d.valuation_date DESC LIMIT 1) x ON true), params AS (SELECT (SELECT nw0 FROM agg) nw0, CASE WHEN '$monthly_contribution'='auto' THEN ((SELECT c_now FROM agg)-(SELECT c_ago FROM ago))/12.0 ELSE '$monthly_contribution'::numeric END cm, ($rate_low::float)/100 rl, ($rate_base::float)/100 rb, ($rate_high::float)/100 rh, ($hist_cagr::float)/100 rhist), m AS (SELECT generate_series(0, ${horizon_years}*12) n) SELECT (now() + (m.n || ' months')::interval) AS "time", round((nw0*power(1+(power(1+rl,1/12.0)-1),m.n) + cm*((power(1+(power(1+rl,1/12.0)-1),m.n)-1)/NULLIF(power(1+rl,1/12.0)-1,0)))::numeric,0) AS "Low ($rate_low%)", round((nw0*power(1+(power(1+rb,1/12.0)-1),m.n) + cm*((power(1+(power(1+rb,1/12.0)-1),m.n)-1)/NULLIF(power(1+rb,1/12.0)-1,0)))::numeric,0) AS "Base ($rate_base%)", round((nw0*power(1+(power(1+rb,1/12.0)-1),m.n))::numeric,0) AS "Base, no new contributions", round((nw0*power(1+(power(1+rh,1/12.0)-1),m.n) + cm*((power(1+(power(1+rh,1/12.0)-1),m.n)-1)/NULLIF(power(1+rh,1/12.0)-1,0)))::numeric,0) AS "High ($rate_high%)", round((nw0*power(1+(power(1+rhist,1/12.0)-1),m.n) + cm*((power(1+(power(1+rhist,1/12.0)-1),m.n)-1)/NULLIF(power(1+rhist,1/12.0)-1,0)))::numeric,0) AS "Historical ($hist_cagr%)" FROM m, params
|
||||
```
|
||||
|
||||
### Panel 3 Target B (actual history, 3y tail)
|
||||
```sql
|
||||
WITH active_count AS (SELECT COUNT(*) n FROM accounts), mc AS (SELECT MAX(valuation_date) d FROM (SELECT valuation_date, COUNT(*) c FROM dav_corrected GROUP BY valuation_date) x WHERE c >= (SELECT n FROM active_count)) SELECT valuation_date::timestamp AS "time", SUM(total_value) AS "Net worth (actual)" FROM dav_corrected WHERE valuation_date <= (SELECT d FROM mc) AND valuation_date >= now()::date - INTERVAL '3 years' GROUP BY valuation_date ORDER BY valuation_date
|
||||
```
|
||||
|
||||
### Stat SQL
|
||||
- NW today: `WITH latest AS (SELECT DISTINCT ON (account_id) total_value FROM dav_corrected d JOIN accounts a ON a.id=d.account_id ORDER BY account_id, valuation_date DESC) SELECT SUM(total_value) FROM latest`
|
||||
- Historical return %: `SELECT $hist_cagr::float`
|
||||
- Monthly contribution (auto): the `agg`/`ago` run-rate `((c_now)-(c_ago))/12.0`
|
||||
- Projected @ base: Target-A base formula evaluated at `n = $horizon_years*12`
|
||||
|
||||
## Build / deploy / verify
|
||||
|
||||
1. **Build:** one-off Python script `/tmp/build_projection.py` (outside repo)
|
||||
loads wealth.json, appends the 6 vars + row + panels, fixes the "Net pay vs
|
||||
market gain — per month" panel (#3) to month-end deltas, writes back.
|
||||
2. **Validate:** `python -c json.load`; unique panel ids; spot-run Target A/B
|
||||
against live `wealth-pg`.
|
||||
3. **Deploy:** `scripts/tg apply -target=...grafana_dashboards["wealth.json"]`
|
||||
(targeted — monitoring stack has unrelated pre-existing drift).
|
||||
4. **Verify:** ConfigMap carries new content; user expands the row, clicks
|
||||
"Show projection range", confirms 5 projected lines flow from today + the
|
||||
actual tail; toggles `$monthly_contribution`=0 to see the contribution gap.
|
||||
|
||||
## Scope notes
|
||||
|
||||
- Skip the optional "projected NW by year" table (YAGNI; add later if wanted).
|
||||
- #3 ("Net pay vs market gain — per month") aligned to month-end deltas in the
|
||||
same build for monthly-market-gain consistency.
|
||||
- Fidelity growth-timing cosmetic = NOT in scope (user deferred 2026-06-01).
|
||||
110
docs/plans/2026-05-30-traefik-dedicated-ip-etp-local-design.md
Normal file
110
docs/plans/2026-05-30-traefik-dedicated-ip-etp-local-design.md
Normal file
|
|
@ -0,0 +1,110 @@
|
|||
# Design: Dedicated MetalLB IP for Traefik with externalTrafficPolicy=Local
|
||||
|
||||
**Date:** 2026-05-30
|
||||
**Status:** Draft — for review (no changes applied yet)
|
||||
**Author:** Viktor + Claude
|
||||
|
||||
## Problem
|
||||
|
||||
Two issues share one root cause on the Traefik ingress LoadBalancer:
|
||||
|
||||
1. **CrowdSec is blind to real client IPs on the 24 non-proxied/direct apps.**
|
||||
Traefik logs `10.0.20.103` (k8s-node3's IP) as the client for the
|
||||
overwhelming majority of direct-app requests (measured: 2522 hits vs 3
|
||||
real external IPs). Cause: the Traefik LB is `externalTrafficPolicy:
|
||||
Cluster`, so kube-proxy SNATs every external client to the MetalLB-elected
|
||||
node's IP before Traefik sees it. CrowdSec therefore makes ban decisions
|
||||
against an internal node IP it would never block → **no effective IP-based
|
||||
protection on the direct apps** (immich, forgejo, send, ytdlp, servarr,
|
||||
ebooks, novelapp, freedify, affine, health, f1-stream, kms, k8s-portal,
|
||||
etc. — 24 total).
|
||||
*Proxied apps are unaffected — they arrive via the cloudflared tunnel and
|
||||
get real IPs through Cloudflare's `X-Forwarded-For`.*
|
||||
|
||||
2. **HTTP/3 / QUIC does not complete for the direct apps.** An external probe
|
||||
(`http3check.net`) confirms "QUIC connection could not be established"
|
||||
despite `Alt-Svc: h3` being advertised and UDP 443 reaching Traefik
|
||||
(verified: pfSense NATs UDP 443 → Traefik LB; Traefik binds UDP 8443).
|
||||
Same root cause: `ETP=Cluster` + 3 replicas means kube-proxy SNATs and can
|
||||
spread the UDP flow across pods, which breaks the QUIC handshake.
|
||||
|
||||
Both are fixed by `externalTrafficPolicy: Local` on the Traefik LB (no SNAT →
|
||||
real client IPs preserved → QUIC stays pinned to one pod).
|
||||
|
||||
## Why we can't just flip ETP on the current IP
|
||||
|
||||
Traefik currently shares MetalLB IP **`10.0.20.200`** with **9 other services**
|
||||
via `metallb.io/allow-shared-ip`:
|
||||
|
||||
`dbaas/postgresql-lb` (**Terraform state backend**), `headscale/headscale-server`,
|
||||
`wireguard/wireguard`, `coturn/coturn`, `xray/xray-reality`,
|
||||
`shadowsocks/shadowsocks`, `beads-server/dolt`, `servarr/qbittorrent-torrenting`,
|
||||
`tor-proxy/torrserver-bt`.
|
||||
|
||||
Per MetalLB docs, services sharing an IP **must all use `Cluster`** (or point
|
||||
to identical pods). Mixing `Local` and `Cluster` on a shared IP is **not
|
||||
allowed** and would break the IP allocation — taking down all ingress **and
|
||||
the Terraform state DB** (locking out `terragrunt` itself), plus VPN/DNS path.
|
||||
→ Traefik must move to its **own** IP.
|
||||
|
||||
## Target state
|
||||
|
||||
- New dedicated MetalLB IP **`10.0.20.203`** (free; pool is `10.0.20.200-220`),
|
||||
**not** shared, `externalTrafficPolicy: Local`, for the Traefik LB.
|
||||
- `10.0.20.200` keeps the other 9 services unchanged (still all `Cluster`).
|
||||
- Internal split-horizon DNS apex `viktorbarzin.me A` → `10.0.20.203`
|
||||
(currently `10.0.20.200`). All `*.viktorbarzin.me` CNAME → apex, so this one
|
||||
record moves every internal ingress hostname.
|
||||
- pfSense: the WAN 443 (TCP **and** UDP) port-forward target moves from the
|
||||
`<nginx>` alias to a **new pfSense alias** for `10.0.20.203`
|
||||
(per request: define a VIP/alias, do **not** hardcode the IP in rules —
|
||||
matches the existing `<nginx>` / `<k8s_shared_lb>` alias pattern).
|
||||
|
||||
## Key decisions
|
||||
|
||||
- **Dedicated IP, not shared** — forced by the MetalLB mixed-ETP rule above.
|
||||
- **`10.0.20.203`** — first free IP after technitium (.201) and kms (.202).
|
||||
- **pfSense reference by alias, not literal IP** (user requirement) — create
|
||||
alias e.g. `traefik_lb` = `10.0.20.203`, reference it in the rdr + firewall
|
||||
pass rule. One place to change later.
|
||||
- **Cutover style** — two options, decided at review (see plan):
|
||||
- *In-place* (recommended for maintainability): change the Helm Service to
|
||||
the new IP + ETP=Local in one edit; brief cutover window (mitigated by
|
||||
pre-lowering DNS TTL + staging the pfSense change).
|
||||
- *Additive* (zero-downtime): stand up a second LB Service on `.203`
|
||||
(ETP=Local) alongside the existing `.200` one, cut DNS/pfSense over, then
|
||||
retire Traefik from `.200`. More moving parts to maintain.
|
||||
|
||||
## Risks & watch-items
|
||||
|
||||
- **Terraform state backend lives on `.200`** — every phase must verify
|
||||
`dbaas/postgresql-lb:5432` stays reachable. We never touch `.200`'s config,
|
||||
only remove Traefik from it at the end; low risk but explicitly checked.
|
||||
- **Live-firewall edit** (pfSense rdr + alias) — done via the pfSense UI
|
||||
(persisted in config.xml); CLI `pfctl` edits don't persist. Per the
|
||||
network-device rule, this step is operator-driven/confirmed, not automated.
|
||||
- **CrowdSec behavior change** — once it sees *real* public IPs on direct
|
||||
apps, it will start making real ban decisions there. Confirm the security
|
||||
allowlist (source-IP allowlist `10.0.20.0/22`, `192.168.1.0/24`, tailnet;
|
||||
identity `me@viktorbarzin.me`) is correct so family/legit IPs aren't banned.
|
||||
- **MetalLB ETP=Local node election** — `.203` is announced only from a node
|
||||
running a ready Traefik pod. Traefik has 3 replicas (node4, node5, +1) and
|
||||
PDB minAvailable=2, so ≥2 eligible nodes always exist; re-elects on failure.
|
||||
- **Cloudflare-proxied apps** route via the cloudflared tunnel → Traefik
|
||||
ClusterIP, **not** the LB IP, so they are unaffected — verified in plan.
|
||||
- **Cutover window** for the in-place option — keep it short; have rollback
|
||||
staged.
|
||||
|
||||
## Out of scope
|
||||
|
||||
- No change to the 9 services on `10.0.20.200`.
|
||||
- No change to Cloudflare-proxied apps' path.
|
||||
- No re-architecture of the pfSense↔K8s ingress beyond the 443 target move.
|
||||
|
||||
## Affected docs (update on apply)
|
||||
|
||||
- `.claude/CLAUDE.md` (Networking & Resilience / Service-Specific notes)
|
||||
- `docs/architecture/networking.md` (or equivalent — Traefik LB IP, ETP)
|
||||
- `docs/runbooks/` — add a short "Traefik LB IP / ETP" runbook entry
|
||||
- `.claude/reference/service-catalog.md` if it records LB IPs
|
||||
- memory: update the QUIC/ingress entries (ids 3241-3246)
|
||||
230
docs/plans/2026-05-30-traefik-dedicated-ip-etp-local-plan.md
Normal file
230
docs/plans/2026-05-30-traefik-dedicated-ip-etp-local-plan.md
Normal file
|
|
@ -0,0 +1,230 @@
|
|||
# Plan: Migrate Traefik to dedicated IP 10.0.20.203 + ETP=Local
|
||||
|
||||
**Date:** 2026-05-30 · **Pairs with:** `2026-05-30-traefik-dedicated-ip-etp-local-design.md`
|
||||
**Status:** Draft — review required before executing. Nothing applied yet.
|
||||
|
||||
Goal: real client IPs to CrowdSec + working QUIC on the 24 direct apps, by
|
||||
moving Traefik off the shared `10.0.20.200` onto its own `10.0.20.203` with
|
||||
`externalTrafficPolicy: Local`. Shared IP `.200` (incl. the TF state DB) is
|
||||
left untouched until the final cleanup step.
|
||||
|
||||
> Recommended cutover: **in-place** (simplest, most maintainable) inside a short
|
||||
> planned window. Additive/zero-downtime variant noted at the end.
|
||||
|
||||
## Phase 0 — Pre-flight (read-only, ~10 min)
|
||||
|
||||
- [ ] Snapshot current state (already captured in chat; re-confirm at execution):
|
||||
- Traefik svc: IP `10.0.20.200`, `allow-shared-ip=shared`, ETP=Cluster.
|
||||
- `.200` shared by 10 services incl. `dbaas/postgresql-lb:5432` (TF state).
|
||||
- DNS apex `viktorbarzin.me A = 10.0.20.200` (Technitium primary, split-horizon).
|
||||
- pfSense rdr: WAN 443 tcp+udp → alias `<nginx>` (=10.0.20.200); `admin@10.0.20.1`.
|
||||
- Traefik 3 replicas (node4, node5, +1), PDB minAvailable=2.
|
||||
- [ ] Confirm `10.0.20.203` still free in pool `10.0.20.200-220`.
|
||||
- [ ] **Lower DNS TTL** on the apex record to 60s (Technitium) ~30 min ahead of
|
||||
cutover to shrink the window. (Restore to normal afterward.)
|
||||
- [ ] Baseline checks to compare against (run now, save output):
|
||||
- `curl -sI https://immich.viktorbarzin.me` (direct app) → 200/redirect
|
||||
- `curl -sI https://<a-proxied-app>` → 200 (proxied path)
|
||||
- PG state reachable: `nc -vz 10.0.20.200 5432` (or a `terragrunt plan` no-op)
|
||||
- Traefik access log shows `10.0.20.103` for a direct app (the bug we're fixing)
|
||||
- `http3check.net` for immich → QUIC FAILS (baseline)
|
||||
|
||||
## Phase 1 — Terraform: dedicated IP + ETP=Local (reversible)
|
||||
|
||||
Edit `stacks/traefik/modules/traefik/main.tf`, Helm `service` block (~L165-173):
|
||||
|
||||
```hcl
|
||||
service = {
|
||||
type = "LoadBalancer"
|
||||
annotations = {
|
||||
"metallb.io/loadBalancerIPs" = "10.0.20.203" # was 10.0.20.200
|
||||
# allow-shared-ip REMOVED — Traefik no longer shares an IP
|
||||
}
|
||||
spec = {
|
||||
externalTrafficPolicy = "Local" # was Cluster
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
- [ ] `scripts/tg plan` in `stacks/traefik` — review: only the Traefik Service
|
||||
changes (new IP, ETP, annotation removed). No change to other stacks.
|
||||
- [ ] `scripts/tg apply`.
|
||||
- [ ] **Immediately verify** (ingress is briefly broken until DNS+pfSense move):
|
||||
- `kubectl get svc traefik -n traefik` → IP `10.0.20.203`, ETP=Local.
|
||||
- `kubectl get svc -A | grep 10.0.20.200` → the other 9 services still hold `.200`.
|
||||
- **`nc -vz 10.0.20.200 5432`** → TF state DB still reachable (critical).
|
||||
- `curl -sI --resolve <app>:443:10.0.20.203 https://<direct-app>` → 200
|
||||
(proves `.203` serves before DNS moves).
|
||||
|
||||
**Rollback (Phase 1):** revert the three lines → `scripts/tg apply`. Back to `.200`.
|
||||
|
||||
## Phase 2 — Internal DNS cutover (Technitium)
|
||||
|
||||
- [ ] Update split-horizon apex: `viktorbarzin.me A → 10.0.20.203` (primary;
|
||||
AXFR replicates to secondary/tertiary, or kick `technitium-zone-sync`).
|
||||
- [ ] Verify internal resolution: `dig +short immich.viktorbarzin.me` → `10.0.20.203`
|
||||
from a cluster/LAN client; `curl -sI https://immich.viktorbarzin.me` → 200.
|
||||
|
||||
**Rollback (Phase 2):** apex A → `10.0.20.200`.
|
||||
|
||||
## Phase 3 — pfSense (live firewall — operator-driven, alias not literal)
|
||||
|
||||
Per the "create a VIP/alias, don't hardcode" requirement:
|
||||
|
||||
- [ ] **Create a pfSense Firewall Alias** (Firewall ▸ Aliases), type Host:
|
||||
name `traefik_lb`, value `10.0.20.203`. *(This is the correct pfSense
|
||||
object for a NAT-forward target — same kind as the existing `<nginx>`
|
||||
alias. If a CARP/IP-Alias Virtual IP is intended instead, confirm at
|
||||
review; a routed K8s LB IP normally uses an Alias, not a VIP.)*
|
||||
- [ ] **Repoint the 443 forward** (Firewall ▸ NAT ▸ Port Forward): change the
|
||||
existing WAN `https` (TCP **and** UDP) rule's target from `nginx` →
|
||||
`traefik_lb`. Leave the auto firewall rule linked. Do **not** touch the
|
||||
`http-alt`/`7443` rules (those are xray on `<k8s_shared_lb>`).
|
||||
- [ ] Apply pfSense changes.
|
||||
- [ ] Verify externally:
|
||||
- `http3check.net` for immich → **QUIC OK** (h3 established).
|
||||
- External `curl` to a few direct apps → 200.
|
||||
- Traefik access log now shows **real client IPs** for direct apps (not `10.0.20.103`).
|
||||
|
||||
**Rollback (Phase 3):** point the 443 rule's target back to `nginx`.
|
||||
|
||||
## Phase 4 — Verify CrowdSec + the fleet (the real prize)
|
||||
|
||||
- [ ] Traefik logs: real public IPs on direct apps (sample several).
|
||||
- [ ] CrowdSec: confirm it now ingests real IPs (a test decision / metrics);
|
||||
**confirm the source-IP allowlist** (`10.0.20.0/22`, `192.168.1.0/24`,
|
||||
tailnet) is active so family/LAN aren't banned.
|
||||
- [ ] Proxied apps unaffected (spot-check 2-3 — still real IPs via Cloudflare).
|
||||
- [ ] All other `.200` services healthy (PG state, headscale, wireguard, coturn,
|
||||
xray, etc.).
|
||||
- [ ] Restore DNS TTL to normal.
|
||||
|
||||
## Phase 5 — Cleanup / docs
|
||||
|
||||
- [ ] Confirm Traefik no longer answers on `.200` (it shouldn't after Phase 1).
|
||||
- [ ] Update docs (design doc "Affected docs" list): `.claude/CLAUDE.md`,
|
||||
`docs/architecture/networking.md`, service-catalog, memory ids 3241-3246.
|
||||
- [ ] Commit TF + docs.
|
||||
|
||||
## Rollback (full)
|
||||
|
||||
Reverse order: pfSense 443 target → `nginx`; apex A → `.200`; revert the
|
||||
Traefik Service TF (IP `.200`, `allow-shared-ip=shared`, ETP=Cluster) → apply.
|
||||
kubectl/Helm reach the API server directly (not via Traefik), so control is
|
||||
retained even if ingress is down mid-cutover.
|
||||
|
||||
## Additive (zero-downtime) variant — if the window is unacceptable
|
||||
|
||||
Instead of editing the Helm Service in place: add a second raw
|
||||
`kubernetes_service` (type LoadBalancer, IP `.203`, ETP=Local, ports
|
||||
web/80→8000, websecure/443→8443 TCP, websecure-http3/443→8443 UDP, selector =
|
||||
Traefik pod labels). Both `.200` (old) and `.203` (new) serve Traefik. Cut
|
||||
DNS+pfSense to `.203`, verify, then convert the Helm Service to ClusterIP
|
||||
(drops `.200`). More config to carry long-term (a hand-maintained Service
|
||||
duplicating Helm) — weigh against the brief in-place window.
|
||||
|
||||
## Attempt 1 — 2026-05-30 — ROLLED BACK (post-mortem)
|
||||
|
||||
First execution was rolled back to the `.200` baseline; all service restored,
|
||||
TF state reconciled (`No changes`). The cutover **achieved its primary goal
|
||||
mid-flight** (real external client IPs reached CrowdSec — confirmed real IPs
|
||||
like `34.107.119.124` in Traefik logs instead of node `10.0.20.103`), but a
|
||||
**missed dependency took proxied apps down**, forcing rollback. Fix the plan
|
||||
before retrying:
|
||||
|
||||
1. **BLOCKER — cloudflared targets the LB IP.** The `cloudflared` tunnel is
|
||||
**token-based / Cloudflare-dashboard-managed** (`args: [tunnel]` +
|
||||
`TUNNEL_TOKEN`; no local `config.yaml`). Its ingress sends `*.viktorbarzin.me`
|
||||
to the **Traefik LB IP `10.0.20.200`**. Moving Traefik to `.203` left
|
||||
cloudflared pointing at a dead IP → **every proxied app (vault, home, …)
|
||||
went down**. **The retry MUST also repoint the tunnel ingress `.200 → .203`
|
||||
in Cloudflare (API/dashboard)** as part of the same cutover — ideally point
|
||||
cloudflared at the Traefik *ClusterIP/service* so it's IP-independent.
|
||||
2. **Vault-ingress circular dependency.** Fetching the Technitium password from
|
||||
Vault *during* the window failed (Vault's ingress was down). Fix used:
|
||||
pre-fetch all creds before touching Traefik (worked). The DNS step then
|
||||
restored Vault.
|
||||
3. **SIGPIPE → stuck PG state locks.** Piping `scripts/tg` through `head`/`grep`
|
||||
(early pipe close) SIGPIPE-killed terragrunt before it released the PG
|
||||
advisory lock, leaving an idle `terraform_state` connection holding the lock
|
||||
(`force-unlock` can't release another session's advisory lock). **Always run
|
||||
`tg` to a file, never pipe through early-closing filters.** Clear a stuck
|
||||
one by terminating the idle backend: `pg_terminate_backend(<pid>)` for the
|
||||
idle conn holding `pg_locks.objid` of the workspace.
|
||||
4. **ETP=Local + hairpin.** Internal hosts that resolve `*.viktorbarzin.me` via
|
||||
*public* DNS and hairpin (e.g. the devvm) become flaky under ETP=Local.
|
||||
True external clients and internal-direct (`.203`) clients work. Ensure such
|
||||
hosts resolve internally (Technitium split-horizon).
|
||||
5. **QUIC verification.** `http3check.net` was unreliable here (failed on TCP
|
||||
while real clients got 200s) — don't rely on it; confirm from a real device
|
||||
on cellular.
|
||||
|
||||
**Left in place for retry:** pfSense alias `traefik_lb` (=`10.0.20.203`, NAT
|
||||
reverted to `nginx`); pfSense `config.xml` backups `config.xml.bak-traefik-*`.
|
||||
|
||||
## Attempt 2 — 2026-05-30 — SUCCESS
|
||||
|
||||
Live and verified, **no proxied/Vault outage** this time. Key change vs attempt 1:
|
||||
**decouple cloudflared from the LB IP FIRST**, so moving Traefik no longer
|
||||
touches the proxied path or Vault's ingress.
|
||||
|
||||
Executed order (all lessons applied — `tg` always run to a file, creds
|
||||
pre-fetched while Vault up):
|
||||
1. **Cloudflare tunnel ingress repointed** `https://10.0.20.200:443` →
|
||||
`https://traefik.traefik.svc.cluster.local:443` (both `*.viktorbarzin.me`
|
||||
and apex rules; `noTLSVerify` kept; catch-all 404 kept). Done via the
|
||||
**Cloudflare Global API Key** (`secret/platform` → `cloudflare_api_key`,
|
||||
email `vbarzin@gmail.com`, `X-Auth-Email`+`X-Auth-Key` headers — NOT the
|
||||
tunnel token, which is not an API credential). Tunnel: account
|
||||
`02e035473cfc4834fb10c5d35470d8b4`, id `75182cd7-bb91-4310-b961-5d8967da8b41`.
|
||||
→ proxied apps now IP-independent.
|
||||
2. Traefik Service → `10.0.20.203` + `ETP=Local` (single service; `tg apply`).
|
||||
Proxied apps + Vault stayed up (cloudflared → ClusterIP).
|
||||
3. Technitium apex `viktorbarzin.me A` → `10.0.20.203` (ttl 60).
|
||||
4. pfSense 443 (tcp+udp) NAT `nginx` → `traefik_lb` (`.203`); `/etc/rc.filter_configure`.
|
||||
|
||||
**Verified:** proxied 307/200 throughout; direct apps 200; **real external
|
||||
client IPs now reach Traefik/CrowdSec** (`216.73.217.51`, `54.x`, `52.x` — not
|
||||
node `10.0.20.103`); PG state DB OK; TF state reconciled (`tg apply` exit 0).
|
||||
|
||||
**Notes / follow-ups:**
|
||||
- **Out-of-band (not in TF):** the cloudflared tunnel ingress (remote/dashboard
|
||||
config) and the pfSense `traefik_lb` alias + NAT. Codify the tunnel config in
|
||||
TF (`cloudflare_zero_trust_tunnel_cloudflared_config`) so `→ClusterIP` is
|
||||
declarative — pre-existing gap (tunnel was already remote-managed).
|
||||
- **QUIC:** infra correct (ETP=Local + UDP 443 → `.203` + Traefik h3 listener).
|
||||
`http3check.net` is unreliable here — it hits the IPv6 AAAA
|
||||
(`2001:470:6e:43d::2`, separate HE-tunnel path, unchanged) and fails before
|
||||
reaching Traefik. Confirm QUIC from a real device (Chrome → Protocol `h3`).
|
||||
- pfSense `nginx` alias (=`.200`) is now unused; `traefik_lb` (=`.203`) is live.
|
||||
|
||||
## IPv6 follow-up — 2026-05-30 — DONE (HAProxy bridge, real client IPs)
|
||||
|
||||
The ETP=Local cutover fixed real client IPs + QUIC on the **IPv4** direct path
|
||||
only. The **IPv6** path (HE 6in4 tunnel `2001:470:6e:43d::2` → pfSense) still ran
|
||||
`socat`, which (a) masked every IPv6 client as `10.0.20.1`, and (b) broke
|
||||
outright once Traefik's `proxyProtocol.trustedIPs` started requiring PROXY-v2
|
||||
from `10.0.20.1`. Replaced socat with a **standalone HAProxy bridge** on pfSense
|
||||
using `send-proxy-v2` so real IPv6 client IPs reach Traefik/CrowdSec.
|
||||
|
||||
Executed:
|
||||
1. **Traefik** (TF, `stacks/traefik/.../main.tf`): added
|
||||
`proxyProtocol = { trustedIPs = ["10.0.20.1"] }` to the `web` + `websecure`
|
||||
entrypoints. Bounded risk — only connections *from* `10.0.20.1` (the bridge)
|
||||
are PROXY-parsed; real IPv4 clients (ETP=Local, own source IP) are untouched.
|
||||
Applied; IPv4 + proxied verified 200 immediately after.
|
||||
2. **pfSense HAProxy** (`/usr/local/etc/ipv6-haproxy.cfg`): 6 frontends on
|
||||
`[::2]:{443,80,25,465,587,993}` → Traefik `.203:{443,80}` and mail NodePorts
|
||||
`{30125,30126,30127,30128}` (.101-103), all `send-proxy-v2`, **no `check`**
|
||||
(a plain check would false-DOWN the PROXY-expecting listeners).
|
||||
3. **Persistence**: rewrote `rc.d/ipv6proxy` → manages HAProxy
|
||||
(`service ipv6proxy {start,stop,status}`, graceful `-sf`); rewrote
|
||||
`ipv6_proxy.sh` (config.xml `<shellcmd>` boot entrypoint) to keep the
|
||||
nginx-off-`[::]` patch then `service ipv6proxy onestart`. socat backups kept
|
||||
as `*.socat-bak-*`.
|
||||
|
||||
**Verified:** web over `::2` = 200; Traefik logs show real public IPv6 clients
|
||||
(e.g. `2620:10d:c092:500::6:1eda`), **zero** `10.0.20.1` artifacts; mail-over-IPv6
|
||||
`220` banners on `::2:25/587` + IMAPS connect on `::2:993`; IPv4 direct/proxied
|
||||
+ QUIC (`alt-svc: h3=":443"`) unaffected. **No QUIC over IPv6** (bridge is TCP/h2).
|
||||
Authoritative as-built: `docs/architecture/networking.md` → "IPv6 Ingress".
|
||||
149
docs/plans/2026-06-01-t3-auto-provision-design.md
Normal file
149
docs/plans/2026-06-01-t3-auto-provision-design.md
Normal file
|
|
@ -0,0 +1,149 @@
|
|||
# t3code per-user auto-provisioning — design
|
||||
|
||||
- **Date:** 2026-06-01
|
||||
- **Status:** implemented 2026-06-01 (commits up to e8766756; dispatcher hardened to a dedicated unprivileged user + `t3-mint` wrapper vs the design's run-as-wizard)
|
||||
- **Owner:** Viktor (wizard)
|
||||
- **Builds on:** the multi-user t3 setup shipped earlier 2026-06-01 (commit `ad9472ab`): Authentik forward-auth on `t3.viktorbarzin.me` → in-cluster nginx `t3-dispatch` → per-OS-user `t3 serve` on devvm (wizard→:3773, emo→:3774).
|
||||
|
||||
## Goal
|
||||
|
||||
When an onboarded user logs in via Authentik, they land **straight in their own t3 workspace** — no admin pre-creating per-user systemd units / dispatch entries, and no manual t3 pairing. "Full hands-off," scoped to users who are valid OS accounts on devvm.
|
||||
|
||||
## Constraints (load-bearing)
|
||||
|
||||
1. **t3 is single-owner with no trust-upstream-auth.** No flag/header lets t3 trust an upstream identity and skip its own session (verified against `t3 serve --help` + source: no `trustedHeader`/`REMOTE_USER`/`disableAuth`). So we cannot make t3 zero-auth-after-Authentik the way ttyd is; we **auto-mint + auto-inject** t3's own session instead.
|
||||
2. **t3 users must be valid OS users on devvm.** No auto-creating Linux accounts. Membership = an `/etc/ttyd-user-map` entry (Authentik username → existing OS user), the same map the `terminal` stack already uses.
|
||||
3. **File permissions must be enforced by the OS.** Each user's t3 instance (and every agent/process it spawns) must run as that user's uid, so file access is bounded by Unix permissions — not by t3 app logic.
|
||||
4. **t3's web session is a cookie** (`/api/auth/bootstrap` calls `HttpServerResponse.setCookie`; `t3 auth session list` shows `method: browser-session-cookie`). A proxy can therefore mint and inject it.
|
||||
|
||||
## Source of truth
|
||||
|
||||
`/etc/ttyd-user-map` (already: `vbarzin=wizard`, `emil.barzin=emo`). One file drives both the terminal and t3. A user with no entry → 403 (no shared fallback). Adding a person = one line here (plus they must already be an Authentik identity + OS account — i.e., your existing onboarding).
|
||||
|
||||
## Discovered auth contract
|
||||
|
||||
*(Task 1 discovery spike — confirmed from `pingdotgg/t3code` source AND a live mint→bootstrap→cookie round-trip against wizard's instance on `http://127.0.0.1:3773`, 2026-06-01.)*
|
||||
|
||||
- **Session cookie name: `t3_session`.**
|
||||
- Source: `apps/server/src/auth/utils.ts` — `const SESSION_COOKIE_NAME = "t3_session"`. `resolveSessionCookieName({mode, port})` returns the plain name in **web** mode and `t3_session_<port>` only in **desktop** mode. The server passes `serverConfig.mode`/`serverConfig.port` (`SessionCredentialService.ts`); `t3 serve` runs in `web` mode → plain `t3_session`.
|
||||
- Live `Set-Cookie` from the running instance returned `t3_session=...` (no port suffix) → confirms web mode and cross-checks the source.
|
||||
|
||||
- **Bootstrap request body: `{ "credential": "<TOKEN>" }`** (single field `credential`, a non-empty trimmed string).
|
||||
- Schema: `packages/contracts/src/auth.ts` — `AuthBootstrapInput = Schema.Struct({ credential: TrimmedNonEmptyString })`.
|
||||
- Server: `apps/server/src/auth/http.ts` `authBootstrapRouteLayer` (POST `/api/auth/bootstrap`) decodes `AuthBootstrapInput`, calls `exchangeBootstrapCredential(payload.credential, ...)`, then `HttpServerResponse.setCookie(sessions.cookieName, result.sessionToken, { httpOnly: true, path: "/", sameSite: "lax", expires })`.
|
||||
- Web UI: `apps/web/src/environments/primary/auth.ts` posts `const payload: AuthBootstrapInput = { credential }` with `credentials: "include"`.
|
||||
- A wrong/missing field yields `400 "Invalid bootstrap payload."`.
|
||||
- **The `t3 auth pairing create --json` CLI returns the pairing token under the `credential` key** (not `token`/`pairingToken`) — feed that value straight into the bootstrap body's `credential` field.
|
||||
|
||||
- **Verified curl** (token redacted):
|
||||
|
||||
```bash
|
||||
TOK=$(sudo -u wizard t3 auth pairing create --base-dir /home/wizard/.t3 --ttl 5m --json | jq -r '.credential')
|
||||
curl -s -i -XPOST http://127.0.0.1:3773/api/auth/bootstrap \
|
||||
-H 'content-type: application/json' \
|
||||
-d "{\"credential\":\"<TOKEN>\"}" | grep -iE 'HTTP/|set-cookie'
|
||||
# HTTP/1.1 200 OK
|
||||
# set-cookie: t3_session=<JWT>; Path=/; Expires=<+30d>; HttpOnly; SameSite=Lax
|
||||
```
|
||||
|
||||
The session cookie is a signed JWT (`v:1, kind:session, sid, sub, role, method:"browser-session-cookie", iat, exp`), default TTL 30 days. The dispatch service must inject it `HttpOnly; Path=/; SameSite=Lax` to match t3's own behaviour.
|
||||
|
||||
- **Constants for the dispatch service:** `T3_COOKIE = "t3_session"`; bootstrap endpoint `POST /api/auth/bootstrap`; body `{"credential": "<pairing-token>"}`; success = `200` + `Set-Cookie: t3_session=...`.
|
||||
|
||||
## Components
|
||||
|
||||
### 1. Per-user systemd template — `t3-serve@.service` (file-permission enforcement)
|
||||
|
||||
Replaces the bespoke `t3-serve.service` + `t3-serve-emo.service` with one template:
|
||||
|
||||
```ini
|
||||
[Service]
|
||||
User=%i
|
||||
Group=%i
|
||||
Environment=HOME=/home/%i
|
||||
Environment=PATH=/usr/local/bin:/usr/bin:/bin:/home/%i/.local/bin
|
||||
EnvironmentFile=/etc/t3-serve/%i.env # T3_PORT=37xx (assigned by reconcile)
|
||||
WorkingDirectory=/home/%i
|
||||
ExecStart=/usr/bin/t3 serve --host 0.0.0.0 --port ${T3_PORT} --base-dir /home/%i/.t3
|
||||
Restart=on-failure
|
||||
RestartSec=5
|
||||
```
|
||||
|
||||
`User=%i` is the enforcement: `t3-serve@wizard` runs as `wizard`, `t3-serve@emo` as `emo`. t3 and the coding agents it launches inherit the uid, so **emo's instance cannot read/write files wizard's uid owns** unless group/world perms allow. Identical guarantee to the terminal's `sudo -u`. Existing port assignments are preserved (wizard=3773, emo=3774) so live sessions aren't disrupted.
|
||||
|
||||
### 2. Reconcile — `t3-provision-users.sh` (data-driven)
|
||||
|
||||
A devvm script (systemd timer mirroring the `apply-mbps-caps.timer` pattern, `OnBootSec` + hourly + `Persistent=true`, plus on-demand run during onboarding). For each `authentik_user=os_user` in `/etc/ttyd-user-map`:
|
||||
- allocate a stable port if unassigned (t3 instances use the 3773+ band: wizard=3773, emo=3774, subsequent users 3775, 3776, …) → write `/etc/t3-serve/<os_user>.env` (`T3_PORT=`). Allocation is sticky (never re-number an existing user).
|
||||
- `systemctl enable --now t3-serve@<os_user>`.
|
||||
|
||||
Sources versioned in `infra/scripts/` (like `apply-mbps-caps.{sh,service,timer}`), deployed to devvm via `scp` (same pattern as the other host scripts).
|
||||
|
||||
### 3. Dispatch + auto-pair — small devvm service
|
||||
|
||||
Replaces the in-cluster nginx `t3-dispatch` (the session-mint needs `sudo` + local base-dir access, so it must live on devvm anyway; consolidating keeps one source of truth and one place for the privileged logic). Fronted by Traefik(Authentik) → K8s Service+Endpoints → this service on devvm at the fixed `10.0.10.10:3780` (outside the 3773+ instance band).
|
||||
|
||||
Per request (Authentik forward-auth has injected a trustworthy `X-authentik-username`):
|
||||
1. Resolve `X-authentik-username` → OS user via `/etc/ttyd-user-map`. No mapping → **403**.
|
||||
2. **Has a valid t3 session cookie?** → reverse-proxy (incl. WebSocket upgrade) to `127.0.0.1:<T3_PORT>`. (Steady state — the common path.)
|
||||
3. **No cookie** (first visit / expired) → auto-pair:
|
||||
- `sudo -u <os_user> t3 auth pairing create --base-dir /home/<os_user>/.t3 --ttl 5m --json` → one-time token.
|
||||
- exchange it at the instance's `POST /api/auth/bootstrap` → capture the returned `Set-Cookie`.
|
||||
- relay that `Set-Cookie` to the browser + `302 /`. Browser now holds the t3 session cookie → next request is the steady-state path. **Login → straight in.**
|
||||
|
||||
Implementation: a small reverse proxy that supports WebSocket upgrade (Go `httputil.ReverseProxy`, or Python aiohttp) — chosen at plan time.
|
||||
|
||||
### 4. Terraform — `stacks/t3code` shrinks
|
||||
|
||||
- Remove the in-cluster nginx `t3-dispatch` (ConfigMap + Deployment + Service).
|
||||
- Add a `Service` + `Endpoints` → `10.0.10.10:3780` (the devvm dispatch service).
|
||||
- Ingress stays `auth = "required"` (Authentik) + CrowdSec, `service_name` → the new dispatch Service.
|
||||
|
||||
### 5. Sudoers — scoped
|
||||
|
||||
A `/etc/sudoers.d/t3-autopair` granting the dispatch service's user **only**:
|
||||
- `t3 auth pairing create --base-dir /home/*/.t3 *`
|
||||
- `systemctl start t3-serve@*` (if lazy-start is later wanted; reconcile already enables them)
|
||||
|
||||
Modeled on the existing `/etc/sudoers.d/ttyd-users`.
|
||||
|
||||
## Data flow
|
||||
|
||||
```
|
||||
phone/browser
|
||||
→ Cloudflare → Traefik (Authentik forward-auth: 302 to SSO if no session)
|
||||
→ [X-authentik-username injected]
|
||||
→ K8s Service/Endpoints → devvm dispatch+autopair :3780
|
||||
map username→os_user (unmapped → 403)
|
||||
cookie? yes → proxy → 127.0.0.1:<T3_PORT> (t3-serve@<u>)
|
||||
no → mint (sudo -u) → /api/auth/bootstrap → Set-Cookie → 302 /
|
||||
```
|
||||
|
||||
## Security
|
||||
|
||||
- **File isolation:** `User=%i` — OS-enforced, the user's explicit requirement.
|
||||
- **Identity gate:** Authentik SSO at the edge; `X-authentik-username` is trustworthy (forward-auth overwrites client-supplied values; unauth never reaches the backend).
|
||||
- **Privilege:** the dispatch service holds a *narrowly scoped* sudoers entry (mint pairing tokens + start `t3-serve@*` only). Minted tokens are 5-min, one-time.
|
||||
- **Blast radius:** unchanged from today — onboarded users only; no new public surface beyond the existing `t3.viktorbarzin.me`.
|
||||
|
||||
## Reboot / persistence
|
||||
|
||||
Each instance's state (paired devices + 30-day sessions) is on-disk SQLite (`/home/<u>/.t3/userdata/state.sqlite`); template instances are `enabled`, so a **reboot** restarts them and reloads state — no re-pair, auto-pair fires only once per device. A devvm **rebuild** loses it (`~/.t3` is not backed up). Optional follow-up: add `/home/*/.t3` to the backup set if rebuild-survival is wanted.
|
||||
|
||||
## Out of scope
|
||||
|
||||
- Native app / `app.t3.codes` (cross-origin bearer clients; blocked by Authentik) — deferred until t3 publishes the native app.
|
||||
- Auto-creating OS accounts / Authentik identities (onboarding stays manual + deliberate).
|
||||
- Backing up t3 state (separate decision).
|
||||
- Lazy stop of idle instances (cheap to keep running at this user count).
|
||||
|
||||
## Testing / verification
|
||||
|
||||
- Reconcile is idempotent: re-run leaves ports + units stable; adding a map line provisions a new instance.
|
||||
- `t3-serve@emo` runs as uid `emo` (`ps -o user`), cannot write a wizard-owned file (negative test).
|
||||
- Dispatch: `X-authentik-username: vbarzin` with no cookie → 302 + `Set-Cookie`; with the cookie → 200 proxied (incl. a WS upgrade). Unmapped → 403.
|
||||
- Live: a real Authentik login in a browser lands in the correct per-user workspace; WS connects; a second device auto-pairs without manual token entry.
|
||||
|
||||
## Rollback
|
||||
|
||||
Revert `stacks/t3code` to the nginx `t3-dispatch` (commit `ad9472ab`); the per-user systemd template + reconcile are additive and can be left running or disabled.
|
||||
360
docs/plans/2026-06-01-t3-auto-provision-plan.md
Normal file
360
docs/plans/2026-06-01-t3-auto-provision-plan.md
Normal file
|
|
@ -0,0 +1,360 @@
|
|||
# t3code per-user auto-provisioning — Implementation Plan
|
||||
|
||||
> **For agentic workers:** REQUIRED SUB-SKILL: Use superpowers:subagent-driven-development (recommended) or superpowers:executing-plans to implement this plan task-by-task. Steps use checkbox (`- [ ]`) syntax for tracking.
|
||||
|
||||
**Goal:** An onboarded Authentik user who opens `t3.viktorbarzin.me` lands straight in their own t3 workspace — instance auto-provisioned, t3 session auto-minted+injected, file access bounded to their OS uid.
|
||||
|
||||
**Architecture:** Per-user `t3 serve` instances via a `t3-serve@<osuser>` systemd template (`User=%i` = OS-enforced file perms). A devvm reconcile turns `/etc/ttyd-user-map` entries into running instances. A small devvm dispatch+auto-pair service (behind Traefik+Authentik) routes `X-authentik-username` to the user's instance and, on first visit, mints + injects t3's session cookie. `stacks/t3code` shrinks to ingress + Endpoints → that service.
|
||||
|
||||
**Tech Stack:** systemd templates, bash (reconcile), Go (dispatch service — single static binary, native WS proxy), Traefik/Authentik, Terraform/Terragrunt.
|
||||
|
||||
**Spec:** `infra/docs/plans/2026-06-01-t3-auto-provision-design.md`
|
||||
|
||||
**Conventions:** devvm host artifacts are versioned under `infra/scripts/` and deployed via `scp` (same as `apply-mbps-caps.{sh,service,timer}`); they are NOT Terraform-managed (like the existing `t3-serve` / terminal-lobby). Only the K8s edge is in `stacks/t3code`. Claim presence (`host:devvm`, `stack:t3code`) before mutating. Verify each task before committing.
|
||||
|
||||
---
|
||||
|
||||
## Task 1: Discover the t3 web-auth contract (spike — blocks the dispatch service)
|
||||
|
||||
The auto-pair step must speak t3's exact session protocol. Nail these three unknowns before writing the service.
|
||||
|
||||
**Files:** none (investigation; record findings in this task's checkboxes).
|
||||
|
||||
- [ ] **Step 1: Find the session cookie name.** Read `apps/server/src/auth/Layers/SessionCredentialService.ts` (or `Services/SessionCredentialService.ts`) in the t3code repo for `cookieName`. Cross-check live: pair a browser to a t3 instance, inspect the cookie set on `t3.viktorbarzin.me`.
|
||||
Run: `gh api repos/pingdotgg/t3code/contents/apps/server/src/auth/Services/SessionCredentialService.ts --jq '.content' | base64 -d | grep -niE 'cookieName|cookie'`
|
||||
Record: `T3_COOKIE=<value>`.
|
||||
- [ ] **Step 2: Find the bootstrap request shape.** Read how the web UI exchanges a pairing token for a session: `apps/server/src/auth/http.ts` `authBootstrapRouteLayer` + the `AuthBootstrapInput` schema in `packages/contracts`, and `apps/web/src/components/auth/PairingRouteSurface.tsx` / `hostedPairing.ts`.
|
||||
Run: `gh api repos/pingdotgg/t3code/contents/packages/contracts/src/<auth file>.ts --jq '.content' | base64 -d | grep -niE 'Bootstrap|pairing|token'`
|
||||
Record: the exact `POST /api/auth/bootstrap` JSON body (field name(s) for the pairing token).
|
||||
- [ ] **Step 3: Verify the exchange by hand against a live instance.** On devvm:
|
||||
```bash
|
||||
TOK=$(sudo -u wizard t3 auth pairing create --base-dir /home/wizard/.t3 --ttl 5m --json | jq -r '.token // .pairingToken')
|
||||
curl -s -i -XPOST http://127.0.0.1:3773/api/auth/bootstrap \
|
||||
-H 'content-type: application/json' -d "{\"<field>\":\"$TOK\"}" | grep -iE 'set-cookie|HTTP/'
|
||||
```
|
||||
Expected: `HTTP/.. 200` + a `Set-Cookie: <T3_COOKIE>=...`. This confirms the mint→bootstrap→cookie flow the service will automate.
|
||||
- [ ] **Step 4: Commit findings** as a short note appended to the design doc (`## Discovered auth contract` section): `T3_COOKIE`, bootstrap body shape, and the verified curl. Commit `docs(t3code): record discovered t3 web-auth contract`.
|
||||
|
||||
---
|
||||
|
||||
## Task 2: systemd template `t3-serve@.service` (file-permission enforcement)
|
||||
|
||||
**Files:**
|
||||
- Create: `infra/scripts/t3-serve@.service`
|
||||
- Deploy to devvm: `/etc/systemd/system/t3-serve@.service`
|
||||
- Retire: `/etc/systemd/system/t3-serve.service`, `/etc/systemd/system/t3-serve-emo.service`
|
||||
|
||||
- [ ] **Step 1: Write the template unit** (`infra/scripts/t3-serve@.service`):
|
||||
|
||||
```ini
|
||||
[Unit]
|
||||
Description=T3 Code server for %i (t3 serve, per-user)
|
||||
Documentation=https://github.com/pingdotgg/t3code
|
||||
After=network.target
|
||||
|
||||
[Service]
|
||||
Type=simple
|
||||
User=%i
|
||||
Group=%i
|
||||
Environment=HOME=/home/%i
|
||||
Environment=PATH=/usr/local/bin:/usr/bin:/bin:/home/%i/.local/bin
|
||||
Environment=NODE_ENV=production
|
||||
EnvironmentFile=/etc/t3-serve/%i.env
|
||||
WorkingDirectory=/home/%i
|
||||
ExecStart=/usr/bin/t3 serve --host 0.0.0.0 --port ${T3_PORT} --base-dir /home/%i/.t3
|
||||
Restart=on-failure
|
||||
RestartSec=5
|
||||
|
||||
[Install]
|
||||
WantedBy=multi-user.target
|
||||
```
|
||||
|
||||
- [ ] **Step 2: Stage the existing users' env files (preserve live ports):**
|
||||
```bash
|
||||
sudo install -d -m 0755 /etc/t3-serve
|
||||
echo 'T3_PORT=3773' | sudo tee /etc/t3-serve/wizard.env
|
||||
echo 'T3_PORT=3774' | sudo tee /etc/t3-serve/emo.env
|
||||
```
|
||||
- [ ] **Step 3: Deploy the template + migrate wizard, then emo (one at a time to limit blast radius):**
|
||||
```bash
|
||||
sudo cp infra/scripts/t3-serve@.service /etc/systemd/system/t3-serve@.service
|
||||
sudo systemctl daemon-reload
|
||||
# wizard: stop old bespoke unit, start template instance
|
||||
sudo systemctl disable --now t3-serve.service
|
||||
sudo systemctl enable --now t3-serve@wizard.service
|
||||
```
|
||||
- [ ] **Step 4: Verify wizard instance runs as wizard on :3773 and serves:**
|
||||
Run:
|
||||
```bash
|
||||
ps -o user= -C t3 | sort -u # expect: wizard (and emo after next step)
|
||||
ss -ltn | grep ':3773'
|
||||
curl -s -o /dev/null -w '%{http_code}\n' http://localhost:3773/ # expect 200
|
||||
```
|
||||
Expected: process owned by `wizard`, listening, 200. (wizard's existing pairings persist — same `~/.t3`.)
|
||||
- [ ] **Step 5: Migrate emo the same way:**
|
||||
```bash
|
||||
sudo systemctl disable --now t3-serve-emo.service
|
||||
sudo systemctl enable --now t3-serve@emo.service
|
||||
```
|
||||
- [ ] **Step 6: Verify emo instance runs as emo + file-permission negative test:**
|
||||
Run:
|
||||
```bash
|
||||
ss -ltn | grep ':3774'
|
||||
curl -s -o /dev/null -w '%{http_code}\n' http://localhost:3774/ # expect 200
|
||||
sudo -u emo test -w /home/wizard/.t3 && echo WRITABLE || echo "denied (correct)" # expect denied
|
||||
```
|
||||
Expected: 200; emo cannot write wizard's private `~/.t3` (file-perm enforcement proven).
|
||||
- [ ] **Step 7: Commit** `infra/scripts/t3-serve@.service`: `t3code: per-user t3-serve@ systemd template (User=%i file isolation)`.
|
||||
|
||||
---
|
||||
|
||||
## Task 3: Reconcile script + timer (data-driven from `/etc/ttyd-user-map`)
|
||||
|
||||
**Files:**
|
||||
- Create: `infra/scripts/t3-provision-users.sh`, `infra/scripts/t3-provision-users.service`, `infra/scripts/t3-provision-users.timer`
|
||||
- Deploy to devvm: `/usr/local/bin/t3-provision-users`, `/etc/systemd/system/t3-provision-users.{service,timer}`
|
||||
- Writes: `/etc/t3-serve/<u>.env`, `/etc/t3-serve/dispatch.json`
|
||||
|
||||
- [ ] **Step 1: Write `infra/scripts/t3-provision-users.sh`** (idempotent; allocates sticky ports from 3773+, ensures `t3-serve@<u>`, emits the dispatcher map):
|
||||
|
||||
```bash
|
||||
#!/usr/bin/env bash
|
||||
# Reconcile per-user t3 instances from /etc/ttyd-user-map.
|
||||
# Each "authentik_user=os_user" line → an enabled t3-serve@<os_user> on a
|
||||
# sticky port, plus /etc/t3-serve/dispatch.json (authentik_user → {os_user,port})
|
||||
# consumed by t3-dispatch.
|
||||
set -euo pipefail
|
||||
MAP=/etc/ttyd-user-map
|
||||
ENVDIR=/etc/t3-serve
|
||||
BASE_PORT=3773
|
||||
install -d -m 0755 "$ENVDIR"
|
||||
|
||||
next_port() { # lowest free port >= BASE_PORT not already assigned
|
||||
local used p
|
||||
used=$(grep -hoE 'T3_PORT=[0-9]+' "$ENVDIR"/*.env 2>/dev/null | cut -d= -f2 | sort -n)
|
||||
p=$BASE_PORT
|
||||
while echo "$used" | grep -qx "$p"; do p=$((p+1)); done
|
||||
echo "$p"
|
||||
}
|
||||
|
||||
declare -A DISPATCH
|
||||
while IFS='=' read -r ak os; do
|
||||
[[ -z "${ak// }" || "$ak" =~ ^[[:space:]]*# ]] && continue
|
||||
ak=$(echo "$ak" | xargs); os=$(echo "$os" | xargs)
|
||||
id "$os" >/dev/null 2>&1 || { logger -t t3-provision "skip $ak: no OS user $os"; continue; }
|
||||
envf="$ENVDIR/$os.env"
|
||||
if [[ ! -f "$envf" ]]; then echo "T3_PORT=$(next_port)" > "$envf"; fi
|
||||
port=$(grep -oE '[0-9]+' "$envf")
|
||||
systemctl enable --now "t3-serve@$os.service" >/dev/null 2>&1 || true
|
||||
DISPATCH[$ak]="{\"os_user\":\"$os\",\"port\":$port}"
|
||||
done < "$MAP"
|
||||
|
||||
{ printf '{'; first=1
|
||||
for ak in "${!DISPATCH[@]}"; do
|
||||
[[ $first -eq 0 ]] && printf ','; first=0
|
||||
printf '"%s":%s' "$ak" "${DISPATCH[$ak]}"
|
||||
done; printf '}\n'; } > "$ENVDIR/dispatch.json"
|
||||
logger -t t3-provision "reconcile complete: $(wc -c < "$ENVDIR/dispatch.json") bytes"
|
||||
```
|
||||
|
||||
- [ ] **Step 2: Write the timer + service** (`infra/scripts/t3-provision-users.service`):
|
||||
```ini
|
||||
[Unit]
|
||||
Description=Reconcile per-user t3 instances from /etc/ttyd-user-map
|
||||
[Service]
|
||||
Type=oneshot
|
||||
ExecStart=/usr/local/bin/t3-provision-users
|
||||
```
|
||||
`infra/scripts/t3-provision-users.timer`:
|
||||
```ini
|
||||
[Unit]
|
||||
Description=Periodic t3 per-user reconcile
|
||||
[Timer]
|
||||
OnBootSec=2min
|
||||
OnCalendar=hourly
|
||||
Persistent=true
|
||||
[Install]
|
||||
WantedBy=timers.target
|
||||
```
|
||||
- [ ] **Step 3: Deploy + run once:**
|
||||
```bash
|
||||
sudo install -m 0755 infra/scripts/t3-provision-users.sh /usr/local/bin/t3-provision-users
|
||||
sudo cp infra/scripts/t3-provision-users.service infra/scripts/t3-provision-users.timer /etc/systemd/system/
|
||||
sudo systemctl daemon-reload && sudo systemctl enable --now t3-provision-users.timer
|
||||
sudo /usr/local/bin/t3-provision-users
|
||||
```
|
||||
- [ ] **Step 4: Verify idempotency + output:**
|
||||
Run:
|
||||
```bash
|
||||
cat /etc/t3-serve/dispatch.json | jq . # expect {"vbarzin":{"os_user":"wizard","port":3773},"emil.barzin":{"os_user":"emo","port":3774}}
|
||||
sudo /usr/local/bin/t3-provision-users # run again
|
||||
cat /etc/t3-serve/wizard.env # expect T3_PORT=3773 (unchanged — sticky)
|
||||
```
|
||||
Expected: dispatch.json correct; re-run changes nothing (ports stable).
|
||||
- [ ] **Step 5: Commit** the three files: `t3code: reconcile per-user t3 instances from ttyd-user-map`.
|
||||
|
||||
---
|
||||
|
||||
## Task 4: Dispatch + auto-pair service (Go, devvm)
|
||||
|
||||
**Files:**
|
||||
- Create: `t3-dispatch/main.go`, `t3-dispatch/go.mod`
|
||||
- Create: `infra/scripts/t3-dispatch.service` → `/etc/systemd/system/t3-dispatch.service`
|
||||
- Deploy binary to devvm: `/usr/local/bin/t3-dispatch`
|
||||
|
||||
Uses `T3_COOKIE` + bootstrap body from Task 1. Listens `:3780`. Reads `/etc/t3-serve/dispatch.json`.
|
||||
|
||||
- [ ] **Step 1: Write `t3-dispatch/main.go`** (substitute `<T3_COOKIE>` and the bootstrap field from Task 1):
|
||||
|
||||
```go
|
||||
package main
|
||||
|
||||
import (
|
||||
"bytes"; "encoding/json"; "fmt"; "log"; "net/http"; "net/http/httputil"
|
||||
"net/url"; "os"; "os/exec"; "sync"; "time"
|
||||
)
|
||||
|
||||
type entry struct{ OsUser string `json:"os_user"`; Port int `json:"port"` }
|
||||
const cookieName = "<T3_COOKIE>" // from Task 1
|
||||
const listenAddr = ":3780"
|
||||
const dispatchFile = "/etc/t3-serve/dispatch.json"
|
||||
|
||||
var mu sync.RWMutex
|
||||
var table map[string]entry
|
||||
|
||||
func loadTable() error {
|
||||
b, err := os.ReadFile(dispatchFile); if err != nil { return err }
|
||||
m := map[string]entry{}; if err := json.Unmarshal(b, &m); err != nil { return err }
|
||||
mu.Lock(); table = m; mu.Unlock(); return nil
|
||||
}
|
||||
|
||||
func lookup(ak string) (entry, bool) { mu.RLock(); defer mu.RUnlock(); e, ok := table[ak]; return e, ok }
|
||||
|
||||
func proxyTo(port int) *httputil.ReverseProxy { // ReverseProxy handles WS upgrade transparently
|
||||
u, _ := url.Parse(fmt.Sprintf("http://127.0.0.1:%d", port)); return httputil.NewSingleHostReverseProxy(u)
|
||||
}
|
||||
|
||||
func autoPair(e entry, w http.ResponseWriter, r *http.Request) {
|
||||
out, err := exec.Command("sudo", "-n", "-u", e.OsUser, "t3", "auth", "pairing", "create",
|
||||
"--base-dir", "/home/"+e.OsUser+"/.t3", "--ttl", "5m", "--json").Output()
|
||||
if err != nil { http.Error(w, "pairing mint failed", 500); log.Printf("mint %s: %v", e.OsUser, err); return }
|
||||
var pc struct{ Token string `json:"token"` } // adjust field per Task 1
|
||||
if json.Unmarshal(out, &pc) != nil || pc.Token == "" { http.Error(w, "bad pairing output", 500); return }
|
||||
body, _ := json.Marshal(map[string]string{"token": pc.Token}) // adjust body per Task 1
|
||||
resp, err := http.Post(fmt.Sprintf("http://127.0.0.1:%d/api/auth/bootstrap", e.Port),
|
||||
"application/json", bytes.NewReader(body))
|
||||
if err != nil { http.Error(w, "bootstrap failed", 502); return }
|
||||
defer resp.Body.Close()
|
||||
for _, c := range resp.Cookies() { http.SetCookie(w, c) }
|
||||
http.Redirect(w, r, "/", http.StatusFound)
|
||||
}
|
||||
|
||||
func handler(w http.ResponseWriter, r *http.Request) {
|
||||
ak := r.Header.Get("X-authentik-username")
|
||||
e, ok := lookup(ak)
|
||||
if !ok { http.Error(w, "no t3 instance for this user", http.StatusForbidden); return }
|
||||
if _, err := r.Cookie(cookieName); err != nil { autoPair(e, w, r); return }
|
||||
proxyTo(e.Port).ServeHTTP(w, r)
|
||||
}
|
||||
|
||||
func main() {
|
||||
if err := loadTable(); err != nil { log.Fatalf("load %s: %v", dispatchFile, err) }
|
||||
go func() { for range time.Tick(60 * time.Second) { _ = loadTable() } }() // pick up reconcile changes
|
||||
http.HandleFunc("/healthz", func(w http.ResponseWriter, _ *http.Request){ w.Write([]byte("ok")) })
|
||||
http.HandleFunc("/", handler)
|
||||
log.Printf("t3-dispatch on %s", listenAddr)
|
||||
log.Fatal(http.ListenAndServe(listenAddr, nil))
|
||||
}
|
||||
```
|
||||
|
||||
- [ ] **Step 2: `t3-dispatch/go.mod`:** `module t3-dispatch` / `go 1.22`. Build: `cd t3-dispatch && GOOS=linux GOARCH=amd64 go build -o t3-dispatch .`
|
||||
- [ ] **Step 3: Write `infra/scripts/t3-dispatch.service`:**
|
||||
```ini
|
||||
[Unit]
|
||||
Description=t3 per-user dispatch + auto-pair (X-authentik-username → user instance)
|
||||
After=network.target
|
||||
[Service]
|
||||
Type=simple
|
||||
User=wizard
|
||||
ExecStart=/usr/local/bin/t3-dispatch
|
||||
Restart=on-failure
|
||||
RestartSec=5
|
||||
[Install]
|
||||
WantedBy=multi-user.target
|
||||
```
|
||||
(Runs as `wizard`; the scoped sudoers in Task 5 lets it mint per-user tokens.)
|
||||
- [ ] **Step 4: Deploy + start:**
|
||||
```bash
|
||||
scp t3-dispatch/t3-dispatch wizard@10.0.10.10:/tmp/t3-dispatch
|
||||
sudo install -m 0755 /tmp/t3-dispatch /usr/local/bin/t3-dispatch
|
||||
sudo cp infra/scripts/t3-dispatch.service /etc/systemd/system/
|
||||
sudo systemctl daemon-reload && sudo systemctl enable --now t3-dispatch.service
|
||||
```
|
||||
- [ ] **Step 5: Verify routing + auto-pair locally (before Task 5 sudoers, expect mint to 500; after Task 5, 302):**
|
||||
Run:
|
||||
```bash
|
||||
curl -s -o /dev/null -w '%{http_code}\n' http://localhost:3780/healthz # 200
|
||||
curl -s -o /dev/null -w '%{http_code}\n' -H 'X-authentik-username: nobody' http://localhost:3780/ # 403
|
||||
curl -s -o /dev/null -w '%{http_code}\n' -H 'X-authentik-username: vbarzin' http://localhost:3780/ # 302 (after Task 5)
|
||||
```
|
||||
Expected (post-Task-5): healthz 200, unmapped 403, mapped-no-cookie 302 with Set-Cookie.
|
||||
- [ ] **Step 6: Commit** `t3-dispatch/` + `infra/scripts/t3-dispatch.service`: `t3code: devvm dispatch + auto-pair service`.
|
||||
|
||||
---
|
||||
|
||||
## Task 5: Scoped sudoers
|
||||
|
||||
**Files:** Create `infra/scripts/sudoers-t3-autopair` → deploy `/etc/sudoers.d/t3-autopair` (mode 0440).
|
||||
|
||||
- [ ] **Step 1: Write the sudoers fragment** (modeled on `/etc/sudoers.d/ttyd-users`):
|
||||
```
|
||||
# t3-dispatch (runs as wizard) may mint per-user t3 pairing tokens only.
|
||||
wizard ALL=(%i) NOPASSWD: /usr/bin/t3 auth pairing create --base-dir /home/*/.t3 --ttl 5m --json
|
||||
```
|
||||
(If `Runas_Alias`/per-user form is needed, enumerate: `wizard ALL=(wizard,emo) NOPASSWD: /usr/bin/t3 auth pairing create *`.)
|
||||
- [ ] **Step 2: Deploy + validate syntax:**
|
||||
```bash
|
||||
sudo install -m 0440 infra/scripts/sudoers-t3-autopair /etc/sudoers.d/t3-autopair
|
||||
sudo visudo -cf /etc/sudoers.d/t3-autopair # expect: parsed OK
|
||||
```
|
||||
- [ ] **Step 3: Verify the dispatch service can now mint (re-run Task 4 Step 5 mapped case):** expect `vbarzin` → 302 + `Set-Cookie`.
|
||||
- [ ] **Step 4: Commit** `infra/scripts/sudoers-t3-autopair`: `t3code: scoped sudoers for dispatch auto-pair`.
|
||||
|
||||
---
|
||||
|
||||
## Task 6: Terraform — repoint `stacks/t3code` at the devvm dispatcher
|
||||
|
||||
**Files:** Modify `stacks/t3code/main.tf`.
|
||||
|
||||
- [ ] **Step 1: Remove** the in-cluster nginx (`kubernetes_config_map_v1.t3_dispatch`, `kubernetes_deployment_v1.t3_dispatch`, `kubernetes_service_v1.t3_dispatch`, the `locals.t3_dispatch_nginx_conf`). **Add** a `kubernetes_service` `t3` (port 80) + `kubernetes_endpoints` `t3` → `10.0.10.10:3780`. Keep `module.ingress` `auth = "required"`, `service_name = "t3"`.
|
||||
- [ ] **Step 2: Plan** — expect: 3 nginx resources destroyed, service+endpoints created, ingress backend `t3-dispatch`→`t3`:
|
||||
```bash
|
||||
cd stacks/t3code && ../../scripts/tg plan 2>&1 | grep -E 'will be|^Plan:'
|
||||
```
|
||||
- [ ] **Step 3: Claim presence + apply:**
|
||||
```bash
|
||||
~/code/scripts/presence claim stack:t3code --purpose "repoint t3 ingress at devvm dispatch+autopair"
|
||||
cd stacks/t3code && ../../scripts/tg apply --non-interactive
|
||||
```
|
||||
- [ ] **Step 4: Verify live end-to-end:**
|
||||
```bash
|
||||
curl -sk -o /dev/null -w '%{http_code}\n' https://t3.viktorbarzin.me/ # 302 → Authentik (gate intact)
|
||||
```
|
||||
Then a real browser login as Viktor → lands in wizard's workspace, WS connects, no manual pairing. (Cannot be fully curl-tested without an Authentik session — confirm in-browser.)
|
||||
- [ ] **Step 5: Commit** `stacks/t3code/main.tf`: `t3code: ingress → devvm dispatch+autopair (retire in-cluster nginx)`.
|
||||
|
||||
---
|
||||
|
||||
## Task 7: Docs, memory, push
|
||||
|
||||
- [ ] **Step 1:** Update `.claude/reference/service-catalog.md` t3code row: dispatcher is now the devvm `t3-dispatch` service (+ auto-pair); add-a-user = one `/etc/ttyd-user-map` line → reconcile.
|
||||
- [ ] **Step 2:** Update design doc status → `implemented`. Append the Task 1 discovered auth-contract note if not already.
|
||||
- [ ] **Step 3:** `memory_update` id 3085 (dispatcher: in-cluster nginx → devvm t3-dispatch + auto-pair + reconcile).
|
||||
- [ ] **Step 4:** Commit docs; push all commits to `origin/master`. If the shared working tree is dirty from another session, push via the git-crypt-disabled detached worktree (see memory ids 3533-3535). Wait for Woodpecker CI on `stacks/t3code`.
|
||||
|
||||
---
|
||||
|
||||
## Self-review notes
|
||||
- **Spec coverage:** source-of-truth (T3) ✓ Task 3; file-perm enforcement (User=%i) ✓ Task 2; reconcile ✓ Task 3; dispatch+auto-pair ✓ Task 4; sudoers ✓ Task 5; TF shrink ✓ Task 6; reboot persistence (units enabled) ✓ Task 2/3. Out-of-scope items not implemented (correct).
|
||||
- **Discovery-dependent:** the dispatch service's `cookieName` + bootstrap body are placeholders resolved in Task 1 before Task 4 coding — flagged inline, not left vague.
|
||||
- **Ports:** instances 3773+ (sticky), dispatcher fixed 3780 — consistent across Tasks 2/3/4/6.
|
||||
156
docs/plans/2026-06-01-topolvm-evaluation.md
Normal file
156
docs/plans/2026-06-01-topolvm-evaluation.md
Normal file
|
|
@ -0,0 +1,156 @@
|
|||
# TopoLVM Migration Evaluation
|
||||
|
||||
**Date**: 2026-06-01
|
||||
**Status**: ❌ NOT ADOPTED — superseded 2026-06-05.
|
||||
**Decision**: **Rejected in favour of option ① (harden proxmox-csi + NFS)** — TopoLVM pins PVCs to a node, which loses the cross-node pod mobility Viktor requires (a node going down must let pods reschedule elsewhere), and Option C's hardware spend was declined. Longhorn was also rejected (replication is 2× write-amplification on the single shared sdc HDD, with no DR benefit on a single host). See `2026-06-05-block-storage-harden-nfs-design.md` for the chosen path and full rationale. This doc is retained for its analysis — the LUN-cap mechanics, the three disk-layout options, and the effort estimate remain accurate reference if a second physical host is ever added (which would revive the Longhorn/replication option).
|
||||
|
||||
## Problem statement
|
||||
|
||||
The cluster's block storage hits a **hardcoded 29-PVC-per-VM ceiling** in `sergelogvinov/proxmox-csi-plugin` (`pkg/csi/utils.go:394`, `for lun = 1; lun < 30; lun++`). The plugin scans Proxmox SCSI indices `scsi1..scsi29`; when all are taken, `ControllerPublishVolume` returns `Internal desc = no free lun found`. We hit this on 2026-05-26 with 4 stuck PVCs on k8s-node1 and responded by scaling from 4 → 6 worker VMs.
|
||||
|
||||
Path 1 (patch the plugin to `lun < 31`) buys +1 slot per VM. Path 2 (NFS-migrate non-DB workloads) buys 20-30 PVCs of headroom. Both are tactical. This doc evaluates **Path 3 — replace the CSI driver with TopoLVM**, which removes the cap permanently by changing the storage architecture from "PVE-host LVM-thin + SCSI hotplug" to "per-VM LVM-thin + local provisioning".
|
||||
|
||||
## What TopoLVM is
|
||||
|
||||
CSI driver from cybozu-go. Each K8s node runs an `lvmd` daemon managing one or more LVM volume groups. The CSI controller creates `LogicalVolume` CRDs; `topolvm-node` on the target node reconciles them by asking `lvmd` to `lvcreate` an LV in the chosen VG. The LV is mounted directly on the node (no virtio-scsi hotplug). PVCs are LV slices, not separate SCSI devices — there is no per-VM cap beyond kernel LV count limits (effectively thousands).
|
||||
|
||||
Mature project, used in production by Cybozu and others. Supports:
|
||||
- Thin provisioning (`type: thin` device class with overprovision ratio)
|
||||
- Multiple device classes per node (e.g., one for SSD, one for HDD)
|
||||
- CSI VolumeSnapshot CRDs (thin-provisioned volumes only; restore pinned to source node)
|
||||
- Online volume expansion (ext4, xfs, btrfs)
|
||||
- Striping and RAID via `lvcreate-options`
|
||||
|
||||
## The big architectural trade-off — read this first
|
||||
|
||||
| Aspect | proxmox-csi (today) | TopoLVM |
|
||||
|---|---|---|
|
||||
| Storage location | PVE-host thin pool (sdc) | Per-VM thin pool on a dedicated disk |
|
||||
| Per-VM PVC cap | **29** (plugin source) | None (kernel LV limits, thousands) |
|
||||
| **PVC mobility** | **Migrates between VMs** — CSI re-attaches LV to wherever the pod schedules | **Pinned to one node** via `topology.topolvm.cybozu.com/node` label |
|
||||
| Failure recovery | Pod reschedules to another VM, PVC follows | Pod can only restart on the same node; if the node dies, data is on the dead node |
|
||||
| IO contention | All VMs share sdc thin pool | Each VM's pool is on its own disk (which may still share underlying physical media) |
|
||||
| Snapshot mechanism | PVE-host `lvm-pvc-snapshot` script (custom) | CSI VolumeSnapshot CRDs (standard) |
|
||||
| Encryption | LUKS via Proxmox CSI `extraParameters` + ESO-synced secret | LUKS via `csi.storage.k8s.io/{node-stage,node-expand}-secret` — same pattern, different secret target |
|
||||
| Backup pipeline | sda → Synology via `daily-backup` script that mounts LVM snapshots on PVE | Same idea but snapshots live inside K8s VMs; backup script would need to run on each VM (or use CSI snapshot → object store) |
|
||||
| Operational model | "Storage is a shared pool, VMs are cattle" | "Storage is per-node, like local-path with LVM features" |
|
||||
|
||||
**Data mobility is the most important difference.** Today, when k8s-node1 is drained for maintenance, all its PVC pods reschedule to other nodes and the proxmox-csi controller detaches/re-attaches the LVs accordingly. With TopoLVM, draining a node means **the PVC data is still on that node's local disk** — pods cannot start elsewhere until either (a) the data is migrated, or (b) the node returns.
|
||||
|
||||
For Viktor's setup specifically:
|
||||
- **Pro**: the underlying PVE host is a single point of failure anyway (192.168.1.127). If the host dies, all VMs and all storage die together. The "mobility" of proxmox-csi is partially illusory at the homelab scale — the data isn't actually mobile across physical machines.
|
||||
- **Con**: VM-level failures (kernel panic, OOM, manual qm shutdown for maintenance) DO happen routinely. Today, the pod just reschedules; with TopoLVM, you wait for the VM to recover or you accept downtime.
|
||||
- **Mitigation**: For services that already have replication built in (CNPG Postgres cluster has 3 replicas, Redis-v2 has 3, Vault has 3-node Raft), the data-locality penalty is minimal — one replica's local LV being unavailable triggers a re-replication elsewhere. The PAIN is concentrated in single-replica stateful services: MySQL standalone, Nextcloud, Vaultwarden, mailserver, claude-memory, all the SQLite-backed services.
|
||||
|
||||
## Disk layout — three options
|
||||
|
||||
TopoLVM needs a dedicated LVM VG per node. Three ways to provision it:
|
||||
|
||||
### Option A — Carve from sdc (HDD), one VG per VM
|
||||
|
||||
Add a second virtual disk to each K8s VM, sized for its expected PVC load. The disk lives on the existing sdc thin pool. Format as LVM PV → its own VG → TopoLVM thin pool.
|
||||
|
||||
- **Sizing**: rough math from session-1 audit: 1.2 TB total LV allocation across 76 PVCs. Add 30% headroom = 1.6 TB. Distribute by current node placement:
|
||||
- node1: Prometheus (433G) + others ≈ 600-700 GiB → **768 GiB disk**
|
||||
- node2: Loki (50G) + smaller DBs ≈ 200 GiB → **256 GiB disk**
|
||||
- node3: MySQL standalone + Immich PG + several DBs ≈ 200 GiB → **256 GiB disk**
|
||||
- node4: smaller → **256 GiB disk**
|
||||
- node5: smaller → **256 GiB disk**
|
||||
- node6: Nextcloud + Vaultwarden + mailserver + small DBs ≈ 200 GiB → **256 GiB disk**
|
||||
- **Total: ~2 TiB** carved from sdc thin pool (currently 66% used, 3.5 TiB free)
|
||||
- **Pro**: simplest physical change, no hardware needed, just `qm set --scsiN local-lvm:NNN`
|
||||
- **Con**: IO contention on sdc unchanged. The 6 thin pools all sit on the same HDD physical layer. Storms hit harder because there's no inter-pool isolation at the LVM level.
|
||||
|
||||
### Option B — Move hot workloads to sdb (SSD), keep cold on sdc
|
||||
|
||||
Use a hybrid layout:
|
||||
- Per-VM SSD disk (sdb, 931 GB total, ~675 GB free) for hot DBs
|
||||
- Per-VM HDD disk (sdc) for cold/bulk
|
||||
|
||||
TopoLVM supports multiple device classes per node — each VM would have an `ssd-thin` and `hdd-thin` class.
|
||||
|
||||
- **Pro**: separates hot/cold IO; SSD-backed DBs are dramatically faster; partial IO-contention relief on sdc
|
||||
- **Con**: 675 GB SSD has to host DBs across 6 VMs (~112 GiB each, tight). Need to identify which PVCs are hot. The encrypted PVCs (45 currently) are mostly DBs and would be the SSD candidates.
|
||||
|
||||
### Option C — Add a second physical disk for storage
|
||||
|
||||
Add a real SSD (e.g., a 2 TB NVMe) to the PVE host. Carve per-VM disks from it for TopoLVM. Keep sdc for VM root + nfs-data only.
|
||||
|
||||
- **Pro**: cleanest physical isolation. Solves both LUN cap AND IO contention (the underlying beads `code-oflt` task).
|
||||
- **Con**: hardware investment. ~£200 for a 2 TB NVMe. Requires PVE host downtime to install. Existing PVE has 2 SATA ports used (sda + sdb) + M.2 slot (might be in use, need to check). LVM/thin pool setup is straightforward.
|
||||
|
||||
## Migration approach
|
||||
|
||||
Same pattern as the 2026-05-26 Wave 1 NFS migration, multiplied across more PVCs:
|
||||
|
||||
1. **Install TopoLVM alongside proxmox-csi** — both run in parallel; new StorageClass `topolvm-provisioner` and `topolvm-provisioner-encrypted` created without touching existing PVCs
|
||||
2. **Per-VM data disk provisioning** — `qm set <vmid> --scsi8 local-lvm:NNN`, add `vgcreate` + `lvcreate` per VM (one-time)
|
||||
3. **lvmd config per node** — Helm values point to the right VG per node
|
||||
4. **Pilot migration** — pick a small, low-criticality PVC (e.g., a single-replica config-only service). Run the same scale-to-0 → rsync helper → swap claim_name → apply pattern from Wave 1. Validate.
|
||||
5. **Phased rollout** — migrate PVCs in batches by criticality:
|
||||
- Wave A: regenerable / cache (5-10 PVCs, low risk)
|
||||
- Wave B: app config PVCs with SQLite (15-20 PVCs, blip per service)
|
||||
- Wave C: medium DBs (Postgres, MySQL, Redis with replicas) (10-15 PVCs)
|
||||
- Wave D: critical singletons (Vaultwarden, Nextcloud, mailserver, MySQL standalone) (5-10 PVCs)
|
||||
- Wave E: huge ones (Prometheus, Loki, Forgejo) (3-5 PVCs)
|
||||
6. **Rewrite backup pipeline** — current `daily-backup` mounts LVM snapshots on PVE host; new flow needs to either (a) run snapshot logic inside each K8s VM via DaemonSet, or (b) use CSI VolumeSnapshot CRDs + an external-snapshotter → restic/borg backend
|
||||
7. **Deprecate proxmox-csi** — once all PVCs migrated, remove the Helm release and the `proxmox-lvm` / `proxmox-lvm-encrypted` StorageClasses
|
||||
8. **Update docs** — `docs/architecture/storage.md`, `CLAUDE.md`, ingress factory references, several runbooks
|
||||
|
||||
## Effort estimate
|
||||
|
||||
| Phase | Time | Notes |
|
||||
|-------|------|-------|
|
||||
| Decision + Option A/B/C pick | 1 day | Includes any hardware ordering for Option C |
|
||||
| TopoLVM install + lvmd config | 1 day | Helm chart, secrets, RBAC, test on one node first |
|
||||
| Per-VM data disk provisioning | 0.5 day | Six VMs; coordinate with kubelet restart |
|
||||
| Encrypted PVC LUKS plumbing | 1 day | Verify the ExternalSecret pattern works with TopoLVM's secret refs |
|
||||
| Pilot migration (1 PVC) | 0.5 day | Includes rollback rehearsal |
|
||||
| Waves A-D migrations (~45 PVCs) | 5-7 days | ~20 min per PVC like Wave 1, plus verification |
|
||||
| Wave E (huge PVCs) | 2-3 days | Prometheus 433 GiB will take hours to rsync; needs careful staging |
|
||||
| Backup pipeline rewrite | 2-3 days | Snapshot-driven backup is a different model; testing |
|
||||
| Deprecation + cleanup | 1 day | Remove proxmox-csi, update SCs, update docs |
|
||||
| Docs + runbook updates | 1 day | storage.md, scale runbook, CLAUDE.md, post-mortems for incidents during migration |
|
||||
|
||||
**Total: ~2.5-3 weeks of focused infra time.** Could stretch over a quarter if done alongside other work.
|
||||
|
||||
## Risks
|
||||
|
||||
| Risk | Likelihood | Mitigation |
|
||||
|------|------------|------------|
|
||||
| Data loss during PVC migration | Low | Rsync with `--checksum`, verify before deleting source, keep proxmox-csi running until each migration validates |
|
||||
| Data-locality penalty during VM reboot | High | Reboot one VM at a time; multi-replica services handle it; single-replica = brief downtime (same as today for kured-driven reboots, but more frequent in TopoLVM model) |
|
||||
| LUKS encryption plumbing different from current | Medium | Pilot encrypted PVC migration before committing |
|
||||
| Backup pipeline regression | High | Keep old `daily-backup` running until new pipeline proven for ≥2 weeks |
|
||||
| Snapshot semantics change (restore pinned to source node) | Medium | Document; not a blocker for normal use but matters for cross-VM restore scenarios |
|
||||
| TopoLVM does not solve IO contention | Certain (unless Option C) | Beads `code-oflt` remains open as a separate task |
|
||||
| Migration window for huge PVCs (Prometheus 433G) | Medium | Stage during low-traffic period; use rsync with checkpoint resumption |
|
||||
| Surprise incompatibility (Kyverno policy, Authentik, etc.) | Low | Pilot catches most |
|
||||
| Reverse migration if we change our mind | Medium | Always possible via the same rsync pattern, but tedious |
|
||||
|
||||
## Decision criteria
|
||||
|
||||
Pick TopoLVM (any option) if:
|
||||
- We hit the LUN cap repeatedly (≥2 incidents in 6 months)
|
||||
- We want to fix IO contention at the same time (then Option C only)
|
||||
- We're comfortable with single-node data locality
|
||||
|
||||
Stay on proxmox-csi if:
|
||||
- The Path 1 + 2 combo gives us enough headroom for the foreseeable future
|
||||
- We value data mobility (any-pod-can-run-anywhere) over architectural cleanliness
|
||||
- The migration cost (3 weeks) outweighs the LUN-cap risk over the next year
|
||||
|
||||
## Recommended next steps if pursuing
|
||||
|
||||
1. **Run a small pilot first** — install TopoLVM on one node (k8s-node5 or node6 since they're newest and have less critical workloads), provision a 50 GB data disk, create a test PVC, migrate one tiny non-critical PVC, verify the operational pattern works end-to-end before committing to full migration
|
||||
2. **Pick Option A or C** — Option B is too SSD-constrained for the encrypted PVC volume we have
|
||||
3. **Order hardware if Option C** — NVMe + a hot-swap caddy or M.2 adapter; verify PVE host has the slot
|
||||
4. **Schedule a 3-week window** — partition the migration waves around other infra commitments; flag in beads as a P1
|
||||
|
||||
## Related
|
||||
|
||||
- `docs/architecture/storage.md` — current storage architecture
|
||||
- `docs/runbooks/scale-k8s-cluster.md` — current scaling playbook (Path 1+2 alternative)
|
||||
- `docs/post-mortems/2026-05-25-immich-anca-elements-io-storm.md` — IO contention is the related-but-separate concern
|
||||
- Beads `code-oflt` — IO isolation long-term fix (Option C would close this)
|
||||
- Remote memory id=2788 — proxmox-csi-plugin LUN cap explanation
|
||||
113
docs/plans/2026-06-01-wealth-dashboard-consolidation-design.md
Normal file
113
docs/plans/2026-06-01-wealth-dashboard-consolidation-design.md
Normal file
|
|
@ -0,0 +1,113 @@
|
|||
# Wealth Dashboard Consolidation — Design (2026-06-01)
|
||||
|
||||
## Goal
|
||||
|
||||
The `wealth` Grafana dashboard (UID `wealth`) has grown to **36 panels** with
|
||||
heavy duplication. Consolidate to **~17 panels with ZERO metric loss** by
|
||||
merging redundant panels, and fix the projection's empty-by-default problem.
|
||||
Philosophy (user-locked): *merge duplicates, keep every metric* — no metric the
|
||||
user tracks today is removed.
|
||||
|
||||
## Current state — 36 panels, duplication clusters
|
||||
|
||||
| Cluster | Panels today | Issue |
|
||||
|---|---|---|
|
||||
| **1. NW/contribution/growth over time** | "Net worth — total over time", "Net contribution vs market value", "Growth (market value − contribution) over time" | All restate `NW = contribution + growth` |
|
||||
| **2. Returns/deltas stat cards** | "12mo return/contrib/gain" (3) + "Δ 1d/7d/30d/90d" × (all/mkt) (8) = 11 cards | Same idea, many windows |
|
||||
| **3. Net pay vs market gain** | "…cumulative", "…per year", "…per month" (3) | Same comparison, 3 grains |
|
||||
| **4. Yearly bars** | "Yearly investment return %" + "Annual change decomposition" (2) | Same yearly data, two encodings |
|
||||
| Projection row (5) | text + 3 stats + projection chart | Stats duplicate Overview; chart empty by default (shared time-range) |
|
||||
|
||||
## Target layout — collapsed rows
|
||||
|
||||
### Row: Overview (expanded by default)
|
||||
- **Keep** 4 snapshot stats: Net worth · Net contribution · Growth · ROI%.
|
||||
- **NEW "Returns" table** ← merges cluster 2 (11 cards). `table` panel: one row
|
||||
per window (1d / 7d / 30d / 90d / 12mo), columns **Δ all £ · Δ market £ ·
|
||||
return %**. Reuses the existing per-window latest-vs-N-days-ago SQL, UNION'd
|
||||
into 5 rows. Preserves every value (12mo contrib = Δall − Δmkt) and adds
|
||||
return-% for the short windows.
|
||||
|
||||
### Row: Net worth over time
|
||||
- **NEW merged timeseries** ← cluster 1: two lines — `net_contribution` and
|
||||
`total_value` (market value) — with the **growth gap shaded** (fillBelowTo /
|
||||
area between). Optionally a 3rd faint "growth" line (= total_value −
|
||||
net_contribution). Reuses the "Net contribution vs market value" query.
|
||||
- **Keep** "Per-account stacked — total value" · "Cash vs invested (stacked)".
|
||||
|
||||
### Row: Returns & contributions
|
||||
- **NEW yearly combo** ← cluster 4: timeseries panel, `contributions` +
|
||||
`market_gain` as **bars** (drawStyle=bars via per-series override) + a
|
||||
**`return_pct` line on a right Y-axis**. One query returns
|
||||
`year, contributions, market_gain, return_pct` (merges the two existing
|
||||
yearly queries — both already share the `yearly`/`ep` CTEs).
|
||||
- **Keep** "Monthly contributions vs market gain" · "Per-account ROI %".
|
||||
|
||||
### Row: Income vs market
|
||||
- **NEW merged "Net pay vs market gain"** ← cluster 3: one timeseries + a
|
||||
**`$grain` custom variable** (`cumulative` / `yearly` / `monthly`). The rawSql
|
||||
switches bucketing on `$grain`. Default `cumulative`.
|
||||
|
||||
### Row: Holdings — **Keep** Positions · Activity log
|
||||
### Row: RSUs (META) — **Keep** vest cadence · realized PNL
|
||||
|
||||
### Row: Projections (rebuilt)
|
||||
- **Rebuild the projection chart as a Trend panel** (`type: trend`): numeric
|
||||
x-axis = **years from today** (0…`$horizon_years`), y = Low / Base / High /
|
||||
Historical / "Base, no new contributions". The Trend panel renders smooth
|
||||
multi-series lines on a numeric x — **independent of the dashboard time
|
||||
range** — so it is ALWAYS visible (fixes empty-by-default). SQL: same FV math
|
||||
as today, but emit `m.n/12.0 AS years_from_now` instead of a timestamp; format
|
||||
`table`; panel `xField = years_from_now`. Carry over the dashed/dotted line
|
||||
overrides + GBP unit.
|
||||
- **Drop** the 3 projection-row stat cards (NW today / Historical return /
|
||||
Monthly contribution) — already in Overview (return table + snapshot). **Keep**
|
||||
the "How to view" text panel only if still useful (with Trend it's no longer
|
||||
needed — drop it too). **Keep** the 5 template vars (rate_low/base/high,
|
||||
monthly_contribution, horizon_years).
|
||||
|
||||
## Panel count: 36 → ~17
|
||||
4 snapshot + returns table + nw-over-time + per-account + cash-vs-invested +
|
||||
yearly-combo + monthly-contrib + per-account-ROI + net-pay(merged) + positions +
|
||||
activity-log + meta-cadence + meta-pnl + projection-trend = **~17**.
|
||||
|
||||
## Merge SQL notes (validate each against live wealth-pg before deploy)
|
||||
- **Returns table**: 5 `SELECT`s (one per window) UNION ALL, each computing
|
||||
`Δall = nw_now − nw_{ago}`, `Δmkt = Δall − (contrib_now − contrib_{ago})`,
|
||||
`ret% = Δmkt / (nw_{ago} + 0.5·Δcontrib)·100` (Modified Dietz, the existing
|
||||
formula). Window→interval: 1d/7d/30d/90d/12mo.
|
||||
- **Yearly combo**: extend the "Annual change decomposition" query (already has
|
||||
`contributions`, `market_gain` per year) to also emit `return_pct` (the
|
||||
"Yearly investment return %" formula) — same `ep` CTE.
|
||||
- **Net-pay `$grain`**: one query; `cumulative` = running sums, `yearly`/`monthly`
|
||||
= period-end deltas (reuse the month-end/year-end delta pattern shipped today).
|
||||
|
||||
## Build / deploy / verify
|
||||
1. One-off Python builder (`/tmp`, outside repo) loads `wealth.json`: removes the
|
||||
merged-away panels by title, adds the new merged panels + `$grain` var,
|
||||
rebuilds the projection as a Trend panel, wraps everything in collapsed rows,
|
||||
assigns unique ids + clean gridPos. Clone existing panels for schema-39
|
||||
fidelity where possible.
|
||||
2. Validate: `json.load`; unique ids; spot-run every new/merged target's SQL
|
||||
against live `wealth-pg` (the pg-sync sidecar) with default var values.
|
||||
3. Deploy: `scripts/tg apply -target='module.monitoring.kubernetes_config_map.grafana_dashboards["wealth.json"]'`
|
||||
(targeted — monitoring stack carries unrelated drift). `git rebase --autostash
|
||||
forgejo/master` before push (shared repo).
|
||||
4. Verify: ConfigMap == local file; user eyeballs each row in Grafana (esp. the
|
||||
Trend projection renders without touching the time picker, and the returns
|
||||
table + merged panels show the right numbers).
|
||||
|
||||
## Risks
|
||||
- **Trend panel** is flagged experimental (since v10.0) but available in v11.2;
|
||||
confirm `xField` + query `format=table` at build time.
|
||||
- **Bars + line on one timeseries** (yearly combo) needs per-series `drawStyle`
|
||||
overrides + a second Y-axis override — verify rendering.
|
||||
- **`$grain` net-pay** SQL is the fiddliest merge; validate all 3 grains.
|
||||
- Reorganizing into rows reshuffles gridPos for the whole dashboard — the
|
||||
builder must lay out rows top-to-bottom without overlaps.
|
||||
- Keep the contribution-correctness fixes (LOCF view, month-end deltas) intact —
|
||||
the merged panels read the same `dav_corrected` view.
|
||||
|
||||
## Out of scope
|
||||
- The `dav_corrected` view + the Fidelity growth-timing cosmetic (separate).
|
||||
- No new metrics — pure consolidation.
|
||||
103
docs/plans/2026-06-03-lb-ip-hygiene-design.md
Normal file
103
docs/plans/2026-06-03-lb-ip-hygiene-design.md
Normal file
|
|
@ -0,0 +1,103 @@
|
|||
# L4 LoadBalancer IP review & pfSense hygiene — design + decisions
|
||||
|
||||
**Date:** 2026-06-03
|
||||
**Status:** repo changes implemented; pfSense DHCP shrink pending live-change approval
|
||||
**Trigger:** "Review the L4 LB IPs we give away, consolidate, and use pfSense Virtual IPs instead of hardcoding IPs in rules."
|
||||
|
||||
## TL;DR
|
||||
|
||||
The headline ask — **consolidate to fewer MetalLB IPs** — is a verified dead end. The
|
||||
real, worthwhile outcome is a **single source of truth (this doc + the renumber
|
||||
checklist in `architecture/networking.md`) plus two stale-reference fixes**. We
|
||||
deliberately did **not** reduce the IP count and did **not** do the high-risk pfSense
|
||||
mail-VIP surgery.
|
||||
|
||||
## Current state (verified live, 2026-06-03)
|
||||
|
||||
MetalLB L2, pool `10.0.20.200-220` (21 IPs, **17 free**). Four in use:
|
||||
|
||||
| IP | ETP | What | Why dedicated |
|
||||
|----|-----|------|---------------|
|
||||
| `.200` | Cluster (shared) | ~9 svcs: postgresql-lb (TF state), dolt, coturn, headscale, wireguard, qbittorrent, shadowsocks, torrserver, xray | already maximally consolidated (the 2026-03 "5→1" merge) |
|
||||
| `.201` | Local | technitium-dns | real client IP → network-scoped split-horizon |
|
||||
| `.202` | Local | windows-kms | real client IP → notifier source labeling |
|
||||
| `.203` | Local | traefik | real client IP (CrowdSec) + QUIC/HTTP3 (UDP) |
|
||||
|
||||
## Why consolidation fails (the core finding)
|
||||
|
||||
MetalLB L2 only lets multiple `ETP=Local` services **share** an IP if they have
|
||||
**identical pod selectors** (so traffic to the single announcing node lands on the
|
||||
right pods). Traefik / KMS / Technitium have disjoint selectors and disjoint pods,
|
||||
and a shared `ETP=Local` IP announces from one node (stateless `node+VIP` hash) —
|
||||
blackholing any service whose pods aren't there. Refs:
|
||||
[MetalLB L2 concepts](https://metallb.universe.tf/concepts/layer2/),
|
||||
[Usage / IP sharing](https://metallb.universe.tf/usage/),
|
||||
[issue #271](https://github.com/metallb/metallb/issues/271).
|
||||
|
||||
Consequences:
|
||||
- **Traefik can never leave a dedicated `ETP=Local` IP** — QUIC's UDP listener needs it; QUIC can't traverse pfSense HAProxy either.
|
||||
- The trio is **fewer-IPs XOR client-IP preservation** — not both. The only "both" is making all three DaemonSets (breaks Technitium's primary/secondary AXFR design; burns resources to save 2 of 17 free IPs). Not worth it.
|
||||
|
||||
**Decision (user, 2026-06-03):** keep all 4 dedicated, preserve client IPs everywhere. No MetalLB changes.
|
||||
|
||||
## Why a doc registry instead of a `config.tfvars`/Terraform IP variable
|
||||
|
||||
The cascade risk is **consumers that hardcode another service's IP and get forgotten**
|
||||
(the 2026-05-30 Traefik `.200→.203` move broke cloudflared, woodpecker, containerd,
|
||||
and the `.lan`+`.me` zones). A Terraform-var single-source was considered and rejected:
|
||||
|
||||
1. Editing `terragrunt.hcl`/`config.tfvars` triggers the CI "global change → apply ALL
|
||||
~37 platform stacks" path (`.woodpecker/default.yml`) — a 37-stack apply for what
|
||||
are no-op refactors (rendered IPs unchanged), risking unrelated drift surfacing.
|
||||
2. It can't cover the **out-of-band** consumers (cloudflared via CF-API, containerd
|
||||
`hosts.toml` on each node) — which were half the 2026-05-30 breakage.
|
||||
3. Bootstrap-critical literals (PG state in `scripts/tg`, node DNS) must stay literals
|
||||
(DNS chicken-and-egg) regardless.
|
||||
|
||||
A **documentation registry** (the "LB-IP renumber checklist" in
|
||||
`architecture/networking.md`) covers *all* consumers — in-band and OOB — at zero
|
||||
apply-risk, and is the complete pre-move checklist. That is the single source of truth.
|
||||
|
||||
## Changes made (minimal-hygiene scope)
|
||||
|
||||
1. **`architecture/networking.md`** — rewrote the stale MetalLB section into an accurate
|
||||
registry (it had KMS on `.200`, mailserver on a LB IP, "two dedicated" — all wrong)
|
||||
+ added the **renumber checklist**.
|
||||
2. **woodpecker** (`stacks/woodpecker/main.tf`) — the `forgejo.viktorbarzin.me`
|
||||
hostAlias hardcoded the **dead** `10.0.20.200` (Traefik moved to `.203`; `.200:443`
|
||||
refuses TLS). Now reads the Traefik **ClusterIP dynamically** (`data
|
||||
"kubernetes_service" "traefik"`) so it can't rot on a future renumber and avoids the
|
||||
ETP=Local hairpin trap. (Real fix — the next woodpecker apply would otherwise
|
||||
re-pin the dead IP and break pipeline creation.)
|
||||
3. **monitoring** (`prometheus_chart_values.tpl`) — `ViktorBarzinApexDrift` alert
|
||||
summary said "expected 10.0.20.200" (stale post-Traefik-move) → `.203`. Cosmetic
|
||||
(alert logic was already correct) but prevents a misleading incident message.
|
||||
4. **`backend.tf`** — 72 stale generated copies were tracked in git with a plaintext
|
||||
(Vault-rotated, ~expired) PG password + `.200` literal, despite already being in
|
||||
`.gitignore`. `git rm --cached` (they regenerate from `PG_CONN_STR`). History scrub
|
||||
deferred (creds rotate weekly → low urgency).
|
||||
5. **pfSense DHCP range** (`opt1`/K8s VLAN) — `.200-.254` overlaps the MetalLB pool
|
||||
`.200-.220` (latent IP-conflict: DHCP could hand out a live LB IP). Plan: shrink to
|
||||
start at `.221`. Verified zero leases/statics in the band. **PENDING** — live
|
||||
pfSense change, applied separately after explicit approval (live network device).
|
||||
|
||||
## Explicitly NOT done (rationale)
|
||||
|
||||
- **No MetalLB IP merging** — infeasible without losing client-IP/QUIC/HA (above).
|
||||
- **No mail Virtual IP** — mail binds pfSense's own `10.0.20.1`, the most stable IP in
|
||||
the system; the 2026-06-02 incident was a *DNS split-horizon* bug, not an IP move.
|
||||
A mail VIP is 4 NAT + 5 filter + HAProxy cutover on the live mail path for marginal
|
||||
"identity" benefit. Skipped.
|
||||
- **No `nginx`-alias delete / NAT literal→alias** — pfSense rule cosmetics; left for a
|
||||
later pfSense-focused pass (would also need the web filter F#2/F#3 `nginx`→`traefik_lb`
|
||||
repoint to avoid breaking 80/443).
|
||||
- **No Terraform IP variable** — see registry rationale above.
|
||||
|
||||
## Known latent items (documented, not fixed here)
|
||||
|
||||
- pfSense web filter rules F#2/F#3 reference `nginx` (.200) while their NAT targets
|
||||
`traefik_lb` (.203) — inconsistent but currently passing; fix in a pfSense pass.
|
||||
- pfSense NAT 53 hardcodes literal `10.0.20.201` instead of the `technitium_dns` alias.
|
||||
- In-cluster `*.viktorbarzin.me` split-horizon still resolves some hosts to the dead
|
||||
`.200` (beads `code-yh33`) — the woodpecker hostAlias is the per-app workaround.
|
||||
- CrowdSec syslog `remoteserver` doc/config drift (`.200` vs comment `.202`).
|
||||
75
docs/plans/2026-06-04-f1-stream-extraction-design.md
Normal file
75
docs/plans/2026-06-04-f1-stream-extraction-design.md
Normal file
|
|
@ -0,0 +1,75 @@
|
|||
# f1-stream extraction + productionization — design (2026-06-04)
|
||||
|
||||
## Problem
|
||||
|
||||
The actively-developed f1-stream codebase (FastAPI backend serving a SvelteKit
|
||||
SPA; ~19 pluggable stream extractors + a Playwright/chrome-service playback
|
||||
verifier) lived **inside** the infra monorepo at
|
||||
`infra/stacks/f1-stream/files/`. It had no standalone repo, no real CI (only a
|
||||
manual `redeploy.sh` doing a local `docker buildx` push), no tests, a loose
|
||||
unpinned `requirements.txt`, and no semver.
|
||||
|
||||
**Key gotcha (the source-of-truth confusion):** there is ALSO an older
|
||||
`github.com/ViktorBarzin/f1-stream` (`main`, last commit 2026-03-29, 14
|
||||
extractors, no verifier) — and the *currently-deployed* image
|
||||
(`viktorbarzin/f1-stream:<sha>`, Keel-managed) is built from THAT github repo,
|
||||
not from `files/`. So the `files/` copy was the newer, richer, but
|
||||
**never-properly-deployed** version. Viktor confirmed (2026-06-05) the
|
||||
`files/` version is the one to ship; this extraction makes it the canonical
|
||||
repo AND finally deploys it (changing the live app from the stale March build
|
||||
to the current code).
|
||||
|
||||
## Goal
|
||||
|
||||
Extract `files/` into its own Forgejo repo `viktor/f1-stream`, productionize it
|
||||
(Poetry, ruff, mypy, pragmatic tests, README, semver `v2.0.1`, Woodpecker CI),
|
||||
point the infra Terraform stack at the Forgejo image, and remove `files/`.
|
||||
|
||||
## Decisions
|
||||
|
||||
- **Registry → Forgejo private** (`forgejo.viktorbarzin.me/viktor/f1-stream`).
|
||||
Deployment gets `image_pull_secrets { registry-credentials }`.
|
||||
- **Packaging → Poetry + ruff + mypy** (Poetry 2.1.3, lock committed). Python
|
||||
**package stays `backend`** (imports + `uvicorn backend.main:app`). **Python
|
||||
3.13** kept.
|
||||
- **Tests → pragmatic pure-logic only**: m3u8_rewriter, the proxy HLS parsers,
|
||||
schedule parsing/status, extractor registry. 63 tests; ruff + mypy clean.
|
||||
- **CI → single `.woodpecker.yml`**: lint+type+test → buildx push to Forgejo
|
||||
(tags `latest` + `<short-sha>`) → `kubectl set image` + rollout. Keel stays
|
||||
enrolled as a redundant net. (No Slack step — the `environment:{from_secret}`
|
||||
form is rejected by this Woodpecker version's decoder.)
|
||||
- **Dockerfile → no bundled Chromium.** In-cluster the verifier drives the
|
||||
shared chrome-service over CDP and never launches a local browser. Bundling
|
||||
Chromium broke the in-cluster buildkit build (`playwright install chromium`
|
||||
times out fetching ~165MB from the Azure CDN through cluster egress). The
|
||||
`playwright` pip package stays for the CDP client.
|
||||
- **Versioning → first git tag `v2.0.1`** (continuity with the existing image
|
||||
lineage), deviating deliberately from the `v0.1.0` precedent.
|
||||
- **Runtime stays root** (matching the prior working image) to avoid an NFS /
|
||||
Chromium-cache regression.
|
||||
|
||||
## Terraform delta (only infra change)
|
||||
|
||||
`stacks/f1-stream/main.tf`: image → `forgejo.../viktor/f1-stream:${var.image_tag}`
|
||||
(new `var.image_tag`, default `latest`) + `image_pull_secrets`; remove `files/`
|
||||
and `redeploy.sh`. Image field stays in `ignore_changes` (KEEL_IGNORE_IMAGE);
|
||||
the running tag is managed by CI/Keel/`kubectl set image`, not Terraform.
|
||||
Everything else (Anubis, ingress, ExternalSecrets, NFS, chrome-service +
|
||||
Discord env) unchanged.
|
||||
|
||||
## Operational notes / known rough edges (2026-06-05)
|
||||
|
||||
- The Woodpecker repo (id 166) was registered via the JWT-mint script and its
|
||||
config-fetch user association is currently broken (`user does not exist
|
||||
[uid:0]`) — pipelines error. Until re-enabled via the Woodpecker UI OAuth,
|
||||
the image is built+pushed manually from the devvm.
|
||||
- The infra repo's `origin` (github) and `forgejo` (CI-canonical) remotes are
|
||||
diverged; this change is applied via `scripts/tg apply` locally and committed;
|
||||
landing it on `forgejo/master` for CI durability depends on the normal
|
||||
origin↔forgejo reconciliation.
|
||||
|
||||
## Blast radius
|
||||
|
||||
The `f1-stream` K8s service is the only consumer; no `.tf` references `files/`.
|
||||
Switching the live image to the Forgejo build is the intended, user-approved
|
||||
behavior change (stale March build → current code).
|
||||
39
docs/plans/2026-06-04-f1-stream-extraction-plan.md
Normal file
39
docs/plans/2026-06-04-f1-stream-extraction-plan.md
Normal file
|
|
@ -0,0 +1,39 @@
|
|||
# f1-stream extraction + productionization — plan (2026-06-04)
|
||||
|
||||
Companion to `2026-06-04-f1-stream-extraction-design.md`.
|
||||
|
||||
## Steps
|
||||
|
||||
1. Scaffold `/home/wizard/code/f1-stream/` from `infra/stacks/f1-stream/files/`
|
||||
(backend/, frontend/, Dockerfile by name; add README, .gitignore). ✅
|
||||
2. Poetry conversion (pyproject v2.0.1, `packages=[{include="backend"}]`, lock,
|
||||
ruff/mypy/pytest config; E501 per-file-ignored on the JS/scraper modules). ✅
|
||||
3. 63 pytest unit tests over the pure-logic core; ruff + mypy clean. ✅
|
||||
4. Dockerfile: Poetry multi-stage, **no bundled Chromium** (CDP-only). ✅
|
||||
5. `.woodpecker.yml`: lint+test → buildx push to Forgejo → kubectl set image. ✅
|
||||
6. Create Forgejo repo `viktor/f1-stream` (private), push `master`, tag
|
||||
`v2.0.1`. ✅
|
||||
7. Build + push the image to the Forgejo registry (manual from devvm, since the
|
||||
Woodpecker repo's config-fetch user is broken):
|
||||
`forgejo.viktorbarzin.me/viktor/f1-stream:24857a82` + `:latest`. ✅
|
||||
8. Repoint `stacks/f1-stream/main.tf` (Forgejo image + `var.image_tag` +
|
||||
`image_pull_secrets`); `tg apply`. ✅
|
||||
9. `kubectl set image deployment/f1-stream f1-stream=…:24857a82` + rollout. ▶
|
||||
10. Remove `stacks/f1-stream/files/`; add `/f1-stream/` to the monorepo root
|
||||
`.gitignore`. ✅ (infra side)
|
||||
11. Verify: pod on the Forgejo image, `/health` 200, ingress through Anubis. ▶
|
||||
|
||||
## Follow-ups (need Viktor / coordination)
|
||||
|
||||
- **Re-enable `viktor/f1-stream` in the Woodpecker UI** (proper OAuth) so CI
|
||||
builds run on push (the API-registered repo has a broken config-fetch user).
|
||||
- **Land this infra commit on `forgejo/master`** (CI-canonical) once the
|
||||
origin↔forgejo divergence is reconciled, so a future `forgejo` apply doesn't
|
||||
revert `imagePullSecrets`.
|
||||
|
||||
## Rollback
|
||||
|
||||
DockerHub `viktorbarzin/f1-stream` tags still exist:
|
||||
`kubectl -n f1-stream set image deployment/f1-stream
|
||||
f1-stream=viktorbarzin/f1-stream:06276544` + restore the `main.tf` image
|
||||
string. The standalone repo + Forgejo image are additive.
|
||||
320
docs/plans/2026-06-04-k8s-dashboard-sso-design.md
Normal file
320
docs/plans/2026-06-04-k8s-dashboard-sso-design.md
Normal file
|
|
@ -0,0 +1,320 @@
|
|||
# K8s Dashboard SSO via Authentik (oauth2-proxy) — Design
|
||||
|
||||
**Date:** 2026-06-04
|
||||
**Status:** Approved (design)
|
||||
**Author:** Viktor + Claude
|
||||
**Scope:** Let namespace-owner users (e.g. gheorghe / `vabbit81`) open the
|
||||
Kubernetes Dashboard at `https://k8s.viktorbarzin.me`, authenticate once with
|
||||
their Authentik account, and manage their own namespace (inspect/edit
|
||||
Deployments, read logs) with read-only visibility elsewhere — no second login,
|
||||
no manually-pasted token.
|
||||
|
||||
---
|
||||
|
||||
## 1. Goal & Non-Goals
|
||||
|
||||
### Goal
|
||||
A user in an Authentik `kubernetes-*` group browses to the dashboard, completes
|
||||
the normal Authentik SSO redirect, and lands in a dashboard session whose
|
||||
permissions are **their own** K8s RBAC (scoped by the OIDC `email`/`groups`
|
||||
claims the apiserver already trusts). gheorghe gets full control of namespace
|
||||
`vabbit81` and read-only cluster visibility.
|
||||
|
||||
### Non-Goals
|
||||
- No change to the kube-apiserver OIDC flags (already configured).
|
||||
- No change to the RBAC module or the `k8s_users` schema (gheorghe is already
|
||||
onboarded; his RoleBindings already exist).
|
||||
- No change to the existing **CLI** OIDC flow (kubelogin against the public
|
||||
`kubernetes` client) used by viktor/anca — it must keep working untouched.
|
||||
- Not replacing the dashboard (Headlamp et al. were considered and rejected).
|
||||
- Not hardening/removing the existing static cluster-admin SA (tracked as an
|
||||
optional future item in §9, deliberately out of scope here to keep blast
|
||||
radius minimal and break-glass intact).
|
||||
|
||||
---
|
||||
|
||||
## 2. Current State (verified)
|
||||
|
||||
| Fact | Evidence |
|
||||
|---|---|
|
||||
| Dashboard deployed via Helm chart `kubernetes-dashboard` 7.12.0 in ns `kubernetes-dashboard`; fronted by `kubernetes-dashboard-kong-proxy` (HTTPS 443) | `stacks/k8s-dashboard/main.tf` |
|
||||
| Ingress `k8s.viktorbarzin.me` currently `auth = "required"` (Authentik forward-auth) → kong-proxy. After auth, the dashboard only had a **static cluster-admin SA token** to talk to the apiserver — every authenticated user was effectively cluster-admin | `stacks/k8s-dashboard/main.tf` (ingress + `admin-user` CRB) |
|
||||
| kube-apiserver already OIDC-configured: `--oidc-issuer-url=https://authentik.viktorbarzin.me/application/o/kubernetes/`, `--oidc-client-id=kubernetes`, `--oidc-username-claim=email`, `--oidc-groups-claim=groups` | `stacks/rbac/modules/rbac/apiserver-oidc.tf` |
|
||||
| Per-user RBAC already created from `k8s_users` (Vault `secret/platform`): namespace-owners get a `RoleBinding` to ClusterRole `admin` in their namespace + a cluster-wide read-only ClusterRoleBinding, keyed on **email** | `stacks/rbac/modules/rbac/main.tf` |
|
||||
| gheorghe = `vabbit81`, email `vabbit81@gmail.com`, namespace `vabbit81`, role `namespace-owner`. Bindings `namespace-owner-vabbit81` + `oidc-ns-owner-readonly-vabbit81` exist | Vault `secret/platform → k8s_users` |
|
||||
| Authentik is Terraform-managed (provider adopted 2026-04-18, Wave 6a). Proxy providers/outposts/guest flow live in `stacks/authentik/*.tf`. The `goauthentik/authentik` provider is available to every stack via central `terragrunt.hcl` | `stacks/authentik/authentik_provider.tf`, `guest.tf` |
|
||||
| The existing `kubernetes` OIDC **application** is still UI-managed (no `client_id="kubernetes"` in the repo) — must not be disturbed | repo grep (no match) |
|
||||
| `ingress_factory` `auth` enum + comment-convention guard (`scripts/check-ingress-auth-comments.py`) require a `# auth = "<tier>": …` comment above any `auth = "none"/"app"` line | `modules/kubernetes/ingress_factory/main.tf`, `infra/.claude/CLAUDE.md` |
|
||||
|
||||
The missing link is purely **token injection**: nothing today gives the
|
||||
dashboard the *user's own* OIDC id_token, so the apiserver can't apply the
|
||||
per-user RBAC that already exists.
|
||||
|
||||
---
|
||||
|
||||
## 3. Architecture
|
||||
|
||||
```
|
||||
Browser
|
||||
→ Cloudflare (proxied)
|
||||
→ Traefik (ingress auth = "none"; oauth2-proxy is now the gate)
|
||||
→ oauth2-proxy.kubernetes-dashboard.svc:4180
|
||||
├─ no session → 302 → Authentik OIDC code-flow (+PKCE) → /oauth2/callback
|
||||
│ gated by a group policy: only kubernetes-{admins,power-users,namespace-owners}
|
||||
└─ session valid → proxies upstream + sets `Authorization: Bearer <id_token>`
|
||||
→ kubernetes-dashboard-kong-proxy.svc:443 (UNCHANGED)
|
||||
→ dashboard `api` → kube-apiserver (Bearer token)
|
||||
→ OIDC validates: iss ✓, aud ⊇ {kubernetes} ✓, sig ✓
|
||||
→ username = email, groups = groups claim
|
||||
→ RBAC: namespace-owner-<user> (admin in their ns) + cluster read-only
|
||||
```
|
||||
|
||||
**The entire change is additive + one ingress repoint.** New objects:
|
||||
oauth2-proxy Deployment/Service, an Authentik OIDC application/provider + scope
|
||||
mapping + group policy, and an ESO-synced secret. The ingress backend flips
|
||||
from kong-proxy → oauth2-proxy and `auth` flips `required → none`.
|
||||
|
||||
---
|
||||
|
||||
## 4. Components
|
||||
|
||||
### 4.1 Authentik (Terraform, in `stacks/k8s-dashboard/`)
|
||||
|
||||
Follows the `stacks/authentik/guest.tf` pattern (provider block reads
|
||||
`secret/authentik → tf_api_token` from Vault).
|
||||
|
||||
- `authentik_provider_oauth2 "k8s_dashboard"`:
|
||||
- `client_type = "confidential"`, `client_id = "k8s-dashboard"`,
|
||||
`client_secret` from Vault.
|
||||
- `allowed_redirect_uris = [{ matching_mode="strict",
|
||||
url="https://k8s.viktorbarzin.me/oauth2/callback" }]`.
|
||||
- `authorization_flow` = default implicit-consent (data source);
|
||||
`invalidation_flow` = default (data source).
|
||||
- `access_token_validity = "hours=1"`, `refresh_token_validity = "days=30"`.
|
||||
- `include_claims_in_id_token = true` (so the id_token carries `email` +
|
||||
`groups`; the apiserver reads the `email` claim for username regardless of
|
||||
`sub_mode`).
|
||||
- `property_mappings` = the default OIDC scope mappings (`openid`, `email`,
|
||||
`profile`) **plus** the goauthentik `groups` scope mapping **plus** the
|
||||
custom audience mapping below. (Resolved via `data
|
||||
authentik_property_mapping_provider_scope` lookups so we don't drop the
|
||||
standard claims.)
|
||||
- `authentik_property_mapping_provider_scope "k8s_dashboard_aud"`:
|
||||
- `scope_name = "k8s-dashboard-audience"`,
|
||||
`expression = return {"aud": ["kubernetes", "k8s-dashboard"]}`.
|
||||
- Emits **both** audiences so the apiserver (`kubernetes`) and oauth2-proxy
|
||||
(`k8s-dashboard`) each find their own client id in `aud`.
|
||||
- `authentik_application "k8s_dashboard"`: slug `k8s-dashboard`,
|
||||
`protocol_provider` = the oauth2 provider, `policy_engine_mode = "any"`.
|
||||
- Group gate: `authentik_policy_expression "k8s_dashboard_groups"` →
|
||||
`return ak_is_group_member(request.user, name="kubernetes-admins") or
|
||||
ak_is_group_member(request.user, name="kubernetes-power-users") or
|
||||
ak_is_group_member(request.user, name="kubernetes-namespace-owners")`,
|
||||
bound to the application via `authentik_policy_binding`.
|
||||
|
||||
### 4.2 Vault + ESO
|
||||
|
||||
- Vault `secret/k8s-dashboard` (new): `oauth2_proxy_client_id`,
|
||||
`oauth2_proxy_client_secret`, `oauth2_proxy_cookie_secret` (32 random bytes,
|
||||
base64). The Authentik provider reads the same client id/secret so the two
|
||||
sides match.
|
||||
- `ExternalSecret` in `stacks/k8s-dashboard/main.tf` → K8s Secret
|
||||
`oauth2-proxy` in ns `kubernetes-dashboard`. First-apply: target the
|
||||
ExternalSecret before the full apply (documented plan-time gotcha).
|
||||
|
||||
### 4.3 oauth2-proxy (Terraform, in `stacks/k8s-dashboard/`)
|
||||
|
||||
- Image `quay.io/oauth2-proxy/oauth2-proxy:v7.x` (pin a concrete tag; SHA-tag
|
||||
convention). `linux/amd64`.
|
||||
- Deployment: 2 replicas (HA path), Recreate not needed (stateless), readiness
|
||||
`/ping`, requests 25m/64Mi, limit 128Mi. Standard `dns_config`
|
||||
`ignore_changes` (KYVERNO_LIFECYCLE_V1).
|
||||
- Config (env `OAUTH2_PROXY_*` or args):
|
||||
- `provider=oidc`,
|
||||
`oidc_issuer_url=https://authentik.viktorbarzin.me/application/o/k8s-dashboard/`
|
||||
- `client_id`/`client_secret`/`cookie_secret` from the `oauth2-proxy` Secret
|
||||
- `redirect_url=https://k8s.viktorbarzin.me/oauth2/callback`
|
||||
- `upstreams=https://kubernetes-dashboard-kong-proxy.kubernetes-dashboard.svc.cluster.local:443`
|
||||
- `ssl_upstream_insecure_skip_verify=true` (kong self-signed cert)
|
||||
- `pass_authorization_header=true` (passes id_token to the dashboard)
|
||||
- `set_authorization_header=true` (belt-and-suspenders)
|
||||
- `oidc_extra_audience=kubernetes` (accept the apiserver audience too)
|
||||
- `scope=openid email profile groups offline_access`
|
||||
- `email_domains=*` (the Authentik group policy is the real gate)
|
||||
- `cookie_secure=true`, `cookie_domains=k8s.viktorbarzin.me`,
|
||||
`whitelist_domains=k8s.viktorbarzin.me`, `cookie_refresh=30m`,
|
||||
`cookie_expire=168h`
|
||||
- `reverse_proxy=true`, `skip_provider_button=true`
|
||||
- Service `oauth2-proxy` (port 4180).
|
||||
|
||||
### 4.4 Ingress (edit existing in `stacks/k8s-dashboard/main.tf`)
|
||||
|
||||
- `service_name = "oauth2-proxy"`, `backend_protocol = "HTTP"`, `port = 4180`.
|
||||
- `auth = "none"` with the mandatory comment:
|
||||
`# auth = "none": oauth2-proxy is the gate — it runs the Authentik OIDC
|
||||
code-flow and injects the user's id_token as Bearer for dashboard→apiserver
|
||||
auth. Group policy on the Authentik app restricts to kubernetes-* groups.`
|
||||
- Keep `dns_type = "proxied"` and the homepage annotations.
|
||||
|
||||
---
|
||||
|
||||
## 5. The Audience Strategy (the crux) + Fallback
|
||||
|
||||
The apiserver is pinned to a single legacy `--oidc-client-id=kubernetes`; the
|
||||
CLI uses the public `kubernetes` client. oauth2-proxy must be its own
|
||||
confidential client (`k8s-dashboard`). A token with `aud=k8s-dashboard` alone
|
||||
would be rejected by the apiserver; a token with `aud=kubernetes` alone would
|
||||
be rejected by oauth2-proxy. **Resolution:** the Authentik scope mapping emits
|
||||
`aud = ["kubernetes", "k8s-dashboard"]`; both validators do a membership check
|
||||
and each finds its own id. `oidc_extra_audience=kubernetes` on oauth2-proxy is
|
||||
an extra safety net in case Authentik's scope mapping *overrides* `aud` to a
|
||||
single value rather than appending.
|
||||
|
||||
**Apply-time verification (blocking):** decode a freshly issued id_token and
|
||||
assert `aud` contains both `kubernetes` and `k8s-dashboard`, and that `groups`
|
||||
is present. If Authentik refuses to emit a multi-valued `aud` via scope
|
||||
mapping, **fallback** (documented, not preferred): repoint oauth2-proxy at the
|
||||
existing `kubernetes` client made confidential and add `--oidc-client-secret`
|
||||
to the kubelogin setup script — this unifies the audience at the cost of
|
||||
touching the CLI flow. We try the additive multi-aud path first precisely to
|
||||
avoid that.
|
||||
|
||||
---
|
||||
|
||||
## 6. Data Flow — gheorghe end-to-end
|
||||
|
||||
1. `https://k8s.viktorbarzin.me` → Cloudflare → Traefik (auth=none) →
|
||||
oauth2-proxy.
|
||||
2. No session → 302 to Authentik (`k8s-dashboard` client, code+PKCE).
|
||||
3. Authentik runs the group policy: `vabbit81 ∈ kubernetes-namespace-owners` ✓.
|
||||
Issues id_token: `email=vabbit81@gmail.com`,
|
||||
`groups=[…,kubernetes-namespace-owners]`, `aud=[kubernetes,k8s-dashboard]`.
|
||||
4. oauth2-proxy validates token (`k8s-dashboard ∈ aud`), sets session cookie,
|
||||
proxies to kong-proxy with `Authorization: Bearer <id_token>`.
|
||||
5. kong-proxy → dashboard `api` → kube-apiserver with the Bearer token.
|
||||
6. apiserver OIDC: `username=vabbit81@gmail.com`, groups from claim.
|
||||
7. RBAC: `RoleBinding namespace-owner-vabbit81` (ClusterRole `admin`) in ns
|
||||
`vabbit81` + `ClusterRoleBinding oidc-ns-owner-readonly-vabbit81`.
|
||||
8. Dashboard shows full control of `vabbit81`, read-only elsewhere — the goal.
|
||||
|
||||
No RBAC changes required; bindings already key on his email.
|
||||
|
||||
---
|
||||
|
||||
## 7. Testing
|
||||
|
||||
1. **Token shape (blocking):** decode an issued id_token; assert
|
||||
`aud ⊇ {kubernetes,k8s-dashboard}` and `groups` present.
|
||||
2. **Admin:** viktor logs in → sees/edits everything (cluster-admin group).
|
||||
3. **Namespace-owner:** gheorghe logs in → can edit Deployments / read logs in
|
||||
`vabbit81`; gets Forbidden creating resources in other namespaces; can view
|
||||
(read-only) cluster resources.
|
||||
4. **No regression:** viktor/anca CLI `kubectl` (kubelogin → public
|
||||
`kubernetes` client) still works.
|
||||
5. Browser checks driven via Playwright MCP; screenshot on failure.
|
||||
|
||||
---
|
||||
|
||||
## 8. Rollback
|
||||
|
||||
Single-commit revert of the ingress edit restores `service_name=kong-proxy`,
|
||||
`backend_protocol=HTTPS`, `port=443`, `auth=required` → instant return to
|
||||
Authentik forward-auth gating. oauth2-proxy + Authentik objects are additive
|
||||
and inert once the ingress no longer points at them; they can be destroyed in a
|
||||
follow-up. No apiserver, RBAC, data, or CLI changes to unwind.
|
||||
|
||||
---
|
||||
|
||||
## 9. Security Notes & Out-of-Scope Hardening
|
||||
|
||||
- The id_token only ever lives server-side (Bearer header oauth2-proxy→kong);
|
||||
the browser holds an opaque oauth2-proxy session cookie
|
||||
(secure/httponly/samesite-lax, scoped to `k8s.viktorbarzin.me`).
|
||||
- Two gates: Authentik **group policy** (only `kubernetes-*` groups complete
|
||||
the flow) and apiserver **RBAC** (per-user, by email). Defense in depth.
|
||||
- The `authentik_walloff` blackbox guard is for `auth=none` carve-outs that
|
||||
must NOT redirect to Authentik. The dashboard intentionally **does** redirect
|
||||
(via oauth2-proxy), so it is **not** added to that guard.
|
||||
- **Out of scope (optional future hardening):** the existing static
|
||||
`admin-user` cluster-admin ClusterRoleBinding + SA remain. They predate this
|
||||
change and provide break-glass. Removing them (admins would rely on SSO +
|
||||
Vault `kubernetes/creds/dashboard-admin` + kubelogin) is a separate, reversible
|
||||
security decision the user can request later. Not done here to keep blast
|
||||
radius minimal.
|
||||
|
||||
---
|
||||
|
||||
## 10. Monitoring & Docs
|
||||
|
||||
- Uptime-Kuma external monitor auto-created (`dns_type=proxied` →
|
||||
`ingress_factory` adds the `external-monitor` label). No manual monitor.
|
||||
- oauth2-proxy readiness probe on `/ping`.
|
||||
- Docs updated in the **same commit**: `docs/architecture/authentication.md`
|
||||
(new OIDC app + dashboard SSO flow), `docs/architecture/multi-tenancy.md`
|
||||
(dashboard access path for namespace-owners),
|
||||
`.claude/reference/authentik-state.md` (new app/provider/scope-mapping),
|
||||
`.claude/reference/service-catalog.md` (k8s-dashboard auth posture), and the
|
||||
companion `2026-06-04-k8s-dashboard-sso-plan.md`.
|
||||
|
||||
---
|
||||
|
||||
## 11. Open Risks
|
||||
|
||||
| Risk | Mitigation |
|
||||
|---|---|
|
||||
| Authentik scope mapping overrides rather than appends `aud` | `oidc_extra_audience=kubernetes` + blocking apply-time token decode; fallback in §5 |
|
||||
| Dashboard v7 ignores a pre-set Authorization header (known friction: kubernetes/dashboard #5105, #1213) | `pass_authorization_header` + `set_authorization_header`; validate in §7; kong forwards headers by default |
|
||||
| ESO first-apply ordering | `terragrunt apply -target` the ExternalSecret first (documented plan-time pattern) |
|
||||
| Single-master apiserver assumption (memory id=2484) | We don't touch apiserver flags; no new exposure |
|
||||
|
||||
---
|
||||
|
||||
## 12. ADDENDUM (2026-06-04) — As-built pivoted to Option B (apiserver multi-issuer)
|
||||
|
||||
Sections 4–5 above describe the *original* plan: a separate `k8s-dashboard`
|
||||
confidential client whose token carries a dual `aud` so the apiserver (pinned
|
||||
to `--oidc-client-id=kubernetes`) would accept it **without** an apiserver
|
||||
change. **That approach does not work**, for a reason discovered during
|
||||
implementation:
|
||||
|
||||
1. **The issuer is the binding constraint, not the audience.** Every Authentik
|
||||
OAuth2 application has its own per-slug issuer. A token from the
|
||||
`k8s-dashboard` app has `iss=…/o/k8s-dashboard/`, but the apiserver does an
|
||||
**exact issuer-string match** against its single configured issuer
|
||||
(`…/o/kubernetes/`). The dual-`aud` scope mapping is irrelevant — the token
|
||||
is rejected on issuer before audience is even considered.
|
||||
|
||||
2. **Apiserver OIDC was already silently broken.** Inspecting the live
|
||||
`kube-apiserver` static-pod manifest showed **no `--oidc-*` flags at all** —
|
||||
the kubeadm v1.34 upgrade had regenerated the manifest and dropped the
|
||||
flags the `rbac` stack's `null_resource` had injected (its content-hash
|
||||
trigger never re-fired). So OIDC apiserver auth was off cluster-wide.
|
||||
|
||||
3. **Reusing the `kubernetes` app (make it confidential) — rejected.** It would
|
||||
force distributing the now-confidential client secret to every CLI user via
|
||||
the **public** k8s-portal `/setup/script` endpoint (a leak), plus
|
||||
re-onboarding existing CLI users. Too invasive.
|
||||
|
||||
**As-built = Option B: structured `AuthenticationConfiguration` on the
|
||||
apiserver trusting BOTH issuers.** `stacks/rbac/modules/rbac/apiserver-oidc.tf`
|
||||
now writes `/etc/kubernetes/pki/auth-config.yaml`
|
||||
(`apiserver.config.k8s.io/v1`) with two `jwt` issuers — `kubernetes`
|
||||
(audience `kubernetes`, for the kubelogin CLI) and `k8s-dashboard` (audience
|
||||
`k8s-dashboard`, for oauth2-proxy) — each mapping `username<-email` and
|
||||
`groups<-groups` with empty prefixes (to match existing RBAC subjects). The
|
||||
legacy `--oidc-*` flags are replaced by `--authentication-config=…`. The remote
|
||||
script health-gates `/livez` and **auto-rolls-back** the manifest if the
|
||||
single-master apiserver doesn't recover. The oauth2-proxy + `k8s-dashboard`
|
||||
Authentik app from §4 are reused unchanged (the dual-`aud` mapping is now
|
||||
harmless — issuer2 only requires `k8s-dashboard ∈ aud`).
|
||||
|
||||
This keeps the CLI flow 100% untouched (its own `kubernetes` issuer is one of
|
||||
the two trusted issuers) and restores the apiserver OIDC that the kubeadm
|
||||
upgrade had broken.
|
||||
|
||||
**Known drift (carried forward):** a future `kubeadm upgrade` will again
|
||||
regenerate the manifest and drop `--authentication-config`. The
|
||||
content-hash trigger won't auto-detect this. **Operational mitigation:
|
||||
re-apply the `rbac` stack after every k8s control-plane upgrade** (add to the
|
||||
upgrade runbook). The `rbac` provisioner needs `TF_VAR_ssh_private_key` (an SSH
|
||||
key authorized on the master) — it is not wired from Vault yet.
|
||||
574
docs/plans/2026-06-04-k8s-dashboard-sso-plan.md
Normal file
574
docs/plans/2026-06-04-k8s-dashboard-sso-plan.md
Normal file
|
|
@ -0,0 +1,574 @@
|
|||
# K8s Dashboard SSO via Authentik (oauth2-proxy) — Implementation Plan
|
||||
|
||||
> **⚠️ AS-BUILT DIVERGED (2026-06-04).** Tasks 2–3 (oauth2-proxy + `k8s-dashboard`
|
||||
> Authentik app) shipped as written, but the audience strategy here is WRONG: the
|
||||
> apiserver matches the token **issuer** exactly, and a separate app has a
|
||||
> different per-slug issuer — so the dual-`aud` trick can't avoid an apiserver
|
||||
> change. The implementation pivoted to **Option B**: a structured multi-issuer
|
||||
> `AuthenticationConfiguration` on the apiserver (`stacks/rbac`). See the
|
||||
> **ADDENDUM (§12)** in `2026-06-04-k8s-dashboard-sso-design.md` for the as-built.
|
||||
|
||||
> **For agentic workers:** REQUIRED SUB-SKILL: Use superpowers:subagent-driven-development (recommended) or superpowers:executing-plans to implement this plan task-by-task. Steps use checkbox (`- [ ]`) syntax for tracking.
|
||||
|
||||
**Goal:** Let namespace-owner users (e.g. gheorghe / `vabbit81`) open `https://k8s.viktorbarzin.me`, log in once with Authentik, and manage their own namespace in the Kubernetes Dashboard under their existing per-user RBAC.
|
||||
|
||||
**Architecture:** Deploy oauth2-proxy in the `kubernetes-dashboard` namespace in front of the existing `kong-proxy`. It runs the Authentik OIDC code-flow and injects the user's id_token as a `Bearer` header so the apiserver applies the per-user RBAC that already exists. A new confidential Authentik OIDC client (`k8s-dashboard`) plus a custom scope mapping emits `aud = ["kubernetes","k8s-dashboard"]`, satisfying both the apiserver and oauth2-proxy without touching the existing CLI (`kubernetes` public client). The change is additive; the only mutation to existing state is one ingress repoint, instantly revertible.
|
||||
|
||||
**Tech Stack:** Terraform/Terragrunt, Authentik (`goauthentik/authentik` TF provider), oauth2-proxy v7, External Secrets Operator, Vault KV, Kubernetes Dashboard v7 (Kong).
|
||||
|
||||
**Design doc:** `docs/plans/2026-06-04-k8s-dashboard-sso-design.md`
|
||||
|
||||
---
|
||||
|
||||
## Conventions for every apply step
|
||||
|
||||
- **Auth first:** `vault login -method=oidc` (humans) before any `scripts/tg`.
|
||||
- **Presence claim before each apply** (CLAUDE.md mandatory rule):
|
||||
`~/code/scripts/presence claim service:k8s-dashboard --purpose "dashboard SSO via oauth2-proxy"`
|
||||
and `~/code/scripts/presence claim stack:k8s-dashboard --purpose "..."`. Release on completion.
|
||||
- **Apply wrapper:** run from inside the stack dir: `cd stacks/k8s-dashboard && ../../scripts/tg <plan|apply>`. `scripts/tg` handles PG-backend creds and runs the ingress-auth comment guard.
|
||||
- **Never** `kubectl apply/edit` as final state — Terraform only.
|
||||
|
||||
---
|
||||
|
||||
## File Structure
|
||||
|
||||
| File | Action | Responsibility |
|
||||
|---|---|---|
|
||||
| `stacks/k8s-dashboard/authentik.tf` | **Create** | `provider "authentik"` block + OIDC provider, custom audience scope mapping, application, group-restriction policy + binding |
|
||||
| `stacks/k8s-dashboard/oauth2_proxy.tf` | **Create** | ExternalSecret → `oauth2-proxy` Secret, oauth2-proxy Deployment + Service |
|
||||
| `stacks/k8s-dashboard/main.tf` | **Modify** | Repoint the dashboard ingress from `kong-proxy` → `oauth2-proxy`, flip `auth` to `none` |
|
||||
| `docs/architecture/authentication.md` | **Modify** | Document the new OIDC app + dashboard SSO flow |
|
||||
| `docs/architecture/multi-tenancy.md` | **Modify** | Document dashboard access path for namespace-owners |
|
||||
| `.claude/reference/authentik-state.md` | **Modify** | Record the new app/provider/scope-mapping |
|
||||
| `.claude/reference/service-catalog.md` | **Modify** | Update k8s-dashboard auth posture |
|
||||
|
||||
The `authentik` provider is already in every stack's generated `required_providers` (root `terragrunt.hcl` → `generate "k8s_providers"`). We only add the `provider "authentik"` config block (reads the API token from Vault `secret/authentik → tf_api_token`).
|
||||
|
||||
---
|
||||
|
||||
## Task 1: Vault secret for oauth2-proxy + Authentik client
|
||||
|
||||
**Files:** none (Vault KV state).
|
||||
|
||||
- [ ] **Step 1: Authenticate to Vault**
|
||||
|
||||
Run:
|
||||
```bash
|
||||
vault login -method=oidc
|
||||
```
|
||||
Expected: `Success! You are now authenticated.`
|
||||
|
||||
- [ ] **Step 2: Generate the three secret values**
|
||||
|
||||
Run:
|
||||
```bash
|
||||
CLIENT_ID="k8s-dashboard"
|
||||
CLIENT_SECRET="$(openssl rand -hex 32)"
|
||||
COOKIE_SECRET="$(openssl rand -base64 32 | tr -d '\n')" # 32 bytes, base64 — required length for AES cookie
|
||||
echo "client_id=$CLIENT_ID"; echo "client_secret set"; echo "cookie_secret set"
|
||||
```
|
||||
Expected: prints `client_id=k8s-dashboard` and confirmations. `COOKIE_SECRET` must decode to exactly 16/24/32 bytes (32 base64 chars from `rand -base64 32` → 32 bytes ✓).
|
||||
|
||||
- [ ] **Step 3: Write the secret to Vault**
|
||||
|
||||
Run:
|
||||
```bash
|
||||
VAULT_ADDR=https://vault.viktorbarzin.me vault kv put secret/k8s-dashboard \
|
||||
oauth2_proxy_client_id="$CLIENT_ID" \
|
||||
oauth2_proxy_client_secret="$CLIENT_SECRET" \
|
||||
oauth2_proxy_cookie_secret="$COOKIE_SECRET"
|
||||
```
|
||||
Expected: `Success! Data written to: secret/data/k8s-dashboard`.
|
||||
|
||||
- [ ] **Step 4: Verify**
|
||||
|
||||
Run:
|
||||
```bash
|
||||
VAULT_ADDR=https://vault.viktorbarzin.me vault kv get -field=oauth2_proxy_client_id secret/k8s-dashboard
|
||||
```
|
||||
Expected: `k8s-dashboard`.
|
||||
|
||||
No commit (Vault state, not git).
|
||||
|
||||
---
|
||||
|
||||
## Task 2: Authentik OIDC application (additive — no user impact)
|
||||
|
||||
**Files:**
|
||||
- Create: `stacks/k8s-dashboard/authentik.tf`
|
||||
|
||||
- [ ] **Step 1: Create `stacks/k8s-dashboard/authentik.tf`**
|
||||
|
||||
```hcl
|
||||
# -----------------------------------------------------------------------------
|
||||
# Authentik OIDC application for the Kubernetes Dashboard (via oauth2-proxy).
|
||||
#
|
||||
# Confidential client `k8s-dashboard`. A custom scope mapping emits
|
||||
# aud = ["kubernetes","k8s-dashboard"] so BOTH the kube-apiserver
|
||||
# (--oidc-client-id=kubernetes) and oauth2-proxy (client_id=k8s-dashboard)
|
||||
# accept the id_token. The existing UI-managed `kubernetes` public client
|
||||
# used by the kubelogin CLI is untouched.
|
||||
#
|
||||
# Provider token: Vault secret/authentik -> tf_api_token (same as
|
||||
# stacks/authentik/authentik_provider.tf).
|
||||
# -----------------------------------------------------------------------------
|
||||
|
||||
data "vault_kv_secret_v2" "authentik_tf" {
|
||||
mount = "secret"
|
||||
name = "authentik"
|
||||
}
|
||||
|
||||
provider "authentik" {
|
||||
url = "https://authentik.viktorbarzin.me"
|
||||
token = data.vault_kv_secret_v2.authentik_tf.data["tf_api_token"]
|
||||
}
|
||||
|
||||
data "vault_kv_secret_v2" "k8s_dashboard" {
|
||||
mount = "secret"
|
||||
name = "k8s-dashboard"
|
||||
}
|
||||
|
||||
data "authentik_flow" "default_authorization_implicit_consent" {
|
||||
slug = "default-provider-authorization-implicit-consent"
|
||||
}
|
||||
|
||||
data "authentik_flow" "default_provider_invalidation" {
|
||||
slug = "default-provider-invalidation-flow"
|
||||
}
|
||||
|
||||
# Default OIDC scope mappings. `profile` carries the `groups` claim in
|
||||
# Authentik's default expression, which the apiserver reads via
|
||||
# --oidc-groups-claim=groups. offline_access enables refresh tokens.
|
||||
data "authentik_property_mapping_provider_scope" "defaults" {
|
||||
managed_list = [
|
||||
"goauthentik.io/providers/oauth2/scope-openid",
|
||||
"goauthentik.io/providers/oauth2/scope-email",
|
||||
"goauthentik.io/providers/oauth2/scope-profile",
|
||||
"goauthentik.io/providers/oauth2/scope-offline_access",
|
||||
]
|
||||
}
|
||||
|
||||
# Custom scope mapping that overrides the audience. It only fires when the
|
||||
# client REQUESTS this scope, so oauth2-proxy must include
|
||||
# `k8s-dashboard-audience` in its --scope (see oauth2_proxy.tf).
|
||||
resource "authentik_property_mapping_provider_scope" "k8s_dashboard_aud" {
|
||||
name = "k8s-dashboard audience"
|
||||
scope_name = "k8s-dashboard-audience"
|
||||
expression = "return {\"aud\": [\"kubernetes\", \"k8s-dashboard\"]}"
|
||||
}
|
||||
|
||||
resource "authentik_provider_oauth2" "k8s_dashboard" {
|
||||
name = "k8s-dashboard"
|
||||
client_id = data.vault_kv_secret_v2.k8s_dashboard.data["oauth2_proxy_client_id"]
|
||||
client_secret = data.vault_kv_secret_v2.k8s_dashboard.data["oauth2_proxy_client_secret"]
|
||||
client_type = "confidential"
|
||||
|
||||
authorization_flow = data.authentik_flow.default_authorization_implicit_consent.id
|
||||
invalidation_flow = data.authentik_flow.default_provider_invalidation.id
|
||||
|
||||
allowed_redirect_uris = [
|
||||
{
|
||||
matching_mode = "strict"
|
||||
url = "https://k8s.viktorbarzin.me/oauth2/callback"
|
||||
},
|
||||
]
|
||||
|
||||
access_token_validity = "hours=1"
|
||||
refresh_token_validity = "days=30"
|
||||
include_claims_in_id_token = true
|
||||
|
||||
property_mappings = concat(
|
||||
data.authentik_property_mapping_provider_scope.defaults.ids,
|
||||
[authentik_property_mapping_provider_scope.k8s_dashboard_aud.id],
|
||||
)
|
||||
}
|
||||
|
||||
resource "authentik_application" "k8s_dashboard" {
|
||||
name = "Kubernetes Dashboard"
|
||||
slug = "k8s-dashboard"
|
||||
protocol_provider = authentik_provider_oauth2.k8s_dashboard.id
|
||||
meta_launch_url = "https://k8s.viktorbarzin.me"
|
||||
policy_engine_mode = "any"
|
||||
}
|
||||
|
||||
# Restrict who can complete the OIDC flow to the K8s RBAC groups.
|
||||
resource "authentik_policy_expression" "k8s_dashboard_groups" {
|
||||
name = "k8s-dashboard-group-access"
|
||||
expression = <<-EOT
|
||||
return (
|
||||
ak_is_group_member(request.user, name="kubernetes-admins")
|
||||
or ak_is_group_member(request.user, name="kubernetes-power-users")
|
||||
or ak_is_group_member(request.user, name="kubernetes-namespace-owners")
|
||||
)
|
||||
EOT
|
||||
}
|
||||
|
||||
resource "authentik_policy_binding" "k8s_dashboard_groups" {
|
||||
target = authentik_application.k8s_dashboard.uuid
|
||||
policy = authentik_policy_expression.k8s_dashboard_groups.id
|
||||
order = 0
|
||||
}
|
||||
```
|
||||
|
||||
- [ ] **Step 2: Plan**
|
||||
|
||||
Run: `cd stacks/k8s-dashboard && ../../scripts/tg plan`
|
||||
Expected: plan adds `authentik_property_mapping_provider_scope.k8s_dashboard_aud`, `authentik_provider_oauth2.k8s_dashboard`, `authentik_application.k8s_dashboard`, `authentik_policy_expression.k8s_dashboard_groups`, `authentik_policy_binding.k8s_dashboard_groups`. **No changes to existing resources** (kong-proxy, ingress untouched).
|
||||
|
||||
- [ ] **Step 3: Claim presence, then apply**
|
||||
|
||||
Run:
|
||||
```bash
|
||||
~/code/scripts/presence claim stack:k8s-dashboard --purpose "add Authentik OIDC app for dashboard SSO"
|
||||
cd stacks/k8s-dashboard && ../../scripts/tg apply --non-interactive
|
||||
```
|
||||
Expected: 5 resources added, 0 changed, 0 destroyed.
|
||||
|
||||
- [ ] **Step 4: Verify the application exists in Authentik**
|
||||
|
||||
Run:
|
||||
```bash
|
||||
curl -s -H "Authorization: Bearer $(VAULT_ADDR=https://vault.viktorbarzin.me vault kv get -field=tf_api_token secret/authentik)" \
|
||||
"https://authentik.viktorbarzin.me/api/v3/core/applications/?slug=k8s-dashboard" | jq '.results[].slug'
|
||||
```
|
||||
Expected: `"k8s-dashboard"`.
|
||||
|
||||
- [ ] **Step 5: Commit**
|
||||
|
||||
```bash
|
||||
git add stacks/k8s-dashboard/authentik.tf
|
||||
git commit -m "feat(k8s-dashboard): add Authentik OIDC app for dashboard SSO"
|
||||
git push origin master
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Task 3: oauth2-proxy Deployment + Service (additive — still no cutover)
|
||||
|
||||
**Files:**
|
||||
- Create: `stacks/k8s-dashboard/oauth2_proxy.tf`
|
||||
|
||||
- [ ] **Step 1: Create `stacks/k8s-dashboard/oauth2_proxy.tf`**
|
||||
|
||||
```hcl
|
||||
# -----------------------------------------------------------------------------
|
||||
# oauth2-proxy: runs the Authentik OIDC code-flow and injects the user's
|
||||
# id_token as `Authorization: Bearer` upstream to kong-proxy, so the dashboard
|
||||
# talks to the apiserver AS THE USER (per-user RBAC applies).
|
||||
# -----------------------------------------------------------------------------
|
||||
|
||||
resource "kubernetes_manifest" "oauth2_proxy_externalsecret" {
|
||||
manifest = {
|
||||
apiVersion = "external-secrets.io/v1beta1"
|
||||
kind = "ExternalSecret"
|
||||
metadata = {
|
||||
name = "oauth2-proxy"
|
||||
namespace = kubernetes_namespace.k8s-dashboard.metadata[0].name
|
||||
}
|
||||
spec = {
|
||||
refreshInterval = "1h"
|
||||
secretStoreRef = { name = "vault-kv", kind = "ClusterSecretStore" }
|
||||
target = { name = "oauth2-proxy", creationPolicy = "Owner" }
|
||||
data = [
|
||||
{ secretKey = "client-id", remoteRef = { key = "k8s-dashboard", property = "oauth2_proxy_client_id" } },
|
||||
{ secretKey = "client-secret", remoteRef = { key = "k8s-dashboard", property = "oauth2_proxy_client_secret" } },
|
||||
{ secretKey = "cookie-secret", remoteRef = { key = "k8s-dashboard", property = "oauth2_proxy_cookie_secret" } },
|
||||
]
|
||||
}
|
||||
}
|
||||
}
|
||||
|
||||
locals {
|
||||
oauth2_proxy_upstream = "https://kubernetes-dashboard-kong-proxy.kubernetes-dashboard.svc.cluster.local:443"
|
||||
}
|
||||
|
||||
resource "kubernetes_deployment" "oauth2_proxy" {
|
||||
metadata {
|
||||
name = "oauth2-proxy"
|
||||
namespace = kubernetes_namespace.k8s-dashboard.metadata[0].name
|
||||
labels = { app = "oauth2-proxy" }
|
||||
}
|
||||
|
||||
spec {
|
||||
replicas = 2
|
||||
selector { match_labels = { app = "oauth2-proxy" } }
|
||||
|
||||
template {
|
||||
metadata { labels = { app = "oauth2-proxy" } }
|
||||
spec {
|
||||
container {
|
||||
name = "oauth2-proxy"
|
||||
image = "quay.io/oauth2-proxy/oauth2-proxy:v7.7.1"
|
||||
args = [
|
||||
"--http-address=0.0.0.0:4180",
|
||||
"--provider=oidc",
|
||||
"--oidc-issuer-url=https://authentik.viktorbarzin.me/application/o/k8s-dashboard/",
|
||||
"--redirect-url=https://k8s.viktorbarzin.me/oauth2/callback",
|
||||
"--upstream=${local.oauth2_proxy_upstream}",
|
||||
"--ssl-upstream-insecure-skip-verify=true",
|
||||
"--scope=openid email profile offline_access k8s-dashboard-audience",
|
||||
"--oidc-extra-audience=kubernetes",
|
||||
"--pass-authorization-header=true",
|
||||
"--set-authorization-header=true",
|
||||
"--pass-access-token=true",
|
||||
"--email-domain=*",
|
||||
"--insecure-oidc-allow-unverified-email=true",
|
||||
"--cookie-secure=true",
|
||||
"--cookie-domain=k8s.viktorbarzin.me",
|
||||
"--whitelist-domain=k8s.viktorbarzin.me",
|
||||
"--cookie-refresh=30m",
|
||||
"--cookie-expire=168h",
|
||||
"--code-challenge-method=S256",
|
||||
"--reverse-proxy=true",
|
||||
"--skip-provider-button=true",
|
||||
]
|
||||
env {
|
||||
name = "OAUTH2_PROXY_CLIENT_ID"
|
||||
value_from { secret_key_ref { name = "oauth2-proxy", key = "client-id" } }
|
||||
}
|
||||
env {
|
||||
name = "OAUTH2_PROXY_CLIENT_SECRET"
|
||||
value_from { secret_key_ref { name = "oauth2-proxy", key = "client-secret" } }
|
||||
}
|
||||
env {
|
||||
name = "OAUTH2_PROXY_COOKIE_SECRET"
|
||||
value_from { secret_key_ref { name = "oauth2-proxy", key = "cookie-secret" } }
|
||||
}
|
||||
port { container_port = 4180 }
|
||||
readiness_probe {
|
||||
http_get {
|
||||
path = "/ping"
|
||||
port = 4180
|
||||
}
|
||||
initial_delay_seconds = 5
|
||||
period_seconds = 10
|
||||
}
|
||||
resources {
|
||||
requests = { cpu = "25m", memory = "64Mi" }
|
||||
limits = { memory = "128Mi" }
|
||||
}
|
||||
}
|
||||
dns_config {
|
||||
option {
|
||||
name = "ndots"
|
||||
value = "2"
|
||||
}
|
||||
}
|
||||
}
|
||||
}
|
||||
}
|
||||
lifecycle {
|
||||
ignore_changes = [
|
||||
spec[0].template[0].spec[0].dns_config, # KYVERNO_LIFECYCLE_V1
|
||||
]
|
||||
}
|
||||
}
|
||||
|
||||
resource "kubernetes_service" "oauth2_proxy" {
|
||||
metadata {
|
||||
name = "oauth2-proxy"
|
||||
namespace = kubernetes_namespace.k8s-dashboard.metadata[0].name
|
||||
}
|
||||
spec {
|
||||
selector = { app = "oauth2-proxy" }
|
||||
port {
|
||||
port = 4180
|
||||
target_port = 4180
|
||||
}
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
- [ ] **Step 2: First-apply the ExternalSecret only (plan-time-secret gotcha)**
|
||||
|
||||
Run:
|
||||
```bash
|
||||
cd stacks/k8s-dashboard && ../../scripts/tg apply --non-interactive \
|
||||
-target=kubernetes_manifest.oauth2_proxy_externalsecret
|
||||
```
|
||||
Expected: 1 resource added.
|
||||
|
||||
- [ ] **Step 3: Verify the K8s Secret materialized**
|
||||
|
||||
Run:
|
||||
```bash
|
||||
kubectl get secret oauth2-proxy -n kubernetes-dashboard -o jsonpath='{.data.client-id}' | base64 -d
|
||||
```
|
||||
Expected: `k8s-dashboard`.
|
||||
|
||||
- [ ] **Step 4: Full apply (deployment + service)**
|
||||
|
||||
Run: `cd stacks/k8s-dashboard && ../../scripts/tg apply --non-interactive`
|
||||
Expected: `kubernetes_deployment.oauth2_proxy` + `kubernetes_service.oauth2_proxy` added, 0 changed.
|
||||
|
||||
- [ ] **Step 5: Verify oauth2-proxy is healthy** (background watch, no `sleep`)
|
||||
|
||||
Run:
|
||||
```bash
|
||||
kubectl get pods -n kubernetes-dashboard -l app=oauth2-proxy -w
|
||||
```
|
||||
Expected: 2/2 pods `Running`, readiness passing. Then check logs for clean OIDC discovery:
|
||||
```bash
|
||||
kubectl logs -n kubernetes-dashboard -l app=oauth2-proxy --tail=30
|
||||
```
|
||||
Expected: `OAuthProxy configured for OpenID Connect Client ID: k8s-dashboard` and no discovery errors. (The ingress still points at kong-proxy; nothing user-facing changed yet.)
|
||||
|
||||
- [ ] **Step 6: Commit**
|
||||
|
||||
```bash
|
||||
git add stacks/k8s-dashboard/oauth2_proxy.tf
|
||||
git commit -m "feat(k8s-dashboard): deploy oauth2-proxy (not yet wired to ingress)"
|
||||
git push origin master
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Task 4: Cutover — repoint ingress to oauth2-proxy
|
||||
|
||||
This is the only step that changes existing behavior. Rollback = revert this commit.
|
||||
|
||||
**Files:**
|
||||
- Modify: `stacks/k8s-dashboard/main.tf` (the `module "ingress"` block, currently around `main.tf:92-111`)
|
||||
|
||||
- [ ] **Step 1: Edit the ingress module block**
|
||||
|
||||
Replace the existing `module "ingress"` block in `stacks/k8s-dashboard/main.tf` with:
|
||||
|
||||
```hcl
|
||||
module "ingress" {
|
||||
source = "../../modules/kubernetes/ingress_factory"
|
||||
namespace = kubernetes_namespace.k8s-dashboard.metadata[0].name
|
||||
name = "kubernetes-dashboard"
|
||||
service_name = "oauth2-proxy"
|
||||
host = "k8s"
|
||||
dns_type = "proxied"
|
||||
tls_secret_name = var.tls_secret_name
|
||||
# auth = "none": oauth2-proxy is the gate — it runs the Authentik OIDC
|
||||
# code-flow and injects the user's id_token as Bearer for dashboard->apiserver
|
||||
# auth. A group policy on the Authentik app restricts access to the
|
||||
# kubernetes-* RBAC groups. See docs/plans/2026-06-04-k8s-dashboard-sso-design.md.
|
||||
auth = "none"
|
||||
backend_protocol = "HTTP"
|
||||
port = 4180
|
||||
extra_annotations = {
|
||||
"gethomepage.dev/enabled" = "true"
|
||||
"gethomepage.dev/name" = "Kubernetes Dashboard"
|
||||
"gethomepage.dev/description" = "Cluster dashboard"
|
||||
"gethomepage.dev/icon" = "kubernetes-dashboard.png"
|
||||
"gethomepage.dev/group" = "Core Platform"
|
||||
"gethomepage.dev/pod-selector" = ""
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
- [ ] **Step 2: Plan (the comment guard runs here)**
|
||||
|
||||
Run: `cd stacks/k8s-dashboard && ../../scripts/tg plan`
|
||||
Expected: the ingress `Service`/middleware updates in place (kong-proxy→oauth2-proxy, drops the Authentik forward-auth middleware). `scripts/check-ingress-auth-comments.py` passes (the `# auth = "none": …` comment is present). No resource destroyed/recreated beyond the ingress objects.
|
||||
|
||||
- [ ] **Step 3: Apply the cutover**
|
||||
|
||||
Run: `cd stacks/k8s-dashboard && ../../scripts/tg apply --non-interactive`
|
||||
Expected: ingress resources updated; apply succeeds.
|
||||
|
||||
- [ ] **Step 4: VERIFY THE AUDIENCE (blocking gate)**
|
||||
|
||||
Log in once in a browser to `https://k8s.viktorbarzin.me` as `viktor`. Capture the
|
||||
id_token the dashboard sends to the apiserver: open browser devtools → Network →
|
||||
click any `/api/v1/...` request → Request Headers → copy the value of
|
||||
`Authorization` (the part after `Bearer `). Decode its claims:
|
||||
```bash
|
||||
JWT='<paste-token-here>'
|
||||
echo "$JWT" | cut -d. -f2 | tr '_-' '/+' | base64 -d 2>/dev/null | jq '{aud, email, groups}'
|
||||
```
|
||||
Expected: `aud` contains **both** `"kubernetes"` and `"k8s-dashboard"`, `email` is set
|
||||
(e.g. `viktor@viktorbarzin.me`), and `groups` is a non-empty list.
|
||||
|
||||
**If `aud` does NOT contain `kubernetes`** → the scope-mapping audience override didn't
|
||||
take. STOP, revert (Step 7 rollback) so the dashboard returns to forward-auth, then
|
||||
apply the §5 design fallback (reuse the `kubernetes` client as confidential +
|
||||
add `--oidc-client-secret` to the kubelogin setup script). Do not leave the cutover
|
||||
live with a broken audience — users would get apiserver 401s.
|
||||
|
||||
- [ ] **Step 5: VERIFY end-to-end RBAC** (Playwright MCP; screenshot on failure)
|
||||
|
||||
- As **viktor** (admin): dashboard lists all namespaces; can view/edit any Deployment. ✅
|
||||
- As **gheorghe** (`vabbit81`): switch to namespace `vabbit81` → can view + edit Deployments, read pod logs; attempting to create a resource in another namespace returns `Forbidden`. ✅
|
||||
- Unauthenticated/other user: Authentik denies at the group policy (no dashboard session issued). ✅
|
||||
|
||||
- [ ] **Step 6: VERIFY CLI regression (must still work, untouched)**
|
||||
|
||||
Run (as viktor, existing kubeconfig):
|
||||
```bash
|
||||
kubectl --context=<oidc-context> get nodes
|
||||
```
|
||||
Expected: succeeds exactly as before (the public `kubernetes` client + kubelogin path is unchanged).
|
||||
|
||||
- [ ] **Step 7: Commit** (rollback = `git revert` this commit + re-apply)
|
||||
|
||||
```bash
|
||||
git add stacks/k8s-dashboard/main.tf
|
||||
git commit -m "feat(k8s-dashboard): cut over ingress to oauth2-proxy SSO
|
||||
|
||||
Dashboard now authenticates users via Authentik (oauth2-proxy) and applies
|
||||
each user's own RBAC. Rollback: revert this commit + scripts/tg apply."
|
||||
git push origin master
|
||||
```
|
||||
|
||||
- [ ] **Step 8: Release presence claims**
|
||||
|
||||
```bash
|
||||
~/code/scripts/presence release stack:k8s-dashboard
|
||||
~/code/scripts/presence release service:k8s-dashboard
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Task 5: Documentation (same logical change set)
|
||||
|
||||
**Files:**
|
||||
- Modify: `docs/architecture/authentication.md`
|
||||
- Modify: `docs/architecture/multi-tenancy.md`
|
||||
- Modify: `.claude/reference/authentik-state.md`
|
||||
- Modify: `.claude/reference/service-catalog.md`
|
||||
|
||||
- [ ] **Step 1: `docs/architecture/authentication.md`**
|
||||
|
||||
- In the "OIDC Applications" table, add a row: `Kubernetes Dashboard | OIDC (confidential, via oauth2-proxy) | Dashboard SSO with per-user RBAC`.
|
||||
- Add a subsection "Kubernetes Dashboard SSO" describing the oauth2-proxy → kong-proxy → apiserver flow, the dual-audience (`kubernetes` + `k8s-dashboard`) scope mapping, and the group-restriction policy. Note the dashboard ingress is `auth = "none"` because oauth2-proxy is the gate (not a regression of the forward-auth default).
|
||||
|
||||
- [ ] **Step 2: `docs/architecture/multi-tenancy.md`**
|
||||
|
||||
- In "User Setup (Self-Service)", add that namespace-owners can also use the **web dashboard** at `k8s.viktorbarzin.me` (Authentik SSO → their namespace RBAC), in addition to kubectl.
|
||||
|
||||
- [ ] **Step 3: `.claude/reference/authentik-state.md`**
|
||||
|
||||
- Record the new application `Kubernetes Dashboard` (slug `k8s-dashboard`), confidential provider `k8s-dashboard`, custom scope mapping `k8s-dashboard audience` (scope `k8s-dashboard-audience`, sets dual `aud`), and the group-access policy/binding. Note these are TF-managed in `stacks/k8s-dashboard/authentik.tf`.
|
||||
|
||||
- [ ] **Step 4: `.claude/reference/service-catalog.md`**
|
||||
|
||||
- Update the k8s-dashboard entry: auth posture is now oauth2-proxy OIDC SSO (was Authentik forward-auth + static cluster-admin SA), per-user RBAC.
|
||||
|
||||
- [ ] **Step 5: Commit**
|
||||
|
||||
```bash
|
||||
git add docs/architecture/authentication.md docs/architecture/multi-tenancy.md \
|
||||
.claude/reference/authentik-state.md .claude/reference/service-catalog.md
|
||||
git commit -m "docs(k8s-dashboard): document dashboard SSO + per-user RBAC [ci skip]"
|
||||
git push origin master
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Self-Review notes (coverage vs. design)
|
||||
|
||||
- Design §4.1 Authentik app → Task 2. §4.2 Vault+ESO → Task 1 + Task 3 Step 1-3. §4.3 oauth2-proxy → Task 3. §4.4 ingress → Task 4.
|
||||
- Design §5 audience strategy + apply-time verification + fallback → Task 4 Step 4 (blocking gate).
|
||||
- Design §7 testing → Task 4 Steps 5-6. §8 rollback → Task 4 Step 7. §10 docs → Task 5.
|
||||
- Out-of-scope (design §9): static cluster-admin SA intentionally NOT touched — no task, by design.
|
||||
|
||||
## Known integration risks (watch during Task 4)
|
||||
|
||||
- **Dashboard v7 ignoring a pre-set Authorization header** (kubernetes/dashboard #5105, #1213): if the dashboard still shows its token-login page after SSO, confirm `--pass-authorization-header=true` and that kong forwards the header; the dashboard `api` component uses the bearer for apiserver calls. Validate in Task 4 Step 5.
|
||||
- **Scope mapping audience override** (primary risk): mitigated by the blocking decode in Task 4 Step 4 + the documented fallback.
|
||||
152
docs/plans/2026-06-04-pve-fan-control-design.md
Normal file
152
docs/plans/2026-06-04-pve-fan-control-design.md
Normal file
|
|
@ -0,0 +1,152 @@
|
|||
# PVE R730 presence-aware fan control — design
|
||||
|
||||
**Date:** 2026-06-04
|
||||
**Status:** implemented
|
||||
**Scripts:** `infra/scripts/fan-control.{sh,service,env.example}`, `test-fan-control.sh`
|
||||
**Runbook:** `infra/docs/runbooks/fan-control.md`
|
||||
|
||||
## Problem
|
||||
|
||||
The Dell R730 PVE host (192.168.1.127) runs its CPU at ~72–77°C under normal
|
||||
cluster load. That is safe (firmware warning at 88°C, critical 93°C) but the
|
||||
iDRAC's stock fan curve optimises for quiet, not cool — it pins the fans at the
|
||||
~7080 RPM floor even at 72°C / load 30 and only ramps near ~80°C. We want the
|
||||
CPU to run cooler when it costs nothing (the box is in the garage, usually
|
||||
empty) while staying quiet when someone is physically in the garage.
|
||||
|
||||
## Measured fan/temp relationship (manual IPMI sweep, 2026-06-04)
|
||||
|
||||
At a comparable CPU load (~45–53 % busy):
|
||||
|
||||
| Fan setting | Fan RPM | CPU temp |
|
||||
|-------------|---------|----------|
|
||||
| Auto (floor) | 7,080 | 71–72°C |
|
||||
| 50 % | 9,360 | 65–66°C |
|
||||
| 70 % | 12,800 | 60–61°C |
|
||||
| 100 % | 17,000 | 55–56°C |
|
||||
|
||||
Best °C-per-RPM is the first step; beyond ~70 % it is mostly noise. ~16°C of
|
||||
swing is available.
|
||||
|
||||
## Power characterization (sweep 2026-06-05)
|
||||
|
||||
Averaged wall power (iDRAC DCMI) + temp at each fan setting:
|
||||
|
||||
| Fan | RPM | Power | CPU | load |
|
||||
|-----|-----|-------|-----|------|
|
||||
| auto | 7,080 | 296 W | 68°C | 21 |
|
||||
| 20 % | 4,800 | 281 W | 73°C | 20 |
|
||||
| 30 % | 6,360 | 288 W | 72°C | 19 |
|
||||
| 50 % | 9,360 | 299 W | 65°C | 18 |
|
||||
| 60 % | 11,040 | 303 W | 61°C | 17 |
|
||||
| 70 % | 12,720 | 324 W | 59°C | 16 |
|
||||
| 100 % | 16,920 | 378 W | 59°C | 17 |
|
||||
|
||||
**The cooling-per-watt knee is ~60 %.** Fan power follows ~RPM³: 60→70 % costs
|
||||
+21 W for −2°C; 70→100 % costs **+54 W for 0°C** (the CPU floors ~59°C at cluster
|
||||
load — more airflow does nothing). Full speed draws ~97 W (~850 kWh/yr) over the
|
||||
floor and buys nothing past 60 %.
|
||||
|
||||
**Decision (2026-06-05):** the COOL curve caps its normal band at 60 % (~303 W,
|
||||
~61°C) — capturing essentially all achievable cooling while avoiding the wasteful
|
||||
80–100 % zone, now reserved as a high-load safety ramp (≥73/79°C) before the 83°C
|
||||
ceiling. QUIET is unchanged (already at the low-power floor: 20 % / 4,800 RPM /
|
||||
281 W). Verified live after re-tune: 63°C, 60 %, ~267 W.
|
||||
|
||||
## Decisions
|
||||
|
||||
1. **Custom bash daemon + systemd service**, deployed to the PVE host the same
|
||||
way as `apply-mbps-caps` / `daily-backup` (source in `infra/scripts/`, scp to
|
||||
`/usr/local/bin`). It cannot be Terraform/k8s — it runs on the bare host where
|
||||
IPMI lives. (OSS `tigerblue77/Dell-iDRAC-fan-controller` was considered;
|
||||
rejected — it is a Docker container, off-pattern here, and unaware of our
|
||||
constraints.)
|
||||
2. **CPU temperature is the only control input.** The Tesla T4 has its own
|
||||
always-on fan (owner-confirmed), so it self-cools and does not depend on
|
||||
chassis airflow — no GPU coupling needed.
|
||||
3. **Presence = the garage door**, because the server is *in the garage*
|
||||
(memory id=1723); noise only matters to people physically there. Signal:
|
||||
ha-sofia `sensor.garage_door_state_bg`. Open now, or last changed within
|
||||
`HOLD_SECS` (15 min) ⇒ someone's around ⇒ QUIET; otherwise COOL.
|
||||
`house_mode` was rejected — it tracks *apartment* occupancy, irrelevant to
|
||||
garage noise.
|
||||
4. **Two continuous LINEAR curves**, picked by presence. (Originally discrete
|
||||
step-bands; replaced 2026-06-05 — the bands flapped at edges, e.g. 45↔65%.
|
||||
Web research: a linear curve + 2–3°C hysteresis is the homelab standard; PID
|
||||
is overkill for this slow thermal loop and even PID projects "only lower, don't
|
||||
chase a setpoint".) fan% interpolates between per-mode anchors, clamped flat
|
||||
outside; both reach 100% right at the 83°C ceiling:
|
||||
|
||||
| Mode | T_LO → P_LO | T_HI → P_HI | slope |
|
||||
|------|-------------|-------------|-------|
|
||||
| COOL (garage empty) | 50°C → 30% | 83°C → 100% | ~2.1%/°C (≈51% at the ~60°C equilibrium) |
|
||||
| QUIET (occupied) | 68°C → 20% | 83°C → 100% | ~4.7%/°C (near-silent until ~70°C) |
|
||||
|
||||
Anchors are env-tunable (`COOL_T_LO/P_LO/T_HI/P_HI`, `QUIET_*`). Under normal
|
||||
load the COOL equilibrium (~60°C → ~51%) sits near the measured ~60% power
|
||||
knee; the ramp toward 100% only engages at genuinely high temp (safety).
|
||||
Anti-oscillation: asymmetric hysteresis (ramp up immediately, ease down only
|
||||
once the curve wants lower 3°C hotter) **plus** a `MIN_STEP` (3%) min-change
|
||||
threshold so 1–2% wiggles don't churn IPMI writes.
|
||||
|
||||
## Safety
|
||||
|
||||
Manual fan mode bypasses the iDRAC's own protection, so it is backstopped:
|
||||
|
||||
- **Daemon exit/crash/stop** → bash `EXIT` trap + systemd `ExecStopPost` both
|
||||
run `ipmitool raw 0x30 0x30 0x01 0x01` (restore Dell auto). `Restart=on-failure`.
|
||||
- **CPU ≥ `CEILING` (83°C)** → hand back to Dell auto until temp holds below
|
||||
`RESUME_BELOW` (75°C) for `RESUME_STABLE` (120 s), then resume manual.
|
||||
- **IPMI read failures ≥ `MAX_IPMI_FAILS`** → restore Dell auto.
|
||||
- **ha-sofia unreachable** → keep the last good presence decision; default COOL
|
||||
at cold start (thermally safe).
|
||||
|
||||
## Observability
|
||||
|
||||
Pushes to the Pushgateway (`http://10.0.20.100:30091`, job `fan_control`):
|
||||
`pve_fan_control_cpu_temp_celsius`, `_fan_percent`, `_mode` (1 quiet / 2 cool /
|
||||
3 manual / 0 fallback), `_ha_reachable`, `_fallback`, `_fan_rpm`, and
|
||||
`_fan_watts_est`.
|
||||
|
||||
**Fan power is ESTIMATED** — the iDRAC exposes only total DCMI watts + RPM (no
|
||||
per-fan power), so `_fan_watts_est` models it from RPM via the fan affinity law
|
||||
(power ∝ RPM³), calibrated to the 2026-06-05 sweep: `fan_W ≈ 0.0205·(RPM/1000)³`
|
||||
(≈2 W at the floor → ~99 W at full; fits the sweep within ~3 W). Surfaced in HA
|
||||
as `sensor.r730_fan_power_est` + a "Fan Power (est)" card on the dashboard-it
|
||||
Server view, next to total power (`sensor.r730_power_consumption`, redfish) — so
|
||||
the fan tax of the control curve is visible. The existing CPU-temp alert is
|
||||
unaffected.
|
||||
|
||||
## Testing
|
||||
|
||||
`test-fan-control.sh` sources the script (main is guarded by a `BASH_SOURCE`
|
||||
check) and unit-tests the pure functions: both curves, hysteresis up/down,
|
||||
presence open/recent/stale, temperature parsing, jq-free JSON field extraction,
|
||||
and percent→hex. 36 assertions, no hardware needed. The daemon also supports
|
||||
`DRY_RUN=1` and `RUN_ONCE=1` for integration checks.
|
||||
|
||||
## HA control (added 2026-06-05, on the host daemon)
|
||||
|
||||
Delivered ahead of the cron migration (which is Vault-gated) by teaching the
|
||||
**host daemon** to poll two ha-sofia helpers each loop (`fc_resolve`):
|
||||
`input_select.r730_fan_mode` (auto/cool/quiet/manual) +
|
||||
`input_number.r730_fan_manual_pct`. `auto` = the garage-presence curve above;
|
||||
cool/quiet force that curve; manual holds a fixed %; `CEILING` still overrides.
|
||||
The **simplified dashboard (2026-06-05)** exposes just three things — fan speed
|
||||
(%/RPM), an **Override %** slider, and a **Lock** toggle. Lock = "freeze current
|
||||
speed / algo off": `automation.r730_fan_lock_freeze_current_speed_resume_algo`
|
||||
snapshots the live target % into Override and sets `mode=manual` on lock-ON, and
|
||||
`mode=auto` on lock-OFF — the daemon needs no change, the toggle just drives the
|
||||
mode. `cool`/`quiet` stay reachable via the entity but are off the dashboard. The
|
||||
60-min `automation.r730_fan_mode_auto_revert` is retained as a dormant safety net
|
||||
(manual now only happens while locked, which it skips). The daemon just polls and
|
||||
actuates.
|
||||
Monitoring + control live on the dashboard-it "Server" view (REST sensors: fan
|
||||
RPM from the redfish exporter; mode/target-% from the Pushgateway). The same
|
||||
logic already exists in the Python controller (`r730-fan-control/`) for the
|
||||
eventual in-cluster CronJob; when that deploys it supersedes the host daemon.
|
||||
|
||||
## Rollback
|
||||
|
||||
`systemctl disable --now fan-control && ipmitool raw 0x30 0x30 0x01 0x01` on the
|
||||
host returns the box to stock firmware fan control. See the runbook.
|
||||
144
docs/plans/2026-06-05-block-storage-harden-nfs-design.md
Normal file
144
docs/plans/2026-06-05-block-storage-harden-nfs-design.md
Normal file
|
|
@ -0,0 +1,144 @@
|
|||
# Block-Storage Scaling — Harden proxmox-csi + NFS (Decision + Design)
|
||||
|
||||
**Date**: 2026-06-05
|
||||
**Status**: Decided — supersedes the recommendation in `2026-06-01-topolvm-evaluation.md`
|
||||
**Decision owner**: Viktor
|
||||
|
||||
## TL;DR
|
||||
|
||||
We keep the **proxmox-csi** block-storage model (which already gives cross-node
|
||||
PVC mobility) and **harden** it, rather than re-architecting to TopoLVM or
|
||||
Longhorn. The 29-PVC/node cap is made *unreachable* (not removed) by shrinking
|
||||
the block footprint via NFS migration of non-DB workloads; the ghost-disk
|
||||
doom-loop is *prevented* (not just detected); and node placement is rebalanced.
|
||||
**£0, no new hardware, mobility preserved.**
|
||||
|
||||
## Why this, not TopoLVM / Longhorn
|
||||
|
||||
Hard constraints set by Viktor (2026-06-05): **(a)** must keep the ability to
|
||||
move pods across VM nodes if one goes down (mobility), **(b)** no new hardware,
|
||||
**(c)** sdc IO contention is acceptable / not worth spending on.
|
||||
|
||||
Key architectural insight that drove the decision:
|
||||
|
||||
- **Mobility and the LUN cap are two sides of the same mechanism.** proxmox-csi
|
||||
gives mobility *because* it hot-plugs each PVC as a Proxmox virtio-scsi disk
|
||||
that re-attaches wherever the pod lands — and that hot-plug is exactly what
|
||||
imposes the `lun < 30` cap and spawns the query-pci ghost-disk loop.
|
||||
- **TopoLVM** removes the cap by killing the hot-plug — which is *why* it pins a
|
||||
PVC to one node. Rejected: violates constraint (a).
|
||||
- **Longhorn** keeps mobility via replication, but mobility-via-replication
|
||||
costs **≥2× writes** (1 replica = no failover). On a single PVE host both
|
||||
replicas land on the same sdc HDD — you pay double the write IO for redundancy
|
||||
that dies with the host anyway (host = SPOF). Longhorn's own docs say "use a
|
||||
dedicated disk, not the root disk." Rejected: wasteful on a single host;
|
||||
reconsider only if a 2nd physical host is added.
|
||||
- proxmox-csi already provides mobility at **1× write** (centralized LV
|
||||
re-attaches) — strictly more IO-efficient than replication on one host. The
|
||||
cap and ghost-loop are *warts on a good model*, not reasons to replace it.
|
||||
|
||||
| Option | 29-cap | Ghost loop | Mobility | sdc IO | Hardware | Verdict |
|
||||
|---|---|---|---|---|---|---|
|
||||
| **① Harden proxmox-csi + NFS** | managed (far off) | prevented | ✅ kept (1×) | same/better | £0 | **CHOSEN** |
|
||||
| TopoLVM (A/C) | removed | eliminated | ❌ pinned | A: same / C: better | £0 / £200 | rejected — loses mobility |
|
||||
| Longhorn | removed | eliminated | ✅ (2×) | worse | £0 | rejected — replication wasted on 1 host |
|
||||
|
||||
## Live state at decision time (2026-06-05)
|
||||
|
||||
- 6 workers (VMID 201–206), proxmox-csi `CSINode.allocatable.count = 28`/node →
|
||||
**168 slots**; **69 used (41%)**; **0 PVCs Pending**.
|
||||
- **Imbalance is the live risk, not aggregate capacity**: node6 **21/28** (hot),
|
||||
node5 **3/28**. node1=9, node2/3/4=12.
|
||||
- **Ghost-disk drift = 0** (the 2026-06-04 cleanup held; `qm config` scsi counts
|
||||
match tracked VolumeAttachments). Prevention still open (beads `code-dfjn`).
|
||||
Retained `unusedN` LVs: node1=6, node2=9, node3=6 (harmless to the cap).
|
||||
- Block PVCs: **74** (44 encrypted + 30 plain). NFS: 64. local-path: 9.
|
||||
- PVE host RAM **222/267 GiB used, swap in use** → adding more worker VMs is
|
||||
memory-bound (the May 4→6 escape hatch is mostly spent).
|
||||
- sdc thin pool `data`: 69.67% data / 15.89% meta. `nfs-data` LV 74% of 4 TiB.
|
||||
VG `pve` raw free <16 GiB; VG `ssd` free 475 GiB.
|
||||
|
||||
## NFS-migration candidates (embedded-DB preflight is mandatory)
|
||||
|
||||
Rule: embedded transactional stores (SQLite/LevelDB/RocksDB/H2/LMDB/ClickHouse)
|
||||
corrupt on NFS; sensitive `-encrypted` PVCs lose LUKS-at-rest on NFS. Only
|
||||
non-DB, non-sensitive (or app-encrypted) workloads qualify.
|
||||
|
||||
**Verified NFS-safe (preflighted 2026-06-05, no embedded DB):**
|
||||
|
||||
| PVC | Node | SC | Evidence |
|
||||
|---|---|---|---|
|
||||
| `tandoor/tandoor-data-proxmox` | node6 | proxmox-lvm | `/opt/recipes/mediafiles` = media + bundled static; PG-backed |
|
||||
| `speedtest/speedtest-config-proxmox` | node6 | proxmox-lvm | `/config` = logs (383 MB `laravel.log`) + config; MySQL-backed |
|
||||
| `hackmd/hackmd-data-encrypted` | node6 | encrypted | `/…/public/uploads` = PNG uploads (4.5 MB); MySQL-backed |
|
||||
| `changedetection/changedetection-data-proxmox` | node6 | proxmox-lvm | `/datastore` = JSON + brotli snapshots; no DB |
|
||||
| `send/send-data-proxmox` | node2 | proxmox-lvm | `/uploads` = encrypted blobs; Redis metadata |
|
||||
|
||||
**Phase-1 candidates (preflight before migrating):** instagram-poster,
|
||||
insta2spotify, novelapp, openclaw/openlobster, servarr/qbittorrent, postiz
|
||||
(scaled-0), priority-pass-uploads*, tripit-personal-documents* (*app-encrypted /
|
||||
sensitive — keep app-layer crypto, confirm before moving).
|
||||
|
||||
**Must stay on block** (embedded DB or fsync-critical): vaultwarden, ntfy,
|
||||
uptime-kuma, navidrome, actualbudget×3, openclaw×2, servarr arr-apps, freshrss
|
||||
(SQLite); stirling-pdf (H2); rybbit (ClickHouse); beads/dolt; all CNPG
|
||||
pg-cluster, mysql-standalone, immich-pg, redis; prometheus, alertmanager, loki,
|
||||
vault×3, technitium×3, mailserver, paperless, forgejo, matrix, n8n.
|
||||
|
||||
## Plan
|
||||
|
||||
### Phase 0 — Tactical relief (now): migrate the 5 verified-safe PVCs
|
||||
Per service, following the proven 2026-05-26 Wave-1 pattern (reversible — source
|
||||
block PVC kept until the NFS copy is verified):
|
||||
1. `presence claim service:<svc>`.
|
||||
2. Create NFS export dir on PVE host + add to git-managed
|
||||
`infra/scripts/pve-nfs-exports`; `exportfs -ra`.
|
||||
3. Add `module "nfs_<svc>"` (`modules/kubernetes/nfs_volume`) to the stack;
|
||||
`scripts/tg apply` to create the static NFS PV/PVC.
|
||||
4. Scale the workload to 0 (RWO → must release the block PVC).
|
||||
5. rsync block→NFS with `--checksum` (exclude cruft: changedetection
|
||||
`test-direct`/`test-seq`/`lost+found`; speedtest can drop `log/`).
|
||||
6. Swap the workload's `claim_name` to the NFS PVC; `tg apply`; scale up.
|
||||
7. Verify app health + data intact.
|
||||
8. Delete the old block PVC → frees the LUN slot; confirm with check #47.
|
||||
9. Commit + push per service; wait for CI/Woodpecker.
|
||||
|
||||
Result: node6 **21 → 17**, node2 **12 → 11**.
|
||||
hackmd note: confirm the LUKS→NFS downgrade is acceptable (low-sensitivity doc
|
||||
images) or leave hackmd on encrypted block and accept 21→18.
|
||||
|
||||
### Phase 1 — Broader NFS sweep (this session if smooth, else tracked)
|
||||
Preflight + migrate the Phase-1 candidates above. Goal: leave **only true DBs +
|
||||
fsync-critical** services on block, so per-node block counts stay well under the
|
||||
cap with years of runway.
|
||||
|
||||
### Phase 2 — Ghost-loop prevention (beads `code-dfjn`; design separately)
|
||||
The structural half of "harden". Substantial — propose to design + plan on its
|
||||
own rather than rush:
|
||||
- Soft-cap block PVCs/node below the query-pci failure threshold (observed safe
|
||||
≤24, fails ≥25) — alert + scheduler hint.
|
||||
- Raise the proxmox-csi controller's QMP/query-pci timeout (and/or QEMU side).
|
||||
- Auto-reconcile CronJob: detect drift (check #47 logic) → safe
|
||||
`qm set <vmid> --delete scsiN` (detach-only, retains LV).
|
||||
- Rebalance residual node6 block PVCs → node5, one at a time, check #47 watching.
|
||||
|
||||
### Phase 3 — Docs
|
||||
Update `storage.md` (Wave-2 NFS migration + the ① decision), `scale-k8s-cluster.md`,
|
||||
`.claude/reference/service-catalog.md`; add a "Decided: ①" banner to
|
||||
`2026-06-01-topolvm-evaluation.md` pointing here.
|
||||
|
||||
## Risks
|
||||
- Data loss during migration → mitigated by rsync `--checksum` + keep source
|
||||
until verified + the workload is scaled-0 during copy.
|
||||
- LUKS-at-rest dropped for `-encrypted` PVCs moved to NFS → only migrate
|
||||
low-sensitivity or app-encrypted ones; flag each.
|
||||
- NFS soft-mount semantics → only non-DB workloads (preflighted); `nfsvers=4`,
|
||||
`soft,timeo=30,retrans=3` per the `nfs_volume` module defaults.
|
||||
- Block rebalance (Phase 2) re-introduces detach/reattach ghost risk → one at a
|
||||
time with check #47.
|
||||
|
||||
## Related
|
||||
- `2026-06-01-topolvm-evaluation.md` (superseded recommendation)
|
||||
- `docs/architecture/storage.md` § "Per-VM SCSI-LUN cap"
|
||||
- `docs/runbooks/scale-k8s-cluster.md`
|
||||
- beads `code-dfjn` (ghost prevention), `code-oflt` (IO isolation — not pursued here)
|
||||
116
docs/plans/2026-06-05-idrac-snmp-migration-design.md
Normal file
116
docs/plans/2026-06-05-idrac-snmp-migration-design.md
Normal file
|
|
@ -0,0 +1,116 @@
|
|||
# iDRAC monitoring: Redfish → SNMP migration (design)
|
||||
|
||||
**Date:** 2026-06-05
|
||||
**Status:** approved (Viktor) — SNMP primary + thin Redfish remnant
|
||||
**Stack:** `stacks/monitoring`
|
||||
|
||||
## Problem
|
||||
|
||||
The R730 iDRAC Redfish exporter (`idrac-redfish-exporter`, mrlhansen
|
||||
`idrac_exporter`, image `viktorbarzin/idrac-redfish-exporter:2.4.1-voltage-fix`)
|
||||
is configured `metrics: all: true`. It collects on-demand and walks every
|
||||
Redfish subtree, making dozens of sequential ~1–2 s requests to a slow BMC.
|
||||
|
||||
Measured live (Prometheus `scrape_duration_seconds{job="redfish-idrac"}`, 24 h):
|
||||
- **avg 18.5 s, peak 28.3 s**, occasional fast-fail 0.085 s.
|
||||
- Pinned to a **3 m interval / 45 s timeout** because it cannot run at the 2 m
|
||||
global cadence.
|
||||
|
||||
The cost is dominated by walks that feed **dashboard-only** panels (`memory`
|
||||
10 DIMMs, `network`, `events`/SEL); the operationally important metrics (fan
|
||||
speed, temps, power, voltage) come from cheap single-request collectors.
|
||||
|
||||
## Decision
|
||||
|
||||
Make **SNMP the fast primary source** and keep a **thin, slow Redfish remnant**
|
||||
for the few things SNMP cannot serve. SNMP walks are fast (the `snmp-ups` job
|
||||
runs at 30 s); the iDRAC SNMP agent is already enabled and reachable.
|
||||
|
||||
Rejected alternatives: (1) pure collector-trim of Redfish — still BMC-bound and
|
||||
slow; (2) pure SNMP / retire Redfish — would require re-pointing the **external
|
||||
ha-sofia** `sensor.r730_fan_speed` REST sensor (collides with a live session
|
||||
editing the fan dashboard) and would drop two cosmetic panels.
|
||||
|
||||
## Key findings (ground-truthed)
|
||||
|
||||
- **The `snmp-idrac` job was dead.** It specified **no `module`** param, so
|
||||
`snmp_exporter` defaulted to `if_mib` and returned only the iDRAC NIC's
|
||||
interface counters — zero health/power/thermal. Both iDRAC jobs relabel to
|
||||
`r730_idrac_*`, which hid this. The alert `iDRACSNMPMetricsMissing` is
|
||||
**misnamed** — its expr `absent(r730_idrac_idrac_system_health)` checks a
|
||||
*Redfish* metric.
|
||||
- **A generated `dell_idrac` module already exists**, unmounted, in
|
||||
`prometheus_snmp_chart_values.yaml` (~lines 79–1628). The mounted config is
|
||||
`ups_snmp_values.yaml` (huawei/if_mib/ip_mib only). iDRAC SNMP = v2c,
|
||||
community `Public0` (already the `public_v2` auth in `ups_snmp_values.yaml`).
|
||||
- **Live snmpwalk (Public0, 192.168.1.4) confirms** these return real data:
|
||||
fan RPM `coolingDeviceReading` (.4.700.12.1.6 = 7080 RPM), temps
|
||||
`temperatureProbeReading` (.700.20.1.6, tenths-°C), system watts
|
||||
`amperageProbeReading` (.600.30.1.6 = 252 W), PSU input voltage
|
||||
`powerSupplyCurrentInputVoltage` (.600.12.1.16), PSU watts/health, global
|
||||
health `globalSystemStatus` (.5.2.1), `systemState*` rollups (.200),
|
||||
`physicalDisk*` status, `memoryDevice*` size/status/type/speed (.1100),
|
||||
`networkDevice*` status/connection (.1100.90), BIOS `2.19.0` (.300.50.1.8),
|
||||
model/service-tag (.5.1.3).
|
||||
- **Genuine SNMP gaps — but inert or cosmetic today:**
|
||||
- SSD life-left % (`physicalDiskRemainingRatedWriteEndurance` .49) → returns
|
||||
`255` (N/A) for every drive incl. the Samsung SSD. **Redfish today reports
|
||||
`0`** on the one drive that has it, and the SSD-wear alerts guard on `> 0`,
|
||||
so they **already never fire** → no functional loss.
|
||||
- SEL event log (`5.5.2`) → `NoSuchObject`. The `idrac_events_log_entry`
|
||||
metric is **already empty in Prometheus** today → no loss.
|
||||
- Indicator LED (`5.1.4`) → absent. Cosmetic ("Off") panel.
|
||||
- NIC link-speed Mbps → no OID (health + up/down preserved). Cosmetic.
|
||||
- Average watts → no native OID; reconstruct via PromQL `avg_over_time()`.
|
||||
|
||||
Conclusion: **every metric with real, used data today has an SNMP equivalent.**
|
||||
|
||||
## Naming / enum strategy
|
||||
|
||||
`snmp_exporter` names metrics after MIB objects (`temperatureProbeReading`,
|
||||
`coolingDeviceReading`, `globalSystemStatus`, …) → after the `r730_idrac_`
|
||||
relabel they are `r730_idrac_<mibName>`, different from today's
|
||||
`r730_idrac_idrac_*` / `r730_idrac_redfish_*`. **Re-point consumers** (not
|
||||
alias): aliasing via `metric_relabel_configs` only renames `__name__` and
|
||||
cannot fix the label-set mismatch (Redfish `member_id`/`name` vs SNMP numeric
|
||||
indexes) nor the **enum-value mismatch** (DellStatus `3=OK` vs Redfish `1`;
|
||||
`systemPowerState 4=on` vs Redfish `2`). Alert exprs must change regardless, so
|
||||
re-pointing is the honest path. The module adds `lookups:` so SNMP series carry
|
||||
human labels (probe/fan location, disk display name) like today.
|
||||
|
||||
## Consumed-metric → SNMP mapping (DIRECT / REGEN / remnant)
|
||||
|
||||
REGEN = OID returns data but must be added to the module walk.
|
||||
|
||||
| Consumed (today) | Source after migration |
|
||||
|---|---|
|
||||
| fan health | REGEN `coolingDeviceStatus` .700.12.1.5 |
|
||||
| consumed watts | DIRECT `amperageProbeReading` (System Board Pwr Consumption) |
|
||||
| system health rollup | DIRECT `globalSystemStatus` .5.2.1 |
|
||||
| PSU health | DIRECT `powerSupplyStatus`/`powerSupplySensorState` |
|
||||
| memory health | DIRECT `systemStateMemoryDeviceStatusCombined` .200.10.1.27 |
|
||||
| storage drive health | DIRECT `physicalDiskComponentStatus` .5.5.1.20.130.4.1.24 |
|
||||
| **SSD life %** | **remnant** (SNMP=255 N/A; already inert) |
|
||||
| system power state | DIRECT `systemPowerState` .5.2.4 (enum 4=on) |
|
||||
| PSU input voltage | DIRECT `powerSupplyCurrentInputVoltage` .600.12.1.16 |
|
||||
| system health (absent-probe) | DIRECT `globalSystemStatus` |
|
||||
| **fan speed RPM (HA)** | DIRECT via remnant (HA reads exporter directly); SNMP REGEN `coolingDeviceReading` for Grafana |
|
||||
| temperature | DIRECT `temperatureProbeReading` .700.20.1.6 (÷10) |
|
||||
| avg watts | PromQL `avg_over_time(amperageProbeReading)` |
|
||||
| **SEL log** | **remnant** (already empty) |
|
||||
| machine/bios info | REGEN model/svctag .5.1.3, BIOS .300.50.1.8 |
|
||||
| memory size / cpu count | DIRECT `memoryDeviceSize` (sum) / `processorDeviceStatus` (count) |
|
||||
| **indicator LED** | **remnant** (cosmetic) |
|
||||
| storage drive info/health/capacity | DIRECT `physicalDisk*` |
|
||||
| memory module info/health/cap/speed | DIRECT(size) + REGEN(status/type/speed .1100.50.1.{5,7,8,15}) |
|
||||
| network port health/link / **Mbps** | REGEN `networkDevice*` (.1100.90); **Mbps → remnant/drop** |
|
||||
| PSU output/input/capacity watts | DIRECT `powerSupplyOutputWatts`/`RatedInputWattage` |
|
||||
|
||||
## Remnant role
|
||||
|
||||
The Redfish exporter stays alive (so the external ha-sofia
|
||||
`sensor.r730_fan_speed` REST poll is **unchanged** — no ha-sofia edit, no
|
||||
collision). It is trimmed to `sensors,system,network,storage,events` and its
|
||||
Prometheus scrape slows to 10 m, keeping **only** the gap metrics (indicator
|
||||
LED, NIC Mbps, SSD-life, SEL) via `metric_relabel_configs` to avoid duplicate
|
||||
series with SNMP.
|
||||
53
docs/plans/2026-06-05-idrac-snmp-migration-plan.md
Normal file
53
docs/plans/2026-06-05-idrac-snmp-migration-plan.md
Normal file
|
|
@ -0,0 +1,53 @@
|
|||
# iDRAC Redfish → SNMP migration (plan)
|
||||
|
||||
Companion to `2026-06-05-idrac-snmp-migration-design.md`. Execute in order;
|
||||
applies are staged so the safe/additive work lands and is verified before any
|
||||
consumer re-pointing.
|
||||
|
||||
Files:
|
||||
- `stacks/monitoring/modules/monitoring/ups_snmp_values.yaml` (merge target)
|
||||
- `stacks/monitoring/modules/monitoring/prometheus_snmp_chart_values.yaml` (dell_idrac source, ~79–1628)
|
||||
- `stacks/monitoring/modules/monitoring/prometheus_chart_values.tpl` (scrape jobs ~3150/3170, alerts ~811–1186)
|
||||
- `stacks/monitoring/modules/monitoring/idrac.tf` (Redfish exporter / remnant)
|
||||
- `stacks/monitoring/modules/monitoring/dashboards/idrac.json`, `cluster_health.json`
|
||||
|
||||
## Phase A — additive SNMP source (low risk)
|
||||
|
||||
- [ ] A1. Extract `dell_idrac` (lines 79–1628) from `prometheus_snmp_chart_values.yaml`; **strip its embedded `auth:`/`version:`** (the merge target uses the split `auths:` format) and append the module under `modules:` in `ups_snmp_values.yaml`.
|
||||
- [ ] A2. Hand-add to dell_idrac `walk:` + `metrics:` (with `lookups:` for labels):
|
||||
- `coolingDeviceReading` .4.700.12.1.6 (fan RPM, gauge, idx chassis+device, lookup `coolingDeviceLocationName` .8)
|
||||
- `coolingDeviceStatus` .4.700.12.1.5 (fan health, enum)
|
||||
- `networkDeviceStatus` / `networkDeviceConnectionStatus` (.1100.90.1.{3,17})
|
||||
- `systemBIOSVersionName` .300.50.1.8; system model .5.1.3.12 + service-tag .5.1.3.2
|
||||
- DIMM `.1100.50.1.{5 status, 7 type, 8 location, 15 speed}`
|
||||
- `physicalDiskRemainingRatedWriteEndurance` .5.5.1.20.130.4.1.49 (so remnant isn't needed for SSD if it ever populates; harmless 255 today)
|
||||
- [ ] A3. `snmp-idrac` job (`prometheus_chart_values.tpl` ~3150): add `params: { module: [dell_idrac], auth: [public_v2] }`, `scrape_interval: 1m`, `scrape_timeout: 30s`. Keep the `r730_idrac_` relabel.
|
||||
- [ ] A4. **Validate before any repoint:** apply monitoring stack; `curl 'http://snmp-exporter.monitoring.svc:9116/snmp?module=dell_idrac&auth=public_v2&target=192.168.1.4:161'` returns all REGEN/DIRECT metrics with readable labels; `scrape_duration_seconds{job="snmp-idrac"}` < 5 s; confirm exact emitted metric names + label keys (feeds B/C).
|
||||
|
||||
## Phase B — re-point consumers to verified SNMP names (riskier)
|
||||
|
||||
- [ ] B1. Rewrite ~12 alert exprs (`prometheus_chart_values.tpl` 811–1186) to SNMP names + **SNMP enums** (`3=OK` not `1`; power `4=on` not `2`). Re-target absent-probes: `iDRACRedfishMetricsMissing`→`absent(r730_idrac_powerSupplyCurrentInputVoltage)`; `iDRACSNMPMetricsMissing`→`absent(r730_idrac_globalSystemStatus)` (also fixes the misnomer).
|
||||
- [ ] B2. Re-point ~26 panels in `idrac.json` + `cluster_health.json` to SNMP names/labels; avg-watts → `avg_over_time(...amperageProbeReading...[$__interval])`.
|
||||
- [ ] B3. Add any new SNMP metric names to the Prometheus keep-rules whitelist if present (grep `prometheus-server` configmap / `prometheus_chart_values.tpl` keep rules) so they aren't silently dropped.
|
||||
- [ ] B4. Apply; verify each re-pointed alert has data (no spurious `absent` firing) and panels render.
|
||||
|
||||
## Phase C — thin Redfish remnant
|
||||
|
||||
- [ ] C1. `idrac.tf` config map: `metrics: all: false` + enable only `sensors, system, network, storage, events` (drop power/memory/processors/manager/extra — now SNMP). (HA reads `sensors` directly — unchanged.)
|
||||
- [ ] C2. `redfish-idrac` job: `scrape_interval: 10m`; add `metric_relabel_configs` to **keep only** the gap series (indicator LED, NIC Mbps, SSD-life, SEL) → avoids duplicate series with SNMP.
|
||||
- [ ] C3. Apply; verify HA `sensor.r730_fan_speed` still updates, gap panels render, fan-control daemon unaffected (it uses IPMI, not this exporter — should be untouched).
|
||||
|
||||
## Phase D — docs + ship
|
||||
|
||||
- [ ] D1. Update `docs/architecture/monitoring.md` (iDRAC now SNMP-primary; remnant role), note the fixed alert misnomer, any runbook.
|
||||
- [ ] D2. Update this plan's checkboxes; commit (named files) + push; wait for CI/deploy.
|
||||
|
||||
## Rollback
|
||||
|
||||
All Terraform-managed. Revert the monitoring-stack commit + `scripts/tg apply`
|
||||
restores the Redfish-primary state. Phase A is additive (safe to leave even if
|
||||
B/C are reverted).
|
||||
|
||||
## Presence
|
||||
|
||||
Claim `stack:monitoring` + `service:idrac-redfish-exporter` before each apply.
|
||||
187
docs/plans/2026-06-07-multi-user-workstation-design.md
Normal file
187
docs/plans/2026-06-07-multi-user-workstation-design.md
Normal file
|
|
@ -0,0 +1,187 @@
|
|||
# Multi-User Workstation — Design
|
||||
|
||||
- **Date:** 2026-06-07
|
||||
- **Status:** designed (grilled extensively); not yet implemented
|
||||
- **Owner:** Viktor (wizard)
|
||||
- **Builds on:** the t3code multi-user setup (`docs/plans/2026-06-01-t3-auto-provision-*`), the `k8s_users` multi-tenancy (`docs/architecture/multi-tenancy.md`), and the cloud-init VM-reproducibility decision (memory id=1575).
|
||||
- **Glossary:** see `infra/CONTEXT.md` → "Workstation (multi-user devvm)" for the canonical terms used here (devvm, Workstation, RBAC tier, Workstation profile, Config inheritance, Config base, Infra visibility).
|
||||
|
||||
## Goal
|
||||
|
||||
Let any onboarded person get a fully-configured Claude Code **Workstation** on the devvm that **inherits Viktor's config live** (his edits propagate with no per-user sync), bounded by **their own permissions** (read infra code + RBAC-scoped cluster view, never secrets), provisioned by **one declarative roster + one idempotent script**, and **reproducible from git** so the VM can be rebuilt from a template.
|
||||
|
||||
## How we got here (so the rationale isn't re-litigated)
|
||||
|
||||
This was stress-tested down several branches before landing:
|
||||
|
||||
1. **Adopt a CDE?** Researched Coder / Gitpod-Ona / Eclipse Che / DevPod / OpenHands (2026-06-07). The category consolidated to "Coder or Che, or build it." Coder is architecturally a great fit but the **role model we need is Premium-gated** (groups + OIDC group→role sync + template ACLs are all paid), its agent UI is mid-transition (Tasks→Agents, Sept 2026), and it still needs custom glue. ~80% of the hard parts are already solved by our stack. → **Build on the existing stack** (ADR-0001).
|
||||
2. **K8s ephemeral pods vs devvm OS users?** Ephemeral pods are maximally declarative but, at ~3-4 trusted users, re-platforming the agent + per-pod persistence is **overkill**; the devvm model already runs and config-push is *easier* on one host. → **devvm Linux users** (ADR-0002).
|
||||
3. **Config inheritance — sync vs live?** A periodic sync/seed was rejected; the requirement is **live inheritance** ("I edit, everyone has it"). Realized via **each subsystem's native machine-wide layer + a per-user layer on top** (ADR-0003) — not OverlayFS (kernel disallows live lowerdir edits), not Nix (rebuild, not live), not bespoke symlink-only (clumsy per-item override).
|
||||
|
||||
## Core model
|
||||
|
||||
A person's **RBAC tier** drives one **Workstation profile**. **Inheritance**: `wizard` authors a **Config base** once; every child user (emo, anca, gheorghe) inherits it **live** through native machine-wide layers and may add their own on top. What differs per tier is **Infra visibility** and cluster scope — never the inherited config. Onboarding is **declarative**: a git roster + an idempotent provisioner.
|
||||
|
||||
## Components
|
||||
|
||||
### 1. Roster — the SINGLE source of truth (in git, full lifecycle)
|
||||
|
||||
A git-committed map keyed by **`os_user`**; it drives the **entire lifecycle** (onboard → reconcile → offboard). It carries the multiple identifiers a person actually has (verified live 2026-06-08 — they differ!):
|
||||
|
||||
```yaml
|
||||
# infra/scripts/workstation/roster.yaml — THE source of truth
|
||||
# os_user (key) → authentik_user (login local-part) · k8s_user (k8s_users key) · tier · namespaces
|
||||
users:
|
||||
emo: { authentik_user: emil.barzin, k8s_user: emo, tier: power-user }
|
||||
# NET-NEW cluster identity — emo is NOT in k8s_users today
|
||||
ancamilea: { authentik_user: ancaelena98, k8s_user: anca, tier: namespace-owner, namespaces: [plotting-book] }
|
||||
# ALREADY a namespace-owner — preserve plotting-book; do NOT re-provision
|
||||
# gheorghe: { authentik_user: vabbit81, k8s_user: vabbit81, tier: namespace-owner, namespaces: [vabbit81] }
|
||||
# already a cluster namespace-owner; uncomment when he wants a devvm workstation
|
||||
# wizard (admin) is the base author; not provisioned as a child.
|
||||
```
|
||||
|
||||
**Single source of truth (SSoT):** the roster is authoritative; everything else is **derived or validated against it** — never hand-maintained in parallel:
|
||||
- `/etc/ttyd-user-map` + `/etc/t3-serve/dispatch.json` are **regenerated** from the roster each reconcile (not appended).
|
||||
- The Authentik **`T3 Users`** group membership is reconciled from the roster (a member ⇔ a roster entry).
|
||||
- The reconcile **validates** `roster.tier` against the live `k8s_users` role and **fails loud on mismatch** (e.g. roster says `power-user` but `k8s_users` says `namespace-owner`) — so the workstation tier and the cluster tier can't silently diverge. `k8s_user`/`namespaces` are reconciled into `k8s_users` (or asserted to match for pre-existing users).
|
||||
|
||||
`os_user` is the pinned key (no email→username derivation — avoids the `ancaelena98`-vs-`ancamilea` trap). Onboard = add an entry + reconcile; **offboard = remove the entry** (see "User lifecycle").
|
||||
|
||||
### 2. Eligibility gate (Authentik group, edge-enforced)
|
||||
|
||||
A `T3 Users` Authentik group gates `t3.viktorbarzin.me` at the edge via a one-branch addition to the existing `stacks/authentik/admin-services-restriction.tf` expression policy (`if host == "t3.viktorbarzin.me": return ak_is_group_member(request.user, name="T3 Users")`). Non-members 302→login, never reach the box. Verified earlier: `X-authentik-groups` already reaches the dispatcher (it's in the forward-auth middleware `authResponseHeaders`), so a dispatcher-side second check is possible but the edge gate is the primary.
|
||||
|
||||
### 3. Provisioning (idempotent script + roster)
|
||||
|
||||
Extend the existing root reconcile (`infra/scripts/t3-provision-users.sh`) to read `roster.yaml` and, per entry, converge:
|
||||
- `useradd` the OS account if missing — **constrained** per tier (see §6);
|
||||
- assign per-tier groups;
|
||||
- drop the per-user identity-scoped kubeconfig + Vault helper;
|
||||
- append the `<authentik_user>=<os_user>` line to `/etc/ttyd-user-map`;
|
||||
- `systemctl enable --now t3-serve@<os_user>`;
|
||||
- provision a writable git-crypt-locked clone at `~/code` for non-admins **only if absent** (§5; never replaces an existing `~/code`).
|
||||
|
||||
Run via the existing systemd timer (OnBoot + periodic) for self-healing, plus on-demand after a roster edit. Account creation is the one new privileged step; it lives only in this root reconcile.
|
||||
|
||||
### 4. Config inheritance (native machine-wide layers — ADR-0003)
|
||||
|
||||
`wizard` authors the **Config base** (a git checkout of the dotfiles/config-base repo on the devvm). It materializes into the OS's native machine-wide layers, which every user inherits live:
|
||||
|
||||
**Verified 2026-06-08:** t3 is itself built on `@anthropic-ai/claude-agent-sdk` and opts into `settingSources: [user, project, local]`; the SDK also reads `/etc/claude-code/managed-settings.json` independently. So the managed layer + `~/.claude` reach **both** surfaces — the t3 web UI *and* a terminal `claude`. Two caveats: it's **Claude-specific** (a t3 user who picks Codex/OpenCode won't inherit Claude config), and `rules/` loads via the per-user `user` source (so Task 1.1's "managed-`claudeMd` vs per-user symlink" question stays real).
|
||||
|
||||
| What inherits | Layer (machine-wide) | Native mechanism (live) | Notes |
|
||||
|---|---|---|---|
|
||||
| **Org guidance** (enforced) | `/etc/claude-code/managed-settings.json` → `claudeMd` | top precedence, every session, non-overridable | NO secrets; **spike-confirmed on claude 2.1.168** |
|
||||
| **Skills / rules / agents / commands** | per-user `~/.claude/{skills,rules,…}` **symlinks** → Config base | loaded from the `user` source; symlink ⇒ base edits are live | there is **NO** managed-skills key — symlinks ARE the mechanism (the proven emo pattern) |
|
||||
| Shell (zsh/aliases/env) | `/etc/profile.d/*.sh`, `/etc/skel` | sourced at login; skel seeds new homes | `~/.zshrc` layers on top |
|
||||
| Tools/binaries | system-wide `/usr/local` + apt manifest | one host → shared `/usr` | `pip install --user` in `~` |
|
||||
|
||||
`wizard` edits the base → commit → every child inherits on next prompt/login. **No copy, no mirror, no drift** (this replaces today's hand-mirrored per-user setup — the documented emo-drift pain, memory id=3205/4015). Per-user *mutable* state (`~/.claude.json`, `.credentials.json`, `projects/`, history) is never shared — local only. *(Resolved 2026-06-08, spike GO: skills/rules/agents are delivered via per-user `~/.claude/*` symlinks to the base — seeded in `/etc/skel/.claude/` (a symlink there is copied **as a symlink** by `useradd -m`) and reinforced by the provisioner; the managed `claudeMd` carries enforced org guidance. Base = wizard's chezmoi-versioned `~/.claude` (override via `WORKSTATION_CONFIG_BASE`). This replaces the old `start-claude.sh: cd /home/wizard/code` hack — config now comes from the managed layer + symlinks regardless of CWD, so a new user's launcher just `cd ~/code`.)* **Secret leak found+fixed 2026-06-08:** `~/.claude/settings.json` was `0664`, exposing `MEMORY_API_KEY` to every devvm user → `0600` (the chezmoi source is non-private, so it needs a `private_` prefix + the key templated out to persist).
|
||||
|
||||
### 5. Infra access (per-user writable locked clone — changes NOT gated)
|
||||
|
||||
Each non-admin gets their **own writable**, git-crypt-**locked** clone of the monorepo at `~/code`:
|
||||
- A **keyless** clone (`filter.git-crypt.smudge=cat`): all code/docs are plaintext; the git-crypt'd secret files (`infra/secrets/`, `infra/terraform.tfvars`) stay `\0GITCRYPT\0` ciphertext blobs. They read the code, never the secrets (the repo is public anyway; only git-crypt'd files are sensitive).
|
||||
- **Writable + ungated:** they edit, commit, and `git push` to Forgejo **freely** — no read-only mount, no PR gate. Safe because **pushing infra master does NOT auto-apply** (infra is applied *manually* via `scripts/tg apply`; memory id=4355). Per-user clones also remove the old shared-tree commit-entanglement hazard.
|
||||
- **The real boundary is apply-time, not the repo:** a non-admin can change code but cannot make it take effect — `scripts/tg apply` needs a write-capable Vault token (`vault login -method=oidc` → vault-admin) + cluster RBAC their tier lacks.
|
||||
- **Trade vs the earlier live mirror:** the infra repo's own `CLAUDE.md`/code now updates via `git pull` (standard dev flow), not instantly. The high-value live inheritance — Viktor's skills/prompts/rules/global `CLAUDE.md` — is **unaffected** (it flows through the machine-wide managed layer in §4, not the repo).
|
||||
|
||||
### 6. Permission model
|
||||
|
||||
| Tier | OS account | sudo / docker | code-shared + git-crypt | infra repo | kubectl (own OIDC, per tier) | Vault (own OIDC) |
|
||||
|---|---|---|---|---|---|---|
|
||||
| **admin** (Viktor) | wizard | ✅ / ✅ | ✅ (unlocked) | unlocked R/W tree; can `tg apply` | cluster-admin | vault-admin |
|
||||
| **power-user** (Emo) | emo | ❌ / ❌ | ❌ | own **writable locked** clone (push free; no secrets; can't apply) | **cluster-wide read-only, no Secrets** | scoped read |
|
||||
| **namespace-owner** (Anca) | ancamilea | ❌ / ❌ | ❌ | own **writable locked** clone (push free; no secrets; can't apply) | **admin in own namespace** (full R/W in-ns) + namespace/node LIST only | own-namespace paths |
|
||||
|
||||
Layers: Authentik group (eligibility) → OS account `0700` home + per-tier groups (no sudo/docker for non-admins; rootless podman if containers needed) → **per-user OIDC kubeconfig + Vault** so each session acts as *its own* identity, never Viktor's. **kubectl is enabled per tier** — the provisioner installs each user's kubeconfig at the scope above (admin = cluster-admin; power-user = cluster-wide read-only, no Secrets; namespace-owner = admin in their own namespace), reusing the existing `k8s_users` / dashboard-SA machinery (memory id=4042). **Changing infra is never gated at the repo; it's gated at apply** — only admin can `scripts/tg apply` (write Vault + cluster RBAC). Per-user creds live in each `0700` home; wizard's `~/.vault-token` (`0600`) is unreadable to others.
|
||||
|
||||
**Cluster-RBAC reality (verified 2026-06-08) — two corrections + identity facts:**
|
||||
- **power-user role:** the existing `oidc-power-user` ClusterRole grants cluster-wide **read+write+Secrets** and is currently *unbound* — NOT the read-only-no-Secrets tier ADR-0005 wants. So power-user needs a **NEW** `oidc-power-user-readonly` ClusterRole (get/list/watch on non-secret resources cluster-wide, NO `secrets`), bound to emo's OIDC email. Do not reuse the existing role.
|
||||
- **kubeconfig is OIDC, not SA-token:** the apiserver carries live `--oidc-*` flags for the `kubernetes` audience and accepts Authentik OIDC; the "apiserver rejects OIDC" note in `dashboard-sa.tf` is dashboard-audience-specific (the multi-issuer `authentication-config` isn't live). Install `kubelogin`, smoke-test the OIDC path first, and fall back to the per-user SA-token (dashboard) pattern only if it fails.
|
||||
- **identity reality:** emo has **no `k8s_users` entry** today → power-user is a NET-NEW grant; anca is already namespace-owner of `plotting-book` and gheorghe (`vabbit81`) of `vabbit81` — preserve, don't re-provision.
|
||||
|
||||
**Shared-host caveat:** a multi-user host is a softer boundary than pods — it relies on standard Linux hardening. Appropriate because these are trusted people. If a user must ever be *untrusted*, that's the signal to revisit K8s pods. Note: non-admins' Claude/t3 runs `--dangerously-skip-permissions` (autonomous tool execution as their uid) — bounded by the `0700` home + no-sudo/no-docker sandbox, but a conscious accepted trade.
|
||||
|
||||
### 7. Secrets & auth (per-user, injected — never in the Config base)
|
||||
|
||||
The Config base / machine-wide managed layer is **secret-free**. Everything carrying a token/auth is **per-user**, in the user's own `0600` files, and **never machine-wide** — per the Google-Workspace-MCP precedent (id=4553: *"do NOT move a secret-bearing MCP server into machine-wide config"*; one user literally can't read another's `~/.claude.json`).
|
||||
|
||||
| Auth / token | Lives in (per-user, `0600`) | New-user provisioning (from Vault) |
|
||||
|---|---|---|
|
||||
| **Claude OAuth** | `~/.claude/.credentials.json` (or `CLAUDE_CODE_OAUTH_TOKEN`) | the shared Enterprise token (earlier decision) **or** own interactive login; emo keeps his own |
|
||||
| **`claude_memory` MCP** | `~/.claude.json` mcpServers + `MEMORY_API_KEY` in `settings.json` env | **DEFERRED — not a risk now (Viktor, 2026-06-08).** Per-user memory isolation needs a service-side `_key_to_user` map edit + redeploy (claude-memory-mcp, GHA repo 78), not just a Vault write — NOT built now. For now a new user gets a simple key or omits memory; revisit if isolation becomes a concern. |
|
||||
| **`ha` MCP** (token-in-URL) | `~/.claude.json` | shared `ha_sofia_mcp_url` from Vault `secret/openclaw` (one HA instance; shared secret, per-user file) — only if HA-eligible |
|
||||
| **`playwright` MCP** | per-user systemd unit (own port) + localhost entry | existing per-user playwright pattern (id=4015); non-secret |
|
||||
| **`context7`** | plugin-provided | non-secret (plugins layer) |
|
||||
|
||||
The root provisioner READS these from Vault and writes them into a **new** user's home — **if-absent, never clobbering** an existing user's working config. Minting a new per-user memory key needs an admin Vault write (`vault login -method=oidc`; the agent token can't write KV — id=4181) → an admin onboarding step. **emo's existing MCP/auth is untouched** (additive-only): `managed-settings.json` carries NO `env` secrets, so his `MEMORY_API_KEY` and his `~/.claude.json` MCP servers keep working exactly as today.
|
||||
|
||||
**beads (`bd`) credential — gap found 2026-06-08:** a per-user infra clone does NOT include the Dolt credential (`.beads-credential-key` is git-ignored), so the provisioner must drop it (or set `DOLT_REMOTE_PASSWORD`) into the user's `~/code/.beads/` — else `bd` resolves the central server (`10.0.20.200:3306`) but fails auth. `bd` does **not** depend on `code-shared` (it's server-mode against the central Dolt), so the emo cutover doesn't break `bd` *if* his credential is provisioned.
|
||||
|
||||
## Capacity & prerequisites
|
||||
|
||||
**The devvm is the binding constraint — address before onboarding active users.** Verified 2026-06-08: devvm has **24 GB RAM** (the `proxmox-inventory.md` "8 GB" is STALE → fix that doc), ~8 GB free, **0 swap**; wizard alone already runs ~20 sessions (~10 GB RSS). Each interactive Claude session is ~300–700 MB; each user adds one persistent `t3-serve` daemon (~430 MB). 3–5 active users × several sessions would exhaust RAM → with **0 swap the failure mode is OOM-kill of live sessions** (everyone's), not graceful slowdown — also a `~/.claude.json` corruption trigger (id=2320/2321: multi-session writes + disk pressure).
|
||||
|
||||
**Prerequisites (do FIRST):** (1) **add swap** to the devvm (OOM-kill → graceful pressure); (2) optionally bump RAM (PVE-side — devvm is NOT TF-managed, id=1575); (3) set a per-user RAM budget + a **max-concurrent-active-users** ceiling; (4) memory/disk-pressure monitoring on the devvm. CPU (16 cores, ~7%) and disk (`/` ~28 GB free) are fine for now.
|
||||
|
||||
## User lifecycle (onboard → reconcile → offboard) — the roster drives all of it
|
||||
|
||||
The roster is the SSoT for the **whole** lifecycle, not just creation:
|
||||
|
||||
- **Onboard:** add a roster entry (the reconcile also adds them to the `T3 Users` Authentik group). The reconcile creates the constrained account, seeds config inheritance, provisions the per-user OIDC kubeconfig + locked clone + MCP/auth (+ the `bd` Dolt credential), starts `t3-serve@<u>`.
|
||||
- **Reconcile (routine, additive-only):** converges *missing* state UP; never strips an existing user (the don't-break-emo guarantee). Safe to run anytime.
|
||||
- **Offboard (REMOVE the roster entry):** the destructive half — gated + staged, NOT the routine timer:
|
||||
1. **Reversible cut (on roster removal):** stop+disable `t3-serve@<u>`; drop the user from `/etc/ttyd-user-map` + `dispatch.json` (regenerated → 403 at the dispatcher); remove from the `T3 Users` Authentik group (edge-blocked); `passwd -l <u>`. Access fully cut; nothing deleted.
|
||||
2. **Cluster revoke:** remove their `k8s_users` entry + apply (drops RBAC binding + kubeconfig validity) + revoke shared-token / memory creds.
|
||||
3. **Destructive (explicit, separate, never auto):** archive `~<u>` (tar → backup), then `userdel -r`. Irreversible — requires explicit go-ahead.
|
||||
- Write `docs/runbooks/offboard-user.md` (the link in `multi-tenancy.md` currently dead-ends). Rollback of step 1/2 = re-add the roster entry + reconcile.
|
||||
|
||||
## Incrementality & migration (don't break emo)
|
||||
|
||||
emo has a **working** setup that must not break: his `t3-serve@emo` (port 3774) + ~4 concurrent live Claude sessions (id=2320); his own `~/.claude` + `~/.claude.json` (MCP servers incl. `ha` token-in-URL and his `MEMORY_API_KEY`); his `~/code` symlink into wizard's tree; `code-shared` + `docker` membership; tmux/playwright units. Hard guarantees:
|
||||
|
||||
- **The idempotent reconcile is ADDITIVE-ONLY.** It creates *missing* accounts/config/instances and *adds* a user's tier-appropriate access, but it **never removes** an existing user's groups, **never replaces** an existing `~/code` (skip-if-exists), and **never writes into** an existing `~/.claude` / `~/.claude.json`. Running `provision-users.sh` at any time is therefore a no-op on emo's existing state — safe to run repeatedly.
|
||||
- **Every destructive/tightening step is SEPARATE, explicit, idle-gated, and reversible** — never part of the routine reconcile.
|
||||
- **Phases 0–4 are additive and verified non-breaking.** After each, confirm emo's live sessions, his `~/.claude`/MCP, his `~/code`, and his groups are unchanged.
|
||||
|
||||
Rollout order:
|
||||
1. **Config base + machine-wide managed layer** → wizard + emo *inherit* wizard's skills/prompts. Additive: the managed layer only ADDS; it must not set keys/hooks that override emo's working `~/.claude` / `MEMORY_API_KEY` / MCP servers. **Verify emo's existing sessions + MCP still work.**
|
||||
2. **Roster + provisioner** alongside the current `/etc/ttyd-user-map` (idempotent; ancamilea already provisioned; emo's instance untouched).
|
||||
3. **Per-user writable locked clones** provisioned **only for users without an existing `~/code`** — emo's symlink is left intact (skip-if-exists).
|
||||
4. **Per-tier kubeconfig** installed **only if absent** (existing `~/.kube/config` backed up, never clobbered) — emo's current kube access untouched.
|
||||
5. **emo cutover — the ONLY step that changes emo; opt-in + reversible, never auto-run:** (a) record rollback state (`readlink ~emo/code`, `id emo`, copy of `start-claude.sh`); (b) idle-gate (id=3201); (c) replace his `~/code` symlink with his own writable locked clone, **point his `start-claude.sh` at `cd ~/code`** (today it hardcodes `cd /home/wizard/code` — *that* is the actual reason his Claude lands in wizard's unlocked tree, so swapping the symlink alone is NOT enough), drop the now-redundant `~/.claude/{rules,skills/file-issue}` symlinks into wizard's home (the managed layer / shared base delivers them now), and `gpasswd -d emo code-shared`. He keeps full edit/commit/push (ungated); loses only secret-read + apply. **Rollback (seconds):** restore the symlink + `start-claude.sh` + the `~/.claude` symlinks + `gpasswd -a emo code-shared`. A `t3-serve@emo` restart only blips his WebSocket (id=3308). Requires explicit go-ahead.
|
||||
6. **Authentik `T3 Users` group + edge gate** last (once instances exist), so no one is locked out mid-migration.
|
||||
|
||||
New users (gheorghe; and ancamilea's enhancement) are born into the new model — no migration needed.
|
||||
|
||||
## Template-readiness ("VM as a template" — future)
|
||||
|
||||
Design principle: **every bit of devvm setup is an idempotent git script** — nothing lives only as hand-typed host state. Three scripts in `infra/scripts/workstation/`: `setup-devvm.sh` (package manifest + managed config + config-base clone), `provision-users.sh` (roster loop), and the roster + manifest data files. When the template is wanted: the devvm becomes a cloud-init Proxmox template (the estate's existing reproducibility pattern, id=1575) that clones the infra repo + runs both scripts → identical devvm. Per-user **home data** is the only non-template state → add `/home` to the 3-2-1 backup set, or users re-clone + re-pair on a fresh box.
|
||||
|
||||
## Key decisions (ADR candidates)
|
||||
|
||||
- **ADR-0001 — Build on the existing stack, not a CDE.** Coder/Che/etc. researched; the role model is Premium-gated or the platform lacks the agent layer, and the homelab scale doesn't justify it. Hard to reverse, surprising ("why not Coder?"), real trade-off.
|
||||
- **ADR-0002 — devvm Linux users, not K8s ephemeral pods.** Re-platforming is overkill at this scale; config-push is easier on one host.
|
||||
- **ADR-0003 — Config inheritance via native machine-wide layers + per-user override.** Rejected: periodic sync, OverlayFS (no live lowerdir edits), Nix (rebuild not live).
|
||||
- **ADR-0004 — Infra access via per-user writable git-crypt-locked clones (changes ungated).** Each non-admin gets their own writable, keyless (locked) clone — read + edit + push freely, no PR gate. Safe because infra apply is manual + admin-only (push ≠ apply, id=4355) and the clone can't decrypt secrets. Rejected: the shared read-only mirror (gated changes) and the shared unlocked tree (secret leak + commit entanglement). Trade: repo-local CLAUDE.md updates via pull, not live (global config inheritance stays live via §4).
|
||||
- **ADR-0005 — Power-user = cluster-wide read-only (no Secrets), via a NEW dedicated ClusterRole.** Re-widens cross-tenant READ for the trusted power-user tier only — but via a NEW `oidc-power-user-readonly` ClusterRole (get/list/watch, NO `secrets`), NOT the existing `oidc-power-user` (which grants read+write+Secrets and is unbound). Bound to the user's OIDC identity (kubelogin) — the apiserver accepts Authentik OIDC for the `kubernetes` audience; the dashboard's SA-token pattern is for the dashboard UI only.
|
||||
- **ADR-0006 — The roster is the single source of truth for the FULL lifecycle.** `roster.yaml` drives onboard *and* offboard; `/etc/ttyd-user-map`, `dispatch.json`, and Authentik `T3 Users` membership are *derived* from it, and tier is *validated* against `k8s_users` (fail-loud on mismatch). Rejected: hand-maintaining the four membership lists in parallel (guaranteed drift). Offboarding is first-class + staged (reversible cut → cluster revoke → gated `userdel`), not an afterthought.
|
||||
- **ADR-0007 — Add swap + a capacity budget to the devvm before onboarding active users.** A shared 24 GB / **0-swap** host OOM-kills live sessions under multi-user load (wizard alone runs ~20). Swap + a max-concurrent ceiling are prerequisites, not follow-ups.
|
||||
|
||||
## Out of scope / deferred
|
||||
|
||||
- Zero-touch auto-provision on first Authentik login (admin runs the provisioner / the timer converges — simpler at this scale).
|
||||
- K8s per-user pods (revisit only if a user must be untrusted, or scale grows large).
|
||||
- The actual cloud-init template conversion (design for it now; do it when wanted).
|
||||
- **Per-user memory isolation** (own namespace / service-side `_key_to_user` map + redeploy) — **deferred; not a risk now** (Viktor, 2026-06-08). Revisit if memory cross-read becomes a concern.
|
||||
|
||||
## Verification (acceptance)
|
||||
|
||||
- A new roster entry + `provision-users.sh` → the user can log into `t3.viktorbarzin.me` and lands in a configured Workstation with Viktor's skills/prompts.
|
||||
- wizard edits a skill/CLAUDE.md in the base → a child's next prompt sees it (no pull).
|
||||
- A child's `kubectl`/`vault` is bounded by their tier (kubectl enabled per tier: power-user = cluster-wide read-only; namespace-owner = read/write in own ns only); a non-admin cannot read git-crypt secrets nor escalate.
|
||||
- A non-admin can edit + commit + push their infra clone **freely**, but cannot `scripts/tg apply` (no write Vault / cluster RBAC) — changes don't take effect until an admin applies.
|
||||
- Re-running the provisioner is idempotent (no changes on a converged host).
|
||||
- `provision-users.sh` + `setup-devvm.sh` reproduce the setup on a fresh host from git.
|
||||
223
docs/plans/2026-06-07-multi-user-workstation-plan.md
Normal file
223
docs/plans/2026-06-07-multi-user-workstation-plan.md
Normal file
|
|
@ -0,0 +1,223 @@
|
|||
# Multi-User Workstation — Implementation Plan
|
||||
|
||||
> **For agentic workers:** REQUIRED SUB-SKILL: use `superpowers:subagent-driven-development` (recommended) or `superpowers:executing-plans` to implement task-by-task. Steps use `- [ ]` for tracking. This is **infra** work — "verify" means an idempotent re-run + a smoke check with expected output (not pytest). Honor the Terraform-only rule for cluster changes; devvm host scripts are the accepted exception (versioned in `infra/scripts/`, deployed via the provisioner). Claim `host:devvm` before mutating the devvm; gate `t3-serve@<user>` restarts on user idle (memory id=3201). **INCREMENTALITY (don't break emo):** every phase is additive; the idempotent reconcile is **additive-only** — it NEVER removes an existing user's groups, NEVER replaces an existing `~/code` (skip-if-exists), and NEVER writes into an existing `~/.claude`/`~/.claude.json`. The emo cutover (Phase 5) is the ONLY destructive step — explicit, idle-gated, reversible, never auto-run. After each of Phases 1–4, **verify emo's live sessions, `~/.claude`/MCP, `~/code`, and groups are unchanged.**
|
||||
|
||||
**Goal:** A declarative roster + idempotent scripts that provision per-user Claude Code Workstations on the devvm, inheriting Viktor's config live via native machine-wide layers, scoped by RBAC tier, reproducible from git.
|
||||
|
||||
**Architecture:** Config base (machine-wide managed Claude config + system shell files + apt manifest) authored by wizard → all users inherit live. `roster.yaml` + `provision-users.sh` create constrained OS accounts + per-user OIDC kubeconfig (per tier) + per-user writable git-crypt-locked infra clone + `t3-serve@<u>`. Authentik `T3 Users` group gates the edge.
|
||||
|
||||
**Tech Stack:** Bash (idempotent host scripts), systemd template units + timer, Claude Code managed-settings, git-crypt, Authentik expression policy (Terraform), the existing `k8s_users` per-user Vault/RBAC.
|
||||
|
||||
**Design:** `infra/docs/plans/2026-06-07-multi-user-workstation-design.md`. **Glossary:** `infra/CONTEXT.md`.
|
||||
|
||||
---
|
||||
|
||||
## File structure
|
||||
|
||||
- Create: `infra/scripts/workstation/roster.yaml` — the source-of-truth roster
|
||||
- Create: `infra/scripts/workstation/packages.txt` — declared host apt/global toolset
|
||||
- Create: `infra/scripts/workstation/setup-devvm.sh` — host base: packages + managed Claude config + config-base clone (idempotent)
|
||||
- Create: `infra/scripts/workstation/managed-settings.json` — the machine-wide Claude base (settings + `claudeMd`)
|
||||
- Modify: `infra/scripts/t3-provision-users.sh` — read `roster.yaml`; create constrained accounts; per-tier groups + kubeconfig; repoint `~/code`
|
||||
- Modify: `infra/scripts/t3-provision-users.sh` — also provision each non-admin's own writable git-crypt-locked clone at `~/code` (no separate mirror service)
|
||||
- Modify: `infra/stacks/authentik/admin-services-restriction.tf` — add the `t3.viktorbarzin.me` → `T3 Users` branch
|
||||
- Create: `infra/stacks/authentik/` group resource (or document the UI-created group) for `T3 Users`
|
||||
- Docs: update `infra/docs/architecture/multi-tenancy.md` (add the Workstation section) + `.claude/reference/service-catalog.md` (t3code row) in the same commits
|
||||
|
||||
---
|
||||
|
||||
## Phase −1 — Prerequisites (do FIRST)
|
||||
|
||||
### Task −1.1: devvm capacity (P0 — verified 2026-06-08: 24 GB RAM, 0 swap, wizard ~20 sessions)
|
||||
|
||||
- [ ] **Step 1:** Add **swap** to the devvm (swapfile, e.g. 8–16 GB) — turns multi-user OOM-kill into graceful pressure. Verify `free -h` shows `Swap` > 0.
|
||||
- [ ] **Step 2:** Document a per-user RAM budget + a **max-concurrent-active-users** ceiling; add memory/disk-pressure monitoring on the devvm. (Optionally bump RAM PVE-side — devvm is NOT TF-managed, id=1575.)
|
||||
- [ ] **Step 3:** Fix the stale `infra/.claude/reference/proxmox-inventory.md` devvm RAM (says 8 GB; live = 24 GB). Commit `[ci skip]`.
|
||||
|
||||
### Task −1.2: tooling
|
||||
|
||||
- [ ] **Step 1:** Install `kubelogin` (`kubectl-oidc_login`) on the devvm and add it to `packages.txt` — the per-user OIDC kubeconfig (Task 2.2) needs it; it is NOT installed today.
|
||||
|
||||
---
|
||||
|
||||
## Phase 0 — Roster + config base in git (no host changes)
|
||||
|
||||
### Task 0.1: Create the roster
|
||||
|
||||
**Files:** Create `infra/scripts/workstation/roster.yaml`
|
||||
|
||||
- [ ] **Step 1:** Write the roster with the current three children (wizard is the base author, not listed):
|
||||
|
||||
```yaml
|
||||
# THE single source of truth for the devvm Workstation lifecycle (onboard → offboard).
|
||||
# os_user (key) → authentik_user · k8s_user · tier · namespaces. Identifiers differ per person (verified 2026-06-08).
|
||||
users:
|
||||
emo: { authentik_user: emil.barzin, k8s_user: emo, tier: power-user } # NET-NEW cluster identity (not in k8s_users today)
|
||||
ancamilea: { authentik_user: ancaelena98, k8s_user: anca, tier: namespace-owner, namespaces: [plotting-book] } # ALREADY provisioned — preserve, don't re-create
|
||||
# gheorghe: { authentik_user: vabbit81, k8s_user: vabbit81, tier: namespace-owner, namespaces: [vabbit81] } # already a cluster ns-owner; uncomment for a devvm workstation
|
||||
```
|
||||
(`os_user` is the pinned key — no email→username derivation. Note the three distinct IDs per person.)
|
||||
|
||||
- [ ] **Step 2: Verify** it parses: `python3 -c "import yaml,sys; print(yaml.safe_load(open('infra/scripts/workstation/roster.yaml')))"` → Expected: a dict with `users.emo.tier == power-user`.
|
||||
- [ ] **Step 3: Commit:** `git add infra/scripts/workstation/roster.yaml && git commit -m "workstation: add roster source-of-truth [ci skip]"`
|
||||
|
||||
### Task 0.2: Declare the host toolset
|
||||
|
||||
**Files:** Create `infra/scripts/workstation/packages.txt`
|
||||
|
||||
- [ ] **Step 1:** List the shared tools (one per line, comments allowed): `git`, `zsh`, `tmux`, `ripgrep`, `jq`, `python3`, `nodejs`, `kubectl`, `vault`, `podman` (rootless). Claude Code is installed via npm global in `setup-devvm.sh` (Task 1.2), not apt.
|
||||
- [ ] **Step 2: Verify:** `grep -vE '^\s*(#|$)' infra/scripts/workstation/packages.txt` lists the expected packages.
|
||||
- [ ] **Step 3: Commit:** `git add infra/scripts/workstation/packages.txt && git commit -m "workstation: declare host package manifest [ci skip]"`
|
||||
|
||||
### Task 0.3: Build the Config base (secret-free, curated — it doesn't exist yet)
|
||||
|
||||
**Files:** chezmoi dotfiles repo (`github.com/ViktorBarzin/dot_files`, `dot_claude/`) + `infra/scripts/workstation/managed-settings.json`
|
||||
|
||||
- [ ] **Step 1:** Create/refresh the **Config base** = the secret-free curated set the managed layer + `/etc/skel` deploy from: skills/agents/rules/commands/hooks/`CLAUDE.md` + shell (`zshrc`/`profile.d`) + the `start-claude.sh` launcher (`cd "$HOME/code"`). Sanitize OUT all secrets (`.credentials.json`, `~/.claude.json`, `settings.json` `env`); resolve any `~/.agents/skills` symlinks to real files.
|
||||
- [ ] **Step 2:** Reconcile launcher ownership: the current `start-claude.sh` is deployed by the SEPARATE `viktor/terminal-lobby` repo (its own `deploy.sh`). Decide whether the workstation base or terminal-lobby owns it — not both (avoid two competing launchers).
|
||||
- [ ] **Step 3: Verify:** secret-scan the base (`grep -rEi 'sk-ant|oat01|BEGIN .*PRIVATE|api[_-]?key|password'` → only docs/placeholders) + no dangling symlinks.
|
||||
- [ ] **Step 4: Commit/push** the refreshed dotfiles repo.
|
||||
|
||||
---
|
||||
|
||||
## Phase 1 — Config base + machine-wide inheritance (additive; verify wizard+emo inherit)
|
||||
|
||||
### Task 1.1: Pin the exact Claude managed-skills mechanism (discovery spike)
|
||||
|
||||
**Why:** the managed `settings.json` + `claudeMd` paths are confirmed (`/etc/claude-code/managed-settings.json`), but the exact **managed skills** deployment path needs confirming on the installed Claude Code version before we rely on it for skill inheritance.
|
||||
|
||||
- [ ] **Step 1:** On the devvm, check the installed version: `claude --version`.
|
||||
- [ ] **Step 2:** Confirm the managed location is read: create a throwaway `/etc/claude-code/managed-settings.json` with a benign `claudeMd` string, start a fresh `claude` session as a NON-wizard test user, and confirm the injected guidance appears. Expected: the `claudeMd` text is present in context.
|
||||
- [ ] **Step 3:** Determine the managed-skills path (managed-settings `skills`/skill-source key, or a managed skills dir) **AND how the bespoke `~/.claude/rules/*.md` + `agents/` are delivered machine-wide** — the managed layer covers settings/skills/`claudeMd`, NOT an arbitrary `rules/` dir, so rules land either (a) folded into the managed `claudeMd`, or (b) a per-user symlink to the shared Config base (replacing today's live `~/.claude/rules → /home/wizard/.claude/rules` symlink). Record the verified mechanism in the design doc's §4 + a memory.
|
||||
- [ ] **Step 3b — Plan-B (go/no-go):** if managed *skills* aren't supported on the installed Claude Code version, FALL BACK to per-user symlinks of `~/.claude/{skills,agents,rules}` → the shared Config base. The verified `settingSources:[user,…]` (2026-06-08) means both t3 and `claude` read the per-user `user` layer, so symlinks are a complete fallback. Make this an explicit branch, not a silent assumption.
|
||||
- [ ] **Step 4: Commit** the design-doc update: `git commit -am "workstation: pin verified managed-skills mechanism [ci skip]"`
|
||||
|
||||
### Task 1.2: `setup-devvm.sh` — host base (idempotent)
|
||||
|
||||
**Files:** Create `infra/scripts/workstation/setup-devvm.sh`, `infra/scripts/workstation/managed-settings.json`
|
||||
|
||||
- [ ] **Step 1:** Write `managed-settings.json` — the machine-wide Claude base: the `claudeMd` org guidance + any enforced hooks/permissions, **no secrets** (per-user memory keys etc. stay per-user).
|
||||
- [ ] **Step 2:** Write `setup-devvm.sh` (run as root, idempotent): (a) `apt-get install -y $(grep -vE '^\s*(#|$)' packages.txt)`; (b) `npm install -g @anthropic-ai/claude-code` if missing; (c) `install -m 0644 managed-settings.json /etc/claude-code/managed-settings.json`; (d) materialize managed skills from the config-base checkout per the Task 1.1 mechanism; (e) lay down `/etc/profile.d/00-workstation.sh` + `/etc/zsh/zshrc.d/` base shell config + seed `/etc/skel` — **incl. a `start-claude.sh` that `cd "$HOME/code"` and a `.tmux.conf` with `default-command "$HOME/start-claude.sh"`, so a new account auto-launches Claude in ITS OWN clone (never a hardcoded `/home/wizard/code`)**; (f) clone/refresh the config-base repo to a shared path.
|
||||
- [ ] **Step 3: Verify (inheritance):** as `emo` (idle-gated if a session is live), `sudo -u emo -i claude` shows wizard's managed `claudeMd` + a base skill in `/skills`, with no per-emo copy. Expected: base skill present.
|
||||
- [ ] **Step 4: Verify (idempotent):** re-run `setup-devvm.sh`; Expected: exit 0, no changes on second run.
|
||||
- [ ] **Step 5: Commit:** `git add infra/scripts/workstation/setup-devvm.sh infra/scripts/workstation/managed-settings.json && git commit -m "workstation: host base + machine-wide Claude config inheritance"`
|
||||
|
||||
---
|
||||
|
||||
## Phase 2 — Provisioner (additive; create constrained accounts from roster)
|
||||
|
||||
### Task 2.1: Extend `t3-provision-users.sh` to read the roster + create accounts
|
||||
|
||||
**Files:** Modify `infra/scripts/t3-provision-users.sh`
|
||||
|
||||
- [ ] **Step 1:** Add a roster-read + per-entry loop. For each `os_user`: if the account is **absent**, `useradd -m -s /bin/zsh "$os_user"` + `passwd -l "$os_user"` (SSO/t3 only) + `chmod 700 ~`. `set_tier_groups` is **ADD-ONLY** — it `gpasswd -a`'s the tier's groups (admin → `sudo,docker,code-shared`; power-user/namespace-owner → none beyond their own) but **NEVER removes** a group from an existing account (so a routine reconcile can't strip emo's current `code-shared`/`docker` — removal is the Phase-5 cutover only). Do **not** `passwd -l` or re-`chmod` an already-existing account.
|
||||
- [ ] **Step 2 (SSoT — derive, don't append):** **Regenerate** `/etc/ttyd-user-map` + `/etc/t3-serve/dispatch.json` from the roster each run (so a removed roster entry DISAPPEARS — this is what makes offboarding's reversible-cut work), allocate sticky ports, `systemctl enable --now t3-serve@<os_user>`. Reconcile the `T3 Users` Authentik group membership from the roster. **Validate** each entry's `tier` against the live `k8s_users` role and **abort with a clear error on mismatch** (workstation tier and cluster tier must not silently diverge).
|
||||
- [ ] **Step 3: Verify (idempotent + non-breaking):** run as root; Expected: emo + ancamilea instances `active`, dispatch.json unchanged, **AND** `id emo` still shows `code-shared`+`docker` (NOT stripped), emo's `~/code` symlink intact, his live sessions unaffected.
|
||||
- [ ] **Step 4: Verify (constrained account):** `id emo` shows no `sudo`/`docker`/`code-shared`; `sudo -n -u emo true` fails (no sudo).
|
||||
- [ ] **Step 5: Commit:** `git add infra/scripts/t3-provision-users.sh && git commit -m "workstation: roster-driven account creation + per-tier groups"`
|
||||
|
||||
### Task 2.2: Per-user identity-scoped kubeconfig + Vault helper
|
||||
|
||||
**Files:** Modify `infra/scripts/t3-provision-users.sh` (add `install_user_identity`)
|
||||
|
||||
- [ ] **Step 1:** For each non-admin, write `~$os_user/.kube/config` as a **per-user OIDC kubeconfig** (`kubelogin`/`oidc-login`) bound to THEIR email — the apiserver accepts Authentik OIDC for the `kubernetes` audience (verified 2026-06-08; the dashboard SA-token pattern is for the dashboard UI, NOT kubectl). Tier → a ClusterRole bound to their OIDC `User`: namespace-owner → admin in their own namespace via the existing `oidc-ns-owner-*` bindings (for anca that's the EXISTING `plotting-book` — assert, don't re-provision); power-user → a **NEW `oidc-power-user-readonly`** ClusterRole (get/list/watch cluster-wide, **NO `secrets`**), NOT the existing `oidc-power-user` (read+write+Secrets). Owned by the user, `0600`. **Install only if `~/.kube/config` is absent;** else back up to `.bak-<ts>` and skip (never clobber).
|
||||
- [ ] **Step 2:** Drop a `~/.zshrc.d/vault.sh` that sets `VAULT_ADDR=https://vault.viktorbarzin.me` and documents `vault login -method=oidc` (their own identity). Do NOT seed wizard's token.
|
||||
- [ ] **Step 3: Verify (OIDC works, then scoping):** FIRST smoke-test the OIDC path — a non-admin `kubectl` via kubelogin actually authenticates (it's currently unexercised by any human; if it fails like the dashboard audience did, fall back to a per-user SA-token kubeconfig). THEN: as emo, `kubectl get pods -A` works (read) but `kubectl get secret -A` is forbidden and `kubectl delete` anything is forbidden; as ancamilea, only `plotting-book` is visible.
|
||||
- [ ] **Step 4: Commit:** `git add infra/scripts/t3-provision-users.sh && git commit -m "workstation: per-user identity-scoped kubeconfig + vault helper"`
|
||||
|
||||
*(Prereq: add a **NEW `oidc-power-user-readonly`** ClusterRole + email binding to `stacks/rbac` via `scripts/tg apply` — do NOT reuse the existing `oidc-power-user` (read+write+Secrets, currently unbound). emo also needs a NEW `k8s_users` entry as `power-user` (net-new); anca/gheorghe already exist — assert, don't re-create. Terraform-managed, separate commit.)*
|
||||
|
||||
### Task 2.3: Inject per-user MCP + auth secrets (new users only; never clobber)
|
||||
|
||||
**Files:** Modify `infra/scripts/t3-provision-users.sh` (add `install_user_secrets`)
|
||||
|
||||
- [ ] **Step 1:** For each non-admin **without** an existing `~/.claude.json` (NEW users only — NEVER touch an existing one): write `~/.claude.json` with `playwright-shared` (localhost), `ha` (shared `ha_sofia_mcp_url` from Vault `secret/openclaw`) if HA-eligible, and `claude_memory` using a **shared/simple key (per-user memory isolation is DEFERRED — not a risk now)**. Seed `~/.claude/.credentials.json` with the shared Claude token (Vault) **or** leave absent for interactive login. **Drop the beads Dolt credential** into `~/code/.beads/` (`.beads-credential-key`, from Vault, or set `DOLT_REMOTE_PASSWORD`) so `bd` authenticates — it's git-ignored, so a fresh clone lacks it. All `0600`, owned by the user. Per-user `playwright-mcp` systemd unit on its own port (existing pattern, id=4015).
|
||||
- [ ] **Step 2 (DEFERRED — not now):** Per-user memory isolation is NOT built (Viktor, 2026-06-08): a new user shares/omits memory for now. When wanted, it needs a service-side `_key_to_user` map edit + redeploy (claude-memory-mcp, GHA repo 78) **and** a Vault key — not just a Vault write (id=413/4181).
|
||||
- [ ] **Step 3: Verify (new user gets isolated auth):** as the test user, `claude mcp list` shows their servers `Connected`; `memory_recall` returns THEIR namespace, not Viktor's.
|
||||
- [ ] **Step 4: Verify (emo untouched):** `~emo/.claude.json`, `~emo/.claude/.credentials.json`, `~emo/.claude/settings.json` are **byte-identical** to before the run (`sha256sum` before/after); `claude mcp list` as emo still shows ha/claude_memory/playwright `Connected`.
|
||||
- [ ] **Step 5: Commit:** `git add infra/scripts/t3-provision-users.sh && git commit -m "workstation: per-user MCP + auth injection (new users only, if-absent)"`
|
||||
|
||||
---
|
||||
|
||||
## Phase 3 — Per-user writable locked infra clone (code view; changes ungated)
|
||||
|
||||
### Task 3.1: Provision each non-admin's own writable git-crypt-locked `~/code`
|
||||
|
||||
**Files:** Modify `infra/scripts/t3-provision-users.sh` (add `install_infra_clone`)
|
||||
|
||||
- [ ] **Step 1:** For each non-admin, **only if `~$os_user/code` does not exist at all** (no symlink, no directory — NEVER touch an existing `~/code`, so emo's symlink stays intact), clone the same repo wizard uses, as that user: `REPO=$(git -C /home/wizard/code config --get remote.origin.url); sudo -u "$os_user" git clone "$REPO" ~/code`. Then in the clone set `git config filter.git-crypt.smudge cat; filter.git-crypt.clean cat; filter.git-crypt.required false` and `git checkout master`. **No git-crypt key is installed** → secret files stay ciphertext, code/docs are plaintext (memory id=3665/3666). Owned by the user, writable.
|
||||
- [ ] **Step 2:** Leave it writable with a normal `origin` remote (Forgejo) — no read-only mount, no PR gate; they may edit/commit/push freely. (Optional: `git config push.default current` so a bare `git push` targets their own branch.)
|
||||
- [ ] **Step 3: Verify (locked + writable):** as emo, `head -c 9 ~/code/infra/terraform.tfvars` shows the `GITCRYPT` magic (ciphertext); `cat ~/code/CLAUDE.md` is plaintext; `echo x >> ~/code/README.md && git -C ~/code commit -am wip` **succeeds** (writable, ungated).
|
||||
- [ ] **Step 4: Verify (apply-gated, not repo-gated):** as emo, `cd ~/code/infra && scripts/tg apply <a-stack>` **fails** (no write Vault token / cluster RBAC); `vault login -method=oidc` as emo cannot obtain vault-admin. Pushing to Forgejo does NOT trigger an apply (id=4355). So his edits can't take effect without an admin apply.
|
||||
- [ ] **Step 5: Commit:** `git add infra/scripts/t3-provision-users.sh && git commit -m "workstation: per-user writable git-crypt-locked infra clone"`
|
||||
|
||||
---
|
||||
|
||||
## Phase 4 — Eligibility gate (Authentik group + edge)
|
||||
|
||||
### Task 4.1: Create the `T3 Users` group + edge restriction
|
||||
|
||||
**Files:** Modify `infra/stacks/authentik/admin-services-restriction.tf`; add the group resource
|
||||
|
||||
- [ ] **Step 1:** Add `resource "authentik_group" "t3_users" { name = "T3 Users" }` (pattern: `stacks/authentik/guest.tf:53`). Add emo/ancamilea (and wizard) as members.
|
||||
- [ ] **Step 2:** In the expression policy, add a dedicated branch BEFORE the final return: `if host == "t3.viktorbarzin.me": return ak_is_group_member(request.user, name="T3 Users")`.
|
||||
- [ ] **Step 3: Apply:** `vault login -method=oidc` then `scripts/tg apply` in `stacks/authentik` (claim `stack:authentik` first).
|
||||
- [ ] **Step 4: Verify (gate):** `curl -sI` an unauthenticated request to `t3.viktorbarzin.me` → 302 to Authentik; a member login → reaches their instance; a logged-in NON-member → denied. Confirm the `authentik-walloff` probe stays green for any public carve-outs.
|
||||
- [ ] **Step 5: Commit:** `git add infra/stacks/authentik/*.tf && git commit -m "workstation: gate t3.viktorbarzin.me to T3 Users group"`
|
||||
|
||||
---
|
||||
|
||||
## Phase 5 — Migrate existing users (idle-gated, low-disruption)
|
||||
|
||||
### Task 5.1: Cut emo over to his own writable locked clone (opt-in, reversible)
|
||||
|
||||
**Files:** none (host state; an explicit one-time action — NOT the routine reconcile)
|
||||
|
||||
- [ ] **Step 1: Prereqs.** Confirm emo inherits config (Phase 1) + has his scoped kubeconfig (Phase 2). (Phase 3 deliberately SKIPPED emo — his clone is created *here*.)
|
||||
- [ ] **Step 2: Record rollback state.** Save `readlink -f ~emo/code` (symlink target), `id emo` (groups), a copy of `/home/emo/start-claude.sh`, and the `~/.claude/{rules,skills/file-issue}` symlink targets. This is the instant-rollback snapshot.
|
||||
- [ ] **Step 3: Idle-gate + go-ahead.** Confirm emo's sessions are keystroke-idle ≥20 min (id=3201); if ambiguous, ASK. Opt-in — never auto-run by the reconcile.
|
||||
- [ ] **Step 4: Cutover.** (a) `mv ~emo/code ~emo/code.symlink.bak`; provision his own writable locked clone at `~emo/code` (Phase-3 `install_infra_clone`, run explicitly for emo). (b) **Repoint his launcher (REQUIRED):** back up `/home/emo/start-claude.sh`, then change its `cd /home/wizard/code` → `cd "$HOME/code"`. The hardcoded `cd` is the *actual* mechanism landing him in wizard's tree — the symlink swap alone is insufficient. (c) Remove the now-redundant `~/.claude/rules` and `~/.claude/skills/file-issue` symlinks into wizard's home (managed layer / shared base delivers them now). (d) `gpasswd -d emo code-shared`.
|
||||
- [ ] **Step 5: Verify.** As emo: `cat ~/code/CLAUDE.md` works (his clone); `head -c 9 ~/code/infra/terraform.tfvars` shows `GITCRYPT` ciphertext (locked); he can still `git -C ~/code commit` (ungated) but can no longer read wizard's unlocked secrets nor `scripts/tg apply`. emo's live t3 session still works (only a WS blip if `t3-serve@emo` was restarted).
|
||||
- [ ] **Step 6: Rollback (seconds, if anything's off):** restore the `~emo/code` symlink (`rm -rf ~emo/code && ln -sfn <saved-target> ~emo/code`), restore `start-claude.sh` from its backup, recreate the `~/.claude/{rules,skills/file-issue}` symlinks, and `gpasswd -a emo code-shared` → emo back to his exact prior state. Otherwise record the cutover in a memory.
|
||||
|
||||
### Task 5.2: Confirm ancamilea + a fresh test user end-to-end
|
||||
|
||||
- [ ] **Step 1:** Confirm ancamilea logs into `t3.viktorbarzin.me` → her instance, inherits config, own-namespace kubectl only.
|
||||
- [ ] **Step 2:** Add a throwaway roster entry, run `provision-users.sh`, confirm the account+instance appear and login works; then remove it + `userdel` and confirm clean teardown.
|
||||
|
||||
---
|
||||
|
||||
## Phase 6 — Template-readiness (design-for-now; convert when wanted)
|
||||
|
||||
### Task 6.1: Verify reproducibility from git (no cloud-init yet)
|
||||
|
||||
- [ ] **Step 1:** On a scratch VM (or a container), clone the infra repo and run `setup-devvm.sh` + `provision-users.sh`; confirm the toolset + managed config + users reproduce.
|
||||
- [ ] **Step 2 (promote out of deferred — do in the main rollout):** Add per-user home data to the 3-2-1 backup set NOW: at minimum `~/.t3` (pairings + 30-day sessions) + `~/.claude` (mutable state), ideally all of `/home`. A devvm rebuild otherwise silently loses every user's pairings + session state.
|
||||
- [ ] **Step 3 (deferred):** When the template is wanted, wrap `setup-devvm.sh` + `provision-users.sh` in cloud-init (the `modules/create-template-vm` pattern, memory id=1575) and snapshot the devvm as a Proxmox template. File a beads task; do not build now.
|
||||
|
||||
---
|
||||
|
||||
## Phase 7 — Offboarding (deprovision; staged, gated)
|
||||
|
||||
Removing a user = delete their `roster.yaml` entry, then:
|
||||
|
||||
### Task 7.1: Reversible cut (driven by roster removal)
|
||||
|
||||
- [ ] **Step 1:** On reconcile after the entry is gone: `systemctl disable --now t3-serve@<u>`; regenerate `/etc/ttyd-user-map` + `dispatch.json` (user absent → dispatcher 403s); remove them from the `T3 Users` Authentik group (edge-blocked); `passwd -l <u>`. **Verify:** they can no longer reach `t3.viktorbarzin.me` (302→login, then denied) and can't log in. Nothing deleted yet.
|
||||
- [ ] **Step 2 (cluster revoke):** remove their `k8s_users` entry + `scripts/tg apply` (drops their RBAC binding; OIDC kubeconfig stops authorizing); revoke any individually-held token/memory key.
|
||||
|
||||
### Task 7.2: Destructive removal (explicit, separate, NEVER auto)
|
||||
|
||||
- [ ] **Step 1:** Archive `~<u>` → backup: `tar czf /mnt/backup/offboard/<u>-<ts>.tar.gz /home/<u>`.
|
||||
- [ ] **Step 2:** `userdel -r <u>` (removes home + spool). **Irreversible — requires explicit go-ahead.**
|
||||
- [ ] **Step 3: Rollback:** before 7.2, re-add the roster entry + reconcile restores everything; after 7.2, restore from the archive.
|
||||
- [ ] **Step 4:** Write + commit `infra/docs/runbooks/offboard-user.md` (the `multi-tenancy.md` link to it is currently a dead end).
|
||||
|
||||
---
|
||||
|
||||
## Self-review
|
||||
|
||||
- **Spec coverage:** prerequisites/capacity + kubelogin (Ph−1), roster SSoT + config-base build (Ph0), config inheritance (Ph1), provisioning + per-tier OIDC kubectl + SSoT-derive/validate + secrets/auth + beads-cred (Ph2), infra code access via writable locked clone (Ph3), Authentik gate (Ph4), incremental non-breaking migration (Ph5), reproducibility/template + per-user backups (Ph6), **offboarding / full lifecycle (Ph7)** — all mapped. Per-user **memory isolation DEFERRED** (not a risk now).
|
||||
- **Open verification carried as a task, not a placeholder:** the exact managed-skills path (Task 1.1) is a discovery spike with a concrete acceptance check.
|
||||
- **Terraform-only respected:** the only cluster changes (Authentik group/policy, the power-user ClusterRole) go through `scripts/tg apply`; devvm host scripts are the accepted exception.
|
||||
- **Docs:** multi-tenancy.md + service-catalog.md updates folded into the relevant commits (per the update-docs rule).
|
||||
52
docs/plans/2026-06-08-matrix-synapse-to-tuwunel-design.md
Normal file
52
docs/plans/2026-06-08-matrix-synapse-to-tuwunel-design.md
Normal file
|
|
@ -0,0 +1,52 @@
|
|||
# Matrix: Synapse → tuwunel migration — Design
|
||||
|
||||
**Date:** 2026-06-08
|
||||
**Status:** Implemented
|
||||
**Stack:** `stacks/matrix` (+ `stacks/vault` cleanup)
|
||||
|
||||
## Context
|
||||
|
||||
The `matrix` homeserver ran **Synapse** (`matrixdotorg/synapse:v1.151.0`) on a
|
||||
cramped `256Mi/512Mi` allocation. Synapse (Python) wants 1–2 GB; at 512Mi it was
|
||||
starved. During a Slack-vs-Discord-vs-Matrix evaluation Viktor confirmed Slack
|
||||
stays his primary hub, but wanted a **working, federated Matrix server kept
|
||||
available "in case I need it."** The resource pain was Synapse-specific — not
|
||||
inherent to Matrix — so the fix was to swap the homeserver implementation, not
|
||||
abandon Matrix.
|
||||
|
||||
## Decision
|
||||
|
||||
Replace Synapse with **tuwunel v1.7.1** (Rust, RocksDB) — the
|
||||
enterprise/Swiss-government-backed official successor to the (archived 2026-01-19)
|
||||
conduwuit.
|
||||
|
||||
| Choice | Decision | Rationale |
|
||||
|---|---|---|
|
||||
| Homeserver | **tuwunel** (vs continuwuity) | Corporate-backed, full-time staff → best longevity for a set-and-forget server |
|
||||
| Data | **Fresh start** (no migration) | No supported Synapse(Postgres)→RocksDB path; Viktor confirmed old rooms/messages disposable |
|
||||
| Federation | **ON** | A backup server is only useful if it can reach the wider Matrix network |
|
||||
| `server_name` | **unchanged** (`matrix.viktorbarzin.me`) | Element clients keep pointing at the same place; only a re-login needed |
|
||||
| Database | **embedded RocksDB** on the existing encrypted PVC | Drops the entire CNPG dependency; local-SSD LUKS2 suits RocksDB's small writes (NFS would be wrong) |
|
||||
| Registration | token-gated, then **disabled** | First user = admin; locked down after registering `@viktor` |
|
||||
| Auth | **native password** | tuwunel OIDC SSO not wired — Authentik Matrix OAuth app is now orphaned (harmless) |
|
||||
| Media cap | **50 MiB** | Kept under Cloudflare's 100 MB proxied-request ceiling |
|
||||
|
||||
## Alternatives considered
|
||||
|
||||
- **Keep Synapse, bump to 2 GB** — zero-migration, but stays the heavy Python
|
||||
server; rejected in favour of the lightweight Rust target Viktor asked for.
|
||||
- **continuwuity** — community continuation; viable and lighter-community, but
|
||||
tuwunel's corporate backing won on longevity.
|
||||
- **Synapse → tuwunel data migration** — not possible (different storage
|
||||
engines); fresh start is the only path.
|
||||
|
||||
## As-built
|
||||
|
||||
- Fully env-var configured (`TUWUNEL_*`, `__` for nested) — no TOML ConfigMap.
|
||||
- tuwunel serves its own `.well-known/matrix/{client,server}` → federation
|
||||
resolves to Cloudflare-proxied `:443` (no 8448 / SRV needed).
|
||||
- Ingress unchanged: `auth = "none"` (Matrix uses bearer/signed requests),
|
||||
`dns_type = "proxied"`.
|
||||
- Pod `securityContext` `runAsUser/runAsGroup/fsGroup = 1000` so uid 1000 can
|
||||
write the encrypted RocksDB PVC.
|
||||
- Image kept under Keel + diun semver management (`^v\d+\.\d+\.\d+$`).
|
||||
92
docs/plans/2026-06-08-matrix-synapse-to-tuwunel-plan.md
Normal file
92
docs/plans/2026-06-08-matrix-synapse-to-tuwunel-plan.md
Normal file
|
|
@ -0,0 +1,92 @@
|
|||
# Matrix: Synapse → tuwunel migration — Plan (executed)
|
||||
|
||||
**Date:** 2026-06-08 · **Companion:** `2026-06-08-matrix-synapse-to-tuwunel-design.md`
|
||||
|
||||
## Executed steps
|
||||
|
||||
1. **Vault** — generated a 32-byte `registration_token`, stored at
|
||||
`secret/matrix`.
|
||||
2. **`stacks/matrix` rewrite** — replaced Synapse with tuwunel: removed the
|
||||
`matrix-db-creds` ExternalSecret, both init-containers (`install-psycopg2`,
|
||||
`inject-db-password`), the `extra-packages` volume, and the Reloader
|
||||
annotation; added the `matrix-secrets` ExternalSecret (vault-kv `dataFrom`),
|
||||
the `TUWUNEL_*` env, `securityContext` 1000, and the tuwunel image. Encrypted
|
||||
PVC, Service (`80→8008`), and ingress (`auth="none"`, proxied) unchanged.
|
||||
- The image is in the deployment's `ignore_changes` (KEEL_IGNORE_IMAGE); it
|
||||
was **temporarily un-ignored** for this base-image swap, then re-added at
|
||||
step 4 so Keel resumes tag management.
|
||||
- `tg init -reconfigure` was required first (Tier-1 PG-backend creds rotate
|
||||
weekly → "Backend configuration block has changed").
|
||||
3. **Apply** — `Plan: 1 to add, 2 to change, 1 to destroy`. tuwunel 1.7.1 came up
|
||||
1/1, created a fresh RocksDB on the encrypted PVC (no permission errors —
|
||||
fsGroup worked).
|
||||
4. **Verify** — all `200`: `/_tuwunel/server_version`, `.well-known/matrix/
|
||||
{client,server}`, `/_matrix/client/versions`, `/_matrix/federation/v1/version`.
|
||||
Registered `@viktor:matrix.viktorbarzin.me` (first user → admin) via the token
|
||||
flow; `whoami` confirmed. Creds stored at `secret/matrix`
|
||||
(`admin_user`, `admin_password`).
|
||||
5. **Lock down** — `TUWUNEL_ALLOW_REGISTRATION=false` + re-added image
|
||||
`ignore_changes`; applied. Registration now returns `403 M_FORBIDDEN`.
|
||||
6. **Cleanup** —
|
||||
- `stacks/vault`: removed the `pg_matrix` static role + its `allowed_roles`
|
||||
entry (targeted apply — the full plan also wanted an **unrelated** OIDC
|
||||
`tune`-TTL change, deliberately NOT applied; see residual items).
|
||||
- Dropped the orphaned `matrix` Postgres DB (16 MB) + `matrix` role on the
|
||||
CNPG primary (`pg-cluster-2`).
|
||||
- Docs updated: `.claude/CLAUDE.md` (PG-rotation list), `service-catalog.md`,
|
||||
`upgrade-config.json` (removed synapse image-rename + matrix PG entry),
|
||||
`authentication.md` + `authentik-state.md` (Matrix OIDC → orphaned).
|
||||
|
||||
## Rollback
|
||||
|
||||
Fresh start was confirmed, so there is no Synapse data to preserve. To revert the
|
||||
*service*: restore the Synapse `main.tf` from git, re-add the `pg_matrix` Vault
|
||||
role, and restore the `matrix` Postgres DB from the daily per-db dump
|
||||
(`/backup/per-db/matrix/`). The reused encrypted PVC still holds Synapse's old
|
||||
`homeserver.yaml` / signing key / media at the volume root alongside the new
|
||||
RocksDB dir.
|
||||
|
||||
## Residual / follow-up items (flagged to user)
|
||||
|
||||
- **Authentik Matrix OAuth2 app — REMOVED 2026-06-08** (user-confirmed). It was
|
||||
UI-managed (NOT in the authentik TF stack), so it was deleted via the Authentik
|
||||
API: application `matrix` + OAuth2 provider `pk=6`. tuwunel uses native password
|
||||
auth, so nothing consumed it.
|
||||
- **Pre-existing drift in `stacks/vault`**: `vault_jwt_auth_backend.oidc` shows a
|
||||
`tune` diff (explicit `768h` default/max lease TTLs being dropped). This
|
||||
predates this migration and was **not** applied. Resolve separately.
|
||||
- **Synapse leftover files** remain on the encrypted PVC volume root (unused by
|
||||
tuwunel). Can be `rm`'d after confidence in the new server.
|
||||
|
||||
## Follow-up: open registration + bot mitigations (2026-06-08, user-chosen)
|
||||
|
||||
Registration was opened **fully (tokenless)** — `TUWUNEL_ALLOW_REGISTRATION=true`
|
||||
+ `TUWUNEL_YES_I_AM_VERY_VERY_SURE_I_WANT_AN_OPEN_REGISTRATION_SERVER_PRONE_TO_ABUSE=true`,
|
||||
dropped the `TUWUNEL_REGISTRATION_TOKEN` env (the Vault `secret/matrix` token +
|
||||
`matrix-secrets` ESO are kept for one-env-change revert to token-gated). tuwunel
|
||||
has **no CAPTCHA** (only Synapse does) and a browser challenge would break native
|
||||
clients, so bot defense is layered instead:
|
||||
|
||||
- **Traefik rate-limit on `/register`** — a `register-ratelimit` Middleware
|
||||
(`stacks/matrix`) on a path-scoped `ingress_register` carve-out (longer prefix
|
||||
wins over the catch-all). Keyed on the **request Host (global `/register` cap),
|
||||
not source IP** — because the host is reachable both via Cloudflare-IPv4
|
||||
(`CF-Connecting-IP`) and **IPv6-direct (HE tunnel → pfSense HAProxy → Traefik,
|
||||
no CF header)**; a per-source key let IPv6 bots bypass entirely (found during
|
||||
testing). 10/min, burst 20, **per Traefik replica (×3)**.
|
||||
- **CrowdSec** (already on the ingress chain) is the hard backstop — bans abusive
|
||||
IPs on both paths; covers the per-replica looseness of the soft rate-limit.
|
||||
- **Notification:** Loki ruler rule `MatrixNewUserRegistered` (`stacks/monitoring`,
|
||||
matches `... registered on this server`, never the rejection line) → `lane=security`
|
||||
→ existing `#security` Slack receiver. Also note tuwunel's admin bot
|
||||
(`@conduit:matrix.viktorbarzin.me`) **natively posts every registration to the
|
||||
server admin room**, so there's an in-Matrix notice too.
|
||||
- **Verification:** open signup returns 200 (`@regtest1`, since deactivated via
|
||||
`!admin users deactivate` in the admin room); Traefik access logs confirm
|
||||
`/register` routes through the rate-limited carve-out router. A live 429 was not
|
||||
force-tested (per-replica burst ~60 across 3 replicas; avoided hammering so as
|
||||
not to trip CrowdSec on the test source IP).
|
||||
|
||||
**Add a user:** anyone can self-register now. To provision manually instead:
|
||||
`!admin users create-user <name>` in the admin room (first user `@viktor` is admin).
|
||||
**Revert to token-gated:** drop the YES_I_AM... flag, re-add `TUWUNEL_REGISTRATION_TOKEN`.
|
||||
|
|
@ -0,0 +1,91 @@
|
|||
# Workstation Membership v2 — Authentik-group-driven, email-identified
|
||||
|
||||
**Status:** designed 2026-06-09, awaiting implementation. **Supersedes the *membership* model** of `2026-06-07-multi-user-workstation-design.md` (which used `roster.yaml` as the source of truth). Everything else in v1 stands unchanged — config inheritance (managed `claudeMd` + `~/.claude` symlinks), the per-user git-crypt-locked clone, the generic OIDC kubeconfig, swap, the `o-rx` admin-tree hardening, the emo cutover. This doc changes **only how workstation membership is defined and reconciled.**
|
||||
|
||||
## Problem
|
||||
|
||||
In v1 a workstation user is defined across three places with three identifiers (`os_user` / `authentik_user` / `k8s_user`, plus `email`): a git `roster.yaml`, a separate Authentik group, and Vault `k8s_users`. It's confusing and multi-place. Goal: **one definition, in Authentik, keyed by email; group membership grants the workstation.**
|
||||
|
||||
## Key principle: workstation access ≠ cluster authorization
|
||||
|
||||
These are independent axes and must not be conflated:
|
||||
|
||||
- **Workstation access** — "may you have an account on the devvm, reachable via t3?" A yes/no. Everyone who qualifies gets the *identical* non-admin setup (constrained account + locked `~/code` clone + generic kubeconfig + inherited config). The only distinction is **admin (wizard, the host owner — unlocked tree + sudo) vs non-admin (everyone else, identical).** No power-user/namespace-owner distinction exists at the workstation layer — it would not change a single provisioned file.
|
||||
- **Cluster authorization** — "what may you do via `kubectl`?" That is RBAC, already group-driven (`kubernetes-admins/power-users/namespace-owners` + Vault `k8s_users`) and applied at `kubectl` time by the user's own OIDC identity. The workstation neither knows nor cares. **Untouched by this design.**
|
||||
|
||||
Collapsing the (redundant) workstation tiers is what makes the v2 small.
|
||||
|
||||
## Model
|
||||
|
||||
- **`T3 Users` Authentik group is the single control for workstation access.** It does both halves: (1) the Authentik edge gate admits its members to `t3.viktorbarzin.me`; (2) the provisioner creates a devvm account + `t3-serve` instance + locked clone/config/kubeconfig for each member. Either half alone is useless; together = "you have a workstation." Both already exist from the 2026-06-08 work — this design changes how *membership* is sourced.
|
||||
- **A workstation user is fully defined by their Authentik account:** `email` (the one identity — OIDC subject, dispatch key, RBAC subject) + `T3 Users` membership + an optional `os_user` **attribute** (only to pin a legacy Linux name like `emo`; otherwise the os_user is derived from the email). Nothing in git or Vault defines workstation membership.
|
||||
- **The provisioner reconciles from the Authentik API** (lists `T3 Users` members + their `os_user` attribute) → provisions the identical non-admin workstation for each member ≠ wizard. `roster.yaml` **retires** as the membership source. wizard is special-cased as the admin/owner (keeps his unlocked tree + sudo; never gets a locked clone).
|
||||
- **`k8s_users`, cluster RBAC, Vault per-user isolation, Woodpecker/Cloudflared/dashboard — all untouched.** A workstation user's `kubectl` powers are whatever their existing cluster identity grants.
|
||||
|
||||
## Components
|
||||
|
||||
1. **Authentik** (`stacks/authentik`)
|
||||
- `T3 Users` group — already created + already wired into the edge policy (`admin-services-restriction.tf`, the `t3.viktorbarzin.me` branch). **Change:** drop the HCL `users = [...]` from the group resource so membership is managed *in Authentik* (UI/API), not in Terraform. Dropping the arg leaves current members intact (Terraform stops managing the list, doesn't clear it).
|
||||
- Optional per-user `os_user` **attribute** (Authentik user custom attribute) for legacy/override names.
|
||||
- A **read-only API token** scoped to read group membership, stored in Vault (`secret/authentik` → a new `t3_provision_token` field), dropped to a root-readable file (`/etc/t3-serve/authentik-token`, mode 0600 root) by `setup-devvm.sh` so the hourly root provisioner can call the API. (Root has no Vault token; this mirrors how other root-side secrets are staged.)
|
||||
|
||||
2. **Engine** (`scripts/workstation/roster_engine.py`, pure, pytest) — new functions:
|
||||
- `derive_os_user(email, os_user_attr) -> str` — `os_user_attr` if set, else `sanitize(local_part(email))`.
|
||||
- `desired_accounts(members, existing_ports) -> DesiredState` — given the Authentik member list (each `{email, os_user_attr?}`), produce the same `DesiredState` shape v1 derives from the roster (accounts, sticky ports, ttyd-map, dispatch). Keying: the ttyd-map/dispatch key is the **email local-part** (what `t3-dispatch` matches from `X-authentik-username`, e.g. `emil.barzin`); `os_user` is the derived/override Linux name (e.g. `emo`); `email` is the identity for RBAC/Vault. So a member resolves to a `local-part=os_user` map line — exactly the shape of today's `/etc/ttyd-user-map`. Reuses the existing additive-only + sticky-port logic.
|
||||
|
||||
3. **Provisioner** (`t3-provision-users.sh`) — replace the `roster.yaml` read with an Authentik API query (members of `T3 Users` + their `os_user` attribute) → feed the engine → apply (account, locked clone, kubeconfig, ttyd-map/dispatch, `t3-serve@`). **Best-effort:** if the token/API is unavailable, log a warning and make no membership changes (existing accounts untouched) — same posture as v1's k8s_users validation. wizard special-cased.
|
||||
|
||||
4. **Migration / retirement** — `roster.yaml` deleted; the provisioner no longer reads it.
|
||||
|
||||
## Data flow
|
||||
|
||||
```
|
||||
Admin: create/locate Authentik user (email) → add to "T3 Users" group [+ optional os_user attr]
|
||||
│
|
||||
hourly t3-provision-users (root) ── reads /etc/t3-serve/authentik-token
|
||||
│ GET Authentik API: members of "T3 Users" (+ os_user attr)
|
||||
▼
|
||||
roster_engine.desired_accounts(members) → desired state (email-keyed)
|
||||
▼
|
||||
for each member ≠ wizard: ensure account (os_user derived/override), locked ~/code clone,
|
||||
generic kubeconfig; regenerate /etc/ttyd-user-map + dispatch.json; enable t3-serve@<os_user>
|
||||
▼
|
||||
Authentik edge gate already admits "T3 Users" → member logs into t3.viktorbarzin.me → their instance
|
||||
```
|
||||
|
||||
Remove from `T3 Users` → next reconcile: the member drops out of the regenerated map/dispatch (dispatcher 403s) — the **reversible cut**. Destructive `userdel` stays a separate, gated step (per the offboarding runbook).
|
||||
|
||||
## os_user derivation
|
||||
|
||||
`os_user = attributes.os_user` if present, else `sanitize(email.split("@")[0])` where `sanitize` = lowercase, replace each run of `[^a-z0-9_-]` with `_`, strip leading/trailing `_`, truncate to 32 chars (Linux username limit). Example: `emil.barzin@gmail.com → emil_barzin`. **Collisions** (two emails → same os_user) are resolved by setting an explicit `os_user` attribute; the engine flags a collision rather than silently merging. **Legacy:** emo's existing account is kept by setting his Authentik `os_user` attribute to `emo`.
|
||||
|
||||
## Error handling
|
||||
|
||||
- Authentik token/API unavailable → warn, skip membership reconcile, leave existing accounts untouched (never break on a transient API failure).
|
||||
- A member with a colliding/underivable os_user and no override attribute → skip that member + warn (do not guess).
|
||||
- Additive-only for existing accounts (never strip groups, replace `~/code`, or rewrite secrets). Removed members → reversible cut now; `userdel` only via the explicit gated offboarding path.
|
||||
- wizard is never reconciled from the group (special-cased), so a mistaken group edit can't touch the admin/owner account.
|
||||
|
||||
## apiserver-OIDC caveat (does NOT affect this design)
|
||||
|
||||
The generic kubeconfig's auth method (OIDC via kubelogin vs the dashboard's SA-token pattern) hinges on the contested question of whether the apiserver accepts Authentik OIDC tokens (a 2026-06-04 memory says it rejects them; the `kubernetes`-audience `AuthenticationConfiguration` this session's RBAC work bound against suggests otherwise). This affects only how `kubectl` authenticates — **not** workstation membership. To be verified during implementation with a live OIDC login; if OIDC is rejected, the kubeconfig falls back to per-user SA-tokens (the existing dashboard mechanism), with no change to this membership model.
|
||||
|
||||
## Testing
|
||||
|
||||
- **Unit (pytest, extends `test_roster_engine.py`):** `derive_os_user` (sanitization, attribute override, collision detection); `desired_accounts` (member list → desired state; the additive-only invariant; offboard diff for a removed member). Pure, no host/API I/O.
|
||||
- **Smoke (live):** add a throwaway Authentik user to `T3 Users` → run the provisioner → confirm the account + `t3-serve` instance + locked clone appear and login routes correctly; remove from the group → confirm the reversible cut (dispatcher 403s, account retained).
|
||||
|
||||
## Out of scope
|
||||
|
||||
- Cluster RBAC re-architecture — `k8s_users` and the 5 consumers stay as-is.
|
||||
- Making Authentik the SSoT for the *cluster* (a separate, larger future epic).
|
||||
- Workstation tier-groups (not needed — the workstation is admin-vs-non-admin only).
|
||||
- Multiple admins (wizard is the sole special-cased admin; add a `T3 Admins` group only if a second admin is ever needed).
|
||||
|
||||
## Migration plan (summary; full steps in the implementation plan)
|
||||
|
||||
1. Drop the HCL `users` from the `T3 Users` group (members stay; now Authentik-managed); apply `stacks/authentik`.
|
||||
2. Set emo's Authentik `os_user` attribute to `emo` (legacy pin); wizard needs none (special-cased).
|
||||
3. Ship the engine functions + tests; switch the provisioner to the Authentik-API source; stage the read-only token via `setup-devvm.sh`.
|
||||
4. Verify a reconcile reproduces the current accounts (wizard/emo/ancamilea) exactly.
|
||||
5. Delete `roster.yaml` + its references; update `service-catalog.md` + `multi-tenancy.md`.
|
||||
Loading…
Add table
Add a link
Reference in a new issue