Compare commits

..

No commits in common. "a3bcb5e12f5dadabda3678cc22ff528409f7729c" and "a5e9fd8c710da3ae6354a491fbef96fd29f821f9" have entirely different histories.

283 changed files with 5328 additions and 30502 deletions

View file

@ -28,16 +28,9 @@ Violations cause state drift, which causes future applies to break or silently r
- **Apply**: Authenticate via `vault login -method=oidc`, then use `scripts/tg` (preferred — handles state decrypt/encrypt) or `terragrunt` directly. `scripts/tg` adds `-auto-approve` for `--non-interactive` applies.
- **New services need CI/CD** and **monitoring** (Prometheus/Uptime Kuma)
- **New service**: Use `setup-project` skill for full workflow
- **Ingress**: `ingress_factory` module. **Auth** (`auth` string enum, default `"required"` — fail-closed). Pick by asking "what gates the app?":
- `auth = "required"` — Authentik forward-auth gates every request. Use when the backend has **no built-in user auth** and Authentik is the only thing standing between strangers and the app (prowlarr, qbittorrent, netbox, phpipam, k8s-dashboard, foolery, any admin UI shipped without its own login).
- `auth = "app"` — the backend handles its own user authentication (NextAuth, Django, OAuth, bearer-token API, etc.); Authentik would only break it. No middleware attached; the app's own login is the gate. Examples: immich, linkwarden, tandoor, freshrss, affine, actualbudget, audiobookshelf, novelapp. **Functionally identical to `"none"`** — the distinct name exists to record intent at the call site.
- `auth = "public"` — Authentik anonymous binding via the dedicated `public` outpost (routes via `traefik-authentik-forward-auth-public``ak-outpost-public.authentik.svc:9000`). Strangers auto-bound to `guest`; logged-in users keep their identity in `X-authentik-username`. **Only works for top-level browser navigation** — CORS preflight rejects XHR/fetch and automation can't replay the cookie dance. Audit trail, not a gate.
- `auth = "none"` — no Authentik, no own-auth claim. Use for Anubis-fronted content (Anubis is the gate), native-client APIs (Git, `/v2/`, WebDAV/CalDAV, CardDAV), webhook receivers, OAuth callbacks, and Authentik outposts themselves.
- **Anti-exposure rule** (the reason `"app"` exists): only pick `"app"` or `"none"` AFTER you've verified the app has its own user auth (`"app"`) OR the endpoint is intentionally public (`"none"`). Default is `"required"` so accidental omission fails closed. **Convention**: when using `"app"` or `"none"`, add a comment line above the `auth = "..."` line stating what gates the app or why it's public. **Enforced by `scripts/tg`**: every `tg plan/apply/destroy/refresh` runs `scripts/check-ingress-auth-comments.py` against the current stack and aborts if any `auth = "app|none"` line lacks the preceding `# auth = "<tier>": ...` comment. Stack-scoped — untouched stacks aren't blocked until they're next edited.
- **Anti-AI**: on by default when `auth = "none"` or `auth = "app"` (no Authentik to discourage bots); redundant on `"required"` and `"public"`.
- **DNS**: `dns_type = "proxied"` (Cloudflare CDN) or `"non-proxied"` (direct A/AAAA). DNS records are auto-created — no need to edit `config.tfvars`. Smoke-test target: `echo.viktorbarzin.me` (auth=public, header-reflecting backend).
- **Ingress**: `ingress_factory` module. Auth: `protected = true`. Anti-AI: on by default. **DNS**: `dns_type = "proxied"` (Cloudflare CDN) or `"non-proxied"` (direct A/AAAA). DNS records are auto-created — no need to edit `config.tfvars`.
- **Anubis PoW challenge** (`modules/kubernetes/anubis_instance/`): per-site reverse proxy that issues a 30-day JWT cookie after a tiny PoW solve. Use for **public, content-bearing sites without app-level auth** (blog, docs, wikis, static landing pages). Pattern: declare `module "anubis" { source = "../../modules/kubernetes/anubis_instance"; name = "X"; namespace = ...; target_url = "http://<backend>.<ns>.svc.cluster.local" }`, then in `ingress_factory` set `service_name = module.anubis.service_name`, `port = module.anubis.service_port`, `anti_ai_scraping = false`. Shared ed25519 key in Vault `secret/viktor` -> `anubis_ed25519_key`; cookie scoped to `viktorbarzin.me` so one solve covers all Anubis-fronted subdomains. **DO NOT put Anubis in front of Git/API/WebDAV/CLI endpoints** — clients without JS can't solve PoW. **Replicas default to 1** because Anubis stores in-flight challenges in process memory; a challenge issued by pod A and solved against pod B errors with `store: key not found` (HTTP 500). Bumping replicas requires wiring a shared Redis store (TODO). For path-level carve-outs (e.g. wrongmove has `/` behind Anubis but `/api` direct), declare a second `ingress_factory` with `ingress_path = ["/api"]` pointing at the bare backend service. Active on: blog, www, kms, travel, f1, cc, json, pb (privatebin), home (homepage), wrongmove (UI only). See `.claude/reference/patterns.md` "Anti-AI Scraping" for full layering.
- **Docker images**: Always build for `linux/amd64`. SHA-tag rule is being phased out — see `docs/plans/2026-05-16-auto-upgrade-apps-{design,plan}.md`. New model: CI pushes `:latest` (optionally also `:<8-char-sha>` for traceability), Keel polls and triggers rollouts. Cache-staleness concern from the old rule is resolved at the nginx layer (URL-split — manifests pass through, blobs cached). Until Phase 1 of the migration completes (per the plan), follow the SHA-tag rule for new services to match existing pattern.
- **Docker images**: Always build for `linux/amd64`. Use 8-char git SHA tags — `:latest` causes stale pull-through cache.
- **Private registry**: `forgejo.viktorbarzin.me/viktor/<name>` (Forgejo packages, OAuth-style PAT auth). Use `image: forgejo.viktorbarzin.me/viktor/<name>:<tag>` + `imagePullSecrets: [{name: registry-credentials}]`. Kyverno auto-syncs the Secret to all namespaces. Containerd `hosts.toml` on every node redirects to in-cluster Traefik LB `10.0.20.200` to avoid hairpin NAT. Push-side: viktor PAT in Vault `secret/ci/global/forgejo_push_token` (Forgejo container packages are scoped per-user; only the package owner can push, ci-pusher cannot write to viktor/*). Pull-side: cluster-puller PAT in Vault `secret/viktor/forgejo_pull_token`. Retention CronJob (`forgejo-cleanup` in `forgejo` ns, daily 04:00) keeps newest 10 versions + always `:latest`; integrity probed every 15min by `forgejo-integrity-probe` in `monitoring` ns (catalog walk + manifest HEAD on every blob). See `docs/plans/2026-05-07-forgejo-registry-consolidation-{design,plan}.md` for the migration history. Pull-through caches for upstream registries (DockerHub, GHCR, Quay, k8s.gcr, Kyverno) stay on the registry VM at `10.0.20.10` ports 5000/5010/5020/5030/5040 — the old port-5050 R/W private registry was decommissioned 2026-05-07.
- **LinuxServer.io containers**: `DOCKER_MODS` runs apt-get on every start — bake slow mods into a custom image (`RUN /docker-mods || true` then `ENV DOCKER_MODS=`). Set `NO_CHOWN=true` to skip recursive chown that hangs on NFS mounts.
- **Node memory changes**: When changing VM memory on any k8s node, update kubelet `systemReserved`, `kubeReserved`, and eviction thresholds accordingly. Config: `/var/lib/kubelet/config.yaml`. Template: `stacks/infra/main.tf`. Current values: systemReserved=512Mi, kubeReserved=512Mi, evictionHard=500Mi, evictionSoft=1Gi.
@ -136,7 +129,7 @@ Repo IDs: infra=1, Website=2, finance=3, health=4, travel_blog=5, webhook-handle
| Frigate | GPU stall detection in liveness probe (inference speed check), high CPU |
| Authentik | 3 replicas, PgBouncer in front of PostgreSQL, strip auth headers before forwarding |
| Kyverno | failurePolicy=Ignore to prevent blocking cluster, pin chart version |
| MySQL Standalone | Raw `kubernetes_stateful_set_v1` pinned to `mysql:8.4.8` exactly (migrated from InnoDB Cluster 2026-04-16; **pinned to 8.4.8 on 2026-05-18** after Keel-driven `mysql:8.4` → 8.4.9 bump stalled the DD upgrade and required a full PVC-wipe + dump-restore — see `docs/runbooks/restore-mysql.md` and beads code-eme8/code-k40p). `skip-log-bin`, `innodb_flush_log_at_trx_commit=2`, `innodb_doublewrite=ON`. ConfigMap `mysql-standalone-cnf`. PVC `data-mysql-standalone-0` (5Gi initial → 30Gi via autoresizer, `proxmox-lvm-encrypted`). Service `mysql.dbaas` unchanged. Anti-affinity excludes k8s-node1. Bitnami charts deprecated (Broadcom Aug 2025) — use official images. |
| MySQL Standalone | Raw `kubernetes_stateful_set_v1` with `mysql:8.4` (migrated from InnoDB Cluster 2026-04-16). `skip-log-bin`, `innodb_flush_log_at_trx_commit=2`, `innodb_doublewrite=ON`. ConfigMap `mysql-standalone-cnf`. PVC `data-mysql-standalone-0` (15Gi, `proxmox-lvm-encrypted`). Service `mysql.dbaas` unchanged. Anti-affinity excludes k8s-node1. Old InnoDB Cluster + operator still in TF (Phase 4 cleanup pending). Bitnami charts deprecated (Broadcom Aug 2025) — use official images. |
| phpIPAM | IPAM — no active scanning. `pfsense-import` CronJob (hourly) pulls Kea leases + ARP via SSH. `dns-sync` CronJob (15min) bidirectional sync with Technitium. Kea DDNS on pfSense handles all 3 subnets. API app `claude` (ssl_token). |
## Monitoring & Alerting
@ -147,17 +140,6 @@ Repo IDs: infra=1, Website=2, finance=3, health=4, travel_blog=5, webhook-handle
- Key alerts: OOMKill, pod replica mismatch, 4xx/5xx error rates, UPS battery, CPU temp, SSD writes, NFS responsiveness, ClusterMemoryRequestsHigh (>85%), ContainerNearOOM (>85% limit), PodUnschedulable, ExternalAccessDivergence.
- **E2E email monitoring**: CronJob `email-roundtrip-monitor` (every 20 min) sends test email via Brevo HTTP API to `smoke-test@viktorbarzin.me` (catch-all → `spam@`), verifies IMAP delivery, deletes test email, pushes metrics to Pushgateway + Uptime Kuma. Alerts: `EmailRoundtripFailing` (60m), `EmailRoundtripStale` (60m), `EmailRoundtripNeverRun` (60m). Outbound relay: Brevo EU (`smtp-relay.brevo.com:587`, 300/day free — migrated from Mailgun). Inbound external traffic enters via pfSense HAProxy on `10.0.20.1:{25,465,587,993}`, which forwards to k8s `mailserver-proxy` NodePort (30125-30128) with `send-proxy-v2`. Mailserver pod runs alt PROXY-speaking listeners (2525/4465/5587/10993) alongside stock PROXY-free ones (25/465/587/993) for intra-cluster clients. Real client IPs recovered from PROXY v2 header despite kube-proxy SNAT (replaces pre-2026-04-19 MetalLB `10.0.20.202` ETP:Local scheme; see bd code-yiu + `docs/runbooks/mailserver-pfsense-haproxy.md`). Vault: `brevo_api_key` in `secret/viktor` (probe + relay).
## Security Posture (Wave 1 — locked 2026-05-18)
Plan in `docs/architecture/security.md` + response playbook in `docs/runbooks/security-incident.md`. Beads epic: `code-8ywc`.
- **Identity allowlist for security rules**: ONLY `me@viktorbarzin.me`. NOT `viktor@viktorbarzin.me`, NOT `emo@viktorbarzin.me` (those don't exist). emo's identity scheme is unknown — ask before assuming.
- **Source-IP allowlist (K2, K9, V7, S1)**: `10.0.20.0/22`, `192.168.1.0/24` (Proxmox + Sofia LAN), K8s pod CIDR, K8s service CIDR, Headscale tailnet. **Policy: no public-IP access** — Vault, kube-apiserver, PVE sshd must transit LAN or Headscale.
- **Response model**: (I) Slack-only daily skim. All security alerts via Loki ruler → Alertmanager → `#security` Slack receiver. Single channel with severity labels inside (critical/warning/info). No paging.
- **Kyverno policies (wave 1)**: `deny-privileged-containers`, `deny-host-namespaces`, `restrict-sys-admin`, `require-trusted-registries` flip Audit→Enforce with the 31-namespace exclude list (memory id=1970). `failurePolicy: Ignore` preserved. Cosign `verify-images` deferred.
- **NetworkPolicy default-deny egress (wave 1)**: observe-then-enforce (γ approach) — Calico flow logs cluster-wide + GlobalNetworkPolicy log-only on tier 3+4, build empirical allowlist after 1 week, phased per-namespace enforce starting `recruiter-responder`. Tier 0/1/2 deferred.
- **What's NOT in scope**: canary tokens (rejected — self-trigger risk with Viktor's normal `vault kv list secret/viktor` and `kubectl get secret -A` workflows), Falco/Tetragon (too noisy for Slack-only daily check), Cloudflare/GitHub audit polling (deferred to wave 2).
## Storage & Backup Architecture
### Storage Class Decision Rule (for new services)
@ -195,7 +177,7 @@ resource "kubernetes_persistent_volume_claim" "data_proxmox" {
name = "<service>-data-proxmox"
namespace = kubernetes_namespace.<ns>.metadata[0].name
annotations = {
"resize.topolvm.io/threshold" = "10%"
"resize.topolvm.io/threshold" = "80%"
"resize.topolvm.io/increase" = "100%"
"resize.topolvm.io/storage_limit" = "5Gi"
}
@ -231,7 +213,7 @@ resource "kubernetes_persistent_volume_claim" "data_encrypted" {
name = "<service>-data-encrypted"
namespace = kubernetes_namespace.<ns>.metadata[0].name
annotations = {
"resize.topolvm.io/threshold" = "10%"
"resize.topolvm.io/threshold" = "80%"
"resize.topolvm.io/increase" = "100%"
"resize.topolvm.io/storage_limit" = "5Gi"
}
@ -287,8 +269,7 @@ resource "kubernetes_persistent_volume_claim" "data_encrypted" {
## Known Issues
- **CrowdSec Helm upgrade times out**: `terragrunt apply` on platform stack causes CrowdSec Helm release to get stuck in `pending-upgrade`. Workaround: `helm rollback crowdsec <rev> -n crowdsec`. Root cause: likely ResourceQuota CPU at 302% preventing pods from passing readiness probes. Needs investigation.
- **OpenClaw config is writable**: OpenClaw writes to `openclaw.json` at runtime (doctor --fix, plugin auto-enable). Never use subPath ConfigMap mounts for it — use an init container to copy into a writable volume. Needs 2Gi memory + `NODE_OPTIONS=--max-old-space-size=1536`. **`mcp.servers` baked into the ConfigMap-loaded openclaw.json gets stripped by `doctor --fix`** — register MCP servers via `openclaw mcp set <name> <json>` in the container startup command instead (CLI-written entries persist across doctor runs). Current servers wired this way: `ha`, `context7`, `playwright` (sidecar at `localhost:3000/mcp`).
- **OpenClaw memory-core indexes `/workspace/memory/`, not `/home/node/.openclaw/memory/`**: `/home/node/.openclaw/memory/main.sqlite` is the index store, NOT a content source. Files written under `/home/node/.openclaw/memory/projects/<x>/*.md` will NOT be indexed. To populate memory-core, write Markdown under `/workspace/memory/projects/<source>/` and run `openclaw memory index --force`. This is what the daily `memory-sync` CronJob in `stacks/openclaw/` does for claude-memory → OpenClaw sync.
- **OpenClaw config is writable**: OpenClaw writes to `openclaw.json` at runtime (doctor --fix, plugin auto-enable). Never use subPath ConfigMap mounts for it — use an init container to copy into a writable volume. Needs 2Gi memory + `NODE_OPTIONS=--max-old-space-size=1536`.
- **Goldilocks VPA sets limits**: When increasing memory requests, always set explicit `limits` too — Goldilocks may have added a limit that blocks the change.
## User Preferences

View file

@ -1,543 +0,0 @@
---
name: k8s-version-upgrade-DEPRECATED
description: "DEPRECATED 2026-05-11 — replaced by the Job-chain in stacks/k8s-version-upgrade. See header below."
tools: Read, Write, Edit, Bash, Grep, Glob
model: opus
---
# DEPRECATED — Do NOT invoke this agent
Retired **2026-05-11** after a self-preemption incident: this agent ran inside
the `claude-agent-service` Deployment (replicas=1, no nodeSelector) and was
scheduled onto k8s-node4. When the agent tried to `kubectl drain k8s-node4`
(Stage 6, first worker), it evicted itself. The bash process died mid-SSH,
leaving node4 cordoned and the cluster half-upgraded (master at v1.34.7,
workers at v1.34.2).
## Replaced by
A chain of small Kubernetes Jobs, each pinned (via `nodeSelector` +
`kubernetes.io/hostname`) to a node that is NOT its drain target. No pod can
preempt itself because each Job's pod and its target node are always
different.
| Old | New |
|-----|-----|
| Single agent run in claude-agent-service pod | Chain of 7 phase Jobs (preflight → master → worker × 4 → postflight) |
| Whole pipeline in one prompt | Phase body in `stacks/k8s-version-upgrade/scripts/upgrade-step.sh`, dispatched per-phase via `case $PHASE` |
| Detection CronJob POSTs to `claude-agent-service` | Detection CronJob renders Job 0 from `job-template.yaml` via `envsubst` + `kubectl apply` |
| Drain blocks indefinitely on PDB=0 (e.g. single-replica Anubis) | New `predrain_unstick` deletes PDB-blocked pods so drain proceeds |
| `K8sVersionSkew` + `EtcdPreUpgradeSnapshotMissing` alerts | Above + `K8sUpgradeStalled` (in_flight=1 and time()-started_timestamp > 5400s) |
## Where the logic lives now
- **`infra/stacks/k8s-version-upgrade/scripts/upgrade-step.sh`** — universal
phase body. Dispatches on `$PHASE`. Each phase spawns the next Job.
- **`infra/stacks/k8s-version-upgrade/job-template.yaml`** — Job template
rendered by `envsubst` at runtime. ConfigMap-mounted at `/template` in
every Job pod.
- **`infra/stacks/k8s-version-upgrade/main.tf`** — Terraform stack: ConfigMaps,
unified `k8s-upgrade-job` ServiceAccount + RBAC, detection CronJob.
- **`infra/docs/runbooks/k8s-version-upgrade.md`** — operator runbook (kill a
stuck Job, skip a phase, manually re-trigger from a specific phase).
## Why kept (not deleted)
Documents the prompted-agent design and is useful as historical reference when
reading post-mortem discussions or comparing approaches. The `name` field has
been suffixed with `-DEPRECATED` so the agent cannot be invoked by name from
`claude-agent-service`.
---
# Original prompt — DO NOT EXECUTE (reference only)
You are the K8s Version Upgrade Agent for a 5-node home-lab Kubernetes cluster (1 master, 4 workers, stacked etcd, no HA).
## Your Job
Given a target patch or minor version of `kubeadm`/`kubelet`/`kubectl`, you orchestrate the full rolling upgrade with safety gates between every node. You do NOT decide WHEN to run — the `k8s-version-check` CronJob in the `k8s-upgrade` namespace fires you off after detection. You only run when invoked.
The sequence (Pre-flight → etcd snapshot → master containerd skew fix → apt repo URL change [minor only] → master kubeadm upgrade → workers sequentially → Post-flight) is non-negotiable. Skipping a step is how clusters die.
## Inputs
The user prompt contains a JSON object with these fields:
```json
{
"target_version": "1.34.5",
"kind": "patch",
"dry_run": false,
"stages": "all"
}
```
| Field | Required | Description |
|---|---|---|
| `target_version` | yes | Exact `X.Y.Z` to land on (e.g. `1.34.5`). The script `infra/scripts/update_k8s.sh` accepts this via `--release`. |
| `kind` | yes | `patch` (no apt-repo URL change) or `minor` (rewrite repo to v$NEW_MINOR/deb on every node before kubeadm). |
| `dry_run` | no, default false | If true, run all SSH + kubectl READ commands but skip every mutating command (`apt-get install`, `kubeadm upgrade apply`, `kubeadm upgrade node`, `kubectl drain/uncordon`, etcd snapshot, systemctl restart). Log what you would do and exit 0. |
| `stages` | no, default `all` | Comma-separated subset of: `preflight`, `snapshot`, `containerd`, `repo`, `master`, `workers`, `postflight`. Run only those stages and exit. Used by tests. |
Parse the prompt's first JSON block to extract these. If anything is missing, abort with a Slack notification ("malformed payload").
## Environment
- **Working dir**: `/workspace/infra` (`WORKSPACE_DIR` env var)
- **Kubeconfig**: `/workspace/infra/config` (use `kubectl --kubeconfig $WORKSPACE_DIR/config ...` in every kubectl call)
- **Prometheus**: `http://prometheus-server.monitoring.svc.cluster.local:80` (in-cluster, no auth)
- **Etcd snapshot**: triggered as a one-shot Job from the existing `default/backup-etcd` CronJob (defined in `stacks/infra-maintenance/`). The Job runs on `k8s-master` with hostNetwork (so etcdctl reaches etcd at 127.0.0.1:2379), mounts the PV-backed NFS export `192.168.1.127:/srv/nfs/etcd-backup`, and writes `etcd-snapshot-<TIMESTAMP>.db` there. Do NOT shell into master with etcdctl directly — the cert paths + NFS mount are already wired into the CronJob.
- **Library script**: `/workspace/infra/scripts/update_k8s.sh` — pipe via SSH to each node, do NOT modify on the fly. Invoke as `ssh ... 'bash -s' < update_k8s.sh --role <role> --release <X.Y.Z>`.
### Credentials — fetched at startup
The k8s-upgrade ServiceAccount has GET on the `k8s-upgrade-creds` Secret in the `k8s-upgrade` namespace (granted by a RoleBinding in `stacks/k8s-version-upgrade/main.tf`). Fetch credentials into `/tmp` files at the start of every run:
```bash
KUBECTL="kubectl --kubeconfig $WORKSPACE_DIR/config"
# SSH private key — mode 0400 required by openssh
$KUBECTL get secret -n k8s-upgrade k8s-upgrade-creds \
-o jsonpath='{.data.ssh_key}' | base64 -d > /tmp/k8s-upgrade-ssh-key
chmod 400 /tmp/k8s-upgrade-ssh-key
# Slack webhook (URL string)
SLACK_WEBHOOK_K8S_UPGRADE=$($KUBECTL get secret -n k8s-upgrade k8s-upgrade-creds \
-o jsonpath='{.data.slack_webhook}' | base64 -d)
```
The rest of the prompt uses `/tmp/k8s-upgrade-ssh-key` for SSH and `$SLACK_WEBHOOK_K8S_UPGRADE` for Slack. SSH template:
```bash
SSH="ssh -i /tmp/k8s-upgrade-ssh-key -o StrictHostKeyChecking=accept-new -o UserKnownHostsFile=/tmp/known_hosts"
```
Every SSH call below uses `$SSH wizard@<host> '<cmd>'`. `accept-new` accepts the host key on first encounter then pins it — if a node was reimaged, clear `/tmp/known_hosts` before retry.
## NEVER do
- Never bypass the halt-on-alert check — even if a single alert "looks unrelated"
- Never start the next worker before the previous one is Ready + all its pods rescheduled + 10-min soak observed
- Never skip the etcd snapshot — even for patch
- Never `kubectl edit/patch/delete` — read-only kubectl plus `drain`/`uncordon` only
- Never `apt-mark hold` something without unholding it first, and vice versa — the script handles this; don't do it manually
- Never run two stages in parallel — sequential only
- Never run if `dry_run=false` AND the cluster has a node Not Ready, or any Upgrade Gates alert firing
- Never push to git, never modify Terraform, never invoke claude-agent-service recursively
## Slack + Pushgateway helpers
Every transition posts to Slack:
```bash
slack() {
local msg="$1"
local hook="${SLACK_WEBHOOK_K8S_UPGRADE:-$SLACK_WEBHOOK_URL}"
curl -sS -X POST -H 'Content-Type: application/json' \
--data "$(jq -nc --arg t "[k8s-upgrade] $msg" '{text: $t}')" \
"$hook"
}
```
Start every message with `[k8s-upgrade]` so it's grep-able.
Pushgateway gauges drive the `EtcdPreUpgradeSnapshotMissing` and ops-visibility metrics:
```bash
PG='http://prometheus-prometheus-pushgateway.monitoring:9091/metrics/job/k8s-version-upgrade'
push_metric() {
# push_metric <name> <value>
local name="$1" val="$2"
printf '# TYPE %s gauge\n%s %s\n' "$name" "$name" "$val" \
| curl -sS --data-binary @- "$PG"
}
```
Pushes you must make at specific stages (skipped in dry_run):
| When | Metric | Value |
|---|---|---|
| Stage 0 start | `k8s_upgrade_in_flight` | `1` |
| Stage 0 start | `k8s_upgrade_target_minor` | `$target_minor` |
| Stage 2 verified | `k8s_upgrade_snapshot_taken` | `1` |
| Stage 7 clean | `k8s_upgrade_in_flight` | `0` |
| Stage 7 clean | `k8s_upgrade_snapshot_taken` | `0` |
If you abort mid-flight, leave `k8s_upgrade_in_flight=1` so the alert fires and surfaces the half-done state.
## Stage 0: Parse inputs + announce
1. Extract `target_version`, `kind`, `dry_run`, `stages` from the prompt JSON.
2. Derive `target_minor` from `target_version` (split on `.`).
3. Mark the in-flight annotation on the namespace AND push Pushgateway in-flight gauge:
```bash
if [ "$dry_run" = "false" ]; then
kubectl --kubeconfig $WORKSPACE_DIR/config annotate ns k8s-upgrade \
viktorbarzin.me/k8s-upgrade-in-flight="$(date -u +%FT%TZ)" \
viktorbarzin.me/k8s-upgrade-target="$target_version" \
--overwrite
push_metric k8s_upgrade_in_flight 1
push_metric k8s_upgrade_snapshot_taken 0
fi
```
4. Slack: `Starting k8s upgrade to v$target_version (kind=$kind, dry_run=$dry_run, stages=$stages)`.
## Stage 1: Pre-flight (`stages` includes `preflight`)
Skip if `stages` excludes `preflight`.
### Check 1.1 — All nodes Ready, no pressure
```bash
kubectl --kubeconfig $WORKSPACE_DIR/config get nodes -o json \
| jq -r '.items[] | "\(.metadata.name): \(.status.conditions[] | select(.type=="Ready") | .status), Mem=\(.status.conditions[] | select(.type=="MemoryPressure") | .status), Disk=\(.status.conditions[] | select(.type=="DiskPressure") | .status)"'
```
Abort if any node is not Ready=True, or has MemoryPressure=True or DiskPressure=True.
### Check 1.2 — Halt-on-alert (same query kured uses)
```bash
ALERTS=$(curl -sf 'http://prometheus-server.monitoring.svc.cluster.local:80/api/v1/alerts' \
| jq -r '.data.alerts[] | select(.state == "firing") | .labels.alertname' \
| grep -vE '^(Watchdog|RebootRequired|KuredNodeWasNotDrained|InfoInhibitor)$' \
| sort -u)
if [ -n "$ALERTS" ]; then
slack "ABORT preflight — firing alerts:\n$ALERTS"
exit 1
fi
```
### Check 1.3 — 24h-quiet baseline
Re-uses the sentinel-gate Check 4 logic from `stacks/kured/main.tf`. Any node that transitioned Ready in the last 24h means the cluster just absorbed a node reboot — we want a clean baseline before starting a fresh rollout.
```bash
RECENT_REBOOT=0
while IFS= read -r ts; do
[ -z "$ts" ] && continue
diff=$(( $(date +%s) - $(date -d "$ts" +%s) ))
[ "$diff" -lt 86400 ] && RECENT_REBOOT=1 && break
done < <(kubectl --kubeconfig $WORKSPACE_DIR/config get nodes -o jsonpath='{range .items[*]}{range .status.conditions[?(@.type=="Ready")]}{.lastTransitionTime}{"\n"}{end}{end}')
if [ "$RECENT_REBOOT" -eq 1 ]; then
slack "ABORT preflight — node transitioned Ready <24h ago (soak window)"
exit 1
fi
```
### Check 1.4 — kubeadm upgrade plan reports our target
```bash
PLAN_TARGET=$($SSH \
wizard@k8s-master 'sudo kubeadm upgrade plan' \
| grep -oE 'You can now apply the upgrade by executing the following command:.*v[0-9]+\.[0-9]+\.[0-9]+' \
| grep -oE 'v[0-9]+\.[0-9]+\.[0-9]+' | head -1 | tr -d v)
```
If `$PLAN_TARGET` does not start with the requested `target_version`, slack-abort:
"`kubeadm upgrade plan` says target is $PLAN_TARGET but caller asked for $target_version — drift; aborting."
Slack: `Pre-flight clean. Proceeding to etcd snapshot.`
## Stage 2: Etcd snapshot (`stages` includes `snapshot`)
Always run — patch OR minor. Triggers a one-shot Job from the existing `default/backup-etcd` CronJob and waits for it to complete.
```bash
JOB_NAME="pre-upgrade-etcd-${target_version}-$(date +%s)"
if [ "$dry_run" = "false" ]; then
$KUBECTL -n default create job --from=cronjob/backup-etcd "$JOB_NAME"
# Wait up to 10 min for snapshot Job to complete
$KUBECTL -n default wait --for=condition=complete --timeout=600s "job/$JOB_NAME" || {
slack "ABORT Stage 2 — etcd snapshot Job did not complete in 10 min"
$KUBECTL -n default describe "job/$JOB_NAME" | tail -30
exit 1
}
# Parse the Job's pod log for "Backup done: <file> (<bytes> bytes)"
LOG=$($KUBECTL -n default logs "job/$JOB_NAME" -c backup-manage --tail=20)
echo "$LOG"
SNAPSHOT_LINE=$(echo "$LOG" | grep -E '^Backup done:')
SIZE=$(echo "$SNAPSHOT_LINE" | grep -oE '\([0-9]+ bytes\)' | grep -oE '[0-9]+')
SNAPSHOT_FILE=$(echo "$SNAPSHOT_LINE" | awk '{print $3}')
if [ -z "$SIZE" ] || [ "$SIZE" -lt 1024 ]; then
slack "ABORT Stage 2 — etcd snapshot empty or missing (size='$SIZE' line='$SNAPSHOT_LINE')"
exit 1
fi
TARGET_PATH="nfs://192.168.1.127:/srv/nfs/etcd-backup/$SNAPSHOT_FILE"
$KUBECTL annotate ns k8s-upgrade \
viktorbarzin.me/k8s-upgrade-snapshot-path="$TARGET_PATH" --overwrite
push_metric k8s_upgrade_snapshot_taken 1
else
TARGET_PATH="WOULD: trigger default/backup-etcd Job, wait, verify size"
SIZE="dry-run"
fi
slack "Etcd snapshot saved at $TARGET_PATH (size=$SIZE)"
```
## Stage 3: Master containerd skew fix (`stages` includes `containerd`)
Only run if master containerd version < highest worker containerd version.
```bash
get_ctr_version() {
$SSH \
"wizard@$1" 'containerd --version | awk "{print \$3}" | tr -d v'
}
MASTER_CTR=$(get_ctr_version k8s-master)
WORKER_MAX="0.0.0"
for n in k8s-node1 k8s-node2 k8s-node3 k8s-node4; do
v=$(get_ctr_version "$n")
# Compare semver-ish
if [ "$(printf '%s\n%s' "$v" "$WORKER_MAX" | sort -V | tail -1)" = "$v" ]; then
WORKER_MAX="$v"
fi
done
if [ "$(printf '%s\n%s' "$MASTER_CTR" "$WORKER_MAX" | sort -V | head -1)" = "$MASTER_CTR" ] \
&& [ "$MASTER_CTR" != "$WORKER_MAX" ]; then
# Master is behind — bump
slack "Master containerd $MASTER_CTR < workers $WORKER_MAX bumping master"
if [ "$dry_run" = "false" ]; then
$SSH \
wizard@k8s-master "sudo apt-mark unhold containerd.io \
&& sudo apt-get install -y containerd.io='$WORKER_MAX-1' \
&& sudo apt-mark hold containerd.io \
&& sudo systemctl restart containerd"
# Wait until kubelet on master is Ready again
for i in $(seq 1 60); do
STATUS=$(kubectl --kubeconfig $WORKSPACE_DIR/config get node k8s-master \
-o jsonpath='{.status.conditions[?(@.type=="Ready")].status}')
[ "$STATUS" = "True" ] && break
sleep 10
done
[ "$STATUS" = "True" ] || { slack "ABORT — k8s-master not Ready after containerd bump"; exit 1; }
fi
slack "Master containerd: $MASTER_CTR → $WORKER_MAX. Master Ready."
else
echo "Master containerd $MASTER_CTR >= workers max $WORKER_MAX — skipping skew fix"
fi
```
## Stage 4: Apt repo URL rewrite for minor bumps (`stages` includes `repo`)
Only run if `kind=minor`.
For each of `k8s-master k8s-node1 k8s-node2 k8s-node3 k8s-node4`:
```bash
target_minor="$(echo "$target_version" | awk -F. '{print $1"."$2}')"
if [ "$dry_run" = "false" ]; then
$SSH \
"wizard@$node" "echo 'deb [signed-by=/etc/apt/keyrings/kubernetes-apt-keyring.gpg] https://pkgs.k8s.io/core:/stable:/v$target_minor/deb/ /' | sudo tee /etc/apt/sources.list.d/kubernetes.list \
&& curl -fsSL 'https://pkgs.k8s.io/core:/stable:/v$target_minor/deb/Release.key' | sudo gpg --dearmor -o /etc/apt/keyrings/kubernetes-apt-keyring.gpg --batch --yes \
&& sudo apt-get update"
fi
```
Slack: `Repo rewritten to v$target_minor/deb on all 5 nodes.`
## Stage 5: Master upgrade (`stages` includes `master`)
```bash
# 5.1 Drain
if [ "$dry_run" = "false" ]; then
kubectl --kubeconfig $WORKSPACE_DIR/config drain k8s-master \
--ignore-daemonsets --delete-emptydir-data --force --grace-period=300
fi
# 5.2 Run the library script via SSH pipe
if [ "$dry_run" = "false" ]; then
$SSH \
wizard@k8s-master 'bash -s' \
< $WORKSPACE_DIR/scripts/update_k8s.sh \
-- --role master --release "$target_version"
fi
# 5.3 Uncordon + wait Ready
if [ "$dry_run" = "false" ]; then
kubectl --kubeconfig $WORKSPACE_DIR/config uncordon k8s-master
fi
for i in $(seq 1 60); do
STATUS=$(kubectl --kubeconfig $WORKSPACE_DIR/config get node k8s-master \
-o jsonpath='{.status.conditions[?(@.type=="Ready")].status}')
KUBELET=$(kubectl --kubeconfig $WORKSPACE_DIR/config get node k8s-master \
-o jsonpath='{.status.nodeInfo.kubeletVersion}' | tr -d v)
[ "$STATUS" = "True" ] && [ "$KUBELET" = "$target_version" ] && break
sleep 15
done
[ "$STATUS" = "True" ] && [ "$KUBELET" = "$target_version" ] \
|| { slack "ABORT — master not Ready or wrong version after upgrade ($STATUS / $KUBELET)"; exit 1; }
# 5.4 All control-plane pods Running
NOT_READY=$(kubectl --kubeconfig $WORKSPACE_DIR/config -n kube-system get pods \
-l 'tier=control-plane' --no-headers | grep -v Running | wc -l)
[ "$NOT_READY" -gt 0 ] && { slack "ABORT — $NOT_READY control-plane pods not Running"; exit 1; }
# 5.5 Re-check halt-on-alert
# (re-run the Check 1.2 query, abort if anything new fires)
slack "Master upgrade complete. Cluster on v$target_version. Healthy."
```
## Stage 6: Workers sequentially (`stages` includes `workers`)
Order: `k8s-node4 → k8s-node3 → k8s-node2 → k8s-node1`. Node1 last because it hosts GPU + Immich and benefits from the longest soak before any other worker is touched (ref: post-mortem-2026-03-16, memory id=570).
For each worker `$node`:
1. Re-check halt-on-alert. If anything fires (e.g. `RecentNodeReboot` on the previous worker), wait + retry up to 30 min, then abort.
2. `kubectl drain $node --ignore-daemonsets --delete-emptydir-data --force --grace-period=300`
3. SSH pipe `update_k8s.sh --role worker --release $target_version`
4. `kubectl uncordon $node`
5. Wait until `$node` Ready + kubeletVersion matches + all calico-node + kube-proxy pods on that node Running.
6. **10-min soak**: poll halt-on-alert every 60s. If anything fires, abort. After 10 min clean, proceed.
7. Slack: `Worker $node complete ($i/4)`.
```bash
WORKERS="k8s-node4 k8s-node3 k8s-node2 k8s-node1"
i=0
for node in $WORKERS; do
i=$((i+1))
# Halt-on-alert recheck with retry
for attempt in $(seq 1 30); do
ALERTS=$(curl -sf 'http://prometheus-server.monitoring.svc.cluster.local:80/api/v1/alerts' \
| jq -r '.data.alerts[] | select(.state == "firing") | .labels.alertname' \
| grep -vE '^(Watchdog|RebootRequired|KuredNodeWasNotDrained|InfoInhibitor)$' \
| sort -u)
[ -z "$ALERTS" ] && break
echo "Waiting for alerts to clear (attempt $attempt/30): $ALERTS"
sleep 60
done
[ -n "$ALERTS" ] && { slack "ABORT $node — alerts firing after 30min wait: $ALERTS"; exit 1; }
if [ "$dry_run" = "false" ]; then
kubectl --kubeconfig $WORKSPACE_DIR/config drain "$node" \
--ignore-daemonsets --delete-emptydir-data --force --grace-period=300
$SSH \
"wizard@$node" 'bash -s' \
< $WORKSPACE_DIR/scripts/update_k8s.sh \
-- --role worker --release "$target_version"
kubectl --kubeconfig $WORKSPACE_DIR/config uncordon "$node"
fi
# Wait Ready + version match
for w in $(seq 1 60); do
STATUS=$(kubectl --kubeconfig $WORKSPACE_DIR/config get node "$node" \
-o jsonpath='{.status.conditions[?(@.type=="Ready")].status}')
KUBELET=$(kubectl --kubeconfig $WORKSPACE_DIR/config get node "$node" \
-o jsonpath='{.status.nodeInfo.kubeletVersion}' | tr -d v)
[ "$STATUS" = "True" ] && [ "$KUBELET" = "$target_version" ] && break
sleep 15
done
[ "$STATUS" = "True" ] && [ "$KUBELET" = "$target_version" ] \
|| { slack "ABORT — $node not Ready or wrong version ($STATUS / $KUBELET)"; exit 1; }
# 10-min soak with halt-on-alert
echo "Soaking $node for 10 min..."
for sec in $(seq 1 10); do
ALERTS=$(curl -sf 'http://prometheus-server.monitoring.svc.cluster.local:80/api/v1/alerts' \
| jq -r '.data.alerts[] | select(.state == "firing") | .labels.alertname' \
| grep -vE '^(Watchdog|RebootRequired|KuredNodeWasNotDrained|InfoInhibitor|RecentNodeReboot)$' \
| sort -u)
[ -n "$ALERTS" ] && { slack "ABORT $node mid-soak — alerts: $ALERTS"; exit 1; }
sleep 60
done
slack "Worker $node upgrade complete ($i/4). Soaked clean."
done
```
Note: during the soak we add `RecentNodeReboot` to the ignore-list because we KNOW we just rebooted-as-it-were that node (kubelet restart counts).
## Stage 7: Post-flight (`stages` includes `postflight`)
```bash
# All 5 nodes at target
VERSIONS=$(kubectl --kubeconfig $WORKSPACE_DIR/config get nodes \
-o jsonpath='{range .items[*]}{.metadata.name}:{.status.nodeInfo.kubeletVersion}{"\n"}{end}')
echo "$VERSIONS"
WRONG=$(echo "$VERSIONS" | grep -v ":v${target_version}$" | wc -l)
[ "$WRONG" -ne 0 ] && { slack "ABORT post-flight — $WRONG node(s) not on v$target_version:\n$VERSIONS"; exit 1; }
# Upgrade Gates all inactive
FIRING=$(curl -sf 'http://prometheus-server.monitoring.svc.cluster.local:80/api/v1/alerts' \
| jq -r '.data.alerts[] | select(.state == "firing") | .labels.alertname' \
| grep -vE '^(Watchdog|RebootRequired|KuredNodeWasNotDrained|InfoInhibitor)$' \
| sort -u)
[ -n "$FIRING" ] && slack "Post-flight WARN — alerts still firing (cluster on target, but check):\n$FIRING"
# pod-ready ratio >= 0.9
RATIO=$(curl -sf 'http://prometheus-server.monitoring.svc.cluster.local:80/api/v1/query' \
--data-urlencode 'query=sum(kube_pod_status_ready{condition="true"}) / sum(kube_pod_status_phase{phase="Running"})' \
| jq -r '.data.result[0].value[1] // "0"')
slack "Pod-ready ratio: $RATIO (target ≥ 0.9)"
# Clear the in-flight annotation + Pushgateway gauges
if [ "$dry_run" = "false" ]; then
kubectl --kubeconfig $WORKSPACE_DIR/config annotate ns k8s-upgrade \
viktorbarzin.me/k8s-upgrade-in-flight- \
viktorbarzin.me/k8s-upgrade-target- \
viktorbarzin.me/k8s-upgrade-snapshot-path- || true
push_metric k8s_upgrade_in_flight 0
push_metric k8s_upgrade_snapshot_taken 0
fi
slack ":white_check_mark: K8s upgrade complete: cluster on v$target_version."
```
## Rollback
This agent does NOT auto-rollback. If anything aborts mid-flight:
1. Slack the failure with the last known stage + node.
2. Leave the in-flight annotation in place (the operator clears it manually after triage).
3. Operator follows `infra/docs/runbooks/k8s-version-upgrade.md` → "Rollback paths" section.
The etcd snapshot path is annotated on the `k8s-upgrade` namespace for easy recovery.
## Notes for tests
- **Test 1 (CronJob dry-run)**: The CronJob has its own `--dry-run` env var that short-circuits before POST. This agent is not invoked.
- **Test 2 (agent dry-run)**: Invoke with `{"dry_run": true}`. Every SSH + kubectl READ runs, every mutation skipped. The agent should print "WOULD: <cmd>" for each skipped mutation.
- **Test 3 (snapshot-only)**: Invoke with `{"stages": "preflight,snapshot"}`. Pre-flight + etcd snapshot only. Slack notification confirms the file exists. No node touched after that.
- **Test 4 (full run)**: `{"target_version": "1.34.7", "kind": "patch"}` once apt has it. Full sequence.
- **Test 5 (synthetic minor)**: `{"target_version": "1.35.0", "kind": "minor", "dry_run": true}`. Confirms the repo-rewrite plan path without mutation.
## Edge cases
- **Slack down**: Don't block the upgrade — continue, log to stderr.
- **SSH host key changes**: `accept-new` accepts only on first encounter — if a node was reimaged its host key changes; clear `/tmp/known_hosts` before retry.
- **kubectl drain hangs on a PDB-violating pod**: 5-min grace-period is hard. If drain fails, `kubectl drain --disable-eviction --force` is NOT a valid escalation here — slack-abort and let the operator investigate.
- **etcd snapshot dir missing/full**: stat the dir first. If <10 GiB free, abort.
- **Network blip during apt-get**: the script `set -e`s — apt-get will fail loud, the agent's bash will see non-zero exit, we slack-abort. The node is left mid-upgrade (kubeadm half-applied). Operator follows the runbook.
## Verification claims you must make
When you `slack` a SUCCESS message, you must have actually verified:
- All 5 nodes report the target kubelet version via `kubectl get nodes -o jsonpath`
- No alerts firing outside the ignore-list
- pod-ready ratio computed from Prometheus
Do not declare success without those three confirmations.

View file

@ -127,65 +127,10 @@ Pinned via Terraform in `stacks/authentik/`:
| Knob | Value | Surface | Effect |
|------|-------|---------|--------|
| `UserLoginStage.session_duration` on `default-authentication-login` | `weeks=4` | `authentik_stage_user_login.default_login` in `authentik_provider.tf` | Authenticated users stay logged in 4 weeks across browser restarts. No sliding refresh — resets on each login. |
| `ProxyProvider.access_token_validity` on `Provider for Domain wide catch all` | `weeks=4` | `authentik_provider_proxy.catchall.access_token_validity` in `authentik_provider.tf` | Cookie `Max-Age` on `authentik_proxy_*` and `expires` on rows in `authentik_providers_proxy_proxysession`. Bumped 2026-05-10 from `hours=168`. **Bumping requires `kubectl rollout restart deploy/ak-outpost-authentik-embedded-outpost`** — the gorilla session store binds the value once at outpost startup; the 5-min provider refresh logs `"reusing existing session store"` and skips rebuild. |
| `AUTHENTIK_SESSIONS__UNAUTHENTICATED_AGE` (server + worker) | `hours=2` | `server.env` + `worker.env` in `modules/authentik/values.yaml` | Anonymous Django sessions (bots, healthcheckers, partial flows) are reaped within 2h instead of the 1d default. |
Notes:
- There is **no** `Brand.session_duration`; `UserLoginStage` is the only correct lever for authenticated session lifetime.
- Embedded outpost session storage: PostgreSQL table `authentik_providers_proxy_proxysession` in authentik 2025.10+ (PR #16628), but **only when `IsEmbedded()` returns true** (i.e. `Outpost.managed == "goauthentik.io/outposts/embedded"`). Our outpost record had `managed=null` until 2026-05-10, which silently kept it on the gorilla `FilesystemStore` at `/dev/shm` (TMPDIR) and re-exposed the 2026-04-18 mismatched-session-ID class on every pod restart. Fix landed 2026-05-10: see `authentik_outpost.embedded` in `authentik_provider.tf` and post-mortem `2026-04-18-authentik-outpost-shm-full.md`.
- The proxy outpost service has a known goauthentik 2026.2.2 bug (`internal/outpost/controllers/k8s/service.py:52`): for embedded outposts the controller sets the Service selector to `app.kubernetes.io/name=authentik` (the server pods), not `authentik-outpost-proxy`. We work around it via a `kubernetes_json_patches.service` patch on the outpost record (replaces `/spec/selector` with the outpost's own labels). Without this, endpoints are empty and Traefik forward-auth fails over to the Basic Auth realm `Emergency Access`.
- The standalone embedded-outpost deployment needs `AUTHENTIK_POSTGRESQL__{HOST,PORT,USER,PASSWORD,NAME}` env vars to reach the dbaas cluster — codified via `kubernetes_json_patches.deployment` envFrom the shared `goauthentik` Secret. The `app.kubernetes.io/component=server` pod label is also injected via JSON patch (matches the `component:server` half of the Service selector that the controller adds for embedded outposts).
- `ProxyProvider.remember_me_offset` stays UI-managed via `ignore_changes`.
- The Authentik provider's resource schema does **not** expose the `Outpost.managed` field. We rely on TF's "write only fields it knows about" semantic: the server-set `goauthentik.io/outposts/embedded` value is preserved across applies because Terraform never writes `managed`. Don't change the resource provider schema expectations without verifying this assumption holds.
- Embedded outpost session storage moved from `/dev/shm` → Postgres table `authentik_providers_proxy_proxysession` in authentik 2025.10. The 2026-04-18 `/dev/shm`-fill outage class is no longer load-bearing in 2026.2.2; the `unauthenticated_age` cap is still the right lever for anonymous-session bloat from external monitors.
- `ProxyProvider.access_token_validity` and `remember_me_offset` stay UI-managed via `ignore_changes`.
- The `unauthenticated_age` env var is injected via `server.env` / `worker.env` (not `authentik.sessions.unauthenticated_age`) because we set `authentik.existingSecret.secretName: goauthentik`, which makes the chart skip rendering its own `AUTHENTIK_*` Secret. The `authentik.*` value block is therefore inert in this stack — anything new under `authentik.*` must use the `*.env` arrays instead. The same applies to the existing `authentik.cache.*`, `authentik.web.*`, `authentik.worker.*` blocks (currently inert; live values come from the orphaned, helm-keep-policy `goauthentik` Secret created by chart 2025.10.3 before `existingSecret` was introduced).
## Upgrade Validation Checklist
Run after **any** of these:
- Authentik chart version bump in `stacks/authentik/modules/authentik/main.tf` (the `version = "..."` line on `helm_release.authentik`).
- `goauthentik/authentik` Terraform provider version bump.
- Outpost pod recreation (kured reboot, eviction, manual `rollout restart`, scheduler move).
The fragile surfaces are the `kubernetes_json_patches` and the `Outpost.managed` field — both rely on assumptions that can silently break across upgrades. The checklist exercises the same path the alerts watch, so it doubles as a smoke test for the alerts.
```bash
# 1. Service routes to the outpost pod (NOT the server pods).
# Empty endpoints => auth-proxy fallback fires; expected: ONE pod IP, ports 9000/9300/9443.
kubectl -n authentik get endpoints ak-outpost-authentik-embedded-outpost
# 2. Service selector still excludes the server pods. Expected: includes
# `app.kubernetes.io/name: authentik-outpost-proxy`. If it flips to
# `name: authentik`, the goauthentik upstream bug came back or our
# JSON patch was unset.
kubectl -n authentik get svc ak-outpost-authentik-embedded-outpost -o jsonpath='{.spec.selector}'
# 3. Outpost mode + session backend. Expected log lines on startup:
# {"embedded":true,"event":"Outpost mode",...}
# {"event":"using PostgreSQL session backend",...}
# If embedded=false or `using filesystem session backend`, the postgres
# fix is broken — likely `Outpost.managed` got cleared, or the upstream
# schema started exposing `managed` and TF reset it.
kubectl -n authentik logs deploy/ak-outpost-authentik-embedded-outpost | grep -E '"Outpost mode"|"session backend"' | head -3
# 4. /dev/shm is essentially empty (postgres backend = no filesystem use).
# A row count > a few dozen indicates filesystem fallback is firing.
kubectl -n authentik exec deploy/ak-outpost-authentik-embedded-outpost -- sh -c 'df -h /dev/shm; ls /dev/shm | wc -l'
# 5. Postgres session table is growing with traffic. Expected: rows with
# `expires` ~28 days out (matches access_token_validity = weeks=4).
kubectl -n authentik exec deploy/goauthentik-server -- ak shell -c "
from django.db import connection; c = connection.cursor()
c.execute('SELECT COUNT(*), MAX(expires) FROM authentik_providers_proxy_proxysession')
print(c.fetchone())"
# 6. Edge auth flow: should be 302 → authentik. NOT 401 with WWW-Authenticate.
curl -sS -o /dev/null -D - 'https://terminal.viktorbarzin.me/' -H 'User-Agent: Mozilla/5.0' \
| grep -iE '^HTTP|^location|x-auth-fallback|www-authenticate'
# 7. Terraform plan-to-zero on the whole authentik stack.
( cd stacks/authentik && /home/wizard/code/infra/scripts/tg plan ) | grep -E 'No changes|Plan:'
```
Steps 1, 3, 6 cover the failure modes the Prometheus alerts trigger on (`AuthentikForwardAuthFallbackActive`, `AuthentikOutpostForwardAuth400Spike`). Steps 4 and 5 cover the silent-regression case (filesystem fallback) where the alerts don't fire but the system loses its postgres-backed session persistence on the next pod restart.
If step 2 shows the controller restored `app.kubernetes.io/name=authentik`, watch goauthentik/authentik issue tracker for fixes around `internal/outpost/controllers/k8s/service.py:52` — the upstream patch might let us drop our `kubernetes_json_patches.service` workaround.

View file

@ -53,7 +53,6 @@
| insta2spotify | Instagram reel song ID to Spotify playlist | insta2spotify |
| trading-bot | Event-driven trading with sentiment analysis | trading-bot |
| claude-memory | Persistent memory MCP server | claude-memory |
| paperless-mcp | Paperless-ngx document search MCP (barryw/PaperlessMCP). Traefik bearer auth via Aetherinox api-token-middleware. `auth=none` at ingress; gateway-level bearer enforced by `paperless-mcp/bearer-auth` Middleware CRD. Tokens + paperless API token in Vault `secret/paperless-mcp`. | paperless-mcp |
| council-complaints | Islington civic reporting pilot | council-complaints |
## Optional
@ -79,7 +78,6 @@
| paperless-ngx | Document management | paperless-ngx |
| jsoncrack | JSON visualizer | jsoncrack |
| servarr | Media automation (Sonarr/Radarr/etc) | servarr |
| aiostreams | Stremio stream aggregator (Real-Debrid + Torrentio/Comet/MediaFusion/StremThru/Knaben). `auth=app` (own UUID+password); canary stream-probe + 3 alerts; weekly NFS config + Stremio-account-collection backups to `/srv/nfs/aiostreams-backup/`. PG-backed user config. | servarr/aiostreams |
| ntfy | Push notifications | ntfy |
| cyberchef | Data transformation | cyberchef |
| diun | Docker image update notifier — detects new versions, fires webhook to n8n upgrade agent | diun |

View file

@ -7,9 +7,8 @@ description: |
(3) User asks to fix stuck pods, evicted pods, or CrashLoopBackOff,
(4) User mentions "health check", "cluster status", "cluster health",
(5) User asks "is everything running" or "any problems".
Runs 44 cluster-wide checks (nodes, workloads, monitoring, certs,
backups, external reachability, PVE host thermals + load) with safe
auto-fix for evicted pods.
Runs 42 cluster-wide checks (nodes, workloads, monitoring, certs,
backups, external reachability) with safe auto-fix for evicted pods.
author: Claude Code
version: 2.0.0
date: 2026-04-19
@ -67,7 +66,7 @@ bash infra/scripts/cluster_healthcheck.sh --no-fix --quiet --json
bash infra/scripts/cluster_healthcheck.sh --kubeconfig /path/to/config
```
## What It Checks (44 checks)
## What It Checks (42 checks)
| # | Check | Notes |
|---|-------|-------|
@ -113,8 +112,6 @@ bash infra/scripts/cluster_healthcheck.sh --kubeconfig /path/to/config
| 40 | External — Cloudflared + Authentik Replicas | deployments fully ready |
| 41 | External — ExternalAccessDivergence Alert | alert not firing |
| 42 | External — Traefik 5xx Rate (15m) | top-10 services emitting 5xx |
| 43 | PVE Host Thermals | package + per-core temps via `/sys/class/hwmon` (SSH). Baseline 55-65 °C. PASS <65 °C, WARN 65-82 °C (a VM is burning too much CPU), FAIL 83 °C (TjMax) |
| 44 | PVE Host Load | `/proc/loadavg` via SSH. PASS 5m <30, WARN 30-37, FAIL 38 of 44 threads |
## Safe Auto-Fix Rules
@ -259,9 +256,9 @@ kubectl logs -n external-secrets deploy/external-secrets --tail=100
kubectl get pods -n cloudflared
kubectl logs -n cloudflared -l app=cloudflared --tail=100
# Authentik (Helm chart names the deployment goauthentik-server)
kubectl get deployment -n authentik goauthentik-server
kubectl logs -n authentik deploy/goauthentik-server --tail=100
# Authentik
kubectl get pods -n authentik -l app=authentik-server
kubectl logs -n authentik -l app=authentik-server --tail=100
# ExternalAccessDivergence alert
kubectl exec -n monitoring deploy/prometheus-server -- \
@ -298,133 +295,6 @@ kubectl exec -n monitoring deploy/prometheus-server -- \
- Exit code 143 → SIGTERM / graceful shutdown failed
3. Cross-check dbaas + NFS + secrets are healthy.
## Performance forensics — top consumers + optimization hints
When the cluster is healthy (script returns 0) but the host is hot or load
is elevated, switch from "what broke?" to "what's expensive?". Run these
in order; stop as soon as the root cause is obvious.
### Step 1 — Snapshot top consumers cluster-wide
```bash
# Top 15 pods by current CPU
kubectl top pods --all-namespaces --sort-by=cpu --no-headers | head -15
# Top 5 nodes by CPU + memory pressure
kubectl top nodes
# Top 15 by 5-min rolling rate (smoothed — kills noise from one-off spikes)
kubectl -n monitoring exec deploy/prometheus-server -- wget -qO- \
"http://localhost:9090/api/v1/query?query=topk(15,sum%20by%20(namespace,pod)%20(rate(container_cpu_usage_seconds_total%7Bcontainer!%3D''%7D%5B5m%5D)))" \
| python3 -m json.tool | head -80
```
### Step 2 — For each suspect pod, get the WHY
For every pod in the top-N, gather these BEFORE proposing a fix:
```bash
NS=<namespace>; POD=<pod>; CONT=$(kubectl -n $NS get pod $POD -o jsonpath='{.spec.containers[0].name}')
# What it does (image + command)
kubectl -n $NS get pod $POD -o jsonpath='{.spec.containers[0].image}{"\n"}{.spec.containers[0].args}{"\n"}'
# Resource limits + current usage
kubectl -n $NS top pod $POD --containers
kubectl -n $NS get pod $POD -o jsonpath='{.spec.containers[0].resources}'
# Recent logs filtered for reconcile loops, watch storms, slow queries
kubectl -n $NS logs $POD -c $CONT --tail=200 --since=5m 2>&1 \
| grep -iE 'reconcil|watch|scrape|index|loop|retry|slow|timeout' | tail -20
# Restart count + recent OOM
kubectl -n $NS describe pod $POD | grep -E 'Restart Count|Last State|Reason'
# Self-exported metrics (for apps that publish on /metrics)
kubectl -n $NS exec $POD -c $CONT -- wget -qO- localhost:<port>/metrics 2>/dev/null | head -50
```
### Step 3 — apiserver / etcd specific deep-dive (when control-plane is hot)
```bash
# Top request producers by verb+resource (last 30 min)
kubectl -n monitoring exec deploy/prometheus-server -- wget -qO- \
"http://localhost:9090/api/v1/query?query=topk(15,sum%20by%20(resource,verb)%20(rate(apiserver_request_total%5B30m%5D)))" \
| python3 -m json.tool
# Top user agents (which clients are hammering)
kubectl -n monitoring exec deploy/prometheus-server -- wget -qO- \
"http://localhost:9090/api/v1/query?query=topk(15,sum%20by%20(user_agent)%20(rate(apiserver_request_total%5B30m%5D)))" \
| python3 -m json.tool
# Long-running requests (WATCH / CONNECT — log streams, pod-watchers)
kubectl -n monitoring exec deploy/prometheus-server -- wget -qO- \
"http://localhost:9090/api/v1/query?query=apiserver_longrunning_requests" \
| python3 -m json.tool
# etcd write rate + DB size
kubectl -n monitoring exec deploy/prometheus-server -- wget -qO- \
"http://localhost:9090/api/v1/query?query=rate(etcd_disk_wal_fsync_duration_seconds_count%5B5m%5D)" \
| python3 -m json.tool
```
### Step 4 — PVE host specific deep-dive (when temp / load is high)
Checks 43 + 44 capture package temp + 5-min load avg with PASS/WARN/FAIL
thresholds — that's the first stop. When those WARN or FAIL, the
follow-up commands below trace which VM / process is the source:
```bash
# Per-core temps (broader than the package summary in check 43)
ssh root@192.168.1.127 'for f in /sys/class/hwmon/hwmon0/temp*_input; do
base=${f%_input}; label=$(cat ${base}_label 2>/dev/null || echo "${base##*/}")
val=$(cat "$f"); echo " $label: $((val/1000))°C"
done'
# Per-VM CPU (each VM = one kvm process)
ssh root@192.168.1.127 'top -bn1 -o %CPU | grep kvm | head -10'
# pvestatd anomaly check — bursts > 50% usually mean LV count > 1000
ssh root@192.168.1.127 'lvs --noheadings 2>/dev/null | wc -l'
# Stale snapshots (any '_pre-*' that survived past their rollback window)
ssh root@192.168.1.127 'lvs --noheadings -o lv_name 2>/dev/null | awk "/_pre-/" | head -20'
```
### Step 5 — Optimization decision
For each consumer in the top-N, fill in a row:
| Pod / Process | CPU (m) | Why busy | Tunable | Est saving | Trade-off | Effort |
|---|---|---|---|---|---|---|
Then rank by ROI (saving / effort) and surface the top 3-5. **Hold back the ones where saving < 50m unless effort is also < 5 min.**
### Common causes + tunables (catalogue)
| Symptom | Likely cause | Tunable |
|---|---|---|
| **`kube-apiserver` > 1 core sustained** | `CONNECT pods/log` streams from `alloy`/`promtail` using apiserver-tail; OR Kyverno PolicyReport churn (background+enforce mode); OR VPA fanout (309 VPAs cause ~7 req/s) | Switch alloy/promtail to `loki.source.file`; raise Kyverno `backgroundScanInterval`; reduce VPA count |
| **`pvestatd` 70-100% bursts** | LV metadata scan over > 1000 LVs (typically stale `_pre-*` snapshots from ad-hoc node ops) | Delete stale snapshots; `/usr/local/bin/lvm-pvc-snapshot prune` |
| **Frigate > 2 cores** | Birdseye `mode: continuous` (16% on frigate.output); LPR debug; debug logging; too many active cameras × detect.fps | `birdseye.mode: motion`; `lpr.debug_save_plates: false`; remove debug loggers |
| **`vault-0` looping ERRORs every ~10s** | DB static-role not in connection's `allowed_roles` list (drift between role and connection) | Add role to `vault_database_secret_backend_connection.*.allowed_roles` in TF |
| **Alloy DS > 100m/pod** | `loki.source.kubernetes` (apiserver-tail) instead of `loki.source.file` | Switch to file-tail (~5× drop per pod) |
| **Prometheus default 1m scrape** | Chart default; new sample every minute | Raise `server.global.scrape_interval` to 2m; pin critical jobs (snmp-ups) to 30s; bump `for: 1m` alerts to `for: 3m` |
| **`kube-controller-manager` periodic ERROR loop** | Aggregated APIService discovery fails (calico/metrics-server unreachable, OR stuck Terminating pod still in endpoints) | Force-delete stuck pod; verify APIService Available; check pod runc bug on k8s-master |
| **etcd write > 1 MB/s** | PolicyReport thrash, too-frequent secret rotation, or audit log mode = RequestResponse | Trim Kyverno reports config; raise rotation_period; downgrade audit policy to Metadata for noisy resources |
### What NOT to touch
- **calico-node, etcd write rate, kube-controller-manager core work, pg-cluster replication** — structural cost, touching them risks correctness.
- **Pods doing legitimate request-serving work** (web servers, databases under load) — optimize the workload, not the runtime.
- **Anything where Goldilocks VPA upperBound is already close to current request** — no headroom to cut.
### Source-of-truth notes
- **All infra mutations go via Terraform** (`scripts/tg plan/apply`). The recipes above are diagnostic; the FIX lives in `infra/stacks/<name>/main.tf` or chart values.
- **Pod-internal config files** (e.g., Frigate's `/config/config.yml` on a PVC) are not TF-managed — edit in-pod and document in `infra/docs/runbooks/`.
- **PVE host-level state** (LVM snapshots, pvestatd) — SSH + manual ops; record in memory if the pattern recurs.
## Notes on the canonical / hardlink setup
The authoritative copy of this SKILL.md lives at

View file

@ -1,199 +0,0 @@
---
name: upgrade-state
description: |
Audit the three autonomous-upgrade pipelines (apps via Keel, OS via
unattended-upgrades+kured, K8s components via the version-check chain).
Use when:
(1) User asks "/upgrade-state" or "are we current",
(2) User asks "what's pending upgrade" or "what's the upgrade state",
(3) User asks if Keel / kured / k8s-version-check is healthy,
(4) User asks about kept-back / held packages or pending reboots,
(5) Periodic survey before the next `k8s-version-check` daily run.
Read-only — no `--fix`. Exits 0 healthy / 1 attention / 2 stalled.
author: Claude Code
version: 1.0.0
date: 2026-05-18
---
# Upgrade-state
## MANDATORY: Run the script first
When this skill is invoked, your **first action** must be to run
`upgrade_state.sh` and reason over its output before doing anything
else. Do NOT improvise individual `kubectl` / `ssh` calls — the script
is the authoritative surface.
```bash
bash /home/wizard/code/infra/scripts/upgrade_state.sh
```
For programmatic use:
```bash
bash /home/wizard/code/infra/scripts/upgrade_state.sh --json | tee /tmp/upgrade-state.json
```
Then:
1. Report the rendered table verbatim — it answers the user's
"are we current" question in three lines.
2. For every `⚠` or `✗` row, surface the relevant drill-down lines
underneath and propose a next action (links in the table below).
3. Only reach for ad-hoc commands when investigating beyond what the
script reported.
Exit codes: `0` healthy, `1` attention warranted, `2` stalled / broken.
## What it covers (3 pipelines)
| Layer | What runs | Cadence | Data sources |
|---|---|---|---|
| **Apps** | Keel polls every watched Deployment's container registry; rolls on new digest | hourly | Prom (`pending_approvals`, `registries_scanned_total`), Keel pod logs |
| **OS** | `unattended-upgrades` in-release patching; `kured` reboots when `/var/run/reboot-required` is set | daily 02:00-06:00 London | SSH fan-out to all 5 nodes |
| **K8s** | `k8s-version-check` CronJob detects new kubeadm patch/minor; spawns the Job-chain that drains+upgrades node-by-node | daily 12:00 UTC | Pushgateway (`k8s_upgrade_*`), `kubectl get nodes` |
The K8s pipeline pushes a small set of gauges to the Prometheus
Pushgateway (`prometheus-prometheus-pushgateway.monitoring:9091`):
- `k8s_upgrade_available{kind="patch"|"minor",target=…}` — 1 if newer release detected
- `k8s_version_check_last_run_timestamp` — when detection last ran
- `k8s_upgrade_in_flight` — 0/1
- `k8s_upgrade_started_timestamp` — when the current chain started (0 when idle)
`K8sUpgradeStalled` alert fires when `in_flight=1` and the chain has
been running >90 minutes. The script raises `✗` in the same window.
## Status-icon legend
| Icon | Meaning |
|---|---|
| `✓` | Healthy, fully current |
| `→` | Update available, not yet applied (K8s patch/minor) |
| `…` | In flight — chain currently running |
| `⚠` | Attention: held-with-bumps, recent errors, pending approvals |
| `✗` | Broken: pod down, alert firing, chain stalled |
## Drill-down — when a row trips, what to do
### Apps `⚠` — pending approvals or errors
```bash
# Read recent Keel log lines
kubectl -n keel logs deploy/keel --since=24h --tail=200
# What is Keel currently tracking?
kubectl -n monitoring exec deploy/prometheus-server -c prometheus-server -- \
wget -qO- 'http://localhost:9090/api/v1/query?query=count by (image) (registries_scanned_total)'
# Is the scrape live?
kubectl -n monitoring exec deploy/prometheus-server -c prometheus-server -- \
wget -qO- 'http://localhost:9090/api/v1/query?query=up{job="kubernetes-pods",app="keel"}'
```
Common Keel errors:
- `failed to add image watch job` — image annotation mistyped (rare; Kyverno auto-injects)
- `registry authentication required` — bad imagePullSecret on the watched Deployment
- `bad tag pattern` — Keel can't parse the watched image's tag against its policy
### OS `⚠` — held packages with bumps
The script flags any package held via `apt-mark hold` that ALSO appears
in `apt list --upgradable` — excluding k8s components (the K8s pipeline
owns those) and the kernel (kured handles the reboot half).
Typical cause: a major-version bump (e.g. containerd 1.7 → 2.2,
runc 1.1 → 1.4). These are held because they need cluster-wide
coordination, not silent in-release patching.
```bash
# Inspect the situation on the flagged node
ssh wizard@10.0.20.10X 'apt-mark showhold; apt list --upgradable 2>/dev/null'
# Unhold + upgrade a specific package
ssh wizard@10.0.20.10X 'sudo apt-mark unhold containerd && sudo apt-get install -y containerd'
```
Node IPs: master=`100`, node1=`101`, node2=`102`, node3=`103`, node4=`104`.
### OS `⚠` — pending reboot
A node has `/var/run/reboot-required`. Kured will reboot it inside the
next 02:00-06:00 London window (any day of the week).
```bash
# Force a manual reboot inside the window (rare)
kubectl drain k8s-nodeX --delete-emptydir-data --ignore-daemonsets
ssh wizard@10.0.20.10X sudo systemctl reboot
```
### OS `✗` — kured not Running
```bash
kubectl -n kured get pods
kubectl -n kured logs daemonset/kured --tail=100
# Verify sentinel gate (kured-sentinel-gate DaemonSet writes /var/run/gated-reboot-required)
kubectl -n kured get pods -l name=kured-sentinel-gate
```
### K8s `→` — patch/minor available
Detection ran, target identified, chain NOT started. The chain spawns
on the same daily detection cycle — typically within ~24h of the
target first being detected.
```bash
# Inspect Pushgateway state
kubectl -n monitoring exec deploy/prometheus-server -c prometheus-server -- \
wget -qO- 'http://prometheus-prometheus-pushgateway:9091/metrics' | grep ^k8s_upgrade
# Trigger a manual run of the detection CronJob
kubectl -n k8s-upgrade create job --from=cronjob/k8s-version-check manual-detect-$(date +%s)
```
### K8s `…` — in flight
The Job chain is running. Watch its progress:
```bash
kubectl -n k8s-upgrade get jobs --sort-by=.metadata.creationTimestamp
kubectl -n k8s-upgrade logs -l app=k8s-version-upgrade --tail=200 --prefix
```
### K8s `✗ stalled``K8sUpgradeStalled` would fire
Chain in-flight >90m. The Job is most likely stuck on drain or a
pre-flight check.
```bash
kubectl -n k8s-upgrade get jobs
kubectl -n k8s-upgrade describe job <stuck-job>
kubectl -n k8s-upgrade logs job/<stuck-job> --tail=300
# If you need to clear the in-flight flag (after diagnosing):
kubectl -n monitoring exec deploy/prometheus-server -c prometheus-server -- sh -c \
"printf 'k8s_upgrade_in_flight 0\nk8s_upgrade_started_timestamp 0\n' | \
wget -qO- --post-file=- 'http://prometheus-prometheus-pushgateway:9091/metrics/job/k8s-version-upgrade' \
--header='Content-Type: text/plain'"
```
### K8s `✗ detection stale` — last detection >9 days
```bash
kubectl -n k8s-upgrade get cronjob k8s-version-check
kubectl -n k8s-upgrade get jobs --sort-by=.metadata.creationTimestamp | tail -5
```
If the CronJob hasn't fired on time, suspect:
- `suspend=true` on the CronJob (`var.enabled=false` in the
`k8s-version-upgrade` Terraform stack)
- Image-pull failure on the version-check pod
- Pushgateway scrape gone stale
## Companion command-line flags
```bash
bash infra/scripts/upgrade_state.sh # rendered table (default)
bash infra/scripts/upgrade_state.sh --json # machine output
bash infra/scripts/upgrade_state.sh --kubeconfig X # override kubeconfig
```

View file

@ -1,4 +0,0 @@
# git-crypt encrypts these at rest; the working-tree plaintext is local-only.
# gitleaks scans the staged working-tree copy and can't see that they're
# encrypted on disk in git, so allowlist by fingerprint.
stacks/recruiter-responder/secrets/privkey.pem:private-key:1

View file

@ -154,37 +154,6 @@ lifecycle {
**Audit**: `rg "KYVERNO_LIFECYCLE_V1" stacks/ | wc -l` — should grow (never shrink). Add the marker to every new pod-owning resource. The `_template/main.tf.example` stub shows the canonical form.
### `# KYVERNO_LIFECYCLE_V2` — Keel auto-update annotations
When a namespace is labeled `keel.sh/enrolled=true`, the `inject-keel-annotations` ClusterPolicy (`stacks/kyverno/modules/kyverno/keel-annotations.tf`) injects three annotations on every Deployment / StatefulSet / DaemonSet:
```
keel.sh/policy: force
keel.sh/trigger: poll
keel.sh/pollSchedule: "@every 1h"
```
To suppress the resulting Terraform drift, **enrolled workloads** must extend their `ignore_changes` block:
```hcl
lifecycle {
ignore_changes = [
spec[0].template[0].spec[0].dns_config, # KYVERNO_LIFECYCLE_V1
metadata[0].annotations["keel.sh/policy"],
metadata[0].annotations["keel.sh/trigger"],
metadata[0].annotations["keel.sh/pollSchedule"], # KYVERNO_LIFECYCLE_V2
]
}
```
The V2 snippet is added **per workload** as namespaces are phase-enrolled — not as a mass sweep. Workloads in un-enrolled namespaces do not receive the annotation and don't need the V2 block.
Per-workload opt-out: add the label `keel.sh/policy: never` on the Deployment metadata (not pod template); the policy's `exclude` clause respects it, no annotation gets injected, no `ignore_changes` needed.
**Audit**: `rg "KYVERNO_LIFECYCLE_V2" stacks/` — count should equal the number of enrolled workloads.
**Design context**: `docs/plans/2026-05-16-auto-upgrade-apps-{design,plan}.md`.
## Tier System
`0-core` | `1-cluster` | `2-gpu` | `3-edge` | `4-aux` — Kyverno auto-generates LimitRange + ResourceQuota per namespace based on tier label.
- Containers without explicit `resources {}` get default limits (256Mi for edge/aux — causes OOMKill for heavy apps)

View file

@ -1,150 +0,0 @@
# Infra
Terragrunt-managed homelab declaring a 5-node Kubernetes cluster on a single Proxmox host. Vault is the secrets source of truth; everything else flows from this repo via `scripts/tg apply`.
## Language
### Code organization
**Service**:
The deployed app as a domain concept — one logical thing that runs in the cluster (e.g. immich, technitium, freshrss). Defined by exactly one **Stack**.
_Avoid_: bare "app" without the Service definition; "deployment" (collides with K8s `Deployment`).
**Stack**:
The HCL directory under `stacks/<name>/` that defines a Service, applied independently with `scripts/tg apply`. A Stack is the unit of Terraform organisation; a Service is the running thing. They are 1:1 but not synonyms.
_Avoid_: using "Stack" when you mean the running Service.
**Module**:
A reusable HCL primitive under `modules/`, consumed by Stacks via `source =`.
_Avoid_: "library", "package".
**Factory module**:
A Module that hides convention (defaults, drift handling, secret wiring) behind a small input surface. Canonical examples: `ingress_factory`, `nfs_volume`, `k8s_app`, `helm_app`, `postgres_app`.
_Avoid_: "wrapper".
**State tier**:
Terraform state-backend partition. **Tier 0** = bootstrap Stacks (`infra`, `platform`, `cnpg`, `vault`, `dbaas`, `external-secrets`) on local SOPS-encrypted state. **Tier 1** = every other Stack, on PG-backed state.
_Avoid_: "phase", "bootstrap stack" — say Tier 0 explicitly.
### Cluster
**Node**:
A K8s worker VM (`k8s-master`, `k8s-node1..4`). Default reading of the bare word "node" in this repo.
_Avoid_: "k8s node" (redundant), "host" (ambiguous).
**PVE node** / **PVE host**:
The single physical Dell R730 running Proxmox; sole hypervisor and sole NFS server. There is exactly one.
_Avoid_: "server", "hypervisor", "Proxmox" alone when you mean the host.
**Namespace tier**:
A namespace-prefix partition (`0-core-*`, `1-cluster-*`, `2-gpu-*`, `3-edge-*`, `4-aux-*`) driving PriorityClass, default resources, and ResourceQuota — generated by **Kyverno policy** from the namespace name. Orthogonal to **State tier**.
_Avoid_: "Service tier" (the partition is on the namespace, not the Service); collapsing Namespace tier with State tier — they are different axes.
**Kyverno policy**:
The convention engine of the cluster — a ClusterPolicy or Policy resource that mutates/generates/validates on admission. Owns Namespace tier limits/quotas, `dns_config` injection on every pod-owning workload, Forgejo pull-credential sync across namespaces, TLS-secret replication. When the repo says "this happens automatically", a Kyverno policy is usually the actor.
_Avoid_: bare "policy" (overloaded with Vault, RBAC, NetworkPolicy).
**Critical-path Service**:
One of {Traefik, Authentik, CrowdSec LAPI, PgBouncer, Cloudflared} — replicas ≥3, PDB enforced, monitored independently.
_Avoid_: "core service" (collides with the `0-core-*` Namespace tier name).
**Namespace-owner**:
A non-admin identity declared in `secret/platform → k8s_users` (JSON map). Owns one or more namespaces and one or more public subdomains.
_Avoid_: bare "user", "tenant".
### Networking
**Public domain**:
`viktorbarzin.me`, served through Cloudflare. DNS records are either **proxied** (Cloudflare CDN/WAF in front) or **non-proxied** (direct A/AAAA reachable via Cloudflared Tunnel).
_Avoid_: "external", "outside".
**Internal domain**:
`viktorbarzin.lan`, served by Technitium DNS. Resolves only inside the homelab network.
_Avoid_: bare "lan", "private", "intranet".
**Ingress auth tier**:
The `auth = "..."` parameter on `ingress_factory`, one of `required` (Authentik forward-auth gates every request), `app` (the backend owns its login), `public` (anonymous Authentik binding for audit only), or `none` (Anubis-fronted content, or native-client API).
_Avoid_: "auth mode" — the canonical key is `auth`.
**Authentik outpost**:
A standalone Authentik deployment that terminates the proxy/auth flow for a specific binding model. The repo runs two distinct ones: the default outpost (used by `auth = "required"`) and the `public` outpost (anonymous binding, used by `auth = "public"`).
_Avoid_: conflating outpost with Authentik core; "Authentik instance".
**Cloudflared Tunnel**:
The channel by which non-proxied **public domain** traffic reaches the cluster, terminating at Traefik. Backs every `dns_type = "non-proxied"` record and is the fallback path for the wildcard `*.viktorbarzin.me`.
_Avoid_: "the tunnel" without "Cloudflared" (could mean Headscale).
**Ingress chain**:
The opinionated stack of Traefik middlewares that `ingress_factory` layers onto every Ingress. Slots, in order: forward-auth (per **Ingress auth tier**) → anti-AI scraping (default-on when no Authentik is in the path) → CrowdSec bouncer (fail-open) → retry (2× / 100ms) → rate-limit (429, not 503). Adding or removing a middleware is a Stack-level choice, but the chain order is convention.
_Avoid_: "middleware list", "Traefik chain". The Anubis PoW gate is upstream of this chain, not inside it.
### Storage
**proxmox-lvm-encrypted**:
Default StorageClass for any workload holding sensitive data (databases, auth, password managers, email, financial data). LUKS2 over a Proxmox LVM-thin LV.
_Avoid_: bare "encrypted PVC" — name the StorageClass.
**proxmox-lvm**:
Block StorageClass for non-sensitive workloads (caches, monitoring data, indexes, app state without secrets).
**NFS volume**:
RWX file storage for shared media libraries, large datasets, or anything that needs to be inspected from outside K8s. Provisioned via the `nfs_volume` Module.
_Avoid_: "shared storage" (ambiguous).
**nfs-truenas StorageClass**:
A historical SC name retained only because StorageClass strings are immutable on bound PVs. The underlying server is the **PVE host**, not TrueNAS; TrueNAS is decommissioned.
_Avoid_: assuming this means TrueNAS.
**3-2-1 backup**:
The named posture of where data lives: **Copy 1** = live on the PVE thin pool (sdc), **Copy 2** = sda backup disk (`/mnt/backup`), **Copy 3** = offsite Synology NAS. Per-PVC file-level rsync from LVM thin snapshots; databases additionally dump to NFS for per-DB restore.
_Avoid_: bare "backup" without saying which copy you mean (a service is "backed up" only once it's on Copy 2; Copy 3 is the disaster floor).
### Secrets
**Vault path**:
Convention: `secret/<service>` for Service-owned secrets, `secret/viktor` for personal/global, `secret/platform` for cluster-wide maps (`k8s_users`, `homepage_credentials`).
_Avoid_: conflating Vault path (e.g. `secret/viktor`) with Vault field (e.g. `forgejo_pull_token`).
**ExternalSecret** / **ESO**:
A K8s manifest that materialises a Vault KV value as a K8s Secret. Two ClusterSecretStores: `vault-kv` (KV engine) and `vault-database` (rotating DB creds).
**Plan-time secret**:
A secret value read in Terraform via `data "kubernetes_secret"` (i.e. via the ESO-created K8s Secret) at plan time, with no Vault provider call. Distinct from a **vault data source** read (`data "vault_kv_secret_v2"`), which still goes through the Vault provider. A few Stacks remain hybrid (plan-time for env vars, vault data source for module inputs).
**Sealed Secret**:
A user-managed secret committed to a Stack directory as `sealed-*.yaml`. Distinct from ExternalSecret — Sealed Secrets carry their own bytes, ExternalSecrets reference Vault.
### CI/CD
**GHA build + Woodpecker deploy**:
The split where Docker images are built+pushed by GitHub Actions and Woodpecker only runs `kubectl set image` on a deploy-only pipeline. Repos that can't fit GHA limits stay on Woodpecker for build too.
_Avoid_: bare "Woodpecker pipeline" — say "build" or "deploy".
**Anubis**:
A PoW reverse-proxy issuing a 30-day JWT cookie, used in front of public content-bearing sites without app-level auth (blog, wiki, landing pages). Never in front of Git, WebDAV, CalDAV, or API endpoints (clients can't solve PoW).
## Relationships
- A **Service** is defined by exactly one **Stack**, which declares zero or more **Modules** and resolves to one or more K8s workloads.
- A **Namespace-owner** owns one or more namespaces and one or more public subdomains.
- A **Service** owns its **Vault path** at `secret/<service>`, surfaces values through **ExternalSecrets**, and reads them at plan time via **plan-time secrets**.
- An **Ingress** picks exactly one **Ingress auth tier**; the choice defines how strangers reach the backend.
- A **proxmox-lvm-encrypted** PVC binds to one Node at a time (RWO) and requires a Service-level backup CronJob; an **NFS volume** is RWX and is backed up at the host level via rsync.
- **State tier** and **Namespace tier** are orthogonal — a Tier 0 Stack can deploy a Service into any Namespace tier and vice versa.
## Example dialogue
> **Dev:** "I'm adding a new **Service** — FastAPI backend with its own JWT login. Do I need Authentik?"
> **Domain expert:** "If the FastAPI login is the gate, set `auth = "app"` on the ingress. That records the intent that you _chose_ not to layer Authentik — leave a one-line comment above stating what gates the Service, or `scripts/tg` will refuse the apply."
> **Dev:** "And storage?"
> **Domain expert:** "Does it hold user data? If yes, `proxmox-lvm-encrypted` — that's the default for anything sensitive. Add a backup CronJob writing to `/mnt/main/<service>-backup/`. If the data is just caches, plain `proxmox-lvm` is fine."
> **Dev:** "What about a Secret with the JWT signing key?"
> **Domain expert:** "Put the key in `secret/<service>` in Vault, then declare an **ExternalSecret** to materialise it as a K8s Secret. Read it at plan time with `data "kubernetes_secret"` — that keeps Vault out of the plan path."
## Flagged ambiguities
- **"tier"** is overloaded — *Namespace tier* (`0-core`..`4-aux`, scheduling priority) is distinct from *State tier* (Tier 0 / Tier 1, Terraform backend partition). Always qualify which axis.
- **"node"** can mean a K8s Node (default) or a PVE node. For Proxmox-level statements, say **PVE node** explicitly.
- **"service"** spans two distinct concepts: the deployed app (capitalised **Service**, this repo's domain noun) and the K8s `Service` object (in backticks or qualified "K8s Service"). Lowercase "service" in prose is fine when context disambiguates; flag it when it doesn't.
- **"secret"** spans Vault entries, K8s Secret objects, **ExternalSecrets**, and **Sealed Secrets**. Always specify which.
- **"proxied"** / **"non-proxied"** refer to Cloudflare's CDN posture for a DNS record, _not_ Anubis or forward-auth layering.

View file

@ -7,11 +7,9 @@ ARG SOPS_VERSION=3.9.4
ARG KUBECTL_VERSION=1.34.0
ARG VAULT_VERSION=1.18.1
# Install system packages (single layer).
# python3: required by scripts/check-ingress-auth-comments.py, invoked
# by scripts/tg before every plan/apply.
# Install system packages (single layer)
RUN apk add --no-cache \
bash curl git git-crypt jq openssh-client openssl python3 unzip \
bash curl git git-crypt jq openssh-client openssl unzip \
&& rm -rf /var/cache/apk/*
# Terraform

View file

@ -44,7 +44,7 @@ graph TB
| Authentik Worker | 2026.2.2 | `stacks/authentik/` | Background task processors (2 replicas) |
| PgBouncer | Latest | `stacks/authentik/` | PostgreSQL connection pooler (3 replicas) |
| Embedded Outpost | - | Built into Authentik | Forward auth endpoint for Traefik |
| Traefik ForwardAuth | - | `modules/kubernetes/ingress_factory/` | Middleware attached when `auth = "required"` or `"public"` |
| Traefik ForwardAuth | - | `ingress_factory` module | Middleware for protected ingresses |
| Vault OIDC Method | - | `stacks/vault/` | Human SSO authentication to Vault |
| Vault K8s Auth | - | `stacks/vault/` | Service account JWT authentication |
@ -52,16 +52,7 @@ graph TB
### Forward Authentication Flow
Services pick an auth tier via the `auth` enum on the `ingress_factory` module (default `"required"`, fail-closed):
| Tier | Effect | When to use |
|------|--------|-------------|
| `"required"` | Authentik forward-auth gates every request | Backend has no own user auth — Authentik is the only gate |
| `"app"` | No Authentik middleware; backend's own login is the gate | Backend handles its own user auth (NextAuth, Django, OAuth, bearer-token API) |
| `"public"` | Authentik anonymous binding via `public` outpost | Audit trail without gating; only works for top-level browser navigation |
| `"none"` | No Authentik middleware at all | Anubis-fronted content, webhooks, OAuth callbacks, native-client APIs (CalDAV, WebDAV, Git) |
When `auth = "required"`, an unauthenticated request flows:
Services configured with `protected = true` in the `ingress_factory` module automatically get Traefik ForwardAuth middleware configured. When an unauthenticated user accesses a protected service:
1. Request hits Traefik ingress
2. ForwardAuth middleware calls Authentik embedded outpost
@ -73,8 +64,6 @@ When `auth = "required"`, an unauthenticated request flows:
Authentik adds authentication headers (user, email, groups) to forwarded requests. These headers are stripped before reaching the backend to prevent confusion.
**Anti-exposure guard**: every `auth = "app"` or `auth = "none"` line MUST have a preceding `# auth = "<tier>": <reason>` comment documenting what gates the backend (for `"app"`) or why the endpoint is intentionally public (for `"none"`). The convention is enforced by `scripts/check-ingress-auth-comments.py`, which `scripts/tg` runs on every `plan/apply/destroy/refresh` and blocks the terragrunt invocation if violated. Stack-scoped — each stack documents itself.
### Social Login & Invitation Flow
All new users must use an invitation link to register. The invitation-enrollment flow:
@ -155,9 +144,8 @@ The public client flow:
| Path | Purpose |
|------|---------|
| `stacks/authentik/` | Authentik deployment (servers, workers, PgBouncer) |
| `modules/kubernetes/ingress_factory/` | Auth-tier enum + per-ingress middleware composition |
| `stacks/traefik/modules/traefik/middleware.tf` | ForwardAuth middleware definitions (required + public outposts) |
| `scripts/check-ingress-auth-comments.py` | Comment-convention guard wired into `scripts/tg` |
| `stacks/platform/modules/ingress_factory/` | Traefik ForwardAuth middleware config |
| `stacks/platform/modules/traefik/middleware.tf` | ForwardAuth middleware definition |
| `stacks/vault/auth.tf` | Vault OIDC and K8s auth methods |
### Vault Paths
@ -172,40 +160,17 @@ The public client flow:
- `stacks/platform/` - Traefik ingress with ForwardAuth
- `stacks/vault/` - Vault auth methods
### Ingress Protection Examples
### Ingress Protection Example
Authentik-gated admin UI (default):
```hcl
module "myapp_ingress" {
source = "../../modules/kubernetes/ingress_factory"
name = "myapp"
namespace = "myapp"
tls_secret_name = var.tls_secret_name
# auth = "required" is the default — Authentik forward-auth is the gate.
}
```
source = "./modules/ingress_factory"
Backend with its own user auth (no Authentik in the way):
```hcl
module "myapp_ingress" {
source = "../../modules/kubernetes/ingress_factory"
name = "myapp"
namespace = "myapp"
tls_secret_name = var.tls_secret_name
# auth = "app": myapp uses NextAuth + Google OAuth; mobile clients can't follow Authentik 302.
auth = "app"
}
```
name = "myapp"
host = "myapp.viktorbarzin.me"
protected = true # Enables ForwardAuth middleware
Intentionally public webhook receiver:
```hcl
module "myapp_ingress" {
source = "../../modules/kubernetes/ingress_factory"
name = "webhook"
namespace = "webhooks"
tls_secret_name = var.tls_secret_name
# auth = "none": upstream signs payloads with HMAC; no user identity expected.
auth = "none"
# ... other config
}
```

View file

@ -1,10 +1,4 @@
# Automated Upgrades
This doc covers three independent automation paths:
1. **Service-level upgrades** — Container image bumps for OSS apps (DIUN → n8n → claude-agent → Terraform). Most of this doc.
2. **OS-level upgrades on K8s nodes**`unattended-upgrades` + `kured` with sentinel-gate + Prometheus halt-on-alert. See "K8s Node OS Upgrades" section and the runbook at `docs/runbooks/k8s-node-auto-upgrades.md`.
3. **K8s component version upgrades** (kubeadm/kubelet/kubectl) — weekly detection CronJob → chain of phase Jobs (preflight → master → worker × 4 → postflight). See "K8s Version Upgrades" section and the runbook at `docs/runbooks/k8s-version-upgrade.md`.
# Automated Service Upgrades
## Overview
@ -211,145 +205,3 @@ The `DIUN Upgrade Agent` workflow is imported once into n8n's PG DB — it is **
- **`N8N_BLOCK_ENV_ACCESS_IN_NODE=false`** must be set on the n8n deployment for expressions to read `$env.*` at all.
- **Troubleshooting 401**: the workflow will show `success` status on the webhook node but error on `Run Upgrade Agent`. Inspect in n8n UI → Executions, or query `execution_entity` + `execution_data` directly. Claude-agent-service logs will also show `POST /execute HTTP/1.1 401 Unauthorized`.
- **Patching the live workflow** (one-off, since it's not in TF): `UPDATE workflow_entity SET nodes = REPLACE(nodes::text, OLD, NEW)::json WHERE name = 'DIUN Upgrade Agent';`
## K8s Node OS Upgrades
Independent of the service-upgrade pipeline above. Drives apt package updates + reboots on the 5 K8s VMs (master + 4 workers).
### Stack
- **In-guest**: `unattended-upgrades` runs apt upgrades within Allowed-Origins (`-security`, `-updates`, ESM). Package-Blacklist excludes runtime components (`containerd`, `containerd.io`, `runc`, `cri-tools`, `kubernetes-cni`, `calico-*`, `cni-plugins-*`, `docker-ce`). `apt-mark hold` on `kubelet`, `kubeadm`, `kubectl` (and runtime pkgs as belt-and-braces). `Automatic-Reboot=false` — kured handles reboots.
- **Reboot driver**: `kured` (chart `kured-5.11.0`, app `1.21.0`). Window 02:00-06:00 Europe/London every day of the week (Mon-Fri-only restriction dropped 2026-05-16 — see PM), period=1h, concurrency=1, reboot-delay=30s, drainTimeout=30m.
- **Reboot gate (sentinel)**: `kured-sentinel-gate` DaemonSet creates `/var/run/gated-reboot-required` only when (a) host needs reboot, (b) all nodes Ready, (c) all calico-node pods Running, (d) **no node has transitioned Ready in the last 24h** (24h soak window).
- **Reboot gate (Prometheus)**: kured `--prometheus-url` polls `prometheus-server.monitoring.svc:80` before each drain. ANY firing alert blocks unless it matches the ignore-regex `^(Watchdog|RebootRequired|KuredNodeWasNotDrained|InfoInhibitor)$`.
- **Health alert library**: 10 alerts in the `Upgrade Gates` group (`prometheus_chart_values.tpl`): `KubeAPIServerDown`, `KubeStateMetricsDown`, `PrometheusRuleEvaluationFailing`, `PVCStuckPending`, `RecentNodeReboot` (the explicit 24h soak signal), `MysqlStandaloneDown`, `ClusterPodReadyRatioDropped`, `NodeMemoryPressure`, `NodeDiskPressure`, `KubeQuotaAlmostFull`. Plus the existing 200+ alerts in the cluster-wide library (anything firing blocks kured).
- **Notifications**: kured `notifyUrl` posts drain-start/drain-finish to Slack via Vault `secret/kured.slack_kured_webhook`. Alertmanager separately routes critical alerts to `#alerts`.
### Source of truth
| Concern | Location |
|---|---|
| Package config (uu, holds, blacklist) | `modules/create-template-vm/cloud_init.yaml` (within `is_k8s_template`) |
| kured Helm release + sentinel-gate DS | `stacks/kured/main.tf` |
| Upgrade Gates alerts | `stacks/monitoring/modules/monitoring/prometheus_chart_values.tpl` |
### Day-2 changes
Cloud-init only runs on first boot. Existing nodes are brought into compliance with a one-shot SSH push — see the runbook section "Restore / re-apply unattended-upgrades config to existing nodes" in `docs/runbooks/k8s-node-auto-upgrades.md`.
### Why this design
The 26h cluster outage on 2026-03-16 was triggered by an unattended-upgrades kernel push that corrupted containerd's overlayfs snapshotter cluster-wide. The remediations:
- 24h soak (sentinel-gate Check 4) gives a full day of observation between consecutive node reboots — broken updates show up as Prometheus alerts before any other node restarts.
- Prometheus halt-on-alert turns ANY firing alert into a hard block — including the 6 Node Runtime Health alerts and the 10 Upgrade Gates alerts that explicitly model "the cluster is in a bad state."
- Package-Blacklist on runtime components prevents the exact failure mode (containerd/runc auto-bumps).
- `Automatic-Reboot=false` keeps reboot policy in kured (window, ordering, gating), not in apt.
### Operational reference
See `docs/runbooks/k8s-node-auto-upgrades.md` for: verifying health, halting rollout, restoring config to a re-imaged node, rolling back a bad upgrade, and the past-incident timeline.
## K8s Version Upgrades
Independent of the OS-upgrade and service-upgrade pipelines. Drives
kubeadm/kubelet/kubectl bumps (patch + minor) on all 5 K8s VMs.
### Architecture
```
k8s-version-check CronJob (Sun 12:00 UTC, k8s-upgrade ns)
│ probe apt-cache madison kubeadm (master) → latest available patch
│ probe HEAD https://pkgs.k8s.io/.../v<NEXT_MINOR>/deb/Release → next minor?
│ push k8s_upgrade_available metric to Pushgateway
▼ if a target is detected
envsubst on /template/job-template.yaml | kubectl apply -f -
│ spawns Job 0 = k8s-upgrade-preflight-<target_version>
Job 0 — preflight (pinned: k8s-node1)
Job 1 — master upgrade (pinned: k8s-node1) drains k8s-master
Job 2 — worker (pinned: k8s-node1) drains k8s-node4
Job 3 — worker (pinned: k8s-node1) drains k8s-node3
Job 4 — worker (pinned: k8s-node1) drains k8s-node2
Job 5 — worker (pinned: k8s-master) drains k8s-node1 ← control-plane toleration
Job 6 — postflight (no pinning)
```
Each Job runs `scripts/upgrade-step.sh`, which dispatches on `$PHASE` and ends
by spawning the next Job (`envsubst < /template/job-template.yaml | kubectl
apply -f -`). Job names are deterministic (`k8s-upgrade-<phase>-<target_version>[-<node>]`)
so `apply` reconciles to a single Job per run — re-running a failed Job
won't duplicate downstream Jobs.
### Self-preemption history (the reason for the Job-chain rewrite)
The v1 design ran the whole upgrade inside the `claude-agent-service`
Deployment (1 replica, no nodeSelector). On 2026-05-11 the agent's pod was
scheduled to k8s-node4. When the agent ran `kubectl drain k8s-node4` during
Stage 6, it evicted itself — the bash process died after the drain but
before the SSH-pipe to install kubeadm on node4. The cluster ended up
half-upgraded (master at v1.34.7, workers at v1.34.2). The rewrite to a
chain of `nodeSelector`-pinned Jobs eliminates this failure mode because
each Job's pod and its drain target are always different nodes.
### Components
- **Detection CronJob + ConfigMaps + RBAC**: `infra/stacks/k8s-version-upgrade/main.tf`.
- Image is the claude-agent-service image (kubectl + ssh-client + curl + jq + envsubst).
- One unified ServiceAccount `k8s-upgrade-job` serves both the detection CronJob and every chain Job.
- **Phase body**: `infra/stacks/k8s-version-upgrade/scripts/upgrade-step.sh`.
Dispatches on `$PHASE` (preflight | master | worker | postflight). Computes
`NEXT_PHASE` / `NEXT_TARGET_NODE` / `NEXT_RUN_ON` and spawns the next Job.
Includes a `predrain_unstick` helper that pre-deletes pods on the target
node whose PDB has `disruptionsAllowed=0` (otherwise drain loops forever on
single-replica deployments like Anubis instances).
- **Job template**: `infra/stacks/k8s-version-upgrade/job-template.yaml`.
envsubst-rendered at runtime. Mounts a `creds` Secret, a `scripts`
ConfigMap, and a `template` ConfigMap into each Job pod.
- **Per-node script**: `infra/scripts/update_k8s.sh`. Caller passes
`--role master|worker --release X.Y.Z`. Piped via SSH into each node by
upgrade-step.sh.
- **Three Upgrade Gates alerts**:
- `K8sVersionSkew` — kubelet/apiserver `gitVersion` count >1 for 30m. Catches a half-done rollout.
- `EtcdPreUpgradeSnapshotMissing``k8s_upgrade_in_flight==1 && k8s_upgrade_snapshot_taken==0` for 10m. Catches preflight failing silently.
- `K8sUpgradeStalled``k8s_upgrade_in_flight==1 && time()-k8s_upgrade_started_timestamp > 5400` for 5m. Catches a chain Job dying without spawning its successor.
- **Pushgateway metrics**:
- `k8s_upgrade_in_flight` (set in preflight, cleared in postflight)
- `k8s_upgrade_snapshot_taken` (set after etcd snapshot Job completes with ≥1 KiB)
- `k8s_upgrade_started_timestamp` (set in preflight; used by `K8sUpgradeStalled`)
- `k8s_upgrade_available{kind,running,target}` (pushed by detection CronJob)
- `k8s_version_check_last_run_timestamp` (staleness watchdog)
### Source of truth
| Concern | Location |
|---|---|
| Stack (CronJob + ConfigMaps + SA/RBAC + ExternalSecret) | `stacks/k8s-version-upgrade/main.tf` |
| Phase orchestration | `stacks/k8s-version-upgrade/scripts/upgrade-step.sh` |
| Job template | `stacks/k8s-version-upgrade/job-template.yaml` |
| Per-node upgrade script | `scripts/update_k8s.sh` |
| Alerts | `stacks/monitoring/modules/monitoring/prometheus_chart_values.tpl` (group "Upgrade Gates") |
| Vault secrets | `secret/k8s-upgrade/{ssh_key, ssh_key_pub, slack_webhook}` |
| Deprecated agent prompt (reference) | `.claude/agents/k8s-version-upgrade.deprecated.md` |
### Why this design
The cluster has a single control plane (no HA). A failed `kubeadm upgrade apply` is an outage. Mitigations:
- **Mandatory etcd snapshot before every run** (even patch). Recovery point if master breaks.
- **Halt-on-alert before every drain**. Reuses the same Prometheus ignore-list regex kured uses — any unrelated cluster-health alert blocks. Three gate alerts catch upgrade-specific half-states (version skew, missing snapshot, stalled chain).
- **Job pinning eliminates self-preemption**. Each Job's pod runs on a node that is NOT its drain target. k8s-node1 hosts every Job except the one that drains it (which runs on k8s-master with a control-plane toleration).
- **Sequential workers with 10-min inter-node soak**. Same risk-bounding as the 24h OS-reboot soak, but tightened because kubelet failures surface within minutes — not hours.
- **Master upgrade goes first, workers last**. If master breaks, the cluster is already degraded so further worker upgrades would just delay recovery. By upgrading master first, we either succeed (workers can roll afterward) or fail loud (operator triages before any worker is touched).
- **No auto-rollback**. kubeadm doesn't support clean downgrade; the snapshot + manual apt rollback in the runbook is the recovery path.
- **PDB-blocked pods don't stall the chain**. `predrain_unstick` deletes PDB=0 pods on the target node directly (bypassing the eviction API), so the parent Deployment recreates them elsewhere. This was the workaround applied manually during the 2026-05-11 recovery for Anubis single-replica instances.
### Secrets
| Secret | Vault Path | Purpose |
|--------|-----------|---------|
| SSH private key | `secret/k8s-upgrade.ssh_key` | Jobs SSH `wizard@<node>` |
| SSH public key | `secret/k8s-upgrade.ssh_key_pub` | Deployed to nodes' `~/.ssh/authorized_keys` |
| Slack webhook | `secret/k8s-upgrade.slack_webhook` | Pipeline notifications (separate channel from kured) |
The previous `api_bearer_token` entry is gone — the chain does not POST to `claude-agent-service`.
### Operational reference
See `docs/runbooks/k8s-version-upgrade.md` for: verifying health, manually triggering detection, killing a stuck Job, skipping a phase, rollback paths (master / worker / mid-flight abort), and SSH key rotation.

View file

@ -18,7 +18,7 @@ graph TB
subgraph Proxmox["Proxmox VE"]
direction TB
MASTER["VM 200: k8s-master<br/>8c / 32GB<br/>10.0.20.100"]
NODE1["VM 201: k8s-node1<br/>16c / 48GB<br/>GPU Passthrough<br/>nvidia.com/gpu=true:PreferNoSchedule"]
NODE1["VM 201: k8s-node1<br/>16c / 32GB<br/>GPU Passthrough<br/>nvidia.com/gpu=true:PreferNoSchedule"]
NODE2["VM 202: k8s-node2<br/>8c / 32GB"]
NODE3["VM 203: k8s-node3<br/>8c / 32GB"]
NODE4["VM 204: k8s-node4<br/>8c / 32GB"]
@ -62,7 +62,7 @@ graph TB
| Model | Dell PowerEdge R730 |
| CPU | 1x Intel Xeon E5-2699 v4 (22 cores / 44 threads, CPU2 unpopulated) |
| Total Cores/Threads | 22 cores / 44 threads |
| RAM | 272GB DDR4-2400 ECC RDIMM physical (10 DIMMs: 8x32G Samsung + 2x8G Hynix). VMs use ~176GB total (k8s-node1 48GB + 4 K8s VMs x 32GB) |
| RAM | 272GB DDR4-2400 ECC RDIMM physical (10 DIMMs: 8x32G Samsung + 2x8G Hynix). VMs use ~160GB total (5 K8s VMs x 32GB) |
| GPU | NVIDIA Tesla T4 (16GB GDDR6, PCIe 0000:06:00.0) |
| Storage | 1.1TB SSD + 931GB SSD + 10.7TB HDD |
| Hypervisor | Proxmox VE |
@ -72,20 +72,12 @@ graph TB
| VM | VMID | vCPUs | RAM | Network | Role | Taints |
|----|------|-------|-----|---------|------|--------|
| k8s-master | 200 | 8 | 32GB | vmbr1:vlan20 (10.0.20.100) | Control Plane | `node-role.kubernetes.io/control-plane:NoSchedule` |
| k8s-node1 | 201 | 16 | 48GB | vmbr1:vlan20 | GPU Worker | `nvidia.com/gpu=true:PreferNoSchedule` (applied dynamically to whichever node carries the GPU) |
| k8s-node1 | 201 | 16 | 32GB | vmbr1:vlan20 | GPU Worker | `nvidia.com/gpu=true:PreferNoSchedule` (applied dynamically to whichever node carries the GPU) |
| k8s-node2 | 202 | 8 | 32GB | vmbr1:vlan20 | Worker | None |
| k8s-node3 | 203 | 8 | 32GB | vmbr1:vlan20 | Worker | None |
| k8s-node4 | 204 | 8 | 32GB | vmbr1:vlan20 | Worker | None |
**Total Cluster Resources**: 48 vCPUs, ~176GB RAM (k8s-node1 48GB + 4 nodes x 32GB)
> **node1 RAM (2026-05-10)**: bumped from 32 → 48 GiB out-of-band via
> `qm set 201 --memory 49152` because VMID 201 is intentionally not
> managed by Terraform yet (telmate/proxmox provider bug with iSCSI
> PVCs — see `infra/stacks/infra/main.tf` line 442). Driver: GPU
> multi-tenancy (frigate + ytdlp + llama-swap + immich-ml) was
> hitting 94% memory-request saturation on the old size. Adopt this
> VM into TF (`module "k8s-node1"`) once we've migrated to bpg/proxmox.
**Total Cluster Resources**: 48 vCPUs, ~160GB RAM (5 nodes x 32GB)
### GPU Passthrough

View file

@ -1,118 +0,0 @@
# llama-cpp / llama-swap
## Overview
In-cluster, OpenAI-compatible vision-LLM endpoint. A single
`mostlygeek/llama-swap:cuda` Deployment fronts three GGUF models
served by `llama.cpp`'s `llama-server` subprocesses, hot-swapped on
demand by `llama-swap`. One Service, one `/v1` endpoint, model
selected by the request body `model` field.
Initial use case: vision-LLM benchmark on a curated Immich album,
choosing between **Qwen3-VL-8B**, **MiniCPM-V-4.5**, and
**Qwen3-VL-4B** for instagram-poster's candidate-scoring path.
Future consumers (Home Assistant, agentic tooling) can hit the same
endpoint via LiteLLM at the cluster gateway.
First benchmark run (2026-05-10): see
`infra/docs/benchmarks/2026-05-10-vision-llm.md`. Verdict: **qwen3vl-4b**
for the request path (3.55 s p50, 100% parse, decisive top-N
distribution). qwen3vl-8b for caption polish on top picks.
## Why llama.cpp + llama-swap (not Ollama)
Verified across 7+7 research/challenger subagents (2026-05-10):
- **Broader OpenAI-compat surface**`tool_choice`, `image_url`
remote URLs, native bearer auth via `--api-key`, `/reranking`,
Anthropic `/v1/messages` shim.
- **Native observability**`/metrics`, `/health` returns 503 during
model load (proper K8s startup-probe semantics), `/slots` per-slot
tracking. Ollama still has the `/metrics` issue
[#3144](https://github.com/ollama/ollama/issues/3144) open.
- **Stricter structured output** — native GBNF on `/completion`,
JSON-schema-to-GBNF converter, optional `LLAMA_LLGUIDANCE=ON`.
- **Vision coverage for our targets** — llama.cpp ≥ b9095 supports
Qwen3-VL and MiniCPM-V-4.5 natively; Ollama needs the official
`qwen3-vl` tag (community GGUFs broken — split-mmproj
[#14575](https://github.com/ollama/ollama/issues/14575)) and the
`openbmb/minicpm-v4.5` Ollama tag is 8 months stale.
Ollama still wins for Llama-3.2-Vision (`mllama` cross-attention) and
ecosystem polish (Go/JS SDKs, langchain-ollama, n8n nodes, HA built-in)
— the latter is mooted by fronting llama.cpp with **LiteLLM** at the
gateway.
## Components
| Component | Resource | Purpose |
|-----------|----------|---------|
| llama-swap Deployment | `kubernetes_deployment.llama_swap` | One pod, one OpenAI-compat endpoint, hot-swaps model subprocesses |
| llama-swap ConfigMap | `kubernetes_config_map.llama_swap_config` | YAML model entries (cmd, ttl, checkEndpoint) |
| llama-swap Service | `kubernetes_service.llama_swap` | ClusterIP `:8080``llama-swap.llama-cpp.svc.cluster.local` |
| Models PVC | `module.nfs_models` (NFS-RWX `/srv/nfs-ssd/llamacpp`) | Shared GGUF store, 30Gi |
| Download Job | `kubernetes_job_v1.download_models` | Pulls Q4_K_M GGUF + mmproj per model, creates stable `model.gguf` / `mmproj.gguf` symlinks, warms page cache |
## Storage
NFS-SSD on the Proxmox host (`192.168.1.127:/srv/nfs-ssd/llamacpp`).
Cold model load is ~40s × 3 startups ≈ 2 min in a 25-30 min benchmark
run (<10%). The download Job warms the kernel page cache after pulling
GGUFs so first inference reads from warm cache.
If steady-state cold-load latency becomes a problem, **Path B**: carve
~50Gi from a Proxmox SSD as an LV, attach as a vdisk to k8s-node1,
mount on-host, expose via a static `kubernetes_persistent_volume` with
`local` source + node1 affinity. NVMe-class load times. Out of scope
for the initial deployment.
## GPU allocation
The llama-swap pod requests `nvidia.com/gpu: 1` (whole-T4
allocation). The shared T4 is also used by Immich's ML pod
(`immich.immich-machine-learning`); only one of the two can hold the
GPU at a time. Operator must scale immich-ml to 0 before running a
benchmark and restore it after:
```bash
kubectl scale -n immich deploy/immich-machine-learning --replicas=0
# ... benchmark ...
kubectl scale -n immich deploy/immich-machine-learning --replicas=1
```
## Models served
| ID | HF repo | Quant | Ctx | mmproj |
|----|---------|-------|-----|--------|
| `qwen3vl-8b` | `Qwen/Qwen3-VL-8B-Instruct-GGUF` | Q4_K_M | 3072 | yes |
| `minicpm-v-4-5` | `openbmb/MiniCPM-V-4_5-gguf` | Q4_K_M | 3072 | yes |
| `qwen3vl-4b` | `Qwen/Qwen3-VL-4B-Instruct-GGUF` | Q4_K_M | 3072 | yes |
llama.cpp build pinned via the `llama-swap:cuda` image (ships a
recent llama.cpp ≥ b9095, which includes Qwen3-VL projection fix
[#20899](https://github.com/ggml-org/llama.cpp/issues/20899) and
mtmd Flash-Attention regression fix
[#16962](https://github.com/ggml-org/llama.cpp/issues/16962)).
## Endpoints
- `GET /v1/models` — list configured models
- `POST /v1/chat/completions` — standard OpenAI chat (vision via
`image_url` content parts, base64 or remote URL)
- `POST /completion` — llama.cpp native completion (preferred for
GBNF-constrained structured output to avoid 2026 regression magnet
on `/v1/chat/completions`)
- `GET /metrics` — Prometheus
- `GET /health` — 200 once a model is fully loaded; 503 during load
## Known issues / decisions
- **Cluster-wide GPU contention** — only one of llama-swap or
immich-ml can hold the T4. No GPU sharing solution wired in
(MPS/MIG would help but T4 has no MIG and MPS is overkill for two
workloads).
- **Filename-agnostic config** — the download Job creates stable
`model.gguf` / `mmproj.gguf` symlinks per model dir so the
llama-swap config doesn't need to track exact HF filenames (which
change between releases).
- **TF schema**`llama-cpp` (PG backend on dbaas).

View file

@ -57,7 +57,7 @@ graph TB
|-----------|---------|----------|---------|
| Prometheus | Latest (Diun monitored) | `stacks/monitoring/modules/monitoring/` | Metrics collection and storage, scrape configs for all services |
| Grafana | Latest (Diun monitored) | `stacks/monitoring/modules/monitoring/` | Visualization, 14+ dashboards (API server, CoreDNS, GPU, UPS, etc.) |
| Loki | **DEPLOYED 2026-05-18** (SingleBinary mode, 30d retention, 50Gi PVC on `proxmox-lvm`, ruler enabled → Alertmanager). Re-enabled from previous "operational overhead" disable. Ships logs via Alloy DaemonSet (now on all nodes including master after 2026-05-19 toleration add). | `stacks/monitoring/modules/monitoring/` | Log aggregation and querying |
| Loki | Latest (Diun monitored) | `stacks/monitoring/modules/monitoring/` | Log aggregation and querying |
| Alertmanager | Latest (Diun monitored) | `stacks/monitoring/modules/monitoring/` | Alert routing with cascade inhibitions |
| Uptime Kuma | Latest (Diun monitored) | `stacks/uptime-kuma/` | Internal + external HTTP monitors, status page |
| External Monitor Sync | Python 3.12 | `stacks/uptime-kuma/` | CronJob (10min) syncs `[External]` monitors from `cloudflare_proxied_names` |
@ -176,35 +176,6 @@ The email monitoring system uses a CronJob (`email-roundtrip-monitor`, every 10
Uptime Kuma monitors: TCP SMTP (port 25) on `176.12.22.76` (external), IMAP (port 993) on `10.0.20.202`, and Dovecot exporter metrics on port 9166.
#### Security Alerts (Wave 1 — planned, beads `code-8ywc`)
Routed via **Loki ruler → Alertmanager → `#security` Slack receiver**. Same handling path as infra alerts. Single channel with severity labels inside (critical/warning/info), not three separate channels. Detection sources: K8s API audit log (`job=kube-audit`), Vault audit log (`job=vault-audit`), PVE sshd journald (`job=sshd-pve`), Calico flow logs (`job=calico-flow`, W1.6 only).
| # | Source | Event | Severity |
|---|---|---|---|
| K2 | kube-audit | SA token used from outside cluster | critical |
| K3 | kube-audit | Secret read in vault/sealed-secrets/external-secrets by non-allowlisted SA | critical |
| K4 | kube-audit | Exec into vault/kube-system/dbaas/cnpg-system pod by non-allowlisted user | warning |
| K5 | kube-audit | Mass delete (>5 Pod/Secret/CM in 60s) | critical |
| K6 | kube-audit | Audit policy itself modified | critical |
| K7 | kube-audit | New `*,*` ClusterRole created | warning |
| K8 | kube-audit | Anonymous binding granted | critical |
| K9 | kube-audit | `me@viktorbarzin.me` request from non-allowlist sourceIP | critical |
| V1 | vault-audit | Root token created | critical |
| V2 | vault-audit | Audit device disabled/modified | critical |
| V3 | vault-audit | Seal status changed | critical |
| V4 | vault-audit | Policy written/modified (allowlist Terraform actor) | warning |
| V5 | vault-audit | Auth failure spike >10/min | warning |
| V6 | vault-audit | Token with policies different from parent created | critical |
| V7 | vault-audit | Viktor's entity_id from non-allowlist remote_addr (requires `x_forwarded_for_authorized_addrs`) | critical |
| S1 | sshd-pve | sshd auth success from non-allowlist IP | critical |
K1 (cluster-admin grant) intentionally skipped — see security.md.
Allowlist source-IP CIDRs (used by K2, K9, V7, S1): `10.0.20.0/22`, `192.168.1.0/24`, K8s pod CIDR, K8s service CIDR, Headscale tailnet. Policy: no public-IP access; all admin paths transit LAN or Headscale.
IOPS impact estimated ~1-2 GB/day additional disk writes after custom audit-policy tuning. Retention: 90d for security streams.
#### Backup Alerts
- **PostgreSQLBackupStale**: >36h since last backup
- **MySQLBackupStale**: >36h since last backup

View file

@ -111,20 +111,16 @@ Namespaces are labeled with a tier (`tier: 0` through `tier: 4`). Kyverno auto-g
This prevents resource exhaustion and enforces governance without manual quota management.
#### Security Policies
#### Security Policies (ALL in Audit Mode)
**Why audit mode first?** Gradual rollout without breaking existing workloads. Policies collect violations, then selectively enforced after cleanup.
**Why audit mode?** Gradual rollout without breaking existing workloads. Policies collect violations, then selectively enforced after cleanup.
**Wave 1 plan (locked 2026-05-18, see beads `code-8ywc`):** all four below flip from Audit → Enforce with `failurePolicy: Ignore` preserved and an exclude list covering the 31 critical namespaces (keel, calico-system, authentik, vault, cnpg-system, dbaas, monitoring, traefik, technitium, mailserver, kyverno, metallb-system, external-secrets, proxmox-csi, nfs-csi, nvidia, kube-system, cloudflared, crowdsec, reverse-proxy, reloader, descheduler, vpa, redis, sealed-secrets, headscale, wireguard, xray, infra-maintenance, metrics-server, tigera-operator). Phased: one policy per day with PolicyReport observation.
| Policy | Purpose | Current | Planned (wave 1) |
|--------|---------|---------|------------------|
| `deny-privileged-containers` | Block privileged pods | Audit | **Enforce** |
| `deny-host-namespaces` | Block hostNetwork/hostPID/hostIPC | Audit | **Enforce** |
| `restrict-sys-admin` | Block CAP_SYS_ADMIN | Audit | **Enforce** |
| `require-trusted-registries` | Only allow approved image registries (forgejo.viktorbarzin.me, docker.io, ghcr.io, quay.io, registry.k8s.io, gcr.io, oci://ghcr.io/sergelogvinov) | Audit | **Enforce** |
Cosign `verify-images` is **deferred** beyond wave 1 — needs image-signing infrastructure (Sigstore / cosign + KMS) before it can enforce meaningfully.
| Policy | Purpose | Enforcement |
|--------|---------|-------------|
| `deny-privileged-containers` | Block privileged pods | Audit |
| `deny-host-namespaces` | Block hostNetwork/hostPID/hostIPC | Audit |
| `restrict-sys-admin` | Block CAP_SYS_ADMIN | Audit |
| `require-trusted-registries` | Only allow approved image registries | Audit |
#### Operational Policies
@ -167,112 +163,6 @@ Removed April 2026. The rewrite-body Traefik plugin used to inject hidden trap l
**Implementation**: See `stacks/poison-fountain/` and `stacks/platform/modules/traefik/middleware.tf`
### Audit Logging & Anomaly Detection (Wave 1)
Beads epic: `code-8ywc`. **Status: partially live as of 2026-05-18.**
| Item | State |
|---|---|
| W1.2 Vault `file` audit device | **LIVE**`vault_audit.file` in `stacks/vault/main.tf:287`, writing to `/vault/audit/vault-audit.log` on `proxmox-lvm-encrypted` PVC |
| W1.2 Vault `x_forwarded_for_authorized_addrs = 10.10.0.0/16` | **LIVE** — applied via `tg apply -target=helm_release.vault` on 2026-05-18; all 3 vault pods restarted cleanly |
| W1.2 Vault audit log shipping to Loki | **LIVE**`audit-tail` sidecar in vault pods + Alloy DaemonSet ships to Loki with `container="audit-tail"`. Verified via `{namespace="vault",container="audit-tail"}` LogQL query. |
| W1.1 K8s API audit policy + shipping | **LIVE** — kube-apiserver audit policy was already configured (Metadata level, `/var/log/kubernetes/audit.log`, 7d retention). Alloy DaemonSet now tolerates control-plane taint, scrapes the audit log file, ships to Loki with `job=kubernetes-audit`. K2-K9 alert rules in Loki ruler. |
| W1.3 Source-IP anomaly rules (K9, V7, S1) | **LIVE** (K9, V7); **S1 PENDING** — fires once promtail/Alloy on PVE host ships sshd journal with `job=sshd-pve`. |
| W1.4 Kyverno security policies → Enforce | **LIVE** — 3 policies in Enforce mode with 35-namespace exclude list. |
| W1.5 Kyverno trusted-registries → Enforce | **LIVE** — explicit allowlist (15 registries + 6 DockerHub library bare names + 56 DockerHub user repos). Verified by admission dry-run: `evilcorp.example/malware:v1` BLOCKED, `alpine:3.20` and `docker.io/library/alpine:3.20` ALLOWED. |
| W1.6 Calico observe-phase (pilot: recruiter-responder) | **LIVE** (2026-05-19) — GlobalNetworkPolicy `wave1-egress-observe-recruiter-responder` with rules `[action:Log, action:Allow]`. FelixConfiguration.flowLogsFileEnabled approach abandoned (Calico Enterprise-only field, rejected by OSS v3.26). Log action emits iptables LOG with prefix `calico-packet: ` → kernel → journald → Alloy → Loki. Verified: `{job="node-journal"} \|~ "calico-packet"` returns real packet metadata (SRC/DST/PROTO). Expand to more namespaces by adding to `namespaceSelector`. |
| W1.7 NetworkPolicy phased enforce | **PENDING** — needs ~1 week of W1.6 observation, then build empirical allowlist from Loki queries, flip GNP rules from `[Log, Allow]` to `[Allow specific dests, Deny rest]`. |
The block below documents the locked design.
Response model: **(I) Slack-only, daily skim.** All security alerts land in a new `#security` Slack channel via Alertmanager. No paging. Mean detection time accepted as ~12-24h; the design weight sits on prevention (Kyverno enforce, NetworkPolicy default-deny egress) rather than runtime detection.
#### Detection sources
| Source | Mechanism | Ships via | Loki job label |
|---|---|---|---|
| K8s API audit log | Custom audit policy on kube-apiserver: drop `get`/`list`/`watch` at `None` for most resources, log writes at `Metadata`, secret reads at `Metadata`, `exec`/`portforward` at `RequestResponse`, exclude kubelet+controller-manager noise. Codified in `stacks/infra` kubeadm config templating. | Alloy DaemonSet tails `/var/log/kubernetes/audit/*.log` | `job=kube-audit` |
| Vault audit log | `file` audit device on existing Vault PVC. Vault listener config sets `x_forwarded_for_authorized_addrs` trusting Traefik pod CIDR so `remote_addr` is the real client IP, not Traefik's. | Alloy tails audit log file | `job=vault-audit` |
| PVE sshd auth log | journald `_SYSTEMD_UNIT=ssh.service` | promtail systemd unit on Proxmox host (192.168.1.127) | `job=sshd-pve` |
| Calico flow log | `flowLogsFileEnabled: true` in Calico Felix config | Alloy (cluster-wide) | `job=calico-flow` (W1.6 only) |
#### Alert rules (16 total)
Routed via **Loki ruler → Alertmanager → `#security` Slack receiver**. Same handling path as existing infra alerts — silenceable in Alertmanager UI, history queryable, severity labels (critical/warning/info) inside the single `#security` channel.
**K8s API audit (K2-K9, 8 rules — K1 cluster-admin-grant intentionally skipped):**
| # | Event | Severity |
|---|---|---|
| K2 | ServiceAccount token used from outside cluster (sourceIPs not in pod CIDR or trusted LAN) | critical |
| K3 | Secret READ in `vault`, `sealed-secrets`, `external-secrets` namespaces by a non-allowlisted ServiceAccount | critical |
| K4 | Exec into a pod in `vault`, `kube-system`, `dbaas`, `cnpg-system` (excluding `me@viktorbarzin.me` + 1 break-glass SA) | warning |
| K5 | >5 deletes of `Pod`, `Secret`, or `ConfigMap` in 60s by any single actor | critical |
| K6 | `audit-log-path` flag or audit policy modified on kube-apiserver | critical |
| K7 | New ClusterRole created with `verbs: ["*"]` and `resources: ["*"]` | warning |
| K8 | Anonymous binding granted (any RoleBinding/CRB referencing `system:anonymous` or `system:unauthenticated`) | critical |
| K9 | Authenticated request where `user.username == "me@viktorbarzin.me"` AND `sourceIPs[0]` NOT in allowlist CIDRs | critical |
**Vault audit (V1-V7):**
| # | Event | Severity |
|---|---|---|
| V1 | Root token created | critical |
| V2 | Audit device disabled or modified | critical |
| V3 | Seal status changed (`sys/seal` write) | critical |
| V4 | Policy written or modified (allowlist Terraform-driven writes by source IP / token role) | warning |
| V5 | Authentication failure spike >10/min on any auth method | warning |
| V6 | Token created with policies different from parent (privilege escalation) | critical |
| V7 | Vault audit event where `auth.entity_id == <viktor-entity-id>` AND `remote_addr` NOT in allowlist CIDRs | critical |
**Host (S1):**
| # | Event | Severity |
|---|---|---|
| S1 | PVE sshd auth success from source IP NOT in allowlist | critical |
#### Allowlist — "expected source IPs" for K2, K9, V7, S1
| CIDR | Source |
|---|---|
| `10.0.20.0/22` | VLAN 20 (K8s cluster + main LAN) |
| `192.168.1.0/24` | Proxmox host LAN + Sofia LAN (same RFC1918 block in both physical locations; cross-site traffic transits Headscale so the CIDR matches only on-LAN clients in either location) |
| K8s pod CIDR (verify at implementation time) | In-cluster pods talking to apiserver |
| K8s service CIDR | Service-to-apiserver traffic |
| Headscale tailnet | VPN-connected devices |
**Policy: no public-IP access ever.** Vault, kube-apiserver, PVE sshd must transit a trusted LAN or Headscale. Anything else fires an alert.
#### Why no canary tokens
Original plan included canary tokens (fake K8s Secret, Vault KV path, PVE file, sinkhole hostname). Rejected because Viktor routinely greps `secret/viktor` (135 keys) and lists `kubectl get secret -A` — any read-trigger canary self-fires. Use-based canaries (zero-RBAC SA tokens with audit alerts on use) were also considered but rejected in favor of cleaner source-IP anomaly detection (K9, V7) on REAL tokens — same threat model, no fake-token operational burden.
#### Why no K1 (cluster-admin grant detection)
Viktor opted out. Gap covered indirectly by K7 (new `*,*` ClusterRole created), K8 (anonymous binding), and K3 (secret read on Vault namespace) — most attacker progressions toward cluster-admin trigger one of these.
#### IOPS / disk-wear
Custom audit policy reduces volume ~80-90% vs default Metadata-everywhere. Loki tuned for fewer larger chunks: `chunk_target_size: 1.5MB`, `chunk_idle_period: 30m`, snappy compression. Retention 90d for security streams (matches Technitium DNS query log precedent). Net estimate: ~1-2 GB/day additional disk writes after tuning.
### NetworkPolicy Default-Deny Egress (Wave 1 — observe-then-enforce, tier 3+4)
Beads: `code-8ywc` W1.6 + W1.7. **Status: planned.**
**Approach (γ): cluster-wide observe-then-enforce.**
1. **Week 0:** Enable Calico flow logs cluster-wide. Apply a GlobalNetworkPolicy with selector `tier in {tier-3, tier-4}`, `action: Log` (no Deny). Ship flow logs to Loki.
2. **Week 1:** Build per-namespace egress allowlist from observed traffic. Common allowlist module `tier3_egress_baseline` covers DNS, NTP, internal Vault/ESO/Authentik, Brevo SMTP, Cloudflare API, OAuth providers. Per-namespace add-ons for service-specific external destinations.
3. **Week 2-3:** Apply default-deny + allowlist per-namespace, starting `recruiter-responder` (smallest egress footprint — local llama-cpp). Watch 24-48h per namespace, iterate. Roll out 3-5 namespaces/day.
**Scope exclusions:** tier 0/1/2 namespaces (defer to wave 2), 31 critical infra namespaces (same exclude list as Kyverno).
**DNS handling:** Calico GlobalNetworkPolicy supports domain-based rules via the `domains:` selector which queries CoreDNS internally. Static IPs reserved for fixed-IP services (Brevo SMTP relay).
**Known risks:**
- Rare-event misses: a Sunday-only CronJob's egress won't appear in 7 days of flow logs. Mitigation: extend observation to 2 weeks for namespaces with weekly CronJobs.
- Mass-rollout cascade: the 26h March 2026 outage (memory id=390) was a mass-change cascade. Mitigation: phased per-namespace with health-check pauses, similar to the 2026-05-17 Keel phased rollout (memory id=1972).
### TLS & HTTP/3
**Traefik** handles TLS termination:

View file

@ -1,253 +0,0 @@
# Vision-LLM benchmark — Malaga / Seville album
**Run ID:** `2026-05-10-1424` · **Date:** 2026-05-10 · **Operator:** wizard
100 photos randomly sampled (seed=42) from the Immich album `🇪🇸 Malaga
Seville` (`46565b85-7580-4ac1-91a6-1ece2cf8634d`, 1556 image assets +
9 videos), scored by three local vision-LLMs served by `llama-swap`
on a single Tesla T4. Goal: pick a model to wire into
`instagram-poster`'s `/candidates` ranking path.
## TL;DR
**Recommendation: `qwen3vl-4b`.**
- **Fastest** by a wide margin (3.55 s p50, 60% of qwen3vl-8b),
important once this is in the request path of `/candidates`.
- **100% structured-output success** — same as the other two; GBNF
grammar enforcement worked across the board.
- **Captions are competitive** with the 8B model in qualitative review
(tied or close on 8/10 sampled photos; 8B wins on Flair, 4B wins on
Latency).
- **Most decisive scorer** — 47/100 photos got IG-fit=9 vs 17 for
qwen3vl-8b and 9 for minicpm. We get more signal at the top end
for ranking.
Use qwen3vl-8b for *manual* caption refinement (top-1 of the day) if
caption polish matters. Use minicpm-v-4-5 for nothing immediate — it's
the most conservative scorer and the slowest at high quantiles, with
no offsetting wins in this dataset.
## Setup
- Hardware: 1× Tesla T4 (16 GiB VRAM), `nvidia.com/gpu` time-slicing
enabled (replicas=100), pod scheduled on `k8s-node1`.
- Server: `mostlygeek/llama-swap:cuda` (ships llama.cpp `b9085-046e28443`)
on `llama-swap.llama-cpp.svc.cluster.local:8080`.
- Models: GGUF Q4_K_M, mmproj F16 except qwen3vl-4b which used the
Q8_0 mmproj (alphabetically first matching the glob).
- Image prep: EXIF-transposed, long-edge resized to 1024 px, JPEG q=90,
base64-embedded as `image_url` data URLs.
- Generation: `temperature=0`, `top_k=1`, `enable_thinking=false`,
GBNF grammar pinning the JSON schema (6 fields, 110 ints, ≤8 tags).
- Run isolation: `immich-machine-learning` scaled to 0 for the
duration to avoid noisy GPU contention. *(Diagnostic note: the
scheduling failure that triggered this was actually node1 RAM —
not GPU — at 94% allocated. Time-slicing was already on. Bumping
node1 RAM is tracked as a follow-up.)*
## Headline numbers
| model | n | parse_ok | p50 latency | p95 latency | median IG-fit | median aesthetic |
|-------|---|----------|-------------|-------------|---------------|------------------|
| **qwen3vl-4b** | 100 | 100% | **3.55 s** | 4.06 s | 8.0 | 8.0 |
| minicpm-v-4-5 | 100 | 100% | 5.62 s | 6.00 s | 7.0 | 8.0 |
| qwen3vl-8b | 100 | 100% | 5.98 s | 6.64 s | 7.0 | 8.0 |
Total wall time for the run: **33 m 32 s** (300 calls + 3 cold loads
of ~30 s each).
## What each model is good at
### qwen3vl-4b — fast and decisive
- p50 3.55 s — comfortable for adding to `/candidates` request path.
- IG-fit distribution skews right (47 nines), spreading 6 → 9 fairly
evenly, which is what you want from a *ranker*.
- Captions are emoji-friendly, hashtag-friendly, sometimes
hallucinatory (e.g. labelled a Seville street as "Barcelona's
colourful streets" once).
- Failure mode to watch: occasional double-down on the same caption
template ("Lost in the tiles. 🌿" repeated across two unrelated
blue-dress photos).
### minicpm-v-4-5 — conservative, terse
- Most conservative scorer: 65% of photos got IG-fit=7. Only 9 nines.
Less useful as a top-N ranker because the top is squashed.
- Fastest p95 of the three (6.0 s) but slower p50 than qwen3vl-4b.
- Captions are short and lower-case ("azulejo dreams.",
"sunshine & secrets") — distinct voice but less Instagram-native.
### qwen3vl-8b — most polished captions
- Best subject identification (specifically named "Metropol Parasol"
and "Plaza de España" by name where the others said "modern
architecture" / "plaza").
- Captions read well: "Coffee & calm vibes ☕️", "where modern meets
historic under a brilliant sky".
- Slowest p50 (5.98 s) and tightest score distribution (median 7,
17 nines) — middle of the pack as a ranker.
## Top-10 agreement (Kendall-tau-style overlap)
How many of each model's top-10 IG-fit picks appear in another
model's top-10:
| pair | overlap |
|------|---------|
| qwen3vl-4b ↔ qwen3vl-8b | 5/10 |
| minicpm-v-4-5 ↔ qwen3vl-4b | 4/10 |
| minicpm-v-4-5 ↔ qwen3vl-8b | 4/10 |
Read: there's moderate but not strong agreement. The models pick
roughly half the same "best" photos and half different ones. For
ranking, that's a healthy sign — they're not collapsing to a single
notion of "good", so combining their scores would add real signal.
## Cost-equivalent context
Approximate cost to score the same 100 photos via cloud APIs
(prompt ≈ 1100 tokens incl. image, completion ≈ 100 tokens):
| backend | input | output | per-100 photos |
|---------|-------|--------|----------------|
| Local llama-swap on T4 | — | — | ≈ $0.04 (electricity, ~70 W × 7 min) |
| Anthropic Haiku 4.5 | $1.00/M | $5.00/M | ≈ $0.15 |
| Anthropic Sonnet 4.6 | $3.00/M | $15.00/M | ≈ $0.45 |
| Google Gemini 2.5 Flash | $0.30/M | $2.50/M | ≈ $0.05 |
Local is competitive with Gemini Flash on marginal cost. The case
for keeping it local is privacy (Immich originals never leave the
LAN), no rate-limits, and no per-call quota planning. The case
against is the GPU is finite — adding this to a request path means
sharing T4 time with frigate, ytdlp, and (when we restore it)
immich-ml. With time-slicing on, that's tractable but each tenant
gets fewer full-GPU kernels.
## Sample captions (10 representative photos, all 3 models)
`9f9c9faf-...` — man at ornate arch in Seville
- **minicpm-v-4-5** (7/8): sunshine & secrets
- **qwen3vl-4b** (9/8): Found the perfect spot to pause. 🌞 #travel #architecture
- **qwen3vl-8b** (9/8): Standing in Seville's magic, framed by history and color.
`0d53158b-...` — Metropol Parasol
- **minicpm-v-4-5** (9/8): Seville's skyward statement.
- **qwen3vl-4b** (9/8): Seville's iconic modern structure against a vibrant blue sky.
- **qwen3vl-8b** (9/8): The Metropol Parasol in Seville, where modern meets historic under a brilliant sky.
`23137c1c-...` — coffee shot
- **minicpm-v-4-5** (8/7): a morning pick-me-up and a smile.
- **qwen3vl-4b** (8/7): Coffee & chill. Perfectly imperfect moment.
- **qwen3vl-8b** (9/8): Coffee & calm vibes ☕️
`3376ee87-...` — woman against tiled wall
- **minicpm-v-4-5** (7/8): azulejo dreams.
- **qwen3vl-4b** (9/8): Lost in the tiles. 🌿
- **qwen3vl-8b** (9/8): Blue tiles, golden light, and a moment of quiet grace.
`4f2fea45-...` — courtyard
- **minicpm-v-4-5** (7/8): hidden gems of seville
- **qwen3vl-4b** (7/8): Timeless beauty in a Spanish courtyard. 🌿
- **qwen3vl-8b** (7/8): A serene courtyard in Seville, where palm trees sway under the sun.
`ea713729-...` — flower-market street (qwen3vl-4b confused location)
- **minicpm-v-4-5** (7/8): Seville's hidden gems.
- **qwen3vl-4b** (7/8): Walking through *Barcelona's* colorful streets, backlit by golden hour.
- **qwen3vl-8b** (7/8): Walking through Seville's vibrant streets, lavender in hand.
The full list of 10 sample sets is in the auto-generated section
below; the raw 300-row JSON is at `benchmark-2026-05-10-1424.json`
in this directory.
## Operational cost during the run
- llama-swap pod (1× T4 wholly allocated for the duration): ~33 min.
- Immich-ML downtime: ~33 min. New uploads weren't auto-tagged or
CLIP-embedded during this window. No user-visible impact (Immich
search against already-indexed assets still worked via pgvector).
- Network egress: zero — Immich originals stayed on the LAN, all
scoring traffic was in-cluster.
## Reproducibility
```bash
DATA_DIR=/tmp/benchmark \
IMMICH_API_KEY=… \
LLAMA_SWAP_URL=http://localhost:18080 \
poetry run python -m instagram_poster.benchmark run \
--album-id 46565b85-7580-4ac1-91a6-1ece2cf8634d \
--models qwen3vl-8b,minicpm-v-4-5,qwen3vl-4b \
--limit 100 --random-seed 42 --run-id 2026-05-10-1424
```
The same `--random-seed` reproduces the photo sample exactly. Prompt
version `4bbb7e7721da24d9` is the SHA-256 of the system prompt + user
prompt + GBNF grammar; rerunning under the same prompt version against
the same seed should produce within-noise identical scores (the models
themselves are temperature=0, top_k=1).
## Next steps
- **Wire `qwen3vl-4b` into `instagram-poster`** as an additional ranking
signal alongside CLIP-based recency in `/candidates`. Cache the score
per asset_id so we don't re-pay 4 s on every list refresh.
- **Bump k8s-node1 RAM** so immich-ml + llama-swap can co-exist (drain
→ resize → uncordon, with kubelet `systemReserved` adjusted in
`stacks/infra/main.tf`).
- **Re-benchmark with shared GPU** once node1 RAM is bumped, to get
realistic latency numbers when the T4 is also under load from
immich-ml and frigate.
- **Front llama-swap with LiteLLM** so Home Assistant and any other
consumer can hit one OpenAI-compat gateway. Track separately.
---
## Auto-generated report
Below is the unedited output of `python -m instagram_poster.benchmark
report --run-id 2026-05-10-1424`, kept for diff-checking against
future runs.
### Per-model summary
| model | n | parse_ok % | error % | p50 latency | p95 latency | median IG-fit | median aesthetic |
|-------|---|-----------|--------|------------|-------------|--------------|------------------|
| minicpm-v-4-5 | 100 | 100.0 | 0.0 | 5617 ms | 5998 ms | 7.0 | 8.0 |
| qwen3vl-4b | 100 | 100.0 | 0.0 | 3552 ms | 4063 ms | 8.0 | 8.0 |
| qwen3vl-8b | 100 | 100.0 | 0.0 | 5981 ms | 6637 ms | 7.0 | 8.0 |
### Score histograms (instagram_fit_score 110)
#### minicpm-v-4-5
```
1: (0) 2: (0) 3: (0) 4: (0) 5: (0)
6: ███████ (7)
7: █████████████████████████████████████████████████████████████████ (65)
8: ███████████████████ (19)
9: █████████ (9)
10: (0)
```
#### qwen3vl-4b
```
1: (0) 2: (0) 3: (0) 4: (0) 5: (0)
6: █████ (5)
7: ████████████████ (16)
8: ████████████████████████████████ (32)
9: ███████████████████████████████████████████████ (47)
10: (0)
```
#### qwen3vl-8b
```
1: (0) 2: (0) 3: (0) 4: (0) 5: (0)
6: ███████████ (11)
7: ███████████████████████████████████████████████████████ (55)
8: █████████████████ (17)
9: █████████████████ (17)
10: (0)
```
### Top-10 by IG-fit per model — see `benchmark-2026-05-10-1424.json`
(Tables omitted from the curated report; available in the JSON dump
alongside this file.)

File diff suppressed because it is too large Load diff

View file

@ -1,72 +0,0 @@
# Known Issues
Catalog of recurring or upstream-blocked failure modes with their
mitigations. Anything that requires a manual workaround should be
documented here — if a future session can hit the same issue, it
deserves an entry. Each entry should have: symptom, root cause, current
mitigation, and the trigger that lets us un-mitigate.
---
## 2026-05-17 — NVIDIA GPU driver fails on Ubuntu 26.04 (kernel 7.0.x)
**Symptom.** `nvidia-driver-daemonset-*` in `nvidia` namespace
CrashLoopBackOff on the GPU node. Logs say:
Could not resolve Linux kernel version
… or, post chart-upgrade, ImagePullBackOff on a `*-ubuntu26.04` tag.
**Root cause.** NVIDIA has not published any `nvcr.io/nvidia/driver:*-ubuntu26.04`
images (0 tags as of 2026-05-17; verified with skopeo). When a k8s node
running the GPU operator gets `do-release-upgrade`'d to Ubuntu 26.04
Resolute Raccoon, NFD relabels the node with
`feature.node.kubernetes.io/system-os_release.VERSION_ID=26.04` and the
operator computes the driver image tag `<version>-ubuntu26.04` — which
404s on pull. Both gpu-operator chart v25.10.1 and v26.3.1 exhibit the
same behaviour once NFD has detected 26.04.
**Current mitigation (active on k8s-node1 since 2026-05-17).**
1. Host kernel rolled back to `6.8.0-117-generic` (Ubuntu 24.04 HWE
kernel — still installed at `/lib/modules/6.8.0-117-generic`).
2. `apt-mark hold` on: `linux-image-6.8.0-117-generic`,
`linux-headers-6.8.0-117-generic`, `linux-modules-6.8.0-117-generic`,
`linux-image-generic`, `linux-headers-generic`, `linux-generic`.
3. `/etc/os-release` on k8s-node1 replaced with the Ubuntu 24.04 Noble
content (was a symlink to `/usr/lib/os-release`; now a regular file
under `/etc`). Backup at `/etc/os-release.bak-pre-spoof-2026-05-17`.
NFD-worker reads `/etc/os-release` and now reports
`system-os_release.VERSION_ID=24.04`, so the operator picks the
matching ubuntu24.04 driver image which DOES exist.
4. gpu-operator chart pinned to v25.10.1 in
`stacks/nvidia/modules/nvidia/main.tf`; driver pinned to 570.195.03
in `stacks/nvidia/modules/nvidia/values.yaml`.
**This is gross but stable.** The kernel matches what 24.04 ships, and
the `apt-mark hold` keeps it that way. /etc/os-release lying about the
OS only affects userland callers that key off it — none of our
deployed services do (we verified by grepping the cluster).
**Trigger to un-mitigate.** Periodically check for ubuntu26.04 driver
tags. Once they appear:
docker run --rm quay.io/skopeo/stable list-tags \
docker://nvcr.io/nvidia/driver \
| python3 -c "import json,sys; d=json.load(sys.stdin); \
print(len([t for t in d['Tags'] if 'ubuntu26.04' in t]))"
When that returns a non-zero count:
1. Restore `/etc/os-release` from backup
(`/etc/os-release.bak-pre-spoof-2026-05-17`) on k8s-node1.
2. Remove apt-mark holds for the kernel packages.
3. `apt full-upgrade` to land the latest 26.04 kernel + reboot.
4. Bump the gpu-operator chart pin to the matching version that ships
ubuntu26.04 driver images. Bump `driver.version` in values.yaml to
the current chart default.
**See also.** `docs/post-mortems/2026-05-17-gpu-driver-ubuntu2604-mismatch.md`
for full incident timeline + the recovery procedure.
**Beads.** `code-8vr0` (P1, OPEN).

View file

@ -1,265 +0,0 @@
# Infra Audit — 2026-04-20
**Status**: Design (post-research, post-challenge)
**Author**: Viktor Barzin (audit run by Claude)
**Scope**: `infra/` Terragrunt stacks + platform services (`claude-agent-service`, `claude-memory-mcp`, `beadboard`, `broker-sync`)
**Goals**: Reliability · Declarative-first · Reduced maintenance overhead · Maintained scalability
**Method**: 5 parallel research agents (R1 Reliability, R2 Declarative, R3 Maintenance, R4 Scalability, R5 Security) → 91 raw findings → 2 independent challengers → filtered/corrected/ranked backlog below.
## Context
The home-lab has grown into a mature stack (105 Tier-1 Terragrunt stacks + 6 Tier-0 SOPS, CNPG, Vault+ESO, Kyverno, Traefik, Authentik, CrowdSec, Woodpecker CI, Redis-Sentinel, MySQL-standalone, Proxmox-NFS). Recent work has been consolidation: MySQL InnoDB-Cluster → standalone (2026-04-16), Redis Phase 7 refactor (2026-04-19), NFS fsid=0 SEV1 post-mortem (2026-04-14), Authentik outpost /dev/shm fix (2026-04-18). This audit surveys everywhere that remains — what's brittle, what's manual, what's dark, what hasn't caught up to recent decisions — and ranks fixes by impact and by operator fatigue.
## Corrections up-front (challenger round)
Before reading the backlog, these findings from the research phase are **dropped, corrected, or reframed** — challengers spot-checked live state and proved them wrong, already-solved, or intentional-by-design. Being honest about this is the point of the challenge round:
| Finding as stated | Actual state | Action |
|---|---|---|
| R4#1: Worker nodes 86-91% memory saturation | Live `kubectl top nodes`: 44-51% across k8s-node{1-4} | **DROPPED** — bad metric pull |
| R4#2: Frigate CPU unbounded (1.5 CPU request, no limit) | Cluster policy is **all CPU limits removed** to avoid CFS throttling (`infra/.claude/CLAUDE.md` → Resource Management) | **DROPPED** — by design |
| R4#7: Redis no `maxmemory-policy` | `infra/stacks/redis/modules/redis/main.tf:254` sets `maxmemory-policy allkeys-lru` (Phase 7, 2026-04-19) | **DROPPED** — already solved |
| R2#1: 307 Kyverno lifecycle markers is a drift risk | Markers are the **canonical discoverability tag**`ignore_changes` only accepts static attribute paths, snippet convention is the only viable path; reframe as *"markers are fine, missing markers are the risk"* | **REFRAMED** |
| R2#3: 140 `ignore_changes` blocks | Actual: **310** across `.tf` files (2.2× off) | **CORRECTED** |
| R3#10: 65 CronJobs | Actual: 59 (10% off) | **CORRECTED** |
| R1#1: 47 deployments missing probes | Actual: **115 missing at least one probe; 103 missing both** | **CORRECTED (much worse than reported)** |
| R1#9: MySQL standalone no HA/PDB | Intentional post-2026-04-16 migration from InnoDB Cluster. Backup + restore matter; HA is explicit deferred. | **REFRAMED** — split into HA (deferred) / backup-restore (open) / connection pool (open) |
| R1#10: PDB gaps include Traefik, Authentik | Traefik & Authentik PDBs `minAvailable=2` exist (CLAUDE.md). The real gaps are **CrowdSec LAPI, Calico-apiserver, ESO webhook, Woodpecker-server** | **CORRECTED (list pruned)** |
| R5#2: 4 Kyverno security policies in Audit | **All 16 ClusterPolicies are in Audit** — zero in Enforce. | **CORRECTED (worse)** |
---
## Executive summary — top 5 cross-cutting themes
These are the themes that survive the challenge round and hit ≥2 concerns. Each headline is a 1-line hook; deep-dives below.
1. **Declarative escape hatches (NFS exports, master-node file provisioners, null_resource initializers)**`/etc/exports` is not in Terraform, which is the **root cause of the 2026-04-14 SEV1**; 6 null_resources + 3 SSH file provisioners still orchestrate critical state. *Hits R2 + R1 + R3.*
2. **Observability has blind spots where pain would actually come from** — no OOMKill alert routing, no NFS capacity monitor, no GPU utilization dashboard, no ESO refresh-lag alert, no CronJob success-rate summary. Alerts exist but they don't cover the operator's real failure modes. *Hits R1 + R3 + R4.*
3. **Supply-chain hygiene: image pinning + Renovate + admission signing** — 84 `:latest` tags in production TF, zero Renovate/Dependabot across 18 repos (~15 hr/mo toil by estimate), no cosign/trivy on push. Single theme unifies security posture, maintenance toil, and determinism. *Hits R3 + R5.*
4. **Reliability-probes & graceful shutdown are genuinely uneven** — 115 deployments missing at least one probe (incl. 103 missing both), 50+ Recreate deployments with no `terminationGracePeriodSeconds`/`preStop`. This is the quietly-largest reliability debt. *Hits R1 + R3 (pager toil).*
5. **Backup coverage is uneven: 30+ PVCs lack app-level CronJobs** — Proxmox host snapshots cover the disk, but Forgejo (!), Affine, Paperless, Hackmd, Matrix, Owntracks have no app-aware dumps. Restore granularity is file-level, not entity-level. *Hits R1 + R5 (compliance) + R3 (restore rehearsal toil).*
Honourable mentions that didn't make top 5 but sit just below: Kyverno audit→enforce transition (security), ESO refresh-lag alert (secrets reliability), Vault hardening (audit log offsite, root-token K8s-secret scope), Cloudflared tunnel-token SPOF (not replica SPOF — those are 3), Dolt PVC sizing + backup.
---
## Scoring method
Two parallel rankings — scan both.
**Rank A — Impact × Reversibility (the original formula)**
`score = Impact × (6 - Effort) × (6 - Risk)` — each dimension 1-5.
**Rank B — Operator fatigue weight**
`score = Impact × (6 - Effort) × FatigueWeight` where `FatigueWeight = 3` if the finding introduces *daily/weekly manual toil* and `1` otherwise. This re-ranks by how much pain the unfixed state causes per month.
Both rankings below. When they agree, that's the clear signal. When they diverge, that's where Rank B (fatigue) wins — Viktor has stated operator fatigue dominates abstract risk for a solo-operator lab.
---
## Ranked backlog (filtered, deduplicated, corrected)
Counts below reflect **post-challenge corrected numbers**. Every row has a reference verified either by a spot-check (file:line) or a live cluster command.
| ID | Title | Concerns | Impact | Effort | Risk | Rank A | Rank B | Refs |
|---|---|---|---:|---:|---:|---:|---:|---|
| F01 | NFS `/etc/exports` not in Terraform (SEV1 root cause) | R2+R1 | 5 | 3 | 2 | **60** | **45** | `infra/scripts/pve-nfs-exports`, PM 2026-04-14 |
| F02 | 115 deployments missing probes (103 missing both) | R1+R3 | 5 | 3 | 2 | **60** | **45** | `kubectl get deploy -A -o json` |
| F03 | Zero Renovate/Dependabot across 18 repos | R3+R5 | 4 | 2 | 1 | **80** | **48** | `find /home/wizard/code -name ".renovaterc*"` → 0 results |
| F04 | 84 `:latest` image tags in production TF | R3+R5+R4 | 4 | 2 | 2 | **64** | **48** | `grep -rn ':latest' infra/stacks` |
| F05 | No OOMKill / unschedulable / node-CPU alert | R1+R4+R3 | 5 | 3 | 1 | **75** | **45** | Grep Prometheus rules — no `OOMKilling` rule present |
| F06 | 6 `null_resource` DB initializers in `dbaas` stack | R2 | 4 | 3 | 3 | **36** | **36** | `grep -n null_resource infra/stacks/dbaas` |
| F07 | 3 SSH+file provisioners on k8s-master (audit, OIDC, etcd) | R2 | 4 | 3 | 3 | **36** | **36** | `stacks/platform/modules/rbac/apiserver-oidc.tf` |
| F08 | ESO refresh-lag alert missing (52 ExternalSecrets) | R1+R5+R3 | 4 | 2 | 1 | **80** | **48** | `stacks/external-secrets/` — no PrometheusRule for refresh lag |
| F09 | 30+ PVCs without app-level backup CronJobs | R1+R5 | 4 | 3 | 2 | **48** | **36** | Affine, Forgejo, Hackmd, Matrix, Owntracks, Paperless (no `*-backup` CJ) |
| F10 | Cloudflared tunnel-token SPOF (replicas OK, token shared) | R1+R5 | 3 | 4 | 2 | **24** | **8** | `stacks/cloudflared/` single tunnel credential |
| F11 | MySQL restore never rehearsed end-to-end | R1+R4+R3 | 4 | 2 | 2 | **64** | **48** | No `mysql-restore-drill` CJ; runbook untested post-migration |
| F12 | Kyverno policies all 16 in Audit — **sequence carefully** | R2+R5 | 4 | 3 | **4** | **24** | **24** | `kubectl get clusterpolicy` |
| F13 | 97 RollingUpdate deployments lack explicit surge bounds | R1 | 2 | 2 | 2 | **32** | **12** | TF defaults inherit from Helm/k8s (25%/25%) |
| F14 | CronJob success-rate dashboard + alert rollup missing | R3+R4 | 3 | 2 | 1 | **60** | **36** | `CronJobTooOld` rule — partial; no 24h rollup |
| F15 | Authentik outpost /dev/shm fix applied via Helm API only | R1+R5 | 3 | 2 | 2 | **48** | **48** | Not in TF — upgrade-reversion risk |
| F16 | Dolt (beads DB) no backup CronJob — 2Gi PVC near full | R1+R4 | 4 | 2 | 2 | **64** | **32** | `stacks/beads/` — no `dolt-backup` CJ |
| F17 | Vault StatefulSet `updateStrategy=OnDelete` (manual roll) | R1+R3 | 2 | 2 | 3 | **24** | **24** | `kubectl get sts -n vault -o yaml` |
| F18 | No NetworkPolicies cluster-wide | R4+R5 | 4 | **5** | **4** | **8** | **8** | `kubectl get netpol -A` → 0-2 |
| F19 | RBAC `oidc-power-user` has cluster-wide secrets r/w | R5 | 4 | 3 | 3 | **36** | **12** | `stacks/platform/modules/rbac/` |
| F20 | No image supply-chain verification (cosign, trivy on push) | R5 | 4 | 4 | 3 | **24** | **8** | No admission controller for signatures |
| F21 | Vault audit log offsite backup not configured | R5+R1 | 3 | 2 | 1 | **60** | **36** | `stacks/vault/` — no `audit-log-sync` CJ |
| F22 | Claude-agent, beadboard, broker-sync singletons | R1 | 2 | 2 | 2 | **32** | **12** | `kubectl get deploy -n claude-agent,beadboard,broker-sync` |
| F23 | 50+ Recreate deployments lack graceful-shutdown hooks | R1+R3 | 3 | 3 | 2 | **36** | **36** | `grep -L terminationGracePeriodSeconds stacks/**` |
| F24 | CoreDNS scaled via `kubectl scale` not TF | R2 | 3 | 2 | 2 | **48** | **32** | Command in runbook; no TF resource for replicas |
| F25 | GPU / inference-latency SLO unmonitored | R4+R5 | 3 | 3 | 2 | **36** | **36** | No dcgm dashboard; Frigate liveness checks only |
| F26 | Prometheus TSDB 200Gi — retention untracked | R4 | 2 | 2 | 1 | **40** | **20** | `stacks/monitoring/` |
| F27 | Pod Security Standards labels unset on all namespaces | R5 | 3 | 2 | 3 | **36** | **12** | `kubectl get ns -o json \| jq '.items[].metadata.labels'` |
| F28 | Authentik worker VPA upperBound 2.3× actual request | R4 | 2 | 2 | 2 | **32** | **20** | Goldilocks dashboard |
| F29 | 9 DB rotation targets, no post-rotation verification loop | R5+R3 | 3 | 2 | 2 | **48** | **36** | Vault DB engine every 7d; no auto-verify |
| F30 | Tier-0 SOPS workflow 7-step vs 3-step Tier-1 | R3 | 2 | 2 | 1 | **40** | **20** | `scripts/state-sync` — manual decrypt/encrypt/commit |
**Rank A leaders (top 8)**: F03, F08, F05, F11, F04, F16, F01, F02 — "big cluster wins, cheap to try"
**Rank B leaders (top 8)**: F03, F04, F08, F11, F15, F01, F02, F05 — "what's paining you weekly"
F03 (Renovate), F08 (ESO refresh alert), F11 (MySQL restore drill) and F01 (NFS in TF) lead in **both** rankings → these are the clear "do first" candidates.
---
## Per-concern deep dives
### R1 — Reliability (18 raw → 11 real after challenge)
Filtered: dropped R1#1/9/10 (incorrect numbers, intentional choices). What actually matters:
- **Probes (F02)** — 115 deployments missing at least one probe; 103 missing both. The corrected count is 2.4× the original claim. Worst offenders are batch workloads (CronJob-spawned) that legitimately skip probes — but long-lived ones (Affine, Hackmd, mailserver sidecars) genuinely need them. Triage: filter by `spec.replicas ≥ 1` and `containers[].command != ["/bin/sh","-c"]`-style short-runners, then add readiness+liveness one-by-one.
- **Cloudflared tunnel token SPOF (F10)** — Replicas are 3 (per CLAUDE.md), so the agent finding "SPOF" framed as replicas is wrong. The real SPOF is the *tunnel credential*. Secondary tunnel with weighted Cloudflare DNS records is the honest fix — medium effort, low urgency unless tunnel CA rolls keys.
- **PDB gaps (F13-like, excluded from table)** — After challenger correction, gaps are: CrowdSec LAPI (3 replicas, no PDB), ESO webhook+controller, Woodpecker-server. Not urgent — drain-test with `kubectl drain --dry-run` shows no current issue.
- **App-level backups (F09)** — Proxmox host captures the PVC contents nightly via LVM snapshot + rsync with `--link-dest` weekly versioning, so file-level recovery is covered. But for databases inside PVCs (e.g. Affine's Postgres in-pod, Paperless' SQLite), app-aware dumps give transactional consistency. Audit pass: enumerate every PVC without a sibling `*-backup` CronJob, add one for the ones that host embedded DBs.
- **MySQL restore drill (F11)** — Migrated 4 days ago. Runbook exists. End-to-end restore (dump → new DB → connect an app → verify) hasn't been rehearsed. SEV1 risk if a dump has been silently broken since migration.
- **Vault update strategy (F17)**`OnDelete` means helm upgrade leaves pods untouched; must manually `kubectl delete pod` to restart. Low impact (infrequent) but procedural toil.
- **Dolt PVC near-full + no backup (F16)**`bd list --status in_progress` runs against this DB; it's load-bearing for cross-session task state. Grow the PVC (resize annotation) + add dolt dump CronJob.
### R2 — Declarative Coverage & Drift (16 raw → 8 real)
Filtered: dropped R2#1 (Kyverno markers are by-design), corrected R2#3 to 310.
- **NFS exports (F01)** — The file is git-managed at `infra/scripts/pve-nfs-exports` but deployed via `scp + exportfs -ra`, not Terraform. This is the exact path that caused the 2026-04-14 SEV1 (fsid=0 on wrong exports line). Options: (a) `null_resource` with `local-exec scp + remote-exec exportfs -ra` triggered on hash of content (partial — SSH dep); (b) new module `pve_host_config` that templates and SCPs multiple PVE-host artifacts with checksum verification. (b) is the cleaner long-term fix.
- **Null-resource initializers (F06)** — 6 in `dbaas` (MySQL users, CNPG cluster, TF-state role, payslip DB, job-hunter DB). Some are genuinely unavoidable (bootstrapping DB before the DB exists); others could use `postgresql_grant` / `mysql_user` providers.
- **SSH file provisioners on k8s-master (F07)**`apiserver-oidc.tf`, `audit-policy.tf`, `etcd tuning`. One-way sync, no drift detection. Proposed quick wins (per `2026-02-22-node-drift-quick-wins-design.md` already exists). Continue/finish the plan.
- **CoreDNS scaling manual (F24)** — Current runbook uses `kubectl scale`/`set env`/`set affinity`. Drift-prone; convert to `kubernetes_deployment` TF resource overriding the Helm chart's scale/affinity fields.
- **MySQL InnoDB Cluster + operator TF resources still present** — Phase 4 cleanup. Low urgency, but removing reduces cognitive load on anyone reading `stacks/dbaas/`.
- **Technitium readiness-gate null_resource with `timestamp()` trigger** — Runs every apply, 3-6 min wall time. Replace with a real health-check on `terraform_data` with `triggers_replace = { checksum = sha256(config) }`.
- **GPU node taints + Proxmox CSI labels via null_resource kubectl** — No drift detection. Fix is in the `2026-02-22-node-drift-quick-wins-design.md` plan.
### R3 — Maintenance overhead (18 raw → 10 real)
- **Renovate (F03)** — The single highest-leverage maintenance fix. 18 repos × ~0.8 hrs/month manual version sweep = real time. Add `.github/renovate.json` (grouping rules for Terraform providers, K8s provider, Docker images) + auto-merge patch-level. Start with `infra/` only; expand after 2 weeks.
- **Image pinning (F04)** — 84 `:latest` tags in production TF. Root CLAUDE.md still says "use 8-char git SHA tags" but that's not enforced. Admission control via Kyverno `require-trusted-registries` is in Audit today — add a sibling policy `forbid-latest-tag` also in Audit. Separate from F03 because pin-to-SHA + Renovate is a synergistic pair.
- **MySQL restore drill (F11)** — tracked under R1 for impact; also a maintenance item because the restore *procedure* has not been test-updated since migration.
- **CronJob alert rollup (F14)** — 59 CronJobs; "which were healthy last 24h" takes ad-hoc `kubectl get jobs --sort-by` scrolling. Add a Grafana panel with `kube_cronjob_status_last_successful_time < now - 2×schedule` summary.
- **Graceful-shutdown toil (F23)** — 50+ Recreate deployments without `terminationGracePeriodSeconds` or `preStop`. Noisy pager hits after node drain. One-off sweep: add a 30s `terminationGracePeriodSeconds` default via Kyverno mutation rule.
- **Tier-0 SOPS workflow (F30)** — 7-step decrypt/edit/encrypt/commit vs Tier-1's 3-step. Combined `tg` wrapper flag `--edit <stack>` that auto-decrypts → EDITOR → auto-encrypts → commit in one command. Moderate win; low risk.
- **Stale `in_progress` beads** — 7 stale tasks in `bd list --status in_progress` at audit start. Session-end hook checks this; 3-5 days without notes is the signal. CLAUDE.md covers the rule — it's followed-sometimes, not enforced.
- **Runbook staleness** — no `last_reviewed` frontmatter on runbook MDs; trivial to add. One-off sweep then keep it honest.
- **CI/CD template unification** — "GHA build → Woodpecker deploy" is the documented pattern for 10 repos; rest still on Woodpecker-only. Track as follow-ups per repo in `bd`.
- **Kyverno DNS-config boilerplate 307 markers** — Not a problem (see correction at top). Do add a lint rule in CI that flags any `kubernetes_deployment` without `# KYVERNO_LIFECYCLE_V1` marker; that's the real drift risk.
### R4 — Scalability (18 raw → 9 real)
Filtered: dropped R4#1 (metric mispull), R4#2 (CPU-limit policy), R4#7 (Phase 7 solved).
- **CNPG memory headroom** — Currently 2Gi limit. Top-line metric at quiet time; add a `ContainerNearOOM > 85%` rule that watches CNPG specifically (general rule exists; CNPG is Tier 0 so deserves explicit binding).
- **HPA cluster-wide: zero** — Every stateless service is 1:1. Not urgent at current node-CPU 8-31%, but one big feature (Immich re-index, Authentik load spike) tips the balance. Pilot: HPA on Traefik (CPU-driven), observe, expand.
- **Redis no HPA + HAProxy singleton** — Wire Sentinel into direct client access (Phase 8 of Redis refactor, per R1#11 of raw findings). Currently all 17 consumers go via HAProxy — the single-point bypass was deliberate (simpler client config), but the HAProxy is now the SPOF Sentinel was meant to prevent. Worth a plan doc (`plans/2026-MM-DD-redis-phase8-sentinel-clients.md`).
- **PgBouncer pool sizing unknown** — Authentik has 3 pods, each opening N connections. At load spikes (big org sync), pool exhaustion. Short-term: `pgbouncer_show_pools` metric + alert at 80% util. Longer-term: pool-size tuning based on observed wait times.
- **Prometheus TSDB (F26)** — 200Gi retention unquantified. Risk: disk fills → scrape gaps → audit blind. Add `kubelet_volume_stats_used_bytes{persistentvolumeclaim="prometheus-server"} > 0.85 * capacity` alert.
- **NFS capacity not monitored** — PVE host has 1TB HDD LV. No `node_filesystem_avail_bytes` scrape from PVE host (it's outside the cluster). Install node_exporter on PVE host; scrape via Prometheus federation or remote_write.
- **VPA quarterly review unscheduled** — Goldilocks is in `Initial` mode (not Auto, by design). Review is manual per quarter. Calendar event + runbook link.
- **Registry single instance** — Registry outage = no pod restarts. Post-mortem 2026-04-19 documented a container-engine pin; replica count still 1. Consider HA registry backed by S3-compat store (MinIO in-cluster) for the second replica — but low urgency given probe CJ monitors integrity every 15m.
- **No ResourceQuota utilization alert** — Quota exhaustion invisible until a pod refuses to schedule. `kube_resourcequota{type="used"} / kube_resourcequota{type="hard"} > 0.85` rule.
### R5 — Security & Secrets (21 raw → 13 real)
- **Vault `vault-unseal-key` K8s Secret (F21-related)** — Challenger A said it wasn't present; it is (`kubectl get secret -n vault`). Used by auto-unseal. RBAC on the secret should restrict to `vault-server` SA only. Audit the `role` + `rolebinding` in `stacks/vault/`.
- **Vault audit log offsite (F21)** — Rotated logs not synced to NFS backup. Add a `vault-audit-log-sync` CronJob or append the audit log path to `nfs-change-tracker` inotify list (zero-Terraform change if the latter).
- **Kyverno audit → enforce (F12) — sequence carefully** — All 16 policies are in Audit today. Naive switch to Enforce will block legitimate workloads (Loki, Frigate, nvidia-device-plugin, wireguard have privileged/host-ns requirements — all documented). Plan: (a) generate `Kyverno PolicyException` CRs for known-good workloads first; (b) enforce one policy at a time, 1-week observation; (c) start with `require-trusted-registries` (least breakage risk). **DANGEROUS TO EXECUTE NAIVELY — don't batch.**
- **No NetworkPolicies (F18)** — Challenger correctly flagged the effort (5) and risk (4): wrong NetworkPolicy stops Authentik from reaching its DB in minutes. Approach: allow-list namespace-wide first (e.g. `authentik` ns can reach `dbaas` on 5432), expand over a month. Single biggest latent security improvement but needs runway.
- **RBAC oidc-power-user secrets r/w cluster-wide (F19)** — Scope down: list which Authentik groups get this binding, remove `secrets:*` from the cluster role, add namespace-scoped RoleBindings where needed. Medium effort, high leverage.
- **Image supply chain (F20)** — cosign verification + admission controller is the mature path. Trivy-on-push fits in GHA workflows. Both unblocked after F04 (pinning).
- **`:latest` tags (overlap F04)** — Security aspect: signed-image admission requires stable refs.
- **Privileged containers** — Loki, WireGuard, NVIDIA, Frigate known-exceptions. Document the exceptions inline (comment block on the TF resource) so future maintainers don't accidentally "fix" them.
- **Git history plaintext secrets** — Challenger B flagged unverified. One way to verify cheaply: `git secrets --scan-history`. Add it as a pre-audit one-off.
- **CrowdSec Metabase disabled, no Prometheus exporter** — R5#18. Enable the Prometheus exporter (no Metabase) for attack-pattern visibility; very cheap.
- **cert-manager evaluation paused** — Documented pause; TLS rotation relies on Cloudflare wildcard. Confirm no local `Ingress` uses a self-managed cert that could expire silently. `kubectl get cert -A` → expect 0.
- **Pod Security Standards (F27)** — Label every namespace `pod-security.kubernetes.io/enforce=restricted` (or baseline). Known-exception namespaces get explicit downgrades. Medium effort, paid back by making future admission decisions uniform.
- **CrowdSec LAPI quorum** — 3 replicas but quorum/consensus behavior undocumented. One-page runbook: what happens if 1, 2, or 3 LAPI pods die.
- **Authentik outpost fix (F15)** — Applied via API, not TF. Next Helm upgrade reverts. Add the `/dev/shm` emptyDir to `stacks/authentik/values.yaml` templatefile.
---
## Dangerous-to-execute (handle with care)
Flagged by challengers; each needs a gradual rollout plan, not a single commit.
1. **F12 — Kyverno Audit → Enforce en masse**. Write `PolicyException` CRs for known-safe workloads first. One policy per week. Observe.
2. **F18 — NetworkPolicies cluster-wide**. Default-deny breaks inter-namespace lookups silently. Namespace-by-namespace rollout, with `kubectl logs -f` tailing the policy-engine events.
3. **PDB additions without drain-test**. New PDB + tight `minAvailable` can deadlock during node cordons. `kubectl drain --dry-run` every new PDB on every node first.
4. **F20 — Signed-image admission**. Must follow F04 (pinning). Un-pinned admission = half the cluster fails to pull.
## Gaps the agents missed
From challenger "GAPS" analyses, collated:
- **Disaster-recovery drill coverage** — backup docs are comprehensive (CLAUDE.md is extensive). End-to-end *restore* rehearsal frequency = never documented. Track per-component: MySQL, PostgreSQL/CNPG, Vault, etcd, NFS, registry blobs.
- **Service mesh evaluation** — Never formally evaluated (Istio, Linkerd, Cilium-in-mesh-mode). Could subsume NetworkPolicy effort + mTLS + observability. Worth a design doc even if answer is "no, too much complexity for the gain."
- **Chaos engineering coverage** — Zero. No pod-kill cron, no node-failure drill. Low urgency given maturity, but would validate F02 probe quality and F23 graceful-shutdown coverage cheaply.
- **Operator onboarding friction** — Nobody else in the "lab team" but Emo exists in `claude-agent-service`. If Emo needs to take over a component for a week, what's the runbook?
- **Alert noise / fatigue rate** — No finding measured how many alerts actually page vs. auto-resolve. `alertmanager_notifications_total` by receiver is the metric; needs a Grafana panel.
- **Secrets-in-image-layers** — Docker images built locally may contain secrets from build env. `trivy image --scanners secret` on registry images is a one-off audit.
- **Runbook → post-mortem → runbook-update loop** — Post-mortem 2026-04-14 produced runbook updates; no general tracker that every incident produces a runbook change.
## Alternative framings (from challengers, preserved for future reference)
- **Split "MySQL singleton" into 3 items** (HA / backup / pool). Accepted — see R1 and R4 treatment.
- **6th concern: Observability & Pager Fatigue** — Considered; the themes already hit R1+R3+R4 under Theme 2 of the executive summary. Keeping 5 concerns but carving "Observability gaps" as a theme, not a new research axis.
- **One-thing-this-weekend**: Challenger B nominated *NFS in Terraform*, Challenger A nominated *`:latest` tag sweep*. F01 wins on SEV1 prevention; F04 wins on toil. Both valid. Pick by energy level: F01 is 1 deliberate session; F04 is low-cognition grep-replace.
- **Re-rank by operator fatigue (Rank B) always**. Partially accepted — presented side-by-side in the table.
---
## Recommended next moves
Ordered for a solo operator balancing SEV-prevention, fatigue reduction, and preserved energy for larger work:
**Week 1 (SEV-prevention + quick-wins, low cognitive load):**
- F01: NFS exports into a `pve_host_config` Terraform module (one deliberate session)
- F04: Sweep `:latest` tags, add Kyverno `forbid-latest-tag` in Audit
- F08: ESO refresh-lag PrometheusRule
- F05: OOMKill / Unschedulable / Node-CPU PrometheusRule
**Week 2 (fatigue reduction):**
- F03: Renovate in `infra/` only (narrow pilot)
- F14: CronJob success-rate Grafana panel + alert rollup
- F16: Dolt backup CronJob + PVC grow
- F11: First MySQL restore drill (scheduled, documented)
**Month 2 (durable fixes, gradual):**
- F06/F07: Replace null_resources + SSH provisioners with native TF resources, one at a time
- F02: Probe sweep — add readiness+liveness to the 20 long-lived deployments first
- F12: Kyverno Enforce transition, one policy per week
- F15: Authentik outpost /dev/shm into values.yaml
**Month 3+ (structural):**
- F18: NetworkPolicies — namespace-by-namespace
- F19: RBAC scope-down
- F20: Signed-image admission
- Service-mesh evaluation (design doc)
- Restore-drill calendar for every backup target
No beads tasks auto-filed by this audit — user decides which findings merit `bd create`.
---
## Appendix — verification references (spot-checked)
Every numeric claim in the backlog was confirmed by one of these commands at audit time (2026-04-20):
| Claim | Command | Result |
|---|---|---|
| Node memory 44-51% | `kubectl top nodes --no-headers` | k8s-node1: 45%, node2: 51%, node3: 49%, node4: 44%, master: 17% |
| 115 deploys missing ≥1 probe | `kubectl get deploy -A -o json \| jq '[.items[] \| select(.spec.template.spec.containers[0].readinessProbe == null or .spec.template.spec.containers[0].livenessProbe == null)] \| length'` | 115 |
| 103 deploys missing BOTH probes | same, with `and` | 103 |
| 310 ignore_changes blocks | `grep -r "ignore_changes" infra --include=*.tf --include=*.hcl \| wc -l` | 310 |
| 59 CronJobs | `kubectl get cronjobs -A --no-headers \| wc -l` | 59 |
| All 16 Kyverno ClusterPolicies in Audit | `kubectl get clusterpolicy -o jsonpath='...validationFailureAction...'` | 16/16 Audit, 0 Enforce |
| Redis `maxmemory-policy allkeys-lru` | `grep -n maxmemory-policy infra/stacks/redis` | `modules/redis/main.tf:254` |
| Zero Renovate configs | `find /home/wizard/code -name '.renovaterc*' -o -name 'renovate.json' \| grep -v node_modules` | 0 |
| Vault `vault-unseal-key` Secret exists | `kubectl get secret -n vault` | present (37d old) |
| NFS `/etc/exports` not in TF | `grep -rn 'fsid=' infra/stacks` | 0 matches; only `infra/scripts/pve-nfs-exports` |
| Frigate CPU limit by policy | `infra/.claude/CLAUDE.md` → "All CPU limits removed cluster-wide" | confirmed |
| MySQL standalone intentional | `infra/.claude/CLAUDE.md` → "migrated from InnoDB Cluster 2026-04-16" | confirmed |
Other claims (84 `:latest` tags, 52 ExternalSecrets, 30+ PVCs without backup CJs) were surfaced by research agents; challengers spot-checked a subset and agreed the order-of-magnitude holds. Full list in `/home/wizard/.claude/plans/let-s-run-a-thorough-floating-pnueli.md` research digest.
## Deliverable disposition
- This document is the audit output.
- No `bd` tasks were created by the audit. Pick findings to ticket after reading.
- When filing: use `F##` as a tag, title with the finding's headline, acceptance criteria from the deep-dive paragraph, priority from Rank B.
- Plan file at `~/.claude/plans/let-s-run-a-thorough-floating-pnueli.md` retains the full 91-finding digest + challenger reports for reference; can be deleted after any follow-up tickets are filed.

View file

@ -1,165 +0,0 @@
# Auto-Upgrade Apps Design
**Date**: 2026-05-16
**Status**: Approved (brainstorm + grill complete; implementation pending)
## Problem
Three constraints in tension across the cluster's ~70 services:
1. **Keep apps at latest.** Most services drift behind upstream; manual bumps don't scale.
2. **Stay Terraform-compatible.** Image refs live in `.tf`; we want declarative source of truth.
3. **Don't let the pull-through cache serve stale `:latest`.** Cache layer must not lie about what `:latest` means today.
The previous `Diun → n8n → Service Upgrade Agent` flow handled (1) via changelog-reviewed PR bumps for third-party. Self-hosted services have inconsistent CI: 1 of 11 fully wired (CI builds + pushes + rolls out), 6 partially wired (build but no rollout trigger), 4 with no CI at all. Self-hosted services typically pull `forgejo.viktorbarzin.me/viktor/<name>:<8-char-sha>` with Terraform tracking each SHA in `var.image_tag`.
The user wants to simplify by retiring the changelog-review agent and moving to a pure "latest, always" model, with the cache freshness concern handled at the cache layer (already done — see Architecture §1).
## Decisions
| # | Decision | Notes |
|---|----------|-------|
| 1 | **Auto-roll for everything** (no PR-bump gate) | Retires the Service Upgrade Agent; Diun's role narrows to notification only |
| 2 | **Actuator: Keel** ([keel.sh](https://keel.sh)) | Annotation-driven Deployment/StatefulSet/DaemonSet auto-update operator |
| 3 | **Tag scheme: `:latest` where it exists, `:major` where it doesn't, glob+`ignore_changes` last resort** | `keel.sh/policy: force` for `:latest` / `:major`; tag string stays in Terraform |
| 4 | **Opt-out-pure (no skip-list)** | Every workload auto-rolls, including Vault, CNPG, operators, CNI, CSI. User accepts recoverability risk |
| 5 | **Phased rollout (9 phases)** | Low-risk → bootstrap. Catch up to latest as we phase in. Each phase soaks ~1 week |
| 6 | **Per-phase: single combined PR** | Switch image refs to floating tag + add to Kyverno mutate allowlist in same commit |
| 7 | **Diun is the audit source for catch-up** | Existing 6h-poll already reports outdated images; export as worklist per phase |
| 8 | **Polling, hourly** (`@every 1h`) | Not webhooks — single mechanism, all registries supported |
| 9 | **Rollback: `kubectl rollout undo` → pin in Terraform → add `keel.sh/policy: never`** | (c) from grill: immediate undo, durable Terraform pin within ≤1h before next Keel poll |
| 10 | **Implementation: Kyverno cluster-wide mutate** | One `ClusterPolicy` injects Keel annotations; phase boundary = `NamespaceSelector` allowlist |
| 11 | **Keel exempt from its own mutate** | One-line `NamespaceSelector` exclusion. Supervisor self-update has uniquely bad failure mode |
| 12 | **Uniform CI model for all self-hosted** | CI builds + pushes `:latest`, Keel polls and rolls. No per-repo `kubectl set image` step. Retires the GHA-migrated SHA-tag flow (memory id=388) |
## Architecture
### 1. Cache freshness — already correct
Pull-through cache at `10.0.20.10` already splits caching by URL at the nginx layer:
- `location ~ /v2/.*/blobs/``proxy_cache_valid 200 24h` — blobs cached (content-addressed, immutable)
- `location /v2/` (manifests) → pass through, no cache
Combined with `registry.proxy.ttl: 0` at the docker-registry layer, mutable manifests revalidate against upstream on every pull. **No cache changes needed for this design.** The CLAUDE.md note "Use 8-char git SHA tags — `:latest` causes stale pull-through cache" predates the nginx URL-split fix and should be updated as part of this work.
### 2. Detection — Keel polls upstream
Keel runs as a Deployment in its own namespace. Every annotated workload polls its registry hourly (Keel-managed; configurable per workload). On detection of a new digest under the watched tag:
- `keel.sh/policy: force` (for mutable tags `:latest`, `:16`, `:7`, etc.) → trigger Deployment update (pod template hash changes → restart)
- `keel.sh/policy: minor` / `major` / `glob` (only for images that publish neither `:latest` nor a stable floating tag) → rewrite tag string on the Deployment; requires `lifecycle { ignore_changes = [...image] }`
### 3. Application — kubelet pull through the cache
When Keel triggers restart:
1. kubelet asks the cache (via containerd hosts.toml) for `image:tag` manifest.
2. nginx passes the manifest request through to the docker-registry layer.
3. docker-registry (with `proxy.ttl: 0`) passes through to upstream.
4. Upstream returns current digest.
5. kubelet pulls blobs (mostly cached at nginx layer; new blobs from upstream).
6. New pod runs new image.
### 4. Annotation injection — Kyverno mutate
Single `ClusterPolicy` adds these annotations to every Deployment / StatefulSet / DaemonSet in opted-in namespaces:
```yaml
metadata:
annotations:
keel.sh/policy: force
keel.sh/trigger: poll
keel.sh/pollSchedule: "@every 1h"
```
Phase = a `match.any[].resources.namespaces` list. Phase advance = append namespaces. Keel namespace is excluded.
### 5. Terraform drift handling
Existing convention (`# KYVERNO_LIFECYCLE_V1` marker) handles `dns_config` injection. We extend with a new marker:
```hcl
lifecycle {
ignore_changes = [
spec[0].template[0].spec[0].dns_config, # KYVERNO_LIFECYCLE_V1
metadata[0].annotations["keel.sh/policy"],
metadata[0].annotations["keel.sh/trigger"],
metadata[0].annotations["keel.sh/pollSchedule"], # KYVERNO_LIFECYCLE_V2
]
}
```
This is added per workload as we phase in. Mechanical, grep-able.
## Phase ordering
| Phase | Set | Rationale |
|-------|-----|-----------|
| 0 | Foundation (Keel install, Kyverno ClusterPolicy with empty allowlist) | Build infra without enrolling anything |
| 1 | Self-hosted (forgejo-hosted: ~11 services) | We own the code; failures are easy to diagnose |
| 2 | Stateless third-party web apps (linkwarden, postiz, affine, etc.) | No migrations |
| 3 | Exporters, sidecars, utilities | Stateless |
| 4 | Stateful-but-tolerant (Grafana, Prometheus, etc.) | Restart-safe state |
| 5 | State-coupled with migrations (Nextcloud, Forgejo, paperless-ngx, mailserver) | Schema-migration risk |
| 6 | Authentik | Auth outage |
| 7 | Operators (cnpg-operator, ESO, kured, descheduler) | Operator skew |
| 8 | Critical infra (Calico, proxmox-csi, nfs-csi, traefik, metallb) | Node-level outage potential (memory id=390: 26h Calico cascade) |
| 9 | Bootstrap (Vault, CNPG PG cluster, mysql-standalone) | Lose recoverability if broken |
Per-phase: combined PR → apply (catch-up rolls happen) → soak 1 week → next phase. If a service breaks repeatedly, apply rollback runbook (decision #9) and proceed; re-enroll later or leave pinned.
## Risk register
| Risk | Likelihood | Impact | Mitigation |
|------|-----------|--------|------------|
| Bad upstream image rolls into prod | High | Service-level outage | Existing alerts (`KubePodCrashLooping`, `KubeletImagePullErrors`, `PodsStuckContainerCreating`); rollback runbook (decision #9) |
| Catch-up rollout overwhelms cache | Medium | ImagePullBackOff cascade (memory id=603) | Rate-limit catch-up to ~5 rollouts/6h via `-target=` per phase; same pacing as retired Service Upgrade Agent (memory id=612) |
| Calico / CSI auto-roll cascades (memory id=390: 26h outage) | Low-Medium | Cluster-level outage | Phase 8 is intentionally late; user opted into the risk; rollback to pinned chart version via Terraform |
| Vault auto-rolls to broken image | Low | Loss of secrets sync; 43 ExternalSecrets stop reconciling | Phase 9 last; Tier 0 SOPS state allows manual recovery |
| CNPG PG cluster auto-rolls to broken image | Low | Tier 1 Terraform state inaccessible; 105 stacks can't apply | Phase 9 last; Tier 0 stack `cnpg` is bootstrap-capable |
| Helm-atomic-trap services (memory id=981) | Medium | `terraform apply` hangs in pending-rollback | Identify `helm_release` services with `atomic = true`; either remove atomic or skip from Keel |
| Keel itself rolls to broken version | Low | Supervisor down; no auto-rolls until manual pin | Decision #11: exempt Keel from mutate |
| Terraform drift after Kyverno injects annotation | High at first | Spurious diffs on every plan | KYVERNO_LIFECYCLE_V2 marker (Architecture §5); applied incrementally per phase |
## What we give up
- **Terraform no longer tracks deployed version.** Image refs in `.tf` say `:latest` or `:16`, but the running digest is whatever Keel pulled. To know what's running: `kubectl describe pod`. This is a deliberate trade — the previous SHA-pinned flow tracked version in TF but required N stack edits per deploy.
- **No changelog review before rollout.** The Service Upgrade Agent's risk classification is gone. We rely on alerts to catch breakage post-deploy, not prevent it.
- **CLAUDE.md SHA-tag rule is reversed for this design.** The "use 8-char git SHA tags" rule predates the nginx URL-split fix. New rule (post-rollout): "use floating tags + Keel annotation" — to be updated in both `infra/.claude/CLAUDE.md` and the repo-root `CLAUDE.md` once Phase 1 is stable.
## Decisions resolved post-grill
### Q1 — Uniform CI model for ALL self-hosted (resolved 2026-05-16)
Every self-hosted service moves to the same shape:
```
CI (GHA or Woodpecker) → build → push :latest (optionally also :<SHA> for traceability) → done
Keel → poll registry → detect new digest → trigger rollout
```
The 10 GHA-migrated repos (memory id=388: Website, k8s-portal, f1-stream, claude-memory-mcp, apple-health-data, audiblez-web, plotting-book, insta2spotify, audiobook-search, council-complaints) drop the `Woodpecker API → kubectl set image` step. Their `.woodpecker/deploy.yml` and `.woodpecker/build-fallback.yml` files become obsolete; remove during Phase 1.
Terraform image refs for all self-hosted: `<registry>/<repo>:latest` (with `${var.image_tag}` defaulting to `"latest"` where the variable exists).
### Q2 — No-CI self-hosted services (resolution: uniform participation)
| Service | Action |
|---------|--------|
| `wealthfolio` | Switch Terraform to upstream `wealthfolio/wealthfolio:latest` (DockerHub). No CI needed. |
| `chrome-service` | Verify whether `:v4` is a deliberate pin. If yes → tag stays, add `keel.sh/policy: never` label. If no → switch to `:latest` or `:major`. Investigate during Phase 1 prep. |
| `beadboard` (used by `beads-server`) | Add minimal Woodpecker CI: build on push → push `:latest`. User-owned. |
| `freedify` | Add minimal Woodpecker CI: build on push → push `:latest`. User-owned. |
## Open questions (still need resolution before Phase 1)
1. **`helm_release atomic = true` services**: count and identify before Phase 1. Either remove `atomic` (preferred — eliminates the memory id=981 trap), or skip from Kyverno mutate via per-namespace exclusion. Survey command: `grep -rn 'atomic.*true' infra/stacks/ infra/modules/`.
## Out of scope
- Cache TTL changes — current config is already correct (nginx URL-split).
- Webhook-based Keel triggers — polling is sufficient for this cadence.
- Replacing Diun — kept for notification visibility into new tags not yet under Keel annotation (during phase rollout).
- Keel approval gate (`keel.sh/approvals: N`) — user wants unattended auto-roll.
- Keel auto-rollback on health-check failure — out of scope for v1; revisit if breakage rate is high.

View file

@ -1,322 +0,0 @@
# Auto-Upgrade Apps Implementation Plan
> **For Claude:** REQUIRED SUB-SKILL: Use superpowers:executing-plans to implement this plan task-by-task.
**Goal:** Move the cluster from a mix of pinned-SHA / pinned-semver / ad-hoc `:latest` references to a Keel-driven auto-update model where every workload tracks `:latest` (or a chosen `:major` floating tag) and rolls automatically when upstream advances.
**Architecture:** Kyverno cluster-wide `ClusterPolicy` mutates Deployments / StatefulSets / DaemonSets in opted-in namespaces with Keel annotations (`keel.sh/policy: force`, `keel.sh/trigger: poll`, `keel.sh/pollSchedule: @every 1h`). Keel polls registries, triggers rollout on new digest. kubelet pulls fresh manifest via the nginx URL-split cache (manifests passthrough, blobs cached). Phase advance = expand the `NamespaceSelector` allowlist.
**Tech Stack:** Keel, Kyverno, Terraform / Terragrunt, Helm, Diun (notification only), nginx, docker/distribution
**Design doc:** `docs/plans/2026-05-16-auto-upgrade-apps-design.md`
**Key context:**
- Cache is already correctly configured (nginx URL-split + `proxy.ttl: 0`). No cache changes needed.
- Per-stack `lifecycle.ignore_changes` is already required for the existing `dns_config` Kyverno mutation (KYVERNO_LIFECYCLE_V1 convention). This plan extends it with a V2 marker for Keel annotations.
- Service Upgrade Agent (Diun → n8n → claude bumps tfvars) is retired by this design. n8n workflow + supporting scripts are removed once Phase 9 completes.
- CLAUDE.md "use 8-char git SHA tags" rule is reversed by this design (see Open Q1 in design doc).
---
## Phase 0 — Foundation
### Task 0.1: Resolve remaining open question
Q1 and Q2 from the design doc are resolved (uniform `:latest` + Keel model for all self-hosted; per-service plan for no-CI services).
Remaining open question:
**Helm-atomic services.** Survey:
```bash
grep -rn 'atomic.*true' /home/wizard/code/infra/stacks/ /home/wizard/code/infra/modules/
```
For each match: either remove `atomic = true` (preferred) or add the namespace to a Kyverno exclusion list. Document inline before Phase 1 proceeds.
---
### Task 0.2: Create the Keel stack
**Files:**
- Create: `stacks/keel/terragrunt.hcl`
- Create: `stacks/keel/main.tf`
- Create: `stacks/keel/variables.tf`
- Create: `stacks/keel/modules/keel/main.tf`
**Step 1:** Add `keel` to `terragrunt.hcl` `locals.tier0_stacks`**NO**. Keel is Tier 1 (depends on Kyverno + Keel image registry access). Keep it in Tier 1.
**Step 2:** Deploy via Helm chart `keel-hq/keel` (verify current version via context7 before pinning).
Key Helm values:
- `polling.enabled: true`
- `helmProvider.enabled: false` (we use annotations, not Helm hooks)
- `notifications.slack.enabled: true` with channel `#deployments` (verify channel exists)
- Registry credentials: mount Forgejo PAT from Vault via ExternalSecret (`secret/viktor/forgejo_pull_token`).
**Step 3:** Verify Keel can authenticate to all five registries (Docker Hub, ghcr, quay, k8s.io, kyverno via the local cache; Forgejo direct).
**Acceptance:**
- `kubectl -n keel get pod` shows Keel Ready.
- `kubectl -n keel logs deploy/keel | grep registry` shows successful manifest queries.
---
### Task 0.3: Author the Kyverno ClusterPolicy
**Files:**
- Create: `stacks/kyverno/modules/kyverno/keel-annotations.tf` (or extend `security-policies.tf`)
ClusterPolicy `inject-keel-annotations`:
```yaml
apiVersion: kyverno.io/v1
kind: ClusterPolicy
metadata:
name: inject-keel-annotations
spec:
background: true
rules:
- name: add-keel-annotation
match:
any:
- resources:
kinds: [Deployment, StatefulSet, DaemonSet]
namespaces: [] # populated per phase
exclude:
any:
- resources:
namespaces: ["keel"] # decision #11
- resources:
# Workloads can opt out by setting this label
selector:
matchLabels:
keel.sh/policy: never
mutate:
patchStrategicMerge:
metadata:
annotations:
+(keel.sh/policy): force
+(keel.sh/trigger): poll
+(keel.sh/pollSchedule): "@every 1h"
```
- `+()` syntax adds only if not present (preserves per-workload overrides).
- `exclude.selector.matchLabels[keel.sh/policy=never]` is the per-workload escape hatch (used during rollback per decision #9).
**Step 2:** Initially deploy with `namespaces: []` — policy exists but matches nothing.
**Acceptance:**
- `kubectl get clusterpolicy inject-keel-annotations` shows Ready.
- `kubectl get deploy -A -o yaml | grep keel.sh/policy` shows no matches yet (empty allowlist).
---
### Task 0.4: Define the KYVERNO_LIFECYCLE_V2 marker convention
**Files:**
- Modify: `AGENTS.md` — add the V2 snippet to the "Kyverno Drift Suppression" section
- Modify: `.claude/CLAUDE.md` — reference the V2 marker
Snippet to copy-paste:
```hcl
lifecycle {
ignore_changes = [
spec[0].template[0].spec[0].dns_config, # KYVERNO_LIFECYCLE_V1
metadata[0].annotations["keel.sh/policy"],
metadata[0].annotations["keel.sh/trigger"],
metadata[0].annotations["keel.sh/pollSchedule"], # KYVERNO_LIFECYCLE_V2
]
}
```
Backfill order: per-phase, only on workloads about to be enrolled. Not a mass sweep.
---
## Phase 1 — Self-hosted (uniform model)
**Set:** all self-hosted services. Three sub-categories:
- **Woodpecker-build-only (6):** `claude-agent-service`, `fire-planner`, `job-hunter`, `payslip-ingest`, `recruiter-responder`, `claude-memory-mcp`.
- **GHA-migrated (10, per memory id=388):** Website, k8s-portal, f1-stream, claude-memory-mcp, apple-health-data, audiblez-web, plotting-book, insta2spotify, audiobook-search, council-complaints. (Note: claude-memory-mcp appears in both lists — verify.)
- **No-CI (4, per design Q2):** `wealthfolio` (→ upstream), `chrome-service` (verify pin intent), `beadboard` (add CI), `freedify` (add CI).
- **Already-uniform (1):** `kms-website` — already pushes `:latest` AND SHA; just needs Keel annotation.
### Task 1.1: Audit current image refs
```bash
grep -rE 'image\s*=\s*"(forgejo\.viktorbarzin\.me|viktorbarzin)' /home/wizard/code/infra/stacks/ | sort
```
Tabulate per service: current tag, CI type (GHA / Woodpecker / none), action needed.
### Task 1.2: Per-service uniform conversion
For each Woodpecker-build-only service:
1. Edit Terraform: `local.image_tag` / `var.image_tag``"latest"`.
2. Add the KYVERNO_LIFECYCLE_V2 snippet (annotations ignore_changes).
3. Verify `.woodpecker.yml` pushes `:latest` on every build (most do via `auto_tag: true`).
For each GHA-migrated service:
1. Edit Terraform: switch `image_tag` from SHA reference to `"latest"`.
2. Add the KYVERNO_LIFECYCLE_V2 snippet.
3. Edit `.github/workflows/build-and-deploy.yml`: push `:latest` (in addition to `:<8-char-sha>` for traceability). Remove the Woodpecker API POST step.
4. Delete `.woodpecker/deploy.yml` and `.woodpecker/build-fallback.yml` from each repo (no longer needed).
5. Remove the Woodpecker repo config for these repos from Terraform if applicable.
For each no-CI service:
- `wealthfolio`: change Terraform image to `wealthfolio/wealthfolio:latest` (upstream DockerHub). Validate the image starts cleanly.
- `chrome-service`: check git blame on the `:v4` pin. If deliberate → label `keel.sh/policy: never`. If accidental → bump to upstream `:latest`.
- `beadboard`, `freedify`: write a minimal `.woodpecker.yml` (single build step pushing to Forgejo `:latest`). Trigger an initial build to populate `:latest`.
For `kms-website`: only add the Keel annotation; CI changes optional.
### Task 1.3: Add Phase 1 namespaces to Kyverno allowlist
Edit `stacks/kyverno/modules/kyverno/keel-annotations.tf`:
```yaml
namespaces:
- claude-agent-service
- fire-planner
- job-hunter
- payslip-ingest
- recruiter-responder
- claude-memory-mcp
- kms-website
# GHA-migrated set:
- website # or whatever the namespace is named per repo
- k8s-portal
- f1-stream
- apple-health-data
- audiblez-web
- plotting-book
- insta2spotify
- audiobook-search
- council-complaints
# No-CI set:
- beads-server
- chrome-service
- freedify
- wealthfolio
```
Verify each namespace name from `kubectl get ns` before locking in (some may differ from the repo name).
Apply. Watch `kubectl get deploy -n <ns> -o yaml | grep keel.sh` confirm annotations injected. Watch Keel logs for first poll cycle picking up the workloads.
### Task 1.4: Soak
1 week. Monitor:
- Slack `#deployments` for Keel rollout notifications.
- `KubePodCrashLooping` alerts.
- Manual `kubectl rollout status` on each service after a Keel-triggered rollout.
If any service breaks repeatedly: apply rollback runbook (decision #9), record the service in a "pin list" with reason, proceed.
**Acceptance:**
- All 7 services running latest digests within 24h of Phase 1 apply.
- No CrashLooping persisting >1h.
- No more than 2 services pinned-out during the soak week.
---
## Phase 2 — Stateless third-party web apps
**Set:** linkwarden, postiz, affine, isponsorblocktv, audiobookshelf, freshrss, tandoor, immich (verify it qualifies — has external DB so app-restart is safe), excalidraw, hackmd, send, jsoncrack, sparkyfitness, etc. (~15-20 services — full list from `kubectl get deploy -A` filtered against the phase-1 set + skip-bucket).
### Task 2.1: Audit current tags via Diun
```bash
# Diun's REST API or UI exports a "new tags available" report
# Use as the per-service decision source
```
For each service, pick floating tag:
- `:latest` if upstream publishes it and it's stable.
- `:<major>` (e.g. `:2`, `:v3`) if `:latest` is unreliable.
- `glob` + `ignore_changes` as last resort.
### Task 2.2: Catch-up PR
Single combined PR:
- Per-stack: switch image tag from pinned semver to chosen floating tag (Diun-informed).
- Per-stack: add KYVERNO_LIFECYCLE_V2 snippet.
- Append Phase 2 namespaces to Kyverno allowlist.
Apply with `-target=` per stack to pace rollouts (≤5 per hour to avoid cache burst — memory id=603).
### Task 2.3: Soak — 1 week, same monitoring as Phase 1.
---
## Phases 39 — same template
For each phase, repeat:
1. Define the set (precise namespace list).
2. Audit current tags (Diun + grep).
3. Pick floating tag per service.
4. Combined PR: image-ref change + lifecycle snippet + Kyverno allowlist update.
5. Apply paced (≤5/hr).
6. Soak 1 week. Pin-out any service that breaks repeatedly.
Set definitions per phase: see design doc Phase Ordering table.
**Special-handling phases:**
- **Phase 7 (Operators).** Restart of an operator can confuse its managed CRD reconciles. Use `imagePullPolicy: Always` + readiness check before declaring stable. Investigate cnpg-operator and ESO restart behavior in advance.
- **Phase 8 (Critical infra).** Calico/CSI DaemonSet rollouts impact each node briefly. Verify `updateStrategy.rollingUpdate.maxUnavailable: 1` on every DaemonSet before enrollment. Memory id=390 (26h Calico-cascade outage) is the cautionary tale.
- **Phase 9 (Bootstrap).** Vault, CNPG, mysql-standalone. Coordinate with backup window. Take a fresh snapshot of `/srv/nfs/<db>-backup/` before applying the phase enrollment.
---
## Cleanup tasks (after Phase 9 stable)
### Task C.1: Retire Service Upgrade Agent
**Files:**
- Modify: `stacks/n8n/` — remove the Service Upgrade Agent workflow
- Delete: any supporting scripts (`infra/scripts/service-upgrade-*.sh` if they exist)
- Modify: `stacks/diun/` — disable webhook notification to n8n (keep Slack notification for visibility)
### Task C.2: Update CLAUDE.md files
- Reverse the "use 8-char git SHA tags" rule in `infra/.claude/CLAUDE.md` "Docker images" line.
- Reverse same in root `/CLAUDE.md` if duplicated.
- Add a new section documenting the Keel model + KYVERNO_LIFECYCLE_V2 snippet.
- Update memory via `mcp__claude_memory__memory_update` on entries 388, 612, 604 (CI/CD architecture, Service Upgrade Agent retirement, cache TTL clarification).
### Task C.3: Add a runbook
**Files:**
- Create: `docs/runbooks/keel-rollback.md`
Document the rollback flow (decision #9): `kubectl rollout undo` → Terraform pin → annotation `keel.sh/policy: never`.
### Task C.4: Tidy Diun
Drop image-pin overrides for MySQL, PostgreSQL, Redis from Diun config (no longer needed since they're Keel-managed; the previous skip was for the retired changelog-agent path).
---
## Rollback (whole project)
If the auto-roll experiment goes badly cluster-wide (multiple cascading failures, repeated outages), revert:
1. Set Kyverno ClusterPolicy `inject-keel-annotations` to empty `namespaces: []`.
2. Existing annotations remain on workloads, but Keel continues to act on them — so also disable Keel: scale `keel` Deployment to 0.
3. Pin every workload's Terraform image_tag back to its current running digest (use `kubectl get deploy -A -o jsonpath='{range .items[*]}{.metadata.name}:{.spec.template.spec.containers[0].image}{"\n"}{end}'`).
4. Document failure modes in `post-mortems/2026-XX-XX-keel-rollback.md`.
5. Reconsider opt-in approach for next iteration.
---
## Success criteria
- All ~70 services running latest within 8 weeks of Phase 0 completion.
- Zero unrolled-back outages caused by Keel.
- ≤5 services on the "pin list" (i.e. ≥93% auto-roll success rate).
- `terragrunt plan` shows no spurious diffs from Kyverno-injected annotations (KYVERNO_LIFECYCLE_V2 working as intended).
- Service Upgrade Agent + supporting infra retired.

File diff suppressed because it is too large Load diff

View file

@ -1,112 +0,0 @@
# MySQL 8.4.8 → 8.4.9 Upgrade — Design
**Date**: 2026-05-19
**Status**: Drafted, **NOT scheduled**. Execute only inside a planned maintenance window with user sign-off.
**Beads**: (filed alongside this doc)
**Related**: `docs/runbooks/restore-mysql.md`, beads `code-eme8` / `code-k40p` (closed in `ea475c3d`)
## Background
On 2026-05-18, Keel auto-bumped the `mysql:8.4` floating tag on the
`mysql-standalone` StatefulSet from 8.4.8 to 8.4.9. The in-server data
dictionary upgrade (80408 → 80409) stalled reliably: ~24 s of writes to
`mysql.ibd` + redo log after "Server upgrade started", then complete
silence — no CPU, no flushes, no errors, no completion. The `boot`
thread sat in user-space sleep (`State: S`, `wchan: 0`) for 10+
minutes; the MySQLX socket appeared but `mysqld.sock` never did. Even
with `liveness_probe.initial_delay_seconds = 600`, the upgrade never
completed.
Recovery (commit `ea475c3d`): pinned image to `mysql:8.4.8` exactly,
wiped the corrupted PVC, restored from the 00:30 UTC mysqldump. Total
downtime: ~25 min. Forgejo + 7 dependent apps offline during that
window.
## Root cause — best evidence
We never proved this definitively because we couldn't connect to MySQL
during the stall, but the strongest hypothesis is **flush starvation
during the DD upgrade's mandatory checkpoint**:
1. Upgrade rewrites `mysql.st_spatial_reference_systems` (5103 SRS
defs) + dirties pages across the system tablespace.
2. Reaches a point where it must checkpoint before continuing.
3. The page-cleaner thread can't drain dirty pages fast enough because
`innodb_io_capacity=100` (1.6 MB/s effective flush rate, default is
200, recommended for SSDs is 2000+) combined with
`innodb_page_cleaners=1`.
4. The `boot` thread waits on a pthread condvar that the flush
coordinator should signal but never does within probe timeout.
Why we're not 100 % certain:
- LUKS2-encrypted block storage (`proxmox-lvm-encrypted`) may
contribute its own flush latency.
- We didn't capture a stack trace from the stalled `boot` thread
(`/proc/1/task/118/stack` was `permission denied`).
- A genuine MySQL 8.4.9 bug in the SRS-update path is possible (worth
checking the MySQL bug tracker before retry).
**Organizational root cause** (definitive): the `mysql:8.4` floating
tag let Keel auto-bump without testing. Already fixed — image pinned
to `mysql:8.4.8` exactly.
## Decisions
| # | Decision | Notes |
|---|----------|-------|
| 1 | **Approach: wipe + re-init on 8.4.9** (logical migration via fresh init + dump-restore) | The DD upgrade is the broken path. A fresh 8.4.9 init starts at version 80409 directly — no upgrade ever runs. We've executed wipe+restore once in ~25 min; the path is now well-trodden. |
| 2 | **Pre-flight: bump InnoDB IO config** | `innodb_io_capacity=2000`, `innodb_io_capacity_max=4000`, `innodb_page_cleaners=4`. These are the long-term-correct values regardless of the upgrade — current settings are ~10× too conservative for the workload. |
| 3 | **Restore strategy: per-database dumps, NOT the full `--all-databases` dump** | Per-db dumps at `/srv/nfs/mysql-backup/per-db/<db>/` skip the `mysql` system schema entirely. Avoids the question of "will 8.4.8 mysql-schema rows confuse 8.4.9". User accounts get recreated via Vault + null_resource. |
| 4 | **Fresh dump immediately before cutover, not yesterday's** | The daily dump runs at 00:30 UTC. The cutover dump must come from < 60 s before scale-to-0 to minimize data loss. Kick `mysql-backup-per-db` CronJob manually. |
| 5 | **Maintenance window required** | All MySQL-dependent apps offline ~25 min: Forgejo (+ registry → ImagePullBackOff cascade), Nextcloud, HackMD, Grafana, Paperless, Uptime-Kuma, Shlink, realestate-crawler, phpipam, technitium, vikunja, freshrss, finance, resume. Pick a low-traffic window (suggest Sunday 03:00 UK). |
| 6 | **Single rollback path: re-pin to 8.4.8 + same wipe/restore flow** | If 8.4.9 fresh init misbehaves post-restore, rollback IS the same procedure, just with image=8.4.8. The pinned 8.4.8 dump survives. No new failure modes. |
| 7 | **Out of scope for this upgrade**: tuning that doesn't gate the upgrade | Right-sizing buffer pool, switching to async commits, changing storage class, replication — all separate decisions. |
## Verification gates
Before declaring done:
1. `kubectl -n dbaas exec mysql-standalone-0 -- mysql -uroot -p"$PW" -e "SELECT VERSION();"` returns `8.4.9`.
2. `SHOW DATABASES;` lists all 20 user databases.
3. Table count per schema matches the pre-upgrade snapshot (recorded
in step 1 of the plan).
4. `forgejo` logs show successful DB ping; `kubectl -n forgejo get pod` is 1/1 Running.
5. `kubectl get deploy,sts -A` shows no unready workloads.
6. `bash infra/scripts/cluster_healthcheck.sh --quiet` returns same or
better PASS/WARN/FAIL ratio as pre-upgrade.
7. Forgejo integrity probe reports 0 failures (manual trigger).
8. `RegistryCatalogInaccessible` not firing in Prometheus.
## Risks + mitigations
| Risk | Likelihood | Mitigation |
|---|---|---|
| 8.4.9 fresh init has *some other* unobserved bug | Low | Smoke-test on a parallel PVC in dbaas before touching the real one (optional but cheap — adds 30 min). See plan Phase 1. |
| Per-db dump-restore misses a database the user added recently | Low | Compare `SHOW DATABASES` against the per-db dump directory listing pre-cutover. If a DB exists in MySQL but not in `/srv/nfs/mysql-backup/per-db/`, dump it manually first. |
| Forgejo/roundcubemail static-user passwords drift again after restore | Certain | Already documented in runbook — DROP USER + CREATE USER from Vault values immediately after restore. |
| The cutover dump itself is corrupt | Very low | mysqldump exits non-zero on failure. CronJob already pushes `backup_last_success_timestamp` to Pushgateway. Verify timestamp is fresh before proceeding. |
| Apps fail to reconnect after MySQL restart | Low | Already-proven recipe: `kubectl rollout restart` on the affected deployments. Listed exhaustively in runbook §B.8. |
| 8.4.9 fresh init *also* stalls (root cause was NOT flush starvation) | Medium-low | Pre-flight test on parallel PVC catches this before maintenance window. If real prod init stalls, immediately revert TF pin to 8.4.8, redo same dump-restore flow. Same 25 min downtime as the original recovery. |
## Why not alternatives
- **In-place DD upgrade with bumped IO config**: simpler, but if it
still stalls we lose 3060 min waiting + still fall back to
wipe+restore. Same data risk; worse expected time. We *would* learn
whether the bumped IO settings fix the upgrade, but the fresh init
approach makes that knowledge unnecessary.
- **Parallel migration (new mysql-standalone-new pod alongside)**:
cleanest rollback (instant via service-selector flip), but needs TF
surgery to declare two StatefulSets temporarily and isn't worth the
complexity when the wipe+restore approach is now proven.
- **Wait for 8.4.10 / 8.5 LTS**: leaves us stuck on 8.4.8 indefinitely.
Acceptable for now (we're pinned), but not a permanent answer.
## Out of scope
- A standby/replica MySQL for zero-downtime upgrades (separate
initiative — see future planning around CNPG-style HA for MySQL).
- Removing `proxmox-lvm-encrypted` LUKS2 from the equation (the
encryption is a security requirement; debugging its flush latency is
separate).
- Replacing MySQL with PostgreSQL (long-term goal for some apps; not
this upgrade).

View file

@ -1,349 +0,0 @@
# MySQL 8.4.8 → 8.4.9 Upgrade — Plan
**Date**: 2026-05-19
**Status**: Drafted, **NOT scheduled**
**Design**: `2026-05-19-mysql-8.4.9-upgrade-design.md`
**Estimated downtime**: 2530 min (all MySQL-dependent apps offline)
**Window**: Suggest Sunday 03:00 UK (low traffic, kured window doesn't fight us)
## Pre-flight (before the maintenance window)
### P.1 Optional smoke test on a parallel PVC (recommended, +30 min)
In a non-production session, before scheduling the real cutover:
```bash
# 1. Create a temporary StatefulSet `mysql-smoketest` in dbaas with the
# same image (mysql:8.4.9), same configmap, brand-new PVC.
# Use a one-off kubectl apply -f /tmp/smoketest.yaml — NOT Terraform —
# so it doesn't pollute the real stack.
# 2. Verify it inits to 8.4.9 cleanly (mysqld.sock appears, "ready for connections").
# 3. Restore one of the smaller per-db dumps (e.g. resume, freshrss) into it.
# 4. Delete the smoketest StatefulSet + PVC.
```
Outcome:
- ✅ Init succeeds → proceed with the real upgrade with high confidence.
- ❌ Init stalls → root cause was not flush starvation. Halt and re-investigate. The real upgrade is unsafe.
### P.2 Read the MySQL 8.4.9 release notes + bug tracker
Specifically look for issues filed since 8.4.9 GA against the DD upgrade
path or `st_spatial_reference_systems`. If a known fix landed in 8.4.10
or 8.5.x, consider waiting.
### P.3 Confirm backup pipeline is healthy
```bash
# Latest per-db dumps exist for all 20 databases
kubectl -n dbaas exec mysql-standalone-0 -- bash -c \
'for d in $(ls /backup/per-db/); do echo -n "$d: "; ls -t /backup/per-db/$d/ | head -1; done'
# Pushgateway shows recent success
kubectl -n monitoring exec deploy/prometheus-server -c prometheus-server -- \
wget -qO- 'http://prometheus-prometheus-pushgateway:9091/metrics' | grep mysql-backup-per-db
```
### P.4 Pin maintenance window and notify
Brief the user. Confirm window. Disable any background scrapers /
schedulers / bots that would create noise during the cutover.
## Execution (inside the maintenance window)
### Step 1 — Pre-flight snapshot
```bash
ROOT_PWD=$(kubectl -n dbaas get secret cluster-secret -o jsonpath='{.data.ROOT_PASSWORD}' | base64 -d)
# Record current state for verification later
kubectl -n dbaas exec mysql-standalone-0 -- mysql -uroot -p"$ROOT_PWD" \
-e "SELECT table_schema, COUNT(*) AS tables FROM information_schema.tables \
WHERE table_schema NOT IN ('information_schema','performance_schema','sys','mysql') \
GROUP BY table_schema;" > /tmp/mysql-pre-upgrade-table-counts.txt
cat /tmp/mysql-pre-upgrade-table-counts.txt
```
### Step 2 — Trigger a fresh per-db dump
```bash
kubectl -n dbaas create job --from=cronjob/mysql-backup-per-db pre-upgrade-$(date +%s)
# Wait for completion (typically <2 min)
kubectl -n dbaas wait --for=condition=complete --timeout=300s job/pre-upgrade-<timestamp>
```
Verify all 20 databases dumped:
```bash
kubectl -n dbaas exec mysql-standalone-0 -- bash -c \
'for d in $(ls /backup/per-db/); do
newest=$(ls -t /backup/per-db/$d/ | head -1)
echo "$d: $newest"
done'
```
Every entry should have a `dump_<today>_*.sql.gz` listed.
### Step 3 — Bump InnoDB IO config + image pin in Terraform
In `stacks/dbaas/modules/dbaas/main.tf`:
```diff
- innodb_io_capacity=100
- innodb_io_capacity_max=200
- innodb_page_cleaners=1
+ innodb_io_capacity=2000
+ innodb_io_capacity_max=4000
+ innodb_page_cleaners=4
```
```diff
- # Pinned to 8.4.8 — 8.4.9 DD upgrade got stuck (no progress, no CPU)
- # repeatedly across multiple attempts. ...
- image = "mysql:8.4.8"
+ # Re-pinned to 8.4.9 on 2026-MM-DD after the wipe+reinit upgrade
+ # path (see docs/plans/2026-05-19-mysql-8.4.9-upgrade-*).
+ image = "mysql:8.4.9"
```
Commit but **do not apply yet**.
### Step 4 — Stop MySQL
```bash
kubectl -n dbaas scale statefulset mysql-standalone --replicas=0
# Wait for pod deletion
kubectl -n dbaas wait --for=delete pod/mysql-standalone-0 --timeout=120s
```
### Step 5 — Wipe the PVC
```bash
PV=$(kubectl -n dbaas get pvc data-mysql-standalone-0 -o jsonpath='{.spec.volumeName}')
kubectl patch pv "$PV" -p '{"spec":{"persistentVolumeReclaimPolicy":"Delete"}}'
kubectl -n dbaas delete pvc data-mysql-standalone-0
# Confirm PV vanishes (CSI cleans up the LV)
kubectl get pv | grep -q "$PV" && echo "WARNING: PV still present" || echo "PV cleaned up"
```
### Step 6 — Apply Terraform (8.4.9 + bumped IO)
```bash
cd stacks/dbaas
/home/wizard/code/infra/scripts/tg apply
```
This creates a fresh 5 Gi PVC + new pod on `mysql:8.4.9`. Initial-init
takes ~30 s. Verify:
```bash
kubectl -n dbaas wait --for=condition=ready pod/mysql-standalone-0 --timeout=300s
kubectl -n dbaas exec mysql-standalone-0 -- mysql -uroot -p"$ROOT_PWD" -e "SELECT VERSION();"
# expect: 8.4.9
```
**If the pod fails to become Ready within 5 min**: this is the
"root cause was not flush starvation" failure mode. Abort the upgrade,
revert the image pin to 8.4.8 in TF, re-run from Step 4 (wipe + apply
8.4.8 + restore). Total extra downtime ~25 min.
### Step 7 — Restore per-db dumps (NOT the full --all-databases dump)
```bash
ROOT_PWD=$(kubectl -n dbaas get secret cluster-secret -o jsonpath='{.data.ROOT_PASSWORD}' | base64 -d)
cat <<YAML | kubectl apply -f -
apiVersion: batch/v1
kind: Job
metadata:
name: mysql-restore-per-db-$(date +%Y-%m-%d)
namespace: dbaas
spec:
ttlSecondsAfterFinished: 3600
template:
spec:
restartPolicy: Never
containers:
- name: restore
image: mysql:8.4.9
command: ["bash","-c"]
args:
- |
set -euo pipefail
for db in \$(ls /backup/per-db/); do
newest=\$(ls -t /backup/per-db/\$db/ | head -1)
echo "=== Restoring \$db from \$newest ==="
mysql -h mysql.dbaas.svc.cluster.local -uroot -p"\$MYSQL_ROOT_PASSWORD" \
-e "CREATE DATABASE IF NOT EXISTS \\\`\$db\\\`;"
gunzip -c "/backup/per-db/\$db/\$newest" | \
mysql -h mysql.dbaas.svc.cluster.local -uroot -p"\$MYSQL_ROOT_PASSWORD" "\$db"
done
echo "=== All databases restored ==="
mysql -h mysql.dbaas.svc.cluster.local -uroot -p"\$MYSQL_ROOT_PASSWORD" -e 'SHOW DATABASES;'
env:
- name: MYSQL_ROOT_PASSWORD
valueFrom: { secretKeyRef: { name: cluster-secret, key: ROOT_PASSWORD } }
volumeMounts:
- { name: backup, mountPath: /backup, readOnly: true }
volumes:
- name: backup
persistentVolumeClaim: { claimName: dbaas-mysql-backup-host, readOnly: true }
YAML
```
Watch: `kubectl -n dbaas logs -f job/mysql-restore-per-db-<date>`.
Expected time: ~3 min for all 20 databases.
### Step 8 — Recreate Vault-rotated + static users
The per-db restore did NOT touch `mysql.user`. Recreate all app users
fresh:
```bash
# Static users (forgejo, roundcubemail) from Vault
FORGEJO_PW=$(vault kv get -field=mysql_forgejo_password secret/viktor)
RC_PW=$(vault kv get -field=mysql_roundcubemail_password secret/viktor)
kubectl -n dbaas exec -i mysql-standalone-0 -- bash -c 'mysql -uroot -p"$MYSQL_ROOT_PASSWORD"' <<SQL
CREATE USER IF NOT EXISTS 'forgejo'@'%' IDENTIFIED WITH caching_sha2_password BY '$FORGEJO_PW';
CREATE USER IF NOT EXISTS 'roundcubemail'@'%' IDENTIFIED WITH caching_sha2_password BY '$RC_PW';
GRANT ALL PRIVILEGES ON \`forgejo\`.* TO 'forgejo'@'%';
GRANT ALL PRIVILEGES ON \`roundcubemail\`.* TO 'roundcubemail'@'%';
FLUSH PRIVILEGES;
SQL
# Vault-DB-engine-rotated users: force re-rotation so Vault rewrites the
# user with the current password held in K8s secrets
for role in $(vault list -format=json database/roles | jq -r '.[]' | grep '^mysql-'); do
echo "Rotating $role"
vault write -f "database/rotate-role/$role"
done
# Technitium has a separate password-sync job — kick it
kubectl -n technitium create job --from=cronjob/technitium-password-sync \
technitium-postupgrade-$(date +%s)
```
### Step 9 — Restart MySQL-dependent apps
```bash
for ns_app in \
"forgejo:deploy/forgejo" \
"nextcloud:deploy/nextcloud" \
"hackmd:deploy/hackmd" \
"monitoring:deploy/grafana" \
"paperless-ngx:deploy/paperless-ngx" \
"uptime-kuma:deploy/uptime-kuma" \
"url:deploy/shlink" \
"phpipam:deploy/phpipam" \
"technitium:sts/technitium" \
"vikunja:deploy/vikunja" \
"freshrss:deploy/freshrss" \
"finance:deploy/finance" \
"resume:deploy/resume" \
"realestate-crawler:deploy/realestate-crawler-api" \
"realestate-crawler:deploy/realestate-crawler-celery" \
"realestate-crawler:deploy/realestate-crawler-celery-beat" \
"realestate-crawler:deploy/realestate-crawler-ui"; do
ns=${ns_app%%:*}; app=${ns_app##*:}
kubectl -n "$ns" rollout restart "$app" &
done
wait
```
Wait for all to become ready:
```bash
until [ "$(kubectl get deploy,sts -A -o json | \
jq -r '.items[] | select(.spec.replicas != .status.readyReplicas and .spec.replicas > 0) | .metadata.name' | \
wc -l)" -eq 0 ]; do
sleep 5
done
echo "All workloads ready"
```
### Step 10 — Force ImagePullBackOff pods to retry (Forgejo registry was offline)
```bash
for ns in chrome-service fire-planner freedify; do
kubectl -n "$ns" delete pod --all 2>/dev/null || true
done
```
### Step 11 — Clean up failed CronJob pods from the outage window
```bash
kubectl delete pods -A --field-selector=status.phase=Failed
```
### Step 12 — Verify (matches design §Verification gates)
```bash
# 1. Version
kubectl -n dbaas exec mysql-standalone-0 -- mysql -uroot -p"$ROOT_PWD" -e "SELECT VERSION();"
# expect: 8.4.9
# 2-3. Databases + table counts
kubectl -n dbaas exec mysql-standalone-0 -- mysql -uroot -p"$ROOT_PWD" \
-e "SELECT table_schema, COUNT(*) FROM information_schema.tables \
WHERE table_schema NOT IN ('information_schema','performance_schema','sys','mysql') \
GROUP BY table_schema;" > /tmp/mysql-post-upgrade-table-counts.txt
diff /tmp/mysql-pre-upgrade-table-counts.txt /tmp/mysql-post-upgrade-table-counts.txt
# expect: no diff (or only counts that grew between snapshots)
# 4. Forgejo
kubectl -n forgejo get pod
kubectl -n forgejo logs deploy/forgejo --tail=20 | grep -iE "ORM engine|ready"
# expect: 1/1 Running, "ORM engine initialized"
# 5. Cluster health
bash /home/wizard/code/infra/scripts/cluster_healthcheck.sh --quiet
# 6. Registry integrity probe
kubectl -n monitoring create job --from=cronjob/forgejo-integrity-probe \
postupgrade-$(date +%s)
kubectl -n monitoring logs job/postupgrade-<timestamp> --tail=5
# expect: "Probe complete: 0 failures"
# 7. RegistryCatalogInaccessible not firing
kubectl -n monitoring exec deploy/prometheus-server -c prometheus-server -- \
wget -qO- 'http://localhost:9090/api/v1/alerts' | \
python3 -c "import json,sys; d=json.load(sys.stdin); [print(a['labels']['alertname']) for a in d['data']['alerts'] if a['state']=='firing']"
# expect: empty / no RegistryCatalogInaccessible
```
### Step 13 — Commit + push the Terraform change
```bash
git add stacks/dbaas/modules/dbaas/main.tf
git commit -m "dbaas: pin MySQL to 8.4.9 after successful wipe+reinit upgrade
Executed per docs/plans/2026-05-19-mysql-8.4.9-upgrade-{design,plan}.md.
The full upgrade ran clean — fresh init on 8.4.9 sidestepped the DD
upgrade stall. IO config bumped to 2000/4 (was 100/1) for the workload.
"
git push
```
## Rollback path (if Step 6 or Step 7 fails catastrophically)
The wipe at Step 5 is destructive — once executed, the original disk
is gone. Rollback is **same procedure, image=8.4.8**:
1. Edit TF: `image = "mysql:8.4.8"`
2. `kubectl -n dbaas scale sts mysql-standalone --replicas=0`
3. Re-wipe (already wiped; just `tg apply`)
4. Run the Step 7 restore Job again (now on 8.4.8)
5. Run Step 8-11
6. Update Terraform comment to reflect retained 8.4.8 pin.
Extra downtime: ~25 min on top of the existing window.
## Post-upgrade follow-ups
- Update `infra/.claude/CLAUDE.md` MySQL row to reflect 8.4.9 pin.
- Update `docs/runbooks/restore-mysql.md` to reflect 8.4.9.
- Re-evaluate whether the new IO config (2000/4) is overkill for the
workload after 1-2 weeks — could drop to 1000/2 if needed.
- Optional: file a follow-up task to investigate MySQL HA/replication
so the next upgrade isn't blocking.

View file

@ -1,135 +0,0 @@
# HA Control Plane (3 masters) — Design
**Date**: 2026-05-21
**Status**: Drafted, NOT scheduled
**Beads**: code-n0ow
**Trigger**: today's k8s 1.34.7→1.34.8 autonomous-upgrade session repeatedly hit a storm cascade rooted in single-master apiserver outages
## Problem statement
The autonomous k8s upgrade pipeline (`stacks/k8s-version-upgrade/`) is
correct end-to-end but **cannot push through the cluster's
single-master architecture**. Each attempted upgrade today rolled
back via the same cascade:
1. Chain drains master → `kubeadm upgrade apply` swaps a static-pod
manifest (etcd → apiserver → controller-manager → scheduler).
2. While a manifest swap is in flight, the affected control-plane
component is briefly down — for apiserver, that means ~1060s of
"connection refused" to `10.96.0.1:443` from every kubelet and
operator pod in the cluster.
3. **Several operators die during that window** instead of waiting:
- **tigera-operator**: logs `[ERROR] Get "https://10.96.0.1:443/api?timeout=32s": connect: connection refused` then exits 1 immediately
- gpu-operator, cnpg-cloudnative-pg, kube-controller-manager: similar leader-lease failures
4. Kubelet restarts those pods → image pulls + initial reads → storm
of disk I/O on master (we observed 563 MB/s from tigera alone).
5. **The storm slows apiserver-to-kubelet status sync** past kubeadm's
hardcoded 5-min watch on the pod's `kubernetes.io/config.hash`
annotation.
6. kubeadm declares the upgrade "did not change after 5m0s",
**rolls back to the previous manifest**, exits non-zero.
7. Chain Job retries (backoffLimit=1) → same storm → same failure.
Chain dead.
The container runtime, the script logic, the RBAC permissions are all
fine after today's fixes. The **single master is the bottleneck**.
## Why HA control plane fixes this
With 3 masters running etcd quorum + apiserver behind an LB:
| Failure mode | Single master | 3-master HA |
|---|---|---|
| Master reboot / kubeadm upgrade | Apiserver completely down 1060s | Other 2 masters serve clients; LB transparently fails over |
| etcd quorum during one master being down | Total outage (1/1 broken) | Quorum maintained (2/3 healthy) |
| Tigera/operators see apiserver as "down" | Yes → crashloop storm | No → keep running through |
| kubeadm `static-pod hash` watch | Times out under load (today's bug) | Never under load; sync stays fast |
| Pipeline upgrade success rate | Brittle / needs manual nursing | Truly autonomous |
The k8s upgrade chain doesn't need to be aware of *any* of this — the
underlying availability of apiserver makes the chain's gates
naturally pass on each iteration.
## Decisions (proposed — to be confirmed)
| # | Decision | Notes |
|---|----------|-------|
| 1 | **3 masters** (not 5) | Quorum tolerates 1 failure, sufficient for home-lab. 5 would tolerate 2 but doubles etcd write amplification. |
| 2 | **Sizing**: match current `k8s-master` (8 vCPU, 32GB RAM, ~64 GB disk) for all 3 | Symmetric. New VMs `k8s-master-2`, `k8s-master-3` on Proxmox. |
| 3 | **Apiserver LB**: **pfSense HAProxy** (existing pattern, see mailserver-pfsense-haproxy.md) over keepalived+haproxy-on-each-master | Pros: no per-node moving parts, mirrors the mailserver layout already in production. Cons: pfSense becomes more SPoF — but it's already SPoF for everything else (DNS, gateway, ingress). |
| 4 | **VIP**: pick an unused IP on the cluster VLAN, e.g. `10.0.20.99`, point all kubeconfigs + kubelet `--server` at it | Internal-only VIP; external API access stays via Cloudflared. |
| 5 | **etcd**: kubeadm-managed (existing); just `kubeadm join --control-plane` brings new members into the etcd cluster automatically | Avoids running etcd separately. |
| 6 | **kured-sentinel-gate**: extend "quorum-safe" check to verify ≥2 control-plane nodes Ready before allowing a reboot | Otherwise kured could reboot 2 masters at once and break quorum. |
| 7 | **etcd backup**: today's `etcd-backup` CronJob already takes a snapshot from one member; that's still sufficient (etcd snapshot is a consistent point-in-time). No new work needed. | |
| 8 | **Migration order**: add masters one at a time, run smoke (kubectl from each), then cut over kubeconfigs | Each `kubeadm join --control-plane` is reversible (just `kubeadm reset` + remove from etcd member list). |
## Out of scope
- HA pfSense itself (separate, much bigger initiative)
- Multi-DC failover
- External etcd cluster (we're sticking with kubeadm-managed stacked etcd)
- Rebuilding cluster from scratch — we'll join into the existing one
## Risk register
| Risk | Mitigation |
|---|---|
| etcd quorum split-brain during member join | kubeadm join is atomic; if it fails, the new member doesn't join the quorum. Existing etcd stays healthy. |
| LB misconfiguration → all kubectl breaks | Smoke-test from each master before flipping clients. Keep a kubeconfig pointing directly at one master as fallback. |
| Existing kubeconfigs (dev VM, agents, woodpecker) need updating | List all consumers, update in a single TF apply. |
| New masters get scheduled some workload pods unintentionally | Verify control-plane taint is applied at join time. |
| Cluster-wide cert rotation might be needed | kubeadm join handles certs automatically using the `--certificate-key` from `kubeadm init phase upload-certs`. |
| 32GB per master × 3 = 96GB RAM used for control plane alone | Proxmox host has headroom; not blocking. |
## Verification
After all 3 masters joined + LB up:
```bash
# All 3 masters listed
kubectl get nodes -l node-role.kubernetes.io/control-plane=
# etcd quorum healthy
kubectl -n kube-system exec etcd-k8s-master -- etcdctl \
--endpoints=https://10.0.20.100:2379,https://10.0.20.X:2379,https://10.0.20.Y:2379 \
--cacert=/etc/kubernetes/pki/etcd/ca.crt \
--cert=/etc/kubernetes/pki/etcd/server.crt \
--key=/etc/kubernetes/pki/etcd/server.key \
endpoint health --cluster
# Failover test: cordon master-1, reboot it, observe kubectl still works through LB
kubectl drain k8s-master --delete-emptydir-data --ignore-daemonsets
ssh wizard@k8s-master.viktorbarzin.lan sudo reboot
# Pipeline test: re-trigger k8s upgrade chain (e.g. for whatever the next patch is)
kubectl -n k8s-upgrade create job --from=cronjob/k8s-version-check ha-validation-$(date +%s)
# Expect: full chain succeeds end-to-end without manual intervention
```
## Cost estimate
- 2× VMs at 8 vCPU + 32GB RAM each = +64GB RAM on Proxmox host
- ~+128GB disk usage (2× 64GB master disks)
- ~2-4 hours of operator time end-to-end (VM provisioning + kubeadm join + LB config + smoke)
## What's already in place from today's work
(All these are prerequisites that were fixed during today's
investigation — they stay relevant when HA lands.)
- Master containerd 1.6.22 → 2.2.2, runc 1.1.8 → 1.4.0 (fixed
`runc: unable to signal init: permission denied` on Ubuntu 26.04)
- Pipeline script bugs: 3× `grep -vE` pipefail, 1× RBAC missing
`get daemonsets`, 1× `RecentNodeReboot` not ignored in master phase
- Kill-switch ConfigMap mechanism (`k8s-upgrade-killswitch`)
- Kubeadm-apply retry wrapper in `update_k8s.sh` (helps but doesn't
fully fix the storm cascade)
- Quiet-baseline threshold 3600s → 600s
## Reference
Commits from today's session:
- `10b261d2` — first `grep -vE` pipefail
- `0c8b46df` — 2 more pipefail sites
- `fc0510aa` — kill-switch + RecentNodeReboot ignore + 600s threshold
- `2dc7e001` — kubeadm apply 3-attempt retry

View file

@ -1,269 +0,0 @@
# OpenClaw devvm access + async task pattern — design
**Date:** 2026-05-22
**Stack:** `infra/stacks/openclaw`
**Status:** Approved (in-session, see chat history 2026-05-22)
## Goal
Give the OpenClaw pod (running in K8s) two new capabilities:
1. **Host-tools bundle** — common Linux CLIs the upstream OpenClaw image
doesn't ship (`ssh`, `scp`, `vault`, `dig`, `jq`, `yq`, `ripgrep`, `fd`,
`gnupg`, `tmux`, etc.). OpenClaw can't `apt install` because the
container runs as non-root `node` (uid 1000).
2. **devvm async task pattern** — OpenClaw spawns long-running work as
`tmux` sessions on devvm, sends prompts via `tmux send-keys`, captures
progress via `tmux capture-pane`. Sessions live on devvm, so they
survive OpenClaw pod restarts.
OpenClaw uses this combination as a **trusted fallback** for tasks too
expensive, sensitive, or stateful for in-pod execution: Vault lookups,
multi-step `claude-code` work, anything needing wizard's full home-lab
access.
## Why now
- The in-pod sandbox is `security=full` but the container is minimal —
no `ssh`, no `vault`, no `dig`, no `tmux`.
- The user wants OpenClaw to be a first-line agent that delegates heavy
work to the dev VM rather than duplicate that work in a constrained pod.
- Long-running work (multi-minute `claude-code` sessions) shouldn't be
tied to a single synchronous `claude -p` invocation — needs persistence
and pollability.
## Architecture decision: stay on K8s
Discussed migrating OpenClaw to run directly on devvm (would obviate the
host-tools bundle + most of the SSH setup). Decision: **stay on K8s**.
Reasons:
- Keeps HA (5-node cluster vs single devvm reboot)
- Keeps ingress/Authentik/Telegram entry chain intact
- Keeps Prometheus scrape + exporter sidecar
- Keeps PVC backup pipeline (LVM snapshots + Synology offsite)
- Resource isolation — a runaway LLM session can't stress wizard's daily-driver VM
- Migration cost is several days; this design is ~150 LoC + an 80-line wrapper
The mental model — "OpenClaw is sandboxed, delegates to wizard@devvm for
trusted heavy lifting" — is a clean security boundary. Worth preserving.
## Architecture
### Pod side (`infra/stacks/openclaw/main.tf`)
Two new init containers added to the OpenClaw Deployment, after the
existing four:
#### Init 5 — `install-host-tools`
- Image: `debian:bookworm-slim` (matches main container base for glibc compat)
- Idempotent: skips if `/tools/host-tools/.installed-v1` exists
- `apt-get install --download-only --no-install-recommends` for:
`openssh-client dnsutils iputils-ping wget gnupg jq ripgrep fd-find ncdu htop strace tcpdump tmux unzip`
- Iterates `.deb` files in `/var/cache/apt/archives/`, `dpkg-deb -x` each
into `/tools/host-tools/root/` (preserves `usr/bin`, `usr/sbin`,
`usr/lib` layout)
- Downloads static binaries to `/tools/host-tools/bin/`:
- `vault` (HashiCorp releases, pinned version)
- `yq` (mikefarah/yq GitHub releases, pinned version)
- Smoke test: invokes `--version` on each bundled binary; fails init if
any won't load (catches glibc / shared-lib drift at deploy time, not
runtime)
- Writes marker file with version
#### Init 6 — `setup-ssh-config`
- Image: uses the just-installed host-tools (debian:bookworm-slim base
with `/tools/host-tools/root/usr/bin` on PATH so `ssh-keyscan` works)
- Runs after `install-host-tools`
- Idempotent: skips if `/home/node/.openclaw/.ssh/.configured-v1` exists
- Creates `/home/node/.openclaw/.ssh/` (uid 1000)
- Copies `/ssh/id_rsa` (tmpfs secret mount) → `~/.ssh/id_rsa` with 0600
(the secret tmpfs mount has wider perms that openssh rejects)
- Writes `~/.ssh/config`:
```ssh-config
Host devvm
HostName 10.0.10.10
User wizard
IdentityFile ~/.ssh/id_rsa
UserKnownHostsFile ~/.ssh/known_hosts
StrictHostKeyChecking yes
```
PATH handling on the remote side: devvm's sshd uses the default
non-interactive PATH (`/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin`)
and does NOT load `~/.profile` or `~/.bashrc` (memory id=740). Client-side
`SetEnv PATH=…` doesn't help because sshd's `AcceptEnv` is `LANG LC_*` only.
Solution: install the binaries openclaw cares about into `/usr/local/bin/`
on devvm (see "Devvm side" below).
- Pre-seeds `~/.ssh/known_hosts` via `ssh-keyscan -H 10.0.10.10`
- Writes marker file
#### Main container
- `PATH` env updated: prepend
`/tools/host-tools/root/usr/bin:/tools/host-tools/root/usr/sbin:/tools/host-tools/bin`
- No other changes to the startup command
### Devvm side
#### `/usr/local/bin/openclaw-task` wrapper
Canonical source: `infra/stacks/openclaw/files/openclaw-task.sh`.
Installed to devvm at `/usr/local/bin/openclaw-task` (`sudo cp`, `sudo
chmod +x`) so non-interactive SSH finds it on the default PATH without
needing `~/.profile`. Updates: re-run the install steps from the
canonical source.
Also: `sudo ln -s /home/wizard/.local/bin/claude /usr/local/bin/claude`
so `ssh devvm claude …` works in non-interactive mode. `vault` and `tmux`
are already at `/usr/bin/` (system packages) so no symlink needed for
those.
POSIX shell script. Subcommands:
| Subcommand | Behavior |
|---|---|
| `new <id> <cmd...>` | Spawns detached tmux session `openclaw-task-<id>`, pipes pane output to `~/openclaw-tasks/<id>.log` |
| `claude <id> <prompt>` | Convenience: spawns interactive `claude` in a tmux session, send-keys the prompt + Enter |
| `send <id> <keys...>` | `tmux send-keys -t openclaw-task-<id> "$@"` — caller supplies `Enter` literal if needed |
| `capture <id> [lines]` | `tmux capture-pane -t … -p -S -<lines>` (default last 1000) |
| `log <id>` | `cat ~/openclaw-tasks/<id>.log` |
| `tail <id>` | `tail -n 100 -f ~/openclaw-tasks/<id>.log` (mainly for human ops) |
| `list` | tmux session list filtered to `openclaw-task-*`, one id per line |
| `status <id>` | `running` if tmux session alive, `ended` otherwise |
| `kill <id>` | `tmux kill-session -t openclaw-task-<id>` (log file is kept) |
| `purge <id>` | `kill` + `rm -f ~/openclaw-tasks/<id>.log` |
Task state lives entirely on devvm:
- tmux sessions persist across SSH disconnects and OpenClaw pod restarts
- `~/openclaw-tasks/<id>.log` is the durable transcript even after a
session is killed
- No central database — `tmux list-sessions` is the source of truth for
"what's running"
Naming convention: tmux sessions are prefixed `openclaw-task-` so they
don't collide with wizard's own tmux work (`0`, `Openclaw`, `read-only`).
### Memory note
File at `/workspace/memory/projects/openclaw-runtime/devvm-fallback.md`
teaching OpenClaw the pattern. Indexed by the existing daily
`memory-sync` CronJob (or via manual `node openclaw.mjs memory index
--force` for the initial seed).
Content (verbatim):
```markdown
# Using devvm as a fallback
When in-pod tools/permissions block you, SSH to devvm and use it. The
devvm runs as wizard with full home-lab access (Vault, kubectl, git
repos, Cloudflare, etc.) and has Claude Code v2+ installed.
## One-shot lookup
ssh devvm 'vault kv get -field=brave_api_key secret/openclaw'
ssh devvm 'claude -p "investigate why frigate is restarting"'
## Long-running async work — USE THIS for anything > ~2 min
Spawn in a tmux session on devvm. Sessions survive OpenClaw pod restarts.
# spawn
ssh devvm openclaw-task new my-task "claude -p --dangerously-skip-permissions 'do the thing'"
# poll progress (last 1000 lines of pane)
ssh devvm openclaw-task capture my-task
# interactive claude (send follow-up prompts)
ssh devvm openclaw-task claude my-task "initial prompt"
ssh devvm openclaw-task send my-task "follow-up prompt" Enter
# housekeeping
ssh devvm openclaw-task list
ssh devvm openclaw-task status my-task
ssh devvm openclaw-task kill my-task
Logs persist at ~/openclaw-tasks/<id>.log on devvm even after a session
is killed. Use `ssh devvm openclaw-task log <id>` to retrieve them.
```
## Devvm: no infra changes
Pre-existing state verified 2026-05-22:
- pubkey from `/ssh/id_rsa` (Vault `secret/openclaw → ssh_key`) matches the
`ssh-ed25519 AAAA…lug node@openclaw-58cd9f7987-884bv` line in
`~/.ssh/authorized_keys` (the comment is a stale pod name; the key
itself is stable from Vault)
- sshd listens on 0.0.0.0:22 ✓
- `claude` v2.1.126 at `/home/wizard/.local/bin/claude`
- `tmux` 3.4 installed, server already running with existing user sessions ✓
Only changes (one-time, done in the same session via `sudo`):
- Install `openclaw-task` wrapper to `/usr/local/bin/openclaw-task`
- Symlink `/home/wizard/.local/bin/claude``/usr/local/bin/claude`
## Tradeoffs / risks
- **Bundle size on NFS**: ~30MB extracted. Acceptable on
`/srv/nfs/openclaw/tools`.
- **Library version drift**: bundled binaries link against bookworm libs.
Smoke test in `install-host-tools` catches breakage on the next pod
restart if upstream OpenClaw image rebases.
- **Full-shell SSH**: explicit user choice. Blast radius if openclaw is
prompt-injected = full wizard access. Mitigation: keep OpenClaw's
plugin allowlist tight (current allow list: `memory-core, recruiter-api,
telegram, openrouter, brave, openai, codex`).
- **tmux server lifecycle on devvm**: if wizard's tmux server dies (rare —
usually only on devvm reboot), in-flight openclaw tasks are killed.
Acceptable for home lab. Task logs persist regardless.
- **Task log unbounded growth**: `~/openclaw-tasks/*.log` grows forever.
Out of scope here. User can add a `find -mtime +N -delete` cron later.
- **Init container order**: `setup-ssh-config` depends on
`install-host-tools` finishing first. K8s init containers run
sequentially in declaration order — natural ordering, no explicit
dependency mechanism needed.
## Testing — E2E flows required by user
1. **Tools present**:
`kubectl -n openclaw exec <pod> -c openclaw -- ssh -V` returns version,
same for `dig`, `vault`, `jq`, `yq`, `tmux`, `rg`.
2. **SSH happy path**:
`kubectl -n openclaw exec <pod> -c openclaw -- ssh devvm 'hostname'`
returns `devvm`.
3. **Claude one-shot**:
`kubectl -n openclaw exec <pod> -c openclaw -- ssh devvm 'claude -p "what is 1+1"'`
returns `2`.
4. **Async task lifecycle**:
- `ssh devvm openclaw-task new test-1 "sleep 30; echo done"`
- `ssh devvm openclaw-task list` contains `test-1`
- `ssh devvm openclaw-task status test-1` returns `running`
- wait 35s
- `ssh devvm openclaw-task log test-1` contains `done`
- `ssh devvm openclaw-task status test-1` returns `ended`
5. **Persistence test** (the key requirement):
- Spawn long task: `ssh devvm openclaw-task new persist-1 "sleep 120; echo survived > /tmp/persist-1.proof"`
- `kubectl -n openclaw delete pod <openclaw-pod>` — pod recreated
- Wait for new pod ready (init containers run, skip via marker, fast)
- `kubectl -n openclaw exec <new-pod> -c openclaw -- ssh devvm openclaw-task list`
contains `persist-1`
- Wait for original sleep to finish; verify `/tmp/persist-1.proof`
contains `survived` from new pod
6. **Memory note lookup**:
`kubectl -n openclaw exec <pod> -c openclaw -- node openclaw.mjs memory search 'devvm fallback'`
returns the note.
## Docs to update with the change
- `infra/docs/plans/2026-05-22-openclaw-devvm-access-design.md` (this doc)
- `infra/docs/plans/2026-05-22-openclaw-devvm-access-plan.md` (implementation plan)
- `infra/.claude/reference/service-catalog.md` (one-line addition under
OpenClaw: "Has SSH to devvm with host-tools bundle; long-running async
tasks via `openclaw-task` wrapper on devvm")
- `infra/.claude/CLAUDE.md` "Known Issues" section is left alone — none of
the existing OpenClaw caveats change.

View file

@ -117,7 +117,7 @@ Contributing distractions:
| Priority | Action | Type | Details | Status |
|----------|--------|------|---------|--------|
| P2 | Codify the catch-all Proxy provider + embedded outpost config in Terraform | Architecture | Adopt `goauthentik/authentik` Terraform provider in `infra/stacks/authentik/`. Import the existing UUID `0eecac07-97c7-443c-8925-05f2f4fe3e47` and the catch-all provider pk=5. Move `kubernetes_json_patches` into TF so the fix is reviewable in git. **Done 2026-05-10**: `authentik_outpost.embedded` resource + `authentik_provider_proxy.catchall.access_token_validity` codified, plan-to-zero on the whole stack. The `Outpost.managed` field is server-set (not in provider schema) and preserved across applies because TF only writes known fields. Same-day work also flipped the outpost's session backend from filesystem (`/dev/shm`) to PostgreSQL — see `.claude/reference/authentik-state.md`. | **DONE** |
| P2 | Codify the catch-all Proxy provider + embedded outpost config in Terraform | Architecture | Adopt `goauthentik/authentik` Terraform provider in `infra/stacks/authentik/`. Import the existing UUID `0eecac07-97c7-443c-8925-05f2f4fe3e47` and the catch-all provider pk=5. Move `kubernetes_json_patches` into TF so the fix is reviewable in git. | TODO |
| P2 | Runbook: Authentik forward-auth troubleshooting | Docs | Add a runbook at `docs/runbooks/authentik-forward-auth-400.md` with the "grep outpost logs first" first step, plus pointer commands for `/dev/shm` usage, session file count, and recent authorize events. | TODO |
### P3 — Upstream + architectural
@ -125,8 +125,8 @@ Contributing distractions:
| Priority | Action | Type | Details | Status |
|----------|--------|------|---------|--------|
| P3 | Comment/support on authentik issue [#20093](https://github.com/goauthentik/authentik/issues/20093) | Upstream | Request either a persistent-backed session store (Redis/DB) OR a configurable GC interval shorter than the default 5 min. | TODO |
| P3 | Consider shortening `access_token_validity` from 168h (7 days) to 24h | Config | Original idea: shrink steady-state session file count (~7× reduction) at the cost of daily re-auth. **Resolved differently 2026-05-10**: switched the outpost to the PostgreSQL session backend (`Outpost.managed = goauthentik.io/outposts/embedded` + `AUTHENTIK_POSTGRESQL__*` envFrom), which makes session count irrelevant for tmpfs sizing and lets us BUMP `access_token_validity` to `weeks=4` for better UX without cost. | **DONE (alt)** |
| P3 | Evaluate moving forward-auth away from the embedded outpost | Architecture | Original framing: external, multi-replica outpost with Redis-backed sessions. **Resolved 2026-05-10** by enabling the postgres-backed session store on the embedded outpost itself (PR goauthentik/authentik#16628). Sessions now persist across pod restarts; the original "in-memory state" concern is moot. Multi-replica still requires a goauthentik upstream fix (PgBouncer-friendly session migration), but the loss-of-state class of failures is gone. | **DONE (alt)** |
| P3 | Consider shortening `access_token_validity` from 168h (7 days) to 24h | Config | Reduces steady-state session file count from ~181k to ~26k (7× reduction). Trade-off: users re-auth daily. Viktor's call on UX tolerance. | TODO |
| P3 | Evaluate moving forward-auth away from the embedded outpost | Architecture | The embedded outpost is a single replica Go binary with in-memory session state. An external, multi-replica outpost with Redis-backed sessions is the production-grade deployment. Probably overkill for a home-lab, but worth noting. | TODO (paused) |
## Lessons Learned

View file

@ -1,164 +0,0 @@
# Post-Mortem: kured Reboots Silently Stalled for 6 Days + Anubis HA Lift
| Field | Value |
|-------|-------|
| **Date** | 2026-05-16 |
| **Duration** | 6 days of unbooted pending-reboot packages (2026-05-10 → 2026-05-16) |
| **Severity** | SEV3 — no user-facing impact; latent risk (kernel/libc CVEs queued, not landing) |
| **Affected Services** | None directly; OS-reboot pipeline halted on all 5 K8s nodes |
| **Status** | Root cause fixed (kured Helm value), defensive defaults added (Anubis HA, kured drain-timeout, CNPG 3 instances) |
## Summary
After unattended-upgrades was re-enabled on the K8s nodes on 2026-05-10,
kured was supposed to drive rolling node reboots within the MonFri
02:0006:00 London window. Instead, kured logged "Reboot not required"
every hour for six straight days while the `kured-sentinel-gate`
DaemonSet on every host happily reported "ALL CHECKS PASSED — creating
/var/run/gated-reboot-required". The gate WAS open. kured was looking
in the wrong place.
The kured Helm chart derives the sentinel hostPath from
`dirname(configuration.rebootSentinel)`. The stack set
`rebootSentinel = "/sentinel/gated-reboot-required"` — which pointed
the chart at hostPath `/sentinel/` (an empty auto-created directory).
The sentinel-gate writes to `/var/run/gated-reboot-required` on the
host. Two different host directories. kured silently skipped reboots
for six days.
Found on 2026-05-16 while auditing why "automatic upgrades aren't
happening" alongside the K8s version-upgrade Job-chain (PM
2026-05-11). Fixed in one commit; took the opportunity to also
eliminate three latent drain-time hazards (Anubis single-replica PDB
deadlock, kured unbounded drain timeout, CNPG-only-2-instances).
## Impact
- **User-facing**: None. Existing kernels, libc, and userspace kept running. CVEs queued in `/var/run/reboot-required.pkgs` on every node but were never exploited.
- **Backlog**: All 5 nodes accumulated `linux-image-*` + `libc6` queued for reboot. Largest gap was master at ~6 days. Workers also 56 days.
- **Detection gap**: kured exposes no Prometheus signal for "I checked but said no". The hourly "Reboot not required" line in stdout is the only trace, and nobody was tailing it. The architecture had two layers (sentinel-gate gate + kured sentinel check) but no verification that the two layers were looking at the same path.
- **Side discovery**: 8 Anubis instances would have stalled drain anyway via single-replica + `PDB minAvailable=1` (the same trap that stalled the manual K8s upgrade on 2026-05-11). Even if the kured path bug were fixed in isolation, Monday's first reboot would have hit the Anubis trap and idled forever (kured default `--drain-timeout=0` = unlimited).
## Timeline (UTC)
| Time | Event |
|------|-------|
| **Mar 16 21:26** | kured-sentinel-gate DaemonSet introduced after the 26h overlayfs cascade outage. Original sentinel cool-down 30m. |
| **May 10 ~16:57** | Last successful kured pod restart picked up new Helm values. `rebootSentinel = "/sentinel/gated-reboot-required"`. Same commit re-enabled unattended-upgrades in cloud_init and stretched the sentinel cool-down 30m → 24h. |
| **May 10 ~17:00 → May 15 06:16** | unattended-upgrades on every node successfully installs kernel + libc patches, writes `/var/run/reboot-required`. |
| **May 1015** | sentinel-gate Check 14 all pass every 5 min on every host. Touches `/var/run/gated-reboot-required`. Logs "ALL CHECKS PASSED". |
| **May 1015** | kured polls `/sentinel/gated-reboot-required` (empty dir, file does not exist). Returns "Reboot not required" every hour. No reboots happen. |
| **May 11 20:4021:00** | Separate K8s-version-upgrade incident (master upgraded to v1.34.7, workers stalled mid-rollout because the upgrade agent drained its own host). Manual recovery 5/115/12. **kured stall noticed but not investigated**: cluster healthy, K8sVersionSkew firing was tracked as the urgent issue. |
| **May 11 22:47 → May 12 00:01** | Manual worker drains hit the Anubis single-replica PDB trap (drain loops). Resolved by direct-deleting Anubis pods to bypass eviction API. This was the first signal that single-replica `minAvailable=1` patterns deadlock drains. |
| **May 16 10:56 UTC** | While auditing "what runs the upgrades" for the user, the kured + sentinel-gate log/path mismatch became visible. |
| **May 16 11:13 UTC** | `stacks/kured/main.tf`: `rebootSentinel = "/sentinel/..."``"/var/run/gated-reboot-required"`. Re-init, plan, apply. |
| **May 16 11:14 UTC** | kured DaemonSet rolls out the new spec. Volume hostPath becomes `/var/run`. kured pod can now see `/sentinel/reboot-required` (32B, from uu) AND `/sentinel/gated-reboot-required` (0B, from gate). Confirmed via `kubectl exec` listing. |
| **May 16 11:44 UTC** | Anubis HA module change deployed: `shared_store_url` variable → `store: { backend: valkey }` block appended to policy YAML, default replicas 2, PDB `maxUnavailable=1`, topology `DoNotSchedule`. Cyberchef applied as canary. Confirmed: Redis DB 5 starts receiving challenge state. |
| **May 16 11:4811:53 UTC** | Remaining 7 Anubis stacks applied (DBs 612). 8/8 deployments at 2/2 Ready, replicas spread on different nodes. Smoke-tested 6 of 8 public URLs return 200. |
| **May 16 12:05 UTC** | kured `drainTimeout: "30m"` added + applied. pg-cluster bumped from 2 → 3 instances. |
| **May 16 12:11 UTC** | pg-cluster phase = "Cluster in healthy state", 3/3 ready. |
## Root Cause
The Helm chart `kured-5.11.0` computes:
```
{{- $sentinel_dir := dir .Values.configuration.rebootSentinel -}}
# template renders both volume mount and hostPath using $sentinel_dir
```
So `rebootSentinel` is doubly-purposed: it's both the **CLI arg path inside
the pod** AND the **hostPath on the node**. Setting it to `/sentinel/...`
caused:
- pod arg: `--reboot-sentinel=/sentinel/gated-reboot-required` (looks at `/sentinel/` inside the pod)
- hostPath: `/sentinel/` (auto-created empty directory by `type: Directory`)
- mountPath inside pod: `/sentinel/` (mapped from hostPath above)
Meanwhile the gate DaemonSet was configured with hostPath `/var/run`
mountPath `/host/var-run`, and wrote `gated-reboot-required` to its local
`/host/var-run/` which became the host's `/var/run/gated-reboot-required`.
The two daemons never touched the same directory.
**Why this was hard to spot**:
1. Both layers logged success: sentinel-gate said "ALL CHECKS PASSED", kured said "Reboot not required". Neither claimed an error.
2. No Prometheus alert exists for "kured polled, gate is open, kured still didn't act". The Upgrade Gates alert group catches firing-alert-during-rollout, not silently-skipped-rollout.
3. The Helm chart's auto-derivation of hostPath from a config value is undocumented surprising behavior. The mental model is "rebootSentinel is just the in-pod path"; the hostPath co-mutation is invisible.
## Remediation
### Primary fix
- `stacks/kured/main.tf`: `rebootSentinel = "/var/run/gated-reboot-required"`. Both the chart-derived hostPath and the kured CLI arg now align with where the gate writes.
### Defensive companion changes (same session)
| Change | Purpose | Stack |
|---|---|---|
| `drainTimeout = "30m"` on kured | Fail closed instead of looping forever if a future PDB or finalizer stalls drain. Node stays Schedulable (no silent capacity loss). | `stacks/kured/main.tf` |
| Anubis: shared-state Valkey/Redis backend | Eliminate the single-replica drain deadlock + provide real HA. PDB changed `minAvailable=1``maxUnavailable=1`. Replicas 1 → 2 with `topologySpreadConstraint: DoNotSchedule`. | `modules/kubernetes/anubis_instance/main.tf` + 8 callers |
| pg-cluster: 2 → 3 instances | Failover during primary's node drain no longer depends on the lone replica being caught up. CNPG always has a fully-current candidate. | `stacks/dbaas/modules/dbaas/main.tf` |
| Orphan `mysql-standalone` PDB deleted | Helm-stamped leftover (selector required 4 labels, pod has 3 → matched 0 pods). Was dead code; deletion is safe. | `kubectl` (not TF-managed) |
### Verified post-fix
- `kubectl -n kured exec deploy/kured -- ls /sentinel/` lists both `reboot-required` and `gated-reboot-required` on every node.
- 8 Anubis Deployments at 2/2 Ready; pods spread across different nodes (verified via `kubectl get pods -o wide`).
- Redis DBs 5, 7, 8, 10 receiving challenge state from real public traffic post-apply (Palo Alto Networks scanner hit blog).
- pg-cluster 3/3 healthy, phase = "Cluster in healthy state".
- kured args show `--drain-timeout=30m`.
## Lessons
1. **Auto-derivation in Helm charts is invisible drift surface.** The chart's
habit of deriving hostPath from a CLI-arg-shaped value is the kind of
"convenient default" that hides during normal review. Mitigation:
pin `hostFilePath` explicitly in `configuration` so the host path is
declared, not derived. (Did not do this in the fix because the
single-config approach is now correct; flagging as future improvement.)
2. **"Silently skipped" needs a Prometheus signal.** The Upgrade Gates
alerts cover "rollout in progress + something went wrong". They don't
cover "we haven't rolled in 7 days when we should have". Suggested:
add `KuredRebootBacklog` — fires when `kured_reboot_required ==
1` (kured exposes this) for more than 24h continuously. The kured
chart already serves `/metrics`; just needs a rule. (Deferred.)
3. **Single-replica `PDB: minAvailable=1` is a deadlock pattern.** It
reads as "protect this pod" but actually means "block all voluntary
disruption forever". Manifested in 9 places (8 Anubis + mysql-standalone
with broken selector). The Anubis fix is now in place via shared-store
replicas=2; the `mysql-standalone` selector was already broken so it
matched 0 pods (and was deleted as cruft). Worth auditing the cluster
periodically for any new pattern of the same shape.
4. **k8s-node1 containerd source drift** (Ubuntu archive's `containerd`
vs Docker's `containerd.io`) is benign but should be documented.
Audited during this session: not a blocker for kured because both
variants are in the Package-Blacklist and both are apt-held. The
version skew with master (1.6.22 vs 1.7.24/1.7.27) is what the
K8s version-upgrade Stage 3 "containerd bump" exists to fix.
5. **CNPG drain handling at 2 replicas is fragile.** Switchover works
but the lone replica must be caught up; in practice this means
on a busy cluster, a primary-node drain could stall for tens of
seconds while CNPG promotes. 3 instances eliminates this. Worth
considering for every long-running multi-instance stateful workload.
## Detection / Prevention Followups
- [ ] `KuredRebootBacklog` Prometheus alert. Spec: `kured_reboot_required == 1 and (time() - timestamp(kured_reboot_required)) > 86400`.
- [ ] Add a `hostFilePath` value to the kured Helm release for explicit declaration (current setup is correct but undocumented).
- [ ] Audit periodically for new single-replica + `minAvailable=1` PDB patterns (could be a Kyverno warn policy).
- [ ] Phase 4: clean up the InnoDB Cluster CR + remaining `mysql-cluster-pdb` once the bitnami legacy is fully decommissioned.
## File pointers
| What | Where | Commit |
|---|---|---|
| kured sentinel path fix | `infra/stacks/kured/main.tf` | c17d87e1 |
| Anubis HA (module + 8 callers) | `infra/modules/kubernetes/anubis_instance/` + 8 `stacks/<app>/main.tf` | 6e920f96 |
| kured drainTimeout + CNPG 3-replica | `infra/stacks/kured/main.tf` + `infra/stacks/dbaas/modules/dbaas/main.tf` | a726e963 |
| K8s version-upgrade Job-chain (related context) | `infra/stacks/k8s-version-upgrade/` | 01bc16d5 (5/11) |
| Architecture doc | `infra/docs/architecture/automated-upgrades.md` | (updated 5/11) |
| Runbook | `infra/docs/runbooks/k8s-version-upgrade.md` | (updated 5/11) |
| Deprecated agent prompt (self-preemption history) | `infra/.claude/agents/k8s-version-upgrade.deprecated.md` | 01bc16d5 |

View file

@ -1,160 +0,0 @@
# Post-Mortem: GPU Driver Crashloop after Ubuntu 26.04 Upgrade on k8s-node1
**Date:** 2026-05-17
**Author:** Viktor Barzin / Claude (incident response)
**Severity:** SEV-3 (GPU workloads unavailable: frigate, immich-ml, llama-swap, ytdlp/yt-highlights all Pending; no impact to non-GPU services)
**Beads:** `code-8vr0` (P1)
**Status:** Blocked on upstream — NVIDIA has not published Ubuntu 26.04 driver images yet
## Summary
`nvidia-driver-daemonset-sg22g` on k8s-node1 went into CrashLoopBackOff
with 76+ restarts. Root cause: k8s-node1 was upgraded to **Ubuntu 26.04
LTS (Resolute Raccoon)** at some point, putting the running kernel at
`7.0.0-15-generic`. The NVIDIA driver daemonset's installer container
runs `apt-get install linux-headers-<kernel>` against Ubuntu 24.04's
noble repositories (the container's base OS), which don't carry
`linux-headers-7.0.0-15-generic`, so the build aborts with:
Could not resolve Linux kernel version
Attempted fix (chart upgrade v25.10.1 → v26.3.1 with driver 580.105.08
and `kernelModuleType: open`) succeeded at the chart level but produced
a worse outcome: the v26.3.1 operator auto-detects the host OS via NFD
and constructs the image tag `<version>-ubuntu26.04`, which 404s on
pull. `skopeo list-tags docker://nvcr.io/nvidia/driver` confirms zero
ubuntu26.04 tags exist (vs 779 ubuntu22.04 and 206 ubuntu24.04 tags).
Rolled the chart back to v25.10.1 (pinned in TF) to restore the closest-
to-working state pending an upstream fix or kernel rollback.
## Impact
- GPU resource `nvidia.com/gpu` = 0 on k8s-node1 (only GPU node)
- All GPU-bound workloads Pending or 0/N Ready:
- `frigate/frigate`
- `immich/immich-machine-learning`
- `llama-cpp/llama-swap`
- `nvidia/nvidia-exporter`
- `ytdlp/yt-highlights`
- Downstream alerts firing: `NvidiaExporterDown`, 5× Uptime Kuma monitors
(Frigate, Immich ML, nvidia-exporter, …), `GPUNodeUnschedulable` not
firing (node is schedulable, just no GPU advertised)
- No data loss; no user-facing service degradation outside the GPU stack
## Timeline (Europe/Sofia, UTC+3)
- pre-incident — `apt-get dist-upgrade` (or `do-release-upgrade`) bumped
k8s-node1 from Ubuntu 24.04 → 26.04. Apt history.log doesn't capture
the upgrade (rotated by `do-release-upgrade`).
- ~2026-05-11 — node rebooted into kernel `7.0.0-15-generic`. NFD
reports `system-os_release.VERSION_ID = 26.04`,
`kernel-version.full = 7.0.0-15-generic`.
- 2026-05-17 04:00 (approx) — driver daemonset enters CrashLoopBackOff
on every kubelet restart cycle. Error: "Could not resolve Linux kernel
version".
- 2026-05-17 13:35 — chart upgrade attempt v25.10.1 → v26.3.1, driver
570.195.03 → 580.105.08, `kernelModuleType: open`. Helm applies
cleanly but driver pod ImagePullBackOff on
`driver:580.105.08-ubuntu26.04`.
- 2026-05-17 ~13:45 — skopeo confirms zero ubuntu26.04 tags on
nvcr.io/nvidia/driver. Decision: roll chart back, pin in TF, document
the gotcha, file the kernel rollback as the next step.
## Root Causes
1. **Host OS upgraded to Ubuntu 26.04** ahead of NVIDIA's driver image
support window. NVIDIA typically lags new Ubuntu LTS releases by
weeks-to-months on the driver-container front.
2. **gpu-operator chart was not pinned** prior to today. The TF
`helm_release` had `version` commented out, so any apply could
re-resolve to the latest chart and follow its OS-auto-detection
logic. With v25.10.1, the operator fell back to ubuntu24.04 image
suffix (which pulls successfully but fails to compile against kernel
7.0). With v26.3.1, the operator picks the correct (per-NFD)
ubuntu26.04 suffix — which doesn't exist.
3. **No alert for "GPU device count = 0 on a GPU node"** — the cluster
had 14+ hours of silent GPU outage before noticing. `NvidiaExporterDown`
fires only when the metrics exporter itself stops scraping, not when
the operator's driver pod is unhealthy.
## What We Changed in This Session
- `stacks/nvidia/modules/nvidia/main.tf` — pinned
`helm_release.nvidia-gpu-operator.version = "v25.10.1"` so future
applies don't surprise us with v26.3.1's stricter OS detection.
- `stacks/nvidia/modules/nvidia/values.yaml` — comment block explaining
the situation; driver version stays at `570.195.03` as the last-known
config that produced a pullable image.
- `docs/post-mortems/2026-05-17-gpu-driver-ubuntu2604-mismatch.md`
this file.
## What We Did NOT Do (Pending User Decision)
- **Roll back the host kernel** on k8s-node1 from `7.0.0-15-generic`
to `6.8.0-117-generic`. The 6.8 kernel is still installed at
`/lib/modules/6.8.0-117-generic` and the matching headers at
`/usr/src/linux-headers-6.8.0-117-generic`, so GRUB can boot it and
the driver image's apt sources (Ubuntu 24.04 noble) carry
`linux-headers-6.8.0-117-generic`. This would require draining the
node, editing GRUB defaults, `apt-mark hold` to prevent future drift,
and rebooting — needs explicit user OK.
- **Add a probe + alert** for `nvidia.com/gpu` resource count on the
GPU node. Should fire within 10 minutes of the operator failing to
publish the resource, regardless of which sub-pod failed.
## Recovery Procedure (next time)
### If the driver-installer fails with "Could not resolve Linux kernel version"
1. Identify the running kernel: `uname -r` on the affected node.
2. Check whether NVIDIA ships an image for that kernel/distro combo:
docker run --rm quay.io/skopeo/stable list-tags \
docker://nvcr.io/nvidia/driver \
| python3 -c "import json,sys; d=json.load(sys.stdin); \
print([t for t in d['Tags'] if '<distro>' in t][:5])"
3. If yes, point the chart at the right version + ensure NFD reports
the matching OS.
4. If no (and a kernel rollback is acceptable):
- `kubectl cordon <node>` then `kubectl drain <node> --ignore-daemonsets --delete-emptydir-data`
- `nsenter -t 1 -m -p -u sed -i 's/^GRUB_DEFAULT=.*/GRUB_DEFAULT="Advanced options for Ubuntu>Ubuntu, with Linux 6.8.0-117-generic"/' /etc/default/grub`
- `nsenter -t 1 -m -p -u update-grub`
- `nsenter -t 1 -m -p -u apt-mark hold linux-image-6.8.0-117-generic linux-headers-6.8.0-117-generic linux-generic linux-image-generic linux-headers-generic`
- Reboot: `nsenter -t 1 -m -p -u systemctl reboot`
- After boot: `kubectl uncordon <node>` and wait for the GPU
daemonset to come Ready
## Action Items
- [x] Pin gpu-operator chart to v25.10.1 in TF
- [x] Document situation in this post-mortem
- [ ] Roll back k8s-node1 host kernel to 6.8.0-117-generic + apt-mark
hold (needs user authorization for node reboot)
- [ ] Add Prometheus alert `GPUNodeNoGPUResource` — fires when a node
labeled `nvidia.com/gpu.present=true` has `nvidia.com/gpu` capacity
of 0 for >10m
- [ ] Periodically re-check NVIDIA's NGC catalog for ubuntu26.04 driver
tags — file as a quarterly checkup once we see the first 26.04
tag, unpin the chart and revert this post-mortem's mitigation
- [ ] Audit ALL host packages with `apt-mark hold` semantics. The
memory of the March 2026 outage says we disabled
`unattended-upgrades``do-release-upgrade` is a separate path
that should be gated too
## Lessons
- **Operator-style charts that auto-detect host OS can silently break
when the host fleet leapfrogs upstream image support.** Pin the chart
version + driver version, and treat upstream support gaps as a hard
blocker rather than a guaranteed-to-resolve race condition.
- **Drain-and-revert host kernel is the right escape hatch when
upstream image lags.** Make sure the previous kernel and its headers
stay installed (don't aggressively purge old kernels in apt
autoremove).
- **NFD labels are authoritative for the operator's image-tag
construction.** If you need to lie about OS version (e.g., to force a
24.04 image on a 26.04 host), edit the NFD label — but only as a last
resort; the chart upgrade made clear the operator will eventually
reconcile this.

View file

@ -1,133 +0,0 @@
# Post-Mortem: nfs-csi Keel-Triggered Upgrade Broke Master Node CSI
**Date:** 2026-05-17
**Author:** Viktor Barzin / Claude (incident response)
**Severity:** SEV-3 (1 of 5 CSI node DaemonSet pods stuck CrashLoopBackOff; controller pair flapping)
**Duration:** ~2 hours from first detection to all-green
## Summary
The Keel auto-update operator polled the `csi-driver-nfs` Helm chart and rolled
`v4.13.1 → v4.13.2`. The new chart's controller Deployment scheduled both
replicas onto `k8s-master` (no built-in control-plane exclusion). Both replicas
used `hostNetwork: true` and tried to bind the same host ports
(`19809` for `node-driver-registrar`, `29653` for `liveness-probe`), so one
controller pod CrashLoopBackOff'd with `bind: address already in use`. The
upgrade also left behind multiple orphan controller pods in containerd that
kubelet could no longer reconcile — they held the host ports even after the
helm rollback removed them from K8s state.
The `csi-nfs-node` DaemonSet pod on master then could not start either: its
own `node-driver-registrar` and `liveness-probe` containers tried to bind
the same host ports and lost to the zombies.
## Impact
- 1× `csi-nfs-node` pod on `k8s-master` stuck CrashLoopBackOff (16+ restarts)
- CSI plugin unregistered on master → no NFS volumes could be mounted on
master-hosted pods (calico-typha cert mount failed, etcd backup CronJob
failed)
- Controller flap (2 replicas fighting) → intermittent
`csi-resizer`/`csi-snapshotter` failure for the whole cluster
- Cascade: kured-sentinel, node-local-dns, prometheus-node-exporter,
csi-node-driver (Calico) all bounced on master while kubelet thrashed
No data loss; no production-facing outages observed (CSI mounts on the four
worker nodes kept working).
## Timeline (Europe/Sofia, UTC+3)
- ~07:46 — Keel polls forgejo + DockerHub manifests, sees a new digest under
the `csi-driver-nfs` `4.13.x` channel, triggers Helm upgrade
- 07:46:16 — `helm upgrade csi-driver-nfs` runs; new controller Deployment
scheduled (no `affinity` block → both replicas land on `k8s-master`)
- ~07:50 — Controller replicas fight for ports `19809`, `29653`; one stays in
CrashLoopBackOff
- ~08:00 — User notices "CSI issue ... due to the upgrade"; investigation
begins
- 08:15 — `helm rollback csi-driver-nfs` to revision 8 (v4.13.1) — controllers
on master deleted via K8s, but containerd retains them as live sandboxes
- 08:30 — Live `podAntiAffinity` + `nodeAffinity: control-plane DoesNotExist`
added to the controller Deployment via patch (controllers now correctly
schedule on node1+node3)
- 08:40 — `csi-nfs-node` master pod still CrashLoopBackOff; ports 19809/29653
held by orphan PIDs (livenessprobe PID 1816, csi-node-driver PID 1944,
plus 5× csi-provisioner from zombie controller pods)
- 09:00 — Privileged pkill via `hostPID: true` pod failed
(`permission denied` from runc — containerd refused to signal init in the
zombie containers)
- 09:03 — `nsenter -t 1 -m -p -u systemctl restart kubelet` on master cleared
the orphan containers via cgroup GC; ports freed
- 09:04 — `csi-nfs-node` master pod reaches 3/3 Ready; cluster green
- 09:09 — Terraform `apply`: pin `helm_release.version = "4.13.1"`, add
`controller.affinity` to values
## Root Causes
1. **`csi-driver-nfs` Helm chart in TF was unpinned.** The `helm_release` had
no `version = ...` field, so it floated to whatever the chart repo
advertised. Keel polled this and rolled forward.
2. **Chart v4.13.2 dropped the implicit control-plane exclusion** that v4.13.1
shipped with. Without it, the K8s scheduler chose master for both
controller replicas.
3. **Two controller replicas + hostNetwork = port conflict on the same node.**
The chart did not add `podAntiAffinity` between the replicas. Live state
has it now; TF now does too.
4. **Helm rollback does not always clean containerd sandboxes.** When the
prior revision's pods are abandoned mid-flight (image-pull-pending, etc.),
containerd can keep multiple sandbox instances for the same pod-UID.
Kubelet GC is the only thing that reliably reaps these — restarting it
forces a reconciliation pass that drops orphans.
## What We Fixed
- **`stacks/nfs-csi/modules/nfs-csi/main.tf`** (this commit):
- `version = "4.13.1"` pin on the `helm_release` (defense in depth — namespace
is already excluded from Kyverno-Keel injection, but the chart could still
drift on a `terraform apply` without a pin)
- `controller.affinity` block with `podAntiAffinity` (different hosts for
replicas) and `nodeAffinity` (exclude `node-role.kubernetes.io/control-plane`)
- Inline comments explaining both decisions
- **Kyverno keel-annotations**: `nfs-csi` was already in the namespace exclude
list (decision from authentik incident 2026-05-17). Verified still there
in `stacks/kyverno/modules/kyverno/keel-annotations.tf:91`.
## Recovery Procedure (next time)
If `csi-nfs-node` on a node CrashLoopBackOff with `bind: address already in use`:
1. **Find which host ports are bound**`lsof -i :19809`, `lsof -i :29653`
(from a privileged hostPID pod on the affected node).
2. **Try `crictl rmp -f <pod-id>`** on zombie pods (those K8s no longer
tracks). Will fail with `unable to signal init: permission denied` if
the containers are sufficiently stuck.
3. **Restart kubelet on the affected node** via `nsenter -t 1 -m -p -u
systemctl restart kubelet` (privileged hostPID pod). Kubelet's GC
reconciles containerd state and reaps the orphans.
4. **Force-delete the DaemonSet pod** to clear the back-off
(`kubectl delete pod -n nfs-csi csi-nfs-node-XXXX --force --grace-period=0`).
DaemonSet recreates it; with the ports free, containers start cleanly.
## Action Items
- [x] Pin `csi-driver-nfs` chart version in TF
- [x] Add `controller.affinity` to TF (podAntiAffinity + control-plane exclude)
- [x] Document recovery procedure (this post-mortem)
- [ ] Audit other unpinned `helm_release` blocks — every chart used in
Kyverno-excluded namespaces should still be pinned to prevent
`terraform apply` drift. (Filed as follow-up — not blocking.)
- [ ] Consider adding a `kured` or daily script that detects orphan
containerd sandboxes whose pod-UID is unknown to the apiserver and
reaps them automatically. (Filed as follow-up — not blocking.)
## Lessons
- **Keel exclusion ≠ chart pin.** The namespace was already excluded from
Keel injection, but the helm_release was unpinned — so a `terraform apply`
alone could re-trigger the same break. Both layers needed locking down.
- **`crictl rmp -f` is not always sufficient.** When containerd refuses to
signal init, kubelet restart is the next escalation step before SSH/reboot.
- **The Keel rollout phase 2-6 design ASSUMED stateful operators were
excluded.** CSI was correctly excluded — but the chart version itself was
still a moving target via plain `terraform apply`. The exclude-list catches
Keel; the version pin catches everything else.

View file

@ -1,207 +0,0 @@
# K8s Node Auto-Upgrades
## Overview
OS-level package upgrades on the 5 K8s VMs (master + 4 workers) are driven by `unattended-upgrades` and rebooted by `kured`, with multiple safety gates layered on top to prevent the failure mode that caused the March 2026 26h cluster outage.
## Architecture
```
apt-daily.timer (random within window)
│ apt-get update
apt-daily-upgrade.timer (random within window)
│ unattended-upgrades runs
│ - Allowed-Origins: -security, -updates, ESM
│ - Package-Blacklist: containerd*, runc, calico-*, cni-plugins-*, docker-ce
│ - apt-mark hold on kubelet, kubeadm, kubectl, containerd*, runc
│ - Automatic-Reboot=false (kured handles reboots)
▼ if kernel/glibc/systemd updated
/var/run/reboot-required appears on the host
▼ (sentinel-gate DaemonSet polls every 5min)
kured-sentinel-gate checks:
├── 1. Host has /var/run/reboot-required
├── 2. ALL nodes Ready
├── 3. ALL calico-node pods Running
└── 4. NO node Ready-transition in last 24h (soak window)
▼ all pass
touch /var/run/gated-reboot-required
▼ (kured polls every 1h within 02:00-06:00 London, any day of the week)
kured checks Prometheus before draining:
│ http://prometheus-server.monitoring.svc.cluster.local:80/api/v1/alerts
│ ANY firing alert (except ignore-list) blocks the drain
│ Ignore-list: ^(Watchdog|RebootRequired|KuredNodeWasNotDrained|InfoInhibitor)$
▼ no blockers
kured drains the node (priority-ordered, 310s budget)
kured runs /bin/systemctl reboot
▼ node returns
kured uncordons + posts Slack notification (configuration.notifyUrl)
▼ 24h cool-down begins (sentinel-gate Check 4)
```
## Components
### unattended-upgrades (in-guest)
- **Config**: `/etc/apt/apt.conf.d/52unattended-upgrades-k8s` + `/etc/apt/apt.conf.d/20auto-upgrades`
- **Source of truth**: `infra/modules/create-template-vm/cloud_init.yaml` (lines for `is_k8s_template`)
- **Day-2 push**: SSH-based — see "Restore / re-apply config" below
### kured (Helm release)
- **Stack**: `infra/stacks/kured/main.tf`
- **Helm chart**: `kured-5.11.0` (image `ghcr.io/kubereboot/kured:1.21.0`)
- **Window**: 02:00-06:00 Europe/London, every day of the week (was Mon-Fri until 2026-05-16), period=1h, concurrency=1
- **Sentinel**: `/sentinel/gated-reboot-required` (created by sentinel-gate DaemonSet)
- **Slack hook**: Vault `secret/kured``slack_kured_webhook`
### kured-sentinel-gate (DaemonSet)
- **Source**: `kubernetes_daemon_set_v1.kured_sentinel_gate` in `infra/stacks/kured/main.tf` (lines ~120-260)
- **Image**: `bitnami/kubectl:latest`
- **Loop period**: every 300s
- **Gate logic**: 4 checks — see Architecture diagram
### Upgrade Gates Prometheus alerts
- **Source**: `infra/stacks/monitoring/modules/monitoring/prometheus_chart_values.tpl` group `Upgrade Gates`
- **10 alerts**: KubeAPIServerDown, KubeStateMetricsDown, PrometheusRuleEvaluationFailing, PVCStuckPending, RecentNodeReboot, MysqlStandaloneDown, ClusterPodReadyRatioDropped, NodeMemoryPressure, NodeDiskPressure, KubeQuotaAlmostFull
- **Effect**: kured `--prometheus-url` polls Prometheus before each drain — any non-ignored firing alert halts the rollout
## Common Operations
### Verify the system is healthy
```bash
# kured pods + sentinel-gate Running on all 5 nodes
kubectl -n kured get pods
# kured can reach Prometheus
kubectl -n kured exec ds/kured -- /usr/bin/kured --help | grep prometheus
# Upgrade Gates rules loaded + state
kubectl -n monitoring exec deploy/prometheus-server -c prometheus-server -- \
wget -q -O- 'http://localhost:9090/api/v1/rules' | \
jq -r '.data.groups[] | select(.name == "Upgrade Gates") | .rules[] | " \(.name): \(.state)"'
# Per-node unattended-upgrades status
for n in k8s-master k8s-node1 k8s-node2 k8s-node3 k8s-node4; do
echo "=== $n ==="
ssh $n "systemctl is-active unattended-upgrades; apt list --upgradable 2>/dev/null | wc -l"
done
```
### Halt rollout in an emergency
```bash
# Option 1: scale kured to 0 (most decisive)
kubectl -n kured scale ds kured --replicas=0
# When ready: kubectl -n kured scale ds kured --replicas=5
# Option 2: silence the gate via Alertmanager (allows kured to retry once silence expires)
# Use Alertmanager UI at https://prometheus.viktorbarzin.me/alertmanager/
```
### Force halt by adding a custom blocker alert
- Add a PrometheusRule expression that's always-1 (e.g. `vector(1)`) to the `Upgrade Gates` group temporarily.
- Apply, wait for sync (~120s), kured will block on the next poll.
- Remove when ready.
### Pause apt upgrades on a single node
```bash
ssh <node> sudo systemctl stop unattended-upgrades
ssh <node> sudo systemctl disable unattended-upgrades
# Re-enable when ready:
ssh <node> sudo systemctl enable --now unattended-upgrades
```
### Restore / re-apply unattended-upgrades config to existing nodes
Cloud-init only runs on first boot. To bring existing nodes into compliance with the IaC:
```bash
# Per node — installs uu, drops apt config, holds k8s/runtime packages, enables service
for n in k8s-master k8s-node1 k8s-node2 k8s-node3 k8s-node4; do
ssh $n sudo bash -s <<'EOF'
set -e
systemctl unmask unattended-upgrades 2>/dev/null || true
DEBIAN_FRONTEND=noninteractive apt-get install -y unattended-upgrades update-notifier-common
cat > /etc/apt/apt.conf.d/52unattended-upgrades-k8s <<'CONF'
Unattended-Upgrade::Allowed-Origins {
"${distro_id}:${distro_codename}";
"${distro_id}:${distro_codename}-security";
"${distro_id}:${distro_codename}-updates";
"${distro_id}ESMApps:${distro_codename}-apps-security";
"${distro_id}ESM:${distro_codename}-infra-security";
};
Unattended-Upgrade::Package-Blacklist {
"^containerd(\.io)?$";
"^runc$";
"^cri-tools$";
"^kubernetes-cni$";
"^calico-.*";
"^cni-plugins-.*";
"^docker-ce$";
};
Unattended-Upgrade::DevRelease "false";
Unattended-Upgrade::Automatic-Reboot "false";
CONF
cat > /etc/apt/apt.conf.d/20auto-upgrades <<'CONF'
APT::Periodic::Update-Package-Lists "1";
APT::Periodic::Unattended-Upgrade "1";
CONF
apt-mark hold kubelet kubeadm kubectl
apt-mark hold containerd containerd.io runc 2>/dev/null || true
systemctl enable --now unattended-upgrades
EOF
done
```
### Roll back a bad apt upgrade
1. Identify the package(s) that broke things from `/var/log/apt/history.log` on the affected node.
2. Hold them: `sudo apt-mark hold <pkg>`.
3. Downgrade: `sudo apt-get install -y --allow-downgrades <pkg>=<previous-version>` (find versions via `apt-cache madison <pkg>`).
4. Reboot the node manually if the package needs it.
5. Add the package to the `Unattended-Upgrade::Package-Blacklist` in `cloud_init.yaml` AND drop the holds via the SSH push above so future apt runs skip it.
### kured halted — investigate which alert is blocking
```bash
# Show kured logs — it logs "blocking alerts" when halting
kubectl -n kured logs ds/kured --tail=100 | grep -i alert
# List currently firing alerts (any of these blocks kured):
kubectl -n monitoring exec deploy/prometheus-server -c prometheus-server -- \
wget -q -O- 'http://localhost:9090/api/v1/alerts' | \
jq -r '.data.alerts[] | select(.state == "firing") | " \(.labels.alertname) (\(.labels.severity // "info"))"' | sort -u
```
The alert is either:
- One of the 10 `Upgrade Gates` (genuine cluster-health issue — fix it),
- A pre-existing alert (any of the ~211 in the library — investigate),
- Or `RecentNodeReboot` — expected for 24h after each node reboot. This is the soak window.
### Verify the 24h soak is enforcing
```bash
# Sentinel-gate logs Check 4 outcome
kubectl -n kured logs ds/kured-sentinel-gate --tail=20 | grep -E "soak|cool-down|24"
# kured won't drain another node until the most recent Ready-transition is >24h ago.
# If you need to override (e.g. emergency security patch), shorten the cool-down by
# editing infra/stacks/kured/main.tf (sentinel script: 86400 → smaller) and applying.
```
## Past Incidents
- **2026-03-16 SEV-1**: Kured + Containerd Cascade Outage (26h). See `docs/post-mortems/2026-03-16-kured-containerd-cascade-outage.html`. Root cause: unattended-upgrades pushed a kernel update → kured rebooted nodes → containerd's overlayfs snapshotter corrupted → image pulls failed → calico broke → cascading outage. Remediations now baked into this system: 24h soak, Prometheus halt-on-alert, Package-Blacklist for runtime components, sentinel-gate health checks.
## File Pointers
| What | Where |
|------|-------|
| kured Helm + sentinel-gate | `infra/stacks/kured/main.tf` |
| Upgrade Gates alerts | `infra/stacks/monitoring/modules/monitoring/prometheus_chart_values.tpl` (group "Upgrade Gates") |
| Cloud-init for new nodes | `infra/modules/create-template-vm/cloud_init.yaml` |
| Slack webhook | Vault `secret/kured``slack_kured_webhook` |
| Post-mortem | `infra/docs/post-mortems/2026-03-16-kured-containerd-cascade-outage.html` |
| Architecture doc | `infra/docs/architecture/automated-upgrades.md` (OS section) |

View file

@ -1,323 +0,0 @@
# K8s Version Upgrade Pipeline
## Overview
Kubernetes component versions (`kubeadm`/`kubelet`/`kubectl`) on the 5 K8s
VMs are upgraded automatically by a weekly detection CronJob that seeds a
chain of small phase Jobs. Each Job is **pinned to a node that is NOT its
drain target** — so no pod in the chain can preempt itself.
The chain (Sun 12:00 UTC weekly):
```
detection CronJob → preflight Job → master Job → worker × 4 Jobs → postflight Job
```
This is **independent** of the OS-side `unattended-upgrades + kured`
pipeline (see `k8s-node-auto-upgrades.md`). They do not share rollouts.
Schedules can overlap (kured runs daily 02:00-06:00 London; detection
here runs Sun 12:00 UTC) — when a kured reboot lands within 24h of the
Sunday detection, the `RecentNodeReboot` alert in the Upgrade Gates
group blocks the version-upgrade preflight, so the chain self-defers
to the next Sunday rather than rolling on top of a half-fresh node.
## Architecture
```
k8s-version-check CronJob (Sun 12:00 UTC, k8s-upgrade ns, SA: k8s-upgrade-job)
│ kubectl get nodes → running version
│ ssh master 'apt-cache madison kubeadm' → latest patch (within current minor)
│ HEAD pkgs.k8s.io/.../v<NEXT_MINOR>/deb/Release → next minor available?
│ push k8s_upgrade_available{kind,running,target} → Pushgateway
▼ if a target is detected
envsubst on /template/job-template.yaml | kubectl apply -f -
│ creates k8s-upgrade-preflight-<target_version>
Job 0 — preflight (pinned: k8s-node1)
├── All nodes Ready + no Mem/Disk pressure
├── halt-on-alert (kured-style ignore-list)
├── 24h-quiet baseline (no Ready transitions <24h ago)
├── kubeadm upgrade plan matches target
├── Push k8s_upgrade_in_flight=1, k8s_upgrade_started_timestamp=$(date +%s)
├── Trigger backup-etcd Job, wait, verify snapshot byte count
├── SSH master: containerd skew fix (if master < workers)
├── SSH all 5 nodes: apt repo URL rewrite (only kind=minor)
└── spawn_next → k8s-upgrade-master-<target_version>
Job 1 — master upgrade (pinned: k8s-node1)
├── halt-on-alert recheck (no firing alerts)
├── drain k8s-master (predrain_unstick deletes PDB-blocked pods)
├── ssh wizard@k8s-master 'bash -s' < /scripts/update_k8s.sh -- --role master --release X.Y.Z
├── kubectl uncordon k8s-master; wait Ready + version match
├── verify control-plane pods Running
├── halt-on-alert recheck (allows RecentNodeReboot)
└── spawn_next → k8s-upgrade-worker-<v>-k8s-node4
Job 2 — worker k8s-node4 (pinned: k8s-node1)
Job 3 — worker k8s-node3 (pinned: k8s-node1)
Job 4 — worker k8s-node2 (pinned: k8s-node1)
(identical pattern: halt-on-alert wait 30m → drain → ssh script → uncordon → 10-min soak → spawn_next)
Job 5 — worker k8s-node1 (pinned: k8s-master + control-plane toleration)
└── spawn_next → k8s-upgrade-postflight-<target_version>
Job 6 — postflight (no pinning)
├── Verify all 5 nodes at target version
├── Verify no firing Upgrade Gates alerts
├── Compute pod-ready ratio (should be ≥ 0.9)
├── Clear k8s-upgrade-* annotations on namespace
├── Push k8s_upgrade_in_flight=0, k8s_upgrade_snapshot_taken=0, k8s_upgrade_started_timestamp=0
└── Slack: ✅ K8s upgrade complete
```
**Pin choices summarised:**
- k8s-node1 hosts every Job that drains master or another worker. k8s-node1
itself is upgraded **last**.
- k8s-master hosts Job 5 (which drains k8s-node1). Job 5's spec includes a
toleration for `node-role.kubernetes.io/control-plane:NoSchedule`.
- If anyone reorders the worker sequence, the pin for Job 5 needs to track
whatever worker is upgraded last. The mapping is in `scripts/upgrade-step.sh`
→ the `case "${PHASE}:${TARGET_NODE:-}"` block.
## Components
### Shared resources (one-time, Terraform-managed)
| Resource | Purpose |
|---|---|
| **ConfigMap `k8s-upgrade-scripts`** | Mounts `/scripts/upgrade-step.sh` (universal phase body, dispatches on `$PHASE`) and `/scripts/update_k8s.sh` (per-node kubeadm/kubelet/kubectl upgrade body — same script the old manual loop used) in every Job pod. |
| **ConfigMap `k8s-upgrade-job-template`** | Mounts `/template/job-template.yaml` — universal Job manifest with envsubst placeholders. Rendered by upgrade-step.sh and the detection CronJob via `envsubst | kubectl apply`. |
| **ServiceAccount `k8s-upgrade-job`** | Used by both the detection CronJob and every chain Job. ClusterRole binding grants: nodes get/list/patch, pods/eviction create, pods delete, batch/jobs CRUD, PDB list (for predrain_unstick), CronJob get (snapshot trigger), namespaces patch on `k8s-upgrade` only. Namespace-scoped Role binding grants secrets:get on `k8s-upgrade-creds`. |
| **ExternalSecret `k8s-upgrade-creds`** | Syncs `secret/k8s-upgrade/{ssh_key, slack_webhook}` from Vault. Mounted into every Job at `/secrets/k8s-upgrade`. |
| **CronJob `k8s-version-check`** | Sun 12:00 UTC. Probes apt + pkgs.k8s.io for target. If found, renders Job 0 from `job-template.yaml` and applies it. |
### Pushgateway metrics
Pushed by upgrade-step.sh during phase execution; observed by the
`Upgrade Gates` alert group in `stacks/monitoring/.../prometheus_chart_values.tpl`:
| Metric | Pushed by | Cleared by |
|---|---|---|
| `k8s_upgrade_in_flight` (1/0) | preflight Job (set to 1) | postflight Job (set to 0) |
| `k8s_upgrade_started_timestamp` (epoch s) | preflight Job | postflight Job (set to 0) |
| `k8s_upgrade_snapshot_taken` (1/0) | preflight Job (set to 1 after Job=`pre-upgrade-etcd-*` completes with `Backup done:` log of ≥1 KiB) | postflight Job (0) |
| `k8s_upgrade_available{kind,running,target}` | detection CronJob | next detection run (overwrite) |
| `k8s_version_check_last_run_timestamp` | detection CronJob | (cumulative) |
### Upgrade Gates alerts (`Upgrade Gates` group in prometheus_chart_values.tpl)
- **`K8sVersionSkew`** — distinct kubelet/apiserver `gitVersion` count > 1 for 30m. Catches a half-done rollout.
- **`EtcdPreUpgradeSnapshotMissing`** — `k8s_upgrade_in_flight==1 && k8s_upgrade_snapshot_taken==0` for 10m. Catches preflight Stage 2 failing silently.
- **`K8sUpgradeStalled`** — `k8s_upgrade_in_flight==1 && time()-k8s_upgrade_started_timestamp > 5400` for 5m. Catches a Job in the chain dying without spawning its successor.
- All three alerts ALSO block kured (same `--prometheus-url` halt-on-alert mechanism) so the OS-reboot pipeline can't run on top of a half-done version upgrade.
### Vault secrets
- `secret/k8s-upgrade/ssh_key` — ed25519 PRIVATE key, used by Jobs to SSH `wizard@<node>`
- `secret/k8s-upgrade/ssh_key_pub` — matching PUBLIC key, deployed to nodes' `~/.ssh/authorized_keys`
- `secret/k8s-upgrade/slack_webhook` — Slack incoming-webhook URL
Exposed in K8s via ExternalSecret `k8s-upgrade-creds` in the `k8s-upgrade` namespace. The previous `api_bearer_token` entry is GONE — the chain does not POST to `claude-agent-service`.
## Common Operations
### Verify the pipeline is healthy
```bash
# CronJob present + not suspended
kubectl -n k8s-upgrade get cronjob k8s-version-check
# Latest detection run output
kubectl -n k8s-upgrade get jobs -l app=k8s-version-upgrade
kubectl -n k8s-upgrade logs -l app=k8s-version-upgrade --tail=200
# Chain Jobs from the last run (retained 7 days via ttlSecondsAfterFinished)
kubectl -n k8s-upgrade get jobs -l app=k8s-upgrade-chain
# Pushgateway — running detection metric
kubectl -n monitoring exec deploy/prometheus-server -c prometheus-server -- \
wget -q -O- 'http://prometheus-prometheus-pushgateway.monitoring:9091/metrics' | \
grep -E '^(k8s_upgrade_(available|in_flight|started_timestamp|snapshot_taken)|k8s_version_check_last_run_timestamp)'
# Upgrade Gates rules loaded
kubectl -n monitoring exec deploy/prometheus-server -c prometheus-server -- \
wget -q -O- 'http://localhost:9090/api/v1/rules' | \
jq -r '.data.groups[] | select(.name == "Upgrade Gates") | .rules[] | " \(.name): \(.state)"'
```
### Manually trigger detection (no upgrade)
Use `detection_dry_run=true` to short-circuit before spawning Job 0:
```bash
# Toggle var in TF, apply, and trigger
# (in stacks/k8s-version-upgrade/main.tf)
# variable "detection_dry_run" { default = true }
# scripts/tg apply
kubectl -n k8s-upgrade create job --from=cronjob/k8s-version-check version-check-test
kubectl -n k8s-upgrade logs -l job-name=version-check-test -f
# When done, flip back to false.
```
### Manually trigger the chain (skip detection)
Useful for testing or to force a specific target. Render Job 0 directly:
```bash
TARGET=1.34.7
KIND=patch
IMAGE=$(kubectl -n k8s-upgrade get cronjob k8s-version-check \
-o jsonpath='{.spec.jobTemplate.spec.template.spec.containers[0].image}')
cat <<EOF | envsubst | kubectl apply -f -
$(kubectl -n k8s-upgrade get cm k8s-upgrade-job-template -o jsonpath='{.data.job-template\.yaml}')
EOF
# Note: export JOB_NAME, PHASE_NEXT, etc. first — see the CronJob's command for
# the full env block. Easier: just trigger detection with the right inputs.
```
### Kill a stuck Job (chain halted mid-flight)
The chain stalls if any Job dies without spawning its successor. `K8sUpgradeStalled`
fires after 90 min. Recovery:
```bash
# 1. Identify the failed Job
kubectl -n k8s-upgrade get jobs -l app=k8s-upgrade-chain
kubectl -n k8s-upgrade describe job/<failed-job-name> | tail -50
kubectl -n k8s-upgrade logs job/<failed-job-name>
# 2. Diagnose. Common causes:
# - drain stuck on PDB-violating pod (predrain_unstick should handle this;
# but a brand-new PDB pattern could escape it — manually delete the pod)
# - SSH from Job pod failing (node restarted? known_hosts mismatch?)
# - kubeadm upgrade failed on a node (check journalctl + apt history on that node)
# 3. Fix the root cause first.
# 4. Delete the failed Job + re-spawn it. Naming is deterministic so
# `kubectl apply` of the same name reconciles to a single Job.
kubectl -n k8s-upgrade delete job/<failed-job-name>
# 5. Manually render + apply the same Job. Pull the template + spec from the
# next-Job-creation block in upgrade-step.sh — easiest is to copy from a
# sibling Job's YAML:
kubectl -n k8s-upgrade get job/<sibling-job-name> -o yaml \
| yq 'del(.metadata.resourceVersion, .metadata.uid, .metadata.creationTimestamp, .metadata.managedFields, .status)' \
| yq '.metadata.name = "<failed-job-name>"' \
| yq '.spec.template.spec.containers[0].env[] | select(.name=="PHASE") .value = "<right-phase>"' \
| kubectl apply -f -
# The chain will continue from there. The next-Job-creation step in upgrade-step.sh
# is idempotent (deterministic name) so re-running won't duplicate downstream.
```
### Skip a phase (advanced; use sparingly)
If you've already done the work for a phase manually and want the chain to
jump past it, manually create the NEXT phase's Job with the deterministic
name. The previous phase's spawn-next will see the Job already exists and
short-circuit. Example: master already on target; jump straight to worker:
```bash
TARGET=1.34.7
TGT_LBL=${TARGET//./-}
# (compose Job from upgrade-step.sh spawn_next code, name=k8s-upgrade-worker-$TGT_LBL-k8s-node4, run on k8s-node1)
```
### Halt the pipeline in an emergency
```bash
# Option 1: suspend the detection CronJob (won't stop an in-flight chain)
kubectl -n k8s-upgrade patch cronjob k8s-version-check \
-p '{"spec":{"suspend":true}}' --type=merge
# Re-enable: -p '{"spec":{"suspend":false}}'
# Option 2: delete all in-flight chain Jobs
kubectl -n k8s-upgrade delete jobs -l app=k8s-upgrade-chain
# This leaves the in-flight annotation + Pushgateway gauge intact —
# K8sUpgradeStalled will fire to surface the halt.
# Option 3: force a blocker alert (same regex kured uses)
# — see k8s-node-auto-upgrades.md "Force halt by adding a custom blocker alert"
```
### Clear orphaned in-flight state
After deciding NOT to retry a halted chain:
```bash
kubectl annotate ns k8s-upgrade \
viktorbarzin.me/k8s-upgrade-in-flight- \
viktorbarzin.me/k8s-upgrade-target- \
viktorbarzin.me/k8s-upgrade-snapshot-path-
# Reset Pushgateway gauges so K8sUpgradeStalled / EtcdPreUpgradeSnapshotMissing clear:
kubectl -n monitoring port-forward svc/prometheus-prometheus-pushgateway 9091:9091 &
printf '# TYPE k8s_upgrade_in_flight gauge\nk8s_upgrade_in_flight 0\n# TYPE k8s_upgrade_snapshot_taken gauge\nk8s_upgrade_snapshot_taken 0\n# TYPE k8s_upgrade_started_timestamp gauge\nk8s_upgrade_started_timestamp 0\n' \
| curl --data-binary @- http://localhost:9091/metrics/job/k8s-version-upgrade
kill %1
```
### Rollback paths
`kubeadm` does **not** support in-place downgrade. If a run fails:
#### Master broke during/after kubeadm upgrade
1. Identify the etcd snapshot: `kubectl get ns k8s-upgrade -o jsonpath='{.metadata.annotations.viktorbarzin\.me/k8s-upgrade-snapshot-path}'`
2. Restore etcd per `infra/docs/runbooks/restore-etcd.md`.
3. Manually downgrade master `kubeadm`/`kubelet`/`kubectl` to the pre-upgrade version. Find versions in `/var/log/apt/history.log` on the node:
```bash
ssh wizard@k8s-master 'sudo cat /var/log/apt/history.log | tail -40'
# Pre-upgrade versions are in the most recent "Commandline: apt-get install"
sudo apt-mark unhold kubeadm kubelet kubectl
sudo apt-get install --allow-downgrades -y \
kubeadm=<OLD>-1.1 kubelet=<OLD>-1.1 kubectl=<OLD>-1.1
sudo apt-mark hold kubeadm kubelet kubectl
sudo systemctl daemon-reload && sudo systemctl restart kubelet
```
#### Worker broke
1. `kubectl drain <node> --ignore-daemonsets --delete-emptydir-data --force --grace-period=300`
2. Downgrade apt packages on that node only (see above)
3. `kubectl uncordon <node>`
4. The cluster continues running on the master + remaining workers throughout
### One-shot SSH key rotation
1. Generate new keypair: `ssh-keygen -t ed25519 -f /tmp/k8s-upgrade -N ""`
2. Update Vault:
```bash
vault kv patch secret/k8s-upgrade \
ssh_key=@/tmp/k8s-upgrade \
ssh_key_pub=@/tmp/k8s-upgrade.pub
```
3. Push the new pubkey to every node:
```bash
for n in k8s-master k8s-node1 k8s-node2 k8s-node3 k8s-node4; do
ssh wizard@$n 'sed -i "/k8s-upgrade-key$/d" ~/.ssh/authorized_keys'
ssh wizard@$n 'echo "$(cat /tmp/k8s-upgrade.pub) k8s-upgrade-key" >> ~/.ssh/authorized_keys'
done
```
4. ESO refreshes within 15 min — or force: `kubectl -n k8s-upgrade annotate externalsecret k8s-upgrade-creds force-sync=$(date +%s) --overwrite`
## Past Incidents
### 2026-05-11 — Self-preemption (agent → Job-chain rewrite)
- The v1 agent ran inside the `claude-agent-service` Deployment (replicas=1, no nodeSelector) and was scheduled to k8s-node4.
- During Stage 6 (first worker drain) the agent ran `kubectl drain k8s-node4` — evicting itself.
- The bash process died after the drain but before the SSH-pipe to install kubeadm on node4.
- Node4 was left cordoned; cluster stuck at master v1.34.7, workers v1.34.2 until manual recovery.
- **Mitigation**: rewrote the pipeline as a chain of Jobs, each `nodeSelector`-pinned to a non-target node. New `predrain_unstick` step deletes PDB-blocked single-replica pods (Anubis pattern) before drain so they don't loop forever. Added `K8sUpgradeStalled` alert (in-flight + started_timestamp > 90 min).
## File Pointers
| What | Where |
|------|-------|
| Stack (CronJob + ConfigMaps + SA/RBAC + ExternalSecret) | `infra/stacks/k8s-version-upgrade/main.tf` |
| Universal phase body | `infra/stacks/k8s-version-upgrade/scripts/upgrade-step.sh` |
| Job template | `infra/stacks/k8s-version-upgrade/job-template.yaml` |
| Per-node upgrade script | `infra/scripts/update_k8s.sh` |
| Upgrade Gates alerts | `infra/stacks/monitoring/modules/monitoring/prometheus_chart_values.tpl` (group "Upgrade Gates") |
| Vault secrets | `secret/k8s-upgrade/{ssh_key, ssh_key_pub, slack_webhook}` |
| Architecture doc | `infra/docs/architecture/automated-upgrades.md` (K8s Version Upgrades section) |
| Related (OS reboots) | `infra/docs/runbooks/k8s-node-auto-upgrades.md` |
| Deprecated agent prompt (reference) | `infra/.claude/agents/k8s-version-upgrade.deprecated.md` |

View file

@ -9,36 +9,15 @@ how to tune the rate limit, how to revoke if abused.
## Architecture
- **K8s service**: `windows-kms` in namespace `kms`, MetalLB **dedicated**
LB IP `10.0.20.202:1688`. ETP=Local, so vlmcsd sees real WAN client IPs
in its log (pfSense WAN forwards do DNAT-only, no SNAT; ETP=Local skips
the kube-proxy SNAT too). Same pattern mailserver used pre-2026-04-19.
Sharing `10.0.20.200` isn't an option — all 10 services there are
ETP=Cluster and MetalLB requires a single ETP per shared IP.
- **Native DNS auto-discovery for LAN clients**: any Windows client with
DNS suffix `viktorbarzin.lan` activates with zero config — Windows
queries `_vlmcs._tcp.viktorbarzin.lan` SRV by default, the SRV target
resolves to `vlmcs.viktorbarzin.lan``10.0.20.202`, and `slmgr /ato`
succeeds. Records:
- `_vlmcs._tcp.viktorbarzin.lan` SRV 0 0 1688 vlmcs.viktorbarzin.lan
- `vlmcs.viktorbarzin.lan` A `10.0.20.202`
- `kms.viktorbarzin.lan` A `10.0.20.200` (Traefik — for the user-facing
website at `https://kms.viktorbarzin.lan/`; **not** the KMS server)
Manual override (e.g., for clients without the suffix or for clients
on the public internet): `slmgr /skms kms.viktorbarzin.me:1688` (WAN
path via pfSense forward) or `slmgr /skms 10.0.20.202:1688` (direct).
To revert a manually-overridden client back to auto-discovery:
`slmgr /ckms`.
- **Pod fluidity**: deployment has `replicas=1` (notifier dedup state is
per-pod) with no node affinity. TCP readiness/liveness probes on 1688
gate Pod Ready on the listener actually being up, so MetalLB only
advertises `10.0.20.202` from a node where vlmcsd is serving.
- **pfSense WAN forward**: `WAN TCP/1688 → k8s_kms_lb:1688`
(alias = `10.0.20.202`, dedicated to KMS). Description: `KMS public —
kms.viktorbarzin.me`. Other forwards using `k8s_shared_lb` (WireGuard,
HTTPS, shadowsocks, smtps, etc.) are unaffected.
- **Filter rule** on the WAN interface, TCP/1688 destination
`<k8s_kms_lb>`, with state-table per-source caps:
- **K8s service**: `windows-kms` in namespace `kms`, MetalLB shared LB IP
`10.0.20.200:1688`. ETP=Cluster, so client IPs in vlmcsd logs are SNAT'd
k8s node IPs (not real-world client IPs). Trade-off accepted —
preserving real client IPs would require a dedicated MetalLB IP with
ETP=Local or a PROXY-protocol bounce; vlmcsd doesn't speak PROXY-v2.
- **pfSense WAN forward**: `WAN TCP/1688 → k8s_shared_lb:1688`
(alias = `10.0.20.200`). Description: `KMS public — kms.viktorbarzin.me`.
- **Filter rule** on the WAN interface, TCP/1688, with state-table
per-source caps:
- `max-src-conn 50` — concurrent connections per source IP
- `max-src-conn-rate 10/60` — 10 new connections per 60 seconds per
source
@ -47,13 +26,6 @@ how to tune the rate limit, how to revoke if abused.
flushed. (`virusprot` is the only table pfSense's filter generator
targets for `overload`; see `/etc/inc/filter.inc`. Don't try to point
it at a custom table — the schema doesn't expose that knob.)
- **Probe filter in slack-notifier**: a bare TCP open/close (no
Application/Activation block from vlmcsd) is treated as a probe — Uptime
Kuma's port-type monitor on `windows-kms.kms.svc:1688` and the kubelet
readiness/liveness probes both hit this path. Probes increment
`kms_connection_probes_total{source}` (`source``internal_pod`,
`cluster_node`, `external`) and log to stdout, but never post to Slack.
Real activations still post.
## Where the logs are
@ -67,11 +39,8 @@ kubectl logs -n kms -l app=kms-service -c windows-kms --tail=50 -f
kubectl logs -n kms -l app=kms-service -c windows-kms | grep "Incoming KMS request"
```
Source IPs from the WAN are real client IPs (pfSense DNAT-only + ETP=Local
preserve them through the chain). LAN clients hitting the LB IP directly
appear as their own IP. Pod-source probes (Uptime Kuma) appear as a Calico
pod IP in `10.10.0.0/16`. Kubelet readiness/liveness probes appear as the
hosting node IP in `10.0.20.0/24`.
Source IPs in this log are the SNAT'd node IPs because the LB Service uses
ETP=Cluster on a shared MetalLB IP. Don't expect real WAN client IPs here.
### Slack notifier (kms namespace, k8s)
@ -84,17 +53,6 @@ also increment the Prometheus counter `kms_activations_total{product,status}`
exposed on the same pod at `:9101/metrics` (scraped by the cluster-wide
`kubernetes-pods` job; query via Prometheus or Grafana directly).
Probe-only TCP connections (open+close, no KMS RPC) are silently filtered
out of Slack and counted in `kms_connection_probes_total{source}`. Useful
queries:
```promql
# Probe rate by source
rate(kms_connection_probes_total[5m])
# Probes from the public WAN (a non-zero rate here means real port-scans
# are reaching us, not just internal monitoring)
rate(kms_connection_probes_total{source="external"}[5m])
```
### pfSense — virusprot table and filter hits
```bash
@ -135,19 +93,18 @@ The `overload` table entry survives pf reloads. Running
If the activation surface needs to come down (abuse, legal, audit):
1. **pfSense web UI**`Firewall → NAT → Port Forward` → find
`WAN TCP/1688 → k8s_kms_lb` → **delete** (or disable). Apply.
`WAN TCP/1688 → k8s_shared_lb` → **delete** (or disable). Apply.
2. **pfSense web UI**`Firewall → Rules → WAN` → find
`KMS public — kms.viktorbarzin.me`**delete** (or disable). Apply.
3. Verify externally: from a phone tether, `nc -zw3 kms.viktorbarzin.me 1688`
should now fail.
The k8s service stays reachable on the LAN
(`10.0.20.202:1688` directly, and the website at `kms.viktorbarzin.lan`
via Traefik on `10.0.20.200:443`) — only the WAN port-forward is removed.
(`10.0.20.200:1688` and the internal `kms.viktorbarzin.lan` ingress for
the webpage) — only the WAN port-forward is removed.
To put it back, recreate the NAT rule (target alias `k8s_kms_lb`,
port `1688`) and the filter rule with the same per-source caps. The alias
itself is independent of any forward and persists across delete/restore.
To put it back, recreate the NAT rule (target alias `k8s_shared_lb`,
port `1688`) and the filter rule with the same per-source caps.
## Related

View file

@ -1,256 +1,166 @@
# Restore MySQL (Standalone)
# Restore MySQL (InnoDB Cluster)
Last updated: 2026-05-18 (after the 8.4.9 DD-upgrade disaster recovery)
Applies to the `mysql-standalone` StatefulSet in the `dbaas` namespace
(raw `kubernetes_stateful_set_v1`, migrated from InnoDB Cluster on
2026-04-16). The historic InnoDB-Cluster recovery flow is gone.
Last updated: 2026-04-06
## Prerequisites
- `kubectl` against the cluster
- Root password: `kubectl -n dbaas get secret cluster-secret -o jsonpath='{.data.ROOT_PASSWORD}' | base64 -d`
- A backup dump on NFS at `/srv/nfs/mysql-backup/` (exported via
`dbaas-mysql-backup-host` PVC inside the cluster)
- `kubectl` access to the cluster
- MySQL root password (from `cluster-secret` in `dbaas` namespace, key `ROOT_PASSWORD`)
- Backup dump available on NFS at `/mnt/main/mysql-backup/`
## Backup Locations
## Backup Location
- NFS: `/mnt/main/mysql-backup/dump_YYYY_MM_DD_HH_MM.sql.gz`
- Mirrored to sda: `/mnt/backup/nfs-mirror/mysql-backup/` (PVE host 192.168.1.127)
- Replicated to Synology NAS: `Synology/Backup/Viki/pve-backup/nfs-mirror/mysql-backup/`
- Retention: 14 days (on NFS), latest only (on sda), unlimited (on Synology)
- Size: ~11MB per dump
| Location | Purpose | Retention |
|---|---|---|
| `/srv/nfs/mysql-backup/dump_YYYY_MM_DD_HH_MM.sql.gz` | Full daily dump (CronJob `mysql-backup`, daily 00:30 UTC) | 14 days |
| `/srv/nfs/mysql-backup/per-db/<dbname>/dump_*.sql.gz` | Per-DB dumps (CronJob `mysql-backup-per-db`, daily 00:45 UTC) | 14 days |
| Synology `Backup/Viki/nfs/mysql-backup/` | Offsite mirror via inotify-tracked rsync | unlimited |
Latest full dump is ~230MB compressed (~3GB uncompressed). Restore
of a full dump into a fresh MySQL pod takes ~3 minutes.
## Scenario A — Single database restored alongside the others
When one DB is corrupted but MySQL is otherwise fine.
## Restore Procedure
### 1. Identify the backup to restore
```bash
ROOT_PWD=$(kubectl -n dbaas get secret cluster-secret -o jsonpath='{.data.ROOT_PASSWORD}' | base64 -d)
# List per-db dumps for the affected database
kubectl -n dbaas exec mysql-standalone-0 -- ls -lt /backup/per-db/<dbname>/
# Pipe a chosen dump into MySQL (REPLACE existing data in <dbname>):
kubectl -n dbaas exec -i mysql-standalone-0 -- \
sh -c "zcat /backup/per-db/<dbname>/dump_YYYY_MM_DD_HH_MM.sql.gz | mysql -uroot -p\"$ROOT_PWD\" <dbname>"
# Restart consumers
kubectl -n <ns> rollout restart deployment
# List available backups
kubectl run mysql-ls --rm -it --image=mysql \
--overrides='{"spec":{"volumes":[{"name":"backup","persistentVolumeClaim":{"claimName":"dbaas-mysql-backup"}}],"containers":[{"name":"mysql-ls","image":"mysql","volumeMounts":[{"name":"backup","mountPath":"/backup"}],"command":["ls","-lt","/backup/"]}]}}' \
-n dbaas
```
## Scenario B — Full disaster: data dictionary corrupt or PVC unsalvageable
This is the path executed on 2026-05-18 when a Keel-driven bump to
`mysql:8.4.9` left the data dictionary half-upgraded and 8.4.8 refused
to start (`Server upgrade of version 80408 is still pending`
MY-013379). Wipes the PVC and rehydrates from the daily dump.
**Estimated downtime: 25 minutes.** Plan accordingly — Forgejo +
registry + every MySQL app go offline during this.
### B.1 Stop the failing MySQL pod
### 2. Get the root password
```bash
kubectl -n dbaas scale statefulset mysql-standalone --replicas=0
kubectl get secret cluster-secret -n dbaas -o jsonpath='{.data.ROOT_PASSWORD}' | base64 -d
```
### B.2 Verify the dump you intend to restore is healthy
### 3. Option A: Restore via port-forward (from outside cluster)
```bash
ssh root@192.168.1.127 'ls -la /srv/nfs/mysql-backup/dump_*.sql.gz | tail -5'
# Sanity-check the header
ssh root@192.168.1.127 'zcat /srv/nfs/mysql-backup/dump_YYYY_MM_DD_HH_MM.sql.gz | head -20'
# Should show "MySQL dump 10.13 ... Server version 8.4.X"
# Port-forward to MySQL primary
kubectl port-forward svc/mysql -n dbaas 3307:3306 &
# Get root password
ROOT_PWD=$(kubectl get secret cluster-secret -n dbaas -o jsonpath='{.data.ROOT_PASSWORD}' | base64 -d)
# Restore (decompress and pipe to mysql, use --host to avoid unix socket, specify non-default port)
zcat /path/to/dump_YYYY_MM_DD_HH_MM.sql.gz | mysql -u root -p"$ROOT_PWD" --host 127.0.0.1 --port 3307
```
### B.3 Pin MySQL image in Terraform (if it auto-bumped)
If the upgrade was triggered by a Keel bump on a floating tag
(`mysql:8.4`), edit `stacks/dbaas/modules/dbaas/main.tf` to pin to a
known-good exact version (`mysql:8.4.8`). Commit but don't apply yet.
### B.4 Wipe the corrupted PVC
The PV reclaim policy defaults to **Retain** on
`proxmox-lvm-encrypted``kubectl delete pvc` alone leaves the PV
attached to the (corrupted) disk. Flip to `Delete` first so the CSI
driver actually cleans up the underlying LV.
### 3. Option B: Restore via in-cluster pod
```bash
PV=$(kubectl -n dbaas get pvc data-mysql-standalone-0 -o jsonpath='{.spec.volumeName}')
kubectl patch pv "$PV" -p '{"spec":{"persistentVolumeReclaimPolicy":"Delete"}}'
kubectl -n dbaas delete pvc data-mysql-standalone-0
ROOT_PWD=$(kubectl get secret cluster-secret -n dbaas -o jsonpath='{.data.ROOT_PASSWORD}' | base64 -d)
kubectl run mysql-restore --rm -it --image=mysql \
--overrides='{"spec":{"volumes":[{"name":"backup","persistentVolumeClaim":{"claimName":"dbaas-mysql-backup"}}],"containers":[{"name":"mysql-restore","image":"mysql","env":[{"name":"MYSQL_PWD","value":"'$ROOT_PWD'"}],"volumeMounts":[{"name":"backup","mountPath":"/backup"}],"command":["/bin/sh","-c","zcat /backup/dump_YYYY_MM_DD_HH_MM.sql.gz | mysql -u root --host mysql.dbaas.svc.cluster.local"]}]}}' \
-n dbaas
```
The PV transitions to `Released` then gets cleaned up by the CSI
controller; confirm with `kubectl get pv | grep <PV>` (eventually
disappears).
### B.5 Scale MySQL back up via Terraform
### 4. Verify restoration
```bash
cd stacks/dbaas && /home/wizard/code/infra/scripts/tg apply
```
# Check databases exist
mysql -u root -p"$ROOT_PWD" --host 127.0.0.1 --port 3307 -e "SHOW DATABASES;"
This recreates the PVC fresh (5Gi initial; pvc-autoresizer grows it
on demand) and starts a brand-new MySQL pod. The pod initializes an
empty datadir using `MYSQL_ROOT_PASSWORD` from the `cluster-secret`
K8s Secret — ~30s to ready.
# Check InnoDB Cluster status
mysql -u root -p"$ROOT_PWD" --host 127.0.0.1 --port 3307 -e "SELECT * FROM performance_schema.replication_group_members;"
### B.6 Restore the full dump via a one-shot Job
```bash
cat <<'YAML' | kubectl apply -f -
apiVersion: batch/v1
kind: Job
metadata:
name: mysql-restore-$(date +%Y-%m-%d)
namespace: dbaas
spec:
ttlSecondsAfterFinished: 3600
template:
spec:
restartPolicy: Never
containers:
- name: restore
image: mysql:8.4.8
command: ["bash","-c"]
args:
- |
set -euo pipefail
gunzip -c /backup/dump_YYYY_MM_DD_HH_MM.sql.gz | \
mysql -h mysql.dbaas.svc.cluster.local -uroot -p"$MYSQL_ROOT_PASSWORD"
mysql -h mysql.dbaas.svc.cluster.local -uroot -p"$MYSQL_ROOT_PASSWORD" -e 'SHOW DATABASES;'
env:
- name: MYSQL_ROOT_PASSWORD
valueFrom:
secretKeyRef: { name: cluster-secret, key: ROOT_PASSWORD }
volumeMounts:
- { name: backup, mountPath: /backup, readOnly: true }
volumes:
- name: backup
persistentVolumeClaim: { claimName: dbaas-mysql-backup-host, readOnly: true }
YAML
```
Watch progress: `kubectl -n dbaas logs -f job/<name>`. Takes ~3 min
for a 230MB compressed dump.
### B.7 Reset static MySQL users with passwords from Vault
**This step is mandatory.** `mysqldump` restores rows in `mysql.user`
verbatim, including password hashes. But `null_resource.mysql_static_user`
in Terraform writes the **current Vault password** to `forgejo` and
`roundcubemail` — and that current password rarely matches the dump's
hash. The apps will fail auth (forgejo logs `Error 1045 (28000): Access
denied for user 'forgejo'@'...'`) until you reset them.
```bash
FORGEJO_PW=$(vault kv get -field=mysql_forgejo_password secret/viktor)
RC_PW=$(vault kv get -field=mysql_roundcubemail_password secret/viktor)
kubectl -n dbaas exec -i mysql-standalone-0 -- bash -c 'mysql -uroot -p"$MYSQL_ROOT_PASSWORD"' <<SQL
DROP USER IF EXISTS 'forgejo'@'%';
DROP USER IF EXISTS 'roundcubemail'@'%';
CREATE USER 'forgejo'@'%' IDENTIFIED WITH caching_sha2_password BY '$FORGEJO_PW';
CREATE USER 'roundcubemail'@'%' IDENTIFIED WITH caching_sha2_password BY '$RC_PW';
GRANT ALL PRIVILEGES ON \`forgejo\`.* TO 'forgejo'@'%';
GRANT ALL PRIVILEGES ON \`roundcubemail\`.* TO 'roundcubemail'@'%';
FLUSH PRIVILEGES;
SQL
```
`ALTER USER` sometimes hits `ERROR 1396 Operation ALTER USER failed`
on freshly-restored DBs (stale grant-table cache); `DROP USER` +
`CREATE USER` is the reliable form.
Vault-rotated app users (nextcloud, codimd, grafana, paperless,
phpipam, etc.) are managed by Vault DB engine and their dump password
already matches the live K8s secret, so they need no manual fixup.
### B.8 Restart MySQL-dependent apps
The dump restore brings MySQL up, but app pods still hold stale
connections (and forgejo has been crash-looping). Roll the
deployments to force fresh connections:
```bash
for ns_app in \
"forgejo:deploy/forgejo" \
"nextcloud:deploy/nextcloud" \
"hackmd:deploy/hackmd" \
"monitoring:deploy/grafana" \
"paperless-ngx:deploy/paperless-ngx" \
"uptime-kuma:deploy/uptime-kuma" \
"url:deploy/shlink" \
"realestate-crawler:deploy/realestate-crawler-api" \
"realestate-crawler:deploy/realestate-crawler-celery" \
"realestate-crawler:deploy/realestate-crawler-celery-beat" \
"realestate-crawler:deploy/realestate-crawler-ui"; do
ns=${ns_app%%:*}; app=${ns_app##*:}
kubectl -n "$ns" rollout restart "$app" &
# Check table counts for key databases
for db in speedtest wrongmove codimd nextcloud shlink grafana technitium; do
echo "=== $db ==="
mysql -u root -p"$ROOT_PWD" --host 127.0.0.1 --port 3307 -e "SELECT TABLE_NAME, TABLE_ROWS FROM information_schema.TABLES WHERE TABLE_SCHEMA='$db' ORDER BY TABLE_ROWS DESC LIMIT 5;"
done
wait
```
If any deployments stay stuck in `ImagePullBackOff` (e.g.
`chrome-service`, `fire-planner`, `freedify`), those rely on the
Forgejo registry — once forgejo is back, just delete their pods to
force a fresh pull:
### 5. Verify application MySQL users exist
After any cluster rebuild or PVC recreation, the MySQL operator only recreates its own system users. Application users may be lost.
```bash
kubectl -n chrome-service delete pod --all
kubectl -n fire-planner delete pod --all
kubectl -n freedify delete pod --all
ROOT_PWD=$(kubectl get secret cluster-secret -n dbaas -o jsonpath='{.data.ROOT_PASSWORD}' | base64 -d)
# Check all expected application users exist
kubectl exec -n dbaas mysql-cluster-0 -c mysql -- mysql -u root -p"$ROOT_PWD" \
-e "SELECT user, host FROM mysql.user WHERE user IN ('nextcloud','forgejo','crowdsec','grafana','speedtest','wrongmove','codimd','shlink','technitium','uptimekuma');"
# If users are missing, force Vault to re-rotate their credentials:
# vault write -f database/rotate-role/mysql-<app>
# This will recreate the user with the correct password.
#
# For technitium specifically, also run the password sync CronJob:
# kubectl create job --from=cronjob/technitium-password-sync technitium-pw-resync -n technitium
#
# Note: forgejo and uptimekuma may be legacy users not managed by Vault rotation.
```
### B.9 Verify recovery
### 6. InnoDB Cluster Recovery
If the InnoDB Cluster itself is broken (not just data loss):
```bash
# Check cluster status via MySQL Shell
kubectl exec -it mysql-cluster-0 -n dbaas -c mysql -- mysqlsh root@localhost --password="$ROOT_PWD" -- cluster status
# Force rejoin a member
kubectl exec -it mysql-cluster-0 -n dbaas -c mysql -- mysqlsh root@localhost --password="$ROOT_PWD" -- cluster rejoinInstance root@mysql-cluster-1:3306
```
## Restore Single Database (from per-db backup)
Per-database backups are stored at `/mnt/main/mysql-backup/per-db/<dbname>/` as gzipped SQL dumps.
### 1. List available per-db backups
```bash
ls -lt /mnt/main/mysql-backup/per-db/<dbname>/
```
### 2. Restore a single database
```bash
# Port-forward to MySQL
kubectl port-forward svc/mysql -n dbaas 3307:3306 &
ROOT_PWD=$(kubectl get secret cluster-secret -n dbaas -o jsonpath='{.data.ROOT_PASSWORD}' | base64 -d)
# Restore single database (this replaces only the target database)
zcat /path/to/per-db/<dbname>/dump_YYYY_MM_DD_HH_MM.sql.gz | mysql -u root -p"$ROOT_PWD" --host 127.0.0.1 --port 3307 <dbname>
```
### 3. Verify
```bash
mysql -u root -p"$ROOT_PWD" --host 127.0.0.1 --port 3307 -e \
"SELECT TABLE_NAME, TABLE_ROWS FROM information_schema.TABLES WHERE TABLE_SCHEMA='<dbname>' ORDER BY TABLE_ROWS DESC LIMIT 10;"
```
### 4. Restart the affected service only
```bash
kubectl rollout restart deployment -n <namespace>
```
**Advantages over full restore**: Only the target database is affected. All other databases continue running with their current data.
## Alternative: Restore from sda Backup
If the Proxmox host NFS mount is unavailable but the PVE host itself is accessible:
```bash
# All workloads ready
kubectl get deploy,sts -A -o json | jq -r '.items[] | select(.spec.replicas != .status.readyReplicas and .spec.replicas > 0) | "\(.metadata.namespace)/\(.metadata.name)"'
# (empty output = healthy)
# 1. SSH to PVE host
ssh root@192.168.1.127
# Database integrity — table counts per schema
kubectl -n dbaas exec mysql-standalone-0 -- mysql -uroot -p"$ROOT_PWD" \
-e "SELECT table_schema, COUNT(*) FROM information_schema.tables \
WHERE table_schema NOT IN ('information_schema','performance_schema','sys') \
GROUP BY table_schema;"
# 2. Find the latest backup
ls -lt /mnt/backup/nfs-mirror/mysql-backup/
# Forgejo's registry catalog (catches the cascade alert)
kubectl -n monitoring create job --from=cronjob/forgejo-integrity-probe manual-postrestore-$(date +%s)
kubectl -n monitoring logs job/manual-postrestore-<timestamp> --tail=10
# Expect "Probe complete: 0 failures across N repos / M tags / K indexes"
# Cluster-health re-run
bash /home/wizard/code/infra/scripts/cluster_healthcheck.sh --quiet
# 3. Copy backup to a location accessible from cluster (e.g., via kubectl cp)
# Or mount sda backup on a pod:
kubectl run mysql-restore --rm -it --image=mysql \
--overrides='{"spec":{"volumes":[{"name":"backup","hostPath":{"path":"/mnt/backup/nfs-mirror/mysql-backup"}}],"containers":[{"name":"mysql-restore","image":"mysql","env":[{"name":"MYSQL_PWD","value":"'$ROOT_PWD'"}],"volumeMounts":[{"name":"backup","mountPath":"/backup"}],"command":["/bin/sh","-c","zcat /backup/dump_YYYY_MM_DD_HH_MM.sql.gz | mysql -u root --host mysql.dbaas.svc.cluster.local"]}],"nodeName":"k8s-master"}}' \
-n dbaas
```
### B.10 Clean up failed CronJob pods from the outage window
## Alternative: Restore from Synology (if PVE host is down)
If the PVE host itself is unavailable:
```bash
kubectl delete pods -A --field-selector=status.phase=Failed
# 1. SSH to Synology NAS
ssh Administrator@192.168.1.13
# 2. Navigate to backup directory
cd /volume1/Backup/Viki/nfs/mysql-backup/
# 3. Copy dump to a temporary location accessible from cluster
# (e.g., via rsync to a surviving node, or restore PVE host first)
```
## Why the 8.4.9 upgrade got us — and the version pin
The MySQL 8.4.9 data-dictionary upgrade from 80408 → 80409 stalls
reliably on this hardware. ~24s of writes to `mysql.ibd` and the redo
log, then no further progress, no CPU, no completion. We bumped the
liveness probe to 600s (`initial_delay_seconds`) and still no
progress. Hypothesised root cause: `innodb_io_capacity=100` combined
with `innodb_page_cleaners=1` — the upgrade's spatial-reference-system
flush phase is IO-starved. **Don't retry 8.4.9 without first bumping
IO capacity and pinning a proper maintenance window.**
Until then, the StatefulSet pins to `mysql:8.4.8` exactly, not the
floating `mysql:8.4` tag. Keel will not silently bump it.
## See also
- `docs/runbooks/forgejo-registry-breakglass.md` — companion runbook
for when the cascade has reached the registry layer.
- Beads `code-eme8` / `code-k40p` — incident tracker entries (closed
in commit ea475c3d).
## Estimated Time
- Data restore: ~5 minutes (11MB dump)
- InnoDB Cluster recovery: ~15-20 minutes (init containers are slow)

View file

@ -1,191 +0,0 @@
# Security Incident Response
What to do when a wave-1 security alert fires. Each alert links to a Loki query for investigation and concrete remediation steps.
**Status: planned, not yet implemented.** Beads epic: `code-8ywc`. This runbook is the response playbook for when wave 1 ships.
## General workflow
1. **Acknowledge in Alertmanager.** Silence only after triage starts.
2. **Pull context from Loki** (queries below). Get the actor, source IP, timestamp.
3. **Decide: real or false-positive?** Use the "false-positive cases" notes below.
4. **If real:** revoke credentials (Vault token revoke, K8s SA token rotate, SSH key remove, OIDC session invalidate), then post-mortem.
5. **If false-positive:** tune the alert (extend allowlist, refine LogQL query).
## Allowlist CIDRs
All source-IP-based alerts (K2, K9, V7, S1) reference this list. Update in one place: Terraform variable `security_source_ip_allowlist` in `stacks/monitoring`.
- `10.0.20.0/22` — VLAN 20 (cluster + main LAN)
- `192.168.1.0/24` — Proxmox + Sofia LAN
- K8s pod CIDR (verify at implementation time)
- K8s service CIDR
- Headscale tailnet
**Anything outside = alert.** No public-IP exceptions.
## Viktor's identity
`me@viktorbarzin.me` is the ONLY allowlisted human identity. NOT `viktor@viktorbarzin.me`. NOT `emo@viktorbarzin.me`. emo's identity scheme is separate and must be added explicitly if/when needed.
---
## K-alerts (K8s API audit)
### K2 — ServiceAccount token used from outside cluster
**Meaning:** A K8s ServiceAccount token authenticated a request whose `sourceIPs[0]` is not in the pod CIDR or trusted LAN. Stolen SA token used externally.
```logql
{job="kube-audit"} | json | user_username =~ "system:serviceaccount:.*" | sourceIPs_0 !~ "10\\.0\\.20\\..*|192\\.168\\.1\\..*"
```
**Action:** Identify the SA. Rotate its token (`kubectl delete secret <sa-token-name>` if old-style, or recreate the SA if projected token). Audit the SA's permissions and tighten.
**False positives:** Pod-to-apiserver traffic that egresses and re-enters via NodePort/LB (rare). Investigate the originating workload.
### K3 — Secret read in sensitive namespace by unexpected actor
**Meaning:** A Secret in `vault`, `sealed-secrets`, or `external-secrets` namespace was read by an SA NOT in the allowlist (ESO controller, sealed-secrets controller, Vault SA, `me@viktorbarzin.me`).
```logql
{job="kube-audit"} | json | verb =~ "get|list" | objectRef_resource = "secrets" | objectRef_namespace =~ "vault|sealed-secrets|external-secrets" | user_username !~ "(me@viktorbarzin.me|system:serviceaccount:external-secrets:.*|system:serviceaccount:sealed-secrets:.*|system:serviceaccount:vault:.*)"
```
**Action:** Identify the actor. If a service account, audit its bindings — it shouldn't have RBAC to read those secrets. Revoke the binding. Rotate any secrets that were read.
### K4 — Exec into sensitive pod
**Meaning:** Someone `kubectl exec`'d into a pod in `vault`, `kube-system`, `dbaas`, or `cnpg-system`.
```logql
{job="kube-audit"} | json | verb = "create" | objectRef_resource = "pods" | objectRef_subresource = "exec" | objectRef_namespace =~ "vault|kube-system|dbaas|cnpg-system" | user_username != "me@viktorbarzin.me"
```
**Action:** Determine if Viktor authorized the exec. If unrecognized actor, revoke their access and rotate any credentials they could have read inside the pod.
**False positives:** Break-glass SAs used during incident response — extend the allowlist to include them by SA name.
### K5 — Mass delete
**Meaning:** Single actor deleted >5 Pods, Secrets, or ConfigMaps in 60 seconds. Either a script gone wrong or destructive intrusion.
```logql
sum by (user_username) (count_over_time({job="kube-audit"} | json | verb = "delete" | objectRef_resource =~ "pods|secrets|configmaps" [1m])) > 5
```
**Action:** Identify actor. If a Terraform apply or known cleanup job, false positive. If unrecognized, suspend the actor's credentials immediately and audit what was deleted.
### K6 — Audit policy modified
**Meaning:** Someone changed the kube-apiserver audit policy. Should only happen via Terraform.
**Action:** Verify the change came from a planned Terraform apply (check recent commits to `stacks/infra`). If not, treat as critical compromise — attacker disabling visibility.
### K7 — New ClusterRole with full wildcards
**Meaning:** A new ClusterRole was created with `verbs: ["*"]` and `resources: ["*"]`. Privilege escalation primitive.
```logql
{job="kube-audit"} | json | verb = "create" | objectRef_resource = "clusterroles" | requestObject_rules_0_verbs_0 = "*" | requestObject_rules_0_resources_0 = "*"
```
**Action:** Verify the change is intentional (some operators install such roles — calico, kyverno). If unrecognized, delete the ClusterRole and audit the creator.
### K8 — Anonymous binding
**Meaning:** A RoleBinding or ClusterRoleBinding was created referencing `system:anonymous` or `system:unauthenticated`. Catastrophic — allows unauthenticated cluster access.
**Action:** Delete the binding immediately. Audit who created it. Treat as full cluster compromise — rotate all secrets, force kubeconfig re-issue.
### K9 — Viktor's identity from unexpected source IP
**Meaning:** A request authenticated as `me@viktorbarzin.me` arrived from a source IP outside the allowlist. Stolen OIDC token / kubeconfig.
```logql
{job="kube-audit"} | json | user_username = "me@viktorbarzin.me" | sourceIPs_0 !~ "10\\.0\\.20\\..*|192\\.168\\.1\\..*|<pod-cidr>|<headscale-cidr>"
```
**Action:** Revoke Viktor's OIDC session in Authentik. Rotate Vault OIDC tokens. Audit recent activity from that IP. Verify Viktor's devices for compromise.
**False positives:** Viktor's machine on a new network without VPN — should not happen per the "no public IP access" policy. If it does, the policy needs revisiting, not the alert.
---
## V-alerts (Vault audit)
### V1 — Root token created
```logql
{job="vault-audit"} | json | request_path = "auth/token/create" | response_auth_policies = "root"
```
**Action:** Verify against Terraform / planned operation. Root tokens should ONLY be created during initial Vault setup or break-glass.
### V2 — Audit device disabled/modified
**Action:** Attacker silencing visibility. Re-enable immediately. Treat as critical compromise.
### V3 — Seal status changed
**Action:** Verify whether this is a planned operation (unseal during upgrade). If unplanned, treat as critical.
### V4 — Policy modified
**Action:** Confirm change came from a Terraform apply. Allowlist Terraform's source IP / token role. Otherwise: review the policy diff, revert if malicious.
### V5 — Auth failure spike
**Action:** Identify the auth method and source. If CI token rotation, false positive. If unknown source brute-forcing, block the source IP at pfSense.
### V6 — Token with policies different from parent
**Action:** Privilege escalation attempt. Revoke the new token. Audit the parent token's policies.
### V7 — Viktor's Vault identity from unexpected source IP
**Meaning:** A Vault operation authenticated as Viktor's entity_id arrived from an IP not in the allowlist. Requires `x_forwarded_for_authorized_addrs` to be configured (Vault sits behind Traefik so `remote_addr` is Traefik's pod IP without XFF trust).
**Action:** Revoke Viktor's Vault OIDC tokens. Force OIDC re-auth. Audit Vault access from that IP.
---
## S-alerts (Host)
### S1 — PVE sshd auth success from unexpected IP
```logql
{job="sshd-pve"} |= "Accepted" | regexp "Accepted (?P<method>\\S+) for (?P<user>\\S+) from (?P<ip>\\S+)" | ip !~ "10\\.0\\.20\\..*|192\\.168\\.1\\..*|<headscale-cidr>"
```
**Action:** Remove the user's SSH key from `/root/.ssh/authorized_keys` if it's still there. Audit recent sudo/login history (`last`, `sudo -i; journalctl _COMM=sudo`). Consider PVE as compromised — rotate root password, audit `/root/.luks-backup-key`, audit `/usr/local/bin/lvm-pvc-snapshot` and backup scripts for tampering.
---
## False-positive triage decision tree
```
Did the alert fire from a known operational event?
├─ Terraform apply at the same time? → likely V4 (policy modified)
├─ Keel auto-roll? → not a security path
├─ CI/CD pipeline running? → check V5 / K5
└─ Viktor doing recovery work? → K4, K9, S1 candidates
Extend allowlist if persistent
```
## Escalation
For SEV1 (multiple alerts, cluster-admin grants, anonymous bindings, mass deletes):
1. Cordon all nodes (`kubectl cordon`) to prevent further pod scheduling — but be aware this also stops legitimate recovery work
2. Revoke all OIDC sessions in Authentik
3. Rotate Vault root keys + reseal
4. Restore from a pre-incident backup if data integrity is questionable
5. Post-mortem per `incident-response.md`
## Related
- [Security architecture](../architecture/security.md)
- [Monitoring architecture](../architecture/monitoring.md)
- [Incident response (general)](../architecture/incident-response.md)
- Beads epic: `code-8ywc`

View file

@ -67,44 +67,11 @@ runcmd:
- sed -i 's/#Compress=yes/Compress=yes/' /etc/systemd/journald.conf
- systemctl restart systemd-journald
%{if is_k8s_template}
# Re-enabled 2026-05-10: unattended-upgrades is back on, but with a tight
# Allowed-Origins list, a Package-Blacklist for k8s/containerd/runc/calico,
# and Automatic-Reboot disabled (kured + sentinel-gate handles reboots in a
# 24h-soaked rolling window, gated by Prometheus alerts).
# Original outage (March 2026) was kernel update → containerd overlayfs corruption.
# Mitigations: 24h cool-down between node reboots, Prometheus halt-on-alert,
# apt-mark hold on k8s components, Package-Blacklist for runtime components.
- apt-get install -y unattended-upgrades update-notifier-common
- |
cat > /etc/apt/apt.conf.d/52unattended-upgrades-k8s <<'EOF'
Unattended-Upgrade::Allowed-Origins {
"$${distro_id}:$${distro_codename}";
"$${distro_id}:$${distro_codename}-security";
"$${distro_id}:$${distro_codename}-updates";
"$${distro_id}ESMApps:$${distro_codename}-apps-security";
"$${distro_id}ESM:$${distro_codename}-infra-security";
};
Unattended-Upgrade::Package-Blacklist {
"^containerd(\.io)?$$";
"^runc$$";
"^cri-tools$$";
"^kubernetes-cni$$";
"^calico-.*";
"^cni-plugins-.*";
"^docker-ce$$";
};
Unattended-Upgrade::DevRelease "false";
Unattended-Upgrade::Automatic-Reboot "false";
EOF
- |
cat > /etc/apt/apt.conf.d/20auto-upgrades <<'EOF'
APT::Periodic::Update-Package-Lists "1";
APT::Periodic::Unattended-Upgrade "1";
EOF
- systemctl unmask unattended-upgrades 2>/dev/null || true
- systemctl enable --now unattended-upgrades
# Disable unattended-upgrades to prevent unexpected kernel updates that can break containerd/kubelet
# (Root cause of 26h cluster outage: unattended-upgrades → kernel update → containerd failure)
- systemctl disable --now unattended-upgrades || true
- apt-get remove -y unattended-upgrades || true
- apt-mark hold kubelet kubeadm kubectl
- apt-mark hold containerd containerd.io runc 2>/dev/null || true
- systemctl stop kubelet
- containerd config default | sudo tee /etc/containerd/config.toml
- ${containerd_config_update_command}

View file

@ -192,9 +192,9 @@ resource "proxmox_vm_qemu" "cloudinit-vm" {
for_each = var.disk_slot == "scsi0" ? [1] : []
content {
disk {
storage = "local-lvm"
size = var.vm_disk_size
discard = true # Enable TRIM passthrough to LVM thin pool reduces CoW overhead
storage = "local-lvm"
size = var.vm_disk_size
discard = true # Enable TRIM passthrough to LVM thin pool reduces CoW overhead
}
}
}
@ -202,9 +202,9 @@ resource "proxmox_vm_qemu" "cloudinit-vm" {
for_each = var.disk_slot == "scsi1" ? [1] : []
content {
disk {
storage = "local-lvm"
size = var.vm_disk_size
discard = true
storage = "local-lvm"
size = var.vm_disk_size
discard = true
}
}
}

View file

@ -56,24 +56,8 @@ variable "image_tag" {
variable "replicas" {
type = number
default = null
description = "Optional replica count override. When null, defaults to 1 if shared_store_url is null and 2 otherwise. Capped at 2 — Redis can handle more but anti-affinity assumes ≤2 replicas per Anubis instance on a 5-node cluster."
validation {
condition = var.replicas == null || (var.replicas >= 1 && var.replicas <= 2)
error_message = "replicas must be 1 or 2 (or null to auto-pick from shared_store_url presence)."
}
}
variable "shared_store_url" {
type = string
default = null
description = "If set, Anubis stores in-flight challenge state in this Valkey/Redis-protocol URL instead of in-process memory, enabling HA across replicas. Format: redis://host:port/<db-index>. The DB index MUST be unique per Anubis instance (this module assumes 16 DBs available, common in standalone Redis). Cluster Redis is redis-master.redis.svc.cluster.local:6379 with HA via Sentinel + haproxy. Without this, replicas>1 causes ~50% PoW failures (challenge issued by pod A, solved against pod B → 500)."
validation {
condition = var.shared_store_url == null || can(regex("^redis://[a-zA-Z0-9_.-]+:[0-9]+/[0-9]+$", var.shared_store_url))
error_message = "shared_store_url must look like redis://host:port/<db-index> (explicit DB index required)."
}
default = 1
description = "Replica count. Default 1 because Anubis stores in-flight challenges in process memory — with N>1 a challenge issued by pod A and solved against pod B fails with `store: key not found` (HTTP 500). For HA, configure a shared store (Redis) and bump this. Per-pod 128Mi @ idle is cheap, single-pod restart is sub-second, so 1 is fine for content sites."
}
variable "memory" {
@ -104,21 +88,6 @@ locals {
"app.kubernetes.io/managed-by" = "terraform"
}
# Effective replicas: caller-override > shared-store-aware default.
effective_replicas = coalesce(var.replicas, var.shared_store_url == null ? 1 : 2)
# Anubis store config. With backend=valkey, multiple Anubis pods can share
# in-flight PoW state and a challenge issued by pod A is verifiable by pod
# B. Default backend is in-process memory which only works at replicas=1.
store_yaml_block = var.shared_store_url == null ? "" : <<-EOT
store:
backend: valkey
parameters:
url: "${var.shared_store_url}"
EOT
# Strict bot policy. Default Anubis policy only WEIGHs Mozilla|Opera UAs
# and lets unmatched UAs (curl, wget, Python-requests, scrapy, headless
# CLI scrapers) fall through to ALLOW. We import the same upstream
@ -126,8 +95,7 @@ locals {
# capability is filtered.
default_policy_yaml = <<-EOT
bots:
# Hard-deny known-bad bots first runs before the method bypass so
# a declared bad bot can't sneak through by sending a POST.
# Hard-deny known-bad bots first.
- import: (data)/bots/_deny-pathological.yaml
- import: (data)/bots/aggressive-brazilian-scrapers.yaml
# Hard-deny declared AI/LLM crawlers (ClaudeBot, GPTBot, Bytespider, ).
@ -139,29 +107,13 @@ locals {
# Allow /.well-known, /robots.txt, /favicon.*, /sitemap.xml keeps
# the internet working for benign crawlers and discovery clients.
- import: (data)/common/keep-internet-working.yaml
# Allow every non-GET request through. Rationale: AI scrapers steal
# the body of GETs (page content) they don't POST. State-mutating
# methods come from app XHRs (PrivateBin paste creation, Komga
# uploads, SPA actions) and CORS preflight (OPTIONS). Challenging
# those breaks the app, because the JS expects JSON and gets the
# Anubis HTML challenge page. CrowdSec + rate-limit + per-app auth
# already cover abuse on these methods.
- name: allow-non-get-methods
action: ALLOW
expression: method != "GET"
# Catch-all: every remaining (GET) request must solve the challenge.
# This closes the "unmatched UA falls through to ALLOW" gap that
# lets curl/wget/Python-requests scrape non-CDN-fronted hosts.
# Catch-all: every remaining request must solve the challenge. This
# closes the "unmatched UA falls through to ALLOW" gap that lets
# curl/wget/Python-requests scrape non-CDN-fronted hosts.
- name: catchall-challenge
path_regex: .*
action: CHALLENGE
EOT
# Final policy YAML: defaults (or caller override) plus an optional store
# block when shared_store_url is set. Store block is module-managed and
# appended universally callers passing a custom policy_yaml shouldn't
# include their own `store:` block (they would collide).
rendered_policy_yaml = "${coalesce(var.policy_yaml, local.default_policy_yaml)}${local.store_yaml_block}"
}
# Bot policy ConfigMap. Mounted into the pod and referenced by POLICY_FNAME.
@ -172,7 +124,7 @@ resource "kubernetes_config_map" "policy" {
labels = local.labels
}
data = {
"botPolicies.yaml" = local.rendered_policy_yaml
"botPolicies.yaml" = coalesce(var.policy_yaml, local.default_policy_yaml)
}
}
@ -216,7 +168,7 @@ resource "kubernetes_deployment" "anubis" {
}
spec {
replicas = local.effective_replicas
replicas = var.replicas
selector {
match_labels = { app = local.full_name }
@ -233,26 +185,14 @@ resource "kubernetes_deployment" "anubis" {
template {
metadata {
labels = local.labels
annotations = {
# Roll the deployment whenever the policy YAML changes Anubis
# reads the policy at startup, so a ConfigMap update alone
# doesn't take effect until pods restart.
"checksum/policy" = sha256(local.rendered_policy_yaml)
}
}
spec {
# Spread replicas across nodes to survive a single node failure.
# DoNotSchedule (not ScheduleAnyway) so 2 replicas are forced onto
# different hosts otherwise the scheduler may pile them on the
# same node and a single node reboot takes the whole Anubis instance
# down despite replicas=2. On a 5-node cluster the spread is always
# satisfiable; the worst case (4 nodes unavailable) leaves one
# replica Pending, but the other keeps serving.
topology_spread_constraint {
max_skew = 1
topology_key = "kubernetes.io/hostname"
when_unsatisfiable = "DoNotSchedule"
when_unsatisfiable = "ScheduleAnyway"
label_selector {
match_labels = { app = local.full_name }
}
@ -448,15 +388,7 @@ resource "kubernetes_pod_disruption_budget_v1" "anubis" {
namespace = var.namespace
}
spec {
# max_unavailable=1 means: at most one pod can be voluntarily disrupted
# at a time. With replicas=2 this allows clean rolling drains (one pod
# goes down other serves traffic first recreates elsewhere). With
# replicas=1 (no shared store) this is functionally equivalent to no
# PDB drain proceeds, brief outage, new pod schedules elsewhere.
# Was min_available=1 before 2026-05-16 which deadlocked drains on
# single-replica instances (eviction API can never satisfy the
# constraint at replicas=1). See PM-2026-05-11.
max_unavailable = "1"
min_available = "1"
selector {
match_labels = { app = local.full_name }
}

View file

@ -31,53 +31,9 @@ variable "tls_secret_name" {}
variable "backend_protocol" {
default = "HTTP"
}
variable "auth" {
type = string
default = "required"
description = <<-EOT
Auth posture for this ingress. Pick by asking "what gates the app?":
* "required" (default, fail-closed): Authentik forward-auth gates every
request. Pick this when the backend has NO built-in user auth and
Authentik is the only thing standing between strangers and the app.
Examples: prowlarr, qbittorrent, netbox, phpipam, k8s-dashboard, any
admin UI shipped without its own login.
* "app": the backend handles its own user authentication (NextAuth,
Django sessions, OAuth, bearer-token API, etc.) and Authentik would
only get in the way. No Authentik middleware is attached; the app's
own login is the gate. Examples: immich, linkwarden, tandoor,
freshrss, affine, actualbudget, audiobookshelf, novelapp.
**Functionally identical to "none"** the distinct name exists to
record intent at the call site so future readers don't have to guess.
* "public": Authentik anonymous binding via the `public` outpost.
Strangers are auto-bound to the `guest` Authentik user; logged-in
users keep their identity in X-authentik-username. Only works for
top-level browser navigation CORS preflight rejects XHR/fetch and
automation can't replay the cookie dance. Audit trail, not a gate.
* "none": no Authentik middleware, no own-auth claim explicitly
public or unauthenticated-by-design. Use for: Anubis-fronted content
sites (where Anubis is the gate), native-client APIs that auth
themselves (Git, /v2/, WebDAV/CalDAV, CardDAV), webhook receivers,
OAuth callbacks, and Authentik outposts themselves.
**Anti-exposure rule** (the reason "app" exists as a distinct mode):
only pick "app" or "none" AFTER you have verified the app has its own
user auth (for "app") OR the endpoint is intentionally public (for
"none"). Picking either of these on a naked admin UI exposes it to the
internet. The default is "required" specifically so accidental omission
fails closed.
**Convention**: when using "app" or "none", add a comment line above
the `auth = "..."` line stating what gates the app or why it's public.
Future-you reads the call site, not the module description.
EOT
validation {
condition = contains(["required", "app", "public", "none"], var.auth)
error_message = "auth must be one of: required, app, public, none."
}
variable "protected" {
type = bool
default = false
}
variable "ingress_path" {
type = list(string)
@ -186,23 +142,8 @@ variable "homepage_enabled" {
}
locals {
effective_host = var.full_host != null ? var.full_host : "${var.host != null ? var.host : var.name}.${var.root_domain}"
# Anti-AI default: ON when no Authentik auth fronts the ingress (auth =
# "none" or auth = "app" either the app gates users itself or the site
# is intentionally public). When Authentik gates the request
# (required/public), the auth flow already discourages bots.
effective_anti_ai = var.anti_ai_scraping != null ? var.anti_ai_scraping : (var.auth == "none" || var.auth == "app")
# Auth middleware selection. "app" and "none" both attach no Authentik
# middleware "app" signals "the backend has its own user auth", "none"
# signals "intentionally public / native-client API / webhook". The
# distinction lives at the call site for human readers; the runtime
# effect is identical.
auth_middleware = (
var.auth == "required" ? "traefik-authentik-forward-auth@kubernetescrd" :
var.auth == "public" ? "traefik-authentik-forward-auth-public@kubernetescrd" :
null
)
effective_host = var.full_host != null ? var.full_host : "${var.host != null ? var.host : var.name}.${var.root_domain}"
effective_anti_ai = var.anti_ai_scraping != null ? var.anti_ai_scraping : !var.protected
# External monitor enabled by default when the ingress has a public DNS
# record (either CF-proxied or direct A/AAAA). Explicit bool overrides.
@ -313,7 +254,7 @@ resource "kubernetes_ingress_v1" "proxied-ingress" {
var.exclude_crowdsec ? null : "traefik-crowdsec@kubernetescrd",
local.effective_anti_ai ? "traefik-ai-bot-block@kubernetescrd" : null,
local.effective_anti_ai ? "traefik-anti-ai-headers@kubernetescrd" : null,
local.auth_middleware,
var.protected ? "traefik-authentik-forward-auth@kubernetescrd" : null,
var.allow_local_access_only ? "traefik-local-only@kubernetescrd" : null,
var.custom_content_security_policy != null ? "${var.namespace}-custom-csp-${var.name}@kubernetescrd" : null,
var.max_body_size != null ? "${var.namespace}-buffering-${var.name}@kubernetescrd" : null,

View file

@ -1,124 +0,0 @@
#!/usr/bin/env python3
"""Enforce the inline-comment convention for ingress_factory auth tiers.
Every `auth = "app"` or `auth = "none"` line under a stack must have an
immediately-preceding comment block containing `# auth = "<tier>":`
that documents what gates the app (for "app") or why the endpoint is
intentionally public (for "none").
This is the static guard for the anti-exposure rule documented in
`infra/.claude/CLAUDE.md` "Auth" section. It's invoked by `scripts/tg`
before every plan/apply/destroy/refresh, so it fires regardless of who
or what is running terragrunt local laptop, CI, headless agent.
Stack-scoped by design: only checks the .tf files under the stack
being acted on. Other stacks' historical violations don't block work
on the current stack; each stack documents itself the next time it's
edited.
Usage:
check-ingress-auth-comments.py <stack-path> # scan one stack
check-ingress-auth-comments.py --all # scan every stack
"""
import argparse
import os
import re
import sys
AUTH_LINE = re.compile(r'^\s*auth\s*=\s*"(app|none)"\s*$')
COMMENT_LINE = re.compile(r'^\s*#')
COMMENT_TIER = re.compile(r'auth\s*=\s*"(app|none)"')
def scan_dir(path):
violations = []
for root, _, files in os.walk(path):
for f in files:
if not f.endswith('.tf'):
continue
full = os.path.join(root, f)
try:
with open(full) as fh:
lines = fh.readlines()
except OSError:
continue
for i, line in enumerate(lines):
m = AUTH_LINE.match(line)
if not m:
continue
tier = m.group(1)
# Walk backwards through contiguous comment lines.
# Pass if ANY of them documents the matching tier.
ok = False
j = i - 1
while j >= 0 and COMMENT_LINE.match(lines[j]):
cm = COMMENT_TIER.search(lines[j])
if cm and cm.group(1) == tier:
ok = True
break
j -= 1
if not ok:
violations.append((full, i + 1, tier))
return violations
def main():
ap = argparse.ArgumentParser(description=__doc__.splitlines()[0])
g = ap.add_mutually_exclusive_group(required=True)
g.add_argument('path', nargs='?', help='Stack directory to scan')
g.add_argument('--all', action='store_true', help='Scan every stack under stacks/')
args = ap.parse_args()
if args.all:
scan_paths = ['stacks']
else:
if not os.path.isdir(args.path):
print(f"ERROR: {args.path} is not a directory", file=sys.stderr)
sys.exit(2)
scan_paths = [args.path]
violations = []
for p in scan_paths:
violations.extend(scan_dir(p))
if not violations:
return
print(
"\n"
"==============================================================\n"
"ingress_factory auth-comment convention violated\n"
"==============================================================\n"
"\n"
"Every `auth = \"app\"` or `auth = \"none\"` line must have a\n"
"preceding comment line documenting what gates the app (for\n"
"\"app\") or why the endpoint is intentionally public (for\n"
"\"none\"). This guard prevents accidentally exposing private\n"
"services. See infra/.claude/CLAUDE.md Auth section.\n"
"\n"
"Add a comment line directly above the auth line:\n"
"\n"
" # auth = \"app\": <what gates the app, e.g. NextAuth + OAuth>\n"
" auth = \"app\"\n"
"\n"
"or:\n"
"\n"
" # auth = \"none\": <why public, e.g. webhook receiver, CalDAV>\n"
" auth = \"none\"\n"
"\n"
"Violations:",
file=sys.stderr,
)
for path, line_no, tier in violations:
print(
f" {path}:{line_no}: auth = \"{tier}\" missing preceding "
f"`# auth = \"{tier}\":` comment",
file=sys.stderr,
)
print(file=sys.stderr)
sys.exit(1)
if __name__ == '__main__':
main()

View file

@ -23,11 +23,10 @@ FAIL_COUNT=0
FIX=false
QUIET=false
JSON=false
KUBECONFIG_PATH="${KUBECONFIG:-${HOME}/.kube/config}"
[[ -f "$KUBECONFIG_PATH" ]] || KUBECONFIG_PATH="$(pwd)/config"
KUBECONFIG_PATH="$(pwd)/config"
KUBECTL=""
JSON_RESULTS=()
TOTAL_CHECKS=44
TOTAL_CHECKS=42
# --- Helpers ---
info() { [[ "$JSON" == true ]] && return 0; echo -e "${BLUE}[INFO]${NC} $*"; }
@ -196,19 +195,6 @@ check_pods() {
section 4 "Problematic Pods"
local bad count detail="" status="PASS"
# Skip pods owned by Jobs (which are owned by CronJobs). A failed CronJob
# retry isn't a problematic pod — the next CronJob fire will replace it.
# Real problems are deployments / statefulsets / daemonsets in trouble.
local job_owned_pods
job_owned_pods=$($KUBECTL get pods -A -o json 2>/dev/null | python3 -c '
import json, sys
d = json.load(sys.stdin)
for p in d["items"]:
owners = p["metadata"].get("ownerReferences", [])
if any(o.get("kind") == "Job" for o in owners):
print(f"{p[\"metadata\"][\"namespace\"]} {p[\"metadata\"][\"name\"]}")
' 2>/dev/null || true)
bad=$( {
$KUBECTL get pods -A --no-headers --field-selector=status.phase!=Running,status.phase!=Succeeded 2>/dev/null \
| grep -E 'CrashLoopBackOff|Error|Pending|Init:|ImagePullBackOff|ErrImagePull' || true
@ -216,14 +202,6 @@ for p in d["items"]:
| grep -E 'CrashLoopBackOff|ImagePullBackOff|ErrImagePull' || true
} | awk '!seen[$1,$2]++' | sed '/^$/d') || true
# Filter out Job-owned pods
if [[ -n "$job_owned_pods" && -n "$bad" ]]; then
bad=$(echo "$bad" | awk -v jp="$job_owned_pods" '
BEGIN { n = split(jp, lines, "\n"); for (i=1;i<=n;i++) skip[lines[i]] = 1 }
{ key = $1 " " $2; if (!(key in skip)) print }
')
fi
count=$(count_lines "$bad")
if [[ "$count" -eq 0 ]]; then
@ -250,21 +228,7 @@ check_evicted() {
section 5 "Evicted/Failed Pods"
local evicted count detail="" status="PASS"
# Exclude pods owned by Jobs — those are CronJob retries that K8s leaves
# behind for log inspection. They're not "evicted" in the cluster-health
# sense and the next CronJob fire replaces them.
evicted=$($KUBECTL get pods -A -o json --field-selector=status.phase=Failed 2>/dev/null | python3 -c '
import json, sys
try:
d = json.load(sys.stdin)
except Exception:
sys.exit(0)
for p in d.get("items", []):
owners = p["metadata"].get("ownerReferences", [])
if any(o.get("kind") == "Job" for o in owners):
continue
print(f"{p[\"metadata\"][\"namespace\"]}\t{p[\"metadata\"][\"name\"]}\t{p.get(\"status\",{}).get(\"reason\",\"\")}")
' 2>/dev/null || true)
evicted=$($KUBECTL get pods -A --no-headers --field-selector=status.phase=Failed 2>/dev/null || true)
count=$(count_lines "$evicted")
if [[ "$count" -eq 0 ]]; then
@ -575,25 +539,18 @@ check_alerts() {
return 0
fi
# Only count warning + critical alerts. Info-level alerts (RecentNodeReboot,
# PVAutoExpanding, etc.) are informational by design and shouldn't be
# treated as a script-level WARN — the alert rules themselves already
# encode the severity.
firing_count=$(echo "$alerts" | python3 -c '
import json, sys
ACTIONABLE = {"warning", "critical"}
def actionable(labels):
return labels.get("severity", "info").lower() in ACTIONABLE
try:
data = json.load(sys.stdin)
if isinstance(data, list):
active = [a for a in data if a.get("status", {}).get("state") == "active" and actionable(a.get("labels", {}))]
active = [a for a in data if a.get("status", {}).get("state") == "active"]
count = len(active)
names = [a.get("labels", {}).get("alertname", "?") for a in active]
print(f"{count}:" + ",".join(names) if count > 0 else "0:")
elif isinstance(data, dict) and "data" in data:
alerts_list = data["data"].get("alerts", [])
firing = [a for a in alerts_list if a.get("state") == "firing" and actionable(a.get("labels", {}))]
firing = [a for a in alerts_list if a.get("state") == "firing"]
count = len(firing)
names = [a.get("labels", {}).get("alertname", "?") for a in firing]
print(f"{count}:" + ",".join(names) if count > 0 else "0:")
@ -641,55 +598,17 @@ check_uptime_kuma() {
return 0
fi
# Connect via kubectl port-forward to the internal Service. The public
# URL (uptime.viktorbarzin.me) is behind Authentik forward-auth, which
# 302-redirects the Socket.IO handshake the library uses — there's no
# way for an unauthenticated script to complete the OAuth dance.
# Port-forward gives us a direct path to the in-cluster ClusterIP
# service and works from any host with kubectl access.
local pf_port=18444 pf_pid
$KUBECTL port-forward -n uptime-kuma svc/uptime-kuma "$pf_port:80" >/dev/null 2>&1 &
pf_pid=$!
# Detach from job control so bash doesn't print "Killed" to stderr
# when we SIGKILL the port-forward at the end of this check — that
# message corrupts stdout when stderr is merged for JSON parsing.
disown "$pf_pid" 2>/dev/null || true
# Wait up to 5s for the local listener to come up.
local i
for i in 1 2 3 4 5; do
if (echo >"/dev/tcp/127.0.0.1/$pf_port") 2>/dev/null; then break; fi
sleep 1
done
result=$(UPTIME_KUMA_PASSWORD="$uk_pass" UK_URL="http://127.0.0.1:$pf_port" \
~/.venvs/claude/bin/python3 -c '
import sys, os, time
result=$(UPTIME_KUMA_PASSWORD="$uk_pass" ~/.venvs/claude/bin/python3 -c '
import sys, os
try:
from uptime_kuma_api import UptimeKumaApi
except ImportError:
print("ERROR:uptime-kuma-api not installed")
sys.exit(0)
# Retry up to 3 times — the Socket.IO handshake is occasionally flaky
# even against the internal service during cluster churn.
last_exc = None
api = None
for attempt in range(3):
try:
api = UptimeKumaApi(os.environ["UK_URL"], timeout=120, wait_events=0.2)
api.login("admin", os.environ["UPTIME_KUMA_PASSWORD"])
break
except Exception as e:
last_exc = e
try: api.disconnect()
except Exception: pass
api = None
time.sleep(2 * (attempt + 1))
if api is None:
print(f"CONN_ERROR:{last_exc}")
sys.exit(0)
try:
api = UptimeKumaApi("https://uptime.viktorbarzin.me", timeout=120, wait_events=0.2)
api.login("admin", os.environ["UPTIME_KUMA_PASSWORD"])
monitors = api.get_monitors()
heartbeats = api.get_heartbeats()
@ -744,13 +663,6 @@ except Exception as e:
print(f"CONN_ERROR:{e}")
' 2>/dev/null) || result="CONN_ERROR:python execution failed"
# Always tear down the port-forward. Use SIGKILL directly — kubectl
# port-forward sometimes ignores SIGTERM during teardown and we don't
# need a graceful exit for a localhost listener. Skip `wait` because
# in `set -m` mode the backgrounded child may not be reapable here,
# causing the script to hang indefinitely; the shell reaps it on exit.
kill -9 "$pf_pid" 2>/dev/null || true
if [[ "$result" == "ERROR:"* ]]; then
[[ "$QUIET" == true ]] && section_always 14 "Uptime Kuma Monitors"
warn "Uptime Kuma: ${result#ERROR:}"
@ -1162,14 +1074,9 @@ for item in data.get("items", []):
expiry = datetime.strptime(date_str.strip(), "%b %d %H:%M:%S %Y %Z")
expiry = expiry.replace(tzinfo=timezone.utc)
days_left = (expiry - datetime.now(timezone.utc)).days
# Threshold rationale (lowered from 30d):
# - cnpg-webhook-cert: CNPG operator auto-rotates at 7d before expiry
# - kyverno-*-tls-pair: Kyverno auto-rotates at 15d before expiry
# - viktorbarzin.me Lets Encrypt wildcard: renewed weekly via Woodpecker
# Anything still <14d at check time is genuinely worth surfacing.
if days_left <= 7:
print(f"FAIL:{ns}/{name}:{days_left}d")
elif days_left <= 14:
elif days_left <= 30:
print(f"WARN:{ns}/{name}:{days_left}d")
except ValueError:
pass
@ -1178,8 +1085,8 @@ for item in data.get("items", []):
' 2>/dev/null) || true
if [[ -z "$cert_issues" ]]; then
pass "All TLS certificates valid for >14 days"
json_add "tls_certs" "PASS" "All valid >14d"
pass "All TLS certificates valid for >30 days"
json_add "tls_certs" "PASS" "All valid >30d"
else
[[ "$QUIET" == true ]] && section_always 22 "TLS Certificate Expiry"
while IFS= read -r line; do
@ -1425,59 +1332,12 @@ check_ha_entities() {
local result
result=$(export HA_CACHE_DIR; python3 << 'PYEOF'
import os, json
from datetime import datetime, timezone, timedelta
# Noise filter rationale:
# * The HA "unavailable" state covers everything from "the iDRAC scrape failed
# 30 seconds ago" to "this iPhone hasn't checked in in 6 hours" to
# "this YAML rest sensor has been broken for a week". Counting all of them
# produces 400+ alerts that are mostly expected (phones in standby, lights
# off, TVs idle).
# * Three filters dramatically cut noise without hiding real outages:
# 1. SKIP_DOMAINS — domains that go unavailable transiently by design
# (mobile_app on backgrounded apps, notify per-device, button/scene/
# event are momentary).
# 2. STALE_HOURS — only count entities that have been unavailable for
# this long. A flapping integration that recovers in <24h is noise;
# one stuck for >24h is real.
# 3. SKIP_DEVICE_HINTS — friendly-name substrings for things that come
# and go (laptops, phones, TVs, vacuums, washers).
SKIP_DOMAINS = {"mobile_app", "device_tracker", "notify", "button", "scene",
"event", "image", "update"}
SKIP_DEVICE_HINTS = ("iphone", "ipad", "macbook", "mac mini", "tv", "bravia",
"playstation", "switch", "roomba", "vacuum", "rumi",
"ipad", "laptop", "phone", "перална", "сушилня",
"миялна", "laptop2")
STALE_HOURS = 24
cache = os.environ["HA_CACHE_DIR"]
with open(f"{cache}/states.json") as f:
states = json.load(f)
now = datetime.now(timezone.utc)
threshold = now - timedelta(hours=STALE_HOURS)
def is_stale(s):
if s.get("state") not in ("unavailable", "unknown"):
return False
domain = s["entity_id"].split(".")[0]
if domain in SKIP_DOMAINS:
return False
name = (s.get("attributes", {}).get("friendly_name") or "").lower()
if any(h in name for h in SKIP_DEVICE_HINTS):
return False
# last_changed = when the state last flipped. If it flipped to unavailable
# >24h ago and stayed there, the integration is genuinely broken.
lc = s.get("last_changed") or s.get("last_updated")
if not lc:
return True # no timestamp = treat as old
try:
dt = datetime.fromisoformat(lc.replace("Z", "+00:00"))
except ValueError:
return True
return dt < threshold
unavail = [s for s in states if is_stale(s)]
unavail = [s for s in states if s.get("state") in ("unavailable", "unknown")]
domains = {}
for s in unavail:
d = s["entity_id"].split(".")[0]
@ -1636,42 +1496,24 @@ with open(f"{cache}/states.json") as f:
autos = [s for s in states if s["entity_id"].startswith("automation.")]
total = len(autos)
# Noise filter rationale (was: any disabled OR not-triggered-in-30d):
# * "Disabled" alone is fine — Viktor disables automations intentionally
# (seasonal, holiday-only, paused). Only flag when ABANDONED, i.e.
# disabled for >180 days AND never triggered recently.
# * "Stale" alone is fine for low-frequency automations (annual reminders,
# manual triggers). Raise the bar to 180d (was 30d).
DISABLED_STALE_DAYS = 180
STALE_DAYS = 180
disabled = [a["entity_id"] for a in autos if a["state"] == "off"]
disabled_count = len(disabled)
now = datetime.now(timezone.utc)
def days_since(ts):
if not ts:
return None
try:
return (now - datetime.fromisoformat(ts.replace("Z", "+00:00"))).days
except Exception:
return None
disabled = []
stale = []
for a in autos:
lt_days = days_since(a.get("attributes", {}).get("last_triggered"))
changed_days = days_since(a.get("last_changed"))
if a["state"] == "off":
# Only flag a disabled automation if it has ALSO been untouched for
# the threshold — i.e. genuinely abandoned, not "paused for now".
# Use last_changed as a proxy for "user-touched recently".
if changed_days is None or changed_days > DISABLED_STALE_DAYS:
disabled.append(a["entity_id"])
else:
if lt_days is not None and lt_days > STALE_DAYS:
stale.append(f"{a['entity_id']}={lt_days}d")
continue
lt = a.get("attributes", {}).get("last_triggered")
if lt:
try:
t = datetime.fromisoformat(lt.replace("Z", "+00:00"))
days = (now - t).days
if days > 30:
stale.append(a["entity_id"] + "=" + str(days) + "d")
except:
pass
disabled_count = len(disabled)
stale_count = len(stale)
disabled_names = "; ".join(disabled)
stale_names = "; ".join(stale[:10])
@ -2465,107 +2307,6 @@ except Exception as e:
}
# --- 42. External Reachability: Traefik 5xx Rate ---
check_pve_thermals() {
section 43 "PVE Host Thermals — Xeon E5-2699v4 package + per-core temps"
local raw status="PASS"
# Read all hwmon temp inputs in one SSH round-trip. Output: one line per
# sensor, "<sensor_label> <celsius>". Falls back gracefully on missing
# labels (Xeon coretemp driver exposes both `Package id 0` and `Core N`).
raw=$(ssh -o BatchMode=yes -o ConnectTimeout=5 -o StrictHostKeyChecking=no \
root@192.168.1.127 '
cd /sys/class/hwmon/hwmon0 2>/dev/null || exit 1
for tfile in temp*_input; do
[[ -e "$tfile" ]] || continue
base=${tfile%_input}
label=$(cat "${base}_label" 2>/dev/null || echo "$base")
val=$(cat "$tfile" 2>/dev/null)
[[ -n "$val" ]] && echo "$label $((val/1000))"
done
' 2>/dev/null || true)
if [[ -z "$raw" ]]; then
[[ "$QUIET" == true ]] && section_always 43 "PVE Host Thermals"
warn "Could not read hwmon temps from 192.168.1.127 (SSH BatchMode failed or path missing)"
json_add "pve_thermals" "WARN" "SSH failed or hwmon path missing"
return 0
fi
local pkg_temp max_core_temp max_core_label
pkg_temp=$(echo "$raw" | awk '/^Package id/{print $NF; exit}')
max_core_temp=$(echo "$raw" | awk '/^Core/{if($NF>m){m=$NF; lbl=$1" "$2}} END{print m}')
max_core_label=$(echo "$raw" | awk '/^Core/{if($NF>m){m=$NF; lbl=$1" "$2}} END{print lbl}')
# Healthy baseline for this R730 (verified Apr 20-May 8 2026 from
# Prometheus): peak 61-69°C, avg 51-55°C. Treat anything above 65°C
# as a signal that some VM/workload is using too much CPU and warrants
# investigation, even though the Xeon E5-2699v4 has TjMax=83°C /
# Tcrit=93°C. This catches load creep early, well before throttling.
# PASS < 65°C package (within baseline 55-65 °C band)
# WARN 65-82°C package (elevated — investigate top CPU consumer)
# FAIL >= 83°C package (at/above TjMax — throttling imminent)
local detail="package=${pkg_temp}°C max_core=${max_core_temp}°C (${max_core_label})"
if [[ -z "$pkg_temp" ]]; then
[[ "$QUIET" == true ]] && section_always 43 "PVE Host Thermals"
warn "Package temp not found in hwmon output"
json_add "pve_thermals" "WARN" "$detail"
elif [[ "$pkg_temp" -ge 83 ]]; then
[[ "$QUIET" == true ]] && section_always 43 "PVE Host Thermals"
fail "PVE package temp ${pkg_temp}°C >= TjMax (83°C) — throttling imminent. $detail"
json_add "pve_thermals" "FAIL" "$detail"
status="FAIL"
elif [[ "$pkg_temp" -ge 65 ]]; then
[[ "$QUIET" == true ]] && section_always 43 "PVE Host Thermals"
warn "PVE package temp ${pkg_temp}°C above baseline (>65°C) — some VM is using too much CPU; check top kvm processes. $detail"
json_add "pve_thermals" "WARN" "$detail"
else
pass "PVE package ${pkg_temp}°C, hottest core ${max_core_temp}°C (${max_core_label}) — within 55-65°C baseline"
json_add "pve_thermals" "PASS" "$detail"
fi
}
check_pve_load() {
section 44 "PVE Host Load — load avg vs 44-thread capacity"
local raw load_1 load_5 load_15
raw=$(ssh -o BatchMode=yes -o ConnectTimeout=5 -o StrictHostKeyChecking=no \
root@192.168.1.127 'cat /proc/loadavg' 2>/dev/null || true)
if [[ -z "$raw" ]]; then
[[ "$QUIET" == true ]] && section_always 44 "PVE Host Load"
warn "Could not read /proc/loadavg from 192.168.1.127"
json_add "pve_load" "WARN" "SSH failed"
return 0
fi
load_1=$(echo "$raw" | awk '{print $1}')
load_5=$(echo "$raw" | awk '{print $2}')
load_15=$(echo "$raw" | awk '{print $3}')
# Round load_5 down for integer comparison (avoid bc dep)
local load_5_int
load_5_int=$(printf '%.0f' "$load_5")
# R730: 44 hw threads (22c × HT). Healthy avg ~ 15-22 (~30-50% utilisation
# of thread count). Warn when sustained 5-min above 30 (~70% threads
# busy). Fail when 5-min above 38 (~85% — close to scheduler saturation).
# PASS load_5 < 30
# WARN 30 <= load_5 < 38
# FAIL load_5 >= 38
local detail="1m=${load_1} 5m=${load_5} 15m=${load_15}"
if [[ "$load_5_int" -ge 38 ]]; then
[[ "$QUIET" == true ]] && section_always 44 "PVE Host Load"
fail "PVE 5-min load ${load_5} >= 38 of 44 threads — saturation. $detail"
json_add "pve_load" "FAIL" "$detail"
elif [[ "$load_5_int" -ge 30 ]]; then
[[ "$QUIET" == true ]] && section_always 44 "PVE Host Load"
warn "PVE 5-min load ${load_5} in warn band (30-37 of 44 threads). $detail"
json_add "pve_load" "WARN" "$detail"
else
pass "PVE load avg $detail (< 30/44 threads)"
json_add "pve_load" "PASS" "$detail"
fi
}
check_external_traefik_5xx() {
section 42 "External — Traefik 5xx Rate (15m)"
local query_result detail="" status="PASS"
@ -2722,8 +2463,6 @@ main() {
check_monitoring_css
check_external_replicas
check_external_divergence
check_pve_thermals
check_pve_load
check_external_traefik_5xx
print_summary

View file

@ -207,15 +207,7 @@ else
dst="${BACKUP_ROOT}/pvc-data/${WEEK}/${ns_pvc}"
mkdir -p "${dst}"
rsync_rc=0
# Per-PVC rsync timeout (30 min). Without this, a single hung
# PVC blocks the entire backup until systemd's TimeoutStartSec
# kills the script (4h ceiling), leaving every later PVC
# unbacked and silently triggering WeeklyBackupFailing. Picked
# 30 min as well above the largest PVC's normal copy time
# (immich-postgres ~10 GiB, ~3 min on local ext4) and well
# below the unit-level budget so we still have headroom to
# finish the rest.
timeout 1800 rsync -az --delete \
rsync -az --delete \
${PREV:+--link-dest="${PREV}/${ns_pvc}/"} \
"${PVC_MOUNT}/" "${dst}/" 2>&1 || rsync_rc=$?
if [ "$rsync_rc" -eq 0 ]; then
@ -225,12 +217,6 @@ else
# (in-flight writes have corrupt metadata from skipped journal replay)
PVC_COUNT=$((PVC_COUNT + 1))
log " partial rsync (LUKS noload) for ${ns_pvc} — OK"
elif [ "$rsync_rc" -eq 124 ]; then
# `timeout` exit 124 = wall-clock killed the rsync. Track
# separately so the next run still produces a metric and
# doesn't pretend nothing happened.
warn "rsync timed out for ${ns_pvc} after 30 min — moving on"
PVC_FAIL=$((PVC_FAIL + 1))
else
warn "rsync failed for ${ns_pvc} (rc=$rsync_rc)"
PVC_FAIL=$((PVC_FAIL + 1))
@ -246,11 +232,7 @@ else
relpath="${dbfile#${PVC_MOUNT}/}"
dest_file="${BACKUP_ROOT}/sqlite-backup/${WEEK}/${ns_pvc}/${relpath}"
mkdir -p "$(dirname "${dest_file}")"
# 5-min sqlite timeout — same hang-prevention idea
# as rsync above. A corrupted SQLite or one held
# open by a writer in the snapshot can otherwise
# block .backup indefinitely.
if timeout 300 sqlite3 "file://${dbfile}?mode=ro" ".backup '${dest_file}'" 2>/dev/null; then
if sqlite3 "file://${dbfile}?mode=ro" ".backup '${dest_file}'" 2>/dev/null; then
log " SQLite: ${ns_pvc}/${relpath}"
else
cp "${dbfile}" "${dest_file}" 2>/dev/null || true
@ -344,7 +326,7 @@ fi
# ============================================================
log "--- Step 4: PVE host config ---"
mkdir -p "${BACKUP_ROOT}/pve-config/scripts"
timeout 300 rsync -az --delete /etc/pve/ "${BACKUP_ROOT}/pve-config/etc-pve/" 2>&1 || { warn "Failed to sync /etc/pve"; STATUS=1; }
rsync -az --delete /etc/pve/ "${BACKUP_ROOT}/pve-config/etc-pve/" 2>&1 || { warn "Failed to sync /etc/pve"; STATUS=1; }
for script in /usr/local/bin/lvm-pvc-snapshot /usr/local/bin/daily-backup /usr/local/bin/offsite-sync-backup; do
[ -f "${script}" ] && cp "${script}" "${BACKUP_ROOT}/pve-config/scripts/" 2>/dev/null || true
done

View file

@ -102,30 +102,6 @@ for arg in "$@"; do
esac
done
# Detect if this is a plan/apply/destroy/refresh — anything that reads or
# writes infra state. Cheap pre-flight check below scans only the current
# stack's .tf files for the ingress_factory auth-comment convention. Other
# tg verbs (init, fmt, validate) skip the check.
is_tf_op=false
for arg in "$@"; do
case "$arg" in
plan|apply|destroy|refresh) is_tf_op=true ;;
esac
done
# Anti-exposure guard: every `auth = "app"` or `auth = "none"` in this stack
# must have a preceding `# auth = "<tier>":` comment documenting what gates
# the app or why the endpoint is intentionally public. See:
# - infra/modules/kubernetes/ingress_factory/main.tf (variable description)
# - infra/.claude/CLAUDE.md "Auth" section
# Stack-scoped: untouched stacks aren't blocked from future applies until
# they're actually edited, at which point the convention applies.
if $is_tf_op && [ -n "$STACK_NAME" ]; then
if ! "$REPO_ROOT/scripts/check-ingress-auth-comments.py" "$REPO_ROOT/stacks/$STACK_NAME"; then
exit 1
fi
fi
# Acquire lock for mutating operations (Tier 0 only — Tier 1 uses pg_advisory_lock)
if $is_mutating && [ -n "$STACK_NAME" ] && is_tier0 "$STACK_NAME"; then
if command -v vault &>/dev/null && [ -n "${VAULT_TOKEN:-}" ]; then

View file

@ -1,114 +1,36 @@
#!/usr/bin/env bash
#
# K8s component upgrader. Run on a single node (master OR worker) at a time.
# The caller is responsible for:
# - draining + uncordoning the node (this script does not touch kubectl)
# - sequencing nodes (master first, then workers one at a time)
# - pre-flight checks (etcd snapshot, halt-on-alert, etc)
#
# Used by:
# - the k8s-version-upgrade agent (infra/.claude/agents/k8s-version-upgrade.md)
# - manual operators following the runbook (infra/docs/runbooks/k8s-version-upgrade.md)
#
# Old manual orchestration loop (kept for reference — the agent does the
# equivalent now):
# for n in $(kbn | grep 'k8s-node' | awk '{print $1}'); do
# kb drain $n --ignore-daemonsets --delete-emptydir-data
# s wizard@$n 'bash -s' < update_k8s.sh --role worker --release 1.34.5
# kb uncordon $n
# done
set -euo pipefail
# run for all nodes using :
# for n in $(kbn | grep 'k8s-node' | awk '{print $1}'); do echo $n; kb drain $n --ignore-daemonsets --delete-emptydir-data; s wizard@$n 'bash -s' <update_k8s.sh; kb uncordon $n; done
ROLE=""
RELEASE=""
set -e
export stable_version='1.34' # change me
export release="$stable_version.2" # change me
usage() {
cat <<EOF
Usage: $0 --role <master|worker> --release <X.Y.Z>
echo "Upgrading to $stable_version"
--role master|worker (required)
--release kubeadm/kubelet/kubectl target patch version, e.g. 1.34.5
Behavior:
- Rewrites /etc/apt/sources.list.d/kubernetes.list to the v\$MINOR/deb repo
derived from --release (so a 1.34.x release uses v1.34/deb, 1.35.x uses
v1.35/deb, etc).
- apt-get install kubeadm=<release>-* (apt-mark unhold first).
- master: kubeadm upgrade plan && kubeadm upgrade apply v<release> -y
- worker: kubeadm upgrade node
- apt-get install kubelet=<release>-* kubectl=<release>-* then re-hold.
- systemctl daemon-reload && systemctl restart kubelet
EOF
}
while [[ $# -gt 0 ]]; do
case "$1" in
--role) ROLE="$2"; shift 2;;
--release) RELEASE="$2"; shift 2;;
-h|--help) usage; exit 0;;
*) echo "Unknown arg: $1" >&2; usage; exit 2;;
esac
done
if [[ -z "$ROLE" || -z "$RELEASE" ]]; then
echo "ERROR: --role and --release are required" >&2
usage
exit 2
fi
if [[ "$ROLE" != "master" && "$ROLE" != "worker" ]]; then
echo "ERROR: --role must be 'master' or 'worker' (got: $ROLE)" >&2
exit 2
fi
# Derive minor track (e.g. 1.34.5 → 1.34)
STABLE_VERSION="$(echo "$RELEASE" | awk -F. '{print $1"."$2}')"
echo "==> Upgrading $(hostname) ($ROLE) to v$RELEASE (track v$STABLE_VERSION)"
# Apt repo URL is pinned per minor track. Rewrite + re-import the signing key
# every run — cheap, idempotent, and handles the minor-bump case where the
# old track's repo no longer carries the target version.
echo "deb [signed-by=/etc/apt/keyrings/kubernetes-apt-keyring.gpg] https://pkgs.k8s.io/core:/stable:/v$STABLE_VERSION/deb/ /" \
| sudo tee /etc/apt/sources.list.d/kubernetes.list
echo "deb [signed-by=/etc/apt/keyrings/kubernetes-apt-keyring.gpg] https://pkgs.k8s.io/core:/stable:/v$stable_version/deb/ /" | sudo tee /etc/apt/sources.list.d/kubernetes.list
sudo mkdir -p /etc/apt/keyrings
curl -fsSL "https://pkgs.k8s.io/core:/stable:/v$STABLE_VERSION/deb/Release.key" \
| sudo gpg --dearmor -o /etc/apt/keyrings/kubernetes-apt-keyring.gpg --batch --yes
curl -fsSL "https://pkgs.k8s.io/core:/stable:/v$stable_version/deb/Release.key" | sudo gpg --dearmor -o /etc/apt/keyrings/kubernetes-apt-keyring.gpg --batch --yes
sudo apt-mark unhold kubeadm kubelet kubectl
sudo apt-get update
sudo apt-get install -y "kubeadm=$RELEASE-*"
sudo apt-get update
sudo apt-get install -y kubeadm="$release-*"
if [[ "$ROLE" == "master" ]]; then
echo "==> Master path: kubeadm upgrade plan + apply"
sudo kubeadm upgrade plan
# The first apply may fail with "static Pod hash for component <X> did
# not change after 5m0s" — kubeadm's 5min wait for the kubelet to reload
# a static pod is too tight on our cluster (apiserver-to-kubelet status
# sync latency post-master-reboot can exceed it). The etcd image IS
# actually updated by then, so a 2nd attempt sees etcd already on
# target and skips it. Up to 3 attempts with a 30s delay between.
attempt=1
while ! sudo kubeadm upgrade apply "v$RELEASE" -y; do
if (( attempt >= 3 )); then
echo "ERROR: kubeadm upgrade apply failed after 3 attempts" >&2
exit 1
fi
echo "==> kubeadm apply attempt $attempt failed (likely static-pod-hash 5m timeout). Sleeping 30s then retrying — the previous attempt's manifest writes usually take hold on the 2nd try."
sleep 30
attempt=$(( attempt + 1 ))
done
echo "==> kubeadm upgrade apply succeeded on attempt $attempt"
HOSTNAME=$(hostname)
SEARCH_STR="master"
if [[ "$HOSTNAME" == *"$SEARCH_STR"* ]]; then
echo "Upgrading master"
sudo kubeadm upgrade plan && sudo kubeadm upgrade apply v$release -y
else
echo "==> Worker path: kubeadm upgrade node"
sudo kubeadm upgrade node
echo "Upgrading worker"
sudo kubeadm upgrade node
fi
sudo apt-get install -y "kubelet=$RELEASE-*" "kubectl=$RELEASE-*"
sudo apt-get install -y kubelet="$release-*" kubectl="$release-*"
sudo apt-mark hold kubeadm kubelet kubectl
sudo systemctl daemon-reload
sudo systemctl restart kubelet
echo "==> Done: $(hostname) is on v$RELEASE"

View file

@ -1,14 +1,8 @@
#!/usr/bin/env bash
#
# OS-major upgrade (Ubuntu do-release-upgrade). NOT in the auto-upgrade
# pipeline — minor apt patches are handled by unattended-upgrades + kured;
# K8s component bumps are handled by the k8s-version-upgrade agent. Run this
# script manually when bumping Ubuntu LTS major versions.
#
# See:
# - infra/docs/runbooks/k8s-node-auto-upgrades.md (apt + reboot)
# - infra/docs/runbooks/k8s-version-upgrade.md (kubeadm/kubelet/kubectl)
# sudo apt update && sudo apt autoremove -y && sudo apt upgrade -y
sudo do-release-upgrade
sudo apt update && sudo apt autoremove -y && sudo apt upgrade -y

View file

@ -1,619 +0,0 @@
#!/usr/bin/env bash
#
# upgrade_state.sh — survey the three autonomous-upgrade pipelines.
#
# Companion to cluster_healthcheck.sh, surfaced via the /upgrade-state skill.
# Read-only by design — no --fix.
#
# The three pipelines:
# 1. Apps — Keel polls registries hourly and rolls Deployments tagged
# keel.sh/policy. Metrics on container :9300/metrics.
# 2. OS — unattended-upgrades patches in-release per node; kured
# reboots within a daily 02:00-06:00 London window.
# 3. K8s — k8s-version-check CronJob (Sun 12:00 UTC) detects new
# kubeadm patch/minor releases; Job-chain drains+upgrades
# node-by-node. Pushgateway holds k8s_upgrade_* gauges.
#
# Exit codes: 0 healthy, 1 attention warranted, 2 something stalled.
set -euo pipefail
# --- Colors ---
RED='\033[0;31m'
GREEN='\033[0;32m'
YELLOW='\033[0;33m'
BLUE='\033[0;34m'
BOLD='\033[1m'
NC='\033[0m'
# --- Globals ---
JSON=false
KUBECONFIG_PATH="${KUBECONFIG:-${HOME}/.kube/config}"
[[ -f "$KUBECONFIG_PATH" ]] || KUBECONFIG_PATH="/home/wizard/code/infra/config"
KUBECTL=""
NODES=(k8s-master:10.0.20.100 k8s-node1:10.0.20.101 k8s-node2:10.0.20.102 k8s-node3:10.0.20.103 k8s-node4:10.0.20.104)
SSH_OPTS=(-o BatchMode=yes -o ConnectTimeout=5 -o StrictHostKeyChecking=no)
NOW_EPOCH=$(date -u +%s)
HIGHEST_EXIT=0 # 0 healthy, 1 attention, 2 stalled
# Results — collectors fill these.
APPS_STATUS_ICON=""; APPS_STATUS_TEXT=""
APPS_LAST_CHECK=""; APPS_NEXT=""; APPS_NOTES=""
APPS_ENROLLED=0; APPS_PENDING=0; APPS_UPDATES_LINE=""; APPS_ERROR_LINE=""
OS_STATUS_ICON=""; OS_STATUS_TEXT=""
OS_LAST_CHECK=""; OS_NEXT=""; OS_NOTES=""
OS_DISTRO_SUMMARY=""; OS_KERNEL_SUMMARY=""
OS_PENDING_REBOOT_NODES=""; OS_HELD_DETAIL=""
OS_LAST_UU=""; OS_LAST_KURED=""
K8S_STATUS_ICON=""; K8S_STATUS_TEXT=""
K8S_LAST_CHECK=""; K8S_NEXT=""; K8S_NOTES=""
K8S_RUNNING=""; K8S_PATCH=""; K8S_MINOR=""
K8S_LAST_DETECT_LINE=""; K8S_IN_FLIGHT="no"; K8S_LAST_CHAIN=""
# --- Helpers ---
log() { [[ "$JSON" == true ]] && return 0; echo -e "$*"; }
raise_exit() {
local n="$1"
if [[ "$n" -gt "$HIGHEST_EXIT" ]]; then HIGHEST_EXIT="$n"; fi
return 0
}
usage() {
cat <<EOF
Usage: $0 [--json] [--kubeconfig <path>]
Read-only audit of the three autonomous-upgrade pipelines (apps, OS, k8s).
--json machine-readable JSON
--kubeconfig PATH override kubeconfig
Exit codes: 0 healthy, 1 attention warranted, 2 something stalled.
EOF
}
parse_args() {
while [[ $# -gt 0 ]]; do
case "$1" in
--json) JSON=true; shift ;;
--kubeconfig) KUBECONFIG_PATH="$2"; shift 2 ;;
-h|--help) usage; exit 0 ;;
*) echo "Unknown option: $1" >&2; exit 1 ;;
esac
done
KUBECTL="kubectl --kubeconfig $KUBECONFIG_PATH"
}
# Prometheus query — Prometheus + reload + backup share a network namespace,
# so reaching localhost:9090 works from any of the three sidecars.
prom_q() {
local q="$1"
$KUBECTL -n monitoring exec deploy/prometheus-server -c prometheus-server -- \
wget -qO- "http://localhost:9090/api/v1/query?query=${q}" 2>/dev/null || true
}
pg_metrics() {
$KUBECTL -n monitoring exec deploy/prometheus-server -c prometheus-server -- \
wget -qO- "http://prometheus-prometheus-pushgateway:9091/metrics" 2>/dev/null || true
}
ssh_node() {
local ip="$1"; shift
ssh "${SSH_OPTS[@]}" "wizard@$ip" "$@" 2>/dev/null || true
}
human_age() {
local secs="$1"
if [[ "$secs" -lt 60 ]]; then printf '%ds ago' "$secs"
elif [[ "$secs" -lt 3600 ]]; then printf '%dm ago' $((secs/60))
elif [[ "$secs" -lt 86400 ]]; then printf '%dh ago' $((secs/3600))
else printf '%dd ago' $((secs/86400))
fi
}
# Pushgateway emits floats and scientific notation — coerce to integer
# epoch seconds. Returns 0 if the input is empty / zero / unparseable.
to_epoch_int() {
local v="${1:-}"
if [[ -z "$v" || "$v" == "0" ]]; then echo 0; return; fi
python3 -c "import sys; v=sys.argv[1]; print(int(float(v)))" "$v" 2>/dev/null || echo 0
}
# --- 1. Apps (Keel) ---
collect_apps() {
local pending tracked enrolled updates_24h errors
# Enrolled: count Deployments with keel.sh/policy != never (Keel itself
# is policy=never). The Kyverno auto-injection labels namespaces
# keel.sh/enrolled=true, but the annotation is what Keel watches.
enrolled=$($KUBECTL get deploy -A -o json 2>/dev/null | python3 -c '
import json, sys
data = json.load(sys.stdin)
n = sum(1 for d in data["items"]
if (d["metadata"].get("annotations") or {}).get("keel.sh/policy", "never") != "never")
print(n)
' 2>/dev/null || echo 0)
APPS_ENROLLED="$enrolled"
# Pending approvals (sum across Keel pods).
pending=$(prom_q 'sum(pending_approvals)' | python3 -c '
import json, sys
try:
r = json.load(sys.stdin)["data"]["result"]
print(int(float(r[0]["value"][1])) if r else 0)
except Exception:
print(0)
' 2>/dev/null || echo 0)
APPS_PENDING="$pending"
# Tracked images — proxy for "is the scrape live?".
tracked=$(prom_q 'count(count by (image) (registries_scanned_total))' | python3 -c '
import json, sys
try:
r = json.load(sys.stdin)["data"]["result"]
print(int(float(r[0]["value"][1])) if r else 0)
except Exception:
print(0)
' 2>/dev/null || echo 0)
# Last scrape age — `up{job="kubernetes-pods", app="keel"}` is 1 if the
# most recent scrape succeeded. We surface the wallclock age via a tiny
# `time() - timestamp(up{...})` query.
APPS_LAST_CHECK=$(prom_q 'time()-timestamp(up{job="kubernetes-pods",app="keel"})' | python3 -c '
import json, sys
try:
r = json.load(sys.stdin)["data"]["result"]
if not r: print("scrape not live")
else:
secs = int(float(r[0]["value"][1]))
if secs < 60: print(f"{secs}s ago")
elif secs < 3600: print(f"{secs//60}m ago")
else: print(f"{secs//3600}h ago")
except Exception:
print("?")
' 2>/dev/null || echo "?")
# Recent updates: count lines in Keel logs that report a successful
# rollout. Keel logs an "update completed" message per rollout.
local log_24h
log_24h=$($KUBECTL -n keel logs deploy/keel --since=24h --tail=2000 2>/dev/null || true)
updates_24h=$(echo "$log_24h" | grep -cE 'update completed|successfully updated|deployment updated' 2>/dev/null || true)
[[ -z "$updates_24h" ]] && updates_24h=0
APPS_UPDATES_LINE="$updates_24h in last 24h (tracked images: $tracked)"
# Known-benign Keel error patterns to suppress. Each is a real error
# line Keel emits, but the surrounding behaviour is fine, so flagging
# them in /upgrade-state is just noise.
# - `bot.Run(): can not get configuration for bot [slack]` — Keel
# 1.2.0 registers a Slack socket-mode bot whenever SLACK_BOT_TOKEN
# is set, then fails because we don't supply an `xapp-` app-level
# token. We don't want the interactive bot (no approvals; opt-out
# auto-update). The Slack NOTIFICATION sender works independently
# of the bot, so rollout messages still post to #general.
# - `failed to check digest` with a transient network error —
# Keel polls ~175 image manifests against public registries
# hourly. Occasional `i/o timeout` / `connection refused` /
# `TLS handshake timeout` / `no such host` / `EOF` /
# `context deadline exceeded` are inherent to public-internet
# polling at that scale and auto-recover on the next poll.
# Actionable digest-check failures surface as HTTP 401/404
# (auth, removed-tag) — those are NOT filtered.
# - `failed to check digest` with HTTP 5xx — upstream registry
# having a problem (DockerHub maintenance, Forgejo restart,
# etc.). Same recovery pattern as network errors: next hourly
# poll succeeds once upstream is back. Persistent 5xx for >24h
# would indicate a real registry-side issue, but that surfaces
# via the registry's own monitoring (e.g. forgejo-integrity-probe
# + RegistryCatalogInaccessible), not via Keel logs.
local benign_re='bot\.Run\(\): can not get configuration for bot \[slack\]'
benign_re+='|SLACK_APP_TOKEN must have the (previf|prefix)'
benign_re+='|failed to check digest.*(i/o timeout|connection refused|connection reset|context deadline exceeded|TLS handshake timeout|no such host|: EOF)'
benign_re+='|failed to check digest.*non-successful response \(status=5[0-9][0-9]'
errors=$(echo "$log_24h" | grep -iE '"level":"(error|fatal)"|level=error' | grep -vE "$benign_re" | tail -3 || true)
if [[ -z "$errors" ]]; then
APPS_ERROR_LINE="(none in last 24h)"
else
APPS_ERROR_LINE="$(echo "$errors" | wc -l | tr -d ' ') error(s); newest: $(echo "$errors" | tail -1 | cut -c1-120)"
fi
# Keel pod state.
local pod_status
pod_status=$($KUBECTL -n keel get pods -l app=keel -o jsonpath='{.items[*].status.phase}' 2>/dev/null || true)
if [[ "$pod_status" != *"Running"* ]]; then
APPS_STATUS_ICON="✗"; APPS_STATUS_TEXT="down"
APPS_NOTES="Keel pod not Running ($pod_status)"
raise_exit 2
elif [[ "$pending" -gt 0 || -n "$errors" ]]; then
APPS_STATUS_ICON="⚠"; APPS_STATUS_TEXT="attn"
APPS_NOTES="$enrolled enrolled; $pending pending; $(echo "$errors" | wc -l | tr -d ' ') recent error(s)"
raise_exit 1
else
APPS_STATUS_ICON="✓"; APPS_STATUS_TEXT="healthy"
APPS_NOTES="$enrolled enrolled, 0 pending, 0 errors"
fi
APPS_NEXT="rolling, hourly poll"
}
# --- 2. OS (apt + kured) ---
collect_os() {
local distros kernels distro_uniq kernel_uniq
distros=$($KUBECTL get nodes -o jsonpath='{range .items[*]}{.status.nodeInfo.osImage}{"\n"}{end}' 2>/dev/null)
kernels=$($KUBECTL get nodes -o jsonpath='{range .items[*]}{.status.nodeInfo.kernelVersion}{"\n"}{end}' 2>/dev/null)
distro_uniq=$(echo "$distros" | sort -u | tr '\n' ',' | sed 's/,$//; s/,/, /g')
kernel_uniq=$(echo "$kernels" | sort -u | tr '\n' ',' | sed 's/,$//; s/,/, /g')
OS_DISTRO_SUMMARY="$distro_uniq"
OS_KERNEL_SUMMARY="$kernel_uniq"
# SSH fan-out — parallel background subshells, write per-node results to tmp files.
local tmpdir; tmpdir=$(mktemp -d)
trap 'rm -rf "$tmpdir"' RETURN
local entry name ip
for entry in "${NODES[@]}"; do
name="${entry%%:*}"; ip="${entry##*:}"
(
local out reboot held upgradable uu_log
reboot=$(ssh_node "$ip" 'test -f /var/run/reboot-required && echo yes || echo no')
held=$(ssh_node "$ip" 'apt-mark showhold 2>/dev/null')
upgradable=$(ssh_node "$ip" 'apt list --upgradable 2>/dev/null | tail -n +2')
uu_log=$(ssh_node "$ip" 'tail -1 /var/log/unattended-upgrades/unattended-upgrades.log 2>/dev/null')
printf 'reboot=%s\n' "$reboot" > "$tmpdir/$name"
printf 'held<<<EOF\n%s\nEOF\n' "$held" >> "$tmpdir/$name"
printf 'upgradable<<<EOF\n%s\nEOF\n' "$upgradable" >> "$tmpdir/$name"
printf 'uu_log=%s\n' "$uu_log" >> "$tmpdir/$name"
) &
done
wait
# Aggregate.
local pending_reboots=() held_with_bumps_lines=() newest_uu_ts=0 newest_uu_iso=""
for entry in "${NODES[@]}"; do
name="${entry%%:*}"
[[ -f "$tmpdir/$name" ]] || continue
local reboot held upgradable uu_log uu_ts
reboot=$(awk -F= '/^reboot=/{print $2}' "$tmpdir/$name")
held=$(awk '/^held<<<EOF$/,/^EOF$/' "$tmpdir/$name" | sed '1d;$d')
upgradable=$(awk '/^upgradable<<<EOF$/,/^EOF$/' "$tmpdir/$name" | sed '1d;$d')
uu_log=$(awk -F= '/^uu_log=/{sub(/^uu_log=/,""); print}' "$tmpdir/$name")
[[ "$reboot" == "yes" ]] && pending_reboots+=("$name")
# Held + upgradable, excluding k8s components (managed by k8s pipeline).
local pkg from to bump
while IFS= read -r line; do
[[ -z "$line" ]] && continue
pkg=$(echo "$line" | awk -F/ '{print $1}')
# Skip k8s and kernel/linux-image — the chain handles those.
case "$pkg" in
kubeadm|kubectl|kubelet) continue ;;
linux-image-*|linux-headers-*|linux-modules-*|linux-generic|linux-headers-generic|linux-image-generic) continue ;;
esac
# Only flag if the package is held.
if echo "$held" | grep -qx "$pkg"; then
to=$(echo "$line" | awk '{print $2}')
from=$(echo "$line" | sed -n 's/.*from: \([^ ]*\).*/\1/p')
bump="$pkg ${from%-*}${to%-*}"
held_with_bumps_lines+=("$name: $bump")
fi
done <<<"$upgradable"
# Newest uu timestamp (ISO at start of log line).
uu_ts=$(echo "$uu_log" | sed -E 's/^([0-9]{4}-[0-9]{2}-[0-9]{2} [0-9]{2}:[0-9]{2}:[0-9]{2}).*/\1/')
if [[ -n "$uu_ts" ]]; then
local epoch; epoch=$(date -u -d "$uu_ts" +%s 2>/dev/null || echo 0)
if [[ "$epoch" -gt "$newest_uu_ts" ]]; then
newest_uu_ts="$epoch"; newest_uu_iso="$uu_ts"
fi
fi
done
OS_PENDING_REBOOT_NODES="${pending_reboots[*]:-}"
if [[ ${#held_with_bumps_lines[@]} -gt 0 ]]; then
OS_HELD_DETAIL=$(printf '%s\n' "${held_with_bumps_lines[@]}" | sort -u | paste -sd '; ' -)
fi
if [[ "$newest_uu_ts" -gt 0 ]]; then
local age=$((NOW_EPOCH - newest_uu_ts))
OS_LAST_UU="$newest_uu_iso UTC ($(human_age "$age"))"
OS_LAST_CHECK="$(human_age "$age") (uu daily)"
else
OS_LAST_UU="(no uu log accessible)"
OS_LAST_CHECK="?"
fi
# Last kured reboot — newest Ready transition across worker nodes.
# `Ready -> True` is what kured causes when the node returns; we surface
# the most recent timestamp and the node it belongs to.
local kured_raw kured_iso kured_node kured_ep kured_age
kured_raw=$($KUBECTL get nodes -o json 2>/dev/null | python3 -c '
import json, sys
from datetime import datetime
data = json.load(sys.stdin)
best = (0, "", "")
for n in data["items"]:
name = n["metadata"]["name"]
for c in n["status"].get("conditions", []):
if c["type"] == "Ready":
dt = datetime.strptime(c["lastTransitionTime"], "%Y-%m-%dT%H:%M:%SZ")
ep = int(dt.timestamp())
if ep > best[0]:
best = (ep, name, c["lastTransitionTime"])
print(f"{best[0]}|{best[1]}|{best[2]}")
' 2>/dev/null || echo "0||")
kured_ep="${kured_raw%%|*}"
kured_node=$(echo "$kured_raw" | cut -d'|' -f2)
kured_iso=$(echo "$kured_raw" | cut -d'|' -f3)
if [[ "$kured_ep" -gt 0 ]]; then
kured_age=$((NOW_EPOCH - kured_ep))
OS_LAST_KURED="$kured_iso ($kured_node, $(human_age "$kured_age"))"
else
OS_LAST_KURED="?"
fi
OS_NEXT="daily 02:00-06:00 London"
# Kured pod health.
local kured_pods kured_unhealthy
kured_pods=$($KUBECTL -n kured get pods -l app.kubernetes.io/name=kured -o jsonpath='{range .items[*]}{.status.phase}{"\n"}{end}' 2>/dev/null)
kured_unhealthy=$(echo "$kured_pods" | grep -cv '^Running$' 2>/dev/null || true)
local notes=()
[[ -n "$OS_HELD_DETAIL" ]] && notes+=("held with bumps: $OS_HELD_DETAIL")
[[ -n "$OS_PENDING_REBOOT_NODES" ]] && notes+=("pending reboot: $OS_PENDING_REBOOT_NODES")
if [[ "$kured_unhealthy" -gt 0 ]]; then
OS_STATUS_ICON="✗"; OS_STATUS_TEXT="kured down"
OS_NOTES="kured pods not all Running"
raise_exit 2
elif [[ ${#notes[@]} -gt 0 ]]; then
OS_STATUS_ICON="⚠"; OS_STATUS_TEXT="attn"
OS_NOTES="${notes[*]}"
raise_exit 1
else
OS_STATUS_ICON="✓"; OS_STATUS_TEXT="healthy"
OS_NOTES="distros uniform; no held bumps; no pending reboots"
fi
}
# --- 3. K8s (kubeadm/kubelet/kubectl) ---
collect_k8s() {
local kver_list kver_uniq metrics target_patch target_minor last_run in_flight started
kver_list=$($KUBECTL get nodes -o jsonpath='{range .items[*]}{.status.nodeInfo.kubeletVersion}{"\n"}{end}' 2>/dev/null)
kver_uniq=$(echo "$kver_list" | sort -u)
local n_uniq; n_uniq=$(echo "$kver_uniq" | wc -l | tr -d ' ')
if [[ "$n_uniq" -eq 1 ]]; then
K8S_RUNNING="$kver_uniq across $(echo "$kver_list" | wc -l | tr -d ' ')/$(echo "$kver_list" | wc -l | tr -d ' ') nodes"
else
K8S_RUNNING="mixed: $(echo "$kver_uniq" | paste -sd', ' -)"
fi
local running_ver; running_ver=$(echo "$kver_uniq" | head -1)
metrics=$(pg_metrics)
# All five may legitimately be absent (cluster never ran the upgrade
# chain, kind="minor" not detected, etc.) — `|| true` keeps pipefail
# from killing the script on no-match.
target_patch=$(echo "$metrics" | { grep -E '^k8s_upgrade_available\{[^}]*kind="patch"' || true; } | sed -n 's/.*target="\([^"]*\)".*/\1/p' | head -1)
target_minor=$(echo "$metrics" | { grep -E '^k8s_upgrade_available\{[^}]*kind="minor"' || true; } | sed -n 's/.*target="\([^"]*\)".*/\1/p' | head -1)
# Pushgateway emits these with `{instance="",job="..."}` labels — the
# `awk '$1 ~ /^name(\{|$)/'` form matches both bare and labelled metrics.
last_run=$(echo "$metrics" | awk '$1 ~ /^k8s_version_check_last_run_timestamp(\{|$)/{print $2}' | head -1 || true)
in_flight=$(echo "$metrics" | awk '$1 ~ /^k8s_upgrade_in_flight(\{|$)/{print $2}' | head -1 || true)
started=$(echo "$metrics" | awk '$1 ~ /^k8s_upgrade_started_timestamp(\{|$)/{print $2}' | head -1 || true)
# Pushgateway timestamps come back in scientific notation
# (e.g. 1.779052159e+09) — convert to plain integer seconds.
local last_run_int started_int
last_run_int=$(to_epoch_int "$last_run")
started_int=$(to_epoch_int "$started")
if [[ "$last_run_int" -gt 0 ]]; then
local age=$((NOW_EPOCH - last_run_int))
K8S_LAST_CHECK="$(human_age "$age") (daily cron)"
if [[ -n "$target_patch" ]]; then
K8S_LAST_DETECT_LINE="last run $(human_age "$age"): available v$target_patch (patch)"
elif [[ -n "$target_minor" ]]; then
K8S_LAST_DETECT_LINE="last run $(human_age "$age"): available v$target_minor (minor)"
else
K8S_LAST_DETECT_LINE="last run $(human_age "$age"): no upgrade available"
fi
else
K8S_LAST_CHECK="(metric missing)"
K8S_LAST_DETECT_LINE="(no k8s_version_check_last_run_timestamp in Pushgateway)"
fi
K8S_PATCH="${target_patch:-none}"
K8S_MINOR="${target_minor:-none}"
# In-flight / last chain.
if [[ "${in_flight:-0}" == "1" ]]; then
K8S_IN_FLIGHT="yes"
local since=0
[[ "$started_int" -gt 0 ]] && since=$((NOW_EPOCH - started_int))
K8S_LAST_CHAIN="in-flight (started $(human_age "$since"))"
else
K8S_IN_FLIGHT="no"
if [[ "$started_int" -gt 0 ]]; then
local age=$((NOW_EPOCH - started_int))
K8S_LAST_CHAIN="$(human_age "$age")"
else
K8S_LAST_CHAIN="never (or zeroed)"
fi
fi
K8S_NEXT="$(next_daily_noon_utc)"
# Status logic.
local stalled=0
if [[ "${in_flight:-0}" == "1" && "$started_int" -gt 0 ]]; then
# K8sUpgradeStalled fires after 5400s (90m) per monitoring stack.
local since=$((NOW_EPOCH - started_int))
[[ "$since" -gt 5400 ]] && stalled=1
fi
local last_run_age=999999999
[[ "$last_run_int" -gt 0 ]] && last_run_age=$((NOW_EPOCH - last_run_int))
if [[ "$stalled" == "1" ]]; then
K8S_STATUS_ICON="✗"; K8S_STATUS_TEXT="stalled"
K8S_NOTES="K8sUpgradeStalled would fire — chain in-flight >90m"
raise_exit 2
elif [[ "$last_run_age" -gt $((9*86400)) ]]; then
K8S_STATUS_ICON="✗"; K8S_STATUS_TEXT="detection stale"
K8S_NOTES="last detection >9d ago"
raise_exit 2
elif [[ "${in_flight:-0}" == "1" ]]; then
K8S_STATUS_ICON="…"; K8S_STATUS_TEXT="in-flight"
K8S_NOTES="upgrade chain running"
raise_exit 1
elif [[ -n "$target_patch" ]]; then
K8S_STATUS_ICON="→"; K8S_STATUS_TEXT="$target_patch"
K8S_NOTES="running $running_ver → v$target_patch (patch) available"
raise_exit 1
elif [[ -n "$target_minor" ]]; then
K8S_STATUS_ICON="→"; K8S_STATUS_TEXT="$target_minor"
K8S_NOTES="running $running_ver → v$target_minor (minor) available"
raise_exit 1
else
K8S_STATUS_ICON="✓"; K8S_STATUS_TEXT="current"
K8S_NOTES="running $running_ver, nothing newer"
fi
}
# Next daily 12:00 UTC — pure bash date math, no croniter. Schedule was
# weekly Sunday until 2026-05-18; now `0 12 * * *` in the
# k8s-version-upgrade stack. If we're still before today's 12:00 UTC,
# the next run is today; otherwise it's tomorrow.
next_daily_noon_utc() {
local hr days_ahead
hr=$(date -u +%H)
if [[ "$hr" -lt 12 ]]; then days_ahead=0; else days_ahead=1; fi
date -u -d "+$days_ahead days" +"%a %Y-%m-%d 12:00 UTC"
}
# --- Renderers ---
# The table uses `column -t` so we don't have to compute visual widths
# manually (the status icons are multi-byte UTF-8 and ANSI escapes don't
# play nice with `printf %-Xs`). Trade-off: no in-cell colour, but the
# icon character already carries the signal.
render_table() {
echo
printf "${BOLD}Upgrade state — %s${NC}\n" "$(date -u +'%Y-%m-%d %H:%M UTC')"
echo
{
echo "Layer|Status|Last check|Next upgrade|Notes"
echo "-----|------|----------|------------|-----"
printf 'Apps|%s %s|%s|%s|%s\n' "$APPS_STATUS_ICON" "$APPS_STATUS_TEXT" "$APPS_LAST_CHECK" "$APPS_NEXT" "$APPS_NOTES"
printf 'OS |%s %s|%s|%s|%s\n' "$OS_STATUS_ICON" "$OS_STATUS_TEXT" "$OS_LAST_CHECK" "$OS_NEXT" "$OS_NOTES"
printf 'K8s |%s %s|%s|%s|%s\n' "$K8S_STATUS_ICON" "$K8S_STATUS_TEXT" "$K8S_LAST_CHECK" "$K8S_NEXT" "$K8S_NOTES"
} | column -t -s '|' -o ' | '
echo
printf "${BOLD}--- Apps (Keel) ---${NC}\n"
echo "Enrolled deployments: $APPS_ENROLLED"
echo "Recent rollouts: $APPS_UPDATES_LINE"
echo "Pending approvals: $APPS_PENDING"
echo "Last Keel error: $APPS_ERROR_LINE"
echo
printf "${BOLD}--- OS (apt + kured) ---${NC}\n"
echo "Ubuntu per node: $OS_DISTRO_SUMMARY"
echo "Kernel per node: $OS_KERNEL_SUMMARY"
echo "Pending reboot: ${OS_PENDING_REBOOT_NODES:-none}"
echo "Held packages with upstream bumps: ${OS_HELD_DETAIL:-none (excluding k8s components)}"
echo "Last uu run (newest across nodes): $OS_LAST_UU"
echo "Last kured reboot (newest Ready transition): $OS_LAST_KURED"
echo "Next kured window: $OS_NEXT"
echo
printf "${BOLD}--- K8s (kubeadm/kubelet/kubectl) ---${NC}\n"
echo "Running: $K8S_RUNNING"
echo "Latest patch (apt): ${K8S_PATCH}"
echo "Next minor available: ${K8S_MINOR}"
echo "Detection: $K8S_LAST_DETECT_LINE"
echo "In-flight: $K8S_IN_FLIGHT | Last chain start: $K8S_LAST_CHAIN"
echo "Next detection: $K8S_NEXT"
echo
}
render_json() {
# Pipe values into Python via env vars so we don't need to worry about
# embedded quotes/backslashes in error lines.
APPS_STATUS_ICON="$APPS_STATUS_ICON" APPS_STATUS_TEXT="$APPS_STATUS_TEXT" \
APPS_LAST_CHECK="$APPS_LAST_CHECK" APPS_NEXT="$APPS_NEXT" APPS_NOTES="$APPS_NOTES" \
APPS_ENROLLED="$APPS_ENROLLED" APPS_PENDING="$APPS_PENDING" \
APPS_UPDATES_LINE="$APPS_UPDATES_LINE" APPS_ERROR_LINE="$APPS_ERROR_LINE" \
OS_STATUS_ICON="$OS_STATUS_ICON" OS_STATUS_TEXT="$OS_STATUS_TEXT" \
OS_LAST_CHECK="$OS_LAST_CHECK" OS_NEXT="$OS_NEXT" OS_NOTES="$OS_NOTES" \
OS_DISTRO_SUMMARY="$OS_DISTRO_SUMMARY" OS_KERNEL_SUMMARY="$OS_KERNEL_SUMMARY" \
OS_PENDING_REBOOT_NODES="$OS_PENDING_REBOOT_NODES" OS_HELD_DETAIL="$OS_HELD_DETAIL" \
OS_LAST_UU="$OS_LAST_UU" OS_LAST_KURED="$OS_LAST_KURED" \
K8S_STATUS_ICON="$K8S_STATUS_ICON" K8S_STATUS_TEXT="$K8S_STATUS_TEXT" \
K8S_LAST_CHECK="$K8S_LAST_CHECK" K8S_NEXT="$K8S_NEXT" K8S_NOTES="$K8S_NOTES" \
K8S_RUNNING="$K8S_RUNNING" K8S_PATCH="$K8S_PATCH" K8S_MINOR="$K8S_MINOR" \
K8S_LAST_DETECT_LINE="$K8S_LAST_DETECT_LINE" K8S_IN_FLIGHT="$K8S_IN_FLIGHT" K8S_LAST_CHAIN="$K8S_LAST_CHAIN" \
HIGHEST_EXIT="$HIGHEST_EXIT" \
python3 -c '
import json, os
from datetime import datetime, timezone
def env(k): return os.environ.get(k, "")
out = {
"as_of_utc": datetime.now(timezone.utc).strftime("%Y-%m-%dT%H:%M:%SZ"),
"highest_exit": int(env("HIGHEST_EXIT")),
"apps": {
"status": env("APPS_STATUS_ICON"),
"status_text": env("APPS_STATUS_TEXT"),
"last_check": env("APPS_LAST_CHECK"),
"next_upgrade": env("APPS_NEXT"),
"notes": env("APPS_NOTES"),
"enrolled": int(env("APPS_ENROLLED") or 0),
"pending_approvals": int(env("APPS_PENDING") or 0),
"updates_line": env("APPS_UPDATES_LINE"),
"errors_line": env("APPS_ERROR_LINE"),
},
"os": {
"status": env("OS_STATUS_ICON"),
"status_text": env("OS_STATUS_TEXT"),
"last_check": env("OS_LAST_CHECK"),
"next_upgrade": env("OS_NEXT"),
"notes": env("OS_NOTES"),
"distros": env("OS_DISTRO_SUMMARY"),
"kernels": env("OS_KERNEL_SUMMARY"),
"pending_reboot_nodes": env("OS_PENDING_REBOOT_NODES"),
"held_with_bumps": env("OS_HELD_DETAIL"),
"last_uu_run": env("OS_LAST_UU"),
"last_kured_reboot": env("OS_LAST_KURED"),
},
"k8s": {
"status": env("K8S_STATUS_ICON"),
"status_text": env("K8S_STATUS_TEXT"),
"last_check": env("K8S_LAST_CHECK"),
"next_upgrade": env("K8S_NEXT"),
"notes": env("K8S_NOTES"),
"running": env("K8S_RUNNING"),
"patch_target": env("K8S_PATCH"),
"minor_target": env("K8S_MINOR"),
"last_detection_line": env("K8S_LAST_DETECT_LINE"),
"in_flight": env("K8S_IN_FLIGHT"),
"last_chain": env("K8S_LAST_CHAIN"),
},
}
print(json.dumps(out, indent=2))
'
}
main() {
parse_args "$@"
collect_apps
collect_os
collect_k8s
if [[ "$JSON" == true ]]; then
render_json
else
render_table
fi
exit "$HIGHEST_EXIT"
}
main "$@"

Binary file not shown.

Binary file not shown.

View file

@ -87,5 +87,5 @@ module "ingress" {
name = "<app-name>"
tls_secret_name = var.tls_secret_name
dns_type = "proxied" # "proxied" (Cloudflare CDN), "non-proxied" (direct A/AAAA), or "none"
auth = "required" # "required" (Authentik login), "public" (anonymous bound to guest), or "none" (no auth)
protected = false # Set true to require Authentik login
}

View file

@ -29,20 +29,6 @@ provider "registry.terraform.io/goauthentik/authentik" {
constraints = "~> 2024.10"
hashes = [
"h1:roBMd+gi+TGgikH/bMzEI8JfvJiMAQWt+8FmokCrQIs=",
"zh:090260dc7889ea822ec1d899344e1ee23eba5290461989c0796149c9511f2316",
"zh:13c2655ff824b0dc4b9bb832b5ca6d41dba97cb280330258c5fef4115e236209",
"zh:166a73c3a810c9c895d68a8ff968158f339f8a2c1c03e20ec9fc5ed99cc64e20",
"zh:203777eae1cdc711233315499643180604cff2324411b186b7cf07fdbe16f655",
"zh:3b2f18c9a8d28dac74dc6bbf168c946855ab9c68f053578d4630c50d5eaf30a0",
"zh:4822275985f6b74b6196c47112316a4252db22cf4ceaef7c9ab4c66d488abf2f",
"zh:53ea97562666c8a5a2f6d63d418a302a7f8ee4b7bb7da35dedaa89aa5708b7f0",
"zh:56b8a230901e3550c92a1d3f58ee9dafe9853f30fe4315af3ab28ae63262e15d",
"zh:6293ab7b1fd8206a0c853591f50186aca4a1eff117b2a773e10760a23a2c83e9",
"zh:9433970f79fb92d8aae3ee436db5630ab312c78b6dc9df9c1db3273a18f8aaa1",
"zh:95df406214f79b3b98222d7c7fe8fc319a3d90b7a9d53e1d5abbda5dfb8b9436",
"zh:a85880da0552a42c8f449390fbd7d8b03541d1a13e04bba9f1404fa658754260",
"zh:a95f6e9bd62c67e70eba1b1a14728856b9a6a28cd1e5e3be54a7718882c87e7f",
"zh:dd599b51c5beb34a4c6feece244fde07d2558d69929449ab1fd39a5ebe738781",
]
}
@ -70,18 +56,6 @@ provider "registry.terraform.io/hashicorp/kubernetes" {
version = "3.1.0"
hashes = [
"h1:oodIAuFMikXNmEtil5MQgP4dfSctUBYQiGJfjbsF3NY=",
"zh:0215c5c60be62028c09a2f22458e89cda3ef5830a632299f1d401eb3538874b0",
"zh:09ebb9f442431e278a310a9423f32caf467cb4b3cad3fe59573ca71fa7b14e20",
"zh:0c4e5912f83bb35846ae0a9ae54fc320706ee61894cd21cc6b4181b1c5a2fa5c",
"zh:1678c982853ad461e65ccb5e79d585e13ed109dd47dab2a66d3a7a304faeef65",
"zh:1c050a5c15e330457a9c18caacf61a923c59d663e13f2962e4b32f04fef523a0",
"zh:2c55bcec83be58ec132c7cb0a1ac644758b800d794fdc636d53a0eada0358a3a",
"zh:a062bb0aa316c08d8460c66a5d68da71da40de5d3bc3b31abcf3a1a9a19650f1",
"zh:a26fdea0afaa9b247c73c0b42843ca51ba7db0ac2571f9d3d50dcabd20ca1b98",
"zh:c872c9385a78d502bf5823d61cd3bb0f9a0585030e025eb12585c83451beeaa1",
"zh:f180879af931182beee4c8c0d9dab62b81d86f17ddcbe3786ef4c7cec9163a4e",
"zh:f569b65999264a9416862bca5cd2a6177d94ccb0424f3a4ef424428912b9cb3c",
"zh:f70f5789264069e0eef06f9b5d5fde955ef7206f7d446d1ce51a4c37a3f3e02f",
]
}

View file

@ -1,7 +1,7 @@
# Generated by Terragrunt. Sig: nIlQXj57tbuaRZEa
terraform {
backend "pg" {
conn_str = "postgres://terraform_state:ts7DGcKmTTY-5ujz4mhh@10.0.20.200:5432/terraform_state?sslmode=disable"
conn_str = "postgres://terraform_state:SBlzGxotNUN6HH9d0S-m@10.0.20.200:5432/terraform_state?sslmode=disable"
schema_name = "actualbudget"
}
}

View file

@ -18,11 +18,6 @@ variable "budget_encryption_password" {
# and are unknown at plan time on first apply, so we cannot base `count` on
# them directly. Callers pass these booleans as hardcoded plan-time constants
# that reflect whether the corresponding credentials are expected to exist.
variable "enabled" {
type = bool
default = true
description = "Deploy this instance. When false, only the PVC is kept (data preservation); deployment, service, ingress, http-api, and cronjob are not created. Flip back to true to bring the instance back."
}
variable "enable_http_api" {
type = bool
default = false
@ -49,7 +44,7 @@ resource "kubernetes_persistent_volume_claim" "data_encrypted" {
name = "actualbudget-${var.name}-data-encrypted"
namespace = "actualbudget"
annotations = {
"resize.topolvm.io/threshold" = "10%"
"resize.topolvm.io/threshold" = "80%"
"resize.topolvm.io/increase" = "100%"
"resize.topolvm.io/storage_limit" = "5Gi"
}
@ -63,17 +58,9 @@ resource "kubernetes_persistent_volume_claim" "data_encrypted" {
}
}
}
lifecycle {
# The autoresizer expands requests.storage up to storage_limit and
# PVCs can't shrink. Without this, every TF apply tries to revert
# to the spec value, K8s rejects the shrink, and the PVC ends up
# in Terminating-but-in-use limbo.
ignore_changes = [spec[0].resources[0].requests]
}
}
resource "kubernetes_deployment" "actualbudget" {
count = var.enabled ? 1 : 0
metadata {
name = "actualbudget-${var.name}"
namespace = "actualbudget"
@ -140,7 +127,6 @@ resource "kubernetes_deployment" "actualbudget" {
}
resource "kubernetes_service" "actualbudget" {
count = var.enabled ? 1 : 0
metadata {
name = "budget-${var.name}"
namespace = "actualbudget"
@ -162,12 +148,7 @@ resource "kubernetes_service" "actualbudget" {
}
module "ingress" {
count = var.enabled ? 1 : 0
source = "../../../modules/kubernetes/ingress_factory"
# auth = "app": Actual Budget enforces a server password + per-user login
# on its own sync API. Authentik forward-auth was 302-ing the mobile/web
# sync clients; Actual's own auth gates users.
auth = "app"
source = "../../../modules/kubernetes/ingress_factory"
namespace = "actualbudget"
name = "budget-${var.name}"
tls_secret_name = var.tls_secret_name
@ -182,7 +163,7 @@ resource "random_string" "api-key" {
}
resource "kubernetes_deployment" "actualbudget-http-api" {
count = var.enabled && var.enable_http_api ? 1 : 0
count = var.enable_http_api ? 1 : 0
metadata {
name = "actualbudget-http-api-${var.name}"
namespace = "actualbudget"
@ -248,7 +229,6 @@ resource "kubernetes_deployment" "actualbudget-http-api" {
}
resource "kubernetes_service" "actualbudget-http-api" {
count = var.enabled && var.enable_http_api ? 1 : 0
metadata {
name = "budget-http-api-${var.name}"
namespace = "actualbudget"
@ -270,7 +250,7 @@ resource "kubernetes_service" "actualbudget-http-api" {
}
resource "kubernetes_cron_job_v1" "bank-sync" {
count = var.enabled && var.enable_bank_sync ? 1 : 0
count = var.enable_bank_sync ? 1 : 0
metadata {
name = "bank-sync-${var.name}"
namespace = "actualbudget"
@ -291,93 +271,48 @@ resource "kubernetes_cron_job_v1" "bank-sync" {
spec {
container {
name = "bank-sync"
image = "alpine:3.20"
image = "curlimages/curl"
command = ["/bin/sh", "-c", <<-EOT
set -u
apk add --no-cache curl jq >/dev/null 2>&1
USER_NAME='${var.name}'
SYNC_ID='${var.sync_id}'
API_KEY='${random_string.api-key.result}'
PW='${var.budget_encryption_password}'
PG="http://prometheus-prometheus-pushgateway.monitoring:9091/metrics/job/bank-sync-$USER_NAME"
API="http://budget-http-api-$USER_NAME"
PUSHGATEWAY="http://prometheus-prometheus-pushgateway.monitoring:9091/metrics/job/bank-sync-${var.name}"
START=$(date +%s)
# Enumerate active accounts: open + on-budget.
ACCOUNTS=$(curl -fsS "$API/v1/budgets/$SYNC_ID/accounts" \
-H "x-api-key: $API_KEY" \
-H "budget-encryption-password: $PW" \
| jq -c '.data[] | select(.closed == false and .offbudget == false) | {id, name}')
if [ -z "$ACCOUNTS" ]; then
echo "ERROR: GET /accounts returned no eligible accounts; aborting"
exit 1
fi
: > /tmp/payload
rm -f /tmp/any_success
# Per-account sync. Each account has its own PSD2/GoCardless
# quota (4 successful pulls per 24h), so we treat them
# independently one rate-limited account doesn't mark the
# run as a failure.
echo "$ACCOUNTS" | while IFS= read -r ACCT; do
[ -z "$ACCT" ] && continue
ID=$(echo "$ACCT" | jq -r '.id')
NAME=$(echo "$ACCT" | jq -r '.name')
LABEL=$(echo "$NAME" | sed -E 's/[^a-zA-Z0-9]+/_/g')
HTTP_CODE=$(curl -s -o /tmp/r.txt -w '%%{http_code}' \
-X POST "$API/v1/budgets/$SYNC_ID/accounts/$ID/banksync" \
-H 'accept: application/json' \
-H "x-api-key: $API_KEY" \
-H "budget-encryption-password: $PW") || HTTP_CODE=0
NOW=$(date +%s)
if [ "$HTTP_CODE" = "200" ]; then
echo "OK account=$NAME"
printf 'bank_sync_account_success{account="%s"} 1\n' "$LABEL" >> /tmp/payload
printf 'bank_sync_account_last_success_timestamp{account="%s"} %s\n' "$LABEL" "$NOW" >> /tmp/payload
: > /tmp/any_success
else
echo "FAIL account=$NAME http=$HTTP_CODE body=$(cat /tmp/r.txt)"
printf 'bank_sync_account_success{account="%s"} 0\n' "$LABEL" >> /tmp/payload
fi
done
HTTP_CODE=$(curl -s -o /tmp/response.txt -w '%%{http_code}' \
-X POST --location \
'http://budget-http-api-${var.name}/v1/budgets/${var.sync_id}/accounts/banksync' \
--header 'accept: application/json' \
--header 'budget-encryption-password: ${var.budget_encryption_password}' \
--header 'x-api-key: ${random_string.api-key.result}')
END=$(date +%s)
DUR=$((END - START))
DURATION=$((END - START))
if [ -f /tmp/any_success ]; then
ANY=1
if [ "$HTTP_CODE" = "200" ]; then
SUCCESS=1
LAST_SUCCESS=$END
else
ANY=0
SUCCESS=0
echo "Bank sync failed with HTTP $HTTP_CODE:"
cat /tmp/response.txt
echo ""
fi
# Pushgateway POST preserves prior values for label sets not
# in the payload, so per-account last_success_timestamp values
# for accounts that failed this run keep their prior good
# values that's what BankSyncAccountStale alerts on.
# Pushgateway POST preserves metrics not in the payload, so on
# failure we omit bank_sync_last_success_timestamp to keep the
# prior success value this prevents BankSyncStale from firing
# alongside BankSyncFailing after a single failed run.
{
printf '# HELP bank_sync_account_success Per-account sync result (1=ok, 0=fail)\n'
printf '# TYPE bank_sync_account_success gauge\n'
printf '# HELP bank_sync_account_last_success_timestamp Per-account Unix timestamp of last successful sync\n'
printf '# TYPE bank_sync_account_last_success_timestamp gauge\n'
cat /tmp/payload
printf '# HELP bank_sync_success 1 if at least one account synced this run\n'
printf '# HELP bank_sync_success Whether the last bank sync succeeded (1=ok, 0=fail)\n'
printf '# TYPE bank_sync_success gauge\n'
printf 'bank_sync_success %s\n' "$ANY"
printf '# HELP bank_sync_duration_seconds Total duration of the cron run\n'
printf 'bank_sync_success %s\n' "$SUCCESS"
printf '# HELP bank_sync_duration_seconds Duration of the last bank sync run\n'
printf '# TYPE bank_sync_duration_seconds gauge\n'
printf 'bank_sync_duration_seconds %s\n' "$DUR"
if [ "$ANY" = "1" ]; then
printf '# HELP bank_sync_last_success_timestamp Unix timestamp of the most recent successful sync of any account\n'
printf 'bank_sync_duration_seconds %s\n' "$DURATION"
if [ "$SUCCESS" = "1" ]; then
printf '# HELP bank_sync_last_success_timestamp Unix timestamp of the last successful sync\n'
printf '# TYPE bank_sync_last_success_timestamp gauge\n'
printf 'bank_sync_last_success_timestamp %s\n' "$END"
printf 'bank_sync_last_success_timestamp %s\n' "$LAST_SUCCESS"
fi
} | curl -fsS --data-binary @- "$PG"
} | curl -s --data-binary @- "$PUSHGATEWAY"
EOT
]
}
@ -391,24 +326,3 @@ resource "kubernetes_cron_job_v1" "bank-sync" {
ignore_changes = [spec[0].job_template[0].spec[0].template[0].spec[0].dns_config]
}
}
# State migration for the new `enabled` toggle (2026-05-13): adding
# count to these resources shifts their addresses to [0]. Without
# moved {}, Terraform would destroy+recreate. Existing http-api / bank-sync
# resources already had count, so no migration needed there.
moved {
from = kubernetes_deployment.actualbudget
to = kubernetes_deployment.actualbudget[0]
}
moved {
from = kubernetes_service.actualbudget
to = kubernetes_service.actualbudget[0]
}
moved {
from = kubernetes_service.actualbudget-http-api
to = kubernetes_service.actualbudget-http-api[0]
}
moved {
from = module.ingress
to = module.ingress[0]
}

View file

@ -57,7 +57,6 @@ resource "kubernetes_namespace" "actualbudget" {
labels = {
"istio-injection" : "disabled"
tier = local.tiers.edge
"keel.sh/enrolled" = "true"
}
}
lifecycle {
@ -121,10 +120,6 @@ module "anca" {
}
# https://budget-emo.viktorbarzin.me/
# Disabled 2026-05-13: Emo isn't using this instance. PVC is preserved so
# we can flip enabled back to true to bring the instance back as-was.
# The empty accounts list (vs. anca/viktor) was causing the daily bank-sync
# CronJob to fail and trigger BankSyncStale.
module "emo" {
source = "./factory"
name = "emo"
@ -133,10 +128,16 @@ module "emo" {
nfs_server = var.nfs_server
depends_on = [kubernetes_namespace.actualbudget]
tier = local.tiers.edge
enabled = false
enable_http_api = false
enable_bank_sync = false
enable_http_api = true
enable_bank_sync = true
budget_encryption_password = lookup(local.credentials["emo"], "password", null)
sync_id = lookup(local.credentials["emo"], "sync_id", null)
homepage_annotations = {}
homepage_annotations = {
"gethomepage.dev/enabled" = "true"
"gethomepage.dev/name" = "Budget Emo"
"gethomepage.dev/description" = "Personal budget"
"gethomepage.dev/icon" = "actual-budget.png"
"gethomepage.dev/group" = "Finance & Personal"
"gethomepage.dev/pod-selector" = ""
}
}

View file

@ -88,7 +88,6 @@ resource "kubernetes_namespace" "affine" {
name = "affine"
labels = {
tier = local.tiers.aux
"keel.sh/enrolled" = "true"
}
}
lifecycle {
@ -156,7 +155,7 @@ resource "kubernetes_persistent_volume_claim" "data_encrypted" {
name = "affine-data-encrypted"
namespace = kubernetes_namespace.affine.metadata[0].name
annotations = {
"resize.topolvm.io/threshold" = "10%"
"resize.topolvm.io/threshold" = "80%"
"resize.topolvm.io/increase" = "100%"
"resize.topolvm.io/storage_limit" = "5Gi"
}
@ -170,13 +169,6 @@ resource "kubernetes_persistent_volume_claim" "data_encrypted" {
}
}
}
lifecycle {
# The autoresizer expands requests.storage up to storage_limit and
# PVCs can't shrink. Without this, every TF apply tries to revert
# to the spec value, K8s rejects the shrink, and the PVC ends up
# in Terminating-but-in-use limbo.
ignore_changes = [spec[0].resources[0].requests]
}
}
resource "kubernetes_deployment" "affine" {
@ -332,12 +324,8 @@ resource "kubernetes_deployment" "affine" {
}
}
lifecycle {
ignore_changes = [
spec[0].template[0].spec[0].dns_config, # KYVERNO_LIFECYCLE_V1
metadata[0].annotations["keel.sh/policy"],
metadata[0].annotations["keel.sh/trigger"],
metadata[0].annotations["keel.sh/pollSchedule"], # KYVERNO_LIFECYCLE_V2
]
# KYVERNO_LIFECYCLE_V1: Kyverno admission webhook mutates dns_config with ndots=2
ignore_changes = [spec[0].template[0].spec[0].dns_config]
}
}
@ -363,11 +351,7 @@ resource "kubernetes_service" "affine" {
}
module "ingress" {
source = "../../modules/kubernetes/ingress_factory"
# auth = "app": AFFiNE has its own workspace auth + bearer-token API
# used by desktop/mobile sync clients. Authentik forward-auth was 302-ing
# those API callers; AFFiNE's own auth gates users.
auth = "app"
source = "../../modules/kubernetes/ingress_factory"
dns_type = "non-proxied"
namespace = kubernetes_namespace.affine.metadata[0].name
name = "affine"

View file

@ -53,130 +53,11 @@ resource "authentik_provider_proxy" "catchall" {
# doesn't require an HCL edit.
authorization_flow = data.authentik_flow.default_authorization_implicit_consent.id
invalidation_flow = data.authentik_flow.default_provider_invalidation.id
# Cookie / proxysession TTL. Drives `Max-Age` on `authentik_proxy_*`
# cookies and the `expires` column in `authentik_providers_proxy_proxysession`.
# See note on the embedded outpost below bumping this requires an outpost
# pod restart for the gorilla session store to rebind.
access_token_validity = "weeks=4"
lifecycle {
ignore_changes = [property_mappings, jwt_federation_sources, skip_path_regex, internal_host, basic_auth_enabled, basic_auth_password_attribute, basic_auth_username_attribute, intercept_header_auth]
ignore_changes = [property_mappings, jwt_federation_sources, skip_path_regex, internal_host, basic_auth_enabled, basic_auth_password_attribute, basic_auth_username_attribute, intercept_header_auth, access_token_validity]
}
}
# -----------------------------------------------------------------------------
# Embedded outpost record. Adopted into Terraform 2026-05-10 as part of the
# postgres-session-backend fix:
# - `managed` is set server-side to `goauthentik.io/outposts/embedded` so
# the outpost binary's `IsEmbedded()` check returns true it loads the
# PostgreSQL session backend (PR #16628). The Terraform provider does
# NOT expose `managed` in the schema, so the field is preserved across
# applies (TF only writes fields it knows about).
# - kubernetes_json_patches.deployment carries:
# * dshm 2Gi tmpfs (covers the 2026-04-18 ENOSPC class of issues)
# * resources requests/limits
# * `app.kubernetes.io/component=server` pod label so the K8s service
# selector lights up endpoints (works around goauthentik 2026.2.2
# service.py:52 selector mismatch on standalone embedded outposts).
# * AUTHENTIK_POSTGRESQL__{HOST,PORT,USER,PASSWORD,NAME} envFrom the
# shared `goauthentik` Secret so the postgres session backend has
# credentials to connect to the dbaas cluster.
# - kubernetes_json_patches.service replaces the controller-set selector
# (which incorrectly targets `app.kubernetes.io/name=authentik`, i.e.
# the goauthentik-server pods) with the outpost's own labels.
# -----------------------------------------------------------------------------
resource "authentik_outpost" "embedded" {
name = "authentik Embedded Outpost"
type = "proxy"
protocol_providers = [authentik_provider_proxy.catchall.id]
service_connection = "99e227a7-4562-4888-9660-4c27da678c50"
config = jsonencode({
log_level = "trace"
docker_labels = null
authentik_host = "https://authentik.viktorbarzin.me/"
docker_network = null
container_image = null
docker_map_ports = true
refresh_interval = "minutes=5"
kubernetes_replicas = 1
kubernetes_namespace = "authentik"
authentik_host_browser = ""
object_naming_template = "ak-outpost-%(name)s"
authentik_host_insecure = false
kubernetes_service_type = "ClusterIP"
kubernetes_ingress_path_type = null
kubernetes_image_pull_secrets = []
kubernetes_ingress_class_name = null
kubernetes_disabled_components = []
kubernetes_ingress_annotations = {}
kubernetes_ingress_secret_name = "authentik-outpost-tls"
kubernetes_httproute_annotations = {}
kubernetes_httproute_parent_refs = []
kubernetes_json_patches = {
deployment = [
{
op = "add"
path = "/spec/template/spec/volumes"
value = [{ name = "dshm", emptyDir = { medium = "Memory", sizeLimit = "2Gi" } }]
},
{
op = "add"
path = "/spec/template/spec/containers/0/volumeMounts"
value = [{ name = "dshm", mountPath = "/dev/shm" }]
},
{
op = "add"
path = "/spec/template/spec/containers/0/resources"
value = { limits = { memory = "2560Mi" }, requests = { cpu = "100m", memory = "128Mi" } }
},
{
op = "add"
path = "/spec/template/metadata/labels/app.kubernetes.io~1component"
value = "server"
},
{
op = "add"
path = "/spec/template/spec/containers/0/env/-"
value = { name = "AUTHENTIK_POSTGRESQL__HOST", valueFrom = { secretKeyRef = { name = "goauthentik", key = "AUTHENTIK_POSTGRESQL__HOST" } } }
},
{
op = "add"
path = "/spec/template/spec/containers/0/env/-"
value = { name = "AUTHENTIK_POSTGRESQL__PORT", valueFrom = { secretKeyRef = { name = "goauthentik", key = "AUTHENTIK_POSTGRESQL__PORT" } } }
},
{
op = "add"
path = "/spec/template/spec/containers/0/env/-"
value = { name = "AUTHENTIK_POSTGRESQL__USER", valueFrom = { secretKeyRef = { name = "goauthentik", key = "AUTHENTIK_POSTGRESQL__USER" } } }
},
{
op = "add"
path = "/spec/template/spec/containers/0/env/-"
value = { name = "AUTHENTIK_POSTGRESQL__PASSWORD", valueFrom = { secretKeyRef = { name = "goauthentik", key = "AUTHENTIK_POSTGRESQL__PASSWORD" } } }
},
{
op = "add"
path = "/spec/template/spec/containers/0/env/-"
value = { name = "AUTHENTIK_POSTGRESQL__NAME", valueFrom = { secretKeyRef = { name = "goauthentik", key = "AUTHENTIK_POSTGRESQL__NAME" } } }
},
]
service = [
{
op = "replace"
path = "/spec/selector"
value = {
"app.kubernetes.io/managed-by" = "goauthentik.io"
"app.kubernetes.io/name" = "authentik-outpost-proxy"
"goauthentik.io/outpost-name" = "authentik-embedded-outpost"
"goauthentik.io/outpost-type" = "proxy"
"goauthentik.io/outpost-uuid" = "0eecac0797c7443c892505f2f4fe3e47"
}
},
]
}
})
}
# -----------------------------------------------------------------------------
# Default User Login stage bound to default-authentication-flow.
# Adopted into Terraform 2026-05-01 to set session_duration=weeks=4 so users

View file

@ -1,217 +0,0 @@
# =============================================================================
# Public Guest user + auto-login flow + public proxy provider + dedicated
# outpost.
#
# Backs the `auth = "public"` tier of the ingress_factory module. Architecture:
#
# * `guest` user (in `Public Guests` group, NOT `Allow Login Users`).
# * `public-auto-login` flow: anonymous user enters expression policy sets
# `pending_user = guest` user_login stage logs them in. No UI shown.
# * `Provider for Public` proxy provider (forward_domain, cookie_domain
# `viktorbarzin.me`) with `authentication_flow = public-auto-login`.
# * Dedicated `Public Outpost` Deployment+Service (managed by Authentik's
# K8s controller). Bound to the public provider only there is no other
# provider claiming `viktorbarzin.me` on this outpost, so every request
# it sees runs the public flow regardless of host.
# * `public-auth.viktorbarzin.me` ingress exposes the public outpost's
# `/outpost.goauthentik.io/*` path so OAuth callbacks land on it (the
# embedded outpost doesn't know about the public provider, so callbacks
# can't go to authentik.viktorbarzin.me).
#
# Traffic flow for a stranger hitting an `auth = "public"` ingress:
# 1. Traefik's `authentik-forward-auth-public` middleware public outpost.
# 2. No session cookie 302 to `https://authentik.viktorbarzin.me/...`
# with redirect_uri = `https://public-auth.viktorbarzin.me/.../callback`.
# 3. Authentik runs `public-auto-login` (no UI), issues session.
# 4. 302 public-auth.viktorbarzin.me callback public outpost validates
# state and sets `authentik_proxy_<public-hash>` cookie on `viktorbarzin.me`.
# 5. 302 original URL Traefik retries forward_auth public outpost
# validates cookie 200 with `X-authentik-username: guest`.
#
# A user already logged into anything else on viktorbarzin.me (the catchall)
# still gets recognised here Authentik prefers an existing session and the
# public provider's authorization_flow auto-approves anyone, so their real
# username shows up in `X-authentik-username`. Strangers get `guest`.
# =============================================================================
resource "authentik_user" "guest" {
username = "guest"
name = "Guest"
path = "users/system"
is_active = true
type = "internal"
# No password set: the user_login stage in `public_auto_login` logs the
# request in via pending_user pre-set by an expression policy. There is no
# UI path for `guest` to authenticate via password the user is also kept
# out of `Allow Login Users`, so even a leaked password cannot be used to
# complete the standard login flow.
lifecycle {
ignore_changes = [attributes, email]
}
}
resource "authentik_group" "public_guests" {
name = "Public Guests"
users = [authentik_user.guest.id]
# NOT a child of "Allow Login Users" keeps a hypothetical leaked password
# from promoting `guest` to a real user via the standard login flow.
}
# Pre-stage policy: sets pending_user = guest before user_login stage runs.
# Mutates `request.context["flow_plan"].context["pending_user"]` the
# canonical pattern (the user_login stage reads pending_user from
# `flow_plan.context`). Direct `request.context["pending_user"]` mutations
# don't propagate, since policy request.context is not the same dict as
# flow_plan.context.
resource "authentik_policy_expression" "set_guest_user" {
name = "set-public-guest-user"
expression = trimspace(<<-EOT
request.context["flow_plan"].context["pending_user"] = ak_user_by(username="guest")
return True
EOT
)
}
# Dedicated user_login stage for the public flow. 4-week session matches the
# default authentication stage; means a stranger only goes through the auto-
# bind once per ~month per device.
resource "authentik_stage_user_login" "public_guest_login" {
name = "public-guest-login"
session_duration = "weeks=4"
}
# `authentication = "none"` lets anonymous requests run the flow.
# `designation = "authentication"` because the flow's outcome is "request is
# now authenticated as guest"; the public proxy provider's authorization_flow
# then runs implicit consent.
resource "authentik_flow" "public_auto_login" {
name = "Public Auto Login"
slug = "public-auto-login"
title = "Public Guest Login"
designation = "authentication"
authentication = "none"
}
resource "authentik_flow_stage_binding" "public_login" {
target = authentik_flow.public_auto_login.uuid
stage = authentik_stage_user_login.public_guest_login.id
order = 10
# Re-evaluate at stage runtime: at plan time, flow_plan may not yet be in
# request.context, so the expression policy's mutation would no-op. With
# evaluate_on_plan=false + re_evaluate_policies=true, the policy fires
# right before the stage runs, when flow_plan is fully populated.
evaluate_on_plan = false
re_evaluate_policies = true
}
resource "authentik_policy_binding" "set_guest_before_login" {
target = authentik_flow_stage_binding.public_login.id
policy = authentik_policy_expression.set_guest_user.id
order = 0
}
# -----------------------------------------------------------------------------
# Public proxy provider forward_domain so it claims any host on
# viktorbarzin.me. Used only on the dedicated `public` outpost (where it is
# the sole bound provider), so there's no dispatch ambiguity with the
# catchall (which lives on the embedded outpost).
# -----------------------------------------------------------------------------
resource "authentik_provider_proxy" "public" {
name = "Provider for Public"
mode = "forward_domain"
external_host = "https://public-auth.viktorbarzin.me"
cookie_domain = "viktorbarzin.me"
# When a request hits with NO Authentik session, this flow runs first and
# auto-binds the request to the `guest` user (no UI prompt).
authentication_flow = authentik_flow.public_auto_login.uuid
# Once authenticated (or already authenticated), implicit-consent auto-approves.
authorization_flow = data.authentik_flow.default_authorization_implicit_consent.id
invalidation_flow = data.authentik_flow.default_provider_invalidation.id
access_token_validity = "weeks=4"
lifecycle {
ignore_changes = [property_mappings, jwt_federation_sources, skip_path_regex, internal_host, basic_auth_enabled, basic_auth_password_attribute, basic_auth_username_attribute, intercept_header_auth]
}
}
resource "authentik_application" "public" {
name = "Public"
slug = "public"
protocol_provider = authentik_provider_proxy.public.id
# No bound policies. policy_engine_mode = "any" + zero bindings = everyone
# passes (the auto-login flow has already established `guest` as the user).
policy_engine_mode = "any"
lifecycle {
ignore_changes = [meta_description, meta_launch_url, meta_icon, group, backchannel_providers, open_in_new_tab]
}
}
# Dedicated outpost so the public provider can claim viktorbarzin.me without
# colliding with the catchall (which already claims viktorbarzin.me on the
# embedded outpost). Authentik's K8s controller deploys this as
# `ak-outpost-public` (Deployment + Service in the `authentik` namespace).
resource "authentik_outpost" "public" {
name = "public"
type = "proxy"
protocol_providers = [authentik_provider_proxy.public.id]
service_connection = "99e227a7-4562-4888-9660-4c27da678c50"
config = jsonencode({
log_level = "info"
docker_labels = null
authentik_host = "https://authentik.viktorbarzin.me/"
docker_network = null
container_image = null
docker_map_ports = true
refresh_interval = "minutes=5"
kubernetes_replicas = 1
kubernetes_namespace = "authentik"
authentik_host_browser = ""
object_naming_template = "ak-outpost-%(name)s"
authentik_host_insecure = false
kubernetes_service_type = "ClusterIP"
kubernetes_ingress_path_type = null
kubernetes_image_pull_secrets = []
kubernetes_ingress_class_name = null
kubernetes_disabled_components = []
kubernetes_ingress_annotations = {}
kubernetes_ingress_secret_name = "authentik-outpost-tls"
kubernetes_httproute_annotations = {}
kubernetes_httproute_parent_refs = []
kubernetes_json_patches = {
deployment = [
{
op = "add"
path = "/spec/template/spec/containers/0/resources"
value = { limits = { memory = "256Mi" }, requests = { cpu = "10m", memory = "64Mi" } }
},
]
}
})
}
# Ingress for `public-auth.viktorbarzin.me` exposes the public outpost's
# /outpost.goauthentik.io/* path so OAuth callbacks land on it. The
# `Provider for Public` external_host points here, so all redirect_uris in
# the OAuth flow resolve to this hostname.
module "ingress_public_outpost" {
source = "../../modules/kubernetes/ingress_factory"
# Public-tier outpost callback the OAuth flow's redirect_uris all resolve
# here; gating it with forward-auth would loop the public outpost onto itself.
# auth = "none": Public outpost callback path for OAuth flow; protecting with forward-auth creates circular dependency.
auth = "none"
namespace = "authentik"
name = "public-outpost"
host = "public-auth"
service_name = "ak-outpost-public"
port = 9000
ingress_path = ["/outpost.goauthentik.io"]
tls_secret_name = var.tls_secret_name
dns_type = "proxied"
anti_ai_scraping = false
exclude_crowdsec = true
homepage_enabled = false
depends_on = [authentik_outpost.public]
}

View file

@ -29,7 +29,6 @@ resource "kubernetes_namespace" "authentik" {
labels = {
tier = var.tier
"resource-governance/custom-quota" = "true"
"keel.sh/enrolled" = "true"
}
}
lifecycle {
@ -71,12 +70,8 @@ resource "helm_release" "authentik" {
module "ingress" {
source = "../../../../modules/kubernetes/ingress_factory"
# Authentik's own UI cannot be gated by Authentik forward-auth that
# creates a chicken-and-egg loop (users can't reach the login page).
# auth = "none": Authentik UI cannot be gated by Authentik forward-auth (chicken-and-egg loop prevents login).
auth = "none"
dns_type = "proxied"
source = "../../../../modules/kubernetes/ingress_factory"
dns_type = "proxied"
namespace = kubernetes_namespace.authentik.metadata[0].name
name = "authentik"
service_name = "goauthentik-server"
@ -96,11 +91,7 @@ module "ingress" {
}
module "ingress-outpost" {
source = "../../../../modules/kubernetes/ingress_factory"
# Authentik forward-auth outpost callback path protecting this with
# forward-auth would loop the outpost back onto itself.
# auth = "none": Authentik outpost callback path for forward-auth flow; protecting with forward-auth creates circular dependency.
auth = "none"
source = "../../../../modules/kubernetes/ingress_factory"
namespace = kubernetes_namespace.authentik.metadata[0].name
name = "authentik-outpost"
host = "authentik"

View file

@ -66,13 +66,9 @@ resource "kubernetes_deployment" "pgbouncer" {
}
}
container {
name = "pgbouncer"
image = "edoburu/pgbouncer:latest"
# `:latest` tag keep `Always` so pod restarts pick up upstream
# updates. The previous `IfNotPresent` value was declared at module
# creation but the live cluster has reconciled to `Always` (likely
# via a Helm/operator default). Match reality to drop the drift.
image_pull_policy = "Always"
name = "pgbouncer"
image = "edoburu/pgbouncer:latest"
image_pull_policy = "IfNotPresent"
port {
container_port = 6432

View file

@ -78,10 +78,7 @@ global:
addPrometheusAnnotations: true
worker:
# 2 replicas: workers handle background tasks (LDAP sync, email,
# certificate renewal) — no user-facing traffic, so 2-of-3 isn't
# needed for availability. Drop saves ~100m sustained CPU.
replicas: 2
replicas: 3
# Same unauthenticated_age cap as server — both the server (Django session
# middleware) and worker (cleanup tasks) need to see the value.
env:

View file

@ -29,7 +29,6 @@ resource "kubernetes_namespace" "beads" {
name = "beads-server"
labels = {
tier = local.tiers.aux
"keel.sh/enrolled" = "true"
}
}
lifecycle {
@ -44,7 +43,7 @@ resource "kubernetes_persistent_volume_claim" "dolt_data" {
name = "dolt-data"
namespace = kubernetes_namespace.beads.metadata[0].name
annotations = {
"resize.topolvm.io/threshold" = "10%"
"resize.topolvm.io/threshold" = "80%"
"resize.topolvm.io/increase" = "100%"
"resize.topolvm.io/storage_limit" = "10Gi"
}
@ -56,13 +55,6 @@ resource "kubernetes_persistent_volume_claim" "dolt_data" {
requests = { storage = "2Gi" }
}
}
lifecycle {
# The autoresizer expands requests.storage up to storage_limit and
# PVCs can't shrink. Without this, every TF apply tries to revert
# to the spec value, K8s rejects the shrink, and the PVC ends up
# in Terminating-but-in-use limbo.
ignore_changes = [spec[0].resources[0].requests]
}
}
resource "kubernetes_config_map" "dolt_init" {
@ -75,23 +67,6 @@ resource "kubernetes_config_map" "dolt_init" {
CREATE USER IF NOT EXISTS 'beads'@'%' IDENTIFIED BY '';
GRANT ALL PRIVILEGES ON *.* TO 'beads'@'%' WITH GRANT OPTION;
EOT
"02-create-presence-table.sql" = <<-EOT
CREATE DATABASE IF NOT EXISTS beads;
USE beads;
CREATE TABLE IF NOT EXISTS presence_claims (
session_id VARCHAR(128) NOT NULL,
resource_label VARCHAR(255) NOT NULL,
purpose TEXT NOT NULL,
claimed_at DATETIME(3) NOT NULL DEFAULT CURRENT_TIMESTAMP(3),
expires_at DATETIME(3) NOT NULL,
host VARCHAR(128) NOT NULL,
user VARCHAR(64) NOT NULL,
agent_name VARCHAR(64) DEFAULT 'claude-code',
PRIMARY KEY (session_id, resource_label),
INDEX idx_resource (resource_label),
INDEX idx_expires (expires_at)
);
EOT
}
}
@ -103,16 +78,6 @@ resource "kubernetes_deployment" "dolt" {
app = "dolt"
tier = local.tiers.aux
}
annotations = {
# Keel is namespace-enrolled (keel.sh/enrolled=true on the namespace),
# but this deployment opts OUT of auto-updates: dolthub/dolt-sql-server:latest
# currently resolves to a broken 0.50.10 build. Pinned image lives in the
# container spec below. Codified here so TF state matches live, no drift.
"keel.sh/policy" = "never"
"keel.sh/match-tag" = "true"
"keel.sh/trigger" = "poll"
"keel.sh/pollSchedule" = "@every 1h"
}
}
spec {
replicas = 1
@ -133,12 +98,7 @@ resource "kubernetes_deployment" "dolt" {
spec {
container {
name = "dolt"
# Pinned to 2.0.3 :latest currently resolves to 0.50.10 on dolthub
# (different versioning stream) whose docker-entrypoint.sh references
# an undefined docker_process_sql function and crash-loops on every
# init script in /docker-entrypoint-initdb.d. Keel can upgrade this
# tag in-cluster; the lifecycle.ignore_changes below preserves that.
image = "dolthub/dolt-sql-server:2.0.3"
image = "dolthub/dolt-sql-server:latest"
port {
name = "mysql"
@ -210,59 +170,7 @@ resource "kubernetes_deployment" "dolt" {
}
lifecycle {
ignore_changes = [
spec[0].template[0].spec[0].dns_config, # KYVERNO_LIFECYCLE_V1
spec[0].template[0].spec[0].container[0].image, # KEEL_IGNORE_IMAGE
# Keel annotations are codified in metadata.annotations above (policy=never
# opts this deployment out of auto-updates see the comment there).
]
}
}
# One-shot Job to apply the presence_claims schema to the running Dolt server.
# The dolt_init ConfigMap only fires on fresh PVCs; since Dolt already exists
# with persistent state, this Job is the only path to update the live schema.
# The job name is hashed off the SQL content so a new Job runs whenever the
# schema changes; the SQL itself is idempotent (CREATE ... IF NOT EXISTS).
resource "kubernetes_job" "presence_schema_migrate" {
metadata {
name = "presence-schema-${substr(sha256(kubernetes_config_map.dolt_init.data["02-create-presence-table.sql"]), 0, 8)}"
namespace = kubernetes_namespace.beads.metadata[0].name
}
spec {
backoff_limit = 3
template {
metadata {}
spec {
restart_policy = "OnFailure"
container {
name = "migrate"
image = "mysql:8.4"
command = ["sh", "-c"]
args = [
"mysql -h dolt.beads-server.svc.cluster.local -P 3306 -u root < /sql/02-create-presence-table.sql"
]
volume_mount {
name = "sql"
mount_path = "/sql"
}
}
volume {
name = "sql"
config_map {
name = kubernetes_config_map.dolt_init.metadata[0].name
}
}
}
}
}
wait_for_completion = true
timeouts {
create = "5m"
}
depends_on = [kubernetes_deployment.dolt]
lifecycle {
ignore_changes = [
spec[0].template[0].spec[0].dns_config, # KYVERNO_LIFECYCLE_V1
spec[0].template[0].spec[0].dns_config # KYVERNO_LIFECYCLE_V1
]
}
}
@ -466,11 +374,7 @@ resource "kubernetes_deployment" "workbench" {
}
lifecycle {
ignore_changes = [
spec[0].template[0].spec[0].dns_config, # KYVERNO_LIFECYCLE_V1
metadata[0].annotations["keel.sh/policy"],
metadata[0].annotations["keel.sh/trigger"],
metadata[0].annotations["keel.sh/pollSchedule"], # KYVERNO_LIFECYCLE_V2
spec[0].template[0].spec[0].container[0].image, # KEEL_IGNORE_IMAGE
spec[0].template[0].spec[0].dns_config # KYVERNO_LIFECYCLE_V1
]
}
}
@ -512,8 +416,7 @@ module "ingress" {
namespace = kubernetes_namespace.beads.metadata[0].name
name = "dolt-workbench"
tls_secret_name = var.tls_secret_name
# auth = "none": Dolt Workbench is client-side encrypted task database; no backend user auth required; Anubis PoW fronts ingress.
auth = "none"
protected = false
exclude_crowdsec = true
extra_annotations = {
"gethomepage.dev/enabled" = "true"
@ -663,7 +566,7 @@ resource "kubernetes_deployment" "beadboard" {
}
container {
name = "beadboard"
name = "beadboard"
# Phase 3 cutover 2026-05-07 Forgejo registry consolidation.
image = "forgejo.viktorbarzin.me/viktor/beadboard:${var.beadboard_image_tag}"
@ -743,11 +646,7 @@ resource "kubernetes_deployment" "beadboard" {
}
lifecycle {
ignore_changes = [
spec[0].template[0].spec[0].dns_config, # KYVERNO_LIFECYCLE_V1
metadata[0].annotations["keel.sh/policy"],
metadata[0].annotations["keel.sh/trigger"],
metadata[0].annotations["keel.sh/pollSchedule"], # KYVERNO_LIFECYCLE_V2
spec[0].template[0].spec[0].container[0].image, # KEEL_IGNORE_IMAGE
spec[0].template[0].spec[0].dns_config # KYVERNO_LIFECYCLE_V1
]
}
}
@ -778,7 +677,7 @@ module "beadboard_ingress" {
namespace = kubernetes_namespace.beads.metadata[0].name
name = "beadboard"
tls_secret_name = var.tls_secret_name
auth = "required"
protected = true
exclude_crowdsec = true
extra_annotations = {
"gethomepage.dev/enabled" = "true"

View file

@ -24,14 +24,6 @@ provider "registry.terraform.io/cloudflare/cloudflare" {
]
}
provider "registry.terraform.io/goauthentik/authentik" {
version = "2024.12.1"
constraints = "~> 2024.10"
hashes = [
"h1:roBMd+gi+TGgikH/bMzEI8JfvJiMAQWt+8FmokCrQIs=",
]
}
provider "registry.terraform.io/hashicorp/helm" {
version = "3.1.1"
hashes = [

View file

@ -1,7 +1,7 @@
# Generated by Terragrunt. Sig: nIlQXj57tbuaRZEa
terraform {
backend "pg" {
conn_str = "postgres://terraform_state:ts7DGcKmTTY-5ujz4mhh@10.0.20.200:5432/terraform_state?sslmode=disable"
conn_str = "postgres://terraform_state:SBlzGxotNUN6HH9d0S-m@10.0.20.200:5432/terraform_state?sslmode=disable"
schema_name = "blog"
}
}

View file

@ -10,7 +10,6 @@ resource "kubernetes_namespace" "website" {
labels = {
"istio-injection" : "disabled"
tier = local.tiers.aux
"keel.sh/enrolled" = "true"
}
}
lifecycle {
@ -77,12 +76,8 @@ resource "kubernetes_deployment" "blog" {
}
}
lifecycle {
ignore_changes = [
spec[0].template[0].spec[0].dns_config, # KYVERNO_LIFECYCLE_V1
metadata[0].annotations["keel.sh/policy"],
metadata[0].annotations["keel.sh/trigger"],
metadata[0].annotations["keel.sh/pollSchedule"], # KYVERNO_LIFECYCLE_V2
]
# KYVERNO_LIFECYCLE_V1: Kyverno admission webhook mutates dns_config with ndots=2
ignore_changes = [spec[0].template[0].spec[0].dns_config]
}
}
@ -121,25 +116,23 @@ resource "kubernetes_service" "blog" {
# tiny PoW (~250ms desktop), get a 30-day cookie, and pass through. Replaces
# the global ai-bot-block forwardAuth for this site.
module "anubis" {
source = "../../modules/kubernetes/anubis_instance"
name = "blog"
namespace = kubernetes_namespace.website.metadata[0].name
target_url = "http://${kubernetes_service.blog.metadata[0].name}.${kubernetes_namespace.website.metadata[0].name}.svc.cluster.local"
shared_store_url = "redis://redis-master.redis.svc.cluster.local:6379/10"
source = "../../modules/kubernetes/anubis_instance"
name = "blog"
namespace = kubernetes_namespace.website.metadata[0].name
target_url = "http://${kubernetes_service.blog.metadata[0].name}.${kubernetes_namespace.website.metadata[0].name}.svc.cluster.local"
}
module "ingress" {
source = "../../modules/kubernetes/ingress_factory"
auth = "none" # Anubis-fronted; PoW challenge gates bots, no Authentik
namespace = kubernetes_namespace.website.metadata[0].name
name = "blog"
service_name = module.anubis.service_name
port = module.anubis.service_port
extra_middlewares = ["traefik-x402@kubernetescrd"]
full_host = "viktorbarzin.me"
dns_type = "proxied"
tls_secret_name = var.tls_secret_name
anti_ai_scraping = false # Anubis is the gatekeeper now drop the redundant ai-bot-block forwardAuth.
full_host = "viktorbarzin.me"
dns_type = "proxied"
tls_secret_name = var.tls_secret_name
anti_ai_scraping = false # Anubis is the gatekeeper now drop the redundant ai-bot-block forwardAuth.
extra_annotations = {
"gethomepage.dev/enabled" = "true"
"gethomepage.dev/name" = "Blog"
@ -152,24 +145,12 @@ module "ingress" {
module "ingress-www" {
source = "../../modules/kubernetes/ingress_factory"
auth = "none" # Anubis-fronted; PoW challenge gates bots, no Authentik
namespace = kubernetes_namespace.website.metadata[0].name
name = "blog-www"
service_name = module.anubis.service_name
port = module.anubis.service_port
extra_middlewares = ["traefik-x402@kubernetescrd"]
full_host = "www.viktorbarzin.me"
tls_secret_name = var.tls_secret_name
anti_ai_scraping = false
full_host = "www.viktorbarzin.me"
tls_secret_name = var.tls_secret_name
anti_ai_scraping = false
}
# CI retrigger 2026-05-16T13:42:57+00:00 bulk enrollment apply (pipeline #689 killed)
# CI retrigger v2 2026-05-16T13:46:35+00:00
# CI retrigger v3 2026-05-16T14:06:39Z
# CI retrigger v4 2026-05-16T14:13:59Z
# CI retrigger v5 2026-05-16T23:10:38Z
# CI retrigger v6 2026-05-16T23:18:58Z

View file

@ -9,10 +9,6 @@ terraform {
source = "cloudflare/cloudflare"
version = "~> 4"
}
authentik = {
source = "goauthentik/authentik"
version = "~> 2024.10"
}
}
}

View file

@ -12,7 +12,6 @@ resource "kubernetes_namespace" "broker_sync" {
labels = {
"istio-injection" = "disabled"
tier = local.tiers.aux
"keel.sh/enrolled" = "true"
}
}
lifecycle {
@ -62,7 +61,7 @@ resource "kubernetes_persistent_volume_claim" "data_encrypted" {
name = "broker-sync-data-encrypted"
namespace = kubernetes_namespace.broker_sync.metadata[0].name
annotations = {
"resize.topolvm.io/threshold" = "10%"
"resize.topolvm.io/threshold" = "80%"
"resize.topolvm.io/increase" = "100%"
"resize.topolvm.io/storage_limit" = "5Gi"
}
@ -74,13 +73,6 @@ resource "kubernetes_persistent_volume_claim" "data_encrypted" {
requests = { storage = "1Gi" }
}
}
lifecycle {
# The autoresizer expands requests.storage up to storage_limit and
# PVCs can't shrink. Without this, every TF apply tries to revert
# to the spec value, K8s rejects the shrink, and the PVC ends up
# in Terminating-but-in-use limbo.
ignore_changes = [spec[0].resources[0].requests]
}
}
locals {
@ -668,13 +660,8 @@ resource "kubernetes_cron_job_v1" "fidelity" {
concurrency_policy = "Forbid"
successful_jobs_history_limit = 3
failed_jobs_history_limit = 5
# Unsuspended 2026-05-17 after the delta gains-offset emission landed
# (broker-sync @98c4729). Manual trigger:
# kubectl -n broker-sync create job fid-now \
# --from=cronjob/broker-sync-fidelity
# NB: storage_state expires every 30-90 days see code-r9n for the
# chrome-service-driven re-seed runbook.
suspend = false
# Suspended until the broker-sync image ships with Playwright + Chromium.
suspend = true
job_template {
metadata {}
spec {

View file

@ -22,9 +22,6 @@ resource "kubernetes_namespace" "calico_system" {
name = "calico-system"
labels = {
name = "calico-system"
# calico-system namespace is managed by tigera-operator auto-update is
# incompatible (operator reverts DaemonSet image from its Installation CR).
# "keel.sh/enrolled" = "true"
}
}
lifecycle {
@ -68,66 +65,3 @@ resource "kubernetes_namespace" "tigera_operator" {
ignore_changes = [metadata[0].labels["goldilocks.fairwinds.com/vpa-update-mode"]]
}
}
# Wave 1 W1.6 (beads code-8ywc): observation phase via Calico GlobalNetworkPolicy
# `action: Log`. This is the supported primitive on Calico OSS v3.26 the
# Calico-Enterprise FelixConfiguration.flowLogsFileEnabled approach is NOT
# accepted by the OSS CRD (verified 2026-05-19: "strict decoding error").
#
# How it works:
# - GNP selects pods by namespaceSelector
# - egress rule action=Log writes an iptables NFLOG entry that lands in the
# kernel log / journald with prefix "calico-packet:" on each node
# - Alloy DaemonSet already ships node-journal to Loki (job=node-journal)
# - LogQL query: {job="node-journal"} |= "calico-packet" surfaces egress flows
# - After ~1 week of observation, build the empirical per-namespace egress
# allowlist; then flip the same GNP to [Allow specific dests, Deny rest]
#
# Started with `recruiter-responder` as the pilot on 2026-05-19; expanded
# 2026-05-19 to all tier 3+4 namespaces (per locked plan tier 3-edge has
# 17 ns, tier 4-aux has 65 ns, all use Calico's WorkloadEndpoint policy
# path). Tier 0/1/2 stay out of observation in wave 1 (cluster infra +
# GPU workloads, deferred per the plan).
#
# `apply_only = true` on the kubectl_manifest means renaming the TF resource
# does NOT destroy the old GNP via TF we kubectl delete the legacy pilot
# GNP after this applies to clean it up. (Tracked manually.)
resource "kubectl_manifest" "wave1_egress_observe_tier34" {
yaml_body = yamlencode({
apiVersion = "projectcalico.org/v3"
kind = "GlobalNetworkPolicy"
metadata = {
name = "wave1-egress-observe-tier34"
annotations = {
"security.viktorbarzin.me/wave" = "1"
"security.viktorbarzin.me/purpose" = "observe-then-enforce egress for tier 3-edge + 4-aux"
}
}
spec = {
order = 2000
selector = "all()"
namespaceSelector = "tier in {\"3-edge\", \"4-aux\"}"
types = ["Egress"]
egress = [
# Rule 1: log every egress packet (LOG target writes to kernel/journal,
# alloy ships to Loki with job=node-journal,transport=kernel).
# LogQL: {job="node-journal"} |~ "calico-packet"
{ action = "Log" },
# Rule 2: allow everything (observation must NOT break workloads).
{ action = "Allow" },
]
}
})
apply_only = true
}
# CI retrigger 2026-05-16T13:42:57+00:00 bulk enrollment apply (pipeline #689 killed)
# CI retrigger v2 2026-05-16T13:46:35+00:00
# CI retrigger v3 2026-05-16T14:06:39Z
# CI retrigger v4 2026-05-16T14:13:59Z
# CI retrigger v5 2026-05-16T23:10:38Z
# CI retrigger v6 2026-05-16T23:18:58Z

View file

@ -9,7 +9,6 @@ resource "kubernetes_namespace" "changedetection" {
labels = {
"istio-injection" : "disabled"
tier = local.tiers.aux
"keel.sh/enrolled" = "true"
}
}
lifecycle {
@ -69,7 +68,7 @@ resource "kubernetes_persistent_volume_claim" "data_proxmox" {
name = "changedetection-data-proxmox"
namespace = kubernetes_namespace.changedetection.metadata[0].name
annotations = {
"resize.topolvm.io/threshold" = "10%"
"resize.topolvm.io/threshold" = "80%"
"resize.topolvm.io/increase" = "100%"
"resize.topolvm.io/storage_limit" = "8Gi"
}
@ -83,13 +82,6 @@ resource "kubernetes_persistent_volume_claim" "data_proxmox" {
}
}
}
lifecycle {
# The autoresizer expands requests.storage up to storage_limit and
# PVCs can't shrink. Without this, every TF apply tries to revert
# to the spec value, K8s rejects the shrink, and the PVC ends up
# in Terminating-but-in-use limbo.
ignore_changes = [spec[0].resources[0].requests]
}
}
resource "kubernetes_deployment" "changedetection" {
@ -195,13 +187,8 @@ resource "kubernetes_deployment" "changedetection" {
}
}
lifecycle {
ignore_changes = [
spec[0].template[0].spec[0].dns_config, # KYVERNO_LIFECYCLE_V1
spec[0].template[0].spec[0].container[0].image, # KEEL_IGNORE_IMAGE Keel manages tag updates
metadata[0].annotations["keel.sh/policy"],
metadata[0].annotations["keel.sh/trigger"],
metadata[0].annotations["keel.sh/pollSchedule"], # KYVERNO_LIFECYCLE_V2
]
# KYVERNO_LIFECYCLE_V1: Kyverno admission webhook mutates dns_config with ndots=2
ignore_changes = [spec[0].template[0].spec[0].dns_config]
}
}
@ -231,7 +218,7 @@ module "ingress" {
namespace = kubernetes_namespace.changedetection.metadata[0].name
name = "changedetection"
tls_secret_name = var.tls_secret_name
auth = "required"
protected = true
extra_annotations = {
"gethomepage.dev/enabled" = "true"
"gethomepage.dev/name" = "Changedetection"

View file

@ -24,7 +24,6 @@ resource "kubernetes_namespace" "chrome_service" {
"istio-injection" = "disabled"
tier = local.tiers.aux
"chrome-service.viktorbarzin.me/server" = "true"
"keel.sh/enrolled" = "true"
}
}
lifecycle {
@ -75,7 +74,7 @@ resource "kubernetes_persistent_volume_claim" "profile_encrypted" {
name = "chrome-service-profile-encrypted"
namespace = kubernetes_namespace.chrome_service.metadata[0].name
annotations = {
"resize.topolvm.io/threshold" = "10%"
"resize.topolvm.io/threshold" = "80%"
"resize.topolvm.io/increase" = "100%"
"resize.topolvm.io/storage_limit" = "10Gi"
}
@ -89,13 +88,6 @@ resource "kubernetes_persistent_volume_claim" "profile_encrypted" {
}
}
}
lifecycle {
# The autoresizer expands requests.storage up to storage_limit and
# PVCs can't shrink. Without this, every TF apply tries to revert
# to the spec value, K8s rejects the shrink, and the PVC ends up
# in Terminating-but-in-use limbo.
ignore_changes = [spec[0].resources[0].requests]
}
}
# --- NFS backup target ---
@ -115,12 +107,6 @@ resource "kubernetes_deployment" "chrome_service" {
namespace = kubernetes_namespace.chrome_service.metadata[0].name
labels = merge(local.labels, {
tier = local.tiers.aux
# Deliberate pin: chrome-service's playwright image MUST match
# the playwright Python version in f1-stream (see local.image
# comment above). Opt out of Keel auto-update via this label
# the inject-keel-annotations ClusterPolicy excludes workloads
# selector-matching keel.sh/policy=never.
"keel.sh/policy" = "never"
})
annotations = {
"reloader.stakater.com/auto" = "true"
@ -318,12 +304,8 @@ resource "kubernetes_deployment" "chrome_service" {
}
}
lifecycle {
ignore_changes = [
spec[0].template[0].spec[0].dns_config, # KYVERNO_LIFECYCLE_V1
metadata[0].annotations["keel.sh/policy"],
metadata[0].annotations["keel.sh/trigger"],
metadata[0].annotations["keel.sh/pollSchedule"], # KYVERNO_LIFECYCLE_V2
]
# KYVERNO_LIFECYCLE_V1: Kyverno admission webhook mutates dns_config with ndots=2
ignore_changes = [spec[0].template[0].spec[0].dns_config]
}
}
@ -372,7 +354,7 @@ module "ingress" {
namespace = kubernetes_namespace.chrome_service.metadata[0].name
name = "chrome"
tls_secret_name = var.tls_secret_name
auth = "required"
protected = true
# noVNC defaults to /vnc.html auto-redirect / there.
ingress_path = ["/"]
extra_annotations = {

View file

@ -10,7 +10,6 @@ resource "kubernetes_namespace" "city-guesser" {
labels = {
"istio-injection" : "disabled"
tier = local.tiers.aux
"keel.sh/enrolled" = "true"
}
}
lifecycle {
@ -68,13 +67,8 @@ resource "kubernetes_deployment" "city-guesser" {
}
}
lifecycle {
ignore_changes = [
spec[0].template[0].spec[0].dns_config, # KYVERNO_LIFECYCLE_V1
spec[0].template[0].spec[0].container[0].image, # KEEL_IGNORE_IMAGE Keel manages tag updates
metadata[0].annotations["keel.sh/policy"],
metadata[0].annotations["keel.sh/trigger"],
metadata[0].annotations["keel.sh/pollSchedule"], # KYVERNO_LIFECYCLE_V2
]
# KYVERNO_LIFECYCLE_V1: Kyverno admission webhook mutates dns_config with ndots=2
ignore_changes = [spec[0].template[0].spec[0].dns_config]
}
}
@ -105,7 +99,7 @@ module "ingress" {
namespace = "city-guesser"
name = "city-guesser"
tls_secret_name = var.tls_secret_name
auth = "required"
protected = true
extra_annotations = {
"gethomepage.dev/enabled" = "true"
"gethomepage.dev/name" = "City Guesser"

View file

@ -12,7 +12,7 @@ locals {
namespace = "claude-agent"
# Phase 3 cutover 2026-05-07 see infra/docs/plans/2026-05-07-forgejo-registry-consolidation-plan.md.
image = "forgejo.viktorbarzin.me/viktor/claude-agent-service"
image_tag = "191ed5dd"
image_tag = "2fd7670d"
labels = {
app = "claude-agent-service"
}
@ -191,25 +191,27 @@ resource "kubernetes_cluster_role_binding" "claude_agent" {
}
# --- Storage ---
#
# The `workspace` volume in the deployment is intentionally emptyDir agent
# jobs do fresh git clones each run, so a per-pod scratch dir on node disk
# is faster and isolated. The 10Gi `claude-agent-workspace-encrypted` PVC
# that previously sat next to this comment was created but never wired
# into the deployment (sat idle from 2026-04-15 to 2026-05-11).
#
# For cases where the agent DOES need to persist state across pod restarts
# (caches, ad-hoc outputs, anything that should survive a pod reschedule),
# `module.persistent` below provides a 5Gi NFS-backed RWX volume mounted
# at /persistent. RWX so all 3 replicas can read/write the same dir;
# sequential job mutex in the service prevents concurrent writes.
module "persistent" {
source = "../../modules/kubernetes/nfs_volume"
name = "claude-agent-persistent"
namespace = kubernetes_namespace.claude_agent.metadata[0].name
nfs_server = "192.168.1.127"
nfs_path = "/srv/nfs/claude-agent-persistent"
storage = "5Gi"
resource "kubernetes_persistent_volume_claim" "workspace" {
wait_until_bound = false
metadata {
name = "claude-agent-workspace-encrypted"
namespace = kubernetes_namespace.claude_agent.metadata[0].name
annotations = {
"resize.topolvm.io/threshold" = "80%"
"resize.topolvm.io/increase" = "100%"
"resize.topolvm.io/storage_limit" = "20Gi"
}
}
spec {
access_modes = ["ReadWriteOnce"]
storage_class_name = "proxmox-lvm-encrypted"
resources {
requests = {
storage = "10Gi"
}
}
}
}
# --- Deployment ---
@ -249,15 +251,11 @@ resource "kubernetes_deployment" "claude_agent" {
fs_group = 1000
}
# Fix workspace ownership. Kubelet creates the Dockerfile WORKDIR
# (/workspace/infra) inside the emptyDir as root:gid=fsGroup with
# the setgid bit uid 1000 can't write into it without explicit
# chown + chmod. Pre-create so the path is guaranteed, then chown
# recursively and chmod the infra subdir for safety.
# Fix workspace ownership (PVC may have root-owned files from prior run)
init_container {
name = "fix-perms"
image = "busybox:1.37"
command = ["sh", "-c", "mkdir -p /workspace/infra /persistent && chown -R 1000:1000 /workspace /persistent && chmod 0775 /workspace/infra /persistent"]
command = ["sh", "-c", "chown -R 1000:1000 /workspace"]
security_context {
run_as_user = 0
}
@ -265,10 +263,6 @@ resource "kubernetes_deployment" "claude_agent" {
name = "workspace"
mount_path = "/workspace"
}
volume_mount {
name = "persistent"
mount_path = "/persistent"
}
resources {
requests = {
memory = "32Mi"
@ -374,7 +368,6 @@ resource "kubernetes_deployment" "claude_agent" {
mkdir -p /workspace/.beads /workspace/scratch /home/agent/.claude/agents
cp /usr/share/agent-seed/beads-metadata.json /workspace/.beads/metadata.json
cp /usr/share/agent-seed/beads-task-runner.md /home/agent/.claude/agents/beads-task-runner.md
cp /usr/share/agent-seed/recruiter-triage.md /home/agent/.claude/agents/recruiter-triage.md
EOT
]
@ -438,10 +431,6 @@ resource "kubernetes_deployment" "claude_agent" {
name = "workspace"
mount_path = "/workspace"
}
volume_mount {
name = "persistent"
mount_path = "/persistent"
}
volume_mount {
name = "sops-age-key"
mount_path = "/home/agent/.config/sops/age"
@ -464,16 +453,8 @@ resource "kubernetes_deployment" "claude_agent" {
volume {
name = "workspace"
# Per-pod ephemeral scratch agent does fresh git clones each
# job, so node-disk emptyDir is faster than a network-backed PVC
# and avoids RWO contention across the 3 replicas.
empty_dir {}
}
volume {
name = "persistent"
persistent_volume_claim {
claim_name = module.persistent.claim_name
claim_name = kubernetes_persistent_volume_claim.workspace.metadata[0].name
}
}

View file

@ -1,18 +0,0 @@
include "root" {
path = find_in_parent_folders()
}
dependency "platform" {
config_path = "../platform"
skip_outputs = true
}
dependency "vault" {
config_path = "../vault"
skip_outputs = true
}
dependency "external-secrets" {
config_path = "../external-secrets"
skip_outputs = true
}

View file

@ -6,7 +6,6 @@ variable "postgresql_host" { type = string }
variable "claude_memory_db_password" {
type = string
sensitive = true
default = "" # falls back to Vault `secret/claude-memory.db_password` below
}
data "vault_kv_secret_v2" "secrets" {
@ -19,7 +18,6 @@ resource "kubernetes_namespace" "claude-memory" {
name = "claude-memory"
labels = {
tier = local.tiers.aux
"keel.sh/enrolled" = "true"
}
}
lifecycle {
@ -114,13 +112,11 @@ resource "kubernetes_job" "db_init" {
"sh", "-c",
<<-EOT
set -e
# -d postgres: psql defaults database name to username; root user
# doesn't have a root-named database, so be explicit.
PGPASSWORD='${data.vault_kv_secret_v2.secrets.data["dbaas_root_password"]}' psql -h ${var.postgresql_host} -U root -d postgres -tc "SELECT 1 FROM pg_roles WHERE rolname='claude_memory'" | grep -q 1 || \
PGPASSWORD='${data.vault_kv_secret_v2.secrets.data["dbaas_root_password"]}' psql -h ${var.postgresql_host} -U root -d postgres -c "CREATE ROLE claude_memory WITH LOGIN PASSWORD '${coalesce(var.claude_memory_db_password, data.vault_kv_secret_v2.secrets.data["db_password"])}'"
PGPASSWORD='${data.vault_kv_secret_v2.secrets.data["dbaas_root_password"]}' psql -h ${var.postgresql_host} -U root -d postgres -tc "SELECT 1 FROM pg_database WHERE datname='claude_memory'" | grep -q 1 || \
PGPASSWORD='${data.vault_kv_secret_v2.secrets.data["dbaas_root_password"]}' psql -h ${var.postgresql_host} -U root -d postgres -c "CREATE DATABASE claude_memory OWNER claude_memory"
PGPASSWORD='${data.vault_kv_secret_v2.secrets.data["dbaas_root_password"]}' psql -h ${var.postgresql_host} -U root -d postgres -c "GRANT ALL PRIVILEGES ON DATABASE claude_memory TO claude_memory"
PGPASSWORD='${data.vault_kv_secret_v2.secrets.data["dbaas_root_password"]}' psql -h ${var.postgresql_host} -U root -tc "SELECT 1 FROM pg_roles WHERE rolname='claude_memory'" | grep -q 1 || \
PGPASSWORD='${data.vault_kv_secret_v2.secrets.data["dbaas_root_password"]}' psql -h ${var.postgresql_host} -U root -c "CREATE ROLE claude_memory WITH LOGIN PASSWORD '${var.claude_memory_db_password}'"
PGPASSWORD='${data.vault_kv_secret_v2.secrets.data["dbaas_root_password"]}' psql -h ${var.postgresql_host} -U root -tc "SELECT 1 FROM pg_database WHERE datname='claude_memory'" | grep -q 1 || \
PGPASSWORD='${data.vault_kv_secret_v2.secrets.data["dbaas_root_password"]}' psql -h ${var.postgresql_host} -U root -c "CREATE DATABASE claude_memory OWNER claude_memory"
PGPASSWORD='${data.vault_kv_secret_v2.secrets.data["dbaas_root_password"]}' psql -h ${var.postgresql_host} -U root -c "GRANT ALL PRIVILEGES ON DATABASE claude_memory TO claude_memory"
echo "Database init complete"
EOT
]
@ -250,9 +246,6 @@ resource "kubernetes_deployment" "claude-memory" {
ignore_changes = [
spec[0].template[0].spec[0].container[0].image,
spec[0].template[0].spec[0].dns_config, # KYVERNO_LIFECYCLE_V1: Kyverno admission webhook mutates dns_config with ndots=2
metadata[0].annotations["keel.sh/policy"],
metadata[0].annotations["keel.sh/trigger"],
metadata[0].annotations["keel.sh/pollSchedule"], # KYVERNO_LIFECYCLE_V2
]
}
}
@ -281,11 +274,7 @@ resource "kubernetes_service" "claude-memory" {
}
module "ingress" {
source = "../../modules/kubernetes/ingress_factory"
# MCP server called by Claude Code (and other tools/agents) via app-layer
# bearer-token auth; forward-auth would break programmatic clients.
# auth = "none": MCP server called by Claude Code via bearer-token auth; forward-auth would break programmatic clients.
auth = "none"
source = "../../modules/kubernetes/ingress_factory"
dns_type = "proxied"
namespace = kubernetes_namespace.claude-memory.metadata[0].name
name = "claude-memory"

View file

@ -50,22 +50,6 @@ locals {
}
}
# Zone-level Bot Management. ai_bots_protection was "block" CF returned
# 403 to declared AI bot UAs at the edge, so the in-cluster x402 gateway
# never got a chance to issue HTTP 402 with a payment offer. Flipped to
# "disabled" so AI bots reach Traefik x402, which returns 402 with the
# wallet address. Generic Bot Fight Mode + crawler protection stay on.
# (import {} stanza for adoption lives in the root stack TF restriction.)
resource "cloudflare_bot_management" "zone" {
zone_id = var.cloudflare_zone_id
enable_js = true
fight_mode = true
ai_bots_protection = "disabled"
# crawler_protection / is_robots_txt_managed are settable only via newer
# provider versions; they retain whatever the API currently has
# (crawler_protection=enabled, is_robots_txt_managed=true).
}
resource "cloudflare_zero_trust_tunnel_cloudflared_config" "sof" {
account_id = var.cloudflare_account_id
tunnel_id = var.cloudflare_tunnel_id
@ -168,57 +152,57 @@ resource "cloudflare_record" "mail_spf" {
}
resource "cloudflare_record" "mail_domainkey_rspamd" {
content = "\"v=DKIM1; h=sha256; k=rsa; p=MIIBIjANBgkqhkiG9w0BAQEFAAOCAQ8AMIIBCgKCAQEAs9XHeFBKhUAEJSikXx+P49Q3nEBbnaSpn6h/9TqIhKaZWSVa2uGUGYQieNdon7DEJZ0VFo0Tvm3/UFsy2qF7ZmF+E/+N8EmkcPrMlxgJT281dpk5DxrZ+kbzw/DosfHH71K6vCLB4rSexzxJHaAx0AUddI3bFUJGjMgCXXCMZF+p8YCx+DDGPIXz2FOTtlJlR7aeZ2xXavwE/lBfI3MLnsq7X+GhPjQEax070nndOdZI0S8HpZkVxdGWl1N2Ec6LukYm2RiUkEMMQHSYX7WF3JBc+CGqUyd706Iy/5oeC3UGwZSM2uLkrp8YBjmw/h1rAeyv/ITt6ZXraP/cIMRiVQIDAQAB\""
name = "mail._domainkey.viktorbarzin.me"
proxied = false
ttl = 1
type = "TXT"
zone_id = var.cloudflare_zone_id
content = "\"v=DKIM1; h=sha256; k=rsa; p=MIIBIjANBgkqhkiG9w0BAQEFAAOCAQ8AMIIBCgKCAQEAs9XHeFBKhUAEJSikXx+P49Q3nEBbnaSpn6h/9TqIhKaZWSVa2uGUGYQieNdon7DEJZ0VFo0Tvm3/UFsy2qF7ZmF+E/+N8EmkcPrMlxgJT281dpk5DxrZ+kbzw/DosfHH71K6vCLB4rSexzxJHaAx0AUddI3bFUJGjMgCXXCMZF+p8YCx+DDGPIXz2FOTtlJlR7aeZ2xXavwE/lBfI3MLnsq7X+GhPjQEax070nndOdZI0S8HpZkVxdGWl1N2Ec6LukYm2RiUkEMMQHSYX7WF3JBc+CGqUyd706Iy/5oeC3UGwZSM2uLkrp8YBjmw/h1rAeyv/ITt6ZXraP/cIMRiVQIDAQAB\""
name = "mail._domainkey.viktorbarzin.me"
proxied = false
ttl = 1
type = "TXT"
zone_id = var.cloudflare_zone_id
}
resource "cloudflare_record" "brevo_domainkey1" {
content = "b1.viktorbarzin-me.dkim.brevo.com."
name = "brevo1._domainkey.viktorbarzin.me"
proxied = false
ttl = 1
type = "CNAME"
zone_id = var.cloudflare_zone_id
content = "b1.viktorbarzin-me.dkim.brevo.com."
name = "brevo1._domainkey.viktorbarzin.me"
proxied = false
ttl = 1
type = "CNAME"
zone_id = var.cloudflare_zone_id
}
resource "cloudflare_record" "brevo_domainkey2" {
content = "b2.viktorbarzin-me.dkim.brevo.com."
name = "brevo2._domainkey.viktorbarzin.me"
proxied = false
ttl = 1
type = "CNAME"
zone_id = var.cloudflare_zone_id
content = "b2.viktorbarzin-me.dkim.brevo.com."
name = "brevo2._domainkey.viktorbarzin.me"
proxied = false
ttl = 1
type = "CNAME"
zone_id = var.cloudflare_zone_id
}
resource "cloudflare_record" "brevo_code" {
content = "\"brevo-code:a6ef1dd91b248559900246eb4e7ceebd\""
name = "viktorbarzin.me"
proxied = false
ttl = 1
type = "TXT"
zone_id = var.cloudflare_zone_id
content = "\"brevo-code:a6ef1dd91b248559900246eb4e7ceebd\""
name = "viktorbarzin.me"
proxied = false
ttl = 1
type = "TXT"
zone_id = var.cloudflare_zone_id
}
resource "cloudflare_record" "mail_mta_sts" {
content = "\"v=STSv1; id=20260412\""
name = "_mta-sts.viktorbarzin.me"
proxied = false
ttl = 1
type = "TXT"
zone_id = var.cloudflare_zone_id
content = "\"v=STSv1; id=20260412\""
name = "_mta-sts.viktorbarzin.me"
proxied = false
ttl = 1
type = "TXT"
zone_id = var.cloudflare_zone_id
}
resource "cloudflare_record" "mail_tlsrpt" {
content = "\"v=TLSRPTv1; rua=mailto:postmaster@viktorbarzin.me\""
name = "_smtp._tls.viktorbarzin.me"
proxied = false
ttl = 1
type = "TXT"
zone_id = var.cloudflare_zone_id
content = "\"v=TLSRPTv1; rua=mailto:postmaster@viktorbarzin.me\""
name = "_smtp._tls.viktorbarzin.me"
proxied = false
ttl = 1
type = "TXT"
zone_id = var.cloudflare_zone_id
}
resource "cloudflare_record" "mail_dmarc" {

View file

@ -6,8 +6,7 @@ resource "kubernetes_namespace" "cloudflared" {
metadata {
name = "cloudflared"
labels = {
tier = var.tier
"keel.sh/enrolled" = "true"
tier = var.tier
}
}
lifecycle {

View file

@ -52,7 +52,6 @@ resource "kubernetes_namespace" "coturn" {
name = "coturn"
labels = {
tier = local.tiers.edge
"keel.sh/enrolled" = "true"
}
}
lifecycle {
@ -195,13 +194,8 @@ resource "kubernetes_deployment" "coturn" {
}
}
lifecycle {
ignore_changes = [
spec[0].template[0].spec[0].dns_config, # KYVERNO_LIFECYCLE_V1
spec[0].template[0].spec[0].container[0].image, # KEEL_IGNORE_IMAGE Keel manages tag updates
metadata[0].annotations["keel.sh/policy"],
metadata[0].annotations["keel.sh/trigger"],
metadata[0].annotations["keel.sh/pollSchedule"], # KYVERNO_LIFECYCLE_V2
]
# KYVERNO_LIFECYCLE_V1: Kyverno admission webhook mutates dns_config with ndots=2
ignore_changes = [spec[0].template[0].spec[0].dns_config]
}
}

View file

@ -29,7 +29,6 @@ resource "kubernetes_namespace" "crowdsec" {
labels = {
tier = var.tier
"resource-governance/custom-quota" = "true"
"keel.sh/enrolled" = "true"
}
}
lifecycle {
@ -283,7 +282,7 @@ module "ingress" {
dns_type = "proxied"
namespace = kubernetes_namespace.crowdsec.metadata[0].name
name = "crowdsec-web"
auth = "required"
protected = true
tls_secret_name = var.tls_secret_name
exclude_crowdsec = true
}

View file

@ -24,14 +24,6 @@ provider "registry.terraform.io/cloudflare/cloudflare" {
]
}
provider "registry.terraform.io/goauthentik/authentik" {
version = "2024.12.1"
constraints = "~> 2024.10"
hashes = [
"h1:roBMd+gi+TGgikH/bMzEI8JfvJiMAQWt+8FmokCrQIs=",
]
}
provider "registry.terraform.io/hashicorp/helm" {
version = "3.1.1"
hashes = [

View file

@ -1,7 +1,7 @@
# Generated by Terragrunt. Sig: nIlQXj57tbuaRZEa
terraform {
backend "pg" {
conn_str = "postgres://terraform_state:ts7DGcKmTTY-5ujz4mhh@10.0.20.200:5432/terraform_state?sslmode=disable"
conn_str = "postgres://terraform_state:SBlzGxotNUN6HH9d0S-m@10.0.20.200:5432/terraform_state?sslmode=disable"
schema_name = "cyberchef"
}
}

View file

@ -9,7 +9,6 @@ resource "kubernetes_namespace" "cyberchef" {
name = "cyberchef"
labels = {
tier = local.tiers.aux
"keel.sh/enrolled" = "true"
}
}
lifecycle {
@ -78,12 +77,8 @@ resource "kubernetes_deployment" "cyberchef" {
}
}
lifecycle {
ignore_changes = [
spec[0].template[0].spec[0].dns_config, # KYVERNO_LIFECYCLE_V1
metadata[0].annotations["keel.sh/policy"],
metadata[0].annotations["keel.sh/trigger"],
metadata[0].annotations["keel.sh/pollSchedule"], # KYVERNO_LIFECYCLE_V2
]
# KYVERNO_LIFECYCLE_V1: Kyverno admission webhook mutates dns_config with ndots=2
ignore_changes = [spec[0].template[0].spec[0].dns_config]
}
}
@ -110,24 +105,22 @@ resource "kubernetes_service" "cyberchef" {
module "anubis" {
source = "../../modules/kubernetes/anubis_instance"
name = "cc"
namespace = kubernetes_namespace.cyberchef.metadata[0].name
target_url = "http://${kubernetes_service.cyberchef.metadata[0].name}.${kubernetes_namespace.cyberchef.metadata[0].name}.svc.cluster.local"
shared_store_url = "redis://redis-master.redis.svc.cluster.local:6379/5"
source = "../../modules/kubernetes/anubis_instance"
name = "cc"
namespace = kubernetes_namespace.cyberchef.metadata[0].name
target_url = "http://${kubernetes_service.cyberchef.metadata[0].name}.${kubernetes_namespace.cyberchef.metadata[0].name}.svc.cluster.local"
}
module "ingress" {
source = "../../modules/kubernetes/ingress_factory"
auth = "none" # Anubis-fronted; PoW challenge gates bots, no Authentik
dns_type = "proxied"
namespace = kubernetes_namespace.cyberchef.metadata[0].name
name = "cc"
service_name = module.anubis.service_name
port = module.anubis.service_port
extra_middlewares = ["traefik-x402@kubernetescrd"]
tls_secret_name = var.tls_secret_name
anti_ai_scraping = false
tls_secret_name = var.tls_secret_name
anti_ai_scraping = false
extra_annotations = {
"gethomepage.dev/enabled" = "true"
"gethomepage.dev/name" = "CyberChef"
@ -137,14 +130,3 @@ module "ingress" {
"gethomepage.dev/pod-selector" = ""
}
}
# CI retrigger 2026-05-16T13:42:57+00:00 bulk enrollment apply (pipeline #689 killed)
# CI retrigger v2 2026-05-16T13:46:35+00:00
# CI retrigger v3 2026-05-16T14:06:39Z
# CI retrigger v4 2026-05-16T14:13:59Z
# CI retrigger v5 2026-05-16T23:10:38Z
# CI retrigger v6 2026-05-16T23:18:58Z

View file

@ -9,10 +9,6 @@ terraform {
source = "cloudflare/cloudflare"
version = "~> 4"
}
authentik = {
source = "goauthentik/authentik"
version = "~> 2024.10"
}
}
}

View file

@ -16,7 +16,6 @@ resource "kubernetes_namespace" "dashy" {
labels = {
"istio-injection" : "disabled"
tier = local.tiers.aux
"keel.sh/enrolled" = "true"
}
}
lifecycle {
@ -101,13 +100,8 @@ resource "kubernetes_deployment" "dashy" {
}
}
lifecycle {
ignore_changes = [
spec[0].template[0].spec[0].dns_config, # KYVERNO_LIFECYCLE_V1
spec[0].template[0].spec[0].container[0].image, # KEEL_IGNORE_IMAGE Keel manages tag updates
metadata[0].annotations["keel.sh/policy"],
metadata[0].annotations["keel.sh/trigger"],
metadata[0].annotations["keel.sh/pollSchedule"], # KYVERNO_LIFECYCLE_V2
]
# KYVERNO_LIFECYCLE_V1: Kyverno admission webhook mutates dns_config with ndots=2
ignore_changes = [spec[0].template[0].spec[0].dns_config]
}
}
@ -138,5 +132,5 @@ module "ingress" {
namespace = kubernetes_namespace.dashy.metadata[0].name
name = "dashy"
tls_secret_name = var.tls_secret_name
auth = "required" # hidden as we use homepage now
protected = true # hidden as we use homepage now
}

View file

@ -17,7 +17,6 @@ resource "kubernetes_namespace" "dawarich" {
labels = {
"istio-injection" : "disabled"
tier = local.tiers.edge
"keel.sh/enrolled" = "true"
}
}
}
@ -326,13 +325,7 @@ resource "kubernetes_deployment" "dawarich" {
}
}
lifecycle {
ignore_changes = [
spec[0].template[0].spec[0].dns_config, # KYVERNO_LIFECYCLE_V1
spec[0].template[0].spec[0].container[0].image, # KEEL_IGNORE_IMAGE Keel manages tag updates
metadata[0].annotations["keel.sh/policy"],
metadata[0].annotations["keel.sh/trigger"],
metadata[0].annotations["keel.sh/pollSchedule"], # KYVERNO_LIFECYCLE_V2
]
ignore_changes = [spec[0].template[0].spec[0].dns_config] # KYVERNO_LIFECYCLE_V1
}
}
@ -439,13 +432,7 @@ resource "kubernetes_service" "dawarich" {
# }
# }
module "ingress" {
source = "../../modules/kubernetes/ingress_factory"
# owntracks bridge hook posts to /api/v1/owntracks/points?api_key=... from
# outside the cluster; mobile location apps also POST programmatically with
# an api_key. Forward-auth would 302 these clients into a login they can't
# complete. Dawarich enforces api_key at app layer.
# auth = "none": Location tracking API mobile apps + OwnTracks bridge POST via api_key; forward-auth 302s break programmatic clients.
auth = "none"
source = "../../modules/kubernetes/ingress_factory"
dns_type = "proxied"
namespace = kubernetes_namespace.dawarich.metadata[0].name
name = "dawarich"

View file

@ -131,18 +131,6 @@ resource "kubernetes_stateful_set_v1" "mysql_standalone" {
"app.kubernetes.io/instance" = "mysql-standalone"
"app.kubernetes.io/component" = "primary"
}
# Explicit Keel opt-out. The dbaas namespace is already excluded
# from the `inject-keel-annotations` Kyverno ClusterPolicy, but the
# StatefulSet historically picked up Keel annotations anyway (from
# an earlier version of that policy that didn't have the exclusion
# list). `keel.sh/policy: never` makes Keel skip this resource even
# if those legacy annotations are still present, so we cannot be
# silently bumped to a new MySQL version again.
#
# Lifting this MUST go through docs/plans/2026-05-19-mysql-8.4.9-upgrade-*.
annotations = {
"keel.sh/policy" = "never"
}
}
spec {
service_name = "mysql-standalone"
@ -179,28 +167,8 @@ resource "kubernetes_stateful_set_v1" "mysql_standalone" {
}
container {
name = "mysql"
#
# DO NOT BUMP THIS IMAGE WITHOUT FOLLOWING THE PLAN
#
# Pinned to mysql:8.4.8 EXACTLY. The in-server DD upgrade from
# 80408 80409 stalls reliably on this hardware (24s of writes
# then no progress, no CPU, never completes). The 2026-05-18
# recovery from the failed auto-bump took ~25 min of full
# MySQL downtime + Forgejo/registry/7 apps cascade.
#
# To go to 8.4.9 (or any later version), follow:
# docs/plans/2026-05-19-mysql-8.4.9-upgrade-design.md
# docs/plans/2026-05-19-mysql-8.4.9-upgrade-plan.md
# Beads: code-963q
#
# The upgrade path is wipe + re-init (NOT in-place DD upgrade).
# Requires: maintenance window, fresh dump, Vault user reset.
#
# History: code-eme8 (initial outage), code-k40p (recovery).
# See also: docs/runbooks/restore-mysql.md.
#
image = "mysql:8.4.8"
name = "mysql"
image = "mysql:8.4"
port {
container_port = 3306
@ -272,7 +240,7 @@ resource "kubernetes_stateful_set_v1" "mysql_standalone" {
metadata {
name = "data"
annotations = {
"resize.topolvm.io/threshold" = "10%"
"resize.topolvm.io/threshold" = "80%"
"resize.topolvm.io/increase" = "100%"
"resize.topolvm.io/storage_limit" = "50Gi"
}
@ -378,7 +346,7 @@ resource "kubernetes_persistent_volume_claim" "pgadmin_encrypted" {
name = "dbaas-pgadmin-encrypted"
namespace = kubernetes_namespace.dbaas.metadata[0].name
annotations = {
"resize.topolvm.io/threshold" = "10%"
"resize.topolvm.io/threshold" = "80%"
"resize.topolvm.io/increase" = "100%"
"resize.topolvm.io/storage_limit" = "5Gi"
}
@ -392,13 +360,6 @@ resource "kubernetes_persistent_volume_claim" "pgadmin_encrypted" {
}
}
}
lifecycle {
# The autoresizer expands requests.storage up to storage_limit and
# PVCs can't shrink. Without this, every TF apply tries to revert
# to the spec value, K8s rejects the shrink, and the PVC ends up
# in Terminating-but-in-use limbo.
ignore_changes = [spec[0].resources[0].requests]
}
}
module "nfs_postgresql_backup_host" {
@ -830,7 +791,7 @@ module "ingress" {
namespace = kubernetes_namespace.dbaas.metadata[0].name
name = "pma"
tls_secret_name = var.tls_secret_name
auth = "required"
protected = true
extra_annotations = {}
}
@ -1082,12 +1043,12 @@ module "ingress" {
# Ensure the CNPG cluster manifest exists (idempotent kubectl apply)
resource "null_resource" "pg_cluster" {
triggers = {
instances = "3"
instances = "2"
image = "ghcr.io/cloudnative-pg/postgis:16"
storage_size = "20Gi"
storage_class = "proxmox-lvm-encrypted"
memory_limit = "3Gi"
pg_params = "v3-shared1024-walcomp-workmem16-max200"
memory_limit = "2Gi"
pg_params = "v2-shared512-walcomp-workmem16"
}
provisioner "local-exec" {
@ -1099,26 +1060,13 @@ resource "null_resource" "pg_cluster" {
name: pg-cluster
namespace: dbaas
spec:
# 3 instances (1 primary + 2 replicas) so a single-node drain (e.g.
# kured's weekly OS-reboot wave) still leaves a primary candidate
# immediately available for switchover. Previously 2; CNPG would
# still failover with 2 but only if the lone replica was caught up
# during a long WAL backlog the failover would stall the drain.
# Bumped 2026-05-16 ahead of Monday's first post-fix kured cycle.
instances: 3
instances: 2
imageName: ghcr.io/cloudnative-pg/postgis:16
postgresql:
parameters:
search_path: '"$user", public'
# Cluster grew past the 100-conn default ceiling (~90/100 idle
# steady-state in May 2026; authentik+matrix alone hold ~55).
# Bumped to 200 with shared_buffers/effective_cache_size/memory
# scaled proportionally. work_mem stays at 16MB that's per
# sort/hash op, not per connection, so 16MB * 200 isn't the
# worst case.
max_connections: "200"
shared_buffers: "1024MB"
effective_cache_size: "2560MB"
shared_buffers: "512MB"
effective_cache_size: "1536MB"
work_mem: "16MB"
wal_compression: "on"
random_page_cost: "4"
@ -1127,9 +1075,7 @@ resource "null_resource" "pg_cluster" {
enableSuperuserAccess: true
inheritedMetadata:
annotations:
# threshold = free-space % below which autoresizer expands.
# 10% means "expand when 90% used" (the conventional knob).
resize.topolvm.io/threshold: "10%"
resize.topolvm.io/threshold: "80%"
resize.topolvm.io/increase: "20%"
resize.topolvm.io/storage_limit: "100Gi"
storage:
@ -1138,9 +1084,9 @@ resource "null_resource" "pg_cluster" {
resources:
requests:
cpu: "50m"
memory: "3Gi"
memory: "2Gi"
limits:
memory: "3Gi"
memory: "2Gi"
EOF
EOT
}
@ -1203,8 +1149,7 @@ resource "null_resource" "pg_terraform_state_db" {
provisioner "local-exec" {
command = <<-EOT
PRIMARY=$(kubectl --kubeconfig ${var.kube_config_path} get cluster -n dbaas pg-cluster -o jsonpath='{.status.currentPrimary}')
kubectl --kubeconfig ${var.kube_config_path} exec -n dbaas $PRIMARY -c postgres -- \
kubectl --kubeconfig ${var.kube_config_path} exec -n dbaas pg-cluster-1 -c postgres -- \
bash -c '
psql -U postgres -tc "SELECT 1 FROM pg_catalog.pg_roles WHERE rolname = '"'"'terraform_state'"'"'" | grep -q 1 || \
psql -U postgres -c "CREATE ROLE terraform_state WITH LOGIN PASSWORD '"'"'changeme-vault-will-rotate'"'"'"
@ -1228,8 +1173,7 @@ resource "null_resource" "pg_payslip_ingest_db" {
provisioner "local-exec" {
command = <<-EOT
PRIMARY=$(kubectl --kubeconfig ${var.kube_config_path} get cluster -n dbaas pg-cluster -o jsonpath='{.status.currentPrimary}')
kubectl --kubeconfig ${var.kube_config_path} exec -n dbaas $PRIMARY -c postgres -- \
kubectl --kubeconfig ${var.kube_config_path} exec -n dbaas pg-cluster-1 -c postgres -- \
bash -c '
psql -U postgres -tc "SELECT 1 FROM pg_catalog.pg_roles WHERE rolname = '"'"'payslip_ingest'"'"'" | grep -q 1 || \
psql -U postgres -c "CREATE ROLE payslip_ingest WITH LOGIN PASSWORD '"'"'changeme-vault-will-rotate'"'"'"
@ -1253,8 +1197,7 @@ resource "null_resource" "pg_job_hunter_db" {
provisioner "local-exec" {
command = <<-EOT
PRIMARY=$(kubectl --kubeconfig ${var.kube_config_path} get cluster -n dbaas pg-cluster -o jsonpath='{.status.currentPrimary}')
kubectl --kubeconfig ${var.kube_config_path} exec -n dbaas $PRIMARY -c postgres -- \
kubectl --kubeconfig ${var.kube_config_path} exec -n dbaas pg-cluster-1 -c postgres -- \
bash -c '
psql -U postgres -tc "SELECT 1 FROM pg_catalog.pg_roles WHERE rolname = '"'"'job_hunter'"'"'" | grep -q 1 || \
psql -U postgres -c "CREATE ROLE job_hunter WITH LOGIN PASSWORD '"'"'changeme-vault-will-rotate'"'"'"
@ -1266,35 +1209,6 @@ resource "null_resource" "pg_job_hunter_db" {
}
}
# Postiz: 3 databases (postiz, temporal, temporal_visibility) all owned by the
# `postiz` role. Bundled bitnami PostgreSQL was retired 2026-05-09 in favour of
# this CNPG cluster covered by postgresql-backup-per-db automatically.
# Role password placeholder; Vault static role `pg-postiz` rotates 7d.
resource "null_resource" "pg_postiz_dbs" {
depends_on = [null_resource.pg_cluster]
triggers = {
role = "postiz"
dbs = "postiz,temporal,temporal_visibility"
}
provisioner "local-exec" {
command = <<-EOT
PRIMARY=$(kubectl --kubeconfig ${var.kube_config_path} get cluster -n dbaas pg-cluster -o jsonpath='{.status.currentPrimary}')
kubectl --kubeconfig ${var.kube_config_path} exec -n dbaas $PRIMARY -c postgres -- \
bash -c '
psql -U postgres -tc "SELECT 1 FROM pg_catalog.pg_roles WHERE rolname = '"'"'postiz'"'"'" | grep -q 1 || \
psql -U postgres -c "CREATE ROLE postiz WITH LOGIN PASSWORD '"'"'changeme-vault-will-rotate'"'"'"
for db in postiz temporal temporal_visibility; do
psql -U postgres -tc "SELECT 1 FROM pg_catalog.pg_database WHERE datname = '"'"'$db'"'"'" | grep -q 1 || \
psql -U postgres -c "CREATE DATABASE $db OWNER postiz"
psql -U postgres -c "GRANT ALL PRIVILEGES ON DATABASE $db TO postiz"
done
'
EOT
}
}
# Create wealthfolio_sync database for the SQLitePG ETL sidecar that mirrors
# Wealthfolio's daily_account_valuation/accounts/activities into PG so Grafana
# can chart net worth, contributions, and growth.
@ -1350,35 +1264,6 @@ resource "null_resource" "pg_fire_planner_db" {
}
}
# Create instagram_poster database for the IG-curation pipeline. Initial use:
# benchmark_score table written by `instagram_poster.benchmark` CLI (vision-LLM
# scoring per Immich asset). Future: migrate story_queue/decision/ig_posted_media
# off the pod's sqlite PVC into this DB so the pod is fully stateless.
# Role password is managed by Vault Database Secrets Engine
# (static role `pg-instagram-poster`, 7d rotation).
resource "null_resource" "pg_instagram_poster_db" {
depends_on = [null_resource.pg_cluster]
triggers = {
db_name = "instagram_poster"
username = "instagram_poster"
}
provisioner "local-exec" {
command = <<-EOT
PRIMARY=$(kubectl --kubeconfig ${var.kube_config_path} get cluster -n dbaas pg-cluster -o jsonpath='{.status.currentPrimary}')
kubectl --kubeconfig ${var.kube_config_path} exec -n dbaas $PRIMARY -c postgres -- \
bash -c '
psql -U postgres -tc "SELECT 1 FROM pg_catalog.pg_roles WHERE rolname = '"'"'instagram_poster'"'"'" | grep -q 1 || \
psql -U postgres -c "CREATE ROLE instagram_poster WITH LOGIN PASSWORD '"'"'changeme-vault-will-rotate'"'"'"
psql -U postgres -tc "SELECT 1 FROM pg_catalog.pg_database WHERE datname = '"'"'instagram_poster'"'"'" | grep -q 1 || \
psql -U postgres -c "CREATE DATABASE instagram_poster OWNER instagram_poster"
psql -U postgres -c "GRANT ALL PRIVILEGES ON DATABASE instagram_poster TO instagram_poster"
'
EOT
}
}
# Old PostgreSQL deployment kept commented for rollback reference
# resource "kubernetes_deployment" "postgres" {
# metadata {
@ -1515,7 +1400,7 @@ module "ingress-pgadmin" {
namespace = kubernetes_namespace.dbaas.metadata[0].name
name = "pgadmin"
tls_secret_name = var.tls_secret_name
auth = "required"
protected = true
}

View file

@ -4,8 +4,7 @@ resource "kubernetes_namespace" "descheduler" {
metadata {
name = "descheduler"
labels = {
tier = local.tiers.cluster
"keel.sh/enrolled" = "true"
tier = local.tiers.cluster
}
}
lifecycle {
@ -95,14 +94,3 @@ resource "helm_release" "descheduler" { # rename me
values = [templatefile("${path.module}/values.yaml", {})]
}
# CI retrigger 2026-05-16T13:42:57+00:00 bulk enrollment apply (pipeline #689 killed)
# CI retrigger v2 2026-05-16T13:46:35+00:00
# CI retrigger v3 2026-05-16T14:06:39Z
# CI retrigger v4 2026-05-16T14:13:59Z
# CI retrigger v5 2026-05-16T23:10:38Z
# CI retrigger v6 2026-05-16T23:18:58Z

View file

@ -10,7 +10,6 @@ resource "kubernetes_namespace" "diun" {
labels = {
"istio-injection" : "disabled"
tier = local.tiers.aux
"keel.sh/enrolled" = "true"
}
}
lifecycle {
@ -92,7 +91,7 @@ resource "kubernetes_persistent_volume_claim" "data_proxmox" {
name = "diun-data-proxmox"
namespace = kubernetes_namespace.diun.metadata[0].name
annotations = {
"resize.topolvm.io/threshold" = "10%"
"resize.topolvm.io/threshold" = "80%"
"resize.topolvm.io/increase" = "100%"
"resize.topolvm.io/storage_limit" = "5Gi"
}
@ -106,13 +105,6 @@ resource "kubernetes_persistent_volume_claim" "data_proxmox" {
}
}
}
lifecycle {
# The autoresizer expands requests.storage up to storage_limit and
# PVCs can't shrink. Without this, every TF apply tries to revert
# to the spec value, K8s rejects the shrink, and the PVC ends up
# in Terminating-but-in-use limbo.
ignore_changes = [spec[0].resources[0].requests]
}
}
resource "kubernetes_deployment" "diun" {
@ -238,12 +230,6 @@ resource "kubernetes_deployment" "diun" {
}
}
lifecycle {
ignore_changes = [
spec[0].template[0].spec[0].dns_config, # KYVERNO_LIFECYCLE_V1
spec[0].template[0].spec[0].container[0].image, # KEEL_IGNORE_IMAGE Keel manages tag updates
metadata[0].annotations["keel.sh/policy"],
metadata[0].annotations["keel.sh/trigger"],
metadata[0].annotations["keel.sh/pollSchedule"], # KYVERNO_LIFECYCLE_V2
]
ignore_changes = [spec[0].template[0].spec[0].dns_config] # KYVERNO_LIFECYCLE_V1
}
}

View file

@ -17,7 +17,6 @@ resource "kubernetes_namespace" "ebook2audiobook" {
labels = {
"istio-injection" : "disabled"
tier = local.tiers.gpu
"keel.sh/enrolled" = "true"
}
}
lifecycle {
@ -121,13 +120,8 @@ resource "kubernetes_deployment" "ebook2audiobook" {
}
}
lifecycle {
ignore_changes = [
spec[0].template[0].spec[0].dns_config, # KYVERNO_LIFECYCLE_V1
spec[0].template[0].spec[0].container[0].image, # KEEL_IGNORE_IMAGE Keel manages tag updates
metadata[0].annotations["keel.sh/policy"],
metadata[0].annotations["keel.sh/trigger"],
metadata[0].annotations["keel.sh/pollSchedule"], # KYVERNO_LIFECYCLE_V2
]
# KYVERNO_LIFECYCLE_V1: Kyverno admission webhook mutates dns_config with ndots=2
ignore_changes = [spec[0].template[0].spec[0].dns_config]
}
}
@ -260,7 +254,7 @@ module "ingress" {
namespace = kubernetes_namespace.ebook2audiobook.metadata[0].name
name = "ebook2audiobook"
tls_secret_name = var.tls_secret_name
auth = "required"
protected = true
extra_annotations = {
"gethomepage.dev/enabled" = "true"
"gethomepage.dev/name" = "Ebook2Audiobook"
@ -328,13 +322,8 @@ resource "kubernetes_deployment" "audiblez" {
}
}
lifecycle {
ignore_changes = [
spec[0].template[0].spec[0].dns_config, # KYVERNO_LIFECYCLE_V1
spec[0].template[0].spec[0].container[0].image, # KEEL_IGNORE_IMAGE Keel manages tag updates
metadata[0].annotations["keel.sh/policy"],
metadata[0].annotations["keel.sh/trigger"],
metadata[0].annotations["keel.sh/pollSchedule"], # KYVERNO_LIFECYCLE_V2
]
# KYVERNO_LIFECYCLE_V1: Kyverno admission webhook mutates dns_config with ndots=2
ignore_changes = [spec[0].template[0].spec[0].dns_config]
}
}
@ -423,13 +412,8 @@ resource "kubernetes_deployment" "audiblez-web" {
}
}
lifecycle {
ignore_changes = [
spec[0].template[0].spec[0].dns_config, # KYVERNO_LIFECYCLE_V1
spec[0].template[0].spec[0].container[0].image, # KEEL_IGNORE_IMAGE Keel manages tag updates
metadata[0].annotations["keel.sh/policy"],
metadata[0].annotations["keel.sh/trigger"],
metadata[0].annotations["keel.sh/pollSchedule"], # KYVERNO_LIFECYCLE_V2
]
# KYVERNO_LIFECYCLE_V1: Kyverno admission webhook mutates dns_config with ndots=2
ignore_changes = [spec[0].template[0].spec[0].dns_config]
}
}
@ -461,7 +445,7 @@ module "audiblez-web-ingress" {
host = "audiblez"
dns_type = "non-proxied"
tls_secret_name = var.tls_secret_name
auth = "required"
protected = true
max_body_size = "500m" # Allow large EPUB uploads
extra_annotations = {
"gethomepage.dev/enabled" = "true"

View file

@ -9,7 +9,6 @@ resource "kubernetes_namespace" "ebooks" {
name = "ebooks"
labels = {
tier = local.tiers.edge
"keel.sh/enrolled" = "true"
}
}
lifecycle {
@ -151,7 +150,7 @@ resource "kubernetes_persistent_volume_claim" "calibre_config_iscsi" {
name = "ebooks-calibre-config-proxmox"
namespace = kubernetes_namespace.ebooks.metadata[0].name
annotations = {
"resize.topolvm.io/threshold" = "10%"
"resize.topolvm.io/threshold" = "80%"
"resize.topolvm.io/increase" = "50%"
"resize.topolvm.io/storage_limit" = "10Gi"
}
@ -165,13 +164,6 @@ resource "kubernetes_persistent_volume_claim" "calibre_config_iscsi" {
}
}
}
lifecycle {
# The autoresizer expands requests.storage up to storage_limit and
# PVCs can't shrink. Without this, every TF apply tries to revert
# to the spec value, K8s rejects the shrink, and the PVC ends up
# in Terminating-but-in-use limbo.
ignore_changes = [spec[0].resources[0].requests]
}
}
module "nfs_calibre_ingest_host" {
@ -213,7 +205,7 @@ resource "kubernetes_persistent_volume_claim" "abs_config_proxmox" {
name = "ebooks-abs-config-proxmox"
namespace = kubernetes_namespace.ebooks.metadata[0].name
annotations = {
"resize.topolvm.io/threshold" = "10%"
"resize.topolvm.io/threshold" = "80%"
"resize.topolvm.io/increase" = "100%"
"resize.topolvm.io/storage_limit" = "5Gi"
}
@ -227,13 +219,6 @@ resource "kubernetes_persistent_volume_claim" "abs_config_proxmox" {
}
}
}
lifecycle {
# The autoresizer expands requests.storage up to storage_limit and
# PVCs can't shrink. Without this, every TF apply tries to revert
# to the spec value, K8s rejects the shrink, and the PVC ends up
# in Terminating-but-in-use limbo.
ignore_changes = [spec[0].resources[0].requests]
}
}
module "nfs_audiobookshelf_metadata_host" {
@ -365,13 +350,7 @@ resource "kubernetes_deployment" "calibre-web-automated" {
}
}
lifecycle {
ignore_changes = [
spec[0].template[0].spec[0].dns_config, # KYVERNO_LIFECYCLE_V1
spec[0].template[0].spec[0].container[0].image, # KEEL_IGNORE_IMAGE Keel manages tag updates
metadata[0].annotations["keel.sh/policy"],
metadata[0].annotations["keel.sh/trigger"],
metadata[0].annotations["keel.sh/pollSchedule"], # KYVERNO_LIFECYCLE_V2
]
ignore_changes = [spec[0].template[0].spec[0].dns_config] # KYVERNO_LIFECYCLE_V1
}
}
@ -399,7 +378,6 @@ resource "kubernetes_service" "calibre" {
module "calibre_ingress" {
source = "../../modules/kubernetes/ingress_factory"
auth = "required"
dns_type = "proxied"
namespace = kubernetes_namespace.ebooks.metadata[0].name
name = "calibre"
@ -492,13 +470,7 @@ resource "kubernetes_deployment" "annas-archive-stacks" {
}
}
lifecycle {
ignore_changes = [
spec[0].template[0].spec[0].dns_config, # KYVERNO_LIFECYCLE_V1
spec[0].template[0].spec[0].container[0].image, # KEEL_IGNORE_IMAGE Keel manages tag updates
metadata[0].annotations["keel.sh/policy"],
metadata[0].annotations["keel.sh/trigger"],
metadata[0].annotations["keel.sh/pollSchedule"], # KYVERNO_LIFECYCLE_V2
]
ignore_changes = [spec[0].template[0].spec[0].dns_config] # KYVERNO_LIFECYCLE_V1
}
}
@ -530,7 +502,7 @@ module "stacks_ingress" {
name = "stacks"
service_name = "annas-archive-stacks"
tls_secret_name = var.tls_secret_name
auth = "required"
protected = true
extra_annotations = {
"gethomepage.dev/enabled" = "false"
}
@ -647,13 +619,7 @@ resource "kubernetes_deployment" "audiobookshelf" {
}
}
lifecycle {
ignore_changes = [
spec[0].template[0].spec[0].dns_config, # KYVERNO_LIFECYCLE_V1
spec[0].template[0].spec[0].container[0].image, # KEEL_IGNORE_IMAGE Keel manages tag updates
metadata[0].annotations["keel.sh/policy"],
metadata[0].annotations["keel.sh/trigger"],
metadata[0].annotations["keel.sh/pollSchedule"], # KYVERNO_LIFECYCLE_V2
]
ignore_changes = [spec[0].template[0].spec[0].dns_config] # KYVERNO_LIFECYCLE_V1
}
}
@ -680,11 +646,7 @@ resource "kubernetes_service" "audiobookshelf" {
}
module "audiobookshelf_ingress" {
source = "../../modules/kubernetes/ingress_factory"
# auth = "app": Audiobookshelf has its own user/password login + API
# tokens used by the iOS/Android Audiobookshelf app. Authentik forward-auth
# was 302-ing the mobile clients; ABS's own auth gates users.
auth = "app"
source = "../../modules/kubernetes/ingress_factory"
dns_type = "non-proxied"
namespace = kubernetes_namespace.ebooks.metadata[0].name
name = "audiobookshelf"
@ -928,13 +890,7 @@ resource "kubernetes_deployment" "book_search" {
}
}
lifecycle {
ignore_changes = [
spec[0].template[0].spec[0].dns_config, # KYVERNO_LIFECYCLE_V1
spec[0].template[0].spec[0].container[0].image, # KEEL_IGNORE_IMAGE Keel manages tag updates
metadata[0].annotations["keel.sh/policy"],
metadata[0].annotations["keel.sh/trigger"],
metadata[0].annotations["keel.sh/pollSchedule"], # KYVERNO_LIFECYCLE_V2
]
ignore_changes = [spec[0].template[0].spec[0].dns_config] # KYVERNO_LIFECYCLE_V1
}
}
@ -965,7 +921,7 @@ module "book_search_ingress" {
namespace = kubernetes_namespace.ebooks.metadata[0].name
name = "book-search"
tls_secret_name = var.tls_secret_name
auth = "required"
protected = true
extra_annotations = {
"gethomepage.dev/enabled" = "true"
"gethomepage.dev/name" = "Book Search"
@ -984,7 +940,6 @@ module "book_search_api_ingress" {
host = "book-search"
service_name = "book-search"
tls_secret_name = var.tls_secret_name
# auth = "none": Book Search API endpoints API key auth handled by backend; forward-auth would block downloads.
auth = "none"
protected = false
ingress_path = ["/api/download-url", "/api/download-status", "/api/send-to-kindle", "/shortcut"]
}

View file

@ -10,7 +10,6 @@ resource "kubernetes_namespace" "echo" {
labels = {
"istio-injection" : "disabled"
tier = local.tiers.edge
"keel.sh/enrolled" = "true"
}
}
lifecycle {
@ -75,13 +74,8 @@ resource "kubernetes_deployment" "echo" {
}
}
lifecycle {
ignore_changes = [
spec[0].template[0].spec[0].dns_config, # KYVERNO_LIFECYCLE_V1
spec[0].template[0].spec[0].container[0].image, # KEEL_IGNORE_IMAGE Keel manages tag updates
metadata[0].annotations["keel.sh/policy"],
metadata[0].annotations["keel.sh/trigger"],
metadata[0].annotations["keel.sh/pollSchedule"], # KYVERNO_LIFECYCLE_V2
]
# KYVERNO_LIFECYCLE_V1: Kyverno admission webhook mutates dns_config with ndots=2
ignore_changes = [spec[0].template[0].spec[0].dns_config]
}
}
@ -107,11 +101,7 @@ resource "kubernetes_service" "echo" {
}
module "ingress" {
source = "../../modules/kubernetes/ingress_factory"
# echo is a header-reflecting diagnostic public so it's reachable for
# forward-auth smoke-testing. Anyone visiting echo.viktorbarzin.me sees
# exactly which X-authentik-* headers Traefik forwarded to backends.
auth = "public"
source = "../../modules/kubernetes/ingress_factory"
dns_type = "proxied"
namespace = kubernetes_namespace.echo.metadata[0].name
name = "echo"

View file

@ -11,7 +11,6 @@ resource "kubernetes_namespace" "excalidraw" {
labels = {
"istio-injection" : "disabled"
tier = local.tiers.aux
"keel.sh/enrolled" = "true"
}
}
lifecycle {
@ -33,7 +32,7 @@ resource "kubernetes_persistent_volume_claim" "data_proxmox" {
name = "excalidraw-data-proxmox"
namespace = kubernetes_namespace.excalidraw.metadata[0].name
annotations = {
"resize.topolvm.io/threshold" = "10%"
"resize.topolvm.io/threshold" = "80%"
"resize.topolvm.io/increase" = "100%"
"resize.topolvm.io/storage_limit" = "5Gi"
}
@ -47,13 +46,6 @@ resource "kubernetes_persistent_volume_claim" "data_proxmox" {
}
}
}
lifecycle {
# The autoresizer expands requests.storage up to storage_limit and
# PVCs can't shrink. Without this, every TF apply tries to revert
# to the spec value, K8s rejects the shrink, and the PVC ends up
# in Terminating-but-in-use limbo.
ignore_changes = [spec[0].resources[0].requests]
}
}
resource "kubernetes_deployment" "excalidraw" {
@ -125,13 +117,8 @@ resource "kubernetes_deployment" "excalidraw" {
}
}
lifecycle {
ignore_changes = [
spec[0].template[0].spec[0].dns_config, # KYVERNO_LIFECYCLE_V1
spec[0].template[0].spec[0].container[0].image, # KEEL_IGNORE_IMAGE Keel manages tag updates
metadata[0].annotations["keel.sh/policy"],
metadata[0].annotations["keel.sh/trigger"],
metadata[0].annotations["keel.sh/pollSchedule"], # KYVERNO_LIFECYCLE_V2
]
# KYVERNO_LIFECYCLE_V1: Kyverno admission webhook mutates dns_config with ndots=2
ignore_changes = [spec[0].template[0].spec[0].dns_config]
}
}
@ -162,7 +149,7 @@ module "ingress" {
namespace = kubernetes_namespace.excalidraw.metadata[0].name
name = "draw"
tls_secret_name = var.tls_secret_name
auth = "required"
protected = true
extra_annotations = {
"gethomepage.dev/enabled" = "true"
"gethomepage.dev/name" = "Excalidraw"

View file

@ -3,7 +3,6 @@ resource "kubernetes_namespace" "external_secrets" {
name = "external-secrets"
labels = {
tier = local.tiers.cluster
"keel.sh/enrolled" = "true"
}
}
lifecycle {

View file

@ -13,7 +13,6 @@ resource "kubernetes_namespace" "f1-stream" {
"istio-injection" : "disabled"
tier = local.tiers.aux
"chrome-service.viktorbarzin.me/client" = "true"
"keel.sh/enrolled" = "true"
}
}
lifecycle {
@ -84,7 +83,7 @@ resource "kubernetes_persistent_volume_claim" "data_proxmox" {
name = "f1-stream-data-proxmox"
namespace = kubernetes_namespace.f1-stream.metadata[0].name
annotations = {
"resize.topolvm.io/threshold" = "10%"
"resize.topolvm.io/threshold" = "80%"
"resize.topolvm.io/increase" = "100%"
"resize.topolvm.io/storage_limit" = "5Gi"
}
@ -98,13 +97,6 @@ resource "kubernetes_persistent_volume_claim" "data_proxmox" {
}
}
}
lifecycle {
# The autoresizer expands requests.storage up to storage_limit and
# PVCs can't shrink. Without this, every TF apply tries to revert
# to the spec value, K8s rejects the shrink, and the PVC ends up
# in Terminating-but-in-use limbo.
ignore_changes = [spec[0].resources[0].requests]
}
}
resource "kubernetes_deployment" "f1-stream" {
@ -203,12 +195,8 @@ resource "kubernetes_deployment" "f1-stream" {
}
}
lifecycle {
ignore_changes = [
spec[0].template[0].spec[0].dns_config, # KYVERNO_LIFECYCLE_V1
metadata[0].annotations["keel.sh/policy"],
metadata[0].annotations["keel.sh/trigger"],
metadata[0].annotations["keel.sh/pollSchedule"], # KYVERNO_LIFECYCLE_V2
]
# KYVERNO_LIFECYCLE_V1: Kyverno admission webhook mutates dns_config with ndots=2
ignore_changes = [spec[0].template[0].spec[0].dns_config]
}
}
@ -249,12 +237,11 @@ module "tls_secret" {
# (which load before any user has a chance to solve PoW), CHALLENGE
# everything else the HTML pages.
module "anubis" {
source = "../../modules/kubernetes/anubis_instance"
name = "f1"
namespace = kubernetes_namespace.f1-stream.metadata[0].name
target_url = "http://${kubernetes_service.f1-stream.metadata[0].name}.${kubernetes_namespace.f1-stream.metadata[0].name}.svc.cluster.local"
shared_store_url = "redis://redis-master.redis.svc.cluster.local:6379/6"
policy_yaml = <<-EOT
source = "../../modules/kubernetes/anubis_instance"
name = "f1"
namespace = kubernetes_namespace.f1-stream.metadata[0].name
target_url = "http://${kubernetes_service.f1-stream.metadata[0].name}.${kubernetes_namespace.f1-stream.metadata[0].name}.svc.cluster.local"
policy_yaml = <<-EOT
bots:
- import: (data)/bots/_deny-pathological.yaml
- import: (data)/bots/aggressive-brazilian-scrapers.yaml
@ -275,11 +262,6 @@ module "anubis" {
- name: f1-data-routes
path_regex: ^/(embed|embed-asset|extract|extractors|health|proxy|relay|schedule|streams)(/|\?|$)
action: ALLOW
# Allow non-GET methods unconditionally AI scrapers GET the body,
# they don't POST. Mutating XHRs and CORS preflight need to bypass.
- name: allow-non-get-methods
action: ALLOW
expression: method != "GET"
- name: catchall-challenge
path_regex: .*
action: CHALLENGE
@ -288,7 +270,6 @@ module "anubis" {
module "ingress" {
source = "../../modules/kubernetes/ingress_factory"
auth = "none" # Anubis-fronted; PoW challenge gates bots, no Authentik
dns_type = "non-proxied"
namespace = kubernetes_namespace.f1-stream.metadata[0].name
name = "f1"
@ -307,14 +288,3 @@ module "ingress" {
"gethomepage.dev/pod-selector" = ""
}
}
# CI retrigger 2026-05-16T13:42:57+00:00 bulk enrollment apply (pipeline #689 killed)
# CI retrigger v2 2026-05-16T13:46:35+00:00
# CI retrigger v3 2026-05-16T14:06:39Z
# CI retrigger v4 2026-05-16T14:13:59Z
# CI retrigger v5 2026-05-16T23:10:38Z
# CI retrigger v6 2026-05-16T23:18:58Z

View file

@ -33,8 +33,6 @@ resource "kubernetes_namespace" "fire_planner" {
# for headless verification (NetworkPolicy in chrome-service ns admits
# any namespace carrying this label).
"chrome-service.viktorbarzin.me/client" = "true"
# Opt into Keel auto-update (inject-keel-annotations ClusterPolicy).
"keel.sh/enrolled" = "true"
}
}
lifecycle {
@ -232,10 +230,9 @@ resource "kubernetes_deployment" "fire_planner" {
}
init_container {
name = "alembic-migrate"
image = local.image
image_pull_policy = "Always"
command = ["python", "-m", "fire_planner", "migrate"]
name = "alembic-migrate"
image = local.image
command = ["python", "-m", "fire_planner", "migrate"]
env_from {
secret_ref {
@ -313,12 +310,7 @@ resource "kubernetes_deployment" "fire_planner" {
}
lifecycle {
ignore_changes = [
spec[0].template[0].spec[0].dns_config, # KYVERNO_LIFECYCLE_V1
metadata[0].annotations["keel.sh/policy"],
metadata[0].annotations["keel.sh/trigger"],
metadata[0].annotations["keel.sh/pollSchedule"], # KYVERNO_LIFECYCLE_V2
]
ignore_changes = [spec[0].template[0].spec[0].dns_config] # KYVERNO_LIFECYCLE_V1
}
depends_on = [
@ -428,77 +420,6 @@ resource "kubernetes_cron_job_v1" "fire_planner_recompute" {
]
}
# Weekly refresh of the COL cache: walks col_snapshot for rows
# expiring within 7 days, re-scrapes Numbeo + Expatistan, upserts. With
# the user-chosen 1-year TTL, a healthy cache has 0 stale rows on most
# Sundays the job is a no-op until rows age out. Schedule Sunday 04:00
# UTC so Numbeo's contributor activity (mostly weekday) doesn't race
# our reads.
resource "kubernetes_cron_job_v1" "fire_planner_col_refresh" {
metadata {
name = "fire-planner-col-refresh"
namespace = kubernetes_namespace.fire_planner.metadata[0].name
}
spec {
schedule = "0 4 * * 0"
concurrency_policy = "Forbid"
successful_jobs_history_limit = 3
failed_jobs_history_limit = 5
starting_deadline_seconds = 600
job_template {
metadata {
labels = local.labels
}
spec {
backoff_limit = 1
ttl_seconds_after_finished = 86400
template {
metadata {
labels = local.labels
}
spec {
restart_policy = "OnFailure"
image_pull_secrets {
name = "registry-credentials"
}
container {
name = "col-refresh"
image = local.image
command = ["python", "-m", "fire_planner", "col-refresh-stale", "--within-days", "7"]
env_from {
secret_ref {
name = "fire-planner-db-creds"
}
}
resources {
requests = {
cpu = "100m"
memory = "256Mi"
}
limits = {
memory = "512Mi"
}
}
}
}
}
}
}
}
lifecycle {
# KYVERNO_LIFECYCLE_V1
ignore_changes = [spec[0].job_template[0].spec[0].template[0].spec[0].dns_config]
}
depends_on = [
kubernetes_manifest.db_external_secret,
]
}
# Public ingress at fire-planner.viktorbarzin.me. Authentik-protected
# (forward-auth at the Traefik layer); Cloudflare-proxied for CDN +
# DDoS shielding. Backend FastAPI serves the SPA at / and the API
@ -510,7 +431,7 @@ module "ingress" {
name = "fire-planner"
port = 8080
tls_secret_name = var.tls_secret_name
auth = "required"
protected = true
extra_annotations = {
"gethomepage.dev/enabled" = "true"
"gethomepage.dev/name" = "FIRE Planner"
@ -522,14 +443,11 @@ module "ingress" {
# Second ingress at the same host for the /api/ prefix WITHOUT Authentik
# forward-auth. The SPA loads under Authentik (main ingress at /), then its
# fetch() XHRs hit /api/* directly ANY forward-auth here (required OR
# public-tier auto-bind) would 302 the XHR to a cross-origin Authentik
# login page, which fetch() rejects under CORS preflight rules. Even the
# `auth = "public"` flow needs a 302+cookie dance on first visit to set
# the guest session cookie, so it doesn't help XHR APIs. App-layer bearer
# auth still gates writes (POST/PATCH/DELETE on scenarios, /recompute,
# /simulate); read endpoints are open. Acceptable for a personal tool
# whose only data is anonymous numeric projections.
# fetch() XHRs hit /api/* directly forward-auth on /api/* would 302 the
# XHR to a cross-origin Authentik login page, which fetch().json() can't
# parse. App-layer bearer auth still gates writes (POST/PATCH/DELETE on
# scenarios, /recompute, /simulate); read endpoints are open. Acceptable
# for a personal tool whose only data is anonymous numeric projections.
module "ingress_api" {
source = "../../modules/kubernetes/ingress_factory"
dns_type = "none"
@ -540,8 +458,7 @@ module "ingress_api" {
port = 8080
ingress_path = ["/api/"]
tls_secret_name = var.tls_secret_name
# auth = "none": XHR-based API endpoints; forward-auth 302+cookie-dance breaks CORS preflight and browser fetch().
auth = "none"
protected = false
}
# Plan-time read of the ESO-created K8s Secret for Grafana datasource
@ -597,6 +514,3 @@ resource "kubernetes_config_map" "grafana_fire_planner_datasource" {
})
}
}
# CI retrigger 2026-05-16T13:42:57+00:00 bulk enrollment apply (pipeline #689 killed)
# CI retrigger v2 2026-05-16T13:46:35+00:00

View file

@ -9,7 +9,6 @@ resource "kubernetes_namespace" "foolery" {
labels = {
"istio-injection" : "disabled"
tier = local.tiers.aux
"keel.sh/enrolled" = "true"
}
}
lifecycle {
@ -66,7 +65,7 @@ module "ingress" {
namespace = kubernetes_namespace.foolery.metadata[0].name
name = "foolery"
tls_secret_name = var.tls_secret_name
auth = "required"
protected = true
extra_annotations = {
"gethomepage.dev/enabled" = "true"
"gethomepage.dev/name" = "Foolery"

View file

@ -10,7 +10,6 @@ resource "kubernetes_namespace" "forgejo" {
labels = {
"istio-injection" : "disabled"
tier = local.tiers.edge
"keel.sh/enrolled" = "true"
}
}
lifecycle {
@ -31,7 +30,7 @@ resource "kubernetes_persistent_volume_claim" "data_encrypted" {
name = "forgejo-data-encrypted"
namespace = kubernetes_namespace.forgejo.metadata[0].name
annotations = {
"resize.topolvm.io/threshold" = "10%"
"resize.topolvm.io/threshold" = "80%"
"resize.topolvm.io/increase" = "50%"
"resize.topolvm.io/storage_limit" = "50Gi"
}
@ -141,16 +140,6 @@ resource "kubernetes_deployment" "forgejo" {
name = "FORGEJO__packages__ENABLED"
value = "true"
}
# Disable source archive ZIP/TAR generation. Bots crawling
# /<owner>/<repo>/archive/<sha>.zip on dot_files (and similar
# vim-plugin trees) caused 9.9s 500s and chewed ~440m sustained
# CPU. Git clone / OCI registry / API are unaffected only
# /archive/* URLs return 404 now. Toggle back to "false" if a
# legitimate consumer needs source ZIPs.
env {
name = "FORGEJO__repository__DISABLE_DOWNLOAD_SOURCE_ARCHIVES"
value = "true"
}
volume_mount {
name = "data"
mount_path = "/data"
@ -180,13 +169,8 @@ resource "kubernetes_deployment" "forgejo" {
}
}
lifecycle {
ignore_changes = [
spec[0].template[0].spec[0].dns_config, # KYVERNO_LIFECYCLE_V1
spec[0].template[0].spec[0].container[0].image, # KEEL_IGNORE_IMAGE Keel manages tag updates
metadata[0].annotations["keel.sh/policy"],
metadata[0].annotations["keel.sh/trigger"],
metadata[0].annotations["keel.sh/pollSchedule"], # KYVERNO_LIFECYCLE_V2
]
# KYVERNO_LIFECYCLE_V1: Kyverno admission webhook mutates dns_config with ndots=2
ignore_changes = [spec[0].template[0].spec[0].dns_config]
}
}
@ -210,12 +194,7 @@ resource "kubernetes_service" "forgejo" {
}
}
module "ingress" {
source = "../../modules/kubernetes/ingress_factory"
# Git + OCI registry (/v2/) native clients (git, docker/podman) use HTTP
# basic-auth / bearer tokens, NOT browser sessions. Forward-auth would 302
# them into a redirect they can't follow.
# auth = "none": Git + OCI registry clients use HTTP Basic auth / bearer tokens; native CLI tools cannot follow forward-auth redirects.
auth = "none"
source = "../../modules/kubernetes/ingress_factory"
dns_type = "non-proxied"
namespace = kubernetes_namespace.forgejo.metadata[0].name
name = "forgejo"

View file

@ -225,7 +225,7 @@ module "ingress" {
name = "music-${var.name}"
tls_secret_name = var.tls_secret_name
dns_type = "non-proxied"
auth = var.protected ? "required" : "none"
protected = var.protected
extra_annotations = var.extra_annotations
}
@ -235,9 +235,9 @@ resource "kubernetes_ingress_v1" "stream-noauth" {
name = "music-${var.name}-stream"
namespace = "freedify"
annotations = {
"traefik.ingress.kubernetes.io/router.middlewares" = "traefik-retry@kubernetescrd,traefik-rate-limit@kubernetescrd"
"traefik.ingress.kubernetes.io/router.entrypoints" = "websecure"
"traefik.ingress.kubernetes.io/router.priority" = "100"
"traefik.ingress.kubernetes.io/router.middlewares" = "traefik-retry@kubernetescrd,traefik-rate-limit@kubernetescrd"
"traefik.ingress.kubernetes.io/router.entrypoints" = "websecure"
"traefik.ingress.kubernetes.io/router.priority" = "100"
}
}
spec {

View file

@ -55,7 +55,6 @@ resource "kubernetes_namespace" "freedify" {
labels = {
"istio-injection" : "disabled"
tier = local.tiers.aux
"keel.sh/enrolled" = "true"
}
}
lifecycle {
@ -99,14 +98,14 @@ module "viktor" {
# https://music-emo.viktorbarzin.me/
module "emo" {
source = "./factory"
name = "emo"
tag = "latest"
tls_secret_name = var.tls_secret_name
depends_on = [kubernetes_namespace.freedify]
tier = local.tiers.aux
protected = true
genius_token = lookup(local.credentials["emo"], "genius_token", null)
source = "./factory"
name = "emo"
tag = "latest"
tls_secret_name = var.tls_secret_name
depends_on = [kubernetes_namespace.freedify]
tier = local.tiers.aux
protected = true
genius_token = lookup(local.credentials["emo"], "genius_token", null)
gemini_api_key = lookup(local.credentials["emo"], "gemini_api_key", null)
navidrome_scan_url = data.kubernetes_secret.eso_secrets.data["navidrome_scan_url"]
ha_sofia_url = lookup(data.kubernetes_secret.eso_secrets.data, "ha_sofia_url", "")

View file

@ -24,14 +24,6 @@ provider "registry.terraform.io/cloudflare/cloudflare" {
]
}
provider "registry.terraform.io/goauthentik/authentik" {
version = "2024.12.1"
constraints = "~> 2024.10"
hashes = [
"h1:roBMd+gi+TGgikH/bMzEI8JfvJiMAQWt+8FmokCrQIs=",
]
}
provider "registry.terraform.io/hashicorp/helm" {
version = "3.1.1"
hashes = [

View file

@ -8,7 +8,6 @@ resource "kubernetes_namespace" "immich" {
name = "freshrss"
labels = {
tier = local.tiers.aux
"keel.sh/enrolled" = "true"
}
}
lifecycle {
@ -68,7 +67,7 @@ resource "kubernetes_persistent_volume_claim" "data_proxmox" {
name = "freshrss-data-proxmox"
namespace = kubernetes_namespace.immich.metadata[0].name
annotations = {
"resize.topolvm.io/threshold" = "10%"
"resize.topolvm.io/threshold" = "80%"
"resize.topolvm.io/increase" = "100%"
"resize.topolvm.io/storage_limit" = "5Gi"
}
@ -82,13 +81,6 @@ resource "kubernetes_persistent_volume_claim" "data_proxmox" {
}
}
}
lifecycle {
# The autoresizer expands requests.storage up to storage_limit and
# PVCs can't shrink. Without this, every TF apply tries to revert
# to the spec value, K8s rejects the shrink, and the PVC ends up
# in Terminating-but-in-use limbo.
ignore_changes = [spec[0].resources[0].requests]
}
}
resource "kubernetes_persistent_volume_claim" "extensions_proxmox" {
@ -97,7 +89,7 @@ resource "kubernetes_persistent_volume_claim" "extensions_proxmox" {
name = "freshrss-extensions-proxmox"
namespace = kubernetes_namespace.immich.metadata[0].name
annotations = {
"resize.topolvm.io/threshold" = "10%"
"resize.topolvm.io/threshold" = "80%"
"resize.topolvm.io/increase" = "100%"
"resize.topolvm.io/storage_limit" = "5Gi"
}
@ -111,13 +103,6 @@ resource "kubernetes_persistent_volume_claim" "extensions_proxmox" {
}
}
}
lifecycle {
# The autoresizer expands requests.storage up to storage_limit and
# PVCs can't shrink. Without this, every TF apply tries to revert
# to the spec value, K8s rejects the shrink, and the PVC ends up
# in Terminating-but-in-use limbo.
ignore_changes = [spec[0].resources[0].requests]
}
}
@ -204,12 +189,8 @@ resource "kubernetes_deployment" "freshrss" {
}
}
lifecycle {
ignore_changes = [
spec[0].template[0].spec[0].dns_config, # KYVERNO_LIFECYCLE_V1
metadata[0].annotations["keel.sh/policy"],
metadata[0].annotations["keel.sh/trigger"],
metadata[0].annotations["keel.sh/pollSchedule"], # KYVERNO_LIFECYCLE_V2
]
# KYVERNO_LIFECYCLE_V1: Kyverno admission webhook mutates dns_config with ndots=2
ignore_changes = [spec[0].template[0].spec[0].dns_config]
}
}
@ -233,11 +214,7 @@ resource "kubernetes_service" "freshrss" {
}
}
module "ingress" {
source = "../../modules/kubernetes/ingress_factory"
# auth = "app": FreshRSS has built-in user login and exposes Fever +
# GReader APIs (/api/fever.php, /api/greader.php) used by mobile RSS
# readers like Reeder/FeedMe. Authentik forward-auth was 302-ing those.
auth = "app"
source = "../../modules/kubernetes/ingress_factory"
dns_type = "proxied"
namespace = "freshrss"
name = "rss"
@ -256,6 +233,3 @@ module "ingress" {
"gethomepage.dev/widget.password" = local.homepage_credentials["freshrss"]["password"]
}
}
# CI retrigger 2026-05-16T13:42:57+00:00 bulk enrollment apply (pipeline #689 killed)
# CI retrigger v2 2026-05-16T13:46:35+00:00

View file

@ -9,10 +9,6 @@ terraform {
source = "cloudflare/cloudflare"
version = "~> 4"
}
authentik = {
source = "goauthentik/authentik"
version = "~> 2024.10"
}
}
}

Some files were not shown because too many files have changed in this diff Show more