fire-planner: COL refresh CronJob + Grafana Cost-of-Living dashboard

Operational layer for the new col_snapshot cache shipped in fire-planner@e72fd22: stacks/fire-planner: - fire-planner-col-refresh CronJob — Sun 04:00 UTC, no-op until rows age toward the 1-year TTL boundary (within 7 days). Calls python -m fire_planner col-refresh-stale, upserts via cache.upsert. monitoring/dashboards/cost-of-living.json (Finance folder): - Two template variables: $city (single-select from col_snapshot), $baseline_city (for COL ratio computation, defaults London). - Stat row: total w/rent, w/o rent, 1-bed rent, ratio (color-coded). - All-cities ranked table with gradient-gauged total + colored ratio. - Cache-freshness table flags rows approaching TTL expiry. Initial population needs a one-shot: post-Keel-rollout, kubectl -n fire-planner exec deploy/fire-planner -- \\ python -m fire_planner col-seed Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
openclaw: revert model swap + document codex re-auth path
2026-05-22 14:17:01 +00:00 · 2026-05-22 14:17:01 +00:00 · 2026-05-22 14:17:01 +00:00 · 2026-05-22 14:17:01 +00:00 · 2026-05-22 14:17:01 +00:00 · 2026-05-22 14:17:01 +00:00
283 changed files with 30513 additions and 5339 deletions
--- a/.claude/CLAUDE.md
+++ b/.claude/CLAUDE.md
@ -28,9 +28,16 @@ Violations cause state drift, which causes future applies to break or silently r
 - **Apply**: Authenticate via `vault login -method=oidc`, then use `scripts/tg` (preferred — handles state decrypt/encrypt) or `terragrunt` directly. `scripts/tg` adds `-auto-approve` for `--non-interactive` applies.
 - **New services need CI/CD** and **monitoring** (Prometheus/Uptime Kuma)
 - **New service**: Use `setup-project` skill for full workflow
- **Ingress**: `ingress_factory` module. Auth: `protected = true`. Anti-AI: on by default. **DNS**: `dns_type = "proxied"` (Cloudflare CDN) or `"non-proxied"` (direct A/AAAA). DNS records are auto-created — no need to edit `config.tfvars`.
+- **Ingress**: `ingress_factory` module. **Auth** (`auth` string enum, default `"required"` — fail-closed). Pick by asking "what gates the app?":
+  - `auth = "required"` — Authentik forward-auth gates every request. Use when the backend has **no built-in user auth** and Authentik is the only thing standing between strangers and the app (prowlarr, qbittorrent, netbox, phpipam, k8s-dashboard, foolery, any admin UI shipped without its own login).
+  - `auth = "app"` — the backend handles its own user authentication (NextAuth, Django, OAuth, bearer-token API, etc.); Authentik would only break it. No middleware attached; the app's own login is the gate. Examples: immich, linkwarden, tandoor, freshrss, affine, actualbudget, audiobookshelf, novelapp. **Functionally identical to `"none"`** — the distinct name exists to record intent at the call site.
+  - `auth = "public"` — Authentik anonymous binding via the dedicated `public` outpost (routes via `traefik-authentik-forward-auth-public` → `ak-outpost-public.authentik.svc:9000`). Strangers auto-bound to `guest`; logged-in users keep their identity in `X-authentik-username`. **Only works for top-level browser navigation** — CORS preflight rejects XHR/fetch and automation can't replay the cookie dance. Audit trail, not a gate.
+  - `auth = "none"` — no Authentik, no own-auth claim. Use for Anubis-fronted content (Anubis is the gate), native-client APIs (Git, `/v2/`, WebDAV/CalDAV, CardDAV), webhook receivers, OAuth callbacks, and Authentik outposts themselves.
+  - **Anti-exposure rule** (the reason `"app"` exists): only pick `"app"` or `"none"` AFTER you've verified the app has its own user auth (`"app"`) OR the endpoint is intentionally public (`"none"`). Default is `"required"` so accidental omission fails closed. **Convention**: when using `"app"` or `"none"`, add a comment line above the `auth = "..."` line stating what gates the app or why it's public. **Enforced by `scripts/tg`**: every `tg plan/apply/destroy/refresh` runs `scripts/check-ingress-auth-comments.py` against the current stack and aborts if any `auth = "app|none"` line lacks the preceding `# auth = "<tier>": ...` comment. Stack-scoped — untouched stacks aren't blocked until they're next edited.
+  - **Anti-AI**: on by default when `auth = "none"` or `auth = "app"` (no Authentik to discourage bots); redundant on `"required"` and `"public"`.
+  - **DNS**: `dns_type = "proxied"` (Cloudflare CDN) or `"non-proxied"` (direct A/AAAA). DNS records are auto-created — no need to edit `config.tfvars`. Smoke-test target: `echo.viktorbarzin.me` (auth=public, header-reflecting backend).
 - **Anubis PoW challenge** (`modules/kubernetes/anubis_instance/`): per-site reverse proxy that issues a 30-day JWT cookie after a tiny PoW solve. Use for **public, content-bearing sites without app-level auth** (blog, docs, wikis, static landing pages). Pattern: declare `module "anubis" { source = "../../modules/kubernetes/anubis_instance"; name = "X"; namespace = ...; target_url = "http://<backend>.<ns>.svc.cluster.local" }`, then in `ingress_factory` set `service_name = module.anubis.service_name`, `port = module.anubis.service_port`, `anti_ai_scraping = false`. Shared ed25519 key in Vault `secret/viktor` -> `anubis_ed25519_key`; cookie scoped to `viktorbarzin.me` so one solve covers all Anubis-fronted subdomains. **DO NOT put Anubis in front of Git/API/WebDAV/CLI endpoints** — clients without JS can't solve PoW. **Replicas default to 1** because Anubis stores in-flight challenges in process memory; a challenge issued by pod A and solved against pod B errors with `store: key not found` (HTTP 500). Bumping replicas requires wiring a shared Redis store (TODO). For path-level carve-outs (e.g. wrongmove has `/` behind Anubis but `/api` direct), declare a second `ingress_factory` with `ingress_path = ["/api"]` pointing at the bare backend service. Active on: blog, www, kms, travel, f1, cc, json, pb (privatebin), home (homepage), wrongmove (UI only). See `.claude/reference/patterns.md` "Anti-AI Scraping" for full layering.
- **Docker images**: Always build for `linux/amd64`. Use 8-char git SHA tags — `:latest` causes stale pull-through cache.
+- **Docker images**: Always build for `linux/amd64`. SHA-tag rule is being phased out — see `docs/plans/2026-05-16-auto-upgrade-apps-{design,plan}.md`. New model: CI pushes `:latest` (optionally also `:<8-char-sha>` for traceability), Keel polls and triggers rollouts. Cache-staleness concern from the old rule is resolved at the nginx layer (URL-split — manifests pass through, blobs cached). Until Phase 1 of the migration completes (per the plan), follow the SHA-tag rule for new services to match existing pattern.
 - **Private registry**: `forgejo.viktorbarzin.me/viktor/<name>` (Forgejo packages, OAuth-style PAT auth). Use `image: forgejo.viktorbarzin.me/viktor/<name>:<tag>` + `imagePullSecrets: [{name: registry-credentials}]`. Kyverno auto-syncs the Secret to all namespaces. Containerd `hosts.toml` on every node redirects to in-cluster Traefik LB `10.0.20.200` to avoid hairpin NAT. Push-side: viktor PAT in Vault `secret/ci/global/forgejo_push_token` (Forgejo container packages are scoped per-user; only the package owner can push, ci-pusher cannot write to viktor/*). Pull-side: cluster-puller PAT in Vault `secret/viktor/forgejo_pull_token`. Retention CronJob (`forgejo-cleanup` in `forgejo` ns, daily 04:00) keeps newest 10 versions + always `:latest`; integrity probed every 15min by `forgejo-integrity-probe` in `monitoring` ns (catalog walk + manifest HEAD on every blob). See `docs/plans/2026-05-07-forgejo-registry-consolidation-{design,plan}.md` for the migration history. Pull-through caches for upstream registries (DockerHub, GHCR, Quay, k8s.gcr, Kyverno) stay on the registry VM at `10.0.20.10` ports 5000/5010/5020/5030/5040 — the old port-5050 R/W private registry was decommissioned 2026-05-07.
 - **LinuxServer.io containers**: `DOCKER_MODS` runs apt-get on every start — bake slow mods into a custom image (`RUN /docker-mods || true` then `ENV DOCKER_MODS=`). Set `NO_CHOWN=true` to skip recursive chown that hangs on NFS mounts.
 - **Node memory changes**: When changing VM memory on any k8s node, update kubelet `systemReserved`, `kubeReserved`, and eviction thresholds accordingly. Config: `/var/lib/kubelet/config.yaml`. Template: `stacks/infra/main.tf`. Current values: systemReserved=512Mi, kubeReserved=512Mi, evictionHard=500Mi, evictionSoft=1Gi.
@ -129,7 +136,7 @@ Repo IDs: infra=1, Website=2, finance=3, health=4, travel_blog=5, webhook-handle
 | Frigate | GPU stall detection in liveness probe (inference speed check), high CPU |
 | Authentik | 3 replicas, PgBouncer in front of PostgreSQL, strip auth headers before forwarding |
 | Kyverno | failurePolicy=Ignore to prevent blocking cluster, pin chart version |
-| MySQL Standalone | Raw `kubernetes_stateful_set_v1` with `mysql:8.4` (migrated from InnoDB Cluster 2026-04-16). `skip-log-bin`, `innodb_flush_log_at_trx_commit=2`, `innodb_doublewrite=ON`. ConfigMap `mysql-standalone-cnf`. PVC `data-mysql-standalone-0` (15Gi, `proxmox-lvm-encrypted`). Service `mysql.dbaas` unchanged. Anti-affinity excludes k8s-node1. Old InnoDB Cluster + operator still in TF (Phase 4 cleanup pending). Bitnami charts deprecated (Broadcom Aug 2025) — use official images. |
+| MySQL Standalone | Raw `kubernetes_stateful_set_v1` pinned to `mysql:8.4.8` exactly (migrated from InnoDB Cluster 2026-04-16; **pinned to 8.4.8 on 2026-05-18** after Keel-driven `mysql:8.4` → 8.4.9 bump stalled the DD upgrade and required a full PVC-wipe + dump-restore — see `docs/runbooks/restore-mysql.md` and beads code-eme8/code-k40p). `skip-log-bin`, `innodb_flush_log_at_trx_commit=2`, `innodb_doublewrite=ON`. ConfigMap `mysql-standalone-cnf`. PVC `data-mysql-standalone-0` (5Gi initial → 30Gi via autoresizer, `proxmox-lvm-encrypted`). Service `mysql.dbaas` unchanged. Anti-affinity excludes k8s-node1. Bitnami charts deprecated (Broadcom Aug 2025) — use official images. |
 | phpIPAM | IPAM — no active scanning. `pfsense-import` CronJob (hourly) pulls Kea leases + ARP via SSH. `dns-sync` CronJob (15min) bidirectional sync with Technitium. Kea DDNS on pfSense handles all 3 subnets. API app `claude` (ssl_token). |

 ## Monitoring & Alerting
@ -140,6 +147,17 @@ Repo IDs: infra=1, Website=2, finance=3, health=4, travel_blog=5, webhook-handle
 - Key alerts: OOMKill, pod replica mismatch, 4xx/5xx error rates, UPS battery, CPU temp, SSD writes, NFS responsiveness, ClusterMemoryRequestsHigh (>85%), ContainerNearOOM (>85% limit), PodUnschedulable, ExternalAccessDivergence.
 - **E2E email monitoring**: CronJob `email-roundtrip-monitor` (every 20 min) sends test email via Brevo HTTP API to `smoke-test@viktorbarzin.me` (catch-all → `spam@`), verifies IMAP delivery, deletes test email, pushes metrics to Pushgateway + Uptime Kuma. Alerts: `EmailRoundtripFailing` (60m), `EmailRoundtripStale` (60m), `EmailRoundtripNeverRun` (60m). Outbound relay: Brevo EU (`smtp-relay.brevo.com:587`, 300/day free — migrated from Mailgun). Inbound external traffic enters via pfSense HAProxy on `10.0.20.1:{25,465,587,993}`, which forwards to k8s `mailserver-proxy` NodePort (30125-30128) with `send-proxy-v2`. Mailserver pod runs alt PROXY-speaking listeners (2525/4465/5587/10993) alongside stock PROXY-free ones (25/465/587/993) for intra-cluster clients. Real client IPs recovered from PROXY v2 header despite kube-proxy SNAT (replaces pre-2026-04-19 MetalLB `10.0.20.202` ETP:Local scheme; see bd code-yiu + `docs/runbooks/mailserver-pfsense-haproxy.md`). Vault: `brevo_api_key` in `secret/viktor` (probe + relay).

+## Security Posture (Wave 1 — locked 2026-05-18)
+
+Plan in `docs/architecture/security.md` + response playbook in `docs/runbooks/security-incident.md`. Beads epic: `code-8ywc`.
+
+- **Identity allowlist for security rules**: ONLY `me@viktorbarzin.me`. NOT `viktor@viktorbarzin.me`, NOT `emo@viktorbarzin.me` (those don't exist). emo's identity scheme is unknown — ask before assuming.
+- **Source-IP allowlist (K2, K9, V7, S1)**: `10.0.20.0/22`, `192.168.1.0/24` (Proxmox + Sofia LAN), K8s pod CIDR, K8s service CIDR, Headscale tailnet. **Policy: no public-IP access** — Vault, kube-apiserver, PVE sshd must transit LAN or Headscale.
+- **Response model**: (I) Slack-only daily skim. All security alerts via Loki ruler → Alertmanager → `#security` Slack receiver. Single channel with severity labels inside (critical/warning/info). No paging.
+- **Kyverno policies (wave 1)**: `deny-privileged-containers`, `deny-host-namespaces`, `restrict-sys-admin`, `require-trusted-registries` flip Audit→Enforce with the 31-namespace exclude list (memory id=1970). `failurePolicy: Ignore` preserved. Cosign `verify-images` deferred.
+- **NetworkPolicy default-deny egress (wave 1)**: observe-then-enforce (γ approach) — Calico flow logs cluster-wide + GlobalNetworkPolicy log-only on tier 3+4, build empirical allowlist after 1 week, phased per-namespace enforce starting `recruiter-responder`. Tier 0/1/2 deferred.
+- **What's NOT in scope**: canary tokens (rejected — self-trigger risk with Viktor's normal `vault kv list secret/viktor` and `kubectl get secret -A` workflows), Falco/Tetragon (too noisy for Slack-only daily check), Cloudflare/GitHub audit polling (deferred to wave 2).
+
 ## Storage & Backup Architecture

 ### Storage Class Decision Rule (for new services)
@ -177,7 +195,7 @@ resource "kubernetes_persistent_volume_claim" "data_proxmox" {
    name      = "<service>-data-proxmox"
    namespace = kubernetes_namespace.<ns>.metadata[0].name
    annotations = {
-      "resize.topolvm.io/threshold"     = "80%"
+      "resize.topolvm.io/threshold"     = "10%"
      "resize.topolvm.io/increase"      = "100%"
      "resize.topolvm.io/storage_limit" = "5Gi"
    }
@ -213,7 +231,7 @@ resource "kubernetes_persistent_volume_claim" "data_encrypted" {
    name      = "<service>-data-encrypted"
    namespace = kubernetes_namespace.<ns>.metadata[0].name
    annotations = {
-      "resize.topolvm.io/threshold"     = "80%"
+      "resize.topolvm.io/threshold"     = "10%"
      "resize.topolvm.io/increase"      = "100%"
      "resize.topolvm.io/storage_limit" = "5Gi"
    }
@ -269,7 +287,8 @@ resource "kubernetes_persistent_volume_claim" "data_encrypted" {

 ## Known Issues
 - **CrowdSec Helm upgrade times out**: `terragrunt apply` on platform stack causes CrowdSec Helm release to get stuck in `pending-upgrade`. Workaround: `helm rollback crowdsec <rev> -n crowdsec`. Root cause: likely ResourceQuota CPU at 302% preventing pods from passing readiness probes. Needs investigation.
- **OpenClaw config is writable**: OpenClaw writes to `openclaw.json` at runtime (doctor --fix, plugin auto-enable). Never use subPath ConfigMap mounts for it — use an init container to copy into a writable volume. Needs 2Gi memory + `NODE_OPTIONS=--max-old-space-size=1536`.
+- **OpenClaw config is writable**: OpenClaw writes to `openclaw.json` at runtime (doctor --fix, plugin auto-enable). Never use subPath ConfigMap mounts for it — use an init container to copy into a writable volume. Needs 2Gi memory + `NODE_OPTIONS=--max-old-space-size=1536`. **`mcp.servers` baked into the ConfigMap-loaded openclaw.json gets stripped by `doctor --fix`** — register MCP servers via `openclaw mcp set <name> <json>` in the container startup command instead (CLI-written entries persist across doctor runs). Current servers wired this way: `ha`, `context7`, `playwright` (sidecar at `localhost:3000/mcp`).
+- **OpenClaw memory-core indexes `/workspace/memory/`, not `/home/node/.openclaw/memory/`**: `/home/node/.openclaw/memory/main.sqlite` is the index store, NOT a content source. Files written under `/home/node/.openclaw/memory/projects/<x>/*.md` will NOT be indexed. To populate memory-core, write Markdown under `/workspace/memory/projects/<source>/` and run `openclaw memory index --force`. This is what the daily `memory-sync` CronJob in `stacks/openclaw/` does for claude-memory → OpenClaw sync.
 - **Goldilocks VPA sets limits**: When increasing memory requests, always set explicit `limits` too — Goldilocks may have added a limit that blocks the change.

 ## User Preferences
--- a/.claude/agents/k8s-version-upgrade.deprecated.md
+++ b/.claude/agents/k8s-version-upgrade.deprecated.md
@ -0,0 +1,543 @@
+---
+name: k8s-version-upgrade-DEPRECATED
+description: "DEPRECATED 2026-05-11 — replaced by the Job-chain in stacks/k8s-version-upgrade. See header below."
+tools: Read, Write, Edit, Bash, Grep, Glob
+model: opus
+---
+
+# DEPRECATED — Do NOT invoke this agent
+
+Retired **2026-05-11** after a self-preemption incident: this agent ran inside
+the `claude-agent-service` Deployment (replicas=1, no nodeSelector) and was
+scheduled onto k8s-node4. When the agent tried to `kubectl drain k8s-node4`
+(Stage 6, first worker), it evicted itself. The bash process died mid-SSH,
+leaving node4 cordoned and the cluster half-upgraded (master at v1.34.7,
+workers at v1.34.2).
+
+## Replaced by
+
+A chain of small Kubernetes Jobs, each pinned (via `nodeSelector` +
+`kubernetes.io/hostname`) to a node that is NOT its drain target. No pod can
+preempt itself because each Job's pod and its target node are always
+different.
+
+| Old | New |
+|-----|-----|
+| Single agent run in claude-agent-service pod | Chain of 7 phase Jobs (preflight → master → worker × 4 → postflight) |
+| Whole pipeline in one prompt | Phase body in `stacks/k8s-version-upgrade/scripts/upgrade-step.sh`, dispatched per-phase via `case $PHASE` |
+| Detection CronJob POSTs to `claude-agent-service` | Detection CronJob renders Job 0 from `job-template.yaml` via `envsubst` + `kubectl apply` |
+| Drain blocks indefinitely on PDB=0 (e.g. single-replica Anubis) | New `predrain_unstick` deletes PDB-blocked pods so drain proceeds |
+| `K8sVersionSkew` + `EtcdPreUpgradeSnapshotMissing` alerts | Above + `K8sUpgradeStalled` (in_flight=1 and time()-started_timestamp > 5400s) |
+
+## Where the logic lives now
+
+- **`infra/stacks/k8s-version-upgrade/scripts/upgrade-step.sh`** — universal
+  phase body. Dispatches on `$PHASE`. Each phase spawns the next Job.
+- **`infra/stacks/k8s-version-upgrade/job-template.yaml`** — Job template
+  rendered by `envsubst` at runtime. ConfigMap-mounted at `/template` in
+  every Job pod.
+- **`infra/stacks/k8s-version-upgrade/main.tf`** — Terraform stack: ConfigMaps,
+  unified `k8s-upgrade-job` ServiceAccount + RBAC, detection CronJob.
+- **`infra/docs/runbooks/k8s-version-upgrade.md`** — operator runbook (kill a
+  stuck Job, skip a phase, manually re-trigger from a specific phase).
+
+## Why kept (not deleted)
+
+Documents the prompted-agent design and is useful as historical reference when
+reading post-mortem discussions or comparing approaches. The `name` field has
+been suffixed with `-DEPRECATED` so the agent cannot be invoked by name from
+`claude-agent-service`.
+
+---
+
+# Original prompt — DO NOT EXECUTE (reference only)
+
+You are the K8s Version Upgrade Agent for a 5-node home-lab Kubernetes cluster (1 master, 4 workers, stacked etcd, no HA).
+
+## Your Job
+
+Given a target patch or minor version of `kubeadm`/`kubelet`/`kubectl`, you orchestrate the full rolling upgrade with safety gates between every node. You do NOT decide WHEN to run — the `k8s-version-check` CronJob in the `k8s-upgrade` namespace fires you off after detection. You only run when invoked.
+
+The sequence (Pre-flight → etcd snapshot → master containerd skew fix → apt repo URL change [minor only] → master kubeadm upgrade → workers sequentially → Post-flight) is non-negotiable. Skipping a step is how clusters die.
+
+## Inputs
+
+The user prompt contains a JSON object with these fields:
+
+```json
+{
+  "target_version": "1.34.5",
+  "kind": "patch",
+  "dry_run": false,
+  "stages": "all"
+}
+```
+
+| Field | Required | Description |
+|---|---|---|
+| `target_version` | yes | Exact `X.Y.Z` to land on (e.g. `1.34.5`). The script `infra/scripts/update_k8s.sh` accepts this via `--release`. |
+| `kind` | yes | `patch` (no apt-repo URL change) or `minor` (rewrite repo to v$NEW_MINOR/deb on every node before kubeadm). |
+| `dry_run` | no, default false | If true, run all SSH + kubectl READ commands but skip every mutating command (`apt-get install`, `kubeadm upgrade apply`, `kubeadm upgrade node`, `kubectl drain/uncordon`, etcd snapshot, systemctl restart). Log what you would do and exit 0. |
+| `stages` | no, default `all` | Comma-separated subset of: `preflight`, `snapshot`, `containerd`, `repo`, `master`, `workers`, `postflight`. Run only those stages and exit. Used by tests. |
+
+Parse the prompt's first JSON block to extract these. If anything is missing, abort with a Slack notification ("malformed payload").
+
+## Environment
+
+- **Working dir**: `/workspace/infra` (`WORKSPACE_DIR` env var)
+- **Kubeconfig**: `/workspace/infra/config` (use `kubectl --kubeconfig $WORKSPACE_DIR/config ...` in every kubectl call)
+- **Prometheus**: `http://prometheus-server.monitoring.svc.cluster.local:80` (in-cluster, no auth)
+- **Etcd snapshot**: triggered as a one-shot Job from the existing `default/backup-etcd` CronJob (defined in `stacks/infra-maintenance/`). The Job runs on `k8s-master` with hostNetwork (so etcdctl reaches etcd at 127.0.0.1:2379), mounts the PV-backed NFS export `192.168.1.127:/srv/nfs/etcd-backup`, and writes `etcd-snapshot-<TIMESTAMP>.db` there. Do NOT shell into master with etcdctl directly — the cert paths + NFS mount are already wired into the CronJob.
+- **Library script**: `/workspace/infra/scripts/update_k8s.sh` — pipe via SSH to each node, do NOT modify on the fly. Invoke as `ssh ... 'bash -s' < update_k8s.sh --role <role> --release <X.Y.Z>`.
+
+### Credentials — fetched at startup
+
+The k8s-upgrade ServiceAccount has GET on the `k8s-upgrade-creds` Secret in the `k8s-upgrade` namespace (granted by a RoleBinding in `stacks/k8s-version-upgrade/main.tf`). Fetch credentials into `/tmp` files at the start of every run:
+
+```bash
+KUBECTL="kubectl --kubeconfig $WORKSPACE_DIR/config"
+
+# SSH private key — mode 0400 required by openssh
+$KUBECTL get secret -n k8s-upgrade k8s-upgrade-creds \
+  -o jsonpath='{.data.ssh_key}' | base64 -d > /tmp/k8s-upgrade-ssh-key
+chmod 400 /tmp/k8s-upgrade-ssh-key
+
+# Slack webhook (URL string)
+SLACK_WEBHOOK_K8S_UPGRADE=$($KUBECTL get secret -n k8s-upgrade k8s-upgrade-creds \
+  -o jsonpath='{.data.slack_webhook}' | base64 -d)
+```
+
+The rest of the prompt uses `/tmp/k8s-upgrade-ssh-key` for SSH and `$SLACK_WEBHOOK_K8S_UPGRADE` for Slack. SSH template:
+
+```bash
+SSH="ssh -i /tmp/k8s-upgrade-ssh-key -o StrictHostKeyChecking=accept-new -o UserKnownHostsFile=/tmp/known_hosts"
+```
+
+Every SSH call below uses `$SSH wizard@<host> '<cmd>'`. `accept-new` accepts the host key on first encounter then pins it — if a node was reimaged, clear `/tmp/known_hosts` before retry.
+
+## NEVER do
+
+- Never bypass the halt-on-alert check — even if a single alert "looks unrelated"
+- Never start the next worker before the previous one is Ready + all its pods rescheduled + 10-min soak observed
+- Never skip the etcd snapshot — even for patch
+- Never `kubectl edit/patch/delete` — read-only kubectl plus `drain`/`uncordon` only
+- Never `apt-mark hold` something without unholding it first, and vice versa — the script handles this; don't do it manually
+- Never run two stages in parallel — sequential only
+- Never run if `dry_run=false` AND the cluster has a node Not Ready, or any Upgrade Gates alert firing
+- Never push to git, never modify Terraform, never invoke claude-agent-service recursively
+
+## Slack + Pushgateway helpers
+
+Every transition posts to Slack:
+
+```bash
+slack() {
+  local msg="$1"
+  local hook="${SLACK_WEBHOOK_K8S_UPGRADE:-$SLACK_WEBHOOK_URL}"
+  curl -sS -X POST -H 'Content-Type: application/json' \
+    --data "$(jq -nc --arg t "[k8s-upgrade] $msg" '{text: $t}')" \
+    "$hook"
+}
+```
+
+Start every message with `[k8s-upgrade]` so it's grep-able.
+
+Pushgateway gauges drive the `EtcdPreUpgradeSnapshotMissing` and ops-visibility metrics:
+
+```bash
+PG='http://prometheus-prometheus-pushgateway.monitoring:9091/metrics/job/k8s-version-upgrade'
+
+push_metric() {
+  # push_metric <name> <value>
+  local name="$1" val="$2"
+  printf '# TYPE %s gauge\n%s %s\n' "$name" "$name" "$val" \
+    | curl -sS --data-binary @- "$PG"
+}
+```
+
+Pushes you must make at specific stages (skipped in dry_run):
+| When | Metric | Value |
+|---|---|---|
+| Stage 0 start | `k8s_upgrade_in_flight` | `1` |
+| Stage 0 start | `k8s_upgrade_target_minor` | `$target_minor` |
+| Stage 2 verified | `k8s_upgrade_snapshot_taken` | `1` |
+| Stage 7 clean | `k8s_upgrade_in_flight` | `0` |
+| Stage 7 clean | `k8s_upgrade_snapshot_taken` | `0` |
+
+If you abort mid-flight, leave `k8s_upgrade_in_flight=1` so the alert fires and surfaces the half-done state.
+
+## Stage 0: Parse inputs + announce
+
+1. Extract `target_version`, `kind`, `dry_run`, `stages` from the prompt JSON.
+2. Derive `target_minor` from `target_version` (split on `.`).
+3. Mark the in-flight annotation on the namespace AND push Pushgateway in-flight gauge:
+   ```bash
+   if [ "$dry_run" = "false" ]; then
+     kubectl --kubeconfig $WORKSPACE_DIR/config annotate ns k8s-upgrade \
+       viktorbarzin.me/k8s-upgrade-in-flight="$(date -u +%FT%TZ)" \
+       viktorbarzin.me/k8s-upgrade-target="$target_version" \
+       --overwrite
+
+     push_metric k8s_upgrade_in_flight 1
+     push_metric k8s_upgrade_snapshot_taken 0
+   fi
+   ```
+4. Slack: `Starting k8s upgrade to v$target_version (kind=$kind, dry_run=$dry_run, stages=$stages)`.
+
+## Stage 1: Pre-flight (`stages` includes `preflight`)
+
+Skip if `stages` excludes `preflight`.
+
+### Check 1.1 — All nodes Ready, no pressure
+
+```bash
+kubectl --kubeconfig $WORKSPACE_DIR/config get nodes -o json \
+  | jq -r '.items[] | "\(.metadata.name): \(.status.conditions[] | select(.type=="Ready") | .status), Mem=\(.status.conditions[] | select(.type=="MemoryPressure") | .status), Disk=\(.status.conditions[] | select(.type=="DiskPressure") | .status)"'
+```
+
+Abort if any node is not Ready=True, or has MemoryPressure=True or DiskPressure=True.
+
+### Check 1.2 — Halt-on-alert (same query kured uses)
+
+```bash
+ALERTS=$(curl -sf 'http://prometheus-server.monitoring.svc.cluster.local:80/api/v1/alerts' \
+  | jq -r '.data.alerts[] | select(.state == "firing") | .labels.alertname' \
+  | grep -vE '^(Watchdog|RebootRequired|KuredNodeWasNotDrained|InfoInhibitor)$' \
+  | sort -u)
+
+if [ -n "$ALERTS" ]; then
+  slack "ABORT preflight — firing alerts:\n$ALERTS"
+  exit 1
+fi
+```
+
+### Check 1.3 — 24h-quiet baseline
+
+Re-uses the sentinel-gate Check 4 logic from `stacks/kured/main.tf`. Any node that transitioned Ready in the last 24h means the cluster just absorbed a node reboot — we want a clean baseline before starting a fresh rollout.
+
+```bash
+RECENT_REBOOT=0
+while IFS= read -r ts; do
+  [ -z "$ts" ] && continue
+  diff=$(( $(date +%s) - $(date -d "$ts" +%s) ))
+  [ "$diff" -lt 86400 ] && RECENT_REBOOT=1 && break
+done < <(kubectl --kubeconfig $WORKSPACE_DIR/config get nodes -o jsonpath='{range .items[*]}{range .status.conditions[?(@.type=="Ready")]}{.lastTransitionTime}{"\n"}{end}{end}')
+
+if [ "$RECENT_REBOOT" -eq 1 ]; then
+  slack "ABORT preflight — node transitioned Ready <24h ago (soak window)"
+  exit 1
+fi
+```
+
+### Check 1.4 — kubeadm upgrade plan reports our target
+
+```bash
+PLAN_TARGET=$($SSH \
+  wizard@k8s-master 'sudo kubeadm upgrade plan' \
+  | grep -oE 'You can now apply the upgrade by executing the following command:.*v[0-9]+\.[0-9]+\.[0-9]+' \
+  | grep -oE 'v[0-9]+\.[0-9]+\.[0-9]+' | head -1 | tr -d v)
+```
+
+If `$PLAN_TARGET` does not start with the requested `target_version`, slack-abort:
+"`kubeadm upgrade plan` says target is $PLAN_TARGET but caller asked for $target_version — drift; aborting."
+
+Slack: `Pre-flight clean. Proceeding to etcd snapshot.`
+
+## Stage 2: Etcd snapshot (`stages` includes `snapshot`)
+
+Always run — patch OR minor. Triggers a one-shot Job from the existing `default/backup-etcd` CronJob and waits for it to complete.
+
+```bash
+JOB_NAME="pre-upgrade-etcd-${target_version}-$(date +%s)"
+
+if [ "$dry_run" = "false" ]; then
+  $KUBECTL -n default create job --from=cronjob/backup-etcd "$JOB_NAME"
+
+  # Wait up to 10 min for snapshot Job to complete
+  $KUBECTL -n default wait --for=condition=complete --timeout=600s "job/$JOB_NAME" || {
+    slack "ABORT Stage 2 — etcd snapshot Job did not complete in 10 min"
+    $KUBECTL -n default describe "job/$JOB_NAME" | tail -30
+    exit 1
+  }
+
+  # Parse the Job's pod log for "Backup done: <file> (<bytes> bytes)"
+  LOG=$($KUBECTL -n default logs "job/$JOB_NAME" -c backup-manage --tail=20)
+  echo "$LOG"
+  SNAPSHOT_LINE=$(echo "$LOG" | grep -E '^Backup done:')
+  SIZE=$(echo "$SNAPSHOT_LINE" | grep -oE '\([0-9]+ bytes\)' | grep -oE '[0-9]+')
+  SNAPSHOT_FILE=$(echo "$SNAPSHOT_LINE" | awk '{print $3}')
+
+  if [ -z "$SIZE" ] || [ "$SIZE" -lt 1024 ]; then
+    slack "ABORT Stage 2 — etcd snapshot empty or missing (size='$SIZE' line='$SNAPSHOT_LINE')"
+    exit 1
+  fi
+
+  TARGET_PATH="nfs://192.168.1.127:/srv/nfs/etcd-backup/$SNAPSHOT_FILE"
+  $KUBECTL annotate ns k8s-upgrade \
+    viktorbarzin.me/k8s-upgrade-snapshot-path="$TARGET_PATH" --overwrite
+
+  push_metric k8s_upgrade_snapshot_taken 1
+else
+  TARGET_PATH="WOULD: trigger default/backup-etcd Job, wait, verify size"
+  SIZE="dry-run"
+fi
+
+slack "Etcd snapshot saved at $TARGET_PATH (size=$SIZE)"
+```
+
+## Stage 3: Master containerd skew fix (`stages` includes `containerd`)
+
+Only run if master containerd version < highest worker containerd version.
+
+```bash
+get_ctr_version() {
+  $SSH \
+    "wizard@$1" 'containerd --version | awk "{print \$3}" | tr -d v'
+}
+
+MASTER_CTR=$(get_ctr_version k8s-master)
+WORKER_MAX="0.0.0"
+for n in k8s-node1 k8s-node2 k8s-node3 k8s-node4; do
+  v=$(get_ctr_version "$n")
+  # Compare semver-ish
+  if [ "$(printf '%s\n%s' "$v" "$WORKER_MAX" | sort -V | tail -1)" = "$v" ]; then
+    WORKER_MAX="$v"
+  fi
+done
+
+if [ "$(printf '%s\n%s' "$MASTER_CTR" "$WORKER_MAX" | sort -V | head -1)" = "$MASTER_CTR" ] \
+   && [ "$MASTER_CTR" != "$WORKER_MAX" ]; then
+  # Master is behind — bump
+  slack "Master containerd $MASTER_CTR < workers $WORKER_MAX — bumping master"
+
+  if [ "$dry_run" = "false" ]; then
+    $SSH \
+      wizard@k8s-master "sudo apt-mark unhold containerd.io \
+        && sudo apt-get install -y containerd.io='$WORKER_MAX-1' \
+        && sudo apt-mark hold containerd.io \
+        && sudo systemctl restart containerd"
+
+    # Wait until kubelet on master is Ready again
+    for i in $(seq 1 60); do
+      STATUS=$(kubectl --kubeconfig $WORKSPACE_DIR/config get node k8s-master \
+        -o jsonpath='{.status.conditions[?(@.type=="Ready")].status}')
+      [ "$STATUS" = "True" ] && break
+      sleep 10
+    done
+    [ "$STATUS" = "True" ] || { slack "ABORT — k8s-master not Ready after containerd bump"; exit 1; }
+  fi
+
+  slack "Master containerd: $MASTER_CTR → $WORKER_MAX. Master Ready."
+else
+  echo "Master containerd $MASTER_CTR >= workers max $WORKER_MAX — skipping skew fix"
+fi
+```
+
+## Stage 4: Apt repo URL rewrite for minor bumps (`stages` includes `repo`)
+
+Only run if `kind=minor`.
+
+For each of `k8s-master k8s-node1 k8s-node2 k8s-node3 k8s-node4`:
+
+```bash
+target_minor="$(echo "$target_version" | awk -F. '{print $1"."$2}')"
+
+if [ "$dry_run" = "false" ]; then
+  $SSH \
+    "wizard@$node" "echo 'deb [signed-by=/etc/apt/keyrings/kubernetes-apt-keyring.gpg] https://pkgs.k8s.io/core:/stable:/v$target_minor/deb/ /' | sudo tee /etc/apt/sources.list.d/kubernetes.list \
+      && curl -fsSL 'https://pkgs.k8s.io/core:/stable:/v$target_minor/deb/Release.key' | sudo gpg --dearmor -o /etc/apt/keyrings/kubernetes-apt-keyring.gpg --batch --yes \
+      && sudo apt-get update"
+fi
+```
+
+Slack: `Repo rewritten to v$target_minor/deb on all 5 nodes.`
+
+## Stage 5: Master upgrade (`stages` includes `master`)
+
+```bash
+# 5.1 Drain
+if [ "$dry_run" = "false" ]; then
+  kubectl --kubeconfig $WORKSPACE_DIR/config drain k8s-master \
+    --ignore-daemonsets --delete-emptydir-data --force --grace-period=300
+fi
+
+# 5.2 Run the library script via SSH pipe
+if [ "$dry_run" = "false" ]; then
+  $SSH \
+    wizard@k8s-master 'bash -s' \
+    < $WORKSPACE_DIR/scripts/update_k8s.sh \
+    -- --role master --release "$target_version"
+fi
+
+# 5.3 Uncordon + wait Ready
+if [ "$dry_run" = "false" ]; then
+  kubectl --kubeconfig $WORKSPACE_DIR/config uncordon k8s-master
+fi
+
+for i in $(seq 1 60); do
+  STATUS=$(kubectl --kubeconfig $WORKSPACE_DIR/config get node k8s-master \
+    -o jsonpath='{.status.conditions[?(@.type=="Ready")].status}')
+  KUBELET=$(kubectl --kubeconfig $WORKSPACE_DIR/config get node k8s-master \
+    -o jsonpath='{.status.nodeInfo.kubeletVersion}' | tr -d v)
+  [ "$STATUS" = "True" ] && [ "$KUBELET" = "$target_version" ] && break
+  sleep 15
+done
+
+[ "$STATUS" = "True" ] && [ "$KUBELET" = "$target_version" ] \
+  || { slack "ABORT — master not Ready or wrong version after upgrade ($STATUS / $KUBELET)"; exit 1; }
+
+# 5.4 All control-plane pods Running
+NOT_READY=$(kubectl --kubeconfig $WORKSPACE_DIR/config -n kube-system get pods \
+  -l 'tier=control-plane' --no-headers | grep -v Running | wc -l)
+[ "$NOT_READY" -gt 0 ] && { slack "ABORT — $NOT_READY control-plane pods not Running"; exit 1; }
+
+# 5.5 Re-check halt-on-alert
+# (re-run the Check 1.2 query, abort if anything new fires)
+
+slack "Master upgrade complete. Cluster on v$target_version. Healthy."
+```
+
+## Stage 6: Workers sequentially (`stages` includes `workers`)
+
+Order: `k8s-node4 → k8s-node3 → k8s-node2 → k8s-node1`. Node1 last because it hosts GPU + Immich and benefits from the longest soak before any other worker is touched (ref: post-mortem-2026-03-16, memory id=570).
+
+For each worker `$node`:
+
+1. Re-check halt-on-alert. If anything fires (e.g. `RecentNodeReboot` on the previous worker), wait + retry up to 30 min, then abort.
+2. `kubectl drain $node --ignore-daemonsets --delete-emptydir-data --force --grace-period=300`
+3. SSH pipe `update_k8s.sh --role worker --release $target_version`
+4. `kubectl uncordon $node`
+5. Wait until `$node` Ready + kubeletVersion matches + all calico-node + kube-proxy pods on that node Running.
+6. **10-min soak**: poll halt-on-alert every 60s. If anything fires, abort. After 10 min clean, proceed.
+7. Slack: `Worker $node complete ($i/4)`.
+
+```bash
+WORKERS="k8s-node4 k8s-node3 k8s-node2 k8s-node1"
+i=0
+for node in $WORKERS; do
+  i=$((i+1))
+
+  # Halt-on-alert recheck with retry
+  for attempt in $(seq 1 30); do
+    ALERTS=$(curl -sf 'http://prometheus-server.monitoring.svc.cluster.local:80/api/v1/alerts' \
+      | jq -r '.data.alerts[] | select(.state == "firing") | .labels.alertname' \
+      | grep -vE '^(Watchdog|RebootRequired|KuredNodeWasNotDrained|InfoInhibitor)$' \
+      | sort -u)
+    [ -z "$ALERTS" ] && break
+    echo "Waiting for alerts to clear (attempt $attempt/30): $ALERTS"
+    sleep 60
+  done
+  [ -n "$ALERTS" ] && { slack "ABORT $node — alerts firing after 30min wait: $ALERTS"; exit 1; }
+
+  if [ "$dry_run" = "false" ]; then
+    kubectl --kubeconfig $WORKSPACE_DIR/config drain "$node" \
+      --ignore-daemonsets --delete-emptydir-data --force --grace-period=300
+
+    $SSH \
+      "wizard@$node" 'bash -s' \
+      < $WORKSPACE_DIR/scripts/update_k8s.sh \
+      -- --role worker --release "$target_version"
+
+    kubectl --kubeconfig $WORKSPACE_DIR/config uncordon "$node"
+  fi
+
+  # Wait Ready + version match
+  for w in $(seq 1 60); do
+    STATUS=$(kubectl --kubeconfig $WORKSPACE_DIR/config get node "$node" \
+      -o jsonpath='{.status.conditions[?(@.type=="Ready")].status}')
+    KUBELET=$(kubectl --kubeconfig $WORKSPACE_DIR/config get node "$node" \
+      -o jsonpath='{.status.nodeInfo.kubeletVersion}' | tr -d v)
+    [ "$STATUS" = "True" ] && [ "$KUBELET" = "$target_version" ] && break
+    sleep 15
+  done
+  [ "$STATUS" = "True" ] && [ "$KUBELET" = "$target_version" ] \
+    || { slack "ABORT — $node not Ready or wrong version ($STATUS / $KUBELET)"; exit 1; }
+
+  # 10-min soak with halt-on-alert
+  echo "Soaking $node for 10 min..."
+  for sec in $(seq 1 10); do
+    ALERTS=$(curl -sf 'http://prometheus-server.monitoring.svc.cluster.local:80/api/v1/alerts' \
+      | jq -r '.data.alerts[] | select(.state == "firing") | .labels.alertname' \
+      | grep -vE '^(Watchdog|RebootRequired|KuredNodeWasNotDrained|InfoInhibitor|RecentNodeReboot)$' \
+      | sort -u)
+    [ -n "$ALERTS" ] && { slack "ABORT $node mid-soak — alerts: $ALERTS"; exit 1; }
+    sleep 60
+  done
+
+  slack "Worker $node upgrade complete ($i/4). Soaked clean."
+done
+```
+
+Note: during the soak we add `RecentNodeReboot` to the ignore-list because we KNOW we just rebooted-as-it-were that node (kubelet restart counts).
+
+## Stage 7: Post-flight (`stages` includes `postflight`)
+
+```bash
+# All 5 nodes at target
+VERSIONS=$(kubectl --kubeconfig $WORKSPACE_DIR/config get nodes \
+  -o jsonpath='{range .items[*]}{.metadata.name}:{.status.nodeInfo.kubeletVersion}{"\n"}{end}')
+echo "$VERSIONS"
+WRONG=$(echo "$VERSIONS" | grep -v ":v${target_version}$" | wc -l)
+[ "$WRONG" -ne 0 ] && { slack "ABORT post-flight — $WRONG node(s) not on v$target_version:\n$VERSIONS"; exit 1; }
+
+# Upgrade Gates all inactive
+FIRING=$(curl -sf 'http://prometheus-server.monitoring.svc.cluster.local:80/api/v1/alerts' \
+  | jq -r '.data.alerts[] | select(.state == "firing") | .labels.alertname' \
+  | grep -vE '^(Watchdog|RebootRequired|KuredNodeWasNotDrained|InfoInhibitor)$' \
+  | sort -u)
+[ -n "$FIRING" ] && slack "Post-flight WARN — alerts still firing (cluster on target, but check):\n$FIRING"
+
+# pod-ready ratio >= 0.9
+RATIO=$(curl -sf 'http://prometheus-server.monitoring.svc.cluster.local:80/api/v1/query' \
+  --data-urlencode 'query=sum(kube_pod_status_ready{condition="true"}) / sum(kube_pod_status_phase{phase="Running"})' \
+  | jq -r '.data.result[0].value[1] // "0"')
+slack "Pod-ready ratio: $RATIO (target ≥ 0.9)"
+
+# Clear the in-flight annotation + Pushgateway gauges
+if [ "$dry_run" = "false" ]; then
+  kubectl --kubeconfig $WORKSPACE_DIR/config annotate ns k8s-upgrade \
+    viktorbarzin.me/k8s-upgrade-in-flight- \
+    viktorbarzin.me/k8s-upgrade-target- \
+    viktorbarzin.me/k8s-upgrade-snapshot-path- || true
+
+  push_metric k8s_upgrade_in_flight 0
+  push_metric k8s_upgrade_snapshot_taken 0
+fi
+
+slack ":white_check_mark: K8s upgrade complete: cluster on v$target_version."
+```
+
+## Rollback
+
+This agent does NOT auto-rollback. If anything aborts mid-flight:
+
+1. Slack the failure with the last known stage + node.
+2. Leave the in-flight annotation in place (the operator clears it manually after triage).
+3. Operator follows `infra/docs/runbooks/k8s-version-upgrade.md` → "Rollback paths" section.
+
+The etcd snapshot path is annotated on the `k8s-upgrade` namespace for easy recovery.
+
+## Notes for tests
+
+- **Test 1 (CronJob dry-run)**: The CronJob has its own `--dry-run` env var that short-circuits before POST. This agent is not invoked.
+- **Test 2 (agent dry-run)**: Invoke with `{"dry_run": true}`. Every SSH + kubectl READ runs, every mutation skipped. The agent should print "WOULD: <cmd>" for each skipped mutation.
+- **Test 3 (snapshot-only)**: Invoke with `{"stages": "preflight,snapshot"}`. Pre-flight + etcd snapshot only. Slack notification confirms the file exists. No node touched after that.
+- **Test 4 (full run)**: `{"target_version": "1.34.7", "kind": "patch"}` once apt has it. Full sequence.
+- **Test 5 (synthetic minor)**: `{"target_version": "1.35.0", "kind": "minor", "dry_run": true}`. Confirms the repo-rewrite plan path without mutation.
+
+## Edge cases
+
+- **Slack down**: Don't block the upgrade — continue, log to stderr.
+- **SSH host key changes**: `accept-new` accepts only on first encounter — if a node was reimaged its host key changes; clear `/tmp/known_hosts` before retry.
+- **kubectl drain hangs on a PDB-violating pod**: 5-min grace-period is hard. If drain fails, `kubectl drain --disable-eviction --force` is NOT a valid escalation here — slack-abort and let the operator investigate.
+- **etcd snapshot dir missing/full**: stat the dir first. If <10 GiB free, abort.
+- **Network blip during apt-get**: the script `set -e`s — apt-get will fail loud, the agent's bash will see non-zero exit, we slack-abort. The node is left mid-upgrade (kubeadm half-applied). Operator follows the runbook.
+
+## Verification claims you must make
+
+When you `slack` a SUCCESS message, you must have actually verified:
+- All 5 nodes report the target kubelet version via `kubectl get nodes -o jsonpath`
+- No alerts firing outside the ignore-list
+- pod-ready ratio computed from Prometheus
+
+Do not declare success without those three confirmations.
--- a/.claude/reference/authentik-state.md
+++ b/.claude/reference/authentik-state.md
@ -127,10 +127,65 @@ Pinned via Terraform in `stacks/authentik/`:
 | Knob | Value | Surface | Effect |
 |------|-------|---------|--------|
 | `UserLoginStage.session_duration` on `default-authentication-login` | `weeks=4` | `authentik_stage_user_login.default_login` in `authentik_provider.tf` | Authenticated users stay logged in 4 weeks across browser restarts. No sliding refresh — resets on each login. |
+| `ProxyProvider.access_token_validity` on `Provider for Domain wide catch all` | `weeks=4` | `authentik_provider_proxy.catchall.access_token_validity` in `authentik_provider.tf` | Cookie `Max-Age` on `authentik_proxy_*` and `expires` on rows in `authentik_providers_proxy_proxysession`. Bumped 2026-05-10 from `hours=168`. **Bumping requires `kubectl rollout restart deploy/ak-outpost-authentik-embedded-outpost`** — the gorilla session store binds the value once at outpost startup; the 5-min provider refresh logs `"reusing existing session store"` and skips rebuild. |
 | `AUTHENTIK_SESSIONS__UNAUTHENTICATED_AGE` (server + worker) | `hours=2` | `server.env` + `worker.env` in `modules/authentik/values.yaml` | Anonymous Django sessions (bots, healthcheckers, partial flows) are reaped within 2h instead of the 1d default. |

 Notes:
 - There is **no** `Brand.session_duration`; `UserLoginStage` is the only correct lever for authenticated session lifetime.
- Embedded outpost session storage moved from `/dev/shm` → Postgres table `authentik_providers_proxy_proxysession` in authentik 2025.10. The 2026-04-18 `/dev/shm`-fill outage class is no longer load-bearing in 2026.2.2; the `unauthenticated_age` cap is still the right lever for anonymous-session bloat from external monitors.
- `ProxyProvider.access_token_validity` and `remember_me_offset` stay UI-managed via `ignore_changes`.
+- Embedded outpost session storage: PostgreSQL table `authentik_providers_proxy_proxysession` in authentik 2025.10+ (PR #16628), but **only when `IsEmbedded()` returns true** (i.e. `Outpost.managed == "goauthentik.io/outposts/embedded"`). Our outpost record had `managed=null` until 2026-05-10, which silently kept it on the gorilla `FilesystemStore` at `/dev/shm` (TMPDIR) and re-exposed the 2026-04-18 mismatched-session-ID class on every pod restart. Fix landed 2026-05-10: see `authentik_outpost.embedded` in `authentik_provider.tf` and post-mortem `2026-04-18-authentik-outpost-shm-full.md`.
+- The proxy outpost service has a known goauthentik 2026.2.2 bug (`internal/outpost/controllers/k8s/service.py:52`): for embedded outposts the controller sets the Service selector to `app.kubernetes.io/name=authentik` (the server pods), not `authentik-outpost-proxy`. We work around it via a `kubernetes_json_patches.service` patch on the outpost record (replaces `/spec/selector` with the outpost's own labels). Without this, endpoints are empty and Traefik forward-auth fails over to the Basic Auth realm `Emergency Access`.
+- The standalone embedded-outpost deployment needs `AUTHENTIK_POSTGRESQL__{HOST,PORT,USER,PASSWORD,NAME}` env vars to reach the dbaas cluster — codified via `kubernetes_json_patches.deployment` envFrom the shared `goauthentik` Secret. The `app.kubernetes.io/component=server` pod label is also injected via JSON patch (matches the `component:server` half of the Service selector that the controller adds for embedded outposts).
+- `ProxyProvider.remember_me_offset` stays UI-managed via `ignore_changes`.
+- The Authentik provider's resource schema does **not** expose the `Outpost.managed` field. We rely on TF's "write only fields it knows about" semantic: the server-set `goauthentik.io/outposts/embedded` value is preserved across applies because Terraform never writes `managed`. Don't change the resource provider schema expectations without verifying this assumption holds.
 - The `unauthenticated_age` env var is injected via `server.env` / `worker.env` (not `authentik.sessions.unauthenticated_age`) because we set `authentik.existingSecret.secretName: goauthentik`, which makes the chart skip rendering its own `AUTHENTIK_*` Secret. The `authentik.*` value block is therefore inert in this stack — anything new under `authentik.*` must use the `*.env` arrays instead. The same applies to the existing `authentik.cache.*`, `authentik.web.*`, `authentik.worker.*` blocks (currently inert; live values come from the orphaned, helm-keep-policy `goauthentik` Secret created by chart 2025.10.3 before `existingSecret` was introduced).
+
+## Upgrade Validation Checklist
+
+Run after **any** of these:
+- Authentik chart version bump in `stacks/authentik/modules/authentik/main.tf` (the `version = "..."` line on `helm_release.authentik`).
+- `goauthentik/authentik` Terraform provider version bump.
+- Outpost pod recreation (kured reboot, eviction, manual `rollout restart`, scheduler move).
+
+The fragile surfaces are the `kubernetes_json_patches` and the `Outpost.managed` field — both rely on assumptions that can silently break across upgrades. The checklist exercises the same path the alerts watch, so it doubles as a smoke test for the alerts.
+
+```bash
+# 1. Service routes to the outpost pod (NOT the server pods).
+#    Empty endpoints => auth-proxy fallback fires; expected: ONE pod IP, ports 9000/9300/9443.
+kubectl -n authentik get endpoints ak-outpost-authentik-embedded-outpost
+
+# 2. Service selector still excludes the server pods. Expected: includes
+#    `app.kubernetes.io/name: authentik-outpost-proxy`. If it flips to
+#    `name: authentik`, the goauthentik upstream bug came back or our
+#    JSON patch was unset.
+kubectl -n authentik get svc ak-outpost-authentik-embedded-outpost -o jsonpath='{.spec.selector}'
+
+# 3. Outpost mode + session backend. Expected log lines on startup:
+#      {"embedded":true,"event":"Outpost mode",...}
+#      {"event":"using PostgreSQL session backend",...}
+#    If embedded=false or `using filesystem session backend`, the postgres
+#    fix is broken — likely `Outpost.managed` got cleared, or the upstream
+#    schema started exposing `managed` and TF reset it.
+kubectl -n authentik logs deploy/ak-outpost-authentik-embedded-outpost | grep -E '"Outpost mode"|"session backend"' | head -3
+
+# 4. /dev/shm is essentially empty (postgres backend = no filesystem use).
+#    A row count > a few dozen indicates filesystem fallback is firing.
+kubectl -n authentik exec deploy/ak-outpost-authentik-embedded-outpost -- sh -c 'df -h /dev/shm; ls /dev/shm | wc -l'
+
+# 5. Postgres session table is growing with traffic. Expected: rows with
+#    `expires` ~28 days out (matches access_token_validity = weeks=4).
+kubectl -n authentik exec deploy/goauthentik-server -- ak shell -c "
+from django.db import connection; c = connection.cursor()
+c.execute('SELECT COUNT(*), MAX(expires) FROM authentik_providers_proxy_proxysession')
+print(c.fetchone())"
+
+# 6. Edge auth flow: should be 302 → authentik. NOT 401 with WWW-Authenticate.
+curl -sS -o /dev/null -D - 'https://terminal.viktorbarzin.me/' -H 'User-Agent: Mozilla/5.0' \
+  | grep -iE '^HTTP|^location|x-auth-fallback|www-authenticate'
+
+# 7. Terraform plan-to-zero on the whole authentik stack.
+( cd stacks/authentik && /home/wizard/code/infra/scripts/tg plan ) | grep -E 'No changes|Plan:'
+```
+
+Steps 1, 3, 6 cover the failure modes the Prometheus alerts trigger on (`AuthentikForwardAuthFallbackActive`, `AuthentikOutpostForwardAuth400Spike`). Steps 4 and 5 cover the silent-regression case (filesystem fallback) where the alerts don't fire but the system loses its postgres-backed session persistence on the next pod restart.
+
+If step 2 shows the controller restored `app.kubernetes.io/name=authentik`, watch goauthentik/authentik issue tracker for fixes around `internal/outpost/controllers/k8s/service.py:52` — the upstream patch might let us drop our `kubernetes_json_patches.service` workaround.
--- a/.claude/reference/service-catalog.md
+++ b/.claude/reference/service-catalog.md
@ -53,6 +53,7 @@
 | insta2spotify | Instagram reel song ID to Spotify playlist | insta2spotify |
 | trading-bot | Event-driven trading with sentiment analysis | trading-bot |
 | claude-memory | Persistent memory MCP server | claude-memory |
+| paperless-mcp | Paperless-ngx document search MCP (barryw/PaperlessMCP). Traefik bearer auth via Aetherinox api-token-middleware. `auth=none` at ingress; gateway-level bearer enforced by `paperless-mcp/bearer-auth` Middleware CRD. Tokens + paperless API token in Vault `secret/paperless-mcp`. | paperless-mcp |
 | council-complaints | Islington civic reporting pilot | council-complaints |

 ## Optional
@ -78,6 +79,7 @@
 | paperless-ngx | Document management | paperless-ngx |
 | jsoncrack | JSON visualizer | jsoncrack |
 | servarr | Media automation (Sonarr/Radarr/etc) | servarr |
+| aiostreams | Stremio stream aggregator (Real-Debrid + Torrentio/Comet/MediaFusion/StremThru/Knaben). `auth=app` (own UUID+password); canary stream-probe + 3 alerts; weekly NFS config + Stremio-account-collection backups to `/srv/nfs/aiostreams-backup/`. PG-backed user config. | servarr/aiostreams |
 | ntfy | Push notifications | ntfy |
 | cyberchef | Data transformation | cyberchef |
 | diun | Docker image update notifier — detects new versions, fires webhook to n8n upgrade agent | diun |
--- a/.claude/skills/cluster-health/SKILL.md
+++ b/.claude/skills/cluster-health/SKILL.md
@ -7,8 +7,9 @@ description: |
  (3) User asks to fix stuck pods, evicted pods, or CrashLoopBackOff,
  (4) User mentions "health check", "cluster status", "cluster health",
  (5) User asks "is everything running" or "any problems".
-  Runs 42 cluster-wide checks (nodes, workloads, monitoring, certs,
-  backups, external reachability) with safe auto-fix for evicted pods.
+  Runs 44 cluster-wide checks (nodes, workloads, monitoring, certs,
+  backups, external reachability, PVE host thermals + load) with safe
+  auto-fix for evicted pods.
 author: Claude Code
 version: 2.0.0
 date: 2026-04-19
@ -66,7 +67,7 @@ bash infra/scripts/cluster_healthcheck.sh --no-fix --quiet --json
 bash infra/scripts/cluster_healthcheck.sh --kubeconfig /path/to/config
 ```

-## What It Checks (42 checks)
+## What It Checks (44 checks)

 | # | Check | Notes |
 |---|-------|-------|
@ -112,6 +113,8 @@ bash infra/scripts/cluster_healthcheck.sh --kubeconfig /path/to/config
 | 40 | External — Cloudflared + Authentik Replicas | deployments fully ready |
 | 41 | External — ExternalAccessDivergence Alert | alert not firing |
 | 42 | External — Traefik 5xx Rate (15m) | top-10 services emitting 5xx |
+| 43 | PVE Host Thermals | package + per-core temps via `/sys/class/hwmon` (SSH). Baseline 55-65 °C. PASS <65 °C, WARN 65-82 °C (a VM is burning too much CPU), FAIL ≥83 °C (TjMax) |
+| 44 | PVE Host Load | `/proc/loadavg` via SSH. PASS 5m <30, WARN 30-37, FAIL ≥38 of 44 threads |

 ## Safe Auto-Fix Rules

@ -256,9 +259,9 @@ kubectl logs -n external-secrets deploy/external-secrets --tail=100
 kubectl get pods -n cloudflared
 kubectl logs -n cloudflared -l app=cloudflared --tail=100

-# Authentik
-kubectl get pods -n authentik -l app=authentik-server
-kubectl logs -n authentik -l app=authentik-server --tail=100
+# Authentik (Helm chart names the deployment goauthentik-server)
+kubectl get deployment -n authentik goauthentik-server
+kubectl logs -n authentik deploy/goauthentik-server --tail=100

 # ExternalAccessDivergence alert
 kubectl exec -n monitoring deploy/prometheus-server -- \
@ -295,6 +298,133 @@ kubectl exec -n monitoring deploy/prometheus-server -- \
   - Exit code 143 → SIGTERM / graceful shutdown failed
 3. Cross-check dbaas + NFS + secrets are healthy.

+## Performance forensics — top consumers + optimization hints
+
+When the cluster is healthy (script returns 0) but the host is hot or load
+is elevated, switch from "what broke?" to "what's expensive?". Run these
+in order; stop as soon as the root cause is obvious.
+
+### Step 1 — Snapshot top consumers cluster-wide
+
+```bash
+# Top 15 pods by current CPU
+kubectl top pods --all-namespaces --sort-by=cpu --no-headers | head -15
+
+# Top 5 nodes by CPU + memory pressure
+kubectl top nodes
+
+# Top 15 by 5-min rolling rate (smoothed — kills noise from one-off spikes)
+kubectl -n monitoring exec deploy/prometheus-server -- wget -qO- \
+  "http://localhost:9090/api/v1/query?query=topk(15,sum%20by%20(namespace,pod)%20(rate(container_cpu_usage_seconds_total%7Bcontainer!%3D''%7D%5B5m%5D)))" \
+  | python3 -m json.tool | head -80
+```
+
+### Step 2 — For each suspect pod, get the WHY
+
+For every pod in the top-N, gather these BEFORE proposing a fix:
+
+```bash
+NS=<namespace>; POD=<pod>; CONT=$(kubectl -n $NS get pod $POD -o jsonpath='{.spec.containers[0].name}')
+
+# What it does (image + command)
+kubectl -n $NS get pod $POD -o jsonpath='{.spec.containers[0].image}{"\n"}{.spec.containers[0].args}{"\n"}'
+
+# Resource limits + current usage
+kubectl -n $NS top pod $POD --containers
+kubectl -n $NS get pod $POD -o jsonpath='{.spec.containers[0].resources}'
+
+# Recent logs filtered for reconcile loops, watch storms, slow queries
+kubectl -n $NS logs $POD -c $CONT --tail=200 --since=5m 2>&1 \
+  | grep -iE 'reconcil|watch|scrape|index|loop|retry|slow|timeout' | tail -20
+
+# Restart count + recent OOM
+kubectl -n $NS describe pod $POD | grep -E 'Restart Count|Last State|Reason'
+
+# Self-exported metrics (for apps that publish on /metrics)
+kubectl -n $NS exec $POD -c $CONT -- wget -qO- localhost:<port>/metrics 2>/dev/null | head -50
+```
+
+### Step 3 — apiserver / etcd specific deep-dive (when control-plane is hot)
+
+```bash
+# Top request producers by verb+resource (last 30 min)
+kubectl -n monitoring exec deploy/prometheus-server -- wget -qO- \
+  "http://localhost:9090/api/v1/query?query=topk(15,sum%20by%20(resource,verb)%20(rate(apiserver_request_total%5B30m%5D)))" \
+  | python3 -m json.tool
+
+# Top user agents (which clients are hammering)
+kubectl -n monitoring exec deploy/prometheus-server -- wget -qO- \
+  "http://localhost:9090/api/v1/query?query=topk(15,sum%20by%20(user_agent)%20(rate(apiserver_request_total%5B30m%5D)))" \
+  | python3 -m json.tool
+
+# Long-running requests (WATCH / CONNECT — log streams, pod-watchers)
+kubectl -n monitoring exec deploy/prometheus-server -- wget -qO- \
+  "http://localhost:9090/api/v1/query?query=apiserver_longrunning_requests" \
+  | python3 -m json.tool
+
+# etcd write rate + DB size
+kubectl -n monitoring exec deploy/prometheus-server -- wget -qO- \
+  "http://localhost:9090/api/v1/query?query=rate(etcd_disk_wal_fsync_duration_seconds_count%5B5m%5D)" \
+  | python3 -m json.tool
+```
+
+### Step 4 — PVE host specific deep-dive (when temp / load is high)
+
+Checks 43 + 44 capture package temp + 5-min load avg with PASS/WARN/FAIL
+thresholds — that's the first stop. When those WARN or FAIL, the
+follow-up commands below trace which VM / process is the source:
+
+```bash
+# Per-core temps (broader than the package summary in check 43)
+ssh root@192.168.1.127 'for f in /sys/class/hwmon/hwmon0/temp*_input; do
+    base=${f%_input}; label=$(cat ${base}_label 2>/dev/null || echo "${base##*/}")
+    val=$(cat "$f"); echo "  $label: $((val/1000))°C"
+done'
+
+# Per-VM CPU (each VM = one kvm process)
+ssh root@192.168.1.127 'top -bn1 -o %CPU | grep kvm | head -10'
+
+# pvestatd anomaly check — bursts > 50% usually mean LV count > 1000
+ssh root@192.168.1.127 'lvs --noheadings 2>/dev/null | wc -l'
+
+# Stale snapshots (any '_pre-*' that survived past their rollback window)
+ssh root@192.168.1.127 'lvs --noheadings -o lv_name 2>/dev/null | awk "/_pre-/" | head -20'
+```
+
+### Step 5 — Optimization decision
+
+For each consumer in the top-N, fill in a row:
+
+| Pod / Process | CPU (m) | Why busy | Tunable | Est saving | Trade-off | Effort |
+|---|---|---|---|---|---|---|
+
+Then rank by ROI (saving / effort) and surface the top 3-5. **Hold back the ones where saving < 50m unless effort is also < 5 min.**
+
+### Common causes + tunables (catalogue)
+
+| Symptom | Likely cause | Tunable |
+|---|---|---|
+| **`kube-apiserver` > 1 core sustained** | `CONNECT pods/log` streams from `alloy`/`promtail` using apiserver-tail; OR Kyverno PolicyReport churn (background+enforce mode); OR VPA fanout (309 VPAs cause ~7 req/s) | Switch alloy/promtail to `loki.source.file`; raise Kyverno `backgroundScanInterval`; reduce VPA count |
+| **`pvestatd` 70-100% bursts** | LV metadata scan over > 1000 LVs (typically stale `_pre-*` snapshots from ad-hoc node ops) | Delete stale snapshots; `/usr/local/bin/lvm-pvc-snapshot prune` |
+| **Frigate > 2 cores** | Birdseye `mode: continuous` (16% on frigate.output); LPR debug; debug logging; too many active cameras × detect.fps | `birdseye.mode: motion`; `lpr.debug_save_plates: false`; remove debug loggers |
+| **`vault-0` looping ERRORs every ~10s** | DB static-role not in connection's `allowed_roles` list (drift between role and connection) | Add role to `vault_database_secret_backend_connection.*.allowed_roles` in TF |
+| **Alloy DS > 100m/pod** | `loki.source.kubernetes` (apiserver-tail) instead of `loki.source.file` | Switch to file-tail (~5× drop per pod) |
+| **Prometheus default 1m scrape** | Chart default; new sample every minute | Raise `server.global.scrape_interval` to 2m; pin critical jobs (snmp-ups) to 30s; bump `for: 1m` alerts to `for: 3m` |
+| **`kube-controller-manager` periodic ERROR loop** | Aggregated APIService discovery fails (calico/metrics-server unreachable, OR stuck Terminating pod still in endpoints) | Force-delete stuck pod; verify APIService Available; check pod runc bug on k8s-master |
+| **etcd write > 1 MB/s** | PolicyReport thrash, too-frequent secret rotation, or audit log mode = RequestResponse | Trim Kyverno reports config; raise rotation_period; downgrade audit policy to Metadata for noisy resources |
+
+### What NOT to touch
+
+- **calico-node, etcd write rate, kube-controller-manager core work, pg-cluster replication** — structural cost, touching them risks correctness.
+- **Pods doing legitimate request-serving work** (web servers, databases under load) — optimize the workload, not the runtime.
+- **Anything where Goldilocks VPA upperBound is already close to current request** — no headroom to cut.
+
+### Source-of-truth notes
+
+- **All infra mutations go via Terraform** (`scripts/tg plan/apply`). The recipes above are diagnostic; the FIX lives in `infra/stacks/<name>/main.tf` or chart values.
+- **Pod-internal config files** (e.g., Frigate's `/config/config.yml` on a PVC) are not TF-managed — edit in-pod and document in `infra/docs/runbooks/`.
+- **PVE host-level state** (LVM snapshots, pvestatd) — SSH + manual ops; record in memory if the pattern recurs.
+
 ## Notes on the canonical / hardlink setup

 The authoritative copy of this SKILL.md lives at
--- a/.claude/skills/upgrade-state/SKILL.md
+++ b/.claude/skills/upgrade-state/SKILL.md
@ -0,0 +1,199 @@
+---
+name: upgrade-state
+description: |
+  Audit the three autonomous-upgrade pipelines (apps via Keel, OS via
+  unattended-upgrades+kured, K8s components via the version-check chain).
+  Use when:
+  (1) User asks "/upgrade-state" or "are we current",
+  (2) User asks "what's pending upgrade" or "what's the upgrade state",
+  (3) User asks if Keel / kured / k8s-version-check is healthy,
+  (4) User asks about kept-back / held packages or pending reboots,
+  (5) Periodic survey before the next `k8s-version-check` daily run.
+  Read-only — no `--fix`. Exits 0 healthy / 1 attention / 2 stalled.
+author: Claude Code
+version: 1.0.0
+date: 2026-05-18
+---
+
+# Upgrade-state
+
+## MANDATORY: Run the script first
+
+When this skill is invoked, your **first action** must be to run
+`upgrade_state.sh` and reason over its output before doing anything
+else. Do NOT improvise individual `kubectl` / `ssh` calls — the script
+is the authoritative surface.
+
+```bash
+bash /home/wizard/code/infra/scripts/upgrade_state.sh
+```
+
+For programmatic use:
+
+```bash
+bash /home/wizard/code/infra/scripts/upgrade_state.sh --json | tee /tmp/upgrade-state.json
+```
+
+Then:
+
+1. Report the rendered table verbatim — it answers the user's
+   "are we current" question in three lines.
+2. For every `⚠` or `✗` row, surface the relevant drill-down lines
+   underneath and propose a next action (links in the table below).
+3. Only reach for ad-hoc commands when investigating beyond what the
+   script reported.
+
+Exit codes: `0` healthy, `1` attention warranted, `2` stalled / broken.
+
+## What it covers (3 pipelines)
+
+| Layer | What runs | Cadence | Data sources |
+|---|---|---|---|
+| **Apps** | Keel polls every watched Deployment's container registry; rolls on new digest | hourly | Prom (`pending_approvals`, `registries_scanned_total`), Keel pod logs |
+| **OS** | `unattended-upgrades` in-release patching; `kured` reboots when `/var/run/reboot-required` is set | daily 02:00-06:00 London | SSH fan-out to all 5 nodes |
+| **K8s** | `k8s-version-check` CronJob detects new kubeadm patch/minor; spawns the Job-chain that drains+upgrades node-by-node | daily 12:00 UTC | Pushgateway (`k8s_upgrade_*`), `kubectl get nodes` |
+
+The K8s pipeline pushes a small set of gauges to the Prometheus
+Pushgateway (`prometheus-prometheus-pushgateway.monitoring:9091`):
+
+- `k8s_upgrade_available{kind="patch"|"minor",target=…}` — 1 if newer release detected
+- `k8s_version_check_last_run_timestamp` — when detection last ran
+- `k8s_upgrade_in_flight` — 0/1
+- `k8s_upgrade_started_timestamp` — when the current chain started (0 when idle)
+
+`K8sUpgradeStalled` alert fires when `in_flight=1` and the chain has
+been running >90 minutes. The script raises `✗` in the same window.
+
+## Status-icon legend
+
+| Icon | Meaning |
+|---|---|
+| `✓` | Healthy, fully current |
+| `→` | Update available, not yet applied (K8s patch/minor) |
+| `…` | In flight — chain currently running |
+| `⚠` | Attention: held-with-bumps, recent errors, pending approvals |
+| `✗` | Broken: pod down, alert firing, chain stalled |
+
+## Drill-down — when a row trips, what to do
+
+### Apps `⚠` — pending approvals or errors
+
+```bash
+# Read recent Keel log lines
+kubectl -n keel logs deploy/keel --since=24h --tail=200
+
+# What is Keel currently tracking?
+kubectl -n monitoring exec deploy/prometheus-server -c prometheus-server -- \
+    wget -qO- 'http://localhost:9090/api/v1/query?query=count by (image) (registries_scanned_total)'
+
+# Is the scrape live?
+kubectl -n monitoring exec deploy/prometheus-server -c prometheus-server -- \
+    wget -qO- 'http://localhost:9090/api/v1/query?query=up{job="kubernetes-pods",app="keel"}'
+```
+
+Common Keel errors:
+- `failed to add image watch job` — image annotation mistyped (rare; Kyverno auto-injects)
+- `registry authentication required` — bad imagePullSecret on the watched Deployment
+- `bad tag pattern` — Keel can't parse the watched image's tag against its policy
+
+### OS `⚠` — held packages with bumps
+
+The script flags any package held via `apt-mark hold` that ALSO appears
+in `apt list --upgradable` — excluding k8s components (the K8s pipeline
+owns those) and the kernel (kured handles the reboot half).
+
+Typical cause: a major-version bump (e.g. containerd 1.7 → 2.2,
+runc 1.1 → 1.4). These are held because they need cluster-wide
+coordination, not silent in-release patching.
+
+```bash
+# Inspect the situation on the flagged node
+ssh wizard@10.0.20.10X 'apt-mark showhold; apt list --upgradable 2>/dev/null'
+
+# Unhold + upgrade a specific package
+ssh wizard@10.0.20.10X 'sudo apt-mark unhold containerd && sudo apt-get install -y containerd'
+```
+
+Node IPs: master=`100`, node1=`101`, node2=`102`, node3=`103`, node4=`104`.
+
+### OS `⚠` — pending reboot
+
+A node has `/var/run/reboot-required`. Kured will reboot it inside the
+next 02:00-06:00 London window (any day of the week).
+
+```bash
+# Force a manual reboot inside the window (rare)
+kubectl drain k8s-nodeX --delete-emptydir-data --ignore-daemonsets
+ssh wizard@10.0.20.10X sudo systemctl reboot
+```
+
+### OS `✗` — kured not Running
+
+```bash
+kubectl -n kured get pods
+kubectl -n kured logs daemonset/kured --tail=100
+# Verify sentinel gate (kured-sentinel-gate DaemonSet writes /var/run/gated-reboot-required)
+kubectl -n kured get pods -l name=kured-sentinel-gate
+```
+
+### K8s `→` — patch/minor available
+
+Detection ran, target identified, chain NOT started. The chain spawns
+on the same daily detection cycle — typically within ~24h of the
+target first being detected.
+
+```bash
+# Inspect Pushgateway state
+kubectl -n monitoring exec deploy/prometheus-server -c prometheus-server -- \
+    wget -qO- 'http://prometheus-prometheus-pushgateway:9091/metrics' | grep ^k8s_upgrade
+
+# Trigger a manual run of the detection CronJob
+kubectl -n k8s-upgrade create job --from=cronjob/k8s-version-check manual-detect-$(date +%s)
+```
+
+### K8s `…` — in flight
+
+The Job chain is running. Watch its progress:
+
+```bash
+kubectl -n k8s-upgrade get jobs --sort-by=.metadata.creationTimestamp
+kubectl -n k8s-upgrade logs -l app=k8s-version-upgrade --tail=200 --prefix
+```
+
+### K8s `✗ stalled` — `K8sUpgradeStalled` would fire
+
+Chain in-flight >90m. The Job is most likely stuck on drain or a
+pre-flight check.
+
+```bash
+kubectl -n k8s-upgrade get jobs
+kubectl -n k8s-upgrade describe job <stuck-job>
+kubectl -n k8s-upgrade logs job/<stuck-job> --tail=300
+
+# If you need to clear the in-flight flag (after diagnosing):
+kubectl -n monitoring exec deploy/prometheus-server -c prometheus-server -- sh -c \
+    "printf 'k8s_upgrade_in_flight 0\nk8s_upgrade_started_timestamp 0\n' | \
+     wget -qO- --post-file=- 'http://prometheus-prometheus-pushgateway:9091/metrics/job/k8s-version-upgrade' \
+       --header='Content-Type: text/plain'"
+```
+
+### K8s `✗ detection stale` — last detection >9 days
+
+```bash
+kubectl -n k8s-upgrade get cronjob k8s-version-check
+kubectl -n k8s-upgrade get jobs --sort-by=.metadata.creationTimestamp | tail -5
+```
+
+If the CronJob hasn't fired on time, suspect:
+- `suspend=true` on the CronJob (`var.enabled=false` in the
+  `k8s-version-upgrade` Terraform stack)
+- Image-pull failure on the version-check pod
+- Pushgateway scrape gone stale
+
+## Companion command-line flags
+
+```bash
+bash infra/scripts/upgrade_state.sh                 # rendered table (default)
+bash infra/scripts/upgrade_state.sh --json          # machine output
+bash infra/scripts/upgrade_state.sh --kubeconfig X  # override kubeconfig
+```
--- a/.gitleaksignore
+++ b/.gitleaksignore
@ -0,0 +1,4 @@
+# git-crypt encrypts these at rest; the working-tree plaintext is local-only.
+# gitleaks scans the staged working-tree copy and can't see that they're
+# encrypted on disk in git, so allowlist by fingerprint.
+stacks/recruiter-responder/secrets/privkey.pem:private-key:1
--- a/AGENTS.md
+++ b/AGENTS.md
@ -154,6 +154,37 @@ lifecycle {

 **Audit**: `rg "KYVERNO_LIFECYCLE_V1" stacks/ | wc -l` — should grow (never shrink). Add the marker to every new pod-owning resource. The `_template/main.tf.example` stub shows the canonical form.

+### `# KYVERNO_LIFECYCLE_V2` — Keel auto-update annotations
+
+When a namespace is labeled `keel.sh/enrolled=true`, the `inject-keel-annotations` ClusterPolicy (`stacks/kyverno/modules/kyverno/keel-annotations.tf`) injects three annotations on every Deployment / StatefulSet / DaemonSet:
+
+```
+keel.sh/policy: force
+keel.sh/trigger: poll
+keel.sh/pollSchedule: "@every 1h"
+```
+
+To suppress the resulting Terraform drift, **enrolled workloads** must extend their `ignore_changes` block:
+
+```hcl
+lifecycle {
+  ignore_changes = [
+    spec[0].template[0].spec[0].dns_config, # KYVERNO_LIFECYCLE_V1
+    metadata[0].annotations["keel.sh/policy"],
+    metadata[0].annotations["keel.sh/trigger"],
+    metadata[0].annotations["keel.sh/pollSchedule"], # KYVERNO_LIFECYCLE_V2
+  ]
+}
+```
+
+The V2 snippet is added **per workload** as namespaces are phase-enrolled — not as a mass sweep. Workloads in un-enrolled namespaces do not receive the annotation and don't need the V2 block.
+
+Per-workload opt-out: add the label `keel.sh/policy: never` on the Deployment metadata (not pod template); the policy's `exclude` clause respects it, no annotation gets injected, no `ignore_changes` needed.
+
+**Audit**: `rg "KYVERNO_LIFECYCLE_V2" stacks/` — count should equal the number of enrolled workloads.
+
+**Design context**: `docs/plans/2026-05-16-auto-upgrade-apps-{design,plan}.md`.
+
 ## Tier System
 `0-core` | `1-cluster` | `2-gpu` | `3-edge` | `4-aux` — Kyverno auto-generates LimitRange + ResourceQuota per namespace based on tier label.
 - Containers without explicit `resources {}` get default limits (256Mi for edge/aux — causes OOMKill for heavy apps)
--- a/CONTEXT.md
+++ b/CONTEXT.md
@ -0,0 +1,150 @@
+# Infra
+
+Terragrunt-managed homelab declaring a 5-node Kubernetes cluster on a single Proxmox host. Vault is the secrets source of truth; everything else flows from this repo via `scripts/tg apply`.
+
+## Language
+
+### Code organization
+
+**Service**:
+The deployed app as a domain concept — one logical thing that runs in the cluster (e.g. immich, technitium, freshrss). Defined by exactly one **Stack**.
+_Avoid_: bare "app" without the Service definition; "deployment" (collides with K8s `Deployment`).
+
+**Stack**:
+The HCL directory under `stacks/<name>/` that defines a Service, applied independently with `scripts/tg apply`. A Stack is the unit of Terraform organisation; a Service is the running thing. They are 1:1 but not synonyms.
+_Avoid_: using "Stack" when you mean the running Service.
+
+**Module**:
+A reusable HCL primitive under `modules/`, consumed by Stacks via `source =`.
+_Avoid_: "library", "package".
+
+**Factory module**:
+A Module that hides convention (defaults, drift handling, secret wiring) behind a small input surface. Canonical examples: `ingress_factory`, `nfs_volume`, `k8s_app`, `helm_app`, `postgres_app`.
+_Avoid_: "wrapper".
+
+**State tier**:
+Terraform state-backend partition. **Tier 0** = bootstrap Stacks (`infra`, `platform`, `cnpg`, `vault`, `dbaas`, `external-secrets`) on local SOPS-encrypted state. **Tier 1** = every other Stack, on PG-backed state.
+_Avoid_: "phase", "bootstrap stack" — say Tier 0 explicitly.
+
+### Cluster
+
+**Node**:
+A K8s worker VM (`k8s-master`, `k8s-node1..4`). Default reading of the bare word "node" in this repo.
+_Avoid_: "k8s node" (redundant), "host" (ambiguous).
+
+**PVE node** / **PVE host**:
+The single physical Dell R730 running Proxmox; sole hypervisor and sole NFS server. There is exactly one.
+_Avoid_: "server", "hypervisor", "Proxmox" alone when you mean the host.
+
+**Namespace tier**:
+A namespace-prefix partition (`0-core-*`, `1-cluster-*`, `2-gpu-*`, `3-edge-*`, `4-aux-*`) driving PriorityClass, default resources, and ResourceQuota — generated by **Kyverno policy** from the namespace name. Orthogonal to **State tier**.
+_Avoid_: "Service tier" (the partition is on the namespace, not the Service); collapsing Namespace tier with State tier — they are different axes.
+
+**Kyverno policy**:
+The convention engine of the cluster — a ClusterPolicy or Policy resource that mutates/generates/validates on admission. Owns Namespace tier limits/quotas, `dns_config` injection on every pod-owning workload, Forgejo pull-credential sync across namespaces, TLS-secret replication. When the repo says "this happens automatically", a Kyverno policy is usually the actor.
+_Avoid_: bare "policy" (overloaded with Vault, RBAC, NetworkPolicy).
+
+**Critical-path Service**:
+One of {Traefik, Authentik, CrowdSec LAPI, PgBouncer, Cloudflared} — replicas ≥3, PDB enforced, monitored independently.
+_Avoid_: "core service" (collides with the `0-core-*` Namespace tier name).
+
+**Namespace-owner**:
+A non-admin identity declared in `secret/platform → k8s_users` (JSON map). Owns one or more namespaces and one or more public subdomains.
+_Avoid_: bare "user", "tenant".
+
+### Networking
+
+**Public domain**:
+`viktorbarzin.me`, served through Cloudflare. DNS records are either **proxied** (Cloudflare CDN/WAF in front) or **non-proxied** (direct A/AAAA reachable via Cloudflared Tunnel).
+_Avoid_: "external", "outside".
+
+**Internal domain**:
+`viktorbarzin.lan`, served by Technitium DNS. Resolves only inside the homelab network.
+_Avoid_: bare "lan", "private", "intranet".
+
+**Ingress auth tier**:
+The `auth = "..."` parameter on `ingress_factory`, one of `required` (Authentik forward-auth gates every request), `app` (the backend owns its login), `public` (anonymous Authentik binding for audit only), or `none` (Anubis-fronted content, or native-client API).
+_Avoid_: "auth mode" — the canonical key is `auth`.
+
+**Authentik outpost**:
+A standalone Authentik deployment that terminates the proxy/auth flow for a specific binding model. The repo runs two distinct ones: the default outpost (used by `auth = "required"`) and the `public` outpost (anonymous binding, used by `auth = "public"`).
+_Avoid_: conflating outpost with Authentik core; "Authentik instance".
+
+**Cloudflared Tunnel**:
+The channel by which non-proxied **public domain** traffic reaches the cluster, terminating at Traefik. Backs every `dns_type = "non-proxied"` record and is the fallback path for the wildcard `*.viktorbarzin.me`.
+_Avoid_: "the tunnel" without "Cloudflared" (could mean Headscale).
+
+**Ingress chain**:
+The opinionated stack of Traefik middlewares that `ingress_factory` layers onto every Ingress. Slots, in order: forward-auth (per **Ingress auth tier**) → anti-AI scraping (default-on when no Authentik is in the path) → CrowdSec bouncer (fail-open) → retry (2× / 100ms) → rate-limit (429, not 503). Adding or removing a middleware is a Stack-level choice, but the chain order is convention.
+_Avoid_: "middleware list", "Traefik chain". The Anubis PoW gate is upstream of this chain, not inside it.
+
+### Storage
+
+**proxmox-lvm-encrypted**:
+Default StorageClass for any workload holding sensitive data (databases, auth, password managers, email, financial data). LUKS2 over a Proxmox LVM-thin LV.
+_Avoid_: bare "encrypted PVC" — name the StorageClass.
+
+**proxmox-lvm**:
+Block StorageClass for non-sensitive workloads (caches, monitoring data, indexes, app state without secrets).
+
+**NFS volume**:
+RWX file storage for shared media libraries, large datasets, or anything that needs to be inspected from outside K8s. Provisioned via the `nfs_volume` Module.
+_Avoid_: "shared storage" (ambiguous).
+
+**nfs-truenas StorageClass**:
+A historical SC name retained only because StorageClass strings are immutable on bound PVs. The underlying server is the **PVE host**, not TrueNAS; TrueNAS is decommissioned.
+_Avoid_: assuming this means TrueNAS.
+
+**3-2-1 backup**:
+The named posture of where data lives: **Copy 1** = live on the PVE thin pool (sdc), **Copy 2** = sda backup disk (`/mnt/backup`), **Copy 3** = offsite Synology NAS. Per-PVC file-level rsync from LVM thin snapshots; databases additionally dump to NFS for per-DB restore.
+_Avoid_: bare "backup" without saying which copy you mean (a service is "backed up" only once it's on Copy 2; Copy 3 is the disaster floor).
+
+### Secrets
+
+**Vault path**:
+Convention: `secret/<service>` for Service-owned secrets, `secret/viktor` for personal/global, `secret/platform` for cluster-wide maps (`k8s_users`, `homepage_credentials`).
+_Avoid_: conflating Vault path (e.g. `secret/viktor`) with Vault field (e.g. `forgejo_pull_token`).
+
+**ExternalSecret** / **ESO**:
+A K8s manifest that materialises a Vault KV value as a K8s Secret. Two ClusterSecretStores: `vault-kv` (KV engine) and `vault-database` (rotating DB creds).
+
+**Plan-time secret**:
+A secret value read in Terraform via `data "kubernetes_secret"` (i.e. via the ESO-created K8s Secret) at plan time, with no Vault provider call. Distinct from a **vault data source** read (`data "vault_kv_secret_v2"`), which still goes through the Vault provider. A few Stacks remain hybrid (plan-time for env vars, vault data source for module inputs).
+
+**Sealed Secret**:
+A user-managed secret committed to a Stack directory as `sealed-*.yaml`. Distinct from ExternalSecret — Sealed Secrets carry their own bytes, ExternalSecrets reference Vault.
+
+### CI/CD
+
+**GHA build + Woodpecker deploy**:
+The split where Docker images are built+pushed by GitHub Actions and Woodpecker only runs `kubectl set image` on a deploy-only pipeline. Repos that can't fit GHA limits stay on Woodpecker for build too.
+_Avoid_: bare "Woodpecker pipeline" — say "build" or "deploy".
+
+**Anubis**:
+A PoW reverse-proxy issuing a 30-day JWT cookie, used in front of public content-bearing sites without app-level auth (blog, wiki, landing pages). Never in front of Git, WebDAV, CalDAV, or API endpoints (clients can't solve PoW).
+
+## Relationships
+
+- A **Service** is defined by exactly one **Stack**, which declares zero or more **Modules** and resolves to one or more K8s workloads.
+- A **Namespace-owner** owns one or more namespaces and one or more public subdomains.
+- A **Service** owns its **Vault path** at `secret/<service>`, surfaces values through **ExternalSecrets**, and reads them at plan time via **plan-time secrets**.
+- An **Ingress** picks exactly one **Ingress auth tier**; the choice defines how strangers reach the backend.
+- A **proxmox-lvm-encrypted** PVC binds to one Node at a time (RWO) and requires a Service-level backup CronJob; an **NFS volume** is RWX and is backed up at the host level via rsync.
+- **State tier** and **Namespace tier** are orthogonal — a Tier 0 Stack can deploy a Service into any Namespace tier and vice versa.
+
+## Example dialogue
+
+> **Dev:** "I'm adding a new **Service** — FastAPI backend with its own JWT login. Do I need Authentik?"
+> **Domain expert:** "If the FastAPI login is the gate, set `auth = "app"` on the ingress. That records the intent that you _chose_ not to layer Authentik — leave a one-line comment above stating what gates the Service, or `scripts/tg` will refuse the apply."
+> **Dev:** "And storage?"
+> **Domain expert:** "Does it hold user data? If yes, `proxmox-lvm-encrypted` — that's the default for anything sensitive. Add a backup CronJob writing to `/mnt/main/<service>-backup/`. If the data is just caches, plain `proxmox-lvm` is fine."
+> **Dev:** "What about a Secret with the JWT signing key?"
+> **Domain expert:** "Put the key in `secret/<service>` in Vault, then declare an **ExternalSecret** to materialise it as a K8s Secret. Read it at plan time with `data "kubernetes_secret"` — that keeps Vault out of the plan path."
+
+## Flagged ambiguities
+
+- **"tier"** is overloaded — *Namespace tier* (`0-core`..`4-aux`, scheduling priority) is distinct from *State tier* (Tier 0 / Tier 1, Terraform backend partition). Always qualify which axis.
+- **"node"** can mean a K8s Node (default) or a PVE node. For Proxmox-level statements, say **PVE node** explicitly.
+- **"service"** spans two distinct concepts: the deployed app (capitalised **Service**, this repo's domain noun) and the K8s `Service` object (in backticks or qualified "K8s Service"). Lowercase "service" in prose is fine when context disambiguates; flag it when it doesn't.
+- **"secret"** spans Vault entries, K8s Secret objects, **ExternalSecrets**, and **Sealed Secrets**. Always specify which.
+- **"proxied"** / **"non-proxied"** refer to Cloudflare's CDN posture for a DNS record, _not_ Anubis or forward-auth layering.
--- a/ci/Dockerfile
+++ b/ci/Dockerfile
@ -7,9 +7,11 @@ ARG SOPS_VERSION=3.9.4
 ARG KUBECTL_VERSION=1.34.0
 ARG VAULT_VERSION=1.18.1

-# Install system packages (single layer)
+# Install system packages (single layer).
+# python3: required by scripts/check-ingress-auth-comments.py, invoked
+#   by scripts/tg before every plan/apply.
 RUN apk add --no-cache \
-    bash curl git git-crypt jq openssh-client openssl unzip \
+    bash curl git git-crypt jq openssh-client openssl python3 unzip \
    && rm -rf /var/cache/apk/*

 # Terraform
--- a/docs/architecture/authentication.md
+++ b/docs/architecture/authentication.md
@ -44,7 +44,7 @@ graph TB
 | Authentik Worker | 2026.2.2 | `stacks/authentik/` | Background task processors (2 replicas) |
 | PgBouncer | Latest | `stacks/authentik/` | PostgreSQL connection pooler (3 replicas) |
 | Embedded Outpost | - | Built into Authentik | Forward auth endpoint for Traefik |
-| Traefik ForwardAuth | - | `ingress_factory` module | Middleware for protected ingresses |
+| Traefik ForwardAuth | - | `modules/kubernetes/ingress_factory/` | Middleware attached when `auth = "required"` or `"public"` |
 | Vault OIDC Method | - | `stacks/vault/` | Human SSO authentication to Vault |
 | Vault K8s Auth | - | `stacks/vault/` | Service account JWT authentication |

@ -52,7 +52,16 @@ graph TB

 ### Forward Authentication Flow

-Services configured with `protected = true` in the `ingress_factory` module automatically get Traefik ForwardAuth middleware configured. When an unauthenticated user accesses a protected service:
+Services pick an auth tier via the `auth` enum on the `ingress_factory` module (default `"required"`, fail-closed):
+
+| Tier | Effect | When to use |
+|------|--------|-------------|
+| `"required"` | Authentik forward-auth gates every request | Backend has no own user auth — Authentik is the only gate |
+| `"app"` | No Authentik middleware; backend's own login is the gate | Backend handles its own user auth (NextAuth, Django, OAuth, bearer-token API) |
+| `"public"` | Authentik anonymous binding via `public` outpost | Audit trail without gating; only works for top-level browser navigation |
+| `"none"` | No Authentik middleware at all | Anubis-fronted content, webhooks, OAuth callbacks, native-client APIs (CalDAV, WebDAV, Git) |
+
+When `auth = "required"`, an unauthenticated request flows:

 1. Request hits Traefik ingress
 2. ForwardAuth middleware calls Authentik embedded outpost
@ -64,6 +73,8 @@ Services configured with `protected = true` in the `ingress_factory` module auto

 Authentik adds authentication headers (user, email, groups) to forwarded requests. These headers are stripped before reaching the backend to prevent confusion.

+**Anti-exposure guard**: every `auth = "app"` or `auth = "none"` line MUST have a preceding `# auth = "<tier>": <reason>` comment documenting what gates the backend (for `"app"`) or why the endpoint is intentionally public (for `"none"`). The convention is enforced by `scripts/check-ingress-auth-comments.py`, which `scripts/tg` runs on every `plan/apply/destroy/refresh` and blocks the terragrunt invocation if violated. Stack-scoped — each stack documents itself.
+
 ### Social Login & Invitation Flow

 All new users must use an invitation link to register. The invitation-enrollment flow:
@ -144,8 +155,9 @@ The public client flow:
 | Path | Purpose |
 |------|---------|
 | `stacks/authentik/` | Authentik deployment (servers, workers, PgBouncer) |
-| `stacks/platform/modules/ingress_factory/` | Traefik ForwardAuth middleware config |
-| `stacks/platform/modules/traefik/middleware.tf` | ForwardAuth middleware definition |
+| `modules/kubernetes/ingress_factory/` | Auth-tier enum + per-ingress middleware composition |
+| `stacks/traefik/modules/traefik/middleware.tf` | ForwardAuth middleware definitions (required + public outposts) |
+| `scripts/check-ingress-auth-comments.py` | Comment-convention guard wired into `scripts/tg` |
 | `stacks/vault/auth.tf` | Vault OIDC and K8s auth methods |

 ### Vault Paths
@ -160,17 +172,40 @@ The public client flow:
 - `stacks/platform/` - Traefik ingress with ForwardAuth
 - `stacks/vault/` - Vault auth methods

-### Ingress Protection Example
+### Ingress Protection Examples

+Authentik-gated admin UI (default):
 ```hcl
 module "myapp_ingress" {
-  source = "./modules/ingress_factory"
+  source          = "../../modules/kubernetes/ingress_factory"
+  name            = "myapp"
+  namespace       = "myapp"
+  tls_secret_name = var.tls_secret_name
+  # auth = "required" is the default — Authentik forward-auth is the gate.
+}
+```

-  name      = "myapp"
-  host      = "myapp.viktorbarzin.me"
-  protected = true  # Enables ForwardAuth middleware
+Backend with its own user auth (no Authentik in the way):
+```hcl
+module "myapp_ingress" {
+  source          = "../../modules/kubernetes/ingress_factory"
+  name            = "myapp"
+  namespace       = "myapp"
+  tls_secret_name = var.tls_secret_name
+  # auth = "app": myapp uses NextAuth + Google OAuth; mobile clients can't follow Authentik 302.
+  auth            = "app"
+}
+```

-  # ... other config
+Intentionally public webhook receiver:
+```hcl
+module "myapp_ingress" {
+  source          = "../../modules/kubernetes/ingress_factory"
+  name            = "webhook"
+  namespace       = "webhooks"
+  tls_secret_name = var.tls_secret_name
+  # auth = "none": upstream signs payloads with HMAC; no user identity expected.
+  auth            = "none"
 }
 ```

--- a/docs/architecture/automated-upgrades.md
+++ b/docs/architecture/automated-upgrades.md
@ -1,4 +1,10 @@
-# Automated Service Upgrades
+# Automated Upgrades
+
+This doc covers three independent automation paths:
+
+1. **Service-level upgrades** — Container image bumps for OSS apps (DIUN → n8n → claude-agent → Terraform). Most of this doc.
+2. **OS-level upgrades on K8s nodes** — `unattended-upgrades` + `kured` with sentinel-gate + Prometheus halt-on-alert. See "K8s Node OS Upgrades" section and the runbook at `docs/runbooks/k8s-node-auto-upgrades.md`.
+3. **K8s component version upgrades** (kubeadm/kubelet/kubectl) — weekly detection CronJob → chain of phase Jobs (preflight → master → worker × 4 → postflight). See "K8s Version Upgrades" section and the runbook at `docs/runbooks/k8s-version-upgrade.md`.

 ## Overview

@ -205,3 +211,145 @@ The `DIUN Upgrade Agent` workflow is imported once into n8n's PG DB — it is **
 - **`N8N_BLOCK_ENV_ACCESS_IN_NODE=false`** must be set on the n8n deployment for expressions to read `$env.*` at all.
 - **Troubleshooting 401**: the workflow will show `success` status on the webhook node but error on `Run Upgrade Agent`. Inspect in n8n UI → Executions, or query `execution_entity` + `execution_data` directly. Claude-agent-service logs will also show `POST /execute HTTP/1.1 401 Unauthorized`.
 - **Patching the live workflow** (one-off, since it's not in TF): `UPDATE workflow_entity SET nodes = REPLACE(nodes::text, OLD, NEW)::json WHERE name = 'DIUN Upgrade Agent';`
+
+## K8s Node OS Upgrades
+
+Independent of the service-upgrade pipeline above. Drives apt package updates + reboots on the 5 K8s VMs (master + 4 workers).
+
+### Stack
+- **In-guest**: `unattended-upgrades` runs apt upgrades within Allowed-Origins (`-security`, `-updates`, ESM). Package-Blacklist excludes runtime components (`containerd`, `containerd.io`, `runc`, `cri-tools`, `kubernetes-cni`, `calico-*`, `cni-plugins-*`, `docker-ce`). `apt-mark hold` on `kubelet`, `kubeadm`, `kubectl` (and runtime pkgs as belt-and-braces). `Automatic-Reboot=false` — kured handles reboots.
+- **Reboot driver**: `kured` (chart `kured-5.11.0`, app `1.21.0`). Window 02:00-06:00 Europe/London every day of the week (Mon-Fri-only restriction dropped 2026-05-16 — see PM), period=1h, concurrency=1, reboot-delay=30s, drainTimeout=30m.
+- **Reboot gate (sentinel)**: `kured-sentinel-gate` DaemonSet creates `/var/run/gated-reboot-required` only when (a) host needs reboot, (b) all nodes Ready, (c) all calico-node pods Running, (d) **no node has transitioned Ready in the last 24h** (24h soak window).
+- **Reboot gate (Prometheus)**: kured `--prometheus-url` polls `prometheus-server.monitoring.svc:80` before each drain. ANY firing alert blocks unless it matches the ignore-regex `^(Watchdog|RebootRequired|KuredNodeWasNotDrained|InfoInhibitor)$`.
+- **Health alert library**: 10 alerts in the `Upgrade Gates` group (`prometheus_chart_values.tpl`): `KubeAPIServerDown`, `KubeStateMetricsDown`, `PrometheusRuleEvaluationFailing`, `PVCStuckPending`, `RecentNodeReboot` (the explicit 24h soak signal), `MysqlStandaloneDown`, `ClusterPodReadyRatioDropped`, `NodeMemoryPressure`, `NodeDiskPressure`, `KubeQuotaAlmostFull`. Plus the existing 200+ alerts in the cluster-wide library (anything firing blocks kured).
+- **Notifications**: kured `notifyUrl` posts drain-start/drain-finish to Slack via Vault `secret/kured.slack_kured_webhook`. Alertmanager separately routes critical alerts to `#alerts`.
+
+### Source of truth
+| Concern | Location |
+|---|---|
+| Package config (uu, holds, blacklist) | `modules/create-template-vm/cloud_init.yaml` (within `is_k8s_template`) |
+| kured Helm release + sentinel-gate DS | `stacks/kured/main.tf` |
+| Upgrade Gates alerts | `stacks/monitoring/modules/monitoring/prometheus_chart_values.tpl` |
+
+### Day-2 changes
+Cloud-init only runs on first boot. Existing nodes are brought into compliance with a one-shot SSH push — see the runbook section "Restore / re-apply unattended-upgrades config to existing nodes" in `docs/runbooks/k8s-node-auto-upgrades.md`.
+
+### Why this design
+The 26h cluster outage on 2026-03-16 was triggered by an unattended-upgrades kernel push that corrupted containerd's overlayfs snapshotter cluster-wide. The remediations:
+- 24h soak (sentinel-gate Check 4) gives a full day of observation between consecutive node reboots — broken updates show up as Prometheus alerts before any other node restarts.
+- Prometheus halt-on-alert turns ANY firing alert into a hard block — including the 6 Node Runtime Health alerts and the 10 Upgrade Gates alerts that explicitly model "the cluster is in a bad state."
+- Package-Blacklist on runtime components prevents the exact failure mode (containerd/runc auto-bumps).
+- `Automatic-Reboot=false` keeps reboot policy in kured (window, ordering, gating), not in apt.
+
+### Operational reference
+See `docs/runbooks/k8s-node-auto-upgrades.md` for: verifying health, halting rollout, restoring config to a re-imaged node, rolling back a bad upgrade, and the past-incident timeline.
+
+## K8s Version Upgrades
+
+Independent of the OS-upgrade and service-upgrade pipelines. Drives
+kubeadm/kubelet/kubectl bumps (patch + minor) on all 5 K8s VMs.
+
+### Architecture
+
+```
+k8s-version-check CronJob   (Sun 12:00 UTC, k8s-upgrade ns)
+  │ probe apt-cache madison kubeadm (master) → latest available patch
+  │ probe HEAD https://pkgs.k8s.io/.../v<NEXT_MINOR>/deb/Release → next minor?
+  │ push k8s_upgrade_available metric to Pushgateway
+  │
+  ▼ if a target is detected
+envsubst on /template/job-template.yaml | kubectl apply -f -
+  │ spawns Job 0 = k8s-upgrade-preflight-<target_version>
+  ▼
+
+Job 0 — preflight       (pinned: k8s-node1)
+Job 1 — master upgrade  (pinned: k8s-node1)        drains k8s-master
+Job 2 — worker          (pinned: k8s-node1)        drains k8s-node4
+Job 3 — worker          (pinned: k8s-node1)        drains k8s-node3
+Job 4 — worker          (pinned: k8s-node1)        drains k8s-node2
+Job 5 — worker          (pinned: k8s-master)       drains k8s-node1  ← control-plane toleration
+Job 6 — postflight      (no pinning)
+```
+
+Each Job runs `scripts/upgrade-step.sh`, which dispatches on `$PHASE` and ends
+by spawning the next Job (`envsubst < /template/job-template.yaml | kubectl
+apply -f -`). Job names are deterministic (`k8s-upgrade-<phase>-<target_version>[-<node>]`)
+so `apply` reconciles to a single Job per run — re-running a failed Job
+won't duplicate downstream Jobs.
+
+### Self-preemption history (the reason for the Job-chain rewrite)
+
+The v1 design ran the whole upgrade inside the `claude-agent-service`
+Deployment (1 replica, no nodeSelector). On 2026-05-11 the agent's pod was
+scheduled to k8s-node4. When the agent ran `kubectl drain k8s-node4` during
+Stage 6, it evicted itself — the bash process died after the drain but
+before the SSH-pipe to install kubeadm on node4. The cluster ended up
+half-upgraded (master at v1.34.7, workers at v1.34.2). The rewrite to a
+chain of `nodeSelector`-pinned Jobs eliminates this failure mode because
+each Job's pod and its drain target are always different nodes.
+
+### Components
+
+- **Detection CronJob + ConfigMaps + RBAC**: `infra/stacks/k8s-version-upgrade/main.tf`.
+  - Image is the claude-agent-service image (kubectl + ssh-client + curl + jq + envsubst).
+  - One unified ServiceAccount `k8s-upgrade-job` serves both the detection CronJob and every chain Job.
+- **Phase body**: `infra/stacks/k8s-version-upgrade/scripts/upgrade-step.sh`.
+  Dispatches on `$PHASE` (preflight | master | worker | postflight). Computes
+  `NEXT_PHASE` / `NEXT_TARGET_NODE` / `NEXT_RUN_ON` and spawns the next Job.
+  Includes a `predrain_unstick` helper that pre-deletes pods on the target
+  node whose PDB has `disruptionsAllowed=0` (otherwise drain loops forever on
+  single-replica deployments like Anubis instances).
+- **Job template**: `infra/stacks/k8s-version-upgrade/job-template.yaml`.
+  envsubst-rendered at runtime. Mounts a `creds` Secret, a `scripts`
+  ConfigMap, and a `template` ConfigMap into each Job pod.
+- **Per-node script**: `infra/scripts/update_k8s.sh`. Caller passes
+  `--role master|worker --release X.Y.Z`. Piped via SSH into each node by
+  upgrade-step.sh.
+- **Three Upgrade Gates alerts**:
+  - `K8sVersionSkew` — kubelet/apiserver `gitVersion` count >1 for 30m. Catches a half-done rollout.
+  - `EtcdPreUpgradeSnapshotMissing` — `k8s_upgrade_in_flight==1 && k8s_upgrade_snapshot_taken==0` for 10m. Catches preflight failing silently.
+  - `K8sUpgradeStalled` — `k8s_upgrade_in_flight==1 && time()-k8s_upgrade_started_timestamp > 5400` for 5m. Catches a chain Job dying without spawning its successor.
+- **Pushgateway metrics**:
+  - `k8s_upgrade_in_flight` (set in preflight, cleared in postflight)
+  - `k8s_upgrade_snapshot_taken` (set after etcd snapshot Job completes with ≥1 KiB)
+  - `k8s_upgrade_started_timestamp` (set in preflight; used by `K8sUpgradeStalled`)
+  - `k8s_upgrade_available{kind,running,target}` (pushed by detection CronJob)
+  - `k8s_version_check_last_run_timestamp` (staleness watchdog)
+
+### Source of truth
+
+| Concern | Location |
+|---|---|
+| Stack (CronJob + ConfigMaps + SA/RBAC + ExternalSecret) | `stacks/k8s-version-upgrade/main.tf` |
+| Phase orchestration | `stacks/k8s-version-upgrade/scripts/upgrade-step.sh` |
+| Job template | `stacks/k8s-version-upgrade/job-template.yaml` |
+| Per-node upgrade script | `scripts/update_k8s.sh` |
+| Alerts | `stacks/monitoring/modules/monitoring/prometheus_chart_values.tpl` (group "Upgrade Gates") |
+| Vault secrets | `secret/k8s-upgrade/{ssh_key, ssh_key_pub, slack_webhook}` |
+| Deprecated agent prompt (reference) | `.claude/agents/k8s-version-upgrade.deprecated.md` |
+
+### Why this design
+
+The cluster has a single control plane (no HA). A failed `kubeadm upgrade apply` is an outage. Mitigations:
+
+- **Mandatory etcd snapshot before every run** (even patch). Recovery point if master breaks.
+- **Halt-on-alert before every drain**. Reuses the same Prometheus ignore-list regex kured uses — any unrelated cluster-health alert blocks. Three gate alerts catch upgrade-specific half-states (version skew, missing snapshot, stalled chain).
+- **Job pinning eliminates self-preemption**. Each Job's pod runs on a node that is NOT its drain target. k8s-node1 hosts every Job except the one that drains it (which runs on k8s-master with a control-plane toleration).
+- **Sequential workers with 10-min inter-node soak**. Same risk-bounding as the 24h OS-reboot soak, but tightened because kubelet failures surface within minutes — not hours.
+- **Master upgrade goes first, workers last**. If master breaks, the cluster is already degraded so further worker upgrades would just delay recovery. By upgrading master first, we either succeed (workers can roll afterward) or fail loud (operator triages before any worker is touched).
+- **No auto-rollback**. kubeadm doesn't support clean downgrade; the snapshot + manual apt rollback in the runbook is the recovery path.
+- **PDB-blocked pods don't stall the chain**. `predrain_unstick` deletes PDB=0 pods on the target node directly (bypassing the eviction API), so the parent Deployment recreates them elsewhere. This was the workaround applied manually during the 2026-05-11 recovery for Anubis single-replica instances.
+
+### Secrets
+
+| Secret | Vault Path | Purpose |
+|--------|-----------|---------|
+| SSH private key | `secret/k8s-upgrade.ssh_key` | Jobs SSH `wizard@<node>` |
+| SSH public key | `secret/k8s-upgrade.ssh_key_pub` | Deployed to nodes' `~/.ssh/authorized_keys` |
+| Slack webhook | `secret/k8s-upgrade.slack_webhook` | Pipeline notifications (separate channel from kured) |
+
+The previous `api_bearer_token` entry is gone — the chain does not POST to `claude-agent-service`.
+
+### Operational reference
+
+See `docs/runbooks/k8s-version-upgrade.md` for: verifying health, manually triggering detection, killing a stuck Job, skipping a phase, rollback paths (master / worker / mid-flight abort), and SSH key rotation.
--- a/docs/architecture/compute.md
+++ b/docs/architecture/compute.md
@ -18,7 +18,7 @@ graph TB
    subgraph Proxmox["Proxmox VE"]
        direction TB
        MASTER["VM 200: k8s-master<br/>8c / 32GB<br/>10.0.20.100"]
-        NODE1["VM 201: k8s-node1<br/>16c / 32GB<br/>GPU Passthrough<br/>nvidia.com/gpu=true:PreferNoSchedule"]
+        NODE1["VM 201: k8s-node1<br/>16c / 48GB<br/>GPU Passthrough<br/>nvidia.com/gpu=true:PreferNoSchedule"]
        NODE2["VM 202: k8s-node2<br/>8c / 32GB"]
        NODE3["VM 203: k8s-node3<br/>8c / 32GB"]
        NODE4["VM 204: k8s-node4<br/>8c / 32GB"]
@ -62,7 +62,7 @@ graph TB
 | Model | Dell PowerEdge R730 |
 | CPU | 1x Intel Xeon E5-2699 v4 (22 cores / 44 threads, CPU2 unpopulated) |
 | Total Cores/Threads | 22 cores / 44 threads |
-| RAM | 272GB DDR4-2400 ECC RDIMM physical (10 DIMMs: 8x32G Samsung + 2x8G Hynix). VMs use ~160GB total (5 K8s VMs x 32GB) |
+| RAM | 272GB DDR4-2400 ECC RDIMM physical (10 DIMMs: 8x32G Samsung + 2x8G Hynix). VMs use ~176GB total (k8s-node1 48GB + 4 K8s VMs x 32GB) |
 | GPU | NVIDIA Tesla T4 (16GB GDDR6, PCIe 0000:06:00.0) |
 | Storage | 1.1TB SSD + 931GB SSD + 10.7TB HDD |
 | Hypervisor | Proxmox VE |
@ -72,12 +72,20 @@ graph TB
 | VM | VMID | vCPUs | RAM | Network | Role | Taints |
 |----|------|-------|-----|---------|------|--------|
 | k8s-master | 200 | 8 | 32GB | vmbr1:vlan20 (10.0.20.100) | Control Plane | `node-role.kubernetes.io/control-plane:NoSchedule` |
-| k8s-node1 | 201 | 16 | 32GB | vmbr1:vlan20 | GPU Worker | `nvidia.com/gpu=true:PreferNoSchedule` (applied dynamically to whichever node carries the GPU) |
+| k8s-node1 | 201 | 16 | 48GB | vmbr1:vlan20 | GPU Worker | `nvidia.com/gpu=true:PreferNoSchedule` (applied dynamically to whichever node carries the GPU) |
 | k8s-node2 | 202 | 8 | 32GB | vmbr1:vlan20 | Worker | None |
 | k8s-node3 | 203 | 8 | 32GB | vmbr1:vlan20 | Worker | None |
 | k8s-node4 | 204 | 8 | 32GB | vmbr1:vlan20 | Worker | None |

-**Total Cluster Resources**: 48 vCPUs, ~160GB RAM (5 nodes x 32GB)
+**Total Cluster Resources**: 48 vCPUs, ~176GB RAM (k8s-node1 48GB + 4 nodes x 32GB)
+
+> **node1 RAM (2026-05-10)**: bumped from 32 → 48 GiB out-of-band via
+> `qm set 201 --memory 49152` because VMID 201 is intentionally not
+> managed by Terraform yet (telmate/proxmox provider bug with iSCSI
+> PVCs — see `infra/stacks/infra/main.tf` line 442). Driver: GPU
+> multi-tenancy (frigate + ytdlp + llama-swap + immich-ml) was
+> hitting 94% memory-request saturation on the old size. Adopt this
+> VM into TF (`module "k8s-node1"`) once we've migrated to bpg/proxmox.

 ### GPU Passthrough

--- a/docs/architecture/llama-cpp.md
+++ b/docs/architecture/llama-cpp.md
@ -0,0 +1,118 @@
+# llama-cpp / llama-swap
+
+## Overview
+
+In-cluster, OpenAI-compatible vision-LLM endpoint. A single
+`mostlygeek/llama-swap:cuda` Deployment fronts three GGUF models
+served by `llama.cpp`'s `llama-server` subprocesses, hot-swapped on
+demand by `llama-swap`. One Service, one `/v1` endpoint, model
+selected by the request body `model` field.
+
+Initial use case: vision-LLM benchmark on a curated Immich album,
+choosing between **Qwen3-VL-8B**, **MiniCPM-V-4.5**, and
+**Qwen3-VL-4B** for instagram-poster's candidate-scoring path.
+Future consumers (Home Assistant, agentic tooling) can hit the same
+endpoint via LiteLLM at the cluster gateway.
+
+First benchmark run (2026-05-10): see
+`infra/docs/benchmarks/2026-05-10-vision-llm.md`. Verdict: **qwen3vl-4b**
+for the request path (3.55 s p50, 100% parse, decisive top-N
+distribution). qwen3vl-8b for caption polish on top picks.
+
+## Why llama.cpp + llama-swap (not Ollama)
+
+Verified across 7+7 research/challenger subagents (2026-05-10):
+
+- **Broader OpenAI-compat surface** — `tool_choice`, `image_url`
+  remote URLs, native bearer auth via `--api-key`, `/reranking`,
+  Anthropic `/v1/messages` shim.
+- **Native observability** — `/metrics`, `/health` returns 503 during
+  model load (proper K8s startup-probe semantics), `/slots` per-slot
+  tracking. Ollama still has the `/metrics` issue
+  [#3144](https://github.com/ollama/ollama/issues/3144) open.
+- **Stricter structured output** — native GBNF on `/completion`,
+  JSON-schema-to-GBNF converter, optional `LLAMA_LLGUIDANCE=ON`.
+- **Vision coverage for our targets** — llama.cpp ≥ b9095 supports
+  Qwen3-VL and MiniCPM-V-4.5 natively; Ollama needs the official
+  `qwen3-vl` tag (community GGUFs broken — split-mmproj
+  [#14575](https://github.com/ollama/ollama/issues/14575)) and the
+  `openbmb/minicpm-v4.5` Ollama tag is 8 months stale.
+
+Ollama still wins for Llama-3.2-Vision (`mllama` cross-attention) and
+ecosystem polish (Go/JS SDKs, langchain-ollama, n8n nodes, HA built-in)
+— the latter is mooted by fronting llama.cpp with **LiteLLM** at the
+gateway.
+
+## Components
+
+| Component | Resource | Purpose |
+|-----------|----------|---------|
+| llama-swap Deployment | `kubernetes_deployment.llama_swap` | One pod, one OpenAI-compat endpoint, hot-swaps model subprocesses |
+| llama-swap ConfigMap | `kubernetes_config_map.llama_swap_config` | YAML model entries (cmd, ttl, checkEndpoint) |
+| llama-swap Service | `kubernetes_service.llama_swap` | ClusterIP `:8080` → `llama-swap.llama-cpp.svc.cluster.local` |
+| Models PVC | `module.nfs_models` (NFS-RWX `/srv/nfs-ssd/llamacpp`) | Shared GGUF store, 30Gi |
+| Download Job | `kubernetes_job_v1.download_models` | Pulls Q4_K_M GGUF + mmproj per model, creates stable `model.gguf` / `mmproj.gguf` symlinks, warms page cache |
+
+## Storage
+
+NFS-SSD on the Proxmox host (`192.168.1.127:/srv/nfs-ssd/llamacpp`).
+Cold model load is ~40s × 3 startups ≈ 2 min in a 25-30 min benchmark
+run (<10%). The download Job warms the kernel page cache after pulling
+GGUFs so first inference reads from warm cache.
+
+If steady-state cold-load latency becomes a problem, **Path B**: carve
+~50Gi from a Proxmox SSD as an LV, attach as a vdisk to k8s-node1,
+mount on-host, expose via a static `kubernetes_persistent_volume` with
+`local` source + node1 affinity. NVMe-class load times. Out of scope
+for the initial deployment.
+
+## GPU allocation
+
+The llama-swap pod requests `nvidia.com/gpu: 1` (whole-T4
+allocation). The shared T4 is also used by Immich's ML pod
+(`immich.immich-machine-learning`); only one of the two can hold the
+GPU at a time. Operator must scale immich-ml to 0 before running a
+benchmark and restore it after:
+
+```bash
+kubectl scale -n immich deploy/immich-machine-learning --replicas=0
+# ... benchmark ...
+kubectl scale -n immich deploy/immich-machine-learning --replicas=1
+```
+
+## Models served
+
+| ID | HF repo | Quant | Ctx | mmproj |
+|----|---------|-------|-----|--------|
+| `qwen3vl-8b` | `Qwen/Qwen3-VL-8B-Instruct-GGUF` | Q4_K_M | 3072 | yes |
+| `minicpm-v-4-5` | `openbmb/MiniCPM-V-4_5-gguf` | Q4_K_M | 3072 | yes |
+| `qwen3vl-4b` | `Qwen/Qwen3-VL-4B-Instruct-GGUF` | Q4_K_M | 3072 | yes |
+
+llama.cpp build pinned via the `llama-swap:cuda` image (ships a
+recent llama.cpp ≥ b9095, which includes Qwen3-VL projection fix
+[#20899](https://github.com/ggml-org/llama.cpp/issues/20899) and
+mtmd Flash-Attention regression fix
+[#16962](https://github.com/ggml-org/llama.cpp/issues/16962)).
+
+## Endpoints
+
+- `GET /v1/models` — list configured models
+- `POST /v1/chat/completions` — standard OpenAI chat (vision via
+  `image_url` content parts, base64 or remote URL)
+- `POST /completion` — llama.cpp native completion (preferred for
+  GBNF-constrained structured output to avoid 2026 regression magnet
+  on `/v1/chat/completions`)
+- `GET /metrics` — Prometheus
+- `GET /health` — 200 once a model is fully loaded; 503 during load
+
+## Known issues / decisions
+
+- **Cluster-wide GPU contention** — only one of llama-swap or
+  immich-ml can hold the T4. No GPU sharing solution wired in
+  (MPS/MIG would help but T4 has no MIG and MPS is overkill for two
+  workloads).
+- **Filename-agnostic config** — the download Job creates stable
+  `model.gguf` / `mmproj.gguf` symlinks per model dir so the
+  llama-swap config doesn't need to track exact HF filenames (which
+  change between releases).
+- **TF schema** — `llama-cpp` (PG backend on dbaas).
--- a/docs/architecture/monitoring.md
+++ b/docs/architecture/monitoring.md
@ -57,7 +57,7 @@ graph TB
 |-----------|---------|----------|---------|
 | Prometheus | Latest (Diun monitored) | `stacks/monitoring/modules/monitoring/` | Metrics collection and storage, scrape configs for all services |
 | Grafana | Latest (Diun monitored) | `stacks/monitoring/modules/monitoring/` | Visualization, 14+ dashboards (API server, CoreDNS, GPU, UPS, etc.) |
-| Loki | Latest (Diun monitored) | `stacks/monitoring/modules/monitoring/` | Log aggregation and querying |
+| Loki | **DEPLOYED 2026-05-18** (SingleBinary mode, 30d retention, 50Gi PVC on `proxmox-lvm`, ruler enabled → Alertmanager). Re-enabled from previous "operational overhead" disable. Ships logs via Alloy DaemonSet (now on all nodes including master after 2026-05-19 toleration add). | `stacks/monitoring/modules/monitoring/` | Log aggregation and querying |
 | Alertmanager | Latest (Diun monitored) | `stacks/monitoring/modules/monitoring/` | Alert routing with cascade inhibitions |
 | Uptime Kuma | Latest (Diun monitored) | `stacks/uptime-kuma/` | Internal + external HTTP monitors, status page |
 | External Monitor Sync | Python 3.12 | `stacks/uptime-kuma/` | CronJob (10min) syncs `[External]` monitors from `cloudflare_proxied_names` |
@ -176,6 +176,35 @@ The email monitoring system uses a CronJob (`email-roundtrip-monitor`, every 10

 Uptime Kuma monitors: TCP SMTP (port 25) on `176.12.22.76` (external), IMAP (port 993) on `10.0.20.202`, and Dovecot exporter metrics on port 9166.

+#### Security Alerts (Wave 1 — planned, beads `code-8ywc`)
+
+Routed via **Loki ruler → Alertmanager → `#security` Slack receiver**. Same handling path as infra alerts. Single channel with severity labels inside (critical/warning/info), not three separate channels. Detection sources: K8s API audit log (`job=kube-audit`), Vault audit log (`job=vault-audit`), PVE sshd journald (`job=sshd-pve`), Calico flow logs (`job=calico-flow`, W1.6 only).
+
+| # | Source | Event | Severity |
+|---|---|---|---|
+| K2 | kube-audit | SA token used from outside cluster | critical |
+| K3 | kube-audit | Secret read in vault/sealed-secrets/external-secrets by non-allowlisted SA | critical |
+| K4 | kube-audit | Exec into vault/kube-system/dbaas/cnpg-system pod by non-allowlisted user | warning |
+| K5 | kube-audit | Mass delete (>5 Pod/Secret/CM in 60s) | critical |
+| K6 | kube-audit | Audit policy itself modified | critical |
+| K7 | kube-audit | New `*,*` ClusterRole created | warning |
+| K8 | kube-audit | Anonymous binding granted | critical |
+| K9 | kube-audit | `me@viktorbarzin.me` request from non-allowlist sourceIP | critical |
+| V1 | vault-audit | Root token created | critical |
+| V2 | vault-audit | Audit device disabled/modified | critical |
+| V3 | vault-audit | Seal status changed | critical |
+| V4 | vault-audit | Policy written/modified (allowlist Terraform actor) | warning |
+| V5 | vault-audit | Auth failure spike >10/min | warning |
+| V6 | vault-audit | Token with policies different from parent created | critical |
+| V7 | vault-audit | Viktor's entity_id from non-allowlist remote_addr (requires `x_forwarded_for_authorized_addrs`) | critical |
+| S1 | sshd-pve | sshd auth success from non-allowlist IP | critical |
+
+K1 (cluster-admin grant) intentionally skipped — see security.md.
+
+Allowlist source-IP CIDRs (used by K2, K9, V7, S1): `10.0.20.0/22`, `192.168.1.0/24`, K8s pod CIDR, K8s service CIDR, Headscale tailnet. Policy: no public-IP access; all admin paths transit LAN or Headscale.
+
+IOPS impact estimated ~1-2 GB/day additional disk writes after custom audit-policy tuning. Retention: 90d for security streams.
+
 #### Backup Alerts
 - **PostgreSQLBackupStale**: >36h since last backup
 - **MySQLBackupStale**: >36h since last backup
--- a/docs/architecture/security.md
+++ b/docs/architecture/security.md
@ -111,16 +111,20 @@ Namespaces are labeled with a tier (`tier: 0` through `tier: 4`). Kyverno auto-g

 This prevents resource exhaustion and enforces governance without manual quota management.

-#### Security Policies (ALL in Audit Mode)
+#### Security Policies

-**Why audit mode?** Gradual rollout without breaking existing workloads. Policies collect violations, then selectively enforced after cleanup.
+**Why audit mode first?** Gradual rollout without breaking existing workloads. Policies collect violations, then selectively enforced after cleanup.

-| Policy | Purpose | Enforcement |
-|--------|---------|-------------|
-| `deny-privileged-containers` | Block privileged pods | Audit |
-| `deny-host-namespaces` | Block hostNetwork/hostPID/hostIPC | Audit |
-| `restrict-sys-admin` | Block CAP_SYS_ADMIN | Audit |
-| `require-trusted-registries` | Only allow approved image registries | Audit |
+**Wave 1 plan (locked 2026-05-18, see beads `code-8ywc`):** all four below flip from Audit → Enforce with `failurePolicy: Ignore` preserved and an exclude list covering the 31 critical namespaces (keel, calico-system, authentik, vault, cnpg-system, dbaas, monitoring, traefik, technitium, mailserver, kyverno, metallb-system, external-secrets, proxmox-csi, nfs-csi, nvidia, kube-system, cloudflared, crowdsec, reverse-proxy, reloader, descheduler, vpa, redis, sealed-secrets, headscale, wireguard, xray, infra-maintenance, metrics-server, tigera-operator). Phased: one policy per day with PolicyReport observation.
+
+| Policy | Purpose | Current | Planned (wave 1) |
+|--------|---------|---------|------------------|
+| `deny-privileged-containers` | Block privileged pods | Audit | **Enforce** |
+| `deny-host-namespaces` | Block hostNetwork/hostPID/hostIPC | Audit | **Enforce** |
+| `restrict-sys-admin` | Block CAP_SYS_ADMIN | Audit | **Enforce** |
+| `require-trusted-registries` | Only allow approved image registries (forgejo.viktorbarzin.me, docker.io, ghcr.io, quay.io, registry.k8s.io, gcr.io, oci://ghcr.io/sergelogvinov) | Audit | **Enforce** |
+
+Cosign `verify-images` is **deferred** beyond wave 1 — needs image-signing infrastructure (Sigstore / cosign + KMS) before it can enforce meaningfully.

 #### Operational Policies

@ -163,6 +167,112 @@ Removed April 2026. The rewrite-body Traefik plugin used to inject hidden trap l

 **Implementation**: See `stacks/poison-fountain/` and `stacks/platform/modules/traefik/middleware.tf`

+### Audit Logging & Anomaly Detection (Wave 1)
+
+Beads epic: `code-8ywc`. **Status: partially live as of 2026-05-18.**
+
+| Item | State |
+|---|---|
+| W1.2 Vault `file` audit device | **LIVE** — `vault_audit.file` in `stacks/vault/main.tf:287`, writing to `/vault/audit/vault-audit.log` on `proxmox-lvm-encrypted` PVC |
+| W1.2 Vault `x_forwarded_for_authorized_addrs = 10.10.0.0/16` | **LIVE** — applied via `tg apply -target=helm_release.vault` on 2026-05-18; all 3 vault pods restarted cleanly |
+| W1.2 Vault audit log shipping to Loki | **LIVE** — `audit-tail` sidecar in vault pods + Alloy DaemonSet ships to Loki with `container="audit-tail"`. Verified via `{namespace="vault",container="audit-tail"}` LogQL query. |
+| W1.1 K8s API audit policy + shipping | **LIVE** — kube-apiserver audit policy was already configured (Metadata level, `/var/log/kubernetes/audit.log`, 7d retention). Alloy DaemonSet now tolerates control-plane taint, scrapes the audit log file, ships to Loki with `job=kubernetes-audit`. K2-K9 alert rules in Loki ruler. |
+| W1.3 Source-IP anomaly rules (K9, V7, S1) | **LIVE** (K9, V7); **S1 PENDING** — fires once promtail/Alloy on PVE host ships sshd journal with `job=sshd-pve`. |
+| W1.4 Kyverno security policies → Enforce | **LIVE** — 3 policies in Enforce mode with 35-namespace exclude list. |
+| W1.5 Kyverno trusted-registries → Enforce | **LIVE** — explicit allowlist (15 registries + 6 DockerHub library bare names + 56 DockerHub user repos). Verified by admission dry-run: `evilcorp.example/malware:v1` BLOCKED, `alpine:3.20` and `docker.io/library/alpine:3.20` ALLOWED. |
+| W1.6 Calico observe-phase (pilot: recruiter-responder) | **LIVE** (2026-05-19) — GlobalNetworkPolicy `wave1-egress-observe-recruiter-responder` with rules `[action:Log, action:Allow]`. FelixConfiguration.flowLogsFileEnabled approach abandoned (Calico Enterprise-only field, rejected by OSS v3.26). Log action emits iptables LOG with prefix `calico-packet: ` → kernel → journald → Alloy → Loki. Verified: `{job="node-journal"} \|~ "calico-packet"` returns real packet metadata (SRC/DST/PROTO). Expand to more namespaces by adding to `namespaceSelector`. |
+| W1.7 NetworkPolicy phased enforce | **PENDING** — needs ~1 week of W1.6 observation, then build empirical allowlist from Loki queries, flip GNP rules from `[Log, Allow]` to `[Allow specific dests, Deny rest]`. |
+
+The block below documents the locked design.
+
+Response model: **(I) Slack-only, daily skim.** All security alerts land in a new `#security` Slack channel via Alertmanager. No paging. Mean detection time accepted as ~12-24h; the design weight sits on prevention (Kyverno enforce, NetworkPolicy default-deny egress) rather than runtime detection.
+
+#### Detection sources
+
+| Source | Mechanism | Ships via | Loki job label |
+|---|---|---|---|
+| K8s API audit log | Custom audit policy on kube-apiserver: drop `get`/`list`/`watch` at `None` for most resources, log writes at `Metadata`, secret reads at `Metadata`, `exec`/`portforward` at `RequestResponse`, exclude kubelet+controller-manager noise. Codified in `stacks/infra` kubeadm config templating. | Alloy DaemonSet tails `/var/log/kubernetes/audit/*.log` | `job=kube-audit` |
+| Vault audit log | `file` audit device on existing Vault PVC. Vault listener config sets `x_forwarded_for_authorized_addrs` trusting Traefik pod CIDR so `remote_addr` is the real client IP, not Traefik's. | Alloy tails audit log file | `job=vault-audit` |
+| PVE sshd auth log | journald `_SYSTEMD_UNIT=ssh.service` | promtail systemd unit on Proxmox host (192.168.1.127) | `job=sshd-pve` |
+| Calico flow log | `flowLogsFileEnabled: true` in Calico Felix config | Alloy (cluster-wide) | `job=calico-flow` (W1.6 only) |
+
+#### Alert rules (16 total)
+
+Routed via **Loki ruler → Alertmanager → `#security` Slack receiver**. Same handling path as existing infra alerts — silenceable in Alertmanager UI, history queryable, severity labels (critical/warning/info) inside the single `#security` channel.
+
+**K8s API audit (K2-K9, 8 rules — K1 cluster-admin-grant intentionally skipped):**
+
+| # | Event | Severity |
+|---|---|---|
+| K2 | ServiceAccount token used from outside cluster (sourceIPs not in pod CIDR or trusted LAN) | critical |
+| K3 | Secret READ in `vault`, `sealed-secrets`, `external-secrets` namespaces by a non-allowlisted ServiceAccount | critical |
+| K4 | Exec into a pod in `vault`, `kube-system`, `dbaas`, `cnpg-system` (excluding `me@viktorbarzin.me` + 1 break-glass SA) | warning |
+| K5 | >5 deletes of `Pod`, `Secret`, or `ConfigMap` in 60s by any single actor | critical |
+| K6 | `audit-log-path` flag or audit policy modified on kube-apiserver | critical |
+| K7 | New ClusterRole created with `verbs: ["*"]` and `resources: ["*"]` | warning |
+| K8 | Anonymous binding granted (any RoleBinding/CRB referencing `system:anonymous` or `system:unauthenticated`) | critical |
+| K9 | Authenticated request where `user.username == "me@viktorbarzin.me"` AND `sourceIPs[0]` NOT in allowlist CIDRs | critical |
+
+**Vault audit (V1-V7):**
+
+| # | Event | Severity |
+|---|---|---|
+| V1 | Root token created | critical |
+| V2 | Audit device disabled or modified | critical |
+| V3 | Seal status changed (`sys/seal` write) | critical |
+| V4 | Policy written or modified (allowlist Terraform-driven writes by source IP / token role) | warning |
+| V5 | Authentication failure spike >10/min on any auth method | warning |
+| V6 | Token created with policies different from parent (privilege escalation) | critical |
+| V7 | Vault audit event where `auth.entity_id == <viktor-entity-id>` AND `remote_addr` NOT in allowlist CIDRs | critical |
+
+**Host (S1):**
+
+| # | Event | Severity |
+|---|---|---|
+| S1 | PVE sshd auth success from source IP NOT in allowlist | critical |
+
+#### Allowlist — "expected source IPs" for K2, K9, V7, S1
+
+| CIDR | Source |
+|---|---|
+| `10.0.20.0/22` | VLAN 20 (K8s cluster + main LAN) |
+| `192.168.1.0/24` | Proxmox host LAN + Sofia LAN (same RFC1918 block in both physical locations; cross-site traffic transits Headscale so the CIDR matches only on-LAN clients in either location) |
+| K8s pod CIDR (verify at implementation time) | In-cluster pods talking to apiserver |
+| K8s service CIDR | Service-to-apiserver traffic |
+| Headscale tailnet | VPN-connected devices |
+
+**Policy: no public-IP access ever.** Vault, kube-apiserver, PVE sshd must transit a trusted LAN or Headscale. Anything else fires an alert.
+
+#### Why no canary tokens
+
+Original plan included canary tokens (fake K8s Secret, Vault KV path, PVE file, sinkhole hostname). Rejected because Viktor routinely greps `secret/viktor` (135 keys) and lists `kubectl get secret -A` — any read-trigger canary self-fires. Use-based canaries (zero-RBAC SA tokens with audit alerts on use) were also considered but rejected in favor of cleaner source-IP anomaly detection (K9, V7) on REAL tokens — same threat model, no fake-token operational burden.
+
+#### Why no K1 (cluster-admin grant detection)
+
+Viktor opted out. Gap covered indirectly by K7 (new `*,*` ClusterRole created), K8 (anonymous binding), and K3 (secret read on Vault namespace) — most attacker progressions toward cluster-admin trigger one of these.
+
+#### IOPS / disk-wear
+
+Custom audit policy reduces volume ~80-90% vs default Metadata-everywhere. Loki tuned for fewer larger chunks: `chunk_target_size: 1.5MB`, `chunk_idle_period: 30m`, snappy compression. Retention 90d for security streams (matches Technitium DNS query log precedent). Net estimate: ~1-2 GB/day additional disk writes after tuning.
+
+### NetworkPolicy Default-Deny Egress (Wave 1 — observe-then-enforce, tier 3+4)
+
+Beads: `code-8ywc` W1.6 + W1.7. **Status: planned.**
+
+**Approach (γ): cluster-wide observe-then-enforce.**
+
+1. **Week 0:** Enable Calico flow logs cluster-wide. Apply a GlobalNetworkPolicy with selector `tier in {tier-3, tier-4}`, `action: Log` (no Deny). Ship flow logs to Loki.
+2. **Week 1:** Build per-namespace egress allowlist from observed traffic. Common allowlist module `tier3_egress_baseline` covers DNS, NTP, internal Vault/ESO/Authentik, Brevo SMTP, Cloudflare API, OAuth providers. Per-namespace add-ons for service-specific external destinations.
+3. **Week 2-3:** Apply default-deny + allowlist per-namespace, starting `recruiter-responder` (smallest egress footprint — local llama-cpp). Watch 24-48h per namespace, iterate. Roll out 3-5 namespaces/day.
+
+**Scope exclusions:** tier 0/1/2 namespaces (defer to wave 2), 31 critical infra namespaces (same exclude list as Kyverno).
+
+**DNS handling:** Calico GlobalNetworkPolicy supports domain-based rules via the `domains:` selector which queries CoreDNS internally. Static IPs reserved for fixed-IP services (Brevo SMTP relay).
+
+**Known risks:**
+- Rare-event misses: a Sunday-only CronJob's egress won't appear in 7 days of flow logs. Mitigation: extend observation to 2 weeks for namespaces with weekly CronJobs.
+- Mass-rollout cascade: the 26h March 2026 outage (memory id=390) was a mass-change cascade. Mitigation: phased per-namespace with health-check pauses, similar to the 2026-05-17 Keel phased rollout (memory id=1972).
+
 ### TLS & HTTP/3

 **Traefik** handles TLS termination:
--- a/docs/benchmarks/2026-05-10-vision-llm.md
+++ b/docs/benchmarks/2026-05-10-vision-llm.md
@ -0,0 +1,253 @@
+# Vision-LLM benchmark — Malaga / Seville album
+
+**Run ID:** `2026-05-10-1424` · **Date:** 2026-05-10 · **Operator:** wizard
+
+100 photos randomly sampled (seed=42) from the Immich album `🇪🇸 Malaga
+Seville` (`46565b85-7580-4ac1-91a6-1ece2cf8634d`, 1556 image assets +
+9 videos), scored by three local vision-LLMs served by `llama-swap`
+on a single Tesla T4. Goal: pick a model to wire into
+`instagram-poster`'s `/candidates` ranking path.
+
+## TL;DR
+
+**Recommendation: `qwen3vl-4b`.**
+
+- **Fastest** by a wide margin (3.55 s p50, 60% of qwen3vl-8b),
+  important once this is in the request path of `/candidates`.
+- **100% structured-output success** — same as the other two; GBNF
+  grammar enforcement worked across the board.
+- **Captions are competitive** with the 8B model in qualitative review
+  (tied or close on 8/10 sampled photos; 8B wins on Flair, 4B wins on
+  Latency).
+- **Most decisive scorer** — 47/100 photos got IG-fit=9 vs 17 for
+  qwen3vl-8b and 9 for minicpm. We get more signal at the top end
+  for ranking.
+
+Use qwen3vl-8b for *manual* caption refinement (top-1 of the day) if
+caption polish matters. Use minicpm-v-4-5 for nothing immediate — it's
+the most conservative scorer and the slowest at high quantiles, with
+no offsetting wins in this dataset.
+
+## Setup
+
+- Hardware: 1× Tesla T4 (16 GiB VRAM), `nvidia.com/gpu` time-slicing
+  enabled (replicas=100), pod scheduled on `k8s-node1`.
+- Server: `mostlygeek/llama-swap:cuda` (ships llama.cpp `b9085-046e28443`)
+  on `llama-swap.llama-cpp.svc.cluster.local:8080`.
+- Models: GGUF Q4_K_M, mmproj F16 except qwen3vl-4b which used the
+  Q8_0 mmproj (alphabetically first matching the glob).
+- Image prep: EXIF-transposed, long-edge resized to 1024 px, JPEG q=90,
+  base64-embedded as `image_url` data URLs.
+- Generation: `temperature=0`, `top_k=1`, `enable_thinking=false`,
+  GBNF grammar pinning the JSON schema (6 fields, 1–10 ints, ≤8 tags).
+- Run isolation: `immich-machine-learning` scaled to 0 for the
+  duration to avoid noisy GPU contention. *(Diagnostic note: the
+  scheduling failure that triggered this was actually node1 RAM —
+  not GPU — at 94% allocated. Time-slicing was already on. Bumping
+  node1 RAM is tracked as a follow-up.)*
+
+## Headline numbers
+
+| model | n | parse_ok | p50 latency | p95 latency | median IG-fit | median aesthetic |
+|-------|---|----------|-------------|-------------|---------------|------------------|
+| **qwen3vl-4b** | 100 | 100% | **3.55 s** | 4.06 s | 8.0 | 8.0 |
+| minicpm-v-4-5 | 100 | 100% | 5.62 s | 6.00 s | 7.0 | 8.0 |
+| qwen3vl-8b | 100 | 100% | 5.98 s | 6.64 s | 7.0 | 8.0 |
+
+Total wall time for the run: **33 m 32 s** (300 calls + 3 cold loads
+of ~30 s each).
+
+## What each model is good at
+
+### qwen3vl-4b — fast and decisive
+- p50 3.55 s — comfortable for adding to `/candidates` request path.
+- IG-fit distribution skews right (47 nines), spreading 6 → 9 fairly
+  evenly, which is what you want from a *ranker*.
+- Captions are emoji-friendly, hashtag-friendly, sometimes
+  hallucinatory (e.g. labelled a Seville street as "Barcelona's
+  colourful streets" once).
+- Failure mode to watch: occasional double-down on the same caption
+  template ("Lost in the tiles. 🌿" repeated across two unrelated
+  blue-dress photos).
+
+### minicpm-v-4-5 — conservative, terse
+- Most conservative scorer: 65% of photos got IG-fit=7. Only 9 nines.
+  Less useful as a top-N ranker because the top is squashed.
+- Fastest p95 of the three (6.0 s) but slower p50 than qwen3vl-4b.
+- Captions are short and lower-case ("azulejo dreams.",
+  "sunshine & secrets") — distinct voice but less Instagram-native.
+
+### qwen3vl-8b — most polished captions
+- Best subject identification (specifically named "Metropol Parasol"
+  and "Plaza de España" by name where the others said "modern
+  architecture" / "plaza").
+- Captions read well: "Coffee & calm vibes ☕️", "where modern meets
+  historic under a brilliant sky".
+- Slowest p50 (5.98 s) and tightest score distribution (median 7,
+  17 nines) — middle of the pack as a ranker.
+
+## Top-10 agreement (Kendall-tau-style overlap)
+
+How many of each model's top-10 IG-fit picks appear in another
+model's top-10:
+
+| pair | overlap |
+|------|---------|
+| qwen3vl-4b ↔ qwen3vl-8b | 5/10 |
+| minicpm-v-4-5 ↔ qwen3vl-4b | 4/10 |
+| minicpm-v-4-5 ↔ qwen3vl-8b | 4/10 |
+
+Read: there's moderate but not strong agreement. The models pick
+roughly half the same "best" photos and half different ones. For
+ranking, that's a healthy sign — they're not collapsing to a single
+notion of "good", so combining their scores would add real signal.
+
+## Cost-equivalent context
+
+Approximate cost to score the same 100 photos via cloud APIs
+(prompt ≈ 1100 tokens incl. image, completion ≈ 100 tokens):
+
+| backend | input | output | per-100 photos |
+|---------|-------|--------|----------------|
+| Local llama-swap on T4 | — | — | ≈ $0.04 (electricity, ~70 W × 7 min) |
+| Anthropic Haiku 4.5 | $1.00/M | $5.00/M | ≈ $0.15 |
+| Anthropic Sonnet 4.6 | $3.00/M | $15.00/M | ≈ $0.45 |
+| Google Gemini 2.5 Flash | $0.30/M | $2.50/M | ≈ $0.05 |
+
+Local is competitive with Gemini Flash on marginal cost. The case
+for keeping it local is privacy (Immich originals never leave the
+LAN), no rate-limits, and no per-call quota planning. The case
+against is the GPU is finite — adding this to a request path means
+sharing T4 time with frigate, ytdlp, and (when we restore it)
+immich-ml. With time-slicing on, that's tractable but each tenant
+gets fewer full-GPU kernels.
+
+## Sample captions (10 representative photos, all 3 models)
+
+`9f9c9faf-...` — man at ornate arch in Seville
+- **minicpm-v-4-5** (7/8): sunshine & secrets
+- **qwen3vl-4b** (9/8): Found the perfect spot to pause. 🌞 #travel #architecture
+- **qwen3vl-8b** (9/8): Standing in Seville's magic, framed by history and color.
+
+`0d53158b-...` — Metropol Parasol
+- **minicpm-v-4-5** (9/8): Seville's skyward statement.
+- **qwen3vl-4b** (9/8): Seville's iconic modern structure against a vibrant blue sky.
+- **qwen3vl-8b** (9/8): The Metropol Parasol in Seville, where modern meets historic under a brilliant sky.
+
+`23137c1c-...` — coffee shot
+- **minicpm-v-4-5** (8/7): a morning pick-me-up and a smile.
+- **qwen3vl-4b** (8/7): Coffee & chill. Perfectly imperfect moment.
+- **qwen3vl-8b** (9/8): Coffee & calm vibes ☕️
+
+`3376ee87-...` — woman against tiled wall
+- **minicpm-v-4-5** (7/8): azulejo dreams.
+- **qwen3vl-4b** (9/8): Lost in the tiles. 🌿
+- **qwen3vl-8b** (9/8): Blue tiles, golden light, and a moment of quiet grace.
+
+`4f2fea45-...` — courtyard
+- **minicpm-v-4-5** (7/8): hidden gems of seville
+- **qwen3vl-4b** (7/8): Timeless beauty in a Spanish courtyard. 🌿
+- **qwen3vl-8b** (7/8): A serene courtyard in Seville, where palm trees sway under the sun.
+
+`ea713729-...` — flower-market street (qwen3vl-4b confused location)
+- **minicpm-v-4-5** (7/8): Seville's hidden gems.
+- **qwen3vl-4b** (7/8): Walking through *Barcelona's* colorful streets, backlit by golden hour.
+- **qwen3vl-8b** (7/8): Walking through Seville's vibrant streets, lavender in hand.
+
+The full list of 10 sample sets is in the auto-generated section
+below; the raw 300-row JSON is at `benchmark-2026-05-10-1424.json`
+in this directory.
+
+## Operational cost during the run
+
+- llama-swap pod (1× T4 wholly allocated for the duration): ~33 min.
+- Immich-ML downtime: ~33 min. New uploads weren't auto-tagged or
+  CLIP-embedded during this window. No user-visible impact (Immich
+  search against already-indexed assets still worked via pgvector).
+- Network egress: zero — Immich originals stayed on the LAN, all
+  scoring traffic was in-cluster.
+
+## Reproducibility
+
+```bash
+DATA_DIR=/tmp/benchmark \
+  IMMICH_API_KEY=… \
+  LLAMA_SWAP_URL=http://localhost:18080 \
+  poetry run python -m instagram_poster.benchmark run \
+    --album-id 46565b85-7580-4ac1-91a6-1ece2cf8634d \
+    --models qwen3vl-8b,minicpm-v-4-5,qwen3vl-4b \
+    --limit 100 --random-seed 42 --run-id 2026-05-10-1424
+```
+
+The same `--random-seed` reproduces the photo sample exactly. Prompt
+version `4bbb7e7721da24d9` is the SHA-256 of the system prompt + user
+prompt + GBNF grammar; rerunning under the same prompt version against
+the same seed should produce within-noise identical scores (the models
+themselves are temperature=0, top_k=1).
+
+## Next steps
+
+- **Wire `qwen3vl-4b` into `instagram-poster`** as an additional ranking
+  signal alongside CLIP-based recency in `/candidates`. Cache the score
+  per asset_id so we don't re-pay 4 s on every list refresh.
+- **Bump k8s-node1 RAM** so immich-ml + llama-swap can co-exist (drain
+  → resize → uncordon, with kubelet `systemReserved` adjusted in
+  `stacks/infra/main.tf`).
+- **Re-benchmark with shared GPU** once node1 RAM is bumped, to get
+  realistic latency numbers when the T4 is also under load from
+  immich-ml and frigate.
+- **Front llama-swap with LiteLLM** so Home Assistant and any other
+  consumer can hit one OpenAI-compat gateway. Track separately.
+
+---
+
+## Auto-generated report
+
+Below is the unedited output of `python -m instagram_poster.benchmark
+report --run-id 2026-05-10-1424`, kept for diff-checking against
+future runs.
+
+### Per-model summary
+
+| model | n | parse_ok % | error % | p50 latency | p95 latency | median IG-fit | median aesthetic |
+|-------|---|-----------|--------|------------|-------------|--------------|------------------|
+| minicpm-v-4-5 | 100 | 100.0 | 0.0 | 5617 ms | 5998 ms | 7.0 | 8.0 |
+| qwen3vl-4b | 100 | 100.0 | 0.0 | 3552 ms | 4063 ms | 8.0 | 8.0 |
+| qwen3vl-8b | 100 | 100.0 | 0.0 | 5981 ms | 6637 ms | 7.0 | 8.0 |
+
+### Score histograms (instagram_fit_score 1–10)
+
+#### minicpm-v-4-5
+```
+ 1: (0)   2: (0)   3: (0)   4: (0)   5: (0)
+ 6: ███████ (7)
+ 7: █████████████████████████████████████████████████████████████████ (65)
+ 8: ███████████████████ (19)
+ 9: █████████ (9)
+10: (0)
+```
+
+#### qwen3vl-4b
+```
+ 1: (0)   2: (0)   3: (0)   4: (0)   5: (0)
+ 6: █████ (5)
+ 7: ████████████████ (16)
+ 8: ████████████████████████████████ (32)
+ 9: ███████████████████████████████████████████████ (47)
+10: (0)
+```
+
+#### qwen3vl-8b
+```
+ 1: (0)   2: (0)   3: (0)   4: (0)   5: (0)
+ 6: ███████████ (11)
+ 7: ███████████████████████████████████████████████████████ (55)
+ 8: █████████████████ (17)
+ 9: █████████████████ (17)
+10: (0)
+```
+
+### Top-10 by IG-fit per model — see `benchmark-2026-05-10-1424.json`
+
+(Tables omitted from the curated report; available in the JSON dump
+alongside this file.)
--- a/docs/benchmarks/benchmark-2026-05-10-1424.json
+++ b/docs/benchmarks/benchmark-2026-05-10-1424.json
--- a/docs/known-issues.md
+++ b/docs/known-issues.md
@ -0,0 +1,72 @@
+# Known Issues
+
+Catalog of recurring or upstream-blocked failure modes with their
+mitigations. Anything that requires a manual workaround should be
+documented here — if a future session can hit the same issue, it
+deserves an entry. Each entry should have: symptom, root cause, current
+mitigation, and the trigger that lets us un-mitigate.
+
+---
+
+## 2026-05-17 — NVIDIA GPU driver fails on Ubuntu 26.04 (kernel 7.0.x)
+
+**Symptom.** `nvidia-driver-daemonset-*` in `nvidia` namespace
+CrashLoopBackOff on the GPU node. Logs say:
+
+    Could not resolve Linux kernel version
+
+… or, post chart-upgrade, ImagePullBackOff on a `*-ubuntu26.04` tag.
+
+**Root cause.** NVIDIA has not published any `nvcr.io/nvidia/driver:*-ubuntu26.04`
+images (0 tags as of 2026-05-17; verified with skopeo). When a k8s node
+running the GPU operator gets `do-release-upgrade`'d to Ubuntu 26.04
+Resolute Raccoon, NFD relabels the node with
+`feature.node.kubernetes.io/system-os_release.VERSION_ID=26.04` and the
+operator computes the driver image tag `<version>-ubuntu26.04` — which
+404s on pull. Both gpu-operator chart v25.10.1 and v26.3.1 exhibit the
+same behaviour once NFD has detected 26.04.
+
+**Current mitigation (active on k8s-node1 since 2026-05-17).**
+
+1. Host kernel rolled back to `6.8.0-117-generic` (Ubuntu 24.04 HWE
+   kernel — still installed at `/lib/modules/6.8.0-117-generic`).
+2. `apt-mark hold` on: `linux-image-6.8.0-117-generic`,
+   `linux-headers-6.8.0-117-generic`, `linux-modules-6.8.0-117-generic`,
+   `linux-image-generic`, `linux-headers-generic`, `linux-generic`.
+3. `/etc/os-release` on k8s-node1 replaced with the Ubuntu 24.04 Noble
+   content (was a symlink to `/usr/lib/os-release`; now a regular file
+   under `/etc`). Backup at `/etc/os-release.bak-pre-spoof-2026-05-17`.
+   NFD-worker reads `/etc/os-release` and now reports
+   `system-os_release.VERSION_ID=24.04`, so the operator picks the
+   matching ubuntu24.04 driver image which DOES exist.
+4. gpu-operator chart pinned to v25.10.1 in
+   `stacks/nvidia/modules/nvidia/main.tf`; driver pinned to 570.195.03
+   in `stacks/nvidia/modules/nvidia/values.yaml`.
+
+**This is gross but stable.** The kernel matches what 24.04 ships, and
+the `apt-mark hold` keeps it that way. /etc/os-release lying about the
+OS only affects userland callers that key off it — none of our
+deployed services do (we verified by grepping the cluster).
+
+**Trigger to un-mitigate.** Periodically check for ubuntu26.04 driver
+tags. Once they appear:
+
+    docker run --rm quay.io/skopeo/stable list-tags \
+        docker://nvcr.io/nvidia/driver \
+      | python3 -c "import json,sys; d=json.load(sys.stdin); \
+          print(len([t for t in d['Tags'] if 'ubuntu26.04' in t]))"
+
+When that returns a non-zero count:
+
+1. Restore `/etc/os-release` from backup
+    (`/etc/os-release.bak-pre-spoof-2026-05-17`) on k8s-node1.
+2. Remove apt-mark holds for the kernel packages.
+3. `apt full-upgrade` to land the latest 26.04 kernel + reboot.
+4. Bump the gpu-operator chart pin to the matching version that ships
+   ubuntu26.04 driver images. Bump `driver.version` in values.yaml to
+   the current chart default.
+
+**See also.** `docs/post-mortems/2026-05-17-gpu-driver-ubuntu2604-mismatch.md`
+for full incident timeline + the recovery procedure.
+
+**Beads.** `code-8vr0` (P1, OPEN).
--- a/docs/plans/2026-04-20-infra-audit-design.md
+++ b/docs/plans/2026-04-20-infra-audit-design.md
@ -0,0 +1,265 @@
+# Infra Audit — 2026-04-20
+
+**Status**: Design (post-research, post-challenge)
+**Author**: Viktor Barzin (audit run by Claude)
+**Scope**: `infra/` Terragrunt stacks + platform services (`claude-agent-service`, `claude-memory-mcp`, `beadboard`, `broker-sync`)
+**Goals**: Reliability · Declarative-first · Reduced maintenance overhead · Maintained scalability
+**Method**: 5 parallel research agents (R1 Reliability, R2 Declarative, R3 Maintenance, R4 Scalability, R5 Security) → 91 raw findings → 2 independent challengers → filtered/corrected/ranked backlog below.
+
+## Context
+
+The home-lab has grown into a mature stack (105 Tier-1 Terragrunt stacks + 6 Tier-0 SOPS, CNPG, Vault+ESO, Kyverno, Traefik, Authentik, CrowdSec, Woodpecker CI, Redis-Sentinel, MySQL-standalone, Proxmox-NFS). Recent work has been consolidation: MySQL InnoDB-Cluster → standalone (2026-04-16), Redis Phase 7 refactor (2026-04-19), NFS fsid=0 SEV1 post-mortem (2026-04-14), Authentik outpost /dev/shm fix (2026-04-18). This audit surveys everywhere that remains — what's brittle, what's manual, what's dark, what hasn't caught up to recent decisions — and ranks fixes by impact and by operator fatigue.
+
+## Corrections up-front (challenger round)
+
+Before reading the backlog, these findings from the research phase are **dropped, corrected, or reframed** — challengers spot-checked live state and proved them wrong, already-solved, or intentional-by-design. Being honest about this is the point of the challenge round:
+
+| Finding as stated | Actual state | Action |
+|---|---|---|
+| R4#1: Worker nodes 86-91% memory saturation | Live `kubectl top nodes`: 44-51% across k8s-node{1-4} | **DROPPED** — bad metric pull |
+| R4#2: Frigate CPU unbounded (1.5 CPU request, no limit) | Cluster policy is **all CPU limits removed** to avoid CFS throttling (`infra/.claude/CLAUDE.md` → Resource Management) | **DROPPED** — by design |
+| R4#7: Redis no `maxmemory-policy` | `infra/stacks/redis/modules/redis/main.tf:254` sets `maxmemory-policy allkeys-lru` (Phase 7, 2026-04-19) | **DROPPED** — already solved |
+| R2#1: 307 Kyverno lifecycle markers is a drift risk | Markers are the **canonical discoverability tag** — `ignore_changes` only accepts static attribute paths, snippet convention is the only viable path; reframe as *"markers are fine, missing markers are the risk"* | **REFRAMED** |
+| R2#3: 140 `ignore_changes` blocks | Actual: **310** across `.tf` files (2.2× off) | **CORRECTED** |
+| R3#10: 65 CronJobs | Actual: 59 (10% off) | **CORRECTED** |
+| R1#1: 47 deployments missing probes | Actual: **115 missing at least one probe; 103 missing both** | **CORRECTED (much worse than reported)** |
+| R1#9: MySQL standalone no HA/PDB | Intentional post-2026-04-16 migration from InnoDB Cluster. Backup + restore matter; HA is explicit deferred. | **REFRAMED** — split into HA (deferred) / backup-restore (open) / connection pool (open) |
+| R1#10: PDB gaps include Traefik, Authentik | Traefik & Authentik PDBs `minAvailable=2` exist (CLAUDE.md). The real gaps are **CrowdSec LAPI, Calico-apiserver, ESO webhook, Woodpecker-server** | **CORRECTED (list pruned)** |
+| R5#2: 4 Kyverno security policies in Audit | **All 16 ClusterPolicies are in Audit** — zero in Enforce. | **CORRECTED (worse)** |
+
+---
+
+## Executive summary — top 5 cross-cutting themes
+
+These are the themes that survive the challenge round and hit ≥2 concerns. Each headline is a 1-line hook; deep-dives below.
+
+1. **Declarative escape hatches (NFS exports, master-node file provisioners, null_resource initializers)** — `/etc/exports` is not in Terraform, which is the **root cause of the 2026-04-14 SEV1**; 6 null_resources + 3 SSH file provisioners still orchestrate critical state. *Hits R2 + R1 + R3.*
+2. **Observability has blind spots where pain would actually come from** — no OOMKill alert routing, no NFS capacity monitor, no GPU utilization dashboard, no ESO refresh-lag alert, no CronJob success-rate summary. Alerts exist but they don't cover the operator's real failure modes. *Hits R1 + R3 + R4.*
+3. **Supply-chain hygiene: image pinning + Renovate + admission signing** — 84 `:latest` tags in production TF, zero Renovate/Dependabot across 18 repos (~15 hr/mo toil by estimate), no cosign/trivy on push. Single theme unifies security posture, maintenance toil, and determinism. *Hits R3 + R5.*
+4. **Reliability-probes & graceful shutdown are genuinely uneven** — 115 deployments missing at least one probe (incl. 103 missing both), 50+ Recreate deployments with no `terminationGracePeriodSeconds`/`preStop`. This is the quietly-largest reliability debt. *Hits R1 + R3 (pager toil).*
+5. **Backup coverage is uneven: 30+ PVCs lack app-level CronJobs** — Proxmox host snapshots cover the disk, but Forgejo (!), Affine, Paperless, Hackmd, Matrix, Owntracks have no app-aware dumps. Restore granularity is file-level, not entity-level. *Hits R1 + R5 (compliance) + R3 (restore rehearsal toil).*
+
+Honourable mentions that didn't make top 5 but sit just below: Kyverno audit→enforce transition (security), ESO refresh-lag alert (secrets reliability), Vault hardening (audit log offsite, root-token K8s-secret scope), Cloudflared tunnel-token SPOF (not replica SPOF — those are 3), Dolt PVC sizing + backup.
+
+---
+
+## Scoring method
+
+Two parallel rankings — scan both.
+
+**Rank A — Impact × Reversibility (the original formula)**
+`score = Impact × (6 - Effort) × (6 - Risk)` — each dimension 1-5.
+
+**Rank B — Operator fatigue weight**
+`score = Impact × (6 - Effort) × FatigueWeight` where `FatigueWeight = 3` if the finding introduces *daily/weekly manual toil* and `1` otherwise. This re-ranks by how much pain the unfixed state causes per month.
+
+Both rankings below. When they agree, that's the clear signal. When they diverge, that's where Rank B (fatigue) wins — Viktor has stated operator fatigue dominates abstract risk for a solo-operator lab.
+
+---
+
+## Ranked backlog (filtered, deduplicated, corrected)
+
+Counts below reflect **post-challenge corrected numbers**. Every row has a reference verified either by a spot-check (file:line) or a live cluster command.
+
+| ID | Title | Concerns | Impact | Effort | Risk | Rank A | Rank B | Refs |
+|---|---|---|---:|---:|---:|---:|---:|---|
+| F01 | NFS `/etc/exports` not in Terraform (SEV1 root cause) | R2+R1 | 5 | 3 | 2 | **60** | **45** | `infra/scripts/pve-nfs-exports`, PM 2026-04-14 |
+| F02 | 115 deployments missing probes (103 missing both) | R1+R3 | 5 | 3 | 2 | **60** | **45** | `kubectl get deploy -A -o json` |
+| F03 | Zero Renovate/Dependabot across 18 repos | R3+R5 | 4 | 2 | 1 | **80** | **48** | `find /home/wizard/code -name ".renovaterc*"` → 0 results |
+| F04 | 84 `:latest` image tags in production TF | R3+R5+R4 | 4 | 2 | 2 | **64** | **48** | `grep -rn ':latest' infra/stacks` |
+| F05 | No OOMKill / unschedulable / node-CPU alert | R1+R4+R3 | 5 | 3 | 1 | **75** | **45** | Grep Prometheus rules — no `OOMKilling` rule present |
+| F06 | 6 `null_resource` DB initializers in `dbaas` stack | R2 | 4 | 3 | 3 | **36** | **36** | `grep -n null_resource infra/stacks/dbaas` |
+| F07 | 3 SSH+file provisioners on k8s-master (audit, OIDC, etcd) | R2 | 4 | 3 | 3 | **36** | **36** | `stacks/platform/modules/rbac/apiserver-oidc.tf` |
+| F08 | ESO refresh-lag alert missing (52 ExternalSecrets) | R1+R5+R3 | 4 | 2 | 1 | **80** | **48** | `stacks/external-secrets/` — no PrometheusRule for refresh lag |
+| F09 | 30+ PVCs without app-level backup CronJobs | R1+R5 | 4 | 3 | 2 | **48** | **36** | Affine, Forgejo, Hackmd, Matrix, Owntracks, Paperless (no `*-backup` CJ) |
+| F10 | Cloudflared tunnel-token SPOF (replicas OK, token shared) | R1+R5 | 3 | 4 | 2 | **24** | **8** | `stacks/cloudflared/` single tunnel credential |
+| F11 | MySQL restore never rehearsed end-to-end | R1+R4+R3 | 4 | 2 | 2 | **64** | **48** | No `mysql-restore-drill` CJ; runbook untested post-migration |
+| F12 | Kyverno policies all 16 in Audit — **sequence carefully** | R2+R5 | 4 | 3 | **4** | **24** | **24** | `kubectl get clusterpolicy` |
+| F13 | 97 RollingUpdate deployments lack explicit surge bounds | R1 | 2 | 2 | 2 | **32** | **12** | TF defaults inherit from Helm/k8s (25%/25%) |
+| F14 | CronJob success-rate dashboard + alert rollup missing | R3+R4 | 3 | 2 | 1 | **60** | **36** | `CronJobTooOld` rule — partial; no 24h rollup |
+| F15 | Authentik outpost /dev/shm fix applied via Helm API only | R1+R5 | 3 | 2 | 2 | **48** | **48** | Not in TF — upgrade-reversion risk |
+| F16 | Dolt (beads DB) no backup CronJob — 2Gi PVC near full | R1+R4 | 4 | 2 | 2 | **64** | **32** | `stacks/beads/` — no `dolt-backup` CJ |
+| F17 | Vault StatefulSet `updateStrategy=OnDelete` (manual roll) | R1+R3 | 2 | 2 | 3 | **24** | **24** | `kubectl get sts -n vault -o yaml` |
+| F18 | No NetworkPolicies cluster-wide | R4+R5 | 4 | **5** | **4** | **8** | **8** | `kubectl get netpol -A` → 0-2 |
+| F19 | RBAC `oidc-power-user` has cluster-wide secrets r/w | R5 | 4 | 3 | 3 | **36** | **12** | `stacks/platform/modules/rbac/` |
+| F20 | No image supply-chain verification (cosign, trivy on push) | R5 | 4 | 4 | 3 | **24** | **8** | No admission controller for signatures |
+| F21 | Vault audit log offsite backup not configured | R5+R1 | 3 | 2 | 1 | **60** | **36** | `stacks/vault/` — no `audit-log-sync` CJ |
+| F22 | Claude-agent, beadboard, broker-sync singletons | R1 | 2 | 2 | 2 | **32** | **12** | `kubectl get deploy -n claude-agent,beadboard,broker-sync` |
+| F23 | 50+ Recreate deployments lack graceful-shutdown hooks | R1+R3 | 3 | 3 | 2 | **36** | **36** | `grep -L terminationGracePeriodSeconds stacks/**` |
+| F24 | CoreDNS scaled via `kubectl scale` not TF | R2 | 3 | 2 | 2 | **48** | **32** | Command in runbook; no TF resource for replicas |
+| F25 | GPU / inference-latency SLO unmonitored | R4+R5 | 3 | 3 | 2 | **36** | **36** | No dcgm dashboard; Frigate liveness checks only |
+| F26 | Prometheus TSDB 200Gi — retention untracked | R4 | 2 | 2 | 1 | **40** | **20** | `stacks/monitoring/` |
+| F27 | Pod Security Standards labels unset on all namespaces | R5 | 3 | 2 | 3 | **36** | **12** | `kubectl get ns -o json \| jq '.items[].metadata.labels'` |
+| F28 | Authentik worker VPA upperBound 2.3× actual request | R4 | 2 | 2 | 2 | **32** | **20** | Goldilocks dashboard |
+| F29 | 9 DB rotation targets, no post-rotation verification loop | R5+R3 | 3 | 2 | 2 | **48** | **36** | Vault DB engine every 7d; no auto-verify |
+| F30 | Tier-0 SOPS workflow 7-step vs 3-step Tier-1 | R3 | 2 | 2 | 1 | **40** | **20** | `scripts/state-sync` — manual decrypt/encrypt/commit |
+
+**Rank A leaders (top 8)**: F03, F08, F05, F11, F04, F16, F01, F02 — "big cluster wins, cheap to try"
+**Rank B leaders (top 8)**: F03, F04, F08, F11, F15, F01, F02, F05 — "what's paining you weekly"
+
+F03 (Renovate), F08 (ESO refresh alert), F11 (MySQL restore drill) and F01 (NFS in TF) lead in **both** rankings → these are the clear "do first" candidates.
+
+---
+
+## Per-concern deep dives
+
+### R1 — Reliability (18 raw → 11 real after challenge)
+
+Filtered: dropped R1#1/9/10 (incorrect numbers, intentional choices). What actually matters:
+
+- **Probes (F02)** — 115 deployments missing at least one probe; 103 missing both. The corrected count is 2.4× the original claim. Worst offenders are batch workloads (CronJob-spawned) that legitimately skip probes — but long-lived ones (Affine, Hackmd, mailserver sidecars) genuinely need them. Triage: filter by `spec.replicas ≥ 1` and `containers[].command != ["/bin/sh","-c"]`-style short-runners, then add readiness+liveness one-by-one.
+- **Cloudflared tunnel token SPOF (F10)** — Replicas are 3 (per CLAUDE.md), so the agent finding "SPOF" framed as replicas is wrong. The real SPOF is the *tunnel credential*. Secondary tunnel with weighted Cloudflare DNS records is the honest fix — medium effort, low urgency unless tunnel CA rolls keys.
+- **PDB gaps (F13-like, excluded from table)** — After challenger correction, gaps are: CrowdSec LAPI (3 replicas, no PDB), ESO webhook+controller, Woodpecker-server. Not urgent — drain-test with `kubectl drain --dry-run` shows no current issue.
+- **App-level backups (F09)** — Proxmox host captures the PVC contents nightly via LVM snapshot + rsync with `--link-dest` weekly versioning, so file-level recovery is covered. But for databases inside PVCs (e.g. Affine's Postgres in-pod, Paperless' SQLite), app-aware dumps give transactional consistency. Audit pass: enumerate every PVC without a sibling `*-backup` CronJob, add one for the ones that host embedded DBs.
+- **MySQL restore drill (F11)** — Migrated 4 days ago. Runbook exists. End-to-end restore (dump → new DB → connect an app → verify) hasn't been rehearsed. SEV1 risk if a dump has been silently broken since migration.
+- **Vault update strategy (F17)** — `OnDelete` means helm upgrade leaves pods untouched; must manually `kubectl delete pod` to restart. Low impact (infrequent) but procedural toil.
+- **Dolt PVC near-full + no backup (F16)** — `bd list --status in_progress` runs against this DB; it's load-bearing for cross-session task state. Grow the PVC (resize annotation) + add dolt dump CronJob.
+
+### R2 — Declarative Coverage & Drift (16 raw → 8 real)
+
+Filtered: dropped R2#1 (Kyverno markers are by-design), corrected R2#3 to 310.
+
+- **NFS exports (F01)** — The file is git-managed at `infra/scripts/pve-nfs-exports` but deployed via `scp + exportfs -ra`, not Terraform. This is the exact path that caused the 2026-04-14 SEV1 (fsid=0 on wrong exports line). Options: (a) `null_resource` with `local-exec scp + remote-exec exportfs -ra` triggered on hash of content (partial — SSH dep); (b) new module `pve_host_config` that templates and SCPs multiple PVE-host artifacts with checksum verification. (b) is the cleaner long-term fix.
+- **Null-resource initializers (F06)** — 6 in `dbaas` (MySQL users, CNPG cluster, TF-state role, payslip DB, job-hunter DB). Some are genuinely unavoidable (bootstrapping DB before the DB exists); others could use `postgresql_grant` / `mysql_user` providers.
+- **SSH file provisioners on k8s-master (F07)** — `apiserver-oidc.tf`, `audit-policy.tf`, `etcd tuning`. One-way sync, no drift detection. Proposed quick wins (per `2026-02-22-node-drift-quick-wins-design.md` already exists). Continue/finish the plan.
+- **CoreDNS scaling manual (F24)** — Current runbook uses `kubectl scale`/`set env`/`set affinity`. Drift-prone; convert to `kubernetes_deployment` TF resource overriding the Helm chart's scale/affinity fields.
+- **MySQL InnoDB Cluster + operator TF resources still present** — Phase 4 cleanup. Low urgency, but removing reduces cognitive load on anyone reading `stacks/dbaas/`.
+- **Technitium readiness-gate null_resource with `timestamp()` trigger** — Runs every apply, 3-6 min wall time. Replace with a real health-check on `terraform_data` with `triggers_replace = { checksum = sha256(config) }`.
+- **GPU node taints + Proxmox CSI labels via null_resource kubectl** — No drift detection. Fix is in the `2026-02-22-node-drift-quick-wins-design.md` plan.
+
+### R3 — Maintenance overhead (18 raw → 10 real)
+
+- **Renovate (F03)** — The single highest-leverage maintenance fix. 18 repos × ~0.8 hrs/month manual version sweep = real time. Add `.github/renovate.json` (grouping rules for Terraform providers, K8s provider, Docker images) + auto-merge patch-level. Start with `infra/` only; expand after 2 weeks.
+- **Image pinning (F04)** — 84 `:latest` tags in production TF. Root CLAUDE.md still says "use 8-char git SHA tags" but that's not enforced. Admission control via Kyverno `require-trusted-registries` is in Audit today — add a sibling policy `forbid-latest-tag` also in Audit. Separate from F03 because pin-to-SHA + Renovate is a synergistic pair.
+- **MySQL restore drill (F11)** — tracked under R1 for impact; also a maintenance item because the restore *procedure* has not been test-updated since migration.
+- **CronJob alert rollup (F14)** — 59 CronJobs; "which were healthy last 24h" takes ad-hoc `kubectl get jobs --sort-by` scrolling. Add a Grafana panel with `kube_cronjob_status_last_successful_time < now - 2×schedule` summary.
+- **Graceful-shutdown toil (F23)** — 50+ Recreate deployments without `terminationGracePeriodSeconds` or `preStop`. Noisy pager hits after node drain. One-off sweep: add a 30s `terminationGracePeriodSeconds` default via Kyverno mutation rule.
+- **Tier-0 SOPS workflow (F30)** — 7-step decrypt/edit/encrypt/commit vs Tier-1's 3-step. Combined `tg` wrapper flag `--edit <stack>` that auto-decrypts → EDITOR → auto-encrypts → commit in one command. Moderate win; low risk.
+- **Stale `in_progress` beads** — 7 stale tasks in `bd list --status in_progress` at audit start. Session-end hook checks this; 3-5 days without notes is the signal. CLAUDE.md covers the rule — it's followed-sometimes, not enforced.
+- **Runbook staleness** — no `last_reviewed` frontmatter on runbook MDs; trivial to add. One-off sweep then keep it honest.
+- **CI/CD template unification** — "GHA build → Woodpecker deploy" is the documented pattern for 10 repos; rest still on Woodpecker-only. Track as follow-ups per repo in `bd`.
+- **Kyverno DNS-config boilerplate 307 markers** — Not a problem (see correction at top). Do add a lint rule in CI that flags any `kubernetes_deployment` without `# KYVERNO_LIFECYCLE_V1` marker; that's the real drift risk.
+
+### R4 — Scalability (18 raw → 9 real)
+
+Filtered: dropped R4#1 (metric mispull), R4#2 (CPU-limit policy), R4#7 (Phase 7 solved).
+
+- **CNPG memory headroom** — Currently 2Gi limit. Top-line metric at quiet time; add a `ContainerNearOOM > 85%` rule that watches CNPG specifically (general rule exists; CNPG is Tier 0 so deserves explicit binding).
+- **HPA cluster-wide: zero** — Every stateless service is 1:1. Not urgent at current node-CPU 8-31%, but one big feature (Immich re-index, Authentik load spike) tips the balance. Pilot: HPA on Traefik (CPU-driven), observe, expand.
+- **Redis no HPA + HAProxy singleton** — Wire Sentinel into direct client access (Phase 8 of Redis refactor, per R1#11 of raw findings). Currently all 17 consumers go via HAProxy — the single-point bypass was deliberate (simpler client config), but the HAProxy is now the SPOF Sentinel was meant to prevent. Worth a plan doc (`plans/2026-MM-DD-redis-phase8-sentinel-clients.md`).
+- **PgBouncer pool sizing unknown** — Authentik has 3 pods, each opening N connections. At load spikes (big org sync), pool exhaustion. Short-term: `pgbouncer_show_pools` metric + alert at 80% util. Longer-term: pool-size tuning based on observed wait times.
+- **Prometheus TSDB (F26)** — 200Gi retention unquantified. Risk: disk fills → scrape gaps → audit blind. Add `kubelet_volume_stats_used_bytes{persistentvolumeclaim="prometheus-server"} > 0.85 * capacity` alert.
+- **NFS capacity not monitored** — PVE host has 1TB HDD LV. No `node_filesystem_avail_bytes` scrape from PVE host (it's outside the cluster). Install node_exporter on PVE host; scrape via Prometheus federation or remote_write.
+- **VPA quarterly review unscheduled** — Goldilocks is in `Initial` mode (not Auto, by design). Review is manual per quarter. Calendar event + runbook link.
+- **Registry single instance** — Registry outage = no pod restarts. Post-mortem 2026-04-19 documented a container-engine pin; replica count still 1. Consider HA registry backed by S3-compat store (MinIO in-cluster) for the second replica — but low urgency given probe CJ monitors integrity every 15m.
+- **No ResourceQuota utilization alert** — Quota exhaustion invisible until a pod refuses to schedule. `kube_resourcequota{type="used"} / kube_resourcequota{type="hard"} > 0.85` rule.
+
+### R5 — Security & Secrets (21 raw → 13 real)
+
+- **Vault `vault-unseal-key` K8s Secret (F21-related)** — Challenger A said it wasn't present; it is (`kubectl get secret -n vault`). Used by auto-unseal. RBAC on the secret should restrict to `vault-server` SA only. Audit the `role` + `rolebinding` in `stacks/vault/`.
+- **Vault audit log offsite (F21)** — Rotated logs not synced to NFS backup. Add a `vault-audit-log-sync` CronJob or append the audit log path to `nfs-change-tracker` inotify list (zero-Terraform change if the latter).
+- **Kyverno audit → enforce (F12) — sequence carefully** — All 16 policies are in Audit today. Naive switch to Enforce will block legitimate workloads (Loki, Frigate, nvidia-device-plugin, wireguard have privileged/host-ns requirements — all documented). Plan: (a) generate `Kyverno PolicyException` CRs for known-good workloads first; (b) enforce one policy at a time, 1-week observation; (c) start with `require-trusted-registries` (least breakage risk). **DANGEROUS TO EXECUTE NAIVELY — don't batch.**
+- **No NetworkPolicies (F18)** — Challenger correctly flagged the effort (5) and risk (4): wrong NetworkPolicy stops Authentik from reaching its DB in minutes. Approach: allow-list namespace-wide first (e.g. `authentik` ns can reach `dbaas` on 5432), expand over a month. Single biggest latent security improvement but needs runway.
+- **RBAC oidc-power-user secrets r/w cluster-wide (F19)** — Scope down: list which Authentik groups get this binding, remove `secrets:*` from the cluster role, add namespace-scoped RoleBindings where needed. Medium effort, high leverage.
+- **Image supply chain (F20)** — cosign verification + admission controller is the mature path. Trivy-on-push fits in GHA workflows. Both unblocked after F04 (pinning).
+- **`:latest` tags (overlap F04)** — Security aspect: signed-image admission requires stable refs.
+- **Privileged containers** — Loki, WireGuard, NVIDIA, Frigate known-exceptions. Document the exceptions inline (comment block on the TF resource) so future maintainers don't accidentally "fix" them.
+- **Git history plaintext secrets** — Challenger B flagged unverified. One way to verify cheaply: `git secrets --scan-history`. Add it as a pre-audit one-off.
+- **CrowdSec Metabase disabled, no Prometheus exporter** — R5#18. Enable the Prometheus exporter (no Metabase) for attack-pattern visibility; very cheap.
+- **cert-manager evaluation paused** — Documented pause; TLS rotation relies on Cloudflare wildcard. Confirm no local `Ingress` uses a self-managed cert that could expire silently. `kubectl get cert -A` → expect 0.
+- **Pod Security Standards (F27)** — Label every namespace `pod-security.kubernetes.io/enforce=restricted` (or baseline). Known-exception namespaces get explicit downgrades. Medium effort, paid back by making future admission decisions uniform.
+- **CrowdSec LAPI quorum** — 3 replicas but quorum/consensus behavior undocumented. One-page runbook: what happens if 1, 2, or 3 LAPI pods die.
+- **Authentik outpost fix (F15)** — Applied via API, not TF. Next Helm upgrade reverts. Add the `/dev/shm` emptyDir to `stacks/authentik/values.yaml` templatefile.
+
+---
+
+## Dangerous-to-execute (handle with care)
+
+Flagged by challengers; each needs a gradual rollout plan, not a single commit.
+
+1. **F12 — Kyverno Audit → Enforce en masse**. Write `PolicyException` CRs for known-safe workloads first. One policy per week. Observe.
+2. **F18 — NetworkPolicies cluster-wide**. Default-deny breaks inter-namespace lookups silently. Namespace-by-namespace rollout, with `kubectl logs -f` tailing the policy-engine events.
+3. **PDB additions without drain-test**. New PDB + tight `minAvailable` can deadlock during node cordons. `kubectl drain --dry-run` every new PDB on every node first.
+4. **F20 — Signed-image admission**. Must follow F04 (pinning). Un-pinned admission = half the cluster fails to pull.
+
+## Gaps the agents missed
+
+From challenger "GAPS" analyses, collated:
+
+- **Disaster-recovery drill coverage** — backup docs are comprehensive (CLAUDE.md is extensive). End-to-end *restore* rehearsal frequency = never documented. Track per-component: MySQL, PostgreSQL/CNPG, Vault, etcd, NFS, registry blobs.
+- **Service mesh evaluation** — Never formally evaluated (Istio, Linkerd, Cilium-in-mesh-mode). Could subsume NetworkPolicy effort + mTLS + observability. Worth a design doc even if answer is "no, too much complexity for the gain."
+- **Chaos engineering coverage** — Zero. No pod-kill cron, no node-failure drill. Low urgency given maturity, but would validate F02 probe quality and F23 graceful-shutdown coverage cheaply.
+- **Operator onboarding friction** — Nobody else in the "lab team" but Emo exists in `claude-agent-service`. If Emo needs to take over a component for a week, what's the runbook?
+- **Alert noise / fatigue rate** — No finding measured how many alerts actually page vs. auto-resolve. `alertmanager_notifications_total` by receiver is the metric; needs a Grafana panel.
+- **Secrets-in-image-layers** — Docker images built locally may contain secrets from build env. `trivy image --scanners secret` on registry images is a one-off audit.
+- **Runbook → post-mortem → runbook-update loop** — Post-mortem 2026-04-14 produced runbook updates; no general tracker that every incident produces a runbook change.
+
+## Alternative framings (from challengers, preserved for future reference)
+
+- **Split "MySQL singleton" into 3 items** (HA / backup / pool). Accepted — see R1 and R4 treatment.
+- **6th concern: Observability & Pager Fatigue** — Considered; the themes already hit R1+R3+R4 under Theme 2 of the executive summary. Keeping 5 concerns but carving "Observability gaps" as a theme, not a new research axis.
+- **One-thing-this-weekend**: Challenger B nominated *NFS in Terraform*, Challenger A nominated *`:latest` tag sweep*. F01 wins on SEV1 prevention; F04 wins on toil. Both valid. Pick by energy level: F01 is 1 deliberate session; F04 is low-cognition grep-replace.
+- **Re-rank by operator fatigue (Rank B) always**. Partially accepted — presented side-by-side in the table.
+
+---
+
+## Recommended next moves
+
+Ordered for a solo operator balancing SEV-prevention, fatigue reduction, and preserved energy for larger work:
+
+**Week 1 (SEV-prevention + quick-wins, low cognitive load):**
+- F01: NFS exports into a `pve_host_config` Terraform module (one deliberate session)
+- F04: Sweep `:latest` tags, add Kyverno `forbid-latest-tag` in Audit
+- F08: ESO refresh-lag PrometheusRule
+- F05: OOMKill / Unschedulable / Node-CPU PrometheusRule
+
+**Week 2 (fatigue reduction):**
+- F03: Renovate in `infra/` only (narrow pilot)
+- F14: CronJob success-rate Grafana panel + alert rollup
+- F16: Dolt backup CronJob + PVC grow
+- F11: First MySQL restore drill (scheduled, documented)
+
+**Month 2 (durable fixes, gradual):**
+- F06/F07: Replace null_resources + SSH provisioners with native TF resources, one at a time
+- F02: Probe sweep — add readiness+liveness to the 20 long-lived deployments first
+- F12: Kyverno Enforce transition, one policy per week
+- F15: Authentik outpost /dev/shm into values.yaml
+
+**Month 3+ (structural):**
+- F18: NetworkPolicies — namespace-by-namespace
+- F19: RBAC scope-down
+- F20: Signed-image admission
+- Service-mesh evaluation (design doc)
+- Restore-drill calendar for every backup target
+
+No beads tasks auto-filed by this audit — user decides which findings merit `bd create`.
+
+---
+
+## Appendix — verification references (spot-checked)
+
+Every numeric claim in the backlog was confirmed by one of these commands at audit time (2026-04-20):
+
+| Claim | Command | Result |
+|---|---|---|
+| Node memory 44-51% | `kubectl top nodes --no-headers` | k8s-node1: 45%, node2: 51%, node3: 49%, node4: 44%, master: 17% |
+| 115 deploys missing ≥1 probe | `kubectl get deploy -A -o json \| jq '[.items[] \| select(.spec.template.spec.containers[0].readinessProbe == null or .spec.template.spec.containers[0].livenessProbe == null)] \| length'` | 115 |
+| 103 deploys missing BOTH probes | same, with `and` | 103 |
+| 310 ignore_changes blocks | `grep -r "ignore_changes" infra --include=*.tf --include=*.hcl \| wc -l` | 310 |
+| 59 CronJobs | `kubectl get cronjobs -A --no-headers \| wc -l` | 59 |
+| All 16 Kyverno ClusterPolicies in Audit | `kubectl get clusterpolicy -o jsonpath='...validationFailureAction...'` | 16/16 Audit, 0 Enforce |
+| Redis `maxmemory-policy allkeys-lru` | `grep -n maxmemory-policy infra/stacks/redis` | `modules/redis/main.tf:254` |
+| Zero Renovate configs | `find /home/wizard/code -name '.renovaterc*' -o -name 'renovate.json' \| grep -v node_modules` | 0 |
+| Vault `vault-unseal-key` Secret exists | `kubectl get secret -n vault` | present (37d old) |
+| NFS `/etc/exports` not in TF | `grep -rn 'fsid=' infra/stacks` | 0 matches; only `infra/scripts/pve-nfs-exports` |
+| Frigate CPU limit by policy | `infra/.claude/CLAUDE.md` → "All CPU limits removed cluster-wide" | confirmed |
+| MySQL standalone intentional | `infra/.claude/CLAUDE.md` → "migrated from InnoDB Cluster 2026-04-16" | confirmed |
+
+Other claims (84 `:latest` tags, 52 ExternalSecrets, 30+ PVCs without backup CJs) were surfaced by research agents; challengers spot-checked a subset and agreed the order-of-magnitude holds. Full list in `/home/wizard/.claude/plans/let-s-run-a-thorough-floating-pnueli.md` research digest.
+
+## Deliverable disposition
+
+- This document is the audit output.
+- No `bd` tasks were created by the audit. Pick findings to ticket after reading.
+- When filing: use `F##` as a tag, title with the finding's headline, acceptance criteria from the deep-dive paragraph, priority from Rank B.
+- Plan file at `~/.claude/plans/let-s-run-a-thorough-floating-pnueli.md` retains the full 91-finding digest + challenger reports for reference; can be deleted after any follow-up tickets are filed.
--- a/docs/plans/2026-05-16-auto-upgrade-apps-design.md
+++ b/docs/plans/2026-05-16-auto-upgrade-apps-design.md
@ -0,0 +1,165 @@
+# Auto-Upgrade Apps Design
+
+**Date**: 2026-05-16
+**Status**: Approved (brainstorm + grill complete; implementation pending)
+
+## Problem
+
+Three constraints in tension across the cluster's ~70 services:
+
+1. **Keep apps at latest.** Most services drift behind upstream; manual bumps don't scale.
+2. **Stay Terraform-compatible.** Image refs live in `.tf`; we want declarative source of truth.
+3. **Don't let the pull-through cache serve stale `:latest`.** Cache layer must not lie about what `:latest` means today.
+
+The previous `Diun → n8n → Service Upgrade Agent` flow handled (1) via changelog-reviewed PR bumps for third-party. Self-hosted services have inconsistent CI: 1 of 11 fully wired (CI builds + pushes + rolls out), 6 partially wired (build but no rollout trigger), 4 with no CI at all. Self-hosted services typically pull `forgejo.viktorbarzin.me/viktor/<name>:<8-char-sha>` with Terraform tracking each SHA in `var.image_tag`.
+
+The user wants to simplify by retiring the changelog-review agent and moving to a pure "latest, always" model, with the cache freshness concern handled at the cache layer (already done — see Architecture §1).
+
+## Decisions
+
+| # | Decision | Notes |
+|---|----------|-------|
+| 1 | **Auto-roll for everything** (no PR-bump gate) | Retires the Service Upgrade Agent; Diun's role narrows to notification only |
+| 2 | **Actuator: Keel** ([keel.sh](https://keel.sh)) | Annotation-driven Deployment/StatefulSet/DaemonSet auto-update operator |
+| 3 | **Tag scheme: `:latest` where it exists, `:major` where it doesn't, glob+`ignore_changes` last resort** | `keel.sh/policy: force` for `:latest` / `:major`; tag string stays in Terraform |
+| 4 | **Opt-out-pure (no skip-list)** | Every workload auto-rolls, including Vault, CNPG, operators, CNI, CSI. User accepts recoverability risk |
+| 5 | **Phased rollout (9 phases)** | Low-risk → bootstrap. Catch up to latest as we phase in. Each phase soaks ~1 week |
+| 6 | **Per-phase: single combined PR** | Switch image refs to floating tag + add to Kyverno mutate allowlist in same commit |
+| 7 | **Diun is the audit source for catch-up** | Existing 6h-poll already reports outdated images; export as worklist per phase |
+| 8 | **Polling, hourly** (`@every 1h`) | Not webhooks — single mechanism, all registries supported |
+| 9 | **Rollback: `kubectl rollout undo` → pin in Terraform → add `keel.sh/policy: never`** | (c) from grill: immediate undo, durable Terraform pin within ≤1h before next Keel poll |
+| 10 | **Implementation: Kyverno cluster-wide mutate** | One `ClusterPolicy` injects Keel annotations; phase boundary = `NamespaceSelector` allowlist |
+| 11 | **Keel exempt from its own mutate** | One-line `NamespaceSelector` exclusion. Supervisor self-update has uniquely bad failure mode |
+| 12 | **Uniform CI model for all self-hosted** | CI builds + pushes `:latest`, Keel polls and rolls. No per-repo `kubectl set image` step. Retires the GHA-migrated SHA-tag flow (memory id=388) |
+
+## Architecture
+
+### 1. Cache freshness — already correct
+
+Pull-through cache at `10.0.20.10` already splits caching by URL at the nginx layer:
+
+- `location ~ /v2/.*/blobs/` → `proxy_cache_valid 200 24h` — blobs cached (content-addressed, immutable)
+- `location /v2/` (manifests) → pass through, no cache
+
+Combined with `registry.proxy.ttl: 0` at the docker-registry layer, mutable manifests revalidate against upstream on every pull. **No cache changes needed for this design.** The CLAUDE.md note "Use 8-char git SHA tags — `:latest` causes stale pull-through cache" predates the nginx URL-split fix and should be updated as part of this work.
+
+### 2. Detection — Keel polls upstream
+
+Keel runs as a Deployment in its own namespace. Every annotated workload polls its registry hourly (Keel-managed; configurable per workload). On detection of a new digest under the watched tag:
+
+- `keel.sh/policy: force` (for mutable tags `:latest`, `:16`, `:7`, etc.) → trigger Deployment update (pod template hash changes → restart)
+- `keel.sh/policy: minor` / `major` / `glob` (only for images that publish neither `:latest` nor a stable floating tag) → rewrite tag string on the Deployment; requires `lifecycle { ignore_changes = [...image] }`
+
+### 3. Application — kubelet pull through the cache
+
+When Keel triggers restart:
+
+1. kubelet asks the cache (via containerd hosts.toml) for `image:tag` manifest.
+2. nginx passes the manifest request through to the docker-registry layer.
+3. docker-registry (with `proxy.ttl: 0`) passes through to upstream.
+4. Upstream returns current digest.
+5. kubelet pulls blobs (mostly cached at nginx layer; new blobs from upstream).
+6. New pod runs new image.
+
+### 4. Annotation injection — Kyverno mutate
+
+Single `ClusterPolicy` adds these annotations to every Deployment / StatefulSet / DaemonSet in opted-in namespaces:
+
+```yaml
+metadata:
+  annotations:
+    keel.sh/policy: force
+    keel.sh/trigger: poll
+    keel.sh/pollSchedule: "@every 1h"
+```
+
+Phase = a `match.any[].resources.namespaces` list. Phase advance = append namespaces. Keel namespace is excluded.
+
+### 5. Terraform drift handling
+
+Existing convention (`# KYVERNO_LIFECYCLE_V1` marker) handles `dns_config` injection. We extend with a new marker:
+
+```hcl
+lifecycle {
+  ignore_changes = [
+    spec[0].template[0].spec[0].dns_config,  # KYVERNO_LIFECYCLE_V1
+    metadata[0].annotations["keel.sh/policy"],
+    metadata[0].annotations["keel.sh/trigger"],
+    metadata[0].annotations["keel.sh/pollSchedule"],  # KYVERNO_LIFECYCLE_V2
+  ]
+}
+```
+
+This is added per workload as we phase in. Mechanical, grep-able.
+
+## Phase ordering
+
+| Phase | Set | Rationale |
+|-------|-----|-----------|
+| 0 | Foundation (Keel install, Kyverno ClusterPolicy with empty allowlist) | Build infra without enrolling anything |
+| 1 | Self-hosted (forgejo-hosted: ~11 services) | We own the code; failures are easy to diagnose |
+| 2 | Stateless third-party web apps (linkwarden, postiz, affine, etc.) | No migrations |
+| 3 | Exporters, sidecars, utilities | Stateless |
+| 4 | Stateful-but-tolerant (Grafana, Prometheus, etc.) | Restart-safe state |
+| 5 | State-coupled with migrations (Nextcloud, Forgejo, paperless-ngx, mailserver) | Schema-migration risk |
+| 6 | Authentik | Auth outage |
+| 7 | Operators (cnpg-operator, ESO, kured, descheduler) | Operator skew |
+| 8 | Critical infra (Calico, proxmox-csi, nfs-csi, traefik, metallb) | Node-level outage potential (memory id=390: 26h Calico cascade) |
+| 9 | Bootstrap (Vault, CNPG PG cluster, mysql-standalone) | Lose recoverability if broken |
+
+Per-phase: combined PR → apply (catch-up rolls happen) → soak 1 week → next phase. If a service breaks repeatedly, apply rollback runbook (decision #9) and proceed; re-enroll later or leave pinned.
+
+## Risk register
+
+| Risk | Likelihood | Impact | Mitigation |
+|------|-----------|--------|------------|
+| Bad upstream image rolls into prod | High | Service-level outage | Existing alerts (`KubePodCrashLooping`, `KubeletImagePullErrors`, `PodsStuckContainerCreating`); rollback runbook (decision #9) |
+| Catch-up rollout overwhelms cache | Medium | ImagePullBackOff cascade (memory id=603) | Rate-limit catch-up to ~5 rollouts/6h via `-target=` per phase; same pacing as retired Service Upgrade Agent (memory id=612) |
+| Calico / CSI auto-roll cascades (memory id=390: 26h outage) | Low-Medium | Cluster-level outage | Phase 8 is intentionally late; user opted into the risk; rollback to pinned chart version via Terraform |
+| Vault auto-rolls to broken image | Low | Loss of secrets sync; 43 ExternalSecrets stop reconciling | Phase 9 last; Tier 0 SOPS state allows manual recovery |
+| CNPG PG cluster auto-rolls to broken image | Low | Tier 1 Terraform state inaccessible; 105 stacks can't apply | Phase 9 last; Tier 0 stack `cnpg` is bootstrap-capable |
+| Helm-atomic-trap services (memory id=981) | Medium | `terraform apply` hangs in pending-rollback | Identify `helm_release` services with `atomic = true`; either remove atomic or skip from Keel |
+| Keel itself rolls to broken version | Low | Supervisor down; no auto-rolls until manual pin | Decision #11: exempt Keel from mutate |
+| Terraform drift after Kyverno injects annotation | High at first | Spurious diffs on every plan | KYVERNO_LIFECYCLE_V2 marker (Architecture §5); applied incrementally per phase |
+
+## What we give up
+
+- **Terraform no longer tracks deployed version.** Image refs in `.tf` say `:latest` or `:16`, but the running digest is whatever Keel pulled. To know what's running: `kubectl describe pod`. This is a deliberate trade — the previous SHA-pinned flow tracked version in TF but required N stack edits per deploy.
+- **No changelog review before rollout.** The Service Upgrade Agent's risk classification is gone. We rely on alerts to catch breakage post-deploy, not prevent it.
+- **CLAUDE.md SHA-tag rule is reversed for this design.** The "use 8-char git SHA tags" rule predates the nginx URL-split fix. New rule (post-rollout): "use floating tags + Keel annotation" — to be updated in both `infra/.claude/CLAUDE.md` and the repo-root `CLAUDE.md` once Phase 1 is stable.
+
+## Decisions resolved post-grill
+
+### Q1 — Uniform CI model for ALL self-hosted (resolved 2026-05-16)
+
+Every self-hosted service moves to the same shape:
+
+```
+CI (GHA or Woodpecker) → build → push :latest (optionally also :<SHA> for traceability) → done
+Keel → poll registry → detect new digest → trigger rollout
+```
+
+The 10 GHA-migrated repos (memory id=388: Website, k8s-portal, f1-stream, claude-memory-mcp, apple-health-data, audiblez-web, plotting-book, insta2spotify, audiobook-search, council-complaints) drop the `Woodpecker API → kubectl set image` step. Their `.woodpecker/deploy.yml` and `.woodpecker/build-fallback.yml` files become obsolete; remove during Phase 1.
+
+Terraform image refs for all self-hosted: `<registry>/<repo>:latest` (with `${var.image_tag}` defaulting to `"latest"` where the variable exists).
+
+### Q2 — No-CI self-hosted services (resolution: uniform participation)
+
+| Service | Action |
+|---------|--------|
+| `wealthfolio` | Switch Terraform to upstream `wealthfolio/wealthfolio:latest` (DockerHub). No CI needed. |
+| `chrome-service` | Verify whether `:v4` is a deliberate pin. If yes → tag stays, add `keel.sh/policy: never` label. If no → switch to `:latest` or `:major`. Investigate during Phase 1 prep. |
+| `beadboard` (used by `beads-server`) | Add minimal Woodpecker CI: build on push → push `:latest`. User-owned. |
+| `freedify` | Add minimal Woodpecker CI: build on push → push `:latest`. User-owned. |
+
+## Open questions (still need resolution before Phase 1)
+
+1. **`helm_release atomic = true` services**: count and identify before Phase 1. Either remove `atomic` (preferred — eliminates the memory id=981 trap), or skip from Kyverno mutate via per-namespace exclusion. Survey command: `grep -rn 'atomic.*true' infra/stacks/ infra/modules/`.
+
+## Out of scope
+
+- Cache TTL changes — current config is already correct (nginx URL-split).
+- Webhook-based Keel triggers — polling is sufficient for this cadence.
+- Replacing Diun — kept for notification visibility into new tags not yet under Keel annotation (during phase rollout).
+- Keel approval gate (`keel.sh/approvals: N`) — user wants unattended auto-roll.
+- Keel auto-rollback on health-check failure — out of scope for v1; revisit if breakage rate is high.
--- a/docs/plans/2026-05-16-auto-upgrade-apps-plan.md
+++ b/docs/plans/2026-05-16-auto-upgrade-apps-plan.md
@ -0,0 +1,322 @@
+# Auto-Upgrade Apps Implementation Plan
+
+> **For Claude:** REQUIRED SUB-SKILL: Use superpowers:executing-plans to implement this plan task-by-task.
+
+**Goal:** Move the cluster from a mix of pinned-SHA / pinned-semver / ad-hoc `:latest` references to a Keel-driven auto-update model where every workload tracks `:latest` (or a chosen `:major` floating tag) and rolls automatically when upstream advances.
+
+**Architecture:** Kyverno cluster-wide `ClusterPolicy` mutates Deployments / StatefulSets / DaemonSets in opted-in namespaces with Keel annotations (`keel.sh/policy: force`, `keel.sh/trigger: poll`, `keel.sh/pollSchedule: @every 1h`). Keel polls registries, triggers rollout on new digest. kubelet pulls fresh manifest via the nginx URL-split cache (manifests passthrough, blobs cached). Phase advance = expand the `NamespaceSelector` allowlist.
+
+**Tech Stack:** Keel, Kyverno, Terraform / Terragrunt, Helm, Diun (notification only), nginx, docker/distribution
+
+**Design doc:** `docs/plans/2026-05-16-auto-upgrade-apps-design.md`
+
+**Key context:**
+- Cache is already correctly configured (nginx URL-split + `proxy.ttl: 0`). No cache changes needed.
+- Per-stack `lifecycle.ignore_changes` is already required for the existing `dns_config` Kyverno mutation (KYVERNO_LIFECYCLE_V1 convention). This plan extends it with a V2 marker for Keel annotations.
+- Service Upgrade Agent (Diun → n8n → claude bumps tfvars) is retired by this design. n8n workflow + supporting scripts are removed once Phase 9 completes.
+- CLAUDE.md "use 8-char git SHA tags" rule is reversed by this design (see Open Q1 in design doc).
+
+---
+
+## Phase 0 — Foundation
+
+### Task 0.1: Resolve remaining open question
+
+Q1 and Q2 from the design doc are resolved (uniform `:latest` + Keel model for all self-hosted; per-service plan for no-CI services).
+
+Remaining open question:
+
+**Helm-atomic services.** Survey:
+```bash
+grep -rn 'atomic.*true' /home/wizard/code/infra/stacks/ /home/wizard/code/infra/modules/
+```
+
+For each match: either remove `atomic = true` (preferred) or add the namespace to a Kyverno exclusion list. Document inline before Phase 1 proceeds.
+
+---
+
+### Task 0.2: Create the Keel stack
+
+**Files:**
+- Create: `stacks/keel/terragrunt.hcl`
+- Create: `stacks/keel/main.tf`
+- Create: `stacks/keel/variables.tf`
+- Create: `stacks/keel/modules/keel/main.tf`
+
+**Step 1:** Add `keel` to `terragrunt.hcl` `locals.tier0_stacks` — **NO**. Keel is Tier 1 (depends on Kyverno + Keel image registry access). Keep it in Tier 1.
+
+**Step 2:** Deploy via Helm chart `keel-hq/keel` (verify current version via context7 before pinning).
+
+Key Helm values:
+- `polling.enabled: true`
+- `helmProvider.enabled: false` (we use annotations, not Helm hooks)
+- `notifications.slack.enabled: true` with channel `#deployments` (verify channel exists)
+- Registry credentials: mount Forgejo PAT from Vault via ExternalSecret (`secret/viktor/forgejo_pull_token`).
+
+**Step 3:** Verify Keel can authenticate to all five registries (Docker Hub, ghcr, quay, k8s.io, kyverno via the local cache; Forgejo direct).
+
+**Acceptance:**
+- `kubectl -n keel get pod` shows Keel Ready.
+- `kubectl -n keel logs deploy/keel | grep registry` shows successful manifest queries.
+
+---
+
+### Task 0.3: Author the Kyverno ClusterPolicy
+
+**Files:**
+- Create: `stacks/kyverno/modules/kyverno/keel-annotations.tf` (or extend `security-policies.tf`)
+
+ClusterPolicy `inject-keel-annotations`:
+
+```yaml
+apiVersion: kyverno.io/v1
+kind: ClusterPolicy
+metadata:
+  name: inject-keel-annotations
+spec:
+  background: true
+  rules:
+    - name: add-keel-annotation
+      match:
+        any:
+          - resources:
+              kinds: [Deployment, StatefulSet, DaemonSet]
+              namespaces: []  # populated per phase
+      exclude:
+        any:
+          - resources:
+              namespaces: ["keel"]  # decision #11
+          - resources:
+              # Workloads can opt out by setting this label
+              selector:
+                matchLabels:
+                  keel.sh/policy: never
+      mutate:
+        patchStrategicMerge:
+          metadata:
+            annotations:
+              +(keel.sh/policy): force
+              +(keel.sh/trigger): poll
+              +(keel.sh/pollSchedule): "@every 1h"
+```
+
+- `+()` syntax adds only if not present (preserves per-workload overrides).
+- `exclude.selector.matchLabels[keel.sh/policy=never]` is the per-workload escape hatch (used during rollback per decision #9).
+
+**Step 2:** Initially deploy with `namespaces: []` — policy exists but matches nothing.
+
+**Acceptance:**
+- `kubectl get clusterpolicy inject-keel-annotations` shows Ready.
+- `kubectl get deploy -A -o yaml | grep keel.sh/policy` shows no matches yet (empty allowlist).
+
+---
+
+### Task 0.4: Define the KYVERNO_LIFECYCLE_V2 marker convention
+
+**Files:**
+- Modify: `AGENTS.md` — add the V2 snippet to the "Kyverno Drift Suppression" section
+- Modify: `.claude/CLAUDE.md` — reference the V2 marker
+
+Snippet to copy-paste:
+
+```hcl
+lifecycle {
+  ignore_changes = [
+    spec[0].template[0].spec[0].dns_config,            # KYVERNO_LIFECYCLE_V1
+    metadata[0].annotations["keel.sh/policy"],
+    metadata[0].annotations["keel.sh/trigger"],
+    metadata[0].annotations["keel.sh/pollSchedule"],   # KYVERNO_LIFECYCLE_V2
+  ]
+}
+```
+
+Backfill order: per-phase, only on workloads about to be enrolled. Not a mass sweep.
+
+---
+
+## Phase 1 — Self-hosted (uniform model)
+
+**Set:** all self-hosted services. Three sub-categories:
+
+- **Woodpecker-build-only (6):** `claude-agent-service`, `fire-planner`, `job-hunter`, `payslip-ingest`, `recruiter-responder`, `claude-memory-mcp`.
+- **GHA-migrated (10, per memory id=388):** Website, k8s-portal, f1-stream, claude-memory-mcp, apple-health-data, audiblez-web, plotting-book, insta2spotify, audiobook-search, council-complaints. (Note: claude-memory-mcp appears in both lists — verify.)
+- **No-CI (4, per design Q2):** `wealthfolio` (→ upstream), `chrome-service` (verify pin intent), `beadboard` (add CI), `freedify` (add CI).
+- **Already-uniform (1):** `kms-website` — already pushes `:latest` AND SHA; just needs Keel annotation.
+
+### Task 1.1: Audit current image refs
+
+```bash
+grep -rE 'image\s*=\s*"(forgejo\.viktorbarzin\.me|viktorbarzin)' /home/wizard/code/infra/stacks/ | sort
+```
+
+Tabulate per service: current tag, CI type (GHA / Woodpecker / none), action needed.
+
+### Task 1.2: Per-service uniform conversion
+
+For each Woodpecker-build-only service:
+1. Edit Terraform: `local.image_tag` / `var.image_tag` → `"latest"`.
+2. Add the KYVERNO_LIFECYCLE_V2 snippet (annotations ignore_changes).
+3. Verify `.woodpecker.yml` pushes `:latest` on every build (most do via `auto_tag: true`).
+
+For each GHA-migrated service:
+1. Edit Terraform: switch `image_tag` from SHA reference to `"latest"`.
+2. Add the KYVERNO_LIFECYCLE_V2 snippet.
+3. Edit `.github/workflows/build-and-deploy.yml`: push `:latest` (in addition to `:<8-char-sha>` for traceability). Remove the Woodpecker API POST step.
+4. Delete `.woodpecker/deploy.yml` and `.woodpecker/build-fallback.yml` from each repo (no longer needed).
+5. Remove the Woodpecker repo config for these repos from Terraform if applicable.
+
+For each no-CI service:
+- `wealthfolio`: change Terraform image to `wealthfolio/wealthfolio:latest` (upstream DockerHub). Validate the image starts cleanly.
+- `chrome-service`: check git blame on the `:v4` pin. If deliberate → label `keel.sh/policy: never`. If accidental → bump to upstream `:latest`.
+- `beadboard`, `freedify`: write a minimal `.woodpecker.yml` (single build step pushing to Forgejo `:latest`). Trigger an initial build to populate `:latest`.
+
+For `kms-website`: only add the Keel annotation; CI changes optional.
+
+### Task 1.3: Add Phase 1 namespaces to Kyverno allowlist
+
+Edit `stacks/kyverno/modules/kyverno/keel-annotations.tf`:
+
+```yaml
+namespaces:
+  - claude-agent-service
+  - fire-planner
+  - job-hunter
+  - payslip-ingest
+  - recruiter-responder
+  - claude-memory-mcp
+  - kms-website
+  # GHA-migrated set:
+  - website  # or whatever the namespace is named per repo
+  - k8s-portal
+  - f1-stream
+  - apple-health-data
+  - audiblez-web
+  - plotting-book
+  - insta2spotify
+  - audiobook-search
+  - council-complaints
+  # No-CI set:
+  - beads-server
+  - chrome-service
+  - freedify
+  - wealthfolio
+```
+
+Verify each namespace name from `kubectl get ns` before locking in (some may differ from the repo name).
+
+Apply. Watch `kubectl get deploy -n <ns> -o yaml | grep keel.sh` confirm annotations injected. Watch Keel logs for first poll cycle picking up the workloads.
+
+### Task 1.4: Soak
+
+1 week. Monitor:
+- Slack `#deployments` for Keel rollout notifications.
+- `KubePodCrashLooping` alerts.
+- Manual `kubectl rollout status` on each service after a Keel-triggered rollout.
+
+If any service breaks repeatedly: apply rollback runbook (decision #9), record the service in a "pin list" with reason, proceed.
+
+**Acceptance:**
+- All 7 services running latest digests within 24h of Phase 1 apply.
+- No CrashLooping persisting >1h.
+- No more than 2 services pinned-out during the soak week.
+
+---
+
+## Phase 2 — Stateless third-party web apps
+
+**Set:** linkwarden, postiz, affine, isponsorblocktv, audiobookshelf, freshrss, tandoor, immich (verify it qualifies — has external DB so app-restart is safe), excalidraw, hackmd, send, jsoncrack, sparkyfitness, etc. (~15-20 services — full list from `kubectl get deploy -A` filtered against the phase-1 set + skip-bucket).
+
+### Task 2.1: Audit current tags via Diun
+
+```bash
+# Diun's REST API or UI exports a "new tags available" report
+# Use as the per-service decision source
+```
+
+For each service, pick floating tag:
+- `:latest` if upstream publishes it and it's stable.
+- `:<major>` (e.g. `:2`, `:v3`) if `:latest` is unreliable.
+- `glob` + `ignore_changes` as last resort.
+
+### Task 2.2: Catch-up PR
+
+Single combined PR:
+- Per-stack: switch image tag from pinned semver to chosen floating tag (Diun-informed).
+- Per-stack: add KYVERNO_LIFECYCLE_V2 snippet.
+- Append Phase 2 namespaces to Kyverno allowlist.
+
+Apply with `-target=` per stack to pace rollouts (≤5 per hour to avoid cache burst — memory id=603).
+
+### Task 2.3: Soak — 1 week, same monitoring as Phase 1.
+
+---
+
+## Phases 3–9 — same template
+
+For each phase, repeat:
+
+1. Define the set (precise namespace list).
+2. Audit current tags (Diun + grep).
+3. Pick floating tag per service.
+4. Combined PR: image-ref change + lifecycle snippet + Kyverno allowlist update.
+5. Apply paced (≤5/hr).
+6. Soak 1 week. Pin-out any service that breaks repeatedly.
+
+Set definitions per phase: see design doc Phase Ordering table.
+
+**Special-handling phases:**
+
+- **Phase 7 (Operators).** Restart of an operator can confuse its managed CRD reconciles. Use `imagePullPolicy: Always` + readiness check before declaring stable. Investigate cnpg-operator and ESO restart behavior in advance.
+- **Phase 8 (Critical infra).** Calico/CSI DaemonSet rollouts impact each node briefly. Verify `updateStrategy.rollingUpdate.maxUnavailable: 1` on every DaemonSet before enrollment. Memory id=390 (26h Calico-cascade outage) is the cautionary tale.
+- **Phase 9 (Bootstrap).** Vault, CNPG, mysql-standalone. Coordinate with backup window. Take a fresh snapshot of `/srv/nfs/<db>-backup/` before applying the phase enrollment.
+
+---
+
+## Cleanup tasks (after Phase 9 stable)
+
+### Task C.1: Retire Service Upgrade Agent
+
+**Files:**
+- Modify: `stacks/n8n/` — remove the Service Upgrade Agent workflow
+- Delete: any supporting scripts (`infra/scripts/service-upgrade-*.sh` if they exist)
+- Modify: `stacks/diun/` — disable webhook notification to n8n (keep Slack notification for visibility)
+
+### Task C.2: Update CLAUDE.md files
+
+- Reverse the "use 8-char git SHA tags" rule in `infra/.claude/CLAUDE.md` "Docker images" line.
+- Reverse same in root `/CLAUDE.md` if duplicated.
+- Add a new section documenting the Keel model + KYVERNO_LIFECYCLE_V2 snippet.
+- Update memory via `mcp__claude_memory__memory_update` on entries 388, 612, 604 (CI/CD architecture, Service Upgrade Agent retirement, cache TTL clarification).
+
+### Task C.3: Add a runbook
+
+**Files:**
+- Create: `docs/runbooks/keel-rollback.md`
+
+Document the rollback flow (decision #9): `kubectl rollout undo` → Terraform pin → annotation `keel.sh/policy: never`.
+
+### Task C.4: Tidy Diun
+
+Drop image-pin overrides for MySQL, PostgreSQL, Redis from Diun config (no longer needed since they're Keel-managed; the previous skip was for the retired changelog-agent path).
+
+---
+
+## Rollback (whole project)
+
+If the auto-roll experiment goes badly cluster-wide (multiple cascading failures, repeated outages), revert:
+
+1. Set Kyverno ClusterPolicy `inject-keel-annotations` to empty `namespaces: []`.
+2. Existing annotations remain on workloads, but Keel continues to act on them — so also disable Keel: scale `keel` Deployment to 0.
+3. Pin every workload's Terraform image_tag back to its current running digest (use `kubectl get deploy -A -o jsonpath='{range .items[*]}{.metadata.name}:{.spec.template.spec.containers[0].image}{"\n"}{end}'`).
+4. Document failure modes in `post-mortems/2026-XX-XX-keel-rollback.md`.
+5. Reconsider opt-in approach for next iteration.
+
+---
+
+## Success criteria
+
+- All ~70 services running latest within 8 weeks of Phase 0 completion.
+- Zero unrolled-back outages caused by Keel.
+- ≤5 services on the "pin list" (i.e. ≥93% auto-roll success rate).
+- `terragrunt plan` shows no spurious diffs from Kyverno-injected annotations (KYVERNO_LIFECYCLE_V2 working as intended).
+- Service Upgrade Agent + supporting infra retired.
--- a/docs/plans/2026-05-17-agent-presence-plan.md
+++ b/docs/plans/2026-05-17-agent-presence-plan.md
--- a/docs/plans/2026-05-19-mysql-8.4.9-upgrade-design.md
+++ b/docs/plans/2026-05-19-mysql-8.4.9-upgrade-design.md
@ -0,0 +1,112 @@
+# MySQL 8.4.8 → 8.4.9 Upgrade — Design
+
+**Date**: 2026-05-19
+**Status**: Drafted, **NOT scheduled**. Execute only inside a planned maintenance window with user sign-off.
+**Beads**: (filed alongside this doc)
+**Related**: `docs/runbooks/restore-mysql.md`, beads `code-eme8` / `code-k40p` (closed in `ea475c3d`)
+
+## Background
+
+On 2026-05-18, Keel auto-bumped the `mysql:8.4` floating tag on the
+`mysql-standalone` StatefulSet from 8.4.8 to 8.4.9. The in-server data
+dictionary upgrade (80408 → 80409) stalled reliably: ~24 s of writes to
+`mysql.ibd` + redo log after "Server upgrade started", then complete
+silence — no CPU, no flushes, no errors, no completion. The `boot`
+thread sat in user-space sleep (`State: S`, `wchan: 0`) for 10+
+minutes; the MySQLX socket appeared but `mysqld.sock` never did. Even
+with `liveness_probe.initial_delay_seconds = 600`, the upgrade never
+completed.
+
+Recovery (commit `ea475c3d`): pinned image to `mysql:8.4.8` exactly,
+wiped the corrupted PVC, restored from the 00:30 UTC mysqldump. Total
+downtime: ~25 min. Forgejo + 7 dependent apps offline during that
+window.
+
+## Root cause — best evidence
+
+We never proved this definitively because we couldn't connect to MySQL
+during the stall, but the strongest hypothesis is **flush starvation
+during the DD upgrade's mandatory checkpoint**:
+
+1. Upgrade rewrites `mysql.st_spatial_reference_systems` (5103 SRS
+   defs) + dirties pages across the system tablespace.
+2. Reaches a point where it must checkpoint before continuing.
+3. The page-cleaner thread can't drain dirty pages fast enough because
+   `innodb_io_capacity=100` (1.6 MB/s effective flush rate, default is
+   200, recommended for SSDs is 2000+) combined with
+   `innodb_page_cleaners=1`.
+4. The `boot` thread waits on a pthread condvar that the flush
+   coordinator should signal but never does within probe timeout.
+
+Why we're not 100 % certain:
+- LUKS2-encrypted block storage (`proxmox-lvm-encrypted`) may
+  contribute its own flush latency.
+- We didn't capture a stack trace from the stalled `boot` thread
+  (`/proc/1/task/118/stack` was `permission denied`).
+- A genuine MySQL 8.4.9 bug in the SRS-update path is possible (worth
+  checking the MySQL bug tracker before retry).
+
+**Organizational root cause** (definitive): the `mysql:8.4` floating
+tag let Keel auto-bump without testing. Already fixed — image pinned
+to `mysql:8.4.8` exactly.
+
+## Decisions
+
+| # | Decision | Notes |
+|---|----------|-------|
+| 1 | **Approach: wipe + re-init on 8.4.9** (logical migration via fresh init + dump-restore) | The DD upgrade is the broken path. A fresh 8.4.9 init starts at version 80409 directly — no upgrade ever runs. We've executed wipe+restore once in ~25 min; the path is now well-trodden. |
+| 2 | **Pre-flight: bump InnoDB IO config** | `innodb_io_capacity=2000`, `innodb_io_capacity_max=4000`, `innodb_page_cleaners=4`. These are the long-term-correct values regardless of the upgrade — current settings are ~10× too conservative for the workload. |
+| 3 | **Restore strategy: per-database dumps, NOT the full `--all-databases` dump** | Per-db dumps at `/srv/nfs/mysql-backup/per-db/<db>/` skip the `mysql` system schema entirely. Avoids the question of "will 8.4.8 mysql-schema rows confuse 8.4.9". User accounts get recreated via Vault + null_resource. |
+| 4 | **Fresh dump immediately before cutover, not yesterday's** | The daily dump runs at 00:30 UTC. The cutover dump must come from < 60 s before scale-to-0 to minimize data loss. Kick `mysql-backup-per-db` CronJob manually. |
+| 5 | **Maintenance window required** | All MySQL-dependent apps offline ~25 min: Forgejo (+ registry → ImagePullBackOff cascade), Nextcloud, HackMD, Grafana, Paperless, Uptime-Kuma, Shlink, realestate-crawler, phpipam, technitium, vikunja, freshrss, finance, resume. Pick a low-traffic window (suggest Sunday 03:00 UK). |
+| 6 | **Single rollback path: re-pin to 8.4.8 + same wipe/restore flow** | If 8.4.9 fresh init misbehaves post-restore, rollback IS the same procedure, just with image=8.4.8. The pinned 8.4.8 dump survives. No new failure modes. |
+| 7 | **Out of scope for this upgrade**: tuning that doesn't gate the upgrade | Right-sizing buffer pool, switching to async commits, changing storage class, replication — all separate decisions. |
+
+## Verification gates
+
+Before declaring done:
+1. `kubectl -n dbaas exec mysql-standalone-0 -- mysql -uroot -p"$PW" -e "SELECT VERSION();"` returns `8.4.9`.
+2. `SHOW DATABASES;` lists all 20 user databases.
+3. Table count per schema matches the pre-upgrade snapshot (recorded
+   in step 1 of the plan).
+4. `forgejo` logs show successful DB ping; `kubectl -n forgejo get pod` is 1/1 Running.
+5. `kubectl get deploy,sts -A` shows no unready workloads.
+6. `bash infra/scripts/cluster_healthcheck.sh --quiet` returns same or
+   better PASS/WARN/FAIL ratio as pre-upgrade.
+7. Forgejo integrity probe reports 0 failures (manual trigger).
+8. `RegistryCatalogInaccessible` not firing in Prometheus.
+
+## Risks + mitigations
+
+| Risk | Likelihood | Mitigation |
+|---|---|---|
+| 8.4.9 fresh init has *some other* unobserved bug | Low | Smoke-test on a parallel PVC in dbaas before touching the real one (optional but cheap — adds 30 min). See plan Phase 1. |
+| Per-db dump-restore misses a database the user added recently | Low | Compare `SHOW DATABASES` against the per-db dump directory listing pre-cutover. If a DB exists in MySQL but not in `/srv/nfs/mysql-backup/per-db/`, dump it manually first. |
+| Forgejo/roundcubemail static-user passwords drift again after restore | Certain | Already documented in runbook — DROP USER + CREATE USER from Vault values immediately after restore. |
+| The cutover dump itself is corrupt | Very low | mysqldump exits non-zero on failure. CronJob already pushes `backup_last_success_timestamp` to Pushgateway. Verify timestamp is fresh before proceeding. |
+| Apps fail to reconnect after MySQL restart | Low | Already-proven recipe: `kubectl rollout restart` on the affected deployments. Listed exhaustively in runbook §B.8. |
+| 8.4.9 fresh init *also* stalls (root cause was NOT flush starvation) | Medium-low | Pre-flight test on parallel PVC catches this before maintenance window. If real prod init stalls, immediately revert TF pin to 8.4.8, redo same dump-restore flow. Same 25 min downtime as the original recovery. |
+
+## Why not alternatives
+
+- **In-place DD upgrade with bumped IO config**: simpler, but if it
+  still stalls we lose 30–60 min waiting + still fall back to
+  wipe+restore. Same data risk; worse expected time. We *would* learn
+  whether the bumped IO settings fix the upgrade, but the fresh init
+  approach makes that knowledge unnecessary.
+- **Parallel migration (new mysql-standalone-new pod alongside)**:
+  cleanest rollback (instant via service-selector flip), but needs TF
+  surgery to declare two StatefulSets temporarily and isn't worth the
+  complexity when the wipe+restore approach is now proven.
+- **Wait for 8.4.10 / 8.5 LTS**: leaves us stuck on 8.4.8 indefinitely.
+  Acceptable for now (we're pinned), but not a permanent answer.
+
+## Out of scope
+
+- A standby/replica MySQL for zero-downtime upgrades (separate
+  initiative — see future planning around CNPG-style HA for MySQL).
+- Removing `proxmox-lvm-encrypted` LUKS2 from the equation (the
+  encryption is a security requirement; debugging its flush latency is
+  separate).
+- Replacing MySQL with PostgreSQL (long-term goal for some apps; not
+  this upgrade).
--- a/docs/plans/2026-05-19-mysql-8.4.9-upgrade-plan.md
+++ b/docs/plans/2026-05-19-mysql-8.4.9-upgrade-plan.md
@ -0,0 +1,349 @@
+# MySQL 8.4.8 → 8.4.9 Upgrade — Plan
+
+**Date**: 2026-05-19
+**Status**: Drafted, **NOT scheduled**
+**Design**: `2026-05-19-mysql-8.4.9-upgrade-design.md`
+**Estimated downtime**: 25–30 min (all MySQL-dependent apps offline)
+**Window**: Suggest Sunday 03:00 UK (low traffic, kured window doesn't fight us)
+
+## Pre-flight (before the maintenance window)
+
+### P.1 Optional smoke test on a parallel PVC (recommended, +30 min)
+
+In a non-production session, before scheduling the real cutover:
+
+```bash
+# 1. Create a temporary StatefulSet `mysql-smoketest` in dbaas with the
+#    same image (mysql:8.4.9), same configmap, brand-new PVC.
+#    Use a one-off kubectl apply -f /tmp/smoketest.yaml — NOT Terraform —
+#    so it doesn't pollute the real stack.
+# 2. Verify it inits to 8.4.9 cleanly (mysqld.sock appears, "ready for connections").
+# 3. Restore one of the smaller per-db dumps (e.g. resume, freshrss) into it.
+# 4. Delete the smoketest StatefulSet + PVC.
+```
+
+Outcome:
+- ✅ Init succeeds → proceed with the real upgrade with high confidence.
+- ❌ Init stalls → root cause was not flush starvation. Halt and re-investigate. The real upgrade is unsafe.
+
+### P.2 Read the MySQL 8.4.9 release notes + bug tracker
+
+Specifically look for issues filed since 8.4.9 GA against the DD upgrade
+path or `st_spatial_reference_systems`. If a known fix landed in 8.4.10
+or 8.5.x, consider waiting.
+
+### P.3 Confirm backup pipeline is healthy
+
+```bash
+# Latest per-db dumps exist for all 20 databases
+kubectl -n dbaas exec mysql-standalone-0 -- bash -c \
+    'for d in $(ls /backup/per-db/); do echo -n "$d: "; ls -t /backup/per-db/$d/ | head -1; done'
+
+# Pushgateway shows recent success
+kubectl -n monitoring exec deploy/prometheus-server -c prometheus-server -- \
+    wget -qO- 'http://prometheus-prometheus-pushgateway:9091/metrics' | grep mysql-backup-per-db
+```
+
+### P.4 Pin maintenance window and notify
+
+Brief the user. Confirm window. Disable any background scrapers /
+schedulers / bots that would create noise during the cutover.
+
+## Execution (inside the maintenance window)
+
+### Step 1 — Pre-flight snapshot
+
+```bash
+ROOT_PWD=$(kubectl -n dbaas get secret cluster-secret -o jsonpath='{.data.ROOT_PASSWORD}' | base64 -d)
+
+# Record current state for verification later
+kubectl -n dbaas exec mysql-standalone-0 -- mysql -uroot -p"$ROOT_PWD" \
+    -e "SELECT table_schema, COUNT(*) AS tables FROM information_schema.tables \
+        WHERE table_schema NOT IN ('information_schema','performance_schema','sys','mysql') \
+        GROUP BY table_schema;" > /tmp/mysql-pre-upgrade-table-counts.txt
+cat /tmp/mysql-pre-upgrade-table-counts.txt
+```
+
+### Step 2 — Trigger a fresh per-db dump
+
+```bash
+kubectl -n dbaas create job --from=cronjob/mysql-backup-per-db pre-upgrade-$(date +%s)
+# Wait for completion (typically <2 min)
+kubectl -n dbaas wait --for=condition=complete --timeout=300s job/pre-upgrade-<timestamp>
+```
+
+Verify all 20 databases dumped:
+
+```bash
+kubectl -n dbaas exec mysql-standalone-0 -- bash -c \
+    'for d in $(ls /backup/per-db/); do
+       newest=$(ls -t /backup/per-db/$d/ | head -1)
+       echo "$d: $newest"
+     done'
+```
+
+Every entry should have a `dump_<today>_*.sql.gz` listed.
+
+### Step 3 — Bump InnoDB IO config + image pin in Terraform
+
+In `stacks/dbaas/modules/dbaas/main.tf`:
+
+```diff
+-      innodb_io_capacity=100
+-      innodb_io_capacity_max=200
+-      innodb_page_cleaners=1
+      innodb_io_capacity=2000
+      innodb_io_capacity_max=4000
+      innodb_page_cleaners=4
+```
+
+```diff
+-          # Pinned to 8.4.8 — 8.4.9 DD upgrade got stuck (no progress, no CPU)
+-          # repeatedly across multiple attempts. ...
+-          image = "mysql:8.4.8"
+          # Re-pinned to 8.4.9 on 2026-MM-DD after the wipe+reinit upgrade
+          # path (see docs/plans/2026-05-19-mysql-8.4.9-upgrade-*).
+          image = "mysql:8.4.9"
+```
+
+Commit but **do not apply yet**.
+
+### Step 4 — Stop MySQL
+
+```bash
+kubectl -n dbaas scale statefulset mysql-standalone --replicas=0
+# Wait for pod deletion
+kubectl -n dbaas wait --for=delete pod/mysql-standalone-0 --timeout=120s
+```
+
+### Step 5 — Wipe the PVC
+
+```bash
+PV=$(kubectl -n dbaas get pvc data-mysql-standalone-0 -o jsonpath='{.spec.volumeName}')
+kubectl patch pv "$PV" -p '{"spec":{"persistentVolumeReclaimPolicy":"Delete"}}'
+kubectl -n dbaas delete pvc data-mysql-standalone-0
+# Confirm PV vanishes (CSI cleans up the LV)
+kubectl get pv | grep -q "$PV" && echo "WARNING: PV still present" || echo "PV cleaned up"
+```
+
+### Step 6 — Apply Terraform (8.4.9 + bumped IO)
+
+```bash
+cd stacks/dbaas
+/home/wizard/code/infra/scripts/tg apply
+```
+
+This creates a fresh 5 Gi PVC + new pod on `mysql:8.4.9`. Initial-init
+takes ~30 s. Verify:
+
+```bash
+kubectl -n dbaas wait --for=condition=ready pod/mysql-standalone-0 --timeout=300s
+kubectl -n dbaas exec mysql-standalone-0 -- mysql -uroot -p"$ROOT_PWD" -e "SELECT VERSION();"
+# expect: 8.4.9
+```
+
+**If the pod fails to become Ready within 5 min**: this is the
+"root cause was not flush starvation" failure mode. Abort the upgrade,
+revert the image pin to 8.4.8 in TF, re-run from Step 4 (wipe + apply
+8.4.8 + restore). Total extra downtime ~25 min.
+
+### Step 7 — Restore per-db dumps (NOT the full --all-databases dump)
+
+```bash
+ROOT_PWD=$(kubectl -n dbaas get secret cluster-secret -o jsonpath='{.data.ROOT_PASSWORD}' | base64 -d)
+
+cat <<YAML | kubectl apply -f -
+apiVersion: batch/v1
+kind: Job
+metadata:
+  name: mysql-restore-per-db-$(date +%Y-%m-%d)
+  namespace: dbaas
+spec:
+  ttlSecondsAfterFinished: 3600
+  template:
+    spec:
+      restartPolicy: Never
+      containers:
+      - name: restore
+        image: mysql:8.4.9
+        command: ["bash","-c"]
+        args:
+        - |
+          set -euo pipefail
+          for db in \$(ls /backup/per-db/); do
+            newest=\$(ls -t /backup/per-db/\$db/ | head -1)
+            echo "=== Restoring \$db from \$newest ==="
+            mysql -h mysql.dbaas.svc.cluster.local -uroot -p"\$MYSQL_ROOT_PASSWORD" \
+                -e "CREATE DATABASE IF NOT EXISTS \\\`\$db\\\`;"
+            gunzip -c "/backup/per-db/\$db/\$newest" | \
+              mysql -h mysql.dbaas.svc.cluster.local -uroot -p"\$MYSQL_ROOT_PASSWORD" "\$db"
+          done
+          echo "=== All databases restored ==="
+          mysql -h mysql.dbaas.svc.cluster.local -uroot -p"\$MYSQL_ROOT_PASSWORD" -e 'SHOW DATABASES;'
+        env:
+        - name: MYSQL_ROOT_PASSWORD
+          valueFrom: { secretKeyRef: { name: cluster-secret, key: ROOT_PASSWORD } }
+        volumeMounts:
+        - { name: backup, mountPath: /backup, readOnly: true }
+      volumes:
+      - name: backup
+        persistentVolumeClaim: { claimName: dbaas-mysql-backup-host, readOnly: true }
+YAML
+```
+
+Watch: `kubectl -n dbaas logs -f job/mysql-restore-per-db-<date>`.
+Expected time: ~3 min for all 20 databases.
+
+### Step 8 — Recreate Vault-rotated + static users
+
+The per-db restore did NOT touch `mysql.user`. Recreate all app users
+fresh:
+
+```bash
+# Static users (forgejo, roundcubemail) from Vault
+FORGEJO_PW=$(vault kv get -field=mysql_forgejo_password secret/viktor)
+RC_PW=$(vault kv get -field=mysql_roundcubemail_password secret/viktor)
+
+kubectl -n dbaas exec -i mysql-standalone-0 -- bash -c 'mysql -uroot -p"$MYSQL_ROOT_PASSWORD"' <<SQL
+CREATE USER IF NOT EXISTS 'forgejo'@'%' IDENTIFIED WITH caching_sha2_password BY '$FORGEJO_PW';
+CREATE USER IF NOT EXISTS 'roundcubemail'@'%' IDENTIFIED WITH caching_sha2_password BY '$RC_PW';
+GRANT ALL PRIVILEGES ON \`forgejo\`.* TO 'forgejo'@'%';
+GRANT ALL PRIVILEGES ON \`roundcubemail\`.* TO 'roundcubemail'@'%';
+FLUSH PRIVILEGES;
+SQL
+
+# Vault-DB-engine-rotated users: force re-rotation so Vault rewrites the
+# user with the current password held in K8s secrets
+for role in $(vault list -format=json database/roles | jq -r '.[]' | grep '^mysql-'); do
+  echo "Rotating $role"
+  vault write -f "database/rotate-role/$role"
+done
+
+# Technitium has a separate password-sync job — kick it
+kubectl -n technitium create job --from=cronjob/technitium-password-sync \
+    technitium-postupgrade-$(date +%s)
+```
+
+### Step 9 — Restart MySQL-dependent apps
+
+```bash
+for ns_app in \
+    "forgejo:deploy/forgejo" \
+    "nextcloud:deploy/nextcloud" \
+    "hackmd:deploy/hackmd" \
+    "monitoring:deploy/grafana" \
+    "paperless-ngx:deploy/paperless-ngx" \
+    "uptime-kuma:deploy/uptime-kuma" \
+    "url:deploy/shlink" \
+    "phpipam:deploy/phpipam" \
+    "technitium:sts/technitium" \
+    "vikunja:deploy/vikunja" \
+    "freshrss:deploy/freshrss" \
+    "finance:deploy/finance" \
+    "resume:deploy/resume" \
+    "realestate-crawler:deploy/realestate-crawler-api" \
+    "realestate-crawler:deploy/realestate-crawler-celery" \
+    "realestate-crawler:deploy/realestate-crawler-celery-beat" \
+    "realestate-crawler:deploy/realestate-crawler-ui"; do
+  ns=${ns_app%%:*}; app=${ns_app##*:}
+  kubectl -n "$ns" rollout restart "$app" &
+done
+wait
+```
+
+Wait for all to become ready:
+
+```bash
+until [ "$(kubectl get deploy,sts -A -o json | \
+    jq -r '.items[] | select(.spec.replicas != .status.readyReplicas and .spec.replicas > 0) | .metadata.name' | \
+    wc -l)" -eq 0 ]; do
+  sleep 5
+done
+echo "All workloads ready"
+```
+
+### Step 10 — Force ImagePullBackOff pods to retry (Forgejo registry was offline)
+
+```bash
+for ns in chrome-service fire-planner freedify; do
+  kubectl -n "$ns" delete pod --all 2>/dev/null || true
+done
+```
+
+### Step 11 — Clean up failed CronJob pods from the outage window
+
+```bash
+kubectl delete pods -A --field-selector=status.phase=Failed
+```
+
+### Step 12 — Verify (matches design §Verification gates)
+
+```bash
+# 1. Version
+kubectl -n dbaas exec mysql-standalone-0 -- mysql -uroot -p"$ROOT_PWD" -e "SELECT VERSION();"
+# expect: 8.4.9
+
+# 2-3. Databases + table counts
+kubectl -n dbaas exec mysql-standalone-0 -- mysql -uroot -p"$ROOT_PWD" \
+    -e "SELECT table_schema, COUNT(*) FROM information_schema.tables \
+        WHERE table_schema NOT IN ('information_schema','performance_schema','sys','mysql') \
+        GROUP BY table_schema;" > /tmp/mysql-post-upgrade-table-counts.txt
+diff /tmp/mysql-pre-upgrade-table-counts.txt /tmp/mysql-post-upgrade-table-counts.txt
+# expect: no diff (or only counts that grew between snapshots)
+
+# 4. Forgejo
+kubectl -n forgejo get pod
+kubectl -n forgejo logs deploy/forgejo --tail=20 | grep -iE "ORM engine|ready"
+# expect: 1/1 Running, "ORM engine initialized"
+
+# 5. Cluster health
+bash /home/wizard/code/infra/scripts/cluster_healthcheck.sh --quiet
+
+# 6. Registry integrity probe
+kubectl -n monitoring create job --from=cronjob/forgejo-integrity-probe \
+    postupgrade-$(date +%s)
+kubectl -n monitoring logs job/postupgrade-<timestamp> --tail=5
+# expect: "Probe complete: 0 failures"
+
+# 7. RegistryCatalogInaccessible not firing
+kubectl -n monitoring exec deploy/prometheus-server -c prometheus-server -- \
+    wget -qO- 'http://localhost:9090/api/v1/alerts' | \
+    python3 -c "import json,sys; d=json.load(sys.stdin); [print(a['labels']['alertname']) for a in d['data']['alerts'] if a['state']=='firing']"
+# expect: empty / no RegistryCatalogInaccessible
+```
+
+### Step 13 — Commit + push the Terraform change
+
+```bash
+git add stacks/dbaas/modules/dbaas/main.tf
+git commit -m "dbaas: pin MySQL to 8.4.9 after successful wipe+reinit upgrade
+
+Executed per docs/plans/2026-05-19-mysql-8.4.9-upgrade-{design,plan}.md.
+The full upgrade ran clean — fresh init on 8.4.9 sidestepped the DD
+upgrade stall. IO config bumped to 2000/4 (was 100/1) for the workload.
+"
+git push
+```
+
+## Rollback path (if Step 6 or Step 7 fails catastrophically)
+
+The wipe at Step 5 is destructive — once executed, the original disk
+is gone. Rollback is **same procedure, image=8.4.8**:
+
+1. Edit TF: `image = "mysql:8.4.8"`
+2. `kubectl -n dbaas scale sts mysql-standalone --replicas=0`
+3. Re-wipe (already wiped; just `tg apply`)
+4. Run the Step 7 restore Job again (now on 8.4.8)
+5. Run Step 8-11
+6. Update Terraform comment to reflect retained 8.4.8 pin.
+
+Extra downtime: ~25 min on top of the existing window.
+
+## Post-upgrade follow-ups
+
+- Update `infra/.claude/CLAUDE.md` MySQL row to reflect 8.4.9 pin.
+- Update `docs/runbooks/restore-mysql.md` to reflect 8.4.9.
+- Re-evaluate whether the new IO config (2000/4) is overkill for the
+  workload after 1-2 weeks — could drop to 1000/2 if needed.
+- Optional: file a follow-up task to investigate MySQL HA/replication
+  so the next upgrade isn't blocking.
--- a/docs/plans/2026-05-21-ha-control-plane-design.md
+++ b/docs/plans/2026-05-21-ha-control-plane-design.md
@ -0,0 +1,135 @@
+# HA Control Plane (3 masters) — Design
+
+**Date**: 2026-05-21
+**Status**: Drafted, NOT scheduled
+**Beads**: code-n0ow
+**Trigger**: today's k8s 1.34.7→1.34.8 autonomous-upgrade session repeatedly hit a storm cascade rooted in single-master apiserver outages
+
+## Problem statement
+
+The autonomous k8s upgrade pipeline (`stacks/k8s-version-upgrade/`) is
+correct end-to-end but **cannot push through the cluster's
+single-master architecture**. Each attempted upgrade today rolled
+back via the same cascade:
+
+1. Chain drains master → `kubeadm upgrade apply` swaps a static-pod
+   manifest (etcd → apiserver → controller-manager → scheduler).
+2. While a manifest swap is in flight, the affected control-plane
+   component is briefly down — for apiserver, that means ~10–60s of
+   "connection refused" to `10.96.0.1:443` from every kubelet and
+   operator pod in the cluster.
+3. **Several operators die during that window** instead of waiting:
+   - **tigera-operator**: logs `[ERROR] Get "https://10.96.0.1:443/api?timeout=32s": connect: connection refused` then exits 1 immediately
+   - gpu-operator, cnpg-cloudnative-pg, kube-controller-manager: similar leader-lease failures
+4. Kubelet restarts those pods → image pulls + initial reads → storm
+   of disk I/O on master (we observed 563 MB/s from tigera alone).
+5. **The storm slows apiserver-to-kubelet status sync** past kubeadm's
+   hardcoded 5-min watch on the pod's `kubernetes.io/config.hash`
+   annotation.
+6. kubeadm declares the upgrade "did not change after 5m0s",
+   **rolls back to the previous manifest**, exits non-zero.
+7. Chain Job retries (backoffLimit=1) → same storm → same failure.
+   Chain dead.
+
+The container runtime, the script logic, the RBAC permissions are all
+fine after today's fixes. The **single master is the bottleneck**.
+
+## Why HA control plane fixes this
+
+With 3 masters running etcd quorum + apiserver behind an LB:
+
+| Failure mode | Single master | 3-master HA |
+|---|---|---|
+| Master reboot / kubeadm upgrade | Apiserver completely down 10–60s | Other 2 masters serve clients; LB transparently fails over |
+| etcd quorum during one master being down | Total outage (1/1 broken) | Quorum maintained (2/3 healthy) |
+| Tigera/operators see apiserver as "down" | Yes → crashloop storm | No → keep running through |
+| kubeadm `static-pod hash` watch | Times out under load (today's bug) | Never under load; sync stays fast |
+| Pipeline upgrade success rate | Brittle / needs manual nursing | Truly autonomous |
+
+The k8s upgrade chain doesn't need to be aware of *any* of this — the
+underlying availability of apiserver makes the chain's gates
+naturally pass on each iteration.
+
+## Decisions (proposed — to be confirmed)
+
+| # | Decision | Notes |
+|---|----------|-------|
+| 1 | **3 masters** (not 5) | Quorum tolerates 1 failure, sufficient for home-lab. 5 would tolerate 2 but doubles etcd write amplification. |
+| 2 | **Sizing**: match current `k8s-master` (8 vCPU, 32GB RAM, ~64 GB disk) for all 3 | Symmetric. New VMs `k8s-master-2`, `k8s-master-3` on Proxmox. |
+| 3 | **Apiserver LB**: **pfSense HAProxy** (existing pattern, see mailserver-pfsense-haproxy.md) over keepalived+haproxy-on-each-master | Pros: no per-node moving parts, mirrors the mailserver layout already in production. Cons: pfSense becomes more SPoF — but it's already SPoF for everything else (DNS, gateway, ingress). |
+| 4 | **VIP**: pick an unused IP on the cluster VLAN, e.g. `10.0.20.99`, point all kubeconfigs + kubelet `--server` at it | Internal-only VIP; external API access stays via Cloudflared. |
+| 5 | **etcd**: kubeadm-managed (existing); just `kubeadm join --control-plane` brings new members into the etcd cluster automatically | Avoids running etcd separately. |
+| 6 | **kured-sentinel-gate**: extend "quorum-safe" check to verify ≥2 control-plane nodes Ready before allowing a reboot | Otherwise kured could reboot 2 masters at once and break quorum. |
+| 7 | **etcd backup**: today's `etcd-backup` CronJob already takes a snapshot from one member; that's still sufficient (etcd snapshot is a consistent point-in-time). No new work needed. | |
+| 8 | **Migration order**: add masters one at a time, run smoke (kubectl from each), then cut over kubeconfigs | Each `kubeadm join --control-plane` is reversible (just `kubeadm reset` + remove from etcd member list). |
+
+## Out of scope
+
+- HA pfSense itself (separate, much bigger initiative)
+- Multi-DC failover
+- External etcd cluster (we're sticking with kubeadm-managed stacked etcd)
+- Rebuilding cluster from scratch — we'll join into the existing one
+
+## Risk register
+
+| Risk | Mitigation |
+|---|---|
+| etcd quorum split-brain during member join | kubeadm join is atomic; if it fails, the new member doesn't join the quorum. Existing etcd stays healthy. |
+| LB misconfiguration → all kubectl breaks | Smoke-test from each master before flipping clients. Keep a kubeconfig pointing directly at one master as fallback. |
+| Existing kubeconfigs (dev VM, agents, woodpecker) need updating | List all consumers, update in a single TF apply. |
+| New masters get scheduled some workload pods unintentionally | Verify control-plane taint is applied at join time. |
+| Cluster-wide cert rotation might be needed | kubeadm join handles certs automatically using the `--certificate-key` from `kubeadm init phase upload-certs`. |
+| 32GB per master × 3 = 96GB RAM used for control plane alone | Proxmox host has headroom; not blocking. |
+
+## Verification
+
+After all 3 masters joined + LB up:
+
+```bash
+# All 3 masters listed
+kubectl get nodes -l node-role.kubernetes.io/control-plane=
+
+# etcd quorum healthy
+kubectl -n kube-system exec etcd-k8s-master -- etcdctl \
+    --endpoints=https://10.0.20.100:2379,https://10.0.20.X:2379,https://10.0.20.Y:2379 \
+    --cacert=/etc/kubernetes/pki/etcd/ca.crt \
+    --cert=/etc/kubernetes/pki/etcd/server.crt \
+    --key=/etc/kubernetes/pki/etcd/server.key \
+    endpoint health --cluster
+
+# Failover test: cordon master-1, reboot it, observe kubectl still works through LB
+kubectl drain k8s-master --delete-emptydir-data --ignore-daemonsets
+ssh wizard@k8s-master.viktorbarzin.lan sudo reboot
+
+# Pipeline test: re-trigger k8s upgrade chain (e.g. for whatever the next patch is)
+kubectl -n k8s-upgrade create job --from=cronjob/k8s-version-check ha-validation-$(date +%s)
+# Expect: full chain succeeds end-to-end without manual intervention
+```
+
+## Cost estimate
+
+- 2× VMs at 8 vCPU + 32GB RAM each = +64GB RAM on Proxmox host
+- ~+128GB disk usage (2× 64GB master disks)
+- ~2-4 hours of operator time end-to-end (VM provisioning + kubeadm join + LB config + smoke)
+
+## What's already in place from today's work
+
+(All these are prerequisites that were fixed during today's
+investigation — they stay relevant when HA lands.)
+
+- Master containerd 1.6.22 → 2.2.2, runc 1.1.8 → 1.4.0 (fixed
+  `runc: unable to signal init: permission denied` on Ubuntu 26.04)
+- Pipeline script bugs: 3× `grep -vE` pipefail, 1× RBAC missing
+  `get daemonsets`, 1× `RecentNodeReboot` not ignored in master phase
+- Kill-switch ConfigMap mechanism (`k8s-upgrade-killswitch`)
+- Kubeadm-apply retry wrapper in `update_k8s.sh` (helps but doesn't
+  fully fix the storm cascade)
+- Quiet-baseline threshold 3600s → 600s
+
+## Reference
+
+Commits from today's session:
+- `10b261d2` — first `grep -vE` pipefail
+- `0c8b46df` — 2 more pipefail sites
+- `fc0510aa` — kill-switch + RecentNodeReboot ignore + 600s threshold
+- `2dc7e001` — kubeadm apply 3-attempt retry
--- a/docs/plans/2026-05-22-openclaw-devvm-access-design.md
+++ b/docs/plans/2026-05-22-openclaw-devvm-access-design.md
@ -0,0 +1,269 @@
+# OpenClaw devvm access + async task pattern — design
+
+**Date:** 2026-05-22
+**Stack:** `infra/stacks/openclaw`
+**Status:** Approved (in-session, see chat history 2026-05-22)
+
+## Goal
+
+Give the OpenClaw pod (running in K8s) two new capabilities:
+
+1. **Host-tools bundle** — common Linux CLIs the upstream OpenClaw image
+   doesn't ship (`ssh`, `scp`, `vault`, `dig`, `jq`, `yq`, `ripgrep`, `fd`,
+   `gnupg`, `tmux`, etc.). OpenClaw can't `apt install` because the
+   container runs as non-root `node` (uid 1000).
+2. **devvm async task pattern** — OpenClaw spawns long-running work as
+   `tmux` sessions on devvm, sends prompts via `tmux send-keys`, captures
+   progress via `tmux capture-pane`. Sessions live on devvm, so they
+   survive OpenClaw pod restarts.
+
+OpenClaw uses this combination as a **trusted fallback** for tasks too
+expensive, sensitive, or stateful for in-pod execution: Vault lookups,
+multi-step `claude-code` work, anything needing wizard's full home-lab
+access.
+
+## Why now
+
+- The in-pod sandbox is `security=full` but the container is minimal —
+  no `ssh`, no `vault`, no `dig`, no `tmux`.
+- The user wants OpenClaw to be a first-line agent that delegates heavy
+  work to the dev VM rather than duplicate that work in a constrained pod.
+- Long-running work (multi-minute `claude-code` sessions) shouldn't be
+  tied to a single synchronous `claude -p` invocation — needs persistence
+  and pollability.
+
+## Architecture decision: stay on K8s
+
+Discussed migrating OpenClaw to run directly on devvm (would obviate the
+host-tools bundle + most of the SSH setup). Decision: **stay on K8s**.
+
+Reasons:
+- Keeps HA (5-node cluster vs single devvm reboot)
+- Keeps ingress/Authentik/Telegram entry chain intact
+- Keeps Prometheus scrape + exporter sidecar
+- Keeps PVC backup pipeline (LVM snapshots + Synology offsite)
+- Resource isolation — a runaway LLM session can't stress wizard's daily-driver VM
+- Migration cost is several days; this design is ~150 LoC + an 80-line wrapper
+
+The mental model — "OpenClaw is sandboxed, delegates to wizard@devvm for
+trusted heavy lifting" — is a clean security boundary. Worth preserving.
+
+## Architecture
+
+### Pod side (`infra/stacks/openclaw/main.tf`)
+
+Two new init containers added to the OpenClaw Deployment, after the
+existing four:
+
+#### Init 5 — `install-host-tools`
+
+- Image: `debian:bookworm-slim` (matches main container base for glibc compat)
+- Idempotent: skips if `/tools/host-tools/.installed-v1` exists
+- `apt-get install --download-only --no-install-recommends` for:
+  `openssh-client dnsutils iputils-ping wget gnupg jq ripgrep fd-find ncdu htop strace tcpdump tmux unzip`
+- Iterates `.deb` files in `/var/cache/apt/archives/`, `dpkg-deb -x` each
+  into `/tools/host-tools/root/` (preserves `usr/bin`, `usr/sbin`,
+  `usr/lib` layout)
+- Downloads static binaries to `/tools/host-tools/bin/`:
+  - `vault` (HashiCorp releases, pinned version)
+  - `yq` (mikefarah/yq GitHub releases, pinned version)
+- Smoke test: invokes `--version` on each bundled binary; fails init if
+  any won't load (catches glibc / shared-lib drift at deploy time, not
+  runtime)
+- Writes marker file with version
+
+#### Init 6 — `setup-ssh-config`
+
+- Image: uses the just-installed host-tools (debian:bookworm-slim base
+  with `/tools/host-tools/root/usr/bin` on PATH so `ssh-keyscan` works)
+- Runs after `install-host-tools`
+- Idempotent: skips if `/home/node/.openclaw/.ssh/.configured-v1` exists
+- Creates `/home/node/.openclaw/.ssh/` (uid 1000)
+- Copies `/ssh/id_rsa` (tmpfs secret mount) → `~/.ssh/id_rsa` with 0600
+  (the secret tmpfs mount has wider perms that openssh rejects)
+- Writes `~/.ssh/config`:
+
+  ```ssh-config
+  Host devvm
+    HostName 10.0.10.10
+    User wizard
+    IdentityFile ~/.ssh/id_rsa
+    UserKnownHostsFile ~/.ssh/known_hosts
+    StrictHostKeyChecking yes
+  ```
+
+  PATH handling on the remote side: devvm's sshd uses the default
+  non-interactive PATH (`/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin`)
+  and does NOT load `~/.profile` or `~/.bashrc` (memory id=740). Client-side
+  `SetEnv PATH=…` doesn't help because sshd's `AcceptEnv` is `LANG LC_*` only.
+  Solution: install the binaries openclaw cares about into `/usr/local/bin/`
+  on devvm (see "Devvm side" below).
+
+- Pre-seeds `~/.ssh/known_hosts` via `ssh-keyscan -H 10.0.10.10`
+- Writes marker file
+
+#### Main container
+
+- `PATH` env updated: prepend
+  `/tools/host-tools/root/usr/bin:/tools/host-tools/root/usr/sbin:/tools/host-tools/bin`
+- No other changes to the startup command
+
+### Devvm side
+
+#### `/usr/local/bin/openclaw-task` wrapper
+
+Canonical source: `infra/stacks/openclaw/files/openclaw-task.sh`.
+Installed to devvm at `/usr/local/bin/openclaw-task` (`sudo cp`, `sudo
+chmod +x`) so non-interactive SSH finds it on the default PATH without
+needing `~/.profile`. Updates: re-run the install steps from the
+canonical source.
+
+Also: `sudo ln -s /home/wizard/.local/bin/claude /usr/local/bin/claude`
+so `ssh devvm claude …` works in non-interactive mode. `vault` and `tmux`
+are already at `/usr/bin/` (system packages) so no symlink needed for
+those.
+
+POSIX shell script. Subcommands:
+
+| Subcommand | Behavior |
+|---|---|
+| `new <id> <cmd...>` | Spawns detached tmux session `openclaw-task-<id>`, pipes pane output to `~/openclaw-tasks/<id>.log` |
+| `claude <id> <prompt>` | Convenience: spawns interactive `claude` in a tmux session, send-keys the prompt + Enter |
+| `send <id> <keys...>` | `tmux send-keys -t openclaw-task-<id> "$@"` — caller supplies `Enter` literal if needed |
+| `capture <id> [lines]` | `tmux capture-pane -t … -p -S -<lines>` (default last 1000) |
+| `log <id>` | `cat ~/openclaw-tasks/<id>.log` |
+| `tail <id>` | `tail -n 100 -f ~/openclaw-tasks/<id>.log` (mainly for human ops) |
+| `list` | tmux session list filtered to `openclaw-task-*`, one id per line |
+| `status <id>` | `running` if tmux session alive, `ended` otherwise |
+| `kill <id>` | `tmux kill-session -t openclaw-task-<id>` (log file is kept) |
+| `purge <id>` | `kill` + `rm -f ~/openclaw-tasks/<id>.log` |
+
+Task state lives entirely on devvm:
+
+- tmux sessions persist across SSH disconnects and OpenClaw pod restarts
+- `~/openclaw-tasks/<id>.log` is the durable transcript even after a
+  session is killed
+- No central database — `tmux list-sessions` is the source of truth for
+  "what's running"
+
+Naming convention: tmux sessions are prefixed `openclaw-task-` so they
+don't collide with wizard's own tmux work (`0`, `Openclaw`, `read-only`).
+
+### Memory note
+
+File at `/workspace/memory/projects/openclaw-runtime/devvm-fallback.md`
+teaching OpenClaw the pattern. Indexed by the existing daily
+`memory-sync` CronJob (or via manual `node openclaw.mjs memory index
+--force` for the initial seed).
+
+Content (verbatim):
+
+```markdown
+# Using devvm as a fallback
+
+When in-pod tools/permissions block you, SSH to devvm and use it. The
+devvm runs as wizard with full home-lab access (Vault, kubectl, git
+repos, Cloudflare, etc.) and has Claude Code v2+ installed.
+
+## One-shot lookup
+    ssh devvm 'vault kv get -field=brave_api_key secret/openclaw'
+    ssh devvm 'claude -p "investigate why frigate is restarting"'
+
+## Long-running async work — USE THIS for anything > ~2 min
+Spawn in a tmux session on devvm. Sessions survive OpenClaw pod restarts.
+
+    # spawn
+    ssh devvm openclaw-task new my-task "claude -p --dangerously-skip-permissions 'do the thing'"
+
+    # poll progress (last 1000 lines of pane)
+    ssh devvm openclaw-task capture my-task
+
+    # interactive claude (send follow-up prompts)
+    ssh devvm openclaw-task claude my-task "initial prompt"
+    ssh devvm openclaw-task send my-task "follow-up prompt" Enter
+
+    # housekeeping
+    ssh devvm openclaw-task list
+    ssh devvm openclaw-task status my-task
+    ssh devvm openclaw-task kill my-task
+
+Logs persist at ~/openclaw-tasks/<id>.log on devvm even after a session
+is killed. Use `ssh devvm openclaw-task log <id>` to retrieve them.
+```
+
+## Devvm: no infra changes
+
+Pre-existing state verified 2026-05-22:
+
+- pubkey from `/ssh/id_rsa` (Vault `secret/openclaw → ssh_key`) matches the
+  `ssh-ed25519 AAAA…lug node@openclaw-58cd9f7987-884bv` line in
+  `~/.ssh/authorized_keys` (the comment is a stale pod name; the key
+  itself is stable from Vault)
+- sshd listens on 0.0.0.0:22 ✓
+- `claude` v2.1.126 at `/home/wizard/.local/bin/claude` ✓
+- `tmux` 3.4 installed, server already running with existing user sessions ✓
+
+Only changes (one-time, done in the same session via `sudo`):
+- Install `openclaw-task` wrapper to `/usr/local/bin/openclaw-task`
+- Symlink `/home/wizard/.local/bin/claude` → `/usr/local/bin/claude`
+
+## Tradeoffs / risks
+
+- **Bundle size on NFS**: ~30MB extracted. Acceptable on
+  `/srv/nfs/openclaw/tools`.
+- **Library version drift**: bundled binaries link against bookworm libs.
+  Smoke test in `install-host-tools` catches breakage on the next pod
+  restart if upstream OpenClaw image rebases.
+- **Full-shell SSH**: explicit user choice. Blast radius if openclaw is
+  prompt-injected = full wizard access. Mitigation: keep OpenClaw's
+  plugin allowlist tight (current allow list: `memory-core, recruiter-api,
+  telegram, openrouter, brave, openai, codex`).
+- **tmux server lifecycle on devvm**: if wizard's tmux server dies (rare —
+  usually only on devvm reboot), in-flight openclaw tasks are killed.
+  Acceptable for home lab. Task logs persist regardless.
+- **Task log unbounded growth**: `~/openclaw-tasks/*.log` grows forever.
+  Out of scope here. User can add a `find -mtime +N -delete` cron later.
+- **Init container order**: `setup-ssh-config` depends on
+  `install-host-tools` finishing first. K8s init containers run
+  sequentially in declaration order — natural ordering, no explicit
+  dependency mechanism needed.
+
+## Testing — E2E flows required by user
+
+1. **Tools present**:
+   `kubectl -n openclaw exec <pod> -c openclaw -- ssh -V` returns version,
+   same for `dig`, `vault`, `jq`, `yq`, `tmux`, `rg`.
+2. **SSH happy path**:
+   `kubectl -n openclaw exec <pod> -c openclaw -- ssh devvm 'hostname'`
+   returns `devvm`.
+3. **Claude one-shot**:
+   `kubectl -n openclaw exec <pod> -c openclaw -- ssh devvm 'claude -p "what is 1+1"'`
+   returns `2`.
+4. **Async task lifecycle**:
+   - `ssh devvm openclaw-task new test-1 "sleep 30; echo done"`
+   - `ssh devvm openclaw-task list` contains `test-1`
+   - `ssh devvm openclaw-task status test-1` returns `running`
+   - wait 35s
+   - `ssh devvm openclaw-task log test-1` contains `done`
+   - `ssh devvm openclaw-task status test-1` returns `ended`
+5. **Persistence test** (the key requirement):
+   - Spawn long task: `ssh devvm openclaw-task new persist-1 "sleep 120; echo survived > /tmp/persist-1.proof"`
+   - `kubectl -n openclaw delete pod <openclaw-pod>` — pod recreated
+   - Wait for new pod ready (init containers run, skip via marker, fast)
+   - `kubectl -n openclaw exec <new-pod> -c openclaw -- ssh devvm openclaw-task list`
+     contains `persist-1`
+   - Wait for original sleep to finish; verify `/tmp/persist-1.proof`
+     contains `survived` from new pod
+6. **Memory note lookup**:
+   `kubectl -n openclaw exec <pod> -c openclaw -- node openclaw.mjs memory search 'devvm fallback'`
+   returns the note.
+
+## Docs to update with the change
+
+- `infra/docs/plans/2026-05-22-openclaw-devvm-access-design.md` (this doc)
+- `infra/docs/plans/2026-05-22-openclaw-devvm-access-plan.md` (implementation plan)
+- `infra/.claude/reference/service-catalog.md` (one-line addition under
+  OpenClaw: "Has SSH to devvm with host-tools bundle; long-running async
+  tasks via `openclaw-task` wrapper on devvm")
+- `infra/.claude/CLAUDE.md` "Known Issues" section is left alone — none of
+  the existing OpenClaw caveats change.
--- a/docs/post-mortems/2026-04-18-authentik-outpost-shm-full.md
+++ b/docs/post-mortems/2026-04-18-authentik-outpost-shm-full.md
@ -117,7 +117,7 @@ Contributing distractions:

 | Priority | Action | Type | Details | Status |
 |----------|--------|------|---------|--------|
-| P2 | Codify the catch-all Proxy provider + embedded outpost config in Terraform | Architecture | Adopt `goauthentik/authentik` Terraform provider in `infra/stacks/authentik/`. Import the existing UUID `0eecac07-97c7-443c-8925-05f2f4fe3e47` and the catch-all provider pk=5. Move `kubernetes_json_patches` into TF so the fix is reviewable in git. | TODO |
+| P2 | Codify the catch-all Proxy provider + embedded outpost config in Terraform | Architecture | Adopt `goauthentik/authentik` Terraform provider in `infra/stacks/authentik/`. Import the existing UUID `0eecac07-97c7-443c-8925-05f2f4fe3e47` and the catch-all provider pk=5. Move `kubernetes_json_patches` into TF so the fix is reviewable in git. **Done 2026-05-10**: `authentik_outpost.embedded` resource + `authentik_provider_proxy.catchall.access_token_validity` codified, plan-to-zero on the whole stack. The `Outpost.managed` field is server-set (not in provider schema) and preserved across applies because TF only writes known fields. Same-day work also flipped the outpost's session backend from filesystem (`/dev/shm`) to PostgreSQL — see `.claude/reference/authentik-state.md`. | **DONE** |
 | P2 | Runbook: Authentik forward-auth troubleshooting | Docs | Add a runbook at `docs/runbooks/authentik-forward-auth-400.md` with the "grep outpost logs first" first step, plus pointer commands for `/dev/shm` usage, session file count, and recent authorize events. | TODO |

 ### P3 — Upstream + architectural
@ -125,8 +125,8 @@ Contributing distractions:
 | Priority | Action | Type | Details | Status |
 |----------|--------|------|---------|--------|
 | P3 | Comment/support on authentik issue [#20093](https://github.com/goauthentik/authentik/issues/20093) | Upstream | Request either a persistent-backed session store (Redis/DB) OR a configurable GC interval shorter than the default 5 min. | TODO |
-| P3 | Consider shortening `access_token_validity` from 168h (7 days) to 24h | Config | Reduces steady-state session file count from ~181k to ~26k (7× reduction). Trade-off: users re-auth daily. Viktor's call on UX tolerance. | TODO |
-| P3 | Evaluate moving forward-auth away from the embedded outpost | Architecture | The embedded outpost is a single replica Go binary with in-memory session state. An external, multi-replica outpost with Redis-backed sessions is the production-grade deployment. Probably overkill for a home-lab, but worth noting. | TODO (paused) |
+| P3 | Consider shortening `access_token_validity` from 168h (7 days) to 24h | Config | Original idea: shrink steady-state session file count (~7× reduction) at the cost of daily re-auth. **Resolved differently 2026-05-10**: switched the outpost to the PostgreSQL session backend (`Outpost.managed = goauthentik.io/outposts/embedded` + `AUTHENTIK_POSTGRESQL__*` envFrom), which makes session count irrelevant for tmpfs sizing and lets us BUMP `access_token_validity` to `weeks=4` for better UX without cost. | **DONE (alt)** |
+| P3 | Evaluate moving forward-auth away from the embedded outpost | Architecture | Original framing: external, multi-replica outpost with Redis-backed sessions. **Resolved 2026-05-10** by enabling the postgres-backed session store on the embedded outpost itself (PR goauthentik/authentik#16628). Sessions now persist across pod restarts; the original "in-memory state" concern is moot. Multi-replica still requires a goauthentik upstream fix (PgBouncer-friendly session migration), but the loss-of-state class of failures is gone. | **DONE (alt)** |

 ## Lessons Learned

--- a/docs/post-mortems/2026-05-16-kured-stalled-and-anubis-ha.md
+++ b/docs/post-mortems/2026-05-16-kured-stalled-and-anubis-ha.md
@ -0,0 +1,164 @@
+# Post-Mortem: kured Reboots Silently Stalled for 6 Days + Anubis HA Lift
+
+| Field | Value |
+|-------|-------|
+| **Date** | 2026-05-16 |
+| **Duration** | 6 days of unbooted pending-reboot packages (2026-05-10 → 2026-05-16) |
+| **Severity** | SEV3 — no user-facing impact; latent risk (kernel/libc CVEs queued, not landing) |
+| **Affected Services** | None directly; OS-reboot pipeline halted on all 5 K8s nodes |
+| **Status** | Root cause fixed (kured Helm value), defensive defaults added (Anubis HA, kured drain-timeout, CNPG 3 instances) |
+
+## Summary
+
+After unattended-upgrades was re-enabled on the K8s nodes on 2026-05-10,
+kured was supposed to drive rolling node reboots within the Mon–Fri
+02:00–06:00 London window. Instead, kured logged "Reboot not required"
+every hour for six straight days while the `kured-sentinel-gate`
+DaemonSet on every host happily reported "ALL CHECKS PASSED — creating
+/var/run/gated-reboot-required". The gate WAS open. kured was looking
+in the wrong place.
+
+The kured Helm chart derives the sentinel hostPath from
+`dirname(configuration.rebootSentinel)`. The stack set
+`rebootSentinel = "/sentinel/gated-reboot-required"` — which pointed
+the chart at hostPath `/sentinel/` (an empty auto-created directory).
+The sentinel-gate writes to `/var/run/gated-reboot-required` on the
+host. Two different host directories. kured silently skipped reboots
+for six days.
+
+Found on 2026-05-16 while auditing why "automatic upgrades aren't
+happening" alongside the K8s version-upgrade Job-chain (PM
+2026-05-11). Fixed in one commit; took the opportunity to also
+eliminate three latent drain-time hazards (Anubis single-replica PDB
+deadlock, kured unbounded drain timeout, CNPG-only-2-instances).
+
+## Impact
+
+- **User-facing**: None. Existing kernels, libc, and userspace kept running. CVEs queued in `/var/run/reboot-required.pkgs` on every node but were never exploited.
+- **Backlog**: All 5 nodes accumulated `linux-image-*` + `libc6` queued for reboot. Largest gap was master at ~6 days. Workers also 5–6 days.
+- **Detection gap**: kured exposes no Prometheus signal for "I checked but said no". The hourly "Reboot not required" line in stdout is the only trace, and nobody was tailing it. The architecture had two layers (sentinel-gate gate + kured sentinel check) but no verification that the two layers were looking at the same path.
+- **Side discovery**: 8 Anubis instances would have stalled drain anyway via single-replica + `PDB minAvailable=1` (the same trap that stalled the manual K8s upgrade on 2026-05-11). Even if the kured path bug were fixed in isolation, Monday's first reboot would have hit the Anubis trap and idled forever (kured default `--drain-timeout=0` = unlimited).
+
+## Timeline (UTC)
+
+| Time | Event |
+|------|-------|
+| **Mar 16 21:26** | kured-sentinel-gate DaemonSet introduced after the 26h overlayfs cascade outage. Original sentinel cool-down 30m. |
+| **May 10 ~16:57** | Last successful kured pod restart picked up new Helm values. `rebootSentinel = "/sentinel/gated-reboot-required"`. Same commit re-enabled unattended-upgrades in cloud_init and stretched the sentinel cool-down 30m → 24h. |
+| **May 10 ~17:00 → May 15 06:16** | unattended-upgrades on every node successfully installs kernel + libc patches, writes `/var/run/reboot-required`. |
+| **May 10–15** | sentinel-gate Check 1–4 all pass every 5 min on every host. Touches `/var/run/gated-reboot-required`. Logs "ALL CHECKS PASSED". |
+| **May 10–15** | kured polls `/sentinel/gated-reboot-required` (empty dir, file does not exist). Returns "Reboot not required" every hour. No reboots happen. |
+| **May 11 20:40–21:00** | Separate K8s-version-upgrade incident (master upgraded to v1.34.7, workers stalled mid-rollout because the upgrade agent drained its own host). Manual recovery 5/11–5/12. **kured stall noticed but not investigated**: cluster healthy, K8sVersionSkew firing was tracked as the urgent issue. |
+| **May 11 22:47 → May 12 00:01** | Manual worker drains hit the Anubis single-replica PDB trap (drain loops). Resolved by direct-deleting Anubis pods to bypass eviction API. This was the first signal that single-replica `minAvailable=1` patterns deadlock drains. |
+| **May 16 10:56 UTC** | While auditing "what runs the upgrades" for the user, the kured + sentinel-gate log/path mismatch became visible. |
+| **May 16 11:13 UTC** | `stacks/kured/main.tf`: `rebootSentinel = "/sentinel/..."` → `"/var/run/gated-reboot-required"`. Re-init, plan, apply. |
+| **May 16 11:14 UTC** | kured DaemonSet rolls out the new spec. Volume hostPath becomes `/var/run`. kured pod can now see `/sentinel/reboot-required` (32B, from uu) AND `/sentinel/gated-reboot-required` (0B, from gate). Confirmed via `kubectl exec` listing. |
+| **May 16 11:44 UTC** | Anubis HA module change deployed: `shared_store_url` variable → `store: { backend: valkey }` block appended to policy YAML, default replicas 2, PDB `maxUnavailable=1`, topology `DoNotSchedule`. Cyberchef applied as canary. Confirmed: Redis DB 5 starts receiving challenge state. |
+| **May 16 11:48–11:53 UTC** | Remaining 7 Anubis stacks applied (DBs 6–12). 8/8 deployments at 2/2 Ready, replicas spread on different nodes. Smoke-tested 6 of 8 public URLs return 200. |
+| **May 16 12:05 UTC** | kured `drainTimeout: "30m"` added + applied. pg-cluster bumped from 2 → 3 instances. |
+| **May 16 12:11 UTC** | pg-cluster phase = "Cluster in healthy state", 3/3 ready. |
+
+## Root Cause
+
+The Helm chart `kured-5.11.0` computes:
+```
+{{- $sentinel_dir := dir .Values.configuration.rebootSentinel -}}
+# template renders both volume mount and hostPath using $sentinel_dir
+```
+
+So `rebootSentinel` is doubly-purposed: it's both the **CLI arg path inside
+the pod** AND the **hostPath on the node**. Setting it to `/sentinel/...`
+caused:
+- pod arg: `--reboot-sentinel=/sentinel/gated-reboot-required` (looks at `/sentinel/` inside the pod)
+- hostPath: `/sentinel/` (auto-created empty directory by `type: Directory`)
+- mountPath inside pod: `/sentinel/` (mapped from hostPath above)
+
+Meanwhile the gate DaemonSet was configured with hostPath `/var/run` →
+mountPath `/host/var-run`, and wrote `gated-reboot-required` to its local
+`/host/var-run/` which became the host's `/var/run/gated-reboot-required`.
+
+The two daemons never touched the same directory.
+
+**Why this was hard to spot**:
+
+1. Both layers logged success: sentinel-gate said "ALL CHECKS PASSED", kured said "Reboot not required". Neither claimed an error.
+2. No Prometheus alert exists for "kured polled, gate is open, kured still didn't act". The Upgrade Gates alert group catches firing-alert-during-rollout, not silently-skipped-rollout.
+3. The Helm chart's auto-derivation of hostPath from a config value is undocumented surprising behavior. The mental model is "rebootSentinel is just the in-pod path"; the hostPath co-mutation is invisible.
+
+## Remediation
+
+### Primary fix
+- `stacks/kured/main.tf`: `rebootSentinel = "/var/run/gated-reboot-required"`. Both the chart-derived hostPath and the kured CLI arg now align with where the gate writes.
+
+### Defensive companion changes (same session)
+
+| Change | Purpose | Stack |
+|---|---|---|
+| `drainTimeout = "30m"` on kured | Fail closed instead of looping forever if a future PDB or finalizer stalls drain. Node stays Schedulable (no silent capacity loss). | `stacks/kured/main.tf` |
+| Anubis: shared-state Valkey/Redis backend | Eliminate the single-replica drain deadlock + provide real HA. PDB changed `minAvailable=1` → `maxUnavailable=1`. Replicas 1 → 2 with `topologySpreadConstraint: DoNotSchedule`. | `modules/kubernetes/anubis_instance/main.tf` + 8 callers |
+| pg-cluster: 2 → 3 instances | Failover during primary's node drain no longer depends on the lone replica being caught up. CNPG always has a fully-current candidate. | `stacks/dbaas/modules/dbaas/main.tf` |
+| Orphan `mysql-standalone` PDB deleted | Helm-stamped leftover (selector required 4 labels, pod has 3 → matched 0 pods). Was dead code; deletion is safe. | `kubectl` (not TF-managed) |
+
+### Verified post-fix
+
+- `kubectl -n kured exec deploy/kured -- ls /sentinel/` lists both `reboot-required` and `gated-reboot-required` on every node.
+- 8 Anubis Deployments at 2/2 Ready; pods spread across different nodes (verified via `kubectl get pods -o wide`).
+- Redis DBs 5, 7, 8, 10 receiving challenge state from real public traffic post-apply (Palo Alto Networks scanner hit blog).
+- pg-cluster 3/3 healthy, phase = "Cluster in healthy state".
+- kured args show `--drain-timeout=30m`.
+
+## Lessons
+
+1. **Auto-derivation in Helm charts is invisible drift surface.** The chart's
+   habit of deriving hostPath from a CLI-arg-shaped value is the kind of
+   "convenient default" that hides during normal review. Mitigation:
+   pin `hostFilePath` explicitly in `configuration` so the host path is
+   declared, not derived. (Did not do this in the fix because the
+   single-config approach is now correct; flagging as future improvement.)
+
+2. **"Silently skipped" needs a Prometheus signal.** The Upgrade Gates
+   alerts cover "rollout in progress + something went wrong". They don't
+   cover "we haven't rolled in 7 days when we should have". Suggested:
+   add `KuredRebootBacklog` — fires when `kured_reboot_required ==
+   1` (kured exposes this) for more than 24h continuously. The kured
+   chart already serves `/metrics`; just needs a rule. (Deferred.)
+
+3. **Single-replica `PDB: minAvailable=1` is a deadlock pattern.** It
+   reads as "protect this pod" but actually means "block all voluntary
+   disruption forever". Manifested in 9 places (8 Anubis + mysql-standalone
+   with broken selector). The Anubis fix is now in place via shared-store
+   replicas=2; the `mysql-standalone` selector was already broken so it
+   matched 0 pods (and was deleted as cruft). Worth auditing the cluster
+   periodically for any new pattern of the same shape.
+
+4. **k8s-node1 containerd source drift** (Ubuntu archive's `containerd`
+   vs Docker's `containerd.io`) is benign but should be documented.
+   Audited during this session: not a blocker for kured because both
+   variants are in the Package-Blacklist and both are apt-held. The
+   version skew with master (1.6.22 vs 1.7.24/1.7.27) is what the
+   K8s version-upgrade Stage 3 "containerd bump" exists to fix.
+
+5. **CNPG drain handling at 2 replicas is fragile.** Switchover works
+   but the lone replica must be caught up; in practice this means
+   on a busy cluster, a primary-node drain could stall for tens of
+   seconds while CNPG promotes. 3 instances eliminates this. Worth
+   considering for every long-running multi-instance stateful workload.
+
+## Detection / Prevention Followups
+
+- [ ] `KuredRebootBacklog` Prometheus alert. Spec: `kured_reboot_required == 1 and (time() - timestamp(kured_reboot_required)) > 86400`.
+- [ ] Add a `hostFilePath` value to the kured Helm release for explicit declaration (current setup is correct but undocumented).
+- [ ] Audit periodically for new single-replica + `minAvailable=1` PDB patterns (could be a Kyverno warn policy).
+- [ ] Phase 4: clean up the InnoDB Cluster CR + remaining `mysql-cluster-pdb` once the bitnami legacy is fully decommissioned.
+
+## File pointers
+
+| What | Where | Commit |
+|---|---|---|
+| kured sentinel path fix | `infra/stacks/kured/main.tf` | c17d87e1 |
+| Anubis HA (module + 8 callers) | `infra/modules/kubernetes/anubis_instance/` + 8 `stacks/<app>/main.tf` | 6e920f96 |
+| kured drainTimeout + CNPG 3-replica | `infra/stacks/kured/main.tf` + `infra/stacks/dbaas/modules/dbaas/main.tf` | a726e963 |
+| K8s version-upgrade Job-chain (related context) | `infra/stacks/k8s-version-upgrade/` | 01bc16d5 (5/11) |
+| Architecture doc | `infra/docs/architecture/automated-upgrades.md` | (updated 5/11) |
+| Runbook | `infra/docs/runbooks/k8s-version-upgrade.md` | (updated 5/11) |
+| Deprecated agent prompt (self-preemption history) | `infra/.claude/agents/k8s-version-upgrade.deprecated.md` | 01bc16d5 |
--- a/docs/post-mortems/2026-05-17-gpu-driver-ubuntu2604-mismatch.md
+++ b/docs/post-mortems/2026-05-17-gpu-driver-ubuntu2604-mismatch.md
@ -0,0 +1,160 @@
+# Post-Mortem: GPU Driver Crashloop after Ubuntu 26.04 Upgrade on k8s-node1
+
+**Date:** 2026-05-17
+**Author:** Viktor Barzin / Claude (incident response)
+**Severity:** SEV-3 (GPU workloads unavailable: frigate, immich-ml, llama-swap, ytdlp/yt-highlights all Pending; no impact to non-GPU services)
+**Beads:** `code-8vr0` (P1)
+**Status:** Blocked on upstream — NVIDIA has not published Ubuntu 26.04 driver images yet
+
+## Summary
+
+`nvidia-driver-daemonset-sg22g` on k8s-node1 went into CrashLoopBackOff
+with 76+ restarts. Root cause: k8s-node1 was upgraded to **Ubuntu 26.04
+LTS (Resolute Raccoon)** at some point, putting the running kernel at
+`7.0.0-15-generic`. The NVIDIA driver daemonset's installer container
+runs `apt-get install linux-headers-<kernel>` against Ubuntu 24.04's
+noble repositories (the container's base OS), which don't carry
+`linux-headers-7.0.0-15-generic`, so the build aborts with:
+
+    Could not resolve Linux kernel version
+
+Attempted fix (chart upgrade v25.10.1 → v26.3.1 with driver 580.105.08
+and `kernelModuleType: open`) succeeded at the chart level but produced
+a worse outcome: the v26.3.1 operator auto-detects the host OS via NFD
+and constructs the image tag `<version>-ubuntu26.04`, which 404s on
+pull. `skopeo list-tags docker://nvcr.io/nvidia/driver` confirms zero
+ubuntu26.04 tags exist (vs 779 ubuntu22.04 and 206 ubuntu24.04 tags).
+
+Rolled the chart back to v25.10.1 (pinned in TF) to restore the closest-
+to-working state pending an upstream fix or kernel rollback.
+
+## Impact
+
+- GPU resource `nvidia.com/gpu` = 0 on k8s-node1 (only GPU node)
+- All GPU-bound workloads Pending or 0/N Ready:
+  - `frigate/frigate`
+  - `immich/immich-machine-learning`
+  - `llama-cpp/llama-swap`
+  - `nvidia/nvidia-exporter`
+  - `ytdlp/yt-highlights`
+- Downstream alerts firing: `NvidiaExporterDown`, 5× Uptime Kuma monitors
+  (Frigate, Immich ML, nvidia-exporter, …), `GPUNodeUnschedulable` not
+  firing (node is schedulable, just no GPU advertised)
+- No data loss; no user-facing service degradation outside the GPU stack
+
+## Timeline (Europe/Sofia, UTC+3)
+
+- pre-incident — `apt-get dist-upgrade` (or `do-release-upgrade`) bumped
+  k8s-node1 from Ubuntu 24.04 → 26.04. Apt history.log doesn't capture
+  the upgrade (rotated by `do-release-upgrade`).
+- ~2026-05-11 — node rebooted into kernel `7.0.0-15-generic`. NFD
+  reports `system-os_release.VERSION_ID = 26.04`,
+  `kernel-version.full = 7.0.0-15-generic`.
+- 2026-05-17 04:00 (approx) — driver daemonset enters CrashLoopBackOff
+  on every kubelet restart cycle. Error: "Could not resolve Linux kernel
+  version".
+- 2026-05-17 13:35 — chart upgrade attempt v25.10.1 → v26.3.1, driver
+  570.195.03 → 580.105.08, `kernelModuleType: open`. Helm applies
+  cleanly but driver pod ImagePullBackOff on
+  `driver:580.105.08-ubuntu26.04`.
+- 2026-05-17 ~13:45 — skopeo confirms zero ubuntu26.04 tags on
+  nvcr.io/nvidia/driver. Decision: roll chart back, pin in TF, document
+  the gotcha, file the kernel rollback as the next step.
+
+## Root Causes
+
+1. **Host OS upgraded to Ubuntu 26.04** ahead of NVIDIA's driver image
+   support window. NVIDIA typically lags new Ubuntu LTS releases by
+   weeks-to-months on the driver-container front.
+2. **gpu-operator chart was not pinned** prior to today. The TF
+   `helm_release` had `version` commented out, so any apply could
+   re-resolve to the latest chart and follow its OS-auto-detection
+   logic. With v25.10.1, the operator fell back to ubuntu24.04 image
+   suffix (which pulls successfully but fails to compile against kernel
+   7.0). With v26.3.1, the operator picks the correct (per-NFD)
+   ubuntu26.04 suffix — which doesn't exist.
+3. **No alert for "GPU device count = 0 on a GPU node"** — the cluster
+   had 14+ hours of silent GPU outage before noticing. `NvidiaExporterDown`
+   fires only when the metrics exporter itself stops scraping, not when
+   the operator's driver pod is unhealthy.
+
+## What We Changed in This Session
+
+- `stacks/nvidia/modules/nvidia/main.tf` — pinned
+  `helm_release.nvidia-gpu-operator.version = "v25.10.1"` so future
+  applies don't surprise us with v26.3.1's stricter OS detection.
+- `stacks/nvidia/modules/nvidia/values.yaml` — comment block explaining
+  the situation; driver version stays at `570.195.03` as the last-known
+  config that produced a pullable image.
+- `docs/post-mortems/2026-05-17-gpu-driver-ubuntu2604-mismatch.md` —
+  this file.
+
+## What We Did NOT Do (Pending User Decision)
+
+- **Roll back the host kernel** on k8s-node1 from `7.0.0-15-generic`
+  to `6.8.0-117-generic`. The 6.8 kernel is still installed at
+  `/lib/modules/6.8.0-117-generic` and the matching headers at
+  `/usr/src/linux-headers-6.8.0-117-generic`, so GRUB can boot it and
+  the driver image's apt sources (Ubuntu 24.04 noble) carry
+  `linux-headers-6.8.0-117-generic`. This would require draining the
+  node, editing GRUB defaults, `apt-mark hold` to prevent future drift,
+  and rebooting — needs explicit user OK.
+- **Add a probe + alert** for `nvidia.com/gpu` resource count on the
+  GPU node. Should fire within 10 minutes of the operator failing to
+  publish the resource, regardless of which sub-pod failed.
+
+## Recovery Procedure (next time)
+
+### If the driver-installer fails with "Could not resolve Linux kernel version"
+
+1. Identify the running kernel: `uname -r` on the affected node.
+2. Check whether NVIDIA ships an image for that kernel/distro combo:
+
+       docker run --rm quay.io/skopeo/stable list-tags \
+           docker://nvcr.io/nvidia/driver \
+         | python3 -c "import json,sys; d=json.load(sys.stdin); \
+             print([t for t in d['Tags'] if '<distro>' in t][:5])"
+
+3. If yes, point the chart at the right version + ensure NFD reports
+   the matching OS.
+4. If no (and a kernel rollback is acceptable):
+   - `kubectl cordon <node>` then `kubectl drain <node> --ignore-daemonsets --delete-emptydir-data`
+   - `nsenter -t 1 -m -p -u sed -i 's/^GRUB_DEFAULT=.*/GRUB_DEFAULT="Advanced options for Ubuntu>Ubuntu, with Linux 6.8.0-117-generic"/' /etc/default/grub`
+   - `nsenter -t 1 -m -p -u update-grub`
+   - `nsenter -t 1 -m -p -u apt-mark hold linux-image-6.8.0-117-generic linux-headers-6.8.0-117-generic linux-generic linux-image-generic linux-headers-generic`
+   - Reboot: `nsenter -t 1 -m -p -u systemctl reboot`
+   - After boot: `kubectl uncordon <node>` and wait for the GPU
+     daemonset to come Ready
+
+## Action Items
+
+- [x] Pin gpu-operator chart to v25.10.1 in TF
+- [x] Document situation in this post-mortem
+- [ ] Roll back k8s-node1 host kernel to 6.8.0-117-generic + apt-mark
+      hold (needs user authorization for node reboot)
+- [ ] Add Prometheus alert `GPUNodeNoGPUResource` — fires when a node
+      labeled `nvidia.com/gpu.present=true` has `nvidia.com/gpu` capacity
+      of 0 for >10m
+- [ ] Periodically re-check NVIDIA's NGC catalog for ubuntu26.04 driver
+      tags — file as a quarterly checkup once we see the first 26.04
+      tag, unpin the chart and revert this post-mortem's mitigation
+- [ ] Audit ALL host packages with `apt-mark hold` semantics. The
+      memory of the March 2026 outage says we disabled
+      `unattended-upgrades` — `do-release-upgrade` is a separate path
+      that should be gated too
+
+## Lessons
+
+- **Operator-style charts that auto-detect host OS can silently break
+  when the host fleet leapfrogs upstream image support.** Pin the chart
+  version + driver version, and treat upstream support gaps as a hard
+  blocker rather than a guaranteed-to-resolve race condition.
+- **Drain-and-revert host kernel is the right escape hatch when
+  upstream image lags.** Make sure the previous kernel and its headers
+  stay installed (don't aggressively purge old kernels in apt
+  autoremove).
+- **NFD labels are authoritative for the operator's image-tag
+  construction.** If you need to lie about OS version (e.g., to force a
+  24.04 image on a 26.04 host), edit the NFD label — but only as a last
+  resort; the chart upgrade made clear the operator will eventually
+  reconcile this.
--- a/docs/post-mortems/2026-05-17-nfs-csi-keel-upgrade-master-port-conflict.md
+++ b/docs/post-mortems/2026-05-17-nfs-csi-keel-upgrade-master-port-conflict.md
@ -0,0 +1,133 @@
+# Post-Mortem: nfs-csi Keel-Triggered Upgrade Broke Master Node CSI
+
+**Date:** 2026-05-17
+**Author:** Viktor Barzin / Claude (incident response)
+**Severity:** SEV-3 (1 of 5 CSI node DaemonSet pods stuck CrashLoopBackOff; controller pair flapping)
+**Duration:** ~2 hours from first detection to all-green
+
+## Summary
+
+The Keel auto-update operator polled the `csi-driver-nfs` Helm chart and rolled
+`v4.13.1 → v4.13.2`. The new chart's controller Deployment scheduled both
+replicas onto `k8s-master` (no built-in control-plane exclusion). Both replicas
+used `hostNetwork: true` and tried to bind the same host ports
+(`19809` for `node-driver-registrar`, `29653` for `liveness-probe`), so one
+controller pod CrashLoopBackOff'd with `bind: address already in use`. The
+upgrade also left behind multiple orphan controller pods in containerd that
+kubelet could no longer reconcile — they held the host ports even after the
+helm rollback removed them from K8s state.
+
+The `csi-nfs-node` DaemonSet pod on master then could not start either: its
+own `node-driver-registrar` and `liveness-probe` containers tried to bind
+the same host ports and lost to the zombies.
+
+## Impact
+
+- 1× `csi-nfs-node` pod on `k8s-master` stuck CrashLoopBackOff (16+ restarts)
+- CSI plugin unregistered on master → no NFS volumes could be mounted on
+  master-hosted pods (calico-typha cert mount failed, etcd backup CronJob
+  failed)
+- Controller flap (2 replicas fighting) → intermittent
+  `csi-resizer`/`csi-snapshotter` failure for the whole cluster
+- Cascade: kured-sentinel, node-local-dns, prometheus-node-exporter,
+  csi-node-driver (Calico) all bounced on master while kubelet thrashed
+
+No data loss; no production-facing outages observed (CSI mounts on the four
+worker nodes kept working).
+
+## Timeline (Europe/Sofia, UTC+3)
+
+- ~07:46 — Keel polls forgejo + DockerHub manifests, sees a new digest under
+  the `csi-driver-nfs` `4.13.x` channel, triggers Helm upgrade
+- 07:46:16 — `helm upgrade csi-driver-nfs` runs; new controller Deployment
+  scheduled (no `affinity` block → both replicas land on `k8s-master`)
+- ~07:50 — Controller replicas fight for ports `19809`, `29653`; one stays in
+  CrashLoopBackOff
+- ~08:00 — User notices "CSI issue ... due to the upgrade"; investigation
+  begins
+- 08:15 — `helm rollback csi-driver-nfs` to revision 8 (v4.13.1) — controllers
+  on master deleted via K8s, but containerd retains them as live sandboxes
+- 08:30 — Live `podAntiAffinity` + `nodeAffinity: control-plane DoesNotExist`
+  added to the controller Deployment via patch (controllers now correctly
+  schedule on node1+node3)
+- 08:40 — `csi-nfs-node` master pod still CrashLoopBackOff; ports 19809/29653
+  held by orphan PIDs (livenessprobe PID 1816, csi-node-driver PID 1944,
+  plus 5× csi-provisioner from zombie controller pods)
+- 09:00 — Privileged pkill via `hostPID: true` pod failed
+  (`permission denied` from runc — containerd refused to signal init in the
+  zombie containers)
+- 09:03 — `nsenter -t 1 -m -p -u systemctl restart kubelet` on master cleared
+  the orphan containers via cgroup GC; ports freed
+- 09:04 — `csi-nfs-node` master pod reaches 3/3 Ready; cluster green
+- 09:09 — Terraform `apply`: pin `helm_release.version = "4.13.1"`, add
+  `controller.affinity` to values
+
+## Root Causes
+
+1. **`csi-driver-nfs` Helm chart in TF was unpinned.** The `helm_release` had
+   no `version = ...` field, so it floated to whatever the chart repo
+   advertised. Keel polled this and rolled forward.
+2. **Chart v4.13.2 dropped the implicit control-plane exclusion** that v4.13.1
+   shipped with. Without it, the K8s scheduler chose master for both
+   controller replicas.
+3. **Two controller replicas + hostNetwork = port conflict on the same node.**
+   The chart did not add `podAntiAffinity` between the replicas. Live state
+   has it now; TF now does too.
+4. **Helm rollback does not always clean containerd sandboxes.** When the
+   prior revision's pods are abandoned mid-flight (image-pull-pending, etc.),
+   containerd can keep multiple sandbox instances for the same pod-UID.
+   Kubelet GC is the only thing that reliably reaps these — restarting it
+   forces a reconciliation pass that drops orphans.
+
+## What We Fixed
+
+- **`stacks/nfs-csi/modules/nfs-csi/main.tf`** (this commit):
+  - `version = "4.13.1"` pin on the `helm_release` (defense in depth — namespace
+    is already excluded from Kyverno-Keel injection, but the chart could still
+    drift on a `terraform apply` without a pin)
+  - `controller.affinity` block with `podAntiAffinity` (different hosts for
+    replicas) and `nodeAffinity` (exclude `node-role.kubernetes.io/control-plane`)
+  - Inline comments explaining both decisions
+- **Kyverno keel-annotations**: `nfs-csi` was already in the namespace exclude
+  list (decision from authentik incident 2026-05-17). Verified still there
+  in `stacks/kyverno/modules/kyverno/keel-annotations.tf:91`.
+
+## Recovery Procedure (next time)
+
+If `csi-nfs-node` on a node CrashLoopBackOff with `bind: address already in use`:
+
+1. **Find which host ports are bound** — `lsof -i :19809`, `lsof -i :29653`
+   (from a privileged hostPID pod on the affected node).
+2. **Try `crictl rmp -f <pod-id>`** on zombie pods (those K8s no longer
+   tracks). Will fail with `unable to signal init: permission denied` if
+   the containers are sufficiently stuck.
+3. **Restart kubelet on the affected node** via `nsenter -t 1 -m -p -u
+   systemctl restart kubelet` (privileged hostPID pod). Kubelet's GC
+   reconciles containerd state and reaps the orphans.
+4. **Force-delete the DaemonSet pod** to clear the back-off
+   (`kubectl delete pod -n nfs-csi csi-nfs-node-XXXX --force --grace-period=0`).
+   DaemonSet recreates it; with the ports free, containers start cleanly.
+
+## Action Items
+
+- [x] Pin `csi-driver-nfs` chart version in TF
+- [x] Add `controller.affinity` to TF (podAntiAffinity + control-plane exclude)
+- [x] Document recovery procedure (this post-mortem)
+- [ ] Audit other unpinned `helm_release` blocks — every chart used in
+      Kyverno-excluded namespaces should still be pinned to prevent
+      `terraform apply` drift. (Filed as follow-up — not blocking.)
+- [ ] Consider adding a `kured` or daily script that detects orphan
+      containerd sandboxes whose pod-UID is unknown to the apiserver and
+      reaps them automatically. (Filed as follow-up — not blocking.)
+
+## Lessons
+
+- **Keel exclusion ≠ chart pin.** The namespace was already excluded from
+  Keel injection, but the helm_release was unpinned — so a `terraform apply`
+  alone could re-trigger the same break. Both layers needed locking down.
+- **`crictl rmp -f` is not always sufficient.** When containerd refuses to
+  signal init, kubelet restart is the next escalation step before SSH/reboot.
+- **The Keel rollout phase 2-6 design ASSUMED stateful operators were
+  excluded.** CSI was correctly excluded — but the chart version itself was
+  still a moving target via plain `terraform apply`. The exclude-list catches
+  Keel; the version pin catches everything else.
--- a/docs/runbooks/k8s-node-auto-upgrades.md
+++ b/docs/runbooks/k8s-node-auto-upgrades.md
@ -0,0 +1,207 @@
+# K8s Node Auto-Upgrades
+
+## Overview
+
+OS-level package upgrades on the 5 K8s VMs (master + 4 workers) are driven by `unattended-upgrades` and rebooted by `kured`, with multiple safety gates layered on top to prevent the failure mode that caused the March 2026 26h cluster outage.
+
+## Architecture
+
+```
+apt-daily.timer (random within window)
+  │ apt-get update
+  │
+  ▼
+apt-daily-upgrade.timer (random within window)
+  │ unattended-upgrades runs
+  │   - Allowed-Origins: -security, -updates, ESM
+  │   - Package-Blacklist: containerd*, runc, calico-*, cni-plugins-*, docker-ce
+  │   - apt-mark hold on kubelet, kubeadm, kubectl, containerd*, runc
+  │   - Automatic-Reboot=false (kured handles reboots)
+  │
+  ▼ if kernel/glibc/systemd updated
+/var/run/reboot-required appears on the host
+  │
+  ▼ (sentinel-gate DaemonSet polls every 5min)
+kured-sentinel-gate checks:
+  ├── 1. Host has /var/run/reboot-required
+  ├── 2. ALL nodes Ready
+  ├── 3. ALL calico-node pods Running
+  └── 4. NO node Ready-transition in last 24h (soak window)
+  │
+  ▼ all pass
+touch /var/run/gated-reboot-required
+  │
+  ▼ (kured polls every 1h within 02:00-06:00 London, any day of the week)
+kured checks Prometheus before draining:
+  │ http://prometheus-server.monitoring.svc.cluster.local:80/api/v1/alerts
+  │ ANY firing alert (except ignore-list) blocks the drain
+  │ Ignore-list: ^(Watchdog|RebootRequired|KuredNodeWasNotDrained|InfoInhibitor)$
+  │
+  ▼ no blockers
+kured drains the node (priority-ordered, 310s budget)
+kured runs /bin/systemctl reboot
+  │
+  ▼ node returns
+kured uncordons + posts Slack notification (configuration.notifyUrl)
+  │
+  ▼ 24h cool-down begins (sentinel-gate Check 4)
+```
+
+## Components
+
+### unattended-upgrades (in-guest)
+- **Config**: `/etc/apt/apt.conf.d/52unattended-upgrades-k8s` + `/etc/apt/apt.conf.d/20auto-upgrades`
+- **Source of truth**: `infra/modules/create-template-vm/cloud_init.yaml` (lines for `is_k8s_template`)
+- **Day-2 push**: SSH-based — see "Restore / re-apply config" below
+
+### kured (Helm release)
+- **Stack**: `infra/stacks/kured/main.tf`
+- **Helm chart**: `kured-5.11.0` (image `ghcr.io/kubereboot/kured:1.21.0`)
+- **Window**: 02:00-06:00 Europe/London, every day of the week (was Mon-Fri until 2026-05-16), period=1h, concurrency=1
+- **Sentinel**: `/sentinel/gated-reboot-required` (created by sentinel-gate DaemonSet)
+- **Slack hook**: Vault `secret/kured` → `slack_kured_webhook`
+
+### kured-sentinel-gate (DaemonSet)
+- **Source**: `kubernetes_daemon_set_v1.kured_sentinel_gate` in `infra/stacks/kured/main.tf` (lines ~120-260)
+- **Image**: `bitnami/kubectl:latest`
+- **Loop period**: every 300s
+- **Gate logic**: 4 checks — see Architecture diagram
+
+### Upgrade Gates Prometheus alerts
+- **Source**: `infra/stacks/monitoring/modules/monitoring/prometheus_chart_values.tpl` group `Upgrade Gates`
+- **10 alerts**: KubeAPIServerDown, KubeStateMetricsDown, PrometheusRuleEvaluationFailing, PVCStuckPending, RecentNodeReboot, MysqlStandaloneDown, ClusterPodReadyRatioDropped, NodeMemoryPressure, NodeDiskPressure, KubeQuotaAlmostFull
+- **Effect**: kured `--prometheus-url` polls Prometheus before each drain — any non-ignored firing alert halts the rollout
+
+## Common Operations
+
+### Verify the system is healthy
+```bash
+# kured pods + sentinel-gate Running on all 5 nodes
+kubectl -n kured get pods
+
+# kured can reach Prometheus
+kubectl -n kured exec ds/kured -- /usr/bin/kured --help | grep prometheus
+
+# Upgrade Gates rules loaded + state
+kubectl -n monitoring exec deploy/prometheus-server -c prometheus-server -- \
+  wget -q -O- 'http://localhost:9090/api/v1/rules' | \
+  jq -r '.data.groups[] | select(.name == "Upgrade Gates") | .rules[] | "  \(.name): \(.state)"'
+
+# Per-node unattended-upgrades status
+for n in k8s-master k8s-node1 k8s-node2 k8s-node3 k8s-node4; do
+  echo "=== $n ==="
+  ssh $n "systemctl is-active unattended-upgrades; apt list --upgradable 2>/dev/null | wc -l"
+done
+```
+
+### Halt rollout in an emergency
+```bash
+# Option 1: scale kured to 0 (most decisive)
+kubectl -n kured scale ds kured --replicas=0
+# When ready: kubectl -n kured scale ds kured --replicas=5
+
+# Option 2: silence the gate via Alertmanager (allows kured to retry once silence expires)
+# Use Alertmanager UI at https://prometheus.viktorbarzin.me/alertmanager/
+```
+
+### Force halt by adding a custom blocker alert
+- Add a PrometheusRule expression that's always-1 (e.g. `vector(1)`) to the `Upgrade Gates` group temporarily.
+- Apply, wait for sync (~120s), kured will block on the next poll.
+- Remove when ready.
+
+### Pause apt upgrades on a single node
+```bash
+ssh <node> sudo systemctl stop unattended-upgrades
+ssh <node> sudo systemctl disable unattended-upgrades
+# Re-enable when ready:
+ssh <node> sudo systemctl enable --now unattended-upgrades
+```
+
+### Restore / re-apply unattended-upgrades config to existing nodes
+Cloud-init only runs on first boot. To bring existing nodes into compliance with the IaC:
+
+```bash
+# Per node — installs uu, drops apt config, holds k8s/runtime packages, enables service
+for n in k8s-master k8s-node1 k8s-node2 k8s-node3 k8s-node4; do
+  ssh $n sudo bash -s <<'EOF'
+set -e
+systemctl unmask unattended-upgrades 2>/dev/null || true
+DEBIAN_FRONTEND=noninteractive apt-get install -y unattended-upgrades update-notifier-common
+cat > /etc/apt/apt.conf.d/52unattended-upgrades-k8s <<'CONF'
+Unattended-Upgrade::Allowed-Origins {
+    "${distro_id}:${distro_codename}";
+    "${distro_id}:${distro_codename}-security";
+    "${distro_id}:${distro_codename}-updates";
+    "${distro_id}ESMApps:${distro_codename}-apps-security";
+    "${distro_id}ESM:${distro_codename}-infra-security";
+};
+Unattended-Upgrade::Package-Blacklist {
+    "^containerd(\.io)?$";
+    "^runc$";
+    "^cri-tools$";
+    "^kubernetes-cni$";
+    "^calico-.*";
+    "^cni-plugins-.*";
+    "^docker-ce$";
+};
+Unattended-Upgrade::DevRelease "false";
+Unattended-Upgrade::Automatic-Reboot "false";
+CONF
+cat > /etc/apt/apt.conf.d/20auto-upgrades <<'CONF'
+APT::Periodic::Update-Package-Lists "1";
+APT::Periodic::Unattended-Upgrade "1";
+CONF
+apt-mark hold kubelet kubeadm kubectl
+apt-mark hold containerd containerd.io runc 2>/dev/null || true
+systemctl enable --now unattended-upgrades
+EOF
+done
+```
+
+### Roll back a bad apt upgrade
+1. Identify the package(s) that broke things from `/var/log/apt/history.log` on the affected node.
+2. Hold them: `sudo apt-mark hold <pkg>`.
+3. Downgrade: `sudo apt-get install -y --allow-downgrades <pkg>=<previous-version>` (find versions via `apt-cache madison <pkg>`).
+4. Reboot the node manually if the package needs it.
+5. Add the package to the `Unattended-Upgrade::Package-Blacklist` in `cloud_init.yaml` AND drop the holds via the SSH push above so future apt runs skip it.
+
+### kured halted — investigate which alert is blocking
+```bash
+# Show kured logs — it logs "blocking alerts" when halting
+kubectl -n kured logs ds/kured --tail=100 | grep -i alert
+
+# List currently firing alerts (any of these blocks kured):
+kubectl -n monitoring exec deploy/prometheus-server -c prometheus-server -- \
+  wget -q -O- 'http://localhost:9090/api/v1/alerts' | \
+  jq -r '.data.alerts[] | select(.state == "firing") | "  \(.labels.alertname) (\(.labels.severity // "info"))"' | sort -u
+```
+
+The alert is either:
+- One of the 10 `Upgrade Gates` (genuine cluster-health issue — fix it),
+- A pre-existing alert (any of the ~211 in the library — investigate),
+- Or `RecentNodeReboot` — expected for 24h after each node reboot. This is the soak window.
+
+### Verify the 24h soak is enforcing
+```bash
+# Sentinel-gate logs Check 4 outcome
+kubectl -n kured logs ds/kured-sentinel-gate --tail=20 | grep -E "soak|cool-down|24"
+
+# kured won't drain another node until the most recent Ready-transition is >24h ago.
+# If you need to override (e.g. emergency security patch), shorten the cool-down by
+# editing infra/stacks/kured/main.tf (sentinel script: 86400 → smaller) and applying.
+```
+
+## Past Incidents
+
+- **2026-03-16 SEV-1**: Kured + Containerd Cascade Outage (26h). See `docs/post-mortems/2026-03-16-kured-containerd-cascade-outage.html`. Root cause: unattended-upgrades pushed a kernel update → kured rebooted nodes → containerd's overlayfs snapshotter corrupted → image pulls failed → calico broke → cascading outage. Remediations now baked into this system: 24h soak, Prometheus halt-on-alert, Package-Blacklist for runtime components, sentinel-gate health checks.
+
+## File Pointers
+
+| What | Where |
+|------|-------|
+| kured Helm + sentinel-gate | `infra/stacks/kured/main.tf` |
+| Upgrade Gates alerts | `infra/stacks/monitoring/modules/monitoring/prometheus_chart_values.tpl` (group "Upgrade Gates") |
+| Cloud-init for new nodes | `infra/modules/create-template-vm/cloud_init.yaml` |
+| Slack webhook | Vault `secret/kured` → `slack_kured_webhook` |
+| Post-mortem | `infra/docs/post-mortems/2026-03-16-kured-containerd-cascade-outage.html` |
+| Architecture doc | `infra/docs/architecture/automated-upgrades.md` (OS section) |
--- a/docs/runbooks/k8s-version-upgrade.md
+++ b/docs/runbooks/k8s-version-upgrade.md
@ -0,0 +1,323 @@
+# K8s Version Upgrade Pipeline
+
+## Overview
+
+Kubernetes component versions (`kubeadm`/`kubelet`/`kubectl`) on the 5 K8s
+VMs are upgraded automatically by a weekly detection CronJob that seeds a
+chain of small phase Jobs. Each Job is **pinned to a node that is NOT its
+drain target** — so no pod in the chain can preempt itself.
+
+The chain (Sun 12:00 UTC weekly):
+
+```
+detection CronJob → preflight Job → master Job → worker × 4 Jobs → postflight Job
+```
+
+This is **independent** of the OS-side `unattended-upgrades + kured`
+pipeline (see `k8s-node-auto-upgrades.md`). They do not share rollouts.
+Schedules can overlap (kured runs daily 02:00-06:00 London; detection
+here runs Sun 12:00 UTC) — when a kured reboot lands within 24h of the
+Sunday detection, the `RecentNodeReboot` alert in the Upgrade Gates
+group blocks the version-upgrade preflight, so the chain self-defers
+to the next Sunday rather than rolling on top of a half-fresh node.
+
+## Architecture
+
+```
+k8s-version-check CronJob   (Sun 12:00 UTC, k8s-upgrade ns, SA: k8s-upgrade-job)
+  │ kubectl get nodes  → running version
+  │ ssh master 'apt-cache madison kubeadm'  → latest patch (within current minor)
+  │ HEAD pkgs.k8s.io/.../v<NEXT_MINOR>/deb/Release  → next minor available?
+  │ push k8s_upgrade_available{kind,running,target} → Pushgateway
+  │
+  ▼ if a target is detected
+envsubst on /template/job-template.yaml  | kubectl apply -f -
+  │ creates k8s-upgrade-preflight-<target_version>
+  ▼
+
+Job 0 — preflight       (pinned: k8s-node1)
+  ├── All nodes Ready + no Mem/Disk pressure
+  ├── halt-on-alert (kured-style ignore-list)
+  ├── 24h-quiet baseline (no Ready transitions <24h ago)
+  ├── kubeadm upgrade plan matches target
+  ├── Push k8s_upgrade_in_flight=1, k8s_upgrade_started_timestamp=$(date +%s)
+  ├── Trigger backup-etcd Job, wait, verify snapshot byte count
+  ├── SSH master: containerd skew fix (if master < workers)
+  ├── SSH all 5 nodes: apt repo URL rewrite (only kind=minor)
+  └── spawn_next → k8s-upgrade-master-<target_version>
+  ▼
+
+Job 1 — master upgrade  (pinned: k8s-node1)
+  ├── halt-on-alert recheck (no firing alerts)
+  ├── drain k8s-master (predrain_unstick deletes PDB-blocked pods)
+  ├── ssh wizard@k8s-master 'bash -s' < /scripts/update_k8s.sh -- --role master --release X.Y.Z
+  ├── kubectl uncordon k8s-master; wait Ready + version match
+  ├── verify control-plane pods Running
+  ├── halt-on-alert recheck (allows RecentNodeReboot)
+  └── spawn_next → k8s-upgrade-worker-<v>-k8s-node4
+  ▼
+
+Job 2 — worker k8s-node4 (pinned: k8s-node1)
+Job 3 — worker k8s-node3 (pinned: k8s-node1)
+Job 4 — worker k8s-node2 (pinned: k8s-node1)
+  (identical pattern: halt-on-alert wait 30m → drain → ssh script → uncordon → 10-min soak → spawn_next)
+  ▼
+
+Job 5 — worker k8s-node1 (pinned: k8s-master + control-plane toleration)
+  └── spawn_next → k8s-upgrade-postflight-<target_version>
+  ▼
+
+Job 6 — postflight       (no pinning)
+  ├── Verify all 5 nodes at target version
+  ├── Verify no firing Upgrade Gates alerts
+  ├── Compute pod-ready ratio (should be ≥ 0.9)
+  ├── Clear k8s-upgrade-* annotations on namespace
+  ├── Push k8s_upgrade_in_flight=0, k8s_upgrade_snapshot_taken=0, k8s_upgrade_started_timestamp=0
+  └── Slack: ✅ K8s upgrade complete
+```
+
+**Pin choices summarised:**
+- k8s-node1 hosts every Job that drains master or another worker. k8s-node1
+  itself is upgraded **last**.
+- k8s-master hosts Job 5 (which drains k8s-node1). Job 5's spec includes a
+  toleration for `node-role.kubernetes.io/control-plane:NoSchedule`.
+- If anyone reorders the worker sequence, the pin for Job 5 needs to track
+  whatever worker is upgraded last. The mapping is in `scripts/upgrade-step.sh`
+  → the `case "${PHASE}:${TARGET_NODE:-}"` block.
+
+## Components
+
+### Shared resources (one-time, Terraform-managed)
+
+| Resource | Purpose |
+|---|---|
+| **ConfigMap `k8s-upgrade-scripts`** | Mounts `/scripts/upgrade-step.sh` (universal phase body, dispatches on `$PHASE`) and `/scripts/update_k8s.sh` (per-node kubeadm/kubelet/kubectl upgrade body — same script the old manual loop used) in every Job pod. |
+| **ConfigMap `k8s-upgrade-job-template`** | Mounts `/template/job-template.yaml` — universal Job manifest with envsubst placeholders. Rendered by upgrade-step.sh and the detection CronJob via `envsubst | kubectl apply`. |
+| **ServiceAccount `k8s-upgrade-job`** | Used by both the detection CronJob and every chain Job. ClusterRole binding grants: nodes get/list/patch, pods/eviction create, pods delete, batch/jobs CRUD, PDB list (for predrain_unstick), CronJob get (snapshot trigger), namespaces patch on `k8s-upgrade` only. Namespace-scoped Role binding grants secrets:get on `k8s-upgrade-creds`. |
+| **ExternalSecret `k8s-upgrade-creds`** | Syncs `secret/k8s-upgrade/{ssh_key, slack_webhook}` from Vault. Mounted into every Job at `/secrets/k8s-upgrade`. |
+| **CronJob `k8s-version-check`** | Sun 12:00 UTC. Probes apt + pkgs.k8s.io for target. If found, renders Job 0 from `job-template.yaml` and applies it. |
+
+### Pushgateway metrics
+
+Pushed by upgrade-step.sh during phase execution; observed by the
+`Upgrade Gates` alert group in `stacks/monitoring/.../prometheus_chart_values.tpl`:
+
+| Metric | Pushed by | Cleared by |
+|---|---|---|
+| `k8s_upgrade_in_flight` (1/0) | preflight Job (set to 1) | postflight Job (set to 0) |
+| `k8s_upgrade_started_timestamp` (epoch s) | preflight Job | postflight Job (set to 0) |
+| `k8s_upgrade_snapshot_taken` (1/0) | preflight Job (set to 1 after Job=`pre-upgrade-etcd-*` completes with `Backup done:` log of ≥1 KiB) | postflight Job (0) |
+| `k8s_upgrade_available{kind,running,target}` | detection CronJob | next detection run (overwrite) |
+| `k8s_version_check_last_run_timestamp` | detection CronJob | (cumulative) |
+
+### Upgrade Gates alerts (`Upgrade Gates` group in prometheus_chart_values.tpl)
+
+- **`K8sVersionSkew`** — distinct kubelet/apiserver `gitVersion` count > 1 for 30m. Catches a half-done rollout.
+- **`EtcdPreUpgradeSnapshotMissing`** — `k8s_upgrade_in_flight==1 && k8s_upgrade_snapshot_taken==0` for 10m. Catches preflight Stage 2 failing silently.
+- **`K8sUpgradeStalled`** — `k8s_upgrade_in_flight==1 && time()-k8s_upgrade_started_timestamp > 5400` for 5m. Catches a Job in the chain dying without spawning its successor.
+- All three alerts ALSO block kured (same `--prometheus-url` halt-on-alert mechanism) so the OS-reboot pipeline can't run on top of a half-done version upgrade.
+
+### Vault secrets
+
+- `secret/k8s-upgrade/ssh_key` — ed25519 PRIVATE key, used by Jobs to SSH `wizard@<node>`
+- `secret/k8s-upgrade/ssh_key_pub` — matching PUBLIC key, deployed to nodes' `~/.ssh/authorized_keys`
+- `secret/k8s-upgrade/slack_webhook` — Slack incoming-webhook URL
+
+Exposed in K8s via ExternalSecret `k8s-upgrade-creds` in the `k8s-upgrade` namespace. The previous `api_bearer_token` entry is GONE — the chain does not POST to `claude-agent-service`.
+
+## Common Operations
+
+### Verify the pipeline is healthy
+```bash
+# CronJob present + not suspended
+kubectl -n k8s-upgrade get cronjob k8s-version-check
+
+# Latest detection run output
+kubectl -n k8s-upgrade get jobs -l app=k8s-version-upgrade
+kubectl -n k8s-upgrade logs -l app=k8s-version-upgrade --tail=200
+
+# Chain Jobs from the last run (retained 7 days via ttlSecondsAfterFinished)
+kubectl -n k8s-upgrade get jobs -l app=k8s-upgrade-chain
+
+# Pushgateway — running detection metric
+kubectl -n monitoring exec deploy/prometheus-server -c prometheus-server -- \
+  wget -q -O- 'http://prometheus-prometheus-pushgateway.monitoring:9091/metrics' | \
+  grep -E '^(k8s_upgrade_(available|in_flight|started_timestamp|snapshot_taken)|k8s_version_check_last_run_timestamp)'
+
+# Upgrade Gates rules loaded
+kubectl -n monitoring exec deploy/prometheus-server -c prometheus-server -- \
+  wget -q -O- 'http://localhost:9090/api/v1/rules' | \
+  jq -r '.data.groups[] | select(.name == "Upgrade Gates") | .rules[] | "  \(.name): \(.state)"'
+```
+
+### Manually trigger detection (no upgrade)
+Use `detection_dry_run=true` to short-circuit before spawning Job 0:
+
+```bash
+# Toggle var in TF, apply, and trigger
+# (in stacks/k8s-version-upgrade/main.tf)
+#   variable "detection_dry_run" { default = true }
+# scripts/tg apply
+kubectl -n k8s-upgrade create job --from=cronjob/k8s-version-check version-check-test
+kubectl -n k8s-upgrade logs -l job-name=version-check-test -f
+# When done, flip back to false.
+```
+
+### Manually trigger the chain (skip detection)
+Useful for testing or to force a specific target. Render Job 0 directly:
+
+```bash
+TARGET=1.34.7
+KIND=patch
+IMAGE=$(kubectl -n k8s-upgrade get cronjob k8s-version-check \
+  -o jsonpath='{.spec.jobTemplate.spec.template.spec.containers[0].image}')
+
+cat <<EOF | envsubst | kubectl apply -f -
+$(kubectl -n k8s-upgrade get cm k8s-upgrade-job-template -o jsonpath='{.data.job-template\.yaml}')
+EOF
+# Note: export JOB_NAME, PHASE_NEXT, etc. first — see the CronJob's command for
+# the full env block. Easier: just trigger detection with the right inputs.
+```
+
+### Kill a stuck Job (chain halted mid-flight)
+The chain stalls if any Job dies without spawning its successor. `K8sUpgradeStalled`
+fires after 90 min. Recovery:
+
+```bash
+# 1. Identify the failed Job
+kubectl -n k8s-upgrade get jobs -l app=k8s-upgrade-chain
+kubectl -n k8s-upgrade describe job/<failed-job-name> | tail -50
+kubectl -n k8s-upgrade logs job/<failed-job-name>
+
+# 2. Diagnose. Common causes:
+#    - drain stuck on PDB-violating pod (predrain_unstick should handle this;
+#      but a brand-new PDB pattern could escape it — manually delete the pod)
+#    - SSH from Job pod failing (node restarted? known_hosts mismatch?)
+#    - kubeadm upgrade failed on a node (check journalctl + apt history on that node)
+
+# 3. Fix the root cause first.
+
+# 4. Delete the failed Job + re-spawn it. Naming is deterministic so
+#    `kubectl apply` of the same name reconciles to a single Job.
+kubectl -n k8s-upgrade delete job/<failed-job-name>
+
+# 5. Manually render + apply the same Job. Pull the template + spec from the
+#    next-Job-creation block in upgrade-step.sh — easiest is to copy from a
+#    sibling Job's YAML:
+kubectl -n k8s-upgrade get job/<sibling-job-name> -o yaml \
+  | yq 'del(.metadata.resourceVersion, .metadata.uid, .metadata.creationTimestamp, .metadata.managedFields, .status)' \
+  | yq '.metadata.name = "<failed-job-name>"' \
+  | yq '.spec.template.spec.containers[0].env[] | select(.name=="PHASE") .value = "<right-phase>"' \
+  | kubectl apply -f -
+
+# The chain will continue from there. The next-Job-creation step in upgrade-step.sh
+# is idempotent (deterministic name) so re-running won't duplicate downstream.
+```
+
+### Skip a phase (advanced; use sparingly)
+If you've already done the work for a phase manually and want the chain to
+jump past it, manually create the NEXT phase's Job with the deterministic
+name. The previous phase's spawn-next will see the Job already exists and
+short-circuit. Example: master already on target; jump straight to worker:
+
+```bash
+TARGET=1.34.7
+TGT_LBL=${TARGET//./-}
+# (compose Job from upgrade-step.sh spawn_next code, name=k8s-upgrade-worker-$TGT_LBL-k8s-node4, run on k8s-node1)
+```
+
+### Halt the pipeline in an emergency
+
+```bash
+# Option 1: suspend the detection CronJob (won't stop an in-flight chain)
+kubectl -n k8s-upgrade patch cronjob k8s-version-check \
+  -p '{"spec":{"suspend":true}}' --type=merge
+# Re-enable: -p '{"spec":{"suspend":false}}'
+
+# Option 2: delete all in-flight chain Jobs
+kubectl -n k8s-upgrade delete jobs -l app=k8s-upgrade-chain
+# This leaves the in-flight annotation + Pushgateway gauge intact —
+# K8sUpgradeStalled will fire to surface the halt.
+
+# Option 3: force a blocker alert (same regex kured uses)
+# — see k8s-node-auto-upgrades.md "Force halt by adding a custom blocker alert"
+```
+
+### Clear orphaned in-flight state
+After deciding NOT to retry a halted chain:
+
+```bash
+kubectl annotate ns k8s-upgrade \
+  viktorbarzin.me/k8s-upgrade-in-flight- \
+  viktorbarzin.me/k8s-upgrade-target- \
+  viktorbarzin.me/k8s-upgrade-snapshot-path-
+
+# Reset Pushgateway gauges so K8sUpgradeStalled / EtcdPreUpgradeSnapshotMissing clear:
+kubectl -n monitoring port-forward svc/prometheus-prometheus-pushgateway 9091:9091 &
+printf '# TYPE k8s_upgrade_in_flight gauge\nk8s_upgrade_in_flight 0\n# TYPE k8s_upgrade_snapshot_taken gauge\nk8s_upgrade_snapshot_taken 0\n# TYPE k8s_upgrade_started_timestamp gauge\nk8s_upgrade_started_timestamp 0\n' \
+  | curl --data-binary @- http://localhost:9091/metrics/job/k8s-version-upgrade
+kill %1
+```
+
+### Rollback paths
+`kubeadm` does **not** support in-place downgrade. If a run fails:
+
+#### Master broke during/after kubeadm upgrade
+1. Identify the etcd snapshot: `kubectl get ns k8s-upgrade -o jsonpath='{.metadata.annotations.viktorbarzin\.me/k8s-upgrade-snapshot-path}'`
+2. Restore etcd per `infra/docs/runbooks/restore-etcd.md`.
+3. Manually downgrade master `kubeadm`/`kubelet`/`kubectl` to the pre-upgrade version. Find versions in `/var/log/apt/history.log` on the node:
+   ```bash
+   ssh wizard@k8s-master 'sudo cat /var/log/apt/history.log | tail -40'
+   # Pre-upgrade versions are in the most recent "Commandline: apt-get install"
+   sudo apt-mark unhold kubeadm kubelet kubectl
+   sudo apt-get install --allow-downgrades -y \
+     kubeadm=<OLD>-1.1 kubelet=<OLD>-1.1 kubectl=<OLD>-1.1
+   sudo apt-mark hold kubeadm kubelet kubectl
+   sudo systemctl daemon-reload && sudo systemctl restart kubelet
+   ```
+
+#### Worker broke
+1. `kubectl drain <node> --ignore-daemonsets --delete-emptydir-data --force --grace-period=300`
+2. Downgrade apt packages on that node only (see above)
+3. `kubectl uncordon <node>`
+4. The cluster continues running on the master + remaining workers throughout
+
+### One-shot SSH key rotation
+1. Generate new keypair: `ssh-keygen -t ed25519 -f /tmp/k8s-upgrade -N ""`
+2. Update Vault:
+   ```bash
+   vault kv patch secret/k8s-upgrade \
+     ssh_key=@/tmp/k8s-upgrade \
+     ssh_key_pub=@/tmp/k8s-upgrade.pub
+   ```
+3. Push the new pubkey to every node:
+   ```bash
+   for n in k8s-master k8s-node1 k8s-node2 k8s-node3 k8s-node4; do
+     ssh wizard@$n 'sed -i "/k8s-upgrade-key$/d" ~/.ssh/authorized_keys'
+     ssh wizard@$n 'echo "$(cat /tmp/k8s-upgrade.pub) k8s-upgrade-key" >> ~/.ssh/authorized_keys'
+   done
+   ```
+4. ESO refreshes within 15 min — or force: `kubectl -n k8s-upgrade annotate externalsecret k8s-upgrade-creds force-sync=$(date +%s) --overwrite`
+
+## Past Incidents
+
+### 2026-05-11 — Self-preemption (agent → Job-chain rewrite)
+- The v1 agent ran inside the `claude-agent-service` Deployment (replicas=1, no nodeSelector) and was scheduled to k8s-node4.
+- During Stage 6 (first worker drain) the agent ran `kubectl drain k8s-node4` — evicting itself.
+- The bash process died after the drain but before the SSH-pipe to install kubeadm on node4.
+- Node4 was left cordoned; cluster stuck at master v1.34.7, workers v1.34.2 until manual recovery.
+- **Mitigation**: rewrote the pipeline as a chain of Jobs, each `nodeSelector`-pinned to a non-target node. New `predrain_unstick` step deletes PDB-blocked single-replica pods (Anubis pattern) before drain so they don't loop forever. Added `K8sUpgradeStalled` alert (in-flight + started_timestamp > 90 min).
+
+## File Pointers
+
+| What | Where |
+|------|-------|
+| Stack (CronJob + ConfigMaps + SA/RBAC + ExternalSecret) | `infra/stacks/k8s-version-upgrade/main.tf` |
+| Universal phase body | `infra/stacks/k8s-version-upgrade/scripts/upgrade-step.sh` |
+| Job template | `infra/stacks/k8s-version-upgrade/job-template.yaml` |
+| Per-node upgrade script | `infra/scripts/update_k8s.sh` |
+| Upgrade Gates alerts | `infra/stacks/monitoring/modules/monitoring/prometheus_chart_values.tpl` (group "Upgrade Gates") |
+| Vault secrets | `secret/k8s-upgrade/{ssh_key, ssh_key_pub, slack_webhook}` |
+| Architecture doc | `infra/docs/architecture/automated-upgrades.md` (K8s Version Upgrades section) |
+| Related (OS reboots) | `infra/docs/runbooks/k8s-node-auto-upgrades.md` |
+| Deprecated agent prompt (reference) | `infra/.claude/agents/k8s-version-upgrade.deprecated.md` |
--- a/docs/runbooks/kms-public-exposure.md
+++ b/docs/runbooks/kms-public-exposure.md
@ -9,15 +9,36 @@ how to tune the rate limit, how to revoke if abused.

 ## Architecture

- **K8s service**: `windows-kms` in namespace `kms`, MetalLB shared LB IP
-  `10.0.20.200:1688`. ETP=Cluster, so client IPs in vlmcsd logs are SNAT'd
-  k8s node IPs (not real-world client IPs). Trade-off accepted —
-  preserving real client IPs would require a dedicated MetalLB IP with
-  ETP=Local or a PROXY-protocol bounce; vlmcsd doesn't speak PROXY-v2.
- **pfSense WAN forward**: `WAN TCP/1688 → k8s_shared_lb:1688`
-  (alias = `10.0.20.200`). Description: `KMS public — kms.viktorbarzin.me`.
- **Filter rule** on the WAN interface, TCP/1688, with state-table
-  per-source caps:
+- **K8s service**: `windows-kms` in namespace `kms`, MetalLB **dedicated**
+  LB IP `10.0.20.202:1688`. ETP=Local, so vlmcsd sees real WAN client IPs
+  in its log (pfSense WAN forwards do DNAT-only, no SNAT; ETP=Local skips
+  the kube-proxy SNAT too). Same pattern mailserver used pre-2026-04-19.
+  Sharing `10.0.20.200` isn't an option — all 10 services there are
+  ETP=Cluster and MetalLB requires a single ETP per shared IP.
+- **Native DNS auto-discovery for LAN clients**: any Windows client with
+  DNS suffix `viktorbarzin.lan` activates with zero config — Windows
+  queries `_vlmcs._tcp.viktorbarzin.lan` SRV by default, the SRV target
+  resolves to `vlmcs.viktorbarzin.lan` → `10.0.20.202`, and `slmgr /ato`
+  succeeds. Records:
+  - `_vlmcs._tcp.viktorbarzin.lan` SRV 0 0 1688 vlmcs.viktorbarzin.lan
+  - `vlmcs.viktorbarzin.lan` A `10.0.20.202`
+  - `kms.viktorbarzin.lan` A `10.0.20.200` (Traefik — for the user-facing
+    website at `https://kms.viktorbarzin.lan/`; **not** the KMS server)
+  Manual override (e.g., for clients without the suffix or for clients
+  on the public internet): `slmgr /skms kms.viktorbarzin.me:1688` (WAN
+  path via pfSense forward) or `slmgr /skms 10.0.20.202:1688` (direct).
+  To revert a manually-overridden client back to auto-discovery:
+  `slmgr /ckms`.
+- **Pod fluidity**: deployment has `replicas=1` (notifier dedup state is
+  per-pod) with no node affinity. TCP readiness/liveness probes on 1688
+  gate Pod Ready on the listener actually being up, so MetalLB only
+  advertises `10.0.20.202` from a node where vlmcsd is serving.
+- **pfSense WAN forward**: `WAN TCP/1688 → k8s_kms_lb:1688`
+  (alias = `10.0.20.202`, dedicated to KMS). Description: `KMS public —
+  kms.viktorbarzin.me`. Other forwards using `k8s_shared_lb` (WireGuard,
+  HTTPS, shadowsocks, smtps, etc.) are unaffected.
+- **Filter rule** on the WAN interface, TCP/1688 destination
+  `<k8s_kms_lb>`, with state-table per-source caps:
  - `max-src-conn 50` — concurrent connections per source IP
  - `max-src-conn-rate 10/60` — 10 new connections per 60 seconds per
    source
@ -26,6 +47,13 @@ how to tune the rate limit, how to revoke if abused.
    flushed. (`virusprot` is the only table pfSense's filter generator
    targets for `overload`; see `/etc/inc/filter.inc`. Don't try to point
    it at a custom table — the schema doesn't expose that knob.)
+- **Probe filter in slack-notifier**: a bare TCP open/close (no
+  Application/Activation block from vlmcsd) is treated as a probe — Uptime
+  Kuma's port-type monitor on `windows-kms.kms.svc:1688` and the kubelet
+  readiness/liveness probes both hit this path. Probes increment
+  `kms_connection_probes_total{source}` (`source` ∈ `internal_pod`,
+  `cluster_node`, `external`) and log to stdout, but never post to Slack.
+  Real activations still post.

 ## Where the logs are

@ -39,8 +67,11 @@ kubectl logs -n kms -l app=kms-service -c windows-kms --tail=50 -f
 kubectl logs -n kms -l app=kms-service -c windows-kms | grep "Incoming KMS request"
 ```

-Source IPs in this log are the SNAT'd node IPs because the LB Service uses
-ETP=Cluster on a shared MetalLB IP. Don't expect real WAN client IPs here.
+Source IPs from the WAN are real client IPs (pfSense DNAT-only + ETP=Local
+preserve them through the chain). LAN clients hitting the LB IP directly
+appear as their own IP. Pod-source probes (Uptime Kuma) appear as a Calico
+pod IP in `10.10.0.0/16`. Kubelet readiness/liveness probes appear as the
+hosting node IP in `10.0.20.0/24`.

 ### Slack notifier (kms namespace, k8s)

@ -53,6 +84,17 @@ also increment the Prometheus counter `kms_activations_total{product,status}`
 exposed on the same pod at `:9101/metrics` (scraped by the cluster-wide
 `kubernetes-pods` job; query via Prometheus or Grafana directly).

+Probe-only TCP connections (open+close, no KMS RPC) are silently filtered
+out of Slack and counted in `kms_connection_probes_total{source}`. Useful
+queries:
+```promql
+# Probe rate by source
+rate(kms_connection_probes_total[5m])
+# Probes from the public WAN (a non-zero rate here means real port-scans
+# are reaching us, not just internal monitoring)
+rate(kms_connection_probes_total{source="external"}[5m])
+```
+
 ### pfSense — virusprot table and filter hits

 ```bash
@ -93,18 +135,19 @@ The `overload` table entry survives pf reloads. Running
 If the activation surface needs to come down (abuse, legal, audit):

 1. **pfSense web UI** → `Firewall → NAT → Port Forward` → find
-   `WAN TCP/1688 → k8s_shared_lb` → **delete** (or disable). Apply.
+   `WAN TCP/1688 → k8s_kms_lb` → **delete** (or disable). Apply.
 2. **pfSense web UI** → `Firewall → Rules → WAN` → find
   `KMS public — kms.viktorbarzin.me` → **delete** (or disable). Apply.
 3. Verify externally: from a phone tether, `nc -zw3 kms.viktorbarzin.me 1688`
   should now fail.

 The k8s service stays reachable on the LAN
-(`10.0.20.200:1688` and the internal `kms.viktorbarzin.lan` ingress for
-the webpage) — only the WAN port-forward is removed.
+(`10.0.20.202:1688` directly, and the website at `kms.viktorbarzin.lan`
+via Traefik on `10.0.20.200:443`) — only the WAN port-forward is removed.

-To put it back, recreate the NAT rule (target alias `k8s_shared_lb`,
-port `1688`) and the filter rule with the same per-source caps.
+To put it back, recreate the NAT rule (target alias `k8s_kms_lb`,
+port `1688`) and the filter rule with the same per-source caps. The alias
+itself is independent of any forward and persists across delete/restore.

 ## Related

--- a/docs/runbooks/restore-mysql.md
+++ b/docs/runbooks/restore-mysql.md
@ -1,166 +1,256 @@
-# Restore MySQL (InnoDB Cluster)
+# Restore MySQL (Standalone)

-Last updated: 2026-04-06
+Last updated: 2026-05-18 (after the 8.4.9 DD-upgrade disaster recovery)
+
+Applies to the `mysql-standalone` StatefulSet in the `dbaas` namespace
+(raw `kubernetes_stateful_set_v1`, migrated from InnoDB Cluster on
+2026-04-16). The historic InnoDB-Cluster recovery flow is gone.

 ## Prerequisites
- `kubectl` access to the cluster
- MySQL root password (from `cluster-secret` in `dbaas` namespace, key `ROOT_PASSWORD`)
- Backup dump available on NFS at `/mnt/main/mysql-backup/`
+- `kubectl` against the cluster
+- Root password: `kubectl -n dbaas get secret cluster-secret -o jsonpath='{.data.ROOT_PASSWORD}' | base64 -d`
+- A backup dump on NFS at `/srv/nfs/mysql-backup/` (exported via
+  `dbaas-mysql-backup-host` PVC inside the cluster)

-## Backup Location
- NFS: `/mnt/main/mysql-backup/dump_YYYY_MM_DD_HH_MM.sql.gz`
- Mirrored to sda: `/mnt/backup/nfs-mirror/mysql-backup/` (PVE host 192.168.1.127)
- Replicated to Synology NAS: `Synology/Backup/Viki/pve-backup/nfs-mirror/mysql-backup/`
- Retention: 14 days (on NFS), latest only (on sda), unlimited (on Synology)
- Size: ~11MB per dump
+## Backup Locations

-## Restore Procedure
+| Location | Purpose | Retention |
+|---|---|---|
+| `/srv/nfs/mysql-backup/dump_YYYY_MM_DD_HH_MM.sql.gz` | Full daily dump (CronJob `mysql-backup`, daily 00:30 UTC) | 14 days |
+| `/srv/nfs/mysql-backup/per-db/<dbname>/dump_*.sql.gz` | Per-DB dumps (CronJob `mysql-backup-per-db`, daily 00:45 UTC) | 14 days |
+| Synology `Backup/Viki/nfs/mysql-backup/` | Offsite mirror via inotify-tracked rsync | unlimited |
+
+Latest full dump is ~230MB compressed (~3GB uncompressed). Restore
+of a full dump into a fresh MySQL pod takes ~3 minutes.
+
+## Scenario A — Single database restored alongside the others
+
+When one DB is corrupted but MySQL is otherwise fine.

-### 1. Identify the backup to restore
 ```bash
-# List available backups
-kubectl run mysql-ls --rm -it --image=mysql \
-  --overrides='{"spec":{"volumes":[{"name":"backup","persistentVolumeClaim":{"claimName":"dbaas-mysql-backup"}}],"containers":[{"name":"mysql-ls","image":"mysql","volumeMounts":[{"name":"backup","mountPath":"/backup"}],"command":["ls","-lt","/backup/"]}]}}' \
-  -n dbaas
+ROOT_PWD=$(kubectl -n dbaas get secret cluster-secret -o jsonpath='{.data.ROOT_PASSWORD}' | base64 -d)
+
+# List per-db dumps for the affected database
+kubectl -n dbaas exec mysql-standalone-0 -- ls -lt /backup/per-db/<dbname>/
+
+# Pipe a chosen dump into MySQL (REPLACE existing data in <dbname>):
+kubectl -n dbaas exec -i mysql-standalone-0 -- \
+    sh -c "zcat /backup/per-db/<dbname>/dump_YYYY_MM_DD_HH_MM.sql.gz | mysql -uroot -p\"$ROOT_PWD\" <dbname>"
+
+# Restart consumers
+kubectl -n <ns> rollout restart deployment
 ```

-### 2. Get the root password
+## Scenario B — Full disaster: data dictionary corrupt or PVC unsalvageable
+
+This is the path executed on 2026-05-18 when a Keel-driven bump to
+`mysql:8.4.9` left the data dictionary half-upgraded and 8.4.8 refused
+to start (`Server upgrade of version 80408 is still pending` —
+MY-013379). Wipes the PVC and rehydrates from the daily dump.
+
+**Estimated downtime: 25 minutes.** Plan accordingly — Forgejo +
+registry + every MySQL app go offline during this.
+
+### B.1 Stop the failing MySQL pod
+
 ```bash
-kubectl get secret cluster-secret -n dbaas -o jsonpath='{.data.ROOT_PASSWORD}' | base64 -d
+kubectl -n dbaas scale statefulset mysql-standalone --replicas=0
 ```

-### 3. Option A: Restore via port-forward (from outside cluster)
+### B.2 Verify the dump you intend to restore is healthy
+
 ```bash
-# Port-forward to MySQL primary
-kubectl port-forward svc/mysql -n dbaas 3307:3306 &
-
-# Get root password
-ROOT_PWD=$(kubectl get secret cluster-secret -n dbaas -o jsonpath='{.data.ROOT_PASSWORD}' | base64 -d)
-
-# Restore (decompress and pipe to mysql, use --host to avoid unix socket, specify non-default port)
-zcat /path/to/dump_YYYY_MM_DD_HH_MM.sql.gz | mysql -u root -p"$ROOT_PWD" --host 127.0.0.1 --port 3307
+ssh root@192.168.1.127 'ls -la /srv/nfs/mysql-backup/dump_*.sql.gz | tail -5'
+# Sanity-check the header
+ssh root@192.168.1.127 'zcat /srv/nfs/mysql-backup/dump_YYYY_MM_DD_HH_MM.sql.gz | head -20'
+# Should show "MySQL dump 10.13 ... Server version 8.4.X"
 ```

-### 3. Option B: Restore via in-cluster pod
-```bash
-ROOT_PWD=$(kubectl get secret cluster-secret -n dbaas -o jsonpath='{.data.ROOT_PASSWORD}' | base64 -d)
+### B.3 Pin MySQL image in Terraform (if it auto-bumped)

-kubectl run mysql-restore --rm -it --image=mysql \
-  --overrides='{"spec":{"volumes":[{"name":"backup","persistentVolumeClaim":{"claimName":"dbaas-mysql-backup"}}],"containers":[{"name":"mysql-restore","image":"mysql","env":[{"name":"MYSQL_PWD","value":"'$ROOT_PWD'"}],"volumeMounts":[{"name":"backup","mountPath":"/backup"}],"command":["/bin/sh","-c","zcat /backup/dump_YYYY_MM_DD_HH_MM.sql.gz | mysql -u root --host mysql.dbaas.svc.cluster.local"]}]}}' \
-  -n dbaas
+If the upgrade was triggered by a Keel bump on a floating tag
+(`mysql:8.4`), edit `stacks/dbaas/modules/dbaas/main.tf` to pin to a
+known-good exact version (`mysql:8.4.8`). Commit but don't apply yet.
+
+### B.4 Wipe the corrupted PVC
+
+The PV reclaim policy defaults to **Retain** on
+`proxmox-lvm-encrypted` — `kubectl delete pvc` alone leaves the PV
+attached to the (corrupted) disk. Flip to `Delete` first so the CSI
+driver actually cleans up the underlying LV.
+
+```bash
+PV=$(kubectl -n dbaas get pvc data-mysql-standalone-0 -o jsonpath='{.spec.volumeName}')
+kubectl patch pv "$PV" -p '{"spec":{"persistentVolumeReclaimPolicy":"Delete"}}'
+kubectl -n dbaas delete pvc data-mysql-standalone-0
 ```

-### 4. Verify restoration
+The PV transitions to `Released` then gets cleaned up by the CSI
+controller; confirm with `kubectl get pv | grep <PV>` (eventually
+disappears).
+
+### B.5 Scale MySQL back up via Terraform
+
 ```bash
-# Check databases exist
-mysql -u root -p"$ROOT_PWD" --host 127.0.0.1 --port 3307 -e "SHOW DATABASES;"
+cd stacks/dbaas && /home/wizard/code/infra/scripts/tg apply
+```

-# Check InnoDB Cluster status
-mysql -u root -p"$ROOT_PWD" --host 127.0.0.1 --port 3307 -e "SELECT * FROM performance_schema.replication_group_members;"
+This recreates the PVC fresh (5Gi initial; pvc-autoresizer grows it
+on demand) and starts a brand-new MySQL pod. The pod initializes an
+empty datadir using `MYSQL_ROOT_PASSWORD` from the `cluster-secret`
+K8s Secret — ~30s to ready.

-# Check table counts for key databases
-for db in speedtest wrongmove codimd nextcloud shlink grafana technitium; do
-  echo "=== $db ==="
-  mysql -u root -p"$ROOT_PWD" --host 127.0.0.1 --port 3307 -e "SELECT TABLE_NAME, TABLE_ROWS FROM information_schema.TABLES WHERE TABLE_SCHEMA='$db' ORDER BY TABLE_ROWS DESC LIMIT 5;"
+### B.6 Restore the full dump via a one-shot Job
+
+```bash
+cat <<'YAML' | kubectl apply -f -
+apiVersion: batch/v1
+kind: Job
+metadata:
+  name: mysql-restore-$(date +%Y-%m-%d)
+  namespace: dbaas
+spec:
+  ttlSecondsAfterFinished: 3600
+  template:
+    spec:
+      restartPolicy: Never
+      containers:
+      - name: restore
+        image: mysql:8.4.8
+        command: ["bash","-c"]
+        args:
+        - |
+          set -euo pipefail
+          gunzip -c /backup/dump_YYYY_MM_DD_HH_MM.sql.gz | \
+            mysql -h mysql.dbaas.svc.cluster.local -uroot -p"$MYSQL_ROOT_PASSWORD"
+          mysql -h mysql.dbaas.svc.cluster.local -uroot -p"$MYSQL_ROOT_PASSWORD" -e 'SHOW DATABASES;'
+        env:
+        - name: MYSQL_ROOT_PASSWORD
+          valueFrom:
+            secretKeyRef: { name: cluster-secret, key: ROOT_PASSWORD }
+        volumeMounts:
+        - { name: backup, mountPath: /backup, readOnly: true }
+      volumes:
+      - name: backup
+        persistentVolumeClaim: { claimName: dbaas-mysql-backup-host, readOnly: true }
+YAML
+```
+
+Watch progress: `kubectl -n dbaas logs -f job/<name>`. Takes ~3 min
+for a 230MB compressed dump.
+
+### B.7 Reset static MySQL users with passwords from Vault
+
+**This step is mandatory.** `mysqldump` restores rows in `mysql.user`
+verbatim, including password hashes. But `null_resource.mysql_static_user`
+in Terraform writes the **current Vault password** to `forgejo` and
+`roundcubemail` — and that current password rarely matches the dump's
+hash. The apps will fail auth (forgejo logs `Error 1045 (28000): Access
+denied for user 'forgejo'@'...'`) until you reset them.
+
+```bash
+FORGEJO_PW=$(vault kv get -field=mysql_forgejo_password secret/viktor)
+RC_PW=$(vault kv get -field=mysql_roundcubemail_password secret/viktor)
+
+kubectl -n dbaas exec -i mysql-standalone-0 -- bash -c 'mysql -uroot -p"$MYSQL_ROOT_PASSWORD"' <<SQL
+DROP USER IF EXISTS 'forgejo'@'%';
+DROP USER IF EXISTS 'roundcubemail'@'%';
+CREATE USER 'forgejo'@'%' IDENTIFIED WITH caching_sha2_password BY '$FORGEJO_PW';
+CREATE USER 'roundcubemail'@'%' IDENTIFIED WITH caching_sha2_password BY '$RC_PW';
+GRANT ALL PRIVILEGES ON \`forgejo\`.* TO 'forgejo'@'%';
+GRANT ALL PRIVILEGES ON \`roundcubemail\`.* TO 'roundcubemail'@'%';
+FLUSH PRIVILEGES;
+SQL
+```
+
+`ALTER USER` sometimes hits `ERROR 1396 Operation ALTER USER failed`
+on freshly-restored DBs (stale grant-table cache); `DROP USER` +
+`CREATE USER` is the reliable form.
+
+Vault-rotated app users (nextcloud, codimd, grafana, paperless,
+phpipam, etc.) are managed by Vault DB engine and their dump password
+already matches the live K8s secret, so they need no manual fixup.
+
+### B.8 Restart MySQL-dependent apps
+
+The dump restore brings MySQL up, but app pods still hold stale
+connections (and forgejo has been crash-looping). Roll the
+deployments to force fresh connections:
+
+```bash
+for ns_app in \
+    "forgejo:deploy/forgejo" \
+    "nextcloud:deploy/nextcloud" \
+    "hackmd:deploy/hackmd" \
+    "monitoring:deploy/grafana" \
+    "paperless-ngx:deploy/paperless-ngx" \
+    "uptime-kuma:deploy/uptime-kuma" \
+    "url:deploy/shlink" \
+    "realestate-crawler:deploy/realestate-crawler-api" \
+    "realestate-crawler:deploy/realestate-crawler-celery" \
+    "realestate-crawler:deploy/realestate-crawler-celery-beat" \
+    "realestate-crawler:deploy/realestate-crawler-ui"; do
+  ns=${ns_app%%:*}; app=${ns_app##*:}
+  kubectl -n "$ns" rollout restart "$app" &
 done
+wait
 ```

-### 5. Verify application MySQL users exist
-
-After any cluster rebuild or PVC recreation, the MySQL operator only recreates its own system users. Application users may be lost.
+If any deployments stay stuck in `ImagePullBackOff` (e.g.
+`chrome-service`, `fire-planner`, `freedify`), those rely on the
+Forgejo registry — once forgejo is back, just delete their pods to
+force a fresh pull:

 ```bash
-ROOT_PWD=$(kubectl get secret cluster-secret -n dbaas -o jsonpath='{.data.ROOT_PASSWORD}' | base64 -d)
-
-# Check all expected application users exist
-kubectl exec -n dbaas mysql-cluster-0 -c mysql -- mysql -u root -p"$ROOT_PWD" \
-  -e "SELECT user, host FROM mysql.user WHERE user IN ('nextcloud','forgejo','crowdsec','grafana','speedtest','wrongmove','codimd','shlink','technitium','uptimekuma');"
-
-# If users are missing, force Vault to re-rotate their credentials:
-# vault write -f database/rotate-role/mysql-<app>
-# This will recreate the user with the correct password.
-#
-# For technitium specifically, also run the password sync CronJob:
-# kubectl create job --from=cronjob/technitium-password-sync technitium-pw-resync -n technitium
-#
-# Note: forgejo and uptimekuma may be legacy users not managed by Vault rotation.
+kubectl -n chrome-service delete pod --all
+kubectl -n fire-planner delete pod --all
+kubectl -n freedify delete pod --all
 ```

-### 6. InnoDB Cluster Recovery
-If the InnoDB Cluster itself is broken (not just data loss):
-```bash
-# Check cluster status via MySQL Shell
-kubectl exec -it mysql-cluster-0 -n dbaas -c mysql -- mysqlsh root@localhost --password="$ROOT_PWD" -- cluster status
-
-# Force rejoin a member
-kubectl exec -it mysql-cluster-0 -n dbaas -c mysql -- mysqlsh root@localhost --password="$ROOT_PWD" -- cluster rejoinInstance root@mysql-cluster-1:3306
-```
-
-## Restore Single Database (from per-db backup)
-
-Per-database backups are stored at `/mnt/main/mysql-backup/per-db/<dbname>/` as gzipped SQL dumps.
-
-### 1. List available per-db backups
-```bash
-ls -lt /mnt/main/mysql-backup/per-db/<dbname>/
-```
-
-### 2. Restore a single database
-```bash
-# Port-forward to MySQL
-kubectl port-forward svc/mysql -n dbaas 3307:3306 &
-ROOT_PWD=$(kubectl get secret cluster-secret -n dbaas -o jsonpath='{.data.ROOT_PASSWORD}' | base64 -d)
-
-# Restore single database (this replaces only the target database)
-zcat /path/to/per-db/<dbname>/dump_YYYY_MM_DD_HH_MM.sql.gz | mysql -u root -p"$ROOT_PWD" --host 127.0.0.1 --port 3307 <dbname>
-```
-
-### 3. Verify
-```bash
-mysql -u root -p"$ROOT_PWD" --host 127.0.0.1 --port 3307 -e \
-  "SELECT TABLE_NAME, TABLE_ROWS FROM information_schema.TABLES WHERE TABLE_SCHEMA='<dbname>' ORDER BY TABLE_ROWS DESC LIMIT 10;"
-```
-
-### 4. Restart the affected service only
-```bash
-kubectl rollout restart deployment -n <namespace>
-```
-
-**Advantages over full restore**: Only the target database is affected. All other databases continue running with their current data.
-
-## Alternative: Restore from sda Backup
-
-If the Proxmox host NFS mount is unavailable but the PVE host itself is accessible:
+### B.9 Verify recovery

 ```bash
-# 1. SSH to PVE host
-ssh root@192.168.1.127
+# All workloads ready
+kubectl get deploy,sts -A -o json | jq -r '.items[] | select(.spec.replicas != .status.readyReplicas and .spec.replicas > 0) | "\(.metadata.namespace)/\(.metadata.name)"'
+# (empty output = healthy)

-# 2. Find the latest backup
-ls -lt /mnt/backup/nfs-mirror/mysql-backup/
+# Database integrity — table counts per schema
+kubectl -n dbaas exec mysql-standalone-0 -- mysql -uroot -p"$ROOT_PWD" \
+    -e "SELECT table_schema, COUNT(*) FROM information_schema.tables \
+        WHERE table_schema NOT IN ('information_schema','performance_schema','sys') \
+        GROUP BY table_schema;"

-# 3. Copy backup to a location accessible from cluster (e.g., via kubectl cp)
-# Or mount sda backup on a pod:
-kubectl run mysql-restore --rm -it --image=mysql \
-  --overrides='{"spec":{"volumes":[{"name":"backup","hostPath":{"path":"/mnt/backup/nfs-mirror/mysql-backup"}}],"containers":[{"name":"mysql-restore","image":"mysql","env":[{"name":"MYSQL_PWD","value":"'$ROOT_PWD'"}],"volumeMounts":[{"name":"backup","mountPath":"/backup"}],"command":["/bin/sh","-c","zcat /backup/dump_YYYY_MM_DD_HH_MM.sql.gz | mysql -u root --host mysql.dbaas.svc.cluster.local"]}],"nodeName":"k8s-master"}}' \
-  -n dbaas
+# Forgejo's registry catalog (catches the cascade alert)
+kubectl -n monitoring create job --from=cronjob/forgejo-integrity-probe manual-postrestore-$(date +%s)
+kubectl -n monitoring logs job/manual-postrestore-<timestamp> --tail=10
+# Expect "Probe complete: 0 failures across N repos / M tags / K indexes"
+
+# Cluster-health re-run
+bash /home/wizard/code/infra/scripts/cluster_healthcheck.sh --quiet
 ```

-## Alternative: Restore from Synology (if PVE host is down)
-
-If the PVE host itself is unavailable:
+### B.10 Clean up failed CronJob pods from the outage window

 ```bash
-# 1. SSH to Synology NAS
-ssh Administrator@192.168.1.13
-
-# 2. Navigate to backup directory
-cd /volume1/Backup/Viki/nfs/mysql-backup/
-
-# 3. Copy dump to a temporary location accessible from cluster
-# (e.g., via rsync to a surviving node, or restore PVE host first)
+kubectl delete pods -A --field-selector=status.phase=Failed
 ```

-## Estimated Time
- Data restore: ~5 minutes (11MB dump)
- InnoDB Cluster recovery: ~15-20 minutes (init containers are slow)
+## Why the 8.4.9 upgrade got us — and the version pin
+
+The MySQL 8.4.9 data-dictionary upgrade from 80408 → 80409 stalls
+reliably on this hardware. ~24s of writes to `mysql.ibd` and the redo
+log, then no further progress, no CPU, no completion. We bumped the
+liveness probe to 600s (`initial_delay_seconds`) and still no
+progress. Hypothesised root cause: `innodb_io_capacity=100` combined
+with `innodb_page_cleaners=1` — the upgrade's spatial-reference-system
+flush phase is IO-starved. **Don't retry 8.4.9 without first bumping
+IO capacity and pinning a proper maintenance window.**
+
+Until then, the StatefulSet pins to `mysql:8.4.8` exactly, not the
+floating `mysql:8.4` tag. Keel will not silently bump it.
+
+## See also
+- `docs/runbooks/forgejo-registry-breakglass.md` — companion runbook
+  for when the cascade has reached the registry layer.
+- Beads `code-eme8` / `code-k40p` — incident tracker entries (closed
+  in commit ea475c3d).
--- a/docs/runbooks/security-incident.md
+++ b/docs/runbooks/security-incident.md
@ -0,0 +1,191 @@
+# Security Incident Response
+
+What to do when a wave-1 security alert fires. Each alert links to a Loki query for investigation and concrete remediation steps.
+
+**Status: planned, not yet implemented.** Beads epic: `code-8ywc`. This runbook is the response playbook for when wave 1 ships.
+
+## General workflow
+
+1. **Acknowledge in Alertmanager.** Silence only after triage starts.
+2. **Pull context from Loki** (queries below). Get the actor, source IP, timestamp.
+3. **Decide: real or false-positive?** Use the "false-positive cases" notes below.
+4. **If real:** revoke credentials (Vault token revoke, K8s SA token rotate, SSH key remove, OIDC session invalidate), then post-mortem.
+5. **If false-positive:** tune the alert (extend allowlist, refine LogQL query).
+
+## Allowlist CIDRs
+
+All source-IP-based alerts (K2, K9, V7, S1) reference this list. Update in one place: Terraform variable `security_source_ip_allowlist` in `stacks/monitoring`.
+
+- `10.0.20.0/22` — VLAN 20 (cluster + main LAN)
+- `192.168.1.0/24` — Proxmox + Sofia LAN
+- K8s pod CIDR (verify at implementation time)
+- K8s service CIDR
+- Headscale tailnet
+
+**Anything outside = alert.** No public-IP exceptions.
+
+## Viktor's identity
+
+`me@viktorbarzin.me` is the ONLY allowlisted human identity. NOT `viktor@viktorbarzin.me`. NOT `emo@viktorbarzin.me`. emo's identity scheme is separate and must be added explicitly if/when needed.
+
+---
+
+## K-alerts (K8s API audit)
+
+### K2 — ServiceAccount token used from outside cluster
+
+**Meaning:** A K8s ServiceAccount token authenticated a request whose `sourceIPs[0]` is not in the pod CIDR or trusted LAN. Stolen SA token used externally.
+
+```logql
+{job="kube-audit"} | json | user_username =~ "system:serviceaccount:.*" | sourceIPs_0 !~ "10\\.0\\.20\\..*|192\\.168\\.1\\..*"
+```
+
+**Action:** Identify the SA. Rotate its token (`kubectl delete secret <sa-token-name>` if old-style, or recreate the SA if projected token). Audit the SA's permissions and tighten.
+
+**False positives:** Pod-to-apiserver traffic that egresses and re-enters via NodePort/LB (rare). Investigate the originating workload.
+
+### K3 — Secret read in sensitive namespace by unexpected actor
+
+**Meaning:** A Secret in `vault`, `sealed-secrets`, or `external-secrets` namespace was read by an SA NOT in the allowlist (ESO controller, sealed-secrets controller, Vault SA, `me@viktorbarzin.me`).
+
+```logql
+{job="kube-audit"} | json | verb =~ "get|list" | objectRef_resource = "secrets" | objectRef_namespace =~ "vault|sealed-secrets|external-secrets" | user_username !~ "(me@viktorbarzin.me|system:serviceaccount:external-secrets:.*|system:serviceaccount:sealed-secrets:.*|system:serviceaccount:vault:.*)"
+```
+
+**Action:** Identify the actor. If a service account, audit its bindings — it shouldn't have RBAC to read those secrets. Revoke the binding. Rotate any secrets that were read.
+
+### K4 — Exec into sensitive pod
+
+**Meaning:** Someone `kubectl exec`'d into a pod in `vault`, `kube-system`, `dbaas`, or `cnpg-system`.
+
+```logql
+{job="kube-audit"} | json | verb = "create" | objectRef_resource = "pods" | objectRef_subresource = "exec" | objectRef_namespace =~ "vault|kube-system|dbaas|cnpg-system" | user_username != "me@viktorbarzin.me"
+```
+
+**Action:** Determine if Viktor authorized the exec. If unrecognized actor, revoke their access and rotate any credentials they could have read inside the pod.
+
+**False positives:** Break-glass SAs used during incident response — extend the allowlist to include them by SA name.
+
+### K5 — Mass delete
+
+**Meaning:** Single actor deleted >5 Pods, Secrets, or ConfigMaps in 60 seconds. Either a script gone wrong or destructive intrusion.
+
+```logql
+sum by (user_username) (count_over_time({job="kube-audit"} | json | verb = "delete" | objectRef_resource =~ "pods|secrets|configmaps" [1m])) > 5
+```
+
+**Action:** Identify actor. If a Terraform apply or known cleanup job, false positive. If unrecognized, suspend the actor's credentials immediately and audit what was deleted.
+
+### K6 — Audit policy modified
+
+**Meaning:** Someone changed the kube-apiserver audit policy. Should only happen via Terraform.
+
+**Action:** Verify the change came from a planned Terraform apply (check recent commits to `stacks/infra`). If not, treat as critical compromise — attacker disabling visibility.
+
+### K7 — New ClusterRole with full wildcards
+
+**Meaning:** A new ClusterRole was created with `verbs: ["*"]` and `resources: ["*"]`. Privilege escalation primitive.
+
+```logql
+{job="kube-audit"} | json | verb = "create" | objectRef_resource = "clusterroles" | requestObject_rules_0_verbs_0 = "*" | requestObject_rules_0_resources_0 = "*"
+```
+
+**Action:** Verify the change is intentional (some operators install such roles — calico, kyverno). If unrecognized, delete the ClusterRole and audit the creator.
+
+### K8 — Anonymous binding
+
+**Meaning:** A RoleBinding or ClusterRoleBinding was created referencing `system:anonymous` or `system:unauthenticated`. Catastrophic — allows unauthenticated cluster access.
+
+**Action:** Delete the binding immediately. Audit who created it. Treat as full cluster compromise — rotate all secrets, force kubeconfig re-issue.
+
+### K9 — Viktor's identity from unexpected source IP
+
+**Meaning:** A request authenticated as `me@viktorbarzin.me` arrived from a source IP outside the allowlist. Stolen OIDC token / kubeconfig.
+
+```logql
+{job="kube-audit"} | json | user_username = "me@viktorbarzin.me" | sourceIPs_0 !~ "10\\.0\\.20\\..*|192\\.168\\.1\\..*|<pod-cidr>|<headscale-cidr>"
+```
+
+**Action:** Revoke Viktor's OIDC session in Authentik. Rotate Vault OIDC tokens. Audit recent activity from that IP. Verify Viktor's devices for compromise.
+
+**False positives:** Viktor's machine on a new network without VPN — should not happen per the "no public IP access" policy. If it does, the policy needs revisiting, not the alert.
+
+---
+
+## V-alerts (Vault audit)
+
+### V1 — Root token created
+
+```logql
+{job="vault-audit"} | json | request_path = "auth/token/create" | response_auth_policies = "root"
+```
+
+**Action:** Verify against Terraform / planned operation. Root tokens should ONLY be created during initial Vault setup or break-glass.
+
+### V2 — Audit device disabled/modified
+
+**Action:** Attacker silencing visibility. Re-enable immediately. Treat as critical compromise.
+
+### V3 — Seal status changed
+
+**Action:** Verify whether this is a planned operation (unseal during upgrade). If unplanned, treat as critical.
+
+### V4 — Policy modified
+
+**Action:** Confirm change came from a Terraform apply. Allowlist Terraform's source IP / token role. Otherwise: review the policy diff, revert if malicious.
+
+### V5 — Auth failure spike
+
+**Action:** Identify the auth method and source. If CI token rotation, false positive. If unknown source brute-forcing, block the source IP at pfSense.
+
+### V6 — Token with policies different from parent
+
+**Action:** Privilege escalation attempt. Revoke the new token. Audit the parent token's policies.
+
+### V7 — Viktor's Vault identity from unexpected source IP
+
+**Meaning:** A Vault operation authenticated as Viktor's entity_id arrived from an IP not in the allowlist. Requires `x_forwarded_for_authorized_addrs` to be configured (Vault sits behind Traefik so `remote_addr` is Traefik's pod IP without XFF trust).
+
+**Action:** Revoke Viktor's Vault OIDC tokens. Force OIDC re-auth. Audit Vault access from that IP.
+
+---
+
+## S-alerts (Host)
+
+### S1 — PVE sshd auth success from unexpected IP
+
+```logql
+{job="sshd-pve"} |= "Accepted" | regexp "Accepted (?P<method>\\S+) for (?P<user>\\S+) from (?P<ip>\\S+)" | ip !~ "10\\.0\\.20\\..*|192\\.168\\.1\\..*|<headscale-cidr>"
+```
+
+**Action:** Remove the user's SSH key from `/root/.ssh/authorized_keys` if it's still there. Audit recent sudo/login history (`last`, `sudo -i; journalctl _COMM=sudo`). Consider PVE as compromised — rotate root password, audit `/root/.luks-backup-key`, audit `/usr/local/bin/lvm-pvc-snapshot` and backup scripts for tampering.
+
+---
+
+## False-positive triage decision tree
+
+```
+Did the alert fire from a known operational event?
+├─ Terraform apply at the same time?       → likely V4 (policy modified)
+├─ Keel auto-roll?                          → not a security path
+├─ CI/CD pipeline running?                  → check V5 / K5
+└─ Viktor doing recovery work?              → K4, K9, S1 candidates
+                                              Extend allowlist if persistent
+```
+
+## Escalation
+
+For SEV1 (multiple alerts, cluster-admin grants, anonymous bindings, mass deletes):
+
+1. Cordon all nodes (`kubectl cordon`) to prevent further pod scheduling — but be aware this also stops legitimate recovery work
+2. Revoke all OIDC sessions in Authentik
+3. Rotate Vault root keys + reseal
+4. Restore from a pre-incident backup if data integrity is questionable
+5. Post-mortem per `incident-response.md`
+
+## Related
+
+- [Security architecture](../architecture/security.md)
+- [Monitoring architecture](../architecture/monitoring.md)
+- [Incident response (general)](../architecture/incident-response.md)
+- Beads epic: `code-8ywc`
--- a/modules/create-template-vm/cloud_init.yaml
+++ b/modules/create-template-vm/cloud_init.yaml
@ -67,11 +67,44 @@ runcmd:
  - sed -i 's/#Compress=yes/Compress=yes/' /etc/systemd/journald.conf
  - systemctl restart systemd-journald
  %{if is_k8s_template}
-  # Disable unattended-upgrades to prevent unexpected kernel updates that can break containerd/kubelet
-  # (Root cause of 26h cluster outage: unattended-upgrades → kernel update → containerd failure)
-  - systemctl disable --now unattended-upgrades || true
-  - apt-get remove -y unattended-upgrades || true
+  # Re-enabled 2026-05-10: unattended-upgrades is back on, but with a tight
+  # Allowed-Origins list, a Package-Blacklist for k8s/containerd/runc/calico,
+  # and Automatic-Reboot disabled (kured + sentinel-gate handles reboots in a
+  # 24h-soaked rolling window, gated by Prometheus alerts).
+  # Original outage (March 2026) was kernel update → containerd overlayfs corruption.
+  # Mitigations: 24h cool-down between node reboots, Prometheus halt-on-alert,
+  # apt-mark hold on k8s components, Package-Blacklist for runtime components.
+  - apt-get install -y unattended-upgrades update-notifier-common
+  - |
+    cat > /etc/apt/apt.conf.d/52unattended-upgrades-k8s <<'EOF'
+    Unattended-Upgrade::Allowed-Origins {
+        "$${distro_id}:$${distro_codename}";
+        "$${distro_id}:$${distro_codename}-security";
+        "$${distro_id}:$${distro_codename}-updates";
+        "$${distro_id}ESMApps:$${distro_codename}-apps-security";
+        "$${distro_id}ESM:$${distro_codename}-infra-security";
+    };
+    Unattended-Upgrade::Package-Blacklist {
+        "^containerd(\.io)?$$";
+        "^runc$$";
+        "^cri-tools$$";
+        "^kubernetes-cni$$";
+        "^calico-.*";
+        "^cni-plugins-.*";
+        "^docker-ce$$";
+    };
+    Unattended-Upgrade::DevRelease "false";
+    Unattended-Upgrade::Automatic-Reboot "false";
+    EOF
+  - |
+    cat > /etc/apt/apt.conf.d/20auto-upgrades <<'EOF'
+    APT::Periodic::Update-Package-Lists "1";
+    APT::Periodic::Unattended-Upgrade "1";
+    EOF
+  - systemctl unmask unattended-upgrades 2>/dev/null || true
+  - systemctl enable --now unattended-upgrades
  - apt-mark hold kubelet kubeadm kubectl
+  - apt-mark hold containerd containerd.io runc 2>/dev/null || true
  - systemctl stop kubelet
  - containerd config default | sudo tee /etc/containerd/config.toml
  - ${containerd_config_update_command}
--- a/modules/create-vm/main.tf
+++ b/modules/create-vm/main.tf
@ -192,9 +192,9 @@ resource "proxmox_vm_qemu" "cloudinit-vm" {
        for_each = var.disk_slot == "scsi0" ? [1] : []
        content {
          disk {
-            storage  = "local-lvm"
-            size     = var.vm_disk_size
-            discard  = true # Enable TRIM passthrough to LVM thin pool — reduces CoW overhead
+            storage = "local-lvm"
+            size    = var.vm_disk_size
+            discard = true # Enable TRIM passthrough to LVM thin pool — reduces CoW overhead
          }
        }
      }
@ -202,9 +202,9 @@ resource "proxmox_vm_qemu" "cloudinit-vm" {
        for_each = var.disk_slot == "scsi1" ? [1] : []
        content {
          disk {
-            storage  = "local-lvm"
-            size     = var.vm_disk_size
-            discard  = true
+            storage = "local-lvm"
+            size    = var.vm_disk_size
+            discard = true
          }
        }
      }
--- a/modules/kubernetes/anubis_instance/main.tf
+++ b/modules/kubernetes/anubis_instance/main.tf
@ -56,8 +56,24 @@ variable "image_tag" {

 variable "replicas" {
  type        = number
-  default     = 1
-  description = "Replica count. Default 1 because Anubis stores in-flight challenges in process memory — with N>1 a challenge issued by pod A and solved against pod B fails with `store: key not found` (HTTP 500). For HA, configure a shared store (Redis) and bump this. Per-pod 128Mi @ idle is cheap, single-pod restart is sub-second, so 1 is fine for content sites."
+  default     = null
+  description = "Optional replica count override. When null, defaults to 1 if shared_store_url is null and 2 otherwise. Capped at 2 — Redis can handle more but anti-affinity assumes ≤2 replicas per Anubis instance on a 5-node cluster."
+
+  validation {
+    condition     = var.replicas == null || (var.replicas >= 1 && var.replicas <= 2)
+    error_message = "replicas must be 1 or 2 (or null to auto-pick from shared_store_url presence)."
+  }
+}
+
+variable "shared_store_url" {
+  type        = string
+  default     = null
+  description = "If set, Anubis stores in-flight challenge state in this Valkey/Redis-protocol URL instead of in-process memory, enabling HA across replicas. Format: redis://host:port/<db-index>. The DB index MUST be unique per Anubis instance (this module assumes 16 DBs available, common in standalone Redis). Cluster Redis is redis-master.redis.svc.cluster.local:6379 with HA via Sentinel + haproxy. Without this, replicas>1 causes ~50% PoW failures (challenge issued by pod A, solved against pod B → 500)."
+
+  validation {
+    condition     = var.shared_store_url == null || can(regex("^redis://[a-zA-Z0-9_.-]+:[0-9]+/[0-9]+$", var.shared_store_url))
+    error_message = "shared_store_url must look like redis://host:port/<db-index> (explicit DB index required)."
+  }
 }

 variable "memory" {
@ -88,6 +104,21 @@ locals {
    "app.kubernetes.io/managed-by" = "terraform"
  }

+  # Effective replicas: caller-override > shared-store-aware default.
+  effective_replicas = coalesce(var.replicas, var.shared_store_url == null ? 1 : 2)
+
+  # Anubis store config. With backend=valkey, multiple Anubis pods can share
+  # in-flight PoW state and a challenge issued by pod A is verifiable by pod
+  # B. Default backend is in-process memory which only works at replicas=1.
+  store_yaml_block = var.shared_store_url == null ? "" : <<-EOT
+
+
+    store:
+      backend: valkey
+      parameters:
+        url: "${var.shared_store_url}"
+  EOT
+
  # Strict bot policy. Default Anubis policy only WEIGHs Mozilla|Opera UAs
  # and lets unmatched UAs (curl, wget, Python-requests, scrapy, headless
  # CLI scrapers) fall through to ALLOW. We import the same upstream
@ -95,7 +126,8 @@ locals {
  # capability is filtered.
  default_policy_yaml = <<-EOT
    bots:
-      # Hard-deny known-bad bots first.
+      # Hard-deny known-bad bots first — runs before the method bypass so
+      # a declared bad bot can't sneak through by sending a POST.
      - import: (data)/bots/_deny-pathological.yaml
      - import: (data)/bots/aggressive-brazilian-scrapers.yaml
      # Hard-deny declared AI/LLM crawlers (ClaudeBot, GPTBot, Bytespider, …).
@ -107,13 +139,29 @@ locals {
      # Allow /.well-known, /robots.txt, /favicon.*, /sitemap.xml — keeps
      # the internet working for benign crawlers and discovery clients.
      - import: (data)/common/keep-internet-working.yaml
-      # Catch-all: every remaining request must solve the challenge. This
-      # closes the "unmatched UA falls through to ALLOW" gap that lets
-      # curl/wget/Python-requests scrape non-CDN-fronted hosts.
+      # Allow every non-GET request through. Rationale: AI scrapers steal
+      # the body of GETs (page content) — they don't POST. State-mutating
+      # methods come from app XHRs (PrivateBin paste creation, Komga
+      # uploads, SPA actions) and CORS preflight (OPTIONS). Challenging
+      # those breaks the app, because the JS expects JSON and gets the
+      # Anubis HTML challenge page. CrowdSec + rate-limit + per-app auth
+      # already cover abuse on these methods.
+      - name: allow-non-get-methods
+        action: ALLOW
+        expression: method != "GET"
+      # Catch-all: every remaining (GET) request must solve the challenge.
+      # This closes the "unmatched UA falls through to ALLOW" gap that
+      # lets curl/wget/Python-requests scrape non-CDN-fronted hosts.
      - name: catchall-challenge
        path_regex: .*
        action: CHALLENGE
  EOT
+
+  # Final policy YAML: defaults (or caller override) plus an optional store
+  # block when shared_store_url is set. Store block is module-managed and
+  # appended universally — callers passing a custom policy_yaml shouldn't
+  # include their own `store:` block (they would collide).
+  rendered_policy_yaml = "${coalesce(var.policy_yaml, local.default_policy_yaml)}${local.store_yaml_block}"
 }

 # Bot policy ConfigMap. Mounted into the pod and referenced by POLICY_FNAME.
@ -124,7 +172,7 @@ resource "kubernetes_config_map" "policy" {
    labels    = local.labels
  }
  data = {
-    "botPolicies.yaml" = coalesce(var.policy_yaml, local.default_policy_yaml)
+    "botPolicies.yaml" = local.rendered_policy_yaml
  }
 }

@ -168,7 +216,7 @@ resource "kubernetes_deployment" "anubis" {
  }

  spec {
-    replicas = var.replicas
+    replicas = local.effective_replicas

    selector {
      match_labels = { app = local.full_name }
@ -185,14 +233,26 @@ resource "kubernetes_deployment" "anubis" {
    template {
      metadata {
        labels = local.labels
+        annotations = {
+          # Roll the deployment whenever the policy YAML changes — Anubis
+          # reads the policy at startup, so a ConfigMap update alone
+          # doesn't take effect until pods restart.
+          "checksum/policy" = sha256(local.rendered_policy_yaml)
+        }
      }

      spec {
        # Spread replicas across nodes to survive a single node failure.
+        # DoNotSchedule (not ScheduleAnyway) so 2 replicas are forced onto
+        # different hosts — otherwise the scheduler may pile them on the
+        # same node and a single node reboot takes the whole Anubis instance
+        # down despite replicas=2. On a 5-node cluster the spread is always
+        # satisfiable; the worst case (4 nodes unavailable) leaves one
+        # replica Pending, but the other keeps serving.
        topology_spread_constraint {
          max_skew           = 1
          topology_key       = "kubernetes.io/hostname"
-          when_unsatisfiable = "ScheduleAnyway"
+          when_unsatisfiable = "DoNotSchedule"
          label_selector {
            match_labels = { app = local.full_name }
          }
@ -388,7 +448,15 @@ resource "kubernetes_pod_disruption_budget_v1" "anubis" {
    namespace = var.namespace
  }
  spec {
-    min_available = "1"
+    # max_unavailable=1 means: at most one pod can be voluntarily disrupted
+    # at a time. With replicas=2 this allows clean rolling drains (one pod
+    # goes down → other serves traffic → first recreates elsewhere). With
+    # replicas=1 (no shared store) this is functionally equivalent to no
+    # PDB — drain proceeds, brief outage, new pod schedules elsewhere.
+    # Was min_available=1 before 2026-05-16 which deadlocked drains on
+    # single-replica instances (eviction API can never satisfy the
+    # constraint at replicas=1). See PM-2026-05-11.
+    max_unavailable = "1"
    selector {
      match_labels = { app = local.full_name }
    }
--- a/modules/kubernetes/ingress_factory/main.tf
+++ b/modules/kubernetes/ingress_factory/main.tf
@ -31,9 +31,53 @@ variable "tls_secret_name" {}
 variable "backend_protocol" {
  default = "HTTP"
 }
-variable "protected" {
-  type    = bool
-  default = false
+variable "auth" {
+  type        = string
+  default     = "required"
+  description = <<-EOT
+    Auth posture for this ingress. Pick by asking "what gates the app?":
+
+      * "required" (default, fail-closed): Authentik forward-auth gates every
+        request. Pick this when the backend has NO built-in user auth and
+        Authentik is the only thing standing between strangers and the app.
+        Examples: prowlarr, qbittorrent, netbox, phpipam, k8s-dashboard, any
+        admin UI shipped without its own login.
+
+      * "app": the backend handles its own user authentication (NextAuth,
+        Django sessions, OAuth, bearer-token API, etc.) and Authentik would
+        only get in the way. No Authentik middleware is attached; the app's
+        own login is the gate. Examples: immich, linkwarden, tandoor,
+        freshrss, affine, actualbudget, audiobookshelf, novelapp.
+        **Functionally identical to "none"** — the distinct name exists to
+        record intent at the call site so future readers don't have to guess.
+
+      * "public": Authentik anonymous binding via the `public` outpost.
+        Strangers are auto-bound to the `guest` Authentik user; logged-in
+        users keep their identity in X-authentik-username. Only works for
+        top-level browser navigation — CORS preflight rejects XHR/fetch and
+        automation can't replay the cookie dance. Audit trail, not a gate.
+
+      * "none": no Authentik middleware, no own-auth claim — explicitly
+        public or unauthenticated-by-design. Use for: Anubis-fronted content
+        sites (where Anubis is the gate), native-client APIs that auth
+        themselves (Git, /v2/, WebDAV/CalDAV, CardDAV), webhook receivers,
+        OAuth callbacks, and Authentik outposts themselves.
+
+    **Anti-exposure rule** (the reason "app" exists as a distinct mode):
+    only pick "app" or "none" AFTER you have verified the app has its own
+    user auth (for "app") OR the endpoint is intentionally public (for
+    "none"). Picking either of these on a naked admin UI exposes it to the
+    internet. The default is "required" specifically so accidental omission
+    fails closed.
+
+    **Convention**: when using "app" or "none", add a comment line above
+    the `auth = "..."` line stating what gates the app or why it's public.
+    Future-you reads the call site, not the module description.
+  EOT
+  validation {
+    condition     = contains(["required", "app", "public", "none"], var.auth)
+    error_message = "auth must be one of: required, app, public, none."
+  }
 }
 variable "ingress_path" {
  type    = list(string)
@ -142,8 +186,23 @@ variable "homepage_enabled" {
 }

 locals {
-  effective_host    = var.full_host != null ? var.full_host : "${var.host != null ? var.host : var.name}.${var.root_domain}"
-  effective_anti_ai = var.anti_ai_scraping != null ? var.anti_ai_scraping : !var.protected
+  effective_host = var.full_host != null ? var.full_host : "${var.host != null ? var.host : var.name}.${var.root_domain}"
+  # Anti-AI default: ON when no Authentik auth fronts the ingress (auth =
+  # "none" or auth = "app" — either the app gates users itself or the site
+  # is intentionally public). When Authentik gates the request
+  # (required/public), the auth flow already discourages bots.
+  effective_anti_ai = var.anti_ai_scraping != null ? var.anti_ai_scraping : (var.auth == "none" || var.auth == "app")
+
+  # Auth middleware selection. "app" and "none" both attach no Authentik
+  # middleware — "app" signals "the backend has its own user auth", "none"
+  # signals "intentionally public / native-client API / webhook". The
+  # distinction lives at the call site for human readers; the runtime
+  # effect is identical.
+  auth_middleware = (
+    var.auth == "required" ? "traefik-authentik-forward-auth@kubernetescrd" :
+    var.auth == "public" ? "traefik-authentik-forward-auth-public@kubernetescrd" :
+    null
+  )

  # External monitor enabled by default when the ingress has a public DNS
  # record (either CF-proxied or direct A/AAAA). Explicit bool overrides.
@ -254,7 +313,7 @@ resource "kubernetes_ingress_v1" "proxied-ingress" {
        var.exclude_crowdsec ? null : "traefik-crowdsec@kubernetescrd",
        local.effective_anti_ai ? "traefik-ai-bot-block@kubernetescrd" : null,
        local.effective_anti_ai ? "traefik-anti-ai-headers@kubernetescrd" : null,
-        var.protected ? "traefik-authentik-forward-auth@kubernetescrd" : null,
+        local.auth_middleware,
        var.allow_local_access_only ? "traefik-local-only@kubernetescrd" : null,
        var.custom_content_security_policy != null ? "${var.namespace}-custom-csp-${var.name}@kubernetescrd" : null,
        var.max_body_size != null ? "${var.namespace}-buffering-${var.name}@kubernetescrd" : null,
--- a/scripts/check-ingress-auth-comments.py
+++ b/scripts/check-ingress-auth-comments.py
@ -0,0 +1,124 @@
+#!/usr/bin/env python3
+"""Enforce the inline-comment convention for ingress_factory auth tiers.
+
+Every `auth = "app"` or `auth = "none"` line under a stack must have an
+immediately-preceding comment block containing `# auth = "<tier>":`
+that documents what gates the app (for "app") or why the endpoint is
+intentionally public (for "none").
+
+This is the static guard for the anti-exposure rule documented in
+`infra/.claude/CLAUDE.md` "Auth" section. It's invoked by `scripts/tg`
+before every plan/apply/destroy/refresh, so it fires regardless of who
+or what is running terragrunt — local laptop, CI, headless agent.
+
+Stack-scoped by design: only checks the .tf files under the stack
+being acted on. Other stacks' historical violations don't block work
+on the current stack; each stack documents itself the next time it's
+edited.
+
+Usage:
+  check-ingress-auth-comments.py <stack-path>     # scan one stack
+  check-ingress-auth-comments.py --all            # scan every stack
+"""
+
+import argparse
+import os
+import re
+import sys
+
+AUTH_LINE = re.compile(r'^\s*auth\s*=\s*"(app|none)"\s*$')
+COMMENT_LINE = re.compile(r'^\s*#')
+COMMENT_TIER = re.compile(r'auth\s*=\s*"(app|none)"')
+
+
+def scan_dir(path):
+    violations = []
+    for root, _, files in os.walk(path):
+        for f in files:
+            if not f.endswith('.tf'):
+                continue
+            full = os.path.join(root, f)
+            try:
+                with open(full) as fh:
+                    lines = fh.readlines()
+            except OSError:
+                continue
+            for i, line in enumerate(lines):
+                m = AUTH_LINE.match(line)
+                if not m:
+                    continue
+                tier = m.group(1)
+                # Walk backwards through contiguous comment lines.
+                # Pass if ANY of them documents the matching tier.
+                ok = False
+                j = i - 1
+                while j >= 0 and COMMENT_LINE.match(lines[j]):
+                    cm = COMMENT_TIER.search(lines[j])
+                    if cm and cm.group(1) == tier:
+                        ok = True
+                        break
+                    j -= 1
+                if not ok:
+                    violations.append((full, i + 1, tier))
+    return violations
+
+
+def main():
+    ap = argparse.ArgumentParser(description=__doc__.splitlines()[0])
+    g = ap.add_mutually_exclusive_group(required=True)
+    g.add_argument('path', nargs='?', help='Stack directory to scan')
+    g.add_argument('--all', action='store_true', help='Scan every stack under stacks/')
+    args = ap.parse_args()
+
+    if args.all:
+        scan_paths = ['stacks']
+    else:
+        if not os.path.isdir(args.path):
+            print(f"ERROR: {args.path} is not a directory", file=sys.stderr)
+            sys.exit(2)
+        scan_paths = [args.path]
+
+    violations = []
+    for p in scan_paths:
+        violations.extend(scan_dir(p))
+
+    if not violations:
+        return
+
+    print(
+        "\n"
+        "==============================================================\n"
+        "ingress_factory auth-comment convention violated\n"
+        "==============================================================\n"
+        "\n"
+        "Every `auth = \"app\"` or `auth = \"none\"` line must have a\n"
+        "preceding comment line documenting what gates the app (for\n"
+        "\"app\") or why the endpoint is intentionally public (for\n"
+        "\"none\"). This guard prevents accidentally exposing private\n"
+        "services. See infra/.claude/CLAUDE.md Auth section.\n"
+        "\n"
+        "Add a comment line directly above the auth line:\n"
+        "\n"
+        "  # auth = \"app\":  <what gates the app, e.g. NextAuth + OAuth>\n"
+        "  auth = \"app\"\n"
+        "\n"
+        "or:\n"
+        "\n"
+        "  # auth = \"none\": <why public, e.g. webhook receiver, CalDAV>\n"
+        "  auth = \"none\"\n"
+        "\n"
+        "Violations:",
+        file=sys.stderr,
+    )
+    for path, line_no, tier in violations:
+        print(
+            f"  {path}:{line_no}: auth = \"{tier}\" missing preceding "
+            f"`# auth = \"{tier}\":` comment",
+            file=sys.stderr,
+        )
+    print(file=sys.stderr)
+    sys.exit(1)
+
+
+if __name__ == '__main__':
+    main()
--- a/scripts/cluster_healthcheck.sh
+++ b/scripts/cluster_healthcheck.sh
@ -23,10 +23,11 @@ FAIL_COUNT=0
 FIX=false
 QUIET=false
 JSON=false
-KUBECONFIG_PATH="$(pwd)/config"
+KUBECONFIG_PATH="${KUBECONFIG:-${HOME}/.kube/config}"
+[[ -f "$KUBECONFIG_PATH" ]] || KUBECONFIG_PATH="$(pwd)/config"
 KUBECTL=""
 JSON_RESULTS=()
-TOTAL_CHECKS=42
+TOTAL_CHECKS=44

 # --- Helpers ---
 info()  { [[ "$JSON" == true ]] && return 0; echo -e "${BLUE}[INFO]${NC} $*"; }
@ -195,6 +196,19 @@ check_pods() {
    section 4 "Problematic Pods"
    local bad count detail="" status="PASS"

+    # Skip pods owned by Jobs (which are owned by CronJobs). A failed CronJob
+    # retry isn't a problematic pod — the next CronJob fire will replace it.
+    # Real problems are deployments / statefulsets / daemonsets in trouble.
+    local job_owned_pods
+    job_owned_pods=$($KUBECTL get pods -A -o json 2>/dev/null | python3 -c '
+import json, sys
+d = json.load(sys.stdin)
+for p in d["items"]:
+    owners = p["metadata"].get("ownerReferences", [])
+    if any(o.get("kind") == "Job" for o in owners):
+        print(f"{p[\"metadata\"][\"namespace\"]} {p[\"metadata\"][\"name\"]}")
+' 2>/dev/null || true)
+
    bad=$( {
        $KUBECTL get pods -A --no-headers --field-selector=status.phase!=Running,status.phase!=Succeeded 2>/dev/null \
            | grep -E 'CrashLoopBackOff|Error|Pending|Init:|ImagePullBackOff|ErrImagePull' || true
@ -202,6 +216,14 @@ check_pods() {
            | grep -E 'CrashLoopBackOff|ImagePullBackOff|ErrImagePull' || true
    } | awk '!seen[$1,$2]++' | sed '/^$/d') || true

+    # Filter out Job-owned pods
+    if [[ -n "$job_owned_pods" && -n "$bad" ]]; then
+        bad=$(echo "$bad" | awk -v jp="$job_owned_pods" '
+            BEGIN { n = split(jp, lines, "\n"); for (i=1;i<=n;i++) skip[lines[i]] = 1 }
+            { key = $1 " " $2; if (!(key in skip)) print }
+        ')
+    fi
+
    count=$(count_lines "$bad")

    if [[ "$count" -eq 0 ]]; then
@ -228,7 +250,21 @@ check_evicted() {
    section 5 "Evicted/Failed Pods"
    local evicted count detail="" status="PASS"

-    evicted=$($KUBECTL get pods -A --no-headers --field-selector=status.phase=Failed 2>/dev/null || true)
+    # Exclude pods owned by Jobs — those are CronJob retries that K8s leaves
+    # behind for log inspection. They're not "evicted" in the cluster-health
+    # sense and the next CronJob fire replaces them.
+    evicted=$($KUBECTL get pods -A -o json --field-selector=status.phase=Failed 2>/dev/null | python3 -c '
+import json, sys
+try:
+    d = json.load(sys.stdin)
+except Exception:
+    sys.exit(0)
+for p in d.get("items", []):
+    owners = p["metadata"].get("ownerReferences", [])
+    if any(o.get("kind") == "Job" for o in owners):
+        continue
+    print(f"{p[\"metadata\"][\"namespace\"]}\t{p[\"metadata\"][\"name\"]}\t{p.get(\"status\",{}).get(\"reason\",\"\")}")
+' 2>/dev/null || true)
    count=$(count_lines "$evicted")

    if [[ "$count" -eq 0 ]]; then
@ -539,18 +575,25 @@ check_alerts() {
        return 0
    fi

+    # Only count warning + critical alerts. Info-level alerts (RecentNodeReboot,
+    # PVAutoExpanding, etc.) are informational by design and shouldn't be
+    # treated as a script-level WARN — the alert rules themselves already
+    # encode the severity.
    firing_count=$(echo "$alerts" | python3 -c '
 import json, sys
+ACTIONABLE = {"warning", "critical"}
+def actionable(labels):
+    return labels.get("severity", "info").lower() in ACTIONABLE
 try:
    data = json.load(sys.stdin)
    if isinstance(data, list):
-        active = [a for a in data if a.get("status", {}).get("state") == "active"]
+        active = [a for a in data if a.get("status", {}).get("state") == "active" and actionable(a.get("labels", {}))]
        count = len(active)
        names = [a.get("labels", {}).get("alertname", "?") for a in active]
        print(f"{count}:" + ",".join(names) if count > 0 else "0:")
    elif isinstance(data, dict) and "data" in data:
        alerts_list = data["data"].get("alerts", [])
-        firing = [a for a in alerts_list if a.get("state") == "firing"]
+        firing = [a for a in alerts_list if a.get("state") == "firing" and actionable(a.get("labels", {}))]
        count = len(firing)
        names = [a.get("labels", {}).get("alertname", "?") for a in firing]
        print(f"{count}:" + ",".join(names) if count > 0 else "0:")
@ -598,17 +641,55 @@ check_uptime_kuma() {
        return 0
    fi

-    result=$(UPTIME_KUMA_PASSWORD="$uk_pass" ~/.venvs/claude/bin/python3 -c '
-import sys, os
+    # Connect via kubectl port-forward to the internal Service. The public
+    # URL (uptime.viktorbarzin.me) is behind Authentik forward-auth, which
+    # 302-redirects the Socket.IO handshake the library uses — there's no
+    # way for an unauthenticated script to complete the OAuth dance.
+    # Port-forward gives us a direct path to the in-cluster ClusterIP
+    # service and works from any host with kubectl access.
+    local pf_port=18444 pf_pid
+    $KUBECTL port-forward -n uptime-kuma svc/uptime-kuma "$pf_port:80" >/dev/null 2>&1 &
+    pf_pid=$!
+    # Detach from job control so bash doesn't print "Killed" to stderr
+    # when we SIGKILL the port-forward at the end of this check — that
+    # message corrupts stdout when stderr is merged for JSON parsing.
+    disown "$pf_pid" 2>/dev/null || true
+    # Wait up to 5s for the local listener to come up.
+    local i
+    for i in 1 2 3 4 5; do
+        if (echo >"/dev/tcp/127.0.0.1/$pf_port") 2>/dev/null; then break; fi
+        sleep 1
+    done
+
+    result=$(UPTIME_KUMA_PASSWORD="$uk_pass" UK_URL="http://127.0.0.1:$pf_port" \
+        ~/.venvs/claude/bin/python3 -c '
+import sys, os, time
 try:
    from uptime_kuma_api import UptimeKumaApi
 except ImportError:
    print("ERROR:uptime-kuma-api not installed")
    sys.exit(0)

+# Retry up to 3 times — the Socket.IO handshake is occasionally flaky
+# even against the internal service during cluster churn.
+last_exc = None
+api = None
+for attempt in range(3):
+    try:
+        api = UptimeKumaApi(os.environ["UK_URL"], timeout=120, wait_events=0.2)
+        api.login("admin", os.environ["UPTIME_KUMA_PASSWORD"])
+        break
+    except Exception as e:
+        last_exc = e
+        try: api.disconnect()
+        except Exception: pass
+        api = None
+        time.sleep(2 * (attempt + 1))
+if api is None:
+    print(f"CONN_ERROR:{last_exc}")
+    sys.exit(0)
+
 try:
-    api = UptimeKumaApi("https://uptime.viktorbarzin.me", timeout=120, wait_events=0.2)
-    api.login("admin", os.environ["UPTIME_KUMA_PASSWORD"])

    monitors = api.get_monitors()
    heartbeats = api.get_heartbeats()
@ -663,6 +744,13 @@ except Exception as e:
    print(f"CONN_ERROR:{e}")
 ' 2>/dev/null) || result="CONN_ERROR:python execution failed"

+    # Always tear down the port-forward. Use SIGKILL directly — kubectl
+    # port-forward sometimes ignores SIGTERM during teardown and we don't
+    # need a graceful exit for a localhost listener. Skip `wait` because
+    # in `set -m` mode the backgrounded child may not be reapable here,
+    # causing the script to hang indefinitely; the shell reaps it on exit.
+    kill -9 "$pf_pid" 2>/dev/null || true
+
    if [[ "$result" == "ERROR:"* ]]; then
        [[ "$QUIET" == true ]] && section_always 14 "Uptime Kuma Monitors"
        warn "Uptime Kuma: ${result#ERROR:}"
@ -1074,9 +1162,14 @@ for item in data.get("items", []):
                    expiry = datetime.strptime(date_str.strip(), "%b %d %H:%M:%S %Y %Z")
                    expiry = expiry.replace(tzinfo=timezone.utc)
                    days_left = (expiry - datetime.now(timezone.utc)).days
+                    # Threshold rationale (lowered from 30d):
+                    # - cnpg-webhook-cert: CNPG operator auto-rotates at 7d before expiry
+                    # - kyverno-*-tls-pair: Kyverno auto-rotates at 15d before expiry
+                    # - viktorbarzin.me Lets Encrypt wildcard: renewed weekly via Woodpecker
+                    # Anything still <14d at check time is genuinely worth surfacing.
                    if days_left <= 7:
                        print(f"FAIL:{ns}/{name}:{days_left}d")
-                    elif days_left <= 30:
+                    elif days_left <= 14:
                        print(f"WARN:{ns}/{name}:{days_left}d")
                except ValueError:
                    pass
@ -1085,8 +1178,8 @@ for item in data.get("items", []):
 ' 2>/dev/null) || true

    if [[ -z "$cert_issues" ]]; then
-        pass "All TLS certificates valid for >30 days"
-        json_add "tls_certs" "PASS" "All valid >30d"
+        pass "All TLS certificates valid for >14 days"
+        json_add "tls_certs" "PASS" "All valid >14d"
    else
        [[ "$QUIET" == true ]] && section_always 22 "TLS Certificate Expiry"
        while IFS= read -r line; do
@ -1332,12 +1425,59 @@ check_ha_entities() {
    local result
    result=$(export HA_CACHE_DIR; python3 << 'PYEOF'
 import os, json
+from datetime import datetime, timezone, timedelta
+
+# Noise filter rationale:
+# * The HA "unavailable" state covers everything from "the iDRAC scrape failed
+#   30 seconds ago" to "this iPhone hasn't checked in in 6 hours" to
+#   "this YAML rest sensor has been broken for a week". Counting all of them
+#   produces 400+ alerts that are mostly expected (phones in standby, lights
+#   off, TVs idle).
+# * Three filters dramatically cut noise without hiding real outages:
+#     1. SKIP_DOMAINS — domains that go unavailable transiently by design
+#        (mobile_app on backgrounded apps, notify per-device, button/scene/
+#        event are momentary).
+#     2. STALE_HOURS — only count entities that have been unavailable for
+#        this long. A flapping integration that recovers in <24h is noise;
+#        one stuck for >24h is real.
+#     3. SKIP_DEVICE_HINTS — friendly-name substrings for things that come
+#        and go (laptops, phones, TVs, vacuums, washers).
+SKIP_DOMAINS = {"mobile_app", "device_tracker", "notify", "button", "scene",
+                "event", "image", "update"}
+SKIP_DEVICE_HINTS = ("iphone", "ipad", "macbook", "mac mini", "tv", "bravia",
+                     "playstation", "switch", "roomba", "vacuum", "rumi",
+                     "ipad", "laptop", "phone", "перална", "сушилня",
+                     "миялна", "laptop2")
+STALE_HOURS = 24

 cache = os.environ["HA_CACHE_DIR"]
 with open(f"{cache}/states.json") as f:
    states = json.load(f)

-unavail = [s for s in states if s.get("state") in ("unavailable", "unknown")]
+now = datetime.now(timezone.utc)
+threshold = now - timedelta(hours=STALE_HOURS)
+
+def is_stale(s):
+    if s.get("state") not in ("unavailable", "unknown"):
+        return False
+    domain = s["entity_id"].split(".")[0]
+    if domain in SKIP_DOMAINS:
+        return False
+    name = (s.get("attributes", {}).get("friendly_name") or "").lower()
+    if any(h in name for h in SKIP_DEVICE_HINTS):
+        return False
+    # last_changed = when the state last flipped. If it flipped to unavailable
+    # >24h ago and stayed there, the integration is genuinely broken.
+    lc = s.get("last_changed") or s.get("last_updated")
+    if not lc:
+        return True  # no timestamp = treat as old
+    try:
+        dt = datetime.fromisoformat(lc.replace("Z", "+00:00"))
+    except ValueError:
+        return True
+    return dt < threshold
+
+unavail = [s for s in states if is_stale(s)]
 domains = {}
 for s in unavail:
    d = s["entity_id"].split(".")[0]
@ -1496,24 +1636,42 @@ with open(f"{cache}/states.json") as f:

 autos = [s for s in states if s["entity_id"].startswith("automation.")]
 total = len(autos)
-disabled = [a["entity_id"] for a in autos if a["state"] == "off"]
-disabled_count = len(disabled)
+
+# Noise filter rationale (was: any disabled OR not-triggered-in-30d):
+# * "Disabled" alone is fine — Viktor disables automations intentionally
+#   (seasonal, holiday-only, paused). Only flag when ABANDONED, i.e.
+#   disabled for >180 days AND never triggered recently.
+# * "Stale" alone is fine for low-frequency automations (annual reminders,
+#   manual triggers). Raise the bar to 180d (was 30d).
+DISABLED_STALE_DAYS = 180
+STALE_DAYS = 180

 now = datetime.now(timezone.utc)
+
+def days_since(ts):
+    if not ts:
+        return None
+    try:
+        return (now - datetime.fromisoformat(ts.replace("Z", "+00:00"))).days
+    except Exception:
+        return None
+
+disabled = []
 stale = []
 for a in autos:
+    lt_days = days_since(a.get("attributes", {}).get("last_triggered"))
+    changed_days = days_since(a.get("last_changed"))
    if a["state"] == "off":
-        continue
-    lt = a.get("attributes", {}).get("last_triggered")
-    if lt:
-        try:
-            t = datetime.fromisoformat(lt.replace("Z", "+00:00"))
-            days = (now - t).days
-            if days > 30:
-                stale.append(a["entity_id"] + "=" + str(days) + "d")
-        except:
-            pass
+        # Only flag a disabled automation if it has ALSO been untouched for
+        # the threshold — i.e. genuinely abandoned, not "paused for now".
+        # Use last_changed as a proxy for "user-touched recently".
+        if changed_days is None or changed_days > DISABLED_STALE_DAYS:
+            disabled.append(a["entity_id"])
+    else:
+        if lt_days is not None and lt_days > STALE_DAYS:
+            stale.append(f"{a['entity_id']}={lt_days}d")

+disabled_count = len(disabled)
 stale_count = len(stale)
 disabled_names = "; ".join(disabled)
 stale_names = "; ".join(stale[:10])
@ -2307,6 +2465,107 @@ except Exception as e:
 }

 # --- 42. External Reachability: Traefik 5xx Rate ---
+check_pve_thermals() {
+    section 43 "PVE Host Thermals — Xeon E5-2699v4 package + per-core temps"
+    local raw status="PASS"
+
+    # Read all hwmon temp inputs in one SSH round-trip. Output: one line per
+    # sensor, "<sensor_label> <celsius>". Falls back gracefully on missing
+    # labels (Xeon coretemp driver exposes both `Package id 0` and `Core N`).
+    raw=$(ssh -o BatchMode=yes -o ConnectTimeout=5 -o StrictHostKeyChecking=no \
+        root@192.168.1.127 '
+        cd /sys/class/hwmon/hwmon0 2>/dev/null || exit 1
+        for tfile in temp*_input; do
+            [[ -e "$tfile" ]] || continue
+            base=${tfile%_input}
+            label=$(cat "${base}_label" 2>/dev/null || echo "$base")
+            val=$(cat "$tfile" 2>/dev/null)
+            [[ -n "$val" ]] && echo "$label $((val/1000))"
+        done
+        ' 2>/dev/null || true)
+
+    if [[ -z "$raw" ]]; then
+        [[ "$QUIET" == true ]] && section_always 43 "PVE Host Thermals"
+        warn "Could not read hwmon temps from 192.168.1.127 (SSH BatchMode failed or path missing)"
+        json_add "pve_thermals" "WARN" "SSH failed or hwmon path missing"
+        return 0
+    fi
+
+    local pkg_temp max_core_temp max_core_label
+    pkg_temp=$(echo "$raw" | awk '/^Package id/{print $NF; exit}')
+    max_core_temp=$(echo "$raw" | awk '/^Core/{if($NF>m){m=$NF; lbl=$1" "$2}} END{print m}')
+    max_core_label=$(echo "$raw" | awk '/^Core/{if($NF>m){m=$NF; lbl=$1" "$2}} END{print lbl}')
+
+    # Healthy baseline for this R730 (verified Apr 20-May 8 2026 from
+    # Prometheus): peak 61-69°C, avg 51-55°C. Treat anything above 65°C
+    # as a signal that some VM/workload is using too much CPU and warrants
+    # investigation, even though the Xeon E5-2699v4 has TjMax=83°C /
+    # Tcrit=93°C. This catches load creep early, well before throttling.
+    #   PASS  < 65°C package    (within baseline 55-65 °C band)
+    #   WARN  65-82°C package   (elevated — investigate top CPU consumer)
+    #   FAIL  >= 83°C package   (at/above TjMax — throttling imminent)
+    local detail="package=${pkg_temp}°C max_core=${max_core_temp}°C (${max_core_label})"
+    if [[ -z "$pkg_temp" ]]; then
+        [[ "$QUIET" == true ]] && section_always 43 "PVE Host Thermals"
+        warn "Package temp not found in hwmon output"
+        json_add "pve_thermals" "WARN" "$detail"
+    elif [[ "$pkg_temp" -ge 83 ]]; then
+        [[ "$QUIET" == true ]] && section_always 43 "PVE Host Thermals"
+        fail "PVE package temp ${pkg_temp}°C >= TjMax (83°C) — throttling imminent. $detail"
+        json_add "pve_thermals" "FAIL" "$detail"
+        status="FAIL"
+    elif [[ "$pkg_temp" -ge 65 ]]; then
+        [[ "$QUIET" == true ]] && section_always 43 "PVE Host Thermals"
+        warn "PVE package temp ${pkg_temp}°C above baseline (>65°C) — some VM is using too much CPU; check top kvm processes. $detail"
+        json_add "pve_thermals" "WARN" "$detail"
+    else
+        pass "PVE package ${pkg_temp}°C, hottest core ${max_core_temp}°C (${max_core_label}) — within 55-65°C baseline"
+        json_add "pve_thermals" "PASS" "$detail"
+    fi
+}
+
+check_pve_load() {
+    section 44 "PVE Host Load — load avg vs 44-thread capacity"
+    local raw load_1 load_5 load_15
+
+    raw=$(ssh -o BatchMode=yes -o ConnectTimeout=5 -o StrictHostKeyChecking=no \
+        root@192.168.1.127 'cat /proc/loadavg' 2>/dev/null || true)
+
+    if [[ -z "$raw" ]]; then
+        [[ "$QUIET" == true ]] && section_always 44 "PVE Host Load"
+        warn "Could not read /proc/loadavg from 192.168.1.127"
+        json_add "pve_load" "WARN" "SSH failed"
+        return 0
+    fi
+
+    load_1=$(echo "$raw" | awk '{print $1}')
+    load_5=$(echo "$raw" | awk '{print $2}')
+    load_15=$(echo "$raw" | awk '{print $3}')
+    # Round load_5 down for integer comparison (avoid bc dep)
+    local load_5_int
+    load_5_int=$(printf '%.0f' "$load_5")
+
+    # R730: 44 hw threads (22c × HT). Healthy avg ~ 15-22 (~30-50% utilisation
+    # of thread count). Warn when sustained 5-min above 30 (~70% threads
+    # busy). Fail when 5-min above 38 (~85% — close to scheduler saturation).
+    #   PASS  load_5 < 30
+    #   WARN  30 <= load_5 < 38
+    #   FAIL  load_5 >= 38
+    local detail="1m=${load_1} 5m=${load_5} 15m=${load_15}"
+    if [[ "$load_5_int" -ge 38 ]]; then
+        [[ "$QUIET" == true ]] && section_always 44 "PVE Host Load"
+        fail "PVE 5-min load ${load_5} >= 38 of 44 threads — saturation. $detail"
+        json_add "pve_load" "FAIL" "$detail"
+    elif [[ "$load_5_int" -ge 30 ]]; then
+        [[ "$QUIET" == true ]] && section_always 44 "PVE Host Load"
+        warn "PVE 5-min load ${load_5} in warn band (30-37 of 44 threads). $detail"
+        json_add "pve_load" "WARN" "$detail"
+    else
+        pass "PVE load avg $detail (< 30/44 threads)"
+        json_add "pve_load" "PASS" "$detail"
+    fi
+}
+
 check_external_traefik_5xx() {
    section 42 "External — Traefik 5xx Rate (15m)"
    local query_result detail="" status="PASS"
@ -2463,6 +2722,8 @@ main() {
    check_monitoring_css
    check_external_replicas
    check_external_divergence
+    check_pve_thermals
+    check_pve_load
    check_external_traefik_5xx
    print_summary

--- a/scripts/daily-backup.sh
+++ b/scripts/daily-backup.sh
@ -207,7 +207,15 @@ else
            dst="${BACKUP_ROOT}/pvc-data/${WEEK}/${ns_pvc}"
            mkdir -p "${dst}"
            rsync_rc=0
-            rsync -az --delete \
+            # Per-PVC rsync timeout (30 min). Without this, a single hung
+            # PVC blocks the entire backup until systemd's TimeoutStartSec
+            # kills the script (4h ceiling), leaving every later PVC
+            # unbacked and silently triggering WeeklyBackupFailing. Picked
+            # 30 min as well above the largest PVC's normal copy time
+            # (immich-postgres ~10 GiB, ~3 min on local ext4) and well
+            # below the unit-level budget so we still have headroom to
+            # finish the rest.
+            timeout 1800 rsync -az --delete \
                ${PREV:+--link-dest="${PREV}/${ns_pvc}/"} \
                "${PVC_MOUNT}/" "${dst}/" 2>&1 || rsync_rc=$?
            if [ "$rsync_rc" -eq 0 ]; then
@ -217,6 +225,12 @@ else
                # (in-flight writes have corrupt metadata from skipped journal replay)
                PVC_COUNT=$((PVC_COUNT + 1))
                log "  partial rsync (LUKS noload) for ${ns_pvc} — OK"
+            elif [ "$rsync_rc" -eq 124 ]; then
+                # `timeout` exit 124 = wall-clock killed the rsync. Track
+                # separately so the next run still produces a metric and
+                # doesn't pretend nothing happened.
+                warn "rsync timed out for ${ns_pvc} after 30 min — moving on"
+                PVC_FAIL=$((PVC_FAIL + 1))
            else
                warn "rsync failed for ${ns_pvc} (rc=$rsync_rc)"
                PVC_FAIL=$((PVC_FAIL + 1))
@ -232,7 +246,11 @@ else
                        relpath="${dbfile#${PVC_MOUNT}/}"
                        dest_file="${BACKUP_ROOT}/sqlite-backup/${WEEK}/${ns_pvc}/${relpath}"
                        mkdir -p "$(dirname "${dest_file}")"
-                        if sqlite3 "file://${dbfile}?mode=ro" ".backup '${dest_file}'" 2>/dev/null; then
+                        # 5-min sqlite timeout — same hang-prevention idea
+                        # as rsync above. A corrupted SQLite or one held
+                        # open by a writer in the snapshot can otherwise
+                        # block .backup indefinitely.
+                        if timeout 300 sqlite3 "file://${dbfile}?mode=ro" ".backup '${dest_file}'" 2>/dev/null; then
                            log "    SQLite: ${ns_pvc}/${relpath}"
                        else
                            cp "${dbfile}" "${dest_file}" 2>/dev/null || true
@ -326,7 +344,7 @@ fi
 # ============================================================
 log "--- Step 4: PVE host config ---"
 mkdir -p "${BACKUP_ROOT}/pve-config/scripts"
-rsync -az --delete /etc/pve/ "${BACKUP_ROOT}/pve-config/etc-pve/" 2>&1 || { warn "Failed to sync /etc/pve"; STATUS=1; }
+timeout 300 rsync -az --delete /etc/pve/ "${BACKUP_ROOT}/pve-config/etc-pve/" 2>&1 || { warn "Failed to sync /etc/pve"; STATUS=1; }
 for script in /usr/local/bin/lvm-pvc-snapshot /usr/local/bin/daily-backup /usr/local/bin/offsite-sync-backup; do
    [ -f "${script}" ] && cp "${script}" "${BACKUP_ROOT}/pve-config/scripts/" 2>/dev/null || true
 done
--- a/scripts/tg
+++ b/scripts/tg
@ -102,6 +102,30 @@ for arg in "$@"; do
  esac
 done

+# Detect if this is a plan/apply/destroy/refresh — anything that reads or
+# writes infra state. Cheap pre-flight check below scans only the current
+# stack's .tf files for the ingress_factory auth-comment convention. Other
+# tg verbs (init, fmt, validate) skip the check.
+is_tf_op=false
+for arg in "$@"; do
+  case "$arg" in
+    plan|apply|destroy|refresh) is_tf_op=true ;;
+  esac
+done
+
+# Anti-exposure guard: every `auth = "app"` or `auth = "none"` in this stack
+# must have a preceding `# auth = "<tier>":` comment documenting what gates
+# the app or why the endpoint is intentionally public. See:
+# - infra/modules/kubernetes/ingress_factory/main.tf (variable description)
+# - infra/.claude/CLAUDE.md "Auth" section
+# Stack-scoped: untouched stacks aren't blocked from future applies until
+# they're actually edited, at which point the convention applies.
+if $is_tf_op && [ -n "$STACK_NAME" ]; then
+  if ! "$REPO_ROOT/scripts/check-ingress-auth-comments.py" "$REPO_ROOT/stacks/$STACK_NAME"; then
+    exit 1
+  fi
+fi
+
 # Acquire lock for mutating operations (Tier 0 only — Tier 1 uses pg_advisory_lock)
 if $is_mutating && [ -n "$STACK_NAME" ] && is_tier0 "$STACK_NAME"; then
  if command -v vault &>/dev/null && [ -n "${VAULT_TOKEN:-}" ]; then
--- a/scripts/update_k8s.sh
+++ b/scripts/update_k8s.sh
@ -1,36 +1,114 @@
 #!/usr/bin/env bash
+#
+# K8s component upgrader. Run on a single node (master OR worker) at a time.
+# The caller is responsible for:
+#   - draining + uncordoning the node (this script does not touch kubectl)
+#   - sequencing nodes (master first, then workers one at a time)
+#   - pre-flight checks (etcd snapshot, halt-on-alert, etc)
+#
+# Used by:
+#   - the k8s-version-upgrade agent (infra/.claude/agents/k8s-version-upgrade.md)
+#   - manual operators following the runbook (infra/docs/runbooks/k8s-version-upgrade.md)
+#
+# Old manual orchestration loop (kept for reference — the agent does the
+# equivalent now):
+#   for n in $(kbn | grep 'k8s-node' | awk '{print $1}'); do
+#     kb drain $n --ignore-daemonsets --delete-emptydir-data
+#     s wizard@$n 'bash -s' < update_k8s.sh --role worker --release 1.34.5
+#     kb uncordon $n
+#   done

-# run for all nodes using :
-# for n in $(kbn | grep 'k8s-node' | awk '{print $1}'); do echo $n; kb drain $n --ignore-daemonsets --delete-emptydir-data; s wizard@$n 'bash -s' <update_k8s.sh; kb uncordon $n; done
+set -euo pipefail

-set -e
-export stable_version='1.34'  # change me
-export release="$stable_version.2"  # change me
+ROLE=""
+RELEASE=""

-echo "Upgrading to $stable_version"
+usage() {
+    cat <<EOF
+Usage: $0 --role <master|worker> --release <X.Y.Z>

-echo "deb [signed-by=/etc/apt/keyrings/kubernetes-apt-keyring.gpg] https://pkgs.k8s.io/core:/stable:/v$stable_version/deb/ /" | sudo tee /etc/apt/sources.list.d/kubernetes.list
-sudo mkdir -p /etc/apt/keyrings
-curl -fsSL "https://pkgs.k8s.io/core:/stable:/v$stable_version/deb/Release.key" | sudo gpg --dearmor -o /etc/apt/keyrings/kubernetes-apt-keyring.gpg --batch --yes
+  --role     master|worker  (required)
+  --release  kubeadm/kubelet/kubectl target patch version, e.g. 1.34.5

-sudo apt-mark unhold kubeadm kubelet kubectl
-sudo apt-get update 
-sudo apt-get install -y kubeadm="$release-*" 
+Behavior:
+  - Rewrites /etc/apt/sources.list.d/kubernetes.list to the v\$MINOR/deb repo
+    derived from --release (so a 1.34.x release uses v1.34/deb, 1.35.x uses
+    v1.35/deb, etc).
+  - apt-get install kubeadm=<release>-* (apt-mark unhold first).
+  - master: kubeadm upgrade plan && kubeadm upgrade apply v<release> -y
+  - worker: kubeadm upgrade node
+  - apt-get install kubelet=<release>-* kubectl=<release>-* then re-hold.
+  - systemctl daemon-reload && systemctl restart kubelet
+EOF
+}

-HOSTNAME=$(hostname)
-SEARCH_STR="master"
+while [[ $# -gt 0 ]]; do
+    case "$1" in
+        --role)    ROLE="$2"; shift 2;;
+        --release) RELEASE="$2"; shift 2;;
+        -h|--help) usage; exit 0;;
+        *) echo "Unknown arg: $1" >&2; usage; exit 2;;
+    esac
+done

-if [[ "$HOSTNAME" == *"$SEARCH_STR"* ]]; then
-    echo "Upgrading master"
-    sudo kubeadm upgrade plan && sudo kubeadm upgrade apply v$release -y
-else
-    echo "Upgrading worker"
-    sudo kubeadm upgrade node 
+if [[ -z "$ROLE" || -z "$RELEASE" ]]; then
+    echo "ERROR: --role and --release are required" >&2
+    usage
+    exit 2
 fi

-sudo apt-get install -y kubelet="$release-*" kubectl="$release-*"
-sudo apt-mark hold kubeadm kubelet kubectl
+if [[ "$ROLE" != "master" && "$ROLE" != "worker" ]]; then
+    echo "ERROR: --role must be 'master' or 'worker' (got: $ROLE)" >&2
+    exit 2
+fi

+# Derive minor track (e.g. 1.34.5 → 1.34)
+STABLE_VERSION="$(echo "$RELEASE" | awk -F. '{print $1"."$2}')"
+
+echo "==> Upgrading $(hostname) ($ROLE) to v$RELEASE (track v$STABLE_VERSION)"
+
+# Apt repo URL is pinned per minor track. Rewrite + re-import the signing key
+# every run — cheap, idempotent, and handles the minor-bump case where the
+# old track's repo no longer carries the target version.
+echo "deb [signed-by=/etc/apt/keyrings/kubernetes-apt-keyring.gpg] https://pkgs.k8s.io/core:/stable:/v$STABLE_VERSION/deb/ /" \
+    | sudo tee /etc/apt/sources.list.d/kubernetes.list
+sudo mkdir -p /etc/apt/keyrings
+curl -fsSL "https://pkgs.k8s.io/core:/stable:/v$STABLE_VERSION/deb/Release.key" \
+    | sudo gpg --dearmor -o /etc/apt/keyrings/kubernetes-apt-keyring.gpg --batch --yes
+
+sudo apt-mark unhold kubeadm kubelet kubectl
+sudo apt-get update
+sudo apt-get install -y "kubeadm=$RELEASE-*"
+
+if [[ "$ROLE" == "master" ]]; then
+    echo "==> Master path: kubeadm upgrade plan + apply"
+    sudo kubeadm upgrade plan
+    # The first apply may fail with "static Pod hash for component <X> did
+    # not change after 5m0s" — kubeadm's 5min wait for the kubelet to reload
+    # a static pod is too tight on our cluster (apiserver-to-kubelet status
+    # sync latency post-master-reboot can exceed it). The etcd image IS
+    # actually updated by then, so a 2nd attempt sees etcd already on
+    # target and skips it. Up to 3 attempts with a 30s delay between.
+    attempt=1
+    while ! sudo kubeadm upgrade apply "v$RELEASE" -y; do
+        if (( attempt >= 3 )); then
+            echo "ERROR: kubeadm upgrade apply failed after 3 attempts" >&2
+            exit 1
+        fi
+        echo "==> kubeadm apply attempt $attempt failed (likely static-pod-hash 5m timeout). Sleeping 30s then retrying — the previous attempt's manifest writes usually take hold on the 2nd try."
+        sleep 30
+        attempt=$(( attempt + 1 ))
+    done
+    echo "==> kubeadm upgrade apply succeeded on attempt $attempt"
+else
+    echo "==> Worker path: kubeadm upgrade node"
+    sudo kubeadm upgrade node
+fi
+
+sudo apt-get install -y "kubelet=$RELEASE-*" "kubectl=$RELEASE-*"
+sudo apt-mark hold kubeadm kubelet kubectl

 sudo systemctl daemon-reload
 sudo systemctl restart kubelet
+
+echo "==> Done: $(hostname) is on v$RELEASE"
--- a/scripts/update_node.sh
+++ b/scripts/update_node.sh
@ -1,8 +1,14 @@
 #!/usr/bin/env bash
+#
+# OS-major upgrade (Ubuntu do-release-upgrade). NOT in the auto-upgrade
+# pipeline — minor apt patches are handled by unattended-upgrades + kured;
+# K8s component bumps are handled by the k8s-version-upgrade agent. Run this
+# script manually when bumping Ubuntu LTS major versions.
+#
+# See:
+#   - infra/docs/runbooks/k8s-node-auto-upgrades.md  (apt + reboot)
+#   - infra/docs/runbooks/k8s-version-upgrade.md     (kubeadm/kubelet/kubectl)

 # sudo apt update && sudo apt autoremove -y && sudo apt upgrade -y
 sudo do-release-upgrade
 sudo apt update && sudo apt autoremove -y && sudo apt upgrade -y
-
-
-
--- a/scripts/upgrade_state.sh
+++ b/scripts/upgrade_state.sh
@ -0,0 +1,619 @@
+#!/usr/bin/env bash
+#
+# upgrade_state.sh — survey the three autonomous-upgrade pipelines.
+#
+# Companion to cluster_healthcheck.sh, surfaced via the /upgrade-state skill.
+# Read-only by design — no --fix.
+#
+# The three pipelines:
+#   1. Apps  — Keel polls registries hourly and rolls Deployments tagged
+#              keel.sh/policy. Metrics on container :9300/metrics.
+#   2. OS    — unattended-upgrades patches in-release per node; kured
+#              reboots within a daily 02:00-06:00 London window.
+#   3. K8s   — k8s-version-check CronJob (Sun 12:00 UTC) detects new
+#              kubeadm patch/minor releases; Job-chain drains+upgrades
+#              node-by-node. Pushgateway holds k8s_upgrade_* gauges.
+#
+# Exit codes: 0 healthy, 1 attention warranted, 2 something stalled.
+
+set -euo pipefail
+
+# --- Colors ---
+RED='\033[0;31m'
+GREEN='\033[0;32m'
+YELLOW='\033[0;33m'
+BLUE='\033[0;34m'
+BOLD='\033[1m'
+NC='\033[0m'
+
+# --- Globals ---
+JSON=false
+KUBECONFIG_PATH="${KUBECONFIG:-${HOME}/.kube/config}"
+[[ -f "$KUBECONFIG_PATH" ]] || KUBECONFIG_PATH="/home/wizard/code/infra/config"
+KUBECTL=""
+NODES=(k8s-master:10.0.20.100 k8s-node1:10.0.20.101 k8s-node2:10.0.20.102 k8s-node3:10.0.20.103 k8s-node4:10.0.20.104)
+SSH_OPTS=(-o BatchMode=yes -o ConnectTimeout=5 -o StrictHostKeyChecking=no)
+NOW_EPOCH=$(date -u +%s)
+HIGHEST_EXIT=0  # 0 healthy, 1 attention, 2 stalled
+
+# Results — collectors fill these.
+APPS_STATUS_ICON=""; APPS_STATUS_TEXT=""
+APPS_LAST_CHECK=""; APPS_NEXT=""; APPS_NOTES=""
+APPS_ENROLLED=0; APPS_PENDING=0; APPS_UPDATES_LINE=""; APPS_ERROR_LINE=""
+
+OS_STATUS_ICON=""; OS_STATUS_TEXT=""
+OS_LAST_CHECK=""; OS_NEXT=""; OS_NOTES=""
+OS_DISTRO_SUMMARY=""; OS_KERNEL_SUMMARY=""
+OS_PENDING_REBOOT_NODES=""; OS_HELD_DETAIL=""
+OS_LAST_UU=""; OS_LAST_KURED=""
+
+K8S_STATUS_ICON=""; K8S_STATUS_TEXT=""
+K8S_LAST_CHECK=""; K8S_NEXT=""; K8S_NOTES=""
+K8S_RUNNING=""; K8S_PATCH=""; K8S_MINOR=""
+K8S_LAST_DETECT_LINE=""; K8S_IN_FLIGHT="no"; K8S_LAST_CHAIN=""
+
+# --- Helpers ---
+log() { [[ "$JSON" == true ]] && return 0; echo -e "$*"; }
+
+raise_exit() {
+    local n="$1"
+    if [[ "$n" -gt "$HIGHEST_EXIT" ]]; then HIGHEST_EXIT="$n"; fi
+    return 0
+}
+
+usage() {
+    cat <<EOF
+Usage: $0 [--json] [--kubeconfig <path>]
+
+Read-only audit of the three autonomous-upgrade pipelines (apps, OS, k8s).
+
+  --json              machine-readable JSON
+  --kubeconfig PATH   override kubeconfig
+
+Exit codes: 0 healthy, 1 attention warranted, 2 something stalled.
+EOF
+}
+
+parse_args() {
+    while [[ $# -gt 0 ]]; do
+        case "$1" in
+            --json)       JSON=true; shift ;;
+            --kubeconfig) KUBECONFIG_PATH="$2"; shift 2 ;;
+            -h|--help)    usage; exit 0 ;;
+            *) echo "Unknown option: $1" >&2; exit 1 ;;
+        esac
+    done
+    KUBECTL="kubectl --kubeconfig $KUBECONFIG_PATH"
+}
+
+# Prometheus query — Prometheus + reload + backup share a network namespace,
+# so reaching localhost:9090 works from any of the three sidecars.
+prom_q() {
+    local q="$1"
+    $KUBECTL -n monitoring exec deploy/prometheus-server -c prometheus-server -- \
+        wget -qO- "http://localhost:9090/api/v1/query?query=${q}" 2>/dev/null || true
+}
+
+pg_metrics() {
+    $KUBECTL -n monitoring exec deploy/prometheus-server -c prometheus-server -- \
+        wget -qO- "http://prometheus-prometheus-pushgateway:9091/metrics" 2>/dev/null || true
+}
+
+ssh_node() {
+    local ip="$1"; shift
+    ssh "${SSH_OPTS[@]}" "wizard@$ip" "$@" 2>/dev/null || true
+}
+
+human_age() {
+    local secs="$1"
+    if   [[ "$secs" -lt 60    ]]; then printf '%ds ago' "$secs"
+    elif [[ "$secs" -lt 3600  ]]; then printf '%dm ago' $((secs/60))
+    elif [[ "$secs" -lt 86400 ]]; then printf '%dh ago' $((secs/3600))
+    else                               printf '%dd ago' $((secs/86400))
+    fi
+}
+
+# Pushgateway emits floats and scientific notation — coerce to integer
+# epoch seconds. Returns 0 if the input is empty / zero / unparseable.
+to_epoch_int() {
+    local v="${1:-}"
+    if [[ -z "$v" || "$v" == "0" ]]; then echo 0; return; fi
+    python3 -c "import sys; v=sys.argv[1]; print(int(float(v)))" "$v" 2>/dev/null || echo 0
+}
+
+# --- 1. Apps (Keel) ---
+collect_apps() {
+    local pending tracked enrolled updates_24h errors
+
+    # Enrolled: count Deployments with keel.sh/policy != never (Keel itself
+    # is policy=never). The Kyverno auto-injection labels namespaces
+    # keel.sh/enrolled=true, but the annotation is what Keel watches.
+    enrolled=$($KUBECTL get deploy -A -o json 2>/dev/null | python3 -c '
+import json, sys
+data = json.load(sys.stdin)
+n = sum(1 for d in data["items"]
+        if (d["metadata"].get("annotations") or {}).get("keel.sh/policy", "never") != "never")
+print(n)
+' 2>/dev/null || echo 0)
+    APPS_ENROLLED="$enrolled"
+
+    # Pending approvals (sum across Keel pods).
+    pending=$(prom_q 'sum(pending_approvals)' | python3 -c '
+import json, sys
+try:
+    r = json.load(sys.stdin)["data"]["result"]
+    print(int(float(r[0]["value"][1])) if r else 0)
+except Exception:
+    print(0)
+' 2>/dev/null || echo 0)
+    APPS_PENDING="$pending"
+
+    # Tracked images — proxy for "is the scrape live?".
+    tracked=$(prom_q 'count(count by (image) (registries_scanned_total))' | python3 -c '
+import json, sys
+try:
+    r = json.load(sys.stdin)["data"]["result"]
+    print(int(float(r[0]["value"][1])) if r else 0)
+except Exception:
+    print(0)
+' 2>/dev/null || echo 0)
+
+    # Last scrape age — `up{job="kubernetes-pods", app="keel"}` is 1 if the
+    # most recent scrape succeeded. We surface the wallclock age via a tiny
+    # `time() - timestamp(up{...})` query.
+    APPS_LAST_CHECK=$(prom_q 'time()-timestamp(up{job="kubernetes-pods",app="keel"})' | python3 -c '
+import json, sys
+try:
+    r = json.load(sys.stdin)["data"]["result"]
+    if not r: print("scrape not live")
+    else:
+        secs = int(float(r[0]["value"][1]))
+        if secs < 60:  print(f"{secs}s ago")
+        elif secs < 3600: print(f"{secs//60}m ago")
+        else: print(f"{secs//3600}h ago")
+except Exception:
+    print("?")
+' 2>/dev/null || echo "?")
+
+    # Recent updates: count lines in Keel logs that report a successful
+    # rollout. Keel logs an "update completed" message per rollout.
+    local log_24h
+    log_24h=$($KUBECTL -n keel logs deploy/keel --since=24h --tail=2000 2>/dev/null || true)
+    updates_24h=$(echo "$log_24h" | grep -cE 'update completed|successfully updated|deployment updated' 2>/dev/null || true)
+    [[ -z "$updates_24h" ]] && updates_24h=0
+    APPS_UPDATES_LINE="$updates_24h in last 24h (tracked images: $tracked)"
+
+    # Known-benign Keel error patterns to suppress. Each is a real error
+    # line Keel emits, but the surrounding behaviour is fine, so flagging
+    # them in /upgrade-state is just noise.
+    #   - `bot.Run(): can not get configuration for bot [slack]` — Keel
+    #     1.2.0 registers a Slack socket-mode bot whenever SLACK_BOT_TOKEN
+    #     is set, then fails because we don't supply an `xapp-` app-level
+    #     token. We don't want the interactive bot (no approvals; opt-out
+    #     auto-update). The Slack NOTIFICATION sender works independently
+    #     of the bot, so rollout messages still post to #general.
+    #   - `failed to check digest` with a transient network error —
+    #     Keel polls ~175 image manifests against public registries
+    #     hourly. Occasional `i/o timeout` / `connection refused` /
+    #     `TLS handshake timeout` / `no such host` / `EOF` /
+    #     `context deadline exceeded` are inherent to public-internet
+    #     polling at that scale and auto-recover on the next poll.
+    #     Actionable digest-check failures surface as HTTP 401/404
+    #     (auth, removed-tag) — those are NOT filtered.
+    #   - `failed to check digest` with HTTP 5xx — upstream registry
+    #     having a problem (DockerHub maintenance, Forgejo restart,
+    #     etc.). Same recovery pattern as network errors: next hourly
+    #     poll succeeds once upstream is back. Persistent 5xx for >24h
+    #     would indicate a real registry-side issue, but that surfaces
+    #     via the registry's own monitoring (e.g. forgejo-integrity-probe
+    #     + RegistryCatalogInaccessible), not via Keel logs.
+    local benign_re='bot\.Run\(\): can not get configuration for bot \[slack\]'
+    benign_re+='|SLACK_APP_TOKEN must have the (previf|prefix)'
+    benign_re+='|failed to check digest.*(i/o timeout|connection refused|connection reset|context deadline exceeded|TLS handshake timeout|no such host|: EOF)'
+    benign_re+='|failed to check digest.*non-successful response \(status=5[0-9][0-9]'
+    errors=$(echo "$log_24h" | grep -iE '"level":"(error|fatal)"|level=error' | grep -vE "$benign_re" | tail -3 || true)
+    if [[ -z "$errors" ]]; then
+        APPS_ERROR_LINE="(none in last 24h)"
+    else
+        APPS_ERROR_LINE="$(echo "$errors" | wc -l | tr -d ' ') error(s); newest: $(echo "$errors" | tail -1 | cut -c1-120)"
+    fi
+
+    # Keel pod state.
+    local pod_status
+    pod_status=$($KUBECTL -n keel get pods -l app=keel -o jsonpath='{.items[*].status.phase}' 2>/dev/null || true)
+
+    if [[ "$pod_status" != *"Running"* ]]; then
+        APPS_STATUS_ICON="✗"; APPS_STATUS_TEXT="down"
+        APPS_NOTES="Keel pod not Running ($pod_status)"
+        raise_exit 2
+    elif [[ "$pending" -gt 0 || -n "$errors" ]]; then
+        APPS_STATUS_ICON="⚠"; APPS_STATUS_TEXT="attn"
+        APPS_NOTES="$enrolled enrolled; $pending pending; $(echo "$errors" | wc -l | tr -d ' ') recent error(s)"
+        raise_exit 1
+    else
+        APPS_STATUS_ICON="✓"; APPS_STATUS_TEXT="healthy"
+        APPS_NOTES="$enrolled enrolled, 0 pending, 0 errors"
+    fi
+
+    APPS_NEXT="rolling, hourly poll"
+}
+
+# --- 2. OS (apt + kured) ---
+collect_os() {
+    local distros kernels distro_uniq kernel_uniq
+    distros=$($KUBECTL get nodes -o jsonpath='{range .items[*]}{.status.nodeInfo.osImage}{"\n"}{end}' 2>/dev/null)
+    kernels=$($KUBECTL get nodes -o jsonpath='{range .items[*]}{.status.nodeInfo.kernelVersion}{"\n"}{end}' 2>/dev/null)
+    distro_uniq=$(echo "$distros" | sort -u | tr '\n' ',' | sed 's/,$//; s/,/, /g')
+    kernel_uniq=$(echo "$kernels" | sort -u | tr '\n' ',' | sed 's/,$//; s/,/, /g')
+    OS_DISTRO_SUMMARY="$distro_uniq"
+    OS_KERNEL_SUMMARY="$kernel_uniq"
+
+    # SSH fan-out — parallel background subshells, write per-node results to tmp files.
+    local tmpdir; tmpdir=$(mktemp -d)
+    trap 'rm -rf "$tmpdir"' RETURN
+    local entry name ip
+    for entry in "${NODES[@]}"; do
+        name="${entry%%:*}"; ip="${entry##*:}"
+        (
+            local out reboot held upgradable uu_log
+            reboot=$(ssh_node "$ip" 'test -f /var/run/reboot-required && echo yes || echo no')
+            held=$(ssh_node "$ip" 'apt-mark showhold 2>/dev/null')
+            upgradable=$(ssh_node "$ip" 'apt list --upgradable 2>/dev/null | tail -n +2')
+            uu_log=$(ssh_node "$ip" 'tail -1 /var/log/unattended-upgrades/unattended-upgrades.log 2>/dev/null')
+            printf 'reboot=%s\n' "$reboot"      >  "$tmpdir/$name"
+            printf 'held<<<EOF\n%s\nEOF\n' "$held"         >> "$tmpdir/$name"
+            printf 'upgradable<<<EOF\n%s\nEOF\n' "$upgradable" >> "$tmpdir/$name"
+            printf 'uu_log=%s\n' "$uu_log"     >> "$tmpdir/$name"
+        ) &
+    done
+    wait
+
+    # Aggregate.
+    local pending_reboots=() held_with_bumps_lines=() newest_uu_ts=0 newest_uu_iso=""
+    for entry in "${NODES[@]}"; do
+        name="${entry%%:*}"
+        [[ -f "$tmpdir/$name" ]] || continue
+        local reboot held upgradable uu_log uu_ts
+        reboot=$(awk -F= '/^reboot=/{print $2}' "$tmpdir/$name")
+        held=$(awk '/^held<<<EOF$/,/^EOF$/' "$tmpdir/$name" | sed '1d;$d')
+        upgradable=$(awk '/^upgradable<<<EOF$/,/^EOF$/' "$tmpdir/$name" | sed '1d;$d')
+        uu_log=$(awk -F= '/^uu_log=/{sub(/^uu_log=/,""); print}' "$tmpdir/$name")
+
+        [[ "$reboot" == "yes" ]] && pending_reboots+=("$name")
+
+        # Held + upgradable, excluding k8s components (managed by k8s pipeline).
+        local pkg from to bump
+        while IFS= read -r line; do
+            [[ -z "$line" ]] && continue
+            pkg=$(echo "$line" | awk -F/ '{print $1}')
+            # Skip k8s and kernel/linux-image — the chain handles those.
+            case "$pkg" in
+                kubeadm|kubectl|kubelet) continue ;;
+                linux-image-*|linux-headers-*|linux-modules-*|linux-generic|linux-headers-generic|linux-image-generic) continue ;;
+            esac
+            # Only flag if the package is held.
+            if echo "$held" | grep -qx "$pkg"; then
+                to=$(echo "$line" | awk '{print $2}')
+                from=$(echo "$line" | sed -n 's/.*from: \([^ ]*\).*/\1/p')
+                bump="$pkg ${from%-*}→${to%-*}"
+                held_with_bumps_lines+=("$name: $bump")
+            fi
+        done <<<"$upgradable"
+
+        # Newest uu timestamp (ISO at start of log line).
+        uu_ts=$(echo "$uu_log" | sed -E 's/^([0-9]{4}-[0-9]{2}-[0-9]{2} [0-9]{2}:[0-9]{2}:[0-9]{2}).*/\1/')
+        if [[ -n "$uu_ts" ]]; then
+            local epoch; epoch=$(date -u -d "$uu_ts" +%s 2>/dev/null || echo 0)
+            if [[ "$epoch" -gt "$newest_uu_ts" ]]; then
+                newest_uu_ts="$epoch"; newest_uu_iso="$uu_ts"
+            fi
+        fi
+    done
+
+    OS_PENDING_REBOOT_NODES="${pending_reboots[*]:-}"
+    if [[ ${#held_with_bumps_lines[@]} -gt 0 ]]; then
+        OS_HELD_DETAIL=$(printf '%s\n' "${held_with_bumps_lines[@]}" | sort -u | paste -sd '; ' -)
+    fi
+
+    if [[ "$newest_uu_ts" -gt 0 ]]; then
+        local age=$((NOW_EPOCH - newest_uu_ts))
+        OS_LAST_UU="$newest_uu_iso UTC ($(human_age "$age"))"
+        OS_LAST_CHECK="$(human_age "$age") (uu daily)"
+    else
+        OS_LAST_UU="(no uu log accessible)"
+        OS_LAST_CHECK="?"
+    fi
+
+    # Last kured reboot — newest Ready transition across worker nodes.
+    # `Ready -> True` is what kured causes when the node returns; we surface
+    # the most recent timestamp and the node it belongs to.
+    local kured_raw kured_iso kured_node kured_ep kured_age
+    kured_raw=$($KUBECTL get nodes -o json 2>/dev/null | python3 -c '
+import json, sys
+from datetime import datetime
+data = json.load(sys.stdin)
+best = (0, "", "")
+for n in data["items"]:
+    name = n["metadata"]["name"]
+    for c in n["status"].get("conditions", []):
+        if c["type"] == "Ready":
+            dt = datetime.strptime(c["lastTransitionTime"], "%Y-%m-%dT%H:%M:%SZ")
+            ep = int(dt.timestamp())
+            if ep > best[0]:
+                best = (ep, name, c["lastTransitionTime"])
+print(f"{best[0]}|{best[1]}|{best[2]}")
+' 2>/dev/null || echo "0||")
+    kured_ep="${kured_raw%%|*}"
+    kured_node=$(echo "$kured_raw" | cut -d'|' -f2)
+    kured_iso=$(echo "$kured_raw" | cut -d'|' -f3)
+    if [[ "$kured_ep" -gt 0 ]]; then
+        kured_age=$((NOW_EPOCH - kured_ep))
+        OS_LAST_KURED="$kured_iso ($kured_node, $(human_age "$kured_age"))"
+    else
+        OS_LAST_KURED="?"
+    fi
+
+    OS_NEXT="daily 02:00-06:00 London"
+
+    # Kured pod health.
+    local kured_pods kured_unhealthy
+    kured_pods=$($KUBECTL -n kured get pods -l app.kubernetes.io/name=kured -o jsonpath='{range .items[*]}{.status.phase}{"\n"}{end}' 2>/dev/null)
+    kured_unhealthy=$(echo "$kured_pods" | grep -cv '^Running$' 2>/dev/null || true)
+
+    local notes=()
+    [[ -n "$OS_HELD_DETAIL" ]]            && notes+=("held with bumps: $OS_HELD_DETAIL")
+    [[ -n "$OS_PENDING_REBOOT_NODES" ]]   && notes+=("pending reboot: $OS_PENDING_REBOOT_NODES")
+
+    if [[ "$kured_unhealthy" -gt 0 ]]; then
+        OS_STATUS_ICON="✗"; OS_STATUS_TEXT="kured down"
+        OS_NOTES="kured pods not all Running"
+        raise_exit 2
+    elif [[ ${#notes[@]} -gt 0 ]]; then
+        OS_STATUS_ICON="⚠"; OS_STATUS_TEXT="attn"
+        OS_NOTES="${notes[*]}"
+        raise_exit 1
+    else
+        OS_STATUS_ICON="✓"; OS_STATUS_TEXT="healthy"
+        OS_NOTES="distros uniform; no held bumps; no pending reboots"
+    fi
+}
+
+# --- 3. K8s (kubeadm/kubelet/kubectl) ---
+collect_k8s() {
+    local kver_list kver_uniq metrics target_patch target_minor last_run in_flight started
+
+    kver_list=$($KUBECTL get nodes -o jsonpath='{range .items[*]}{.status.nodeInfo.kubeletVersion}{"\n"}{end}' 2>/dev/null)
+    kver_uniq=$(echo "$kver_list" | sort -u)
+    local n_uniq; n_uniq=$(echo "$kver_uniq" | wc -l | tr -d ' ')
+    if [[ "$n_uniq" -eq 1 ]]; then
+        K8S_RUNNING="$kver_uniq across $(echo "$kver_list" | wc -l | tr -d ' ')/$(echo "$kver_list" | wc -l | tr -d ' ') nodes"
+    else
+        K8S_RUNNING="mixed: $(echo "$kver_uniq" | paste -sd', ' -)"
+    fi
+    local running_ver; running_ver=$(echo "$kver_uniq" | head -1)
+
+    metrics=$(pg_metrics)
+    # All five may legitimately be absent (cluster never ran the upgrade
+    # chain, kind="minor" not detected, etc.) — `|| true` keeps pipefail
+    # from killing the script on no-match.
+    target_patch=$(echo "$metrics" | { grep -E '^k8s_upgrade_available\{[^}]*kind="patch"' || true; } | sed -n 's/.*target="\([^"]*\)".*/\1/p' | head -1)
+    target_minor=$(echo "$metrics" | { grep -E '^k8s_upgrade_available\{[^}]*kind="minor"' || true; } | sed -n 's/.*target="\([^"]*\)".*/\1/p' | head -1)
+    # Pushgateway emits these with `{instance="",job="..."}` labels — the
+    # `awk '$1 ~ /^name(\{|$)/'` form matches both bare and labelled metrics.
+    last_run=$(echo "$metrics"  | awk '$1 ~ /^k8s_version_check_last_run_timestamp(\{|$)/{print $2}' | head -1 || true)
+    in_flight=$(echo "$metrics" | awk '$1 ~ /^k8s_upgrade_in_flight(\{|$)/{print $2}' | head -1 || true)
+    started=$(echo "$metrics"   | awk '$1 ~ /^k8s_upgrade_started_timestamp(\{|$)/{print $2}' | head -1 || true)
+
+    # Pushgateway timestamps come back in scientific notation
+    # (e.g. 1.779052159e+09) — convert to plain integer seconds.
+    local last_run_int started_int
+    last_run_int=$(to_epoch_int "$last_run")
+    started_int=$(to_epoch_int "$started")
+
+    if [[ "$last_run_int" -gt 0 ]]; then
+        local age=$((NOW_EPOCH - last_run_int))
+        K8S_LAST_CHECK="$(human_age "$age") (daily cron)"
+        if [[ -n "$target_patch" ]]; then
+            K8S_LAST_DETECT_LINE="last run $(human_age "$age"): available v$target_patch (patch)"
+        elif [[ -n "$target_minor" ]]; then
+            K8S_LAST_DETECT_LINE="last run $(human_age "$age"): available v$target_minor (minor)"
+        else
+            K8S_LAST_DETECT_LINE="last run $(human_age "$age"): no upgrade available"
+        fi
+    else
+        K8S_LAST_CHECK="(metric missing)"
+        K8S_LAST_DETECT_LINE="(no k8s_version_check_last_run_timestamp in Pushgateway)"
+    fi
+    K8S_PATCH="${target_patch:-none}"
+    K8S_MINOR="${target_minor:-none}"
+
+    # In-flight / last chain.
+    if [[ "${in_flight:-0}" == "1" ]]; then
+        K8S_IN_FLIGHT="yes"
+        local since=0
+        [[ "$started_int" -gt 0 ]] && since=$((NOW_EPOCH - started_int))
+        K8S_LAST_CHAIN="in-flight (started $(human_age "$since"))"
+    else
+        K8S_IN_FLIGHT="no"
+        if [[ "$started_int" -gt 0 ]]; then
+            local age=$((NOW_EPOCH - started_int))
+            K8S_LAST_CHAIN="$(human_age "$age")"
+        else
+            K8S_LAST_CHAIN="never (or zeroed)"
+        fi
+    fi
+
+    K8S_NEXT="$(next_daily_noon_utc)"
+
+    # Status logic.
+    local stalled=0
+    if [[ "${in_flight:-0}" == "1" && "$started_int" -gt 0 ]]; then
+        # K8sUpgradeStalled fires after 5400s (90m) per monitoring stack.
+        local since=$((NOW_EPOCH - started_int))
+        [[ "$since" -gt 5400 ]] && stalled=1
+    fi
+    local last_run_age=999999999
+    [[ "$last_run_int" -gt 0 ]] && last_run_age=$((NOW_EPOCH - last_run_int))
+
+    if [[ "$stalled" == "1" ]]; then
+        K8S_STATUS_ICON="✗"; K8S_STATUS_TEXT="stalled"
+        K8S_NOTES="K8sUpgradeStalled would fire — chain in-flight >90m"
+        raise_exit 2
+    elif [[ "$last_run_age" -gt $((9*86400)) ]]; then
+        K8S_STATUS_ICON="✗"; K8S_STATUS_TEXT="detection stale"
+        K8S_NOTES="last detection >9d ago"
+        raise_exit 2
+    elif [[ "${in_flight:-0}" == "1" ]]; then
+        K8S_STATUS_ICON="…"; K8S_STATUS_TEXT="in-flight"
+        K8S_NOTES="upgrade chain running"
+        raise_exit 1
+    elif [[ -n "$target_patch" ]]; then
+        K8S_STATUS_ICON="→"; K8S_STATUS_TEXT="$target_patch"
+        K8S_NOTES="running $running_ver → v$target_patch (patch) available"
+        raise_exit 1
+    elif [[ -n "$target_minor" ]]; then
+        K8S_STATUS_ICON="→"; K8S_STATUS_TEXT="$target_minor"
+        K8S_NOTES="running $running_ver → v$target_minor (minor) available"
+        raise_exit 1
+    else
+        K8S_STATUS_ICON="✓"; K8S_STATUS_TEXT="current"
+        K8S_NOTES="running $running_ver, nothing newer"
+    fi
+}
+
+# Next daily 12:00 UTC — pure bash date math, no croniter. Schedule was
+# weekly Sunday until 2026-05-18; now `0 12 * * *` in the
+# k8s-version-upgrade stack. If we're still before today's 12:00 UTC,
+# the next run is today; otherwise it's tomorrow.
+next_daily_noon_utc() {
+    local hr days_ahead
+    hr=$(date -u +%H)
+    if [[ "$hr" -lt 12 ]]; then days_ahead=0; else days_ahead=1; fi
+    date -u -d "+$days_ahead days" +"%a %Y-%m-%d 12:00 UTC"
+}
+
+# --- Renderers ---
+# The table uses `column -t` so we don't have to compute visual widths
+# manually (the status icons are multi-byte UTF-8 and ANSI escapes don't
+# play nice with `printf %-Xs`). Trade-off: no in-cell colour, but the
+# icon character already carries the signal.
+render_table() {
+    echo
+    printf "${BOLD}Upgrade state — %s${NC}\n" "$(date -u +'%Y-%m-%d %H:%M UTC')"
+    echo
+    {
+        echo "Layer|Status|Last check|Next upgrade|Notes"
+        echo "-----|------|----------|------------|-----"
+        printf 'Apps|%s %s|%s|%s|%s\n' "$APPS_STATUS_ICON" "$APPS_STATUS_TEXT" "$APPS_LAST_CHECK" "$APPS_NEXT" "$APPS_NOTES"
+        printf 'OS  |%s %s|%s|%s|%s\n' "$OS_STATUS_ICON"   "$OS_STATUS_TEXT"   "$OS_LAST_CHECK"   "$OS_NEXT"   "$OS_NOTES"
+        printf 'K8s |%s %s|%s|%s|%s\n' "$K8S_STATUS_ICON"  "$K8S_STATUS_TEXT"  "$K8S_LAST_CHECK"  "$K8S_NEXT"  "$K8S_NOTES"
+    } | column -t -s '|' -o ' | '
+
+    echo
+    printf "${BOLD}--- Apps (Keel) ---${NC}\n"
+    echo "Enrolled deployments: $APPS_ENROLLED"
+    echo "Recent rollouts: $APPS_UPDATES_LINE"
+    echo "Pending approvals: $APPS_PENDING"
+    echo "Last Keel error: $APPS_ERROR_LINE"
+
+    echo
+    printf "${BOLD}--- OS (apt + kured) ---${NC}\n"
+    echo "Ubuntu per node: $OS_DISTRO_SUMMARY"
+    echo "Kernel per node: $OS_KERNEL_SUMMARY"
+    echo "Pending reboot: ${OS_PENDING_REBOOT_NODES:-none}"
+    echo "Held packages with upstream bumps: ${OS_HELD_DETAIL:-none (excluding k8s components)}"
+    echo "Last uu run (newest across nodes): $OS_LAST_UU"
+    echo "Last kured reboot (newest Ready transition): $OS_LAST_KURED"
+    echo "Next kured window: $OS_NEXT"
+
+    echo
+    printf "${BOLD}--- K8s (kubeadm/kubelet/kubectl) ---${NC}\n"
+    echo "Running: $K8S_RUNNING"
+    echo "Latest patch (apt): ${K8S_PATCH}"
+    echo "Next minor available: ${K8S_MINOR}"
+    echo "Detection: $K8S_LAST_DETECT_LINE"
+    echo "In-flight: $K8S_IN_FLIGHT  |  Last chain start: $K8S_LAST_CHAIN"
+    echo "Next detection: $K8S_NEXT"
+    echo
+}
+
+render_json() {
+    # Pipe values into Python via env vars so we don't need to worry about
+    # embedded quotes/backslashes in error lines.
+    APPS_STATUS_ICON="$APPS_STATUS_ICON" APPS_STATUS_TEXT="$APPS_STATUS_TEXT" \
+    APPS_LAST_CHECK="$APPS_LAST_CHECK" APPS_NEXT="$APPS_NEXT" APPS_NOTES="$APPS_NOTES" \
+    APPS_ENROLLED="$APPS_ENROLLED" APPS_PENDING="$APPS_PENDING" \
+    APPS_UPDATES_LINE="$APPS_UPDATES_LINE" APPS_ERROR_LINE="$APPS_ERROR_LINE" \
+    OS_STATUS_ICON="$OS_STATUS_ICON" OS_STATUS_TEXT="$OS_STATUS_TEXT" \
+    OS_LAST_CHECK="$OS_LAST_CHECK" OS_NEXT="$OS_NEXT" OS_NOTES="$OS_NOTES" \
+    OS_DISTRO_SUMMARY="$OS_DISTRO_SUMMARY" OS_KERNEL_SUMMARY="$OS_KERNEL_SUMMARY" \
+    OS_PENDING_REBOOT_NODES="$OS_PENDING_REBOOT_NODES" OS_HELD_DETAIL="$OS_HELD_DETAIL" \
+    OS_LAST_UU="$OS_LAST_UU" OS_LAST_KURED="$OS_LAST_KURED" \
+    K8S_STATUS_ICON="$K8S_STATUS_ICON" K8S_STATUS_TEXT="$K8S_STATUS_TEXT" \
+    K8S_LAST_CHECK="$K8S_LAST_CHECK" K8S_NEXT="$K8S_NEXT" K8S_NOTES="$K8S_NOTES" \
+    K8S_RUNNING="$K8S_RUNNING" K8S_PATCH="$K8S_PATCH" K8S_MINOR="$K8S_MINOR" \
+    K8S_LAST_DETECT_LINE="$K8S_LAST_DETECT_LINE" K8S_IN_FLIGHT="$K8S_IN_FLIGHT" K8S_LAST_CHAIN="$K8S_LAST_CHAIN" \
+    HIGHEST_EXIT="$HIGHEST_EXIT" \
+    python3 -c '
+import json, os
+from datetime import datetime, timezone
+def env(k): return os.environ.get(k, "")
+out = {
+    "as_of_utc": datetime.now(timezone.utc).strftime("%Y-%m-%dT%H:%M:%SZ"),
+    "highest_exit": int(env("HIGHEST_EXIT")),
+    "apps": {
+        "status": env("APPS_STATUS_ICON"),
+        "status_text": env("APPS_STATUS_TEXT"),
+        "last_check": env("APPS_LAST_CHECK"),
+        "next_upgrade": env("APPS_NEXT"),
+        "notes": env("APPS_NOTES"),
+        "enrolled": int(env("APPS_ENROLLED") or 0),
+        "pending_approvals": int(env("APPS_PENDING") or 0),
+        "updates_line": env("APPS_UPDATES_LINE"),
+        "errors_line": env("APPS_ERROR_LINE"),
+    },
+    "os": {
+        "status": env("OS_STATUS_ICON"),
+        "status_text": env("OS_STATUS_TEXT"),
+        "last_check": env("OS_LAST_CHECK"),
+        "next_upgrade": env("OS_NEXT"),
+        "notes": env("OS_NOTES"),
+        "distros": env("OS_DISTRO_SUMMARY"),
+        "kernels": env("OS_KERNEL_SUMMARY"),
+        "pending_reboot_nodes": env("OS_PENDING_REBOOT_NODES"),
+        "held_with_bumps": env("OS_HELD_DETAIL"),
+        "last_uu_run": env("OS_LAST_UU"),
+        "last_kured_reboot": env("OS_LAST_KURED"),
+    },
+    "k8s": {
+        "status": env("K8S_STATUS_ICON"),
+        "status_text": env("K8S_STATUS_TEXT"),
+        "last_check": env("K8S_LAST_CHECK"),
+        "next_upgrade": env("K8S_NEXT"),
+        "notes": env("K8S_NOTES"),
+        "running": env("K8S_RUNNING"),
+        "patch_target": env("K8S_PATCH"),
+        "minor_target": env("K8S_MINOR"),
+        "last_detection_line": env("K8S_LAST_DETECT_LINE"),
+        "in_flight": env("K8S_IN_FLIGHT"),
+        "last_chain": env("K8S_LAST_CHAIN"),
+    },
+}
+print(json.dumps(out, indent=2))
+'
+}
+
+main() {
+    parse_args "$@"
+    collect_apps
+    collect_os
+    collect_k8s
+    if [[ "$JSON" == true ]]; then
+        render_json
+    else
+        render_table
+    fi
+    exit "$HIGHEST_EXIT"
+}
+
+main "$@"
--- a/secrets/fullchain.pem
+++ b/secrets/fullchain.pem
--- a/secrets/privkey.pem
+++ b/secrets/privkey.pem
--- a/stacks/_template/main.tf.example
+++ b/stacks/_template/main.tf.example
@ -87,5 +87,5 @@ module "ingress" {
  name            = "<app-name>"
  tls_secret_name = var.tls_secret_name
  dns_type        = "proxied"  # "proxied" (Cloudflare CDN), "non-proxied" (direct A/AAAA), or "none"
-  protected       = false      # Set true to require Authentik login
+  auth            = "required" # "required" (Authentik login), "public" (anonymous bound to guest), or "none" (no auth)
 }
--- a/stacks/actualbudget/.terraform.lock.hcl
+++ b/stacks/actualbudget/.terraform.lock.hcl
@ -29,6 +29,20 @@ provider "registry.terraform.io/goauthentik/authentik" {
  constraints = "~> 2024.10"
  hashes = [
    "h1:roBMd+gi+TGgikH/bMzEI8JfvJiMAQWt+8FmokCrQIs=",
+    "zh:090260dc7889ea822ec1d899344e1ee23eba5290461989c0796149c9511f2316",
+    "zh:13c2655ff824b0dc4b9bb832b5ca6d41dba97cb280330258c5fef4115e236209",
+    "zh:166a73c3a810c9c895d68a8ff968158f339f8a2c1c03e20ec9fc5ed99cc64e20",
+    "zh:203777eae1cdc711233315499643180604cff2324411b186b7cf07fdbe16f655",
+    "zh:3b2f18c9a8d28dac74dc6bbf168c946855ab9c68f053578d4630c50d5eaf30a0",
+    "zh:4822275985f6b74b6196c47112316a4252db22cf4ceaef7c9ab4c66d488abf2f",
+    "zh:53ea97562666c8a5a2f6d63d418a302a7f8ee4b7bb7da35dedaa89aa5708b7f0",
+    "zh:56b8a230901e3550c92a1d3f58ee9dafe9853f30fe4315af3ab28ae63262e15d",
+    "zh:6293ab7b1fd8206a0c853591f50186aca4a1eff117b2a773e10760a23a2c83e9",
+    "zh:9433970f79fb92d8aae3ee436db5630ab312c78b6dc9df9c1db3273a18f8aaa1",
+    "zh:95df406214f79b3b98222d7c7fe8fc319a3d90b7a9d53e1d5abbda5dfb8b9436",
+    "zh:a85880da0552a42c8f449390fbd7d8b03541d1a13e04bba9f1404fa658754260",
+    "zh:a95f6e9bd62c67e70eba1b1a14728856b9a6a28cd1e5e3be54a7718882c87e7f",
+    "zh:dd599b51c5beb34a4c6feece244fde07d2558d69929449ab1fd39a5ebe738781",
  ]
 }

@ -56,6 +70,18 @@ provider "registry.terraform.io/hashicorp/kubernetes" {
  version = "3.1.0"
  hashes = [
    "h1:oodIAuFMikXNmEtil5MQgP4dfSctUBYQiGJfjbsF3NY=",
+    "zh:0215c5c60be62028c09a2f22458e89cda3ef5830a632299f1d401eb3538874b0",
+    "zh:09ebb9f442431e278a310a9423f32caf467cb4b3cad3fe59573ca71fa7b14e20",
+    "zh:0c4e5912f83bb35846ae0a9ae54fc320706ee61894cd21cc6b4181b1c5a2fa5c",
+    "zh:1678c982853ad461e65ccb5e79d585e13ed109dd47dab2a66d3a7a304faeef65",
+    "zh:1c050a5c15e330457a9c18caacf61a923c59d663e13f2962e4b32f04fef523a0",
+    "zh:2c55bcec83be58ec132c7cb0a1ac644758b800d794fdc636d53a0eada0358a3a",
+    "zh:a062bb0aa316c08d8460c66a5d68da71da40de5d3bc3b31abcf3a1a9a19650f1",
+    "zh:a26fdea0afaa9b247c73c0b42843ca51ba7db0ac2571f9d3d50dcabd20ca1b98",
+    "zh:c872c9385a78d502bf5823d61cd3bb0f9a0585030e025eb12585c83451beeaa1",
+    "zh:f180879af931182beee4c8c0d9dab62b81d86f17ddcbe3786ef4c7cec9163a4e",
+    "zh:f569b65999264a9416862bca5cd2a6177d94ccb0424f3a4ef424428912b9cb3c",
+    "zh:f70f5789264069e0eef06f9b5d5fde955ef7206f7d446d1ce51a4c37a3f3e02f",
  ]
 }

--- a/stacks/actualbudget/backend.tf
+++ b/stacks/actualbudget/backend.tf
@ -1,7 +1,7 @@
 # Generated by Terragrunt. Sig: nIlQXj57tbuaRZEa
 terraform {
  backend "pg" {
-    conn_str    = "postgres://terraform_state:SBlzGxotNUN6HH9d0S-m@10.0.20.200:5432/terraform_state?sslmode=disable"
+    conn_str    = "postgres://terraform_state:ts7DGcKmTTY-5ujz4mhh@10.0.20.200:5432/terraform_state?sslmode=disable"
    schema_name = "actualbudget"
  }
 }
--- a/stacks/actualbudget/factory/main.tf
+++ b/stacks/actualbudget/factory/main.tf
@ -18,6 +18,11 @@ variable "budget_encryption_password" {
 # and are unknown at plan time on first apply, so we cannot base `count` on
 # them directly. Callers pass these booleans as hardcoded plan-time constants
 # that reflect whether the corresponding credentials are expected to exist.
+variable "enabled" {
+  type        = bool
+  default     = true
+  description = "Deploy this instance. When false, only the PVC is kept (data preservation); deployment, service, ingress, http-api, and cronjob are not created. Flip back to true to bring the instance back."
+}
 variable "enable_http_api" {
  type        = bool
  default     = false
@ -44,7 +49,7 @@ resource "kubernetes_persistent_volume_claim" "data_encrypted" {
    name      = "actualbudget-${var.name}-data-encrypted"
    namespace = "actualbudget"
    annotations = {
-      "resize.topolvm.io/threshold"     = "80%"
+      "resize.topolvm.io/threshold"     = "10%"
      "resize.topolvm.io/increase"      = "100%"
      "resize.topolvm.io/storage_limit" = "5Gi"
    }
@ -58,9 +63,17 @@ resource "kubernetes_persistent_volume_claim" "data_encrypted" {
      }
    }
  }
+  lifecycle {
+    # The autoresizer expands requests.storage up to storage_limit and
+    # PVCs can't shrink. Without this, every TF apply tries to revert
+    # to the spec value, K8s rejects the shrink, and the PVC ends up
+    # in Terminating-but-in-use limbo.
+    ignore_changes = [spec[0].resources[0].requests]
+  }
 }

 resource "kubernetes_deployment" "actualbudget" {
+  count = var.enabled ? 1 : 0
  metadata {
    name      = "actualbudget-${var.name}"
    namespace = "actualbudget"
@ -127,6 +140,7 @@ resource "kubernetes_deployment" "actualbudget" {
 }

 resource "kubernetes_service" "actualbudget" {
+  count = var.enabled ? 1 : 0
  metadata {
    name      = "budget-${var.name}"
    namespace = "actualbudget"
@ -148,7 +162,12 @@ resource "kubernetes_service" "actualbudget" {
 }

 module "ingress" {
-  source            = "../../../modules/kubernetes/ingress_factory"
+  count  = var.enabled ? 1 : 0
+  source = "../../../modules/kubernetes/ingress_factory"
+  # auth = "app": Actual Budget enforces a server password + per-user login
+  # on its own sync API. Authentik forward-auth was 302-ing the mobile/web
+  # sync clients; Actual's own auth gates users.
+  auth              = "app"
  namespace         = "actualbudget"
  name              = "budget-${var.name}"
  tls_secret_name   = var.tls_secret_name
@ -163,7 +182,7 @@ resource "random_string" "api-key" {
 }

 resource "kubernetes_deployment" "actualbudget-http-api" {
-  count = var.enable_http_api ? 1 : 0
+  count = var.enabled && var.enable_http_api ? 1 : 0
  metadata {
    name      = "actualbudget-http-api-${var.name}"
    namespace = "actualbudget"
@ -229,6 +248,7 @@ resource "kubernetes_deployment" "actualbudget-http-api" {
 }

 resource "kubernetes_service" "actualbudget-http-api" {
+  count = var.enabled && var.enable_http_api ? 1 : 0
  metadata {
    name      = "budget-http-api-${var.name}"
    namespace = "actualbudget"
@ -250,7 +270,7 @@ resource "kubernetes_service" "actualbudget-http-api" {
 }

 resource "kubernetes_cron_job_v1" "bank-sync" {
-  count = var.enable_bank_sync ? 1 : 0
+  count = var.enabled && var.enable_bank_sync ? 1 : 0
  metadata {
    name      = "bank-sync-${var.name}"
    namespace = "actualbudget"
@ -271,48 +291,93 @@ resource "kubernetes_cron_job_v1" "bank-sync" {
          spec {
            container {
              name  = "bank-sync"
-              image = "curlimages/curl"
+              image = "alpine:3.20"
              command = ["/bin/sh", "-c", <<-EOT
-              PUSHGATEWAY="http://prometheus-prometheus-pushgateway.monitoring:9091/metrics/job/bank-sync-${var.name}"
+              set -u
+              apk add --no-cache curl jq >/dev/null 2>&1
+
+              USER_NAME='${var.name}'
+              SYNC_ID='${var.sync_id}'
+              API_KEY='${random_string.api-key.result}'
+              PW='${var.budget_encryption_password}'
+              PG="http://prometheus-prometheus-pushgateway.monitoring:9091/metrics/job/bank-sync-$USER_NAME"
+              API="http://budget-http-api-$USER_NAME"
+
              START=$(date +%s)

-              HTTP_CODE=$(curl -s -o /tmp/response.txt -w '%%{http_code}' \
-                -X POST --location \
-                'http://budget-http-api-${var.name}/v1/budgets/${var.sync_id}/accounts/banksync' \
-                --header 'accept: application/json' \
-                --header 'budget-encryption-password: ${var.budget_encryption_password}' \
-                --header 'x-api-key: ${random_string.api-key.result}')
+              # Enumerate active accounts: open + on-budget.
+              ACCOUNTS=$(curl -fsS "$API/v1/budgets/$SYNC_ID/accounts" \
+                -H "x-api-key: $API_KEY" \
+                -H "budget-encryption-password: $PW" \
+                | jq -c '.data[] | select(.closed == false and .offbudget == false) | {id, name}')

-              END=$(date +%s)
-              DURATION=$((END - START))
-
-              if [ "$HTTP_CODE" = "200" ]; then
-                SUCCESS=1
-                LAST_SUCCESS=$END
-              else
-                SUCCESS=0
-                echo "Bank sync failed with HTTP $HTTP_CODE:"
-                cat /tmp/response.txt
-                echo ""
+              if [ -z "$ACCOUNTS" ]; then
+                echo "ERROR: GET /accounts returned no eligible accounts; aborting"
+                exit 1
              fi

-              # Pushgateway POST preserves metrics not in the payload, so on
-              # failure we omit bank_sync_last_success_timestamp to keep the
-              # prior success value — this prevents BankSyncStale from firing
-              # alongside BankSyncFailing after a single failed run.
-              {
-                printf '# HELP bank_sync_success Whether the last bank sync succeeded (1=ok, 0=fail)\n'
-                printf '# TYPE bank_sync_success gauge\n'
-                printf 'bank_sync_success %s\n' "$SUCCESS"
-                printf '# HELP bank_sync_duration_seconds Duration of the last bank sync run\n'
-                printf '# TYPE bank_sync_duration_seconds gauge\n'
-                printf 'bank_sync_duration_seconds %s\n' "$DURATION"
-                if [ "$SUCCESS" = "1" ]; then
-                  printf '# HELP bank_sync_last_success_timestamp Unix timestamp of the last successful sync\n'
-                  printf '# TYPE bank_sync_last_success_timestamp gauge\n'
-                  printf 'bank_sync_last_success_timestamp %s\n' "$LAST_SUCCESS"
+              : > /tmp/payload
+              rm -f /tmp/any_success
+
+              # Per-account sync. Each account has its own PSD2/GoCardless
+              # quota (4 successful pulls per 24h), so we treat them
+              # independently — one rate-limited account doesn't mark the
+              # run as a failure.
+              echo "$ACCOUNTS" | while IFS= read -r ACCT; do
+                [ -z "$ACCT" ] && continue
+                ID=$(echo "$ACCT" | jq -r '.id')
+                NAME=$(echo "$ACCT" | jq -r '.name')
+                LABEL=$(echo "$NAME" | sed -E 's/[^a-zA-Z0-9]+/_/g')
+
+                HTTP_CODE=$(curl -s -o /tmp/r.txt -w '%%{http_code}' \
+                  -X POST "$API/v1/budgets/$SYNC_ID/accounts/$ID/banksync" \
+                  -H 'accept: application/json' \
+                  -H "x-api-key: $API_KEY" \
+                  -H "budget-encryption-password: $PW") || HTTP_CODE=0
+
+                NOW=$(date +%s)
+                if [ "$HTTP_CODE" = "200" ]; then
+                  echo "OK account=$NAME"
+                  printf 'bank_sync_account_success{account="%s"} 1\n' "$LABEL" >> /tmp/payload
+                  printf 'bank_sync_account_last_success_timestamp{account="%s"} %s\n' "$LABEL" "$NOW" >> /tmp/payload
+                  : > /tmp/any_success
+                else
+                  echo "FAIL account=$NAME http=$HTTP_CODE body=$(cat /tmp/r.txt)"
+                  printf 'bank_sync_account_success{account="%s"} 0\n' "$LABEL" >> /tmp/payload
                fi
-              } | curl -s --data-binary @- "$PUSHGATEWAY"
+              done
+
+              END=$(date +%s)
+              DUR=$((END - START))
+
+              if [ -f /tmp/any_success ]; then
+                ANY=1
+              else
+                ANY=0
+              fi
+
+              # Pushgateway POST preserves prior values for label sets not
+              # in the payload, so per-account last_success_timestamp values
+              # for accounts that failed this run keep their prior good
+              # values — that's what BankSyncAccountStale alerts on.
+              {
+                printf '# HELP bank_sync_account_success Per-account sync result (1=ok, 0=fail)\n'
+                printf '# TYPE bank_sync_account_success gauge\n'
+                printf '# HELP bank_sync_account_last_success_timestamp Per-account Unix timestamp of last successful sync\n'
+                printf '# TYPE bank_sync_account_last_success_timestamp gauge\n'
+                cat /tmp/payload
+                printf '# HELP bank_sync_success 1 if at least one account synced this run\n'
+                printf '# TYPE bank_sync_success gauge\n'
+                printf 'bank_sync_success %s\n' "$ANY"
+                printf '# HELP bank_sync_duration_seconds Total duration of the cron run\n'
+                printf '# TYPE bank_sync_duration_seconds gauge\n'
+                printf 'bank_sync_duration_seconds %s\n' "$DUR"
+                if [ "$ANY" = "1" ]; then
+                  printf '# HELP bank_sync_last_success_timestamp Unix timestamp of the most recent successful sync of any account\n'
+                  printf '# TYPE bank_sync_last_success_timestamp gauge\n'
+                  printf 'bank_sync_last_success_timestamp %s\n' "$END"
+                fi
+              } | curl -fsS --data-binary @- "$PG"
              EOT
              ]
            }
@ -326,3 +391,24 @@ resource "kubernetes_cron_job_v1" "bank-sync" {
    ignore_changes = [spec[0].job_template[0].spec[0].template[0].spec[0].dns_config]
  }
 }
+
+# State migration for the new `enabled` toggle (2026-05-13): adding
+# count to these resources shifts their addresses to [0]. Without
+# moved {}, Terraform would destroy+recreate. Existing http-api / bank-sync
+# resources already had count, so no migration needed there.
+moved {
+  from = kubernetes_deployment.actualbudget
+  to   = kubernetes_deployment.actualbudget[0]
+}
+moved {
+  from = kubernetes_service.actualbudget
+  to   = kubernetes_service.actualbudget[0]
+}
+moved {
+  from = kubernetes_service.actualbudget-http-api
+  to   = kubernetes_service.actualbudget-http-api[0]
+}
+moved {
+  from = module.ingress
+  to   = module.ingress[0]
+}
--- a/stacks/actualbudget/main.tf
+++ b/stacks/actualbudget/main.tf
@ -57,6 +57,7 @@ resource "kubernetes_namespace" "actualbudget" {
    labels = {
      "istio-injection" : "disabled"
      tier = local.tiers.edge
+      "keel.sh/enrolled" = "true"
    }
  }
  lifecycle {
@ -120,6 +121,10 @@ module "anca" {
 }

 # https://budget-emo.viktorbarzin.me/
+# Disabled 2026-05-13: Emo isn't using this instance. PVC is preserved so
+# we can flip enabled back to true to bring the instance back as-was.
+# The empty accounts list (vs. anca/viktor) was causing the daily bank-sync
+# CronJob to fail and trigger BankSyncStale.
 module "emo" {
  source                     = "./factory"
  name                       = "emo"
@ -128,16 +133,10 @@ module "emo" {
  nfs_server                 = var.nfs_server
  depends_on                 = [kubernetes_namespace.actualbudget]
  tier                       = local.tiers.edge
-  enable_http_api            = true
-  enable_bank_sync           = true
+  enabled                    = false
+  enable_http_api            = false
+  enable_bank_sync           = false
  budget_encryption_password = lookup(local.credentials["emo"], "password", null)
  sync_id                    = lookup(local.credentials["emo"], "sync_id", null)
-  homepage_annotations = {
-    "gethomepage.dev/enabled"      = "true"
-    "gethomepage.dev/name"         = "Budget Emo"
-    "gethomepage.dev/description"  = "Personal budget"
-    "gethomepage.dev/icon"         = "actual-budget.png"
-    "gethomepage.dev/group"        = "Finance & Personal"
-    "gethomepage.dev/pod-selector" = ""
-  }
+  homepage_annotations       = {}
 }
--- a/stacks/affine/main.tf
+++ b/stacks/affine/main.tf
@ -88,6 +88,7 @@ resource "kubernetes_namespace" "affine" {
    name = "affine"
    labels = {
      tier = local.tiers.aux
+      "keel.sh/enrolled" = "true"
    }
  }
  lifecycle {
@ -155,7 +156,7 @@ resource "kubernetes_persistent_volume_claim" "data_encrypted" {
    name      = "affine-data-encrypted"
    namespace = kubernetes_namespace.affine.metadata[0].name
    annotations = {
-      "resize.topolvm.io/threshold"     = "80%"
+      "resize.topolvm.io/threshold"     = "10%"
      "resize.topolvm.io/increase"      = "100%"
      "resize.topolvm.io/storage_limit" = "5Gi"
    }
@ -169,6 +170,13 @@ resource "kubernetes_persistent_volume_claim" "data_encrypted" {
      }
    }
  }
+  lifecycle {
+    # The autoresizer expands requests.storage up to storage_limit and
+    # PVCs can't shrink. Without this, every TF apply tries to revert
+    # to the spec value, K8s rejects the shrink, and the PVC ends up
+    # in Terminating-but-in-use limbo.
+    ignore_changes = [spec[0].resources[0].requests]
+  }
 }

 resource "kubernetes_deployment" "affine" {
@ -324,8 +332,12 @@ resource "kubernetes_deployment" "affine" {
    }
  }
  lifecycle {
-    # KYVERNO_LIFECYCLE_V1: Kyverno admission webhook mutates dns_config with ndots=2
-    ignore_changes = [spec[0].template[0].spec[0].dns_config]
+    ignore_changes = [
+      spec[0].template[0].spec[0].dns_config, # KYVERNO_LIFECYCLE_V1
+      metadata[0].annotations["keel.sh/policy"],
+      metadata[0].annotations["keel.sh/trigger"],
+      metadata[0].annotations["keel.sh/pollSchedule"], # KYVERNO_LIFECYCLE_V2
+    ]
  }
 }

@ -351,7 +363,11 @@ resource "kubernetes_service" "affine" {
 }

 module "ingress" {
-  source          = "../../modules/kubernetes/ingress_factory"
+  source = "../../modules/kubernetes/ingress_factory"
+  # auth = "app": AFFiNE has its own workspace auth + bearer-token API
+  # used by desktop/mobile sync clients. Authentik forward-auth was 302-ing
+  # those API callers; AFFiNE's own auth gates users.
+  auth            = "app"
  dns_type        = "non-proxied"
  namespace       = kubernetes_namespace.affine.metadata[0].name
  name            = "affine"
--- a/stacks/authentik/authentik_provider.tf
+++ b/stacks/authentik/authentik_provider.tf
@ -53,11 +53,130 @@ resource "authentik_provider_proxy" "catchall" {
  # doesn't require an HCL edit.
  authorization_flow = data.authentik_flow.default_authorization_implicit_consent.id
  invalidation_flow  = data.authentik_flow.default_provider_invalidation.id
+  # Cookie / proxysession TTL. Drives `Max-Age` on `authentik_proxy_*`
+  # cookies and the `expires` column in `authentik_providers_proxy_proxysession`.
+  # See note on the embedded outpost below — bumping this requires an outpost
+  # pod restart for the gorilla session store to rebind.
+  access_token_validity = "weeks=4"
  lifecycle {
-    ignore_changes = [property_mappings, jwt_federation_sources, skip_path_regex, internal_host, basic_auth_enabled, basic_auth_password_attribute, basic_auth_username_attribute, intercept_header_auth, access_token_validity]
+    ignore_changes = [property_mappings, jwt_federation_sources, skip_path_regex, internal_host, basic_auth_enabled, basic_auth_password_attribute, basic_auth_username_attribute, intercept_header_auth]
  }
 }

+# -----------------------------------------------------------------------------
+# Embedded outpost record. Adopted into Terraform 2026-05-10 as part of the
+# postgres-session-backend fix:
+#   - `managed` is set server-side to `goauthentik.io/outposts/embedded` so
+#     the outpost binary's `IsEmbedded()` check returns true → it loads the
+#     PostgreSQL session backend (PR #16628). The Terraform provider does
+#     NOT expose `managed` in the schema, so the field is preserved across
+#     applies (TF only writes fields it knows about).
+#   - kubernetes_json_patches.deployment carries:
+#       * dshm 2Gi tmpfs (covers the 2026-04-18 ENOSPC class of issues)
+#       * resources requests/limits
+#       * `app.kubernetes.io/component=server` pod label so the K8s service
+#         selector lights up endpoints (works around goauthentik 2026.2.2
+#         service.py:52 selector mismatch on standalone embedded outposts).
+#       * AUTHENTIK_POSTGRESQL__{HOST,PORT,USER,PASSWORD,NAME} envFrom the
+#         shared `goauthentik` Secret so the postgres session backend has
+#         credentials to connect to the dbaas cluster.
+#   - kubernetes_json_patches.service replaces the controller-set selector
+#     (which incorrectly targets `app.kubernetes.io/name=authentik`, i.e.
+#     the goauthentik-server pods) with the outpost's own labels.
+# -----------------------------------------------------------------------------
+
+resource "authentik_outpost" "embedded" {
+  name               = "authentik Embedded Outpost"
+  type               = "proxy"
+  protocol_providers = [authentik_provider_proxy.catchall.id]
+  service_connection = "99e227a7-4562-4888-9660-4c27da678c50"
+  config = jsonencode({
+    log_level                        = "trace"
+    docker_labels                    = null
+    authentik_host                   = "https://authentik.viktorbarzin.me/"
+    docker_network                   = null
+    container_image                  = null
+    docker_map_ports                 = true
+    refresh_interval                 = "minutes=5"
+    kubernetes_replicas              = 1
+    kubernetes_namespace             = "authentik"
+    authentik_host_browser           = ""
+    object_naming_template           = "ak-outpost-%(name)s"
+    authentik_host_insecure          = false
+    kubernetes_service_type          = "ClusterIP"
+    kubernetes_ingress_path_type     = null
+    kubernetes_image_pull_secrets    = []
+    kubernetes_ingress_class_name    = null
+    kubernetes_disabled_components   = []
+    kubernetes_ingress_annotations   = {}
+    kubernetes_ingress_secret_name   = "authentik-outpost-tls"
+    kubernetes_httproute_annotations = {}
+    kubernetes_httproute_parent_refs = []
+    kubernetes_json_patches = {
+      deployment = [
+        {
+          op    = "add"
+          path  = "/spec/template/spec/volumes"
+          value = [{ name = "dshm", emptyDir = { medium = "Memory", sizeLimit = "2Gi" } }]
+        },
+        {
+          op    = "add"
+          path  = "/spec/template/spec/containers/0/volumeMounts"
+          value = [{ name = "dshm", mountPath = "/dev/shm" }]
+        },
+        {
+          op    = "add"
+          path  = "/spec/template/spec/containers/0/resources"
+          value = { limits = { memory = "2560Mi" }, requests = { cpu = "100m", memory = "128Mi" } }
+        },
+        {
+          op    = "add"
+          path  = "/spec/template/metadata/labels/app.kubernetes.io~1component"
+          value = "server"
+        },
+        {
+          op    = "add"
+          path  = "/spec/template/spec/containers/0/env/-"
+          value = { name = "AUTHENTIK_POSTGRESQL__HOST", valueFrom = { secretKeyRef = { name = "goauthentik", key = "AUTHENTIK_POSTGRESQL__HOST" } } }
+        },
+        {
+          op    = "add"
+          path  = "/spec/template/spec/containers/0/env/-"
+          value = { name = "AUTHENTIK_POSTGRESQL__PORT", valueFrom = { secretKeyRef = { name = "goauthentik", key = "AUTHENTIK_POSTGRESQL__PORT" } } }
+        },
+        {
+          op    = "add"
+          path  = "/spec/template/spec/containers/0/env/-"
+          value = { name = "AUTHENTIK_POSTGRESQL__USER", valueFrom = { secretKeyRef = { name = "goauthentik", key = "AUTHENTIK_POSTGRESQL__USER" } } }
+        },
+        {
+          op    = "add"
+          path  = "/spec/template/spec/containers/0/env/-"
+          value = { name = "AUTHENTIK_POSTGRESQL__PASSWORD", valueFrom = { secretKeyRef = { name = "goauthentik", key = "AUTHENTIK_POSTGRESQL__PASSWORD" } } }
+        },
+        {
+          op    = "add"
+          path  = "/spec/template/spec/containers/0/env/-"
+          value = { name = "AUTHENTIK_POSTGRESQL__NAME", valueFrom = { secretKeyRef = { name = "goauthentik", key = "AUTHENTIK_POSTGRESQL__NAME" } } }
+        },
+      ]
+      service = [
+        {
+          op   = "replace"
+          path = "/spec/selector"
+          value = {
+            "app.kubernetes.io/managed-by" = "goauthentik.io"
+            "app.kubernetes.io/name"       = "authentik-outpost-proxy"
+            "goauthentik.io/outpost-name"  = "authentik-embedded-outpost"
+            "goauthentik.io/outpost-type"  = "proxy"
+            "goauthentik.io/outpost-uuid"  = "0eecac0797c7443c892505f2f4fe3e47"
+          }
+        },
+      ]
+    }
+  })
+}
+
 # -----------------------------------------------------------------------------
 # Default User Login stage — bound to default-authentication-flow.
 # Adopted into Terraform 2026-05-01 to set session_duration=weeks=4 so users
--- a/stacks/authentik/guest.tf
+++ b/stacks/authentik/guest.tf
@ -0,0 +1,217 @@
+# =============================================================================
+# Public Guest user + auto-login flow + public proxy provider + dedicated
+# outpost.
+#
+# Backs the `auth = "public"` tier of the ingress_factory module. Architecture:
+#
+#   * `guest` user (in `Public Guests` group, NOT `Allow Login Users`).
+#   * `public-auto-login` flow: anonymous user enters → expression policy sets
+#     `pending_user = guest` → user_login stage logs them in. No UI shown.
+#   * `Provider for Public` proxy provider (forward_domain, cookie_domain
+#     `viktorbarzin.me`) with `authentication_flow = public-auto-login`.
+#   * Dedicated `Public Outpost` Deployment+Service (managed by Authentik's
+#     K8s controller). Bound to the public provider only — there is no other
+#     provider claiming `viktorbarzin.me` on this outpost, so every request
+#     it sees runs the public flow regardless of host.
+#   * `public-auth.viktorbarzin.me` ingress exposes the public outpost's
+#     `/outpost.goauthentik.io/*` path so OAuth callbacks land on it (the
+#     embedded outpost doesn't know about the public provider, so callbacks
+#     can't go to authentik.viktorbarzin.me).
+#
+# Traffic flow for a stranger hitting an `auth = "public"` ingress:
+#   1. Traefik's `authentik-forward-auth-public` middleware → public outpost.
+#   2. No session cookie → 302 to `https://authentik.viktorbarzin.me/...`
+#      with redirect_uri = `https://public-auth.viktorbarzin.me/.../callback`.
+#   3. Authentik runs `public-auto-login` (no UI), issues session.
+#   4. 302 → public-auth.viktorbarzin.me callback → public outpost validates
+#      state and sets `authentik_proxy_<public-hash>` cookie on `viktorbarzin.me`.
+#   5. 302 → original URL → Traefik retries forward_auth → public outpost
+#      validates cookie → 200 with `X-authentik-username: guest`.
+#
+# A user already logged into anything else on viktorbarzin.me (the catchall)
+# still gets recognised here — Authentik prefers an existing session and the
+# public provider's authorization_flow auto-approves anyone, so their real
+# username shows up in `X-authentik-username`. Strangers get `guest`.
+# =============================================================================
+
+resource "authentik_user" "guest" {
+  username  = "guest"
+  name      = "Guest"
+  path      = "users/system"
+  is_active = true
+  type      = "internal"
+  # No password set: the user_login stage in `public_auto_login` logs the
+  # request in via pending_user pre-set by an expression policy. There is no
+  # UI path for `guest` to authenticate via password — the user is also kept
+  # out of `Allow Login Users`, so even a leaked password cannot be used to
+  # complete the standard login flow.
+  lifecycle {
+    ignore_changes = [attributes, email]
+  }
+}
+
+resource "authentik_group" "public_guests" {
+  name  = "Public Guests"
+  users = [authentik_user.guest.id]
+  # NOT a child of "Allow Login Users" — keeps a hypothetical leaked password
+  # from promoting `guest` to a real user via the standard login flow.
+}
+
+# Pre-stage policy: sets pending_user = guest before user_login stage runs.
+# Mutates `request.context["flow_plan"].context["pending_user"]` — the
+# canonical pattern (the user_login stage reads pending_user from
+# `flow_plan.context`). Direct `request.context["pending_user"]` mutations
+# don't propagate, since policy request.context is not the same dict as
+# flow_plan.context.
+resource "authentik_policy_expression" "set_guest_user" {
+  name = "set-public-guest-user"
+  expression = trimspace(<<-EOT
+    request.context["flow_plan"].context["pending_user"] = ak_user_by(username="guest")
+    return True
+  EOT
+  )
+}
+
+# Dedicated user_login stage for the public flow. 4-week session matches the
+# default authentication stage; means a stranger only goes through the auto-
+# bind once per ~month per device.
+resource "authentik_stage_user_login" "public_guest_login" {
+  name             = "public-guest-login"
+  session_duration = "weeks=4"
+}
+
+# `authentication = "none"` lets anonymous requests run the flow.
+# `designation = "authentication"` because the flow's outcome is "request is
+# now authenticated as guest"; the public proxy provider's authorization_flow
+# then runs implicit consent.
+resource "authentik_flow" "public_auto_login" {
+  name           = "Public Auto Login"
+  slug           = "public-auto-login"
+  title          = "Public Guest Login"
+  designation    = "authentication"
+  authentication = "none"
+}
+
+resource "authentik_flow_stage_binding" "public_login" {
+  target = authentik_flow.public_auto_login.uuid
+  stage  = authentik_stage_user_login.public_guest_login.id
+  order  = 10
+  # Re-evaluate at stage runtime: at plan time, flow_plan may not yet be in
+  # request.context, so the expression policy's mutation would no-op. With
+  # evaluate_on_plan=false + re_evaluate_policies=true, the policy fires
+  # right before the stage runs, when flow_plan is fully populated.
+  evaluate_on_plan     = false
+  re_evaluate_policies = true
+}
+
+resource "authentik_policy_binding" "set_guest_before_login" {
+  target = authentik_flow_stage_binding.public_login.id
+  policy = authentik_policy_expression.set_guest_user.id
+  order  = 0
+}
+
+# -----------------------------------------------------------------------------
+# Public proxy provider — forward_domain so it claims any host on
+# viktorbarzin.me. Used only on the dedicated `public` outpost (where it is
+# the sole bound provider), so there's no dispatch ambiguity with the
+# catchall (which lives on the embedded outpost).
+# -----------------------------------------------------------------------------
+resource "authentik_provider_proxy" "public" {
+  name          = "Provider for Public"
+  mode          = "forward_domain"
+  external_host = "https://public-auth.viktorbarzin.me"
+  cookie_domain = "viktorbarzin.me"
+
+  # When a request hits with NO Authentik session, this flow runs first and
+  # auto-binds the request to the `guest` user (no UI prompt).
+  authentication_flow = authentik_flow.public_auto_login.uuid
+  # Once authenticated (or already authenticated), implicit-consent auto-approves.
+  authorization_flow = data.authentik_flow.default_authorization_implicit_consent.id
+  invalidation_flow  = data.authentik_flow.default_provider_invalidation.id
+
+  access_token_validity = "weeks=4"
+
+  lifecycle {
+    ignore_changes = [property_mappings, jwt_federation_sources, skip_path_regex, internal_host, basic_auth_enabled, basic_auth_password_attribute, basic_auth_username_attribute, intercept_header_auth]
+  }
+}
+
+resource "authentik_application" "public" {
+  name              = "Public"
+  slug              = "public"
+  protocol_provider = authentik_provider_proxy.public.id
+  # No bound policies. policy_engine_mode = "any" + zero bindings = everyone
+  # passes (the auto-login flow has already established `guest` as the user).
+  policy_engine_mode = "any"
+
+  lifecycle {
+    ignore_changes = [meta_description, meta_launch_url, meta_icon, group, backchannel_providers, open_in_new_tab]
+  }
+}
+
+# Dedicated outpost so the public provider can claim viktorbarzin.me without
+# colliding with the catchall (which already claims viktorbarzin.me on the
+# embedded outpost). Authentik's K8s controller deploys this as
+# `ak-outpost-public` (Deployment + Service in the `authentik` namespace).
+resource "authentik_outpost" "public" {
+  name               = "public"
+  type               = "proxy"
+  protocol_providers = [authentik_provider_proxy.public.id]
+  service_connection = "99e227a7-4562-4888-9660-4c27da678c50"
+  config = jsonencode({
+    log_level                        = "info"
+    docker_labels                    = null
+    authentik_host                   = "https://authentik.viktorbarzin.me/"
+    docker_network                   = null
+    container_image                  = null
+    docker_map_ports                 = true
+    refresh_interval                 = "minutes=5"
+    kubernetes_replicas              = 1
+    kubernetes_namespace             = "authentik"
+    authentik_host_browser           = ""
+    object_naming_template           = "ak-outpost-%(name)s"
+    authentik_host_insecure          = false
+    kubernetes_service_type          = "ClusterIP"
+    kubernetes_ingress_path_type     = null
+    kubernetes_image_pull_secrets    = []
+    kubernetes_ingress_class_name    = null
+    kubernetes_disabled_components   = []
+    kubernetes_ingress_annotations   = {}
+    kubernetes_ingress_secret_name   = "authentik-outpost-tls"
+    kubernetes_httproute_annotations = {}
+    kubernetes_httproute_parent_refs = []
+    kubernetes_json_patches = {
+      deployment = [
+        {
+          op    = "add"
+          path  = "/spec/template/spec/containers/0/resources"
+          value = { limits = { memory = "256Mi" }, requests = { cpu = "10m", memory = "64Mi" } }
+        },
+      ]
+    }
+  })
+}
+
+# Ingress for `public-auth.viktorbarzin.me` — exposes the public outpost's
+# /outpost.goauthentik.io/* path so OAuth callbacks land on it. The
+# `Provider for Public` external_host points here, so all redirect_uris in
+# the OAuth flow resolve to this hostname.
+module "ingress_public_outpost" {
+  source = "../../modules/kubernetes/ingress_factory"
+  # Public-tier outpost callback — the OAuth flow's redirect_uris all resolve
+  # here; gating it with forward-auth would loop the public outpost onto itself.
+  # auth = "none": Public outpost callback path for OAuth flow; protecting with forward-auth creates circular dependency.
+  auth             = "none"
+  namespace        = "authentik"
+  name             = "public-outpost"
+  host             = "public-auth"
+  service_name     = "ak-outpost-public"
+  port             = 9000
+  ingress_path     = ["/outpost.goauthentik.io"]
+  tls_secret_name  = var.tls_secret_name
+  dns_type         = "proxied"
+  anti_ai_scraping = false
+  exclude_crowdsec = true
+  homepage_enabled = false
+  depends_on       = [authentik_outpost.public]
+}
--- a/stacks/authentik/modules/authentik/main.tf
+++ b/stacks/authentik/modules/authentik/main.tf
@ -29,6 +29,7 @@ resource "kubernetes_namespace" "authentik" {
    labels = {
      tier                               = var.tier
      "resource-governance/custom-quota" = "true"
+      "keel.sh/enrolled" = "true"
    }
  }
  lifecycle {
@ -70,8 +71,12 @@ resource "helm_release" "authentik" {


 module "ingress" {
-  source           = "../../../../modules/kubernetes/ingress_factory"
-  dns_type        = "proxied"
+  source = "../../../../modules/kubernetes/ingress_factory"
+  # Authentik's own UI cannot be gated by Authentik forward-auth — that
+  # creates a chicken-and-egg loop (users can't reach the login page).
+  # auth = "none": Authentik UI cannot be gated by Authentik forward-auth (chicken-and-egg loop prevents login).
+  auth             = "none"
+  dns_type         = "proxied"
  namespace        = kubernetes_namespace.authentik.metadata[0].name
  name             = "authentik"
  service_name     = "goauthentik-server"
@ -91,7 +96,11 @@ module "ingress" {
 }

 module "ingress-outpost" {
-  source           = "../../../../modules/kubernetes/ingress_factory"
+  source = "../../../../modules/kubernetes/ingress_factory"
+  # Authentik forward-auth outpost callback path — protecting this with
+  # forward-auth would loop the outpost back onto itself.
+  # auth = "none": Authentik outpost callback path for forward-auth flow; protecting with forward-auth creates circular dependency.
+  auth             = "none"
  namespace        = kubernetes_namespace.authentik.metadata[0].name
  name             = "authentik-outpost"
  host             = "authentik"
--- a/stacks/authentik/modules/authentik/pgbouncer.tf
+++ b/stacks/authentik/modules/authentik/pgbouncer.tf
@ -66,9 +66,13 @@ resource "kubernetes_deployment" "pgbouncer" {
          }
        }
        container {
-          name              = "pgbouncer"
-          image             = "edoburu/pgbouncer:latest"
-          image_pull_policy = "IfNotPresent"
+          name  = "pgbouncer"
+          image = "edoburu/pgbouncer:latest"
+          # `:latest` tag — keep `Always` so pod restarts pick up upstream
+          # updates. The previous `IfNotPresent` value was declared at module
+          # creation but the live cluster has reconciled to `Always` (likely
+          # via a Helm/operator default). Match reality to drop the drift.
+          image_pull_policy = "Always"

          port {
            container_port = 6432
--- a/stacks/authentik/modules/authentik/values.yaml
+++ b/stacks/authentik/modules/authentik/values.yaml
@ -78,7 +78,10 @@ global:
  addPrometheusAnnotations: true

 worker:
-  replicas: 3
+  # 2 replicas: workers handle background tasks (LDAP sync, email,
+  # certificate renewal) — no user-facing traffic, so 2-of-3 isn't
+  # needed for availability. Drop saves ~100m sustained CPU.
+  replicas: 2
  # Same unauthenticated_age cap as server — both the server (Django session
  # middleware) and worker (cleanup tasks) need to see the value.
  env:
--- a/stacks/beads-server/main.tf
+++ b/stacks/beads-server/main.tf
@ -29,6 +29,7 @@ resource "kubernetes_namespace" "beads" {
    name = "beads-server"
    labels = {
      tier = local.tiers.aux
+      "keel.sh/enrolled" = "true"
    }
  }
  lifecycle {
@ -43,7 +44,7 @@ resource "kubernetes_persistent_volume_claim" "dolt_data" {
    name      = "dolt-data"
    namespace = kubernetes_namespace.beads.metadata[0].name
    annotations = {
-      "resize.topolvm.io/threshold"     = "80%"
+      "resize.topolvm.io/threshold"     = "10%"
      "resize.topolvm.io/increase"      = "100%"
      "resize.topolvm.io/storage_limit" = "10Gi"
    }
@ -55,6 +56,13 @@ resource "kubernetes_persistent_volume_claim" "dolt_data" {
      requests = { storage = "2Gi" }
    }
  }
+  lifecycle {
+    # The autoresizer expands requests.storage up to storage_limit and
+    # PVCs can't shrink. Without this, every TF apply tries to revert
+    # to the spec value, K8s rejects the shrink, and the PVC ends up
+    # in Terminating-but-in-use limbo.
+    ignore_changes = [spec[0].resources[0].requests]
+  }
 }

 resource "kubernetes_config_map" "dolt_init" {
@ -67,6 +75,23 @@ resource "kubernetes_config_map" "dolt_init" {
      CREATE USER IF NOT EXISTS 'beads'@'%' IDENTIFIED BY '';
      GRANT ALL PRIVILEGES ON *.* TO 'beads'@'%' WITH GRANT OPTION;
    EOT
+    "02-create-presence-table.sql" = <<-EOT
+      CREATE DATABASE IF NOT EXISTS beads;
+      USE beads;
+      CREATE TABLE IF NOT EXISTS presence_claims (
+        session_id      VARCHAR(128)  NOT NULL,
+        resource_label  VARCHAR(255)  NOT NULL,
+        purpose         TEXT          NOT NULL,
+        claimed_at      DATETIME(3)   NOT NULL DEFAULT CURRENT_TIMESTAMP(3),
+        expires_at      DATETIME(3)   NOT NULL,
+        host            VARCHAR(128)  NOT NULL,
+        user            VARCHAR(64)   NOT NULL,
+        agent_name      VARCHAR(64)   DEFAULT 'claude-code',
+        PRIMARY KEY (session_id, resource_label),
+        INDEX idx_resource (resource_label),
+        INDEX idx_expires  (expires_at)
+      );
+    EOT
  }
 }

@ -78,6 +103,16 @@ resource "kubernetes_deployment" "dolt" {
      app  = "dolt"
      tier = local.tiers.aux
    }
+    annotations = {
+      # Keel is namespace-enrolled (keel.sh/enrolled=true on the namespace),
+      # but this deployment opts OUT of auto-updates: dolthub/dolt-sql-server:latest
+      # currently resolves to a broken 0.50.10 build. Pinned image lives in the
+      # container spec below. Codified here so TF state matches live, no drift.
+      "keel.sh/policy"       = "never"
+      "keel.sh/match-tag"    = "true"
+      "keel.sh/trigger"      = "poll"
+      "keel.sh/pollSchedule" = "@every 1h"
+    }
  }
  spec {
    replicas = 1
@ -98,7 +133,12 @@ resource "kubernetes_deployment" "dolt" {
      spec {
        container {
          name  = "dolt"
-          image = "dolthub/dolt-sql-server:latest"
+          # Pinned to 2.0.3 — :latest currently resolves to 0.50.10 on dolthub
+          # (different versioning stream) whose docker-entrypoint.sh references
+          # an undefined docker_process_sql function and crash-loops on every
+          # init script in /docker-entrypoint-initdb.d. Keel can upgrade this
+          # tag in-cluster; the lifecycle.ignore_changes below preserves that.
+          image = "dolthub/dolt-sql-server:2.0.3"

          port {
            name           = "mysql"
@ -170,7 +210,59 @@ resource "kubernetes_deployment" "dolt" {
  }
  lifecycle {
    ignore_changes = [
-      spec[0].template[0].spec[0].dns_config # KYVERNO_LIFECYCLE_V1
+      spec[0].template[0].spec[0].dns_config, # KYVERNO_LIFECYCLE_V1
+      spec[0].template[0].spec[0].container[0].image, # KEEL_IGNORE_IMAGE
+      # Keel annotations are codified in metadata.annotations above (policy=never
+      # opts this deployment out of auto-updates — see the comment there).
+    ]
+  }
+}
+
+# One-shot Job to apply the presence_claims schema to the running Dolt server.
+# The dolt_init ConfigMap only fires on fresh PVCs; since Dolt already exists
+# with persistent state, this Job is the only path to update the live schema.
+# The job name is hashed off the SQL content so a new Job runs whenever the
+# schema changes; the SQL itself is idempotent (CREATE ... IF NOT EXISTS).
+resource "kubernetes_job" "presence_schema_migrate" {
+  metadata {
+    name      = "presence-schema-${substr(sha256(kubernetes_config_map.dolt_init.data["02-create-presence-table.sql"]), 0, 8)}"
+    namespace = kubernetes_namespace.beads.metadata[0].name
+  }
+  spec {
+    backoff_limit = 3
+    template {
+      metadata {}
+      spec {
+        restart_policy = "OnFailure"
+        container {
+          name    = "migrate"
+          image   = "mysql:8.4"
+          command = ["sh", "-c"]
+          args = [
+            "mysql -h dolt.beads-server.svc.cluster.local -P 3306 -u root < /sql/02-create-presence-table.sql"
+          ]
+          volume_mount {
+            name       = "sql"
+            mount_path = "/sql"
+          }
+        }
+        volume {
+          name = "sql"
+          config_map {
+            name = kubernetes_config_map.dolt_init.metadata[0].name
+          }
+        }
+      }
+    }
+  }
+  wait_for_completion = true
+  timeouts {
+    create = "5m"
+  }
+  depends_on = [kubernetes_deployment.dolt]
+  lifecycle {
+    ignore_changes = [
+      spec[0].template[0].spec[0].dns_config, # KYVERNO_LIFECYCLE_V1
    ]
  }
 }
@ -374,7 +466,11 @@ resource "kubernetes_deployment" "workbench" {
  }
  lifecycle {
    ignore_changes = [
-      spec[0].template[0].spec[0].dns_config # KYVERNO_LIFECYCLE_V1
+      spec[0].template[0].spec[0].dns_config, # KYVERNO_LIFECYCLE_V1
+      metadata[0].annotations["keel.sh/policy"],
+      metadata[0].annotations["keel.sh/trigger"],
+      metadata[0].annotations["keel.sh/pollSchedule"], # KYVERNO_LIFECYCLE_V2
+      spec[0].template[0].spec[0].container[0].image, # KEEL_IGNORE_IMAGE
    ]
  }
 }
@ -416,7 +512,8 @@ module "ingress" {
  namespace        = kubernetes_namespace.beads.metadata[0].name
  name             = "dolt-workbench"
  tls_secret_name  = var.tls_secret_name
-  protected        = false
+  # auth = "none": Dolt Workbench is client-side encrypted task database; no backend user auth required; Anubis PoW fronts ingress.
+  auth             = "none"
  exclude_crowdsec = true
  extra_annotations = {
    "gethomepage.dev/enabled"      = "true"
@ -566,7 +663,7 @@ resource "kubernetes_deployment" "beadboard" {
        }

        container {
-          name  = "beadboard"
+          name = "beadboard"
          # Phase 3 cutover 2026-05-07 — Forgejo registry consolidation.
          image = "forgejo.viktorbarzin.me/viktor/beadboard:${var.beadboard_image_tag}"

@ -646,7 +743,11 @@ resource "kubernetes_deployment" "beadboard" {
  }
  lifecycle {
    ignore_changes = [
-      spec[0].template[0].spec[0].dns_config # KYVERNO_LIFECYCLE_V1
+      spec[0].template[0].spec[0].dns_config, # KYVERNO_LIFECYCLE_V1
+      metadata[0].annotations["keel.sh/policy"],
+      metadata[0].annotations["keel.sh/trigger"],
+      metadata[0].annotations["keel.sh/pollSchedule"], # KYVERNO_LIFECYCLE_V2
+      spec[0].template[0].spec[0].container[0].image, # KEEL_IGNORE_IMAGE
    ]
  }
 }
@ -677,7 +778,7 @@ module "beadboard_ingress" {
  namespace        = kubernetes_namespace.beads.metadata[0].name
  name             = "beadboard"
  tls_secret_name  = var.tls_secret_name
-  protected        = true
+  auth             = "required"
  exclude_crowdsec = true
  extra_annotations = {
    "gethomepage.dev/enabled"      = "true"
--- a/stacks/blog/.terraform.lock.hcl
+++ b/stacks/blog/.terraform.lock.hcl
@ -24,6 +24,14 @@ provider "registry.terraform.io/cloudflare/cloudflare" {
  ]
 }

+provider "registry.terraform.io/goauthentik/authentik" {
+  version     = "2024.12.1"
+  constraints = "~> 2024.10"
+  hashes = [
+    "h1:roBMd+gi+TGgikH/bMzEI8JfvJiMAQWt+8FmokCrQIs=",
+  ]
+}
+
 provider "registry.terraform.io/hashicorp/helm" {
  version = "3.1.1"
  hashes = [
--- a/stacks/blog/backend.tf
+++ b/stacks/blog/backend.tf
@ -1,7 +1,7 @@
 # Generated by Terragrunt. Sig: nIlQXj57tbuaRZEa
 terraform {
  backend "pg" {
-    conn_str    = "postgres://terraform_state:SBlzGxotNUN6HH9d0S-m@10.0.20.200:5432/terraform_state?sslmode=disable"
+    conn_str    = "postgres://terraform_state:ts7DGcKmTTY-5ujz4mhh@10.0.20.200:5432/terraform_state?sslmode=disable"
    schema_name = "blog"
  }
 }
--- a/stacks/blog/main.tf
+++ b/stacks/blog/main.tf
@ -10,6 +10,7 @@ resource "kubernetes_namespace" "website" {
    labels = {
      "istio-injection" : "disabled"
      tier = local.tiers.aux
+      "keel.sh/enrolled" = "true"
    }
  }
  lifecycle {
@ -76,8 +77,12 @@ resource "kubernetes_deployment" "blog" {
    }
  }
  lifecycle {
-    # KYVERNO_LIFECYCLE_V1: Kyverno admission webhook mutates dns_config with ndots=2
-    ignore_changes = [spec[0].template[0].spec[0].dns_config]
+    ignore_changes = [
+      spec[0].template[0].spec[0].dns_config, # KYVERNO_LIFECYCLE_V1
+      metadata[0].annotations["keel.sh/policy"],
+      metadata[0].annotations["keel.sh/trigger"],
+      metadata[0].annotations["keel.sh/pollSchedule"], # KYVERNO_LIFECYCLE_V2
+    ]
  }
 }

@ -116,23 +121,25 @@ resource "kubernetes_service" "blog" {
 # tiny PoW (~250ms desktop), get a 30-day cookie, and pass through. Replaces
 # the global ai-bot-block forwardAuth for this site.
 module "anubis" {
-  source     = "../../modules/kubernetes/anubis_instance"
-  name       = "blog"
-  namespace  = kubernetes_namespace.website.metadata[0].name
-  target_url = "http://${kubernetes_service.blog.metadata[0].name}.${kubernetes_namespace.website.metadata[0].name}.svc.cluster.local"
+  source           = "../../modules/kubernetes/anubis_instance"
+  name             = "blog"
+  namespace        = kubernetes_namespace.website.metadata[0].name
+  target_url       = "http://${kubernetes_service.blog.metadata[0].name}.${kubernetes_namespace.website.metadata[0].name}.svc.cluster.local"
+  shared_store_url = "redis://redis-master.redis.svc.cluster.local:6379/10"
 }

 module "ingress" {
  source            = "../../modules/kubernetes/ingress_factory"
+  auth              = "none" # Anubis-fronted; PoW challenge gates bots, no Authentik
  namespace         = kubernetes_namespace.website.metadata[0].name
  name              = "blog"
  service_name      = module.anubis.service_name
  port              = module.anubis.service_port
  extra_middlewares = ["traefik-x402@kubernetescrd"]
-  full_host        = "viktorbarzin.me"
-  dns_type         = "proxied"
-  tls_secret_name  = var.tls_secret_name
-  anti_ai_scraping = false # Anubis is the gatekeeper now — drop the redundant ai-bot-block forwardAuth.
+  full_host         = "viktorbarzin.me"
+  dns_type          = "proxied"
+  tls_secret_name   = var.tls_secret_name
+  anti_ai_scraping  = false # Anubis is the gatekeeper now — drop the redundant ai-bot-block forwardAuth.
  extra_annotations = {
    "gethomepage.dev/enabled"      = "true"
    "gethomepage.dev/name"         = "Blog"
@ -145,12 +152,24 @@ module "ingress" {

 module "ingress-www" {
  source            = "../../modules/kubernetes/ingress_factory"
+  auth              = "none" # Anubis-fronted; PoW challenge gates bots, no Authentik
  namespace         = kubernetes_namespace.website.metadata[0].name
  name              = "blog-www"
  service_name      = module.anubis.service_name
  port              = module.anubis.service_port
  extra_middlewares = ["traefik-x402@kubernetescrd"]
-  full_host        = "www.viktorbarzin.me"
-  tls_secret_name  = var.tls_secret_name
-  anti_ai_scraping = false
+  full_host         = "www.viktorbarzin.me"
+  tls_secret_name   = var.tls_secret_name
+  anti_ai_scraping  = false
 }
+
+# CI retrigger 2026-05-16T13:42:57+00:00 — bulk enrollment apply (pipeline #689 killed)
+# CI retrigger v2 2026-05-16T13:46:35+00:00
+
+# CI retrigger v3 2026-05-16T14:06:39Z
+
+# CI retrigger v4 2026-05-16T14:13:59Z
+
+# CI retrigger v5 2026-05-16T23:10:38Z
+
+# CI retrigger v6 2026-05-16T23:18:58Z
--- a/stacks/blog/providers.tf
+++ b/stacks/blog/providers.tf
@ -9,6 +9,10 @@ terraform {
      source  = "cloudflare/cloudflare"
      version = "~> 4"
    }
+    authentik = {
+      source  = "goauthentik/authentik"
+      version = "~> 2024.10"
+    }
  }
 }

--- a/stacks/broker-sync/main.tf
+++ b/stacks/broker-sync/main.tf
@ -12,6 +12,7 @@ resource "kubernetes_namespace" "broker_sync" {
    labels = {
      "istio-injection" = "disabled"
      tier              = local.tiers.aux
+      "keel.sh/enrolled" = "true"
    }
  }
  lifecycle {
@ -61,7 +62,7 @@ resource "kubernetes_persistent_volume_claim" "data_encrypted" {
    name      = "broker-sync-data-encrypted"
    namespace = kubernetes_namespace.broker_sync.metadata[0].name
    annotations = {
-      "resize.topolvm.io/threshold"     = "80%"
+      "resize.topolvm.io/threshold"     = "10%"
      "resize.topolvm.io/increase"      = "100%"
      "resize.topolvm.io/storage_limit" = "5Gi"
    }
@ -73,6 +74,13 @@ resource "kubernetes_persistent_volume_claim" "data_encrypted" {
      requests = { storage = "1Gi" }
    }
  }
+  lifecycle {
+    # The autoresizer expands requests.storage up to storage_limit and
+    # PVCs can't shrink. Without this, every TF apply tries to revert
+    # to the spec value, K8s rejects the shrink, and the PVC ends up
+    # in Terminating-but-in-use limbo.
+    ignore_changes = [spec[0].resources[0].requests]
+  }
 }

 locals {
@ -660,8 +668,13 @@ resource "kubernetes_cron_job_v1" "fidelity" {
    concurrency_policy            = "Forbid"
    successful_jobs_history_limit = 3
    failed_jobs_history_limit     = 5
-    # Suspended until the broker-sync image ships with Playwright + Chromium.
-    suspend = true
+    # Unsuspended 2026-05-17 after the delta gains-offset emission landed
+    # (broker-sync @98c4729). Manual trigger:
+    #   kubectl -n broker-sync create job fid-now \
+    #     --from=cronjob/broker-sync-fidelity
+    # NB: storage_state expires every 30-90 days — see code-r9n for the
+    # chrome-service-driven re-seed runbook.
+    suspend = false
    job_template {
      metadata {}
      spec {
--- a/stacks/calico/main.tf
+++ b/stacks/calico/main.tf
@ -22,6 +22,9 @@ resource "kubernetes_namespace" "calico_system" {
    name = "calico-system"
    labels = {
      name = "calico-system"
+# calico-system namespace is managed by tigera-operator — auto-update is
+      # incompatible (operator reverts DaemonSet image from its Installation CR).
+      # "keel.sh/enrolled" = "true"
    }
  }
  lifecycle {
@ -65,3 +68,66 @@ resource "kubernetes_namespace" "tigera_operator" {
    ignore_changes = [metadata[0].labels["goldilocks.fairwinds.com/vpa-update-mode"]]
  }
 }
+
+# Wave 1 W1.6 (beads code-8ywc): observation phase via Calico GlobalNetworkPolicy
+# `action: Log`. This is the supported primitive on Calico OSS v3.26 — the
+# Calico-Enterprise FelixConfiguration.flowLogsFileEnabled approach is NOT
+# accepted by the OSS CRD (verified 2026-05-19: "strict decoding error").
+#
+# How it works:
+#   - GNP selects pods by namespaceSelector
+#   - egress rule action=Log writes an iptables NFLOG entry that lands in the
+#     kernel log / journald with prefix "calico-packet:" on each node
+#   - Alloy DaemonSet already ships node-journal to Loki (job=node-journal)
+#   - LogQL query: {job="node-journal"} |= "calico-packet" surfaces egress flows
+#   - After ~1 week of observation, build the empirical per-namespace egress
+#     allowlist; then flip the same GNP to [Allow specific dests, Deny rest]
+#
+# Started with `recruiter-responder` as the pilot on 2026-05-19; expanded
+# 2026-05-19 to all tier 3+4 namespaces (per locked plan — tier 3-edge has
+# 17 ns, tier 4-aux has 65 ns, all use Calico's WorkloadEndpoint policy
+# path). Tier 0/1/2 stay out of observation in wave 1 (cluster infra +
+# GPU workloads, deferred per the plan).
+#
+# `apply_only = true` on the kubectl_manifest means renaming the TF resource
+# does NOT destroy the old GNP via TF — we kubectl delete the legacy pilot
+# GNP after this applies to clean it up. (Tracked manually.)
+resource "kubectl_manifest" "wave1_egress_observe_tier34" {
+  yaml_body = yamlencode({
+    apiVersion = "projectcalico.org/v3"
+    kind       = "GlobalNetworkPolicy"
+    metadata = {
+      name = "wave1-egress-observe-tier34"
+      annotations = {
+        "security.viktorbarzin.me/wave"    = "1"
+        "security.viktorbarzin.me/purpose" = "observe-then-enforce egress for tier 3-edge + 4-aux"
+      }
+    }
+    spec = {
+      order             = 2000
+      selector          = "all()"
+      namespaceSelector = "tier in {\"3-edge\", \"4-aux\"}"
+      types             = ["Egress"]
+      egress = [
+        # Rule 1: log every egress packet (LOG target writes to kernel/journal,
+        # alloy ships to Loki with job=node-journal,transport=kernel).
+        # LogQL: {job="node-journal"} |~ "calico-packet"
+        { action = "Log" },
+        # Rule 2: allow everything (observation must NOT break workloads).
+        { action = "Allow" },
+      ]
+    }
+  })
+  apply_only = true
+}
+
+# CI retrigger 2026-05-16T13:42:57+00:00 — bulk enrollment apply (pipeline #689 killed)
+# CI retrigger v2 2026-05-16T13:46:35+00:00
+
+# CI retrigger v3 2026-05-16T14:06:39Z
+
+# CI retrigger v4 2026-05-16T14:13:59Z
+
+# CI retrigger v5 2026-05-16T23:10:38Z
+
+# CI retrigger v6 2026-05-16T23:18:58Z
--- a/stacks/changedetection/main.tf
+++ b/stacks/changedetection/main.tf
@ -9,6 +9,7 @@ resource "kubernetes_namespace" "changedetection" {
    labels = {
      "istio-injection" : "disabled"
      tier = local.tiers.aux
+      "keel.sh/enrolled" = "true"
    }
  }
  lifecycle {
@ -68,7 +69,7 @@ resource "kubernetes_persistent_volume_claim" "data_proxmox" {
    name      = "changedetection-data-proxmox"
    namespace = kubernetes_namespace.changedetection.metadata[0].name
    annotations = {
-      "resize.topolvm.io/threshold"     = "80%"
+      "resize.topolvm.io/threshold"     = "10%"
      "resize.topolvm.io/increase"      = "100%"
      "resize.topolvm.io/storage_limit" = "8Gi"
    }
@ -82,6 +83,13 @@ resource "kubernetes_persistent_volume_claim" "data_proxmox" {
      }
    }
  }
+  lifecycle {
+    # The autoresizer expands requests.storage up to storage_limit and
+    # PVCs can't shrink. Without this, every TF apply tries to revert
+    # to the spec value, K8s rejects the shrink, and the PVC ends up
+    # in Terminating-but-in-use limbo.
+    ignore_changes = [spec[0].resources[0].requests]
+  }
 }

 resource "kubernetes_deployment" "changedetection" {
@ -187,8 +195,13 @@ resource "kubernetes_deployment" "changedetection" {
    }
  }
  lifecycle {
-    # KYVERNO_LIFECYCLE_V1: Kyverno admission webhook mutates dns_config with ndots=2
-    ignore_changes = [spec[0].template[0].spec[0].dns_config]
+    ignore_changes = [
+      spec[0].template[0].spec[0].dns_config, # KYVERNO_LIFECYCLE_V1
+      spec[0].template[0].spec[0].container[0].image, # KEEL_IGNORE_IMAGE — Keel manages tag updates
+      metadata[0].annotations["keel.sh/policy"],
+      metadata[0].annotations["keel.sh/trigger"],
+      metadata[0].annotations["keel.sh/pollSchedule"], # KYVERNO_LIFECYCLE_V2
+    ]
  }
 }

@ -218,7 +231,7 @@ module "ingress" {
  namespace       = kubernetes_namespace.changedetection.metadata[0].name
  name            = "changedetection"
  tls_secret_name = var.tls_secret_name
-  protected       = true
+  auth            = "required"
  extra_annotations = {
    "gethomepage.dev/enabled"      = "true"
    "gethomepage.dev/name"         = "Changedetection"
--- a/stacks/chrome-service/main.tf
+++ b/stacks/chrome-service/main.tf
@ -24,6 +24,7 @@ resource "kubernetes_namespace" "chrome_service" {
      "istio-injection"                       = "disabled"
      tier                                    = local.tiers.aux
      "chrome-service.viktorbarzin.me/server" = "true"
+      "keel.sh/enrolled" = "true"
    }
  }
  lifecycle {
@ -74,7 +75,7 @@ resource "kubernetes_persistent_volume_claim" "profile_encrypted" {
    name      = "chrome-service-profile-encrypted"
    namespace = kubernetes_namespace.chrome_service.metadata[0].name
    annotations = {
-      "resize.topolvm.io/threshold"     = "80%"
+      "resize.topolvm.io/threshold"     = "10%"
      "resize.topolvm.io/increase"      = "100%"
      "resize.topolvm.io/storage_limit" = "10Gi"
    }
@ -88,6 +89,13 @@ resource "kubernetes_persistent_volume_claim" "profile_encrypted" {
      }
    }
  }
+  lifecycle {
+    # The autoresizer expands requests.storage up to storage_limit and
+    # PVCs can't shrink. Without this, every TF apply tries to revert
+    # to the spec value, K8s rejects the shrink, and the PVC ends up
+    # in Terminating-but-in-use limbo.
+    ignore_changes = [spec[0].resources[0].requests]
+  }
 }

 # --- NFS backup target ---
@ -107,6 +115,12 @@ resource "kubernetes_deployment" "chrome_service" {
    namespace = kubernetes_namespace.chrome_service.metadata[0].name
    labels = merge(local.labels, {
      tier = local.tiers.aux
+      # Deliberate pin: chrome-service's playwright image MUST match
+      # the playwright Python version in f1-stream (see local.image
+      # comment above). Opt out of Keel auto-update via this label —
+      # the inject-keel-annotations ClusterPolicy excludes workloads
+      # selector-matching keel.sh/policy=never.
+      "keel.sh/policy" = "never"
    })
    annotations = {
      "reloader.stakater.com/auto" = "true"
@ -304,8 +318,12 @@ resource "kubernetes_deployment" "chrome_service" {
    }
  }
  lifecycle {
-    # KYVERNO_LIFECYCLE_V1: Kyverno admission webhook mutates dns_config with ndots=2
-    ignore_changes = [spec[0].template[0].spec[0].dns_config]
+    ignore_changes = [
+      spec[0].template[0].spec[0].dns_config, # KYVERNO_LIFECYCLE_V1
+      metadata[0].annotations["keel.sh/policy"],
+      metadata[0].annotations["keel.sh/trigger"],
+      metadata[0].annotations["keel.sh/pollSchedule"], # KYVERNO_LIFECYCLE_V2
+    ]
  }
 }

@ -354,7 +372,7 @@ module "ingress" {
  namespace       = kubernetes_namespace.chrome_service.metadata[0].name
  name            = "chrome"
  tls_secret_name = var.tls_secret_name
-  protected       = true
+  auth            = "required"
  # noVNC defaults to /vnc.html — auto-redirect / there.
  ingress_path = ["/"]
  extra_annotations = {
--- a/stacks/city-guesser/main.tf
+++ b/stacks/city-guesser/main.tf
@ -10,6 +10,7 @@ resource "kubernetes_namespace" "city-guesser" {
    labels = {
      "istio-injection" : "disabled"
      tier = local.tiers.aux
+      "keel.sh/enrolled" = "true"
    }
  }
  lifecycle {
@ -67,8 +68,13 @@ resource "kubernetes_deployment" "city-guesser" {
    }
  }
  lifecycle {
-    # KYVERNO_LIFECYCLE_V1: Kyverno admission webhook mutates dns_config with ndots=2
-    ignore_changes = [spec[0].template[0].spec[0].dns_config]
+    ignore_changes = [
+      spec[0].template[0].spec[0].dns_config, # KYVERNO_LIFECYCLE_V1
+      spec[0].template[0].spec[0].container[0].image, # KEEL_IGNORE_IMAGE — Keel manages tag updates
+      metadata[0].annotations["keel.sh/policy"],
+      metadata[0].annotations["keel.sh/trigger"],
+      metadata[0].annotations["keel.sh/pollSchedule"], # KYVERNO_LIFECYCLE_V2
+    ]
  }
 }

@ -99,7 +105,7 @@ module "ingress" {
  namespace       = "city-guesser"
  name            = "city-guesser"
  tls_secret_name = var.tls_secret_name
-  protected       = true
+  auth            = "required"
  extra_annotations = {
    "gethomepage.dev/enabled"      = "true"
    "gethomepage.dev/name"         = "City Guesser"
--- a/stacks/claude-agent-service/main.tf
+++ b/stacks/claude-agent-service/main.tf
@ -12,7 +12,7 @@ locals {
  namespace = "claude-agent"
  # Phase 3 cutover 2026-05-07 — see infra/docs/plans/2026-05-07-forgejo-registry-consolidation-plan.md.
  image     = "forgejo.viktorbarzin.me/viktor/claude-agent-service"
-  image_tag = "2fd7670d"
+  image_tag = "191ed5dd"
  labels = {
    app = "claude-agent-service"
  }
@ -191,27 +191,25 @@ resource "kubernetes_cluster_role_binding" "claude_agent" {
 }

 # --- Storage ---
-
-resource "kubernetes_persistent_volume_claim" "workspace" {
-  wait_until_bound = false
-  metadata {
-    name      = "claude-agent-workspace-encrypted"
-    namespace = kubernetes_namespace.claude_agent.metadata[0].name
-    annotations = {
-      "resize.topolvm.io/threshold"     = "80%"
-      "resize.topolvm.io/increase"      = "100%"
-      "resize.topolvm.io/storage_limit" = "20Gi"
-    }
-  }
-  spec {
-    access_modes       = ["ReadWriteOnce"]
-    storage_class_name = "proxmox-lvm-encrypted"
-    resources {
-      requests = {
-        storage = "10Gi"
-      }
-    }
-  }
+#
+# The `workspace` volume in the deployment is intentionally emptyDir — agent
+# jobs do fresh git clones each run, so a per-pod scratch dir on node disk
+# is faster and isolated. The 10Gi `claude-agent-workspace-encrypted` PVC
+# that previously sat next to this comment was created but never wired
+# into the deployment (sat idle from 2026-04-15 to 2026-05-11).
+#
+# For cases where the agent DOES need to persist state across pod restarts
+# (caches, ad-hoc outputs, anything that should survive a pod reschedule),
+# `module.persistent` below provides a 5Gi NFS-backed RWX volume mounted
+# at /persistent. RWX so all 3 replicas can read/write the same dir;
+# sequential job mutex in the service prevents concurrent writes.
+module "persistent" {
+  source     = "../../modules/kubernetes/nfs_volume"
+  name       = "claude-agent-persistent"
+  namespace  = kubernetes_namespace.claude_agent.metadata[0].name
+  nfs_server = "192.168.1.127"
+  nfs_path   = "/srv/nfs/claude-agent-persistent"
+  storage    = "5Gi"
 }

 # --- Deployment ---
@ -251,11 +249,15 @@ resource "kubernetes_deployment" "claude_agent" {
          fs_group     = 1000
        }

-        # Fix workspace ownership (PVC may have root-owned files from prior run)
+        # Fix workspace ownership. Kubelet creates the Dockerfile WORKDIR
+        # (/workspace/infra) inside the emptyDir as root:gid=fsGroup with
+        # the setgid bit — uid 1000 can't write into it without explicit
+        # chown + chmod. Pre-create so the path is guaranteed, then chown
+        # recursively and chmod the infra subdir for safety.
        init_container {
          name    = "fix-perms"
          image   = "busybox:1.37"
-          command = ["sh", "-c", "chown -R 1000:1000 /workspace"]
+          command = ["sh", "-c", "mkdir -p /workspace/infra /persistent && chown -R 1000:1000 /workspace /persistent && chmod 0775 /workspace/infra /persistent"]
          security_context {
            run_as_user = 0
          }
@ -263,6 +265,10 @@ resource "kubernetes_deployment" "claude_agent" {
            name       = "workspace"
            mount_path = "/workspace"
          }
+          volume_mount {
+            name       = "persistent"
+            mount_path = "/persistent"
+          }
          resources {
            requests = {
              memory = "32Mi"
@ -368,6 +374,7 @@ resource "kubernetes_deployment" "claude_agent" {
            mkdir -p /workspace/.beads /workspace/scratch /home/agent/.claude/agents
            cp /usr/share/agent-seed/beads-metadata.json /workspace/.beads/metadata.json
            cp /usr/share/agent-seed/beads-task-runner.md /home/agent/.claude/agents/beads-task-runner.md
+            cp /usr/share/agent-seed/recruiter-triage.md /home/agent/.claude/agents/recruiter-triage.md
          EOT
          ]

@ -431,6 +438,10 @@ resource "kubernetes_deployment" "claude_agent" {
            name       = "workspace"
            mount_path = "/workspace"
          }
+          volume_mount {
+            name       = "persistent"
+            mount_path = "/persistent"
+          }
          volume_mount {
            name       = "sops-age-key"
            mount_path = "/home/agent/.config/sops/age"
@ -453,8 +464,16 @@ resource "kubernetes_deployment" "claude_agent" {

        volume {
          name = "workspace"
+          # Per-pod ephemeral scratch — agent does fresh git clones each
+          # job, so node-disk emptyDir is faster than a network-backed PVC
+          # and avoids RWO contention across the 3 replicas.
+          empty_dir {}
+        }
+
+        volume {
+          name = "persistent"
          persistent_volume_claim {
-            claim_name = kubernetes_persistent_volume_claim.workspace.metadata[0].name
+            claim_name = module.persistent.claim_name
          }
        }

--- a/stacks/claude-agent-service/terragrunt.hcl
+++ b/stacks/claude-agent-service/terragrunt.hcl
@ -0,0 +1,18 @@
+include "root" {
+  path = find_in_parent_folders()
+}
+
+dependency "platform" {
+  config_path  = "../platform"
+  skip_outputs = true
+}
+
+dependency "vault" {
+  config_path  = "../vault"
+  skip_outputs = true
+}
+
+dependency "external-secrets" {
+  config_path  = "../external-secrets"
+  skip_outputs = true
+}
--- a/stacks/claude-memory/main.tf
+++ b/stacks/claude-memory/main.tf
@ -6,6 +6,7 @@ variable "postgresql_host" { type = string }
 variable "claude_memory_db_password" {
  type      = string
  sensitive = true
+  default   = "" # falls back to Vault `secret/claude-memory.db_password` below
 }

 data "vault_kv_secret_v2" "secrets" {
@ -18,6 +19,7 @@ resource "kubernetes_namespace" "claude-memory" {
    name = "claude-memory"
    labels = {
      tier = local.tiers.aux
+      "keel.sh/enrolled" = "true"
    }
  }
  lifecycle {
@ -112,11 +114,13 @@ resource "kubernetes_job" "db_init" {
            "sh", "-c",
            <<-EOT
              set -e
-              PGPASSWORD='${data.vault_kv_secret_v2.secrets.data["dbaas_root_password"]}' psql -h ${var.postgresql_host} -U root -tc "SELECT 1 FROM pg_roles WHERE rolname='claude_memory'" | grep -q 1 || \
-                PGPASSWORD='${data.vault_kv_secret_v2.secrets.data["dbaas_root_password"]}' psql -h ${var.postgresql_host} -U root -c "CREATE ROLE claude_memory WITH LOGIN PASSWORD '${var.claude_memory_db_password}'"
-              PGPASSWORD='${data.vault_kv_secret_v2.secrets.data["dbaas_root_password"]}' psql -h ${var.postgresql_host} -U root -tc "SELECT 1 FROM pg_database WHERE datname='claude_memory'" | grep -q 1 || \
-                PGPASSWORD='${data.vault_kv_secret_v2.secrets.data["dbaas_root_password"]}' psql -h ${var.postgresql_host} -U root -c "CREATE DATABASE claude_memory OWNER claude_memory"
-              PGPASSWORD='${data.vault_kv_secret_v2.secrets.data["dbaas_root_password"]}' psql -h ${var.postgresql_host} -U root -c "GRANT ALL PRIVILEGES ON DATABASE claude_memory TO claude_memory"
+              # -d postgres: psql defaults database name to username; root user
+              # doesn't have a root-named database, so be explicit.
+              PGPASSWORD='${data.vault_kv_secret_v2.secrets.data["dbaas_root_password"]}' psql -h ${var.postgresql_host} -U root -d postgres -tc "SELECT 1 FROM pg_roles WHERE rolname='claude_memory'" | grep -q 1 || \
+                PGPASSWORD='${data.vault_kv_secret_v2.secrets.data["dbaas_root_password"]}' psql -h ${var.postgresql_host} -U root -d postgres -c "CREATE ROLE claude_memory WITH LOGIN PASSWORD '${coalesce(var.claude_memory_db_password, data.vault_kv_secret_v2.secrets.data["db_password"])}'"
+              PGPASSWORD='${data.vault_kv_secret_v2.secrets.data["dbaas_root_password"]}' psql -h ${var.postgresql_host} -U root -d postgres -tc "SELECT 1 FROM pg_database WHERE datname='claude_memory'" | grep -q 1 || \
+                PGPASSWORD='${data.vault_kv_secret_v2.secrets.data["dbaas_root_password"]}' psql -h ${var.postgresql_host} -U root -d postgres -c "CREATE DATABASE claude_memory OWNER claude_memory"
+              PGPASSWORD='${data.vault_kv_secret_v2.secrets.data["dbaas_root_password"]}' psql -h ${var.postgresql_host} -U root -d postgres -c "GRANT ALL PRIVILEGES ON DATABASE claude_memory TO claude_memory"
              echo "Database init complete"
            EOT
          ]
@ -246,6 +250,9 @@ resource "kubernetes_deployment" "claude-memory" {
    ignore_changes = [
      spec[0].template[0].spec[0].container[0].image,
      spec[0].template[0].spec[0].dns_config, # KYVERNO_LIFECYCLE_V1: Kyverno admission webhook mutates dns_config with ndots=2
+      metadata[0].annotations["keel.sh/policy"],
+      metadata[0].annotations["keel.sh/trigger"],
+      metadata[0].annotations["keel.sh/pollSchedule"], # KYVERNO_LIFECYCLE_V2
    ]
  }
 }
@ -274,7 +281,11 @@ resource "kubernetes_service" "claude-memory" {
 }

 module "ingress" {
-  source          = "../../modules/kubernetes/ingress_factory"
+  source = "../../modules/kubernetes/ingress_factory"
+  # MCP server — called by Claude Code (and other tools/agents) via app-layer
+  # bearer-token auth; forward-auth would break programmatic clients.
+  # auth = "none": MCP server called by Claude Code via bearer-token auth; forward-auth would break programmatic clients.
+  auth            = "none"
  dns_type        = "proxied"
  namespace       = kubernetes_namespace.claude-memory.metadata[0].name
  name            = "claude-memory"
--- a/stacks/cloudflared/modules/cloudflared/cloudflare.tf
+++ b/stacks/cloudflared/modules/cloudflared/cloudflare.tf
@ -50,6 +50,22 @@ locals {
  }
 }

+# Zone-level Bot Management. ai_bots_protection was "block" — CF returned
+# 403 to declared AI bot UAs at the edge, so the in-cluster x402 gateway
+# never got a chance to issue HTTP 402 with a payment offer. Flipped to
+# "disabled" so AI bots reach Traefik → x402, which returns 402 with the
+# wallet address. Generic Bot Fight Mode + crawler protection stay on.
+# (import {} stanza for adoption lives in the root stack — TF restriction.)
+resource "cloudflare_bot_management" "zone" {
+  zone_id            = var.cloudflare_zone_id
+  enable_js          = true
+  fight_mode         = true
+  ai_bots_protection = "disabled"
+  # crawler_protection / is_robots_txt_managed are settable only via newer
+  # provider versions; they retain whatever the API currently has
+  # (crawler_protection=enabled, is_robots_txt_managed=true).
+}
+
 resource "cloudflare_zero_trust_tunnel_cloudflared_config" "sof" {
  account_id = var.cloudflare_account_id
  tunnel_id  = var.cloudflare_tunnel_id
@ -152,57 +168,57 @@ resource "cloudflare_record" "mail_spf" {
 }

 resource "cloudflare_record" "mail_domainkey_rspamd" {
-  content  = "\"v=DKIM1; h=sha256; k=rsa; p=MIIBIjANBgkqhkiG9w0BAQEFAAOCAQ8AMIIBCgKCAQEAs9XHeFBKhUAEJSikXx+P49Q3nEBbnaSpn6h/9TqIhKaZWSVa2uGUGYQieNdon7DEJZ0VFo0Tvm3/UFsy2qF7ZmF+E/+N8EmkcPrMlxgJT281dpk5DxrZ+kbzw/DosfHH71K6vCLB4rSexzxJHaAx0AUddI3bFUJGjMgCXXCMZF+p8YCx+DDGPIXz2FOTtlJlR7aeZ2xXavwE/lBfI3MLnsq7X+GhPjQEax070nndOdZI0S8HpZkVxdGWl1N2Ec6LukYm2RiUkEMMQHSYX7WF3JBc+CGqUyd706Iy/5oeC3UGwZSM2uLkrp8YBjmw/h1rAeyv/ITt6ZXraP/cIMRiVQIDAQAB\""
-  name     = "mail._domainkey.viktorbarzin.me"
-  proxied  = false
-  ttl      = 1
-  type     = "TXT"
-  zone_id  = var.cloudflare_zone_id
+  content = "\"v=DKIM1; h=sha256; k=rsa; p=MIIBIjANBgkqhkiG9w0BAQEFAAOCAQ8AMIIBCgKCAQEAs9XHeFBKhUAEJSikXx+P49Q3nEBbnaSpn6h/9TqIhKaZWSVa2uGUGYQieNdon7DEJZ0VFo0Tvm3/UFsy2qF7ZmF+E/+N8EmkcPrMlxgJT281dpk5DxrZ+kbzw/DosfHH71K6vCLB4rSexzxJHaAx0AUddI3bFUJGjMgCXXCMZF+p8YCx+DDGPIXz2FOTtlJlR7aeZ2xXavwE/lBfI3MLnsq7X+GhPjQEax070nndOdZI0S8HpZkVxdGWl1N2Ec6LukYm2RiUkEMMQHSYX7WF3JBc+CGqUyd706Iy/5oeC3UGwZSM2uLkrp8YBjmw/h1rAeyv/ITt6ZXraP/cIMRiVQIDAQAB\""
+  name    = "mail._domainkey.viktorbarzin.me"
+  proxied = false
+  ttl     = 1
+  type    = "TXT"
+  zone_id = var.cloudflare_zone_id
 }

 resource "cloudflare_record" "brevo_domainkey1" {
-  content  = "b1.viktorbarzin-me.dkim.brevo.com."
-  name     = "brevo1._domainkey.viktorbarzin.me"
-  proxied  = false
-  ttl      = 1
-  type     = "CNAME"
-  zone_id  = var.cloudflare_zone_id
+  content = "b1.viktorbarzin-me.dkim.brevo.com."
+  name    = "brevo1._domainkey.viktorbarzin.me"
+  proxied = false
+  ttl     = 1
+  type    = "CNAME"
+  zone_id = var.cloudflare_zone_id
 }

 resource "cloudflare_record" "brevo_domainkey2" {
-  content  = "b2.viktorbarzin-me.dkim.brevo.com."
-  name     = "brevo2._domainkey.viktorbarzin.me"
-  proxied  = false
-  ttl      = 1
-  type     = "CNAME"
-  zone_id  = var.cloudflare_zone_id
+  content = "b2.viktorbarzin-me.dkim.brevo.com."
+  name    = "brevo2._domainkey.viktorbarzin.me"
+  proxied = false
+  ttl     = 1
+  type    = "CNAME"
+  zone_id = var.cloudflare_zone_id
 }

 resource "cloudflare_record" "brevo_code" {
-  content  = "\"brevo-code:a6ef1dd91b248559900246eb4e7ceebd\""
-  name     = "viktorbarzin.me"
-  proxied  = false
-  ttl      = 1
-  type     = "TXT"
-  zone_id  = var.cloudflare_zone_id
+  content = "\"brevo-code:a6ef1dd91b248559900246eb4e7ceebd\""
+  name    = "viktorbarzin.me"
+  proxied = false
+  ttl     = 1
+  type    = "TXT"
+  zone_id = var.cloudflare_zone_id
 }

 resource "cloudflare_record" "mail_mta_sts" {
-  content  = "\"v=STSv1; id=20260412\""
-  name     = "_mta-sts.viktorbarzin.me"
-  proxied  = false
-  ttl      = 1
-  type     = "TXT"
-  zone_id  = var.cloudflare_zone_id
+  content = "\"v=STSv1; id=20260412\""
+  name    = "_mta-sts.viktorbarzin.me"
+  proxied = false
+  ttl     = 1
+  type    = "TXT"
+  zone_id = var.cloudflare_zone_id
 }

 resource "cloudflare_record" "mail_tlsrpt" {
-  content  = "\"v=TLSRPTv1; rua=mailto:postmaster@viktorbarzin.me\""
-  name     = "_smtp._tls.viktorbarzin.me"
-  proxied  = false
-  ttl      = 1
-  type     = "TXT"
-  zone_id  = var.cloudflare_zone_id
+  content = "\"v=TLSRPTv1; rua=mailto:postmaster@viktorbarzin.me\""
+  name    = "_smtp._tls.viktorbarzin.me"
+  proxied = false
+  ttl     = 1
+  type    = "TXT"
+  zone_id = var.cloudflare_zone_id
 }

 resource "cloudflare_record" "mail_dmarc" {
--- a/stacks/cloudflared/modules/cloudflared/main.tf
+++ b/stacks/cloudflared/modules/cloudflared/main.tf
@ -6,7 +6,8 @@ resource "kubernetes_namespace" "cloudflared" {
  metadata {
    name = "cloudflared"
    labels = {
-      tier = var.tier
+      tier               = var.tier
+      "keel.sh/enrolled" = "true"
    }
  }
  lifecycle {
--- a/stacks/coturn/main.tf
+++ b/stacks/coturn/main.tf
@ -52,6 +52,7 @@ resource "kubernetes_namespace" "coturn" {
    name = "coturn"
    labels = {
      tier = local.tiers.edge
+      "keel.sh/enrolled" = "true"
    }
  }
  lifecycle {
@ -194,8 +195,13 @@ resource "kubernetes_deployment" "coturn" {
    }
  }
  lifecycle {
-    # KYVERNO_LIFECYCLE_V1: Kyverno admission webhook mutates dns_config with ndots=2
-    ignore_changes = [spec[0].template[0].spec[0].dns_config]
+    ignore_changes = [
+      spec[0].template[0].spec[0].dns_config, # KYVERNO_LIFECYCLE_V1
+      spec[0].template[0].spec[0].container[0].image, # KEEL_IGNORE_IMAGE — Keel manages tag updates
+      metadata[0].annotations["keel.sh/policy"],
+      metadata[0].annotations["keel.sh/trigger"],
+      metadata[0].annotations["keel.sh/pollSchedule"], # KYVERNO_LIFECYCLE_V2
+    ]
  }
 }

--- a/stacks/crowdsec/modules/crowdsec/main.tf
+++ b/stacks/crowdsec/modules/crowdsec/main.tf
@ -29,6 +29,7 @@ resource "kubernetes_namespace" "crowdsec" {
    labels = {
      tier                               = var.tier
      "resource-governance/custom-quota" = "true"
+      "keel.sh/enrolled"                 = "true"
    }
  }
  lifecycle {
@ -282,7 +283,7 @@ module "ingress" {
  dns_type         = "proxied"
  namespace        = kubernetes_namespace.crowdsec.metadata[0].name
  name             = "crowdsec-web"
-  protected        = true
+  auth             = "required"
  tls_secret_name  = var.tls_secret_name
  exclude_crowdsec = true
 }
--- a/stacks/cyberchef/.terraform.lock.hcl
+++ b/stacks/cyberchef/.terraform.lock.hcl
@ -24,6 +24,14 @@ provider "registry.terraform.io/cloudflare/cloudflare" {
  ]
 }

+provider "registry.terraform.io/goauthentik/authentik" {
+  version     = "2024.12.1"
+  constraints = "~> 2024.10"
+  hashes = [
+    "h1:roBMd+gi+TGgikH/bMzEI8JfvJiMAQWt+8FmokCrQIs=",
+  ]
+}
+
 provider "registry.terraform.io/hashicorp/helm" {
  version = "3.1.1"
  hashes = [
--- a/stacks/cyberchef/backend.tf
+++ b/stacks/cyberchef/backend.tf
@ -1,7 +1,7 @@
 # Generated by Terragrunt. Sig: nIlQXj57tbuaRZEa
 terraform {
  backend "pg" {
-    conn_str    = "postgres://terraform_state:SBlzGxotNUN6HH9d0S-m@10.0.20.200:5432/terraform_state?sslmode=disable"
+    conn_str    = "postgres://terraform_state:ts7DGcKmTTY-5ujz4mhh@10.0.20.200:5432/terraform_state?sslmode=disable"
    schema_name = "cyberchef"
  }
 }
--- a/stacks/cyberchef/main.tf
+++ b/stacks/cyberchef/main.tf
@ -9,6 +9,7 @@ resource "kubernetes_namespace" "cyberchef" {
    name = "cyberchef"
    labels = {
      tier = local.tiers.aux
+      "keel.sh/enrolled" = "true"
    }
  }
  lifecycle {
@ -77,8 +78,12 @@ resource "kubernetes_deployment" "cyberchef" {
    }
  }
  lifecycle {
-    # KYVERNO_LIFECYCLE_V1: Kyverno admission webhook mutates dns_config with ndots=2
-    ignore_changes = [spec[0].template[0].spec[0].dns_config]
+    ignore_changes = [
+      spec[0].template[0].spec[0].dns_config, # KYVERNO_LIFECYCLE_V1
+      metadata[0].annotations["keel.sh/policy"],
+      metadata[0].annotations["keel.sh/trigger"],
+      metadata[0].annotations["keel.sh/pollSchedule"], # KYVERNO_LIFECYCLE_V2
+    ]
  }
 }

@ -105,22 +110,24 @@ resource "kubernetes_service" "cyberchef" {


 module "anubis" {
-  source     = "../../modules/kubernetes/anubis_instance"
-  name       = "cc"
-  namespace  = kubernetes_namespace.cyberchef.metadata[0].name
-  target_url = "http://${kubernetes_service.cyberchef.metadata[0].name}.${kubernetes_namespace.cyberchef.metadata[0].name}.svc.cluster.local"
+  source           = "../../modules/kubernetes/anubis_instance"
+  name             = "cc"
+  namespace        = kubernetes_namespace.cyberchef.metadata[0].name
+  target_url       = "http://${kubernetes_service.cyberchef.metadata[0].name}.${kubernetes_namespace.cyberchef.metadata[0].name}.svc.cluster.local"
+  shared_store_url = "redis://redis-master.redis.svc.cluster.local:6379/5"
 }

 module "ingress" {
  source            = "../../modules/kubernetes/ingress_factory"
+  auth              = "none" # Anubis-fronted; PoW challenge gates bots, no Authentik
  dns_type          = "proxied"
  namespace         = kubernetes_namespace.cyberchef.metadata[0].name
  name              = "cc"
  service_name      = module.anubis.service_name
  port              = module.anubis.service_port
  extra_middlewares = ["traefik-x402@kubernetescrd"]
-  tls_secret_name  = var.tls_secret_name
-  anti_ai_scraping = false
+  tls_secret_name   = var.tls_secret_name
+  anti_ai_scraping  = false
  extra_annotations = {
    "gethomepage.dev/enabled"      = "true"
    "gethomepage.dev/name"         = "CyberChef"
@ -130,3 +137,14 @@ module "ingress" {
    "gethomepage.dev/pod-selector" = ""
  }
 }
+
+# CI retrigger 2026-05-16T13:42:57+00:00 — bulk enrollment apply (pipeline #689 killed)
+# CI retrigger v2 2026-05-16T13:46:35+00:00
+
+# CI retrigger v3 2026-05-16T14:06:39Z
+
+# CI retrigger v4 2026-05-16T14:13:59Z
+
+# CI retrigger v5 2026-05-16T23:10:38Z
+
+# CI retrigger v6 2026-05-16T23:18:58Z
--- a/stacks/cyberchef/providers.tf
+++ b/stacks/cyberchef/providers.tf
@ -9,6 +9,10 @@ terraform {
      source  = "cloudflare/cloudflare"
      version = "~> 4"
    }
+    authentik = {
+      source  = "goauthentik/authentik"
+      version = "~> 2024.10"
+    }
  }
 }

--- a/stacks/dashy/main.tf
+++ b/stacks/dashy/main.tf
@ -16,6 +16,7 @@ resource "kubernetes_namespace" "dashy" {
    labels = {
      "istio-injection" : "disabled"
      tier = local.tiers.aux
+      "keel.sh/enrolled" = "true"
    }
  }
  lifecycle {
@ -100,8 +101,13 @@ resource "kubernetes_deployment" "dashy" {
    }
  }
  lifecycle {
-    # KYVERNO_LIFECYCLE_V1: Kyverno admission webhook mutates dns_config with ndots=2
-    ignore_changes = [spec[0].template[0].spec[0].dns_config]
+    ignore_changes = [
+      spec[0].template[0].spec[0].dns_config, # KYVERNO_LIFECYCLE_V1
+      spec[0].template[0].spec[0].container[0].image, # KEEL_IGNORE_IMAGE — Keel manages tag updates
+      metadata[0].annotations["keel.sh/policy"],
+      metadata[0].annotations["keel.sh/trigger"],
+      metadata[0].annotations["keel.sh/pollSchedule"], # KYVERNO_LIFECYCLE_V2
+    ]
  }
 }

@ -132,5 +138,5 @@ module "ingress" {
  namespace       = kubernetes_namespace.dashy.metadata[0].name
  name            = "dashy"
  tls_secret_name = var.tls_secret_name
-  protected       = true # hidden as we use homepage now
+  auth            = "required" # hidden as we use homepage now
 }
--- a/stacks/dawarich/main.tf
+++ b/stacks/dawarich/main.tf
@ -17,6 +17,7 @@ resource "kubernetes_namespace" "dawarich" {
    labels = {
      "istio-injection" : "disabled"
      tier = local.tiers.edge
+      "keel.sh/enrolled" = "true"
    }
  }
 }
@ -325,7 +326,13 @@ resource "kubernetes_deployment" "dawarich" {
    }
  }
  lifecycle {
-    ignore_changes = [spec[0].template[0].spec[0].dns_config] # KYVERNO_LIFECYCLE_V1
+    ignore_changes = [
+      spec[0].template[0].spec[0].dns_config, # KYVERNO_LIFECYCLE_V1
+      spec[0].template[0].spec[0].container[0].image, # KEEL_IGNORE_IMAGE — Keel manages tag updates
+      metadata[0].annotations["keel.sh/policy"],
+      metadata[0].annotations["keel.sh/trigger"],
+      metadata[0].annotations["keel.sh/pollSchedule"], # KYVERNO_LIFECYCLE_V2
+    ]
  }
 }

@ -432,7 +439,13 @@ resource "kubernetes_service" "dawarich" {
 #   }
 # }
 module "ingress" {
-  source          = "../../modules/kubernetes/ingress_factory"
+  source = "../../modules/kubernetes/ingress_factory"
+  # owntracks bridge hook posts to /api/v1/owntracks/points?api_key=... from
+  # outside the cluster; mobile location apps also POST programmatically with
+  # an api_key. Forward-auth would 302 these clients into a login they can't
+  # complete. Dawarich enforces api_key at app layer.
+  # auth = "none": Location tracking API — mobile apps + OwnTracks bridge POST via api_key; forward-auth 302s break programmatic clients.
+  auth            = "none"
  dns_type        = "proxied"
  namespace       = kubernetes_namespace.dawarich.metadata[0].name
  name            = "dawarich"
--- a/stacks/dbaas/modules/dbaas/main.tf
+++ b/stacks/dbaas/modules/dbaas/main.tf
@ -131,6 +131,18 @@ resource "kubernetes_stateful_set_v1" "mysql_standalone" {
      "app.kubernetes.io/instance"  = "mysql-standalone"
      "app.kubernetes.io/component" = "primary"
    }
+    # Explicit Keel opt-out. The dbaas namespace is already excluded
+    # from the `inject-keel-annotations` Kyverno ClusterPolicy, but the
+    # StatefulSet historically picked up Keel annotations anyway (from
+    # an earlier version of that policy that didn't have the exclusion
+    # list). `keel.sh/policy: never` makes Keel skip this resource even
+    # if those legacy annotations are still present, so we cannot be
+    # silently bumped to a new MySQL version again.
+    #
+    # Lifting this MUST go through docs/plans/2026-05-19-mysql-8.4.9-upgrade-*.
+    annotations = {
+      "keel.sh/policy" = "never"
+    }
  }
  spec {
    service_name = "mysql-standalone"
@ -167,8 +179,28 @@ resource "kubernetes_stateful_set_v1" "mysql_standalone" {
        }

        container {
-          name  = "mysql"
-          image = "mysql:8.4"
+          name = "mysql"
+          # ─────────────────────────────────────────────────────────────
+          # ⚠️  DO NOT BUMP THIS IMAGE WITHOUT FOLLOWING THE PLAN  ⚠️
+          # ─────────────────────────────────────────────────────────────
+          # Pinned to mysql:8.4.8 EXACTLY. The in-server DD upgrade from
+          # 80408 → 80409 stalls reliably on this hardware (24s of writes
+          # then no progress, no CPU, never completes). The 2026-05-18
+          # recovery from the failed auto-bump took ~25 min of full
+          # MySQL downtime + Forgejo/registry/7 apps cascade.
+          #
+          # To go to 8.4.9 (or any later version), follow:
+          #   docs/plans/2026-05-19-mysql-8.4.9-upgrade-design.md
+          #   docs/plans/2026-05-19-mysql-8.4.9-upgrade-plan.md
+          #   Beads: code-963q
+          #
+          # The upgrade path is wipe + re-init (NOT in-place DD upgrade).
+          # Requires: maintenance window, fresh dump, Vault user reset.
+          #
+          # History: code-eme8 (initial outage), code-k40p (recovery).
+          # See also: docs/runbooks/restore-mysql.md.
+          # ─────────────────────────────────────────────────────────────
+          image = "mysql:8.4.8"

          port {
            container_port = 3306
@ -240,7 +272,7 @@ resource "kubernetes_stateful_set_v1" "mysql_standalone" {
      metadata {
        name = "data"
        annotations = {
-          "resize.topolvm.io/threshold"     = "80%"
+          "resize.topolvm.io/threshold"     = "10%"
          "resize.topolvm.io/increase"      = "100%"
          "resize.topolvm.io/storage_limit" = "50Gi"
        }
@ -346,7 +378,7 @@ resource "kubernetes_persistent_volume_claim" "pgadmin_encrypted" {
    name      = "dbaas-pgadmin-encrypted"
    namespace = kubernetes_namespace.dbaas.metadata[0].name
    annotations = {
-      "resize.topolvm.io/threshold"     = "80%"
+      "resize.topolvm.io/threshold"     = "10%"
      "resize.topolvm.io/increase"      = "100%"
      "resize.topolvm.io/storage_limit" = "5Gi"
    }
@ -360,6 +392,13 @@ resource "kubernetes_persistent_volume_claim" "pgadmin_encrypted" {
      }
    }
  }
+  lifecycle {
+    # The autoresizer expands requests.storage up to storage_limit and
+    # PVCs can't shrink. Without this, every TF apply tries to revert
+    # to the spec value, K8s rejects the shrink, and the PVC ends up
+    # in Terminating-but-in-use limbo.
+    ignore_changes = [spec[0].resources[0].requests]
+  }
 }

 module "nfs_postgresql_backup_host" {
@ -791,7 +830,7 @@ module "ingress" {
  namespace         = kubernetes_namespace.dbaas.metadata[0].name
  name              = "pma"
  tls_secret_name   = var.tls_secret_name
-  protected         = true
+  auth              = "required"
  extra_annotations = {}
 }

@ -1043,12 +1082,12 @@ module "ingress" {
 # Ensure the CNPG cluster manifest exists (idempotent kubectl apply)
 resource "null_resource" "pg_cluster" {
  triggers = {
-    instances     = "2"
+    instances     = "3"
    image         = "ghcr.io/cloudnative-pg/postgis:16"
    storage_size  = "20Gi"
    storage_class = "proxmox-lvm-encrypted"
-    memory_limit  = "2Gi"
-    pg_params     = "v2-shared512-walcomp-workmem16"
+    memory_limit  = "3Gi"
+    pg_params     = "v3-shared1024-walcomp-workmem16-max200"
  }

  provisioner "local-exec" {
@ -1060,13 +1099,26 @@ resource "null_resource" "pg_cluster" {
        name: pg-cluster
        namespace: dbaas
      spec:
-        instances: 2
+        # 3 instances (1 primary + 2 replicas) so a single-node drain (e.g.
+        # kured's weekly OS-reboot wave) still leaves a primary candidate
+        # immediately available for switchover. Previously 2; CNPG would
+        # still failover with 2 but only if the lone replica was caught up
+        # — during a long WAL backlog the failover would stall the drain.
+        # Bumped 2026-05-16 ahead of Monday's first post-fix kured cycle.
+        instances: 3
        imageName: ghcr.io/cloudnative-pg/postgis:16
        postgresql:
          parameters:
            search_path: '"$user", public'
-            shared_buffers: "512MB"
-            effective_cache_size: "1536MB"
+            # Cluster grew past the 100-conn default ceiling (~90/100 idle
+            # steady-state in May 2026; authentik+matrix alone hold ~55).
+            # Bumped to 200 with shared_buffers/effective_cache_size/memory
+            # scaled proportionally. work_mem stays at 16MB — that's per
+            # sort/hash op, not per connection, so 16MB * 200 isn't the
+            # worst case.
+            max_connections: "200"
+            shared_buffers: "1024MB"
+            effective_cache_size: "2560MB"
            work_mem: "16MB"
            wal_compression: "on"
            random_page_cost: "4"
@ -1075,7 +1127,9 @@ resource "null_resource" "pg_cluster" {
        enableSuperuserAccess: true
        inheritedMetadata:
          annotations:
-            resize.topolvm.io/threshold: "80%"
+            # threshold = free-space % below which autoresizer expands.
+            # 10% means "expand when 90% used" (the conventional knob).
+            resize.topolvm.io/threshold: "10%"
            resize.topolvm.io/increase: "20%"
            resize.topolvm.io/storage_limit: "100Gi"
        storage:
@ -1084,9 +1138,9 @@ resource "null_resource" "pg_cluster" {
        resources:
          requests:
            cpu: "50m"
-            memory: "2Gi"
+            memory: "3Gi"
          limits:
-            memory: "2Gi"
+            memory: "3Gi"
      EOF
    EOT
  }
@ -1149,7 +1203,8 @@ resource "null_resource" "pg_terraform_state_db" {

  provisioner "local-exec" {
    command = <<-EOT
-      kubectl --kubeconfig ${var.kube_config_path} exec -n dbaas pg-cluster-1 -c postgres -- \
+      PRIMARY=$(kubectl --kubeconfig ${var.kube_config_path} get cluster -n dbaas pg-cluster -o jsonpath='{.status.currentPrimary}')
+      kubectl --kubeconfig ${var.kube_config_path} exec -n dbaas $PRIMARY -c postgres -- \
        bash -c '
          psql -U postgres -tc "SELECT 1 FROM pg_catalog.pg_roles WHERE rolname = '"'"'terraform_state'"'"'" | grep -q 1 || \
            psql -U postgres -c "CREATE ROLE terraform_state WITH LOGIN PASSWORD '"'"'changeme-vault-will-rotate'"'"'"
@ -1173,7 +1228,8 @@ resource "null_resource" "pg_payslip_ingest_db" {

  provisioner "local-exec" {
    command = <<-EOT
-      kubectl --kubeconfig ${var.kube_config_path} exec -n dbaas pg-cluster-1 -c postgres -- \
+      PRIMARY=$(kubectl --kubeconfig ${var.kube_config_path} get cluster -n dbaas pg-cluster -o jsonpath='{.status.currentPrimary}')
+      kubectl --kubeconfig ${var.kube_config_path} exec -n dbaas $PRIMARY -c postgres -- \
        bash -c '
          psql -U postgres -tc "SELECT 1 FROM pg_catalog.pg_roles WHERE rolname = '"'"'payslip_ingest'"'"'" | grep -q 1 || \
            psql -U postgres -c "CREATE ROLE payslip_ingest WITH LOGIN PASSWORD '"'"'changeme-vault-will-rotate'"'"'"
@ -1197,7 +1253,8 @@ resource "null_resource" "pg_job_hunter_db" {

  provisioner "local-exec" {
    command = <<-EOT
-      kubectl --kubeconfig ${var.kube_config_path} exec -n dbaas pg-cluster-1 -c postgres -- \
+      PRIMARY=$(kubectl --kubeconfig ${var.kube_config_path} get cluster -n dbaas pg-cluster -o jsonpath='{.status.currentPrimary}')
+      kubectl --kubeconfig ${var.kube_config_path} exec -n dbaas $PRIMARY -c postgres -- \
        bash -c '
          psql -U postgres -tc "SELECT 1 FROM pg_catalog.pg_roles WHERE rolname = '"'"'job_hunter'"'"'" | grep -q 1 || \
            psql -U postgres -c "CREATE ROLE job_hunter WITH LOGIN PASSWORD '"'"'changeme-vault-will-rotate'"'"'"
@ -1209,6 +1266,35 @@ resource "null_resource" "pg_job_hunter_db" {
  }
 }

+# Postiz: 3 databases (postiz, temporal, temporal_visibility) all owned by the
+# `postiz` role. Bundled bitnami PostgreSQL was retired 2026-05-09 in favour of
+# this CNPG cluster — covered by postgresql-backup-per-db automatically.
+# Role password placeholder; Vault static role `pg-postiz` rotates 7d.
+resource "null_resource" "pg_postiz_dbs" {
+  depends_on = [null_resource.pg_cluster]
+
+  triggers = {
+    role = "postiz"
+    dbs  = "postiz,temporal,temporal_visibility"
+  }
+
+  provisioner "local-exec" {
+    command = <<-EOT
+      PRIMARY=$(kubectl --kubeconfig ${var.kube_config_path} get cluster -n dbaas pg-cluster -o jsonpath='{.status.currentPrimary}')
+      kubectl --kubeconfig ${var.kube_config_path} exec -n dbaas $PRIMARY -c postgres -- \
+        bash -c '
+          psql -U postgres -tc "SELECT 1 FROM pg_catalog.pg_roles WHERE rolname = '"'"'postiz'"'"'" | grep -q 1 || \
+            psql -U postgres -c "CREATE ROLE postiz WITH LOGIN PASSWORD '"'"'changeme-vault-will-rotate'"'"'"
+          for db in postiz temporal temporal_visibility; do
+            psql -U postgres -tc "SELECT 1 FROM pg_catalog.pg_database WHERE datname = '"'"'$db'"'"'" | grep -q 1 || \
+              psql -U postgres -c "CREATE DATABASE $db OWNER postiz"
+            psql -U postgres -c "GRANT ALL PRIVILEGES ON DATABASE $db TO postiz"
+          done
+        '
+    EOT
+  }
+}
+
 # Create wealthfolio_sync database for the SQLite→PG ETL sidecar that mirrors
 # Wealthfolio's daily_account_valuation/accounts/activities into PG so Grafana
 # can chart net worth, contributions, and growth.
@ -1264,6 +1350,35 @@ resource "null_resource" "pg_fire_planner_db" {
  }
 }

+# Create instagram_poster database for the IG-curation pipeline. Initial use:
+# benchmark_score table written by `instagram_poster.benchmark` CLI (vision-LLM
+# scoring per Immich asset). Future: migrate story_queue/decision/ig_posted_media
+# off the pod's sqlite PVC into this DB so the pod is fully stateless.
+# Role password is managed by Vault Database Secrets Engine
+# (static role `pg-instagram-poster`, 7d rotation).
+resource "null_resource" "pg_instagram_poster_db" {
+  depends_on = [null_resource.pg_cluster]
+
+  triggers = {
+    db_name  = "instagram_poster"
+    username = "instagram_poster"
+  }
+
+  provisioner "local-exec" {
+    command = <<-EOT
+      PRIMARY=$(kubectl --kubeconfig ${var.kube_config_path} get cluster -n dbaas pg-cluster -o jsonpath='{.status.currentPrimary}')
+      kubectl --kubeconfig ${var.kube_config_path} exec -n dbaas $PRIMARY -c postgres -- \
+        bash -c '
+          psql -U postgres -tc "SELECT 1 FROM pg_catalog.pg_roles WHERE rolname = '"'"'instagram_poster'"'"'" | grep -q 1 || \
+            psql -U postgres -c "CREATE ROLE instagram_poster WITH LOGIN PASSWORD '"'"'changeme-vault-will-rotate'"'"'"
+          psql -U postgres -tc "SELECT 1 FROM pg_catalog.pg_database WHERE datname = '"'"'instagram_poster'"'"'" | grep -q 1 || \
+            psql -U postgres -c "CREATE DATABASE instagram_poster OWNER instagram_poster"
+          psql -U postgres -c "GRANT ALL PRIVILEGES ON DATABASE instagram_poster TO instagram_poster"
+        '
+    EOT
+  }
+}
+
 # Old PostgreSQL deployment — kept commented for rollback reference
 # resource "kubernetes_deployment" "postgres" {
 #   metadata {
@ -1400,7 +1515,7 @@ module "ingress-pgadmin" {
  namespace       = kubernetes_namespace.dbaas.metadata[0].name
  name            = "pgadmin"
  tls_secret_name = var.tls_secret_name
-  protected       = true
+  auth            = "required"
 }


--- a/stacks/descheduler/main.tf
+++ b/stacks/descheduler/main.tf
@ -4,7 +4,8 @@ resource "kubernetes_namespace" "descheduler" {
  metadata {
    name = "descheduler"
    labels = {
-      tier = local.tiers.cluster
+      tier               = local.tiers.cluster
+      "keel.sh/enrolled" = "true"
    }
  }
  lifecycle {
@ -94,3 +95,14 @@ resource "helm_release" "descheduler" { # rename me

  values = [templatefile("${path.module}/values.yaml", {})]
 }
+
+# CI retrigger 2026-05-16T13:42:57+00:00 — bulk enrollment apply (pipeline #689 killed)
+# CI retrigger v2 2026-05-16T13:46:35+00:00
+
+# CI retrigger v3 2026-05-16T14:06:39Z
+
+# CI retrigger v4 2026-05-16T14:13:59Z
+
+# CI retrigger v5 2026-05-16T23:10:38Z
+
+# CI retrigger v6 2026-05-16T23:18:58Z
--- a/stacks/diun/main.tf
+++ b/stacks/diun/main.tf
@ -10,6 +10,7 @@ resource "kubernetes_namespace" "diun" {
    labels = {
      "istio-injection" : "disabled"
      tier = local.tiers.aux
+      "keel.sh/enrolled" = "true"
    }
  }
  lifecycle {
@ -91,7 +92,7 @@ resource "kubernetes_persistent_volume_claim" "data_proxmox" {
    name      = "diun-data-proxmox"
    namespace = kubernetes_namespace.diun.metadata[0].name
    annotations = {
-      "resize.topolvm.io/threshold"     = "80%"
+      "resize.topolvm.io/threshold"     = "10%"
      "resize.topolvm.io/increase"      = "100%"
      "resize.topolvm.io/storage_limit" = "5Gi"
    }
@ -105,6 +106,13 @@ resource "kubernetes_persistent_volume_claim" "data_proxmox" {
      }
    }
  }
+  lifecycle {
+    # The autoresizer expands requests.storage up to storage_limit and
+    # PVCs can't shrink. Without this, every TF apply tries to revert
+    # to the spec value, K8s rejects the shrink, and the PVC ends up
+    # in Terminating-but-in-use limbo.
+    ignore_changes = [spec[0].resources[0].requests]
+  }
 }

 resource "kubernetes_deployment" "diun" {
@ -230,6 +238,12 @@ resource "kubernetes_deployment" "diun" {
    }
  }
  lifecycle {
-    ignore_changes = [spec[0].template[0].spec[0].dns_config] # KYVERNO_LIFECYCLE_V1
+    ignore_changes = [
+      spec[0].template[0].spec[0].dns_config, # KYVERNO_LIFECYCLE_V1
+      spec[0].template[0].spec[0].container[0].image, # KEEL_IGNORE_IMAGE — Keel manages tag updates
+      metadata[0].annotations["keel.sh/policy"],
+      metadata[0].annotations["keel.sh/trigger"],
+      metadata[0].annotations["keel.sh/pollSchedule"], # KYVERNO_LIFECYCLE_V2
+    ]
  }
 }
--- a/stacks/ebook2audiobook/main.tf
+++ b/stacks/ebook2audiobook/main.tf
@ -17,6 +17,7 @@ resource "kubernetes_namespace" "ebook2audiobook" {
    labels = {
      "istio-injection" : "disabled"
      tier = local.tiers.gpu
+      "keel.sh/enrolled" = "true"
    }
  }
  lifecycle {
@ -120,8 +121,13 @@ resource "kubernetes_deployment" "ebook2audiobook" {
    }
  }
  lifecycle {
-    # KYVERNO_LIFECYCLE_V1: Kyverno admission webhook mutates dns_config with ndots=2
-    ignore_changes = [spec[0].template[0].spec[0].dns_config]
+    ignore_changes = [
+      spec[0].template[0].spec[0].dns_config, # KYVERNO_LIFECYCLE_V1
+      spec[0].template[0].spec[0].container[0].image, # KEEL_IGNORE_IMAGE — Keel manages tag updates
+      metadata[0].annotations["keel.sh/policy"],
+      metadata[0].annotations["keel.sh/trigger"],
+      metadata[0].annotations["keel.sh/pollSchedule"], # KYVERNO_LIFECYCLE_V2
+    ]
  }
 }

@ -254,7 +260,7 @@ module "ingress" {
  namespace       = kubernetes_namespace.ebook2audiobook.metadata[0].name
  name            = "ebook2audiobook"
  tls_secret_name = var.tls_secret_name
-  protected       = true
+  auth            = "required"
  extra_annotations = {
    "gethomepage.dev/enabled"      = "true"
    "gethomepage.dev/name"         = "Ebook2Audiobook"
@ -322,8 +328,13 @@ resource "kubernetes_deployment" "audiblez" {
    }
  }
  lifecycle {
-    # KYVERNO_LIFECYCLE_V1: Kyverno admission webhook mutates dns_config with ndots=2
-    ignore_changes = [spec[0].template[0].spec[0].dns_config]
+    ignore_changes = [
+      spec[0].template[0].spec[0].dns_config, # KYVERNO_LIFECYCLE_V1
+      spec[0].template[0].spec[0].container[0].image, # KEEL_IGNORE_IMAGE — Keel manages tag updates
+      metadata[0].annotations["keel.sh/policy"],
+      metadata[0].annotations["keel.sh/trigger"],
+      metadata[0].annotations["keel.sh/pollSchedule"], # KYVERNO_LIFECYCLE_V2
+    ]
  }
 }

@ -412,8 +423,13 @@ resource "kubernetes_deployment" "audiblez-web" {
    }
  }
  lifecycle {
-    # KYVERNO_LIFECYCLE_V1: Kyverno admission webhook mutates dns_config with ndots=2
-    ignore_changes = [spec[0].template[0].spec[0].dns_config]
+    ignore_changes = [
+      spec[0].template[0].spec[0].dns_config, # KYVERNO_LIFECYCLE_V1
+      spec[0].template[0].spec[0].container[0].image, # KEEL_IGNORE_IMAGE — Keel manages tag updates
+      metadata[0].annotations["keel.sh/policy"],
+      metadata[0].annotations["keel.sh/trigger"],
+      metadata[0].annotations["keel.sh/pollSchedule"], # KYVERNO_LIFECYCLE_V2
+    ]
  }
 }

@ -445,7 +461,7 @@ module "audiblez-web-ingress" {
  host            = "audiblez"
  dns_type        = "non-proxied"
  tls_secret_name = var.tls_secret_name
-  protected       = true
+  auth            = "required"
  max_body_size   = "500m" # Allow large EPUB uploads
  extra_annotations = {
    "gethomepage.dev/enabled"      = "true"
--- a/stacks/ebooks/main.tf
+++ b/stacks/ebooks/main.tf
@ -9,6 +9,7 @@ resource "kubernetes_namespace" "ebooks" {
    name = "ebooks"
    labels = {
      tier = local.tiers.edge
+      "keel.sh/enrolled" = "true"
    }
  }
  lifecycle {
@ -150,7 +151,7 @@ resource "kubernetes_persistent_volume_claim" "calibre_config_iscsi" {
    name      = "ebooks-calibre-config-proxmox"
    namespace = kubernetes_namespace.ebooks.metadata[0].name
    annotations = {
-      "resize.topolvm.io/threshold"     = "80%"
+      "resize.topolvm.io/threshold"     = "10%"
      "resize.topolvm.io/increase"      = "50%"
      "resize.topolvm.io/storage_limit" = "10Gi"
    }
@ -164,6 +165,13 @@ resource "kubernetes_persistent_volume_claim" "calibre_config_iscsi" {
      }
    }
  }
+  lifecycle {
+    # The autoresizer expands requests.storage up to storage_limit and
+    # PVCs can't shrink. Without this, every TF apply tries to revert
+    # to the spec value, K8s rejects the shrink, and the PVC ends up
+    # in Terminating-but-in-use limbo.
+    ignore_changes = [spec[0].resources[0].requests]
+  }
 }

 module "nfs_calibre_ingest_host" {
@ -205,7 +213,7 @@ resource "kubernetes_persistent_volume_claim" "abs_config_proxmox" {
    name      = "ebooks-abs-config-proxmox"
    namespace = kubernetes_namespace.ebooks.metadata[0].name
    annotations = {
-      "resize.topolvm.io/threshold"     = "80%"
+      "resize.topolvm.io/threshold"     = "10%"
      "resize.topolvm.io/increase"      = "100%"
      "resize.topolvm.io/storage_limit" = "5Gi"
    }
@ -219,6 +227,13 @@ resource "kubernetes_persistent_volume_claim" "abs_config_proxmox" {
      }
    }
  }
+  lifecycle {
+    # The autoresizer expands requests.storage up to storage_limit and
+    # PVCs can't shrink. Without this, every TF apply tries to revert
+    # to the spec value, K8s rejects the shrink, and the PVC ends up
+    # in Terminating-but-in-use limbo.
+    ignore_changes = [spec[0].resources[0].requests]
+  }
 }

 module "nfs_audiobookshelf_metadata_host" {
@ -350,7 +365,13 @@ resource "kubernetes_deployment" "calibre-web-automated" {
    }
  }
  lifecycle {
-    ignore_changes = [spec[0].template[0].spec[0].dns_config] # KYVERNO_LIFECYCLE_V1
+    ignore_changes = [
+      spec[0].template[0].spec[0].dns_config, # KYVERNO_LIFECYCLE_V1
+      spec[0].template[0].spec[0].container[0].image, # KEEL_IGNORE_IMAGE — Keel manages tag updates
+      metadata[0].annotations["keel.sh/policy"],
+      metadata[0].annotations["keel.sh/trigger"],
+      metadata[0].annotations["keel.sh/pollSchedule"], # KYVERNO_LIFECYCLE_V2
+    ]
  }
 }

@ -378,6 +399,7 @@ resource "kubernetes_service" "calibre" {

 module "calibre_ingress" {
  source          = "../../modules/kubernetes/ingress_factory"
+  auth            = "required"
  dns_type        = "proxied"
  namespace       = kubernetes_namespace.ebooks.metadata[0].name
  name            = "calibre"
@ -470,7 +492,13 @@ resource "kubernetes_deployment" "annas-archive-stacks" {
    }
  }
  lifecycle {
-    ignore_changes = [spec[0].template[0].spec[0].dns_config] # KYVERNO_LIFECYCLE_V1
+    ignore_changes = [
+      spec[0].template[0].spec[0].dns_config, # KYVERNO_LIFECYCLE_V1
+      spec[0].template[0].spec[0].container[0].image, # KEEL_IGNORE_IMAGE — Keel manages tag updates
+      metadata[0].annotations["keel.sh/policy"],
+      metadata[0].annotations["keel.sh/trigger"],
+      metadata[0].annotations["keel.sh/pollSchedule"], # KYVERNO_LIFECYCLE_V2
+    ]
  }
 }

@ -502,7 +530,7 @@ module "stacks_ingress" {
  name            = "stacks"
  service_name    = "annas-archive-stacks"
  tls_secret_name = var.tls_secret_name
-  protected       = true
+  auth            = "required"
  extra_annotations = {
    "gethomepage.dev/enabled" = "false"
  }
@ -619,7 +647,13 @@ resource "kubernetes_deployment" "audiobookshelf" {
    }
  }
  lifecycle {
-    ignore_changes = [spec[0].template[0].spec[0].dns_config] # KYVERNO_LIFECYCLE_V1
+    ignore_changes = [
+      spec[0].template[0].spec[0].dns_config, # KYVERNO_LIFECYCLE_V1
+      spec[0].template[0].spec[0].container[0].image, # KEEL_IGNORE_IMAGE — Keel manages tag updates
+      metadata[0].annotations["keel.sh/policy"],
+      metadata[0].annotations["keel.sh/trigger"],
+      metadata[0].annotations["keel.sh/pollSchedule"], # KYVERNO_LIFECYCLE_V2
+    ]
  }
 }

@ -646,7 +680,11 @@ resource "kubernetes_service" "audiobookshelf" {
 }

 module "audiobookshelf_ingress" {
-  source          = "../../modules/kubernetes/ingress_factory"
+  source = "../../modules/kubernetes/ingress_factory"
+  # auth = "app": Audiobookshelf has its own user/password login + API
+  # tokens used by the iOS/Android Audiobookshelf app. Authentik forward-auth
+  # was 302-ing the mobile clients; ABS's own auth gates users.
+  auth            = "app"
  dns_type        = "non-proxied"
  namespace       = kubernetes_namespace.ebooks.metadata[0].name
  name            = "audiobookshelf"
@ -890,7 +928,13 @@ resource "kubernetes_deployment" "book_search" {
    }
  }
  lifecycle {
-    ignore_changes = [spec[0].template[0].spec[0].dns_config] # KYVERNO_LIFECYCLE_V1
+    ignore_changes = [
+      spec[0].template[0].spec[0].dns_config, # KYVERNO_LIFECYCLE_V1
+      spec[0].template[0].spec[0].container[0].image, # KEEL_IGNORE_IMAGE — Keel manages tag updates
+      metadata[0].annotations["keel.sh/policy"],
+      metadata[0].annotations["keel.sh/trigger"],
+      metadata[0].annotations["keel.sh/pollSchedule"], # KYVERNO_LIFECYCLE_V2
+    ]
  }
 }

@ -921,7 +965,7 @@ module "book_search_ingress" {
  namespace       = kubernetes_namespace.ebooks.metadata[0].name
  name            = "book-search"
  tls_secret_name = var.tls_secret_name
-  protected       = true
+  auth            = "required"
  extra_annotations = {
    "gethomepage.dev/enabled"      = "true"
    "gethomepage.dev/name"         = "Book Search"
@ -940,6 +984,7 @@ module "book_search_api_ingress" {
  host            = "book-search"
  service_name    = "book-search"
  tls_secret_name = var.tls_secret_name
-  protected       = false
+  # auth = "none": Book Search API endpoints — API key auth handled by backend; forward-auth would block downloads.
+  auth            = "none"
  ingress_path    = ["/api/download-url", "/api/download-status", "/api/send-to-kindle", "/shortcut"]
 }
--- a/stacks/echo/main.tf
+++ b/stacks/echo/main.tf
@ -10,6 +10,7 @@ resource "kubernetes_namespace" "echo" {
    labels = {
      "istio-injection" : "disabled"
      tier = local.tiers.edge
+      "keel.sh/enrolled" = "true"
    }
  }
  lifecycle {
@ -74,8 +75,13 @@ resource "kubernetes_deployment" "echo" {
    }
  }
  lifecycle {
-    # KYVERNO_LIFECYCLE_V1: Kyverno admission webhook mutates dns_config with ndots=2
-    ignore_changes = [spec[0].template[0].spec[0].dns_config]
+    ignore_changes = [
+      spec[0].template[0].spec[0].dns_config, # KYVERNO_LIFECYCLE_V1
+      spec[0].template[0].spec[0].container[0].image, # KEEL_IGNORE_IMAGE — Keel manages tag updates
+      metadata[0].annotations["keel.sh/policy"],
+      metadata[0].annotations["keel.sh/trigger"],
+      metadata[0].annotations["keel.sh/pollSchedule"], # KYVERNO_LIFECYCLE_V2
+    ]
  }
 }

@ -101,7 +107,11 @@ resource "kubernetes_service" "echo" {
 }

 module "ingress" {
-  source          = "../../modules/kubernetes/ingress_factory"
+  source = "../../modules/kubernetes/ingress_factory"
+  # echo is a header-reflecting diagnostic — public so it's reachable for
+  # forward-auth smoke-testing. Anyone visiting echo.viktorbarzin.me sees
+  # exactly which X-authentik-* headers Traefik forwarded to backends.
+  auth            = "public"
  dns_type        = "proxied"
  namespace       = kubernetes_namespace.echo.metadata[0].name
  name            = "echo"
--- a/stacks/excalidraw/main.tf
+++ b/stacks/excalidraw/main.tf
@ -11,6 +11,7 @@ resource "kubernetes_namespace" "excalidraw" {
    labels = {
      "istio-injection" : "disabled"
      tier = local.tiers.aux
+      "keel.sh/enrolled" = "true"
    }
  }
  lifecycle {
@ -32,7 +33,7 @@ resource "kubernetes_persistent_volume_claim" "data_proxmox" {
    name      = "excalidraw-data-proxmox"
    namespace = kubernetes_namespace.excalidraw.metadata[0].name
    annotations = {
-      "resize.topolvm.io/threshold"     = "80%"
+      "resize.topolvm.io/threshold"     = "10%"
      "resize.topolvm.io/increase"      = "100%"
      "resize.topolvm.io/storage_limit" = "5Gi"
    }
@ -46,6 +47,13 @@ resource "kubernetes_persistent_volume_claim" "data_proxmox" {
      }
    }
  }
+  lifecycle {
+    # The autoresizer expands requests.storage up to storage_limit and
+    # PVCs can't shrink. Without this, every TF apply tries to revert
+    # to the spec value, K8s rejects the shrink, and the PVC ends up
+    # in Terminating-but-in-use limbo.
+    ignore_changes = [spec[0].resources[0].requests]
+  }
 }

 resource "kubernetes_deployment" "excalidraw" {
@ -117,8 +125,13 @@ resource "kubernetes_deployment" "excalidraw" {
    }
  }
  lifecycle {
-    # KYVERNO_LIFECYCLE_V1: Kyverno admission webhook mutates dns_config with ndots=2
-    ignore_changes = [spec[0].template[0].spec[0].dns_config]
+    ignore_changes = [
+      spec[0].template[0].spec[0].dns_config, # KYVERNO_LIFECYCLE_V1
+      spec[0].template[0].spec[0].container[0].image, # KEEL_IGNORE_IMAGE — Keel manages tag updates
+      metadata[0].annotations["keel.sh/policy"],
+      metadata[0].annotations["keel.sh/trigger"],
+      metadata[0].annotations["keel.sh/pollSchedule"], # KYVERNO_LIFECYCLE_V2
+    ]
  }
 }

@ -149,7 +162,7 @@ module "ingress" {
  namespace       = kubernetes_namespace.excalidraw.metadata[0].name
  name            = "draw"
  tls_secret_name = var.tls_secret_name
-  protected       = true
+  auth            = "required"
  extra_annotations = {
    "gethomepage.dev/enabled"      = "true"
    "gethomepage.dev/name"         = "Excalidraw"
--- a/stacks/external-secrets/main.tf
+++ b/stacks/external-secrets/main.tf
@ -3,6 +3,7 @@ resource "kubernetes_namespace" "external_secrets" {
    name = "external-secrets"
    labels = {
      tier = local.tiers.cluster
+      "keel.sh/enrolled" = "true"
    }
  }
  lifecycle {
--- a/stacks/f1-stream/main.tf
+++ b/stacks/f1-stream/main.tf
@ -13,6 +13,7 @@ resource "kubernetes_namespace" "f1-stream" {
      "istio-injection" : "disabled"
      tier                                    = local.tiers.aux
      "chrome-service.viktorbarzin.me/client" = "true"
+      "keel.sh/enrolled" = "true"
    }
  }
  lifecycle {
@ -83,7 +84,7 @@ resource "kubernetes_persistent_volume_claim" "data_proxmox" {
    name      = "f1-stream-data-proxmox"
    namespace = kubernetes_namespace.f1-stream.metadata[0].name
    annotations = {
-      "resize.topolvm.io/threshold"     = "80%"
+      "resize.topolvm.io/threshold"     = "10%"
      "resize.topolvm.io/increase"      = "100%"
      "resize.topolvm.io/storage_limit" = "5Gi"
    }
@ -97,6 +98,13 @@ resource "kubernetes_persistent_volume_claim" "data_proxmox" {
      }
    }
  }
+  lifecycle {
+    # The autoresizer expands requests.storage up to storage_limit and
+    # PVCs can't shrink. Without this, every TF apply tries to revert
+    # to the spec value, K8s rejects the shrink, and the PVC ends up
+    # in Terminating-but-in-use limbo.
+    ignore_changes = [spec[0].resources[0].requests]
+  }
 }

 resource "kubernetes_deployment" "f1-stream" {
@ -195,8 +203,12 @@ resource "kubernetes_deployment" "f1-stream" {
    }
  }
  lifecycle {
-    # KYVERNO_LIFECYCLE_V1: Kyverno admission webhook mutates dns_config with ndots=2
-    ignore_changes = [spec[0].template[0].spec[0].dns_config]
+    ignore_changes = [
+      spec[0].template[0].spec[0].dns_config, # KYVERNO_LIFECYCLE_V1
+      metadata[0].annotations["keel.sh/policy"],
+      metadata[0].annotations["keel.sh/trigger"],
+      metadata[0].annotations["keel.sh/pollSchedule"], # KYVERNO_LIFECYCLE_V2
+    ]
  }
 }

@ -237,11 +249,12 @@ module "tls_secret" {
 # (which load before any user has a chance to solve PoW), CHALLENGE
 # everything else — the HTML pages.
 module "anubis" {
-  source     = "../../modules/kubernetes/anubis_instance"
-  name       = "f1"
-  namespace  = kubernetes_namespace.f1-stream.metadata[0].name
-  target_url = "http://${kubernetes_service.f1-stream.metadata[0].name}.${kubernetes_namespace.f1-stream.metadata[0].name}.svc.cluster.local"
-  policy_yaml = <<-EOT
+  source           = "../../modules/kubernetes/anubis_instance"
+  name             = "f1"
+  namespace        = kubernetes_namespace.f1-stream.metadata[0].name
+  target_url       = "http://${kubernetes_service.f1-stream.metadata[0].name}.${kubernetes_namespace.f1-stream.metadata[0].name}.svc.cluster.local"
+  shared_store_url = "redis://redis-master.redis.svc.cluster.local:6379/6"
+  policy_yaml      = <<-EOT
    bots:
      - import: (data)/bots/_deny-pathological.yaml
      - import: (data)/bots/aggressive-brazilian-scrapers.yaml
@ -262,6 +275,11 @@ module "anubis" {
      - name: f1-data-routes
        path_regex: ^/(embed|embed-asset|extract|extractors|health|proxy|relay|schedule|streams)(/|\?|$)
        action: ALLOW
+      # Allow non-GET methods unconditionally — AI scrapers GET the body,
+      # they don't POST. Mutating XHRs and CORS preflight need to bypass.
+      - name: allow-non-get-methods
+        action: ALLOW
+        expression: method != "GET"
      - name: catchall-challenge
        path_regex: .*
        action: CHALLENGE
@ -270,6 +288,7 @@ module "anubis" {

 module "ingress" {
  source            = "../../modules/kubernetes/ingress_factory"
+  auth              = "none" # Anubis-fronted; PoW challenge gates bots, no Authentik
  dns_type          = "non-proxied"
  namespace         = kubernetes_namespace.f1-stream.metadata[0].name
  name              = "f1"
@ -288,3 +307,14 @@ module "ingress" {
    "gethomepage.dev/pod-selector" = ""
  }
 }
+
+# CI retrigger 2026-05-16T13:42:57+00:00 — bulk enrollment apply (pipeline #689 killed)
+# CI retrigger v2 2026-05-16T13:46:35+00:00
+
+# CI retrigger v3 2026-05-16T14:06:39Z
+
+# CI retrigger v4 2026-05-16T14:13:59Z
+
+# CI retrigger v5 2026-05-16T23:10:38Z
+
+# CI retrigger v6 2026-05-16T23:18:58Z
--- a/stacks/fire-planner/main.tf
+++ b/stacks/fire-planner/main.tf
@ -33,6 +33,8 @@ resource "kubernetes_namespace" "fire_planner" {
      # for headless verification (NetworkPolicy in chrome-service ns admits
      # any namespace carrying this label).
      "chrome-service.viktorbarzin.me/client" = "true"
+      # Opt into Keel auto-update (inject-keel-annotations ClusterPolicy).
+      "keel.sh/enrolled" = "true"
    }
  }
  lifecycle {
@ -230,9 +232,10 @@ resource "kubernetes_deployment" "fire_planner" {
        }

        init_container {
-          name    = "alembic-migrate"
-          image   = local.image
-          command = ["python", "-m", "fire_planner", "migrate"]
+          name              = "alembic-migrate"
+          image             = local.image
+          image_pull_policy = "Always"
+          command           = ["python", "-m", "fire_planner", "migrate"]

          env_from {
            secret_ref {
@ -310,7 +313,12 @@ resource "kubernetes_deployment" "fire_planner" {
  }

  lifecycle {
-    ignore_changes = [spec[0].template[0].spec[0].dns_config] # KYVERNO_LIFECYCLE_V1
+    ignore_changes = [
+      spec[0].template[0].spec[0].dns_config, # KYVERNO_LIFECYCLE_V1
+      metadata[0].annotations["keel.sh/policy"],
+      metadata[0].annotations["keel.sh/trigger"],
+      metadata[0].annotations["keel.sh/pollSchedule"], # KYVERNO_LIFECYCLE_V2
+    ]
  }

  depends_on = [
@ -420,6 +428,77 @@ resource "kubernetes_cron_job_v1" "fire_planner_recompute" {
  ]
 }

+# Weekly refresh of the COL cache: walks col_snapshot for rows
+# expiring within 7 days, re-scrapes Numbeo + Expatistan, upserts. With
+# the user-chosen 1-year TTL, a healthy cache has 0 stale rows on most
+# Sundays — the job is a no-op until rows age out. Schedule Sunday 04:00
+# UTC so Numbeo's contributor activity (mostly weekday) doesn't race
+# our reads.
+resource "kubernetes_cron_job_v1" "fire_planner_col_refresh" {
+  metadata {
+    name      = "fire-planner-col-refresh"
+    namespace = kubernetes_namespace.fire_planner.metadata[0].name
+  }
+  spec {
+    schedule                      = "0 4 * * 0"
+    concurrency_policy            = "Forbid"
+    successful_jobs_history_limit = 3
+    failed_jobs_history_limit     = 5
+    starting_deadline_seconds     = 600
+
+    job_template {
+      metadata {
+        labels = local.labels
+      }
+      spec {
+        backoff_limit              = 1
+        ttl_seconds_after_finished = 86400
+        template {
+          metadata {
+            labels = local.labels
+          }
+          spec {
+            restart_policy = "OnFailure"
+            image_pull_secrets {
+              name = "registry-credentials"
+            }
+            container {
+              name    = "col-refresh"
+              image   = local.image
+              command = ["python", "-m", "fire_planner", "col-refresh-stale", "--within-days", "7"]
+
+              env_from {
+                secret_ref {
+                  name = "fire-planner-db-creds"
+                }
+              }
+
+              resources {
+                requests = {
+                  cpu    = "100m"
+                  memory = "256Mi"
+                }
+                limits = {
+                  memory = "512Mi"
+                }
+              }
+            }
+          }
+        }
+      }
+    }
+  }
+
+  lifecycle {
+    # KYVERNO_LIFECYCLE_V1
+    ignore_changes = [spec[0].job_template[0].spec[0].template[0].spec[0].dns_config]
+  }
+
+  depends_on = [
+    kubernetes_manifest.db_external_secret,
+  ]
+}
+
 # Public ingress at fire-planner.viktorbarzin.me. Authentik-protected
 # (forward-auth at the Traefik layer); Cloudflare-proxied for CDN +
 # DDoS shielding. Backend FastAPI serves the SPA at / and the API
@ -431,7 +510,7 @@ module "ingress" {
  name            = "fire-planner"
  port            = 8080
  tls_secret_name = var.tls_secret_name
-  protected       = true
+  auth            = "required"
  extra_annotations = {
    "gethomepage.dev/enabled"     = "true"
    "gethomepage.dev/name"        = "FIRE Planner"
@ -443,11 +522,14 @@ module "ingress" {

 # Second ingress at the same host for the /api/ prefix WITHOUT Authentik
 # forward-auth. The SPA loads under Authentik (main ingress at /), then its
-# fetch() XHRs hit /api/* directly — forward-auth on /api/* would 302 the
-# XHR to a cross-origin Authentik login page, which fetch().json() can't
-# parse. App-layer bearer auth still gates writes (POST/PATCH/DELETE on
-# scenarios, /recompute, /simulate); read endpoints are open. Acceptable
-# for a personal tool whose only data is anonymous numeric projections.
+# fetch() XHRs hit /api/* directly — ANY forward-auth here (required OR
+# public-tier auto-bind) would 302 the XHR to a cross-origin Authentik
+# login page, which fetch() rejects under CORS preflight rules. Even the
+# `auth = "public"` flow needs a 302+cookie dance on first visit to set
+# the guest session cookie, so it doesn't help XHR APIs. App-layer bearer
+# auth still gates writes (POST/PATCH/DELETE on scenarios, /recompute,
+# /simulate); read endpoints are open. Acceptable for a personal tool
+# whose only data is anonymous numeric projections.
 module "ingress_api" {
  source          = "../../modules/kubernetes/ingress_factory"
  dns_type        = "none"
@ -458,7 +540,8 @@ module "ingress_api" {
  port            = 8080
  ingress_path    = ["/api/"]
  tls_secret_name = var.tls_secret_name
-  protected       = false
+  # auth = "none": XHR-based API endpoints; forward-auth 302+cookie-dance breaks CORS preflight and browser fetch().
+  auth            = "none"
 }

 # Plan-time read of the ESO-created K8s Secret for Grafana datasource
@ -514,3 +597,6 @@ resource "kubernetes_config_map" "grafana_fire_planner_datasource" {
    })
  }
 }
+
+# CI retrigger 2026-05-16T13:42:57+00:00 — bulk enrollment apply (pipeline #689 killed)
+# CI retrigger v2 2026-05-16T13:46:35+00:00
--- a/stacks/foolery/main.tf
+++ b/stacks/foolery/main.tf
@ -9,6 +9,7 @@ resource "kubernetes_namespace" "foolery" {
    labels = {
      "istio-injection" : "disabled"
      tier = local.tiers.aux
+      "keel.sh/enrolled" = "true"
    }
  }
  lifecycle {
@ -65,7 +66,7 @@ module "ingress" {
  namespace       = kubernetes_namespace.foolery.metadata[0].name
  name            = "foolery"
  tls_secret_name = var.tls_secret_name
-  protected       = true
+  auth            = "required"
  extra_annotations = {
    "gethomepage.dev/enabled"      = "true"
    "gethomepage.dev/name"         = "Foolery"
--- a/stacks/forgejo/main.tf
+++ b/stacks/forgejo/main.tf
@ -10,6 +10,7 @@ resource "kubernetes_namespace" "forgejo" {
    labels = {
      "istio-injection" : "disabled"
      tier = local.tiers.edge
+      "keel.sh/enrolled" = "true"
    }
  }
  lifecycle {
@ -30,7 +31,7 @@ resource "kubernetes_persistent_volume_claim" "data_encrypted" {
    name      = "forgejo-data-encrypted"
    namespace = kubernetes_namespace.forgejo.metadata[0].name
    annotations = {
-      "resize.topolvm.io/threshold"     = "80%"
+      "resize.topolvm.io/threshold"     = "10%"
      "resize.topolvm.io/increase"      = "50%"
      "resize.topolvm.io/storage_limit" = "50Gi"
    }
@ -140,6 +141,16 @@ resource "kubernetes_deployment" "forgejo" {
            name  = "FORGEJO__packages__ENABLED"
            value = "true"
          }
+          # Disable source archive ZIP/TAR generation. Bots crawling
+          # /<owner>/<repo>/archive/<sha>.zip on dot_files (and similar
+          # vim-plugin trees) caused 9.9s 500s and chewed ~440m sustained
+          # CPU. Git clone / OCI registry / API are unaffected — only
+          # /archive/* URLs return 404 now. Toggle back to "false" if a
+          # legitimate consumer needs source ZIPs.
+          env {
+            name  = "FORGEJO__repository__DISABLE_DOWNLOAD_SOURCE_ARCHIVES"
+            value = "true"
+          }
          volume_mount {
            name       = "data"
            mount_path = "/data"
@ -169,8 +180,13 @@ resource "kubernetes_deployment" "forgejo" {
    }
  }
  lifecycle {
-    # KYVERNO_LIFECYCLE_V1: Kyverno admission webhook mutates dns_config with ndots=2
-    ignore_changes = [spec[0].template[0].spec[0].dns_config]
+    ignore_changes = [
+      spec[0].template[0].spec[0].dns_config, # KYVERNO_LIFECYCLE_V1
+      spec[0].template[0].spec[0].container[0].image, # KEEL_IGNORE_IMAGE — Keel manages tag updates
+      metadata[0].annotations["keel.sh/policy"],
+      metadata[0].annotations["keel.sh/trigger"],
+      metadata[0].annotations["keel.sh/pollSchedule"], # KYVERNO_LIFECYCLE_V2
+    ]
  }
 }

@ -194,7 +210,12 @@ resource "kubernetes_service" "forgejo" {
  }
 }
 module "ingress" {
-  source          = "../../modules/kubernetes/ingress_factory"
+  source = "../../modules/kubernetes/ingress_factory"
+  # Git + OCI registry (/v2/) — native clients (git, docker/podman) use HTTP
+  # basic-auth / bearer tokens, NOT browser sessions. Forward-auth would 302
+  # them into a redirect they can't follow.
+  # auth = "none": Git + OCI registry clients use HTTP Basic auth / bearer tokens; native CLI tools cannot follow forward-auth redirects.
+  auth            = "none"
  dns_type        = "non-proxied"
  namespace       = kubernetes_namespace.forgejo.metadata[0].name
  name            = "forgejo"
--- a/stacks/freedify/factory/main.tf
+++ b/stacks/freedify/factory/main.tf
@ -225,7 +225,7 @@ module "ingress" {
  name              = "music-${var.name}"
  tls_secret_name   = var.tls_secret_name
  dns_type          = "non-proxied"
-  protected         = var.protected
+  auth              = var.protected ? "required" : "none"
  extra_annotations = var.extra_annotations
 }

@ -235,9 +235,9 @@ resource "kubernetes_ingress_v1" "stream-noauth" {
    name      = "music-${var.name}-stream"
    namespace = "freedify"
    annotations = {
-      "traefik.ingress.kubernetes.io/router.middlewares"  = "traefik-retry@kubernetescrd,traefik-rate-limit@kubernetescrd"
-      "traefik.ingress.kubernetes.io/router.entrypoints"  = "websecure"
-      "traefik.ingress.kubernetes.io/router.priority"     = "100"
+      "traefik.ingress.kubernetes.io/router.middlewares" = "traefik-retry@kubernetescrd,traefik-rate-limit@kubernetescrd"
+      "traefik.ingress.kubernetes.io/router.entrypoints" = "websecure"
+      "traefik.ingress.kubernetes.io/router.priority"    = "100"
    }
  }
  spec {
--- a/stacks/freedify/main.tf
+++ b/stacks/freedify/main.tf
@ -55,6 +55,7 @@ resource "kubernetes_namespace" "freedify" {
    labels = {
      "istio-injection" : "disabled"
      tier = local.tiers.aux
+      "keel.sh/enrolled" = "true"
    }
  }
  lifecycle {
@ -98,14 +99,14 @@ module "viktor" {

 # https://music-emo.viktorbarzin.me/
 module "emo" {
-  source          = "./factory"
-  name            = "emo"
-  tag             = "latest"
-  tls_secret_name = var.tls_secret_name
-  depends_on      = [kubernetes_namespace.freedify]
-  tier            = local.tiers.aux
-  protected       = true
-  genius_token    = lookup(local.credentials["emo"], "genius_token", null)
+  source             = "./factory"
+  name               = "emo"
+  tag                = "latest"
+  tls_secret_name    = var.tls_secret_name
+  depends_on         = [kubernetes_namespace.freedify]
+  tier               = local.tiers.aux
+  protected          = true
+  genius_token       = lookup(local.credentials["emo"], "genius_token", null)
  gemini_api_key     = lookup(local.credentials["emo"], "gemini_api_key", null)
  navidrome_scan_url = data.kubernetes_secret.eso_secrets.data["navidrome_scan_url"]
  ha_sofia_url       = lookup(data.kubernetes_secret.eso_secrets.data, "ha_sofia_url", "")
--- a/stacks/freshrss/.terraform.lock.hcl
+++ b/stacks/freshrss/.terraform.lock.hcl
@ -24,6 +24,14 @@ provider "registry.terraform.io/cloudflare/cloudflare" {
  ]
 }

+provider "registry.terraform.io/goauthentik/authentik" {
+  version     = "2024.12.1"
+  constraints = "~> 2024.10"
+  hashes = [
+    "h1:roBMd+gi+TGgikH/bMzEI8JfvJiMAQWt+8FmokCrQIs=",
+  ]
+}
+
 provider "registry.terraform.io/hashicorp/helm" {
  version = "3.1.1"
  hashes = [
--- a/stacks/freshrss/main.tf
+++ b/stacks/freshrss/main.tf
@ -8,6 +8,7 @@ resource "kubernetes_namespace" "immich" {
    name = "freshrss"
    labels = {
      tier = local.tiers.aux
+      "keel.sh/enrolled" = "true"
    }
  }
  lifecycle {
@ -67,7 +68,7 @@ resource "kubernetes_persistent_volume_claim" "data_proxmox" {
    name      = "freshrss-data-proxmox"
    namespace = kubernetes_namespace.immich.metadata[0].name
    annotations = {
-      "resize.topolvm.io/threshold"     = "80%"
+      "resize.topolvm.io/threshold"     = "10%"
      "resize.topolvm.io/increase"      = "100%"
      "resize.topolvm.io/storage_limit" = "5Gi"
    }
@ -81,6 +82,13 @@ resource "kubernetes_persistent_volume_claim" "data_proxmox" {
      }
    }
  }
+  lifecycle {
+    # The autoresizer expands requests.storage up to storage_limit and
+    # PVCs can't shrink. Without this, every TF apply tries to revert
+    # to the spec value, K8s rejects the shrink, and the PVC ends up
+    # in Terminating-but-in-use limbo.
+    ignore_changes = [spec[0].resources[0].requests]
+  }
 }

 resource "kubernetes_persistent_volume_claim" "extensions_proxmox" {
@ -89,7 +97,7 @@ resource "kubernetes_persistent_volume_claim" "extensions_proxmox" {
    name      = "freshrss-extensions-proxmox"
    namespace = kubernetes_namespace.immich.metadata[0].name
    annotations = {
-      "resize.topolvm.io/threshold"     = "80%"
+      "resize.topolvm.io/threshold"     = "10%"
      "resize.topolvm.io/increase"      = "100%"
      "resize.topolvm.io/storage_limit" = "5Gi"
    }
@ -103,6 +111,13 @@ resource "kubernetes_persistent_volume_claim" "extensions_proxmox" {
      }
    }
  }
+  lifecycle {
+    # The autoresizer expands requests.storage up to storage_limit and
+    # PVCs can't shrink. Without this, every TF apply tries to revert
+    # to the spec value, K8s rejects the shrink, and the PVC ends up
+    # in Terminating-but-in-use limbo.
+    ignore_changes = [spec[0].resources[0].requests]
+  }
 }


@ -189,8 +204,12 @@ resource "kubernetes_deployment" "freshrss" {
    }
  }
  lifecycle {
-    # KYVERNO_LIFECYCLE_V1: Kyverno admission webhook mutates dns_config with ndots=2
-    ignore_changes = [spec[0].template[0].spec[0].dns_config]
+    ignore_changes = [
+      spec[0].template[0].spec[0].dns_config, # KYVERNO_LIFECYCLE_V1
+      metadata[0].annotations["keel.sh/policy"],
+      metadata[0].annotations["keel.sh/trigger"],
+      metadata[0].annotations["keel.sh/pollSchedule"], # KYVERNO_LIFECYCLE_V2
+    ]
  }
 }

@ -214,7 +233,11 @@ resource "kubernetes_service" "freshrss" {
  }
 }
 module "ingress" {
-  source          = "../../modules/kubernetes/ingress_factory"
+  source = "../../modules/kubernetes/ingress_factory"
+  # auth = "app": FreshRSS has built-in user login and exposes Fever +
+  # GReader APIs (/api/fever.php, /api/greader.php) used by mobile RSS
+  # readers like Reeder/FeedMe. Authentik forward-auth was 302-ing those.
+  auth            = "app"
  dns_type        = "proxied"
  namespace       = "freshrss"
  name            = "rss"
@ -233,3 +256,6 @@ module "ingress" {
    "gethomepage.dev/widget.password" = local.homepage_credentials["freshrss"]["password"]
  }
 }
+
+# CI retrigger 2026-05-16T13:42:57+00:00 — bulk enrollment apply (pipeline #689 killed)
+# CI retrigger v2 2026-05-16T13:46:35+00:00
--- a/stacks/freshrss/providers.tf
+++ b/stacks/freshrss/providers.tf
@ -9,6 +9,10 @@ terraform {
      source  = "cloudflare/cloudflare"
      version = "~> 4"
    }
+    authentik = {
+      source  = "goauthentik/authentik"
+      version = "~> 2024.10"
+    }
  }
 }

--- a/Show more
+++ b/Show more