infra/stacks
Viktor Barzin 448bc0c0f6 k8s-version-upgrade: decompose into Job chain to fix self-preemption
The agent-based v1 ran inside claude-agent-service (replicas=1, no
nodeSelector) and self-evicted when it tried to drain its host (k8s-node4
on 2026-05-11). Cluster ended half-upgraded (master v1.34.7, workers
v1.34.2) until manual recovery.

Rewrite the pipeline as a chain of nodeSelector-pinned Jobs:

  preflight (k8s-node1)
    → master   (k8s-node1)   drains k8s-master
    → worker × 4 (k8s-node1) drains k8s-node{4,3,2}
    → worker   (k8s-master + control-plane toleration) drains k8s-node1
    → postflight (no pinning)

Each Job runs scripts/upgrade-step.sh (case-on-$PHASE) and ends by
envsubst-ing job-template.yaml into the next Job. Deterministic names
(k8s-upgrade-<phase>-<target_version>[-<node>]) make `kubectl apply`
idempotent — a failed Job can be re-created without duplicating
downstream.

Also lands `predrain_unstick`: deletes pods on the target node whose PDB
has 0 disruptionsAllowed. Without this, drain loops indefinitely on
single-replica deployments (e.g. every Anubis instance — discovered the
hard way during 2026-05-11 manual recovery of k8s-node3).

Adds K8sUpgradeStalled alert (in_flight + started_timestamp > 90 min).
Deprecates the agent prompt (renamed to *.deprecated.md with a header
pointer to the new code).

Apply order: k8s-version-upgrade first (consumes new SA + ConfigMaps),
then monitoring (loads the new alert). Both applied 2026-05-11.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-22 14:16:45 +00:00
..
_template ingress_factory: replace protected bool with auth enum + audit pass across 100 stacks 2026-05-22 14:16:42 +00:00
actualbudget infra/ingress_factory: add auth = "app" mode for self-authed backends 2026-05-22 14:16:44 +00:00
affine infra/ingress_factory: add auth = "app" mode for self-authed backends 2026-05-22 14:16:44 +00:00
authentik infra: document auth = "app|none" tier on every legacy ingress 2026-05-22 14:16:44 +00:00
beads-server infra: document auth = "app|none" tier on every legacy ingress 2026-05-22 14:16:44 +00:00
blog ingress_factory: replace protected bool with auth enum + audit pass across 100 stacks 2026-05-22 14:16:42 +00:00
broker-sync fix: pvc-autoresizer + TF drift safety — bulk add ignore_changes 2026-05-22 14:16:43 +00:00
calico [infra] Partial Calico adoption: namespaces only (Wave 5b) 2026-04-18 22:52:56 +00:00
changedetection fix: pvc-autoresizer + TF drift safety — bulk add ignore_changes 2026-05-22 14:16:43 +00:00
chrome-service fix: pvc-autoresizer + TF drift safety — bulk add ignore_changes 2026-05-22 14:16:43 +00:00
city-guesser ingress_factory: replace protected bool with auth enum + audit pass across 100 stacks 2026-05-22 14:16:42 +00:00
claude-agent-service claude-agent: replace unused 10Gi PVC with 5Gi NFS-backed /persistent 2026-05-22 14:16:45 +00:00
claude-memory infra: document auth = "app|none" tier on every legacy ingress 2026-05-22 14:16:44 +00:00
cloudflared cloudflare: disable AI bot edge-block so x402 can issue payment offers 2026-05-22 14:16:42 +00:00
cnpg [infra] Suppress Goldilocks vpa-update-mode label drift on all namespaces [ci skip] 2026-04-18 21:15:27 +00:00
coturn [infra] Sweep dns_config ignore_changes across all pod-owning resources [ci skip] 2026-04-18 21:19:48 +00:00
crowdsec ingress_factory: replace protected bool with auth enum + audit pass across 100 stacks 2026-05-22 14:16:42 +00:00
cyberchef ingress_factory: replace protected bool with auth enum + audit pass across 100 stacks 2026-05-22 14:16:42 +00:00
dashy ingress_factory: replace protected bool with auth enum + audit pass across 100 stacks 2026-05-22 14:16:42 +00:00
dawarich infra: document auth = "app|none" tier on every legacy ingress 2026-05-22 14:16:44 +00:00
dbaas dbaas+monitoring: bump PG max_connections to 200, add scrape + alerts 2026-05-22 14:16:44 +00:00
descheduler [infra] Suppress Goldilocks vpa-update-mode label drift on all namespaces [ci skip] 2026-04-18 21:15:27 +00:00
diun fix: pvc-autoresizer + TF drift safety — bulk add ignore_changes 2026-05-22 14:16:43 +00:00
ebook2audiobook ingress_factory: replace protected bool with auth enum + audit pass across 100 stacks 2026-05-22 14:16:42 +00:00
ebooks infra: document auth = "app|none" tier on every legacy ingress 2026-05-22 14:16:44 +00:00
echo state(dbaas): update encrypted state 2026-05-22 14:16:43 +00:00
excalidraw fix: pvc-autoresizer + TF drift safety — bulk add ignore_changes 2026-05-22 14:16:43 +00:00
external-secrets [infra] Suppress Goldilocks vpa-update-mode label drift on all namespaces [ci skip] 2026-04-18 21:15:27 +00:00
f1-stream fix: pvc-autoresizer + TF drift safety — bulk add ignore_changes 2026-05-22 14:16:43 +00:00
fire-planner infra: document auth = "app|none" tier on every legacy ingress 2026-05-22 14:16:44 +00:00
foolery ingress_factory: replace protected bool with auth enum + audit pass across 100 stacks 2026-05-22 14:16:42 +00:00
forgejo infra: document auth = "app|none" tier on every legacy ingress 2026-05-22 14:16:44 +00:00
freedify ingress_factory: replace protected bool with auth enum + audit pass across 100 stacks 2026-05-22 14:16:42 +00:00
freshrss infra/ingress_factory: add auth = "app" mode for self-authed backends 2026-05-22 14:16:44 +00:00
frigate infra: document auth = "app|none" tier on every legacy ingress 2026-05-22 14:16:44 +00:00
grampsweb fix: pvc-autoresizer + TF drift safety — bulk add ignore_changes 2026-05-22 14:16:43 +00:00
hackmd fix: pvc-autoresizer + TF drift safety — bulk add ignore_changes 2026-05-22 14:16:43 +00:00
headscale infra: document auth = "app|none" tier on every legacy ingress 2026-05-22 14:16:44 +00:00
health fix: pvc-autoresizer + TF drift safety — bulk add ignore_changes 2026-05-22 14:16:43 +00:00
hermes-agent fix: pvc-autoresizer + TF drift safety — bulk add ignore_changes 2026-05-22 14:16:43 +00:00
homepage ingress_factory: replace protected bool with auth enum + audit pass across 100 stacks 2026-05-22 14:16:42 +00:00
immich infra: document auth = "app|none" tier on every legacy ingress 2026-05-22 14:16:44 +00:00
infra [forgejo] Phases 3+4+5: cutover, decommission, docs sweep 2026-05-07 23:29:34 +00:00
infra-maintenance [infra] Sweep dns_config ignore_changes across all pod-owning resources [ci skip] 2026-04-18 21:19:48 +00:00
insta2spotify infra: document auth = "app|none" tier on every legacy ingress 2026-05-22 14:16:44 +00:00
instagram-poster infra: document auth = "app|none" tier on every legacy ingress 2026-05-22 14:16:44 +00:00
isponsorblocktv fix: pvc-autoresizer + TF drift safety — bulk add ignore_changes 2026-05-22 14:16:43 +00:00
job-hunter grafana: env-var datasources + reloader so Vault rotations stop breaking dashboards 2026-05-10 11:12:39 +00:00
jsoncrack ingress_factory: replace protected bool with auth enum + audit pass across 100 stacks 2026-05-22 14:16:42 +00:00
k8s-dashboard ingress_factory: replace protected bool with auth enum + audit pass across 100 stacks 2026-05-22 14:16:42 +00:00
k8s-portal infra: document auth = "app|none" tier on every legacy ingress 2026-05-22 14:16:44 +00:00
k8s-version-upgrade k8s-version-upgrade: decompose into Job chain to fix self-preemption 2026-05-22 14:16:45 +00:00
kms ingress_factory: replace protected bool with auth enum + audit pass across 100 stacks 2026-05-22 14:16:42 +00:00
kured kured(sentinel-gate): fix auth + write-perm so safety checks actually run 2026-05-22 14:16:41 +00:00
kyverno [forgejo] Phases 3+4+5: cutover, decommission, docs sweep 2026-05-07 23:29:34 +00:00
linkwarden infra/ingress_factory: add auth = "app" mode for self-authed backends 2026-05-22 14:16:44 +00:00
llama-cpp infra/llama-cpp: benchmark report + -fa flag fix 2026-05-22 14:16:41 +00:00
local-path [infra] Adopt local-path-provisioner into Terraform (Wave 5c) 2026-04-18 22:39:55 +00:00
mailserver fix: pvc-autoresizer + TF drift safety — bulk add ignore_changes 2026-05-22 14:16:43 +00:00
matrix infra: document auth = "app|none" tier on every legacy ingress 2026-05-22 14:16:44 +00:00
meshcentral fix: pvc-autoresizer + TF drift safety — bulk add ignore_changes 2026-05-22 14:16:43 +00:00
metallb [infra] Suppress Goldilocks vpa-update-mode label drift on all namespaces [ci skip] 2026-04-18 21:15:27 +00:00
metrics-server [infra] Suppress Goldilocks vpa-update-mode label drift on all namespaces [ci skip] 2026-04-18 21:15:27 +00:00
monitoring k8s-version-upgrade: decompose into Job chain to fix self-preemption 2026-05-22 14:16:45 +00:00
n8n infra: document auth = "app|none" tier on every legacy ingress 2026-05-22 14:16:44 +00:00
navidrome infra: document auth = "app|none" tier on every legacy ingress 2026-05-22 14:16:44 +00:00
netbox ingress_factory: replace protected bool with auth enum + audit pass across 100 stacks 2026-05-22 14:16:42 +00:00
networking-toolbox ingress_factory: replace protected bool with auth enum + audit pass across 100 stacks 2026-05-22 14:16:42 +00:00
nextcloud infra: document auth = "app|none" tier on every legacy ingress 2026-05-22 14:16:44 +00:00
nfs-csi [infra] TrueNAS decommission — remove active references from Terraform + configs 2026-04-19 16:57:05 +00:00
nodelocal-dns [dns] NodeLocal DNSCache — deploy DaemonSet to all nodes (WS C) 2026-04-19 15:46:41 +00:00
novelapp infra/ingress_factory: add auth = "app" mode for self-authed backends 2026-05-22 14:16:44 +00:00
ntfy infra: document auth = "app|none" tier on every legacy ingress 2026-05-22 14:16:44 +00:00
nvidia infra: document auth = "app|none" tier on every legacy ingress 2026-05-22 14:16:44 +00:00
onlyoffice infra: document auth = "app|none" tier on every legacy ingress 2026-05-22 14:16:44 +00:00
openclaw fix: pvc-autoresizer + TF drift safety — bulk add ignore_changes 2026-05-22 14:16:43 +00:00
osm_routing [infra] Sweep dns_config ignore_changes across all pod-owning resources [ci skip] 2026-04-18 21:19:48 +00:00
owntracks infra: document auth = "app|none" tier on every legacy ingress 2026-05-22 14:16:44 +00:00
paperless-ngx Woodpecker CI deploy [CI SKIP] 2026-05-22 14:16:45 +00:00
payslip-ingest grafana: env-var datasources + reloader so Vault rotations stop breaking dashboards 2026-05-10 11:12:39 +00:00
phpipam ingress_factory: replace protected bool with auth enum + audit pass across 100 stacks 2026-05-22 14:16:42 +00:00
platform [infra] Add Cloudflare provider to all stack lock files and generated providers 2026-04-16 16:31:36 +00:00
plotting-book fix: pvc-autoresizer + TF drift safety — bulk add ignore_changes 2026-05-22 14:16:43 +00:00
poison-fountain ingress_factory: replace protected bool with auth enum + audit pass across 100 stacks 2026-05-22 14:16:42 +00:00
postiz infra: document auth = "app|none" tier on every legacy ingress 2026-05-22 14:16:44 +00:00
priority-pass fix: pvc-autoresizer + TF drift safety — bulk add ignore_changes 2026-05-22 14:16:43 +00:00
privatebin infra: document auth = "app|none" tier on every legacy ingress 2026-05-22 14:16:44 +00:00
proxmox-csi proxmox-csi: opt SCs into pvc-autoresizer (resize.topolvm.io/enabled=true) 2026-05-22 14:16:41 +00:00
pvc-autoresizer [infra] Suppress Goldilocks vpa-update-mode label drift on all namespaces [ci skip] 2026-04-18 21:15:27 +00:00
rbac [infra] Migrate Terraform state from local SOPS to PostgreSQL backend 2026-04-16 19:33:12 +00:00
real-estate-crawler real-estate-crawler: populate SCRAPE_SCHEDULES (daily RENT + weekly BUY, London 1-2 bed) 2026-05-22 14:16:44 +00:00
redis fix: pvc-autoresizer threshold should be 10%, not 80% 2026-05-22 14:16:43 +00:00
reloader [infra] Suppress Goldilocks vpa-update-mode label drift on all namespaces [ci skip] 2026-04-18 21:15:27 +00:00
resume Woodpecker CI deploy [CI SKIP] 2026-05-22 14:16:44 +00:00
reverse-proxy chore: remove decommissioned registry.viktorbarzin.me ingress 2026-05-10 11:12:37 +00:00
rybbit infra: document auth = "app|none" tier on every legacy ingress 2026-05-22 14:16:44 +00:00
sealed-secrets [infra] Suppress Goldilocks vpa-update-mode label drift on all namespaces [ci skip] 2026-04-18 21:15:27 +00:00
send infra: document auth = "app|none" tier on every legacy ingress 2026-05-22 14:16:44 +00:00
servarr fix: pvc-autoresizer + TF drift safety — bulk add ignore_changes 2026-05-22 14:16:43 +00:00
shadowsocks [infra] Sweep dns_config ignore_changes across all pod-owning resources [ci skip] 2026-04-18 21:19:48 +00:00
speedtest fix: pvc-autoresizer + TF drift safety — bulk add ignore_changes 2026-05-22 14:16:43 +00:00
status-page [infra] Establish KYVERNO_LIFECYCLE_V1 drift-suppression convention [ci skip] 2026-04-18 14:15:51 +00:00
stirling-pdf fix: pvc-autoresizer + TF drift safety — bulk add ignore_changes 2026-05-22 14:16:43 +00:00
tandoor infra/ingress_factory: add auth = "app" mode for self-authed backends 2026-05-22 14:16:44 +00:00
technitium fix: pvc-autoresizer + TF drift safety — bulk add ignore_changes 2026-05-22 14:16:43 +00:00
terminal ingress_factory: replace protected bool with auth enum + audit pass across 100 stacks 2026-05-22 14:16:42 +00:00
tor-proxy fix: pvc-autoresizer + TF drift safety — bulk add ignore_changes 2026-05-22 14:16:43 +00:00
trading-bot ingress_factory: replace protected bool with auth enum + audit pass across 100 stacks 2026-05-22 14:16:42 +00:00
traefik ingress_factory: replace protected bool with auth enum + audit pass across 100 stacks 2026-05-22 14:16:42 +00:00
travel_blog ingress_factory: replace protected bool with auth enum + audit pass across 100 stacks 2026-05-22 14:16:42 +00:00
tuya-bridge infra: document auth = "app|none" tier on every legacy ingress 2026-05-22 14:16:44 +00:00
uptime-kuma fix: pvc-autoresizer + TF drift safety — bulk add ignore_changes 2026-05-22 14:16:43 +00:00
url ingress_factory: replace protected bool with auth enum + audit pass across 100 stacks 2026-05-22 14:16:42 +00:00
vault vault: move audit-PVC autoresizer annotations to kubernetes_annotations 2026-05-22 14:16:44 +00:00
vaultwarden infra: document auth = "app|none" tier on every legacy ingress 2026-05-22 14:16:44 +00:00
vpa ingress_factory: replace protected bool with auth enum + audit pass across 100 stacks 2026-05-22 14:16:42 +00:00
wealthfolio wealthfolio, paperless-ngx: drop migration-leftover -proxmox PVCs 2026-05-22 14:16:45 +00:00
webhook_handler infra: document auth = "app|none" tier on every legacy ingress 2026-05-22 14:16:44 +00:00
whisper fix: pvc-autoresizer + TF drift safety — bulk add ignore_changes 2026-05-22 14:16:43 +00:00
wireguard [infra] Suppress Goldilocks vpa-update-mode label drift on all namespaces [ci skip] 2026-04-18 21:15:27 +00:00
woodpecker infra: document auth = "app|none" tier on every legacy ingress 2026-05-22 14:16:44 +00:00
xray infra: document auth = "app|none" tier on every legacy ingress 2026-05-22 14:16:44 +00:00
ytdlp ingress_factory: replace protected bool with auth enum + audit pass across 100 stacks 2026-05-22 14:16:42 +00:00