infra

Viktor Barzin 448bc0c0f6 k8s-version-upgrade: decompose into Job chain to fix self-preemption The agent-based v1 ran inside claude-agent-service (replicas=1, no nodeSelector) and self-evicted when it tried to drain its host (k8s-node4 on 2026-05-11). Cluster ended half-upgraded (master v1.34.7, workers v1.34.2) until manual recovery. Rewrite the pipeline as a chain of nodeSelector-pinned Jobs: preflight (k8s-node1) → master (k8s-node1) drains k8s-master → worker × 4 (k8s-node1) drains k8s-node{4,3,2} → worker (k8s-master + control-plane toleration) drains k8s-node1 → postflight (no pinning) Each Job runs scripts/upgrade-step.sh (case-on-$PHASE) and ends by envsubst-ing job-template.yaml into the next Job. Deterministic names (k8s-upgrade-<phase>-<target_version>[-<node>]) make `kubectl apply` idempotent — a failed Job can be re-created without duplicating downstream. Also lands `predrain_unstick`: deletes pods on the target node whose PDB has 0 disruptionsAllowed. Without this, drain loops indefinitely on single-replica deployments (e.g. every Anubis instance — discovered the hard way during 2026-05-11 manual recovery of k8s-node3). Adds K8sUpgradeStalled alert (in_flight + started_timestamp > 90 min). Deprecates the agent prompt (renamed to *.deprecated.md with a header pointer to the new code). Apply order: k8s-version-upgrade first (consumes new SA + ConfigMaps), then monitoring (loads the new alert). Both applied 2026-05-11. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>		2026-05-22 14:16:45 +00:00
..
agent-task-tracking.md	Add agent task tracking documentation	2026-04-15 17:11:26 +00:00
authentication.md	docs/auth: sync to current `auth` enum (required/app/public/none)	2026-05-22 14:16:44 +00:00
automated-upgrades.md	k8s-version-upgrade: decompose into Job chain to fix self-preemption	2026-05-22 14:16:45 +00:00
backup-dr.md	backup: fix daily-backup silent failures, postiz pg_dump CronJob, doc reconcile	2026-05-10 11:12:39 +00:00
chrome-service.md	chrome-service: open NP for Traefik → noVNC sidecar (port 6080)	2026-05-07 23:29:34 +00:00
ci-cd.md	[forgejo] Phases 3+4+5: cutover, decommission, docs sweep	2026-05-07 23:29:34 +00:00
compute.md	infra/compute: bump k8s-node1 RAM 32 -> 48 GiB	2026-05-22 14:16:41 +00:00
databases.md	[redis] stabilise against node-crash flap cascade — RC1-RC5 fixes	2026-04-22 15:59:00 +00:00
dns.md	phpipam-pfsense-import: every 5min → hourly	2026-04-26 22:48:43 +00:00
homepage.md	add homepage auto-discovery documentation [ci skip]	2026-03-25 13:06:43 +02:00
incident-response.md	[claude-agent-service] Migrate all pipelines from DevVM SSH to K8s HTTP	2026-04-18 10:12:02 +00:00
llama-cpp.md	infra/llama-cpp: benchmark report + -fa flag fix	2026-05-22 14:16:41 +00:00
mailserver.md	monitoring: bring EmailRoundtripStale threshold docs in sync with for:20m	2026-04-21 22:39:46 +00:00
monitoring.md	[forgejo] Phases 3+4+5: cutover, decommission, docs sweep	2026-05-07 23:29:34 +00:00
multi-tenancy.md	add architecture documentation for all infrastructure subsystems [ci skip]	2026-03-24 00:55:25 +02:00
networking.md	kms: deploy slack-notifier sidecar with Prometheus metrics + document public exposure	2026-05-10 11:12:39 +00:00
overview.md	gpu: schedule off NFD label, not k8s-node1 hostname	2026-04-22 13:43:07 +00:00
secrets.md	docs: comprehensive audit and update of all architecture docs and runbooks [ci skip]	2026-04-06 13:21:05 +03:00
security.md	[docs] Update anti-AI and rybbit docs after rewrite-body removal	2026-04-17 21:43:13 +00:00
storage.md	mysql: bump to 4Gi limit / 3Gi request; grow /srv/nfs LV to 3 TiB	2026-05-10 11:12:38 +00:00
vpn.md	[docs] TrueNAS decommission cleanup — remove references from active docs	2026-04-19 16:55:43 +00:00