infra/.claude/agents
Viktor Barzin 01bc16d592 k8s-version-upgrade: decompose into Job chain to fix self-preemption
The agent-based v1 ran inside claude-agent-service (replicas=1, no
nodeSelector) and self-evicted when it tried to drain its host (k8s-node4
on 2026-05-11). Cluster ended half-upgraded (master v1.34.7, workers
v1.34.2) until manual recovery.

Rewrite the pipeline as a chain of nodeSelector-pinned Jobs:

  preflight (k8s-node1)
    → master   (k8s-node1)   drains k8s-master
    → worker × 4 (k8s-node1) drains k8s-node{4,3,2}
    → worker   (k8s-master + control-plane toleration) drains k8s-node1
    → postflight (no pinning)

Each Job runs scripts/upgrade-step.sh (case-on-$PHASE) and ends by
envsubst-ing job-template.yaml into the next Job. Deterministic names
(k8s-upgrade-<phase>-<target_version>[-<node>]) make `kubectl apply`
idempotent — a failed Job can be re-created without duplicating
downstream.

Also lands `predrain_unstick`: deletes pods on the target node whose PDB
has 0 disruptionsAllowed. Without this, drain loops indefinitely on
single-replica deployments (e.g. every Anubis instance — discovered the
hard way during 2026-05-11 manual recovery of k8s-node3).

Adds K8sUpgradeStalled alert (in_flight + started_timestamp > 90 min).
Deprecates the agent prompt (renamed to *.deprecated.md with a header
pointer to the new code).

Apply order: k8s-version-upgrade first (consumes new SA + ConfigMaps),
then monitoring (loads the new alert). Both applied 2026-05-11.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-11 23:54:22 +00:00
..
issue-responder.md Add agent task tracking documentation 2026-04-15 17:11:26 +00:00
k8s-version-upgrade.deprecated.md k8s-version-upgrade: decompose into Job chain to fix self-preemption 2026-05-11 23:54:22 +00:00
payslip-extractor.md [payslip-ingest] Update extractor agent + dashboard for v2 regex parser 2026-04-19 10:54:33 +00:00
post-mortem.md feat: add incident management system with user reporting 2026-04-14 20:00:31 +00:00
postmortem-todo-resolver.md feat: post-mortem automation pipeline 2026-04-14 15:34:42 +00:00
service-upgrade.md [service-upgrade] Drop vault-CLI assumptions + check default workflow only 2026-04-19 13:15:06 +00:00
sev-historian.md feat: add incident management system with user reporting 2026-04-14 20:00:31 +00:00
sev-report-writer.md feat: add incident management system with user reporting 2026-04-14 20:00:31 +00:00
sev-triage.md feat: add incident management system with user reporting 2026-04-14 20:00:31 +00:00