Commit graph

10 commits

Author SHA1 Message Date
Viktor Barzin
467460cccd k8s-version-upgrade: ignore IngressTTFBCritical in halt-on-alert check
The Synology DSM (port 5001) ingress chronically trips IngressTTFBCritical
because of NAS-side latency that is unrelated to k8s upgrades. The chain
was halting indefinitely waiting for it to clear. Add it alongside
RecentNodeReboot to the per-call ignore regex so the chain can proceed
autonomously without manual silences.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-24 01:10:44 +00:00
Viktor Barzin
02ea5da8dc k8s-version-upgrade: skip phase_master/phase_worker if node already on target
The chain wasn't idempotent — re-running on a partially-upgraded cluster
would re-drain + re-kubeadm + re-apt an already-upgraded node, causing
unnecessary disruption (5-10 min per no-op node) and risking alert
re-fires during the unnecessary drain.

Today's chain hit this twice: after fixing the version-detection bug
(commit a0f3e155), the chain correctly resumed but re-did master AND
node4 even though both were already on v1.34.8. node4 got cordoned,
drained, and is now soaking for 10 min for no reason.

Fix: at the top of phase_master and phase_worker, read the node's
current kubelet version. If it equals TARGET_VERSION, skip the whole
phase (return 0 — spawn_next will fire downstream). Chain advances
without disturbing the already-upgraded node.

In-flight effect: the current node4 worker pod has the old script
mounted from configmap snapshot, so it'll continue. If it fails and
retries, the new pod will see node4 on v1.34.8 and short-circuit.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-23 09:53:57 +00:00
Viktor Barzin
ad9f6c8f41 k8s-version-upgrade: halt_on_alert allowlist (severity=critical only)
Refactored halt_on_alert_query from denylist ("ignore these noisy alerts")
to an allowlist ("only halt on severity=critical"). Today's blocking
alerts were all warning/info-level and not actual upgrade blockers:
  - PodCrashLooping (gpu-operator on the GPU node, code-8vr0, long-standing)
  - IngressTTFBHigh (Traefik latency, transient)
  - NodeHighIOWait (chicken-and-egg with our own upgrade I/O)
  - RecentNodeReboot (chain causes this itself)

severity=critical filtering is more robust than maintaining a denylist
of every noisy alert that crops up. extra_ignore parameter kept for
backwards compatibility but is rarely needed now (critical alerts are
the only ones that should actually halt the chain).

Tested end-to-end this session — master successfully upgraded to v1.34.8
via the autonomous chain after the apiserver state-repair (apiserver
manifest had been pinned at v1.34.2 from a previous month's rollback;
required a one-time manual edit + kubelet reload to bring back to v1.34.7,
after which the chain ran cleanly).

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-23 09:14:39 +00:00
Viktor Barzin
4713c3a6d9 k8s-version-upgrade: tigera quiesce + etcd-skip retry + IO-wait alert ignore
Three changes unblocking the autonomous chain for k8s patch upgrades:

1. **phase_master quiesces tigera-operator before drain, restores after.**
   Tigera crashes immediately if apiserver is unreachable (no retry logic)
   and crashlooping it during master static-pod swaps generates ~500MB/s
   disk I/O that pushes kubeadm's 5-min static-pod-hash watch past its
   limit. Quiesce removes the storm contributor; calico data plane keeps
   running unchanged (data plane is the DaemonSet+Typha, operator is just
   the reconciler).

2. **update_k8s.sh retries with --etcd-upgrade=false on the 2nd attempt.**
   For patch upgrades (1.34.7→1.34.8), etcd's image doesn't change — kubeadm
   writes an identical manifest, hash doesn't update, watch times out and
   rolls back forever. The skip-etcd retry sidesteps it for the legitimate
   no-change case while still doing a full etcd upgrade on the first
   attempt (correct for minor-version bumps).

3. **halt_on_alert_query also ignores IngressTTFBHigh + NodeHighIOWait.**
   Both are symptoms-not-causes: ingress latency spikes briefly during any
   pod-restart wave; high IOwait is exactly what upgrade activity causes
   (chicken-and-egg). The inline quiet-baseline check (Ready transition
   <10min) is the real cluster-churn gate.

RBAC: k8s-upgrade-job ClusterRole gains `patch` on deployments + scale
subresource so the chain can do the scale-to-0/back-to-1 on tigera.

These three together get the chain past the cascade that's been blocking
1.34.7→1.34.8 for a week. Long-term fix is still HA control plane
(beads code-n0ow); these are the bridge.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-23 08:40:11 +00:00
Viktor Barzin
fc0510aa67 k8s-version-upgrade: kill-switch + ignore RecentNodeReboot + shorter quiet window
Three changes from today's autonomous-pipeline validation session:

1. **Kill-switch ConfigMap** — chain checks for `k8s-upgrade-killswitch`
   ConfigMap in `k8s-upgrade` namespace at the top of every phase + at the
   start of version-check. Existence halts the chain (exit 0) with a Slack
   message. Single-command emergency stop:
       kubectl -n k8s-upgrade create configmap k8s-upgrade-killswitch \
           --from-literal=reason="storm response"
   Resume:  kubectl -n k8s-upgrade delete cm k8s-upgrade-killswitch
   Role rule for `configmaps` get/list/watch added (resourceName-scoped).

2. **Ignore RecentNodeReboot in halt_on_alert_query everywhere** — the
   chain itself causes reboots. The pre-drain master check, post-upgrade
   worker check, postflight check, and preflight halt-on-alert all now
   pass `RecentNodeReboot` as the extra-ignore. Previously only worker
   phase's post-upgrade gate did this. Master Failed silently this morning
   on the pre-drain check after my own master reboot.

3. **Preflight quiet-baseline 3600s → 600s** — the 1h cooldown after any
   Ready transition meant the chain refused to run for an hour after
   every kured reboot. 10 min is enough for kubelet/control-plane to
   settle; the 24h-between-cluster-reboots invariant lives in
   kured-sentinel-gate, not here.

Validated by running the chain end-to-end: preflight passed in 5s,
master phase now in drain. Today's storm post-mortem (snapshot CoW
amplification + tigera-operator crashloop feedback loop) drove the
kill-switch design.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-21 09:23:41 +00:00
Viktor Barzin
0c8b46df55 k8s-version-upgrade: fix two more grep-pipefail bugs
Same `grep -v` / `set -o pipefail` interaction as commit 10b261d2,
in two more callsites the previous fix didn't cover:

  Line 354 (phase_master): control-plane Running check —
    `grep -v Running | wc -l` returns 1 when all pods are Running
    (the happy path), aborting the chain right after master upgrades.

  Line 419 (phase_postflight): on-target node check —
    `grep -v ":v$TARGET_VERSION$" | wc -l` returns 1 when all nodes
    are on the target version (the happy path, exactly when postflight
    should succeed). Aborts at the moment of victory.

Forensics on yesterday's master Job failure (see commit message of
10b261d2 for context): the master Job spawned 16s after the previous
fix's TF apply, before configmap propagation completed on the kubelet.
With those two latent bugs also looming, the chain would have died
post-master-upgrade and again at postflight even if propagation had
been timely.

Wrapping each grep in `{ ... || true; }` so a no-matches result
returns success.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-20 20:59:10 +00:00
Viktor Barzin
10b261d2db k8s-version-upgrade: fix pipefail abort when no alerts are firing
halt_on_alert_query() ends with `grep -vE "$regex" | sort -u`. When
zero alerts are firing (the desired healthy state), grep matches
nothing and exits 1. Under `set -o pipefail`, the whole pipeline
returns 1; under `set -e`, the caller's `alerts=$(...)` assignment
fails and aborts the script in ~1s with no diagnostic output.

The chain effectively required at least one non-meta alert to be
firing to make any forward progress. Today (2026-05-19) the cluster
is fully clean post-MySQL recovery, the daily 12:00 UTC detection
spawned the preflight Job, and it died instantly — blocking the
1.34.7 → 1.34.8 patch chain.

Fix: wrap the grep in `{ ... || true; }` so a no-matches result
returns success. Preflight verified end-to-end after the fix — the
chain is now in flight (preflight ✓, master phase running).

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-19 22:19:06 +00:00
Viktor Barzin
8a6ec72039 RecentNodeReboot: 24h → 1h threshold, matching upgrade-chain preflight
The 24h kubelet-uptime threshold (process_start_time_seconds < 86400)
was a defense-in-depth duplicate of the 24h-since-Ready-transition
check in kured-sentinel-gate Check 4 — but they used different
signals (kubelet process start vs node Ready transition). Whenever
the cluster cycled through reboots, the alert kept firing for a full
day even after sentinel-gate's check passed, and blocked anything
querying halt-on-alert (kured, K8s version-upgrade preflight).

Tightened to 1h (3600s) for "node just rebooted, give it a settle
window". The cluster-wide 24h-between-reboots invariant lives
exclusively in kured-sentinel-gate Check 4 from now on (independent,
uses lastTransitionTime).

Matched the preflight's own 24h-quiet check in upgrade-step.sh
(86400 → 3600) so it doesn't act as a second blocker.

Empirically verified: all 5 kubelets are >10h up, alert cleared on
next eval after the rule reload.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-17 22:22:01 +00:00
Viktor Barzin
4f2959866d k8s-version-upgrade: FQDN SSH targets + python3 in place of envsubst
Two latent bugs in the K8s-version-upgrade pipeline surfaced when a
real detection run ran post-26.04 upgrade today:

1. **DNS**: pod's CoreDNS search path is `<ns>.svc.cluster.local
   svc.cluster.local cluster.local` (+ ndots=2 via Kyverno mutation).
   Unqualified `k8s-master` falls through all of those and then queries
   upstream Technitium for the bare name → NXDOMAIN. The FQDN
   `k8s-master.viktorbarzin.lan` is what Technitium actually serves.
   Suffix every node SSH target with `$NODE_DOMAIN`.

2. **envsubst missing**: claude-agent-service image doesn't ship
   `gettext-base`. Replace `envsubst <template | apply` with
   `python3 -c 'import os,sys; sys.stdout.write(os.path.expandvars(
   sys.stdin.read()))' <template | apply`. Same semantics, image
   already has python3. Multi-line $SCHEDULING_BLOCK is preserved
   correctly through expandvars.

Verified by manually triggering `k8s-version-check` post-fix:
detection now reads `Latest patch: v1.34.8` (currently running 1.34.7)
and spawns `k8s-upgrade-preflight-1-34-8`. The Job pod scheduled and
started; killed before it touched the cluster (will land on Sunday
2026-05-24 12:00 UTC like the schedule says).

Root cause of why these bugs lay dormant: yesterday's first
manual-test detection found "no upgrade needed" so neither code path
exercised SSH or envsubst. Today's apt-source restore (do-release-
upgrade had mangled them) unmasked the v1.34.8 candidate, which made
detection finally proceed past the SSH step.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-17 21:10:58 +00:00
Viktor Barzin
01bc16d592 k8s-version-upgrade: decompose into Job chain to fix self-preemption
The agent-based v1 ran inside claude-agent-service (replicas=1, no
nodeSelector) and self-evicted when it tried to drain its host (k8s-node4
on 2026-05-11). Cluster ended half-upgraded (master v1.34.7, workers
v1.34.2) until manual recovery.

Rewrite the pipeline as a chain of nodeSelector-pinned Jobs:

  preflight (k8s-node1)
    → master   (k8s-node1)   drains k8s-master
    → worker × 4 (k8s-node1) drains k8s-node{4,3,2}
    → worker   (k8s-master + control-plane toleration) drains k8s-node1
    → postflight (no pinning)

Each Job runs scripts/upgrade-step.sh (case-on-$PHASE) and ends by
envsubst-ing job-template.yaml into the next Job. Deterministic names
(k8s-upgrade-<phase>-<target_version>[-<node>]) make `kubectl apply`
idempotent — a failed Job can be re-created without duplicating
downstream.

Also lands `predrain_unstick`: deletes pods on the target node whose PDB
has 0 disruptionsAllowed. Without this, drain loops indefinitely on
single-replica deployments (e.g. every Anubis instance — discovered the
hard way during 2026-05-11 manual recovery of k8s-node3).

Adds K8sUpgradeStalled alert (in_flight + started_timestamp > 90 min).
Deprecates the agent prompt (renamed to *.deprecated.md with a header
pointer to the new code).

Apply order: k8s-version-upgrade first (consumes new SA + ConfigMaps),
then monitoring (loads the new alert). Both applied 2026-05-11.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-11 23:54:22 +00:00