infra

Author	SHA1	Message	Date
Viktor Barzin	b931d9fb20	k8s-version-upgrade: make tigera-operator restore crash-safe (EXIT trap) All checks were successful ci/woodpecker/push/default Pipeline was successful Details phase_master quiesces tigera-operator (Calico's config reconciler) to 0 around the master upgrade so it can't crashloop during the apiserver blip + I/O-storm kubeadm's static-pod-hash watch (which would roll the upgrade back). The restore was a plain line at the end of the phase, so any abort AFTER quiescing left the operator at 0 — and the idempotent retry then skipped the already-on-target master phase and never restored it. Observed 2026-06-17: a post-upgrade gate aborted the master attempt; the operator sat scaled to 0 for ~1.5h (data plane fine — calico-node keeps running — but no Calico reconciliation). Fix: - Drain first (drain doesn't blip the apiserver), THEN quiesce right before `kubeadm upgrade apply`, and install an EXIT trap that restores the operator no matter how the phase exits (gate abort, set -e on ssh/kubeadm, success). Trap is set AFTER drain_node so its own EXIT trap can't clobber it; cleared after the explicit happy-path restore. - postflight also force-restores replicas=1 as a final guarantee (covers the skip-on-retry path that never quiesces or restores). Long-term fix remains HA control plane (apiserver never goes down) — bead code-n0ow. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-17 18:25:54 +00:00
Viktor Barzin	ed53b34bf4	k8s-version-upgrade: dynamic worker enumeration + IP-based SSH (auto-cover all/new nodes) All checks were successful ci/woodpecker/push/default Pipeline was successful Details The chain hardcoded master→node4→node3→node2→node1→postflight and SSHed by FQDN. It silently SKIPPED node5/node6 (added 2026-05-26) — postflight would have failed even if reachable — and node5/node6 had no .viktorbarzin.lan DNS records, so the chain couldn't SSH to them at all. Refactor (upgrade-step.sh): - Worker set + order derived live from `kubectl get nodes` (worker_nodes / next_pending_worker), so EVERY worker still off-target is upgraded and a newly-joined node is covered with zero script change. - SSH targets are node InternalIPs (ssh_target), removing the dependency on node DNS records entirely — a new node is reachable the moment it joins. - The two remaining hardcoded loops (containerd skew, apt-repo rewrite) now enumerate workers/all-nodes dynamically too. - Topology preserved: master-drain Job runs on the first worker; every worker-drain Job runs on the already-upgraded k8s-master (self-preemption invariant intact). - next_pending_worker returns 0 explicitly on the no-match path — the `while read … done < <(…)` loop exits 1 at EOF, which under set -e would abort the LAST worker's Job before it spawns postflight (cluster upgraded but no cleanup / in_flight reset). Caught in review. Docs (runbook + architecture + headers) updated to the dynamic topology. NOTE: nodes still need the k8s-upgrade SSH public key in authorized_keys; it was deployed to node4/5/6 by hand this session. Baking it into node provisioning (so new nodes get it automatically) is the remaining follow-up. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-17 16:56:02 +00:00
Viktor Barzin	bfb86e653f	k8s-version-upgrade: ignore CoreDNS preflight on `kubeadm upgrade plan` too All checks were successful ci/woodpecker/push/default Pipeline was successful Details The prior commit added the CoreDNS ignore/skip flags only to `kubeadm upgrade apply`, but `kubeadm upgrade plan` runs the SAME CoreDNS preflight. Once master's kubeadm binary is on the target version (the first attempt's apt step already bumps it), both plan calls fail on the Keel-drifted CoreDNS 1.12.4 under set -euo pipefail and abort: - preflight Job step 4 (upgrade-step.sh) — `plan` output is grepped for the target version; the failing pipeline killed the whole preflight. - update_k8s.sh master path line 85 — bare `plan` before the apply. Both now pass --ignore-preflight-errors=CoreDNSMigration,CoreDNSUnsupportedPlugins. Verified read-only on master: plan exits 0 and still emits "kubeadm upgrade apply v1.34.9". Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-17 13:49:06 +00:00
Viktor Barzin	037a609f27	k8s-version-upgrade: unblock 1.34.9 — skip kubeadm CoreDNS addon + busybox-date fix All checks were successful ci/woodpecker/push/default Pipeline was successful Details The 1.34.9 master upgrade hard-failed `kubeadm upgrade apply` preflight: CoreDNS is at v1.12.4 (Keel auto-bumped it 1.12.1 -> 1.12.4 on 2026-05-26 via a stale kube-system out-of-band annotation), and 1.12.4 is ahead of kubeadm 1.34.9's bundled corefile-migration table ("start version not supported"). - scripts/update_k8s.sh: master `kubeadm upgrade apply` now runs with `--ignore-preflight-errors=CoreDNSMigration,CoreDNSUnsupportedPlugins --skip-phases=addon/coredns`. A dry-run proved --ignore ALONE would overwrite our custom split-horizon Corefile with kubeadm's default AND downgrade the image; --skip-phases leaves CoreDNS 100% untouched while the control plane upgrades. CoreDNS is pinned off Keel (keel.sh/policy=never) to stop the drift. - stacks/k8s-version-upgrade/scripts/upgrade-step.sh: fix the preflight quiet-baseline (settle-window) check, which silently no-op'd on the ghcr claude-agent-service image's busybox `date` (can't parse ISO8601). Now tries GNU then busybox `-D`, and warns+skips on parse failure (no silent fail-open). - docs: runbook + architecture document the CoreDNS handling. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-17 13:45:05 +00:00
Viktor Barzin	dfa1a12a86	k8s-version-upgrade: retry failed phases + surface wedged chain (fix 5-day silent stall) The 1.34.9 patch auto-upgrade sat stuck for 5 days without anyone knowing. On 2026-06-12 a transient critical alert (the ttyd web-terminal probe on the devvm) was firing when the daily detection ran; the preflight's "halt on any critical alert" gate aborted it, so the preflight Job Failed (backoffLimit=1). Two design gaps then turned that blip into a multi-day wedge: * the detection guard and spawn_next only checked whether the phase Job EXISTED, not whether it succeeded — and the Failed Job lingers 7 days via ttlSecondsAfterFinished, so every daily run skipped re-spawning it; * the abort happens before the in-flight metric is pushed, so neither K8sUpgradeStalled nor upgrade_state.sh could see it — the pipeline reported "never ran" while actually being stuck. Fixes: D1 retry-on-failure: detection CronJob (main.tf) and spawn_next (upgrade-step.sh) now delete + re-spawn a terminally-Failed phase Job instead of skipping it, so a transient gate self-corrects next cycle rather than wedging the pipeline for a week. D2 WebterminalTtydUnreachable critical -> warning: a devvm developer web-terminal is not cluster infrastructure and must not block upgrades. D3 observability: new K8sUpgradeChainJobFailed alert (kube_job_status_failed in k8s-upgrade ns) and upgrade_state.sh now flags a Failed chain Job as "chain failed" — closing the pre-in-flight blind spot so a wedge is visible immediately. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-17 13:07:36 +00:00
Viktor Barzin	fd0f4a0365	fix: restore tree dropped by `6d224861`; land stem95su gdrive-sync (10m) [ci skip] `6d224861` came from a --no-checkout worktree whose empty index made the commit drop every file except two. This restores 05b50d2b's full tree and correctly adds stacks/stem95su/gdrive-sync.tf + the service-catalog stem95su entry. Forward-only (parent=6d224861, no force-push); [ci skip] since the live infra was never applied from the broken commit. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-09 08:45:33 +00:00
Viktor Barzin	6d224861c4	stem95su: scheduled Drive->site sync CronJob (every 10m) CronJob stem95su-gdrive-sync (*/10) mounts the content PVC RW and rclone-syncs the read-only Drive folder "claude" (stem claude/files) onto it (rclone/rclone:1.74.3, scope=drive.readonly, empty-source guard + --max-delete 25). ESO ExternalSecret stem95su-rclone <- Vault secret/stem95su. Requires the GCP OAuth app published to Production or the refresh token expires ~weekly. Lands the gdrive-sync stack on master (it had landed on a feature branch by accident on the shared devvm checkout). Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-09 08:42:26 +00:00
Viktor Barzin	467460cccd	k8s-version-upgrade: ignore IngressTTFBCritical in halt-on-alert check The Synology DSM (port 5001) ingress chronically trips IngressTTFBCritical because of NAS-side latency that is unrelated to k8s upgrades. The chain was halting indefinitely waiting for it to clear. Add it alongside RecentNodeReboot to the per-call ignore regex so the chain can proceed autonomously without manual silences. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-05-24 01:10:44 +00:00
Viktor Barzin	02ea5da8dc	k8s-version-upgrade: skip phase_master/phase_worker if node already on target The chain wasn't idempotent — re-running on a partially-upgraded cluster would re-drain + re-kubeadm + re-apt an already-upgraded node, causing unnecessary disruption (5-10 min per no-op node) and risking alert re-fires during the unnecessary drain. Today's chain hit this twice: after fixing the version-detection bug (commit `a0f3e155`), the chain correctly resumed but re-did master AND node4 even though both were already on v1.34.8. node4 got cordoned, drained, and is now soaking for 10 min for no reason. Fix: at the top of phase_master and phase_worker, read the node's current kubelet version. If it equals TARGET_VERSION, skip the whole phase (return 0 — spawn_next will fire downstream). Chain advances without disturbing the already-upgraded node. In-flight effect: the current node4 worker pod has the old script mounted from configmap snapshot, so it'll continue. If it fails and retries, the new pod will see node4 on v1.34.8 and short-circuit. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-05-23 09:53:57 +00:00
Viktor Barzin	ad9f6c8f41	k8s-version-upgrade: halt_on_alert allowlist (severity=critical only) Refactored halt_on_alert_query from denylist ("ignore these noisy alerts") to an allowlist ("only halt on severity=critical"). Today's blocking alerts were all warning/info-level and not actual upgrade blockers: - PodCrashLooping (gpu-operator on the GPU node, code-8vr0, long-standing) - IngressTTFBHigh (Traefik latency, transient) - NodeHighIOWait (chicken-and-egg with our own upgrade I/O) - RecentNodeReboot (chain causes this itself) severity=critical filtering is more robust than maintaining a denylist of every noisy alert that crops up. extra_ignore parameter kept for backwards compatibility but is rarely needed now (critical alerts are the only ones that should actually halt the chain). Tested end-to-end this session — master successfully upgraded to v1.34.8 via the autonomous chain after the apiserver state-repair (apiserver manifest had been pinned at v1.34.2 from a previous month's rollback; required a one-time manual edit + kubelet reload to bring back to v1.34.7, after which the chain ran cleanly). Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-05-23 09:14:39 +00:00
Viktor Barzin	4713c3a6d9	k8s-version-upgrade: tigera quiesce + etcd-skip retry + IO-wait alert ignore Three changes unblocking the autonomous chain for k8s patch upgrades: 1. phase_master quiesces tigera-operator before drain, restores after. Tigera crashes immediately if apiserver is unreachable (no retry logic) and crashlooping it during master static-pod swaps generates ~500MB/s disk I/O that pushes kubeadm's 5-min static-pod-hash watch past its limit. Quiesce removes the storm contributor; calico data plane keeps running unchanged (data plane is the DaemonSet+Typha, operator is just the reconciler). 2. update_k8s.sh retries with --etcd-upgrade=false on the 2nd attempt. For patch upgrades (1.34.7→1.34.8), etcd's image doesn't change — kubeadm writes an identical manifest, hash doesn't update, watch times out and rolls back forever. The skip-etcd retry sidesteps it for the legitimate no-change case while still doing a full etcd upgrade on the first attempt (correct for minor-version bumps). 3. halt_on_alert_query also ignores IngressTTFBHigh + NodeHighIOWait. Both are symptoms-not-causes: ingress latency spikes briefly during any pod-restart wave; high IOwait is exactly what upgrade activity causes (chicken-and-egg). The inline quiet-baseline check (Ready transition <10min) is the real cluster-churn gate. RBAC: k8s-upgrade-job ClusterRole gains `patch` on deployments + scale subresource so the chain can do the scale-to-0/back-to-1 on tigera. These three together get the chain past the cascade that's been blocking 1.34.7→1.34.8 for a week. Long-term fix is still HA control plane (beads code-n0ow); these are the bridge. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-05-23 08:40:11 +00:00
Viktor Barzin	fc0510aa67	k8s-version-upgrade: kill-switch + ignore RecentNodeReboot + shorter quiet window Three changes from today's autonomous-pipeline validation session: 1. Kill-switch ConfigMap — chain checks for `k8s-upgrade-killswitch` ConfigMap in `k8s-upgrade` namespace at the top of every phase + at the start of version-check. Existence halts the chain (exit 0) with a Slack message. Single-command emergency stop: kubectl -n k8s-upgrade create configmap k8s-upgrade-killswitch \ --from-literal=reason="storm response" Resume: kubectl -n k8s-upgrade delete cm k8s-upgrade-killswitch Role rule for `configmaps` get/list/watch added (resourceName-scoped). 2. Ignore RecentNodeReboot in halt_on_alert_query everywhere — the chain itself causes reboots. The pre-drain master check, post-upgrade worker check, postflight check, and preflight halt-on-alert all now pass `RecentNodeReboot` as the extra-ignore. Previously only worker phase's post-upgrade gate did this. Master Failed silently this morning on the pre-drain check after my own master reboot. 3. Preflight quiet-baseline 3600s → 600s — the 1h cooldown after any Ready transition meant the chain refused to run for an hour after every kured reboot. 10 min is enough for kubelet/control-plane to settle; the 24h-between-cluster-reboots invariant lives in kured-sentinel-gate, not here. Validated by running the chain end-to-end: preflight passed in 5s, master phase now in drain. Today's storm post-mortem (snapshot CoW amplification + tigera-operator crashloop feedback loop) drove the kill-switch design. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-05-21 09:23:41 +00:00
Viktor Barzin	0c8b46df55	k8s-version-upgrade: fix two more grep-pipefail bugs Same `grep -v` / `set -o pipefail` interaction as commit `10b261d2`, in two more callsites the previous fix didn't cover: Line 354 (phase_master): control-plane Running check — `grep -v Running \| wc -l` returns 1 when all pods are Running (the happy path), aborting the chain right after master upgrades. Line 419 (phase_postflight): on-target node check — `grep -v ":v$TARGET_VERSION$" \| wc -l` returns 1 when all nodes are on the target version (the happy path, exactly when postflight should succeed). Aborts at the moment of victory. Forensics on yesterday's master Job failure (see commit message of `10b261d2` for context): the master Job spawned 16s after the previous fix's TF apply, before configmap propagation completed on the kubelet. With those two latent bugs also looming, the chain would have died post-master-upgrade and again at postflight even if propagation had been timely. Wrapping each grep in `{ ... \|\| true; }` so a no-matches result returns success. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-05-20 20:59:10 +00:00
Viktor Barzin	10b261d2db	k8s-version-upgrade: fix pipefail abort when no alerts are firing halt_on_alert_query() ends with `grep -vE "$regex" \| sort -u`. When zero alerts are firing (the desired healthy state), grep matches nothing and exits 1. Under `set -o pipefail`, the whole pipeline returns 1; under `set -e`, the caller's `alerts=$(...)` assignment fails and aborts the script in ~1s with no diagnostic output. The chain effectively required at least one non-meta alert to be firing to make any forward progress. Today (2026-05-19) the cluster is fully clean post-MySQL recovery, the daily 12:00 UTC detection spawned the preflight Job, and it died instantly — blocking the 1.34.7 → 1.34.8 patch chain. Fix: wrap the grep in `{ ... \|\| true; }` so a no-matches result returns success. Preflight verified end-to-end after the fix — the chain is now in flight (preflight ✓, master phase running). Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-05-19 22:19:06 +00:00
Viktor Barzin	8a6ec72039	RecentNodeReboot: 24h → 1h threshold, matching upgrade-chain preflight The 24h kubelet-uptime threshold (process_start_time_seconds < 86400) was a defense-in-depth duplicate of the 24h-since-Ready-transition check in kured-sentinel-gate Check 4 — but they used different signals (kubelet process start vs node Ready transition). Whenever the cluster cycled through reboots, the alert kept firing for a full day even after sentinel-gate's check passed, and blocked anything querying halt-on-alert (kured, K8s version-upgrade preflight). Tightened to 1h (3600s) for "node just rebooted, give it a settle window". The cluster-wide 24h-between-reboots invariant lives exclusively in kured-sentinel-gate Check 4 from now on (independent, uses lastTransitionTime). Matched the preflight's own 24h-quiet check in upgrade-step.sh (86400 → 3600) so it doesn't act as a second blocker. Empirically verified: all 5 kubelets are >10h up, alert cleared on next eval after the rule reload. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-05-17 22:22:01 +00:00
Viktor Barzin	4f2959866d	k8s-version-upgrade: FQDN SSH targets + python3 in place of envsubst Two latent bugs in the K8s-version-upgrade pipeline surfaced when a real detection run ran post-26.04 upgrade today: 1. DNS: pod's CoreDNS search path is `<ns>.svc.cluster.local svc.cluster.local cluster.local` (+ ndots=2 via Kyverno mutation). Unqualified `k8s-master` falls through all of those and then queries upstream Technitium for the bare name → NXDOMAIN. The FQDN `k8s-master.viktorbarzin.lan` is what Technitium actually serves. Suffix every node SSH target with `$NODE_DOMAIN`. 2. envsubst missing: claude-agent-service image doesn't ship `gettext-base`. Replace `envsubst <template \| apply` with `python3 -c 'import os,sys; sys.stdout.write(os.path.expandvars( sys.stdin.read()))' <template \| apply`. Same semantics, image already has python3. Multi-line $SCHEDULING_BLOCK is preserved correctly through expandvars. Verified by manually triggering `k8s-version-check` post-fix: detection now reads `Latest patch: v1.34.8` (currently running 1.34.7) and spawns `k8s-upgrade-preflight-1-34-8`. The Job pod scheduled and started; killed before it touched the cluster (will land on Sunday 2026-05-24 12:00 UTC like the schedule says). Root cause of why these bugs lay dormant: yesterday's first manual-test detection found "no upgrade needed" so neither code path exercised SSH or envsubst. Today's apt-source restore (do-release- upgrade had mangled them) unmasked the v1.34.8 candidate, which made detection finally proceed past the SSH step. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-05-17 21:10:58 +00:00
Viktor Barzin	01bc16d592	k8s-version-upgrade: decompose into Job chain to fix self-preemption The agent-based v1 ran inside claude-agent-service (replicas=1, no nodeSelector) and self-evicted when it tried to drain its host (k8s-node4 on 2026-05-11). Cluster ended half-upgraded (master v1.34.7, workers v1.34.2) until manual recovery. Rewrite the pipeline as a chain of nodeSelector-pinned Jobs: preflight (k8s-node1) → master (k8s-node1) drains k8s-master → worker × 4 (k8s-node1) drains k8s-node{4,3,2} → worker (k8s-master + control-plane toleration) drains k8s-node1 → postflight (no pinning) Each Job runs scripts/upgrade-step.sh (case-on-$PHASE) and ends by envsubst-ing job-template.yaml into the next Job. Deterministic names (k8s-upgrade-<phase>-<target_version>[-<node>]) make `kubectl apply` idempotent — a failed Job can be re-created without duplicating downstream. Also lands `predrain_unstick`: deletes pods on the target node whose PDB has 0 disruptionsAllowed. Without this, drain loops indefinitely on single-replica deployments (e.g. every Anubis instance — discovered the hard way during 2026-05-11 manual recovery of k8s-node3). Adds K8sUpgradeStalled alert (in_flight + started_timestamp > 90 min). Deprecates the agent prompt (renamed to *.deprecated.md with a header pointer to the new code). Apply order: k8s-version-upgrade first (consumes new SA + ConfigMaps), then monitoring (loads the new alert). Both applied 2026-05-11. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-05-11 23:54:22 +00:00

17 commits