Commit graph

17 commits

Author SHA1 Message Date
Viktor Barzin
94ca849379 k8s-version-upgrade: grant get/list on apps resources for drain
kubectl drain --ignore-daemonsets needs to GET each pod's owner
reference (DaemonSet/StatefulSet/ReplicaSet/Deployment) to classify
which pods can be drained vs ignored. Without these RBAC verbs, drain
bails with 'cannot delete daemonsets ... is forbidden' for every
daemonset-managed pod on the node.
2026-05-22 14:17:00 +00:00
Viktor Barzin
8aff0ba1a2 k8s-version-upgrade: fix two more grep-pipefail bugs
Same `grep -v` / `set -o pipefail` interaction as commit 10b261d2,
in two more callsites the previous fix didn't cover:

  Line 354 (phase_master): control-plane Running check —
    `grep -v Running | wc -l` returns 1 when all pods are Running
    (the happy path), aborting the chain right after master upgrades.

  Line 419 (phase_postflight): on-target node check —
    `grep -v ":v$TARGET_VERSION$" | wc -l` returns 1 when all nodes
    are on the target version (the happy path, exactly when postflight
    should succeed). Aborts at the moment of victory.

Forensics on yesterday's master Job failure (see commit message of
10b261d2 for context): the master Job spawned 16s after the previous
fix's TF apply, before configmap propagation completed on the kubelet.
With those two latent bugs also looming, the chain would have died
post-master-upgrade and again at postflight even if propagation had
been timely.

Wrapping each grep in `{ ... || true; }` so a no-matches result
returns success.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-22 14:17:00 +00:00
Viktor Barzin
83fc15c22b k8s-version-upgrade: fix pipefail abort when no alerts are firing
halt_on_alert_query() ends with `grep -vE "$regex" | sort -u`. When
zero alerts are firing (the desired healthy state), grep matches
nothing and exits 1. Under `set -o pipefail`, the whole pipeline
returns 1; under `set -e`, the caller's `alerts=$(...)` assignment
fails and aborts the script in ~1s with no diagnostic output.

The chain effectively required at least one non-meta alert to be
firing to make any forward progress. Today (2026-05-19) the cluster
is fully clean post-MySQL recovery, the daily 12:00 UTC detection
spawned the preflight Job, and it died instantly — blocking the
1.34.7 → 1.34.8 patch chain.

Fix: wrap the grep in `{ ... || true; }` so a no-matches result
returns success. Preflight verified end-to-end after the fix — the
chain is now in flight (preflight ✓, master phase running).

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-22 14:17:00 +00:00
Viktor Barzin
3d43d96a5e k8s-version-upgrade: switch detection cron from weekly to daily
Was `0 12 * * 0` (Sun 12:00 UTC) — patch releases waited up to 6 days
before the chain picked them up. Now `0 12 * * *` (daily 12:00 UTC,
still outside kured's 02:00-06:00 London window). Concurrency is
bounded by Forbid + deterministic job-name idempotency (the detection
job exits early if a preflight Job for the same target already exists),
so back-to-back days can't pile up parallel runs.

- stacks/k8s-version-upgrade/main.tf: var.schedule default + rationale comment
- scripts/upgrade_state.sh: rename next_sunday_noon_utc -> next_daily_noon_utc
  (now returns "Tue 2026-05-19 12:00 UTC" form); change "(Sun cron)" label
  to "(daily cron)"
- .claude/skills/upgrade-state/SKILL.md: cadence column + frontmatter

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-22 14:16:57 +00:00
Viktor Barzin
5482f46125 RecentNodeReboot: 24h → 1h threshold, matching upgrade-chain preflight
The 24h kubelet-uptime threshold (process_start_time_seconds < 86400)
was a defense-in-depth duplicate of the 24h-since-Ready-transition
check in kured-sentinel-gate Check 4 — but they used different
signals (kubelet process start vs node Ready transition). Whenever
the cluster cycled through reboots, the alert kept firing for a full
day even after sentinel-gate's check passed, and blocked anything
querying halt-on-alert (kured, K8s version-upgrade preflight).

Tightened to 1h (3600s) for "node just rebooted, give it a settle
window". The cluster-wide 24h-between-reboots invariant lives
exclusively in kured-sentinel-gate Check 4 from now on (independent,
uses lastTransitionTime).

Matched the preflight's own 24h-quiet check in upgrade-step.sh
(86400 → 3600) so it doesn't act as a second blocker.

Empirically verified: all 5 kubelets are >10h up, alert cleared on
next eval after the rule reload.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-22 14:16:57 +00:00
Viktor Barzin
e4e2babd6a k8s-version-upgrade: FQDN SSH targets + python3 in place of envsubst
Two latent bugs in the K8s-version-upgrade pipeline surfaced when a
real detection run ran post-26.04 upgrade today:

1. **DNS**: pod's CoreDNS search path is `<ns>.svc.cluster.local
   svc.cluster.local cluster.local` (+ ndots=2 via Kyverno mutation).
   Unqualified `k8s-master` falls through all of those and then queries
   upstream Technitium for the bare name → NXDOMAIN. The FQDN
   `k8s-master.viktorbarzin.lan` is what Technitium actually serves.
   Suffix every node SSH target with `$NODE_DOMAIN`.

2. **envsubst missing**: claude-agent-service image doesn't ship
   `gettext-base`. Replace `envsubst <template | apply` with
   `python3 -c 'import os,sys; sys.stdout.write(os.path.expandvars(
   sys.stdin.read()))' <template | apply`. Same semantics, image
   already has python3. Multi-line $SCHEDULING_BLOCK is preserved
   correctly through expandvars.

Verified by manually triggering `k8s-version-check` post-fix:
detection now reads `Latest patch: v1.34.8` (currently running 1.34.7)
and spawns `k8s-upgrade-preflight-1-34-8`. The Job pod scheduled and
started; killed before it touched the cluster (will land on Sunday
2026-05-24 12:00 UTC like the schedule says).

Root cause of why these bugs lay dormant: yesterday's first
manual-test detection found "no upgrade needed" so neither code path
exercised SSH or envsubst. Today's apt-source restore (do-release-
upgrade had mangled them) unmasked the v1.34.8 candidate, which made
detection finally proceed past the SSH step.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-22 14:16:56 +00:00
Viktor Barzin
cdeb89d5f1 final wave: enroll immich + status-page, retrigger 17 pending Bucket A
* immich: extended 3 V1 lifecycles to V2 (1 Deployment without V1
    skipped — has non-standard lifecycle from earlier work).
  * status-page: enrolled (was missing from original sweep).
  * v6 retrigger marker on 17 stacks that never reached terragrunt
    apply (#704 exit-1 halted mid-loop).

After this lands, expected live enrollment: ~96 / 118 Tier 1 stacks.
The remaining ~22 are operator/Helm-managed and intentionally excluded
(same fight-loop risk as Calico — bump via Helm chart version, not
Keel).

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-22 14:16:55 +00:00
Viktor Barzin
b57596d930 Bucket A retrigger + Bucket D enrollment (5 module-nested stacks)
After fixing the postgresql-lb MetalLB flap (deleted stuck
ServiceL2Status CR l2-rgt9d), Tier 1 CI can apply again. Combined
commit:

  * Bucket A (16 stacks): re-append CI retrigger marker so the
    previously-pending applies pick up:
      blog calico cyberchef descheduler f1-stream homepage jsoncrack
      k8s-dashboard k8s-version-upgrade kms local-path osm_routing
      real-estate-crawler travel_blog vault webhook_handler

  * Bucket D (5 module-nested stacks): keel.sh/enrolled label on
    namespace + KYVERNO_LIFECYCLE_V2 on Deployments inside the module:
      postiz instagram-poster k8s-portal uptime-kuma vaultwarden

Bucket C (raw-deploy apps without V1 marker on their Deployment
lifecycles) deferred — needs per-Deployment lifecycle block additions
that the bulk script can't safely automate:
  beads-server immich llama-cpp novelapp plotting-book trading-bot

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-22 14:16:55 +00:00
Viktor Barzin
b48ddc09d6 ci: retrigger v4 — remaining 16 Keel stacks (#701 failed one of them) 2026-05-22 14:16:54 +00:00
Viktor Barzin
978237441e ci: retrigger v3 — apply remaining 22 Keel-enrolled stacks 2026-05-22 14:16:54 +00:00
Viktor Barzin
0c73974362 ci: retrigger v2 — apply pending Keel-enrolled stacks (#697 was cancelled by #698) 2026-05-22 14:16:53 +00:00
Viktor Barzin
6769526e1e ci: retrigger apply for pending Keel enrollment (~58 stacks)
Bulk enrollment commit 8f4b1956 had its CI pipeline #689 killed before
terragrunt apply ran. The enrollment label + V2 lifecycle changes are
in master but never reached the cluster. Appending a one-line marker
to each pending stack's main.tf so Woodpecker's diff-detection picks
them up and applies them serially.

Idempotent — re-applying a stack whose state already matches is a no-op.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-22 14:16:52 +00:00
3027ab85a8 recruiter-responder: bump image_tag to 189ef901
OpenClaw can now answer 'what do we know about <company>?' from cache
via the new recruiter_company_research tool, and recruiter_get embeds
the cached research payload inline.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-22 14:16:49 +00:00
Viktor Barzin
448bc0c0f6 k8s-version-upgrade: decompose into Job chain to fix self-preemption
The agent-based v1 ran inside claude-agent-service (replicas=1, no
nodeSelector) and self-evicted when it tried to drain its host (k8s-node4
on 2026-05-11). Cluster ended half-upgraded (master v1.34.7, workers
v1.34.2) until manual recovery.

Rewrite the pipeline as a chain of nodeSelector-pinned Jobs:

  preflight (k8s-node1)
    → master   (k8s-node1)   drains k8s-master
    → worker × 4 (k8s-node1) drains k8s-node{4,3,2}
    → worker   (k8s-master + control-plane toleration) drains k8s-node1
    → postflight (no pinning)

Each Job runs scripts/upgrade-step.sh (case-on-$PHASE) and ends by
envsubst-ing job-template.yaml into the next Job. Deterministic names
(k8s-upgrade-<phase>-<target_version>[-<node>]) make `kubectl apply`
idempotent — a failed Job can be re-created without duplicating
downstream.

Also lands `predrain_unstick`: deletes pods on the target node whose PDB
has 0 disruptionsAllowed. Without this, drain loops indefinitely on
single-replica deployments (e.g. every Anubis instance — discovered the
hard way during 2026-05-11 manual recovery of k8s-node3).

Adds K8sUpgradeStalled alert (in_flight + started_timestamp > 90 min).
Deprecates the agent prompt (renamed to *.deprecated.md with a header
pointer to the new code).

Apply order: k8s-version-upgrade first (consumes new SA + ConfigMaps),
then monitoring (loads the new alert). Both applied 2026-05-11.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-22 14:16:45 +00:00
Viktor Barzin
dc4ce46411 k8s-version-upgrade: detection script refresh apt before madison + DRY_RUN_OVERRIDE
Test 2 dry-run revealed kubeadm plan reports v1.34.7 as latest while
apt-cache madison (without prior apt-get update) was reporting v1.34.5
— so the CronJob would have dispatched the agent against a stale
target. Now do `sudo apt-get update -qq` for just the kubernetes repo
before querying madison.

Also add a DRY_RUN_OVERRIDE env precedence so future test invocations
can override DRY_RUN without an apply cycle — but Job spec env is
immutable post-create, so this is only useful for CronJob spec edits
(suspend, then add env, then resume). Documented in the runbook.
2026-05-22 14:16:43 +00:00
Viktor Barzin
ae6dde45c2 k8s-version-upgrade: trigger etcd snapshot via existing backup-etcd Job; broaden agent RBAC
Stage 2 now reuses the existing default/backup-etcd CronJob (NFS-backed
PV pointing at 192.168.1.127:/srv/nfs/etcd-backup) instead of trying to
ssh into master and run etcdctl against a non-existent /mnt/main mount.
The agent triggers a one-shot Job from cronjob/backup-etcd, waits up to
10 min, then parses the backup-manage container log for "Backup done"
line + byte count.

Test 2 (dry-run) surfaced 5 real cluster blockers — agent loop works
end-to-end at the planning level.

Expanded the claude-agent ServiceAccount's privileges via a sibling
ClusterRole (claude-agent-upgrade-ops):
  - patch namespaces/k8s-upgrade (in-flight annotation)
  - create batch/jobs (trigger etcd snapshot Job)
  - patch nodes (cordon/uncordon)
  - create pods/eviction (drain)
  - delete pods (drain fallback)
2026-05-22 14:16:43 +00:00
Viktor Barzin
e75bcaf394 k8s-version-upgrade: automated kubeadm/kubelet/kubectl upgrade pipeline
Adds a weekly detection CronJob (Sun 12:00 UTC) that probes apt-cache madison
on master for new patches + HEAD pkgs.k8s.io for next-minor availability,
then POSTs to claude-agent-service to dispatch the k8s-version-upgrade agent.

The agent (.claude/agents/k8s-version-upgrade.md) orchestrates:
  pre-flight (5 nodes Ready + halt-on-alert + 24h-quiet + plan target match)
    -> etcd snapshot save
    -> optional master containerd skew fix
    -> apt repo URL rewrite (minor bumps only)
    -> drain/upgrade/uncordon master via ssh < update_k8s.sh
    -> sequential workers k8s-node4 -> 3 -> 2 -> 1 with 10-min soak each
    -> post-flight verification

Two new Upgrade Gates alerts catch failure modes:
  - K8sVersionSkew (kubelet/apiserver gitVersion mismatch >30m)
  - EtcdPreUpgradeSnapshotMissing (in_flight without snapshot_taken >10m)

update_k8s.sh refactored to take --role / --release args; the agent shells
it into each node via SSH pipe. update_node.sh annotated as OS-major path.

Operator-facing docs: docs/runbooks/k8s-version-upgrade.md and a new section
in docs/architecture/automated-upgrades.md.

Secrets: secret/k8s-upgrade/{ssh_key,ssh_key_pub,slack_webhook} (ed25519
keypair distributed to all 5 nodes via authorized_keys; slack_webhook
reuses kured webhook URL on initial deploy).
2026-05-22 14:16:42 +00:00