infra

Author	SHA1	Message	Date
Viktor Barzin	8b9727ac3e	final wave: enroll immich + status-page, retrigger 17 pending Bucket A * immich: extended 3 V1 lifecycles to V2 (1 Deployment without V1 skipped — has non-standard lifecycle from earlier work). * status-page: enrolled (was missing from original sweep). * v6 retrigger marker on 17 stacks that never reached terragrunt apply (#704 exit-1 halted mid-loop). After this lands, expected live enrollment: ~96 / 118 Tier 1 stacks. The remaining ~22 are operator/Helm-managed and intentionally excluded (same fight-loop risk as Calico — bump via Helm chart version, not Keel). Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-05-16 23:19:20 +00:00
Viktor Barzin	eb99ee5635	Bucket A retrigger + Bucket D enrollment (5 module-nested stacks) After fixing the postgresql-lb MetalLB flap (deleted stuck ServiceL2Status CR l2-rgt9d), Tier 1 CI can apply again. Combined commit: * Bucket A (16 stacks): re-append CI retrigger marker so the previously-pending applies pick up: blog calico cyberchef descheduler f1-stream homepage jsoncrack k8s-dashboard k8s-version-upgrade kms local-path osm_routing real-estate-crawler travel_blog vault webhook_handler * Bucket D (5 module-nested stacks): keel.sh/enrolled label on namespace + KYVERNO_LIFECYCLE_V2 on Deployments inside the module: postiz instagram-poster k8s-portal uptime-kuma vaultwarden Bucket C (raw-deploy apps without V1 marker on their Deployment lifecycles) deferred — needs per-Deployment lifecycle block additions that the bulk script can't safely automate: beads-server immich llama-cpp novelapp plotting-book trading-bot Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-05-16 23:10:38 +00:00
Viktor Barzin	57e3059f5b	ci: retrigger v4 — remaining 16 Keel stacks (#701 failed one of them)	2026-05-16 14:13:59 +00:00
Viktor Barzin	cbbb394e88	ci: retrigger v3 — apply remaining 22 Keel-enrolled stacks	2026-05-16 14:06:39 +00:00
Viktor Barzin	7bcd3f745c	ci: retrigger v2 — apply pending Keel-enrolled stacks (#697 was cancelled by #698 )	2026-05-16 13:47:13 +00:00
Viktor Barzin	70d0623b21	ci: retrigger apply for pending Keel enrollment (~58 stacks) Bulk enrollment commit `8f4b1956` had its CI pipeline #689 killed before terragrunt apply ran. The enrollment label + V2 lifecycle changes are in master but never reached the cluster. Appending a one-line marker to each pending stack's main.tf so Woodpecker's diff-detection picks them up and applies them serially. Idempotent — re-applying a stack whose state already matches is a no-op. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-05-16 13:42:57 +00:00
Viktor Barzin	8f4b19565c	recruiter-responder: bump image_tag to 189ef901 OpenClaw can now answer 'what do we know about <company>?' from cache via the new recruiter_company_research tool, and recruiter_get embeds the cached research payload inline. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-05-16 12:41:05 +00:00
Viktor Barzin	01bc16d592	k8s-version-upgrade: decompose into Job chain to fix self-preemption The agent-based v1 ran inside claude-agent-service (replicas=1, no nodeSelector) and self-evicted when it tried to drain its host (k8s-node4 on 2026-05-11). Cluster ended half-upgraded (master v1.34.7, workers v1.34.2) until manual recovery. Rewrite the pipeline as a chain of nodeSelector-pinned Jobs: preflight (k8s-node1) → master (k8s-node1) drains k8s-master → worker × 4 (k8s-node1) drains k8s-node{4,3,2} → worker (k8s-master + control-plane toleration) drains k8s-node1 → postflight (no pinning) Each Job runs scripts/upgrade-step.sh (case-on-$PHASE) and ends by envsubst-ing job-template.yaml into the next Job. Deterministic names (k8s-upgrade-<phase>-<target_version>[-<node>]) make `kubectl apply` idempotent — a failed Job can be re-created without duplicating downstream. Also lands `predrain_unstick`: deletes pods on the target node whose PDB has 0 disruptionsAllowed. Without this, drain loops indefinitely on single-replica deployments (e.g. every Anubis instance — discovered the hard way during 2026-05-11 manual recovery of k8s-node3). Adds K8sUpgradeStalled alert (in_flight + started_timestamp > 90 min). Deprecates the agent prompt (renamed to *.deprecated.md with a header pointer to the new code). Apply order: k8s-version-upgrade first (consumes new SA + ConfigMaps), then monitoring (loads the new alert). Both applied 2026-05-11. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-05-11 23:54:22 +00:00
Viktor Barzin	5d0e17b5ba	k8s-version-upgrade: detection script refresh apt before madison + DRY_RUN_OVERRIDE Test 2 dry-run revealed kubeadm plan reports v1.34.7 as latest while apt-cache madison (without prior apt-get update) was reporting v1.34.5 — so the CronJob would have dispatched the agent against a stale target. Now do `sudo apt-get update -qq` for just the kubernetes repo before querying madison. Also add a DRY_RUN_OVERRIDE env precedence so future test invocations can override DRY_RUN without an apply cycle — but Job spec env is immutable post-create, so this is only useful for CronJob spec edits (suspend, then add env, then resume). Documented in the runbook.	2026-05-10 19:33:11 +00:00
Viktor Barzin	988bfde45c	k8s-version-upgrade: trigger etcd snapshot via existing backup-etcd Job; broaden agent RBAC Stage 2 now reuses the existing default/backup-etcd CronJob (NFS-backed PV pointing at 192.168.1.127:/srv/nfs/etcd-backup) instead of trying to ssh into master and run etcdctl against a non-existent /mnt/main mount. The agent triggers a one-shot Job from cronjob/backup-etcd, waits up to 10 min, then parses the backup-manage container log for "Backup done" line + byte count. Test 2 (dry-run) surfaced 5 real cluster blockers — agent loop works end-to-end at the planning level. Expanded the claude-agent ServiceAccount's privileges via a sibling ClusterRole (claude-agent-upgrade-ops): - patch namespaces/k8s-upgrade (in-flight annotation) - create batch/jobs (trigger etcd snapshot Job) - patch nodes (cordon/uncordon) - create pods/eviction (drain) - delete pods (drain fallback)	2026-05-10 19:16:12 +00:00
Viktor Barzin	a58d777059	k8s-version-upgrade: automated kubeadm/kubelet/kubectl upgrade pipeline Adds a weekly detection CronJob (Sun 12:00 UTC) that probes apt-cache madison on master for new patches + HEAD pkgs.k8s.io for next-minor availability, then POSTs to claude-agent-service to dispatch the k8s-version-upgrade agent. The agent (.claude/agents/k8s-version-upgrade.md) orchestrates: pre-flight (5 nodes Ready + halt-on-alert + 24h-quiet + plan target match) -> etcd snapshot save -> optional master containerd skew fix -> apt repo URL rewrite (minor bumps only) -> drain/upgrade/uncordon master via ssh < update_k8s.sh -> sequential workers k8s-node4 -> 3 -> 2 -> 1 with 10-min soak each -> post-flight verification Two new Upgrade Gates alerts catch failure modes: - K8sVersionSkew (kubelet/apiserver gitVersion mismatch >30m) - EtcdPreUpgradeSnapshotMissing (in_flight without snapshot_taken >10m) update_k8s.sh refactored to take --role / --release args; the agent shells it into each node via SSH pipe. update_node.sh annotated as OS-major path. Operator-facing docs: docs/runbooks/k8s-version-upgrade.md and a new section in docs/architecture/automated-upgrades.md. Secrets: secret/k8s-upgrade/{ssh_key,ssh_key_pub,slack_webhook} (ed25519 keypair distributed to all 5 nodes via authorized_keys; slack_webhook reuses kured webhook URL on initial deploy).	2026-05-10 19:07:42 +00:00

11 commits