infra/.claude
Viktor Barzin e75bcaf394 k8s-version-upgrade: automated kubeadm/kubelet/kubectl upgrade pipeline
Adds a weekly detection CronJob (Sun 12:00 UTC) that probes apt-cache madison
on master for new patches + HEAD pkgs.k8s.io for next-minor availability,
then POSTs to claude-agent-service to dispatch the k8s-version-upgrade agent.

The agent (.claude/agents/k8s-version-upgrade.md) orchestrates:
  pre-flight (5 nodes Ready + halt-on-alert + 24h-quiet + plan target match)
    -> etcd snapshot save
    -> optional master containerd skew fix
    -> apt repo URL rewrite (minor bumps only)
    -> drain/upgrade/uncordon master via ssh < update_k8s.sh
    -> sequential workers k8s-node4 -> 3 -> 2 -> 1 with 10-min soak each
    -> post-flight verification

Two new Upgrade Gates alerts catch failure modes:
  - K8sVersionSkew (kubelet/apiserver gitVersion mismatch >30m)
  - EtcdPreUpgradeSnapshotMissing (in_flight without snapshot_taken >10m)

update_k8s.sh refactored to take --role / --release args; the agent shells
it into each node via SSH pipe. update_node.sh annotated as OS-major path.

Operator-facing docs: docs/runbooks/k8s-version-upgrade.md and a new section
in docs/architecture/automated-upgrades.md.

Secrets: secret/k8s-upgrade/{ssh_key,ssh_key_pub,slack_webhook} (ed25519
keypair distributed to all 5 nodes via authorized_keys; slack_webhook
reuses kured webhook URL on initial deploy).
2026-05-22 14:16:42 +00:00
..
agents k8s-version-upgrade: automated kubeadm/kubelet/kubectl upgrade pipeline 2026-05-22 14:16:42 +00:00
commands [ci skip] update kubectl skill to use local kubeconfig 2026-02-07 13:42:35 +00:00
reference authentik: zero-endpoints alert + upgrade-validation checklist 2026-05-22 14:16:41 +00:00
scripts rename weekly-backup → daily-backup across scripts, timers, services, and docs [ci skip] 2026-04-13 18:37:04 +00:00
skills [cluster-health] Expand to 42 checks, remove pod CronJob path 2026-04-19 15:13:03 +00:00
calendar-query.py sync regenerated providers.tf + upstream changes 2026-03-22 02:56:04 +02:00
CLAUDE.md anubis: fix 500 on multi-replica + roll out to 6 more public sites 2026-05-10 11:12:40 +00:00
home-assistant-sofia.py [ci skip] Add ha-sofia Home Assistant deployment to skills 2026-02-07 21:26:05 +00:00
home-assistant.py add claude [ci skip] 2026-02-06 20:10:02 +00:00
internet-mode-used_DO_NOT_REMOVE_MANUALLY_SECURITY_RISK add claude [ci skip] 2026-02-06 20:10:02 +00:00
pfsense.py [ci skip] Add pfSense firewall management skill 2026-02-14 12:42:10 +00:00
settings.json add claude files [ci skip] 2026-01-18 15:40:43 +00:00