infra/docs/architecture
Viktor Barzin e75bcaf394 k8s-version-upgrade: automated kubeadm/kubelet/kubectl upgrade pipeline
Adds a weekly detection CronJob (Sun 12:00 UTC) that probes apt-cache madison
on master for new patches + HEAD pkgs.k8s.io for next-minor availability,
then POSTs to claude-agent-service to dispatch the k8s-version-upgrade agent.

The agent (.claude/agents/k8s-version-upgrade.md) orchestrates:
  pre-flight (5 nodes Ready + halt-on-alert + 24h-quiet + plan target match)
    -> etcd snapshot save
    -> optional master containerd skew fix
    -> apt repo URL rewrite (minor bumps only)
    -> drain/upgrade/uncordon master via ssh < update_k8s.sh
    -> sequential workers k8s-node4 -> 3 -> 2 -> 1 with 10-min soak each
    -> post-flight verification

Two new Upgrade Gates alerts catch failure modes:
  - K8sVersionSkew (kubelet/apiserver gitVersion mismatch >30m)
  - EtcdPreUpgradeSnapshotMissing (in_flight without snapshot_taken >10m)

update_k8s.sh refactored to take --role / --release args; the agent shells
it into each node via SSH pipe. update_node.sh annotated as OS-major path.

Operator-facing docs: docs/runbooks/k8s-version-upgrade.md and a new section
in docs/architecture/automated-upgrades.md.

Secrets: secret/k8s-upgrade/{ssh_key,ssh_key_pub,slack_webhook} (ed25519
keypair distributed to all 5 nodes via authorized_keys; slack_webhook
reuses kured webhook URL on initial deploy).
2026-05-22 14:16:42 +00:00
..
agent-task-tracking.md Add agent task tracking documentation 2026-04-15 17:11:26 +00:00
authentication.md fix: cluster healthcheck fixes + Authentik upgrade to 2026.2.2 2026-04-15 06:41:56 +00:00
automated-upgrades.md k8s-version-upgrade: automated kubeadm/kubelet/kubectl upgrade pipeline 2026-05-22 14:16:42 +00:00
backup-dr.md backup: fix daily-backup silent failures, postiz pg_dump CronJob, doc reconcile 2026-05-10 11:12:39 +00:00
chrome-service.md chrome-service: open NP for Traefik → noVNC sidecar (port 6080) 2026-05-07 23:29:34 +00:00
ci-cd.md [forgejo] Phases 3+4+5: cutover, decommission, docs sweep 2026-05-07 23:29:34 +00:00
compute.md infra/compute: bump k8s-node1 RAM 32 -> 48 GiB 2026-05-22 14:16:41 +00:00
databases.md [redis] stabilise against node-crash flap cascade — RC1-RC5 fixes 2026-04-22 15:59:00 +00:00
dns.md phpipam-pfsense-import: every 5min → hourly 2026-04-26 22:48:43 +00:00
homepage.md add homepage auto-discovery documentation [ci skip] 2026-03-25 13:06:43 +02:00
incident-response.md [claude-agent-service] Migrate all pipelines from DevVM SSH to K8s HTTP 2026-04-18 10:12:02 +00:00
llama-cpp.md infra/llama-cpp: benchmark report + -fa flag fix 2026-05-22 14:16:41 +00:00
mailserver.md monitoring: bring EmailRoundtripStale threshold docs in sync with for:20m 2026-04-21 22:39:46 +00:00
monitoring.md [forgejo] Phases 3+4+5: cutover, decommission, docs sweep 2026-05-07 23:29:34 +00:00
multi-tenancy.md add architecture documentation for all infrastructure subsystems [ci skip] 2026-03-24 00:55:25 +02:00
networking.md kms: deploy slack-notifier sidecar with Prometheus metrics + document public exposure 2026-05-10 11:12:39 +00:00
overview.md gpu: schedule off NFD label, not k8s-node1 hostname 2026-04-22 13:43:07 +00:00
secrets.md docs: comprehensive audit and update of all architecture docs and runbooks [ci skip] 2026-04-06 13:21:05 +03:00
security.md [docs] Update anti-AI and rybbit docs after rewrite-body removal 2026-04-17 21:43:13 +00:00
storage.md mysql: bump to 4Gi limit / 3Gi request; grow /srv/nfs LV to 3 TiB 2026-05-10 11:12:38 +00:00
vpn.md [docs] TrueNAS decommission cleanup — remove references from active docs 2026-04-19 16:55:43 +00:00