From c3f9a1f346ba3f10714fdcb44f594e89a9c073df Mon Sep 17 00:00:00 2001 From: Viktor Barzin Date: Thu, 2 Apr 2026 11:46:27 +0300 Subject: [PATCH] add reboot-server skill for safe Proxmox host and k8s node reboots --- dot_claude/skills/reboot-server.md | 370 +++++++++++++++++++++++++++++ 1 file changed, 370 insertions(+) create mode 100644 dot_claude/skills/reboot-server.md diff --git a/dot_claude/skills/reboot-server.md b/dot_claude/skills/reboot-server.md new file mode 100644 index 0000000..26491f7 --- /dev/null +++ b/dot_claude/skills/reboot-server.md @@ -0,0 +1,370 @@ +name: reboot-server +description: Safely reboot the Proxmox host or a single k8s node with graceful shutdown + and post-boot validation. Use when the user asks to "reboot the server", "reboot + proxmox", "reboot node", "restart the host", "restart k8s node", "power cycle". + Supports full host reboot and single node reboot modes. +--- + +## Overview + +This skill safely reboots either the entire Proxmox host (Dell R730) or a single Kubernetes node VM. It ensures graceful shutdown of all services and validates full recovery afterwards. + +## Connection Details + +- **Proxmox host**: `ssh root@192.168.1.127` +- **kubectl**: `KUBECONFIG=/Users/viktorbarzin/code/config kubectl` +- **VM commands**: `ssh root@192.168.1.127 'qm '` + +## VM Inventory & Boot Order + +| Order | VMID | Name | Startup Delay | Shutdown Timeout | Notes | +|-------|------|------|---------------|------------------|-------| +| 1 | 101 | pfSense | 0s | 120s | Gateway/DHCP/DNS — must boot first | +| 2 | 9000 | TrueNAS | 60s | 300s | NFS/iSCSI storage — needs network from pfSense | +| 3 | 220 | docker-registry | 60s | 120s | Pull-through cache (fallback: upstream) | +| 3 | 102 | devvm | 60s | 120s | Dev VM — needs pfSense for network | +| 4 | 200 | k8s-master | 45s | 300s | Control plane — must be up before workers | +| 5 | 201 | k8s-node1 | 45s | 300s | GPU node (Tesla T4) | +| 5 | 202 | k8s-node2 | 45s | 300s | Worker | +| 5 | 203 | k8s-node3 | 45s | 300s | Worker | +| 5 | 204 | k8s-node4 | 45s | 300s | Worker | +| 6 | 103 | home-assistant | 0s | 120s | HA Sofia — no ordering dependency | +| 6 | 300 | Windows10 | 0s | 120s | Windows VM — no ordering dependency | + +Shutdown order is the **reverse** of boot order (Proxmox handles this automatically). + +## Node-to-VMID Mapping + +``` +k8s-master = 200 +k8s-node1 = 201 (GPU node) +k8s-node2 = 202 +k8s-node3 = 203 +k8s-node4 = 204 +``` + +## Mode Selection + +Ask the user which mode they want if unclear: +1. **Full host reboot** — reboot the entire Proxmox R730 host +2. **Single node reboot** — drain, reboot one k8s node VM, uncordon + +--- + +## Full Host Reboot Procedure + +### Phase 1: Pre-flight Checks + +Run ALL of these checks before proceeding. Fix issues inline if possible. + +#### 1.1 Verify Proxmox SSH access +```bash +ssh -o ConnectTimeout=5 root@192.168.1.127 'hostname && uptime' +``` + +#### 1.2 Verify cluster health +```bash +KUBECONFIG=/Users/viktorbarzin/code/config kubectl get nodes +KUBECONFIG=/Users/viktorbarzin/code/config kubectl get pods -A --field-selector 'status.phase!=Running,status.phase!=Succeeded' | head -20 +``` + +#### 1.3 Check and configure VM boot ordering +For each VM, check if boot ordering is set. If any VM has `order=-1`, configure it: + +```bash +# Check current boot order for all VMs +ssh root@192.168.1.127 'for VMID in 101 102 103 200 201 202 203 204 220 300 9000; do echo "VMID $VMID: $(qm config $VMID 2>/dev/null | grep ^startup || echo "no startup config")"; done' +``` + +If not configured, apply boot order (this is idempotent): +```bash +ssh root@192.168.1.127 ' +qm set 101 --startup order=1,down=120 +qm set 9000 --startup order=2,up=60,down=300 +qm set 102 --startup order=3,up=60,down=120 +qm set 220 --startup order=3,up=60,down=120 +qm set 200 --startup order=4,up=45,down=300 +qm set 201 --startup order=5,up=45,down=300 +qm set 202 --startup order=5,up=45,down=300 +qm set 203 --startup order=5,up=45,down=300 +qm set 204 --startup order=5,up=45,down=300 +qm set 103 --startup order=6,down=120 +qm set 300 --startup order=6,down=120 +' +``` + +#### 1.4 Check kubelet shutdownGracePeriod on all k8s nodes +```bash +for VMID in 200 201 202 203 204; do + echo "=== VMID $VMID ===" + ssh root@192.168.1.127 "qm guest exec $VMID -- grep -c shutdownGracePeriod /var/lib/kubelet/config.yaml 2>/dev/null" || echo "NOT SET" +done +``` + +If not set on any node, patch it: +```bash +VMID= +ssh root@192.168.1.127 "qm guest exec $VMID -- bash -c ' +# Remove any existing shutdown config to avoid duplicates +sed -i \"/shutdownGracePeriod/d; /shutdownGracePeriodCriticalPods/d\" /var/lib/kubelet/config.yaml + +# Add graceful shutdown config +cat >> /var/lib/kubelet/config.yaml < /etc/systemd/logind.conf.d/kubelet-shutdown.conf < /etc/systemd/system/kubelet.service.d/20-shutdown.conf </dev/null" || echo "not found (good)" +done +``` + +If active on any node, disable it: +```bash +ssh root@192.168.1.127 "qm guest exec $VMID -- bash -c 'systemctl disable --now unattended-upgrades; apt-get remove -y unattended-upgrades'" +``` + +#### 1.6 Report pre-flight status +Summarize: all checks passed, or list what was fixed. Ask user for final confirmation before rebooting. + +### Phase 2: Reboot + +**IMPORTANT**: Get explicit user confirmation before proceeding. + +```bash +# Reboot the Proxmox host +ssh root@192.168.1.127 'reboot' +``` + +Then wait for SSH to drop (confirms reboot started): +```bash +# Poll until SSH drops (host is rebooting) +for i in $(seq 1 30); do + ssh -o ConnectTimeout=3 -o BatchMode=yes root@192.168.1.127 'true' 2>/dev/null || { echo "Host is rebooting"; break; } + sleep 2 +done +``` + +Then poll until Proxmox is back (timeout: 10 minutes): +```bash +# Poll every 30s until SSH is back +for i in $(seq 1 20); do + if ssh -o ConnectTimeout=5 -o BatchMode=yes root@192.168.1.127 'uptime' 2>/dev/null; then + echo "Proxmox host is back online" + break + fi + echo "Waiting for Proxmox host... (attempt $i/20)" + sleep 30 +done +``` + +### Phase 3: Post-boot Validation (30 min timeout) + +Run these checks in a loop, reporting progress. Timeout after 30 minutes total. + +#### 3.1 Check all VMs are running +```bash +ssh root@192.168.1.127 'qm list' | grep -v stopped +``` + +If any critical VM (101, 9000, 200-204, 220) is not running after 5 minutes, start it: +```bash +ssh root@192.168.1.127 'qm start ' +``` + +#### 3.2 Check k8s API is reachable +```bash +KUBECONFIG=/Users/viktorbarzin/code/config kubectl cluster-info 2>/dev/null +``` + +#### 3.3 Check all nodes are Ready +```bash +KUBECONFIG=/Users/viktorbarzin/code/config kubectl get nodes +``` + +All 5 nodes (k8s-master, k8s-node1-4) should show `Ready`. + +#### 3.4 Check critical infrastructure pods + +Check these in order (reflects boot dependency chain): + +```bash +KC="KUBECONFIG=/Users/viktorbarzin/code/config kubectl" + +# Tier 0: Core infrastructure +$KC get pods -n metallb-system -l app=metallb +$KC get pods -n kube-system -l k8s-app=kube-dns # CoreDNS +$KC get pods -n technitium # DNS + +# Tier 1: Storage +$KC get pods -n democratic-csi # iSCSI-CSI +$KC get pods -n nfs-csi # NFS-CSI + +# Tier 2: Ingress + tunnel +$KC get pods -n traefik +$KC get pods -n cloudflared + +# Tier 3: Security +$KC get pods -n kyverno + +# Tier 4: Data layer +$KC get pods -n redis +$KC get pods -n dbaas # MySQL + PostgreSQL + +# Tier 5: Secrets + auth +$KC get pods -n vault +$KC get pods -n authentik +$KC get pods -n external-secrets +``` + +#### 3.5 Check Vault is unsealed +```bash +KUBECONFIG=/Users/viktorbarzin/code/config kubectl exec -n vault vault-0 -- vault status -format=json 2>/dev/null | jq '.sealed' +``` + +Should return `false`. The auto-unseal sidecar polls every 10s — if still sealed after 5 minutes, investigate. + +#### 3.6 Check PVC status +```bash +KUBECONFIG=/Users/viktorbarzin/code/config kubectl get pvc -A --field-selector 'status.phase!=Bound' 2>/dev/null | head -20 +``` + +No PVCs should be in Pending state (indicates CSI driver issues). + +#### 3.7 Check for stuck pods +```bash +KUBECONFIG=/Users/viktorbarzin/code/config kubectl get pods -A --field-selector 'status.phase!=Running,status.phase!=Succeeded' | grep -v Completed | head -30 +``` + +Known slow starters to ignore: MySQL InnoDB pods (up to 60 min init), Vault pods (auto-unseal takes a minute). + +### Phase 4: Final Validation + +- If all checks pass: report "Cluster fully recovered in X minutes" +- If issues remain after 30 minutes: invoke the `/cluster-health` skill for diagnosis and remediation +- Store a memory summarizing the reboot: timestamp, duration, any issues encountered + +--- + +## Single Node Reboot Procedure + +### Phase 1: Pre-flight + +#### 1.1 Identify target +Map the user's input to a node name and VMID: +- `k8s-master` / `master` / `200` → k8s-master (VMID 200) +- `k8s-node1` / `node1` / `201` → k8s-node1 (VMID 201, GPU node) +- `k8s-node2` / `node2` / `202` → k8s-node2 (VMID 202) +- `k8s-node3` / `node3` / `203` → k8s-node3 (VMID 203) +- `k8s-node4` / `node4` / `204` → k8s-node4 (VMID 204) + +#### 1.2 Check node status +```bash +KUBECONFIG=/Users/viktorbarzin/code/config kubectl get node +``` + +#### 1.3 Check what's running on the node +```bash +KUBECONFIG=/Users/viktorbarzin/code/config kubectl get pods -A --field-selector spec.nodeName= | head -30 +``` + +Warn the user if the node runs: +- GPU workloads (node1) — Immich ML, Frigate, Ollama +- Database pods — MySQL InnoDB, PostgreSQL CNPG (long recovery) +- Vault pods — may need re-unseal + +#### 1.4 Check PDBs won't block drain +```bash +KUBECONFIG=/Users/viktorbarzin/code/config kubectl get pdb -A +``` + +Verify that draining this node won't violate any PDB (e.g., Traefik minAvailable=2 — if 2 of 3 Traefik pods are on other nodes, drain is safe). + +### Phase 2: Drain + Reboot + +```bash +# Drain the node (300s timeout for graceful eviction) +KUBECONFIG=/Users/viktorbarzin/code/config kubectl drain \ + --ignore-daemonsets \ + --delete-emptydir-data \ + --timeout=300s + +# Shutdown the VM gracefully (300s timeout for kubelet + systemd shutdown) +ssh root@192.168.1.127 "qm shutdown --timeout 300" + +# Wait for VM to stop +for i in $(seq 1 60); do + STATUS=$(ssh root@192.168.1.127 "qm status " 2>/dev/null) + if echo "$STATUS" | grep -q stopped; then + echo "VM stopped" + break + fi + echo "Waiting for VM to stop... ($i/60)" + sleep 5 +done + +# Start the VM +ssh root@192.168.1.127 "qm start " +``` + +### Phase 3: Recovery + +```bash +# Wait for node to become Ready (timeout: 5 min) +for i in $(seq 1 30); do + STATUS=$(KUBECONFIG=/Users/viktorbarzin/code/config kubectl get node --no-headers 2>/dev/null | awk '{print $2}') + if [ "$STATUS" = "Ready" ]; then + echo "Node is Ready" + break + fi + echo "Waiting for node to become Ready... ($i/30)" + sleep 10 +done + +# Uncordon the node +KUBECONFIG=/Users/viktorbarzin/code/config kubectl uncordon + +# Verify DaemonSet pods are scheduled +KUBECONFIG=/Users/viktorbarzin/code/config kubectl get pods -A --field-selector spec.nodeName= | head -20 + +# Report status +echo "Node rebooted and uncordoned successfully" +``` + +--- + +## Troubleshooting + +| Symptom | Cause | Fix | +|---------|-------|-----| +| VM won't start | Proxmox host disk full | `ssh root@192.168.1.127 'df -h'` — check thin pool | +| Node stays NotReady | kubelet/containerd not starting | `qm guest exec -- systemctl status kubelet` | +| NFS PVCs stuck Pending | TrueNAS not fully booted | Wait for TrueNAS ZFS pool import, or `qm guest exec 9000 -- zpool status` | +| Vault stays sealed | Auto-unseal sidecar not running | Check sidecar logs: `kubectl logs -n vault vault-0 -c vault-unseal` | +| MySQL slow to recover | InnoDB init containers (~20 min each) | Normal — monitor progress with `kubectl logs -n dbaas` | +| Pods CrashLoopBackOff | Dependencies not ready yet | Usually self-heals — wait 5 min, then check with `/cluster-health` | +| containerd won't start | Disk full or corrupted images | `qm guest exec -- journalctl -u containerd --no-pager -n 50` |