12 KiB
name: reboot-server description: Safely reboot the Proxmox host or a single k8s node with graceful shutdown and post-boot validation. Use when the user asks to "reboot the server", "reboot proxmox", "reboot node", "restart the host", "restart k8s node", "power cycle". Supports full host reboot and single node reboot modes.
Overview
This skill safely reboots either the entire Proxmox host (Dell R730) or a single Kubernetes node VM. It ensures graceful shutdown of all services and validates full recovery afterwards.
Connection Details
- Proxmox host:
ssh root@192.168.1.127 - kubectl:
KUBECONFIG=/Users/viktorbarzin/code/config kubectl - VM commands:
ssh root@192.168.1.127 'qm <command>'
VM Inventory & Boot Order
| Order | VMID | Name | Startup Delay | Shutdown Timeout | Notes |
|---|---|---|---|---|---|
| 1 | 101 | pfSense | 0s | 120s | Gateway/DHCP/DNS — must boot first |
| 2 | 9000 | TrueNAS | 60s | 300s | NFS/iSCSI storage — needs network from pfSense |
| 3 | 220 | docker-registry | 60s | 120s | Pull-through cache (fallback: upstream) |
| 3 | 102 | devvm | 60s | 120s | Dev VM — needs pfSense for network |
| 4 | 200 | k8s-master | 45s | 300s | Control plane — must be up before workers |
| 5 | 201 | k8s-node1 | 45s | 300s | GPU node (Tesla T4) |
| 5 | 202 | k8s-node2 | 45s | 300s | Worker |
| 5 | 203 | k8s-node3 | 45s | 300s | Worker |
| 5 | 204 | k8s-node4 | 45s | 300s | Worker |
| 6 | 103 | home-assistant | 0s | 120s | HA Sofia — no ordering dependency |
| 6 | 300 | Windows10 | 0s | 120s | Windows VM — no ordering dependency |
Shutdown order is the reverse of boot order (Proxmox handles this automatically).
Node-to-VMID Mapping
k8s-master = 200
k8s-node1 = 201 (GPU node)
k8s-node2 = 202
k8s-node3 = 203
k8s-node4 = 204
Mode Selection
Ask the user which mode they want if unclear:
- Full host reboot — reboot the entire Proxmox R730 host
- Single node reboot — drain, reboot one k8s node VM, uncordon
Full Host Reboot Procedure
Phase 1: Pre-flight Checks
Run ALL of these checks before proceeding. Fix issues inline if possible.
1.1 Verify Proxmox SSH access
ssh -o ConnectTimeout=5 root@192.168.1.127 'hostname && uptime'
1.2 Verify cluster health
KUBECONFIG=/Users/viktorbarzin/code/config kubectl get nodes
KUBECONFIG=/Users/viktorbarzin/code/config kubectl get pods -A --field-selector 'status.phase!=Running,status.phase!=Succeeded' | head -20
1.3 Check and configure VM boot ordering
For each VM, check if boot ordering is set. If any VM has order=-1, configure it:
# Check current boot order for all VMs
ssh root@192.168.1.127 'for VMID in 101 102 103 200 201 202 203 204 220 300 9000; do echo "VMID $VMID: $(qm config $VMID 2>/dev/null | grep ^startup || echo "no startup config")"; done'
If not configured, apply boot order (this is idempotent):
ssh root@192.168.1.127 '
qm set 101 --startup order=1,down=120
qm set 9000 --startup order=2,up=60,down=300
qm set 102 --startup order=3,up=60,down=120
qm set 220 --startup order=3,up=60,down=120
qm set 200 --startup order=4,up=45,down=300
qm set 201 --startup order=5,up=45,down=300
qm set 202 --startup order=5,up=45,down=300
qm set 203 --startup order=5,up=45,down=300
qm set 204 --startup order=5,up=45,down=300
qm set 103 --startup order=6,down=120
qm set 300 --startup order=6,down=120
'
1.4 Check kubelet shutdownGracePeriod on all k8s nodes
for VMID in 200 201 202 203 204; do
echo "=== VMID $VMID ==="
ssh root@192.168.1.127 "qm guest exec $VMID -- grep -c shutdownGracePeriod /var/lib/kubelet/config.yaml 2>/dev/null" || echo "NOT SET"
done
If not set on any node, patch it:
VMID=<target>
ssh root@192.168.1.127 "qm guest exec $VMID -- bash -c '
# Remove any existing shutdown config to avoid duplicates
sed -i \"/shutdownGracePeriod/d; /shutdownGracePeriodCriticalPods/d\" /var/lib/kubelet/config.yaml
# Add graceful shutdown config
cat >> /var/lib/kubelet/config.yaml <<EOF
shutdownGracePeriod: \"240s\"
shutdownGracePeriodCriticalPods: \"60s\"
EOF
# Create systemd logind override for InhibitDelayMaxSec
mkdir -p /etc/systemd/logind.conf.d
cat > /etc/systemd/logind.conf.d/kubelet-shutdown.conf <<EOF
[Login]
InhibitDelayMaxSec=300
EOF
systemctl restart systemd-logind
# Create kubelet service override for TimeoutStopSec
mkdir -p /etc/systemd/system/kubelet.service.d
cat > /etc/systemd/system/kubelet.service.d/20-shutdown.conf <<EOF
[Service]
TimeoutStopSec=300s
EOF
systemctl daemon-reload
systemctl restart kubelet
'"
Repeat for each node VMID (200-204) that needs patching.
1.5 Check unattended-upgrades on k8s nodes
for VMID in 200 201 202 203 204; do
echo "=== VMID $VMID ==="
ssh root@192.168.1.127 "qm guest exec $VMID -- systemctl is-active unattended-upgrades 2>/dev/null" || echo "not found (good)"
done
If active on any node, disable it:
ssh root@192.168.1.127 "qm guest exec $VMID -- bash -c 'systemctl disable --now unattended-upgrades; apt-get remove -y unattended-upgrades'"
1.6 Report pre-flight status
Summarize: all checks passed, or list what was fixed. Ask user for final confirmation before rebooting.
Phase 2: Reboot
IMPORTANT: Get explicit user confirmation before proceeding.
# Reboot the Proxmox host
ssh root@192.168.1.127 'reboot'
Then wait for SSH to drop (confirms reboot started):
# Poll until SSH drops (host is rebooting)
for i in $(seq 1 30); do
ssh -o ConnectTimeout=3 -o BatchMode=yes root@192.168.1.127 'true' 2>/dev/null || { echo "Host is rebooting"; break; }
sleep 2
done
Then poll until Proxmox is back (timeout: 10 minutes):
# Poll every 30s until SSH is back
for i in $(seq 1 20); do
if ssh -o ConnectTimeout=5 -o BatchMode=yes root@192.168.1.127 'uptime' 2>/dev/null; then
echo "Proxmox host is back online"
break
fi
echo "Waiting for Proxmox host... (attempt $i/20)"
sleep 30
done
Phase 3: Post-boot Validation (30 min timeout)
Run these checks in a loop, reporting progress. Timeout after 30 minutes total.
3.1 Check all VMs are running
ssh root@192.168.1.127 'qm list' | grep -v stopped
If any critical VM (101, 9000, 200-204, 220) is not running after 5 minutes, start it:
ssh root@192.168.1.127 'qm start <VMID>'
3.2 Check k8s API is reachable
KUBECONFIG=/Users/viktorbarzin/code/config kubectl cluster-info 2>/dev/null
3.3 Check all nodes are Ready
KUBECONFIG=/Users/viktorbarzin/code/config kubectl get nodes
All 5 nodes (k8s-master, k8s-node1-4) should show Ready.
3.4 Check critical infrastructure pods
Check these in order (reflects boot dependency chain):
KC="KUBECONFIG=/Users/viktorbarzin/code/config kubectl"
# Tier 0: Core infrastructure
$KC get pods -n metallb-system -l app=metallb
$KC get pods -n kube-system -l k8s-app=kube-dns # CoreDNS
$KC get pods -n technitium # DNS
# Tier 1: Storage
$KC get pods -n democratic-csi # iSCSI-CSI
$KC get pods -n nfs-csi # NFS-CSI
# Tier 2: Ingress + tunnel
$KC get pods -n traefik
$KC get pods -n cloudflared
# Tier 3: Security
$KC get pods -n kyverno
# Tier 4: Data layer
$KC get pods -n redis
$KC get pods -n dbaas # MySQL + PostgreSQL
# Tier 5: Secrets + auth
$KC get pods -n vault
$KC get pods -n authentik
$KC get pods -n external-secrets
3.5 Check Vault is unsealed
KUBECONFIG=/Users/viktorbarzin/code/config kubectl exec -n vault vault-0 -- vault status -format=json 2>/dev/null | jq '.sealed'
Should return false. The auto-unseal sidecar polls every 10s — if still sealed after 5 minutes, investigate.
3.6 Check PVC status
KUBECONFIG=/Users/viktorbarzin/code/config kubectl get pvc -A --field-selector 'status.phase!=Bound' 2>/dev/null | head -20
No PVCs should be in Pending state (indicates CSI driver issues).
3.7 Check for stuck pods
KUBECONFIG=/Users/viktorbarzin/code/config kubectl get pods -A --field-selector 'status.phase!=Running,status.phase!=Succeeded' | grep -v Completed | head -30
Known slow starters to ignore: MySQL InnoDB pods (up to 60 min init), Vault pods (auto-unseal takes a minute).
Phase 4: Final Validation
- If all checks pass: report "Cluster fully recovered in X minutes"
- If issues remain after 30 minutes: invoke the
/cluster-healthskill for diagnosis and remediation - Store a memory summarizing the reboot: timestamp, duration, any issues encountered
Single Node Reboot Procedure
Phase 1: Pre-flight
1.1 Identify target
Map the user's input to a node name and VMID:
k8s-master/master/200→ k8s-master (VMID 200)k8s-node1/node1/201→ k8s-node1 (VMID 201, GPU node)k8s-node2/node2/202→ k8s-node2 (VMID 202)k8s-node3/node3/203→ k8s-node3 (VMID 203)k8s-node4/node4/204→ k8s-node4 (VMID 204)
1.2 Check node status
KUBECONFIG=/Users/viktorbarzin/code/config kubectl get node <node-name>
1.3 Check what's running on the node
KUBECONFIG=/Users/viktorbarzin/code/config kubectl get pods -A --field-selector spec.nodeName=<node-name> | head -30
Warn the user if the node runs:
- GPU workloads (node1) — Immich ML, Frigate, Ollama
- Database pods — MySQL InnoDB, PostgreSQL CNPG (long recovery)
- Vault pods — may need re-unseal
1.4 Check PDBs won't block drain
KUBECONFIG=/Users/viktorbarzin/code/config kubectl get pdb -A
Verify that draining this node won't violate any PDB (e.g., Traefik minAvailable=2 — if 2 of 3 Traefik pods are on other nodes, drain is safe).
Phase 2: Drain + Reboot
# Drain the node (300s timeout for graceful eviction)
KUBECONFIG=/Users/viktorbarzin/code/config kubectl drain <node-name> \
--ignore-daemonsets \
--delete-emptydir-data \
--timeout=300s
# Shutdown the VM gracefully (300s timeout for kubelet + systemd shutdown)
ssh root@192.168.1.127 "qm shutdown <VMID> --timeout 300"
# Wait for VM to stop
for i in $(seq 1 60); do
STATUS=$(ssh root@192.168.1.127 "qm status <VMID>" 2>/dev/null)
if echo "$STATUS" | grep -q stopped; then
echo "VM stopped"
break
fi
echo "Waiting for VM to stop... ($i/60)"
sleep 5
done
# Start the VM
ssh root@192.168.1.127 "qm start <VMID>"
Phase 3: Recovery
# Wait for node to become Ready (timeout: 5 min)
for i in $(seq 1 30); do
STATUS=$(KUBECONFIG=/Users/viktorbarzin/code/config kubectl get node <node-name> --no-headers 2>/dev/null | awk '{print $2}')
if [ "$STATUS" = "Ready" ]; then
echo "Node is Ready"
break
fi
echo "Waiting for node to become Ready... ($i/30)"
sleep 10
done
# Uncordon the node
KUBECONFIG=/Users/viktorbarzin/code/config kubectl uncordon <node-name>
# Verify DaemonSet pods are scheduled
KUBECONFIG=/Users/viktorbarzin/code/config kubectl get pods -A --field-selector spec.nodeName=<node-name> | head -20
# Report status
echo "Node <node-name> rebooted and uncordoned successfully"
Troubleshooting
| Symptom | Cause | Fix |
|---|---|---|
| VM won't start | Proxmox host disk full | ssh root@192.168.1.127 'df -h' — check thin pool |
| Node stays NotReady | kubelet/containerd not starting | qm guest exec <VMID> -- systemctl status kubelet |
| NFS PVCs stuck Pending | TrueNAS not fully booted | Wait for TrueNAS ZFS pool import, or qm guest exec 9000 -- zpool status |
| Vault stays sealed | Auto-unseal sidecar not running | Check sidecar logs: kubectl logs -n vault vault-0 -c vault-unseal |
| MySQL slow to recover | InnoDB init containers (~20 min each) | Normal — monitor progress with kubectl logs -n dbaas |
| Pods CrashLoopBackOff | Dependencies not ready yet | Usually self-heals — wait 5 min, then check with /cluster-health |
| containerd won't start | Disk full or corrupted images | qm guest exec <VMID> -- journalctl -u containerd --no-pager -n 50 |