add reboot-server skill for safe Proxmox host and k8s node reboots

This commit is contained in:
Viktor Barzin 2026-04-02 11:46:27 +03:00
parent 8ca83a2762
commit c3f9a1f346
No known key found for this signature in database
GPG key ID: 0EB088298288D958

View file

@ -0,0 +1,370 @@
name: reboot-server
description: Safely reboot the Proxmox host or a single k8s node with graceful shutdown
and post-boot validation. Use when the user asks to "reboot the server", "reboot
proxmox", "reboot node", "restart the host", "restart k8s node", "power cycle".
Supports full host reboot and single node reboot modes.
---
## Overview
This skill safely reboots either the entire Proxmox host (Dell R730) or a single Kubernetes node VM. It ensures graceful shutdown of all services and validates full recovery afterwards.
## Connection Details
- **Proxmox host**: `ssh root@192.168.1.127`
- **kubectl**: `KUBECONFIG=/Users/viktorbarzin/code/config kubectl`
- **VM commands**: `ssh root@192.168.1.127 'qm <command>'`
## VM Inventory & Boot Order
| Order | VMID | Name | Startup Delay | Shutdown Timeout | Notes |
|-------|------|------|---------------|------------------|-------|
| 1 | 101 | pfSense | 0s | 120s | Gateway/DHCP/DNS — must boot first |
| 2 | 9000 | TrueNAS | 60s | 300s | NFS/iSCSI storage — needs network from pfSense |
| 3 | 220 | docker-registry | 60s | 120s | Pull-through cache (fallback: upstream) |
| 3 | 102 | devvm | 60s | 120s | Dev VM — needs pfSense for network |
| 4 | 200 | k8s-master | 45s | 300s | Control plane — must be up before workers |
| 5 | 201 | k8s-node1 | 45s | 300s | GPU node (Tesla T4) |
| 5 | 202 | k8s-node2 | 45s | 300s | Worker |
| 5 | 203 | k8s-node3 | 45s | 300s | Worker |
| 5 | 204 | k8s-node4 | 45s | 300s | Worker |
| 6 | 103 | home-assistant | 0s | 120s | HA Sofia — no ordering dependency |
| 6 | 300 | Windows10 | 0s | 120s | Windows VM — no ordering dependency |
Shutdown order is the **reverse** of boot order (Proxmox handles this automatically).
## Node-to-VMID Mapping
```
k8s-master = 200
k8s-node1 = 201 (GPU node)
k8s-node2 = 202
k8s-node3 = 203
k8s-node4 = 204
```
## Mode Selection
Ask the user which mode they want if unclear:
1. **Full host reboot** — reboot the entire Proxmox R730 host
2. **Single node reboot** — drain, reboot one k8s node VM, uncordon
---
## Full Host Reboot Procedure
### Phase 1: Pre-flight Checks
Run ALL of these checks before proceeding. Fix issues inline if possible.
#### 1.1 Verify Proxmox SSH access
```bash
ssh -o ConnectTimeout=5 root@192.168.1.127 'hostname && uptime'
```
#### 1.2 Verify cluster health
```bash
KUBECONFIG=/Users/viktorbarzin/code/config kubectl get nodes
KUBECONFIG=/Users/viktorbarzin/code/config kubectl get pods -A --field-selector 'status.phase!=Running,status.phase!=Succeeded' | head -20
```
#### 1.3 Check and configure VM boot ordering
For each VM, check if boot ordering is set. If any VM has `order=-1`, configure it:
```bash
# Check current boot order for all VMs
ssh root@192.168.1.127 'for VMID in 101 102 103 200 201 202 203 204 220 300 9000; do echo "VMID $VMID: $(qm config $VMID 2>/dev/null | grep ^startup || echo "no startup config")"; done'
```
If not configured, apply boot order (this is idempotent):
```bash
ssh root@192.168.1.127 '
qm set 101 --startup order=1,down=120
qm set 9000 --startup order=2,up=60,down=300
qm set 102 --startup order=3,up=60,down=120
qm set 220 --startup order=3,up=60,down=120
qm set 200 --startup order=4,up=45,down=300
qm set 201 --startup order=5,up=45,down=300
qm set 202 --startup order=5,up=45,down=300
qm set 203 --startup order=5,up=45,down=300
qm set 204 --startup order=5,up=45,down=300
qm set 103 --startup order=6,down=120
qm set 300 --startup order=6,down=120
'
```
#### 1.4 Check kubelet shutdownGracePeriod on all k8s nodes
```bash
for VMID in 200 201 202 203 204; do
echo "=== VMID $VMID ==="
ssh root@192.168.1.127 "qm guest exec $VMID -- grep -c shutdownGracePeriod /var/lib/kubelet/config.yaml 2>/dev/null" || echo "NOT SET"
done
```
If not set on any node, patch it:
```bash
VMID=<target>
ssh root@192.168.1.127 "qm guest exec $VMID -- bash -c '
# Remove any existing shutdown config to avoid duplicates
sed -i \"/shutdownGracePeriod/d; /shutdownGracePeriodCriticalPods/d\" /var/lib/kubelet/config.yaml
# Add graceful shutdown config
cat >> /var/lib/kubelet/config.yaml <<EOF
shutdownGracePeriod: \"240s\"
shutdownGracePeriodCriticalPods: \"60s\"
EOF
# Create systemd logind override for InhibitDelayMaxSec
mkdir -p /etc/systemd/logind.conf.d
cat > /etc/systemd/logind.conf.d/kubelet-shutdown.conf <<EOF
[Login]
InhibitDelayMaxSec=300
EOF
systemctl restart systemd-logind
# Create kubelet service override for TimeoutStopSec
mkdir -p /etc/systemd/system/kubelet.service.d
cat > /etc/systemd/system/kubelet.service.d/20-shutdown.conf <<EOF
[Service]
TimeoutStopSec=300s
EOF
systemctl daemon-reload
systemctl restart kubelet
'"
```
Repeat for each node VMID (200-204) that needs patching.
#### 1.5 Check unattended-upgrades on k8s nodes
```bash
for VMID in 200 201 202 203 204; do
echo "=== VMID $VMID ==="
ssh root@192.168.1.127 "qm guest exec $VMID -- systemctl is-active unattended-upgrades 2>/dev/null" || echo "not found (good)"
done
```
If active on any node, disable it:
```bash
ssh root@192.168.1.127 "qm guest exec $VMID -- bash -c 'systemctl disable --now unattended-upgrades; apt-get remove -y unattended-upgrades'"
```
#### 1.6 Report pre-flight status
Summarize: all checks passed, or list what was fixed. Ask user for final confirmation before rebooting.
### Phase 2: Reboot
**IMPORTANT**: Get explicit user confirmation before proceeding.
```bash
# Reboot the Proxmox host
ssh root@192.168.1.127 'reboot'
```
Then wait for SSH to drop (confirms reboot started):
```bash
# Poll until SSH drops (host is rebooting)
for i in $(seq 1 30); do
ssh -o ConnectTimeout=3 -o BatchMode=yes root@192.168.1.127 'true' 2>/dev/null || { echo "Host is rebooting"; break; }
sleep 2
done
```
Then poll until Proxmox is back (timeout: 10 minutes):
```bash
# Poll every 30s until SSH is back
for i in $(seq 1 20); do
if ssh -o ConnectTimeout=5 -o BatchMode=yes root@192.168.1.127 'uptime' 2>/dev/null; then
echo "Proxmox host is back online"
break
fi
echo "Waiting for Proxmox host... (attempt $i/20)"
sleep 30
done
```
### Phase 3: Post-boot Validation (30 min timeout)
Run these checks in a loop, reporting progress. Timeout after 30 minutes total.
#### 3.1 Check all VMs are running
```bash
ssh root@192.168.1.127 'qm list' | grep -v stopped
```
If any critical VM (101, 9000, 200-204, 220) is not running after 5 minutes, start it:
```bash
ssh root@192.168.1.127 'qm start <VMID>'
```
#### 3.2 Check k8s API is reachable
```bash
KUBECONFIG=/Users/viktorbarzin/code/config kubectl cluster-info 2>/dev/null
```
#### 3.3 Check all nodes are Ready
```bash
KUBECONFIG=/Users/viktorbarzin/code/config kubectl get nodes
```
All 5 nodes (k8s-master, k8s-node1-4) should show `Ready`.
#### 3.4 Check critical infrastructure pods
Check these in order (reflects boot dependency chain):
```bash
KC="KUBECONFIG=/Users/viktorbarzin/code/config kubectl"
# Tier 0: Core infrastructure
$KC get pods -n metallb-system -l app=metallb
$KC get pods -n kube-system -l k8s-app=kube-dns # CoreDNS
$KC get pods -n technitium # DNS
# Tier 1: Storage
$KC get pods -n democratic-csi # iSCSI-CSI
$KC get pods -n nfs-csi # NFS-CSI
# Tier 2: Ingress + tunnel
$KC get pods -n traefik
$KC get pods -n cloudflared
# Tier 3: Security
$KC get pods -n kyverno
# Tier 4: Data layer
$KC get pods -n redis
$KC get pods -n dbaas # MySQL + PostgreSQL
# Tier 5: Secrets + auth
$KC get pods -n vault
$KC get pods -n authentik
$KC get pods -n external-secrets
```
#### 3.5 Check Vault is unsealed
```bash
KUBECONFIG=/Users/viktorbarzin/code/config kubectl exec -n vault vault-0 -- vault status -format=json 2>/dev/null | jq '.sealed'
```
Should return `false`. The auto-unseal sidecar polls every 10s — if still sealed after 5 minutes, investigate.
#### 3.6 Check PVC status
```bash
KUBECONFIG=/Users/viktorbarzin/code/config kubectl get pvc -A --field-selector 'status.phase!=Bound' 2>/dev/null | head -20
```
No PVCs should be in Pending state (indicates CSI driver issues).
#### 3.7 Check for stuck pods
```bash
KUBECONFIG=/Users/viktorbarzin/code/config kubectl get pods -A --field-selector 'status.phase!=Running,status.phase!=Succeeded' | grep -v Completed | head -30
```
Known slow starters to ignore: MySQL InnoDB pods (up to 60 min init), Vault pods (auto-unseal takes a minute).
### Phase 4: Final Validation
- If all checks pass: report "Cluster fully recovered in X minutes"
- If issues remain after 30 minutes: invoke the `/cluster-health` skill for diagnosis and remediation
- Store a memory summarizing the reboot: timestamp, duration, any issues encountered
---
## Single Node Reboot Procedure
### Phase 1: Pre-flight
#### 1.1 Identify target
Map the user's input to a node name and VMID:
- `k8s-master` / `master` / `200` → k8s-master (VMID 200)
- `k8s-node1` / `node1` / `201` → k8s-node1 (VMID 201, GPU node)
- `k8s-node2` / `node2` / `202` → k8s-node2 (VMID 202)
- `k8s-node3` / `node3` / `203` → k8s-node3 (VMID 203)
- `k8s-node4` / `node4` / `204` → k8s-node4 (VMID 204)
#### 1.2 Check node status
```bash
KUBECONFIG=/Users/viktorbarzin/code/config kubectl get node <node-name>
```
#### 1.3 Check what's running on the node
```bash
KUBECONFIG=/Users/viktorbarzin/code/config kubectl get pods -A --field-selector spec.nodeName=<node-name> | head -30
```
Warn the user if the node runs:
- GPU workloads (node1) — Immich ML, Frigate, Ollama
- Database pods — MySQL InnoDB, PostgreSQL CNPG (long recovery)
- Vault pods — may need re-unseal
#### 1.4 Check PDBs won't block drain
```bash
KUBECONFIG=/Users/viktorbarzin/code/config kubectl get pdb -A
```
Verify that draining this node won't violate any PDB (e.g., Traefik minAvailable=2 — if 2 of 3 Traefik pods are on other nodes, drain is safe).
### Phase 2: Drain + Reboot
```bash
# Drain the node (300s timeout for graceful eviction)
KUBECONFIG=/Users/viktorbarzin/code/config kubectl drain <node-name> \
--ignore-daemonsets \
--delete-emptydir-data \
--timeout=300s
# Shutdown the VM gracefully (300s timeout for kubelet + systemd shutdown)
ssh root@192.168.1.127 "qm shutdown <VMID> --timeout 300"
# Wait for VM to stop
for i in $(seq 1 60); do
STATUS=$(ssh root@192.168.1.127 "qm status <VMID>" 2>/dev/null)
if echo "$STATUS" | grep -q stopped; then
echo "VM stopped"
break
fi
echo "Waiting for VM to stop... ($i/60)"
sleep 5
done
# Start the VM
ssh root@192.168.1.127 "qm start <VMID>"
```
### Phase 3: Recovery
```bash
# Wait for node to become Ready (timeout: 5 min)
for i in $(seq 1 30); do
STATUS=$(KUBECONFIG=/Users/viktorbarzin/code/config kubectl get node <node-name> --no-headers 2>/dev/null | awk '{print $2}')
if [ "$STATUS" = "Ready" ]; then
echo "Node is Ready"
break
fi
echo "Waiting for node to become Ready... ($i/30)"
sleep 10
done
# Uncordon the node
KUBECONFIG=/Users/viktorbarzin/code/config kubectl uncordon <node-name>
# Verify DaemonSet pods are scheduled
KUBECONFIG=/Users/viktorbarzin/code/config kubectl get pods -A --field-selector spec.nodeName=<node-name> | head -20
# Report status
echo "Node <node-name> rebooted and uncordoned successfully"
```
---
## Troubleshooting
| Symptom | Cause | Fix |
|---------|-------|-----|
| VM won't start | Proxmox host disk full | `ssh root@192.168.1.127 'df -h'` — check thin pool |
| Node stays NotReady | kubelet/containerd not starting | `qm guest exec <VMID> -- systemctl status kubelet` |
| NFS PVCs stuck Pending | TrueNAS not fully booted | Wait for TrueNAS ZFS pool import, or `qm guest exec 9000 -- zpool status` |
| Vault stays sealed | Auto-unseal sidecar not running | Check sidecar logs: `kubectl logs -n vault vault-0 -c vault-unseal` |
| MySQL slow to recover | InnoDB init containers (~20 min each) | Normal — monitor progress with `kubectl logs -n dbaas` |
| Pods CrashLoopBackOff | Dependencies not ready yet | Usually self-heals — wait 5 min, then check with `/cluster-health` |
| containerd won't start | Disk full or corrupted images | `qm guest exec <VMID> -- journalctl -u containerd --no-pager -n 50` |