dot_files/dot_claude/skills/reboot-server.md at c3f9a1f346ba3f10714fdcb44f594e89a9c073df

viktor/dot_files

Fork 0

Viktor Barzin c3f9a1f346

add reboot-server skill for safe Proxmox host and k8s node reboots

2026-04-02 11:46:27 +03:00

12 KiB

Raw Blame History

name: reboot-server description: Safely reboot the Proxmox host or a single k8s node with graceful shutdown and post-boot validation. Use when the user asks to "reboot the server", "reboot proxmox", "reboot node", "restart the host", "restart k8s node", "power cycle". Supports full host reboot and single node reboot modes.

Overview

This skill safely reboots either the entire Proxmox host (Dell R730) or a single Kubernetes node VM. It ensures graceful shutdown of all services and validates full recovery afterwards.

Connection Details

Proxmox host: ssh root@192.168.1.127
kubectl: KUBECONFIG=/Users/viktorbarzin/code/config kubectl
VM commands: ssh root@192.168.1.127 'qm <command>'

VM Inventory & Boot Order

Order	VMID	Name	Startup Delay	Shutdown Timeout	Notes
1	101	pfSense	0s	120s	Gateway/DHCP/DNS — must boot first
2	9000	TrueNAS	60s	300s	NFS/iSCSI storage — needs network from pfSense
3	220	docker-registry	60s	120s	Pull-through cache (fallback: upstream)
3	102	devvm	60s	120s	Dev VM — needs pfSense for network
4	200	k8s-master	45s	300s	Control plane — must be up before workers
5	201	k8s-node1	45s	300s	GPU node (Tesla T4)
5	202	k8s-node2	45s	300s	Worker
5	203	k8s-node3	45s	300s	Worker
5	204	k8s-node4	45s	300s	Worker
6	103	home-assistant	0s	120s	HA Sofia — no ordering dependency
6	300	Windows10	0s	120s	Windows VM — no ordering dependency

Shutdown order is the reverse of boot order (Proxmox handles this automatically).

Node-to-VMID Mapping

k8s-master = 200
k8s-node1 = 201 (GPU node)
k8s-node2 = 202
k8s-node3 = 203
k8s-node4 = 204

Mode Selection

Ask the user which mode they want if unclear:

Full host reboot — reboot the entire Proxmox R730 host
Single node reboot — drain, reboot one k8s node VM, uncordon

Full Host Reboot Procedure

Phase 1: Pre-flight Checks

Run ALL of these checks before proceeding. Fix issues inline if possible.

1.1 Verify Proxmox SSH access

ssh -o ConnectTimeout=5 root@192.168.1.127 'hostname && uptime'

1.2 Verify cluster health

KUBECONFIG=/Users/viktorbarzin/code/config kubectl get nodes
KUBECONFIG=/Users/viktorbarzin/code/config kubectl get pods -A --field-selector 'status.phase!=Running,status.phase!=Succeeded' | head -20

1.3 Check and configure VM boot ordering

For each VM, check if boot ordering is set. If any VM has order=-1, configure it:

# Check current boot order for all VMs
ssh root@192.168.1.127 'for VMID in 101 102 103 200 201 202 203 204 220 300 9000; do echo "VMID $VMID: $(qm config $VMID 2>/dev/null | grep ^startup || echo "no startup config")"; done'

If not configured, apply boot order (this is idempotent):

ssh root@192.168.1.127 '
qm set 101 --startup order=1,down=120
qm set 9000 --startup order=2,up=60,down=300
qm set 102 --startup order=3,up=60,down=120
qm set 220 --startup order=3,up=60,down=120
qm set 200 --startup order=4,up=45,down=300
qm set 201 --startup order=5,up=45,down=300
qm set 202 --startup order=5,up=45,down=300
qm set 203 --startup order=5,up=45,down=300
qm set 204 --startup order=5,up=45,down=300
qm set 103 --startup order=6,down=120
qm set 300 --startup order=6,down=120
'

1.4 Check kubelet shutdownGracePeriod on all k8s nodes

for VMID in 200 201 202 203 204; do
  echo "=== VMID $VMID ==="
  ssh root@192.168.1.127 "qm guest exec $VMID -- grep -c shutdownGracePeriod /var/lib/kubelet/config.yaml 2>/dev/null" || echo "NOT SET"
done

If not set on any node, patch it:

VMID=<target>
ssh root@192.168.1.127 "qm guest exec $VMID -- bash -c '
# Remove any existing shutdown config to avoid duplicates
sed -i \"/shutdownGracePeriod/d; /shutdownGracePeriodCriticalPods/d\" /var/lib/kubelet/config.yaml

# Add graceful shutdown config
cat >> /var/lib/kubelet/config.yaml <<EOF
shutdownGracePeriod: \"240s\"
shutdownGracePeriodCriticalPods: \"60s\"
EOF

# Create systemd logind override for InhibitDelayMaxSec
mkdir -p /etc/systemd/logind.conf.d
cat > /etc/systemd/logind.conf.d/kubelet-shutdown.conf <<EOF
[Login]
InhibitDelayMaxSec=300
EOF
systemctl restart systemd-logind

# Create kubelet service override for TimeoutStopSec
mkdir -p /etc/systemd/system/kubelet.service.d
cat > /etc/systemd/system/kubelet.service.d/20-shutdown.conf <<EOF
[Service]
TimeoutStopSec=300s
EOF
systemctl daemon-reload
systemctl restart kubelet
'"

Repeat for each node VMID (200-204) that needs patching.

1.5 Check unattended-upgrades on k8s nodes

for VMID in 200 201 202 203 204; do
  echo "=== VMID $VMID ==="
  ssh root@192.168.1.127 "qm guest exec $VMID -- systemctl is-active unattended-upgrades 2>/dev/null" || echo "not found (good)"
done

If active on any node, disable it:

ssh root@192.168.1.127 "qm guest exec $VMID -- bash -c 'systemctl disable --now unattended-upgrades; apt-get remove -y unattended-upgrades'"

1.6 Report pre-flight status

Summarize: all checks passed, or list what was fixed. Ask user for final confirmation before rebooting.

Phase 2: Reboot

IMPORTANT: Get explicit user confirmation before proceeding.

# Reboot the Proxmox host
ssh root@192.168.1.127 'reboot'

Then wait for SSH to drop (confirms reboot started):

# Poll until SSH drops (host is rebooting)
for i in $(seq 1 30); do
  ssh -o ConnectTimeout=3 -o BatchMode=yes root@192.168.1.127 'true' 2>/dev/null || { echo "Host is rebooting"; break; }
  sleep 2
done

Then poll until Proxmox is back (timeout: 10 minutes):

# Poll every 30s until SSH is back
for i in $(seq 1 20); do
  if ssh -o ConnectTimeout=5 -o BatchMode=yes root@192.168.1.127 'uptime' 2>/dev/null; then
    echo "Proxmox host is back online"
    break
  fi
  echo "Waiting for Proxmox host... (attempt $i/20)"
  sleep 30
done

Phase 3: Post-boot Validation (30 min timeout)

Run these checks in a loop, reporting progress. Timeout after 30 minutes total.

3.1 Check all VMs are running

ssh root@192.168.1.127 'qm list' | grep -v stopped

If any critical VM (101, 9000, 200-204, 220) is not running after 5 minutes, start it:

ssh root@192.168.1.127 'qm start <VMID>'

3.2 Check k8s API is reachable

KUBECONFIG=/Users/viktorbarzin/code/config kubectl cluster-info 2>/dev/null

3.3 Check all nodes are Ready

KUBECONFIG=/Users/viktorbarzin/code/config kubectl get nodes

All 5 nodes (k8s-master, k8s-node1-4) should show Ready.

3.4 Check critical infrastructure pods

Check these in order (reflects boot dependency chain):

KC="KUBECONFIG=/Users/viktorbarzin/code/config kubectl"

# Tier 0: Core infrastructure
$KC get pods -n metallb-system -l app=metallb
$KC get pods -n kube-system -l k8s-app=kube-dns  # CoreDNS
$KC get pods -n technitium  # DNS

# Tier 1: Storage
$KC get pods -n democratic-csi  # iSCSI-CSI
$KC get pods -n nfs-csi  # NFS-CSI

# Tier 2: Ingress + tunnel
$KC get pods -n traefik
$KC get pods -n cloudflared

# Tier 3: Security
$KC get pods -n kyverno

# Tier 4: Data layer
$KC get pods -n redis
$KC get pods -n dbaas  # MySQL + PostgreSQL

# Tier 5: Secrets + auth
$KC get pods -n vault
$KC get pods -n authentik
$KC get pods -n external-secrets

3.5 Check Vault is unsealed

KUBECONFIG=/Users/viktorbarzin/code/config kubectl exec -n vault vault-0 -- vault status -format=json 2>/dev/null | jq '.sealed'

Should return false. The auto-unseal sidecar polls every 10s — if still sealed after 5 minutes, investigate.

3.6 Check PVC status

KUBECONFIG=/Users/viktorbarzin/code/config kubectl get pvc -A --field-selector 'status.phase!=Bound' 2>/dev/null | head -20

No PVCs should be in Pending state (indicates CSI driver issues).

3.7 Check for stuck pods

KUBECONFIG=/Users/viktorbarzin/code/config kubectl get pods -A --field-selector 'status.phase!=Running,status.phase!=Succeeded' | grep -v Completed | head -30

Known slow starters to ignore: MySQL InnoDB pods (up to 60 min init), Vault pods (auto-unseal takes a minute).

Phase 4: Final Validation

If all checks pass: report "Cluster fully recovered in X minutes"
If issues remain after 30 minutes: invoke the /cluster-health skill for diagnosis and remediation
Store a memory summarizing the reboot: timestamp, duration, any issues encountered

Single Node Reboot Procedure

Phase 1: Pre-flight

1.1 Identify target

Map the user's input to a node name and VMID:

k8s-master / master / 200 → k8s-master (VMID 200)
k8s-node1 / node1 / 201 → k8s-node1 (VMID 201, GPU node)
k8s-node2 / node2 / 202 → k8s-node2 (VMID 202)
k8s-node3 / node3 / 203 → k8s-node3 (VMID 203)
k8s-node4 / node4 / 204 → k8s-node4 (VMID 204)

1.2 Check node status

KUBECONFIG=/Users/viktorbarzin/code/config kubectl get node <node-name>

1.3 Check what's running on the node

KUBECONFIG=/Users/viktorbarzin/code/config kubectl get pods -A --field-selector spec.nodeName=<node-name> | head -30

Warn the user if the node runs:

GPU workloads (node1) — Immich ML, Frigate, Ollama
Database pods — MySQL InnoDB, PostgreSQL CNPG (long recovery)
Vault pods — may need re-unseal

1.4 Check PDBs won't block drain

KUBECONFIG=/Users/viktorbarzin/code/config kubectl get pdb -A

Verify that draining this node won't violate any PDB (e.g., Traefik minAvailable=2 — if 2 of 3 Traefik pods are on other nodes, drain is safe).

Phase 2: Drain + Reboot

# Drain the node (300s timeout for graceful eviction)
KUBECONFIG=/Users/viktorbarzin/code/config kubectl drain <node-name> \
  --ignore-daemonsets \
  --delete-emptydir-data \
  --timeout=300s

# Shutdown the VM gracefully (300s timeout for kubelet + systemd shutdown)
ssh root@192.168.1.127 "qm shutdown <VMID> --timeout 300"

# Wait for VM to stop
for i in $(seq 1 60); do
  STATUS=$(ssh root@192.168.1.127 "qm status <VMID>" 2>/dev/null)
  if echo "$STATUS" | grep -q stopped; then
    echo "VM stopped"
    break
  fi
  echo "Waiting for VM to stop... ($i/60)"
  sleep 5
done

# Start the VM
ssh root@192.168.1.127 "qm start <VMID>"

Phase 3: Recovery

# Wait for node to become Ready (timeout: 5 min)
for i in $(seq 1 30); do
  STATUS=$(KUBECONFIG=/Users/viktorbarzin/code/config kubectl get node <node-name> --no-headers 2>/dev/null | awk '{print $2}')
  if [ "$STATUS" = "Ready" ]; then
    echo "Node is Ready"
    break
  fi
  echo "Waiting for node to become Ready... ($i/30)"
  sleep 10
done

# Uncordon the node
KUBECONFIG=/Users/viktorbarzin/code/config kubectl uncordon <node-name>

# Verify DaemonSet pods are scheduled
KUBECONFIG=/Users/viktorbarzin/code/config kubectl get pods -A --field-selector spec.nodeName=<node-name> | head -20

# Report status
echo "Node <node-name> rebooted and uncordoned successfully"

Troubleshooting

Symptom	Cause	Fix
VM won't start	Proxmox host disk full	`ssh root@192.168.1.127 'df -h'` — check thin pool
Node stays NotReady	kubelet/containerd not starting	`qm guest exec <VMID> -- systemctl status kubelet`
NFS PVCs stuck Pending	TrueNAS not fully booted	Wait for TrueNAS ZFS pool import, or `qm guest exec 9000 -- zpool status`
Vault stays sealed	Auto-unseal sidecar not running	Check sidecar logs: `kubectl logs -n vault vault-0 -c vault-unseal`
MySQL slow to recover	InnoDB init containers (~20 min each)	Normal — monitor progress with `kubectl logs -n dbaas`
Pods CrashLoopBackOff	Dependencies not ready yet	Usually self-heals — wait 5 min, then check with `/cluster-health`
containerd won't start	Disk full or corrupted images	`qm guest exec <VMID> -- journalctl -u containerd --no-pager -n 50`

12 KiB Raw Blame History