update reboot-server skill: add rolling reboot, MySQL auto-recovery, proxmox-lvm validation

This commit is contained in:
Viktor Barzin 2026-04-05 20:40:36 +03:00
parent 5e4112827d
commit f11cf2870a
No known key found for this signature in database
GPG key ID: 0EB088298288D958

View file

@ -1,13 +1,16 @@
name: reboot-server
description: Safely reboot the Proxmox host or a single k8s node with graceful shutdown
and post-boot validation. Use when the user asks to "reboot the server", "reboot
proxmox", "reboot node", "restart the host", "restart k8s node", "power cycle".
Supports full host reboot and single node reboot modes.
description: Safely reboot the Proxmox host, a single k8s node, or perform a rolling
reboot of all nodes with graceful shutdown, MySQL auto-recovery, and post-boot validation.
Use when the user asks to "reboot the server", "reboot proxmox", "reboot node",
"restart the host", "rolling reboot", "reboot all nodes", "power cycle".
---
## Overview
This skill safely reboots either the entire Proxmox host (Dell R730) or a single Kubernetes node VM. It ensures graceful shutdown of all services and validates full recovery afterwards.
This skill safely reboots infrastructure with three modes:
1. **Full host reboot** — reboot the entire Proxmox R730 host (all VMs)
2. **Single node reboot** — drain, reboot one k8s node VM, uncordon
3. **Rolling reboot** — cycle all 5 k8s nodes sequentially (workers first, master last)
## Connection Details
@ -15,12 +18,14 @@ This skill safely reboots either the entire Proxmox host (Dell R730) or a single
- **kubectl**: `KUBECONFIG=/Users/viktorbarzin/code/config kubectl`
- **VM commands**: `ssh root@192.168.1.127 'qm <command>'`
Shorthand used below: `KC="KUBECONFIG=/Users/viktorbarzin/code/config kubectl"`
## VM Inventory & Boot Order
| Order | VMID | Name | Startup Delay | Shutdown Timeout | Notes |
|-------|------|------|---------------|------------------|-------|
| 1 | 101 | pfSense | 0s | 120s | Gateway/DHCP/DNS — must boot first |
| 2 | 9000 | TrueNAS | 60s | 300s | NFS/iSCSI storage — needs network from pfSense |
| 2 | 9000 | TrueNAS | 60s | 300s | NFS storage — needs network from pfSense |
| 3 | 220 | docker-registry | 60s | 120s | Pull-through cache (fallback: upstream) |
| 3 | 102 | devvm | 60s | 120s | Dev VM — needs pfSense for network |
| 4 | 200 | k8s-master | 45s | 300s | Control plane — must be up before workers |
@ -37,7 +42,7 @@ Shutdown order is the **reverse** of boot order (Proxmox handles this automatica
```
k8s-master = 200
k8s-node1 = 201 (GPU node)
k8s-node1 = 201 (GPU node — Immich ML, Frigate, Ollama)
k8s-node2 = 202
k8s-node3 = 203
k8s-node4 = 204
@ -48,35 +53,32 @@ k8s-node4 = 204
Ask the user which mode they want if unclear:
1. **Full host reboot** — reboot the entire Proxmox R730 host
2. **Single node reboot** — drain, reboot one k8s node VM, uncordon
3. **Rolling reboot** — cycle all 5 k8s nodes sequentially (for kernel updates, maintenance)
---
## Full Host Reboot Procedure
## Shared Pre-flight Checks
### Phase 1: Pre-flight Checks
These checks apply to ALL modes. Run them before proceeding.
Run ALL of these checks before proceeding. Fix issues inline if possible.
#### 1.1 Verify Proxmox SSH access
### PF-1: Verify Proxmox SSH access
```bash
ssh -o ConnectTimeout=5 root@192.168.1.127 'hostname && uptime'
```
#### 1.2 Verify cluster health
### PF-2: Verify cluster health
```bash
KUBECONFIG=/Users/viktorbarzin/code/config kubectl get nodes
KUBECONFIG=/Users/viktorbarzin/code/config kubectl get pods -A --field-selector 'status.phase!=Running,status.phase!=Succeeded' | head -20
$KC get nodes
$KC get pods -A --field-selector 'status.phase!=Running,status.phase!=Succeeded' | grep -v Completed | head -20
```
#### 1.3 Check and configure VM boot ordering
For each VM, check if boot ordering is set. If any VM has `order=-1`, configure it:
### PF-3: Check and configure VM boot ordering
```bash
# Check current boot order for all VMs
ssh root@192.168.1.127 'for VMID in 101 102 103 200 201 202 203 204 220 300 9000; do echo "VMID $VMID: $(qm config $VMID 2>/dev/null | grep ^startup || echo "no startup config")"; done'
```
If not configured, apply boot order (this is idempotent):
If not configured, apply boot order (idempotent):
```bash
ssh root@192.168.1.127 '
qm set 101 --startup order=1,down=120
@ -93,7 +95,7 @@ qm set 300 --startup order=6,down=120
'
```
#### 1.4 Check kubelet shutdownGracePeriod on all k8s nodes
### PF-4: Check kubelet shutdownGracePeriod on all k8s nodes
```bash
for VMID in 200 201 202 203 204; do
echo "=== VMID $VMID ==="
@ -105,16 +107,13 @@ If not set on any node, patch it:
```bash
VMID=<target>
ssh root@192.168.1.127 "qm guest exec $VMID -- bash -c '
# Remove any existing shutdown config to avoid duplicates
sed -i \"/shutdownGracePeriod/d; /shutdownGracePeriodCriticalPods/d\" /var/lib/kubelet/config.yaml
# Add graceful shutdown config
cat >> /var/lib/kubelet/config.yaml <<EOF
shutdownGracePeriod: \"240s\"
shutdownGracePeriodCriticalPods: \"60s\"
EOF
# Create systemd logind override for InhibitDelayMaxSec
mkdir -p /etc/systemd/logind.conf.d
cat > /etc/systemd/logind.conf.d/kubelet-shutdown.conf <<EOF
[Login]
@ -122,7 +121,6 @@ InhibitDelayMaxSec=300
EOF
systemctl restart systemd-logind
# Create kubelet service override for TimeoutStopSec
mkdir -p /etc/systemd/system/kubelet.service.d
cat > /etc/systemd/system/kubelet.service.d/20-shutdown.conf <<EOF
[Service]
@ -133,9 +131,7 @@ systemctl restart kubelet
'"
```
Repeat for each node VMID (200-204) that needs patching.
#### 1.5 Check unattended-upgrades on k8s nodes
### PF-5: Check unattended-upgrades on k8s nodes
```bash
for VMID in 200 201 202 203 204; do
echo "=== VMID $VMID ==="
@ -143,35 +139,36 @@ for VMID in 200 201 202 203 204; do
done
```
If active on any node, disable it:
```bash
ssh root@192.168.1.127 "qm guest exec $VMID -- bash -c 'systemctl disable --now unattended-upgrades; apt-get remove -y unattended-upgrades'"
```
If active, disable: `ssh root@192.168.1.127 "qm guest exec $VMID -- bash -c 'systemctl disable --now unattended-upgrades; apt-get remove -y unattended-upgrades'"`
#### 1.6 Report pre-flight status
Summarize: all checks passed, or list what was fixed. Ask user for final confirmation before rebooting.
### PF-6: Report pre-flight status
Summarize: all checks passed, or list what was fixed. Ask user for **final confirmation** before proceeding.
---
## Full Host Reboot Procedure
### Phase 1: Pre-flight
Run [Shared Pre-flight Checks](#shared-pre-flight-checks) above.
### Phase 2: Reboot
**IMPORTANT**: Get explicit user confirmation before proceeding.
**Get explicit user confirmation before proceeding.**
```bash
# Reboot the Proxmox host
ssh root@192.168.1.127 'reboot'
```
Then wait for SSH to drop (confirms reboot started):
Wait for SSH to drop:
```bash
# Poll until SSH drops (host is rebooting)
for i in $(seq 1 30); do
ssh -o ConnectTimeout=3 -o BatchMode=yes root@192.168.1.127 'true' 2>/dev/null || { echo "Host is rebooting"; break; }
sleep 2
done
```
Then poll until Proxmox is back (timeout: 10 minutes):
Poll until Proxmox is back (timeout: 10 minutes):
```bash
# Poll every 30s until SSH is back
for i in $(seq 1 20); do
if ssh -o ConnectTimeout=5 -o BatchMode=yes root@192.168.1.127 'uptime' 2>/dev/null; then
echo "Proxmox host is back online"
@ -184,45 +181,36 @@ done
### Phase 3: Post-boot Validation (30 min timeout)
Run these checks in a loop, reporting progress. Timeout after 30 minutes total.
Run these checks in a loop, reporting progress.
#### 3.1 Check all VMs are running
```bash
ssh root@192.168.1.127 'qm list' | grep -v stopped
```
If any critical VM (101, 9000, 200-204, 220) is not running after 5 minutes, start it:
```bash
ssh root@192.168.1.127 'qm start <VMID>'
```
If any critical VM (101, 9000, 200-204, 220) is not running after 5 minutes: `qm start <VMID>`
#### 3.2 Check k8s API is reachable
```bash
KUBECONFIG=/Users/viktorbarzin/code/config kubectl cluster-info 2>/dev/null
$KC cluster-info 2>/dev/null
```
#### 3.3 Check all nodes are Ready
```bash
KUBECONFIG=/Users/viktorbarzin/code/config kubectl get nodes
$KC get nodes
```
All 5 nodes should show `Ready`.
All 5 nodes (k8s-master, k8s-node1-4) should show `Ready`.
#### 3.4 Check critical infrastructure pods
Check these in order (reflects boot dependency chain):
#### 3.4 Check critical infrastructure pods (tiered)
```bash
KC="KUBECONFIG=/Users/viktorbarzin/code/config kubectl"
# Tier 0: Core infrastructure
# Tier 0: Core networking
$KC get pods -n metallb-system -l app=metallb
$KC get pods -n kube-system -l k8s-app=kube-dns # CoreDNS
$KC get pods -n technitium # DNS
$KC get pods -n kube-system -l k8s-app=kube-dns
$KC get pods -n technitium
# Tier 1: Storage
$KC get pods -n democratic-csi # iSCSI-CSI
$KC get pods -n nfs-csi # NFS-CSI
$KC get pods -n proxmox-csi
$KC get pods -n nfs-csi
# Tier 2: Ingress + tunnel
$KC get pods -n traefik
@ -233,7 +221,7 @@ $KC get pods -n kyverno
# Tier 4: Data layer
$KC get pods -n redis
$KC get pods -n dbaas # MySQL + PostgreSQL
$KC get pods -n dbaas
# Tier 5: Secrets + auth
$KC get pods -n vault
@ -241,32 +229,156 @@ $KC get pods -n authentik
$KC get pods -n external-secrets
```
#### 3.5 Check Vault is unsealed
#### 3.5 Vault Verification
```bash
KUBECONFIG=/Users/viktorbarzin/code/config kubectl exec -n vault vault-0 -- vault status -format=json 2>/dev/null | jq '.sealed'
# Check sealed status
$KC exec -n vault vault-0 -- vault status -format=json 2>/dev/null | jq '.sealed'
# Check all 3 Vault pods are Running
$KC get pods -n vault -l app.kubernetes.io/name=vault
# If unsealed, check Raft peers
$KC exec -n vault vault-0 -- vault operator raft list-peers 2>/dev/null
```
Should return `false`. The auto-unseal sidecar polls every 10s — if still sealed after 5 minutes, investigate.
#### 3.6 Check PVC status
If sealed after 5 minutes:
```bash
KUBECONFIG=/Users/viktorbarzin/code/config kubectl get pvc -A --field-selector 'status.phase!=Bound' 2>/dev/null | head -20
# Check auto-unseal sidecar logs
$KC logs -n vault vault-0 -c auto-unseal --tail=20
```
No PVCs should be in Pending state (indicates CSI driver issues).
#### 3.7 Check for stuck pods
If Raft has missing peers after unseal:
```bash
KUBECONFIG=/Users/viktorbarzin/code/config kubectl get pods -A --field-selector 'status.phase!=Running,status.phase!=Succeeded' | grep -v Completed | head -30
$KC exec -n vault vault-1 -- vault operator raft join http://vault-0.vault-internal:8200
$KC exec -n vault vault-2 -- vault operator raft join http://vault-0.vault-internal:8200
```
Known slow starters to ignore: MySQL InnoDB pods (up to 60 min init), Vault pods (auto-unseal takes a minute).
#### 3.6 Proxmox-LVM PVC Validation
```bash
# Check all PVCs — none should be Pending (except newly created)
$KC get pvc -A --field-selector 'status.phase!=Bound' 2>/dev/null | head -20
# Check proxmox-lvm PVCs specifically
$KC get pvc -A -o json | jq -r '.items[] | select(.spec.storageClassName=="proxmox-lvm") | "\(.metadata.namespace)/\(.metadata.name): \(.status.phase)"'
```
If any proxmox-lvm PVCs are stuck:
```bash
# Check proxmox-csi-plugin pods
$KC get pods -n proxmox-csi
# Check node LVM thin pool
ssh root@192.168.1.127 "for VMID in 200 201 202 203 204; do echo '=== VMID '\$VMID' ==='; qm guest exec \$VMID -- lvs --noheadings -o lv_name,lv_size,data_percent pve/data 2>/dev/null; done"
```
#### 3.7 MySQL InnoDB Cluster Recovery
**This is the most critical post-reboot step.** MySQL InnoDB Cluster cannot auto-recover from a complete outage.
```bash
# Get root password
ROOT_PWD=$($KC get secret cluster-secret -n dbaas -o jsonpath='{.data.ROOT_PASSWORD}' | base64 -d)
# Check cluster status
$KC exec -n dbaas mysql-cluster-0 -c mysql -- mysqlsh --js --uri "root:${ROOT_PWD}@localhost:3306" -e "print(JSON.stringify(dba.getCluster().status()))" 2>/dev/null
```
**If status shows 0 ONLINE members (complete outage):**
⚠️ **ASK USER FOR CONFIRMATION** before running rebootClusterFromCompleteOutage. Explain:
- This will designate mysql-cluster-0 as the new primary
- If data diverged between members, this picks cluster-0's data
- Partial outages should NOT use this command (use rejoinInstance instead)
If confirmed:
```bash
# Reboot cluster from complete outage
$KC exec -n dbaas mysql-cluster-0 -c mysql -- mysqlsh --js --uri "root:${ROOT_PWD}@localhost:3306" -e "dba.rebootClusterFromCompleteOutage()"
# Wait up to 10 minutes for all 3 members to come ONLINE
for i in $(seq 1 60); do
ONLINE=$($KC exec -n dbaas mysql-cluster-0 -c mysql -- mysqlsh --js --uri "root:${ROOT_PWD}@localhost:3306" -e "print(dba.getCluster().status().defaultReplicaSet.topology)" 2>/dev/null | grep -c '"status": "ONLINE"')
if [ "$ONLINE" = "3" ]; then
echo "All 3 MySQL members ONLINE"
break
fi
echo "MySQL members ONLINE: $ONLINE/3 (attempt $i/60)"
sleep 10
done
```
**Fix operator/router authentication (always needed after rebootClusterFromCompleteOutage):**
```bash
# Get expected passwords from K8s secrets
ADMIN_PWD=$($KC get secret -n dbaas mysql-cluster-privsecret -o jsonpath='{.data.clusterAdminPassword}' | base64 -d 2>/dev/null)
# Reset operator admin user password
$KC exec -n dbaas mysql-cluster-0 -c mysql -- mysql -u root -p"${ROOT_PWD}" -e "
ALTER USER IF EXISTS 'mysqladmin'@'%' IDENTIFIED BY '${ADMIN_PWD}';
"
# Recreate router user with full privileges (the reboot drops it)
$KC exec -n dbaas mysql-cluster-0 -c mysql -- mysqlsh --js --uri "root:${ROOT_PWD}@localhost:3306" -e "
var c = dba.getCluster();
c.setupRouterAccount('mysqlrouter@\"%\"', {update: true});
"
# Restart operator and router pods to pick up new credentials
$KC delete pod -n dbaas -l app.kubernetes.io/name=mysql-operator
$KC delete pod -n dbaas -l app.kubernetes.io/component=router
```
**Verify MySQL is fully operational:**
```bash
# Check cluster status
$KC exec -n dbaas mysql-cluster-0 -c mysql -- mysqlsh --js --uri "root:${ROOT_PWD}@localhost:3306" -e "print(dba.getCluster().status().defaultReplicaSet.status)"
# Check router is accepting connections
$KC exec -n dbaas mysql-cluster-0 -c mysql -- mysql -u root -p"${ROOT_PWD}" -h mysql.dbaas.svc.cluster.local -e "SELECT 1"
```
#### 3.8 Redis Validation
```bash
# Check Redis pods
$KC get pods -n redis
# Check HAProxy is routing to master
HAPROXY_POD=$($KC get pods -n redis -l app=redis-haproxy -o jsonpath='{.items[0].metadata.name}')
$KC exec -n redis $HAPROXY_POD -- cat /tmp/haproxy.cfg 2>/dev/null | grep "server redis" || echo "HAProxy config not found — check haproxy pod"
# Verify writes work through master
$KC exec -n redis redis-node-1 -- redis-cli SET reboot-test ok 2>/dev/null && \
$KC exec -n redis redis-node-1 -- redis-cli DEL reboot-test 2>/dev/null && \
echo "Redis write test passed" || echo "Redis write FAILED — check master routing"
```
#### 3.9 ESO Sync Verification
```bash
# Force ESO to re-sync all secrets (Vault may have rotated passwords)
$KC annotate externalsecrets.external-secrets.io -A --all force-sync=$(date +%s) --overwrite 2>/dev/null
# Wait 60s for sync
sleep 60
# Check for failed syncs
FAILED=$($KC get externalsecrets -A --no-headers 2>/dev/null | grep -v SecretSynced | grep -v "SYNCED" || true)
if [ -n "$FAILED" ]; then
echo "WARNING: Some ExternalSecrets failed to sync:"
echo "$FAILED"
else
echo "All ExternalSecrets synced successfully"
fi
```
#### 3.10 Check for stuck pods
```bash
$KC get pods -A --field-selector 'status.phase!=Running,status.phase!=Succeeded' | grep -v Completed | head -30
```
### Phase 4: Final Validation
- If all checks pass: report "Cluster fully recovered in X minutes"
- If issues remain after 30 minutes: invoke the `/cluster-health` skill for diagnosis and remediation
- Store a memory summarizing the reboot: timestamp, duration, any issues encountered
- If issues remain after 30 minutes: invoke the `/cluster-health` skill for diagnosis
- Store a memory summarizing the reboot: timestamp, duration, any issues encountered, lessons learned
---
@ -277,43 +389,38 @@ Known slow starters to ignore: MySQL InnoDB pods (up to 60 min init), Vault pods
#### 1.1 Identify target
Map the user's input to a node name and VMID:
- `k8s-master` / `master` / `200` → k8s-master (VMID 200)
- `k8s-node1` / `node1` / `201` → k8s-node1 (VMID 201, GPU node)
- `k8s-node1` / `node1` / `201` → k8s-node1 (VMID 201, GPU)
- `k8s-node2` / `node2` / `202` → k8s-node2 (VMID 202)
- `k8s-node3` / `node3` / `203` → k8s-node3 (VMID 203)
- `k8s-node4` / `node4` / `204` → k8s-node4 (VMID 204)
#### 1.2 Check node status
```bash
KUBECONFIG=/Users/viktorbarzin/code/config kubectl get node <node-name>
$KC get node <node-name>
```
#### 1.3 Check what's running on the node
```bash
KUBECONFIG=/Users/viktorbarzin/code/config kubectl get pods -A --field-selector spec.nodeName=<node-name> | head -30
$KC get pods -A --field-selector spec.nodeName=<node-name> | head -30
```
Warn the user if the node runs:
- GPU workloads (node1) — Immich ML, Frigate, Ollama
- Database pods — MySQL InnoDB, PostgreSQL CNPG (long recovery)
- Database pods — MySQL InnoDB, PostgreSQL CNPG (check if it's primary)
- Vault pods — may need re-unseal
#### 1.4 Check PDBs won't block drain
```bash
KUBECONFIG=/Users/viktorbarzin/code/config kubectl get pdb -A
$KC get pdb -A
```
Verify that draining this node won't violate any PDB (e.g., Traefik minAvailable=2 — if 2 of 3 Traefik pods are on other nodes, drain is safe).
### Phase 2: Drain + Reboot
```bash
# Drain the node (300s timeout for graceful eviction)
KUBECONFIG=/Users/viktorbarzin/code/config kubectl drain <node-name> \
--ignore-daemonsets \
--delete-emptydir-data \
--timeout=300s
# Drain the node
$KC drain <node-name> --ignore-daemonsets --delete-emptydir-data --timeout=300s
# Shutdown the VM gracefully (300s timeout for kubelet + systemd shutdown)
# Shutdown the VM gracefully
ssh root@192.168.1.127 "qm shutdown <VMID> --timeout 300"
# Wait for VM to stop
@ -336,7 +443,7 @@ ssh root@192.168.1.127 "qm start <VMID>"
```bash
# Wait for node to become Ready (timeout: 5 min)
for i in $(seq 1 30); do
STATUS=$(KUBECONFIG=/Users/viktorbarzin/code/config kubectl get node <node-name> --no-headers 2>/dev/null | awk '{print $2}')
STATUS=$($KC get node <node-name> --no-headers 2>/dev/null | awk '{print $2}')
if [ "$STATUS" = "Ready" ]; then
echo "Node is Ready"
break
@ -346,25 +453,142 @@ for i in $(seq 1 30); do
done
# Uncordon the node
KUBECONFIG=/Users/viktorbarzin/code/config kubectl uncordon <node-name>
$KC uncordon <node-name>
# Verify DaemonSet pods are scheduled
KUBECONFIG=/Users/viktorbarzin/code/config kubectl get pods -A --field-selector spec.nodeName=<node-name> | head -20
$KC get pods -A --field-selector spec.nodeName=<node-name> | head -20
# Report status
echo "Node <node-name> rebooted and uncordoned successfully"
```
---
## Rolling Reboot Procedure
Cycles all 5 k8s nodes sequentially: **workers first (node2→3→4→1), master last**.
### Phase 1: Pre-flight
Run [Shared Pre-flight Checks](#shared-pre-flight-checks).
Additionally:
```bash
# Snapshot current pod distribution
$KC get pods -A -o wide --no-headers | awk '{print $8}' | sort | uniq -c | sort -rn
echo "---"
# Check PDBs that could block drains
$KC get pdb -A
echo "---"
# Check MySQL cluster health before starting
ROOT_PWD=$($KC get secret cluster-secret -n dbaas -o jsonpath='{.data.ROOT_PASSWORD}' | base64 -d)
$KC exec -n dbaas mysql-cluster-0 -c mysql -- mysqlsh --js --uri "root:${ROOT_PWD}@localhost:3306" -e "print(dba.getCluster().status().defaultReplicaSet.status)" 2>/dev/null
```
### Phase 2: Rolling Cycle
Process nodes in this order: **k8s-node2 → k8s-node3 → k8s-node4 → k8s-node1 (GPU) → k8s-master**
For each node, execute this sequence:
```bash
NODE=<node-name>
VMID=<vmid>
echo "========================================="
echo "Rolling reboot: $NODE (VMID $VMID)"
echo "========================================="
# Step 1: Drain
echo "Draining $NODE..."
$KC drain $NODE --ignore-daemonsets --delete-emptydir-data --timeout=300s
# Step 2: Graceful shutdown
echo "Shutting down VM $VMID..."
ssh root@192.168.1.127 "qm shutdown $VMID --timeout 300"
# Step 3: Wait for VM to stop
for i in $(seq 1 60); do
STATUS=$(ssh root@192.168.1.127 "qm status $VMID" 2>/dev/null)
if echo "$STATUS" | grep -q stopped; then echo "VM $VMID stopped"; break; fi
sleep 5
done
# Step 4: Start VM
echo "Starting VM $VMID..."
ssh root@192.168.1.127 "qm start $VMID"
# Step 5: Wait for node Ready
for i in $(seq 1 30); do
STATUS=$($KC get node $NODE --no-headers 2>/dev/null | awk '{print $2}')
if [ "$STATUS" = "Ready" ]; then echo "$NODE is Ready"; break; fi
echo "Waiting... ($i/30)"
sleep 10
done
# Step 6: Uncordon
$KC uncordon $NODE
echo "$NODE uncordoned"
# Step 7: Verify DaemonSets
$KC get pods -A --field-selector spec.nodeName=$NODE --no-headers | wc -l
echo "DaemonSet pods scheduled on $NODE"
# Step 8: Cool-down (2 min)
echo "Cooling down for 2 minutes before next node..."
sleep 120
```
### Phase 3: Mid-cycle Health Check (after all workers, before master)
After k8s-node1 (last worker) is back and uncordoned, check critical services before cycling the master:
```bash
echo "=== Mid-cycle health check ==="
# All 4 workers should be Ready
$KC get nodes
# Check MySQL cluster — if any member is not ONLINE, fix before cycling master
ROOT_PWD=$($KC get secret cluster-secret -n dbaas -o jsonpath='{.data.ROOT_PASSWORD}' | base64 -d)
$KC exec -n dbaas mysql-cluster-0 -c mysql -- mysqlsh --js --uri "root:${ROOT_PWD}@localhost:3306" -e "print(dba.getCluster().status().defaultReplicaSet.status)" 2>/dev/null
# Check Vault — if sealed, fix before cycling master
$KC exec -n vault vault-0 -- vault status -format=json 2>/dev/null | jq '.sealed'
# Check no pods stuck
$KC get pods -A --field-selector 'status.phase!=Running,status.phase!=Succeeded' | grep -v Completed | head -10
echo "=== Proceeding to master node ==="
```
If MySQL shows issues at this point, run [MySQL InnoDB Cluster Recovery](#37-mysql-innodb-cluster-recovery) before proceeding to the master.
### Phase 4: Post-rolling Validation
After the master is back and uncordoned, run the full validation suite from [Phase 3 of Full Host Reboot](#phase-3-post-boot-validation-30-min-timeout), including:
- All nodes Ready (3.3)
- Critical pods tiered check (3.4)
- Vault verification (3.5)
- Proxmox-LVM PVC validation (3.6)
- MySQL InnoDB recovery if needed (3.7) — **with user confirmation**
- Redis validation (3.8)
- ESO sync (3.9)
- Stuck pods check (3.10)
---
## Troubleshooting
| Symptom | Cause | Fix |
|---------|-------|-----|
| VM won't start | Proxmox host disk full | `ssh root@192.168.1.127 'df -h'` — check thin pool |
| Node stays NotReady | kubelet/containerd not starting | `qm guest exec <VMID> -- systemctl status kubelet` |
| NFS PVCs stuck Pending | TrueNAS not fully booted | Wait for TrueNAS ZFS pool import, or `qm guest exec 9000 -- zpool status` |
| Vault stays sealed | Auto-unseal sidecar not running | Check sidecar logs: `kubectl logs -n vault vault-0 -c vault-unseal` |
| MySQL slow to recover | InnoDB init containers (~20 min each) | Normal — monitor progress with `kubectl logs -n dbaas` |
| VM won't start | Proxmox host disk full | `ssh root@192.168.1.127 'df -h'` — check thin pool usage with `lvs pve/data` |
| Node stays NotReady | kubelet/containerd not starting | `qm guest exec <VMID> -- systemctl status kubelet` and `systemctl status containerd` |
| NFS PVCs stuck Pending | TrueNAS not fully booted | Wait for ZFS pool import: `qm guest exec 9000 -- zpool status` |
| Proxmox-LVM PVCs stuck | proxmox-csi-plugin not running | `$KC get pods -n proxmox-csi` — check CSI node plugin on affected node. Check LVM thin pool: `qm guest exec <VMID> -- lvs` |
| Vault stays sealed | Auto-unseal sidecar not running | Check sidecar: `$KC logs -n vault vault-0 -c auto-unseal --tail=20`. Check unseal key secret exists: `$KC get secret -n vault vault-unseal-key` |
| Vault Raft peer missing | Pod restarted on different node | `$KC exec -n vault vault-1 -- vault operator raft join http://vault-0.vault-internal:8200` |
| MySQL 0 ONLINE members | Complete outage — operator can't recover | See [MySQL InnoDB Cluster Recovery](#37-mysql-innodb-cluster-recovery) — requires user confirmation |
| MySQL router auth failure | Reboot recreated internal users | Reset passwords from K8s secrets + setupRouterAccount — see section 3.7 |
| Redis READONLY errors | HAProxy routing to replica | Check HAProxy pod routing config, verify `redis` service selector points to `app=redis-haproxy` |
| ESO secrets not syncing | Vault sealed or token expired | Unseal Vault first, then force-sync: `$KC annotate externalsecrets -A --all force-sync=$(date +%s) --overwrite` |
| Pods CrashLoopBackOff | Dependencies not ready yet | Usually self-heals — wait 5 min, then check with `/cluster-health` |
| containerd won't start | Disk full or corrupted images | `qm guest exec <VMID> -- journalctl -u containerd --no-pager -n 50` |
| containerd won't start | Disk full or corrupted images | `qm guest exec <VMID> -- journalctl -u containerd --no-pager -n 50`. If blob corruption, see containerd cleanup procedure in memory. |
| Drain stuck on PDB | PDB minAvailable can't be met | Check PDB: `$KC get pdb -A`. May need to scale up another replica first or `--disable-eviction` |