update reboot-server skill: add rolling reboot, MySQL auto-recovery, proxmox-lvm validation
This commit is contained in:
parent
5e4112827d
commit
f11cf2870a
1 changed files with 320 additions and 96 deletions
|
|
@ -1,13 +1,16 @@
|
||||||
name: reboot-server
|
name: reboot-server
|
||||||
description: Safely reboot the Proxmox host or a single k8s node with graceful shutdown
|
description: Safely reboot the Proxmox host, a single k8s node, or perform a rolling
|
||||||
and post-boot validation. Use when the user asks to "reboot the server", "reboot
|
reboot of all nodes with graceful shutdown, MySQL auto-recovery, and post-boot validation.
|
||||||
proxmox", "reboot node", "restart the host", "restart k8s node", "power cycle".
|
Use when the user asks to "reboot the server", "reboot proxmox", "reboot node",
|
||||||
Supports full host reboot and single node reboot modes.
|
"restart the host", "rolling reboot", "reboot all nodes", "power cycle".
|
||||||
---
|
---
|
||||||
|
|
||||||
## Overview
|
## Overview
|
||||||
|
|
||||||
This skill safely reboots either the entire Proxmox host (Dell R730) or a single Kubernetes node VM. It ensures graceful shutdown of all services and validates full recovery afterwards.
|
This skill safely reboots infrastructure with three modes:
|
||||||
|
1. **Full host reboot** — reboot the entire Proxmox R730 host (all VMs)
|
||||||
|
2. **Single node reboot** — drain, reboot one k8s node VM, uncordon
|
||||||
|
3. **Rolling reboot** — cycle all 5 k8s nodes sequentially (workers first, master last)
|
||||||
|
|
||||||
## Connection Details
|
## Connection Details
|
||||||
|
|
||||||
|
|
@ -15,12 +18,14 @@ This skill safely reboots either the entire Proxmox host (Dell R730) or a single
|
||||||
- **kubectl**: `KUBECONFIG=/Users/viktorbarzin/code/config kubectl`
|
- **kubectl**: `KUBECONFIG=/Users/viktorbarzin/code/config kubectl`
|
||||||
- **VM commands**: `ssh root@192.168.1.127 'qm <command>'`
|
- **VM commands**: `ssh root@192.168.1.127 'qm <command>'`
|
||||||
|
|
||||||
|
Shorthand used below: `KC="KUBECONFIG=/Users/viktorbarzin/code/config kubectl"`
|
||||||
|
|
||||||
## VM Inventory & Boot Order
|
## VM Inventory & Boot Order
|
||||||
|
|
||||||
| Order | VMID | Name | Startup Delay | Shutdown Timeout | Notes |
|
| Order | VMID | Name | Startup Delay | Shutdown Timeout | Notes |
|
||||||
|-------|------|------|---------------|------------------|-------|
|
|-------|------|------|---------------|------------------|-------|
|
||||||
| 1 | 101 | pfSense | 0s | 120s | Gateway/DHCP/DNS — must boot first |
|
| 1 | 101 | pfSense | 0s | 120s | Gateway/DHCP/DNS — must boot first |
|
||||||
| 2 | 9000 | TrueNAS | 60s | 300s | NFS/iSCSI storage — needs network from pfSense |
|
| 2 | 9000 | TrueNAS | 60s | 300s | NFS storage — needs network from pfSense |
|
||||||
| 3 | 220 | docker-registry | 60s | 120s | Pull-through cache (fallback: upstream) |
|
| 3 | 220 | docker-registry | 60s | 120s | Pull-through cache (fallback: upstream) |
|
||||||
| 3 | 102 | devvm | 60s | 120s | Dev VM — needs pfSense for network |
|
| 3 | 102 | devvm | 60s | 120s | Dev VM — needs pfSense for network |
|
||||||
| 4 | 200 | k8s-master | 45s | 300s | Control plane — must be up before workers |
|
| 4 | 200 | k8s-master | 45s | 300s | Control plane — must be up before workers |
|
||||||
|
|
@ -37,7 +42,7 @@ Shutdown order is the **reverse** of boot order (Proxmox handles this automatica
|
||||||
|
|
||||||
```
|
```
|
||||||
k8s-master = 200
|
k8s-master = 200
|
||||||
k8s-node1 = 201 (GPU node)
|
k8s-node1 = 201 (GPU node — Immich ML, Frigate, Ollama)
|
||||||
k8s-node2 = 202
|
k8s-node2 = 202
|
||||||
k8s-node3 = 203
|
k8s-node3 = 203
|
||||||
k8s-node4 = 204
|
k8s-node4 = 204
|
||||||
|
|
@ -48,35 +53,32 @@ k8s-node4 = 204
|
||||||
Ask the user which mode they want if unclear:
|
Ask the user which mode they want if unclear:
|
||||||
1. **Full host reboot** — reboot the entire Proxmox R730 host
|
1. **Full host reboot** — reboot the entire Proxmox R730 host
|
||||||
2. **Single node reboot** — drain, reboot one k8s node VM, uncordon
|
2. **Single node reboot** — drain, reboot one k8s node VM, uncordon
|
||||||
|
3. **Rolling reboot** — cycle all 5 k8s nodes sequentially (for kernel updates, maintenance)
|
||||||
|
|
||||||
---
|
---
|
||||||
|
|
||||||
## Full Host Reboot Procedure
|
## Shared Pre-flight Checks
|
||||||
|
|
||||||
### Phase 1: Pre-flight Checks
|
These checks apply to ALL modes. Run them before proceeding.
|
||||||
|
|
||||||
Run ALL of these checks before proceeding. Fix issues inline if possible.
|
### PF-1: Verify Proxmox SSH access
|
||||||
|
|
||||||
#### 1.1 Verify Proxmox SSH access
|
|
||||||
```bash
|
```bash
|
||||||
ssh -o ConnectTimeout=5 root@192.168.1.127 'hostname && uptime'
|
ssh -o ConnectTimeout=5 root@192.168.1.127 'hostname && uptime'
|
||||||
```
|
```
|
||||||
|
|
||||||
#### 1.2 Verify cluster health
|
### PF-2: Verify cluster health
|
||||||
```bash
|
```bash
|
||||||
KUBECONFIG=/Users/viktorbarzin/code/config kubectl get nodes
|
$KC get nodes
|
||||||
KUBECONFIG=/Users/viktorbarzin/code/config kubectl get pods -A --field-selector 'status.phase!=Running,status.phase!=Succeeded' | head -20
|
$KC get pods -A --field-selector 'status.phase!=Running,status.phase!=Succeeded' | grep -v Completed | head -20
|
||||||
```
|
```
|
||||||
|
|
||||||
#### 1.3 Check and configure VM boot ordering
|
### PF-3: Check and configure VM boot ordering
|
||||||
For each VM, check if boot ordering is set. If any VM has `order=-1`, configure it:
|
|
||||||
|
|
||||||
```bash
|
```bash
|
||||||
# Check current boot order for all VMs
|
# Check current boot order for all VMs
|
||||||
ssh root@192.168.1.127 'for VMID in 101 102 103 200 201 202 203 204 220 300 9000; do echo "VMID $VMID: $(qm config $VMID 2>/dev/null | grep ^startup || echo "no startup config")"; done'
|
ssh root@192.168.1.127 'for VMID in 101 102 103 200 201 202 203 204 220 300 9000; do echo "VMID $VMID: $(qm config $VMID 2>/dev/null | grep ^startup || echo "no startup config")"; done'
|
||||||
```
|
```
|
||||||
|
|
||||||
If not configured, apply boot order (this is idempotent):
|
If not configured, apply boot order (idempotent):
|
||||||
```bash
|
```bash
|
||||||
ssh root@192.168.1.127 '
|
ssh root@192.168.1.127 '
|
||||||
qm set 101 --startup order=1,down=120
|
qm set 101 --startup order=1,down=120
|
||||||
|
|
@ -93,7 +95,7 @@ qm set 300 --startup order=6,down=120
|
||||||
'
|
'
|
||||||
```
|
```
|
||||||
|
|
||||||
#### 1.4 Check kubelet shutdownGracePeriod on all k8s nodes
|
### PF-4: Check kubelet shutdownGracePeriod on all k8s nodes
|
||||||
```bash
|
```bash
|
||||||
for VMID in 200 201 202 203 204; do
|
for VMID in 200 201 202 203 204; do
|
||||||
echo "=== VMID $VMID ==="
|
echo "=== VMID $VMID ==="
|
||||||
|
|
@ -105,16 +107,13 @@ If not set on any node, patch it:
|
||||||
```bash
|
```bash
|
||||||
VMID=<target>
|
VMID=<target>
|
||||||
ssh root@192.168.1.127 "qm guest exec $VMID -- bash -c '
|
ssh root@192.168.1.127 "qm guest exec $VMID -- bash -c '
|
||||||
# Remove any existing shutdown config to avoid duplicates
|
|
||||||
sed -i \"/shutdownGracePeriod/d; /shutdownGracePeriodCriticalPods/d\" /var/lib/kubelet/config.yaml
|
sed -i \"/shutdownGracePeriod/d; /shutdownGracePeriodCriticalPods/d\" /var/lib/kubelet/config.yaml
|
||||||
|
|
||||||
# Add graceful shutdown config
|
|
||||||
cat >> /var/lib/kubelet/config.yaml <<EOF
|
cat >> /var/lib/kubelet/config.yaml <<EOF
|
||||||
shutdownGracePeriod: \"240s\"
|
shutdownGracePeriod: \"240s\"
|
||||||
shutdownGracePeriodCriticalPods: \"60s\"
|
shutdownGracePeriodCriticalPods: \"60s\"
|
||||||
EOF
|
EOF
|
||||||
|
|
||||||
# Create systemd logind override for InhibitDelayMaxSec
|
|
||||||
mkdir -p /etc/systemd/logind.conf.d
|
mkdir -p /etc/systemd/logind.conf.d
|
||||||
cat > /etc/systemd/logind.conf.d/kubelet-shutdown.conf <<EOF
|
cat > /etc/systemd/logind.conf.d/kubelet-shutdown.conf <<EOF
|
||||||
[Login]
|
[Login]
|
||||||
|
|
@ -122,7 +121,6 @@ InhibitDelayMaxSec=300
|
||||||
EOF
|
EOF
|
||||||
systemctl restart systemd-logind
|
systemctl restart systemd-logind
|
||||||
|
|
||||||
# Create kubelet service override for TimeoutStopSec
|
|
||||||
mkdir -p /etc/systemd/system/kubelet.service.d
|
mkdir -p /etc/systemd/system/kubelet.service.d
|
||||||
cat > /etc/systemd/system/kubelet.service.d/20-shutdown.conf <<EOF
|
cat > /etc/systemd/system/kubelet.service.d/20-shutdown.conf <<EOF
|
||||||
[Service]
|
[Service]
|
||||||
|
|
@ -133,9 +131,7 @@ systemctl restart kubelet
|
||||||
'"
|
'"
|
||||||
```
|
```
|
||||||
|
|
||||||
Repeat for each node VMID (200-204) that needs patching.
|
### PF-5: Check unattended-upgrades on k8s nodes
|
||||||
|
|
||||||
#### 1.5 Check unattended-upgrades on k8s nodes
|
|
||||||
```bash
|
```bash
|
||||||
for VMID in 200 201 202 203 204; do
|
for VMID in 200 201 202 203 204; do
|
||||||
echo "=== VMID $VMID ==="
|
echo "=== VMID $VMID ==="
|
||||||
|
|
@ -143,35 +139,36 @@ for VMID in 200 201 202 203 204; do
|
||||||
done
|
done
|
||||||
```
|
```
|
||||||
|
|
||||||
If active on any node, disable it:
|
If active, disable: `ssh root@192.168.1.127 "qm guest exec $VMID -- bash -c 'systemctl disable --now unattended-upgrades; apt-get remove -y unattended-upgrades'"`
|
||||||
```bash
|
|
||||||
ssh root@192.168.1.127 "qm guest exec $VMID -- bash -c 'systemctl disable --now unattended-upgrades; apt-get remove -y unattended-upgrades'"
|
|
||||||
```
|
|
||||||
|
|
||||||
#### 1.6 Report pre-flight status
|
### PF-6: Report pre-flight status
|
||||||
Summarize: all checks passed, or list what was fixed. Ask user for final confirmation before rebooting.
|
Summarize: all checks passed, or list what was fixed. Ask user for **final confirmation** before proceeding.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Full Host Reboot Procedure
|
||||||
|
|
||||||
|
### Phase 1: Pre-flight
|
||||||
|
Run [Shared Pre-flight Checks](#shared-pre-flight-checks) above.
|
||||||
|
|
||||||
### Phase 2: Reboot
|
### Phase 2: Reboot
|
||||||
|
|
||||||
**IMPORTANT**: Get explicit user confirmation before proceeding.
|
**Get explicit user confirmation before proceeding.**
|
||||||
|
|
||||||
```bash
|
```bash
|
||||||
# Reboot the Proxmox host
|
|
||||||
ssh root@192.168.1.127 'reboot'
|
ssh root@192.168.1.127 'reboot'
|
||||||
```
|
```
|
||||||
|
|
||||||
Then wait for SSH to drop (confirms reboot started):
|
Wait for SSH to drop:
|
||||||
```bash
|
```bash
|
||||||
# Poll until SSH drops (host is rebooting)
|
|
||||||
for i in $(seq 1 30); do
|
for i in $(seq 1 30); do
|
||||||
ssh -o ConnectTimeout=3 -o BatchMode=yes root@192.168.1.127 'true' 2>/dev/null || { echo "Host is rebooting"; break; }
|
ssh -o ConnectTimeout=3 -o BatchMode=yes root@192.168.1.127 'true' 2>/dev/null || { echo "Host is rebooting"; break; }
|
||||||
sleep 2
|
sleep 2
|
||||||
done
|
done
|
||||||
```
|
```
|
||||||
|
|
||||||
Then poll until Proxmox is back (timeout: 10 minutes):
|
Poll until Proxmox is back (timeout: 10 minutes):
|
||||||
```bash
|
```bash
|
||||||
# Poll every 30s until SSH is back
|
|
||||||
for i in $(seq 1 20); do
|
for i in $(seq 1 20); do
|
||||||
if ssh -o ConnectTimeout=5 -o BatchMode=yes root@192.168.1.127 'uptime' 2>/dev/null; then
|
if ssh -o ConnectTimeout=5 -o BatchMode=yes root@192.168.1.127 'uptime' 2>/dev/null; then
|
||||||
echo "Proxmox host is back online"
|
echo "Proxmox host is back online"
|
||||||
|
|
@ -184,45 +181,36 @@ done
|
||||||
|
|
||||||
### Phase 3: Post-boot Validation (30 min timeout)
|
### Phase 3: Post-boot Validation (30 min timeout)
|
||||||
|
|
||||||
Run these checks in a loop, reporting progress. Timeout after 30 minutes total.
|
Run these checks in a loop, reporting progress.
|
||||||
|
|
||||||
#### 3.1 Check all VMs are running
|
#### 3.1 Check all VMs are running
|
||||||
```bash
|
```bash
|
||||||
ssh root@192.168.1.127 'qm list' | grep -v stopped
|
ssh root@192.168.1.127 'qm list' | grep -v stopped
|
||||||
```
|
```
|
||||||
|
If any critical VM (101, 9000, 200-204, 220) is not running after 5 minutes: `qm start <VMID>`
|
||||||
If any critical VM (101, 9000, 200-204, 220) is not running after 5 minutes, start it:
|
|
||||||
```bash
|
|
||||||
ssh root@192.168.1.127 'qm start <VMID>'
|
|
||||||
```
|
|
||||||
|
|
||||||
#### 3.2 Check k8s API is reachable
|
#### 3.2 Check k8s API is reachable
|
||||||
```bash
|
```bash
|
||||||
KUBECONFIG=/Users/viktorbarzin/code/config kubectl cluster-info 2>/dev/null
|
$KC cluster-info 2>/dev/null
|
||||||
```
|
```
|
||||||
|
|
||||||
#### 3.3 Check all nodes are Ready
|
#### 3.3 Check all nodes are Ready
|
||||||
```bash
|
```bash
|
||||||
KUBECONFIG=/Users/viktorbarzin/code/config kubectl get nodes
|
$KC get nodes
|
||||||
```
|
```
|
||||||
|
All 5 nodes should show `Ready`.
|
||||||
|
|
||||||
All 5 nodes (k8s-master, k8s-node1-4) should show `Ready`.
|
#### 3.4 Check critical infrastructure pods (tiered)
|
||||||
|
|
||||||
#### 3.4 Check critical infrastructure pods
|
|
||||||
|
|
||||||
Check these in order (reflects boot dependency chain):
|
|
||||||
|
|
||||||
```bash
|
```bash
|
||||||
KC="KUBECONFIG=/Users/viktorbarzin/code/config kubectl"
|
# Tier 0: Core networking
|
||||||
|
|
||||||
# Tier 0: Core infrastructure
|
|
||||||
$KC get pods -n metallb-system -l app=metallb
|
$KC get pods -n metallb-system -l app=metallb
|
||||||
$KC get pods -n kube-system -l k8s-app=kube-dns # CoreDNS
|
$KC get pods -n kube-system -l k8s-app=kube-dns
|
||||||
$KC get pods -n technitium # DNS
|
$KC get pods -n technitium
|
||||||
|
|
||||||
# Tier 1: Storage
|
# Tier 1: Storage
|
||||||
$KC get pods -n democratic-csi # iSCSI-CSI
|
$KC get pods -n proxmox-csi
|
||||||
$KC get pods -n nfs-csi # NFS-CSI
|
$KC get pods -n nfs-csi
|
||||||
|
|
||||||
# Tier 2: Ingress + tunnel
|
# Tier 2: Ingress + tunnel
|
||||||
$KC get pods -n traefik
|
$KC get pods -n traefik
|
||||||
|
|
@ -233,7 +221,7 @@ $KC get pods -n kyverno
|
||||||
|
|
||||||
# Tier 4: Data layer
|
# Tier 4: Data layer
|
||||||
$KC get pods -n redis
|
$KC get pods -n redis
|
||||||
$KC get pods -n dbaas # MySQL + PostgreSQL
|
$KC get pods -n dbaas
|
||||||
|
|
||||||
# Tier 5: Secrets + auth
|
# Tier 5: Secrets + auth
|
||||||
$KC get pods -n vault
|
$KC get pods -n vault
|
||||||
|
|
@ -241,32 +229,156 @@ $KC get pods -n authentik
|
||||||
$KC get pods -n external-secrets
|
$KC get pods -n external-secrets
|
||||||
```
|
```
|
||||||
|
|
||||||
#### 3.5 Check Vault is unsealed
|
#### 3.5 Vault Verification
|
||||||
```bash
|
```bash
|
||||||
KUBECONFIG=/Users/viktorbarzin/code/config kubectl exec -n vault vault-0 -- vault status -format=json 2>/dev/null | jq '.sealed'
|
# Check sealed status
|
||||||
|
$KC exec -n vault vault-0 -- vault status -format=json 2>/dev/null | jq '.sealed'
|
||||||
|
|
||||||
|
# Check all 3 Vault pods are Running
|
||||||
|
$KC get pods -n vault -l app.kubernetes.io/name=vault
|
||||||
|
|
||||||
|
# If unsealed, check Raft peers
|
||||||
|
$KC exec -n vault vault-0 -- vault operator raft list-peers 2>/dev/null
|
||||||
```
|
```
|
||||||
|
|
||||||
Should return `false`. The auto-unseal sidecar polls every 10s — if still sealed after 5 minutes, investigate.
|
If sealed after 5 minutes:
|
||||||
|
|
||||||
#### 3.6 Check PVC status
|
|
||||||
```bash
|
```bash
|
||||||
KUBECONFIG=/Users/viktorbarzin/code/config kubectl get pvc -A --field-selector 'status.phase!=Bound' 2>/dev/null | head -20
|
# Check auto-unseal sidecar logs
|
||||||
|
$KC logs -n vault vault-0 -c auto-unseal --tail=20
|
||||||
```
|
```
|
||||||
|
|
||||||
No PVCs should be in Pending state (indicates CSI driver issues).
|
If Raft has missing peers after unseal:
|
||||||
|
|
||||||
#### 3.7 Check for stuck pods
|
|
||||||
```bash
|
```bash
|
||||||
KUBECONFIG=/Users/viktorbarzin/code/config kubectl get pods -A --field-selector 'status.phase!=Running,status.phase!=Succeeded' | grep -v Completed | head -30
|
$KC exec -n vault vault-1 -- vault operator raft join http://vault-0.vault-internal:8200
|
||||||
|
$KC exec -n vault vault-2 -- vault operator raft join http://vault-0.vault-internal:8200
|
||||||
```
|
```
|
||||||
|
|
||||||
Known slow starters to ignore: MySQL InnoDB pods (up to 60 min init), Vault pods (auto-unseal takes a minute).
|
#### 3.6 Proxmox-LVM PVC Validation
|
||||||
|
```bash
|
||||||
|
# Check all PVCs — none should be Pending (except newly created)
|
||||||
|
$KC get pvc -A --field-selector 'status.phase!=Bound' 2>/dev/null | head -20
|
||||||
|
|
||||||
|
# Check proxmox-lvm PVCs specifically
|
||||||
|
$KC get pvc -A -o json | jq -r '.items[] | select(.spec.storageClassName=="proxmox-lvm") | "\(.metadata.namespace)/\(.metadata.name): \(.status.phase)"'
|
||||||
|
```
|
||||||
|
|
||||||
|
If any proxmox-lvm PVCs are stuck:
|
||||||
|
```bash
|
||||||
|
# Check proxmox-csi-plugin pods
|
||||||
|
$KC get pods -n proxmox-csi
|
||||||
|
# Check node LVM thin pool
|
||||||
|
ssh root@192.168.1.127 "for VMID in 200 201 202 203 204; do echo '=== VMID '\$VMID' ==='; qm guest exec \$VMID -- lvs --noheadings -o lv_name,lv_size,data_percent pve/data 2>/dev/null; done"
|
||||||
|
```
|
||||||
|
|
||||||
|
#### 3.7 MySQL InnoDB Cluster Recovery
|
||||||
|
|
||||||
|
**This is the most critical post-reboot step.** MySQL InnoDB Cluster cannot auto-recover from a complete outage.
|
||||||
|
|
||||||
|
```bash
|
||||||
|
# Get root password
|
||||||
|
ROOT_PWD=$($KC get secret cluster-secret -n dbaas -o jsonpath='{.data.ROOT_PASSWORD}' | base64 -d)
|
||||||
|
|
||||||
|
# Check cluster status
|
||||||
|
$KC exec -n dbaas mysql-cluster-0 -c mysql -- mysqlsh --js --uri "root:${ROOT_PWD}@localhost:3306" -e "print(JSON.stringify(dba.getCluster().status()))" 2>/dev/null
|
||||||
|
```
|
||||||
|
|
||||||
|
**If status shows 0 ONLINE members (complete outage):**
|
||||||
|
|
||||||
|
⚠️ **ASK USER FOR CONFIRMATION** before running rebootClusterFromCompleteOutage. Explain:
|
||||||
|
- This will designate mysql-cluster-0 as the new primary
|
||||||
|
- If data diverged between members, this picks cluster-0's data
|
||||||
|
- Partial outages should NOT use this command (use rejoinInstance instead)
|
||||||
|
|
||||||
|
If confirmed:
|
||||||
|
```bash
|
||||||
|
# Reboot cluster from complete outage
|
||||||
|
$KC exec -n dbaas mysql-cluster-0 -c mysql -- mysqlsh --js --uri "root:${ROOT_PWD}@localhost:3306" -e "dba.rebootClusterFromCompleteOutage()"
|
||||||
|
|
||||||
|
# Wait up to 10 minutes for all 3 members to come ONLINE
|
||||||
|
for i in $(seq 1 60); do
|
||||||
|
ONLINE=$($KC exec -n dbaas mysql-cluster-0 -c mysql -- mysqlsh --js --uri "root:${ROOT_PWD}@localhost:3306" -e "print(dba.getCluster().status().defaultReplicaSet.topology)" 2>/dev/null | grep -c '"status": "ONLINE"')
|
||||||
|
if [ "$ONLINE" = "3" ]; then
|
||||||
|
echo "All 3 MySQL members ONLINE"
|
||||||
|
break
|
||||||
|
fi
|
||||||
|
echo "MySQL members ONLINE: $ONLINE/3 (attempt $i/60)"
|
||||||
|
sleep 10
|
||||||
|
done
|
||||||
|
```
|
||||||
|
|
||||||
|
**Fix operator/router authentication (always needed after rebootClusterFromCompleteOutage):**
|
||||||
|
```bash
|
||||||
|
# Get expected passwords from K8s secrets
|
||||||
|
ADMIN_PWD=$($KC get secret -n dbaas mysql-cluster-privsecret -o jsonpath='{.data.clusterAdminPassword}' | base64 -d 2>/dev/null)
|
||||||
|
|
||||||
|
# Reset operator admin user password
|
||||||
|
$KC exec -n dbaas mysql-cluster-0 -c mysql -- mysql -u root -p"${ROOT_PWD}" -e "
|
||||||
|
ALTER USER IF EXISTS 'mysqladmin'@'%' IDENTIFIED BY '${ADMIN_PWD}';
|
||||||
|
"
|
||||||
|
|
||||||
|
# Recreate router user with full privileges (the reboot drops it)
|
||||||
|
$KC exec -n dbaas mysql-cluster-0 -c mysql -- mysqlsh --js --uri "root:${ROOT_PWD}@localhost:3306" -e "
|
||||||
|
var c = dba.getCluster();
|
||||||
|
c.setupRouterAccount('mysqlrouter@\"%\"', {update: true});
|
||||||
|
"
|
||||||
|
|
||||||
|
# Restart operator and router pods to pick up new credentials
|
||||||
|
$KC delete pod -n dbaas -l app.kubernetes.io/name=mysql-operator
|
||||||
|
$KC delete pod -n dbaas -l app.kubernetes.io/component=router
|
||||||
|
```
|
||||||
|
|
||||||
|
**Verify MySQL is fully operational:**
|
||||||
|
```bash
|
||||||
|
# Check cluster status
|
||||||
|
$KC exec -n dbaas mysql-cluster-0 -c mysql -- mysqlsh --js --uri "root:${ROOT_PWD}@localhost:3306" -e "print(dba.getCluster().status().defaultReplicaSet.status)"
|
||||||
|
|
||||||
|
# Check router is accepting connections
|
||||||
|
$KC exec -n dbaas mysql-cluster-0 -c mysql -- mysql -u root -p"${ROOT_PWD}" -h mysql.dbaas.svc.cluster.local -e "SELECT 1"
|
||||||
|
```
|
||||||
|
|
||||||
|
#### 3.8 Redis Validation
|
||||||
|
```bash
|
||||||
|
# Check Redis pods
|
||||||
|
$KC get pods -n redis
|
||||||
|
|
||||||
|
# Check HAProxy is routing to master
|
||||||
|
HAPROXY_POD=$($KC get pods -n redis -l app=redis-haproxy -o jsonpath='{.items[0].metadata.name}')
|
||||||
|
$KC exec -n redis $HAPROXY_POD -- cat /tmp/haproxy.cfg 2>/dev/null | grep "server redis" || echo "HAProxy config not found — check haproxy pod"
|
||||||
|
|
||||||
|
# Verify writes work through master
|
||||||
|
$KC exec -n redis redis-node-1 -- redis-cli SET reboot-test ok 2>/dev/null && \
|
||||||
|
$KC exec -n redis redis-node-1 -- redis-cli DEL reboot-test 2>/dev/null && \
|
||||||
|
echo "Redis write test passed" || echo "Redis write FAILED — check master routing"
|
||||||
|
```
|
||||||
|
|
||||||
|
#### 3.9 ESO Sync Verification
|
||||||
|
```bash
|
||||||
|
# Force ESO to re-sync all secrets (Vault may have rotated passwords)
|
||||||
|
$KC annotate externalsecrets.external-secrets.io -A --all force-sync=$(date +%s) --overwrite 2>/dev/null
|
||||||
|
|
||||||
|
# Wait 60s for sync
|
||||||
|
sleep 60
|
||||||
|
|
||||||
|
# Check for failed syncs
|
||||||
|
FAILED=$($KC get externalsecrets -A --no-headers 2>/dev/null | grep -v SecretSynced | grep -v "SYNCED" || true)
|
||||||
|
if [ -n "$FAILED" ]; then
|
||||||
|
echo "WARNING: Some ExternalSecrets failed to sync:"
|
||||||
|
echo "$FAILED"
|
||||||
|
else
|
||||||
|
echo "All ExternalSecrets synced successfully"
|
||||||
|
fi
|
||||||
|
```
|
||||||
|
|
||||||
|
#### 3.10 Check for stuck pods
|
||||||
|
```bash
|
||||||
|
$KC get pods -A --field-selector 'status.phase!=Running,status.phase!=Succeeded' | grep -v Completed | head -30
|
||||||
|
```
|
||||||
|
|
||||||
### Phase 4: Final Validation
|
### Phase 4: Final Validation
|
||||||
|
|
||||||
- If all checks pass: report "Cluster fully recovered in X minutes"
|
- If all checks pass: report "Cluster fully recovered in X minutes"
|
||||||
- If issues remain after 30 minutes: invoke the `/cluster-health` skill for diagnosis and remediation
|
- If issues remain after 30 minutes: invoke the `/cluster-health` skill for diagnosis
|
||||||
- Store a memory summarizing the reboot: timestamp, duration, any issues encountered
|
- Store a memory summarizing the reboot: timestamp, duration, any issues encountered, lessons learned
|
||||||
|
|
||||||
---
|
---
|
||||||
|
|
||||||
|
|
@ -277,43 +389,38 @@ Known slow starters to ignore: MySQL InnoDB pods (up to 60 min init), Vault pods
|
||||||
#### 1.1 Identify target
|
#### 1.1 Identify target
|
||||||
Map the user's input to a node name and VMID:
|
Map the user's input to a node name and VMID:
|
||||||
- `k8s-master` / `master` / `200` → k8s-master (VMID 200)
|
- `k8s-master` / `master` / `200` → k8s-master (VMID 200)
|
||||||
- `k8s-node1` / `node1` / `201` → k8s-node1 (VMID 201, GPU node)
|
- `k8s-node1` / `node1` / `201` → k8s-node1 (VMID 201, GPU)
|
||||||
- `k8s-node2` / `node2` / `202` → k8s-node2 (VMID 202)
|
- `k8s-node2` / `node2` / `202` → k8s-node2 (VMID 202)
|
||||||
- `k8s-node3` / `node3` / `203` → k8s-node3 (VMID 203)
|
- `k8s-node3` / `node3` / `203` → k8s-node3 (VMID 203)
|
||||||
- `k8s-node4` / `node4` / `204` → k8s-node4 (VMID 204)
|
- `k8s-node4` / `node4` / `204` → k8s-node4 (VMID 204)
|
||||||
|
|
||||||
#### 1.2 Check node status
|
#### 1.2 Check node status
|
||||||
```bash
|
```bash
|
||||||
KUBECONFIG=/Users/viktorbarzin/code/config kubectl get node <node-name>
|
$KC get node <node-name>
|
||||||
```
|
```
|
||||||
|
|
||||||
#### 1.3 Check what's running on the node
|
#### 1.3 Check what's running on the node
|
||||||
```bash
|
```bash
|
||||||
KUBECONFIG=/Users/viktorbarzin/code/config kubectl get pods -A --field-selector spec.nodeName=<node-name> | head -30
|
$KC get pods -A --field-selector spec.nodeName=<node-name> | head -30
|
||||||
```
|
```
|
||||||
|
|
||||||
Warn the user if the node runs:
|
Warn the user if the node runs:
|
||||||
- GPU workloads (node1) — Immich ML, Frigate, Ollama
|
- GPU workloads (node1) — Immich ML, Frigate, Ollama
|
||||||
- Database pods — MySQL InnoDB, PostgreSQL CNPG (long recovery)
|
- Database pods — MySQL InnoDB, PostgreSQL CNPG (check if it's primary)
|
||||||
- Vault pods — may need re-unseal
|
- Vault pods — may need re-unseal
|
||||||
|
|
||||||
#### 1.4 Check PDBs won't block drain
|
#### 1.4 Check PDBs won't block drain
|
||||||
```bash
|
```bash
|
||||||
KUBECONFIG=/Users/viktorbarzin/code/config kubectl get pdb -A
|
$KC get pdb -A
|
||||||
```
|
```
|
||||||
|
|
||||||
Verify that draining this node won't violate any PDB (e.g., Traefik minAvailable=2 — if 2 of 3 Traefik pods are on other nodes, drain is safe).
|
|
||||||
|
|
||||||
### Phase 2: Drain + Reboot
|
### Phase 2: Drain + Reboot
|
||||||
|
|
||||||
```bash
|
```bash
|
||||||
# Drain the node (300s timeout for graceful eviction)
|
# Drain the node
|
||||||
KUBECONFIG=/Users/viktorbarzin/code/config kubectl drain <node-name> \
|
$KC drain <node-name> --ignore-daemonsets --delete-emptydir-data --timeout=300s
|
||||||
--ignore-daemonsets \
|
|
||||||
--delete-emptydir-data \
|
|
||||||
--timeout=300s
|
|
||||||
|
|
||||||
# Shutdown the VM gracefully (300s timeout for kubelet + systemd shutdown)
|
# Shutdown the VM gracefully
|
||||||
ssh root@192.168.1.127 "qm shutdown <VMID> --timeout 300"
|
ssh root@192.168.1.127 "qm shutdown <VMID> --timeout 300"
|
||||||
|
|
||||||
# Wait for VM to stop
|
# Wait for VM to stop
|
||||||
|
|
@ -336,7 +443,7 @@ ssh root@192.168.1.127 "qm start <VMID>"
|
||||||
```bash
|
```bash
|
||||||
# Wait for node to become Ready (timeout: 5 min)
|
# Wait for node to become Ready (timeout: 5 min)
|
||||||
for i in $(seq 1 30); do
|
for i in $(seq 1 30); do
|
||||||
STATUS=$(KUBECONFIG=/Users/viktorbarzin/code/config kubectl get node <node-name> --no-headers 2>/dev/null | awk '{print $2}')
|
STATUS=$($KC get node <node-name> --no-headers 2>/dev/null | awk '{print $2}')
|
||||||
if [ "$STATUS" = "Ready" ]; then
|
if [ "$STATUS" = "Ready" ]; then
|
||||||
echo "Node is Ready"
|
echo "Node is Ready"
|
||||||
break
|
break
|
||||||
|
|
@ -346,25 +453,142 @@ for i in $(seq 1 30); do
|
||||||
done
|
done
|
||||||
|
|
||||||
# Uncordon the node
|
# Uncordon the node
|
||||||
KUBECONFIG=/Users/viktorbarzin/code/config kubectl uncordon <node-name>
|
$KC uncordon <node-name>
|
||||||
|
|
||||||
# Verify DaemonSet pods are scheduled
|
# Verify DaemonSet pods are scheduled
|
||||||
KUBECONFIG=/Users/viktorbarzin/code/config kubectl get pods -A --field-selector spec.nodeName=<node-name> | head -20
|
$KC get pods -A --field-selector spec.nodeName=<node-name> | head -20
|
||||||
|
|
||||||
# Report status
|
|
||||||
echo "Node <node-name> rebooted and uncordoned successfully"
|
echo "Node <node-name> rebooted and uncordoned successfully"
|
||||||
```
|
```
|
||||||
|
|
||||||
---
|
---
|
||||||
|
|
||||||
|
## Rolling Reboot Procedure
|
||||||
|
|
||||||
|
Cycles all 5 k8s nodes sequentially: **workers first (node2→3→4→1), master last**.
|
||||||
|
|
||||||
|
### Phase 1: Pre-flight
|
||||||
|
Run [Shared Pre-flight Checks](#shared-pre-flight-checks).
|
||||||
|
|
||||||
|
Additionally:
|
||||||
|
```bash
|
||||||
|
# Snapshot current pod distribution
|
||||||
|
$KC get pods -A -o wide --no-headers | awk '{print $8}' | sort | uniq -c | sort -rn
|
||||||
|
echo "---"
|
||||||
|
# Check PDBs that could block drains
|
||||||
|
$KC get pdb -A
|
||||||
|
echo "---"
|
||||||
|
# Check MySQL cluster health before starting
|
||||||
|
ROOT_PWD=$($KC get secret cluster-secret -n dbaas -o jsonpath='{.data.ROOT_PASSWORD}' | base64 -d)
|
||||||
|
$KC exec -n dbaas mysql-cluster-0 -c mysql -- mysqlsh --js --uri "root:${ROOT_PWD}@localhost:3306" -e "print(dba.getCluster().status().defaultReplicaSet.status)" 2>/dev/null
|
||||||
|
```
|
||||||
|
|
||||||
|
### Phase 2: Rolling Cycle
|
||||||
|
|
||||||
|
Process nodes in this order: **k8s-node2 → k8s-node3 → k8s-node4 → k8s-node1 (GPU) → k8s-master**
|
||||||
|
|
||||||
|
For each node, execute this sequence:
|
||||||
|
|
||||||
|
```bash
|
||||||
|
NODE=<node-name>
|
||||||
|
VMID=<vmid>
|
||||||
|
echo "========================================="
|
||||||
|
echo "Rolling reboot: $NODE (VMID $VMID)"
|
||||||
|
echo "========================================="
|
||||||
|
|
||||||
|
# Step 1: Drain
|
||||||
|
echo "Draining $NODE..."
|
||||||
|
$KC drain $NODE --ignore-daemonsets --delete-emptydir-data --timeout=300s
|
||||||
|
|
||||||
|
# Step 2: Graceful shutdown
|
||||||
|
echo "Shutting down VM $VMID..."
|
||||||
|
ssh root@192.168.1.127 "qm shutdown $VMID --timeout 300"
|
||||||
|
|
||||||
|
# Step 3: Wait for VM to stop
|
||||||
|
for i in $(seq 1 60); do
|
||||||
|
STATUS=$(ssh root@192.168.1.127 "qm status $VMID" 2>/dev/null)
|
||||||
|
if echo "$STATUS" | grep -q stopped; then echo "VM $VMID stopped"; break; fi
|
||||||
|
sleep 5
|
||||||
|
done
|
||||||
|
|
||||||
|
# Step 4: Start VM
|
||||||
|
echo "Starting VM $VMID..."
|
||||||
|
ssh root@192.168.1.127 "qm start $VMID"
|
||||||
|
|
||||||
|
# Step 5: Wait for node Ready
|
||||||
|
for i in $(seq 1 30); do
|
||||||
|
STATUS=$($KC get node $NODE --no-headers 2>/dev/null | awk '{print $2}')
|
||||||
|
if [ "$STATUS" = "Ready" ]; then echo "$NODE is Ready"; break; fi
|
||||||
|
echo "Waiting... ($i/30)"
|
||||||
|
sleep 10
|
||||||
|
done
|
||||||
|
|
||||||
|
# Step 6: Uncordon
|
||||||
|
$KC uncordon $NODE
|
||||||
|
echo "$NODE uncordoned"
|
||||||
|
|
||||||
|
# Step 7: Verify DaemonSets
|
||||||
|
$KC get pods -A --field-selector spec.nodeName=$NODE --no-headers | wc -l
|
||||||
|
echo "DaemonSet pods scheduled on $NODE"
|
||||||
|
|
||||||
|
# Step 8: Cool-down (2 min)
|
||||||
|
echo "Cooling down for 2 minutes before next node..."
|
||||||
|
sleep 120
|
||||||
|
```
|
||||||
|
|
||||||
|
### Phase 3: Mid-cycle Health Check (after all workers, before master)
|
||||||
|
|
||||||
|
After k8s-node1 (last worker) is back and uncordoned, check critical services before cycling the master:
|
||||||
|
|
||||||
|
```bash
|
||||||
|
echo "=== Mid-cycle health check ==="
|
||||||
|
|
||||||
|
# All 4 workers should be Ready
|
||||||
|
$KC get nodes
|
||||||
|
|
||||||
|
# Check MySQL cluster — if any member is not ONLINE, fix before cycling master
|
||||||
|
ROOT_PWD=$($KC get secret cluster-secret -n dbaas -o jsonpath='{.data.ROOT_PASSWORD}' | base64 -d)
|
||||||
|
$KC exec -n dbaas mysql-cluster-0 -c mysql -- mysqlsh --js --uri "root:${ROOT_PWD}@localhost:3306" -e "print(dba.getCluster().status().defaultReplicaSet.status)" 2>/dev/null
|
||||||
|
|
||||||
|
# Check Vault — if sealed, fix before cycling master
|
||||||
|
$KC exec -n vault vault-0 -- vault status -format=json 2>/dev/null | jq '.sealed'
|
||||||
|
|
||||||
|
# Check no pods stuck
|
||||||
|
$KC get pods -A --field-selector 'status.phase!=Running,status.phase!=Succeeded' | grep -v Completed | head -10
|
||||||
|
|
||||||
|
echo "=== Proceeding to master node ==="
|
||||||
|
```
|
||||||
|
|
||||||
|
If MySQL shows issues at this point, run [MySQL InnoDB Cluster Recovery](#37-mysql-innodb-cluster-recovery) before proceeding to the master.
|
||||||
|
|
||||||
|
### Phase 4: Post-rolling Validation
|
||||||
|
|
||||||
|
After the master is back and uncordoned, run the full validation suite from [Phase 3 of Full Host Reboot](#phase-3-post-boot-validation-30-min-timeout), including:
|
||||||
|
- All nodes Ready (3.3)
|
||||||
|
- Critical pods tiered check (3.4)
|
||||||
|
- Vault verification (3.5)
|
||||||
|
- Proxmox-LVM PVC validation (3.6)
|
||||||
|
- MySQL InnoDB recovery if needed (3.7) — **with user confirmation**
|
||||||
|
- Redis validation (3.8)
|
||||||
|
- ESO sync (3.9)
|
||||||
|
- Stuck pods check (3.10)
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
## Troubleshooting
|
## Troubleshooting
|
||||||
|
|
||||||
| Symptom | Cause | Fix |
|
| Symptom | Cause | Fix |
|
||||||
|---------|-------|-----|
|
|---------|-------|-----|
|
||||||
| VM won't start | Proxmox host disk full | `ssh root@192.168.1.127 'df -h'` — check thin pool |
|
| VM won't start | Proxmox host disk full | `ssh root@192.168.1.127 'df -h'` — check thin pool usage with `lvs pve/data` |
|
||||||
| Node stays NotReady | kubelet/containerd not starting | `qm guest exec <VMID> -- systemctl status kubelet` |
|
| Node stays NotReady | kubelet/containerd not starting | `qm guest exec <VMID> -- systemctl status kubelet` and `systemctl status containerd` |
|
||||||
| NFS PVCs stuck Pending | TrueNAS not fully booted | Wait for TrueNAS ZFS pool import, or `qm guest exec 9000 -- zpool status` |
|
| NFS PVCs stuck Pending | TrueNAS not fully booted | Wait for ZFS pool import: `qm guest exec 9000 -- zpool status` |
|
||||||
| Vault stays sealed | Auto-unseal sidecar not running | Check sidecar logs: `kubectl logs -n vault vault-0 -c vault-unseal` |
|
| Proxmox-LVM PVCs stuck | proxmox-csi-plugin not running | `$KC get pods -n proxmox-csi` — check CSI node plugin on affected node. Check LVM thin pool: `qm guest exec <VMID> -- lvs` |
|
||||||
| MySQL slow to recover | InnoDB init containers (~20 min each) | Normal — monitor progress with `kubectl logs -n dbaas` |
|
| Vault stays sealed | Auto-unseal sidecar not running | Check sidecar: `$KC logs -n vault vault-0 -c auto-unseal --tail=20`. Check unseal key secret exists: `$KC get secret -n vault vault-unseal-key` |
|
||||||
|
| Vault Raft peer missing | Pod restarted on different node | `$KC exec -n vault vault-1 -- vault operator raft join http://vault-0.vault-internal:8200` |
|
||||||
|
| MySQL 0 ONLINE members | Complete outage — operator can't recover | See [MySQL InnoDB Cluster Recovery](#37-mysql-innodb-cluster-recovery) — requires user confirmation |
|
||||||
|
| MySQL router auth failure | Reboot recreated internal users | Reset passwords from K8s secrets + setupRouterAccount — see section 3.7 |
|
||||||
|
| Redis READONLY errors | HAProxy routing to replica | Check HAProxy pod routing config, verify `redis` service selector points to `app=redis-haproxy` |
|
||||||
|
| ESO secrets not syncing | Vault sealed or token expired | Unseal Vault first, then force-sync: `$KC annotate externalsecrets -A --all force-sync=$(date +%s) --overwrite` |
|
||||||
| Pods CrashLoopBackOff | Dependencies not ready yet | Usually self-heals — wait 5 min, then check with `/cluster-health` |
|
| Pods CrashLoopBackOff | Dependencies not ready yet | Usually self-heals — wait 5 min, then check with `/cluster-health` |
|
||||||
| containerd won't start | Disk full or corrupted images | `qm guest exec <VMID> -- journalctl -u containerd --no-pager -n 50` |
|
| containerd won't start | Disk full or corrupted images | `qm guest exec <VMID> -- journalctl -u containerd --no-pager -n 50`. If blob corruption, see containerd cleanup procedure in memory. |
|
||||||
|
| Drain stuck on PDB | PDB minAvailable can't be met | Check PDB: `$KC get pdb -A`. May need to scale up another replica first or `--disable-eviction` |
|
||||||
|
|
|
||||||
Loading…
Add table
Add a link
Reference in a new issue