dot_files/dot_claude/skills/reboot-server.md

22 KiB

name: reboot-server description: Safely reboot the Proxmox host, a single k8s node, or perform a rolling reboot of all nodes with graceful shutdown, MySQL auto-recovery, and post-boot validation. Use when the user asks to "reboot the server", "reboot proxmox", "reboot node", "restart the host", "rolling reboot", "reboot all nodes", "power cycle".

Overview

This skill safely reboots infrastructure with three modes:

  1. Full host reboot — reboot the entire Proxmox R730 host (all VMs)
  2. Single node reboot — drain, reboot one k8s node VM, uncordon
  3. Rolling reboot — cycle all 5 k8s nodes sequentially (workers first, master last)

Connection Details

  • Proxmox host: ssh root@192.168.1.127
  • kubectl: KUBECONFIG=/Users/viktorbarzin/code/config kubectl
  • VM commands: ssh root@192.168.1.127 'qm <command>'

Shorthand used below: KC="KUBECONFIG=/Users/viktorbarzin/code/config kubectl"

VM Inventory & Boot Order

Order VMID Name Startup Delay Shutdown Timeout Notes
1 101 pfSense 0s 120s Gateway/DHCP/DNS — must boot first
2 9000 TrueNAS 60s 300s NFS storage — needs network from pfSense
3 220 docker-registry 60s 120s Pull-through cache (fallback: upstream)
3 102 devvm 60s 120s Dev VM — needs pfSense for network
4 200 k8s-master 45s 420s Control plane — must be up before workers
5 201 k8s-node1 45s 420s GPU node (Tesla T4)
5 202 k8s-node2 45s 420s Worker
5 203 k8s-node3 45s 420s Worker
5 204 k8s-node4 45s 420s Worker
6 103 home-assistant 0s 120s HA Sofia — no ordering dependency
6 300 Windows10 0s 120s Windows VM — no ordering dependency

Shutdown order is the reverse of boot order (Proxmox handles this automatically).

Node-to-VMID Mapping

k8s-master = 200
k8s-node1 = 201 (GPU node — Immich ML, Frigate, Ollama)
k8s-node2 = 202
k8s-node3 = 203
k8s-node4 = 204

Mode Selection

Ask the user which mode they want if unclear:

  1. Full host reboot — reboot the entire Proxmox R730 host
  2. Single node reboot — drain, reboot one k8s node VM, uncordon
  3. Rolling reboot — cycle all 5 k8s nodes sequentially (for kernel updates, maintenance)

Shared Pre-flight Checks

These checks apply to ALL modes. Run them before proceeding.

PF-1: Verify Proxmox SSH access

ssh -o ConnectTimeout=5 root@192.168.1.127 'hostname && uptime'

PF-2: Verify cluster health

$KC get nodes
$KC get pods -A --field-selector 'status.phase!=Running,status.phase!=Succeeded' | grep -v Completed | head -20

PF-3: Check and configure VM boot ordering

# Check current boot order for all VMs
ssh root@192.168.1.127 'for VMID in 101 102 103 200 201 202 203 204 220 300 9000; do echo "VMID $VMID: $(qm config $VMID 2>/dev/null | grep ^startup || echo "no startup config")"; done'

If not configured, apply boot order (idempotent):

ssh root@192.168.1.127 '
qm set 101 --startup order=1,down=120
qm set 9000 --startup order=2,up=60,down=300
qm set 102 --startup order=3,up=60,down=120
qm set 220 --startup order=3,up=60,down=120
qm set 200 --startup order=4,up=45,down=420
qm set 201 --startup order=5,up=45,down=420
qm set 202 --startup order=5,up=45,down=420
qm set 203 --startup order=5,up=45,down=420
qm set 204 --startup order=5,up=45,down=420
qm set 103 --startup order=6,down=120
qm set 300 --startup order=6,down=120
'

PF-4: Check kubelet priority-based shutdown on all k8s nodes

Kubelet uses shutdownGracePeriodByPodPriority for ordered pod shutdown (lowest priority stopped first):

Priority Tier Grace Stopped
0 unclassified 20s 1st
200000 tier-4-aux 20s 2nd
400000 tier-3-edge 30s 3rd
600000 tier-2-gpu 30s 4th
800000 tier-1-cluster (DBs) 90s 5th
1000000 tier-0-core 30s 6th
1200000 gpu-workload 30s 7th
2000000000 system-cluster-critical 30s 8th
2000001000 system-node-critical 30s 9th (last)

Total: 310s kubelet, 420s VM timeout, 480s InhibitDelay

for VMID in 200 201 202 203 204; do
  echo "=== VMID $VMID ==="
  ssh root@192.168.1.127 "qm guest exec $VMID -- grep -c shutdownGracePeriodByPodPriority /var/lib/kubelet/config.yaml 2>/dev/null" || echo "NOT SET"
done

If not set on any node, patch it with the python3/yaml approach:

VMID=<target>
ssh root@192.168.1.127 "qm guest exec $VMID -- python3 -c \"
import yaml
with open('/var/lib/kubelet/config.yaml') as f:
    cfg = yaml.safe_load(f)
cfg.pop('shutdownGracePeriod', None)
cfg.pop('shutdownGracePeriodCriticalPods', None)
cfg.pop('shutdownGracePeriodByPodPriority', None)
cfg['shutdownGracePeriodByPodPriority'] = [
    {'priority': 0,          'shutdownGracePeriodSeconds': 20},
    {'priority': 200000,     'shutdownGracePeriodSeconds': 20},
    {'priority': 400000,     'shutdownGracePeriodSeconds': 30},
    {'priority': 600000,     'shutdownGracePeriodSeconds': 30},
    {'priority': 800000,     'shutdownGracePeriodSeconds': 90},
    {'priority': 1000000,    'shutdownGracePeriodSeconds': 30},
    {'priority': 1200000,    'shutdownGracePeriodSeconds': 30},
    {'priority': 2000000000, 'shutdownGracePeriodSeconds': 30},
    {'priority': 2000001000, 'shutdownGracePeriodSeconds': 30},
]
with open('/var/lib/kubelet/config.yaml', 'w') as f:
    yaml.dump(cfg, f, default_flow_style=False)
print('done')
\""

# Update systemd timeouts to match
ssh root@192.168.1.127 "qm guest exec $VMID -- bash -c '
mkdir -p /etc/systemd/logind.conf.d
echo -e \"[Login]\nInhibitDelayMaxSec=480\" > /etc/systemd/logind.conf.d/kubelet-shutdown.conf
systemctl restart systemd-logind
mkdir -p /etc/systemd/system/kubelet.service.d
echo -e \"[Service]\nTimeoutStopSec=420s\" > /etc/systemd/system/kubelet.service.d/20-shutdown.conf
systemctl daemon-reload
systemctl restart kubelet
'"

PF-5: Check unattended-upgrades on k8s nodes

for VMID in 200 201 202 203 204; do
  echo "=== VMID $VMID ==="
  ssh root@192.168.1.127 "qm guest exec $VMID -- systemctl is-active unattended-upgrades 2>/dev/null" || echo "not found (good)"
done

If active, disable: ssh root@192.168.1.127 "qm guest exec $VMID -- bash -c 'systemctl disable --now unattended-upgrades; apt-get remove -y unattended-upgrades'"

PF-6: Report pre-flight status

Summarize: all checks passed, or list what was fixed. Ask user for final confirmation before proceeding.


Full Host Reboot Procedure

Phase 1: Pre-flight

Run Shared Pre-flight Checks above.

Phase 2: Reboot

Get explicit user confirmation before proceeding.

ssh root@192.168.1.127 'reboot'

Wait for SSH to drop:

for i in $(seq 1 30); do
  ssh -o ConnectTimeout=3 -o BatchMode=yes root@192.168.1.127 'true' 2>/dev/null || { echo "Host is rebooting"; break; }
  sleep 2
done

Poll until Proxmox is back (timeout: 10 minutes):

for i in $(seq 1 20); do
  if ssh -o ConnectTimeout=5 -o BatchMode=yes root@192.168.1.127 'uptime' 2>/dev/null; then
    echo "Proxmox host is back online"
    break
  fi
  echo "Waiting for Proxmox host... (attempt $i/20)"
  sleep 30
done

Phase 3: Post-boot Validation (30 min timeout)

Run these checks in a loop, reporting progress.

3.1 Check all VMs are running

ssh root@192.168.1.127 'qm list' | grep -v stopped

If any critical VM (101, 9000, 200-204, 220) is not running after 5 minutes: qm start <VMID>

3.2 Check k8s API is reachable

$KC cluster-info 2>/dev/null

3.3 Check all nodes are Ready

$KC get nodes

All 5 nodes should show Ready.

3.4 Check critical infrastructure pods (tiered)

# Tier 0: Core networking
$KC get pods -n metallb-system -l app=metallb
$KC get pods -n kube-system -l k8s-app=kube-dns
$KC get pods -n technitium

# Tier 1: Storage
$KC get pods -n proxmox-csi
$KC get pods -n nfs-csi

# Tier 2: Ingress + tunnel
$KC get pods -n traefik
$KC get pods -n cloudflared

# Tier 3: Security
$KC get pods -n kyverno

# Tier 4: Data layer
$KC get pods -n redis
$KC get pods -n dbaas

# Tier 5: Secrets + auth
$KC get pods -n vault
$KC get pods -n authentik
$KC get pods -n external-secrets

3.5 Vault Verification

# Check sealed status
$KC exec -n vault vault-0 -- vault status -format=json 2>/dev/null | jq '.sealed'

# Check all 3 Vault pods are Running
$KC get pods -n vault -l app.kubernetes.io/name=vault

# If unsealed, check Raft peers
$KC exec -n vault vault-0 -- vault operator raft list-peers 2>/dev/null

If sealed after 5 minutes:

# Check auto-unseal sidecar logs
$KC logs -n vault vault-0 -c auto-unseal --tail=20

If Raft has missing peers after unseal:

$KC exec -n vault vault-1 -- vault operator raft join http://vault-0.vault-internal:8200
$KC exec -n vault vault-2 -- vault operator raft join http://vault-0.vault-internal:8200

3.6 Proxmox-LVM PVC Validation

Note: VolumeAttachments auto-detach in ~2 min (60s pod eviction + 15s attach-detach reconcile). If pods are stuck in ContainerCreating with Multi-Attach errors, wait 2-3 min before intervening. Only escalate if CSI controller pod is not running.

# Check all PVCs — none should be Pending (except newly created)
$KC get pvc -A --field-selector 'status.phase!=Bound' 2>/dev/null | head -20

# Check proxmox-lvm PVCs specifically
$KC get pvc -A -o json | jq -r '.items[] | select(.spec.storageClassName=="proxmox-lvm") | "\(.metadata.namespace)/\(.metadata.name): \(.status.phase)"'

If any proxmox-lvm PVCs are stuck:

# Check proxmox-csi-plugin pods
$KC get pods -n proxmox-csi
# Check node LVM thin pool
ssh root@192.168.1.127 "for VMID in 200 201 202 203 204; do echo '=== VMID '\$VMID' ==='; qm guest exec \$VMID -- lvs --noheadings -o lv_name,lv_size,data_percent pve/data 2>/dev/null; done"

3.7 MySQL InnoDB Cluster Recovery

This is the most critical post-reboot step. MySQL InnoDB Cluster cannot auto-recover from a complete outage.

# Get root password
ROOT_PWD=$($KC get secret cluster-secret -n dbaas -o jsonpath='{.data.ROOT_PASSWORD}' | base64 -d)

# Check cluster status
$KC exec -n dbaas mysql-cluster-0 -c mysql -- mysqlsh --js --uri "root:${ROOT_PWD}@localhost:3306" -e "print(JSON.stringify(dba.getCluster().status()))" 2>/dev/null

If status shows 0 ONLINE members (complete outage):

⚠️ ASK USER FOR CONFIRMATION before running rebootClusterFromCompleteOutage. Explain:

  • This will designate mysql-cluster-0 as the new primary
  • If data diverged between members, this picks cluster-0's data
  • Partial outages should NOT use this command (use rejoinInstance instead)

If confirmed:

# Reboot cluster from complete outage
$KC exec -n dbaas mysql-cluster-0 -c mysql -- mysqlsh --js --uri "root:${ROOT_PWD}@localhost:3306" -e "dba.rebootClusterFromCompleteOutage()"

# Wait up to 10 minutes for all 3 members to come ONLINE
for i in $(seq 1 60); do
  ONLINE=$($KC exec -n dbaas mysql-cluster-0 -c mysql -- mysqlsh --js --uri "root:${ROOT_PWD}@localhost:3306" -e "print(dba.getCluster().status().defaultReplicaSet.topology)" 2>/dev/null | grep -c '"status": "ONLINE"')
  if [ "$ONLINE" = "3" ]; then
    echo "All 3 MySQL members ONLINE"
    break
  fi
  echo "MySQL members ONLINE: $ONLINE/3 (attempt $i/60)"
  sleep 10
done

Fix operator/router authentication (always needed after rebootClusterFromCompleteOutage):

# Get expected passwords from K8s secrets
ADMIN_PWD=$($KC get secret -n dbaas mysql-cluster-privsecret -o jsonpath='{.data.clusterAdminPassword}' | base64 -d 2>/dev/null)

# Reset operator admin user password
$KC exec -n dbaas mysql-cluster-0 -c mysql -- mysql -u root -p"${ROOT_PWD}" -e "
ALTER USER IF EXISTS 'mysqladmin'@'%' IDENTIFIED BY '${ADMIN_PWD}';
"

# Recreate router user with full privileges (the reboot drops it)
$KC exec -n dbaas mysql-cluster-0 -c mysql -- mysqlsh --js --uri "root:${ROOT_PWD}@localhost:3306" -e "
var c = dba.getCluster();
c.setupRouterAccount('mysqlrouter@\"%\"', {update: true});
"

# Restart operator and router pods to pick up new credentials
$KC delete pod -n dbaas -l app.kubernetes.io/name=mysql-operator
$KC delete pod -n dbaas -l app.kubernetes.io/component=router

Verify MySQL is fully operational:

# Check cluster status
$KC exec -n dbaas mysql-cluster-0 -c mysql -- mysqlsh --js --uri "root:${ROOT_PWD}@localhost:3306" -e "print(dba.getCluster().status().defaultReplicaSet.status)"

# Check router is accepting connections
$KC exec -n dbaas mysql-cluster-0 -c mysql -- mysql -u root -p"${ROOT_PWD}" -h mysql.dbaas.svc.cluster.local -e "SELECT 1"

3.8 Redis Validation

# Check Redis pods
$KC get pods -n redis

# Check HAProxy is routing to master
HAPROXY_POD=$($KC get pods -n redis -l app=redis-haproxy -o jsonpath='{.items[0].metadata.name}')
$KC exec -n redis $HAPROXY_POD -- cat /tmp/haproxy.cfg 2>/dev/null | grep "server redis" || echo "HAProxy config not found — check haproxy pod"

# Verify writes work through master
$KC exec -n redis redis-node-1 -- redis-cli SET reboot-test ok 2>/dev/null && \
$KC exec -n redis redis-node-1 -- redis-cli DEL reboot-test 2>/dev/null && \
echo "Redis write test passed" || echo "Redis write FAILED — check master routing"

3.9 ESO Sync Verification

# Force ESO to re-sync all secrets (Vault may have rotated passwords)
$KC annotate externalsecrets.external-secrets.io -A --all force-sync=$(date +%s) --overwrite 2>/dev/null

# Wait 60s for sync
sleep 60

# Check for failed syncs
FAILED=$($KC get externalsecrets -A --no-headers 2>/dev/null | grep -v SecretSynced | grep -v "SYNCED" || true)
if [ -n "$FAILED" ]; then
  echo "WARNING: Some ExternalSecrets failed to sync:"
  echo "$FAILED"
else
  echo "All ExternalSecrets synced successfully"
fi

3.10 Check for stuck pods

$KC get pods -A --field-selector 'status.phase!=Running,status.phase!=Succeeded' | grep -v Completed | head -30

Phase 4: Final Validation

  • If all checks pass: report "Cluster fully recovered in X minutes"
  • If issues remain after 30 minutes: invoke the /cluster-health skill for diagnosis
  • Store a memory summarizing the reboot: timestamp, duration, any issues encountered, lessons learned

Single Node Reboot Procedure

Phase 1: Pre-flight

1.1 Identify target

Map the user's input to a node name and VMID:

  • k8s-master / master / 200 → k8s-master (VMID 200)
  • k8s-node1 / node1 / 201 → k8s-node1 (VMID 201, GPU)
  • k8s-node2 / node2 / 202 → k8s-node2 (VMID 202)
  • k8s-node3 / node3 / 203 → k8s-node3 (VMID 203)
  • k8s-node4 / node4 / 204 → k8s-node4 (VMID 204)

1.2 Check node status

$KC get node <node-name>

1.3 Check what's running on the node

$KC get pods -A --field-selector spec.nodeName=<node-name> | head -30

Warn the user if the node runs:

  • GPU workloads (node1) — Immich ML, Frigate, Ollama
  • Database pods — MySQL InnoDB, PostgreSQL CNPG (check if it's primary)
  • Vault pods — may need re-unseal

1.4 Check PDBs won't block drain

$KC get pdb -A

Phase 2: Drain + Reboot

# Drain the node
$KC drain <node-name> --ignore-daemonsets --delete-emptydir-data --timeout=300s

# Shutdown the VM gracefully
ssh root@192.168.1.127 "qm shutdown <VMID> --timeout 300"

# Wait for VM to stop
for i in $(seq 1 60); do
  STATUS=$(ssh root@192.168.1.127 "qm status <VMID>" 2>/dev/null)
  if echo "$STATUS" | grep -q stopped; then
    echo "VM stopped"
    break
  fi
  echo "Waiting for VM to stop... ($i/60)"
  sleep 5
done

# Start the VM
ssh root@192.168.1.127 "qm start <VMID>"

Phase 3: Recovery

# Wait for node to become Ready (timeout: 5 min)
for i in $(seq 1 30); do
  STATUS=$($KC get node <node-name> --no-headers 2>/dev/null | awk '{print $2}')
  if [ "$STATUS" = "Ready" ]; then
    echo "Node is Ready"
    break
  fi
  echo "Waiting for node to become Ready... ($i/30)"
  sleep 10
done

# Uncordon the node
$KC uncordon <node-name>

# Verify DaemonSet pods are scheduled
$KC get pods -A --field-selector spec.nodeName=<node-name> | head -20

echo "Node <node-name> rebooted and uncordoned successfully"

Rolling Reboot Procedure

Cycles all 5 k8s nodes sequentially: workers first (node2→3→4→1), master last.

Phase 1: Pre-flight

Run Shared Pre-flight Checks.

Additionally:

# Snapshot current pod distribution
$KC get pods -A -o wide --no-headers | awk '{print $8}' | sort | uniq -c | sort -rn
echo "---"
# Check PDBs that could block drains
$KC get pdb -A
echo "---"
# Check MySQL cluster health before starting
ROOT_PWD=$($KC get secret cluster-secret -n dbaas -o jsonpath='{.data.ROOT_PASSWORD}' | base64 -d)
$KC exec -n dbaas mysql-cluster-0 -c mysql -- mysqlsh --js --uri "root:${ROOT_PWD}@localhost:3306" -e "print(dba.getCluster().status().defaultReplicaSet.status)" 2>/dev/null

Phase 2: Rolling Cycle

Process nodes in this order: k8s-node2 → k8s-node3 → k8s-node4 → k8s-node1 (GPU) → k8s-master

For each node, execute this sequence:

NODE=<node-name>
VMID=<vmid>
echo "========================================="
echo "Rolling reboot: $NODE (VMID $VMID)"
echo "========================================="

# Step 1: Drain
echo "Draining $NODE..."
$KC drain $NODE --ignore-daemonsets --delete-emptydir-data --timeout=300s

# Step 2: Graceful shutdown
echo "Shutting down VM $VMID..."
ssh root@192.168.1.127 "qm shutdown $VMID --timeout 300"

# Step 3: Wait for VM to stop
for i in $(seq 1 60); do
  STATUS=$(ssh root@192.168.1.127 "qm status $VMID" 2>/dev/null)
  if echo "$STATUS" | grep -q stopped; then echo "VM $VMID stopped"; break; fi
  sleep 5
done

# Step 4: Start VM
echo "Starting VM $VMID..."
ssh root@192.168.1.127 "qm start $VMID"

# Step 5: Wait for node Ready
for i in $(seq 1 30); do
  STATUS=$($KC get node $NODE --no-headers 2>/dev/null | awk '{print $2}')
  if [ "$STATUS" = "Ready" ]; then echo "$NODE is Ready"; break; fi
  echo "Waiting... ($i/30)"
  sleep 10
done

# Step 6: Uncordon
$KC uncordon $NODE
echo "$NODE uncordoned"

# Step 7: Verify DaemonSets
$KC get pods -A --field-selector spec.nodeName=$NODE --no-headers | wc -l
echo "DaemonSet pods scheduled on $NODE"

# Step 8: Cool-down (2 min)
echo "Cooling down for 2 minutes before next node..."
sleep 120

Phase 3: Mid-cycle Health Check (after all workers, before master)

After k8s-node1 (last worker) is back and uncordoned, check critical services before cycling the master:

echo "=== Mid-cycle health check ==="

# All 4 workers should be Ready
$KC get nodes

# Check MySQL cluster — if any member is not ONLINE, fix before cycling master
ROOT_PWD=$($KC get secret cluster-secret -n dbaas -o jsonpath='{.data.ROOT_PASSWORD}' | base64 -d)
$KC exec -n dbaas mysql-cluster-0 -c mysql -- mysqlsh --js --uri "root:${ROOT_PWD}@localhost:3306" -e "print(dba.getCluster().status().defaultReplicaSet.status)" 2>/dev/null

# Check Vault — if sealed, fix before cycling master
$KC exec -n vault vault-0 -- vault status -format=json 2>/dev/null | jq '.sealed'

# Check no pods stuck
$KC get pods -A --field-selector 'status.phase!=Running,status.phase!=Succeeded' | grep -v Completed | head -10

echo "=== Proceeding to master node ==="

If MySQL shows issues at this point, run MySQL InnoDB Cluster Recovery before proceeding to the master.

Phase 4: Post-rolling Validation

After the master is back and uncordoned, run the full validation suite from Phase 3 of Full Host Reboot, including:

  • All nodes Ready (3.3)
  • Critical pods tiered check (3.4)
  • Vault verification (3.5)
  • Proxmox-LVM PVC validation (3.6)
  • MySQL InnoDB recovery if needed (3.7) — with user confirmation
  • Redis validation (3.8)
  • ESO sync (3.9)
  • Stuck pods check (3.10)

Troubleshooting

Symptom Cause Fix
VM won't start Proxmox host disk full ssh root@192.168.1.127 'df -h' — check thin pool usage with lvs pve/data
Node stays NotReady kubelet/containerd not starting qm guest exec <VMID> -- systemctl status kubelet and systemctl status containerd
NFS PVCs stuck Pending TrueNAS not fully booted Wait for ZFS pool import: qm guest exec 9000 -- zpool status
Proxmox-LVM PVCs stuck ContainerCreating VolumeAttachments auto-detaching (60s pod eviction + 15s reconcile) Wait ~2 min — auto-heals. If stuck after 3 min: check $KC get pods -n proxmox-csi (CSI controller must be running). Stale VolumeAttachments: $KC get volumeattachments -o json | jq '.items[] | select(.spec.nodeName=="<node>")'
Stale Error/Unknown pods Pods from shutdown not GC'd Force-delete: $KC get pods -A --field-selector status.phase=Failed --no-headers | awk '{print "-n",$1,$2}' | xargs -L1 $KC delete pod --force --grace-period=0
Vault stays sealed Auto-unseal sidecar not running Check sidecar: $KC logs -n vault vault-0 -c auto-unseal --tail=20. Check unseal key secret exists: $KC get secret -n vault vault-unseal-key
Vault Raft peer missing Pod restarted on different node $KC exec -n vault vault-1 -- vault operator raft join http://vault-0.vault-internal:8200
MySQL 0 ONLINE members Complete outage — operator can't recover See MySQL InnoDB Cluster Recovery — requires user confirmation
MySQL router auth failure Reboot recreated internal users Reset passwords from K8s secrets + setupRouterAccount — see section 3.7
Redis READONLY errors HAProxy routing to replica Check HAProxy pod routing config, verify redis service selector points to app=redis-haproxy
ESO secrets not syncing Vault sealed or token expired Unseal Vault first, then force-sync: $KC annotate externalsecrets -A --all force-sync=$(date +%s) --overwrite
Pods CrashLoopBackOff Dependencies not ready yet Usually self-heals — wait 5 min, then check with /cluster-health
containerd won't start Disk full or corrupted images qm guest exec <VMID> -- journalctl -u containerd --no-pager -n 50. If blob corruption, see containerd cleanup procedure in memory.
Drain stuck on PDB PDB minAvailable can't be met Check PDB: $KC get pdb -A. May need to scale up another replica first or --disable-eviction