[ci skip] Add extend-vm-storage script and skills

- Script to automate K8s node VM disk expansion (drain, shutdown, resize, boot, expand FS, uncordon)
- Skill docs for the workflow and troubleshooting pitfalls (growpart, macOS grep -P, drain timeouts)
- Successfully tested on k8s-node2, k8s-node3, k8s-node4 (64G → 128G)
This commit is contained in:
Viktor Barzin 2026-02-13 22:08:46 +00:00
parent 04dd438b01
commit 08ea489fe0
4 changed files with 591 additions and 0 deletions

View file

@ -435,6 +435,12 @@ Skills are specialized workflows for common tasks. Located in `.claude/skills/`.
- **When to use**: User provides GitHub URL or wants to deploy a new service
- **Example**: "Deploy [GitHub repo] to the cluster"
**extend-vm-storage** (`.claude/skills/extend-vm-storage.md`)
- Extend disk storage on K8s node VMs (Proxmox-hosted)
- Automates: drain → shutdown → resize → boot → expand filesystem → uncordon
- **When to use**: A k8s node needs more disk space
- **Example**: "Extend storage on k8s-node2 by 64G"
---
## Service-Specific Notes

View file

@ -0,0 +1,77 @@
# Extend VM Storage Skill
**Purpose**: Extend disk storage on a Kubernetes node VM (Proxmox-hosted).
**When to use**: User wants to increase disk space on a k8s node VM, or a node is running low on disk.
## Workflow
### 1. Identify the Node
Ask the user which node needs more storage and how much to add.
Valid nodes: `k8s-master`, `k8s-node1`, `k8s-node2`, `k8s-node3`, `k8s-node4`
### 2. Run the Script
```bash
./scripts/extend_vm_storage.sh <node-name> <size-increment>
```
**Example**:
```bash
./scripts/extend_vm_storage.sh k8s-node2 +64G
```
### 3. What the Script Does
1. Validates inputs (node name and size format)
2. Resolves node IP via kubectl
3. Prompts for confirmation
4. Drains the node (evicts pods)
5. Shuts down the VM in Proxmox
6. Resizes the disk (`scsi0`) by the given increment
7. Starts the VM and waits for SSH
8. Expands the filesystem inside the guest (auto-detects LVM vs direct partition)
9. Uncordons the node
10. Shows verification output (`df -h` and node status)
### 4. Update Terraform (if needed)
If you want Terraform to reflect the new disk size, update the VM definition in `main.tf` or `modules/create-vm/` so that a future `terraform apply` doesn't revert the change. Check if the VM disk size is managed by Terraform:
```bash
grep -A5 "disk" main.tf | grep -i size
```
If managed, update the size value to match the new total.
### 5. Verification
After the script completes, verify:
```bash
kubectl --kubeconfig $(pwd)/config get nodes
ssh wizard@<node-ip> "df -h /"
```
## Recovery
If the script fails mid-way:
1. Check VM status: `ssh root@192.168.1.127 "qm status <vmid>"`
2. Start VM if stopped: `ssh root@192.168.1.127 "qm start <vmid>"`
3. Uncordon node: `kubectl --kubeconfig $(pwd)/config uncordon <node-name>`
## Constants
| Setting | Value |
|---------|-------|
| Proxmox host | `root@192.168.1.127` |
| VM SSH user | `wizard` |
| Disk name | `scsi0` |
| Shutdown timeout | 300s |
| SSH wait timeout | 300s |
## Questions to Ask User
1. Which node needs more storage?
2. How much storage to add? (e.g., +64G)

View file

@ -0,0 +1,136 @@
---
name: proxmox-vm-disk-expansion-pitfalls
description: |
Troubleshoot common failures when expanding Proxmox VM disks on Ubuntu 24.04
cloud-init images and draining Kubernetes nodes. Use when: (1) growpart fails
with "command not found" on Ubuntu cloud-init VMs, (2) grep -P fails on macOS
with "invalid option -- P", (3) kubectl drain times out with pods stuck
terminating, (4) filesystem shows old size after qm resize. Covers
cloud-guest-utils installation, macOS-portable regex parsing, drain timeout
tuning, and recovery from partial failures.
author: Claude Code
version: 1.0.0
date: 2026-02-13
---
# Proxmox VM Disk Expansion Pitfalls
## Problem
Expanding disk storage on Proxmox-hosted Ubuntu 24.04 cloud-init VMs (used as
Kubernetes nodes) fails at multiple points due to missing tools, cross-platform
incompatibilities, and Kubernetes drain timeouts.
## Context / Trigger Conditions
- Running disk expansion scripts from macOS against Proxmox + Ubuntu VMs
- Ubuntu 24.04 cloud-init images (the default k8s node template)
- Kubernetes nodes with many pods or stateful workloads
- Using `scripts/extend_vm_storage.sh` or similar automation
## Issues and Solutions
### 1. `growpart: command not found` on Ubuntu 24.04
**Symptom**: After `qm resize`, SSH into VM, run `growpart /dev/sda 1` — fails
with "command not found". `resize2fs` then reports "Nothing to do!" because the
partition table hasn't been updated.
**Root cause**: Ubuntu 24.04 cloud-init images don't include `cloud-guest-utils`
by default. The `growpart` tool (which updates the partition table to use new
disk space) is in this package.
**Fix**:
```bash
sudo apt-get update -qq && sudo apt-get install -y -qq cloud-guest-utils
sudo growpart /dev/sda 1
sudo resize2fs /dev/sda1
```
**Prevention**: Check for `growpart` before attempting partition expansion:
```bash
if ! command -v growpart &>/dev/null; then
sudo apt-get update -qq && sudo apt-get install -y -qq cloud-guest-utils
fi
```
### 2. `grep -P` (PCRE) not available on macOS
**Symptom**: Script running on macOS fails with `grep: invalid option -- P`.
**Root cause**: macOS ships BSD grep, which doesn't support `-P` (Perl-compatible
regex). GNU grep (from Homebrew) does, but scripts shouldn't assume it's installed.
**Fix**: Replace `grep -oP 'pattern\Kcapture'` with portable `sed`:
```bash
# BAD (GNU grep only):
CURRENT_SIZE=$(echo "$LINE" | grep -oP 'size=\K[0-9]+G')
# GOOD (portable):
CURRENT_SIZE=$(echo "$LINE" | sed -n 's/.*size=\([0-9]*G\).*/\1/p')
```
**General rule**: In scripts that run on macOS, avoid `grep -P`, `sed -i ''`
vs `sed -i` differences, and `date` flag differences. Use `sed` with basic
regex or bash built-in `[[ =~ ]]` for pattern matching.
### 3. `kubectl drain` timeout with stuck pods
**Symptom**: `kubectl drain --timeout=120s` fails with "context deadline exceeded"
for multiple pods. Pods are evicted but don't terminate in time.
**Root cause**: Some pods (stateful services like ClickHouse, Paperless-ngx,
OnlyOffice) need more time to shut down gracefully. 120s isn't enough when many
pods are draining simultaneously.
**Fix**: Use `--force` flag and a longer timeout, or retry:
```bash
# First attempt with standard timeout
kubectl drain <node> --ignore-daemonsets --delete-emptydir-data --timeout=120s
# If it fails, force with longer timeout (pods already evicting)
kubectl drain <node> --ignore-daemonsets --delete-emptydir-data --timeout=300s --force
```
**Note**: After a failed drain, the node is already cordoned. A second drain
attempt only needs to wait for already-evicting pods to finish.
### 4. Recovery from partial failure
If the script fails mid-way (after drain but before uncordon):
```bash
# Check VM status
ssh root@192.168.1.127 "qm status <vmid>"
# Start VM if stopped
ssh root@192.168.1.127 "qm start <vmid>"
# Uncordon node
kubectl --kubeconfig $(pwd)/config uncordon <node-name>
```
## Verification
After successful expansion:
```bash
# On the VM
df -h /
# Should show new size (128G disk → ~126G usable for ext4)
# On the cluster
kubectl get node <name>
# Should show Ready status
```
## Notes
- The k8s node VMs use direct partition layout (`/dev/sda1`), not LVM, despite
the script handling both paths
- `growpart` returns exit code 1 for "NOCHANGE" (partition already at max) —
this is not an error
- Proxmox `qm resize` uses `scsi0` as the disk identifier for these VMs
- SSH host keys may change if VMs are recreated or network changes — use
`-o StrictHostKeyChecking=no` in automated scripts
See also: `extend-vm-storage.md` (the operational skill for running the script)

372
scripts/extend_vm_storage.sh Executable file
View file

@ -0,0 +1,372 @@
#!/usr/bin/env bash
# Extend disk storage on a Kubernetes node VM.
# Drains the node, shuts down the VM, resizes the disk in Proxmox,
# boots the VM, expands the filesystem, and uncordons the node.
#
# Usage: ./scripts/extend_vm_storage.sh <node-name> <size-increment>
# Example: ./scripts/extend_vm_storage.sh k8s-node2 +64G
# --- Constants ---
PROXMOX_HOST="root@192.168.1.127"
VM_SSH_USER="wizard"
KUBECTL="kubectl --kubeconfig $(pwd)/config"
SHUTDOWN_TIMEOUT=300
SSH_WAIT_TIMEOUT=300
POLL_INTERVAL=5
# --- Colors ---
RED='\033[0;31m'
GREEN='\033[0;32m'
YELLOW='\033[0;33m'
BLUE='\033[0;34m'
NC='\033[0m'
info() { echo -e "${BLUE}[INFO]${NC} $*"; }
ok() { echo -e "${GREEN}[OK]${NC} $*"; }
warn() { echo -e "${YELLOW}[WARN]${NC} $*"; }
error() { echo -e "${RED}[ERROR]${NC} $*"; }
# --- Node-to-VMID mapping ---
declare -A NODE_VMID=(
[k8s-master]=200
[k8s-node1]=201
[k8s-node2]=202
[k8s-node3]=203
[k8s-node4]=204
)
# --- Cleanup trap ---
DRAINED_NODE=""
cleanup() {
if [[ -n "$DRAINED_NODE" ]]; then
echo ""
error "Script exited unexpectedly!"
warn "The node '$DRAINED_NODE' may still be cordoned/drained."
warn "Recovery steps:"
warn " 1. Check VM status: ssh $PROXMOX_HOST 'qm status ${NODE_VMID[$DRAINED_NODE]}'"
warn " 2. Start VM if stopped: ssh $PROXMOX_HOST 'qm start ${NODE_VMID[$DRAINED_NODE]}'"
warn " 3. Uncordon node: $KUBECTL uncordon $DRAINED_NODE"
fi
}
trap cleanup EXIT
# --- Input validation ---
usage() {
echo "Usage: $0 <node-name> <size-increment>"
echo ""
echo "Arguments:"
echo " node-name One of: ${!NODE_VMID[*]}"
echo " size-increment Disk size increase, e.g. +64G, +128G"
echo ""
echo "Example:"
echo " $0 k8s-node2 +64G"
exit 1
}
if [[ $# -ne 2 ]]; then
usage
fi
NODE_NAME="$1"
SIZE_INCREMENT="$2"
if [[ -z "${NODE_VMID[$NODE_NAME]+x}" ]]; then
error "Unknown node: '$NODE_NAME'"
echo "Valid nodes: ${!NODE_VMID[*]}"
exit 1
fi
if [[ ! "$SIZE_INCREMENT" =~ ^\+[0-9]+G$ ]]; then
error "Invalid size increment: '$SIZE_INCREMENT'"
echo "Must match pattern +<number>G, e.g. +64G"
exit 1
fi
VMID="${NODE_VMID[$NODE_NAME]}"
# --- Resolve node IP via kubectl ---
info "Resolving IP for node '$NODE_NAME'..."
NODE_IP=$($KUBECTL get node "$NODE_NAME" -o jsonpath='{.status.addresses[?(@.type=="InternalIP")].address}' 2>/dev/null)
if [[ -z "$NODE_IP" ]]; then
error "Could not resolve IP for node '$NODE_NAME'. Is the cluster reachable?"
exit 1
fi
ok "Node IP: $NODE_IP"
# --- Query current disk size ---
info "Querying current disk size for VM $VMID..."
SCSI0_LINE=$(ssh "$PROXMOX_HOST" "qm config $VMID" 2>/dev/null | grep '^scsi0:')
if [[ -z "$SCSI0_LINE" ]]; then
error "Could not read scsi0 config for VM $VMID."
exit 1
fi
# Extract size value, e.g. "size=64G" from the config line
CURRENT_SIZE=$(echo "$SCSI0_LINE" | sed -n 's/.*size=\([0-9]*G\).*/\1/p')
if [[ -z "$CURRENT_SIZE" ]]; then
error "Could not parse current disk size from: $SCSI0_LINE"
exit 1
fi
CURRENT_SIZE_NUM=${CURRENT_SIZE%G}
INCREMENT_NUM=${SIZE_INCREMENT//[+G]/}
NEW_SIZE_NUM=$((CURRENT_SIZE_NUM + INCREMENT_NUM))
ok "Current disk size: ${CURRENT_SIZE_NUM}G → New size: ${NEW_SIZE_NUM}G (${SIZE_INCREMENT})"
if [[ $NEW_SIZE_NUM -le $CURRENT_SIZE_NUM ]]; then
error "New size (${NEW_SIZE_NUM}G) must be greater than current size (${CURRENT_SIZE_NUM}G)."
exit 1
fi
# --- Confirmation ---
echo ""
echo "========================================="
echo " Extend VM Storage"
echo "========================================="
echo " Node: $NODE_NAME"
echo " VMID: $VMID"
echo " Node IP: $NODE_IP"
echo " Current: ${CURRENT_SIZE_NUM}G"
echo " Increment: $SIZE_INCREMENT"
echo " New size: ${NEW_SIZE_NUM}G"
echo " Proxmox: $PROXMOX_HOST"
echo "========================================="
echo ""
echo "This will:"
echo " 1. Drain the node (evict pods)"
echo " 2. Shut down the VM"
echo " 3. Resize disk (scsi0) from ${CURRENT_SIZE_NUM}G to ${NEW_SIZE_NUM}G"
echo " 4. Start the VM"
echo " 5. Expand the filesystem inside the guest"
echo " 6. Uncordon the node"
echo ""
read -rp "Proceed? [y/N] " confirm
if [[ ! "$confirm" =~ ^[yY]$ ]]; then
echo "Aborted."
exit 0
fi
# --- Step 1: Drain node ---
info "Step 1/7: Draining node '$NODE_NAME'..."
DRAINED_NODE="$NODE_NAME"
if ! $KUBECTL drain "$NODE_NAME" --ignore-daemonsets --delete-emptydir-data --timeout=120s; then
error "Failed to drain node '$NODE_NAME'."
exit 1
fi
ok "Node drained."
# --- Step 2: Shutdown VM ---
info "Step 2/7: Shutting down VM $VMID..."
if ! ssh "$PROXMOX_HOST" "qm shutdown $VMID"; then
error "Failed to send shutdown command to VM $VMID."
exit 1
fi
info "Waiting for VM to stop (timeout: ${SHUTDOWN_TIMEOUT}s)..."
elapsed=0
while true; do
status=$(ssh "$PROXMOX_HOST" "qm status $VMID" 2>/dev/null)
if [[ "$status" == *"stopped"* ]]; then
break
fi
if [[ $elapsed -ge $SHUTDOWN_TIMEOUT ]]; then
error "VM $VMID did not stop within ${SHUTDOWN_TIMEOUT}s. Current status: $status"
exit 1
fi
sleep "$POLL_INTERVAL"
elapsed=$((elapsed + POLL_INTERVAL))
done
ok "VM stopped."
# --- Step 3: Resize disk ---
info "Step 3/7: Resizing disk scsi0 by $SIZE_INCREMENT..."
if ! ssh "$PROXMOX_HOST" "qm resize $VMID scsi0 $SIZE_INCREMENT"; then
error "Failed to resize disk on VM $VMID."
exit 1
fi
ok "Disk resized."
# --- Step 4: Start VM ---
info "Step 4/7: Starting VM $VMID..."
if ! ssh "$PROXMOX_HOST" "qm start $VMID"; then
error "Failed to start VM $VMID."
exit 1
fi
info "Waiting for SSH to become available at $NODE_IP (timeout: ${SSH_WAIT_TIMEOUT}s)..."
elapsed=0
while true; do
if ssh -o ConnectTimeout=5 -o StrictHostKeyChecking=no "$VM_SSH_USER@$NODE_IP" "true" 2>/dev/null; then
break
fi
if [[ $elapsed -ge $SSH_WAIT_TIMEOUT ]]; then
error "SSH not reachable on $NODE_IP within ${SSH_WAIT_TIMEOUT}s."
exit 1
fi
sleep "$POLL_INTERVAL"
elapsed=$((elapsed + POLL_INTERVAL))
done
ok "VM is up and SSH is reachable."
info "Waiting 10s for system stabilization..."
sleep 10
# --- Step 5: Expand filesystem ---
info "Step 5/7: Expanding filesystem inside the guest..."
ssh -o StrictHostKeyChecking=no "$VM_SSH_USER@$NODE_IP" 'bash -s' <<'REMOTE_SCRIPT'
set -o pipefail
RED='\033[0;31m'
GREEN='\033[0;32m'
YELLOW='\033[0;33m'
BLUE='\033[0;34m'
NC='\033[0m'
info() { echo -e "${BLUE}[INFO]${NC} $*"; }
ok() { echo -e "${GREEN}[OK]${NC} $*"; }
warn() { echo -e "${YELLOW}[WARN]${NC} $*"; }
error() { echo -e "${RED}[ERROR]${NC} $*"; }
ROOT_DEV=$(findmnt -n -o SOURCE /)
ROOT_FSTYPE=$(findmnt -n -o FSTYPE /)
info "Root device: $ROOT_DEV"
info "Root filesystem: $ROOT_FSTYPE"
# Ensure growpart is available
if ! command -v growpart &>/dev/null; then
info "Installing growpart (cloud-guest-utils)..."
sudo apt-get update -qq && sudo apt-get install -y -qq cloud-guest-utils
fi
resize_fs() {
local dev="$1"
local fstype="$2"
if [[ "$fstype" == "ext4" || "$fstype" == "ext3" || "$fstype" == "ext2" ]]; then
info "Running resize2fs on $dev..."
if ! sudo resize2fs "$dev"; then
error "resize2fs failed on $dev"
return 1
fi
elif [[ "$fstype" == "xfs" ]]; then
info "Running xfs_growfs on /..."
if ! sudo xfs_growfs /; then
error "xfs_growfs failed"
return 1
fi
else
error "Unsupported filesystem type: $fstype"
return 1
fi
return 0
}
# Check if root is on LVM (device-mapper)
if [[ "$ROOT_DEV" == /dev/mapper/* || "$ROOT_DEV" == /dev/dm-* ]]; then
info "LVM layout detected."
# Find the PV device
PV_DEV=$(sudo pvs --noheadings -o pv_name | head -1 | tr -d ' ')
if [[ -z "$PV_DEV" ]]; then
error "Could not determine PV device."
exit 1
fi
info "PV device: $PV_DEV"
# Parse disk and partition number (handles /dev/sdaX and /dev/nvmeXnXpX)
if [[ "$PV_DEV" =~ ^(/dev/nvme[0-9]+n[0-9]+)p([0-9]+)$ ]]; then
DISK="${BASH_REMATCH[1]}"
PARTNUM="${BASH_REMATCH[2]}"
elif [[ "$PV_DEV" =~ ^(/dev/[a-z]+)([0-9]+)$ ]]; then
DISK="${BASH_REMATCH[1]}"
PARTNUM="${BASH_REMATCH[2]}"
else
error "Could not parse disk/partition from PV: $PV_DEV"
exit 1
fi
info "Disk: $DISK, Partition: $PARTNUM"
# Grow partition
info "Growing partition $DISK partition $PARTNUM..."
sudo growpart "$DISK" "$PARTNUM" || echo "(growpart: partition may already be at max size)"
# Resize PV
info "Resizing PV $PV_DEV..."
if ! sudo pvresize "$PV_DEV"; then
error "pvresize failed on $PV_DEV"
exit 1
fi
# Resolve LV path if using /dev/dm-*
if [[ "$ROOT_DEV" == /dev/dm-* ]]; then
LV_PATH=$(sudo lvs --noheadings -o lv_path | head -1 | tr -d ' ')
else
LV_PATH="$ROOT_DEV"
fi
info "LV path: $LV_PATH"
# Extend LV
info "Extending LV $LV_PATH to use all free space..."
if ! sudo lvextend -l +100%FREE "$LV_PATH"; then
warn "lvextend reported no change (LV may already use all space)."
fi
# Resize filesystem
resize_fs "$LV_PATH" "$ROOT_FSTYPE"
if [[ $? -ne 0 ]]; then
exit 1
fi
else
info "Direct partition layout detected."
# Parse disk and partition number
if [[ "$ROOT_DEV" =~ ^(/dev/nvme[0-9]+n[0-9]+)p([0-9]+)$ ]]; then
DISK="${BASH_REMATCH[1]}"
PARTNUM="${BASH_REMATCH[2]}"
elif [[ "$ROOT_DEV" =~ ^(/dev/[a-z]+)([0-9]+)$ ]]; then
DISK="${BASH_REMATCH[1]}"
PARTNUM="${BASH_REMATCH[2]}"
else
error "Could not parse disk/partition from: $ROOT_DEV"
exit 1
fi
info "Disk: $DISK, Partition: $PARTNUM"
# Grow partition
info "Growing partition $DISK partition $PARTNUM..."
sudo growpart "$DISK" "$PARTNUM" || echo "(growpart: partition may already be at max size)"
# Resize filesystem
resize_fs "$ROOT_DEV" "$ROOT_FSTYPE"
if [[ $? -ne 0 ]]; then
exit 1
fi
fi
ok "Filesystem expansion complete."
df -h /
REMOTE_SCRIPT
if [[ $? -ne 0 ]]; then
error "Filesystem expansion failed on the guest."
exit 1
fi
ok "Filesystem expanded."
# --- Step 6: Uncordon node ---
info "Step 6/7: Uncordoning node '$NODE_NAME'..."
if ! $KUBECTL uncordon "$NODE_NAME"; then
error "Failed to uncordon node '$NODE_NAME'."
exit 1
fi
DRAINED_NODE=""
ok "Node uncordoned."
# --- Step 7: Verify ---
info "Step 7/7: Verification"
echo ""
info "Disk usage on $NODE_NAME:"
ssh -o StrictHostKeyChecking=no "$VM_SSH_USER@$NODE_IP" "df -h /"
echo ""
info "Node status:"
$KUBECTL get node "$NODE_NAME"
echo ""
ok "Storage extension complete for $NODE_NAME."