From 08ea489fe0f6474a1b2a4298066945208a417991 Mon Sep 17 00:00:00 2001 From: Viktor Barzin Date: Fri, 13 Feb 2026 22:08:46 +0000 Subject: [PATCH] [ci skip] Add extend-vm-storage script and skills MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit - Script to automate K8s node VM disk expansion (drain, shutdown, resize, boot, expand FS, uncordon) - Skill docs for the workflow and troubleshooting pitfalls (growpart, macOS grep -P, drain timeouts) - Successfully tested on k8s-node2, k8s-node3, k8s-node4 (64G → 128G) --- .claude/CLAUDE.md | 6 + .claude/skills/extend-vm-storage.md | 77 ++++ .../SKILL.md | 136 +++++++ scripts/extend_vm_storage.sh | 372 ++++++++++++++++++ 4 files changed, 591 insertions(+) create mode 100644 .claude/skills/extend-vm-storage.md create mode 100644 .claude/skills/proxmox-vm-disk-expansion-pitfalls/SKILL.md create mode 100755 scripts/extend_vm_storage.sh diff --git a/.claude/CLAUDE.md b/.claude/CLAUDE.md index 20cb1253..b54995c8 100755 --- a/.claude/CLAUDE.md +++ b/.claude/CLAUDE.md @@ -435,6 +435,12 @@ Skills are specialized workflows for common tasks. Located in `.claude/skills/`. - **When to use**: User provides GitHub URL or wants to deploy a new service - **Example**: "Deploy [GitHub repo] to the cluster" +**extend-vm-storage** (`.claude/skills/extend-vm-storage.md`) +- Extend disk storage on K8s node VMs (Proxmox-hosted) +- Automates: drain → shutdown → resize → boot → expand filesystem → uncordon +- **When to use**: A k8s node needs more disk space +- **Example**: "Extend storage on k8s-node2 by 64G" + --- ## Service-Specific Notes diff --git a/.claude/skills/extend-vm-storage.md b/.claude/skills/extend-vm-storage.md new file mode 100644 index 00000000..a994acbb --- /dev/null +++ b/.claude/skills/extend-vm-storage.md @@ -0,0 +1,77 @@ +# Extend VM Storage Skill + +**Purpose**: Extend disk storage on a Kubernetes node VM (Proxmox-hosted). + +**When to use**: User wants to increase disk space on a k8s node VM, or a node is running low on disk. + +## Workflow + +### 1. Identify the Node + +Ask the user which node needs more storage and how much to add. + +Valid nodes: `k8s-master`, `k8s-node1`, `k8s-node2`, `k8s-node3`, `k8s-node4` + +### 2. Run the Script + +```bash +./scripts/extend_vm_storage.sh +``` + +**Example**: +```bash +./scripts/extend_vm_storage.sh k8s-node2 +64G +``` + +### 3. What the Script Does + +1. Validates inputs (node name and size format) +2. Resolves node IP via kubectl +3. Prompts for confirmation +4. Drains the node (evicts pods) +5. Shuts down the VM in Proxmox +6. Resizes the disk (`scsi0`) by the given increment +7. Starts the VM and waits for SSH +8. Expands the filesystem inside the guest (auto-detects LVM vs direct partition) +9. Uncordons the node +10. Shows verification output (`df -h` and node status) + +### 4. Update Terraform (if needed) + +If you want Terraform to reflect the new disk size, update the VM definition in `main.tf` or `modules/create-vm/` so that a future `terraform apply` doesn't revert the change. Check if the VM disk size is managed by Terraform: + +```bash +grep -A5 "disk" main.tf | grep -i size +``` + +If managed, update the size value to match the new total. + +### 5. Verification + +After the script completes, verify: +```bash +kubectl --kubeconfig $(pwd)/config get nodes +ssh wizard@ "df -h /" +``` + +## Recovery + +If the script fails mid-way: +1. Check VM status: `ssh root@192.168.1.127 "qm status "` +2. Start VM if stopped: `ssh root@192.168.1.127 "qm start "` +3. Uncordon node: `kubectl --kubeconfig $(pwd)/config uncordon ` + +## Constants + +| Setting | Value | +|---------|-------| +| Proxmox host | `root@192.168.1.127` | +| VM SSH user | `wizard` | +| Disk name | `scsi0` | +| Shutdown timeout | 300s | +| SSH wait timeout | 300s | + +## Questions to Ask User + +1. Which node needs more storage? +2. How much storage to add? (e.g., +64G) diff --git a/.claude/skills/proxmox-vm-disk-expansion-pitfalls/SKILL.md b/.claude/skills/proxmox-vm-disk-expansion-pitfalls/SKILL.md new file mode 100644 index 00000000..89f89f22 --- /dev/null +++ b/.claude/skills/proxmox-vm-disk-expansion-pitfalls/SKILL.md @@ -0,0 +1,136 @@ +--- +name: proxmox-vm-disk-expansion-pitfalls +description: | + Troubleshoot common failures when expanding Proxmox VM disks on Ubuntu 24.04 + cloud-init images and draining Kubernetes nodes. Use when: (1) growpart fails + with "command not found" on Ubuntu cloud-init VMs, (2) grep -P fails on macOS + with "invalid option -- P", (3) kubectl drain times out with pods stuck + terminating, (4) filesystem shows old size after qm resize. Covers + cloud-guest-utils installation, macOS-portable regex parsing, drain timeout + tuning, and recovery from partial failures. +author: Claude Code +version: 1.0.0 +date: 2026-02-13 +--- + +# Proxmox VM Disk Expansion Pitfalls + +## Problem + +Expanding disk storage on Proxmox-hosted Ubuntu 24.04 cloud-init VMs (used as +Kubernetes nodes) fails at multiple points due to missing tools, cross-platform +incompatibilities, and Kubernetes drain timeouts. + +## Context / Trigger Conditions + +- Running disk expansion scripts from macOS against Proxmox + Ubuntu VMs +- Ubuntu 24.04 cloud-init images (the default k8s node template) +- Kubernetes nodes with many pods or stateful workloads +- Using `scripts/extend_vm_storage.sh` or similar automation + +## Issues and Solutions + +### 1. `growpart: command not found` on Ubuntu 24.04 + +**Symptom**: After `qm resize`, SSH into VM, run `growpart /dev/sda 1` — fails +with "command not found". `resize2fs` then reports "Nothing to do!" because the +partition table hasn't been updated. + +**Root cause**: Ubuntu 24.04 cloud-init images don't include `cloud-guest-utils` +by default. The `growpart` tool (which updates the partition table to use new +disk space) is in this package. + +**Fix**: +```bash +sudo apt-get update -qq && sudo apt-get install -y -qq cloud-guest-utils +sudo growpart /dev/sda 1 +sudo resize2fs /dev/sda1 +``` + +**Prevention**: Check for `growpart` before attempting partition expansion: +```bash +if ! command -v growpart &>/dev/null; then + sudo apt-get update -qq && sudo apt-get install -y -qq cloud-guest-utils +fi +``` + +### 2. `grep -P` (PCRE) not available on macOS + +**Symptom**: Script running on macOS fails with `grep: invalid option -- P`. + +**Root cause**: macOS ships BSD grep, which doesn't support `-P` (Perl-compatible +regex). GNU grep (from Homebrew) does, but scripts shouldn't assume it's installed. + +**Fix**: Replace `grep -oP 'pattern\Kcapture'` with portable `sed`: +```bash +# BAD (GNU grep only): +CURRENT_SIZE=$(echo "$LINE" | grep -oP 'size=\K[0-9]+G') + +# GOOD (portable): +CURRENT_SIZE=$(echo "$LINE" | sed -n 's/.*size=\([0-9]*G\).*/\1/p') +``` + +**General rule**: In scripts that run on macOS, avoid `grep -P`, `sed -i ''` +vs `sed -i` differences, and `date` flag differences. Use `sed` with basic +regex or bash built-in `[[ =~ ]]` for pattern matching. + +### 3. `kubectl drain` timeout with stuck pods + +**Symptom**: `kubectl drain --timeout=120s` fails with "context deadline exceeded" +for multiple pods. Pods are evicted but don't terminate in time. + +**Root cause**: Some pods (stateful services like ClickHouse, Paperless-ngx, +OnlyOffice) need more time to shut down gracefully. 120s isn't enough when many +pods are draining simultaneously. + +**Fix**: Use `--force` flag and a longer timeout, or retry: +```bash +# First attempt with standard timeout +kubectl drain --ignore-daemonsets --delete-emptydir-data --timeout=120s + +# If it fails, force with longer timeout (pods already evicting) +kubectl drain --ignore-daemonsets --delete-emptydir-data --timeout=300s --force +``` + +**Note**: After a failed drain, the node is already cordoned. A second drain +attempt only needs to wait for already-evicting pods to finish. + +### 4. Recovery from partial failure + +If the script fails mid-way (after drain but before uncordon): + +```bash +# Check VM status +ssh root@192.168.1.127 "qm status " + +# Start VM if stopped +ssh root@192.168.1.127 "qm start " + +# Uncordon node +kubectl --kubeconfig $(pwd)/config uncordon +``` + +## Verification + +After successful expansion: +```bash +# On the VM +df -h / +# Should show new size (128G disk → ~126G usable for ext4) + +# On the cluster +kubectl get node +# Should show Ready status +``` + +## Notes + +- The k8s node VMs use direct partition layout (`/dev/sda1`), not LVM, despite + the script handling both paths +- `growpart` returns exit code 1 for "NOCHANGE" (partition already at max) — + this is not an error +- Proxmox `qm resize` uses `scsi0` as the disk identifier for these VMs +- SSH host keys may change if VMs are recreated or network changes — use + `-o StrictHostKeyChecking=no` in automated scripts + +See also: `extend-vm-storage.md` (the operational skill for running the script) diff --git a/scripts/extend_vm_storage.sh b/scripts/extend_vm_storage.sh new file mode 100755 index 00000000..4cc9b609 --- /dev/null +++ b/scripts/extend_vm_storage.sh @@ -0,0 +1,372 @@ +#!/usr/bin/env bash + +# Extend disk storage on a Kubernetes node VM. +# Drains the node, shuts down the VM, resizes the disk in Proxmox, +# boots the VM, expands the filesystem, and uncordons the node. +# +# Usage: ./scripts/extend_vm_storage.sh +# Example: ./scripts/extend_vm_storage.sh k8s-node2 +64G + +# --- Constants --- +PROXMOX_HOST="root@192.168.1.127" +VM_SSH_USER="wizard" +KUBECTL="kubectl --kubeconfig $(pwd)/config" +SHUTDOWN_TIMEOUT=300 +SSH_WAIT_TIMEOUT=300 +POLL_INTERVAL=5 + +# --- Colors --- +RED='\033[0;31m' +GREEN='\033[0;32m' +YELLOW='\033[0;33m' +BLUE='\033[0;34m' +NC='\033[0m' + +info() { echo -e "${BLUE}[INFO]${NC} $*"; } +ok() { echo -e "${GREEN}[OK]${NC} $*"; } +warn() { echo -e "${YELLOW}[WARN]${NC} $*"; } +error() { echo -e "${RED}[ERROR]${NC} $*"; } + +# --- Node-to-VMID mapping --- +declare -A NODE_VMID=( + [k8s-master]=200 + [k8s-node1]=201 + [k8s-node2]=202 + [k8s-node3]=203 + [k8s-node4]=204 +) + +# --- Cleanup trap --- +DRAINED_NODE="" +cleanup() { + if [[ -n "$DRAINED_NODE" ]]; then + echo "" + error "Script exited unexpectedly!" + warn "The node '$DRAINED_NODE' may still be cordoned/drained." + warn "Recovery steps:" + warn " 1. Check VM status: ssh $PROXMOX_HOST 'qm status ${NODE_VMID[$DRAINED_NODE]}'" + warn " 2. Start VM if stopped: ssh $PROXMOX_HOST 'qm start ${NODE_VMID[$DRAINED_NODE]}'" + warn " 3. Uncordon node: $KUBECTL uncordon $DRAINED_NODE" + fi +} +trap cleanup EXIT + +# --- Input validation --- +usage() { + echo "Usage: $0 " + echo "" + echo "Arguments:" + echo " node-name One of: ${!NODE_VMID[*]}" + echo " size-increment Disk size increase, e.g. +64G, +128G" + echo "" + echo "Example:" + echo " $0 k8s-node2 +64G" + exit 1 +} + +if [[ $# -ne 2 ]]; then + usage +fi + +NODE_NAME="$1" +SIZE_INCREMENT="$2" + +if [[ -z "${NODE_VMID[$NODE_NAME]+x}" ]]; then + error "Unknown node: '$NODE_NAME'" + echo "Valid nodes: ${!NODE_VMID[*]}" + exit 1 +fi + +if [[ ! "$SIZE_INCREMENT" =~ ^\+[0-9]+G$ ]]; then + error "Invalid size increment: '$SIZE_INCREMENT'" + echo "Must match pattern +G, e.g. +64G" + exit 1 +fi + +VMID="${NODE_VMID[$NODE_NAME]}" + +# --- Resolve node IP via kubectl --- +info "Resolving IP for node '$NODE_NAME'..." +NODE_IP=$($KUBECTL get node "$NODE_NAME" -o jsonpath='{.status.addresses[?(@.type=="InternalIP")].address}' 2>/dev/null) +if [[ -z "$NODE_IP" ]]; then + error "Could not resolve IP for node '$NODE_NAME'. Is the cluster reachable?" + exit 1 +fi +ok "Node IP: $NODE_IP" + +# --- Query current disk size --- +info "Querying current disk size for VM $VMID..." +SCSI0_LINE=$(ssh "$PROXMOX_HOST" "qm config $VMID" 2>/dev/null | grep '^scsi0:') +if [[ -z "$SCSI0_LINE" ]]; then + error "Could not read scsi0 config for VM $VMID." + exit 1 +fi +# Extract size value, e.g. "size=64G" from the config line +CURRENT_SIZE=$(echo "$SCSI0_LINE" | sed -n 's/.*size=\([0-9]*G\).*/\1/p') +if [[ -z "$CURRENT_SIZE" ]]; then + error "Could not parse current disk size from: $SCSI0_LINE" + exit 1 +fi +CURRENT_SIZE_NUM=${CURRENT_SIZE%G} +INCREMENT_NUM=${SIZE_INCREMENT//[+G]/} +NEW_SIZE_NUM=$((CURRENT_SIZE_NUM + INCREMENT_NUM)) +ok "Current disk size: ${CURRENT_SIZE_NUM}G → New size: ${NEW_SIZE_NUM}G (${SIZE_INCREMENT})" + +if [[ $NEW_SIZE_NUM -le $CURRENT_SIZE_NUM ]]; then + error "New size (${NEW_SIZE_NUM}G) must be greater than current size (${CURRENT_SIZE_NUM}G)." + exit 1 +fi + +# --- Confirmation --- +echo "" +echo "=========================================" +echo " Extend VM Storage" +echo "=========================================" +echo " Node: $NODE_NAME" +echo " VMID: $VMID" +echo " Node IP: $NODE_IP" +echo " Current: ${CURRENT_SIZE_NUM}G" +echo " Increment: $SIZE_INCREMENT" +echo " New size: ${NEW_SIZE_NUM}G" +echo " Proxmox: $PROXMOX_HOST" +echo "=========================================" +echo "" +echo "This will:" +echo " 1. Drain the node (evict pods)" +echo " 2. Shut down the VM" +echo " 3. Resize disk (scsi0) from ${CURRENT_SIZE_NUM}G to ${NEW_SIZE_NUM}G" +echo " 4. Start the VM" +echo " 5. Expand the filesystem inside the guest" +echo " 6. Uncordon the node" +echo "" +read -rp "Proceed? [y/N] " confirm +if [[ ! "$confirm" =~ ^[yY]$ ]]; then + echo "Aborted." + exit 0 +fi + +# --- Step 1: Drain node --- +info "Step 1/7: Draining node '$NODE_NAME'..." +DRAINED_NODE="$NODE_NAME" +if ! $KUBECTL drain "$NODE_NAME" --ignore-daemonsets --delete-emptydir-data --timeout=120s; then + error "Failed to drain node '$NODE_NAME'." + exit 1 +fi +ok "Node drained." + +# --- Step 2: Shutdown VM --- +info "Step 2/7: Shutting down VM $VMID..." +if ! ssh "$PROXMOX_HOST" "qm shutdown $VMID"; then + error "Failed to send shutdown command to VM $VMID." + exit 1 +fi + +info "Waiting for VM to stop (timeout: ${SHUTDOWN_TIMEOUT}s)..." +elapsed=0 +while true; do + status=$(ssh "$PROXMOX_HOST" "qm status $VMID" 2>/dev/null) + if [[ "$status" == *"stopped"* ]]; then + break + fi + if [[ $elapsed -ge $SHUTDOWN_TIMEOUT ]]; then + error "VM $VMID did not stop within ${SHUTDOWN_TIMEOUT}s. Current status: $status" + exit 1 + fi + sleep "$POLL_INTERVAL" + elapsed=$((elapsed + POLL_INTERVAL)) +done +ok "VM stopped." + +# --- Step 3: Resize disk --- +info "Step 3/7: Resizing disk scsi0 by $SIZE_INCREMENT..." +if ! ssh "$PROXMOX_HOST" "qm resize $VMID scsi0 $SIZE_INCREMENT"; then + error "Failed to resize disk on VM $VMID." + exit 1 +fi +ok "Disk resized." + +# --- Step 4: Start VM --- +info "Step 4/7: Starting VM $VMID..." +if ! ssh "$PROXMOX_HOST" "qm start $VMID"; then + error "Failed to start VM $VMID." + exit 1 +fi + +info "Waiting for SSH to become available at $NODE_IP (timeout: ${SSH_WAIT_TIMEOUT}s)..." +elapsed=0 +while true; do + if ssh -o ConnectTimeout=5 -o StrictHostKeyChecking=no "$VM_SSH_USER@$NODE_IP" "true" 2>/dev/null; then + break + fi + if [[ $elapsed -ge $SSH_WAIT_TIMEOUT ]]; then + error "SSH not reachable on $NODE_IP within ${SSH_WAIT_TIMEOUT}s." + exit 1 + fi + sleep "$POLL_INTERVAL" + elapsed=$((elapsed + POLL_INTERVAL)) +done +ok "VM is up and SSH is reachable." + +info "Waiting 10s for system stabilization..." +sleep 10 + +# --- Step 5: Expand filesystem --- +info "Step 5/7: Expanding filesystem inside the guest..." +ssh -o StrictHostKeyChecking=no "$VM_SSH_USER@$NODE_IP" 'bash -s' <<'REMOTE_SCRIPT' +set -o pipefail + +RED='\033[0;31m' +GREEN='\033[0;32m' +YELLOW='\033[0;33m' +BLUE='\033[0;34m' +NC='\033[0m' + +info() { echo -e "${BLUE}[INFO]${NC} $*"; } +ok() { echo -e "${GREEN}[OK]${NC} $*"; } +warn() { echo -e "${YELLOW}[WARN]${NC} $*"; } +error() { echo -e "${RED}[ERROR]${NC} $*"; } + +ROOT_DEV=$(findmnt -n -o SOURCE /) +ROOT_FSTYPE=$(findmnt -n -o FSTYPE /) +info "Root device: $ROOT_DEV" +info "Root filesystem: $ROOT_FSTYPE" + +# Ensure growpart is available +if ! command -v growpart &>/dev/null; then + info "Installing growpart (cloud-guest-utils)..." + sudo apt-get update -qq && sudo apt-get install -y -qq cloud-guest-utils +fi + +resize_fs() { + local dev="$1" + local fstype="$2" + if [[ "$fstype" == "ext4" || "$fstype" == "ext3" || "$fstype" == "ext2" ]]; then + info "Running resize2fs on $dev..." + if ! sudo resize2fs "$dev"; then + error "resize2fs failed on $dev" + return 1 + fi + elif [[ "$fstype" == "xfs" ]]; then + info "Running xfs_growfs on /..." + if ! sudo xfs_growfs /; then + error "xfs_growfs failed" + return 1 + fi + else + error "Unsupported filesystem type: $fstype" + return 1 + fi + return 0 +} + +# Check if root is on LVM (device-mapper) +if [[ "$ROOT_DEV" == /dev/mapper/* || "$ROOT_DEV" == /dev/dm-* ]]; then + info "LVM layout detected." + + # Find the PV device + PV_DEV=$(sudo pvs --noheadings -o pv_name | head -1 | tr -d ' ') + if [[ -z "$PV_DEV" ]]; then + error "Could not determine PV device." + exit 1 + fi + info "PV device: $PV_DEV" + + # Parse disk and partition number (handles /dev/sdaX and /dev/nvmeXnXpX) + if [[ "$PV_DEV" =~ ^(/dev/nvme[0-9]+n[0-9]+)p([0-9]+)$ ]]; then + DISK="${BASH_REMATCH[1]}" + PARTNUM="${BASH_REMATCH[2]}" + elif [[ "$PV_DEV" =~ ^(/dev/[a-z]+)([0-9]+)$ ]]; then + DISK="${BASH_REMATCH[1]}" + PARTNUM="${BASH_REMATCH[2]}" + else + error "Could not parse disk/partition from PV: $PV_DEV" + exit 1 + fi + info "Disk: $DISK, Partition: $PARTNUM" + + # Grow partition + info "Growing partition $DISK partition $PARTNUM..." + sudo growpart "$DISK" "$PARTNUM" || echo "(growpart: partition may already be at max size)" + + # Resize PV + info "Resizing PV $PV_DEV..." + if ! sudo pvresize "$PV_DEV"; then + error "pvresize failed on $PV_DEV" + exit 1 + fi + + # Resolve LV path if using /dev/dm-* + if [[ "$ROOT_DEV" == /dev/dm-* ]]; then + LV_PATH=$(sudo lvs --noheadings -o lv_path | head -1 | tr -d ' ') + else + LV_PATH="$ROOT_DEV" + fi + info "LV path: $LV_PATH" + + # Extend LV + info "Extending LV $LV_PATH to use all free space..." + if ! sudo lvextend -l +100%FREE "$LV_PATH"; then + warn "lvextend reported no change (LV may already use all space)." + fi + + # Resize filesystem + resize_fs "$LV_PATH" "$ROOT_FSTYPE" + if [[ $? -ne 0 ]]; then + exit 1 + fi +else + info "Direct partition layout detected." + + # Parse disk and partition number + if [[ "$ROOT_DEV" =~ ^(/dev/nvme[0-9]+n[0-9]+)p([0-9]+)$ ]]; then + DISK="${BASH_REMATCH[1]}" + PARTNUM="${BASH_REMATCH[2]}" + elif [[ "$ROOT_DEV" =~ ^(/dev/[a-z]+)([0-9]+)$ ]]; then + DISK="${BASH_REMATCH[1]}" + PARTNUM="${BASH_REMATCH[2]}" + else + error "Could not parse disk/partition from: $ROOT_DEV" + exit 1 + fi + info "Disk: $DISK, Partition: $PARTNUM" + + # Grow partition + info "Growing partition $DISK partition $PARTNUM..." + sudo growpart "$DISK" "$PARTNUM" || echo "(growpart: partition may already be at max size)" + + # Resize filesystem + resize_fs "$ROOT_DEV" "$ROOT_FSTYPE" + if [[ $? -ne 0 ]]; then + exit 1 + fi +fi + +ok "Filesystem expansion complete." +df -h / +REMOTE_SCRIPT + +if [[ $? -ne 0 ]]; then + error "Filesystem expansion failed on the guest." + exit 1 +fi +ok "Filesystem expanded." + +# --- Step 6: Uncordon node --- +info "Step 6/7: Uncordoning node '$NODE_NAME'..." +if ! $KUBECTL uncordon "$NODE_NAME"; then + error "Failed to uncordon node '$NODE_NAME'." + exit 1 +fi +DRAINED_NODE="" +ok "Node uncordoned." + +# --- Step 7: Verify --- +info "Step 7/7: Verification" +echo "" +info "Disk usage on $NODE_NAME:" +ssh -o StrictHostKeyChecking=no "$VM_SSH_USER@$NODE_IP" "df -h /" +echo "" +info "Node status:" +$KUBECTL get node "$NODE_NAME" +echo "" +ok "Storage extension complete for $NODE_NAME."