[ci skip] Add extend-vm-storage script and skills

- Script to automate K8s node VM disk expansion (drain, shutdown, resize, boot, expand FS, uncordon)
- Skill docs for the workflow and troubleshooting pitfalls (growpart, macOS grep -P, drain timeouts)
- Successfully tested on k8s-node2, k8s-node3, k8s-node4 (64G → 128G)
This commit is contained in:
Viktor Barzin 2026-02-13 22:08:46 +00:00
parent ecffe93c22
commit 9df9ab1654
No known key found for this signature in database
GPG key ID: 0EB088298288D958
4 changed files with 591 additions and 0 deletions

View file

@ -435,6 +435,12 @@ Skills are specialized workflows for common tasks. Located in `.claude/skills/`.
- **When to use**: User provides GitHub URL or wants to deploy a new service
- **Example**: "Deploy [GitHub repo] to the cluster"
**extend-vm-storage** (`.claude/skills/extend-vm-storage.md`)
- Extend disk storage on K8s node VMs (Proxmox-hosted)
- Automates: drain → shutdown → resize → boot → expand filesystem → uncordon
- **When to use**: A k8s node needs more disk space
- **Example**: "Extend storage on k8s-node2 by 64G"
---
## Service-Specific Notes

View file

@ -0,0 +1,77 @@
# Extend VM Storage Skill
**Purpose**: Extend disk storage on a Kubernetes node VM (Proxmox-hosted).
**When to use**: User wants to increase disk space on a k8s node VM, or a node is running low on disk.
## Workflow
### 1. Identify the Node
Ask the user which node needs more storage and how much to add.
Valid nodes: `k8s-master`, `k8s-node1`, `k8s-node2`, `k8s-node3`, `k8s-node4`
### 2. Run the Script
```bash
./scripts/extend_vm_storage.sh <node-name> <size-increment>
```
**Example**:
```bash
./scripts/extend_vm_storage.sh k8s-node2 +64G
```
### 3. What the Script Does
1. Validates inputs (node name and size format)
2. Resolves node IP via kubectl
3. Prompts for confirmation
4. Drains the node (evicts pods)
5. Shuts down the VM in Proxmox
6. Resizes the disk (`scsi0`) by the given increment
7. Starts the VM and waits for SSH
8. Expands the filesystem inside the guest (auto-detects LVM vs direct partition)
9. Uncordons the node
10. Shows verification output (`df -h` and node status)
### 4. Update Terraform (if needed)
If you want Terraform to reflect the new disk size, update the VM definition in `main.tf` or `modules/create-vm/` so that a future `terraform apply` doesn't revert the change. Check if the VM disk size is managed by Terraform:
```bash
grep -A5 "disk" main.tf | grep -i size
```
If managed, update the size value to match the new total.
### 5. Verification
After the script completes, verify:
```bash
kubectl --kubeconfig $(pwd)/config get nodes
ssh wizard@<node-ip> "df -h /"
```
## Recovery
If the script fails mid-way:
1. Check VM status: `ssh root@192.168.1.127 "qm status <vmid>"`
2. Start VM if stopped: `ssh root@192.168.1.127 "qm start <vmid>"`
3. Uncordon node: `kubectl --kubeconfig $(pwd)/config uncordon <node-name>`
## Constants
| Setting | Value |
|---------|-------|
| Proxmox host | `root@192.168.1.127` |
| VM SSH user | `wizard` |
| Disk name | `scsi0` |
| Shutdown timeout | 300s |
| SSH wait timeout | 300s |
## Questions to Ask User
1. Which node needs more storage?
2. How much storage to add? (e.g., +64G)

View file

@ -0,0 +1,136 @@
---
name: proxmox-vm-disk-expansion-pitfalls
description: |
Troubleshoot common failures when expanding Proxmox VM disks on Ubuntu 24.04
cloud-init images and draining Kubernetes nodes. Use when: (1) growpart fails
with "command not found" on Ubuntu cloud-init VMs, (2) grep -P fails on macOS
with "invalid option -- P", (3) kubectl drain times out with pods stuck
terminating, (4) filesystem shows old size after qm resize. Covers
cloud-guest-utils installation, macOS-portable regex parsing, drain timeout
tuning, and recovery from partial failures.
author: Claude Code
version: 1.0.0
date: 2026-02-13
---
# Proxmox VM Disk Expansion Pitfalls
## Problem
Expanding disk storage on Proxmox-hosted Ubuntu 24.04 cloud-init VMs (used as
Kubernetes nodes) fails at multiple points due to missing tools, cross-platform
incompatibilities, and Kubernetes drain timeouts.
## Context / Trigger Conditions
- Running disk expansion scripts from macOS against Proxmox + Ubuntu VMs
- Ubuntu 24.04 cloud-init images (the default k8s node template)
- Kubernetes nodes with many pods or stateful workloads
- Using `scripts/extend_vm_storage.sh` or similar automation
## Issues and Solutions
### 1. `growpart: command not found` on Ubuntu 24.04
**Symptom**: After `qm resize`, SSH into VM, run `growpart /dev/sda 1` — fails
with "command not found". `resize2fs` then reports "Nothing to do!" because the
partition table hasn't been updated.
**Root cause**: Ubuntu 24.04 cloud-init images don't include `cloud-guest-utils`
by default. The `growpart` tool (which updates the partition table to use new
disk space) is in this package.
**Fix**:
```bash
sudo apt-get update -qq && sudo apt-get install -y -qq cloud-guest-utils
sudo growpart /dev/sda 1
sudo resize2fs /dev/sda1
```
**Prevention**: Check for `growpart` before attempting partition expansion:
```bash
if ! command -v growpart &>/dev/null; then
sudo apt-get update -qq && sudo apt-get install -y -qq cloud-guest-utils
fi
```
### 2. `grep -P` (PCRE) not available on macOS
**Symptom**: Script running on macOS fails with `grep: invalid option -- P`.
**Root cause**: macOS ships BSD grep, which doesn't support `-P` (Perl-compatible
regex). GNU grep (from Homebrew) does, but scripts shouldn't assume it's installed.
**Fix**: Replace `grep -oP 'pattern\Kcapture'` with portable `sed`:
```bash
# BAD (GNU grep only):
CURRENT_SIZE=$(echo "$LINE" | grep -oP 'size=\K[0-9]+G')
# GOOD (portable):
CURRENT_SIZE=$(echo "$LINE" | sed -n 's/.*size=\([0-9]*G\).*/\1/p')
```
**General rule**: In scripts that run on macOS, avoid `grep -P`, `sed -i ''`
vs `sed -i` differences, and `date` flag differences. Use `sed` with basic
regex or bash built-in `[[ =~ ]]` for pattern matching.
### 3. `kubectl drain` timeout with stuck pods
**Symptom**: `kubectl drain --timeout=120s` fails with "context deadline exceeded"
for multiple pods. Pods are evicted but don't terminate in time.
**Root cause**: Some pods (stateful services like ClickHouse, Paperless-ngx,
OnlyOffice) need more time to shut down gracefully. 120s isn't enough when many
pods are draining simultaneously.
**Fix**: Use `--force` flag and a longer timeout, or retry:
```bash
# First attempt with standard timeout
kubectl drain <node> --ignore-daemonsets --delete-emptydir-data --timeout=120s
# If it fails, force with longer timeout (pods already evicting)
kubectl drain <node> --ignore-daemonsets --delete-emptydir-data --timeout=300s --force
```
**Note**: After a failed drain, the node is already cordoned. A second drain
attempt only needs to wait for already-evicting pods to finish.
### 4. Recovery from partial failure
If the script fails mid-way (after drain but before uncordon):
```bash
# Check VM status
ssh root@192.168.1.127 "qm status <vmid>"
# Start VM if stopped
ssh root@192.168.1.127 "qm start <vmid>"
# Uncordon node
kubectl --kubeconfig $(pwd)/config uncordon <node-name>
```
## Verification
After successful expansion:
```bash
# On the VM
df -h /
# Should show new size (128G disk → ~126G usable for ext4)
# On the cluster
kubectl get node <name>
# Should show Ready status
```
## Notes
- The k8s node VMs use direct partition layout (`/dev/sda1`), not LVM, despite
the script handling both paths
- `growpart` returns exit code 1 for "NOCHANGE" (partition already at max) —
this is not an error
- Proxmox `qm resize` uses `scsi0` as the disk identifier for these VMs
- SSH host keys may change if VMs are recreated or network changes — use
`-o StrictHostKeyChecking=no` in automated scripts
See also: `extend-vm-storage.md` (the operational skill for running the script)