## Context After the MySQL standalone migration + Technitium SQLite disable saved ~130 GB/day of disk writes, this methodology should be reusable for periodic health reviews. ## This change: Adds `/disk-wear` skill that combines three data sources: - SSH to PVE host for real-time 30s I/O snapshots and SSD SMART health - Prometheus PromQL for per-app write attribution (node_disk_written_bytes_total joined with node_disk_device_mapper_info for dm->LVM mapping) - kubectl for PVC UUID -> pod/namespace mapping Produces ranked breakdowns by physical disk, VM, k8s namespace, and individual PVC. Includes baselines, red flag detection, and annualized wear projections. Note: container_fs_writes_bytes_total has 0 series (cadvisor doesn't track block device writes per container), so per-app attribution uses the PVE host's dm-device level metrics mapped through Prometheus and kubectl. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
7.8 KiB
7.8 KiB
| name | description | author | version | date |
|---|---|---|---|---|
| disk-wear | Analyze disk write patterns on the PVE host to assess wear and identify top writers by VM, k8s app, and PVC. Use when: (1) User asks about disk wear, disk writes, or storage health, (2) User says "what's wearing the disk", "disk analysis", "I/O analysis", (3) User wants to check write rates by VM, k8s namespace, or PVC, (4) Periodic quarterly disk health review. Combines PVE host I/O stats (SSH), Prometheus metrics (PromQL), and k8s PVC-to-pod mapping for a full breakdown. | Claude Code | 1.0.0 | 2026-04-17 |
Disk Wear Analysis
Infrastructure
| Resource | Address | Notes |
|---|---|---|
| PVE host | root@192.168.1.127 (SSH) |
Dell R730, PERC H730 RAID |
| Prometheus | prometheus-server.monitoring.svc:80 |
Query via alertmanager pod (wget) |
| SSD | Slot 4, Samsung 850 EVO 1TB | Rated 150 TBW |
| HDD sdc | RAID1 (2x 11.7TB SAS 7200RPM) | Main data disk, enterprise rated ~550 TB/yr |
| HDD sda | 1.2TB SAS 10K RPM | Backup only |
Step 1: Physical Disk Overview + SSD Health
ssh root@192.168.1.127 'echo "=== UPTIME ===" && uptime && echo "" && \
echo "=== PHYSICAL DISK CUMULATIVE (since boot) ===" && iostat -d -k sda sdb sdc 2>/dev/null && echo "" && \
echo "=== SSD SMART (Samsung 850 EVO, slot 4) ===" && \
smartctl -d sat+megaraid,4 -A /dev/sda 2>/dev/null | grep -iE "power_on|reallocat|written|wear|pending|uncorrect"'
Interpret SSD health:
Wear_Leveling_Count: 100 = new, 0 = dead. Calculate(100 - value)%wear used.Total_LBAs_Written: multiply by 512 bytes for total TB written. Compare against 150 TBW rating.- Estimate remaining life:
(150 TBW - current TBW) / annual write rate.
Step 2: Real-Time Snapshot (30 seconds)
SSH to PVE host and take two reads of block device stats 30 seconds apart. This gives instantaneous write rates independent of Prometheus scrape intervals.
ssh root@192.168.1.127 'bash -s' << 'SCRIPT'
echo "=== 30-SECOND SNAPSHOT ($(date)) ==="
declare -A snap1
for dm in /sys/block/dm-*; do
name=$(basename $dm)
snap1[$name]=$(cat $dm/stat 2>/dev/null | awk '{print $7}')
done
for d in sda sdb sdc; do
snap1[$d]=$(cat /sys/block/$d/stat 2>/dev/null | awk '{print $7}')
done
sleep 30
printf "%-12s %10s %10s %s\n" "DEVICE" "kB/s" "GB/day" "NAME"
echo "-------------------------------------------------------------------"
results=""
for dm in /sys/block/dm-*; do
name=$(basename $dm)
s2=$(cat $dm/stat 2>/dev/null | awk '{print $7}')
s1=${snap1[$name]:-0}
diff=$((s2 - s1))
if [ "$diff" -gt 100 ]; then
kbps=$((diff / 2 / 30))
gbday=$(echo "scale=1; $kbps * 86400 / 1048576" | bc)
lvm=$(dmsetup info --columns --noheadings -o name /dev/$name 2>/dev/null)
results="$results\n$name $kbps $gbday $lvm"
fi
done
for d in sda sdb sdc; do
s2=$(cat /sys/block/$d/stat 2>/dev/null | awk '{print $7}')
s1=${snap1[$d]:-0}
diff=$((s2 - s1))
kbps=$((diff / 2 / 30))
gbday=$(echo "scale=1; $kbps * 86400 / 1048576" | bc)
results="$results\n$d $kbps $gbday (physical)"
done
echo -e "$results" | sort -k2 -rn | head -30 | while read dev kbps gbday name; do
printf "%-12s %8s kB/s %8s GB/day %s\n" "$dev" "$kbps" "$gbday" "$name"
done
SCRIPT
Step 3: Prometheus — Per-App Write Attribution
Query Prometheus from inside the cluster (alertmanager pod has wget).
3a. Top PVC Writers (1h rate)
kubectl exec -n monitoring prometheus-alertmanager-0 -- wget -qO- 'http://prometheus-server/api/v1/query' \
--post-data='query=topk(20,rate(node_disk_written_bytes_total{instance=~"pve.*"}[1h])*on(device)group_left(lv_name,vg_name)node_disk_device_mapper_info{instance=~"pve.*",lv_name=~"vm-9999-pvc-.*"})' \
2>/dev/null | python3 -c "
import json,sys
d=json.load(sys.stdin)
for r in d['data']['result']:
m = r['metric']
val = float(r['value'][1])
gb_day = val * 86400 / 1073741824
if gb_day > 0.05:
lv = m.get('lv_name','?').replace('vm-9999-','')
print(f'{gb_day:8.1f} GB/day {lv}')
"
Then enrich PVC UUIDs with names:
kubectl get pv -o custom-columns=NAME:.metadata.name,PVC:.spec.claimRef.name,NS:.spec.claimRef.namespace | grep "pvc-<UUID>"
3b. Top VM Writers (1h rate)
kubectl exec -n monitoring prometheus-alertmanager-0 -- wget -qO- 'http://prometheus-server/api/v1/query' \
--post-data='query=topk(10,rate(node_disk_written_bytes_total{instance=~"pve.*"}[1h])*on(device)group_left(lv_name,vg_name)node_disk_device_mapper_info{instance=~"pve.*",lv_name!~"vm-9999-.*|root|swap|data.*|nfs.*|backup.*|ssd.*"})' \
2>/dev/null | python3 -c "
import json,sys
d=json.load(sys.stdin)
for r in d['data']['result']:
m = r['metric']
val = float(r['value'][1])
gb_day = val * 86400 / 1073741824
print(f'{gb_day:8.1f} GB/day {m.get(\"lv_name\",\"?\")}')
"
Enrich VM IDs with names:
ssh root@192.168.1.127 'qm list' 2>/dev/null
3c. Aggregate PVC Writes by K8s Namespace
After collecting the top PVC writers from 3a, map each PVC UUID to its namespace using kubectl get pv, then sum by namespace. Present as a table:
| Namespace | GB/day | Top PVC |
|---|---|---|
| dbaas | ... | mysql-standalone, pg-cluster |
| monitoring | ... | prometheus-data |
3d. Historical Trend (7-day total)
kubectl exec -n monitoring prometheus-alertmanager-0 -- wget -qO- 'http://prometheus-server/api/v1/query' \
--post-data='query=topk(10,increase(node_disk_written_bytes_total{instance=~"pve.*",device=~"sda|sdb|sdc"}[7d]))' \
2>/dev/null | python3 -c "
import json,sys
d=json.load(sys.stdin)
for r in d['data']['result']:
m = r['metric']
val = float(r['value'][1])
tb = val / 1099511627776
print(f'{tb:8.2f} TB/7d device={m.get(\"device\",\"?\")}')
"
Step 4: Interpretation
Baselines
| Metric | Healthy | Warning | Critical |
|---|---|---|---|
| sdc (HDD RAID1) annualized | <200 TB/yr | 200-400 TB/yr | >400 TB/yr |
| sdb (SSD) wear used | <50% | 50-80% | >80% |
| Single PVC write rate | <20 GB/day | 20-50 GB/day | >50 GB/day |
| Single VM write rate | <50 GB/day | 50-100 GB/day | >100 GB/day |
| NFS volume total | <20 GB/day | 20-50 GB/day | >50 GB/day |
Known Write Sources (expected baseline, April 2026)
| Source | Expected GB/day | Notes |
|---|---|---|
| MySQL standalone | 5-10 | uptimekuma heartbeats + phpipam. skip-log-bin, no GR |
| PostgreSQL cluster | 5-15 | Technitium DNS query logs (90-day retention) + app DBs |
| k8s-master etcd | 30-50 | etcd WAL + snapshot compaction |
| k8s-node VMs | 10-30 each | containerd layers, kubelet journals, ephemeral storage |
| Prometheus | 3-5 | TSDB compaction |
| home-assistant | 10-15 | Recorder database (SQLite/MariaDB) |
| NFS volume | 5-10 | Minimal after TrueNAS deprecation |
Red Flags (investigate immediately)
- Any single PVC >50 GB/day
- MySQL
log_bin= ON (should be OFF —skip-log-binin standalone config) - Technitium MySQL or SQLite query log plugins re-installed (should be uninstalled)
- NFS writes >30 GB/day (media ingestion or backup churn)
- SSD wear >80% or projected life <2 years
- k8s node VM writes >100 GB/day (something writing heavily to ephemeral storage)
Step 5: Report Format
Present findings as three tables:
1. Physical Disks
| Disk | Type | 7d Total | Rate GB/day | Annualized | Status |
|---|
2. Top Writers (VMs + PVCs combined, sorted by rate)
| Rank | Name | Type | GB/day | Status | Notes |
|---|
3. By K8s Namespace
| Namespace | PVC Writes GB/day | Top Contributor |
|---|
End with:
- Annualized wear projections
- Comparison with previous run (if user provides one)
- Action items for any WARNING/CRITICAL findings