---
name: disk-wear
description: |
  Analyze disk write patterns on the PVE host to assess wear and identify
  top writers by VM, k8s app, and PVC. Use when:
  (1) User asks about disk wear, disk writes, or storage health,
  (2) User says "what's wearing the disk", "disk analysis", "I/O analysis",
  (3) User wants to check write rates by VM, k8s namespace, or PVC,
  (4) Periodic quarterly disk health review.
  Combines PVE host I/O stats (SSH), Prometheus metrics (PromQL), and
  k8s PVC-to-pod mapping for a full breakdown.
author: Claude Code
version: 1.0.0
date: 2026-04-17
---

# Disk Wear Analysis

## Infrastructure

| Resource | Address | Notes |
|----------|---------|-------|
| PVE host | `root@192.168.1.127` (SSH) | Dell R730, PERC H730 RAID |
| Prometheus | `prometheus-server.monitoring.svc:80` | Query via alertmanager pod (wget) |
| SSD | Slot 4, Samsung 850 EVO 1TB | Rated 150 TBW |
| HDD sdc | RAID1 (2x 11.7TB SAS 7200RPM) | Main data disk, enterprise rated ~550 TB/yr |
| HDD sda | 1.2TB SAS 10K RPM | Backup only |

## Step 1: Physical Disk Overview + SSD Health

```bash
ssh root@192.168.1.127 'echo "=== UPTIME ===" && uptime && echo "" && \
echo "=== PHYSICAL DISK CUMULATIVE (since boot) ===" && iostat -d -k sda sdb sdc 2>/dev/null && echo "" && \
echo "=== SSD SMART (Samsung 850 EVO, slot 4) ===" && \
smartctl -d sat+megaraid,4 -A /dev/sda 2>/dev/null | grep -iE "power_on|reallocat|written|wear|pending|uncorrect"'
```

**Interpret SSD health:**
- `Wear_Leveling_Count`: 100 = new, 0 = dead. Calculate `(100 - value)%` wear used.
- `Total_LBAs_Written`: multiply by 512 bytes for total TB written. Compare against 150 TBW rating.
- Estimate remaining life: `(150 TBW - current TBW) / annual write rate`.

## Step 2: Real-Time Snapshot (30 seconds)

SSH to PVE host and take two reads of block device stats 30 seconds apart. This gives instantaneous write rates independent of Prometheus scrape intervals.

```bash
ssh root@192.168.1.127 'bash -s' << 'SCRIPT'
echo "=== 30-SECOND SNAPSHOT ($(date)) ==="
declare -A snap1
for dm in /sys/block/dm-*; do
  name=$(basename $dm)
  snap1[$name]=$(cat $dm/stat 2>/dev/null | awk '{print $7}')
done
for d in sda sdb sdc; do
  snap1[$d]=$(cat /sys/block/$d/stat 2>/dev/null | awk '{print $7}')
done

sleep 30

printf "%-12s %10s %10s  %s\n" "DEVICE" "kB/s" "GB/day" "NAME"
echo "-------------------------------------------------------------------"
results=""
for dm in /sys/block/dm-*; do
  name=$(basename $dm)
  s2=$(cat $dm/stat 2>/dev/null | awk '{print $7}')
  s1=${snap1[$name]:-0}
  diff=$((s2 - s1))
  if [ "$diff" -gt 100 ]; then
    kbps=$((diff / 2 / 30))
    gbday=$(echo "scale=1; $kbps * 86400 / 1048576" | bc)
    lvm=$(dmsetup info --columns --noheadings -o name /dev/$name 2>/dev/null)
    results="$results\n$name $kbps $gbday $lvm"
  fi
done
for d in sda sdb sdc; do
  s2=$(cat /sys/block/$d/stat 2>/dev/null | awk '{print $7}')
  s1=${snap1[$d]:-0}
  diff=$((s2 - s1))
  kbps=$((diff / 2 / 30))
  gbday=$(echo "scale=1; $kbps * 86400 / 1048576" | bc)
  results="$results\n$d $kbps $gbday (physical)"
done
echo -e "$results" | sort -k2 -rn | head -30 | while read dev kbps gbday name; do
  printf "%-12s %8s kB/s %8s GB/day  %s\n" "$dev" "$kbps" "$gbday" "$name"
done
SCRIPT
```

## Step 3: Prometheus — Per-App Write Attribution

Query Prometheus from inside the cluster (alertmanager pod has wget).

### 3a. Top PVC Writers (1h rate)

```bash
kubectl exec -n monitoring prometheus-alertmanager-0 -- wget -qO- 'http://prometheus-server/api/v1/query' \
  --post-data='query=topk(20,rate(node_disk_written_bytes_total{instance=~"pve.*"}[1h])*on(device)group_left(lv_name,vg_name)node_disk_device_mapper_info{instance=~"pve.*",lv_name=~"vm-9999-pvc-.*"})' \
  2>/dev/null | python3 -c "
import json,sys
d=json.load(sys.stdin)
for r in d['data']['result']:
    m = r['metric']
    val = float(r['value'][1])
    gb_day = val * 86400 / 1073741824
    if gb_day > 0.05:
        lv = m.get('lv_name','?').replace('vm-9999-','')
        print(f'{gb_day:8.1f} GB/day  {lv}')
"
```

Then enrich PVC UUIDs with names:
```bash
kubectl get pv -o custom-columns=NAME:.metadata.name,PVC:.spec.claimRef.name,NS:.spec.claimRef.namespace | grep "pvc-<UUID>"
```

### 3b. Top VM Writers (1h rate)

```bash
kubectl exec -n monitoring prometheus-alertmanager-0 -- wget -qO- 'http://prometheus-server/api/v1/query' \
  --post-data='query=topk(10,rate(node_disk_written_bytes_total{instance=~"pve.*"}[1h])*on(device)group_left(lv_name,vg_name)node_disk_device_mapper_info{instance=~"pve.*",lv_name!~"vm-9999-.*|root|swap|data.*|nfs.*|backup.*|ssd.*"})' \
  2>/dev/null | python3 -c "
import json,sys
d=json.load(sys.stdin)
for r in d['data']['result']:
    m = r['metric']
    val = float(r['value'][1])
    gb_day = val * 86400 / 1073741824
    print(f'{gb_day:8.1f} GB/day  {m.get(\"lv_name\",\"?\")}')
"
```

Enrich VM IDs with names:
```bash
ssh root@192.168.1.127 'qm list' 2>/dev/null
```

### 3c. Aggregate PVC Writes by K8s Namespace

After collecting the top PVC writers from 3a, map each PVC UUID to its namespace using `kubectl get pv`, then sum by namespace. Present as a table:

| Namespace | GB/day | Top PVC |
|-----------|--------|---------|
| dbaas | ... | mysql-standalone, pg-cluster |
| monitoring | ... | prometheus-data |

### 3d. Historical Trend (7-day total)

```bash
kubectl exec -n monitoring prometheus-alertmanager-0 -- wget -qO- 'http://prometheus-server/api/v1/query' \
  --post-data='query=topk(10,increase(node_disk_written_bytes_total{instance=~"pve.*",device=~"sda|sdb|sdc"}[7d]))' \
  2>/dev/null | python3 -c "
import json,sys
d=json.load(sys.stdin)
for r in d['data']['result']:
    m = r['metric']
    val = float(r['value'][1])
    tb = val / 1099511627776
    print(f'{tb:8.2f} TB/7d  device={m.get(\"device\",\"?\")}')
"
```

## Step 4: Interpretation

### Baselines

| Metric | Healthy | Warning | Critical |
|--------|---------|---------|----------|
| sdc (HDD RAID1) annualized | <200 TB/yr | 200-400 TB/yr | >400 TB/yr |
| sdb (SSD) wear used | <50% | 50-80% | >80% |
| Single PVC write rate | <20 GB/day | 20-50 GB/day | >50 GB/day |
| Single VM write rate | <50 GB/day | 50-100 GB/day | >100 GB/day |
| NFS volume total | <20 GB/day | 20-50 GB/day | >50 GB/day |

### Known Write Sources (expected baseline, April 2026)

| Source | Expected GB/day | Notes |
|--------|----------------|-------|
| MySQL standalone | 5-10 | uptimekuma heartbeats + phpipam. `skip-log-bin`, no GR |
| PostgreSQL cluster | 5-15 | Technitium DNS query logs (90-day retention) + app DBs |
| k8s-master etcd | 30-50 | etcd WAL + snapshot compaction |
| k8s-node VMs | 10-30 each | containerd layers, kubelet journals, ephemeral storage |
| Prometheus | 3-5 | TSDB compaction |
| home-assistant | 10-15 | Recorder database (SQLite/MariaDB) |
| NFS volume | 5-10 | Minimal after TrueNAS deprecation |

### Red Flags (investigate immediately)

- Any single PVC >50 GB/day
- MySQL `log_bin` = ON (should be OFF — `skip-log-bin` in standalone config)
- Technitium MySQL or SQLite query log plugins re-installed (should be uninstalled)
- NFS writes >30 GB/day (media ingestion or backup churn)
- SSD wear >80% or projected life <2 years
- k8s node VM writes >100 GB/day (something writing heavily to ephemeral storage)

## Step 5: Report Format

Present findings as three tables:

**1. Physical Disks**
| Disk | Type | 7d Total | Rate GB/day | Annualized | Status |
|------|------|----------|-------------|------------|--------|

**2. Top Writers (VMs + PVCs combined, sorted by rate)**
| Rank | Name | Type | GB/day | Status | Notes |
|------|------|------|--------|--------|-------|

**3. By K8s Namespace**
| Namespace | PVC Writes GB/day | Top Contributor |
|-----------|-------------------|-----------------|

End with:
- Annualized wear projections
- Comparison with previous run (if user provides one)
- Action items for any WARNING/CRITICAL findings