scripts: hook apply-mbps-caps into the PVE host as a systemd timer
The qm-set I/O caps were previously only applied by manual one-shot runs of apply-mbps-caps.sh, so any config drift (manual `qm set`, config restored from /mnt/backup/pve-config like we did on 2026-05-26, fresh VM clone) would leave the affected VM uncapped until someone remembered to re-run the script. Adds apply-mbps-caps.service (Type=oneshot) + apply-mbps-caps.timer firing: - OnBootSec=5min — catches PVE host reboots & restored configs - OnCalendar=hourly — catches manual qm-set drift / fresh clones - Persistent=true — runs missed schedule after PVE downtime - RandomizedDelaySec=2min Same install pattern as the other PVE operational scripts (nfs-mirror, daily-backup, offsite-sync-backup, lvm-pvc-snapshot — memory id=609 + id=542). Source in this repo, deployed to /usr/local/bin + /etc/ systemd/system/ on the PVE host. Script hardening: kept `set -uo pipefail` but dropped `-e` so one missing VM doesn't abort the rest; each VM is gated on `qm status` existence; added a fast-path "already at target" no-op log line for quiet hourly runs. Installed on PVE (192.168.1.127) and smoke-tested: all 8 VMs caps re-applied successfully, next run 12:00 EEST. Journal: `journalctl -u apply-mbps-caps -f` on the PVE host.
This commit is contained in:
parent
232409e798
commit
56a338f80b
3 changed files with 79 additions and 22 deletions
12
scripts/apply-mbps-caps.service
Normal file
12
scripts/apply-mbps-caps.service
Normal file
|
|
@ -0,0 +1,12 @@
|
|||
[Unit]
|
||||
Description=Apply per-VM I/O caps via qm set (idempotent)
|
||||
Documentation=https://github.com/ViktorBarzin/infra/blob/master/scripts/apply-mbps-caps.sh
|
||||
After=pve-cluster.service
|
||||
Wants=pve-cluster.service
|
||||
|
||||
[Service]
|
||||
Type=oneshot
|
||||
ExecStart=/usr/local/bin/apply-mbps-caps.sh
|
||||
StandardOutput=journal
|
||||
StandardError=journal
|
||||
SyslogIdentifier=apply-mbps-caps
|
||||
|
|
@ -1,10 +1,19 @@
|
|||
#!/usr/bin/env bash
|
||||
# Apply per-VM I/O caps via qm set on the PVE host.
|
||||
# - Reads each VM's current boot-disk options
|
||||
# - Appends mbps_rd=<N>,mbps_wr=<N>
|
||||
# - Re-applies via qm set (live, no reboot needed)
|
||||
# - Verifies with qm config | grep mbps
|
||||
set -euo pipefail
|
||||
# Apply per-VM I/O caps via `qm set` on the PVE host.
|
||||
#
|
||||
# - Reads each target VM's current boot-disk options.
|
||||
# - Appends/normalises `mbps_rd=<N>,mbps_wr=<N>`.
|
||||
# - Re-applies via `qm set` (live, no reboot needed).
|
||||
# - Idempotent: re-running with no drift is a no-op at the storage
|
||||
# level (proxmox config rewrite is cheap).
|
||||
# - Continues on per-VM failures so one missing/stopped VM doesn't
|
||||
# skip the rest — designed to be safe under the systemd timer.
|
||||
#
|
||||
# Backed by `apply-mbps-caps.{service,timer}` (hourly + 5min-after-boot).
|
||||
# Why these values: see beads code-9v2j + memory id=2726 (alloy IO storm)
|
||||
# + memory id=1575 (VMs intentionally out of TF).
|
||||
|
||||
set -uo pipefail # NOT -e — keep going if a single VM step fails.
|
||||
|
||||
# vmid:disk_slot:mbps_rd:mbps_wr (Linux VMs only — skipping 101 pfsense BSD, 300 Windows)
|
||||
TARGETS=(
|
||||
|
|
@ -14,34 +23,52 @@ TARGETS=(
|
|||
"201:scsi1:150:120" # k8s-node1 (GPU + many CSI disks; boots from scsi1)
|
||||
"202:scsi0:150:120" # k8s-node2
|
||||
"203:scsi0:150:120" # k8s-node3
|
||||
"204:scsi0:150:120" # k8s-node4 (currently doing write recovery)
|
||||
"204:scsi0:150:120" # k8s-node4
|
||||
"220:scsi0:40:40" # docker-registry
|
||||
)
|
||||
|
||||
for spec in "${TARGETS[@]}"; do
|
||||
apply_one() {
|
||||
local spec="$1"
|
||||
local vmid slot rd wr
|
||||
IFS=: read -r vmid slot rd wr <<<"$spec"
|
||||
printf '\n=== VMID %s slot=%s rd=%s MB/s wr=%s MB/s ===\n' "$vmid" "$slot" "$rd" "$wr"
|
||||
|
||||
current=$(qm config "$vmid" | awk -v s="$slot:" '$1==s {sub(/^[^ ]+ /, ""); print; exit}')
|
||||
if [[ -z "$current" ]]; then
|
||||
echo " ERROR: could not read $slot for vmid $vmid — skipping"
|
||||
continue
|
||||
# Skip non-existent VMs cleanly (e.g. node decommissioned, never rebuilt).
|
||||
if ! qm status "$vmid" >/dev/null 2>&1; then
|
||||
echo "vmid $vmid: not present on this host — skipping"
|
||||
return 0
|
||||
fi
|
||||
|
||||
local current cleaned newvalue
|
||||
current=$(qm config "$vmid" | awk -v s="$slot:" '$1==s {sub(/^[^ ]+ /, ""); print; exit}')
|
||||
if [[ -z "$current" ]]; then
|
||||
echo "vmid $vmid: no $slot line in config — skipping"
|
||||
return 0
|
||||
fi
|
||||
|
||||
# Strip any existing mbps_rd / mbps_wr from the current string (idempotent)
|
||||
cleaned=$(echo "$current" | sed -E 's/,mbps_rd=[0-9]+//g; s/,mbps_wr=[0-9]+//g')
|
||||
newvalue="${cleaned},mbps_rd=${rd},mbps_wr=${wr}"
|
||||
|
||||
# Skip the qm-set call entirely when state already matches — keeps
|
||||
# journal noise low under the hourly timer.
|
||||
if [[ "$current" == "$newvalue" ]]; then
|
||||
echo "vmid $vmid: $slot already at mbps_rd=${rd},mbps_wr=${wr} — no-op"
|
||||
return 0
|
||||
fi
|
||||
|
||||
echo "vmid $vmid: updating $slot"
|
||||
echo " before: $current"
|
||||
echo " after: $newvalue"
|
||||
if qm set "$vmid" "--$slot" "$newvalue"; then
|
||||
echo " ok"
|
||||
else
|
||||
echo " FAILED: qm set returned non-zero"
|
||||
return 1
|
||||
fi
|
||||
}
|
||||
|
||||
qm set "$vmid" "--$slot" "$newvalue"
|
||||
echo " verify: $(qm config "$vmid" | awk -v s="$slot:" '$1==s {print; exit}')"
|
||||
done
|
||||
|
||||
echo
|
||||
echo "=== Final verification — mbps on all targets ==="
|
||||
rc=0
|
||||
for spec in "${TARGETS[@]}"; do
|
||||
IFS=: read -r vmid slot _ _ <<<"$spec"
|
||||
echo "vmid $vmid: $(qm config "$vmid" | awk -v s="$slot:" '$1==s {print; exit}')"
|
||||
apply_one "$spec" || rc=1
|
||||
done
|
||||
|
||||
exit "$rc"
|
||||
|
|
|
|||
18
scripts/apply-mbps-caps.timer
Normal file
18
scripts/apply-mbps-caps.timer
Normal file
|
|
@ -0,0 +1,18 @@
|
|||
[Unit]
|
||||
Description=Re-apply per-VM I/O caps periodically + after PVE boot
|
||||
|
||||
[Timer]
|
||||
# After every PVE host reboot — caps survive in /etc/pve/qemu-server/<vmid>.conf
|
||||
# normally, but a config restore from backup can drop them (see 2026-05-26
|
||||
# incident where we restored 202.conf + 203.conf from /mnt/backup/pve-config/).
|
||||
OnBootSec=5min
|
||||
|
||||
# Hourly during normal operation — catches manual `qm set` drift or fresh
|
||||
# VM clones that haven't had caps applied yet.
|
||||
OnCalendar=hourly
|
||||
|
||||
Persistent=true
|
||||
RandomizedDelaySec=2min
|
||||
|
||||
[Install]
|
||||
WantedBy=timers.target
|
||||
Loading…
Add table
Add a link
Reference in a new issue