apply-mbps-caps: compare normalized option sets (true idempotency) + devvm I/O-stall post-mortem [ci skip]

The raw string compare never matched qm config's canonical key order, so
the hourly timer re-issued 'qm set' against every running capped VM,
live-rewriting QEMU throttle state via QMP 24x/day. Implicated in today's
devvm freeze (15:21-16:48 UTC): the guest's disk I/O stalled inside QEMU
(blockstats frozen at 0 while QMP stayed responsive) on the legacy lsi
controller path with no iothread.

Viktor asked to root-cause the freeze before choosing fixes, then approved
mitigating via VM settings: this commit fixes the hourly trigger and
documents the incident; the controller swap (virtio-scsi-single +
iothread=1 + aio=threads) is staged on VM 102 separately, pending his
cold stop/start.

Adds docs/post-mortems/2026-06-11-devvm-qemu-io-stall.md (evidence chain,
ruled-out causes, capture-before-kill autopsy steps) and syncs compute.md
+ proxmox-inventory.md.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
This commit is contained in:
Viktor Barzin 2026-06-11 18:00:08 +00:00
parent 2e0cebff87
commit c3a63fcd38
4 changed files with 136 additions and 4 deletions

View file

@ -27,6 +27,12 @@ TARGETS=(
"220:scsi0:40:40" # docker-registry
)
# Sort a disk spec's comma-separated options so two specs with the same
# option set but different key order compare equal.
normalized() {
tr ',' '\n' <<<"$1" | LC_ALL=C sort | paste -sd, -
}
apply_one() {
local spec="$1"
local vmid slot rd wr
@ -49,8 +55,13 @@ apply_one() {
newvalue="${cleaned},mbps_rd=${rd},mbps_wr=${wr}"
# Skip the qm-set call entirely when state already matches — keeps
# journal noise low under the hourly timer.
if [[ "$current" == "$newvalue" ]]; then
# journal noise low under the hourly timer. Compare option SETS, not raw
# strings: `qm config` prints keys in its own canonical order, so a raw
# compare never matched and every hourly run re-issued `qm set`, which
# live-rewrites the running VM's QEMU throttle state via QMP (implicated
# in the 2026-06-11 devvm I/O stall — see
# docs/post-mortems/2026-06-11-devvm-qemu-io-stall.md).
if [[ "$(normalized "$current")" == "$(normalized "$newvalue")" ]]; then
echo "vmid $vmid: $slot already at mbps_rd=${rd},mbps_wr=${wr} — no-op"
return 0
fi