infra/docs/post-mortems/2026-06-11-devvm-qemu-io-stall.md
Viktor Barzin c3a63fcd38 apply-mbps-caps: compare normalized option sets (true idempotency) + devvm I/O-stall post-mortem [ci skip]
The raw string compare never matched qm config's canonical key order, so
the hourly timer re-issued 'qm set' against every running capped VM,
live-rewriting QEMU throttle state via QMP 24x/day. Implicated in today's
devvm freeze (15:21-16:48 UTC): the guest's disk I/O stalled inside QEMU
(blockstats frozen at 0 while QMP stayed responsive) on the legacy lsi
controller path with no iothread.

Viktor asked to root-cause the freeze before choosing fixes, then approved
mitigating via VM settings: this commit fixes the hourly trigger and
documents the incident; the controller swap (virtio-scsi-single +
iothread=1 + aio=threads) is staged on VM 102 separately, pending his
cold stop/start.

Adds docs/post-mortems/2026-06-11-devvm-qemu-io-stall.md (evidence chain,
ruled-out causes, capture-before-kill autopsy steps) and syncs compute.md
+ proxmox-inventory.md.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
2026-06-11 18:00:08 +00:00

116 lines
6.5 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

# 2026-06-11 — devvm dead ~90 min: QEMU-internal I/O stall on the legacy LSI disk path
## Impact
- devvm (VM 102, the shared multi-user Claude Code workstation) effectively
dead 15:2116:48 UTC (18:2119:48 EEST): all ssh/tmux and t3 sessions for
wizard/emo/anca lost, every in-flight agent killed.
- Detection was human (~90 min) — no `up{instance="devvm"} == 0` alert
exists (follow-up below).
- Recovery was manual: kill of the wedged QEMU process + `qm start` (the
kill left no autopsy — see "What we could not prove").
## Timeline (UTC; host journal runs EEST = UTC+3)
- **15:01** — hourly `apply-mbps-caps` run live-rewrites VM 102's scsi0
throttle via `qm set` (as it had done every hour for weeks — see Root
cause #4).
- **15:1815:20** — guest healthy by every metric: CPU 716% of 16 vCPUs,
load 1.4, 17 GiB MemAvailable, swap flat at 2.0 GiB, host `sdc` 28%
utilized. Heavy claude/bwrap sandbox activity (normal workload).
- **15:19:08** — last journal line the guest ever writes (mid normal
traffic, zero kernel distress — not even a hung-task warning).
- **15:21** — host RRD (pvestatd polling QEMU over QMP once a minute) shows
`diskwrite` drop to **exactly 0 and stay 0 for 87 minutes** — not even
journal flushes. netout collapses 380K→7K/s. **QEMU keeps answering QMP
the whole time** — the process and its main loop are alive; only the
block path is dead.
- **15:21→15:39** — guest CPU (host's view) ramps 11% → ~50% and plateaus:
processes progressively piling up behind dead storage (dirty-page
writeback stuck → direct reclaim spins). Classic starvation cascade, not
a panic (a panic halts or spins flat from t=0).
- **16:47:42** — QMP socket resets: the wedged QEMU is killed out-of-band
(root shell; no PVE task, no snoopy line — shell-builtin `kill`).
- **16:48:31** — `qmstart` task; guest boots clean on kernel 6.8.0-124
(wedged boot ran 6.8.0-117).
## Ruled out (evidence, not vibes)
- **Guest CPU/memory/swap pressure** — healthy at last scrape (Prometheus)
and per-minute host RRD.
- **Host storage** — `pve` thin pool 68% data / 15.5% meta; zero kernel
I/O errors on the host all day; `sdc` quiet through the window.
- **Host-side kill/OOM** — no OOM-killer lines, no segfault, no QEMU crash
log; 113 of 114 monitored targets stayed up. Only the devvm died.
- **Guest kernel panic** — would not keep QMP-visible blockstats frozen at
0 while netout ACKs trickle; and the guest kernel logged nothing.
## Root cause
**Class pinned, exact line unprovable** (see below): the devvm's disk I/O
stalled *inside the QEMU process* — below the guest kernel (all guest I/O
froze simultaneously with nothing logged) and above host storage (host
clean, neighbors fine, QEMU main loop responsive). Contributing stack,
unique to this VM:
1. **`scsihw: lsi`** — the emulated LSI 53C895A (1997 chip, QEMU's legacy
default for OSes without virtio drivers). The devvm was the **only VM
on the host** running its disk through this path; every healthy
neighbor uses `virtio-scsi-pci`. The LSI model is documented as
hang-prone under intensive I/O.
2. **No `iothread`** — all disk emulation ran on QEMU's single main event
loop, sharing it with timers and QMP.
3. **QEMU-level mbps throttle (60/60)** — a token bucket inside QEMU whose
queued I/O completes only when its re-arm timer fires.
4. **Hourly live throttle rewrites**`apply-mbps-caps.sh`'s idempotency
check compared raw config strings, but `qm config` prints keys in its
own canonical order, so the check **never matched** and the script
re-issued `qm set` (→ live QMP `block_set_io_throttle` against the
running QEMU) every hour, 24×/day, for weeks — each poke a chance to
race the throttle machinery while queued I/O is in flight. The wedge
came 20 min after the 15:01 poke.
## What we could not prove
Whether the stuck queue was the LSI device model, the throttle-group
timer, or their interaction. The discriminating evidence (QMP
`query-block`, a stack trace of the QEMU process) existed in RAM at 16:47
and was destroyed by the recovery kill. If a wedge recurs **autopsy before
shooting**: `qm guest exec` will fail but `qm monitor`/QMP `query-block`,
`query-status`, and `gdb -p <pid> -batch -ex 'thread apply all bt'` on the
kvm process pin it to the line.
## Fixes
| Status | Fix |
|---|---|
| shipped (this commit) | `apply-mbps-caps.sh` compares **normalized option sets** — hourly runs are now true no-ops; running VMs' throttle state is no longer rewritten 24×/day. Verified: reordered-key configs compare equal, real drift still triggers `qm set`, post-restart iothread configs compare equal. |
| staged, awaiting Viktor's cold stop→start | VM 102: `scsihw: virtio-scsi-single` + `scsi0 …,iothread=1,aio=threads` — replaces the LSI path with the paravirt controller all healthy VMs use, moves disk emulation off the main loop, swaps io_uring for boring thread-pool AIO. Guest pre-flight passed (`CONFIG_SCSI_VIRTIO=y` built-in; fstab on LVM dm-uuid/UUID). Must be a **full stop→start** — a guest reboot reuses the old QEMU process. |
## Open follow-ups (discussed 2026-06-11, not yet built)
- `DevvmDown` alert (`up{job="devvm"} == 0 for 3m` → Slack) — closes the
90-min detection gap.
- Freeze forensics: netconsole → pve listener, serial console,
`kernel.panic=60`, and a capture-before-kill runbook (above) so any
recurrence is pinned, not mourned.
- The recurring *crawl* class (agent storms → swap-thrash; journald
watchdog-killed 3× on 2026-06-10) is a separate failure mode —
ssh/tmux sessions remain memory-uncontained by explicit decision
(swap-only, 2026-06-10).
## Lessons
- **A VM can die of QEMU-userspace causes that no guest or host kernel log
will ever show.** The host's per-VM RRD (pvestatd's QMP polls) is the
only witness — `diskwrite=0` with a live QMP socket is the signature.
- **"Idempotent" reconcilers must prove idempotency against the system's
canonical output format**, not against the string they themselves
constructed. A compare that never matches turns a safety net into a
24×/day fault injector — and its own journal said `updating scsi0`
every hour, in plain sight, for weeks.
- The May-26 mbps caps fixed the sdc-saturation freeze class and
introduced this one's trigger surface. Layered mitigations fail in
layers — audit what a fix *adds*, not only what it removes.
- pve host logs are **EEST (UTC+3)**; guest logs are UTC. Every
cross-machine correlation in this incident initially looked 3h off.