docs(compute): mark all Linux VMs as hand-managed; document apply-mbps-caps timer

Reflects the 2026-05-26 decision (commit 44c3770a) to keep Linux VMs
out of Terraform — telmate/proxmox v3.0.2 mangles dynamically-attached
disks (id=539) and doesn't refresh mbps_*_concurrent back from live
state. What stays in TF: the cloud-init templates. Per-VM I/O caps
now driven by the apply-mbps-caps systemd timer (commit 56a338f8).

Replaces the stale note about iSCSI mangling — that rationale is
obsolete (iSCSI gone since 2026-04-11) and the new scope is
intentional, not provisional.
This commit is contained in:
Viktor Barzin 2026-05-26 08:38:00 +00:00
parent 5cc91e67bf
commit c0618ae1ae

View file

@ -79,13 +79,33 @@ graph TB
**Total Cluster Resources**: 48 vCPUs, ~176GB RAM (k8s-node1 48GB + 4 nodes x 32GB)
> **node1 RAM (2026-05-10)**: bumped from 32 → 48 GiB out-of-band via
> `qm set 201 --memory 49152` because VMID 201 is intentionally not
> managed by Terraform yet (telmate/proxmox provider bug with iSCSI
> PVCs — see `infra/stacks/infra/main.tf` line 442). Driver: GPU
> multi-tenancy (frigate + ytdlp + llama-swap + immich-ml) was
> hitting 94% memory-request saturation on the old size. Adopt this
> VM into TF (`module "k8s-node1"`) once we've migrated to bpg/proxmox.
> **All Linux VMs are hand-managed in Proxmox, NOT in Terraform**
> (decided 2026-05-26, commit 44c3770a). The telmate/proxmox v3.0.2
> provider rewrites every disk slot on update — even ones covered by
> `lifecycle.ignore_changes` — and it doesn't refresh per-disk
> `mbps_*_concurrent` fields back from live state. We hit both bugs
> in production (id=539 iSCSI mangling 2026-04-02, and the 2026-05-26
> import attempt that corrupted k8s-node2 + k8s-node3 .conf files;
> recovered via `/mnt/backup/pve-config/etc-pve/nodes/pve/qemu-server/`
> nightly backups). What stays in TF: the cloud-init templates
> (`k8s-node-template`, `non-k8s-node-template`,
> `docker-registry-template` in `stacks/infra/main.tf`) — a fresh VM
> still clones the right template and runs the same bootstrap.
>
> Per-VM I/O caps (defense against sdc saturation by a single noisy
> guest) are applied by `apply-mbps-caps.{sh,service,timer}` on the
> PVE host (sources in `infra/scripts/`, install pattern per
> `architecture/backup-dr.md`). Timer fires `OnBootSec=5min` +
> `OnCalendar=hourly`, so any drift (config restore, manual `qm
> set`, fresh clone) self-heals within the hour. Current caps:
> 102 devvm 60/60, 103 home-assistant 40/40, 200 k8s-master 100/60,
> 201 k8s-node1 150/120, 202 k8s-node2 150/120, 203 k8s-node3 150/120,
> 204 k8s-node4 150/120, 220 docker-registry 40/40.
>
> Re-adoption into TF (via the `bpg/proxmox` provider, which models
> dynamic disks correctly) is possible but not scheduled — the
> cloud-init template above already captures the bootstrap-
> reproducibility goal.
### GPU Passthrough