- Add new "Data Routing" flowchart up front showing which paths go
where (sda mirror vs Synology-direct vs not-backed-up).
- Overall Backup Flow: split Layer 2 into 2a (nfs-mirror daily 02:00)
and 2b (daily-backup 05:00); show nfs-mirror as an explicit
component; clarify Step 2 is immich-only direct + nfs-ssd.
- Weekly Backup Timeline → Daily Backup Timeline: actual schedule
(00:00 LVM, 00:15 PG, 00:45 MySQL, 02:00 nfs-mirror, 05:00 daily-
backup, 06:00 offsite-sync, 12:00 second LVM); explicit inotify
feeding Step 2.
- Physical Disk Layout: current capacity numbers + dual sdc→sda and
sdc→Synology arrows (immich-only) reflecting the two-leg design.
- Restore Decision Tree: refreshed age tiers (< 12h LVM, 12h-4w sda,
> 4w Synology) + dedicated branch for immich photos (which only
have an offsite copy).
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Mirrors the SSD section's pattern. If the LAST iteration of the
`while IFS= read -r f; do [ -f "$f" ] && echo "${f#/srv/nfs/}"; done`
body sees a file that was deleted between inotify capture and now
(e.g. an immich encoded-video temp file that got cleaned up), the
while loop returns 1, pipefail propagates, set -e kills the script
silently before reaching the rsync. No log line, just disappears.
Pre-existing bug; only exposed today after pruning the bypass regex
to immich-only — when the regex was broader, the last match in the
sorted dedup'd inotify log happened to be a live file often enough
that the bug stayed dormant. Validated by full e2e run:
1120 nfs/immich files + 2285 nfs-ssd files shipped successfully.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Handoff artifact for next session. v7 is the converged staged plan
(Stage A hardened Ubuntu → B DR primitives → C 6-week soak →
D-optional Talos). User decision pending: pick v4 (full Talos, 117-178h)
vs v7 (staged, 30-37h to decision point) vs hybrid.
Full context in ~/.claude/plans/distributed-humming-sonnet.md.
Re-enables Keel after the 2026-05-26 emergency stop, with a safer default.
Switch Kyverno-injected default from `force + match-tag=true` (proven
unreliable — it rewrote tag strings cluster-wide despite the design intent)
to `patch`, which is semver-parser-bounded:
- Only patch bumps within current major.minor (1.2.3 → 1.2.4, never
1.3.x or 2.x — the parser does the math, not string compare).
- Non-semver tags (`:latest`, `:v4`, `:2`, SHA, `:nightly`) are
IGNORED entirely. No tag rewriting under any code path.
- 151 stale `force` annotations migrated to `patch` cluster-wide
during this apply (anchor `+()` dropped, then re-added).
Live state after this commit:
0 workloads on `force`, 209 on `patch`, 22 on `never`.
Keel deployment back to 1/1 on `:0.21.1`.
Note: 22 workloads with `keel.sh/policy=never` LABEL had their annotation
mutated to `patch` during the migration despite Kyverno's
matchLabels-based exclude rule — appears to be a quirk of
`mutateExistingOnPolicyUpdate` not honoring `selector` excludes. Repatched
all 22 back to `annotation=never` via `kubectl annotate --overwrite`, then
restored the `+(keel.sh/policy)` anchor in the policy so future Kyverno
reconciles preserve them.
Also fixes CI build-cli workflow which was blocked by
`deny-privileged-containers` since wave 1 enforce flip on 2026-05-18:
woodpecker namespace added to the shared security_policy_exclude_namespaces
list (CI pipeline pods `wp-*` run privileged docker builds, legitimate use).
The `default` workflow (terragrunt apply) was already passing — only the
parallel `build-cli` workflow (which builds the infra-cli docker image) was
failing, but it took the overall pipeline status down with it.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Steady-state delta runs in 10-20 min and the weekly cadence left a
real RPO gap: app data under /srv/nfs/<svc>/ that isn't a PVC
(captured by daily-backup) or a *-backup CronJob (captured daily by
the CronJob writing to /srv/nfs/<svc>-backup/) was on a 7-day worst
case for off-disk durability. Affected paths include nextcloud shared
files, audiobookshelf library, mailserver Maildir, calibre, servarr
metadata, real-estate-crawler scraped data, openclaw agent state.
Daily cadence drops their RPO to ~24h at negligible cost.
Slot: 02:00, 3h ahead of daily-backup (05:00) so the manifest is
populated before offsite-sync reads it at 06:00.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Post-2026-05-26 unclean node2 reboot left redis-v2-2's incremental AOF
truncated at offset 84799139. With aof-load-corrupt-tail-max-size at its
default 0, redis refuses to load any corruption and crashloops. Setting
1024 lets it truncate the corrupted tail and continue, which is the
right call for a non-source-of-truth cache fronted by sentinel.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Keel was rewriting tag strings (not just digests) despite the
keel.sh/match-tag=true annotation injected by the Kyverno
inject-keel-annotations ClusterPolicy. That annotation was supposed to
constrain Keel to digest-only watches under the deployment's CURRENT tag.
It didn't. Casualties confirmed today (live image rewritten to a lower
version): uptime-kuma (:2 → :1, 4h CrashLoopBackOff because v1 boots into
SQLite mode and can't read the v2 db-config.json → MariaDB store);
n8n (:1.80.5 → :0.1.2, silent — EEXIST mkdir /root/.n8n loop);
beads-server/dolt-workbench (:0.3.73 → :0.1.0, GraphQL schema mismatch on
addDatabaseConnection); wealthfolio (:3.2.1 → :2.0 → :3.2 string truncate);
plus historical ones previously fixed (claude-memory :71b32438 → :17,
forgejo 11.0.14 → 1.18, onlyoffice 9.3.1 → 4.0.0.9, shlink 5.0.2 → 1.16.1).
Changes:
* stacks/keel: replicaCount = 0 in the helm values. Pod went from 1/1 to
0/0. Keep off until either match-tag is root-caused or every enrolled
workload migrates to a content-addressed (SHA) pin.
* stacks/uptime-kuma: pin image to louislam/uptime-kuma:2.3.2 (was :2,
bumped to :1 by Keel). Full opt-out: keel.sh/policy=never on BOTH the
deployment label (matches Kyverno's exclude rule so the inject-keel-
annotations ClusterPolicy stops mutating) AND the annotation (so Keel
itself respects). Removed keel.sh/policy from lifecycle.ignore_changes
so TF owns it as `never` and can't drift back to `force`.
* stacks/beads-server: pin dolt-workbench to dolthub/dolt-workbench:0.3.73
on both seed-config and workbench containers (was :latest, Keel rolled
to :0.1.0).
* stacks/wealthfolio: pin to afadil/wealthfolio:3.2.1 (was :3.2 truncated
by Keel from the prior live :3.2.1).
* stacks/monitoring: monitoring-quota requests.memory 16Gi → 20Gi. Cluster
grew from 5 to 7 workers (k8s-node5/6 added 2026-05-26) and alloy's
per-pod request jumped 50Mi → 562Mi earlier today; combined with new-node
DS pods (loki-canary, node-exporter, sysctl-inotify) the quota tipped to
100% and blocked every new pod create with FailedCreate. Raising the cap
unblocked the four affected DaemonSets in one shot.
* stacks/immich: tier-quota requests.memory 20Gi → 24Gi, limits.memory
32Gi → 40Gi. Was at 88% with VPA still creeping up on immich-server's
face-detection burst behaviour.
* stacks/{excalidraw,immich,n8n}: providers.tf + .terraform.lock.hcl
updated by `tg init -upgrade` to record telmate/proxmox 3.0.2-rc07
(matches the 21 other stacks that already declare it).
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Previously /srv/nfs/{ollama,audiblez,ebook2audiobook,*-backup} took
the sdc → Synology direct leg. They now ride sdc → sda → Synology
pve-backup/ via nfs-mirror like every other NFS subtree, so sda
becomes the single canonical mirror and Synology only has to ingest
one feed for the bulk of cluster state.
frigate + temp dropped from BOTH legs (no backup anywhere) per
explicit user ask — frigate is a 14d camera ring, temp is scratch.
prometheus/loki/alertmanager dropped as no-op (orphan dirs that
no longer exist on /srv/nfs).
Also: nfs-mirror's manifest collection switched from find -newer
(mtime) to find -cnewer (ctime) — rsync -t preserves source mtime
on dest, so freshly-written files looked "older than \$STAMP" and
the 2026-05-26 full mirror run captured only 2 of 800k transferred
files. Hit during this session, recovered via .force-full-sync.
Operational result post-rollout:
- sda 87% → 70% (anca-elements 423G deleted, +260G new dirs)
- /Viki/nfs/ on Synology: was 24 stale dirs (~430G), now immich only
- Synology free: ~300G → ~430G+ once btrfs reclaim catches up
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Earlier today Keel's hourly poll caught vaultwarden's deployment in a
window where the `keel.sh/match-tag` annotation wasn't set, fell into
'watch repository tags' mode, and rewrote 1.35.7 -> 1.21.0. Vaultwarden
1.21.0 doesn't have the API endpoints the modern Bitwarden clients call
(/identity/accounts/prelogin/password, /api/devices/knowndevice,
/api/config), so the Chrome extension started 404-ing on login.
Same race shape as the 2026-05-17 authentik/pgbouncer incident. The
fundamental issue: `policy: force` on a semver-pinned tag is unsafe
because Keel happily rewrites the tag string if it can't find a stable
'current tag' to digest-watch.
Fix: switch to `:latest` (the mutable tag vaultwarden publishes for the
newest stable release). Keel now digest-watches `:latest` (safe mode)
and rolls forward on each upstream release. Matches cluster convention
(128 other Keel-managed workloads use the same `:latest` + force +
match-tag pattern).
Also added imagePullPolicy=Always (required with :latest so the kubelet
revalidates the manifest on each rollout instead of using a cached
layer), and extended the lifecycle.ignore_changes to cover the
match-tag annotation and kubernetes.io/change-cause (Keel rewrites
this on every rollout).
Current `:latest` digest -> vaultwarden 1.36.0 (released 2026-05-03).
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Goal: re-clone the worker template, boot, and have it appear as `kubectl
get nodes …Ready` with no manual steps. Adds `scripts/provision-k8s-worker
NAME VMID IP` and rebuilds the cloud-init pipeline that was failing five
distinct ways on a clean boot.
Bugs fixed (all hit during the k8s-node5 + k8s-node6 builds today):
1. `indent(6, containerd_config_update_command)` indented the bodies of
`cat >> /etc/containerd/config.toml <<'CONTAINERD_GC'` heredocs, so
[plugins.*] TOML sections landed in /etc/containerd/config.toml at
col 6 — containerd refused to parse them. Source is now a normal
.sh file (`modules/create-template-vm/k8s-node-containerd-setup.sh`)
base64-embedded into `write_files`; YAML whitespace never touches
the heredoc bodies.
2. The same script tried to `cat >> /etc/containerd/config.toml`
`[plugins."io.containerd.gc.v1.scheduler"]` etc., which containerd
v2.2.4's `config default` ALREADY emits. Result: `toml: table …
already exists`. Patched with sed-in-place overrides instead.
3. Kubelet tuning (sed against /var/lib/kubelet/config.yaml) ran from
the containerd setup script — BEFORE `kubeadm join` writes that
file. Sed aborted with "No such file or directory", `set -e` killed
the script, post-script cloud-init steps kept going (cloud-init
doesn't stop on runcmd failure). Split into a dedicated
`k8s-node-post-join-tune.sh` invoked AFTER kubeadm join.
4. cloud_init.yaml fallocate'd a 4G swapfile and `swapon`'d it BEFORE
kubeadm join. kubelet defaults to failSwapOn=true → exited 1
immediately. Replaced the swap setup with `swapoff -a` (node4
already runs this way and the cluster is fine).
5. Without `hostname:` in the shared user-data snippet, Proxmox's
auto-generated meta-data does NOT include local-hostname when
`cicustom user=…` is set — so cloud-init falls back to the cloud
image's default `ubuntu` and `kubeadm join` registers the wrong
node name. `provision-k8s-worker` now writes a per-node
`<NAME>-meta.yaml` snippet and passes both via
`cicustom user=…,meta=…`.
Other improvements rolled in while fixing the above:
- `ssh_public_key` read from Vault (`secret/viktor.ssh_public_key`,
added today) instead of `var.ssh_public_key`. The last
`terragrunt apply` was run with that var empty, leaving the snippet's
`ssh_authorized_keys` with a single blank entry; the wizard user
was effectively locked out of every fresh node.
- `cloud_init.yaml` adds `/etc/systemd/resolved.conf.d/global-dns.conf`
with `DNS=8.8.8.8 1.1.1.1, FallbackDNS=10.0.20.201`. Without it,
systemd-resolved only consulted Technitium (link-level), which
returns NXDOMAIN for `forgejo.viktorbarzin.me` — kubelet pulls from
the Forgejo registry then failed DNS until I patched it manually
on node5.
- k8s apt repo bumped v1.32 → v1.34 (matches cluster).
- The containerd setup script now creates hosts.toml for forgejo,
quay, registry.k8s.io in addition to docker.io + ghcr.io. node3/4
had these added by hand post-bootstrap; now they're baked in.
- `config_path` sed matches both `""` (containerd v1) and `''`
(containerd v2.x). Without the v2 match, the certs.d mirror dir was
silently ignored.
- `proxmox-csi` node map adds k8s-node5 + k8s-node6 entries so CSI
topology labels (region/zone, max-volume-attachments=28) apply on
next `tg apply`.
- `stacks/infra/main.tf` shed the 160-line inline containerd setup
heredoc — that whole thing now lives in the module as a .sh file.
Known unsolved gaps (deferred):
- iscsid restart hangs ~90s on first boot before SIGKILL releases it
(systemd-resolved restart kicks iscsid via dependency). Adds wall-
clock time but doesn't block the join.
- `provision-k8s-worker` doesn't run `tg apply` on `proxmox-csi`
afterward, so the CSI topology labels need a manual apply after
the node joins. Solving cleanly needs the CSI map to derive from
`kubectl get nodes` instead of a static local — separate work.
- `var.containerd_config_update_command` is now ignored when
is_k8s_template=true (replaced by the bundled .sh file). Variable
kept with a deprecation note to avoid breaking other call sites.
E2E proof: k8s-node6 (VMID 206) boots hands-off from
`provision-k8s-worker k8s-node6 206 10.0.20.106` and appears as
`kubectl get nodes …Ready` ~7 min later (most of which is the apt
package_upgrade — separate optimization).
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
The Layer 5 ClusterPolicy inject-gpu-workload-priority used JSON6902
op=replace on /spec/priorityClassName. Incoming pods (e.g. frigate)
have no priorityClassName field at all — replace requires the path to
exist, so the patch fails with "doc is missing key: /spec/priorityClassName"
and the whole mutation chain aborts BEFORE Layer 4 (inject-priority-class-from-tier)
gets a chance to add the field.
Result: GPU pods never got priorityClassName set, sat at priority=0, and
could not preempt lower-tier pods on the GPU node. Observed today on
frigate post-node4-recovery — pod stayed Pending with "Preemption is
not helpful" while 3 pg-cluster pods (tier-1-cluster, priority 800000)
occupied node1's memory budget.
Fix: op=add for all three paths. add works whether or not the key is
present, so the policy is robust to the upstream pod shape.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Default CNPG affinity was `preferred` (soft). During the 2026-05-26
node4 outage, all 3 pg-cluster pods drifted onto k8s-node1 — losing
that node would have taken the whole PG cluster down (no quorum) AND
the 9.2 GiB pg-cluster footprint was the dominant reason frigate
couldn't fit on the GPU node.
With 3 instances + 4 worker nodes, `required` is safe under 1-node
drain (3 distinct nodes always available, even excluding the drained
one).
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Reflects the 2026-05-26 decision (commit 44c3770a) to keep Linux VMs
out of Terraform — telmate/proxmox v3.0.2 mangles dynamically-attached
disks (id=539) and doesn't refresh mbps_*_concurrent back from live
state. What stays in TF: the cloud-init templates. Per-VM I/O caps
now driven by the apply-mbps-caps systemd timer (commit 56a338f8).
Replaces the stale note about iSCSI mangling — that rationale is
obsolete (iSCSI gone since 2026-04-11) and the new scope is
intentional, not provisional.
Moves the containerd_config_update_command interpolation out of the
runcmd list and into a write_files block delivering
/usr/local/bin/k8s-node-containerd-setup.sh. runcmd then just calls
the script.
Why: the heredoc in stacks/infra/main.tf has mixed-indent inner shell
heredocs (CONTAINERD_GC, KUBELET_PATCH bodies at col 0, surrounding
text at col 2). When inserted into a `runcmd: - $${var}` item — even
wrapped in a `- |` literal block — YAML's block-indent rule
terminates the block early on the col-0 lines. The result is a silent
cloud-init parse failure on every new k8s node (observed 2026-05-26
during node4 rebuild — node booted into the minimal default config,
no kubeadm join, no containerd tuning, no kubelet shutdown grace).
write_files writes the multi-line content into a YAML literal block
where the script body is just opaque text — the block's content
indent is set by the `content: |` block's own indentation (col 6)
and any indent >= 6 is valid content. Any further indent inside the
script (like the col-0 `[plugins...]` heredoc lines now at col 6 via
indent(6, ...)) is preserved cleanly.
Verified: `yaml.safe_load()` on the rendered snippet now reports
`runcmd=36 write_files=1` (was throwing ParserError before).
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
shlinkio/shlink-web-client:0.1.1 listens on port 80 (nginx default),
not 8080 like the prior :latest images. Keel auto-bumped the tag
on 2026-05-23; liveness/readiness probes have been failing ever since
because they still hit :8080. Pod was stuck restarting, the
DeploymentReplicasMismatch alert fired.
Aligns containerPort + both probes + service target_port with the image.
The qm-set I/O caps were previously only applied by manual one-shot
runs of apply-mbps-caps.sh, so any config drift (manual `qm set`,
config restored from /mnt/backup/pve-config like we did on 2026-05-26,
fresh VM clone) would leave the affected VM uncapped until someone
remembered to re-run the script.
Adds apply-mbps-caps.service (Type=oneshot) + apply-mbps-caps.timer
firing:
- OnBootSec=5min — catches PVE host reboots & restored configs
- OnCalendar=hourly — catches manual qm-set drift / fresh clones
- Persistent=true — runs missed schedule after PVE downtime
- RandomizedDelaySec=2min
Same install pattern as the other PVE operational scripts (nfs-mirror,
daily-backup, offsite-sync-backup, lvm-pvc-snapshot — memory id=609 +
id=542). Source in this repo, deployed to /usr/local/bin + /etc/
systemd/system/ on the PVE host.
Script hardening: kept `set -uo pipefail` but dropped `-e` so one
missing VM doesn't abort the rest; each VM is gated on `qm status`
existence; added a fast-path "already at target" no-op log line for
quiet hourly runs.
Installed on PVE (192.168.1.127) and smoke-tested: all 8 VMs caps
re-applied successfully, next run 12:00 EEST. Journal: `journalctl
-u apply-mbps-caps -f` on the PVE host.
Idempotent qm-set script for the per-VM I/O caps on the PVE host's sdc
thin pool (2026-05-26 session, beads code-9v2j). Caps protect each
Linux VM's share of sdc so a runaway workload (e.g. the 2026-05-23/26
alloy IO storm — memory id=2726) cannot saturate the disk for everyone.
Was sitting in /tmp on PVE — moving the source under version control
and installing to /usr/local/bin/ alongside the other PVE operational
scripts (nfs-mirror, daily-backup, offsite-sync-backup; pattern from
memory id=609). Survives PVE host reboots; safe to re-run on any node
rebuild to restore the caps.
VMIDs covered (Linux only — pfSense 101 and Windows10 300 skipped):
102 devvm 60/60 103 home-assistant 40/40 200 k8s-master 100/60
201 k8s-node1 150/120 202 k8s-node2 150/120 203 k8s-node3 150/120
204 k8s-node4 150/120 220 docker-registry 40/40
The telmate/proxmox v3.0.2-rc07 provider mangles dynamically-attached
disks (id=539, 2026-05-26 incident) and doesn't refresh mbps_*_concurrent
fields back from live state — every plan after a qm-set cap is applied
proposes to "fix" mbps 0 → N and the apply errors with the spurious
"the QEMU guest needs to be rebooted" message. lifecycle.ignore_changes
does NOT block either failure mode.
Decision: stop trying to manage Linux VMs in this stack. The cloud-init
bootstrap stays in TF (via k8s-node-template, non-k8s-node-template,
docker-registry-template above), so a fresh node still clones the right
template and runs the same bootstrap. VM lifecycle stays in the Proxmox
UI. I/O caps are managed via qm-set on the PVE host (idempotent script
at /tmp/apply-mbps-caps.sh, tracked in beads code-9v2j).
Removed from TF state + HCL:
- module "k8s-master" (vmid 200)
- module "k8s-node2" (vmid 202) — pre-existing drift, never in state
- module "docker-registry-vm" (vmid 220) — was in state, hit refresh bug
Already hand-managed (never in HCL):
- 102 devvm, 103 home-assistant, 201 k8s-node1 (Tesla T4 passthrough),
203 k8s-node3, 204 k8s-node4, 101 pfSense (BSD), 300 Windows10.
Live I/O caps (qm set, all verified):
102=60/60 103=40/40 200=100/60 201=150/120 202=150/120
203=150/120 204=150/120 220=40/40
Future TF adoption tracked in beads code-75ds (blocks on bpg/proxmox
provider migration — telmate can't represent these VMs at all).
Closes: code-75ds
The previous indent(6, containerd_config_update_command) attempt didn't
fix the YAML parse error — the heredoc in stacks/infra/main.tf has
mixed indentation (most lines at col 2, inner shell heredoc bodies
like CONTAINERD_GC and KUBELET_PATCH at col 0). Any uniform-prefix
function (indent / replace / join) preserves the relative offset, so
the column-0 lines always end up below the block's first-line indent
and YAML terminates the literal block early.
The cleanest fix is a refactor: move the containerd setup snippet out
of the inline heredoc into a cloud-init `write_files` block (script
file delivered to the VM, then `bash /path/to/script.sh` in runcmd).
That bypasses the multi-line YAML interpolation entirely.
Reverting to the previous (also-broken) interpolation pattern with a
big WARNING comment instead. New k8s nodes still need manual backfill
after first boot — node4 was backfilled today; see memory id=2767/2772
for the backfill steps. Tracked separately.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Two bugs found while rebuilding k8s-node4 (2026-05-26):
1. **runcmd YAML breakage**: `- $${containerd_config_update_command}`
interpolated a multi-line heredoc as bare list-item content. The
trailing lines lost their list-item prefix, breaking cloud-config
parsing. Cloud-init silently fell back to the minimal default
(hostname + package_upgrade only) — kubeadm join, containerd config,
kubelet tuning, iSCSI hardening, swap, ALL skipped. No error visible
in `cloud-init status`.
Fix: wrap the interpolation in `- |` literal block with `indent(4, ...)`.
2. **containerd v2 single-quote mismatch**: `containerd config default`
in v2 writes `config_path = ''` (single quotes), v1 writes `""` (double).
The sed pattern matched only double quotes → silent no-op on fresh
containerd 2.x nodes → registry-mirror hosts.toml ignored → all image
pulls hit upstream registries → DNS-to-MetalLB chicken-and-egg loop.
Fix: match any value with `config_path = .*`.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
WHAT LANDED:
- terragrunt.hcl (root): added telmate/proxmox to k8s_providers
required_providers. Other stacks just don't instantiate a provider
block — harmless. Replaces the same-name override trick the infra
stack used to do, which stopped working under Terragrunt v0.77
("Detected generate blocks with the same name").
- stacks/infra/terragrunt.hcl: new generate "proxmox_provider" block
writes proxmox_provider.tf with the provider config; credentials
read from Vault secret/viktor at plan/apply time (no env vars).
- modules/create-vm: new mbps_rd / mbps_wr number variables (default 0
= uncapped), wired into scsi0/scsi1 disk{} blocks as
mbps_r_concurrent / mbps_wr_concurrent. lifecycle.ignore_changes
extended to scsi6..scsi29 (K8s nodes have many CSI-managed slots),
plus scsihw and qemu_os (vary per-VM; non-trivial live changes).
- stacks/infra/main.tf: docker-registry-vm gains mbps_rd=40,
mbps_wr=40 in HCL — already applied live via qm set on 2026-05-26.
WHAT FAILED AND WAS ROLLED BACK:
- Attempted import of 7 VMs (102 devvm, 103 home-assistant, 200
k8s-master, 201 k8s-node1, 202 k8s-node2, 203 k8s-node3, 204
k8s-node4) via import {} blocks. The telmate/proxmox v3.0.2-rc07
provider mangled proxmox-csi PVC slots on apply for vmid 202 and
203: every scsi slot got rewritten from `vm-9999-pvc-<uuid>` to
the boot disk `vm-<vmid>-disk-0`. Restored both .conf files from
the 2026-05-24 nightly PVE config backup at /mnt/backup/pve-config/
etc-pve/nodes/pve/qemu-server/{202,203}.conf — no reboots, no data
loss, K8s CSI reconciled PVC attachments within minutes. Removed
the 7 imports from state via `terraform state rm` and re-encrypted.
Tracked in beads code-xzbl: blocked on bpg/proxmox provider
migration (telmate has the same dynamic-disk defect that bit us on
iSCSI back in 2026-04-02; see memory id=539).
LIVE CAPS STILL IN PLACE (qm set, 2026-05-26 ~03:13 UTC):
102 devvm 60/60 103 home-assistant 40/40 200 k8s-master 100/60
201 k8s-node1 150/120 202 k8s-node2 150/120 203 k8s-node3 150/120
204 k8s-node4 150/120 220 docker-registry 40/40
(pfSense 101 BSD + Windows10 300 intentionally out of scope.)
PRE-EXISTING DRIFT EXPOSED (NOT NEW):
- HCL declares k8s-master (200) and k8s-node2 (202) but neither was
ever imported into TF state — confirmed against the SOPS-encrypted
state in git (lineage e1cc5bb5, serial 42, last touched 2026-04-06).
This commit leaves both declarations in place but does NOT import
them; that's part of the code-xzbl follow-up.
Closes: code-s9xr
The proxmox-csi-plugin hardcodes a 29-disks-per-VM ceiling in
pkg/csi/utils.go:394 (lun < 30 loop). This is the actual block-
storage scaling bottleneck — NOT QEMU, NOT Proxmox, NOT the kernel.
Adds a "Per-VM SCSI-LUN cap" section to docs/architecture/storage.md
explaining:
- the source-level hardcode and how to recognise it (FailedAttachVolume
"no free lun found")
- why switching scsihw to virtio-scsi-single buys ZERO additional
capacity (perf-only)
- levers in leverage-per-effort order (migrate non-DB to NFS,
add a worker, fork+patch the plugin)
- the Wave 1 NFS migration (2026-05-26) that took 5 services off
block and skipped two more on pre-flight (plotting-book SQLite+WAL,
stirling-pdf H2 .mv.db)
Discovered during the Wave 1 work — see remote memory ids 2788+ for
full context and 2798+ for the related postiz state-drift discovery.
Wave 1 LUN-cap relief. The PVC stores 5 small JSON state files
(health_state, schedule, scraped_links, sessions, streams) and a
lost+found — total 30KB, no DB, regenerable from upstream APIs.
Standard scale-to-0 → rsync → swap pattern (deployment was at
replicas=1). Pod came back up on k8s-node4 (now Ready again).
Net: -1 SCSI LUN on k8s-node1 (was the previous host).
Vaultwarden + 18 pods got stuck for 7h on 2026-05-26 when k8s-node4 went
down: surviving workloads piled onto node1 and hit the
csi.proxmox.sinextra.dev/max-volume-attachments=28 cap. The Proxmox VM also
had 5 stale scsi entries (PVCs long-migrated to other nodes but never
removed from VM config), which bypassed the K8s scheduler safety until the
plugin returned 'no free lun found' at attach time.
Three new alerts on the kube_volumeattachment_info count per node:
- warning at 24/28 (>= 85%), 10m
- critical at 27/28 (1 slot left), 3m
- critical at 28/28 (cap reached), 1m
Also whitelisted kube_volumeattachment_info — the metric was being dropped
by the disk-write-reduction filter (id=559) and the alert queries returned
zero series until it's kept.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Wave 1 LUN-cap relief. OnlyOffice document server keeps only 2 WOPI
key files + a .private dir on the PVC (~24K) — the real DB lives in
its external Postgres + Redis stack, not on this PVC. Service is at
replicas=0 (IO-storm temp scaledown — TEMP-SCALEDOWN comment
preserved).
Migration trivia: scheduler tried to put the rsync helper on
k8s-node4 (PVC's last-known location) but node4 had just come back
online and its proxmox-csi/nfs-csi node pods were still in
ContainerCreating — failed. Retried pinned to k8s-node2 via
nodeSelector; rsync template updated to take an optional node arg.
Net: -1 SCSI LUN once onlyoffice is brought back up.
Wave 1 LUN-cap relief. Whisper PVC holds Piper TTS .onnx voice
model + a HuggingFace faster-whisper-small-int8 model cache —
read-mostly model artefacts, no DB, 303M total. Both whisper and
piper deployments are at replicas=0 (GPU-node memory pressure,
unrelated).
Switched access_modes to ReadWriteMany since both whisper + piper
deployments reference the same PVC; on proxmox-lvm RWO they could
only colocate on the same node when both come back.
Net: -1 SCSI LUN once these are brought back up.
Wave 1 LUN-cap relief. Reactive Resume stores user-uploaded PDFs +
3 .txt counters under uploads/ and statistics/ — no embedded DB,
112K of data. Service is at replicas=0 (browserless OOM scaledown,
unrelated to this work) so the migration was no-downtime.
Net: -1 SCSI LUN once resume is brought back up.
ClusterCannotTolerateNonGpuNodeLoss fires when the most heavily reserved
non-GPU worker (k8s-node2/3/4) has more memory requests pinned to it
than the rest of the workers (incl. node1 GPU node) currently have free.
If that node went down, its pods would not fit elsewhere and would stay
Pending — exactly what happened today (2026-05-26) with node4 NotReady:
4 kyverno pods + woodpecker PVCs + several deployments stuck Pending
because node2/node3 were at 99% memory-request saturation.
Math: max(R(node X) for X in non-GPU workers) > sum(clamp_min(A(n) - R(n), 0))
over Ready workers. node1 included on the right because its taint is
PreferNoSchedule (soft) so it does absorb non-GPU pods under pressure.
Currently fires with a 33.96 GiB shortage. Remediation: right-size top
reservers via Goldilocks (immich-server 8Gi, frigate 5Gi, prometheus
4.4Gi, pg-cluster 3Gi each, paperless 2Gi) or bump VM RAM on
k8s-node2/k8s-node3 from 32GB → 48GB to match node1.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Wave 1 of the per-VM SCSI-LUN cap relief. The proxmox-csi-plugin
hardcodes a `lun < 30` loop (pkg/csi/utils.go:394) — cap is 29
attachable PVCs per K8s node VM, and k8s-node1 was sitting at 29
with 4 stuck `no free lun found` PVCs queued behind it.
Excalidraw stores per-user .excalidraw scene files (no SQLite,
no embedded DB) — confirmed safe on NFS. 1.5 MiB of data,
4 active scenes. Migration:
- Add nfs_volume module → apply
- Scale to 0, rsync helper, swap claim_name → apply
- Remove old proxmox-lvm PVC → apply
Net: -1 SCSI LUN on k8s-node2.
Refs: docs/post-mortems/2026-05-25-immich-anca-elements-io-storm.md
(separate concern; this is for the upstream LUN-cap pressure).
The Alloy Helm chart maps `alloy.resources`, NOT `controller.resources`, onto
the alloy container. The block under `controller:` was silently dropped, so
the container ran with `resources: {}` and inherited the Kyverno LimitRange
`tier-defaults` 256Mi — well below Alloy's 400-450Mi steady state. The
cgroup ran at 255.8/256MB with ~50M memory-reclaim events, page-cache
thrashing drove ~185 MB/s sdc reads (12.18 TB in 24h), saturating the
Proxmox host and rippling out to all VMs + NFS.
Fix:
- Move resources to `alloy.resources` (correct chart key).
- Burstable QoS: request 512Mi, limit 1Gi. Workers are at 97-99%
memory-request saturation cluster-wide; a 1Gi request blocks
scheduling on node2/node3.
- Bump controller.updateStrategy.maxUnavailable to 50% so a 5-pod DS
rolling update fits inside the helm timeout.
- Bump helm_release.alloy.timeout to 900s (default 300s was too short
with occasional runc-stuck-Terminating on k8s-master).
Verified: all 4 alloy pods now show 1Gi/512Mi at the container level;
helm rev=8 deployed; per-pod memory 99-108Mi at steady state (well
under the new limit).
Memory ID 2726.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Two compounding issues prevented the GPU driver from installing after
the k8s-node1 kernel rollback to 6.8.0-117-generic (Ubuntu 24.04):
1. **Deadlock**: The k8s-driver-manager init container was stuck waiting
for nvidia-operator-validator to shut down. The validator's
driver-validation init container was in an infinite poll loop checking
for /run/nvidia/validations/.driver-ctr-ready (which only appears after
a successful driver install). The validator pod had deletionTimestamp
set but its container remained in Terminating state indefinitely.
Fix: force-delete the stuck Terminating validator pod to break the
deadlock (kubectl delete --force --grace-period=0).
2. **Startup probe timeout**: Full driver install on this hardware
(apt headers ~2min + gcc make -j16 ~12min + file copy ~7min = ~21min)
exactly exhausted the default 120×10s=20min startup probe window,
causing SIGKILL (exit 137) at exactly 21 minutes even when the install
was succeeding. Extended failureThreshold 120→300 (50min headroom).
Documented both root causes + recovery steps in the post-mortem.
values.yaml: add driver.startupProbe.failureThreshold: 300.
Note: the kubectl patch applied during recovery is a temporary fix;
this TF values.yaml change makes it durable via the next TF apply.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
The frontend already routes every m3u8 URL through `getProxyUrl` →
`/proxy?url=…` so CORS-restricted hosts work for users. The verifier
was the odd one out: it loaded m3u8 URLs directly into hls.js inside a
`data:` URL test page, which has Origin `null`. Hosts like
`oe1.ossfeed.store` (pitsport's playlist CDN) only set ACAO when the
request's Origin is `https://pushembdz.store`, so hls.js got an instant
`fatal_network_error` and every pitsport stream was marked dead even
though they play fine for real users.
Wrap the m3u8 URL the same way the verifier already wraps embed URLs:
`{PROXY_BASE}/proxy?url=<b64>`. Stays same-origin for hls.js, gets
ACAO:* from our proxy, and the rewritten variants are also proxy-wrapped
so subsequent fetches stay clean.
For sites whose CDN serves any IP without Origin tricks (stremio,
dd12), this is transparent — proxy just forwards.
Side effect: every verified m3u8 hits our proxy once during extraction.
Cheap (1 cluster-internal request + 1 upstream HEAD/GET) and only
during the 5/30-min extraction cycle.
Mid-flight stability changes from the 2026-05-24 Anca-elements import
that surfaced multiple latent issues under sustained load:
- `immich-postgresql` memory 3Gi → 5Gi. The original limit OOM-killed
PG once the bulk insert + vector embeddings drove buffer pressure
past 3 GiB. 5 GiB gives ~60% headroom over the observed steady
state during ongoing imports.
- `immich-server` startup probe `failure_threshold` 30 → 360 (5min →
1h). After any PG restart, immich-server reindexes `clip_index` +
`face_index` (147k + 185k rows at the time of incident) before
binding the API port. The old 5min budget was too tight, so each
PG bounce trapped immich-server in a startup crashloop until the
reindex was killed. 1h gives generous headroom.
- `kubernetes_job_v1.anca_elements_import.backoff_limit` 2 → 20 and
`--concurrent-tasks` 8 → 20 on the immich-go upload. Short
cluster blips (PG restart, KCM lease loss) were exhausting the
Job's 3-attempt budget. 20 attempts + 20 parallel hashers makes
dedup-on-resume ~2.5x faster and tolerates a much rougher cluster.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
On 2026-05-24T15:35:37Z Keel's force-policy rewrote the image tag from
`11.0.14 → 1.18` (codeberg.org/forgejo/forgejo). v1.18 is a Gitea-era
Forgejo (Forgejo forked from Gitea at 1.18 and used pre-Forgejo
versioning early on); the DB had already been migrated to schema 305
by 11.0.14, and 1.18 only knows up to migration 231 → pod refused to
start ("Your database (migration version: 305) is for a newer Gitea,
you can not use the newer database for this old Gitea release (231)").
Exact replay of the 2026-05-16 force-policy tag-rewriting bug
(memory id=1933).
Changes:
- Pin image to explicit `:11.0.14` (latest 11.x, published 2026-05-12)
- Add `keel.sh/policy: "never"` deploy annotation — overrides the
Kyverno-stamped `force` policy via the chart's `+()` anchor semantics
(memory id=1972). Keel will no longer touch this workload.
- Drop KEEL_IGNORE_IMAGE from `lifecycle.ignore_changes` (TF owns the
image now). Restore it if you flip Keel back to `force`.
- Add the KEEL_LIFECYCLE_V1 trio (`kubernetes.io/change-cause`,
`deployment.kubernetes.io/revision`, `keel.sh/update-time` on the
pod template) so future TF applies don't fight K8s rollout metadata.
Verified: new pod on v11.0.14 came up Running 1/1.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
- aceztrims: scrape /f11/ (the actual stream page), not /f1/ (the
cross-sport schedule). Drop the dead /iframe1?s= + onclick m3u8
regexes (site moved to `getElementById('iframe').src = '...'` ~20
channels ago). Strip HTML comments first so the ~20 legacy buttons
kept inside <!-- ... --> stop showing up as false positives.
Also pick up the default inline <iframe id='iframe' src='...'>.
Local run: 11 channels (was 0).
- pitsport: decode the RSC payload before regex-matching in
_parse_live_events (raw HTML had it escape-encoded, so the homepage
card path was silently 0). Add the new /live-now route (canonical
what's-live-right-now list). Add "f1" to MOTORSPORT_CATEGORIES — the
site labels Formula 1 events as just "F1". Refresh the stale
serveplay.site docstring (host rotates; pushembdz's api/stream link
is authoritative).
Local run: 7 m3u8 streams covering Canadian GP (EN1/EN2/MULTI/ITA/ESP)
+ NASCAR Coke 600 (was 0).
- ppv: always emit the parent embed alongside substreams (was dropping
it whenever substreams existed). Prefer source_tag in substream titles
so users see "Sky Sport 1 NZ" / "Apple TV (F1TV)" instead of generic
#1/#2 suffixes.
Diagnosed against the live cluster (curated + 7 other extractors
returning 0 cached streams, only 2 dead hmembeds curated 24/7 channels
visible to users). Each fix verified with the extractor run against
live sites this turn.