The previous indent(6, containerd_config_update_command) attempt didn't
fix the YAML parse error — the heredoc in stacks/infra/main.tf has
mixed indentation (most lines at col 2, inner shell heredoc bodies
like CONTAINERD_GC and KUBELET_PATCH at col 0). Any uniform-prefix
function (indent / replace / join) preserves the relative offset, so
the column-0 lines always end up below the block's first-line indent
and YAML terminates the literal block early.
The cleanest fix is a refactor: move the containerd setup snippet out
of the inline heredoc into a cloud-init `write_files` block (script
file delivered to the VM, then `bash /path/to/script.sh` in runcmd).
That bypasses the multi-line YAML interpolation entirely.
Reverting to the previous (also-broken) interpolation pattern with a
big WARNING comment instead. New k8s nodes still need manual backfill
after first boot — node4 was backfilled today; see memory id=2767/2772
for the backfill steps. Tracked separately.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Two bugs found while rebuilding k8s-node4 (2026-05-26):
1. **runcmd YAML breakage**: `- $${containerd_config_update_command}`
interpolated a multi-line heredoc as bare list-item content. The
trailing lines lost their list-item prefix, breaking cloud-config
parsing. Cloud-init silently fell back to the minimal default
(hostname + package_upgrade only) — kubeadm join, containerd config,
kubelet tuning, iSCSI hardening, swap, ALL skipped. No error visible
in `cloud-init status`.
Fix: wrap the interpolation in `- |` literal block with `indent(4, ...)`.
2. **containerd v2 single-quote mismatch**: `containerd config default`
in v2 writes `config_path = ''` (single quotes), v1 writes `""` (double).
The sed pattern matched only double quotes → silent no-op on fresh
containerd 2.x nodes → registry-mirror hosts.toml ignored → all image
pulls hit upstream registries → DNS-to-MetalLB chicken-and-egg loop.
Fix: match any value with `config_path = .*`.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Reverses the March 2026 outage mitigation that disabled unattended-
upgrades cluster-wide. Now re-enables it on the k8s template VM with:
- Allowed-Origins limited to security/updates pockets
- Package-Blacklist for k8s/containerd/runc/calico-node (apt-mark
hold on the cluster-critical components)
- Automatic-Reboot disabled — kured drives the actual reboots
- Compatible with the existing kured + sentinel-gate flow
kured side:
- rebootDelay 30s, concurrency 1
- Sentinel cool-down stretched 30m → 24h (aligns with the 24h soak
window from the post-mortem)
- prometheusUrl + alertFilterRegexp wired so any firing non-ignored
alert halts the rollout. Ignore-list excludes self-referential
alerts (Watchdog/RebootRequired/KuredNodeWasNotDrained/
InfoInhibitor) that would otherwise deadlock kured.
Prometheus side (already partly landed in 6c4e0966 — the "Upgrade
Gates" rule group):
- Refine `KubeQuotaAlmostFull` to include the resourcequota label in
both the on-clause and the summary, so multi-quota namespaces
(authentik, beads-server, frigate) report the quota name correctly.
grafana.tf: terraform fmt whitespace only.
Together with the post-mortem 2026-03-22 (memory id=390) the loop is
closed: unattended-upgrades runs again, kernel-class updates can land,
but only when cluster health is green and the reboot window is open.
The kured sentinel gate DaemonSet requires /sentinel to exist on
all nodes. Without it, kured pods get stuck in ContainerCreating
with hostPath mount failure. Previously created manually; now
provisioned automatically for new nodes.