infra

Author	SHA1	Message	Date
Viktor Barzin	5cc91e67bf	cloud-init: refactor to write_files for multi-line containerd setup Moves the containerd_config_update_command interpolation out of the runcmd list and into a write_files block delivering /usr/local/bin/k8s-node-containerd-setup.sh. runcmd then just calls the script. Why: the heredoc in stacks/infra/main.tf has mixed-indent inner shell heredocs (CONTAINERD_GC, KUBELET_PATCH bodies at col 0, surrounding text at col 2). When inserted into a `runcmd: - $${var}` item — even wrapped in a `- \|` literal block — YAML's block-indent rule terminates the block early on the col-0 lines. The result is a silent cloud-init parse failure on every new k8s node (observed 2026-05-26 during node4 rebuild — node booted into the minimal default config, no kubeadm join, no containerd tuning, no kubelet shutdown grace). write_files writes the multi-line content into a YAML literal block where the script body is just opaque text — the block's content indent is set by the `content: \|` block's own indentation (col 6) and any indent >= 6 is valid content. Any further indent inside the script (like the col-0 `[plugins...]` heredoc lines now at col 6 via indent(6, ...)) is preserved cleanly. Verified: `yaml.safe_load()` on the rendered snippet now reports `runcmd=36 write_files=1` (was throwing ParserError before). Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-05-26 08:30:53 +00:00
Viktor Barzin	146dc143c6	cloud-init: revert indent(6) wrap; document the YAML interpolation bug The previous indent(6, containerd_config_update_command) attempt didn't fix the YAML parse error — the heredoc in stacks/infra/main.tf has mixed indentation (most lines at col 2, inner shell heredoc bodies like CONTAINERD_GC and KUBELET_PATCH at col 0). Any uniform-prefix function (indent / replace / join) preserves the relative offset, so the column-0 lines always end up below the block's first-line indent and YAML terminates the literal block early. The cleanest fix is a refactor: move the containerd setup snippet out of the inline heredoc into a cloud-init `write_files` block (script file delivered to the VM, then `bash /path/to/script.sh` in runcmd). That bypasses the multi-line YAML interpolation entirely. Reverting to the previous (also-broken) interpolation pattern with a big WARNING comment instead. New k8s nodes still need manual backfill after first boot — node4 was backfilled today; see memory id=2767/2772 for the backfill steps. Tracked separately. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-05-26 07:11:20 +00:00
Viktor Barzin	9b75b2817b	cloud-init: fix k8s node bootstrap snippet (multi-line interp + containerd v2 quotes) Two bugs found while rebuilding k8s-node4 (2026-05-26): 1. runcmd YAML breakage: `- $${containerd_config_update_command}` interpolated a multi-line heredoc as bare list-item content. The trailing lines lost their list-item prefix, breaking cloud-config parsing. Cloud-init silently fell back to the minimal default (hostname + package_upgrade only) — kubeadm join, containerd config, kubelet tuning, iSCSI hardening, swap, ALL skipped. No error visible in `cloud-init status`. Fix: wrap the interpolation in `- \|` literal block with `indent(4, ...)`. 2. containerd v2 single-quote mismatch: `containerd config default` in v2 writes `config_path = ''` (single quotes), v1 writes `""` (double). The sed pattern matched only double quotes → silent no-op on fresh containerd 2.x nodes → registry-mirror hosts.toml ignored → all image pulls hit upstream registries → DNS-to-MetalLB chicken-and-egg loop. Fix: match any value with `config_path = .*`. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-05-26 07:06:50 +00:00
Viktor Barzin	c0991f7f8f	infra: re-enable unattended-upgrades with kured prometheus-gating Reverses the March 2026 outage mitigation that disabled unattended- upgrades cluster-wide. Now re-enables it on the k8s template VM with: - Allowed-Origins limited to security/updates pockets - Package-Blacklist for k8s/containerd/runc/calico-node (apt-mark hold on the cluster-critical components) - Automatic-Reboot disabled — kured drives the actual reboots - Compatible with the existing kured + sentinel-gate flow kured side: - rebootDelay 30s, concurrency 1 - Sentinel cool-down stretched 30m → 24h (aligns with the 24h soak window from the post-mortem) - prometheusUrl + alertFilterRegexp wired so any firing non-ignored alert halts the rollout. Ignore-list excludes self-referential alerts (Watchdog/RebootRequired/KuredNodeWasNotDrained/ InfoInhibitor) that would otherwise deadlock kured. Prometheus side (already partly landed in `6c4e0966` — the "Upgrade Gates" rule group): - Refine `KubeQuotaAlmostFull` to include the resourcequota label in both the on-clause and the summary, so multi-quota namespaces (authentik, beads-server, frigate) report the quota name correctly. grafana.tf: terraform fmt whitespace only. Together with the post-mortem 2026-03-22 (memory id=390) the loop is closed: unattended-upgrades runs again, kernel-class updates can land, but only when cluster health is green and the reboot window is open.	2026-05-10 17:07:32 +00:00
Viktor Barzin	6101fb99f9	Reduce disk write amplification across cluster (~200-350 GB/day savings) [ci skip] - Prometheus: persist metric whitelist (keep rules) to Helm template, preventing regression from 33K to 250K samples/scrape on next apply. Reduce retention 52w→26w. - MySQL InnoDB: aggressive write reduction — flush_log_at_trx_commit=0, sync_binlog=0, doublewrite=OFF, io_capacity=100/200, redo_log=1GB, flush_neighbors=1, reduced page cleaners. - etcd: increase snapshot-count 10000→50000 to reduce WAL snapshot frequency. - VM disks: enable TRIM/discard passthrough to LVM thin pool via create-vm module. - Cloud-init: enable fstrim.timer, journald limits (500M/7d/compress). - Kubelet: containerLogMaxSize=10Mi, containerLogMaxFiles=3. - Technitium: DNS query log retention 0→30 days (was unlimited writes to MySQL). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-09 19:01:21 +00:00
Viktor Barzin	c2f9ca0d13	modules: improve create-vm with additional config options and cloud-init updates	2026-04-06 11:57:55 +03:00
Viktor Barzin	a44f35bcf8	harden vaultwarden iSCSI storage and increase backup frequency - Increase backup from daily to every 6 hours (0 /6 * *) - Add pre/post-flight SQLite integrity checks to backup job - Harden iSCSI on all nodes: increase recovery timeout (300s), enable CRC32C data/header digests for bit-flip detection - Fix restore runbook PVC name (vaultwarden-data-iscsi) Motivated by SQLite corruption from iSCSI I/O errors.	2026-03-23 00:36:11 +02:00
Viktor Barzin	67d1ce453c	add /sentinel dir to cloud-init for kured reboot gating The kured sentinel gate DaemonSet requires /sentinel to exist on all nodes. Without it, kured pods get stuck in ContainerCreating with hostPath mount failure. Previously created manually; now provisioned automatically for new nodes.	2026-03-19 19:57:27 +00:00
Viktor Barzin	c034adab5f	mitigate cluster instability during terraform applies - Recreate strategy for heavy single-replica deployments (onlyoffice, stirling-pdf) - Reduce maxSurge on multi-replica deployments (traefik, authentik, grafana, kyverno) to prevent memory request surge overwhelming scheduler - Weekly etcd defrag CronJob (Sunday 3 AM) to prevent fragmentation buildup - Disable Kyverno policy reports (ephemeral report cleanup) - Cloud-init: journald persistence + 4Gi swap for worker nodes - Kubelet: LimitedSwap behavior for memory pressure relief	2026-03-15 17:23:39 +00:00
Viktor Barzin	0638e2cc2e	[ci skip] iSCSI migration, healthcheck fixes, health probes, etcd backup - Migrate MySQL/PostgreSQL storage from local-path to iscsi-truenas - Add democratic-csi iSCSI driver module for TrueNAS - Add open-iscsi to cloud-init VM template - Fix Shlink health probe path (/api/v3 -> /rest/v3 for Shlink 5.0) - Fix etcd backup: use etcd 3.5.21-0 (3.6.x is distroless, no /bin/sh) - Fix cluster healthcheck CronJob: always exit 0 to prevent circular JobFailed alerts (reporting via Slack, not exit codes) - Fix Uptime Kuma nested list handling in cluster-health.sh - Add health probes to: audiobookshelf, immich ML, ntfy, headscale, uptime-kuma, vaultwarden, rybbit (clickhouse + server + client), shlink, shlink-web - Add iSCSI storage documentation to CLAUDE.md	2026-03-06 19:54:21 +00:00
Viktor Barzin	946b5b1745	[ci skip] add qemu-guest-agent to VM templates and enable agent by default	2026-03-01 01:58:46 +00:00
Viktor Barzin	3b7d295119	add nginx reverse proxy to serialize registyr requests for the same path to avoid race conditions [ci skip]	2025-12-29 20:16:13 +00:00
Viktor Barzin	45e74bedc6	update vm creation tempaltes [ci skip]	2025-12-14 09:50:15 +00:00
Viktor Barzin	b15246a2cb	add docker registry vm and allow multiple provisioning cmds in templates [ci skip]	2025-10-12 18:54:29 +00:00
Viktor Barzin	1968f353a2	add module to create a k8s worker [ci skip]	2025-10-11 20:40:34 +00:00
Viktor Barzin	51a94faff4	add template vm in proxmox [ci skip]	2025-10-11 17:07:47 +00:00

16 commits