Compare commits

...

43 commits

Author SHA1 Message Date
Viktor Barzin
1eee56d0ba redis: tolerate up to 1KB of AOF tail corruption on load
Some checks failed
ci/woodpecker/push/build-cli Pipeline failed
ci/woodpecker/push/default Pipeline failed
Post-2026-05-26 unclean node2 reboot left redis-v2-2's incremental AOF
truncated at offset 84799139. With aof-load-corrupt-tail-max-size at its
default 0, redis refuses to load any corruption and crashloops. Setting
1024 lets it truncate the corrupted tail and continue, which is the
right call for a non-source-of-truth cache fronted by sentinel.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-26 18:48:58 +00:00
Viktor Barzin
60b2b1cdfc cluster-health: emergency-stop Keel + roll back image downgrades + quota raises
Keel was rewriting tag strings (not just digests) despite the
keel.sh/match-tag=true annotation injected by the Kyverno
inject-keel-annotations ClusterPolicy. That annotation was supposed to
constrain Keel to digest-only watches under the deployment's CURRENT tag.
It didn't. Casualties confirmed today (live image rewritten to a lower
version): uptime-kuma (:2 → :1, 4h CrashLoopBackOff because v1 boots into
SQLite mode and can't read the v2 db-config.json → MariaDB store);
n8n (:1.80.5 → :0.1.2, silent — EEXIST mkdir /root/.n8n loop);
beads-server/dolt-workbench (:0.3.73 → :0.1.0, GraphQL schema mismatch on
addDatabaseConnection); wealthfolio (:3.2.1 → :2.0 → :3.2 string truncate);
plus historical ones previously fixed (claude-memory :71b32438 → :17,
forgejo 11.0.14 → 1.18, onlyoffice 9.3.1 → 4.0.0.9, shlink 5.0.2 → 1.16.1).

Changes:

* stacks/keel: replicaCount = 0 in the helm values. Pod went from 1/1 to
  0/0. Keep off until either match-tag is root-caused or every enrolled
  workload migrates to a content-addressed (SHA) pin.

* stacks/uptime-kuma: pin image to louislam/uptime-kuma:2.3.2 (was :2,
  bumped to :1 by Keel). Full opt-out: keel.sh/policy=never on BOTH the
  deployment label (matches Kyverno's exclude rule so the inject-keel-
  annotations ClusterPolicy stops mutating) AND the annotation (so Keel
  itself respects). Removed keel.sh/policy from lifecycle.ignore_changes
  so TF owns it as `never` and can't drift back to `force`.

* stacks/beads-server: pin dolt-workbench to dolthub/dolt-workbench:0.3.73
  on both seed-config and workbench containers (was :latest, Keel rolled
  to :0.1.0).

* stacks/wealthfolio: pin to afadil/wealthfolio:3.2.1 (was :3.2 truncated
  by Keel from the prior live :3.2.1).

* stacks/monitoring: monitoring-quota requests.memory 16Gi → 20Gi. Cluster
  grew from 5 to 7 workers (k8s-node5/6 added 2026-05-26) and alloy's
  per-pod request jumped 50Mi → 562Mi earlier today; combined with new-node
  DS pods (loki-canary, node-exporter, sysctl-inotify) the quota tipped to
  100% and blocked every new pod create with FailedCreate. Raising the cap
  unblocked the four affected DaemonSets in one shot.

* stacks/immich: tier-quota requests.memory 20Gi → 24Gi, limits.memory
  32Gi → 40Gi. Was at 88% with VPA still creeping up on immich-server's
  face-detection burst behaviour.

* stacks/{excalidraw,immich,n8n}: providers.tf + .terraform.lock.hcl
  updated by `tg init -upgrade` to record telmate/proxmox 3.0.2-rc07
  (matches the 21 other stacks that already declare it).

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-26 18:48:50 +00:00
Viktor Barzin
41fb7c4a76 backup pipeline: prune sda-bypass list to immich-only
Previously /srv/nfs/{ollama,audiblez,ebook2audiobook,*-backup} took
the sdc → Synology direct leg. They now ride sdc → sda → Synology
pve-backup/ via nfs-mirror like every other NFS subtree, so sda
becomes the single canonical mirror and Synology only has to ingest
one feed for the bulk of cluster state.

frigate + temp dropped from BOTH legs (no backup anywhere) per
explicit user ask — frigate is a 14d camera ring, temp is scratch.
prometheus/loki/alertmanager dropped as no-op (orphan dirs that
no longer exist on /srv/nfs).

Also: nfs-mirror's manifest collection switched from find -newer
(mtime) to find -cnewer (ctime) — rsync -t preserves source mtime
on dest, so freshly-written files looked "older than \$STAMP" and
the 2026-05-26 full mirror run captured only 2 of 800k transferred
files. Hit during this session, recovered via .force-full-sync.

Operational result post-rollout:
- sda 87% → 70% (anca-elements 423G deleted, +260G new dirs)
- /Viki/nfs/ on Synology: was 24 stale dirs (~430G), now immich only
- Synology free: ~300G → ~430G+ once btrfs reclaim catches up

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-26 18:22:01 +00:00
Viktor Barzin
b3dcccfc41 vaultwarden: track :latest tag for Keel auto-upgrade (was 1.35.7)
Earlier today Keel's hourly poll caught vaultwarden's deployment in a
window where the `keel.sh/match-tag` annotation wasn't set, fell into
'watch repository tags' mode, and rewrote 1.35.7 -> 1.21.0. Vaultwarden
1.21.0 doesn't have the API endpoints the modern Bitwarden clients call
(/identity/accounts/prelogin/password, /api/devices/knowndevice,
/api/config), so the Chrome extension started 404-ing on login.

Same race shape as the 2026-05-17 authentik/pgbouncer incident. The
fundamental issue: `policy: force` on a semver-pinned tag is unsafe
because Keel happily rewrites the tag string if it can't find a stable
'current tag' to digest-watch.

Fix: switch to `:latest` (the mutable tag vaultwarden publishes for the
newest stable release). Keel now digest-watches `:latest` (safe mode)
and rolls forward on each upstream release. Matches cluster convention
(128 other Keel-managed workloads use the same `:latest` + force +
match-tag pattern).

Also added imagePullPolicy=Always (required with :latest so the kubelet
revalidates the manifest on each rollout instead of using a cached
layer), and extended the lifecycle.ignore_changes to cover the
match-tag annotation and kubernetes.io/change-cause (Keel rewrites
this on every rollout).

Current `:latest` digest -> vaultwarden 1.36.0 (released 2026-05-03).

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-26 13:26:36 +00:00
Viktor Barzin
8ed427a7e4 cloud-init: hands-off k8s worker provisioning + 5 bug fixes
Goal: re-clone the worker template, boot, and have it appear as `kubectl
get nodes …Ready` with no manual steps. Adds `scripts/provision-k8s-worker
NAME VMID IP` and rebuilds the cloud-init pipeline that was failing five
distinct ways on a clean boot.

Bugs fixed (all hit during the k8s-node5 + k8s-node6 builds today):

1. `indent(6, containerd_config_update_command)` indented the bodies of
   `cat >> /etc/containerd/config.toml <<'CONTAINERD_GC'` heredocs, so
   [plugins.*] TOML sections landed in /etc/containerd/config.toml at
   col 6 — containerd refused to parse them. Source is now a normal
   .sh file (`modules/create-template-vm/k8s-node-containerd-setup.sh`)
   base64-embedded into `write_files`; YAML whitespace never touches
   the heredoc bodies.

2. The same script tried to `cat >> /etc/containerd/config.toml`
   `[plugins."io.containerd.gc.v1.scheduler"]` etc., which containerd
   v2.2.4's `config default` ALREADY emits. Result: `toml: table …
   already exists`. Patched with sed-in-place overrides instead.

3. Kubelet tuning (sed against /var/lib/kubelet/config.yaml) ran from
   the containerd setup script — BEFORE `kubeadm join` writes that
   file. Sed aborted with "No such file or directory", `set -e` killed
   the script, post-script cloud-init steps kept going (cloud-init
   doesn't stop on runcmd failure). Split into a dedicated
   `k8s-node-post-join-tune.sh` invoked AFTER kubeadm join.

4. cloud_init.yaml fallocate'd a 4G swapfile and `swapon`'d it BEFORE
   kubeadm join. kubelet defaults to failSwapOn=true → exited 1
   immediately. Replaced the swap setup with `swapoff -a` (node4
   already runs this way and the cluster is fine).

5. Without `hostname:` in the shared user-data snippet, Proxmox's
   auto-generated meta-data does NOT include local-hostname when
   `cicustom user=…` is set — so cloud-init falls back to the cloud
   image's default `ubuntu` and `kubeadm join` registers the wrong
   node name. `provision-k8s-worker` now writes a per-node
   `<NAME>-meta.yaml` snippet and passes both via
   `cicustom user=…,meta=…`.

Other improvements rolled in while fixing the above:

- `ssh_public_key` read from Vault (`secret/viktor.ssh_public_key`,
  added today) instead of `var.ssh_public_key`. The last
  `terragrunt apply` was run with that var empty, leaving the snippet's
  `ssh_authorized_keys` with a single blank entry; the wizard user
  was effectively locked out of every fresh node.
- `cloud_init.yaml` adds `/etc/systemd/resolved.conf.d/global-dns.conf`
  with `DNS=8.8.8.8 1.1.1.1, FallbackDNS=10.0.20.201`. Without it,
  systemd-resolved only consulted Technitium (link-level), which
  returns NXDOMAIN for `forgejo.viktorbarzin.me` — kubelet pulls from
  the Forgejo registry then failed DNS until I patched it manually
  on node5.
- k8s apt repo bumped v1.32 → v1.34 (matches cluster).
- The containerd setup script now creates hosts.toml for forgejo,
  quay, registry.k8s.io in addition to docker.io + ghcr.io. node3/4
  had these added by hand post-bootstrap; now they're baked in.
- `config_path` sed matches both `""` (containerd v1) and `''`
  (containerd v2.x). Without the v2 match, the certs.d mirror dir was
  silently ignored.
- `proxmox-csi` node map adds k8s-node5 + k8s-node6 entries so CSI
  topology labels (region/zone, max-volume-attachments=28) apply on
  next `tg apply`.
- `stacks/infra/main.tf` shed the 160-line inline containerd setup
  heredoc — that whole thing now lives in the module as a .sh file.

Known unsolved gaps (deferred):

- iscsid restart hangs ~90s on first boot before SIGKILL releases it
  (systemd-resolved restart kicks iscsid via dependency). Adds wall-
  clock time but doesn't block the join.
- `provision-k8s-worker` doesn't run `tg apply` on `proxmox-csi`
  afterward, so the CSI topology labels need a manual apply after
  the node joins. Solving cleanly needs the CSI map to derive from
  `kubectl get nodes` instead of a static local — separate work.
- `var.containerd_config_update_command` is now ignored when
  is_k8s_template=true (replaced by the bundled .sh file). Variable
  kept with a deprecation note to avoid breaking other call sites.

E2E proof: k8s-node6 (VMID 206) boots hands-off from
`provision-k8s-worker k8s-node6 206 10.0.20.106` and appears as
`kubectl get nodes …Ready` ~7 min later (most of which is the apt
package_upgrade — separate optimization).

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-26 11:52:00 +00:00
Viktor Barzin
e4c0cbc3d0 state(infra): update encrypted state 2026-05-26 11:48:55 +00:00
Viktor Barzin
311eb60c9c state(infra): update encrypted state 2026-05-26 11:31:03 +00:00
Viktor Barzin
3fdce1f5cb state(infra): update encrypted state 2026-05-26 11:20:12 +00:00
Viktor Barzin
3d226184c1 state(infra): update encrypted state 2026-05-26 11:11:16 +00:00
Viktor Barzin
b7e252ec99 state(infra): update encrypted state 2026-05-26 11:03:57 +00:00
Viktor Barzin
bb9d8f1b38 kyverno: GPU priority mutate uses add (was replace) — fixes silent skip
The Layer 5 ClusterPolicy inject-gpu-workload-priority used JSON6902
op=replace on /spec/priorityClassName. Incoming pods (e.g. frigate)
have no priorityClassName field at all — replace requires the path to
exist, so the patch fails with "doc is missing key: /spec/priorityClassName"
and the whole mutation chain aborts BEFORE Layer 4 (inject-priority-class-from-tier)
gets a chance to add the field.

Result: GPU pods never got priorityClassName set, sat at priority=0, and
could not preempt lower-tier pods on the GPU node. Observed today on
frigate post-node4-recovery — pod stayed Pending with "Preemption is
not helpful" while 3 pg-cluster pods (tier-1-cluster, priority 800000)
occupied node1's memory budget.

Fix: op=add for all three paths. add works whether or not the key is
present, so the policy is robust to the upstream pod shape.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-26 09:04:51 +00:00
Viktor Barzin
12b4f6f81a dbaas: require pod anti-affinity on pg-cluster (one PG per node)
Default CNPG affinity was `preferred` (soft). During the 2026-05-26
node4 outage, all 3 pg-cluster pods drifted onto k8s-node1 — losing
that node would have taken the whole PG cluster down (no quorum) AND
the 9.2 GiB pg-cluster footprint was the dominant reason frigate
couldn't fit on the GPU node.

With 3 instances + 4 worker nodes, `required` is safe under 1-node
drain (3 distinct nodes always available, even excluding the drained
one).

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-26 09:00:37 +00:00
Viktor Barzin
400ee88967 state(dbaas): update encrypted state 2026-05-26 08:59:40 +00:00
Viktor Barzin
c0618ae1ae docs(compute): mark all Linux VMs as hand-managed; document apply-mbps-caps timer
Reflects the 2026-05-26 decision (commit 44c3770a) to keep Linux VMs
out of Terraform — telmate/proxmox v3.0.2 mangles dynamically-attached
disks (id=539) and doesn't refresh mbps_*_concurrent back from live
state. What stays in TF: the cloud-init templates. Per-VM I/O caps
now driven by the apply-mbps-caps systemd timer (commit 56a338f8).

Replaces the stale note about iSCSI mangling — that rationale is
obsolete (iSCSI gone since 2026-04-11) and the new scope is
intentional, not provisional.
2026-05-26 08:38:00 +00:00
Viktor Barzin
5cc91e67bf cloud-init: refactor to write_files for multi-line containerd setup
Moves the containerd_config_update_command interpolation out of the
runcmd list and into a write_files block delivering
/usr/local/bin/k8s-node-containerd-setup.sh. runcmd then just calls
the script.

Why: the heredoc in stacks/infra/main.tf has mixed-indent inner shell
heredocs (CONTAINERD_GC, KUBELET_PATCH bodies at col 0, surrounding
text at col 2). When inserted into a `runcmd: - $${var}` item — even
wrapped in a `- |` literal block — YAML's block-indent rule
terminates the block early on the col-0 lines. The result is a silent
cloud-init parse failure on every new k8s node (observed 2026-05-26
during node4 rebuild — node booted into the minimal default config,
no kubeadm join, no containerd tuning, no kubelet shutdown grace).

write_files writes the multi-line content into a YAML literal block
where the script body is just opaque text — the block's content
indent is set by the `content: |` block's own indentation (col 6)
and any indent >= 6 is valid content. Any further indent inside the
script (like the col-0 `[plugins...]` heredoc lines now at col 6 via
indent(6, ...)) is preserved cleanly.

Verified: `yaml.safe_load()` on the rendered snippet now reports
`runcmd=36 write_files=1` (was throwing ParserError before).

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-26 08:30:53 +00:00
Viktor Barzin
3382d19d25 state(infra): update encrypted state 2026-05-26 08:30:53 +00:00
root
daa41a2eb1 Woodpecker CI deploy [CI SKIP] 2026-05-26 08:29:09 +00:00
Viktor Barzin
00bbbe0838 url/shlink-web: containerPort 8080 -> 80
shlinkio/shlink-web-client:0.1.1 listens on port 80 (nginx default),
not 8080 like the prior :latest images. Keel auto-bumped the tag
on 2026-05-23; liveness/readiness probes have been failing ever since
because they still hit :8080. Pod was stuck restarting, the
DeploymentReplicasMismatch alert fired.

Aligns containerPort + both probes + service target_port with the image.
2026-05-26 08:19:24 +00:00
Viktor Barzin
56a338f80b scripts: hook apply-mbps-caps into the PVE host as a systemd timer
The qm-set I/O caps were previously only applied by manual one-shot
runs of apply-mbps-caps.sh, so any config drift (manual `qm set`,
config restored from /mnt/backup/pve-config like we did on 2026-05-26,
fresh VM clone) would leave the affected VM uncapped until someone
remembered to re-run the script.

Adds apply-mbps-caps.service (Type=oneshot) + apply-mbps-caps.timer
firing:
  - OnBootSec=5min        — catches PVE host reboots & restored configs
  - OnCalendar=hourly     — catches manual qm-set drift / fresh clones
  - Persistent=true       — runs missed schedule after PVE downtime
  - RandomizedDelaySec=2min

Same install pattern as the other PVE operational scripts (nfs-mirror,
daily-backup, offsite-sync-backup, lvm-pvc-snapshot — memory id=609 +
id=542). Source in this repo, deployed to /usr/local/bin + /etc/
systemd/system/ on the PVE host.

Script hardening: kept `set -uo pipefail` but dropped `-e` so one
missing VM doesn't abort the rest; each VM is gated on `qm status`
existence; added a fast-path "already at target" no-op log line for
quiet hourly runs.

Installed on PVE (192.168.1.127) and smoke-tested: all 8 VMs caps
re-applied successfully, next run 12:00 EEST. Journal: `journalctl
-u apply-mbps-caps -f` on the PVE host.
2026-05-26 08:12:15 +00:00
Viktor Barzin
232409e798 scripts: per-VM I/O cap script — apply-mbps-caps.sh
Idempotent qm-set script for the per-VM I/O caps on the PVE host's sdc
thin pool (2026-05-26 session, beads code-9v2j). Caps protect each
Linux VM's share of sdc so a runaway workload (e.g. the 2026-05-23/26
alloy IO storm — memory id=2726) cannot saturate the disk for everyone.

Was sitting in /tmp on PVE — moving the source under version control
and installing to /usr/local/bin/ alongside the other PVE operational
scripts (nfs-mirror, daily-backup, offsite-sync-backup; pattern from
memory id=609). Survives PVE host reboots; safe to re-run on any node
rebuild to restore the caps.

VMIDs covered (Linux only — pfSense 101 and Windows10 300 skipped):
  102 devvm 60/60   103 home-assistant 40/40   200 k8s-master 100/60
  201 k8s-node1 150/120   202 k8s-node2 150/120   203 k8s-node3 150/120
  204 k8s-node4 150/120   220 docker-registry 40/40
2026-05-26 08:06:15 +00:00
Viktor Barzin
44c3770a5c infra: pull all VMs out of Terraform — telmate provider can't represent them safely
The telmate/proxmox v3.0.2-rc07 provider mangles dynamically-attached
disks (id=539, 2026-05-26 incident) and doesn't refresh mbps_*_concurrent
fields back from live state — every plan after a qm-set cap is applied
proposes to "fix" mbps 0 → N and the apply errors with the spurious
"the QEMU guest needs to be rebooted" message. lifecycle.ignore_changes
does NOT block either failure mode.

Decision: stop trying to manage Linux VMs in this stack. The cloud-init
bootstrap stays in TF (via k8s-node-template, non-k8s-node-template,
docker-registry-template above), so a fresh node still clones the right
template and runs the same bootstrap. VM lifecycle stays in the Proxmox
UI. I/O caps are managed via qm-set on the PVE host (idempotent script
at /tmp/apply-mbps-caps.sh, tracked in beads code-9v2j).

Removed from TF state + HCL:
  - module "k8s-master"          (vmid 200)
  - module "k8s-node2"            (vmid 202) — pre-existing drift, never in state
  - module "docker-registry-vm"   (vmid 220) — was in state, hit refresh bug

Already hand-managed (never in HCL):
  - 102 devvm, 103 home-assistant, 201 k8s-node1 (Tesla T4 passthrough),
    203 k8s-node3, 204 k8s-node4, 101 pfSense (BSD), 300 Windows10.

Live I/O caps (qm set, all verified):
  102=60/60  103=40/40  200=100/60  201=150/120  202=150/120
  203=150/120  204=150/120  220=40/40

Future TF adoption tracked in beads code-75ds (blocks on bpg/proxmox
provider migration — telmate can't represent these VMs at all).

Closes: code-75ds
2026-05-26 07:12:46 +00:00
Viktor Barzin
8d495ab5da state(infra): update encrypted state 2026-05-26 07:11:54 +00:00
Viktor Barzin
90c1b476a1 state(infra): update encrypted state 2026-05-26 07:11:46 +00:00
Viktor Barzin
146dc143c6 cloud-init: revert indent(6) wrap; document the YAML interpolation bug
The previous indent(6, containerd_config_update_command) attempt didn't
fix the YAML parse error — the heredoc in stacks/infra/main.tf has
mixed indentation (most lines at col 2, inner shell heredoc bodies
like CONTAINERD_GC and KUBELET_PATCH at col 0). Any uniform-prefix
function (indent / replace / join) preserves the relative offset, so
the column-0 lines always end up below the block's first-line indent
and YAML terminates the literal block early.

The cleanest fix is a refactor: move the containerd setup snippet out
of the inline heredoc into a cloud-init `write_files` block (script
file delivered to the VM, then `bash /path/to/script.sh` in runcmd).
That bypasses the multi-line YAML interpolation entirely.

Reverting to the previous (also-broken) interpolation pattern with a
big WARNING comment instead. New k8s nodes still need manual backfill
after first boot — node4 was backfilled today; see memory id=2767/2772
for the backfill steps. Tracked separately.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-26 07:11:20 +00:00
Viktor Barzin
321c073ca0 state(infra): update encrypted state 2026-05-26 07:09:52 +00:00
Viktor Barzin
5b7b962d7c state(infra): update encrypted state 2026-05-26 07:09:33 +00:00
Viktor Barzin
6a83cee6ae state(infra): update encrypted state 2026-05-26 07:07:06 +00:00
Viktor Barzin
9b75b2817b cloud-init: fix k8s node bootstrap snippet (multi-line interp + containerd v2 quotes)
Two bugs found while rebuilding k8s-node4 (2026-05-26):

1. **runcmd YAML breakage**: `- $${containerd_config_update_command}`
   interpolated a multi-line heredoc as bare list-item content. The
   trailing lines lost their list-item prefix, breaking cloud-config
   parsing. Cloud-init silently fell back to the minimal default
   (hostname + package_upgrade only) — kubeadm join, containerd config,
   kubelet tuning, iSCSI hardening, swap, ALL skipped. No error visible
   in `cloud-init status`.

   Fix: wrap the interpolation in `- |` literal block with `indent(4, ...)`.

2. **containerd v2 single-quote mismatch**: `containerd config default`
   in v2 writes `config_path = ''` (single quotes), v1 writes `""` (double).
   The sed pattern matched only double quotes → silent no-op on fresh
   containerd 2.x nodes → registry-mirror hosts.toml ignored → all image
   pulls hit upstream registries → DNS-to-MetalLB chicken-and-egg loop.

   Fix: match any value with `config_path = .*`.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-26 07:06:50 +00:00
Viktor Barzin
445feb118f infra: per-VM I/O caps + terragrunt v0.77 plumbing + state recovery
WHAT LANDED:
- terragrunt.hcl (root): added telmate/proxmox to k8s_providers
  required_providers. Other stacks just don't instantiate a provider
  block — harmless. Replaces the same-name override trick the infra
  stack used to do, which stopped working under Terragrunt v0.77
  ("Detected generate blocks with the same name").
- stacks/infra/terragrunt.hcl: new generate "proxmox_provider" block
  writes proxmox_provider.tf with the provider config; credentials
  read from Vault secret/viktor at plan/apply time (no env vars).
- modules/create-vm: new mbps_rd / mbps_wr number variables (default 0
  = uncapped), wired into scsi0/scsi1 disk{} blocks as
  mbps_r_concurrent / mbps_wr_concurrent. lifecycle.ignore_changes
  extended to scsi6..scsi29 (K8s nodes have many CSI-managed slots),
  plus scsihw and qemu_os (vary per-VM; non-trivial live changes).
- stacks/infra/main.tf: docker-registry-vm gains mbps_rd=40,
  mbps_wr=40 in HCL — already applied live via qm set on 2026-05-26.

WHAT FAILED AND WAS ROLLED BACK:
- Attempted import of 7 VMs (102 devvm, 103 home-assistant, 200
  k8s-master, 201 k8s-node1, 202 k8s-node2, 203 k8s-node3, 204
  k8s-node4) via import {} blocks. The telmate/proxmox v3.0.2-rc07
  provider mangled proxmox-csi PVC slots on apply for vmid 202 and
  203: every scsi slot got rewritten from `vm-9999-pvc-<uuid>` to
  the boot disk `vm-<vmid>-disk-0`. Restored both .conf files from
  the 2026-05-24 nightly PVE config backup at /mnt/backup/pve-config/
  etc-pve/nodes/pve/qemu-server/{202,203}.conf — no reboots, no data
  loss, K8s CSI reconciled PVC attachments within minutes. Removed
  the 7 imports from state via `terraform state rm` and re-encrypted.
  Tracked in beads code-xzbl: blocked on bpg/proxmox provider
  migration (telmate has the same dynamic-disk defect that bit us on
  iSCSI back in 2026-04-02; see memory id=539).

LIVE CAPS STILL IN PLACE (qm set, 2026-05-26 ~03:13 UTC):
  102 devvm 60/60   103 home-assistant 40/40   200 k8s-master 100/60
  201 k8s-node1 150/120   202 k8s-node2 150/120   203 k8s-node3 150/120
  204 k8s-node4 150/120   220 docker-registry 40/40
  (pfSense 101 BSD + Windows10 300 intentionally out of scope.)

PRE-EXISTING DRIFT EXPOSED (NOT NEW):
- HCL declares k8s-master (200) and k8s-node2 (202) but neither was
  ever imported into TF state — confirmed against the SOPS-encrypted
  state in git (lineage e1cc5bb5, serial 42, last touched 2026-04-06).
  This commit leaves both declarations in place but does NOT import
  them; that's part of the code-xzbl follow-up.

Closes: code-s9xr
2026-05-26 06:46:47 +00:00
Viktor Barzin
07bd2e0017 onlyoffice: restore replicas 0 → 1 post IO-storm recovery
Cluster is fully stable (all 5 nodes Ready, vaultwarden recovered,
node4 rebuilt 2026-05-26). Removing the TEMP-SCALEDOWN guard.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-26 03:08:17 +00:00
Viktor Barzin
6e9bffb1a3 storage docs: document the per-VM SCSI-LUN cap (proxmox-csi)
The proxmox-csi-plugin hardcodes a 29-disks-per-VM ceiling in
pkg/csi/utils.go:394 (lun < 30 loop). This is the actual block-
storage scaling bottleneck — NOT QEMU, NOT Proxmox, NOT the kernel.

Adds a "Per-VM SCSI-LUN cap" section to docs/architecture/storage.md
explaining:
  - the source-level hardcode and how to recognise it (FailedAttachVolume
    "no free lun found")
  - why switching scsihw to virtio-scsi-single buys ZERO additional
    capacity (perf-only)
  - levers in leverage-per-effort order (migrate non-DB to NFS,
    add a worker, fork+patch the plugin)
  - the Wave 1 NFS migration (2026-05-26) that took 5 services off
    block and skipped two more on pre-flight (plotting-book SQLite+WAL,
    stirling-pdf H2 .mv.db)

Discovered during the Wave 1 work — see remote memory ids 2788+ for
full context and 2798+ for the related postiz state-drift discovery.
2026-05-26 02:56:27 +00:00
Viktor Barzin
7ad0e578ae f1-stream: migrate PVC from proxmox-lvm to NFS
Wave 1 LUN-cap relief. The PVC stores 5 small JSON state files
(health_state, schedule, scraped_links, sessions, streams) and a
lost+found — total 30KB, no DB, regenerable from upstream APIs.

Standard scale-to-0 → rsync → swap pattern (deployment was at
replicas=1). Pod came back up on k8s-node4 (now Ready again).

Net: -1 SCSI LUN on k8s-node1 (was the previous host).
2026-05-26 02:49:43 +00:00
Viktor Barzin
aded77d5ab monitoring: alerts for proxmox-csi LUN saturation per node
Vaultwarden + 18 pods got stuck for 7h on 2026-05-26 when k8s-node4 went
down: surviving workloads piled onto node1 and hit the
csi.proxmox.sinextra.dev/max-volume-attachments=28 cap. The Proxmox VM also
had 5 stale scsi entries (PVCs long-migrated to other nodes but never
removed from VM config), which bypassed the K8s scheduler safety until the
plugin returned 'no free lun found' at attach time.

Three new alerts on the kube_volumeattachment_info count per node:
- warning at 24/28 (>= 85%), 10m
- critical at 27/28 (1 slot left), 3m
- critical at 28/28 (cap reached), 1m

Also whitelisted kube_volumeattachment_info — the metric was being dropped
by the disk-write-reduction filter (id=559) and the alert queries returned
zero series until it's kept.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-26 02:45:13 +00:00
Viktor Barzin
a0b5cbc922 onlyoffice: migrate PVC from proxmox-lvm to NFS
Wave 1 LUN-cap relief. OnlyOffice document server keeps only 2 WOPI
key files + a .private dir on the PVC (~24K) — the real DB lives in
its external Postgres + Redis stack, not on this PVC. Service is at
replicas=0 (IO-storm temp scaledown — TEMP-SCALEDOWN comment
preserved).

Migration trivia: scheduler tried to put the rsync helper on
k8s-node4 (PVC's last-known location) but node4 had just come back
online and its proxmox-csi/nfs-csi node pods were still in
ContainerCreating — failed. Retried pinned to k8s-node2 via
nodeSelector; rsync template updated to take an optional node arg.

Net: -1 SCSI LUN once onlyoffice is brought back up.
2026-05-26 02:43:47 +00:00
Viktor Barzin
681f6daf10 whisper: migrate PVC from proxmox-lvm to NFS
Wave 1 LUN-cap relief. Whisper PVC holds Piper TTS .onnx voice
model + a HuggingFace faster-whisper-small-int8 model cache —
read-mostly model artefacts, no DB, 303M total. Both whisper and
piper deployments are at replicas=0 (GPU-node memory pressure,
unrelated).

Switched access_modes to ReadWriteMany since both whisper + piper
deployments reference the same PVC; on proxmox-lvm RWO they could
only colocate on the same node when both come back.

Net: -1 SCSI LUN once these are brought back up.
2026-05-26 02:38:34 +00:00
Viktor Barzin
a2b410f6c9 resume: migrate PVC from proxmox-lvm to NFS
Wave 1 LUN-cap relief. Reactive Resume stores user-uploaded PDFs +
3 .txt counters under uploads/ and statistics/ — no embedded DB,
112K of data. Service is at replicas=0 (browserless OOM scaledown,
unrelated to this work) so the migration was no-downtime.

Net: -1 SCSI LUN once resume is brought back up.
2026-05-26 02:36:20 +00:00
Viktor Barzin
cdbb418f45 monitoring: alert when cluster can't tolerate losing a non-GPU worker
ClusterCannotTolerateNonGpuNodeLoss fires when the most heavily reserved
non-GPU worker (k8s-node2/3/4) has more memory requests pinned to it
than the rest of the workers (incl. node1 GPU node) currently have free.
If that node went down, its pods would not fit elsewhere and would stay
Pending — exactly what happened today (2026-05-26) with node4 NotReady:
4 kyverno pods + woodpecker PVCs + several deployments stuck Pending
because node2/node3 were at 99% memory-request saturation.

Math: max(R(node X) for X in non-GPU workers) > sum(clamp_min(A(n) - R(n), 0))
over Ready workers. node1 included on the right because its taint is
PreferNoSchedule (soft) so it does absorb non-GPU pods under pressure.

Currently fires with a 33.96 GiB shortage. Remediation: right-size top
reservers via Goldilocks (immich-server 8Gi, frigate 5Gi, prometheus
4.4Gi, pg-cluster 3Gi each, paperless 2Gi) or bump VM RAM on
k8s-node2/k8s-node3 from 32GB → 48GB to match node1.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-26 02:34:13 +00:00
Viktor Barzin
467fa1631d excalidraw: migrate PVC from proxmox-lvm to NFS
Wave 1 of the per-VM SCSI-LUN cap relief. The proxmox-csi-plugin
hardcodes a `lun < 30` loop (pkg/csi/utils.go:394) — cap is 29
attachable PVCs per K8s node VM, and k8s-node1 was sitting at 29
with 4 stuck `no free lun found` PVCs queued behind it.

Excalidraw stores per-user .excalidraw scene files (no SQLite,
no embedded DB) — confirmed safe on NFS. 1.5 MiB of data,
4 active scenes. Migration:
  - Add nfs_volume module → apply
  - Scale to 0, rsync helper, swap claim_name → apply
  - Remove old proxmox-lvm PVC → apply
Net: -1 SCSI LUN on k8s-node2.

Refs: docs/post-mortems/2026-05-25-immich-anca-elements-io-storm.md
(separate concern; this is for the upstream LUN-cap pressure).
2026-05-26 02:33:41 +00:00
Viktor Barzin
16b3969ceb alloy: move resources to alloy.* (chart key bug); 1Gi limit fixes IO storm
The Alloy Helm chart maps `alloy.resources`, NOT `controller.resources`, onto
the alloy container. The block under `controller:` was silently dropped, so
the container ran with `resources: {}` and inherited the Kyverno LimitRange
`tier-defaults` 256Mi — well below Alloy's 400-450Mi steady state. The
cgroup ran at 255.8/256MB with ~50M memory-reclaim events, page-cache
thrashing drove ~185 MB/s sdc reads (12.18 TB in 24h), saturating the
Proxmox host and rippling out to all VMs + NFS.

Fix:
- Move resources to `alloy.resources` (correct chart key).
- Burstable QoS: request 512Mi, limit 1Gi. Workers are at 97-99%
  memory-request saturation cluster-wide; a 1Gi request blocks
  scheduling on node2/node3.
- Bump controller.updateStrategy.maxUnavailable to 50% so a 5-pod DS
  rolling update fits inside the helm timeout.
- Bump helm_release.alloy.timeout to 900s (default 300s was too short
  with occasional runc-stuck-Terminating on k8s-master).

Verified: all 4 alloy pods now show 1Gi/512Mi at the container level;
helm rev=8 deployed; per-pod memory 99-108Mi at steady state (well
under the new limit).

Memory ID 2726.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-26 02:08:35 +00:00
Viktor Barzin
b9ac942647 nvidia: fix driver install deadlock + extend startup probe
Two compounding issues prevented the GPU driver from installing after
the k8s-node1 kernel rollback to 6.8.0-117-generic (Ubuntu 24.04):

1. **Deadlock**: The k8s-driver-manager init container was stuck waiting
   for nvidia-operator-validator to shut down. The validator's
   driver-validation init container was in an infinite poll loop checking
   for /run/nvidia/validations/.driver-ctr-ready (which only appears after
   a successful driver install). The validator pod had deletionTimestamp
   set but its container remained in Terminating state indefinitely.
   Fix: force-delete the stuck Terminating validator pod to break the
   deadlock (kubectl delete --force --grace-period=0).

2. **Startup probe timeout**: Full driver install on this hardware
   (apt headers ~2min + gcc make -j16 ~12min + file copy ~7min = ~21min)
   exactly exhausted the default 120×10s=20min startup probe window,
   causing SIGKILL (exit 137) at exactly 21 minutes even when the install
   was succeeding. Extended failureThreshold 120→300 (50min headroom).

Documented both root causes + recovery steps in the post-mortem.
values.yaml: add driver.startupProbe.failureThreshold: 300.

Note: the kubectl patch applied during recovery is a temporary fix;
this TF values.yaml change makes it durable via the next TF apply.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-25 11:53:44 +00:00
Viktor Barzin
da33919368 f1-stream: verifier — wrap m3u8 fetches through /proxy
The frontend already routes every m3u8 URL through `getProxyUrl` →
`/proxy?url=…` so CORS-restricted hosts work for users. The verifier
was the odd one out: it loaded m3u8 URLs directly into hls.js inside a
`data:` URL test page, which has Origin `null`. Hosts like
`oe1.ossfeed.store` (pitsport's playlist CDN) only set ACAO when the
request's Origin is `https://pushembdz.store`, so hls.js got an instant
`fatal_network_error` and every pitsport stream was marked dead even
though they play fine for real users.

Wrap the m3u8 URL the same way the verifier already wraps embed URLs:
`{PROXY_BASE}/proxy?url=<b64>`. Stays same-origin for hls.js, gets
ACAO:* from our proxy, and the rewritten variants are also proxy-wrapped
so subsequent fetches stay clean.

For sites whose CDN serves any IP without Origin tricks (stremio,
dd12), this is transparent — proxy just forwards.

Side effect: every verified m3u8 hits our proxy once during extraction.
Cheap (1 cluster-internal request + 1 upstream HEAD/GET) and only
during the 5/30-min extraction cycle.
2026-05-24 22:26:56 +00:00
Viktor Barzin
7045559fee immich: harden against bulk-import load (memory + probe + Job retries)
Mid-flight stability changes from the 2026-05-24 Anca-elements import
that surfaced multiple latent issues under sustained load:

- `immich-postgresql` memory 3Gi → 5Gi. The original limit OOM-killed
  PG once the bulk insert + vector embeddings drove buffer pressure
  past 3 GiB. 5 GiB gives ~60% headroom over the observed steady
  state during ongoing imports.
- `immich-server` startup probe `failure_threshold` 30 → 360 (5min →
  1h). After any PG restart, immich-server reindexes `clip_index` +
  `face_index` (147k + 185k rows at the time of incident) before
  binding the API port. The old 5min budget was too tight, so each
  PG bounce trapped immich-server in a startup crashloop until the
  reindex was killed. 1h gives generous headroom.
- `kubernetes_job_v1.anca_elements_import.backoff_limit` 2 → 20 and
  `--concurrent-tasks` 8 → 20 on the immich-go upload. Short
  cluster blips (PG restart, KCM lease loss) were exhausting the
  Job's 3-attempt budget. 20 attempts + 20 parallel hashers makes
  dedup-on-resume ~2.5x faster and tolerates a much rougher cluster.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-24 22:14:05 +00:00
root
445f30d955 Woodpecker CI deploy [CI SKIP] 2026-05-24 22:07:58 +00:00
84 changed files with 4011 additions and 2871 deletions

View file

@ -1,11 +1,28 @@
# Backup & Disaster Recovery Architecture
Last updated: 2026-05-24
Last updated: 2026-05-26
> **2026-05-24 session — what changed today** (deeper structural review pending — see the open backup-pipeline simplification audit):
> **2026-05-26 — bypass list pruned to a single path** (follow-up to the
> 2026-05-24 changes below):
> - `nfs-mirror` now copies ollama, audiblez, ebook2audiobook, and every
> `*-backup` CronJob output onto sda. Previously these went sdc → Synology
> DIRECT via Step 2; now they ride leg 1 like everything else.
> - **Bypass list (leg 2)** is now just `/srv/nfs/immich/` — too big for sda
> (1.5 T), no other choice.
> - **frigate and temp**: dropped from BOTH legs — intentionally not backed up.
> frigate is a 14-day camera ring, temp is scratch space. User explicit ask
> 2026-05-26.
> - **prometheus, loki, alertmanager**: live-orphan dirs that no longer
> exist on `/srv/nfs`. Dropped from the exclude/include lists as no-ops.
> - `/mnt/backup/anca-elements` (423 G) deleted — canonical copy lives in
> Immich since the 2026-05-24 ingest.
> - Aftermath: sda 87% → 46% used; Synology `/Viki/nfs/` shrinks to
> immich-only on next monthly `--delete` pass (or manual cleanup —
> see runbook).
>
> **2026-05-24 session — what changed**:
> - **anca-elements archive direction inverted** — Synology `/Backup/Anca/Elements` (770G) deleted; PVE `/srv/nfs/anca-elements` is now source of truth. `anca-elements-sync.sh` retired.
> - **`anca-elements-mirror.{sh,service,timer}` retired**, subsumed into the new **`nfs-mirror`** weekly job covering all critical NFS subtrees (anca-elements + ~80 services) → sda.
> - **`offsite-sync-backup` Step 2 filter inverted**: NFS-direct-to-Synology now only carries the sda-bypass paths (immich + frigate + prometheus + `*-backup` + …). Two-leg invariant: `nfs-mirror.sh EXCLUDES``offsite-sync-backup Step 2 INCLUDES`. Cross-referenced in both scripts.
> - **Synology `/Backup/Viki/nfs/<svc>/` orphan cleanup** — 84 dirs renamed in-place (btrfs metadata-only) to `/Backup/Viki/pve-backup/<svc>/` so daily-incremental Step 1 sees them as pre-existing and only ships deltas. No re-transfer.
> - **Synology snapshot retention 7d → 3d**, all 8 backlog snapshots deleted via `sudo synosharesnapshot delete Backup ...`. Reclaimed ~800G btrfs (98% → 83% used). DSM API was blocked by 2FA; `sudo` over the existing `Administrator` SSH key worked with the Vault-stored password.
> - **Manifest mechanism extended**: `nfs-mirror` now appends its transferred file list to `/mnt/backup/.changed-files` so daily Step 1 incremental picks it up (was previously only fed by `daily-backup`).
@ -16,19 +33,19 @@ The homelab runs a 3-2-1 strategy with a **two-leg** path to Synology so every N
```
sdc /srv/nfs/<svc>/ ──nfs-mirror weekly──→ sda /mnt/backup/<svc>/ ──offsite-sync Step 1──→ Synology /Backup/Viki/pve-backup/<svc>/ [leg 1]
sdc /srv/nfs/<bypass>/ ──inotify (nfs-change-tracker)──→ offsite-sync Step 2 ──→ Synology /Backup/Viki/nfs/<bypass>/ [leg 2]
sdc /srv/nfs/immich/ ──inotify (nfs-change-tracker)──→ offsite-sync Step 2 ──→ Synology /Backup/Viki/nfs/immich/ [leg 2]
sdc PVCs (LVM thin) ──daily-backup~snapshot~rsync──→ sda /mnt/backup/{pvc-data,sqlite-backup,pfsense,pve-config}/ ──Step 1──→ Synology /Backup/Viki/pve-backup/
```
The **bypass list** (paths that take leg 2 — too big for sda, transient, or already-a-backup): `immich`, `frigate`, `prometheus`, `loki`, `temp`, `alertmanager`, `ollama`, `audiblez`, `ebook2audiobook`, `*-backup`. Anything NOT in this list rides leg 1 via `nfs-mirror`.
The **bypass list** (leg 2) is just `/srv/nfs/immich/` — too big for sda (1.5 T). **Not backed up at all**: `/srv/nfs/frigate/` (camera ring buffer), `/srv/nfs/temp/` (scratch). Everything else rides leg 1 via `nfs-mirror`.
**3-2-1 Breakdown**:
- **Copy 1** (live): all PVC data + VM disks on Proxmox sdc thin pool (10.7TB RAID1 HDD); all NFS data at `/srv/nfs[-ssd]/`
- **Copy 2** (local backup): sda `/mnt/backup` (1.1TB RAID1 SAS) — at **~90% used** post-2026-05-24 (was ~10% in April)
- **Copy 3** (offsite): Synology NAS at 192.168.1.13 — at **~83% used / 934G free** post-2026-05-24 (was 98% / 121G before today's cleanup)
- `Synology/Backup/Viki/pve-backup/` — sda contents (PVC backups + nfs-mirror output: ~90 service dirs)
- `Synology/Backup/Viki/nfs/`bypass-list NFS (immich, frigate, etc.)
- `Synology/Backup/Viki/nfs-ssd/`bypass-list SSD NFS (immich-ML, ollama, llamacpp)
- **Copy 2** (local backup): sda `/mnt/backup` (1.1TB RAID1 SAS) — **46% used** post-2026-05-26 (was 87% before anca-elements cleanup; bypass-list pruning added ~260 G of *-backup + ollama + audiblez + ebook2audiobook)
- **Copy 3** (offsite): Synology NAS at 192.168.1.13
- `Synology/Backup/Viki/pve-backup/` — sda contents (PVC backups + nfs-mirror output: ~90 service dirs, now also includes ollama/audiblez/ebook2audiobook/*-backup)
- `Synology/Backup/Viki/nfs/`immich only (post-2026-05-26)
- `Synology/Backup/Viki/nfs-ssd/`full SSD NFS (immich-ML, ollama, llamacpp); SSD has no sda-mirror leg, so all three go direct
## Architecture Diagram
@ -346,35 +363,33 @@ Two-step offsite sync:
#### Step 2: sda-bypass NFS to Synology nfs/ + nfs-ssd/ (inotify change-tracked, FILTERED)
**Role**: Only carries paths that **bypass sda** — i.e., paths the nfs-mirror script explicitly skips (immich, frigate, prometheus, *-backup, …). Paths that ARE on sda reach Synology via Step 1 and are explicitly excluded from Step 2 to prevent double-syncing. The Step 2 INCLUDE list MUST stay in sync with nfs-mirror's `EXCLUDES` — they are complementary.
**Role**: Carries the single path that bypasses sda — `/srv/nfs/immich/` (1.5 T, doesn't fit on sda). Plus the full `/srv/nfs-ssd/` (immich-ML + ollama + llamacpp; the SSD has no sda-mirror leg). Everything else under `/srv/nfs/` rides leg 1.
**Method**: `rsync --files-from /mnt/backup/.nfs-changes.log` with regex filter `^/srv/nfs/(immich|frigate|prometheus|loki|temp|alertmanager|ollama|audiblez|ebook2audiobook|[^/]+-backup)/`. The monthly full sync uses `--include='/<bypass-path>/***' … --exclude='*'` to limit to the same set. `nfs-ssd/` (all of immich-ML / ollama / llamacpp) is entirely bypass-list, so a plain `--delete` still applies.
**Method**: `rsync --files-from /mnt/backup/.nfs-changes.log` with regex filter `^/srv/nfs/immich/`. The monthly full sync uses `--include='/immich/***' --exclude='*'` for the HDD leg, and a plain `--delete` for the SSD leg.
**Change tracking**: `nfs-change-tracker.service` (systemd, inotifywait) on PVE host watches `/srv/nfs` and `/srv/nfs-ssd` continuously. Changed file paths are logged to `/mnt/backup/.nfs-changes.log`. Step 2 reads this log and transfers only changed files matching the bypass regex. Incremental syncs complete in seconds.
**Monthly full sync**: On 1st Sunday of month, runs `rsync --delete` with the bypass-only include list for cleanup.
**Monthly full sync**: On 1st Sunday of month, runs `rsync --delete` with the immich-only include list. The `--delete` pass also reaps any stale Synology `/Viki/nfs/<dir>/` from the broader pre-2026-05-26 bypass list (ollama, audiblez, ebook2audiobook, *-backup, frigate, prometheus, loki, temp, alertmanager).
**`/srv/nfs/anca-elements/` history**: had its own dedicated Synology exclusion line earlier in 2026-05-24 because the original Synology source (`/volume1/Backup/Anca/Elements`) was being preserved while we moved canonical to PVE. After the original was deleted (same day), anca-elements joined the broader "NOT bypassing sda" category and is covered by Step 1 via `nfs-mirror`.
**Layer 3a: NFS local mirror on sda (3-2-1 second copy)**: `/usr/local/bin/nfs-mirror` rsyncs the *critical* subset of `/srv/nfs/``/mnt/backup/<service>/` weekly (Mon 04:00). Single rsync invocation, single destination. The skip-list (in `nfs-mirror.sh` `EXCLUDES`) drops paths that don't justify a second local copy:
**Layer 3a: NFS local mirror on sda (3-2-1 second copy)**: `/usr/local/bin/nfs-mirror` rsyncs `/srv/nfs/``/mnt/backup/<service>/` weekly (Mon 04:00). Single rsync invocation, single destination. As of 2026-05-26 the skip-list (in `nfs-mirror.sh` `EXCLUDES`) is intentionally minimal:
- **immich** (1.2T) — too big for sda; Synology offsite is the only 2nd copy by design
- **frigate** (camera recordings, 14d auto-rotate)
- **prometheus**, **loki** (TSDB + logs — rebuildable / policy-driven retention)
- **ollama**, **llamacpp**, **audiblez**, **ebook2audiobook** (re-downloadable / regenerable)
- **temp**, **alertmanager** (transient state)
- **`*-backup`** (CronJob outputs — these ARE backups; backing up the backup is meta)
- **/srv/nfs-ssd** entirely (after the SSD skips above, residual is ~0)
- **immich** (1.5 T) — too big for sda; ships sdc → Synology direct (leg 2)
- **frigate** (camera ring buffer) — intentionally NOT backed up
- **temp** (scratch) — intentionally NOT backed up
- **anca-elements** (legacy) — now in Immich; `/mnt/backup/anca-elements` deleted 2026-05-26
- **/srv/nfs-ssd** entirely — its three dirs (immich-ML, ollama, llamacpp) all ship direct to Synology nfs-ssd/
Everything else under `/srv/nfs/` (anca-elements + ~30 critical service NFS subtrees: mysql, postgresql, nextcloud, health, real-estate-crawler, audiobookshelf, servarr, technitium, openclaw, ...) lands at `/mnt/backup/<svc>/`. Total mirror size ≈ 900 GB (mostly anca-elements at 770G).
Everything else under `/srv/nfs/` — mysql, postgresql, nextcloud, health, real-estate-crawler, audiobookshelf, servarr, technitium, openclaw, ollama (HDD), audiblez, ebook2audiobook, every `*-backup` CronJob output, … — lands at `/mnt/backup/<svc>/`. Mirror size ≈ 400 GB post-2026-05-26 (was ~900 GB with anca-elements).
Pushes `nfs_mirror_last_run_timestamp` + `nfs_mirror_last_status` + `nfs_mirror_bytes` to Pushgateway. Alerts: `NfsMirrorStale` (>16d), `NfsMirrorFailing` (status != 0). `rsync -rlt --delete -H --no-perms --no-owner --no-group`; idempotent. Nice=10, IOSchedulingClass=idle (won't compete with foreground IO).
> History: `anca-elements-mirror.{sh,service,timer}` was a precursor (2026-05-24 morning) dedicated to /srv/nfs/anca-elements only. Subsumed by `nfs-mirror` later the same day to consolidate ad-hoc copy scripts into one.
**Destination**:
- `Synology/Backup/Viki/nfs/`mirrors `/srv/nfs`
- `Synology/Backup/Viki/nfs-ssd/` — mirrors `/srv/nfs-ssd`
- `Synology/Backup/Viki/nfs/`immich only (post-2026-05-26)
- `Synology/Backup/Viki/nfs-ssd/` — mirrors `/srv/nfs-ssd` (immich-ML, ollama, llamacpp)
**Monitoring**: Pushes `offsite_backup_sync_last_success_timestamp` to Pushgateway. Alerts: `OffsiteBackupSyncStale` (>8d), `OffsiteBackupSyncFailing`.

View file

@ -79,13 +79,33 @@ graph TB
**Total Cluster Resources**: 48 vCPUs, ~176GB RAM (k8s-node1 48GB + 4 nodes x 32GB)
> **node1 RAM (2026-05-10)**: bumped from 32 → 48 GiB out-of-band via
> `qm set 201 --memory 49152` because VMID 201 is intentionally not
> managed by Terraform yet (telmate/proxmox provider bug with iSCSI
> PVCs — see `infra/stacks/infra/main.tf` line 442). Driver: GPU
> multi-tenancy (frigate + ytdlp + llama-swap + immich-ml) was
> hitting 94% memory-request saturation on the old size. Adopt this
> VM into TF (`module "k8s-node1"`) once we've migrated to bpg/proxmox.
> **All Linux VMs are hand-managed in Proxmox, NOT in Terraform**
> (decided 2026-05-26, commit 44c3770a). The telmate/proxmox v3.0.2
> provider rewrites every disk slot on update — even ones covered by
> `lifecycle.ignore_changes` — and it doesn't refresh per-disk
> `mbps_*_concurrent` fields back from live state. We hit both bugs
> in production (id=539 iSCSI mangling 2026-04-02, and the 2026-05-26
> import attempt that corrupted k8s-node2 + k8s-node3 .conf files;
> recovered via `/mnt/backup/pve-config/etc-pve/nodes/pve/qemu-server/`
> nightly backups). What stays in TF: the cloud-init templates
> (`k8s-node-template`, `non-k8s-node-template`,
> `docker-registry-template` in `stacks/infra/main.tf`) — a fresh VM
> still clones the right template and runs the same bootstrap.
>
> Per-VM I/O caps (defense against sdc saturation by a single noisy
> guest) are applied by `apply-mbps-caps.{sh,service,timer}` on the
> PVE host (sources in `infra/scripts/`, install pattern per
> `architecture/backup-dr.md`). Timer fires `OnBootSec=5min` +
> `OnCalendar=hourly`, so any drift (config restore, manual `qm
> set`, fresh clone) self-heals within the hour. Current caps:
> 102 devvm 60/60, 103 home-assistant 40/40, 200 k8s-master 100/60,
> 201 k8s-node1 150/120, 202 k8s-node2 150/120, 203 k8s-node3 150/120,
> 204 k8s-node4 150/120, 220 docker-registry 40/40.
>
> Re-adoption into TF (via the `bpg/proxmox` provider, which models
> dynamic disks correctly) is possible but not scheduled — the
> cloud-init template above already captures the bootstrap-
> reproducibility goal.
### GPU Passthrough

View file

@ -158,6 +158,43 @@ SQLite uses `fsync()` to guarantee durability. NFS's soft mount + async semantic
> Democratic-csi has been removed along with TrueNAS decommissioning (2026-04). This section is kept for historical reference only.
### Per-VM SCSI-LUN cap (29 block PVCs per K8s node)
**The proxmox-csi-plugin hardcodes a per-VM LUN ceiling at 29.** The plugin
scans `scsi1..scsi29` for a free slot when attaching a PVC
(`pkg/csi/utils.go:394`: `for lun = 1; lun < 30; lun++`); when the loop exits
without a hit, ControllerPublishVolume returns
`Internal desc = no free lun found`. `CSINode.allocatable.count` is advertised
as `28` for every worker — derived from this plugin limit, NOT from Proxmox or
QEMU constraints.
What this means in practice:
- Each K8s node VM can hold at most 29 block PVCs simultaneously (scsi0 is the
OS disk).
- Switching `scsihw` from `virtio-scsi-pci` to `virtio-scsi-single` gains
per-disk iothread isolation but **zero additional capacity** — the cap lives
in the CSI plugin, not the QEMU device topology. Proxmox itself allows
`scsi0..scsi30` (31 slots, `$MAX_SCSI_DISKS = 31` in
`/usr/share/perl5/PVE/QemuServer/Drive.pm`).
- NFS PVCs (`nfs.csi.k8s.io`) are kernel NFS mounts and do not count against
the SCSI cap. Moving non-DB workloads (config-only, static content,
regenerable cache, pure upload buckets) to NFS is the simplest relief.
- Symptom when the cap is hit: pods stuck `ContainerCreating` with
`FailedAttachVolume … no free lun found` event, and the proxmox-csi
controller hot-loops `ControllerPublishVolume` against the saturated VM.
Levers (in order of leverage-per-effort):
1. **Migrate non-DB workloads off block** to NFS. Pre-flight every candidate
for embedded DBs (SQLite/LevelDB/RocksDB/H2/BoltDB) — they corrupt on NFS
due to lock semantics. Wave 1 (2026-05-26) moved 5 services
(excalidraw, resume, whisper, onlyoffice, f1-stream) and pre-flighted
two more out of scope (plotting-book → SQLite + WAL, stirling-pdf → H2).
2. **Add another K8s worker VM** — each new worker brings up to 29 fresh
slots; the most durable answer if PVC count keeps growing.
3. **Patch+fork `sergelogvinov/proxmox-csi-plugin`** to bump the loop bound
from `< 30` to `< 31` (matches Proxmox `MAX_SCSI_DISKS`). +1 slot per VM.
File upstream PR. Self-maintained image until merged.
## Configuration
### Key Files

View file

@ -130,8 +130,13 @@ to-working state pending an upstream fix or kernel rollback.
- [x] Pin gpu-operator chart to v25.10.1 in TF
- [x] Document situation in this post-mortem
- [ ] Roll back k8s-node1 host kernel to 6.8.0-117-generic + apt-mark
hold (needs user authorization for node reboot)
- [x] Roll back k8s-node1 host kernel to 6.8.0-117-generic (done by user;
kernel rollback succeeded and NFD now reports
`kernel-version.full=6.8.0-117-generic`, `os_release.VERSION_ID=24.04`)
- [x] Extend driver daemonset startup probe `failureThreshold` from 120 to 300
(50 min) in TF `values.yaml` — 2026-05-25. On this hardware the
full install sequence (apt headers + gcc compilation + file copy) takes
~21min which exactly exhausted the old 120×10s window.
- [ ] Add Prometheus alert `GPUNodeNoGPUResource` — fires when a node
labeled `nvidia.com/gpu.present=true` has `nvidia.com/gpu` capacity
of 0 for >10m
@ -143,6 +148,47 @@ to-working state pending an upstream fix or kernel rollback.
`unattended-upgrades``do-release-upgrade` is a separate path
that should be gated too
## Follow-up Incident: Driver install hang (2026-05-25)
**Date**: 2026-05-25
**Status**: Resolved
After the kernel rollback to 6.8.0-117-generic succeeded, the driver pod
(`nvidia-driver-daemonset-529vg`) was still reported as "stuck at
Installing Linux kernel headers..." with no progress for 1520 min.
**Actual root causes (two compounding issues)**:
1. **Deadlock between k8s-driver-manager and operator-validator**: The
`k8s-driver-manager` init container waits for `nvidia-operator-validator`
to shut down before it can begin the install sequence. The validator's
`driver-validation` init container was in an infinite retry loop polling
`/run/nvidia/validations/.driver-ctr-ready` (which the driver creates when
ready). Since the driver never finished, the validator never exited. The
validator pod had `deletionTimestamp` set but kubelet on node1 couldn't GC
it — the container received SIGTERM but remained in "Terminating" state
indefinitely, blocking the new driver from starting.
**Fix**: Force-deleted the stuck validator pod
(`kubectl delete pod -n nvidia nvidia-operator-validator-sff98 --force --grace-period=0`).
This broke the deadlock immediately.
2. **Startup probe timeout**: The full driver install sequence on this hardware
(6 vCPUs, 16Gi RAM) takes ~21 minutes:
- `apt-get install linux-headers-6.8.0-117-generic`: ~2 min
- `gcc/make -j16` kernel module build (nvidia, nvidia-uvm, nvidia-modeset,
nvidia-peermem): ~12 min
- nvidia-installer file copy + archive integrity check: ~7 min
The default startup probe allows exactly `60 + (120 × 10) = 1260s = 21min`.
This caused a SIGKILL (exit 137) at 21 minutes even when the install was
progressing normally.
**Fix**: Patched `driver.startupProbe.failureThreshold` from 120 → 300
in `stacks/nvidia/modules/nvidia/values.yaml` (gives 51 min headroom).
**Key observation**: "Installing Linux kernel headers..." is NOT a hang — the
apt install just takes 2+ min and produces no log output during execution. The
log line appears before apt runs, so it looks frozen. Check `ps auxf` inside
the container to confirm apt/dpkg are actively running.
## Lessons
- **Operator-style charts that auto-detect host OS can silently break
@ -158,3 +204,9 @@ to-working state pending an upstream fix or kernel rollback.
24.04 image on a 26.04 host), edit the NFD label — but only as a last
resort; the chart upgrade made clear the operator will eventually
reconcile this.
- **A k8s-driver-manager deadlock on a stuck Terminating validator pod is
indistinguishable from an apt hang** — `ps auxf` inside the container is
the key diagnostic. Force-deleting a stuck Terminating pod with no
finalizers is safe and immediately resolves the deadlock.
- **Driver startup probe must be sized for the full install wall-clock time**,
not just apt or just compilation. On slow hardware, 21 min is tight.

View file

@ -1,5 +1,8 @@
#cloud-config
hostname: terraform-vm
#cloud-config
# Hostname intentionally NOT set here — cloud-init reads it from
# Proxmox's auto-generated meta-data (which uses `qm set --name <X>`),
# so a single shared snippet works for every node.
manage_etc_hosts: true
users:
- name: wizard
sudo: ALL=(ALL) NOPASSWD:ALL
@ -46,7 +49,7 @@ apt:
sources:
%{if is_k8s_template}
kubernetes:
source: "deb https://pkgs.k8s.io/core:/stable:/v1.32/deb/ /"
source: "deb https://pkgs.k8s.io/core:/stable:/v1.34/deb/ /"
keyid: "DE15B14486CD377B9E876E1A234654DA9A296436"
filename: kubernetes.list
%{endif}
@ -55,6 +58,26 @@ apt:
keyid: "9DC858229FC7DD38854AE2D88D81803C0EBFCD88"
filename: docker.list
%{if is_k8s_template}
# Setup script is base64-encoded by the module so YAML whitespace
# handling never touches the heredoc bodies inside it. Replaces an
# earlier `indent(6, …)` approach that put `[plugins.*]` TOML
# sections at col 6 inside `cat >> /etc/containerd/config.toml`
# heredocs — containerd refused to parse the result and the node5 v1
# boot failed there (2026-05-26). Source: modules/create-template-vm/k8s-node-containerd-setup.sh
write_files:
- path: /usr/local/bin/k8s-node-containerd-setup.sh
permissions: '0755'
owner: root:root
encoding: b64
content: ${k8s_node_setup_script_b64}
- path: /usr/local/bin/k8s-node-post-join-tune.sh
permissions: '0755'
owner: root:root
encoding: b64
content: ${k8s_node_post_join_script_b64}
%{endif}
runcmd:
# Enable weekly TRIM/discard to reclaim freed blocks in LVM thin pool
- systemctl enable --now fstrim.timer
@ -67,6 +90,20 @@ runcmd:
- sed -i 's/#Compress=yes/Compress=yes/' /etc/systemd/journald.conf
- systemctl restart systemd-journald
%{if is_k8s_template}
# systemd-resolved global DNS fallback. Without this, only the
# link-level DNS from Proxmox's `qm set --nameserver` (Technitium,
# 10.0.20.201) is consulted — and Technitium returns NXDOMAIN for
# forgejo.viktorbarzin.me, so kubelet image pulls from the Forgejo
# registry break. Public DNS upstream + Technitium fallback matches
# the pre-existing manual setup on k8s-node1..4.
- mkdir -p /etc/systemd/resolved.conf.d
- |
cat > /etc/systemd/resolved.conf.d/global-dns.conf <<'EOF'
[Resolve]
DNS=8.8.8.8 1.1.1.1
FallbackDNS=10.0.20.201
EOF
- systemctl restart systemd-resolved
# Re-enabled 2026-05-10: unattended-upgrades is back on, but with a tight
# Allowed-Origins list, a Package-Blacklist for k8s/containerd/runc/calico,
# and Automatic-Reboot disabled (kured + sentinel-gate handles reboots in a
@ -107,7 +144,12 @@ runcmd:
- apt-mark hold containerd containerd.io runc 2>/dev/null || true
- systemctl stop kubelet
- containerd config default | sudo tee /etc/containerd/config.toml
- ${containerd_config_update_command}
# The containerd/kubelet setup is delivered as /usr/local/bin/k8s-node-containerd-setup.sh
# via the write_files: block at the top of this file. We run it as a single
# bash invocation here so cloud-init only sees a one-line runcmd item.
# (Previous inline `- $${containerd_config_update_command}` broke YAML parsing
# because the heredoc contains mixed-indent inner shell heredocs.)
- bash /usr/local/bin/k8s-node-containerd-setup.sh
- systemctl restart containerd
- systemctl enable --now iscsid
# Harden iSCSI: increase recovery timeout (300s vs 120s default) and enable
@ -124,17 +166,19 @@ runcmd:
- systemctl restart iscsid
# Create /sentinel directory for kured reboot gating (sentinel gate DaemonSet)
- mkdir -p /sentinel
# Create 4Gi swap file for worker node memory pressure relief (NOT for master — etcd is latency-critical)
- fallocate -l 4G /swapfile
- chmod 600 /swapfile
- mkswap /swapfile
- swapon /swapfile
- echo '/swapfile none swap sw 0 0' >> /etc/fstab
- sysctl -w vm.swappiness=10
- echo 'vm.swappiness=10' >> /etc/sysctl.d/99-swap.conf
# Disable swap — kubelet defaults to failSwapOn=true and won't start otherwise.
# (Previously this snippet created a 4G swapfile for "memory pressure relief"
# but never set failSwapOn=false / memorySwap.swapBehavior together, so the
# join consistently bricked kubelet — observed on node6 boot v3 2026-05-26.)
- swapoff -a
- sed -i '/ swap / s/^/#/' /etc/fstab
- ${k8s_join_command}
- systemctl enable kubelet
- systemctl start kubelet
# Kubelet tuning runs AFTER kubeadm join — that's when
# /var/lib/kubelet/config.yaml gets written. Restarts kubelet at the
# end to pick up the patched config.
- bash /usr/local/bin/k8s-node-post-join-tune.sh
%{ endif }
%{ for provision_cmd in provision_cmds ~}
- ${provision_cmd}

View file

@ -0,0 +1,146 @@
#!/usr/bin/env bash
#
# K8s node containerd + kubelet bootstrap. Runs once via cloud-init runcmd.
# Embedded into the cloud-init snippet base64-encoded by main.tf so YAML
# whitespace handling never touches the heredoc bodies — TOML / Python
# blocks below land in /etc/containerd/config.toml etc. with their leading
# whitespace intact.
#
# Layout:
# 1. /etc/containerd/config.toml — config_path + mirror dirs + GC tuning
# 2. /etc/containerd/certs.d/*/hosts.toml — per-registry mirror configs
# 3. /var/lib/kubelet/config.yaml — eviction + shutdown grace + log rotation
# 4. /etc/systemd/logind.conf.d + kubelet.service.d — graceful shutdown
# 5. (master-only) /etc/kubernetes/manifests — apiserver + controller flags
set -euo pipefail
# 1. config_path — match BOTH quote styles. containerd v1 writes `""`,
# containerd v2.x writes `''`. Without the v2 match, hosts.toml mirror
# config is silently ignored — observed 2026-05-26 on k8s-node4
# (containerd v2.2.4) and reproduced on k8s-node5 v1 boot.
sed -i "s|config_path = \"\"|config_path = \"/etc/containerd/certs.d\"|g" /etc/containerd/config.toml
sed -i "s|config_path = ''|config_path = \"/etc/containerd/certs.d\"|g" /etc/containerd/config.toml
# 2. Per-registry hosts.toml — pull-through caches on docker-registry VM
# (10.0.20.10) for high-traffic registries, Traefik LB (10.0.20.200) for
# forgejo. Low-traffic registries (registry.k8s.io, reg.kyverno.io) skip
# the cache and pull direct because past pull-through cache attempts
# truncated downloads and broke VPA certgen + Kyverno image pulls.
mkdir -p /etc/containerd/certs.d/docker.io
cat > /etc/containerd/certs.d/docker.io/hosts.toml <<'DOCKERIO'
server = "https://registry-1.docker.io"
[host."http://10.0.20.10:5000"]
capabilities = ["pull", "resolve"]
[host."https://registry-1.docker.io"]
capabilities = ["pull", "resolve"]
DOCKERIO
mkdir -p /etc/containerd/certs.d/ghcr.io
cat > /etc/containerd/certs.d/ghcr.io/hosts.toml <<'GHCR'
server = "https://ghcr.io"
[host."http://10.0.20.10:5010"]
capabilities = ["pull", "resolve"]
[host."https://ghcr.io"]
capabilities = ["pull", "resolve"]
GHCR
# Forgejo OCI registry: prefer in-cluster Traefik LB (10.0.20.200) to
# avoid hairpin NAT. Traefik serves the *.viktorbarzin.me wildcard so
# SNI verification succeeds. If the mirror is unreachable, fall back to
# public DNS resolution (needs the global DNS fallback set up below).
mkdir -p /etc/containerd/certs.d/forgejo.viktorbarzin.me
cat > /etc/containerd/certs.d/forgejo.viktorbarzin.me/hosts.toml <<'FORGEJO'
server = "https://forgejo.viktorbarzin.me"
[host."https://10.0.20.200"]
capabilities = ["pull", "resolve"]
FORGEJO
# quay.io + registry.k8s.io: include mirror configs that match node4's
# layout (no real pull-through cache today, server line is the direct
# upstream). Keeping these present makes the per-node config uniform and
# lets us flip a cache on later by editing only the [host."..."] block.
mkdir -p /etc/containerd/certs.d/quay.io
cat > /etc/containerd/certs.d/quay.io/hosts.toml <<'QUAY'
server = "https://quay.io"
[host."http://10.0.20.10:5020"]
capabilities = ["pull", "resolve"]
QUAY
mkdir -p /etc/containerd/certs.d/registry.k8s.io
cat > /etc/containerd/certs.d/registry.k8s.io/hosts.toml <<'K8SREG'
server = "https://registry.k8s.io"
[host."http://10.0.20.10:5030"]
capabilities = ["pull", "resolve"]
K8SREG
# 3. containerd tuning: parallel pulls + selective GC overrides.
# containerd v2's `config default` ALREADY emits `[plugins.'io.containerd.gc.v1.scheduler']`,
# `[plugins.'io.containerd.runtime.v2.task']`, and `[plugins.'io.containerd.metadata.v1.bolt']`
# sections — declaring them again fails with `toml: table … already exists`
# (observed on node6 boot 2026-05-26). Patch values in place instead.
sed -i 's/.*max_concurrent_downloads = 3/max_concurrent_downloads = 20/g' /etc/containerd/config.toml
# pause_threshold: 0.5 → 0.02 (run GC more aggressively when images dirty %)
sed -i "s/^[[:space:]]*pause_threshold = .*/ pause_threshold = 0.02/" /etc/containerd/config.toml
# schedule_delay: 0s/1ms → 30 min (longer cool-down between GC runs)
sed -i "s/^[[:space:]]*schedule_delay = .*/ schedule_delay = '1800s'/" /etc/containerd/config.toml
# exit_timeout: 0s → 5m (more aggressive container cleanup)
sed -i "s/^[[:space:]]*exit_timeout = .*/ exit_timeout = '5m'/" /etc/containerd/config.toml
# 4. (kubelet tuning intentionally NOT here — /var/lib/kubelet/config.yaml
# only exists AFTER kubeadm join. That work runs in
# k8s-node-post-join-tune.sh, invoked as a separate cloud-init runcmd
# step after the join completes.)
# 5. logind + kubelet systemd unit — total kubelet shutdown 310s, so
# logind InhibitDelay > that and kubelet TimeoutStopSec > that.
mkdir -p /etc/systemd/logind.conf.d
cat > /etc/systemd/logind.conf.d/kubelet-shutdown.conf <<'LOGIND_CONF'
[Login]
InhibitDelayMaxSec=480
LOGIND_CONF
systemctl restart systemd-logind
mkdir -p /etc/systemd/system/kubelet.service.d
cat > /etc/systemd/system/kubelet.service.d/20-shutdown.conf <<'KUBELET_SHUTDOWN'
[Service]
TimeoutStopSec=420s
KUBELET_SHUTDOWN
systemctl daemon-reload
# 6. (master-only) faster pod eviction + attach-detach reconcile.
if [ -f /etc/kubernetes/manifests/kube-controller-manager.yaml ]; then
python3 - <<'CM_PATCH'
import yaml
with open('/etc/kubernetes/manifests/kube-controller-manager.yaml') as f:
m = yaml.safe_load(f)
args = m['spec']['containers'][0]['command']
for flag in ['--attach-detach-reconcile-sync-period=15s']:
key = flag.split('=')[0]
args = [a for a in args if not a.startswith(key)]
args.append(flag)
m['spec']['containers'][0]['command'] = args
with open('/etc/kubernetes/manifests/kube-controller-manager.yaml', 'w') as f:
yaml.dump(m, f, default_flow_style=False)
CM_PATCH
python3 - <<'AS_PATCH'
import yaml
with open('/etc/kubernetes/manifests/kube-apiserver.yaml') as f:
m = yaml.safe_load(f)
args = m['spec']['containers'][0]['command']
for flag in ['--default-unreachable-toleration-seconds=60', '--default-not-ready-toleration-seconds=60']:
key = flag.split('=')[0]
args = [a for a in args if not a.startswith(key)]
args.append(flag)
m['spec']['containers'][0]['command'] = args
with open('/etc/kubernetes/manifests/kube-apiserver.yaml', 'w') as f:
yaml.dump(m, f, default_flow_style=False)
AS_PATCH
fi

View file

@ -0,0 +1,78 @@
#!/usr/bin/env bash
#
# Runs AFTER `kubeadm join` has written /var/lib/kubelet/config.yaml.
# Patches kubelet config in place (parallel image pulls, eviction
# thresholds, priority-based shutdown grace, container log rotation)
# and (on master) tightens controller-manager / apiserver flags.
#
# Embedded into the cloud-init snippet base64-encoded by main.tf so
# YAML whitespace doesn't touch the heredoc bodies inside.
set -euo pipefail
if [ ! -f /var/lib/kubelet/config.yaml ]; then
echo "post-join-tune: /var/lib/kubelet/config.yaml not found — was kubeadm join run?" >&2
exit 1
fi
# Parallel image pulls.
sed -i '/serializeImagePulls:/d' /var/lib/kubelet/config.yaml
sed -i '/maxParallelImagePulls:/d' /var/lib/kubelet/config.yaml
printf 'serializeImagePulls: false\nmaxParallelImagePulls: 50\n' >> /var/lib/kubelet/config.yaml
# Memory / disk eviction. Aggressive disk thresholds (15%/20%)
# prevent the 2026-03-13 containerd image-store corruption that took
# down k8s-node2.
sed -i '/systemReserved:/d; /kubeReserved:/d; /evictionHard:/,/^[^ ]/{ /evictionHard:/d; /^ /d }; /evictionSoft:/,/^[^ ]/{ /evictionSoft:/d; /^ /d }; /evictionSoftGracePeriod:/,/^[^ ]/{ /evictionSoftGracePeriod:/d; /^ /d }' /var/lib/kubelet/config.yaml
cat >> /var/lib/kubelet/config.yaml <<'KUBELET_PATCH'
systemReserved:
memory: "512Mi"
cpu: "200m"
kubeReserved:
memory: "512Mi"
cpu: "200m"
evictionHard:
memory.available: "500Mi"
nodefs.available: "15%"
imagefs.available: "20%"
evictionSoft:
memory.available: "1Gi"
nodefs.available: "20%"
imagefs.available: "25%"
evictionSoftGracePeriod:
memory.available: "30s"
nodefs.available: "60s"
imagefs.available: "30s"
memorySwap:
swapBehavior: "LimitedSwap"
KUBELET_PATCH
# Container log rotation + priority-based shutdown grace.
sed -i '/^shutdownGracePeriod:/d; /^shutdownGracePeriodCriticalPods:/d' /var/lib/kubelet/config.yaml
python3 - <<'KUBELET_FINAL'
import yaml
with open('/var/lib/kubelet/config.yaml') as f:
cfg = yaml.safe_load(f)
cfg.pop('shutdownGracePeriod', None)
cfg.pop('shutdownGracePeriodCriticalPods', None)
cfg.pop('shutdownGracePeriodByPodPriority', None)
cfg['containerLogMaxSize'] = '10Mi'
cfg['containerLogMaxFiles'] = 3
cfg['shutdownGracePeriodByPodPriority'] = [
{'priority': 0, 'shutdownGracePeriodSeconds': 20},
{'priority': 200000, 'shutdownGracePeriodSeconds': 20},
{'priority': 400000, 'shutdownGracePeriodSeconds': 30},
{'priority': 600000, 'shutdownGracePeriodSeconds': 30},
{'priority': 800000, 'shutdownGracePeriodSeconds': 90},
{'priority': 1000000, 'shutdownGracePeriodSeconds': 30},
{'priority': 1200000, 'shutdownGracePeriodSeconds': 30},
{'priority': 2000000000, 'shutdownGracePeriodSeconds': 30},
{'priority': 2000001000, 'shutdownGracePeriodSeconds': 30},
]
with open('/var/lib/kubelet/config.yaml', 'w') as f:
yaml.dump(cfg, f, default_flow_style=False)
KUBELET_FINAL
# Reload kubelet to pick up new config (it's already started by the
# preceding cloud-init runcmd line — restart, not start).
systemctl restart kubelet

View file

@ -16,7 +16,7 @@ variable "k8s_join_command" {
variable "containerd_config_update_command" {
type = string
default = ""
description = "Command to execute to update containerd config.toml; e.g add mirror"
description = "DEPRECATED: was inlined into write_files via indent(); the heredoc-TOML interaction broke containerd config parsing on node5 v1 boot 2026-05-26. The k8s setup script is now bundled inside the module at k8s-node-containerd-setup.sh — pass nothing here. Kept to avoid breaking stacks that still reference it; ignored when is_k8s_template=true."
}
variable "is_k8s_template" { type = bool }
variable "ssh_private_key" {
@ -79,23 +79,26 @@ resource "null_resource" "upload_cloud_init" {
provisioner "file" {
destination = "/var/lib/vz/snippets/${var.snippet_name}"
content = templatefile("${path.module}/cloud_init.yaml", {
is_k8s_template = var.is_k8s_template,
authorized_ssh_key = var.ssh_public_key,
passwd = var.user_passwd,
provision_cmds = var.provision_cmds,
k8s_join_command = var.k8s_join_command,
containerd_config_update_command = var.containerd_config_update_command
is_k8s_template = var.is_k8s_template,
authorized_ssh_key = var.ssh_public_key,
passwd = var.user_passwd,
provision_cmds = var.provision_cmds,
k8s_join_command = var.k8s_join_command,
k8s_node_setup_script_b64 = var.is_k8s_template ? base64encode(file("${path.module}/k8s-node-containerd-setup.sh")) : ""
k8s_node_post_join_script_b64 = var.is_k8s_template ? base64encode(file("${path.module}/k8s-node-post-join-tune.sh")) : ""
}
)
}
# Force recreate when the below changes
triggers = {
file_hash = filesha256("${path.module}/cloud_init.yaml")
provision_cmds = join(", ", var.provision_cmds)
is_k8s_template = var.is_k8s_template,
passwd = var.user_passwd,
k8s_join_command = var.k8s_join_command,
containerd_config_update_command = var.containerd_config_update_command
file_hash = filesha256("${path.module}/cloud_init.yaml")
setup_script_hash = var.is_k8s_template ? filesha256("${path.module}/k8s-node-containerd-setup.sh") : ""
post_join_script_hash = var.is_k8s_template ? filesha256("${path.module}/k8s-node-post-join-tune.sh") : ""
provision_cmds = join(", ", var.provision_cmds)
is_k8s_template = var.is_k8s_template,
passwd = var.user_passwd,
k8s_join_command = var.k8s_join_command,
ssh_public_key = var.ssh_public_key,
}
}

View file

@ -135,6 +135,22 @@ variable "hostpci0" {
default = "" # e.g., "0000:06:00.0" for Tesla T4 passthrough
}
# ---------------------------------------------------------------------------
# Variables Disk I/O throttling (bytes/sec; 0 = uncapped)
# ---------------------------------------------------------------------------
# Caps any single VM's share of the underlying disk so a runaway workload
# (e.g. the 2026-05-23/26 alloy IO storm memory id=2726) cannot wedge the
# whole Proxmox host's sdc thin pool. Values inferred from PVE RRD p99/max
# observed in /nodes/pve/qemu/<vmid>/rrddata.
variable "mbps_rd" {
type = number
default = 0
}
variable "mbps_wr" {
type = number
default = 0
}
# ---------------------------------------------------------------------------
# Resource
# ---------------------------------------------------------------------------
@ -192,9 +208,11 @@ resource "proxmox_vm_qemu" "cloudinit-vm" {
for_each = var.disk_slot == "scsi0" ? [1] : []
content {
disk {
storage = "local-lvm"
size = var.vm_disk_size
discard = true # Enable TRIM passthrough to LVM thin pool reduces CoW overhead
storage = "local-lvm"
size = var.vm_disk_size
discard = true # Enable TRIM passthrough to LVM thin pool reduces CoW overhead
mbps_r_concurrent = var.mbps_rd
mbps_wr_concurrent = var.mbps_wr
}
}
}
@ -202,9 +220,11 @@ resource "proxmox_vm_qemu" "cloudinit-vm" {
for_each = var.disk_slot == "scsi1" ? [1] : []
content {
disk {
storage = "local-lvm"
size = var.vm_disk_size
discard = true
storage = "local-lvm"
size = var.vm_disk_size
discard = true
mbps_r_concurrent = var.mbps_rd
mbps_wr_concurrent = var.mbps_wr
}
}
}
@ -234,12 +254,39 @@ resource "proxmox_vm_qemu" "cloudinit-vm" {
lifecycle {
prevent_destroy = true
ignore_changes = [
# democratic-csi dynamically attaches/detaches iSCSI disks
# proxmox-csi dynamically attaches/detaches PVC disks. K8s workers
# have up to ~30 slots in use simultaneously (k8s-node1: scsi1-29 +
# unused0-29). The k8s-master only uses scsi0 (boot) so most of
# these are no-ops for that VM but harmless.
disks[0].scsi[0].scsi1,
disks[0].scsi[0].scsi2,
disks[0].scsi[0].scsi3,
disks[0].scsi[0].scsi4,
disks[0].scsi[0].scsi5,
disks[0].scsi[0].scsi6,
disks[0].scsi[0].scsi7,
disks[0].scsi[0].scsi8,
disks[0].scsi[0].scsi9,
disks[0].scsi[0].scsi10,
disks[0].scsi[0].scsi11,
disks[0].scsi[0].scsi12,
disks[0].scsi[0].scsi13,
disks[0].scsi[0].scsi14,
disks[0].scsi[0].scsi15,
disks[0].scsi[0].scsi16,
disks[0].scsi[0].scsi17,
disks[0].scsi[0].scsi18,
disks[0].scsi[0].scsi19,
disks[0].scsi[0].scsi20,
disks[0].scsi[0].scsi21,
disks[0].scsi[0].scsi22,
disks[0].scsi[0].scsi23,
disks[0].scsi[0].scsi24,
disks[0].scsi[0].scsi25,
disks[0].scsi[0].scsi26,
disks[0].scsi[0].scsi27,
disks[0].scsi[0].scsi28,
disks[0].scsi[0].scsi29,
# cloud-init config may drift after first boot
cicustom,
ciupgrade,
@ -254,6 +301,13 @@ resource "proxmox_vm_qemu" "cloudinit-vm" {
# Provider defaults that differ from imported state
define_connection_info,
full_clone,
# scsihw varies per VM (virtio-scsi-pci / virtio-scsi-single / lsi)
# and changing it on a running VM is risky leave whatever's live.
scsihw,
# qemu_os is a hint to qemu about the guest OS; some live VMs have
# "other" (unset originally) and the module's "l26" default would
# otherwise force an unnecessary write on apply.
qemu_os,
]
}
}

View file

@ -0,0 +1,12 @@
[Unit]
Description=Apply per-VM I/O caps via qm set (idempotent)
Documentation=https://github.com/ViktorBarzin/infra/blob/master/scripts/apply-mbps-caps.sh
After=pve-cluster.service
Wants=pve-cluster.service
[Service]
Type=oneshot
ExecStart=/usr/local/bin/apply-mbps-caps.sh
StandardOutput=journal
StandardError=journal
SyslogIdentifier=apply-mbps-caps

74
scripts/apply-mbps-caps.sh Executable file
View file

@ -0,0 +1,74 @@
#!/usr/bin/env bash
# Apply per-VM I/O caps via `qm set` on the PVE host.
#
# - Reads each target VM's current boot-disk options.
# - Appends/normalises `mbps_rd=<N>,mbps_wr=<N>`.
# - Re-applies via `qm set` (live, no reboot needed).
# - Idempotent: re-running with no drift is a no-op at the storage
# level (proxmox config rewrite is cheap).
# - Continues on per-VM failures so one missing/stopped VM doesn't
# skip the rest — designed to be safe under the systemd timer.
#
# Backed by `apply-mbps-caps.{service,timer}` (hourly + 5min-after-boot).
# Why these values: see beads code-9v2j + memory id=2726 (alloy IO storm)
# + memory id=1575 (VMs intentionally out of TF).
set -uo pipefail # NOT -e — keep going if a single VM step fails.
# vmid:disk_slot:mbps_rd:mbps_wr (Linux VMs only — skipping 101 pfsense BSD, 300 Windows)
TARGETS=(
"102:scsi0:60:60" # devvm
"103:sata0:40:40" # home-assistant
"200:scsi0:100:60" # k8s-master (alloy storm origin — firmest clip)
"201:scsi1:150:120" # k8s-node1 (GPU + many CSI disks; boots from scsi1)
"202:scsi0:150:120" # k8s-node2
"203:scsi0:150:120" # k8s-node3
"204:scsi0:150:120" # k8s-node4
"220:scsi0:40:40" # docker-registry
)
apply_one() {
local spec="$1"
local vmid slot rd wr
IFS=: read -r vmid slot rd wr <<<"$spec"
# Skip non-existent VMs cleanly (e.g. node decommissioned, never rebuilt).
if ! qm status "$vmid" >/dev/null 2>&1; then
echo "vmid $vmid: not present on this host — skipping"
return 0
fi
local current cleaned newvalue
current=$(qm config "$vmid" | awk -v s="$slot:" '$1==s {sub(/^[^ ]+ /, ""); print; exit}')
if [[ -z "$current" ]]; then
echo "vmid $vmid: no $slot line in config — skipping"
return 0
fi
cleaned=$(echo "$current" | sed -E 's/,mbps_rd=[0-9]+//g; s/,mbps_wr=[0-9]+//g')
newvalue="${cleaned},mbps_rd=${rd},mbps_wr=${wr}"
# Skip the qm-set call entirely when state already matches — keeps
# journal noise low under the hourly timer.
if [[ "$current" == "$newvalue" ]]; then
echo "vmid $vmid: $slot already at mbps_rd=${rd},mbps_wr=${wr} — no-op"
return 0
fi
echo "vmid $vmid: updating $slot"
echo " before: $current"
echo " after: $newvalue"
if qm set "$vmid" "--$slot" "$newvalue"; then
echo " ok"
else
echo " FAILED: qm set returned non-zero"
return 1
fi
}
rc=0
for spec in "${TARGETS[@]}"; do
apply_one "$spec" || rc=1
done
exit "$rc"

View file

@ -0,0 +1,18 @@
[Unit]
Description=Re-apply per-VM I/O caps periodically + after PVE boot
[Timer]
# After every PVE host reboot — caps survive in /etc/pve/qemu-server/<vmid>.conf
# normally, but a config restore from backup can drop them (see 2026-05-26
# incident where we restored 202.conf + 203.conf from /mnt/backup/pve-config/).
OnBootSec=5min
# Hourly during normal operation — catches manual `qm set` drift or fresh
# VM clones that haven't had caps applied yet.
OnCalendar=hourly
Persistent=true
RandomizedDelaySec=2min
[Install]
WantedBy=timers.target

View file

@ -13,20 +13,21 @@
# destination layout (anca-elements lives at /mnt/backup/anca-elements/),
# but now covers every other critical NFS subtree in one pass.
#
# SKIP-LIST rationale (paths NOT mirrored — Synology offsite still covers them):
# immich — 1.2T, doesn't fit on sda; Synology only by design
# frigate — 14d camera ring, auto-rotates
# prometheus — TSDB, rebuildable from cluster state
# loki — log retention is a policy choice, not durable data
# temp — scratch
# alertmanager — transient state
# ollama — LLM model weights, re-downloadable
# audiblez — re-fetchable from Audible
# ebook2audiobook — regenerable from book sources
# *-backup — CronJob output (these ARE backups; backing them up is meta)
# SKIP-LIST rationale (2026-05-26 simplification — see commit notes):
# immich — 1.5T, doesn't fit on sda; offsite-sync ships it direct to Synology
# frigate — camera ring buffer; intentionally NOT backed up anywhere
# temp — scratch; intentionally NOT backed up
#
# Note: /srv/nfs-ssd is intentionally NOT mirrored — after skipping immich
# (47G), ollama (59G), and llamacpp (26G) there's effectively zero residual.
# Everything else (ollama, audiblez, ebook2audiobook, *-backup, …) now
# flows sdc → sda (this script) → Synology pve-backup/ via offsite-sync
# Step 1. Previously they went sdc → Synology DIRECT via Step 2; the
# bypass list got pruned to just `immich` so we have a single canonical
# mirror at sda. Prometheus/loki/alertmanager were live-orphan entries
# that no longer exist on /srv/nfs (cleaned 2026-05-26) — dropped from
# the exclude list as a no-op.
#
# Note: /srv/nfs-ssd is intentionally NOT mirrored — its three dirs
# (immich, ollama, llamacpp) all go direct to Synology nfs-ssd/.
set -euo pipefail
@ -57,27 +58,15 @@ EXCLUDES=(
--exclude='/.lv-pvc-mapping.json'
--exclude='/.nfs-changes.log'
# ---- anca-elements: photos are being ingested into Immich (2026-05-24),
# so /srv/nfs/immich/library/ becomes the canonical copy and the separate
# anca-elements tree is redundant. Excluded from nfs-mirror going forward.
# The historical 771G at /mnt/backup/anca-elements/ stays put until manual
# cleanup once Immich ingest completes; offsite-sync Step 1 also excludes
# it from the Synology pve-backup/ upload so we don't ship the redundant copy.
# ---- anca-elements: now in Immich (canonical), /mnt/backup copy deleted
# 2026-05-26. Kept in excludes so nfs-mirror doesn't re-populate from sdc
# if /srv/nfs/anca-elements is ever re-attached.
--exclude='/anca-elements/'
# ---- NFS paths: too big / transient / re-fetchable ----
--exclude='/immich/'
--exclude='/frigate/'
--exclude='/prometheus/'
--exclude='/loki/'
--exclude='/temp/'
--exclude='/alertmanager/'
--exclude='/ollama/'
--exclude='/audiblez/'
--exclude='/ebook2audiobook/'
# ---- *-backup CronJob outputs (don't back up backups) ----
--exclude='/*-backup/'
# ---- NFS paths intentionally NOT backed up ----
--exclude='/immich/' # 1.5T — ships sdc → Synology direct (Step 2)
--exclude='/frigate/' # ring buffer — no backup anywhere
--exclude='/temp/' # scratch — no backup anywhere
# ---- Synology / Windows / macOS cruft ----
--exclude='/@eaDir/'
@ -130,7 +119,7 @@ mountpoint -q /mnt/backup || { log "FATAL: /mnt/backup not mounted"; push_metric
[ -d "$SRC" ] || { log "FATAL: source $SRC missing"; push_metrics 1 0; exit 1; }
log "=== mirror starting: $SRC$DST ==="
log "skip: immich, frigate, prometheus, loki, ollama, audiblez, *-backup, temp"
log "skip: immich (Synology direct), frigate (no backup), temp (no backup), anca-elements"
# Marker file used to identify files written by this rsync run, so we can append
# their paths to the offsite-sync manifest. Touch BEFORE rsync; `find -newer` AFTER.
@ -149,7 +138,13 @@ DST_BYTES=$(df -B1 --output=used /mnt/backup | tail -1)
if [ "$RSYNC_RC" -eq 0 ]; then
# Capture files that rsync created/modified and feed them to the offsite-sync
# manifest so daily Step 1 incremental picks them up tomorrow morning.
NEW_COUNT=$(find /mnt/backup -newer "$STAMP" -type f \
# Use -cnewer (ctime), not -newer (mtime): rsync -t preserves SOURCE mtime
# on the dest, so freshly-written files with old source mtime look "older"
# than $STAMP and -newer misses them. ctime is set when the inode is written,
# regardless of -t, so it correctly identifies what this run created.
# (Bug hit 2026-05-26 full bypass-list mirror: 800k files copied, manifest
# captured only 2 entries → forced a .force-full-sync to recover.)
NEW_COUNT=$(find /mnt/backup -cnewer "$STAMP" -type f \
! -path '/mnt/backup/.changed-files' \
! -path '/mnt/backup/.changed-files.lock' \
! -path '/mnt/backup/.lv-pvc-mapping.json' \

View file

@ -76,8 +76,8 @@ if [ "${DAY_OF_MONTH}" -le 7 ] || [ -n "${FORCE_FULL}" ]; then
elif [ -s "${MANIFEST}" ]; then
MANIFEST_LINES=$(wc -l < "${MANIFEST}")
log "Incremental sync (${MANIFEST_LINES} files from manifest)..."
# /anca-elements is being ingested into Immich (Immich becomes canonical) —
# skip the redundant copy in /mnt/backup/anca-elements/ until manual cleanup.
# anca-elements: now in Immich (canonical); /mnt/backup copy deleted
# 2026-05-26. Exclude retained as a safety belt in case it re-appears.
rsync -rlt --chmod=Du=rwx,Dgo=rx,Fu=rw,Fog=r --files-from="${MANIFEST}" \
--exclude='anca-elements/' \
"${BACKUP_ROOT}/" "${PVE_BACKUP_DEST}/" 2>&1 || STATUS=1
@ -89,64 +89,60 @@ fi
# STEP 2: NFS → Synology nfs/ + nfs-ssd/ (inotify change-tracked, FILTERED)
# ============================================================
#
# DESIGN: Step 2 only carries paths that BYPASS the sda mirror. Paths that ARE
# mirrored to sda by nfs-mirror reach Synology via Step 1 (sda → Synology
# pve-backup/) and must NOT also flow through Step 2 — that would duplicate
# every byte and double Synology consumption.
# DESIGN: Step 2 only carries paths that BYPASS the sda mirror. As of
# 2026-05-26 that's just /srv/nfs/immich/ (1.5T, doesn't fit on sda).
# Everything else under /srv/nfs/ now flows through sda via nfs-mirror,
# reaching Synology via Step 1 (sda → pve-backup/). frigate and temp are
# excluded from both legs — intentionally NOT backed up.
#
# The skip-list below MUST stay in sync with EXCLUDES in
# /usr/local/bin/nfs-mirror (which defines what nfs-mirror does NOT copy to
# sda). The two are complementary: nfs-mirror EXCLUDES = offsite-sync Step 2
# INCLUDES. Failing to keep them aligned creates either gaps (data missing
# from Synology) or duplication (data on Synology via both paths).
log "--- Step 2: NFS → Synology (skip-list paths only — sda-bypass leg) ---"
# nfs-ssd is handled separately below: its three dirs (immich, ollama,
# llamacpp) all go direct to Synology since /srv/nfs-ssd is not mirrored
# to sda. ollama+llamacpp are small enough (~85G total) that the direct
# leg is fine and we don't need to extend nfs-mirror to cover the SSD.
#
# Keep this aligned with /usr/local/bin/nfs-mirror's EXCLUDES — the
# excludes there are { immich (this leg), frigate (no backup), temp
# (no backup), anca-elements (deleted), pvc-data and friends (owned by
# daily-backup) }. Only the bypass-leg subset matters here: { immich }.
log "--- Step 2: NFS → Synology (immich-only direct leg + nfs-ssd) ---"
# Regex matching paths NOT on sda (must reach Synology directly).
# Top-level dirs under /srv/nfs/ — anchored, no nesting allowed.
NFS_SDA_BYPASS_RE='^/srv/nfs/(immich|frigate|prometheus|loki|temp|alertmanager|ollama|audiblez|ebook2audiobook|[^/]+-backup)/'
NFS_SDA_BYPASS_RE='^/srv/nfs/immich/'
# rsync include/exclude args for the monthly full sync (HDD).
# Order matters: --include patterns first, --exclude '*' last.
NFS_FULL_INCLUDES=(
--include='/immich/' --include='/immich/***'
--include='/frigate/' --include='/frigate/***'
--include='/prometheus/' --include='/prometheus/***'
--include='/loki/' --include='/loki/***'
--include='/temp/' --include='/temp/***'
--include='/alertmanager/' --include='/alertmanager/***'
--include='/ollama/' --include='/ollama/***'
--include='/audiblez/' --include='/audiblez/***'
--include='/ebook2audiobook/' --include='/ebook2audiobook/***'
--include='/*-backup/' --include='/*-backup/***'
--include='/immich/' --include='/immich/***'
--exclude='*'
)
if [ "${DAY_OF_MONTH}" -le 7 ]; then
# Monthly: full sync with --delete for cleanup, restricted to bypass-list.
log "Monthly full NFS sync (sda-bypass paths only)..."
# --delete here will reap legacy dirs on Synology (frigate, ollama,
# audiblez, ebook2audiobook, *-backup, prometheus, loki, temp,
# alertmanager) since they're no longer in NFS_FULL_INCLUDES.
log "Monthly full NFS sync (immich-only — reaps legacy bypass dirs)..."
rsync -rlt --delete "${NFS_FULL_INCLUDES[@]}" /srv/nfs/ "${NFS_DEST}/" 2>&1 \
&& log " OK: nfs/ full sync (bypass-list)" || { warn "nfs/ full sync failed"; STATUS=1; }
# nfs-ssd: every dir under it (immich/ollama/llamacpp) is in the bypass list,
# so a plain --delete still applies cleanly.
&& log " OK: nfs/ full sync (immich-only)" || { warn "nfs/ full sync failed"; STATUS=1; }
# nfs-ssd: full sync of all three dirs (immich, ollama, llamacpp).
rsync -rlt --delete /srv/nfs-ssd/ "${NFS_SSD_DEST}/" 2>&1 \
&& log " OK: nfs-ssd/ full sync" || { warn "nfs-ssd/ full sync failed"; STATUS=1; }
> "${NFS_CHANGE_LOG}"
elif [ -s "${NFS_CHANGE_LOG}" ]; then
# Incremental: only sync changed files in bypass-list paths.
# Incremental: only sync changed files matching the bypass leg (immich).
sort -u "${NFS_CHANGE_LOG}" > /tmp/nfs-changes-deduped
# HDD NFS — include only sda-bypass paths.
# HDD NFS — include only /srv/nfs/immich/ paths.
grep -E "${NFS_SDA_BYPASS_RE}" /tmp/nfs-changes-deduped | \
while IFS= read -r f; do [ -f "$f" ] && echo "${f#/srv/nfs/}"; done \
> /tmp/sync-nfs.list 2>/dev/null
NFS_COUNT=$(wc -l < /tmp/sync-nfs.list 2>/dev/null || echo 0)
if [ "${NFS_COUNT:-0}" -gt 0 ]; then
rsync -rlt --files-from=/tmp/sync-nfs.list /srv/nfs/ "${NFS_DEST}/" 2>&1 \
&& log " OK: nfs/ (${NFS_COUNT} bypass files)" \
&& log " OK: nfs/ (${NFS_COUNT} immich files)" \
|| { warn "nfs/ incremental failed"; STATUS=1; }
fi
# SSD NFS — every nfs-ssd path (immich/ollama/llamacpp) is in the bypass list.
# SSD NFS — every nfs-ssd path (immich/ollama/llamacpp) ships direct.
grep '^/srv/nfs-ssd/' /tmp/nfs-changes-deduped | \
while IFS= read -r f; do [ -f "$f" ] && echo "${f#/srv/nfs-ssd/}"; done \
> /tmp/sync-nfs-ssd.list 2>/dev/null || true
@ -158,7 +154,7 @@ elif [ -s "${NFS_CHANGE_LOG}" ]; then
fi
TOTAL=$(wc -l < /tmp/nfs-changes-deduped)
log " Processed ${TOTAL} change events (${NFS_COUNT} nfs + ${SSD_COUNT} nfs-ssd bypass-list files synced)"
log " Processed ${TOTAL} change events (${NFS_COUNT} nfs/immich + ${SSD_COUNT} nfs-ssd files synced)"
> "${NFS_CHANGE_LOG}"
rm -f /tmp/nfs-changes-deduped /tmp/sync-nfs.list /tmp/sync-nfs-ssd.list
else

109
scripts/provision-k8s-worker Executable file
View file

@ -0,0 +1,109 @@
#!/usr/bin/env bash
# provision-k8s-worker NAME VMID IP[/CIDR]
#
# Clone PVE template 2000 (ubuntu-2404-cloudinit-k8s-template) into a new
# VM, configure resources to match k8s-node3/4 (32G RAM, 8 vCPU, host CPU,
# 256G disk, VLAN 20 on vmbr1), attach the shared cicustom snippet
# (/var/lib/vz/snippets/k8s_cloud_init.yaml), and start it. Cloud-init
# inside the VM installs containerd + kubelet, applies the bundled
# setup script, and runs the kubeadm join. No manual steps after this.
#
# Hostname is derived from `qm set --name $NAME` and read by cloud-init
# from Proxmox metadata — DO NOT hard-code in the snippet.
#
# Idempotent: aborts if VMID already exists or IP is already in use.
#
# Usage:
# ssh root@192.168.1.127 bash -s -- k8s-node6 206 10.0.20.106 < provision-k8s-worker
# or, if the script lives on the PVE host:
# provision-k8s-worker k8s-node6 206 10.0.20.106
#
# Run on the PVE host (needs qm + /var/lib/vz/snippets access).
set -euo pipefail
if [ $# -ne 3 ]; then
echo "usage: $0 NAME VMID IP" >&2
echo " e.g. $0 k8s-node6 206 10.0.20.106" >&2
exit 2
fi
NAME=$1
VMID=$2
IP=$3
CIDR_IP="${IP}/22"
GW="10.0.20.1"
DNS="10.0.20.201"
SEARCH="viktorbarzin.lan"
TEMPLATE_ID=2000
STORAGE="local-lvm"
USER_SNIPPET="local:snippets/k8s_cloud_init.yaml"
# Per-node meta-data snippet — written below — supplies local-hostname.
# Proxmox's auto-generated metadata DOESN'T include hostname when
# cicustom user=… is set, so the shared user-data snippet alone leaves
# nodes joining as "ubuntu" (image default). Per-node meta-data is the
# clean fix.
META_SNIPPET_FILE="/var/lib/vz/snippets/${NAME}-meta.yaml"
META_SNIPPET="local:snippets/${NAME}-meta.yaml"
BRIDGE="vmbr1"
VLAN=20
# Sanity: VMID must be free
if qm status "$VMID" >/dev/null 2>&1; then
echo "ERROR: VM $VMID already exists. Refusing to clobber." >&2
qm status "$VMID" >&2
exit 1
fi
# Sanity: IP must not be pingable
if ping -c 1 -W 1 "$IP" >/dev/null 2>&1; then
echo "ERROR: $IP is already responding to ping. Refusing to assign." >&2
exit 1
fi
# Sanity: snippet must exist
if [ ! -f "/var/lib/vz/snippets/k8s_cloud_init.yaml" ]; then
echo "ERROR: /var/lib/vz/snippets/k8s_cloud_init.yaml missing." >&2
echo " Run `tg apply` in infra/stacks/infra/ to regenerate it." >&2
exit 1
fi
# Sanity: template must be a template
if ! qm config "$TEMPLATE_ID" | grep -q '^template: 1'; then
echo "ERROR: VMID $TEMPLATE_ID is not a template." >&2
exit 1
fi
echo "[1/6] write per-node meta-data snippet ($META_SNIPPET_FILE)"
cat > "$META_SNIPPET_FILE" <<META
local-hostname: $NAME
instance-id: $NAME-$(date +%s)
META
echo "[2/6] qm clone $TEMPLATE_ID -> $VMID ($NAME)"
qm clone "$TEMPLATE_ID" "$VMID" --name "$NAME" --full true --storage "$STORAGE"
echo "[3/6] qm set $VMID — VM resources + network + cicustom"
qm set "$VMID" \
--agent 1 \
--balloon 32768 \
--cores 8 \
--cpu host \
--memory 32768 \
--net0 "virtio,bridge=$BRIDGE,tag=$VLAN" \
--ipconfig0 "ip=$CIDR_IP,gw=$GW" \
--nameserver "$DNS" \
--searchdomain "$SEARCH" \
--onboot 1 \
--startup 'order=5,up=45,down=420' \
--cicustom "user=$USER_SNIPPET,meta=$META_SNIPPET"
echo "[4/6] qm resize $VMID scsi0 256G"
qm resize "$VMID" scsi0 256G
echo "[5/6] qm start $VMID"
qm start "$VMID"
echo "[6/6] Done. Cloud-init runs now; node should appear in 'kubectl get nodes' within ~6-10 min."
echo " Tail cloud-init: socat -u UNIX-CONNECT:/var/run/qemu-server/$VMID.serial0 STDOUT | strings"
echo " Final config:"
qm config "$VMID" | grep -E '^(name|cores|memory|net0|ipconfig0|cicustom|scsi0|onboot):'

View file

@ -336,7 +336,11 @@ resource "kubernetes_deployment" "workbench" {
spec {
init_container {
name = "seed-config"
image = "dolthub/dolt-workbench:latest"
# Pinned 2026-05-26: Keel rolled :latest :0.1.0 on 2026-05-17,
# which speaks an old GraphQL schema (missing `type` arg on
# addDatabaseConnection) seed-config fails, UI can't add the
# connection. :0.3.73 was the last Keel-resolved good tag.
image = "dolthub/dolt-workbench:0.3.73"
command = ["sh", "-c", <<-EOT
# Seed connection store
cp /config/store.json /store/store.json
@ -365,7 +369,11 @@ resource "kubernetes_deployment" "workbench" {
container {
name = "workbench"
image = "dolthub/dolt-workbench:latest"
# Pinned 2026-05-26: Keel rolled :latest :0.1.0 on 2026-05-17,
# which speaks an old GraphQL schema (missing `type` arg on
# addDatabaseConnection) seed-config fails, UI can't add the
# connection. :0.3.73 was the last Keel-resolved good tag.
image = "dolthub/dolt-workbench:0.3.73"
command = ["sh", "-c", <<-EOT
# Patch GraphQL server to listen on 0.0.0.0 (IPv4) Node 18+ defaults to IPv6
sed -i 's|app.listen(9002)|app.listen(9002,"0.0.0.0")|g' /app/graphql-server/dist/main.js

View file

@ -1088,6 +1088,7 @@ resource "null_resource" "pg_cluster" {
storage_class = "proxmox-lvm-encrypted"
memory_limit = "3Gi"
pg_params = "v3-shared1024-walcomp-workmem16-max200"
affinity = "required-hostname-v1"
}
provisioner "local-exec" {
@ -1106,6 +1107,15 @@ resource "null_resource" "pg_cluster" {
# during a long WAL backlog the failover would stall the drain.
# Bumped 2026-05-16 ahead of Monday's first post-fix kured cycle.
instances: 3
# Hard anti-affinity: force one PG instance per node. Default is
# `preferred` which let all 3 pods collapse onto k8s-node1 during
# the 2026-05-26 node4 outage losing node1 would have killed the
# whole cluster (no quorum). With 3 instances + 4 worker nodes,
# `required` is safe under 1-node drain.
affinity:
enablePodAntiAffinity: true
podAntiAffinityType: required
topologyKey: kubernetes.io/hostname
imageName: ghcr.io/cloudnative-pg/postgis:16
postgresql:
parameters:

View file

@ -24,6 +24,22 @@ provider "registry.terraform.io/cloudflare/cloudflare" {
]
}
provider "registry.terraform.io/gavinbunney/kubectl" {
version = "1.19.0"
constraints = "~> 1.14"
hashes = [
"h1:9QkxPjp0x5FZFfJbE+B7hBOoads9gmdfj9aYu5N4Sfc=",
]
}
provider "registry.terraform.io/goauthentik/authentik" {
version = "2024.12.1"
constraints = "~> 2024.10"
hashes = [
"h1:roBMd+gi+TGgikH/bMzEI8JfvJiMAQWt+8FmokCrQIs=",
]
}
provider "registry.terraform.io/hashicorp/helm" {
version = "3.1.1"
hashes = [
@ -71,3 +87,11 @@ provider "registry.terraform.io/hashicorp/vault" {
"zh:ff35fb1ab6add288f0f368981e56f780b50405accd1937131cba1137999c8d83",
]
}
provider "registry.terraform.io/telmate/proxmox" {
version = "3.0.2-rc07"
constraints = "3.0.2-rc07"
hashes = [
"h1:zp5hpQJQ4t4zROSLqdltVpBO+Riy9VugtfFbpyTw1aM=",
]
}

View file

@ -1,7 +1,7 @@
# Generated by Terragrunt. Sig: nIlQXj57tbuaRZEa
terraform {
backend "pg" {
conn_str = "postgres://terraform_state:SBlzGxotNUN6HH9d0S-m@10.0.20.200:5432/terraform_state?sslmode=disable"
conn_str = "postgres://terraform_state:LicuZK1nVl4ILE5HF-A9@10.0.20.200:5432/terraform_state?sslmode=disable"
schema_name = "excalidraw"
}
}

View file

@ -27,33 +27,14 @@ module "tls_secret" {
tls_secret_name = var.tls_secret_name
}
resource "kubernetes_persistent_volume_claim" "data_proxmox" {
wait_until_bound = false
metadata {
name = "excalidraw-data-proxmox"
namespace = kubernetes_namespace.excalidraw.metadata[0].name
annotations = {
"resize.topolvm.io/threshold" = "10%"
"resize.topolvm.io/increase" = "100%"
"resize.topolvm.io/storage_limit" = "5Gi"
}
}
spec {
access_modes = ["ReadWriteOnce"]
storage_class_name = "proxmox-lvm"
resources {
requests = {
storage = "1Gi"
}
}
}
lifecycle {
# The autoresizer expands requests.storage up to storage_limit and
# PVCs can't shrink. Without this, every TF apply tries to revert
# to the spec value, K8s rejects the shrink, and the PVC ends up
# in Terminating-but-in-use limbo.
ignore_changes = [spec[0].resources[0].requests]
}
module "nfs_data_host" {
source = "../../modules/kubernetes/nfs_volume"
name = "excalidraw-data-host"
namespace = kubernetes_namespace.excalidraw.metadata[0].name
nfs_server = var.nfs_server
nfs_path = "/srv/nfs/excalidraw"
storage = "1Gi"
access_modes = ["ReadWriteOnce"]
}
resource "kubernetes_deployment" "excalidraw" {
@ -118,7 +99,7 @@ resource "kubernetes_deployment" "excalidraw" {
volume {
name = "data"
persistent_volume_claim {
claim_name = kubernetes_persistent_volume_claim.data_proxmox.metadata[0].name
claim_name = module.nfs_data_host.claim_name
}
}
}

View file

@ -9,6 +9,21 @@ terraform {
source = "cloudflare/cloudflare"
version = "~> 4"
}
authentik = {
source = "goauthentik/authentik"
version = "~> 2024.10"
}
# kubectl (gavinbunney) workaround for hashicorp/kubernetes
# `kubernetes_manifest` panics on Kyverno CRDs. See beads code-e2dp.
# Declared for all stacks but only used where opted-in.
kubectl = {
source = "gavinbunney/kubectl"
version = "~> 1.14"
}
proxmox = {
source = "telmate/proxmox"
version = "3.0.2-rc07"
}
}
}
@ -31,3 +46,8 @@ provider "vault" {
address = "https://vault.viktorbarzin.me"
skip_child_token = true
}
provider "kubectl" {
config_path = var.kube_config_path
load_config_file = true
}

View file

@ -24,6 +24,14 @@ provider "registry.terraform.io/cloudflare/cloudflare" {
]
}
provider "registry.terraform.io/gavinbunney/kubectl" {
version = "1.19.0"
constraints = "~> 1.14"
hashes = [
"h1:9QkxPjp0x5FZFfJbE+B7hBOoads9gmdfj9aYu5N4Sfc=",
]
}
provider "registry.terraform.io/goauthentik/authentik" {
version = "2024.12.1"
constraints = "~> 2024.10"

View file

@ -1,7 +1,7 @@
# Generated by Terragrunt. Sig: nIlQXj57tbuaRZEa
terraform {
backend "pg" {
conn_str = "postgres://terraform_state:ts7DGcKmTTY-5ujz4mhh@10.0.20.200:5432/terraform_state?sslmode=disable"
conn_str = "postgres://terraform_state:LicuZK1nVl4ILE5HF-A9@10.0.20.200:5432/terraform_state?sslmode=disable"
schema_name = "f1-stream"
}
}

View file

@ -381,7 +381,15 @@ class PlaybackVerifier:
return PlaybackVerdict(is_playable=False, error="playwright unavailable")
is_m3u8 = stream_type == "m3u8"
if not is_m3u8:
if is_m3u8:
# Route m3u8 fetches through our own /proxy so the verifier gets a
# same-origin response with ACAO:* — matches what the frontend does
# (frontend `getProxyUrl` wraps every m3u8 via /proxy anyway). Without
# this, hosts like oe1.ossfeed.store that only return CORS headers
# for specific Origins (e.g. pushembdz.store) trigger an immediate
# `fatal_network_error` in hls.js and the stream is marked dead.
url = f"{PROXY_BASE}/proxy?url={_b64url(url)}"
else:
url = f"{PROXY_BASE}/embed?url={_b64url(url)}"
async with self._sem:

View file

@ -78,33 +78,14 @@ resource "kubernetes_manifest" "chrome_service_client_secret" {
depends_on = [kubernetes_namespace.f1-stream]
}
resource "kubernetes_persistent_volume_claim" "data_proxmox" {
wait_until_bound = false
metadata {
name = "f1-stream-data-proxmox"
namespace = kubernetes_namespace.f1-stream.metadata[0].name
annotations = {
"resize.topolvm.io/threshold" = "10%"
"resize.topolvm.io/increase" = "100%"
"resize.topolvm.io/storage_limit" = "5Gi"
}
}
spec {
access_modes = ["ReadWriteOnce"]
storage_class_name = "proxmox-lvm"
resources {
requests = {
storage = "1Gi"
}
}
}
lifecycle {
# The autoresizer expands requests.storage up to storage_limit and
# PVCs can't shrink. Without this, every TF apply tries to revert
# to the spec value, K8s rejects the shrink, and the PVC ends up
# in Terminating-but-in-use limbo.
ignore_changes = [spec[0].resources[0].requests]
}
module "nfs_data_host" {
source = "../../modules/kubernetes/nfs_volume"
name = "f1-stream-data-host"
namespace = kubernetes_namespace.f1-stream.metadata[0].name
nfs_server = var.nfs_server
nfs_path = "/srv/nfs/f1-stream"
storage = "1Gi"
access_modes = ["ReadWriteOnce"]
}
resource "kubernetes_deployment" "f1-stream" {
@ -196,7 +177,7 @@ resource "kubernetes_deployment" "f1-stream" {
volume {
name = "data"
persistent_volume_claim {
claim_name = kubernetes_persistent_volume_claim.data_proxmox.metadata[0].name
claim_name = module.nfs_data_host.claim_name
}
}
}

View file

@ -13,6 +13,13 @@ terraform {
source = "goauthentik/authentik"
version = "~> 2024.10"
}
# kubectl (gavinbunney) workaround for hashicorp/kubernetes
# `kubernetes_manifest` panics on Kyverno CRDs. See beads code-e2dp.
# Declared for all stacks but only used where opted-in.
kubectl = {
source = "gavinbunney/kubectl"
version = "~> 1.14"
}
}
}
@ -35,3 +42,8 @@ provider "vault" {
address = "https://vault.viktorbarzin.me"
skip_child_token = true
}
provider "kubectl" {
config_path = var.kube_config_path
load_config_file = true
}

View file

@ -52,6 +52,20 @@ provider "registry.terraform.io/goauthentik/authentik" {
constraints = "~> 2024.10"
hashes = [
"h1:roBMd+gi+TGgikH/bMzEI8JfvJiMAQWt+8FmokCrQIs=",
"zh:090260dc7889ea822ec1d899344e1ee23eba5290461989c0796149c9511f2316",
"zh:13c2655ff824b0dc4b9bb832b5ca6d41dba97cb280330258c5fef4115e236209",
"zh:166a73c3a810c9c895d68a8ff968158f339f8a2c1c03e20ec9fc5ed99cc64e20",
"zh:203777eae1cdc711233315499643180604cff2324411b186b7cf07fdbe16f655",
"zh:3b2f18c9a8d28dac74dc6bbf168c946855ab9c68f053578d4630c50d5eaf30a0",
"zh:4822275985f6b74b6196c47112316a4252db22cf4ceaef7c9ab4c66d488abf2f",
"zh:53ea97562666c8a5a2f6d63d418a302a7f8ee4b7bb7da35dedaa89aa5708b7f0",
"zh:56b8a230901e3550c92a1d3f58ee9dafe9853f30fe4315af3ab28ae63262e15d",
"zh:6293ab7b1fd8206a0c853591f50186aca4a1eff117b2a773e10760a23a2c83e9",
"zh:9433970f79fb92d8aae3ee436db5630ab312c78b6dc9df9c1db3273a18f8aaa1",
"zh:95df406214f79b3b98222d7c7fe8fc319a3d90b7a9d53e1d5abbda5dfb8b9436",
"zh:a85880da0552a42c8f449390fbd7d8b03541d1a13e04bba9f1404fa658754260",
"zh:a95f6e9bd62c67e70eba1b1a14728856b9a6a28cd1e5e3be54a7718882c87e7f",
"zh:dd599b51c5beb34a4c6feece244fde07d2558d69929449ab1fd39a5ebe738781",
]
}
@ -79,6 +93,18 @@ provider "registry.terraform.io/hashicorp/kubernetes" {
version = "3.1.0"
hashes = [
"h1:oodIAuFMikXNmEtil5MQgP4dfSctUBYQiGJfjbsF3NY=",
"zh:0215c5c60be62028c09a2f22458e89cda3ef5830a632299f1d401eb3538874b0",
"zh:09ebb9f442431e278a310a9423f32caf467cb4b3cad3fe59573ca71fa7b14e20",
"zh:0c4e5912f83bb35846ae0a9ae54fc320706ee61894cd21cc6b4181b1c5a2fa5c",
"zh:1678c982853ad461e65ccb5e79d585e13ed109dd47dab2a66d3a7a304faeef65",
"zh:1c050a5c15e330457a9c18caacf61a923c59d663e13f2962e4b32f04fef523a0",
"zh:2c55bcec83be58ec132c7cb0a1ac644758b800d794fdc636d53a0eada0358a3a",
"zh:a062bb0aa316c08d8460c66a5d68da71da40de5d3bc3b31abcf3a1a9a19650f1",
"zh:a26fdea0afaa9b247c73c0b42843ca51ba7db0ac2571f9d3d50dcabd20ca1b98",
"zh:c872c9385a78d502bf5823d61cd3bb0f9a0585030e025eb12585c83451beeaa1",
"zh:f180879af931182beee4c8c0d9dab62b81d86f17ddcbe3786ef4c7cec9163a4e",
"zh:f569b65999264a9416862bca5cd2a6177d94ccb0424f3a4ef424428912b9cb3c",
"zh:f70f5789264069e0eef06f9b5d5fde955ef7206f7d446d1ce51a4c37a3f3e02f",
]
}

View file

@ -1,7 +1,7 @@
# Generated by Terragrunt. Sig: nIlQXj57tbuaRZEa
terraform {
backend "pg" {
conn_str = "postgres://terraform_state:ZCcWMOLCTqb0aV-XyTAZ@10.0.20.200:5432/terraform_state?sslmode=disable"
conn_str = "postgres://terraform_state:LicuZK1nVl4ILE5HF-A9@10.0.20.200:5432/terraform_state?sslmode=disable"
schema_name = "forgejo"
}
}

View file

@ -24,6 +24,14 @@ provider "registry.terraform.io/cloudflare/cloudflare" {
]
}
provider "registry.terraform.io/gavinbunney/kubectl" {
version = "1.19.0"
constraints = "~> 1.14"
hashes = [
"h1:9QkxPjp0x5FZFfJbE+B7hBOoads9gmdfj9aYu5N4Sfc=",
]
}
provider "registry.terraform.io/goauthentik/authentik" {
version = "2024.12.1"
constraints = "~> 2024.10"
@ -79,3 +87,11 @@ provider "registry.terraform.io/hashicorp/vault" {
"zh:ff35fb1ab6add288f0f368981e56f780b50405accd1937131cba1137999c8d83",
]
}
provider "registry.terraform.io/telmate/proxmox" {
version = "3.0.2-rc07"
constraints = "3.0.2-rc07"
hashes = [
"h1:zp5hpQJQ4t4zROSLqdltVpBO+Riy9VugtfFbpyTw1aM=",
]
}

View file

@ -1,7 +1,7 @@
# Generated by Terragrunt. Sig: nIlQXj57tbuaRZEa
terraform {
backend "pg" {
conn_str = "postgres://terraform_state:ts7DGcKmTTY-5ujz4mhh@10.0.20.200:5432/terraform_state?sslmode=disable"
conn_str = "postgres://terraform_state:LicuZK1nVl4ILE5HF-A9@10.0.20.200:5432/terraform_state?sslmode=disable"
schema_name = "freedify"
}
}

View file

@ -13,6 +13,17 @@ terraform {
source = "goauthentik/authentik"
version = "~> 2024.10"
}
# kubectl (gavinbunney) workaround for hashicorp/kubernetes
# `kubernetes_manifest` panics on Kyverno CRDs. See beads code-e2dp.
# Declared for all stacks but only used where opted-in.
kubectl = {
source = "gavinbunney/kubectl"
version = "~> 1.14"
}
proxmox = {
source = "telmate/proxmox"
version = "3.0.2-rc07"
}
}
}
@ -35,3 +46,8 @@ provider "vault" {
address = "https://vault.viktorbarzin.me"
skip_child_token = true
}
provider "kubectl" {
config_path = var.kube_config_path
load_config_file = true
}

View file

@ -29,21 +29,6 @@ provider "registry.terraform.io/gavinbunney/kubectl" {
constraints = "~> 1.14"
hashes = [
"h1:9QkxPjp0x5FZFfJbE+B7hBOoads9gmdfj9aYu5N4Sfc=",
"zh:1dec8766336ac5b00b3d8f62e3fff6390f5f60699c9299920fc9861a76f00c71",
"zh:43f101b56b58d7fead6a511728b4e09f7c41dc2e3963f59cf1c146c4767c6cb7",
"zh:4c4fbaa44f60e722f25cc05ee11dfaec282893c5c0ffa27bc88c382dbfbaa35c",
"zh:51dd23238b7b677b8a1abbfcc7deec53ffa5ec79e58e3b54d6be334d3d01bc0e",
"zh:5afc2ebc75b9d708730dbabdc8f94dd559d7f2fc5a31c5101358bd8d016916ba",
"zh:6be6e72d4663776390a82a37e34f7359f726d0120df622f4a2b46619338a168e",
"zh:72642d5fcf1e3febb6e5d4ae7b592bb9ff3cb220af041dbda893588e4bf30c0c",
"zh:9b12af85486a96aedd8d7984b0ff811a4b42e3d88dad1a3fb4c0b580d04fa425",
"zh:a1da03e3239867b35812ee031a1060fed6e8d8e458e2eaca48b5dd51b35f56f7",
"zh:b98b6a6728fe277fcd133bdfa7237bd733eae233f09653523f14460f608f8ba2",
"zh:bb8b071d0437f4767695c6158a3cb70df9f52e377c67019971d888b99147511f",
"zh:dc89ce4b63bfef708ec29c17e85ad0232a1794336dc54dd88c3ba0b77e764f71",
"zh:dd7dd18f1f8218c6cd19592288fde32dccc743cde05b9feeb2883f37c2ff4b4e",
"zh:ec4bd5ab3872dedb39fe528319b4bba609306e12ee90971495f109e142d66310",
"zh:f610ead42f724c82f5463e0e71fa735a11ffb6101880665d93f48b4a67b9ad82",
]
}
@ -52,39 +37,13 @@ provider "registry.terraform.io/goauthentik/authentik" {
constraints = "~> 2024.10"
hashes = [
"h1:roBMd+gi+TGgikH/bMzEI8JfvJiMAQWt+8FmokCrQIs=",
"zh:090260dc7889ea822ec1d899344e1ee23eba5290461989c0796149c9511f2316",
"zh:13c2655ff824b0dc4b9bb832b5ca6d41dba97cb280330258c5fef4115e236209",
"zh:166a73c3a810c9c895d68a8ff968158f339f8a2c1c03e20ec9fc5ed99cc64e20",
"zh:203777eae1cdc711233315499643180604cff2324411b186b7cf07fdbe16f655",
"zh:3b2f18c9a8d28dac74dc6bbf168c946855ab9c68f053578d4630c50d5eaf30a0",
"zh:4822275985f6b74b6196c47112316a4252db22cf4ceaef7c9ab4c66d488abf2f",
"zh:53ea97562666c8a5a2f6d63d418a302a7f8ee4b7bb7da35dedaa89aa5708b7f0",
"zh:56b8a230901e3550c92a1d3f58ee9dafe9853f30fe4315af3ab28ae63262e15d",
"zh:6293ab7b1fd8206a0c853591f50186aca4a1eff117b2a773e10760a23a2c83e9",
"zh:9433970f79fb92d8aae3ee436db5630ab312c78b6dc9df9c1db3273a18f8aaa1",
"zh:95df406214f79b3b98222d7c7fe8fc319a3d90b7a9d53e1d5abbda5dfb8b9436",
"zh:a85880da0552a42c8f449390fbd7d8b03541d1a13e04bba9f1404fa658754260",
"zh:a95f6e9bd62c67e70eba1b1a14728856b9a6a28cd1e5e3be54a7718882c87e7f",
"zh:dd599b51c5beb34a4c6feece244fde07d2558d69929449ab1fd39a5ebe738781",
]
}
provider "registry.terraform.io/hashicorp/helm" {
version = "3.1.1"
version = "3.1.2"
hashes = [
"h1:5b2ojWKT0noujHiweCds37ZreRFRQLNaErdJLusJN88=",
"zh:1a6d5ce931708aec29d1f3d9e360c2a0c35ba5a54d03eeaff0ce3ca597cd0275",
"zh:3411919ba2a5941801e677f0fea08bdd0ae22ba3c9ce3309f55554699e06524a",
"zh:81b36138b8f2320dc7f877b50f9e38f4bc614affe68de885d322629dd0d16a29",
"zh:95a2a0a497a6082ee06f95b38bd0f0d6924a65722892a856cfd914c0d117f104",
"zh:9d3e78c2d1bb46508b972210ad706dd8c8b106f8b206ecf096cd211c54f46990",
"zh:a79139abf687387a6efdbbb04289a0a8e7eaca2bd91cdc0ce68ea4f3286c2c34",
"zh:aaa8784be125fbd50c48d84d6e171d3fb6ef84a221dbc5165c067ce05faab4c8",
"zh:afecd301f469975c9d8f350cc482fe656e082b6ab0f677d1a816c3c615837cc1",
"zh:c54c22b18d48ff9053d899d178d9ffef7d9d19785d9bf310a07d648b7aac075b",
"zh:db2eefd55aea48e73384a555c72bac3f7d428e24147bedb64e1a039398e5b903",
"zh:ee61666a233533fd2be971091cecc01650561f1585783c381b6f6e8a390198a4",
"zh:f569b65999264a9416862bca5cd2a6177d94ccb0424f3a4ef424428912b9cb3c",
"h1:lIuknMfM7+QTzPWs8VBocstZF0B3TpEMIj/bw+dLAOs=",
]
}
@ -92,18 +51,6 @@ provider "registry.terraform.io/hashicorp/kubernetes" {
version = "3.1.0"
hashes = [
"h1:oodIAuFMikXNmEtil5MQgP4dfSctUBYQiGJfjbsF3NY=",
"zh:0215c5c60be62028c09a2f22458e89cda3ef5830a632299f1d401eb3538874b0",
"zh:09ebb9f442431e278a310a9423f32caf467cb4b3cad3fe59573ca71fa7b14e20",
"zh:0c4e5912f83bb35846ae0a9ae54fc320706ee61894cd21cc6b4181b1c5a2fa5c",
"zh:1678c982853ad461e65ccb5e79d585e13ed109dd47dab2a66d3a7a304faeef65",
"zh:1c050a5c15e330457a9c18caacf61a923c59d663e13f2962e4b32f04fef523a0",
"zh:2c55bcec83be58ec132c7cb0a1ac644758b800d794fdc636d53a0eada0358a3a",
"zh:a062bb0aa316c08d8460c66a5d68da71da40de5d3bc3b31abcf3a1a9a19650f1",
"zh:a26fdea0afaa9b247c73c0b42843ca51ba7db0ac2571f9d3d50dcabd20ca1b98",
"zh:c872c9385a78d502bf5823d61cd3bb0f9a0585030e025eb12585c83451beeaa1",
"zh:f180879af931182beee4c8c0d9dab62b81d86f17ddcbe3786ef4c7cec9163a4e",
"zh:f569b65999264a9416862bca5cd2a6177d94ccb0424f3a4ef424428912b9cb3c",
"zh:f70f5789264069e0eef06f9b5d5fde955ef7206f7d446d1ce51a4c37a3f3e02f",
]
}
@ -126,3 +73,11 @@ provider "registry.terraform.io/hashicorp/vault" {
"zh:ff35fb1ab6add288f0f368981e56f780b50405accd1937131cba1137999c8d83",
]
}
provider "registry.terraform.io/telmate/proxmox" {
version = "3.0.2-rc07"
constraints = "3.0.2-rc07"
hashes = [
"h1:zp5hpQJQ4t4zROSLqdltVpBO+Riy9VugtfFbpyTw1aM=",
]
}

View file

@ -157,7 +157,8 @@ resource "kubernetes_namespace" "immich" {
# Override the kyverno-generated tier-2-gpu quota (12Gi requests.memory).
# Immich-server needs 8Gi to absorb face-detection burst spikes (OOM 2026-04-26)
# without OOM. Plus immich-machine-learning (3.5Gi) + immich-postgresql (3Gi) +
# backup CronJobs 15.5Gi. 20Gi gives ~4.5Gi headroom.
# backup CronJobs 15.5Gi. 24Gi gives ~8Gi headroom (raised 2026-05-26 was at
# 88% with VPA bumps creeping up on immich-server burst behaviour).
resource "kubernetes_resource_quota" "immich" {
metadata {
name = "tier-quota"
@ -166,8 +167,8 @@ resource "kubernetes_resource_quota" "immich" {
spec {
hard = {
"requests.cpu" = "8"
"requests.memory" = "20Gi"
"limits.memory" = "32Gi"
"requests.memory" = "24Gi"
"limits.memory" = "40Gi"
pods = "40"
}
}
@ -321,7 +322,12 @@ resource "kubernetes_deployment" "immich_server" {
}
period_seconds = 10
timeout_seconds = 1
failure_threshold = 30
# Bumped 30 360 (5min 1h): after a PG restart, immich-server
# reindexes the clip_index + face_index vector tables before binding
# the API port. Hundreds of thousands of rows take longer than 5min
# on a cold cache, so the old threshold trapped us in a startup
# crashloop after every PG restart (2026-05-24 incident).
failure_threshold = 360
success_threshold = 1
}
@ -526,10 +532,10 @@ resource "kubernetes_deployment" "immich-postgres" {
resources {
requests = {
cpu = "100m"
memory = "3Gi"
memory = "5Gi"
}
limits = {
memory = "3Gi"
memory = "5Gi"
}
}
}
@ -906,7 +912,7 @@ resource "kubernetes_job_v1" "anca_elements_import" {
wait_for_completion = false
spec {
backoff_limit = 2
backoff_limit = 20
ttl_seconds_after_finished = 604800
template {
metadata {
@ -948,7 +954,7 @@ resource "kubernetes_job_v1" "anca_elements_import" {
--ban-file "csp/" --ban-file "KOREAN/" \
--ban-file "System Volume Information/" \
--pause-immich-jobs=false \
--concurrent-tasks 8 \
--concurrent-tasks 20 \
--client-timeout 1h \
--no-ui \
--on-errors continue

View file

@ -20,6 +20,10 @@ terraform {
source = "gavinbunney/kubectl"
version = "~> 1.14"
}
proxmox = {
source = "telmate/proxmox"
version = "3.0.2-rc07"
}
}
}

View file

@ -1,10 +1,117 @@
# This file is maintained automatically by "terraform init".
# Manual edits may be lost in future updates.
provider "registry.terraform.io/cloudflare/cloudflare" {
version = "4.52.7"
constraints = "~> 4.0"
hashes = [
"h1:pPItIWii5oymR+geZB219ROSPuSODPLTlM4S/u8xLvM=",
"zh:0c904ce31a4c6c4a5b3bf7ff1560e77c0cc7e2450c8553ded8e8c90398e1418b",
"zh:36183d310c36373fe4cb936b83c595c6fd3b0a94bc7827f28e5789ccbf59752e",
"zh:556a568a6f0235e8f41647de9e4d3a1e7b1d6502df8b19b54ec441f1c653ea10",
"zh:633ebbd5b0245e75e500ef9be4d9e62288f97e8da3baaa51323892a786d90285",
"zh:6acfe60cf52a65ba8f044f748548d2119e7f4fd7f8ebcb14698960d87c68f529",
"zh:890df766e9b839623b1f0437355032a3c006226a6c200cd911e15ee1a9014e9f",
"zh:904acc31ebb9d6ef68c792074b30532ee61bf515f19e0a3c75b46f126cca1f13",
"zh:a1d0a81246afc8750286d3f6fe7a8fbe6460dd2662407b28dbfbabb612e5fa9d",
"zh:a41a36fe253fc365fe2b7ffc749624688b2693b4634862fda161179ab100029f",
"zh:a7ef269e77ffa8715c8945a2c14322c7ff159ea44c15f62505f3cbb2cae3b32d",
"zh:b01aa3bed30610633b762df64332b26f8844a68c3960cebcb30f04918efc67fe",
"zh:b069cc2cd18cae10757df3ae030508eac8d55de7e49eda7a5e3e11f2f7fe6455",
"zh:b2d2c6313729ebb7465dceece374049e2d08bda34473901be9ff46a8836d42b2",
"zh:db0e114edaf4bc2f3d4769958807c83022bfbc619a00bdf4c4bd17faa4ab2d8b",
"zh:ecc0aa8b9044f664fd2aaf8fa992d976578f78478980555b4b8f6148e8d1a5fe",
]
}
provider "registry.terraform.io/gavinbunney/kubectl" {
version = "1.19.0"
constraints = "~> 1.14"
hashes = [
"h1:9QkxPjp0x5FZFfJbE+B7hBOoads9gmdfj9aYu5N4Sfc=",
"zh:1dec8766336ac5b00b3d8f62e3fff6390f5f60699c9299920fc9861a76f00c71",
"zh:43f101b56b58d7fead6a511728b4e09f7c41dc2e3963f59cf1c146c4767c6cb7",
"zh:4c4fbaa44f60e722f25cc05ee11dfaec282893c5c0ffa27bc88c382dbfbaa35c",
"zh:51dd23238b7b677b8a1abbfcc7deec53ffa5ec79e58e3b54d6be334d3d01bc0e",
"zh:5afc2ebc75b9d708730dbabdc8f94dd559d7f2fc5a31c5101358bd8d016916ba",
"zh:6be6e72d4663776390a82a37e34f7359f726d0120df622f4a2b46619338a168e",
"zh:72642d5fcf1e3febb6e5d4ae7b592bb9ff3cb220af041dbda893588e4bf30c0c",
"zh:9b12af85486a96aedd8d7984b0ff811a4b42e3d88dad1a3fb4c0b580d04fa425",
"zh:a1da03e3239867b35812ee031a1060fed6e8d8e458e2eaca48b5dd51b35f56f7",
"zh:b98b6a6728fe277fcd133bdfa7237bd733eae233f09653523f14460f608f8ba2",
"zh:bb8b071d0437f4767695c6158a3cb70df9f52e377c67019971d888b99147511f",
"zh:dc89ce4b63bfef708ec29c17e85ad0232a1794336dc54dd88c3ba0b77e764f71",
"zh:dd7dd18f1f8218c6cd19592288fde32dccc743cde05b9feeb2883f37c2ff4b4e",
"zh:ec4bd5ab3872dedb39fe528319b4bba609306e12ee90971495f109e142d66310",
"zh:f610ead42f724c82f5463e0e71fa735a11ffb6101880665d93f48b4a67b9ad82",
]
}
provider "registry.terraform.io/goauthentik/authentik" {
version = "2024.12.1"
constraints = "~> 2024.10"
hashes = [
"h1:roBMd+gi+TGgikH/bMzEI8JfvJiMAQWt+8FmokCrQIs=",
"zh:090260dc7889ea822ec1d899344e1ee23eba5290461989c0796149c9511f2316",
"zh:13c2655ff824b0dc4b9bb832b5ca6d41dba97cb280330258c5fef4115e236209",
"zh:166a73c3a810c9c895d68a8ff968158f339f8a2c1c03e20ec9fc5ed99cc64e20",
"zh:203777eae1cdc711233315499643180604cff2324411b186b7cf07fdbe16f655",
"zh:3b2f18c9a8d28dac74dc6bbf168c946855ab9c68f053578d4630c50d5eaf30a0",
"zh:4822275985f6b74b6196c47112316a4252db22cf4ceaef7c9ab4c66d488abf2f",
"zh:53ea97562666c8a5a2f6d63d418a302a7f8ee4b7bb7da35dedaa89aa5708b7f0",
"zh:56b8a230901e3550c92a1d3f58ee9dafe9853f30fe4315af3ab28ae63262e15d",
"zh:6293ab7b1fd8206a0c853591f50186aca4a1eff117b2a773e10760a23a2c83e9",
"zh:9433970f79fb92d8aae3ee436db5630ab312c78b6dc9df9c1db3273a18f8aaa1",
"zh:95df406214f79b3b98222d7c7fe8fc319a3d90b7a9d53e1d5abbda5dfb8b9436",
"zh:a85880da0552a42c8f449390fbd7d8b03541d1a13e04bba9f1404fa658754260",
"zh:a95f6e9bd62c67e70eba1b1a14728856b9a6a28cd1e5e3be54a7718882c87e7f",
"zh:dd599b51c5beb34a4c6feece244fde07d2558d69929449ab1fd39a5ebe738781",
]
}
provider "registry.terraform.io/hashicorp/helm" {
version = "3.1.2"
hashes = [
"h1:lIuknMfM7+QTzPWs8VBocstZF0B3TpEMIj/bw+dLAOs=",
"zh:1086b24b20d94afc331eb38c52b70848899fd0efaed46d9f4646180b96e9dffd",
"zh:28bebd04f8d0c44291dc961597c89de5be1e62153191b8b466dbbfb254c696aa",
"zh:49a7dd287c2c80621ba0c25834b1afac88c45d47ad3a24cd0aed634d78b1bbd4",
"zh:574e146b128be51cd4d9ee66cb8352eac82c7e3be2dbf53a51516ca701bb8b7c",
"zh:68285c8987affaa635c9590a0cefe238ba277e12532b64cb2d7ffec570ade064",
"zh:6ce12b5eb8f1d9aa61c4d336905e0186f9ea82c8767169533be5b206e4bd33f4",
"zh:83b7743951c989732f191cb429549296bca6faecffed492094bef92bec5c9dcb",
"zh:84fe2d11907b4e9d0c536d8b50bb63ad4056f60a73c4b734d5de7435784e53a7",
"zh:c8a25498bfbde4916f178d6880d9ee56ed9ceb88bef4842cd47360faadbb3dfb",
"zh:dfad553c09b36a7df68c3622c78b835669e69aaf954735802e85375a8df01dff",
"zh:f569b65999264a9416862bca5cd2a6177d94ccb0424f3a4ef424428912b9cb3c",
"zh:fd6f36da732f442e421d2b90ed3925a1c9ad0992c380a61fe7681d90b34aa5f3",
]
}
provider "registry.terraform.io/hashicorp/kubernetes" {
version = "3.1.0"
hashes = [
"h1:oodIAuFMikXNmEtil5MQgP4dfSctUBYQiGJfjbsF3NY=",
"zh:0215c5c60be62028c09a2f22458e89cda3ef5830a632299f1d401eb3538874b0",
"zh:09ebb9f442431e278a310a9423f32caf467cb4b3cad3fe59573ca71fa7b14e20",
"zh:0c4e5912f83bb35846ae0a9ae54fc320706ee61894cd21cc6b4181b1c5a2fa5c",
"zh:1678c982853ad461e65ccb5e79d585e13ed109dd47dab2a66d3a7a304faeef65",
"zh:1c050a5c15e330457a9c18caacf61a923c59d663e13f2962e4b32f04fef523a0",
"zh:2c55bcec83be58ec132c7cb0a1ac644758b800d794fdc636d53a0eada0358a3a",
"zh:a062bb0aa316c08d8460c66a5d68da71da40de5d3bc3b31abcf3a1a9a19650f1",
"zh:a26fdea0afaa9b247c73c0b42843ca51ba7db0ac2571f9d3d50dcabd20ca1b98",
"zh:c872c9385a78d502bf5823d61cd3bb0f9a0585030e025eb12585c83451beeaa1",
"zh:f180879af931182beee4c8c0d9dab62b81d86f17ddcbe3786ef4c7cec9163a4e",
"zh:f569b65999264a9416862bca5cd2a6177d94ccb0424f3a4ef424428912b9cb3c",
"zh:f70f5789264069e0eef06f9b5d5fde955ef7206f7d446d1ce51a4c37a3f3e02f",
]
}
provider "registry.terraform.io/hashicorp/null" {
version = "3.2.4"
hashes = [
"h1:L5V05xwp/Gto1leRryuesxjMfgZwjb7oool4WS1UEFQ=",
"h1:hkf5w5B6q8e2A42ND2CjAvgvSN3puAosDmOJb3zCVQM=",
"zh:59f6b52ab4ff35739647f9509ee6d93d7c032985d9f8c6237d1f8a59471bbbe2",
"zh:78d5eefdd9e494defcb3c68d282b8f96630502cac21d1ea161f53cfe9bb483b3",
"zh:795c897119ff082133150121d39ff26cb5f89a730a2c8c26f3a9c1abf81a9c43",
@ -25,6 +132,7 @@ provider "registry.terraform.io/hashicorp/vault" {
constraints = "~> 4.0"
hashes = [
"h1:GPfhH6dr1LY0foPBDYv9bEGifx7eSwYqFcEAOWOUxLk=",
"h1:aHqgWQhDBMeZO9iUKwJYMlh4q+xNMUlMIcjRbF4d02Y=",
"zh:269ab13433f67684012ae7e15876532b0312f5d0d2002a9cf9febb1279ce5ea6",
"zh:4babc95bf0c40eb85005db1dc2ca403c46be4a71dd3e409db3711a56f7a5ca0e",
"zh:78d5eefdd9e494defcb3c68d282b8f96630502cac21d1ea161f53cfe9bb483b3",
@ -45,6 +153,7 @@ provider "registry.terraform.io/telmate/proxmox" {
constraints = "3.0.2-rc07"
hashes = [
"h1:0UpRJ8PFsu9lhD3p2KUdUNVsDPbjZLPR46wYRpt1dxc=",
"h1:zp5hpQJQ4t4zROSLqdltVpBO+Riy9VugtfFbpyTw1aM=",
"zh:2ee860cd0a368b3eaa53f4a9ea46f16dab8a97929e813ea6ef55183f8112c2ca",
"zh:415965fd915bae2040d7f79e45f64d6e3ae61149c10114efeac1b34687d7296c",
"zh:6584b2055df0e32062561c615e3b6b2c291ca8c959440adda09ef3ec1e1436bd",

View file

@ -1,6 +1,6 @@
# Generated by Terragrunt. Sig: nIlQXj57tbuaRZEa
terraform {
backend "local" {
path = "/Users/viktorbarzin/code/infra/state/stacks/infra/terraform.tfstate"
path = "/home/wizard/code/infra/state/stacks/infra/terraform.tfstate"
}
}

View file

@ -10,8 +10,9 @@
variable "proxmox_host" { type = string }
variable "ssh_public_key" {
type = string
default = ""
type = string
default = ""
description = "DEPRECATED: was a tfvars input. Now read from Vault secret/viktor.ssh_public_key directly (see locals.k8s_ssh_public_key) so no apply-time argument can leave the snippet's authorized_keys empty."
}
variable "k8s_join_command" { type = string }
@ -40,6 +41,12 @@ locals {
non_k8s_cloud_init_image_path = "/var/lib/vz/template/iso/noble-server-cloudimg-amd64-non-k8s.img"
cloud_init_image_url = "https://cloud-images.ubuntu.com/noble/current/noble-server-cloudimg-amd64.img"
# Source of truth for the wizard user's SSH key on every cloud-init
# generated VM. Lives in Vault so we never apply with an empty value
# (which silently locked the wizard account on the node5 v1 boot
# 2026-05-26). Falls back to var.ssh_public_key for backward compat.
k8s_ssh_public_key = try(data.vault_kv_secret_v2.viktor.data["ssh_public_key"], var.ssh_public_key)
}
# ---------------------------------------------------------------------------
@ -52,7 +59,7 @@ module "k8s-node-template" {
proxmox_user = "root" # SSH user on Proxmox host
ssh_private_key = data.vault_kv_secret_v2.secrets.data["ssh_private_key"]
ssh_public_key = var.ssh_public_key
ssh_public_key = local.k8s_ssh_public_key
cloud_image_url = local.cloud_init_image_url
image_path = local.k8s_cloud_init_image_path
@ -62,163 +69,10 @@ module "k8s-node-template" {
is_k8s_template = true # provision cloud init file with k8s deps
snippet_name = local.k8s_cloud_init_snippet_name
# Add mirror registry
containerd_config_update_command = <<-EOF
# Set up config_path for per-registry mirror configuration
sed -i 's|config_path = ""|config_path = "/etc/containerd/certs.d"|' /etc/containerd/config.toml
# Create hosts.toml for docker.io (Docker Hub) high traffic, rate-limited
mkdir -p /etc/containerd/certs.d/docker.io
printf 'server = "https://registry-1.docker.io"\n\n[host."http://10.0.20.10:5000"]\n capabilities = ["pull", "resolve"]\n\n[host."https://registry-1.docker.io"]\n capabilities = ["pull", "resolve"]\n' > /etc/containerd/certs.d/docker.io/hosts.toml
# Create hosts.toml for ghcr.io medium traffic
mkdir -p /etc/containerd/certs.d/ghcr.io
printf 'server = "https://ghcr.io"\n\n[host."http://10.0.20.10:5010"]\n capabilities = ["pull", "resolve"]\n\n[host."https://ghcr.io"]\n capabilities = ["pull", "resolve"]\n' > /etc/containerd/certs.d/ghcr.io/hosts.toml
# Forgejo OCI registry: redirect to in-cluster Traefik LB (10.0.20.200) so
# pulls don't hairpin out through the WAN gateway. Traefik serves the
# *.viktorbarzin.me wildcard so SNI verification still passes.
# registry.viktorbarzin.me / 10.0.20.10:5050 entries removed in Phase 4 of
# the forgejo-registry-consolidation 2026-05-07 registry-private is gone.
mkdir -p /etc/containerd/certs.d/forgejo.viktorbarzin.me
printf 'server = "https://forgejo.viktorbarzin.me"\n\n[host."https://10.0.20.200"]\n capabilities = ["pull", "resolve"]\n' > /etc/containerd/certs.d/forgejo.viktorbarzin.me/hosts.toml
# Low-traffic registries (registry.k8s.io, quay.io, reg.kyverno.io) pull directly.
# Pull-through cache removed: caused corrupted images (truncated downloads)
# breaking VPA certgen and Kyverno image pulls.
sed -i 's/.*max_concurrent_downloads = 3/max_concurrent_downloads = 20/g' /etc/containerd/config.toml # Enable multiple concurrent downloads
# Configure aggressive garbage collection to prevent disk space exhaustion (node2 incident prevention)
# Set up containerd GC for unused images and containers
cat >> /etc/containerd/config.toml << 'CONTAINERD_GC'
[plugins."io.containerd.gc.v1.scheduler"]
# Run GC every 30 minutes instead of default 1 hour
pause_threshold = 0.02
deletion_threshold = 0
mutation_threshold = 100
schedule_delay = "1800s" # 30 minutes
[plugins."io.containerd.runtime.v2.task"]
# More aggressive container cleanup
exit_timeout = "5m"
[plugins."io.containerd.metadata.v1.bolt"]
# Compact database more frequently
compact_threshold = 5242880 # 5MB instead of default 100MB
CONTAINERD_GC
sudo sed -i '/serializeImagePulls:/d' /var/lib/kubelet/config.yaml && \
sudo sed -i '/maxParallelImagePulls:/d' /var/lib/kubelet/config.yaml && \
echo -e 'serializeImagePulls: false\nmaxParallelImagePulls: 50' | sudo tee -a /var/lib/kubelet/config.yaml
# Memory and disk reservation and eviction prevent node OOM/disk full
# Aggressive disk eviction settings added after node2 containerd corruption incident (2026-03-13)
# These settings prevent disk space exhaustion that can corrupt containerd image store
sudo sed -i '/systemReserved:/d; /kubeReserved:/d; /evictionHard:/,/^[^ ]/{ /evictionHard:/d; /^ /d }; /evictionSoft:/,/^[^ ]/{ /evictionSoft:/d; /^ /d }; /evictionSoftGracePeriod:/,/^[^ ]/{ /evictionSoftGracePeriod:/d; /^ /d }' /var/lib/kubelet/config.yaml
cat <<'KUBELET_PATCH' | sudo tee -a /var/lib/kubelet/config.yaml
systemReserved:
memory: "512Mi"
cpu: "200m"
kubeReserved:
memory: "512Mi"
cpu: "200m"
evictionHard:
memory.available: "500Mi"
nodefs.available: "15%" # More aggressive: evict at 15% free (was 10%)
imagefs.available: "20%" # Much more aggressive: evict at 20% free to prevent containerd corruption
evictionSoft:
memory.available: "1Gi"
nodefs.available: "20%" # Start warnings at 20% free
imagefs.available: "25%" # Start warnings at 25% free for containerd safety
evictionSoftGracePeriod:
memory.available: "30s"
nodefs.available: "60s" # Grace period for disk space warnings
imagefs.available: "30s" # Shorter grace for critical containerd space
memorySwap:
swapBehavior: "LimitedSwap"
KUBELET_PATCH
# Remove old 2-bucket shutdown config if present (replaced by priority-based)
sudo sed -i '/^shutdownGracePeriod:/d; /^shutdownGracePeriodCriticalPods:/d' /var/lib/kubelet/config.yaml
# Remove old shutdownGracePeriodByPodPriority block if present (idempotent re-apply)
sudo python3 -c "
import yaml, sys
with open('/var/lib/kubelet/config.yaml') as f:
cfg = yaml.safe_load(f)
cfg.pop('shutdownGracePeriod', None)
cfg.pop('shutdownGracePeriodCriticalPods', None)
cfg.pop('shutdownGracePeriodByPodPriority', None)
# Container log rotation limits reduces root disk writes (~20-30 GB/day savings)
cfg['containerLogMaxSize'] = '10Mi'
cfg['containerLogMaxFiles'] = 3
cfg['shutdownGracePeriodByPodPriority'] = [
{'priority': 0, 'shutdownGracePeriodSeconds': 20},
{'priority': 200000, 'shutdownGracePeriodSeconds': 20},
{'priority': 400000, 'shutdownGracePeriodSeconds': 30},
{'priority': 600000, 'shutdownGracePeriodSeconds': 30},
{'priority': 800000, 'shutdownGracePeriodSeconds': 90},
{'priority': 1000000, 'shutdownGracePeriodSeconds': 30},
{'priority': 1200000, 'shutdownGracePeriodSeconds': 30},
{'priority': 2000000000, 'shutdownGracePeriodSeconds': 30},
{'priority': 2000001000, 'shutdownGracePeriodSeconds': 30},
]
with open('/var/lib/kubelet/config.yaml', 'w') as f:
yaml.dump(cfg, f, default_flow_style=False)
"
# Systemd: increase InhibitDelayMaxSec so logind doesn't force-kill before kubelet finishes graceful shutdown
# Total kubelet shutdown time: 310s. InhibitDelay must exceed this.
mkdir -p /etc/systemd/logind.conf.d
cat <<'LOGIND_CONF' | sudo tee /etc/systemd/logind.conf.d/kubelet-shutdown.conf
[Login]
InhibitDelayMaxSec=480
LOGIND_CONF
sudo systemctl restart systemd-logind
# Systemd: increase kubelet stop timeout to match total shutdown grace period (310s + buffer)
mkdir -p /etc/systemd/system/kubelet.service.d
cat <<'KUBELET_SHUTDOWN' | sudo tee /etc/systemd/system/kubelet.service.d/20-shutdown.conf
[Service]
TimeoutStopSec=420s
KUBELET_SHUTDOWN
sudo systemctl daemon-reload
# Tune controller-manager + apiserver for faster volume detach on node failure
# Only on master node (has static pod manifests)
if [ -f /etc/kubernetes/manifests/kube-controller-manager.yaml ]; then
sudo python3 -c "
import yaml
# Controller-manager: faster attach-detach reconciliation (15s vs 1m default)
with open('/etc/kubernetes/manifests/kube-controller-manager.yaml') as f:
m = yaml.safe_load(f)
args = m['spec']['containers'][0]['command']
for flag in ['--attach-detach-reconcile-sync-period=15s']:
key = flag.split('=')[0]
args = [a for a in args if not a.startswith(key)]
args.append(flag)
m['spec']['containers'][0]['command'] = args
with open('/etc/kubernetes/manifests/kube-controller-manager.yaml', 'w') as f:
yaml.dump(m, f, default_flow_style=False)
print('controller-manager: attach-detach-reconcile-sync-period=15s')
"
sudo python3 -c "
import yaml
# API server: faster pod eviction from unreachable nodes (60s vs 300s default)
with open('/etc/kubernetes/manifests/kube-apiserver.yaml') as f:
m = yaml.safe_load(f)
args = m['spec']['containers'][0]['command']
for flag in ['--default-unreachable-toleration-seconds=60', '--default-not-ready-toleration-seconds=60']:
key = flag.split('=')[0]
args = [a for a in args if not a.startswith(key)]
args.append(flag)
m['spec']['containers'][0]['command'] = args
with open('/etc/kubernetes/manifests/kube-apiserver.yaml', 'w') as f:
yaml.dump(m, f, default_flow_style=False)
print('apiserver: unreachable+not-ready toleration=60s')
"
fi
EOF
# containerd setup script now bundled in the module
# (k8s-node-containerd-setup.sh); the deprecated variable is
# ignored when is_k8s_template=true.
containerd_config_update_command = ""
k8s_join_command = var.k8s_join_command
}
@ -395,95 +249,53 @@ UNIT
}
# ---------------------------------------------------------------------------
# Docker registry VM
# ---------------------------------------------------------------------------
module "docker-registry-vm" {
source = "../../modules/create-vm"
vmid = 220
vm_cpus = 4
vm_mem_mb = 4196
vm_disk_size = "64G"
template_name = "docker-registry-template"
vm_name = "docker-registry"
cisnippet_name = "docker-registry.yaml"
agent = 1
# Boot order: after TrueNAS (order=2), before k8s nodes (order=4)
startup_order = 3
startup_delay = 60
shutdown_timeout = 120
vm_mac_address = "DE:AD:BE:EF:22:22" # mapped to 10.0.20.10 in dhcp
bridge = "vmbr1"
vlan_tag = "20"
ipconfig0 = "ip=10.0.20.10/24,gw=10.0.20.1"
# Active pull-through caches (docker.io + ghcr.io only):
# 5000 -> nginx -> registry-dockerhub (docker.io proxy)
# 5001 -> registry-dockerhub direct (Prometheus metrics)
# 5010 -> nginx -> registry-ghcr (ghcr.io proxy)
# Disabled caches (low-traffic, caused corrupted images):
# 5020 -> registry-quay (quay.io) DISABLED
# 5030 -> registry-k8s (registry.k8s.io) DISABLED, broke VPA certgen
# 5040 -> registry-kyverno (reg.kyverno.io) DISABLED
# 5050 -> nginx -> registry-private (R/W registry for CI build cache)
# 8080 -> registry-ui (joxit/docker-registry-ui)
}
# ---------------------------------------------------------------------------
# K8s node VMs (imported from existing Proxmox VMs)
# ---------------------------------------------------------------------------
# ---------------------------------------------------------------------------
# K8s node VMs imported from existing Proxmox VMs
# Docker registry VM (220) INTENTIONALLY NOT MANAGED BY TERRAFORM.
#
# NOTE: Nodes with iSCSI PVC disks (201, 203, 204) cannot be imported yet
# due to telmate/proxmox provider bug: it constructs wrong volume references
# for shared iSCSI disks on update, causing API 500 errors. These nodes will
# be importable after migrating to the bpg/proxmox provider.
# Same telmate/proxmox provider defect as the K8s VMs below: the
# provider doesn't refresh `mbps_*_concurrent` fields back from live
# state, so state perma-shows 0 even when live has 40. Every plan
# then proposes to "fix" mbps from 0 40, and the apply errors with
# "the QEMU guest needs to be rebooted" even though the proxmox API
# call ends up being a no-op (live values already match). Pulling
# docker-registry out of TF for the same reason as the K8s VMs:
# bootstrap is reproducible via the docker-registry-template above
# + the cisnippet; VM lifecycle stays in the Proxmox UI.
#
# Pull-through cache port map (for reference; lives on the VM):
# 5000 -> nginx -> registry-dockerhub (docker.io proxy)
# 5001 -> registry-dockerhub direct (Prometheus metrics)
# 5010 -> nginx -> registry-ghcr (ghcr.io proxy)
# 5020 -> registry-quay (quay.io) DISABLED (low traffic, corrupt images)
# 5030 -> registry-k8s (registry.k8s.io) DISABLED (broke VPA certgen)
# 5040 -> registry-kyverno (reg.kyverno.io) DISABLED
# 5050 -> nginx -> registry-private (R/W cache) decom 2026-05-07
# 8080 -> registry-ui (joxit/docker-registry-ui)
# ---------------------------------------------------------------------------
module "k8s-master" {
source = "../../modules/create-vm"
vmid = 200
vm_name = "k8s-master"
vm_cpus = 8
vm_mem_mb = 32768
vm_disk_size = "64G"
balloon = 0
qemu_os = "other"
use_cloud_init = false
boot = "order=scsi0"
vm_mac_address = "00:50:56:b0:a1:39"
bridge = "vmbr1"
vlan_tag = "20"
startup_order = 4
startup_delay = 45
shutdown_timeout = 420
}
module "k8s-node2" {
source = "../../modules/create-vm"
vmid = 202
vm_name = "k8s-node2"
vm_cpus = 8
vm_mem_mb = 32768
vm_disk_size = "256G"
balloon = 0
qemu_os = "other"
use_cloud_init = false
boot = "c"
boot_disk = "scsi0"
vm_mac_address = "00:50:56:b0:a1:36"
bridge = "vmbr1"
vlan_tag = "20"
startup_order = 5
startup_delay = 45
shutdown_timeout = 420
}
# ---------------------------------------------------------------------------
# K8s node VMs INTENTIONALLY NOT MANAGED BY TERRAFORM.
#
# The telmate/proxmox v3.0.2-rc07 provider's `disks{}` block cannot
# represent dynamically-attached disks: on every update it rewrites
# the entire disk list, and `lifecycle.ignore_changes` does NOT stop
# it. We hit this twice: id=539 (iSCSI, 2026-04-02) and the 2026-05-26
# import attempt where every `vm-9999-pvc-*` slot on k8s-node2 +
# k8s-node3 got rewritten to point at the boot disk. Recovered via the
# /mnt/backup/pve-config/etc-pve/nodes/pve/qemu-server/<vmid>.conf
# nightly backup no reboots, no data loss, K8s CSI reconciled.
#
# Decision (2026-05-26): k8s-master (200) and k8s-node1-4 (201-204)
# stay out of TF indefinitely. Their cloud-init bootstrap IS in TF
# (via k8s-node-template + non-k8s-node-template above), so a fresh
# node still clones the template and runs the same bootstrap. The VM
# lifecycle itself (create / shutdown / config tweak) stays in the
# Proxmox UI. devvm (102), home-assistant (103), pfSense (101), and
# Windows10 (300) are also hand-managed for the same reason / out of
# scope (BSD, Windows).
#
# I/O caps for all 8 Linux VMs live in /tmp/apply-mbps-caps.sh on the
# PVE host (idempotent qm-set script beads code-9v2j). The bpg/
# proxmox provider migration (beads code-75ds) would unblock full TF
# adoption, but it's a multi-hour project and the cloud-init coverage
# above already captures the bootstrap-reproducibility goal.
# ---------------------------------------------------------------------------

View file

@ -5,6 +5,21 @@ terraform {
source = "hashicorp/vault"
version = "~> 4.0"
}
cloudflare = {
source = "cloudflare/cloudflare"
version = "~> 4"
}
authentik = {
source = "goauthentik/authentik"
version = "~> 2024.10"
}
# kubectl (gavinbunney) workaround for hashicorp/kubernetes
# `kubernetes_manifest` panics on Kyverno CRDs. See beads code-e2dp.
# Declared for all stacks but only used where opted-in.
kubectl = {
source = "gavinbunney/kubectl"
version = "~> 1.14"
}
proxmox = {
source = "telmate/proxmox"
version = "3.0.2-rc07"
@ -17,18 +32,22 @@ variable "kube_config_path" {
default = "~/.kube/config"
}
variable "proxmox_pm_api_url" { type = string }
variable "proxmox_pm_api_token_id" { type = string }
variable "proxmox_pm_api_token_secret" { type = string }
provider "kubernetes" {
config_path = var.kube_config_path
}
provider "helm" {
kubernetes = {
config_path = var.kube_config_path
}
}
provider "vault" {
address = "https://vault.viktorbarzin.me"
skip_child_token = true
}
provider "proxmox" {
pm_api_url = var.proxmox_pm_api_url
pm_api_token_id = var.proxmox_pm_api_token_id
pm_api_token_secret = var.proxmox_pm_api_token_secret
pm_tls_insecure = true
provider "kubectl" {
config_path = var.kube_config_path
load_config_file = true
}

View file

@ -3,42 +3,30 @@ include "root" {
path = find_in_parent_folders()
}
# Override provider generation to include proxmox + vault (k8s providers not needed)
generate "providers" {
path = "providers.tf"
if_exists = "overwrite"
# The root's `k8s_providers` generate block now declares `telmate/proxmox`
# in required_providers for every stack (harmless for non-infra stacks
# they just don't instantiate a `provider "proxmox" {}` block).
#
# Here we add the per-stack provider config + the tfvar variable for the
# API URL. Credentials come from Vault `secret/viktor` (same pattern as
# cloudflare_provider.tf at the root). The output file name is distinct
# from `providers.tf` to avoid the same-path conflict that the old
# `generate "providers"` block silently triggered under Terragrunt v0.77.
generate "proxmox_provider" {
path = "proxmox_provider.tf"
if_exists = "overwrite_terragrunt"
contents = <<EOF
terraform {
required_providers {
vault = {
source = "hashicorp/vault"
version = "~> 4.0"
}
proxmox = {
source = "telmate/proxmox"
version = "3.0.2-rc07"
}
}
}
variable "kube_config_path" {
type = string
default = "~/.kube/config"
}
variable "proxmox_pm_api_url" { type = string }
variable "proxmox_pm_api_token_id" { type = string }
variable "proxmox_pm_api_token_secret" { type = string }
provider "vault" {
address = "https://vault.viktorbarzin.me"
skip_child_token = true
data "vault_kv_secret_v2" "proxmox_pm" {
mount = "secret"
name = "viktor"
}
provider "proxmox" {
pm_api_url = var.proxmox_pm_api_url
pm_api_token_id = var.proxmox_pm_api_token_id
pm_api_token_secret = var.proxmox_pm_api_token_secret
pm_api_token_id = data.vault_kv_secret_v2.proxmox_pm.data["proxmox_pm_api_token_id"]
pm_api_token_secret = data.vault_kv_secret_v2.proxmox_pm.data["proxmox_pm_api_token_secret"]
pm_tls_insecure = true
}
EOF

View file

@ -46,6 +46,16 @@ resource "helm_release" "keel" {
atomic = true
values = [yamlencode({
# EMERGENCY STOP scaled to 0 on 2026-05-26 16:42 UTC. Keel was actively
# rewriting tag strings (not just digests) despite the
# `keel.sh/match-tag=true` annotation injected by Kyverno that's supposed
# to constrain it to digest-only watches. Known casualties this round:
# uptime-kuma (2 1, 4h CrashLoopBackOff), n8n (1.80.5 0.1.2, silent
# degradation), beads-server/dolt-workbench (0.3.73 0.1.0), and ~10
# other deployments with downgrade-flavored change-cause annotations.
# Re-enable only after root-causing why match-tag isn't being enforced,
# OR after migrating each app to a content-addressed (SHA) tag pin.
replicaCount = 0
# Prometheus pod-annotation scrape picks up Keel-specific metrics
# (pending_approvals, poll_trigger_tracked_images, registries_scanned_total{image,registry})
# on container port 9300 /metrics. The cluster's `kubernetes-pods`

View file

@ -925,19 +925,24 @@ resource "kubectl_manifest" "mutate_gpu_priority" {
]
}
mutate = {
# `op=add` (not replace) incoming pods often lack the
# `/spec/priorityClassName` key entirely; replace fails with
# "doc is missing key" and aborts the mutation chain BEFORE
# Layer 4 (tier injection) can fall back. add works whether
# the path exists or not. Verified 2026-05-26 on frigate.
patchesJson6902 = yamlencode([
{
op = "replace"
op = "add"
path = "/spec/priorityClassName"
value = "gpu-workload"
},
{
op = "replace"
op = "add"
path = "/spec/priority"
value = 1200000
},
{
op = "replace"
op = "add"
path = "/spec/preemptionPolicy"
value = "PreemptLowerPriority"
}

View file

@ -280,7 +280,10 @@ resource "kubernetes_deployment" "llama_swap" {
# for it to be reachable".
wait_for_rollout = false
spec {
replicas = 1
# TEMP-SCALEDOWN-2026-05-25-IO-STORM: scaled to 0 during cluster recovery.
# Restore to 1 when cluster is fully stable. See post-mortem
# docs/post-mortems/2026-05-25-immich-anca-elements-io-storm.md.
replicas = 0
strategy { type = "Recreate" }
selector {

View file

@ -1,4 +1,18 @@
alloy:
# Resource limits for the alloy container itself.
# Must be under `alloy.resources` (NOT `controller.resources`) — the chart
# only maps THIS key onto the alloy container. Without it, the container gets
# `resources: {}` and inherits Kyverno LimitRange `tier-defaults` (256Mi),
# which is below Alloy's 400-450Mi steady state and caused page-cache
# thrashing → 185 MB/s sdc reads → host IO saturation (2026-05-26).
# Burstable QoS (request < limit) — workers are at 97-99% memory-request
# saturation; a 1Gi request blocks scheduling on node2/node3.
resources:
requests:
cpu: 50m
memory: 512Mi
limits:
memory: 1Gi
configMap:
content: |-
// Write your Alloy config here:
@ -183,6 +197,14 @@ alloy:
readOnly: true
controller:
# Bump maxUnavailable above the chart default (1) so a 5-node DS finishes its
# rolling update inside the helm_release timeout. Log shipper tolerates the
# brief gap.
updateStrategy:
type: RollingUpdate
rollingUpdate:
maxUnavailable: 50%
volumes:
extra:
- name: journal-run
@ -206,13 +228,3 @@ controller:
operator: "Exists"
effect: "NoSchedule"
# Resource limits for DaemonSet pods
# Alloy tails logs from all containers on the node via K8s API and batches
# them to Loki. Memory scales with number of active log streams (~30-50 per node).
# 128Mi was OOMKilled; steady-state usage is ~400-450Mi per pod.
resources:
requests:
cpu: 50m
memory: 512Mi
limits:
memory: 1Gi

View file

@ -28,8 +28,9 @@ resource "helm_release" "alloy" {
repository = "https://grafana.github.io/helm-charts"
chart = "alloy"
values = [file("${path.module}/alloy.yaml")]
atomic = true
values = [file("${path.module}/alloy.yaml")]
atomic = true
timeout = 900 # 5-pod DS rolling update + occasional runc-stuck-Terminating on k8s-master needs >300s default
depends_on = [helm_release.loki]
}

View file

@ -568,6 +568,9 @@ resource "kubernetes_manifest" "yotovski_ingress_route" {
# Custom ResourceQuota for monitoring larger than the default 1-cluster tier quota
# because monitoring runs 29+ pods (Prometheus, Grafana, Loki, Alloy, exporters, etc.)
# Headroom: cluster grew from 5 7 workers (k8s-node5/6 added 2026-05-26); per-pod
# DaemonSets (alloy 562Mi, node-exporter 100Mi, loki-canary 128Mi, sysctl-inotify 4Mi)
# now consume ~+2Gi vs. pre-expansion. 20Gi gives ~3-4Gi safe headroom.
resource "kubernetes_resource_quota" "monitoring" {
metadata {
name = "monitoring-quota"
@ -576,7 +579,7 @@ resource "kubernetes_resource_quota" "monitoring" {
spec {
hard = {
"requests.cpu" = "16"
"requests.memory" = "16Gi"
"requests.memory" = "20Gi"
"limits.memory" = "64Gi"
pods = "100"
}

View file

@ -528,7 +528,7 @@ serverFiles:
action: drop
# Whitelist: only keep essential kube-state-metrics, node-exporter, and coredns metrics
- source_labels: [__name__]
regex: 'kube_cronjob_status_last_successful_time|kube_deployment_spec_replicas|kube_deployment_status_replicas_available|kube_deployment_status_replicas_unavailable|kube_job_status_failed|kube_job_status_start_time|kube_node_info|kube_node_status_allocatable|kube_node_status_capacity|kube_node_status_condition|kube_persistentvolumeclaim_status_phase|kube_pod_container_resource_limits|kube_pod_container_resource_requests|kube_pod_container_status_restarts_total|kube_pod_container_status_running|kube_pod_container_status_waiting_reason|kube_pod_info|kube_pod_status_phase|kube_pod_status_ready|kube_pod_status_reason|kube_pod_status_conditions|kube_resourcequota|kube_statefulset_replicas|kube_statefulset_status_replicas_ready|kube_daemonset_status_desired_number_scheduled|kube_daemonset_status_number_ready|kube_node_spec_unschedulable|node_cpu_seconds_total|node_disk_io_time_seconds_total|node_disk_read_bytes_total|node_disk_written_bytes_total|node_disk_reads_completed_total|node_disk_writes_completed_total|node_filesystem_avail_bytes|node_filesystem_size_bytes|node_filesystem_device_error|node_filesystem_readonly|node_hwmon_chip_names|node_hwmon_temp_celsius|node_load1|node_load15|node_load5|node_memory_MemAvailable_bytes|node_memory_MemTotal_bytes|node_memory_Buffers_bytes|node_memory_Cached_bytes|node_memory_MemFree_bytes|node_memory_SwapTotal_bytes|node_memory_SwapFree_bytes|node_network_receive_bytes_total|node_network_transmit_bytes_total|node_nfs_requests_total|node_uname_info|node_vmstat_oom_kill|coredns_cache_entries|coredns_cache_hits_total|coredns_cache_misses_total|coredns_dns_requests_total|coredns_dns_responses_total|coredns_forward_requests_total|coredns_forward_responses_total|coredns_build_info|process_cpu_seconds_total|process_resident_memory_bytes|process_start_time_seconds|up|pve_.*'
regex: 'kube_cronjob_status_last_successful_time|kube_deployment_spec_replicas|kube_deployment_status_replicas_available|kube_deployment_status_replicas_unavailable|kube_job_status_failed|kube_job_status_start_time|kube_node_info|kube_node_status_allocatable|kube_node_status_capacity|kube_node_status_condition|kube_persistentvolumeclaim_status_phase|kube_volumeattachment_info|kube_pod_container_resource_limits|kube_pod_container_resource_requests|kube_pod_container_status_restarts_total|kube_pod_container_status_running|kube_pod_container_status_waiting_reason|kube_pod_info|kube_pod_status_phase|kube_pod_status_ready|kube_pod_status_reason|kube_pod_status_conditions|kube_resourcequota|kube_statefulset_replicas|kube_statefulset_status_replicas_ready|kube_daemonset_status_desired_number_scheduled|kube_daemonset_status_number_ready|kube_node_spec_unschedulable|node_cpu_seconds_total|node_disk_io_time_seconds_total|node_disk_read_bytes_total|node_disk_written_bytes_total|node_disk_reads_completed_total|node_disk_writes_completed_total|node_filesystem_avail_bytes|node_filesystem_size_bytes|node_filesystem_device_error|node_filesystem_readonly|node_hwmon_chip_names|node_hwmon_temp_celsius|node_load1|node_load15|node_load5|node_memory_MemAvailable_bytes|node_memory_MemTotal_bytes|node_memory_Buffers_bytes|node_memory_Cached_bytes|node_memory_MemFree_bytes|node_memory_SwapTotal_bytes|node_memory_SwapFree_bytes|node_network_receive_bytes_total|node_network_transmit_bytes_total|node_nfs_requests_total|node_uname_info|node_vmstat_oom_kill|coredns_cache_entries|coredns_cache_hits_total|coredns_cache_misses_total|coredns_dns_requests_total|coredns_dns_responses_total|coredns_forward_requests_total|coredns_forward_responses_total|coredns_build_info|process_cpu_seconds_total|process_resident_memory_bytes|process_start_time_seconds|up|pve_.*'
action: keep
- job_name: kubernetes-service-endpoints-slow
honor_labels: true
@ -1290,6 +1290,42 @@ serverFiles:
annotations:
summary: "Pod {{ $labels.namespace }}/{{ $labels.pod }} ({{ $labels.container }}) cannot pull image"
description: "Check the deployment's image reference — often a stale tag, a removed registry, or a credentials mismatch. `kubectl -n {{ $labels.namespace }} describe pod {{ $labels.pod }}` shows the pull error."
# N-1 capacity check: if any non-GPU worker (node2/3/4) died, would
# its memory requests fit on the remaining Ready workers (incl. node1
# GPU node — its taint is PreferNoSchedule, soft)? Fires when the
# most-loaded non-GPU worker holds more memory requests than the rest
# of the cluster has free.
- alert: ClusterCannotTolerateNonGpuNodeLoss
expr: |
max(
sum by (node) (
kube_pod_container_resource_requests{resource="memory",unit="byte",node=~"k8s-node[234]"}
)
)
>
sum(
clamp_min(
kube_node_status_allocatable{resource="memory",unit="byte",node=~"k8s-node[1234]"}
- on(node) group_left() sum by (node) (
kube_pod_container_resource_requests{resource="memory",unit="byte",node=~"k8s-node[1234]"}
),
0
)
and on(node) (kube_node_status_condition{condition="Ready",status="true"} == 1)
)
for: 15m
labels:
severity: warning
annotations:
summary: "Cluster cannot tolerate losing any non-GPU worker — memory requests won't fit on the rest"
description: |
The most-loaded non-GPU worker (k8s-node2/3/4) has more memory
requests pinned to it than the rest of the workers (incl. node1
GPU node) currently have free. If that node went down, its
pods would not reschedule and stay Pending.
Remediation: right-size top reservers via Goldilocks (immich-server,
frigate, prometheus, pg-cluster, paperless) or bump VM RAM on
k8s-node2/k8s-node3 from 32GB → 48GB to match node1.
- name: Infrastructure Health
rules:
- alert: HomeAssistantDown
@ -2336,6 +2372,35 @@ serverFiles:
severity: warning
annotations:
summary: "Node {{ $labels.instance }}: NFS RPC retransmission rate {{ $value | printf \"%.1f\" }}/s — NFS server (192.168.1.127) may be degraded or unreachable"
# Proxmox CSI per-node LUN saturation. The plugin enforces
# csi.proxmox.sinextra.dev/max-volume-attachments=28 (set on every k8s-node*
# by stacks/proxmox-csi). QEMU's virtio-scsi-pci hard cap is 30 LUNs.
# When K8s-side VolumeAttachments approach the cap, new PVCs fail to
# attach with "no free lun found" — vaultwarden + 18 pods stuck 2026-05-26.
- alert: ProxmoxCSILunUsageHigh
expr: count by (node) (kube_volumeattachment_info{node=~"k8s-node.*"}) >= 24
for: 10m
labels:
severity: warning
annotations:
summary: "{{ $labels.node }}: {{ $value }}/28 CSI volumes attached (>= 85% of cap)"
description: "Approaching the proxmox-csi-plugin per-node cap of 28 attachments. Workloads scheduled to this node with new PVCs may fail to attach. Consider rebalancing or migrating PVCs to other nodes."
- alert: ProxmoxCSILunUsageCritical
expr: count by (node) (kube_volumeattachment_info{node=~"k8s-node.*"}) >= 27
for: 3m
labels:
severity: critical
annotations:
summary: "{{ $labels.node }}: {{ $value }}/28 CSI volumes attached — 1 slot left"
description: "Only 1 LUN slot remains before the proxmox-csi cap. Next PVC attach to this node will fail with 'no free lun found'."
- alert: ProxmoxCSILunCapReached
expr: count by (node) (kube_volumeattachment_info{node=~"k8s-node.*"}) >= 28
for: 1m
labels:
severity: critical
annotations:
summary: "{{ $labels.node }}: at proxmox-csi LUN cap (28/28) — attaches WILL fail"
description: "Pods needing new PVC attachments on {{ $labels.node }} will fail with 'no free lun found'. Detach unused volumes from this node's Proxmox VM config, or migrate PVCs to a less-loaded node."
- name: "Application Health"
rules:
- alert: MailServerDown

View file

@ -111,3 +111,11 @@ provider "registry.terraform.io/hashicorp/vault" {
"zh:ff35fb1ab6add288f0f368981e56f780b50405accd1937131cba1137999c8d83",
]
}
provider "registry.terraform.io/telmate/proxmox" {
version = "3.0.2-rc07"
constraints = "3.0.2-rc07"
hashes = [
"h1:zp5hpQJQ4t4zROSLqdltVpBO+Riy9VugtfFbpyTw1aM=",
]
}

View file

@ -20,6 +20,10 @@ terraform {
source = "gavinbunney/kubectl"
version = "~> 1.14"
}
proxmox = {
source = "telmate/proxmox"
version = "3.0.2-rc07"
}
}
}

View file

@ -41,6 +41,16 @@ driver:
limits:
memory: "2Gi"
# 2026-05-25: extended startup probe from 120 to 300 failures.
# On k8s-node1 (6 vCPUs, 16Gi RAM, Ubuntu 24.04 + 6.8.0-117-generic),
# the full driver install sequence — apt install linux-headers (~2min) +
# gcc make -j16 kernel module compilation (~12min) + nvidia-installer
# file copy (~7min) = ~21min total, which exactly exhausted the default
# 120×10s=20min window (exit 137 = SIGKILL from startup probe).
# 300×10s = 50min gives 2.5× headroom on this hardware.
startupProbe:
failureThreshold: 300
devicePlugin:
config:
name: time-slicing-config

View file

@ -24,6 +24,14 @@ provider "registry.terraform.io/cloudflare/cloudflare" {
]
}
provider "registry.terraform.io/gavinbunney/kubectl" {
version = "1.19.0"
constraints = "~> 1.14"
hashes = [
"h1:9QkxPjp0x5FZFfJbE+B7hBOoads9gmdfj9aYu5N4Sfc=",
]
}
provider "registry.terraform.io/goauthentik/authentik" {
version = "2024.12.1"
constraints = "~> 2024.10"
@ -33,22 +41,9 @@ provider "registry.terraform.io/goauthentik/authentik" {
}
provider "registry.terraform.io/hashicorp/helm" {
version = "3.1.1"
version = "3.1.2"
hashes = [
"h1:47CqNwkxctJtL/N/JuEj+8QMg8mRNI/NWeKO5/ydfZU=",
"h1:5b2ojWKT0noujHiweCds37ZreRFRQLNaErdJLusJN88=",
"zh:1a6d5ce931708aec29d1f3d9e360c2a0c35ba5a54d03eeaff0ce3ca597cd0275",
"zh:3411919ba2a5941801e677f0fea08bdd0ae22ba3c9ce3309f55554699e06524a",
"zh:81b36138b8f2320dc7f877b50f9e38f4bc614affe68de885d322629dd0d16a29",
"zh:95a2a0a497a6082ee06f95b38bd0f0d6924a65722892a856cfd914c0d117f104",
"zh:9d3e78c2d1bb46508b972210ad706dd8c8b106f8b206ecf096cd211c54f46990",
"zh:a79139abf687387a6efdbbb04289a0a8e7eaca2bd91cdc0ce68ea4f3286c2c34",
"zh:aaa8784be125fbd50c48d84d6e171d3fb6ef84a221dbc5165c067ce05faab4c8",
"zh:afecd301f469975c9d8f350cc482fe656e082b6ab0f677d1a816c3c615837cc1",
"zh:c54c22b18d48ff9053d899d178d9ffef7d9d19785d9bf310a07d648b7aac075b",
"zh:db2eefd55aea48e73384a555c72bac3f7d428e24147bedb64e1a039398e5b903",
"zh:ee61666a233533fd2be971091cecc01650561f1585783c381b6f6e8a390198a4",
"zh:f569b65999264a9416862bca5cd2a6177d94ccb0424f3a4ef424428912b9cb3c",
"h1:lIuknMfM7+QTzPWs8VBocstZF0B3TpEMIj/bw+dLAOs=",
]
}
@ -79,3 +74,11 @@ provider "registry.terraform.io/hashicorp/vault" {
"zh:ff35fb1ab6add288f0f368981e56f780b50405accd1937131cba1137999c8d83",
]
}
provider "registry.terraform.io/telmate/proxmox" {
version = "3.0.2-rc07"
constraints = "3.0.2-rc07"
hashes = [
"h1:zp5hpQJQ4t4zROSLqdltVpBO+Riy9VugtfFbpyTw1aM=",
]
}

View file

@ -1,7 +1,7 @@
# Generated by Terragrunt. Sig: nIlQXj57tbuaRZEa
terraform {
backend "pg" {
conn_str = "postgres://terraform_state:SBlzGxotNUN6HH9d0S-m@10.0.20.200:5432/terraform_state?sslmode=disable"
conn_str = "postgres://terraform_state:LicuZK1nVl4ILE5HF-A9@10.0.20.200:5432/terraform_state?sslmode=disable"
schema_name = "onlyoffice"
}
}

View file

@ -93,33 +93,14 @@ module "tls_secret" {
tls_secret_name = var.tls_secret_name
}
resource "kubernetes_persistent_volume_claim" "data_proxmox" {
wait_until_bound = false
metadata {
name = "onlyoffice-data-proxmox"
namespace = kubernetes_namespace.onlyoffice.metadata[0].name
annotations = {
"resize.topolvm.io/threshold" = "10%"
"resize.topolvm.io/increase" = "100%"
"resize.topolvm.io/storage_limit" = "5Gi"
}
}
spec {
access_modes = ["ReadWriteOnce"]
storage_class_name = "proxmox-lvm"
resources {
requests = {
storage = "1Gi"
}
}
}
lifecycle {
# The autoresizer expands requests.storage up to storage_limit and
# PVCs can't shrink. Without this, every TF apply tries to revert
# to the spec value, K8s rejects the shrink, and the PVC ends up
# in Terminating-but-in-use limbo.
ignore_changes = [spec[0].resources[0].requests]
}
module "nfs_data_host" {
source = "../../modules/kubernetes/nfs_volume"
name = "onlyoffice-data-host"
namespace = kubernetes_namespace.onlyoffice.metadata[0].name
nfs_server = var.nfs_server
nfs_path = "/srv/nfs/onlyoffice"
storage = "1Gi"
access_modes = ["ReadWriteOnce"]
}
resource "kubernetes_deployment" "onlyoffice-document-server" {
@ -226,7 +207,7 @@ resource "kubernetes_deployment" "onlyoffice-document-server" {
volume {
name = "data"
persistent_volume_claim {
claim_name = kubernetes_persistent_volume_claim.data_proxmox.metadata[0].name
claim_name = module.nfs_data_host.claim_name
}
}
}

View file

@ -13,6 +13,17 @@ terraform {
source = "goauthentik/authentik"
version = "~> 2024.10"
}
# kubectl (gavinbunney) workaround for hashicorp/kubernetes
# `kubernetes_manifest` panics on Kyverno CRDs. See beads code-e2dp.
# Declared for all stacks but only used where opted-in.
kubectl = {
source = "gavinbunney/kubectl"
version = "~> 1.14"
}
proxmox = {
source = "telmate/proxmox"
version = "3.0.2-rc07"
}
}
}
@ -35,3 +46,8 @@ provider "vault" {
address = "https://vault.viktorbarzin.me"
skip_child_token = true
}
provider "kubectl" {
config_path = var.kube_config_path
load_config_file = true
}

View file

@ -107,25 +107,35 @@ locals {
"k8s-node2" = { vmid = 202, proxmox_node = "pve" }
"k8s-node3" = { vmid = 203, proxmox_node = "pve" }
"k8s-node4" = { vmid = 204, proxmox_node = "pve" }
"k8s-node5" = { vmid = 205, proxmox_node = "pve" }
"k8s-node6" = { vmid = 206, proxmox_node = "pve" }
}
}
resource "null_resource" "node_labels" {
for_each = local.k8s_nodes
# max-volume-attachments: capped at 28 (4 below plugin's hard ceiling of 30,
# see VolumesPerNodeHardLimit in sergelogvinov/proxmox-csi-plugin pkg/csi/node.go).
# Default is 24; bumping to 28 gives ~4-PVC headroom per node while keeping
# 2 slots for recovery (boot disk + transient attach during reschedule).
# Without this label the plugin reports 24 and node1 cascades through that
# ceiling during evictions see post-mortem 2026-05-25.
provisioner "local-exec" {
command = <<-EOT
kubectl --kubeconfig=${var.kube_config_path} label node ${each.key} \
topology.kubernetes.io/region=${var.proxmox_cluster_name} \
topology.kubernetes.io/zone=${each.value.proxmox_node} \
node.csi.proxmox.sinextra.dev/name=${each.key} \
csi.proxmox.sinextra.dev/max-volume-attachments=28 \
--overwrite
EOT
}
triggers = {
region = var.proxmox_cluster_name
zone = each.value.proxmox_node
region = var.proxmox_cluster_name
zone = each.value.proxmox_node
max_volumes = "28"
}
}

View file

@ -24,6 +24,14 @@ provider "registry.terraform.io/cloudflare/cloudflare" {
]
}
provider "registry.terraform.io/gavinbunney/kubectl" {
version = "1.19.0"
constraints = "~> 1.14"
hashes = [
"h1:9QkxPjp0x5FZFfJbE+B7hBOoads9gmdfj9aYu5N4Sfc=",
]
}
provider "registry.terraform.io/goauthentik/authentik" {
version = "2024.12.1"
constraints = "~> 2024.10"
@ -79,3 +87,11 @@ provider "registry.terraform.io/hashicorp/vault" {
"zh:ff35fb1ab6add288f0f368981e56f780b50405accd1937131cba1137999c8d83",
]
}
provider "registry.terraform.io/telmate/proxmox" {
version = "3.0.2-rc07"
constraints = "3.0.2-rc07"
hashes = [
"h1:zp5hpQJQ4t4zROSLqdltVpBO+Riy9VugtfFbpyTw1aM=",
]
}

View file

@ -1,7 +1,7 @@
# Generated by Terragrunt. Sig: nIlQXj57tbuaRZEa
terraform {
backend "pg" {
conn_str = "postgres://terraform_state:ts7DGcKmTTY-5ujz4mhh@10.0.20.200:5432/terraform_state?sslmode=disable"
conn_str = "postgres://terraform_state:LicuZK1nVl4ILE5HF-A9@10.0.20.200:5432/terraform_state?sslmode=disable"
schema_name = "real-estate-crawler"
}
}

View file

@ -13,6 +13,17 @@ terraform {
source = "goauthentik/authentik"
version = "~> 2024.10"
}
# kubectl (gavinbunney) workaround for hashicorp/kubernetes
# `kubernetes_manifest` panics on Kyverno CRDs. See beads code-e2dp.
# Declared for all stacks but only used where opted-in.
kubectl = {
source = "gavinbunney/kubectl"
version = "~> 1.14"
}
proxmox = {
source = "telmate/proxmox"
version = "3.0.2-rc07"
}
}
}
@ -35,3 +46,8 @@ provider "vault" {
address = "https://vault.viktorbarzin.me"
skip_child_token = true
}
provider "kubectl" {
config_path = var.kube_config_path
load_config_file = true
}

View file

@ -268,6 +268,12 @@ resource "kubernetes_config_map" "redis_v2_conf" {
auto-aof-rewrite-min-size 128mb
aof-load-truncated yes
aof-use-rdb-preamble yes
# Allow loading an AOF with up to 1KB of garbage at the tail (post-2026-05-26
# node2 unclean reboot corrupted redis-v2-2's incremental AOF at offset
# 84799139; without this, redis-v2-2 crashlooped). Redis truncates the
# corrupted tail and continues. Default is 0 (refuse to load any corruption).
aof-load-corrupt-tail-max-size 1024
replica-read-only yes
replica-serve-stale-data yes

View file

@ -24,6 +24,14 @@ provider "registry.terraform.io/cloudflare/cloudflare" {
]
}
provider "registry.terraform.io/gavinbunney/kubectl" {
version = "1.19.0"
constraints = "~> 1.14"
hashes = [
"h1:9QkxPjp0x5FZFfJbE+B7hBOoads9gmdfj9aYu5N4Sfc=",
]
}
provider "registry.terraform.io/goauthentik/authentik" {
version = "2024.12.1"
constraints = "~> 2024.10"

View file

@ -1,7 +1,7 @@
# Generated by Terragrunt. Sig: nIlQXj57tbuaRZEa
terraform {
backend "pg" {
conn_str = "postgres://terraform_state:ts7DGcKmTTY-5ujz4mhh@10.0.20.200:5432/terraform_state?sslmode=disable"
conn_str = "postgres://terraform_state:LicuZK1nVl4ILE5HF-A9@10.0.20.200:5432/terraform_state?sslmode=disable"
schema_name = "resume"
}
}

View file

@ -170,33 +170,14 @@ resource "kubernetes_service" "printer" {
}
}
resource "kubernetes_persistent_volume_claim" "data_proxmox" {
wait_until_bound = false
metadata {
name = "resume-data-proxmox"
namespace = kubernetes_namespace.resume.metadata[0].name
annotations = {
"resize.topolvm.io/threshold" = "10%"
"resize.topolvm.io/increase" = "100%"
"resize.topolvm.io/storage_limit" = "5Gi"
}
}
spec {
access_modes = ["ReadWriteOnce"]
storage_class_name = "proxmox-lvm"
resources {
requests = {
storage = "1Gi"
}
}
}
lifecycle {
# The autoresizer expands requests.storage up to storage_limit and
# PVCs can't shrink. Without this, every TF apply tries to revert
# to the spec value, K8s rejects the shrink, and the PVC ends up
# in Terminating-but-in-use limbo.
ignore_changes = [spec[0].resources[0].requests]
}
module "nfs_data_host" {
source = "../../modules/kubernetes/nfs_volume"
name = "resume-data-host"
namespace = kubernetes_namespace.resume.metadata[0].name
nfs_server = var.nfs_server
nfs_path = "/srv/nfs/resume"
storage = "1Gi"
access_modes = ["ReadWriteOnce"]
}
# Reactive Resume app
@ -339,7 +320,7 @@ resource "kubernetes_deployment" "resume" {
volume {
name = "data"
persistent_volume_claim {
claim_name = kubernetes_persistent_volume_claim.data_proxmox.metadata[0].name
claim_name = module.nfs_data_host.claim_name
}
}
}

View file

@ -13,6 +13,13 @@ terraform {
source = "goauthentik/authentik"
version = "~> 2024.10"
}
# kubectl (gavinbunney) workaround for hashicorp/kubernetes
# `kubernetes_manifest` panics on Kyverno CRDs. See beads code-e2dp.
# Declared for all stacks but only used where opted-in.
kubectl = {
source = "gavinbunney/kubectl"
version = "~> 1.14"
}
}
}
@ -35,3 +42,8 @@ provider "vault" {
address = "https://vault.viktorbarzin.me"
skip_child_token = true
}
provider "kubectl" {
config_path = var.kube_config_path
load_config_file = true
}

View file

@ -24,6 +24,14 @@ provider "registry.terraform.io/cloudflare/cloudflare" {
]
}
provider "registry.terraform.io/gavinbunney/kubectl" {
version = "1.19.0"
constraints = "~> 1.14"
hashes = [
"h1:9QkxPjp0x5FZFfJbE+B7hBOoads9gmdfj9aYu5N4Sfc=",
]
}
provider "registry.terraform.io/goauthentik/authentik" {
version = "2024.12.1"
constraints = "~> 2024.10"
@ -122,3 +130,11 @@ provider "registry.terraform.io/hashicorp/vault" {
"zh:ff35fb1ab6add288f0f368981e56f780b50405accd1937131cba1137999c8d83",
]
}
provider "registry.terraform.io/telmate/proxmox" {
version = "3.0.2-rc07"
constraints = "3.0.2-rc07"
hashes = [
"h1:zp5hpQJQ4t4zROSLqdltVpBO+Riy9VugtfFbpyTw1aM=",
]
}

View file

@ -1,7 +1,7 @@
# Generated by Terragrunt. Sig: nIlQXj57tbuaRZEa
terraform {
backend "pg" {
conn_str = "postgres://terraform_state:ZCcWMOLCTqb0aV-XyTAZ@10.0.20.200:5432/terraform_state?sslmode=disable"
conn_str = "postgres://terraform_state:LicuZK1nVl4ILE5HF-A9@10.0.20.200:5432/terraform_state?sslmode=disable"
schema_name = "servarr"
}
}

View file

@ -13,6 +13,17 @@ terraform {
source = "goauthentik/authentik"
version = "~> 2024.10"
}
# kubectl (gavinbunney) workaround for hashicorp/kubernetes
# `kubernetes_manifest` panics on Kyverno CRDs. See beads code-e2dp.
# Declared for all stacks but only used where opted-in.
kubectl = {
source = "gavinbunney/kubectl"
version = "~> 1.14"
}
proxmox = {
source = "telmate/proxmox"
version = "3.0.2-rc07"
}
}
}
@ -35,3 +46,8 @@ provider "vault" {
address = "https://vault.viktorbarzin.me"
skip_child_token = true
}
provider "kubectl" {
config_path = var.kube_config_path
load_config_file = true
}

View file

@ -24,6 +24,22 @@ provider "registry.terraform.io/cloudflare/cloudflare" {
]
}
provider "registry.terraform.io/gavinbunney/kubectl" {
version = "1.19.0"
constraints = "~> 1.14"
hashes = [
"h1:9QkxPjp0x5FZFfJbE+B7hBOoads9gmdfj9aYu5N4Sfc=",
]
}
provider "registry.terraform.io/goauthentik/authentik" {
version = "2024.12.1"
constraints = "~> 2024.10"
hashes = [
"h1:roBMd+gi+TGgikH/bMzEI8JfvJiMAQWt+8FmokCrQIs=",
]
}
provider "registry.terraform.io/hashicorp/helm" {
version = "3.1.1"
hashes = [

View file

@ -1,7 +1,7 @@
# Generated by Terragrunt. Sig: nIlQXj57tbuaRZEa
terraform {
backend "pg" {
conn_str = "postgres://terraform_state:SBlzGxotNUN6HH9d0S-m@10.0.20.200:5432/terraform_state?sslmode=disable"
conn_str = "postgres://terraform_state:LicuZK1nVl4ILE5HF-A9@10.0.20.200:5432/terraform_state?sslmode=disable"
schema_name = "stirling-pdf"
}
}

View file

@ -9,6 +9,17 @@ terraform {
source = "cloudflare/cloudflare"
version = "~> 4"
}
authentik = {
source = "goauthentik/authentik"
version = "~> 2024.10"
}
# kubectl (gavinbunney) workaround for hashicorp/kubernetes
# `kubernetes_manifest` panics on Kyverno CRDs. See beads code-e2dp.
# Declared for all stacks but only used where opted-in.
kubectl = {
source = "gavinbunney/kubectl"
version = "~> 1.14"
}
}
}
@ -31,3 +42,8 @@ provider "vault" {
address = "https://vault.viktorbarzin.me"
skip_child_token = true
}
provider "kubectl" {
config_path = var.kube_config_path
load_config_file = true
}

View file

@ -81,9 +81,23 @@ resource "kubernetes_deployment" "uptime-kuma" {
labels = {
app = "uptime-kuma"
tier = var.tier
# Opt out of Kyverno's inject-keel-annotations ClusterPolicy. The Kyverno
# rule excludes any workload with this LABEL (see
# stacks/kyverno/modules/kyverno/keel-annotations.tf, exclude.any
# matchLabels keel.sh/policy=never). Without the label, Kyverno would
# silently re-add `keel.sh/policy=force` after every reconcile, undoing
# the annotation below.
"keel.sh/policy" = "never"
}
annotations = {
"reloader.stakater.com/search" = "true"
# Stop Keel polling for this workload. Even with match-tag=true,
# Keel auto-downgraded :2 :1 on 2026-05-26 12:14, which v1 booted
# into SQLite mode and couldn't read the existing MariaDB store
# (db-config.json) 4h CrashLoopBackOff. Pinning the image string
# alone isn't enough because Keel kept fighting the apply. Combined
# with the matching LABEL above, this fully bypasses Keel.
"keel.sh/policy" = "never"
}
}
spec {
@ -108,7 +122,14 @@ resource "kubernetes_deployment" "uptime-kuma" {
}
spec {
container {
image = "louislam/uptime-kuma:2"
# Pinned to 2.3.2 because Keel auto-downgraded :2 :1 on 2026-05-26
# 12:14 UTC despite the Kyverno-injected `keel.sh/match-tag=true` +
# `keel.sh/policy=force` annotation pair (which is supposed to gate
# digest changes only). The v1 image opens kuma.db (SQLite) at boot
# and can't read the v2 db-config.json 4h CrashLoopBackOff while
# the MariaDB store sat intact. Until the keel-match-tag regression
# is root-caused, pin minor versions explicitly.
image = "louislam/uptime-kuma:2.3.2"
name = "uptime-kuma"
resources {
@ -167,9 +188,12 @@ resource "kubernetes_deployment" "uptime-kuma" {
lifecycle {
ignore_changes = [
spec[0].template[0].spec[0].dns_config, # KYVERNO_LIFECYCLE_V1
metadata[0].annotations["keel.sh/policy"],
# `keel.sh/policy` is intentionally NOT ignored we want TF to own it
# as `never` so a Kyverno reconcile (or manual kubectl) can't flip it
# back to `force` and re-enable auto-updates.
metadata[0].annotations["keel.sh/trigger"],
metadata[0].annotations["keel.sh/pollSchedule"], # KYVERNO_LIFECYCLE_V2
metadata[0].annotations["keel.sh/pollSchedule"], # KYVERNO_LIFECYCLE_V2
metadata[0].annotations["keel.sh/match-tag"], # injected by Kyverno
]
}
}

View file

@ -29,6 +29,21 @@ provider "registry.terraform.io/gavinbunney/kubectl" {
constraints = "~> 1.14"
hashes = [
"h1:9QkxPjp0x5FZFfJbE+B7hBOoads9gmdfj9aYu5N4Sfc=",
"zh:1dec8766336ac5b00b3d8f62e3fff6390f5f60699c9299920fc9861a76f00c71",
"zh:43f101b56b58d7fead6a511728b4e09f7c41dc2e3963f59cf1c146c4767c6cb7",
"zh:4c4fbaa44f60e722f25cc05ee11dfaec282893c5c0ffa27bc88c382dbfbaa35c",
"zh:51dd23238b7b677b8a1abbfcc7deec53ffa5ec79e58e3b54d6be334d3d01bc0e",
"zh:5afc2ebc75b9d708730dbabdc8f94dd559d7f2fc5a31c5101358bd8d016916ba",
"zh:6be6e72d4663776390a82a37e34f7359f726d0120df622f4a2b46619338a168e",
"zh:72642d5fcf1e3febb6e5d4ae7b592bb9ff3cb220af041dbda893588e4bf30c0c",
"zh:9b12af85486a96aedd8d7984b0ff811a4b42e3d88dad1a3fb4c0b580d04fa425",
"zh:a1da03e3239867b35812ee031a1060fed6e8d8e458e2eaca48b5dd51b35f56f7",
"zh:b98b6a6728fe277fcd133bdfa7237bd733eae233f09653523f14460f608f8ba2",
"zh:bb8b071d0437f4767695c6158a3cb70df9f52e377c67019971d888b99147511f",
"zh:dc89ce4b63bfef708ec29c17e85ad0232a1794336dc54dd88c3ba0b77e764f71",
"zh:dd7dd18f1f8218c6cd19592288fde32dccc743cde05b9feeb2883f37c2ff4b4e",
"zh:ec4bd5ab3872dedb39fe528319b4bba609306e12ee90971495f109e142d66310",
"zh:f610ead42f724c82f5463e0e71fa735a11ffb6101880665d93f48b4a67b9ad82",
]
}
@ -37,6 +52,20 @@ provider "registry.terraform.io/goauthentik/authentik" {
constraints = "~> 2024.10"
hashes = [
"h1:roBMd+gi+TGgikH/bMzEI8JfvJiMAQWt+8FmokCrQIs=",
"zh:090260dc7889ea822ec1d899344e1ee23eba5290461989c0796149c9511f2316",
"zh:13c2655ff824b0dc4b9bb832b5ca6d41dba97cb280330258c5fef4115e236209",
"zh:166a73c3a810c9c895d68a8ff968158f339f8a2c1c03e20ec9fc5ed99cc64e20",
"zh:203777eae1cdc711233315499643180604cff2324411b186b7cf07fdbe16f655",
"zh:3b2f18c9a8d28dac74dc6bbf168c946855ab9c68f053578d4630c50d5eaf30a0",
"zh:4822275985f6b74b6196c47112316a4252db22cf4ceaef7c9ab4c66d488abf2f",
"zh:53ea97562666c8a5a2f6d63d418a302a7f8ee4b7bb7da35dedaa89aa5708b7f0",
"zh:56b8a230901e3550c92a1d3f58ee9dafe9853f30fe4315af3ab28ae63262e15d",
"zh:6293ab7b1fd8206a0c853591f50186aca4a1eff117b2a773e10760a23a2c83e9",
"zh:9433970f79fb92d8aae3ee436db5630ab312c78b6dc9df9c1db3273a18f8aaa1",
"zh:95df406214f79b3b98222d7c7fe8fc319a3d90b7a9d53e1d5abbda5dfb8b9436",
"zh:a85880da0552a42c8f449390fbd7d8b03541d1a13e04bba9f1404fa658754260",
"zh:a95f6e9bd62c67e70eba1b1a14728856b9a6a28cd1e5e3be54a7718882c87e7f",
"zh:dd599b51c5beb34a4c6feece244fde07d2558d69929449ab1fd39a5ebe738781",
]
}
@ -64,6 +93,18 @@ provider "registry.terraform.io/hashicorp/kubernetes" {
version = "3.1.0"
hashes = [
"h1:oodIAuFMikXNmEtil5MQgP4dfSctUBYQiGJfjbsF3NY=",
"zh:0215c5c60be62028c09a2f22458e89cda3ef5830a632299f1d401eb3538874b0",
"zh:09ebb9f442431e278a310a9423f32caf467cb4b3cad3fe59573ca71fa7b14e20",
"zh:0c4e5912f83bb35846ae0a9ae54fc320706ee61894cd21cc6b4181b1c5a2fa5c",
"zh:1678c982853ad461e65ccb5e79d585e13ed109dd47dab2a66d3a7a304faeef65",
"zh:1c050a5c15e330457a9c18caacf61a923c59d663e13f2962e4b32f04fef523a0",
"zh:2c55bcec83be58ec132c7cb0a1ac644758b800d794fdc636d53a0eada0358a3a",
"zh:a062bb0aa316c08d8460c66a5d68da71da40de5d3bc3b31abcf3a1a9a19650f1",
"zh:a26fdea0afaa9b247c73c0b42843ca51ba7db0ac2571f9d3d50dcabd20ca1b98",
"zh:c872c9385a78d502bf5823d61cd3bb0f9a0585030e025eb12585c83451beeaa1",
"zh:f180879af931182beee4c8c0d9dab62b81d86f17ddcbe3786ef4c7cec9163a4e",
"zh:f569b65999264a9416862bca5cd2a6177d94ccb0424f3a4ef424428912b9cb3c",
"zh:f70f5789264069e0eef06f9b5d5fde955ef7206f7d446d1ce51a4c37a3f3e02f",
]
}
@ -87,3 +128,25 @@ provider "registry.terraform.io/hashicorp/vault" {
"zh:ff35fb1ab6add288f0f368981e56f780b50405accd1937131cba1137999c8d83",
]
}
provider "registry.terraform.io/telmate/proxmox" {
version = "3.0.2-rc07"
constraints = "3.0.2-rc07"
hashes = [
"h1:zp5hpQJQ4t4zROSLqdltVpBO+Riy9VugtfFbpyTw1aM=",
"zh:2ee860cd0a368b3eaa53f4a9ea46f16dab8a97929e813ea6ef55183f8112c2ca",
"zh:415965fd915bae2040d7f79e45f64d6e3ae61149c10114efeac1b34687d7296c",
"zh:6584b2055df0e32062561c615e3b6b2c291ca8c959440adda09ef3ec1e1436bd",
"zh:65dcfad71928e0a8dd9befc22524ed686be5020b0024dc5cca5184c7420eeb6b",
"zh:7253dc29bd265d33f2791ac4f779c5413f16720bb717de8e6c5fcb2c858648ea",
"zh:7ec8993da10a47606670f9f67cfd10719a7580641d11c7aa761121c4a2bd66fb",
"zh:999a3f7a9dcf517967fc537e6ec930a8172203642fb01b8e1f78f908373db210",
"zh:a50e6df7280eb6584a5fd2456e3f5b6df13b2ec8a7fa4605511e438e1863be42",
"zh:b25b329a1e42681c509d027fee0365414f0cc5062b65690cfc3386aab16132ae",
"zh:c028877fdb438ece48f7bc02b65bbae9ca7b7befbd260e519ccab6c0cbb39f26",
"zh:cf0eaa3ea9fcc6d62793637947f1b8d7c885b6ad74695ab47e134e4ff132190f",
"zh:d5ade3fae031cc629b7c512a7b60e46570f4c41665e88a595d7efd943dde5ab2",
"zh:f388c15ad1ecfc09e7361e3b98bae9b627a3a85f7b908c9f40650969c949901c",
"zh:f415cc6f735a3971faae6ac24034afdb9ee83373ef8de19a9631c187d5adc7db",
]
}

View file

@ -377,13 +377,16 @@ resource "kubernetes_deployment" "shlink-web" {
memory = "64Mi"
}
}
# shlinkio/shlink-web-client >=0.1.0 listens on port 80 (nginx default);
# prior :latest builds listened on 8080. Keep both probes + service
# target_port aligned with the image.
port {
container_port = 8080
container_port = 80
}
liveness_probe {
http_get {
path = "/"
port = 8080
port = 80
}
initial_delay_seconds = 15
period_seconds = 30
@ -393,7 +396,7 @@ resource "kubernetes_deployment" "shlink-web" {
readiness_probe {
http_get {
path = "/"
port = 8080
port = 80
}
initial_delay_seconds = 5
period_seconds = 30
@ -436,7 +439,7 @@ resource "kubernetes_service" "shlink-web" {
port {
name = "http"
port = 80
target_port = 8080
target_port = 80
}
}
}

View file

@ -20,6 +20,10 @@ terraform {
source = "gavinbunney/kubectl"
version = "~> 1.14"
}
proxmox = {
source = "telmate/proxmox"
version = "3.0.2-rc07"
}
}
}

View file

@ -87,8 +87,9 @@ resource "kubernetes_deployment" "vaultwarden" {
}
spec {
container {
image = "vaultwarden/server:1.35.7"
name = "vaultwarden"
image = "vaultwarden/server:latest"
image_pull_policy = "Always"
name = "vaultwarden"
resources {
requests = {
@ -181,7 +182,9 @@ resource "kubernetes_deployment" "vaultwarden" {
spec[0].template[0].spec[0].dns_config, # KYVERNO_LIFECYCLE_V1
metadata[0].annotations["keel.sh/policy"],
metadata[0].annotations["keel.sh/trigger"],
metadata[0].annotations["keel.sh/pollSchedule"], # KYVERNO_LIFECYCLE_V2
metadata[0].annotations["keel.sh/pollSchedule"], # KYVERNO_LIFECYCLE_V2
metadata[0].annotations["keel.sh/match-tag"], # KYVERNO_LIFECYCLE_V2
metadata[0].annotations["kubernetes.io/change-cause"], # Keel rewrites this on every rollout
]
}
}

View file

@ -24,6 +24,14 @@ provider "registry.terraform.io/cloudflare/cloudflare" {
]
}
provider "registry.terraform.io/gavinbunney/kubectl" {
version = "1.19.0"
constraints = "~> 1.14"
hashes = [
"h1:9QkxPjp0x5FZFfJbE+B7hBOoads9gmdfj9aYu5N4Sfc=",
]
}
provider "registry.terraform.io/goauthentik/authentik" {
version = "2024.12.1"
constraints = "~> 2024.10"
@ -99,3 +107,11 @@ provider "registry.terraform.io/hashicorp/vault" {
"zh:ff35fb1ab6add288f0f368981e56f780b50405accd1937131cba1137999c8d83",
]
}
provider "registry.terraform.io/telmate/proxmox" {
version = "3.0.2-rc07"
constraints = "3.0.2-rc07"
hashes = [
"h1:zp5hpQJQ4t4zROSLqdltVpBO+Riy9VugtfFbpyTw1aM=",
]
}

View file

@ -1,7 +1,7 @@
# Generated by Terragrunt. Sig: nIlQXj57tbuaRZEa
terraform {
backend "pg" {
conn_str = "postgres://terraform_state:ZCcWMOLCTqb0aV-XyTAZ@10.0.20.200:5432/terraform_state?sslmode=disable"
conn_str = "postgres://terraform_state:LicuZK1nVl4ILE5HF-A9@10.0.20.200:5432/terraform_state?sslmode=disable"
schema_name = "wealthfolio"
}
}

View file

@ -146,7 +146,10 @@ resource "kubernetes_deployment" "wealthfolio" {
}
spec {
container {
image = "afadil/wealthfolio:3.2"
# Pinned 2026-05-26: prior live was :3.2.1, Keel rolled it to :2.0
# on 2026-05-26 03:13, then truncated to :3.2 at 06:46 (Keel string
# match dropped the patch suffix). Restore the patch version.
image = "afadil/wealthfolio:3.2.1"
name = "wealthfolio"
port {
container_port = 8080

View file

@ -13,6 +13,17 @@ terraform {
source = "goauthentik/authentik"
version = "~> 2024.10"
}
# kubectl (gavinbunney) workaround for hashicorp/kubernetes
# `kubernetes_manifest` panics on Kyverno CRDs. See beads code-e2dp.
# Declared for all stacks but only used where opted-in.
kubectl = {
source = "gavinbunney/kubectl"
version = "~> 1.14"
}
proxmox = {
source = "telmate/proxmox"
version = "3.0.2-rc07"
}
}
}
@ -35,3 +46,8 @@ provider "vault" {
address = "https://vault.viktorbarzin.me"
skip_child_token = true
}
provider "kubectl" {
config_path = var.kube_config_path
load_config_file = true
}

View file

@ -24,6 +24,22 @@ provider "registry.terraform.io/cloudflare/cloudflare" {
]
}
provider "registry.terraform.io/gavinbunney/kubectl" {
version = "1.19.0"
constraints = "~> 1.14"
hashes = [
"h1:9QkxPjp0x5FZFfJbE+B7hBOoads9gmdfj9aYu5N4Sfc=",
]
}
provider "registry.terraform.io/goauthentik/authentik" {
version = "2024.12.1"
constraints = "~> 2024.10"
hashes = [
"h1:roBMd+gi+TGgikH/bMzEI8JfvJiMAQWt+8FmokCrQIs=",
]
}
provider "registry.terraform.io/hashicorp/helm" {
version = "3.1.1"
hashes = [

View file

@ -1,7 +1,7 @@
# Generated by Terragrunt. Sig: nIlQXj57tbuaRZEa
terraform {
backend "pg" {
conn_str = "postgres://terraform_state:SBlzGxotNUN6HH9d0S-m@10.0.20.200:5432/terraform_state?sslmode=disable"
conn_str = "postgres://terraform_state:LicuZK1nVl4ILE5HF-A9@10.0.20.200:5432/terraform_state?sslmode=disable"
schema_name = "whisper"
}
}

View file

@ -25,33 +25,14 @@ module "tls_secret" {
tls_secret_name = var.tls_secret_name
}
resource "kubernetes_persistent_volume_claim" "data_proxmox" {
wait_until_bound = false
metadata {
name = "whisper-data-proxmox"
namespace = kubernetes_namespace.whisper.metadata[0].name
annotations = {
"resize.topolvm.io/threshold" = "10%"
"resize.topolvm.io/increase" = "100%"
"resize.topolvm.io/storage_limit" = "5Gi"
}
}
spec {
access_modes = ["ReadWriteOnce"]
storage_class_name = "proxmox-lvm"
resources {
requests = {
storage = "1Gi"
}
}
}
lifecycle {
# The autoresizer expands requests.storage up to storage_limit and
# PVCs can't shrink. Without this, every TF apply tries to revert
# to the spec value, K8s rejects the shrink, and the PVC ends up
# in Terminating-but-in-use limbo.
ignore_changes = [spec[0].resources[0].requests]
}
module "nfs_data_host" {
source = "../../modules/kubernetes/nfs_volume"
name = "whisper-data-host"
namespace = kubernetes_namespace.whisper.metadata[0].name
nfs_server = var.nfs_server
nfs_path = "/srv/nfs/whisper"
storage = "1Gi"
access_modes = ["ReadWriteMany"]
}
resource "kubernetes_deployment" "whisper" {
@ -118,7 +99,7 @@ resource "kubernetes_deployment" "whisper" {
volume {
name = "data"
persistent_volume_claim {
claim_name = kubernetes_persistent_volume_claim.data_proxmox.metadata[0].name
claim_name = module.nfs_data_host.claim_name
}
}
}
@ -244,7 +225,7 @@ resource "kubernetes_deployment" "piper" {
volume {
name = "data"
persistent_volume_claim {
claim_name = kubernetes_persistent_volume_claim.data_proxmox.metadata[0].name
claim_name = module.nfs_data_host.claim_name
}
}
}

View file

@ -9,6 +9,17 @@ terraform {
source = "cloudflare/cloudflare"
version = "~> 4"
}
authentik = {
source = "goauthentik/authentik"
version = "~> 2024.10"
}
# kubectl (gavinbunney) workaround for hashicorp/kubernetes
# `kubernetes_manifest` panics on Kyverno CRDs. See beads code-e2dp.
# Declared for all stacks but only used where opted-in.
kubectl = {
source = "gavinbunney/kubectl"
version = "~> 1.14"
}
}
}
@ -31,3 +42,8 @@ provider "vault" {
address = "https://vault.viktorbarzin.me"
skip_child_token = true
}
provider "kubectl" {
config_path = var.kube_config_path
load_config_file = true
}

File diff suppressed because one or more lines are too long

File diff suppressed because one or more lines are too long

View file

@ -46,8 +46,14 @@ terraform {
}
}
# Generate kubernetes + helm + cloudflare providers for all stacks.
# The infra stack overrides this to add the proxmox provider.
# Generate kubernetes + helm + cloudflare + proxmox providers for all stacks.
# (Stacks that don't use proxmox simply omit any `provider "proxmox" {}` block;
# the required_providers entry is harmless. The pre-2026-05-26 trick of the
# infra stack overriding this block to add proxmox stopped working under
# Terragrunt v0.77 same-name generate blocks are now forbidden so proxmox
# is declared globally instead. The `provider "proxmox" {}` config lives in
# stacks/infra/terragrunt.hcl, generated under a different filename so it
# doesn't collide with this providers.tf.)
generate "k8s_providers" {
path = "providers.tf"
if_exists = "overwrite_terragrunt"
@ -73,6 +79,10 @@ terraform {
source = "gavinbunney/kubectl"
version = "~> 1.14"
}
proxmox = {
source = "telmate/proxmox"
version = "3.0.2-rc07"
}
}
}