Commit graph

286 commits

Author SHA1 Message Date
Viktor Barzin
e097b7eb29 fix(ingress): wire up backend_protocol, remove dead ssl_redirect variable
Post nginx→Traefik migration cleanup:
- backend_protocol now sets serversscheme + serverstransport annotations
  for HTTPS backends (k8s-dashboard, pfsense, nas, idrac, proxmox, etc.)
- Remove ssl_redirect variable (nginx-only, silently ignored by Traefik)
  and all 9 caller references

[ci skip]

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-18 08:45:56 +00:00
Viktor Barzin
586f8345d1 fix(dbaas,actualbudget): apply OOM fixes — sync live cluster with Terraform code
Live cluster had stale resource limits causing OOMKills:
- actualbudget-http-api: 128Mi → 512Mi (code already correct)
- pg-cluster CNPG: 512Mi → 4Gi (code already correct)
- dbaas ResourceQuota: 20Gi → 24Gi live (TF code has 64Gi)

Formatting cleanup from terraform fmt included.

[ci skip]

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-18 08:04:10 +00:00
Viktor Barzin
7ec627f365 right-size memory requests to unblock GPU workloads and fix dbaas quota [ci skip]
- nvidia: custom LimitRange (128Mi default, was 1Gi from Kyverno) to stop
  inflating GPU operator init containers; saves ~2.5Gi on GPU node
- nvidia: dcgm-exporter 1536Mi → 768Mi (actual usage 489Mi)
- monitoring: prometheus server 4Gi → 3Gi (actual usage 2.6Gi)
- onlyoffice: 2304Mi → 1536Mi (actual usage 1.3Gi)
- immich: frame explicit 64Mi resources (was getting 1Gi LimitRange default)
- dbaas: quota limits.memory 20Gi → 24Gi to fit 3rd MySQL replica

Root cause: Kyverno tier-2-gpu LimitRange injected 1Gi on every NVIDIA init
container (no explicit resources), wasting ~2.5Gi scheduling overhead on the
GPU node. Combined with over-requesting, frigate and immich-ml couldn't schedule.
2026-03-18 08:04:08 +00:00
Viktor Barzin
263d97bea2 extract remaining 19 modules from platform, complete stack split [ci skip]
Phase 3: all 27 platform modules now run as independent stacks.
Platform reduced to empty shell (outputs only) for backward compat
with 72 app stacks that declare dependency "platform".
Fixed technitium cross-module dashboard reference by copying file.
Woodpecker pipeline applies all 27+1 stacks in parallel via loop.
All applied with zero destroys.
2026-03-18 08:04:08 +00:00
Viktor Barzin
f7c3a338a5 extract monitoring, nvidia, mailserver, cloudflared, kyverno from platform [ci skip]
Phase 2 of platform stack split. 5 more modules extracted into
independent stacks. All applied successfully with zero destroys.
Cloudflared now reads k8s_users from Vault directly to compute
user_domains. Woodpecker pipeline runs all 8 extracted stacks
in parallel. Memory bumped to 6Gi for 9 concurrent TF processes.
Platform reduced from 27 to 19 modules.
2026-03-18 08:04:07 +00:00
Viktor Barzin
5b11761d22 extract dbaas, authentik, crowdsec from platform into independent stacks [ci skip]
Phase 1 of platform stack split for parallel CI applies.
All 3 modules were fully independent (no cross-module refs).
State migrated via terraform state mv. All 3 stacks applied
with zero changes (dbaas had pre-existing ResourceQuota drift).
Woodpecker pipeline updated to run extracted stacks in parallel.
2026-03-18 08:04:06 +00:00
Viktor Barzin
94717dcd32 fix DB password rotation desync in 5 stacks
Vault DB engine rotates passwords weekly but 5 stacks baked passwords
at Terraform plan time, causing stale credentials until next apply.

- real-estate-crawler: add vault-database ESO, use secret_key_ref in 3 deployments
- nextcloud: switch Helm chart to existingSecret for DB password
- grafana: add vault-database ESO, use envFromSecrets in Helm values
- woodpecker: use extraSecretNamesForEnvFrom, remove plan-time data source chain
- affine: add vault-database ESO, use secret_key_ref in deployment + init container
2026-03-18 08:04:05 +00:00
Viktor Barzin
6656743968 increase DB password rotation from 24h to weekly (604800s) 2026-03-18 08:04:05 +00:00
Viktor Barzin
7282e37294 k8s-portal: use Recreate strategy, limit revision history to 3
Prevents stale pods serving old content during rapid successive deploys.
With 1 replica + RollingUpdate, old and new pods briefly coexist.
2026-03-18 08:04:05 +00:00
Viktor Barzin
12918dd491 post-mortem: kured + containerd cascade outage — alerts + report
26h outage caused by unattended-upgrades kernel update → kured reboot →
containerd overlayfs snapshotter corruption → image pull failures →
calico down → cascading cluster outage.

Remediation:
- Add "Node Runtime Health" Prometheus alert group (6 alerts):
  KubeletImagePullErrors, KubeletPLEGUnhealthy, PodsStuckContainerCreating,
  KubeletRuntimeOperationsLatency, KubeletRunningContainersDrop, CalicoNodeNotReady
- Add containerd cascade inhibition rule
- Save post-mortem report as HTML in post-mortems/

Also applied via kubectl (needs Terraform codification):
- Sentinel gate DaemonSet gating kured reboots on cluster health
- Fixed kured Helm values: reboot window + gated sentinel path
2026-03-18 08:04:04 +00:00
Viktor Barzin
66c70ce10f fix: improve Slack alert formatting — add values, fix ContainerNearOOM filter
- Add container!="" filter to ContainerNearOOM to exclude system-level cadvisor entries
- Add $value to summaries: ContainerOOMKilled, ClusterMemoryRequestsHigh,
  ContainerNearOOM, PVPredictedFull, NFSServerUnresponsive, NewTailscaleClient
- Add fallback field to all Slack receivers for clean push notifications
- Multiply ratio exprs by 100 for readable percentages
- Rename "New Tailscale client" to CamelCase "NewTailscaleClient"
- Add actionable hints to PodUnschedulable, NodeConditionBad, ForwardAuthFallbackActive
2026-03-18 08:04:04 +00:00
Viktor Barzin
1cd767652d fix: migrate woodpecker database credentials to runtime-refreshed ExternalSecret
The woodpecker server was crashing repeatedly with database authentication failures
because Vault rotates the database password every 24 hours, but the Helm release
had hardcoded the password into WOODPECKER_DATABASE_DATASOURCE at plan time.

Changes:
- Updated ExternalSecret to provide the full DATABASE_DATASOURCE URI dynamically
- Modified Helm values to use envFrom to inject the secret instead of hardcoding
- ExternalSecret refreshes every 15 minutes, automatically picking up rotated passwords
- Pod will auto-restart when secret changes (via reloader.stakater.com annotation)
- This eliminates the plan-time password snapshot that goes stale within 24h

The pod still has an unrelated image pull issue on k8s-node4 (containerd blob
corruption), but the database credentials mechanism is now correctly implemented.
2026-03-18 08:04:04 +00:00
Viktor Barzin
36850a7e40 scale up f1-stream and changedetection [ci skip] 2026-03-18 08:04:04 +00:00
Viktor Barzin
e383dfb443 fix platform stack: k8s_users.domains and sensitive for_each errors [ci skip]
- Use lookup(user, "domains", []) for missing domains attribute
- Wrap user_domains in nonsensitive() for Cloudflare for_each
2026-03-18 08:04:03 +00:00
Viktor Barzin
5b24c8d437 fix woodpecker sync: single $ in heredoc, alpine image for jq, port 80 not 8000 2026-03-18 08:04:03 +00:00
Viktor Barzin
171d03086e right-size 14 services and scale down GPU-heavy workloads [ci skip]
Memory right-sizing based on VPA upperBound analysis:
- Increases: stirling-pdf 1200→1536Mi, claude-memory 64→128Mi,
  dawarich 512→768Mi, kyverno-cleanup 128→192Mi, linkwarden 768→1Gi,
  navidrome 64→128Mi, listenarr 768→896Mi, privatebin 64→128Mi,
  ntfy 64→128Mi, health 128→256Mi, dbaas quota 16→20Gi,
  mysql-operator 384→512Mi
- Decreases: rybbit 768→384Mi, nvidia-exporter added explicit 192Mi,
  dcgm-exporter 2560→1536Mi
- Scale to 0: ebook2audiobook/audiblez-web, whisper (GPU node pressure)

Net effect: -496Mi cluster-wide, 13 ContainerNearOOM alerts resolved,
all ResourceQuota pressures cleared, GPU health green.
2026-03-18 08:04:03 +00:00
Viktor Barzin
642b0e578d fix ollama: remove conditional count on basicAuth (incompatible with ESO data source) 2026-03-18 08:04:03 +00:00
Viktor Barzin
0610ea30d4 add generic multi-user cluster onboarding system
Data-driven user onboarding: add a JSON entry to Vault KV k8s_users,
apply vault + platform + woodpecker stacks, and everything is auto-generated.

Vault stack: namespace creation, per-user Vault policies with secret isolation
via identity entities/aliases, K8s deployer roles, CI policy update.

Platform stack: domains field in k8s_users type, TLS secrets per user namespace,
user domains merged into Cloudflare DNS, user-roles ConfigMap mounted in portal.

Woodpecker stack: admin list auto-generated from k8s_users, WOODPECKER_OPEN=true.

K8s-portal: dual-track onboarding (general/namespace-owner), namespace-owner
dashboard with Vault/kubectl commands, setup script adds Vault+Terraform+Terragrunt,
contributing page with CI pipeline template, versioned image tags in CI pipeline.

New: stacks/_template/ with copyable stack template for namespace-owners.
2026-03-18 08:04:03 +00:00
Viktor Barzin
5bc50af99e migrate 16 plan-time stacks: vault data source → ESO + kubernetes_secret
Replaced data "vault_kv_secret_v2" with:
1. ExternalSecret (ESO syncs Vault KV → K8s Secret)
2. data "kubernetes_secret" (reads ESO-created secret at plan time)

This removes the Vault provider dependency at plan time for these
stacks — they now only need K8s API access, not a Vault token.

Stacks: actualbudget, affine, audiobookshelf, calibre, changedetection,
coturn, freedify, freshrss, grampsweb, navidrome, novelapp, ollama,
owntracks, real-estate-crawler, servarr, ytdlp
2026-03-18 08:04:03 +00:00
Viktor Barzin
f7f8e4beba fix health DB ExternalSecret: use pg-health not postgresql-health role name 2026-03-18 08:04:03 +00:00
root
7dad3ec54b Woodpecker CI deploy commit [CI SKIP] 2026-03-18 08:04:03 +00:00
Viktor Barzin
fca99fd418 fix DB password desync + migrate remaining tfvars to Vault
DB desync fix: Stacks with Vault DB engine rotation (24h) now read
the password from vault-database ClusterSecretStore instead of vault-kv.
9 stacks updated with db ExternalSecrets reading from static-creds/*.

Stacks fixed: speedtest, hackmd, health, trading-bot, claude-memory,
woodpecker, linkwarden, nextcloud, url.

terraform.tfvars migration:
- plotting-book: google_client_id/secret → Vault KV + secret_key_ref
- tandoor: email_password var removed (was default="", now optional ESO)
- infra: ssh_private_key, vm_wizard_password, dockerhub_registry_password
  → Vault KV at secret/infra + data source
2026-03-18 08:04:03 +00:00
Viktor Barzin
19e0aef67b regenerate providers.tf: remove vault_root_token variable [ci skip] 2026-03-18 08:04:03 +00:00
Viktor Barzin
9ed19e1b42 fix realestate-crawler: access nested notification_settings correctly
Vault KV stores notification_settings as nested JSON ({"slack":{"webhook_url":""}}).
TF code was passing the map object directly as a string env var value.
Fix: access ["slack"]["webhook_url"] with try() fallback.
2026-03-18 08:04:03 +00:00
Viktor Barzin
7e3540e56a fix woodpecker sync script: escape $ and %{} for HCL heredoc
HCL heredocs always interpolate — use $$ for literal $ and
%%{} for literal %{}. Fixes terraform plan errors.
2026-03-18 08:04:03 +00:00
Viktor Barzin
14125c1b9b add pod dependency management via Kyverno init container injection
Kyverno ClusterPolicy reads dependency.kyverno.io/wait-for annotation
and injects busybox init containers that block until each dependency
is reachable (nc -z). Annotations added to 18 stacks (24 deployments).

Includes graceful-db-maintenance.sh script for planned DB maintenance
(scales dependents to 0, saves replica counts, restores on startup).
2026-03-18 08:04:02 +00:00
root
8af7e20527 Woodpecker CI deploy commit [CI SKIP] 2026-03-18 08:04:02 +00:00
Viktor Barzin
a833363e1d fix gpu-workload Kyverno policy: use replace with explicit priority value
The API server doesn't re-resolve priority from PriorityClassName after
webhook mutation. Changed from remove+add to replace with explicit
priority=1200000 and preemptionPolicy=PreemptLowerPriority.
2026-03-18 08:04:02 +00:00
Viktor Barzin
f82ece8d1f add Vault→Woodpecker secret sync CronJob (Part E)
Syncs secrets from Vault KV at secret/ci/global to Woodpecker
global secrets via REST API every 6 hours. Authenticates via K8s
SA JWT (woodpecker-sync role). New repos just add secrets to
Vault and use from_secret: in pipeline files.

Also removes k8s-dashboard static admin token — use
vault write kubernetes/creds/dashboard-admin instead.
2026-03-18 08:04:02 +00:00
Viktor Barzin
850ab5277f migrate consuming stacks to ESO + remove k8s-dashboard static token
Phase 9: ExternalSecret migration across 26 stacks:

Fully migrated (vault data source removed, ESO delivers secrets):
- speedtest, shadowsocks, wealthfolio, plotting-book, f1-stream, tandoor
- n8n, dawarich, diun, netbox, onlyoffice, tuya-bridge
- hackmd (ESO template for DB URL), health (ESO template for DB URL)
- trading-bot (ESO template for DATABASE_URL + 7 secret env vars)
- forgejo (removed unused vault data source)

Partially migrated (vault kept for plan-time, ESO added for runtime):
- immich, linkwarden, nextcloud, paperless-ngx (jsondecode for homepage)
- claude-memory, rybbit, url, webhook_handler (plan-time in locals/jobs)
- woodpecker, openclaw, resume (plan-time in helm values/jobs/modules)

17 stacks unchanged (all plan-time: homepage annotations, configmaps,
module inputs) — vault data source works with OIDC auth.

Phase 17a: Remove k8s-dashboard static admin token secret.
Users now get tokens via: vault write kubernetes/creds/dashboard-admin
2026-03-18 08:04:02 +00:00
Viktor Barzin
0bae93a097 claude-memory: read DB password from Vault KV instead of tfvars
Vault DB engine rotates the password every 24h, so the static tfvars
value was stale. Now reads from secret/claude-memory db_password key.
2026-03-18 08:04:02 +00:00
Viktor Barzin
91e5d728a2 etcd defrag cronjob: add --command-timeout=60s
Default 5s timeout causes defrag to fail on fragmented DBs.
Discovered during manual defrag that took ~7s.
2026-03-18 08:04:02 +00:00
Viktor Barzin
c766d849f8 mitigate cluster instability during terraform applies
- Recreate strategy for heavy single-replica deployments (onlyoffice, stirling-pdf)
- Reduce maxSurge on multi-replica deployments (traefik, authentik, grafana, kyverno)
  to prevent memory request surge overwhelming scheduler
- Weekly etcd defrag CronJob (Sunday 3 AM) to prevent fragmentation buildup
- Disable Kyverno policy reports (ephemeral report cleanup)
- Cloud-init: journald persistence + 4Gi swap for worker nodes
- Kubelet: LimitedSwap behavior for memory pressure relief
2026-03-18 08:04:02 +00:00
Viktor Barzin
750da49c80 fix openclaw init container: escape shell vars, fix image path [ci skip]
- Use $$ for shell variable escaping in Terraform ($ is Terraform interpolation)
- Fix image: docker.io/alpine/git (not library/alpine/git)
- Inline command instead of heredoc to avoid Terraform interpolation issues
2026-03-18 08:04:02 +00:00
Viktor Barzin
afe3a8bf8d remove SOPS pipeline, deploy ESO + Vault DB/K8s engines
Vault is now the sole source of truth for secrets. SOPS pipeline
removed entirely — auth via `vault login -method=oidc`.

Part A: SOPS removal
- vault/main.tf: delete 990 lines (93 vars + 43 KV write resources),
  add self-read data source for OIDC creds from secret/vault
- terragrunt.hcl: remove SOPS var loading, vault_root_token, check_secrets hook
- scripts/tg: remove SOPS decryption, keep -auto-approve logic
- .woodpecker/default.yml: replace SOPS with Vault K8s auth via curl
- Delete secrets.sops.json, .sops.yaml

Part B: External Secrets Operator
- New stack stacks/external-secrets/ with Helm chart + 2 ClusterSecretStores
  (vault-kv for KV v2, vault-database for DB engine)

Part C: Database secrets engine (in vault/main.tf)
- MySQL + PostgreSQL connections with static role rotation (24h)
- 6 MySQL roles (speedtest, wrongmove, codimd, nextcloud, shlink, grafana)
- 6 PostgreSQL roles (trading, health, linkwarden, affine, woodpecker, claude_memory)

Part D: Kubernetes secrets engine (in vault/main.tf)
- RBAC for Vault SA to manage K8s tokens
- Roles: dashboard-admin, ci-deployer, openclaw, local-admin
- New scripts/vault-kubeconfig helper for dynamic kubeconfig

K8s auth method with scoped policies for CI, ESO, OpenClaw, Woodpecker sync.
2026-03-18 08:04:01 +00:00
Viktor Barzin
ffc04ef9f6 openclaw: replace cc-config NFS with dotfiles repo clone [ci skip]
- Add init container "install-dotfiles" that clones the dotfiles repo
  and installs skills/agents/hooks to OpenClaw's home directory
- Remove nfs_cc_config module and its volume mount
- Skills/agents now come from the same chezmoi-managed dotfiles repo
  that manages the Mac config, eliminating the dual-sync problem
2026-03-18 08:04:01 +00:00
Viktor Barzin
0d596c57f5 fix immich TF drift from Kyverno ndots injection, right-size nvidia GPU operator
- immich: add lifecycle ignore_changes for dns_config on all 3 deployments
  to prevent perpetual plan drift from Kyverno ndots:2 mutation policy
- nvidia dcgm-exporter: 768Mi → 2560Mi (VPA upper 2091Mi, was under-provisioned)
- nvidia cuda-validator: 1024Mi → 256Mi (one-shot job, vastly over-provisioned)
2026-03-18 08:04:01 +00:00
Viktor Barzin
57eccd8a81 vaultwarden: upgrade to 1.35.4, use Recreate strategy
- Upgrade from 1.35.2 to 1.35.4 (fixes API key login userDecryptionOptions bug)
- Switch deployment strategy from RollingUpdate to Recreate (iSCSI PVC can't multi-attach)
2026-03-18 08:04:01 +00:00
Viktor Barzin
24c14fd0c6 claude-memory: pin image to :17, fixes URL-decode crash on sync endpoint
The + in +00:00 timezone offsets was being URL-decoded to a space,
causing ValueError on the /api/memories/sync endpoint. Build :17
includes the fix. Using versioned tag instead of :latest to avoid
pull-through cache serving stale images.
2026-03-18 08:04:01 +00:00
Viktor Barzin
9f1b3a53d3 right-size cluster memory: reduce overprovisioned, fix under-provisioned services
Phase 1 - Quick wins (~4.5 Gi saved):
- democratic-csi: add explicit sidecar resources (64-80Mi vs 256Mi LimitRange default)
- caretta: 768Mi → 600Mi (VPA upper 485Mi)
- immich-ml: 4Gi → 3584Mi (VPA upper 2.95Gi, GPU margin)
- onlyoffice: 3Gi → 2304Mi (VPA upper 1.82Gi)

Phase 2 - Safety fixes (prevent OOMKills):
- frigate: 2Gi/8Gi → 5Gi/10Gi (VPA upper 7.7Gi, was 4% headroom)
- openclaw: 1280Mi req → 2Gi req=limit (documented 2Gi requirement)

Phase 3 - Additional right-sizing:
- authentik workers: 1Gi → 896Mi x3 (VPA upper 722Mi)
- shlink: 512Mi/768Mi → 960Mi req=limit (VPA upper 780Mi, safety increase)

Phase 4 - Burstable QoS for lower tiers:
- tier-3-edge: 128Mi/128Mi → 96Mi req / 192Mi limit
- tier-4-aux: 128Mi/128Mi → 64Mi req / 256Mi limit

Phase 5 - Monitoring:
- Add ClusterMemoryRequestsHigh alert (>85% allocatable, 15m)
- Add ContainerNearOOM alert (>85% limit, 30m)
- Add PodUnschedulable alert (5m, critical)

Cluster: 92.7% → 90.8% memory requests. Stirling-pdf now schedulable.
2026-03-18 08:04:01 +00:00
Viktor Barzin
d6d6290fb7 fix: reduce openclaw memory requests for scheduling
- openclaw: request 1280Mi (limit 2Gi), modelrelay request 128Mi
  (limit 256Mi). Total request 1408Mi fits available capacity.
2026-03-18 08:04:01 +00:00
Viktor Barzin
bc9f5c3cf1 fix: openclaw policy violation + reduce memory requests for capacity
- openclaw: fix Kyverno policy violation (node:22-alpine ->
  docker.io/library/node:22-alpine), reduce request to 1536Mi
  with 2Gi limit for overcommit
- rybbit/clickhouse: reduce 1Gi -> 768Mi (frees 256Mi)
- stirling-pdf: reduce 1536Mi -> 1200Mi (frees 336Mi)
2026-03-18 08:04:01 +00:00
Viktor Barzin
8e3d87587d fix: increase memory for OOMKilled services
- hackmd: 64Mi -> 256Mi (Node.js app OOMKilled after 14min)
- n8n: limit 512Mi -> 768Mi (DB timeouts at 88% mem usage)
- speedtest: 128Mi -> 256Mi (OOMKilled during startup)
- shlink: limit 512Mi -> 768Mi (OOMKilled after startup)
2026-03-18 08:04:01 +00:00
Viktor Barzin
54642a0b94 fix: MySQL memory overcommit + shlink OOMKill
- dbaas: MySQL requests 4Gi -> 2Gi (limits stay 4Gi) to free 6Gi
  of request capacity. Actual usage is 1-1.5Gi per instance.
- url/shlink: increase memory limit 512Mi -> 768Mi (OOMKilled)
2026-03-18 08:04:01 +00:00
Viktor Barzin
4872bf2842 enable memory-core plugin for OpenClaw [ci skip]
- Add memory-core to plugins.allow and plugins.slots.memory
- Add /app/extensions to plugin load paths
- Update CLAUDE.md memory instructions to reference native tools
2026-03-18 08:04:00 +00:00
root
3189c2bb35 Woodpecker CI deploy commit [CI SKIP] 2026-03-18 08:04:00 +00:00
Viktor Barzin
6f2f4c089c fix cluster health: resolve 21/23 failures from healthcheck
- nvidia: change GPU taint NoSchedule -> PreferNoSchedule to allow
  overflow scheduling on k8s-node1 (frees ~7Gi capacity)
- kyverno: increase reports-controller memory 256Mi -> 512Mi (OOMKilled)
- speedtest: add missing DB_PORT=3306 env var (nc: service "" unknown)
- realestate-crawler: increase API memory 64Mi -> 256Mi (OOMKilled)
- calibre: increase liveness probe timeout 1s -> 5s (false restarts)
2026-03-18 08:04:00 +00:00
Viktor Barzin
af3ba0306c fix calibre slow startup: bake calibre binaries into image, skip chown on NFS
Custom Docker image pre-installs the universal-calibre mod at build time,
eliminating ~10 min apt-get on every container start. Added NO_CHOWN=true
to skip recursive chown that hangs on NFS mounts. Tightened startup probe
since pod now starts in ~2 min instead of 15-20 min.
2026-03-18 08:04:00 +00:00
Viktor Barzin
3be8fff082 prometheus: increase memory to 4Gi and probe delays for TSDB compaction
Compaction of 5 years of TSDB blocks was OOM-killing at 3Gi (18 restarts
in 8h), causing sustained IO pressure on the PVE host spinning disk.
Increase liveness probe delay to 300s so WAL replay completes before
the probe kills the pod.
2026-03-18 08:04:00 +00:00
Viktor Barzin
16ad9cd839 cluster recovery: fix resource limits and node1 memory
- nvidia quota: requests.memory 8Gi → 12Gi (unblock cuda-validator)
- calibre: startup probe initial_delay 60→120s, timeout 1→5s,
  wait_for_rollout=false (DOCKER_MODS install takes 10+ min)
- immich ML: memory 2Gi → 4Gi (OOMKilled loading CLIP models)

Also done outside TF (not in this commit):
- node1 VM: 16 GiB → 24 GiB RAM (Proxmox)
- tigera-operator: kubectl patch 128→256Mi
- nvidia-driver-daemonset: kubectl patch 1→4Gi memory
- kyverno reports-controller: kubectl patch 128→256Mi
- CNPG operator: kubectl rollout restart
2026-03-18 08:04:00 +00:00