Phase 5 — CI pipelines:
- default.yml: add SOPS decrypt in prepare step, change git add . to
specific paths (stacks/ state/ .woodpecker/), cleanup on success+failure
- renew-tls.yml: change git add . to git add secrets/ state/
Phase 6 — sensitive=true:
- Add sensitive = true to 256 variable declarations across 149 stack files
- Prevents secret values from appearing in terraform plan output
- Does NOT modify shared modules (ingress_factory, nfs_volume) to avoid
breaking module interface contracts
Note: CI pipeline SOPS decryption requires sops_age_key Woodpecker secret
to be created before the pipeline will work with SOPS. Until then, the old
terraform.tfvars path continues to function.
Critical fix: StorageClass mountOptions only apply during dynamic
provisioning. Our static PVs (created by Terraform) were missing
mount_options, so all NFS mounts defaulted to hard,timeo=600 —
the exact stale mount behavior we were trying to eliminate.
Adds mount_options directly to the nfs_volume module PV spec and
to the monitoring PVs (prometheus, loki, alertmanager).
Requires re-applying all stacks to propagate to existing PVs.
Final batch: servarr (aiostreams, listenarr, readarr, soulseek,
prowlarr, qbittorrent, lidarr) and actualbudget factory.
All use ../../../modules/kubernetes/nfs_volume (3 levels deep).
Pull-through cache at 10.0.20.10 was serving corrupted/truncated images
for low-traffic registries, breaking VPA certgen (ImagePullBackOff) and
previously causing Kyverno image pull failures.
Kept: docker.io (port 5000) and ghcr.io (port 5010) — high traffic,
Docker Hub rate limits make caching essential.
Removed from cloud-init template and all 5 live nodes:
- registry.k8s.io (port 5030) — 14 system images, very low churn
- quay.io (port 5020) — 11 images
- reg.kyverno.io (port 5040) — 5 images
The registry containers on the 10.0.20.10 VM still run but nodes no
longer route to them. They can be stopped/removed from the VM later.
When the GPU becomes unavailable (overloaded, CUDA context corruption),
Frigate silently falls back to CPU detection burning 4 cores with no
automatic recovery. Add liveness probe checking nvidia-smi + API health
every 60s (3 failures = restart), and startup probe allowing up to 5min
for TensorRT model loading.
ClickHouse system log tables (metric_log, trace_log, text_log, etc.) were
growing unboundedly on NFS (~10GiB, 1.3B rows) with no TTL, causing
continuous background merge operations that burned ~920m CPU. Mounting
custom config.d XML files crashes ClickHouse (exit code 36) so instead
add a CronJob that truncates the tables via the HTTP API every 6 hours.
Also removed the broken ConfigMap/volume mount that was causing crashes.
- Kyverno policy: VPA mode set to 'off' for all namespaces (was 'initial'
for non-core). Terraform is now sole authority for container resources.
Goldilocks provides recommendations only.
- Ollama: add explicit CPU/memory resources (500m/4Gi req, 4/12Gi limit)
alongside GPU allocation. Fixes OOMKill from VPA scaling down resources.
- MySQL InnoDB Cluster: bump memory limit from 2Gi to 3Gi.
- Remove redundant per-namespace VPA opt-out labels from onlyoffice,
openclaw, trading-bot (now handled globally by Kyverno policy).
Persist font cache (159MB) and theme images (10MB) to NFS volume.
Set GENERATE_FONTS=false to skip regeneration on startup since cache
is warm. Startup time: ~3 min -> 5 seconds.
Startup was throttled by allthemesgen and font generation hitting 2 CPU
ceiling. Bumped to 8 CPU burst limit with custom LimitRange (max 8 CPU)
and custom ResourceQuota. Disabled VPA and goldilocks opt-out labels.
CPU was pegged at 2000m/2000m (100% throttled). Add custom-quota
opt-out label and ResourceQuota allowing 32 CPU limits to accommodate
the 16 CPU container limit plus sidecar defaults.
Init container clones repo as root but main container runs as node (UID 1000).
Added chown -R 1000:1000 /workspace/infra so OpenClaw can write to workspace.
The Bitnami Redis Sentinel chart's service selects all nodes (master + replicas).
Clients using plain redis:// URLs (paperless-ngx, etc.) randomly hit read-only
replicas, causing write failures. Pin the service to redis-node-0 (master).
- Set workspace to /workspace/infra (was defaulting to ~/.openclaw/workspace)
- Enable tools.elevated for unrestricted access
- Enable commands.native and commands.nativeSkills
- All tools, commands, and skills now fully accessible
- Set agents.defaults.sandbox.mode = off
- Combined with exec.host=gateway and exec.security=full,
OpenClaw can now run any command on the container host
- Switch openclaw-home from emptyDir to NFS (/mnt/main/openclaw/home)
- Persists SOUL.md, IDENTITY.md, sessions, memory DB, telegram state,
device identity, and all runtime files across pod restarts
- Init container still refreshes openclaw.json and kubeconfig on each start
- Deploy modelrelay as sidecar container (auto-routes to fastest free model)
- Configured with NVIDIA NIM + OpenRouter API keys
- Primary: Mistral Large 3 (NIM), Fallback 1: Nemotron Ultra (NIM),
Fallback 2: modelrelay/auto-fastest (80+ free models)
- Modelrelay web UI available at pod:7352
- Set explicit CPU (2 cores) and memory (2Gi) limits
Root cause: Goldilocks VPA was throttling to 300m CPU, causing gateway
to take 5+ minutes to start, and 1Gi memory caused OOM crashes
- Add TCP readiness probe on port 18789 to prevent 502 Bad Gateway
during startup (Traefik was routing before gateway was listening)
- Disable Goldilocks VPA via namespace label (vpa-update-mode: off)