- Kyverno policy: VPA mode set to 'off' for all namespaces (was 'initial'
for non-core). Terraform is now sole authority for container resources.
Goldilocks provides recommendations only.
- Ollama: add explicit CPU/memory resources (500m/4Gi req, 4/12Gi limit)
alongside GPU allocation. Fixes OOMKill from VPA scaling down resources.
- MySQL InnoDB Cluster: bump memory limit from 2Gi to 3Gi.
- Remove redundant per-namespace VPA opt-out labels from onlyoffice,
openclaw, trading-bot (now handled globally by Kyverno policy).
New skill documenting the NFSv4 idmapd UID mapping crisis where all file
UIDs show as 65534 (nobody) inside K8s containers. Root cause: containers
auto-negotiate NFSv4.2, and idmapd domain mismatch maps all UIDs to nobody.
Fix: v4_v3owner=true on TrueNAS for numeric UID passthrough.
Persist font cache (159MB) and theme images (10MB) to NFS volume.
Set GENERATE_FONTS=false to skip regeneration on startup since cache
is warm. Startup time: ~3 min -> 5 seconds.
Startup was throttled by allthemesgen and font generation hitting 2 CPU
ceiling. Bumped to 8 CPU burst limit with custom LimitRange (max 8 CPU)
and custom ResourceQuota. Disabled VPA and goldilocks opt-out labels.
CPU was pegged at 2000m/2000m (100% throttled). Add custom-quota
opt-out label and ResourceQuota allowing 32 CPU limits to accommodate
the 16 CPU container limit plus sidecar defaults.
Init container clones repo as root but main container runs as node (UID 1000).
Added chown -R 1000:1000 /workspace/infra so OpenClaw can write to workspace.
The Bitnami Redis Sentinel chart's service selects all nodes (master + replicas).
Clients using plain redis:// URLs (paperless-ngx, etc.) randomly hit read-only
replicas, causing write failures. Pin the service to redis-node-0 (master).
- Set workspace to /workspace/infra (was defaulting to ~/.openclaw/workspace)
- Enable tools.elevated for unrestricted access
- Enable commands.native and commands.nativeSkills
- All tools, commands, and skills now fully accessible
- Set agents.defaults.sandbox.mode = off
- Combined with exec.host=gateway and exec.security=full,
OpenClaw can now run any command on the container host
- Switch openclaw-home from emptyDir to NFS (/mnt/main/openclaw/home)
- Persists SOUL.md, IDENTITY.md, sessions, memory DB, telegram state,
device identity, and all runtime files across pod restarts
- Init container still refreshes openclaw.json and kubeconfig on each start
- Deploy modelrelay as sidecar container (auto-routes to fastest free model)
- Configured with NVIDIA NIM + OpenRouter API keys
- Primary: Mistral Large 3 (NIM), Fallback 1: Nemotron Ultra (NIM),
Fallback 2: modelrelay/auto-fastest (80+ free models)
- Modelrelay web UI available at pod:7352
- Set explicit CPU (2 cores) and memory (2Gi) limits
Root cause: Goldilocks VPA was throttling to 300m CPU, causing gateway
to take 5+ minutes to start, and 1Gi memory caused OOM crashes
- Add TCP readiness probe on port 18789 to prevent 502 Bad Gateway
during startup (Traefik was routing before gateway was listening)
- Disable Goldilocks VPA via namespace label (vpa-update-mode: off)
Add three new extractors (Streamed.pk, DaddyLive, Aceztrims) for live
F1 streams. Extend ExtractedStream model with stream_type/embed_url
fields, skip health checks for embed streams, fix broken Akamai demo
stream, add variant playlist validation, and add iframe player support
in the frontend for embed-type streams.
- Primary: Mistral Large 3 (675B) on NIM - always warm, excellent tool calling
- Fallback 1: Nemotron Ultra 253B on NIM
- Fallback 2: Llama 4 Maverick on Llama API (different provider for resilience)
- 10 models total across 3 providers, all free
- Removed: Modal (GLM-5), Gemini, Ollama providers
- Added: NVIDIA NIM provider with DeepSeek V3.2, Qwen 3.5, Qwen 3 Coder, GLM-5
- Bumped maxTokens from 8192 to 16384 for agentic output room
The InnoDB Cluster Router (mysqlrouter) doesn't deploy when the cluster
lacks quorum. Changed service selector from mysqlrouter to mysqld with
publishNotReadyAddresses=true to bypass the operator's readiness gate.
Pinned to mysql-cluster-1 (healthy primary) until full cluster recovers.
- MySQL Operator v2.2.7 in mysql-operator namespace (on control-plane)
- InnoDB Cluster: 3 MySQL 9.2.0 servers + 1 Router, local-path storage
- Group Replication with automatic failover via MySQL Router
- Compatibility service: mysql.dbaas:3306 → Router port 6446
- Images from container-registry.oracle.com (not Docker Hub)
- Init containers are slow (~20 min) due to mysqlsh plugin loading
- Data restore from mysqldump pending after cluster is ONLINE
P0: Set updateMaxFailure=-1 (fail-open)
Previously defaulted to 0 which blocked ALL traffic on first LAPI
failure. Now serves from cached decisions when LAPI is unreachable.
P1: Enable Redis cache for CrowdSec decisions
Decisions are now shared across all 3 Traefik replicas and survive
pod restarts. redisCacheUnreachableBlock=false prevents Redis from
becoming another SPOF.
P1: Add clientTrustedIPs for internal cluster traffic
Node CIDR (10.0.20.0/24) and pod CIDR (10.10.0.0/16) bypass
CrowdSec entirely, preventing internal cascade failures.
Bitnami MySQL images can't be pulled (not found on Docker Hub, likely
moved to a different registry). Reverted MySQL to single instance on
NFS as the known-working state. MySQL replication to be revisited
once image availability is resolved.
PostgreSQL and Redis remain on local disk with replication.
VPA Auto mode modifies Deployment specs at runtime, causing conflicts
with Terraform on every apply (drift -> reset -> VPA evict loop).
Initial mode only mutates Pod resource requests at creation time via
the admission webhook, leaving the Deployment spec unchanged. This
means terraform plan shows no drift while pods still get VPA-optimized
resources on every restart.
- 171 VPAs switched from Auto to Initial
- 20 VPAs remain Off (tier-0 critical services)
- Goldilocks dashboard continues to show recommendations