infra/.claude/skills/archived/openclaw-k8s-deployment/SKILL.md
Viktor Barzin fd0f4a0365 fix: restore tree dropped by 6d224861; land stem95su gdrive-sync (10m) [ci skip]
6d224861 came from a --no-checkout worktree whose empty index made the
commit drop every file except two. This restores 05b50d2b's full tree and
correctly adds stacks/stem95su/gdrive-sync.tf + the service-catalog stem95su
entry. Forward-only (parent=6d224861, no force-push); [ci skip] since the
live infra was never applied from the broken commit.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-09 08:45:33 +00:00

6.8 KiB

name description author version date
openclaw-k8s-deployment Deploy and troubleshoot OpenClaw gateway on Kubernetes. Use when: (1) OpenClaw gateway won't start or shows "Telegram configured, not enabled yet", (2) exec fails with "requires a paired node (none available)", (3) gateway shows "Config invalid" for exec.host or exec.security values, (4) OpenClaw can't write files (EACCES on workspace or home), (5) gateway takes 5+ minutes to start (CPU throttling by VPA/LimitRange), (6) 502 Bad Gateway from Traefik after pod restart, (7) setting up Telegram bot channel, (8) configuring modelrelay sidecar for free model routing. Covers all non-obvious deployment gotchas discovered through trial and error. Claude Code 1.0.0 2026-03-01

OpenClaw Kubernetes Deployment

Problem

Deploying OpenClaw as a Kubernetes pod involves many non-obvious configuration requirements. The gateway process, Telegram integration, exec permissions, and file ownership all have specific constraints not documented together.

Context / Trigger Conditions

  • Deploying OpenClaw from ghcr.io/openclaw/openclaw container image
  • Running in Kubernetes with NFS volumes, Traefik ingress, Goldilocks/VPA
  • Want Telegram bot integration, tool execution, and persistent state

Solution

1. Gateway Configuration (openclaw.json)

Required fields that aren't obvious:

{
  "gateway": {
    "mode": "local",
    "bind": "lan",
    "controlUi": {
      "dangerouslyDisableDeviceAuth": true,
      "dangerouslyAllowHostHeaderOriginFallback": true
    }
  },
  "wizard": {
    "lastRunAt": "2026-03-01T00:00:00.000Z",
    "lastRunVersion": "2026.2.26",
    "lastRunCommand": "configure",
    "lastRunMode": "local"
  }
}
  • gateway.mode = "local"required or gateway refuses to start
  • dangerouslyAllowHostHeaderOriginFallback = true — required in v2026.2.26+ for non-loopback Control UI (error: "non-loopback Control UI requires gateway.controlUi.allowedOrigins")
  • wizard block — required for Telegram to start. Without it, gateway logs "Telegram configured, not enabled yet" on every startup. The wizard block signals that initial setup was completed.

2. Exec Configuration

Valid values for tools.exec:

Field Valid Values Notes
host sandbox, gateway, node NOT "local" — that's invalid
security deny, allowlist, full NOT "off" — that's invalid
ask "off" Disables confirmation prompts
  • host = "gateway" — runs commands on the container host directly
  • host = "node" — requires a "paired node" companion app (doesn't work in containers)
  • host = "sandbox" — requires Docker-in-Docker
  • security = "full" — most permissive valid option

3. Sandbox Mode

{
  "agents": {
    "defaults": {
      "sandbox": { "mode": "off" },
      "workspace": "/workspace/infra"
    }
  }
}
  • sandbox.mode = "off" disables Docker sandboxing
  • workspace must be set explicitly — defaults to ~/.openclaw/workspace

4. File Permissions

The init container runs as root but the main container runs as node (UID 1000).

Must chown in init container:

chown -R 1000:1000 /workspace/infra
chown -R 1000:1000 /openclaw-home
chmod 700 /openclaw-home

Must create directories:

mkdir -p /openclaw-home/agents/main/sessions \
         /openclaw-home/credentials \
         /openclaw-home/canvas \
         /openclaw-home/devices \
         /openclaw-home/cron

Without these: EACCES: permission denied errors for AGENTS.md, canvas, cron/jobs.json, devices, and other runtime files.

5. Startup Command

node openclaw.mjs doctor --fix 2>/dev/null; exec node openclaw.mjs gateway --allow-unconfigured --bind lan

Run doctor --fix before the gateway to auto-enable Telegram and fix config issues. Without this, Telegram stays "not enabled yet".

6. Resource Requirements

  • CPU limit: 2 cores minimum — the Node.js gateway startup is CPU-intensive. With 150-300m CPU, startup takes 5+ minutes.
  • Memory limit: 2Gi minimum — the gateway OOM-kills at 1Gi during startup (V8 heap exhaustion).
  • Goldilocks VPA will override these — see "VPA Override" section below.

7. Readiness Probe

readiness_probe {
  tcp_socket { port = 18789 }
  initial_delay_seconds = 30
  period_seconds        = 10
}

Do NOT use a startup probe — the gateway can take 2-3 minutes to start listening and a startup probe will kill it. Use readiness-only to prevent 502s from Traefik during startup without killing the container.

8. Telegram Integration

{
  "channels": {
    "telegram": {
      "enabled": true,
      "botToken": "...",
      "dmPolicy": "allowlist",
      "allowFrom": ["tg:USER_ID"],
      "groupPolicy": "allowlist",
      "streamMode": "partial"
    }
  }
}

Telegram won't start without:

  1. The wizard block in config (signals setup was run)
  2. doctor --fix at startup (auto-enables the channel)
  3. Both groupPolicy and streamMode fields

9. NFS Volume Strategy

Volume Purpose Type
/home/node/.openclaw Persistent state (SOUL.md, sessions, memory, telegram) NFS
/tools Cached binaries (kubectl, terraform, terragrunt, python libs) NFS
/workspace Infra repo clone NFS
/data General data NFS

Using NFS for tools cache reduces restart time from ~2.5min to ~38s by skipping binary downloads and pip installs on subsequent starts.

10. ModelRelay Sidecar

Deploy as a sidecar container for automatic free model routing:

container {
  name  = "modelrelay"
  image = "node:22-alpine"
  command = ["sh", "-c", "npm install -g modelrelay; exec modelrelay --port 7352"]
  env { name = "NVIDIA_API_KEY"; value = "..." }
  env { name = "OPENROUTER_API_KEY"; value = "..." }
}

Configure as provider: baseUrl = "http://127.0.0.1:7352/v1", model auto-fastest.

Verification

  1. kubectl logs -c openclaw should show [gateway] listening on ws://0.0.0.0:18789
  2. No "Telegram configured, not enabled yet" message
  3. No EACCES permission errors
  4. kubectl exec ... -- cat /proc/net/tcp shows listening sockets
  5. Telegram bot responds to /start

Notes

  • ConfigMap changes require pod restart (init container copies config at start)
  • ConfigMap taint+reinit sometimes needed when Terraform state gets out of sync
  • Goldilocks VPA recreates itself from namespace labels — must delete VPA on every pod recreation if namespace has goldilocks.fairwinds.com/vpa-update-mode
  • The --allow-unconfigured flag is needed for the gateway command
  • v2026.2.26 introduced breaking change requiring dangerouslyAllowHostHeaderOriginFallback

See also

  • openclaw-custom-model-provider — basic model provider configuration
  • k8s-limitrange-oom-silent-kill — LimitRange causing OOM (related but different)