2026-03-07 14:30:36 +00:00
|
|
|
|
variable "tls_secret_name" {
|
2026-03-14 08:51:45 +00:00
|
|
|
|
type = string
|
2026-03-07 14:30:36 +00:00
|
|
|
|
sensitive = true
|
|
|
|
|
|
}
|
2026-03-14 17:15:48 +00:00
|
|
|
|
variable "nfs_server" { type = string }
|
|
|
|
|
|
|
|
|
|
|
|
data "vault_kv_secret_v2" "secrets" {
|
|
|
|
|
|
mount = "secret"
|
|
|
|
|
|
name = "openclaw"
|
2026-03-07 21:09:31 +00:00
|
|
|
|
}
|
2026-03-14 17:15:48 +00:00
|
|
|
|
|
|
|
|
|
|
locals {
|
|
|
|
|
|
skill_secrets = jsondecode(data.vault_kv_secret_v2.secrets.data["skill_secrets"])
|
2026-03-14 16:01:41 +00:00
|
|
|
|
}
|
2026-02-22 13:56:34 +00:00
|
|
|
|
|
|
|
|
|
|
|
2026-02-22 15:13:55 +00:00
|
|
|
|
resource "kubernetes_namespace" "openclaw" {
|
|
|
|
|
|
metadata {
|
|
|
|
|
|
name = "openclaw"
|
|
|
|
|
|
labels = {
|
2026-03-15 16:04:02 +00:00
|
|
|
|
tier = local.tiers.aux
|
|
|
|
|
|
"resource-governance/custom-limitrange" = "true"
|
|
|
|
|
|
"resource-governance/custom-quota" = "true"
|
2026-02-22 15:13:55 +00:00
|
|
|
|
}
|
|
|
|
|
|
}
|
[infra] Suppress Goldilocks vpa-update-mode label drift on all namespaces [ci skip]
## Context
Wave 3B-continued: the Goldilocks VPA dashboard (stacks/vpa) runs a Kyverno
ClusterPolicy `goldilocks-vpa-auto-mode` that mutates every namespace with
`metadata.labels["goldilocks.fairwinds.com/vpa-update-mode"] = "off"`. This
is intentional — Terraform owns container resource limits, and Goldilocks
should only provide recommendations, never auto-update. The label is how
Goldilocks decides per-namespace whether to run its VPA in `off` mode.
Effect on Terraform: every `kubernetes_namespace` resource shows the label
as pending-removal (`-> null`) on every `scripts/tg plan`. Dawarich survey
2026-04-18 confirmed the drift. Cluster-side count: 88 namespaces carry the
label (`kubectl get ns -o json | jq ... | wc -l`). Every TF-managed namespace
is affected.
This commit brings the intentional admission drift under the same
`# KYVERNO_LIFECYCLE_V1` discoverability marker introduced in c9d221d5 for
the ndots dns_config pattern. The marker now stands generically for any
Kyverno admission-webhook drift suppression; the inline comment records
which specific policy stamps which specific field so future grep audits
show why each suppression exists.
## This change
107 `.tf` files touched — every stack's `resource "kubernetes_namespace"`
resource gets:
```hcl
lifecycle {
# KYVERNO_LIFECYCLE_V1: goldilocks-vpa-auto-mode ClusterPolicy stamps this label on every namespace
ignore_changes = [metadata[0].labels["goldilocks.fairwinds.com/vpa-update-mode"]]
}
```
Injection was done with a brace-depth-tracking Python pass (`/tmp/add_goldilocks_ignore.py`):
match `^resource "kubernetes_namespace" ` → track `{` / `}` until the
outermost closing brace → insert the lifecycle block before the closing
brace. The script is idempotent (skips any file that already mentions
`goldilocks.fairwinds.com/vpa-update-mode`) so re-running is safe.
Vault stack picked up 2 namespaces in the same file (k8s-users produces
one, plus a second explicit ns) — confirmed via file diff (+8 lines).
## What is NOT in this change
- `stacks/trading-bot/main.tf` — entire file is `/* … */` commented out
(paused 2026-04-06 per user decision). Reverted after the script ran.
- `stacks/_template/main.tf.example` — per-stack skeleton, intentionally
minimal. User keeps it that way. Not touched by the script (file
has no real `resource "kubernetes_namespace"` — only a placeholder
comment).
- `.terraform/` copies (e.g. `stacks/metallb/.terraform/modules/...`) —
gitignored, won't commit; the live path was edited.
- `terraform fmt` cleanup of adjacent pre-existing alignment issues in
authentik, freedify, hermes-agent, nvidia, vault, meshcentral. Reverted
to keep the commit scoped to the Goldilocks sweep. Those files will
need a separate fmt-only commit or will be cleaned up on next real
apply to that stack.
## Verification
Dawarich (one of the hundred-plus touched stacks) showed the pattern
before and after:
```
$ cd stacks/dawarich && ../../scripts/tg plan
Before:
Plan: 0 to add, 2 to change, 0 to destroy.
# kubernetes_namespace.dawarich will be updated in-place
(goldilocks.fairwinds.com/vpa-update-mode -> null)
# module.tls_secret.kubernetes_secret.tls_secret will be updated in-place
(Kyverno generate.* labels — fixed in 8d94688d)
After:
No changes. Your infrastructure matches the configuration.
```
Injection count check:
```
$ rg -c 'KYVERNO_LIFECYCLE_V1: goldilocks-vpa-auto-mode' stacks/ | awk -F: '{s+=$2} END {print s}'
108
```
## Reproduce locally
1. `git pull`
2. Pick any stack: `cd stacks/<name> && ../../scripts/tg plan`
3. Expect: no drift on the namespace's goldilocks.fairwinds.com/vpa-update-mode label.
Closes: code-dwx
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-18 21:15:27 +00:00
|
|
|
|
lifecycle {
|
|
|
|
|
|
# KYVERNO_LIFECYCLE_V1: goldilocks-vpa-auto-mode ClusterPolicy stamps this label on every namespace
|
|
|
|
|
|
ignore_changes = [metadata[0].labels["goldilocks.fairwinds.com/vpa-update-mode"]]
|
|
|
|
|
|
}
|
2026-02-22 15:13:55 +00:00
|
|
|
|
}
|
|
|
|
|
|
|
|
|
|
|
|
module "tls_secret" {
|
|
|
|
|
|
source = "../../modules/kubernetes/setup_tls_secret"
|
|
|
|
|
|
namespace = kubernetes_namespace.openclaw.metadata[0].name
|
|
|
|
|
|
tls_secret_name = var.tls_secret_name
|
|
|
|
|
|
}
|
|
|
|
|
|
|
migrate consuming stacks to ESO + remove k8s-dashboard static token
Phase 9: ExternalSecret migration across 26 stacks:
Fully migrated (vault data source removed, ESO delivers secrets):
- speedtest, shadowsocks, wealthfolio, plotting-book, f1-stream, tandoor
- n8n, dawarich, diun, netbox, onlyoffice, tuya-bridge
- hackmd (ESO template for DB URL), health (ESO template for DB URL)
- trading-bot (ESO template for DATABASE_URL + 7 secret env vars)
- forgejo (removed unused vault data source)
Partially migrated (vault kept for plan-time, ESO added for runtime):
- immich, linkwarden, nextcloud, paperless-ngx (jsondecode for homepage)
- claude-memory, rybbit, url, webhook_handler (plan-time in locals/jobs)
- woodpecker, openclaw, resume (plan-time in helm values/jobs/modules)
17 stacks unchanged (all plan-time: homepage annotations, configmaps,
module inputs) — vault data source works with OIDC auth.
Phase 17a: Remove k8s-dashboard static admin token secret.
Users now get tokens via: vault write kubernetes/creds/dashboard-admin
2026-03-15 19:05:04 +00:00
|
|
|
|
resource "kubernetes_manifest" "external_secret" {
|
|
|
|
|
|
manifest = {
|
|
|
|
|
|
apiVersion = "external-secrets.io/v1beta1"
|
|
|
|
|
|
kind = "ExternalSecret"
|
|
|
|
|
|
metadata = {
|
|
|
|
|
|
name = "openclaw-secrets"
|
|
|
|
|
|
namespace = "openclaw"
|
|
|
|
|
|
}
|
|
|
|
|
|
spec = {
|
|
|
|
|
|
refreshInterval = "15m"
|
|
|
|
|
|
secretStoreRef = {
|
|
|
|
|
|
name = "vault-kv"
|
|
|
|
|
|
kind = "ClusterSecretStore"
|
|
|
|
|
|
}
|
|
|
|
|
|
target = {
|
|
|
|
|
|
name = "openclaw-secrets"
|
|
|
|
|
|
}
|
|
|
|
|
|
dataFrom = [{
|
|
|
|
|
|
extract = {
|
|
|
|
|
|
key = "openclaw"
|
|
|
|
|
|
}
|
|
|
|
|
|
}]
|
|
|
|
|
|
}
|
|
|
|
|
|
}
|
|
|
|
|
|
depends_on = [kubernetes_namespace.openclaw]
|
|
|
|
|
|
}
|
|
|
|
|
|
|
2026-02-22 15:13:55 +00:00
|
|
|
|
resource "kubernetes_service_account" "openclaw" {
|
|
|
|
|
|
metadata {
|
|
|
|
|
|
name = "openclaw"
|
|
|
|
|
|
namespace = kubernetes_namespace.openclaw.metadata[0].name
|
|
|
|
|
|
}
|
|
|
|
|
|
}
|
|
|
|
|
|
|
|
|
|
|
|
resource "kubernetes_cluster_role_binding" "openclaw" {
|
|
|
|
|
|
metadata {
|
|
|
|
|
|
name = "openclaw-cluster-admin"
|
|
|
|
|
|
}
|
|
|
|
|
|
subject {
|
|
|
|
|
|
kind = "ServiceAccount"
|
|
|
|
|
|
name = kubernetes_service_account.openclaw.metadata[0].name
|
|
|
|
|
|
namespace = kubernetes_namespace.openclaw.metadata[0].name
|
|
|
|
|
|
}
|
|
|
|
|
|
role_ref {
|
|
|
|
|
|
api_group = "rbac.authorization.k8s.io"
|
|
|
|
|
|
kind = "ClusterRole"
|
|
|
|
|
|
name = "cluster-admin"
|
|
|
|
|
|
}
|
|
|
|
|
|
}
|
|
|
|
|
|
|
|
|
|
|
|
resource "kubernetes_secret" "ssh_key" {
|
|
|
|
|
|
metadata {
|
|
|
|
|
|
name = "ssh-key"
|
|
|
|
|
|
namespace = kubernetes_namespace.openclaw.metadata[0].name
|
|
|
|
|
|
}
|
|
|
|
|
|
data = {
|
2026-03-14 17:15:48 +00:00
|
|
|
|
"id_rsa" = data.vault_kv_secret_v2.secrets.data["ssh_key"]
|
2026-02-22 15:13:55 +00:00
|
|
|
|
}
|
|
|
|
|
|
type = "generic"
|
|
|
|
|
|
}
|
|
|
|
|
|
|
|
|
|
|
|
resource "kubernetes_config_map" "git_crypt_key" {
|
|
|
|
|
|
metadata {
|
|
|
|
|
|
name = "git-crypt-key"
|
|
|
|
|
|
namespace = kubernetes_namespace.openclaw.metadata[0].name
|
|
|
|
|
|
}
|
|
|
|
|
|
data = {
|
|
|
|
|
|
"key" = filebase64("${path.root}/../../.git/git-crypt/keys/default")
|
|
|
|
|
|
}
|
|
|
|
|
|
}
|
|
|
|
|
|
|
|
|
|
|
|
resource "kubernetes_config_map" "openclaw_config" {
|
|
|
|
|
|
metadata {
|
|
|
|
|
|
name = "openclaw-config"
|
|
|
|
|
|
namespace = kubernetes_namespace.openclaw.metadata[0].name
|
|
|
|
|
|
}
|
|
|
|
|
|
data = {
|
|
|
|
|
|
"openclaw.json" = jsonencode({
|
|
|
|
|
|
gateway = {
|
2026-03-01 15:47:54 +00:00
|
|
|
|
mode = "local"
|
2026-02-22 15:13:55 +00:00
|
|
|
|
bind = "lan"
|
|
|
|
|
|
trustedProxies = ["10.0.0.0/8"]
|
|
|
|
|
|
controlUi = {
|
2026-03-01 15:47:54 +00:00
|
|
|
|
dangerouslyDisableDeviceAuth = true
|
|
|
|
|
|
dangerouslyAllowHostHeaderOriginFallback = true
|
2026-02-22 15:13:55 +00:00
|
|
|
|
}
|
|
|
|
|
|
}
|
|
|
|
|
|
agents = {
|
|
|
|
|
|
defaults = {
|
|
|
|
|
|
contextTokens = 1000000
|
|
|
|
|
|
bootstrapMaxChars = 30000
|
2026-03-19 23:32:28 +00:00
|
|
|
|
workspace = "/workspace"
|
2026-03-01 16:51:35 +00:00
|
|
|
|
sandbox = {
|
|
|
|
|
|
mode = "off"
|
|
|
|
|
|
}
|
2026-02-22 15:13:55 +00:00
|
|
|
|
model = {
|
2026-05-06 22:06:32 +00:00
|
|
|
|
# ChatGPT Plus OAuth via openai-codex plugin (account: ancaelena98@gmail.com).
|
|
|
|
|
|
# gpt-5.4-mini is the only mini variant the Codex backend accepts for Plus tier;
|
|
|
|
|
|
# gpt-5-mini / gpt-5.1-codex-mini return model_not_found / "not supported with
|
|
|
|
|
|
# ChatGPT account". Plus rate-card: 1,200–7,000 local msgs / 5h on gpt-5.4-mini.
|
|
|
|
|
|
primary = "openai-codex/gpt-5.4-mini"
|
|
|
|
|
|
fallbacks = ["openai-codex/gpt-5.5", "nim/qwen/qwen3-coder-480b-a35b-instruct", "modelrelay/auto-fastest"]
|
2026-02-22 15:13:55 +00:00
|
|
|
|
}
|
|
|
|
|
|
models = {
|
2026-03-01 15:57:31 +00:00
|
|
|
|
"modelrelay/auto-fastest" = {}
|
2026-03-01 13:22:47 +00:00
|
|
|
|
"nim/deepseek-ai/deepseek-v3.2" = {}
|
|
|
|
|
|
"nim/qwen/qwen3.5-397b-a17b" = {}
|
|
|
|
|
|
"nim/mistralai/mistral-large-3-675b-instruct-2512" = {}
|
|
|
|
|
|
"nim/qwen/qwen3-coder-480b-a35b-instruct" = {}
|
|
|
|
|
|
"nim/nvidia/llama-3.1-nemotron-ultra-253b-v1" = {}
|
|
|
|
|
|
"nim/z-ai/glm5" = {}
|
|
|
|
|
|
"llama-as-openai/Llama-4-Maverick-17B-128E-Instruct-FP8" = {}
|
|
|
|
|
|
"llama-as-openai/Llama-4-Scout-17B-16E-Instruct-FP8" = {}
|
|
|
|
|
|
"openrouter/stepfun/step-3.5-flash:free" = {}
|
|
|
|
|
|
"openrouter/arcee-ai/trinity-large-preview:free" = {}
|
2026-05-06 22:06:32 +00:00
|
|
|
|
"openai-codex/gpt-5.4-mini" = {}
|
|
|
|
|
|
"openai-codex/gpt-5.5" = {}
|
2026-02-22 15:13:55 +00:00
|
|
|
|
}
|
|
|
|
|
|
}
|
|
|
|
|
|
}
|
|
|
|
|
|
tools = {
|
|
|
|
|
|
profile = "full"
|
2026-03-01 15:58:12 +00:00
|
|
|
|
deny = []
|
2026-03-01 17:12:03 +00:00
|
|
|
|
elevated = {
|
|
|
|
|
|
enabled = true
|
|
|
|
|
|
}
|
2026-02-22 15:13:55 +00:00
|
|
|
|
exec = {
|
2026-03-01 16:47:14 +00:00
|
|
|
|
host = "gateway"
|
2026-03-01 16:42:22 +00:00
|
|
|
|
security = "full"
|
2026-02-22 15:13:55 +00:00
|
|
|
|
ask = "off"
|
2026-03-19 23:32:28 +00:00
|
|
|
|
pathPrepend = ["/tools", "/workspace"]
|
2026-02-22 15:13:55 +00:00
|
|
|
|
}
|
|
|
|
|
|
web = {
|
|
|
|
|
|
search = {
|
|
|
|
|
|
enabled = true
|
|
|
|
|
|
provider = "brave"
|
2026-03-14 17:15:48 +00:00
|
|
|
|
apiKey = data.vault_kv_secret_v2.secrets.data["brave_api_key"]
|
2026-02-22 15:13:55 +00:00
|
|
|
|
maxResults = 5
|
|
|
|
|
|
}
|
|
|
|
|
|
fetch = {
|
|
|
|
|
|
enabled = true
|
|
|
|
|
|
maxChars = 50000
|
|
|
|
|
|
timeoutSeconds = 30
|
|
|
|
|
|
}
|
|
|
|
|
|
}
|
|
|
|
|
|
}
|
2026-03-14 16:01:41 +00:00
|
|
|
|
plugins = {
|
2026-03-15 02:39:14 +00:00
|
|
|
|
allow = ["memory-core"]
|
|
|
|
|
|
slots = { memory = "memory-core" }
|
2026-03-14 16:01:41 +00:00
|
|
|
|
load = {
|
2026-03-15 02:39:14 +00:00
|
|
|
|
paths = ["/home/node/.openclaw/extensions", "/app/extensions"]
|
2026-03-14 16:01:41 +00:00
|
|
|
|
}
|
|
|
|
|
|
}
|
2026-03-01 17:12:03 +00:00
|
|
|
|
commands = {
|
|
|
|
|
|
native = true
|
|
|
|
|
|
nativeSkills = true
|
|
|
|
|
|
}
|
2026-03-01 13:44:58 +00:00
|
|
|
|
channels = {
|
|
|
|
|
|
telegram = {
|
2026-03-01 15:47:54 +00:00
|
|
|
|
enabled = true
|
2026-03-14 17:15:48 +00:00
|
|
|
|
botToken = data.vault_kv_secret_v2.secrets.data["telegram_bot_token"]
|
2026-03-01 15:47:54 +00:00
|
|
|
|
dmPolicy = "allowlist"
|
|
|
|
|
|
allowFrom = ["tg:8281953845"]
|
|
|
|
|
|
groupPolicy = "allowlist"
|
|
|
|
|
|
streamMode = "partial"
|
2026-03-01 13:44:58 +00:00
|
|
|
|
}
|
|
|
|
|
|
}
|
2026-02-22 15:13:55 +00:00
|
|
|
|
models = {
|
|
|
|
|
|
mode = "merge"
|
|
|
|
|
|
providers = {
|
2026-03-01 15:57:31 +00:00
|
|
|
|
modelrelay = {
|
|
|
|
|
|
baseUrl = "http://127.0.0.1:7352/v1"
|
|
|
|
|
|
api = "openai-completions"
|
|
|
|
|
|
apiKey = "modelrelay"
|
|
|
|
|
|
models = [
|
|
|
|
|
|
{ id = "auto-fastest", name = "Auto (Fastest)", reasoning = false, input = ["text"], contextWindow = 200000, maxTokens = 16384, cost = { input = 0, output = 0, cacheRead = 0, cacheWrite = 0 } },
|
|
|
|
|
|
]
|
|
|
|
|
|
}
|
2026-03-01 13:22:47 +00:00
|
|
|
|
nim = {
|
|
|
|
|
|
baseUrl = "https://integrate.api.nvidia.com/v1"
|
2026-02-22 15:13:55 +00:00
|
|
|
|
api = "openai-completions"
|
2026-03-14 17:15:48 +00:00
|
|
|
|
apiKey = data.vault_kv_secret_v2.secrets.data["nvidia_api_key"]
|
2026-02-22 15:13:55 +00:00
|
|
|
|
models = [
|
2026-03-01 13:22:47 +00:00
|
|
|
|
{ id = "deepseek-ai/deepseek-v3.2", name = "DeepSeek V3.2", reasoning = false, input = ["text"], contextWindow = 164000, maxTokens = 16384, cost = { input = 0, output = 0, cacheRead = 0, cacheWrite = 0 } },
|
|
|
|
|
|
{ id = "qwen/qwen3.5-397b-a17b", name = "Qwen 3.5", reasoning = true, input = ["text"], contextWindow = 262000, maxTokens = 16384, cost = { input = 0, output = 0, cacheRead = 0, cacheWrite = 0 } },
|
|
|
|
|
|
{ id = "mistralai/mistral-large-3-675b-instruct-2512", name = "Mistral Large 3", reasoning = false, input = ["text"], contextWindow = 262000, maxTokens = 16384, cost = { input = 0, output = 0, cacheRead = 0, cacheWrite = 0 } },
|
|
|
|
|
|
{ id = "qwen/qwen3-coder-480b-a35b-instruct", name = "Qwen 3 Coder", reasoning = false, input = ["text"], contextWindow = 262000, maxTokens = 16384, cost = { input = 0, output = 0, cacheRead = 0, cacheWrite = 0 } },
|
|
|
|
|
|
{ id = "nvidia/llama-3.1-nemotron-ultra-253b-v1", name = "Nemotron Ultra 253B", reasoning = true, input = ["text"], contextWindow = 128000, maxTokens = 16384, cost = { input = 0, output = 0, cacheRead = 0, cacheWrite = 0 } },
|
|
|
|
|
|
{ id = "z-ai/glm5", name = "GLM-5", reasoning = false, input = ["text"], contextWindow = 128000, maxTokens = 16384, cost = { input = 0, output = 0, cacheRead = 0, cacheWrite = 0 } },
|
2026-02-22 15:13:55 +00:00
|
|
|
|
]
|
|
|
|
|
|
}
|
2026-03-01 13:22:47 +00:00
|
|
|
|
openrouter = {
|
|
|
|
|
|
baseUrl = "https://openrouter.ai/api/v1"
|
2026-02-22 15:13:55 +00:00
|
|
|
|
api = "openai-completions"
|
2026-03-14 17:15:48 +00:00
|
|
|
|
apiKey = data.vault_kv_secret_v2.secrets.data["openrouter_api_key"]
|
2026-02-22 15:13:55 +00:00
|
|
|
|
models = [
|
2026-03-01 13:22:47 +00:00
|
|
|
|
{ id = "stepfun/step-3.5-flash:free", name = "Step 3.5 Flash", reasoning = true, input = ["text"], contextWindow = 256000, maxTokens = 16384, cost = { input = 0, output = 0, cacheRead = 0, cacheWrite = 0 } },
|
|
|
|
|
|
{ id = "arcee-ai/trinity-large-preview:free", name = "Trinity Large", reasoning = false, input = ["text"], contextWindow = 131000, maxTokens = 16384, cost = { input = 0, output = 0, cacheRead = 0, cacheWrite = 0 } },
|
2026-02-22 15:13:55 +00:00
|
|
|
|
]
|
|
|
|
|
|
}
|
|
|
|
|
|
llama-as-openai = {
|
|
|
|
|
|
baseUrl = "https://api.llama.com/compat/v1"
|
2026-03-14 17:15:48 +00:00
|
|
|
|
apiKey = data.vault_kv_secret_v2.secrets.data["llama_api_key"]
|
2026-02-22 15:13:55 +00:00
|
|
|
|
api = "openai-completions"
|
|
|
|
|
|
models = [
|
2026-03-01 13:22:47 +00:00
|
|
|
|
{ id = "Llama-4-Maverick-17B-128E-Instruct-FP8", name = "Llama 4 Maverick", reasoning = false, input = ["text"], contextWindow = 200000, maxTokens = 16384, cost = { input = 0, output = 0, cacheRead = 0, cacheWrite = 0 } },
|
|
|
|
|
|
{ id = "Llama-4-Scout-17B-16E-Instruct-FP8", name = "Llama 4 Scout", reasoning = false, input = ["text"], contextWindow = 200000, maxTokens = 16384, cost = { input = 0, output = 0, cacheRead = 0, cacheWrite = 0 } },
|
2026-02-22 15:13:55 +00:00
|
|
|
|
]
|
|
|
|
|
|
}
|
|
|
|
|
|
}
|
|
|
|
|
|
}
|
2026-03-01 15:47:54 +00:00
|
|
|
|
wizard = {
|
|
|
|
|
|
lastRunAt = "2026-03-01T15:11:54.176Z"
|
|
|
|
|
|
lastRunVersion = "2026.2.9"
|
|
|
|
|
|
lastRunCommand = "configure"
|
|
|
|
|
|
lastRunMode = "local"
|
|
|
|
|
|
}
|
2026-02-22 15:13:55 +00:00
|
|
|
|
})
|
|
|
|
|
|
}
|
|
|
|
|
|
}
|
|
|
|
|
|
|
|
|
|
|
|
resource "random_password" "gateway_token" {
|
|
|
|
|
|
length = 32
|
|
|
|
|
|
special = false
|
|
|
|
|
|
}
|
|
|
|
|
|
|
openclaw: realtime usage dashboard via Prometheus exporter sidecar
Stdlib-only Python exporter ($1) reads ~/.openclaw/agents/*/sessions/*.jsonl
(assistant messages with usage) plus auth-profiles.json (OAuth expiry,
Plus-tier label) and exposes Prometheus text format on :9099/metrics.
Container is python:3.12-slim; pod template gets prometheus.io/scrape
annotations so the existing kubernetes-pods job picks it up — no
ServiceMonitor needed.
Metrics exported:
openclaw_codex_messages_total{provider,model,session_kind} counter
openclaw_codex_input/output/cache_read/cache_write_tokens_total
openclaw_codex_message_errors_total{reason}
openclaw_codex_active_sessions{kind} gauge
openclaw_codex_oauth_expiry_seconds{provider,account,plan} gauge
openclaw_codex_last_run_timestamp gauge
Grafana dashboard "OpenClaw — Codex Usage" (Applications folder, 30s
refresh): messages/5h vs Plus rate-card, % of 1,200 floor, tokens/5h,
cache hit %, OAuth expiry days, active sessions, last-turn age, errors,
plus per-model timeseries + bar gauge + error table.
Plus rate-card thresholds in the gauge are conservative (1,200/5h floor;
real cap is dynamic 1,200–7,000). Re-baseline if throttling shows up
below 80%.
2026-05-07 09:04:25 +00:00
|
|
|
|
# Prometheus exporter script — read by the openclaw-exporter sidecar.
|
|
|
|
|
|
# Stdlib-only Python so no pip install at startup. Reads sessions JSONL +
|
|
|
|
|
|
# auth-profiles.json from the NFS-backed openclaw home volume (mounted ro).
|
|
|
|
|
|
resource "kubernetes_config_map" "openclaw_exporter" {
|
|
|
|
|
|
metadata {
|
|
|
|
|
|
name = "openclaw-exporter"
|
|
|
|
|
|
namespace = kubernetes_namespace.openclaw.metadata[0].name
|
|
|
|
|
|
}
|
|
|
|
|
|
data = {
|
|
|
|
|
|
"exporter.py" = file("${path.module}/files/exporter.py")
|
|
|
|
|
|
}
|
|
|
|
|
|
}
|
|
|
|
|
|
|
truenas deprecation: migrate all non-immich storage to proxmox NFS
- Migrate 7 backup CronJobs to Proxmox host NFS (192.168.1.127)
(etcd, mysql, postgresql, nextcloud, redis, vaultwarden, plotting-book)
- Migrate headscale backup, ebook2audiobook, osm_routing to Proxmox NFS
- Migrate servarr (lidarr, readarr, soulseek) NFS refs to Proxmox
- Remove 79 orphaned TrueNAS NFS module declarations from 49 stacks
- Delete stacks/platform/modules/ (27 dead module copies, 65MB)
- Update nfs-truenas StorageClass to point to Proxmox (192.168.1.127)
- Remove iscsi DNS record from config.tfvars
- Fix woodpecker persistence config and alertmanager PV
Only Immich (8 PVCs, ~1.4TB) remains on TrueNAS.
2026-04-12 14:35:39 +01:00
|
|
|
|
module "nfs_tools_host" {
|
[ci skip] complete NFS CSI migration: complex stacks + platform modules
Migrate remaining multi-volume stacks and all platform modules from
inline NFS volumes to CSI-backed PV/PVC with nfs-truenas StorageClass
(soft,timeo=30,retrans=3 mount options).
Complex stacks: openclaw (4 vols), immich (8 vols), frigate (2 vols),
nextcloud (2 vols + old PV replaced), rybbit (1 vol)
Remaining stacks: affine, ebook2audiobook, f1-stream, osm_routing,
real-estate-crawler
Platform modules: monitoring (prometheus, loki, alertmanager PVs
converted from native NFS to CSI), redis, dbaas, technitium,
headscale, vaultwarden, uptime-kuma, mailserver, infra-maintenance
2026-03-02 01:24:07 +00:00
|
|
|
|
source = "../../modules/kubernetes/nfs_volume"
|
truenas deprecation: migrate all non-immich storage to proxmox NFS
- Migrate 7 backup CronJobs to Proxmox host NFS (192.168.1.127)
(etcd, mysql, postgresql, nextcloud, redis, vaultwarden, plotting-book)
- Migrate headscale backup, ebook2audiobook, osm_routing to Proxmox NFS
- Migrate servarr (lidarr, readarr, soulseek) NFS refs to Proxmox
- Remove 79 orphaned TrueNAS NFS module declarations from 49 stacks
- Delete stacks/platform/modules/ (27 dead module copies, 65MB)
- Update nfs-truenas StorageClass to point to Proxmox (192.168.1.127)
- Remove iscsi DNS record from config.tfvars
- Fix woodpecker persistence config and alertmanager PV
Only Immich (8 PVCs, ~1.4TB) remains on TrueNAS.
2026-04-12 14:35:39 +01:00
|
|
|
|
name = "openclaw-tools-host"
|
[ci skip] complete NFS CSI migration: complex stacks + platform modules
Migrate remaining multi-volume stacks and all platform modules from
inline NFS volumes to CSI-backed PV/PVC with nfs-truenas StorageClass
(soft,timeo=30,retrans=3 mount options).
Complex stacks: openclaw (4 vols), immich (8 vols), frigate (2 vols),
nextcloud (2 vols + old PV replaced), rybbit (1 vol)
Remaining stacks: affine, ebook2audiobook, f1-stream, osm_routing,
real-estate-crawler
Platform modules: monitoring (prometheus, loki, alertmanager PVs
converted from native NFS to CSI), redis, dbaas, technitium,
headscale, vaultwarden, uptime-kuma, mailserver, infra-maintenance
2026-03-02 01:24:07 +00:00
|
|
|
|
namespace = kubernetes_namespace.openclaw.metadata[0].name
|
truenas deprecation: migrate all non-immich storage to proxmox NFS
- Migrate 7 backup CronJobs to Proxmox host NFS (192.168.1.127)
(etcd, mysql, postgresql, nextcloud, redis, vaultwarden, plotting-book)
- Migrate headscale backup, ebook2audiobook, osm_routing to Proxmox NFS
- Migrate servarr (lidarr, readarr, soulseek) NFS refs to Proxmox
- Remove 79 orphaned TrueNAS NFS module declarations from 49 stacks
- Delete stacks/platform/modules/ (27 dead module copies, 65MB)
- Update nfs-truenas StorageClass to point to Proxmox (192.168.1.127)
- Remove iscsi DNS record from config.tfvars
- Fix woodpecker persistence config and alertmanager PV
Only Immich (8 PVCs, ~1.4TB) remains on TrueNAS.
2026-04-12 14:35:39 +01:00
|
|
|
|
nfs_server = "192.168.1.127"
|
|
|
|
|
|
nfs_path = "/srv/nfs/openclaw/tools"
|
[ci skip] complete NFS CSI migration: complex stacks + platform modules
Migrate remaining multi-volume stacks and all platform modules from
inline NFS volumes to CSI-backed PV/PVC with nfs-truenas StorageClass
(soft,timeo=30,retrans=3 mount options).
Complex stacks: openclaw (4 vols), immich (8 vols), frigate (2 vols),
nextcloud (2 vols + old PV replaced), rybbit (1 vol)
Remaining stacks: affine, ebook2audiobook, f1-stream, osm_routing,
real-estate-crawler
Platform modules: monitoring (prometheus, loki, alertmanager PVs
converted from native NFS to CSI), redis, dbaas, technitium,
headscale, vaultwarden, uptime-kuma, mailserver, infra-maintenance
2026-03-02 01:24:07 +00:00
|
|
|
|
}
|
|
|
|
|
|
|
feat(storage): migrate 38 NFS PVCs to proxmox-lvm (Wave 2)
Add proxmox-lvm PVCs with pvc-autoresizer annotations for all
remaining single-pod app data services. Deployments updated to
use new block storage PVCs. Old NFS modules retained for rollback.
Services: affine, changedetection, diun, excalidraw, f1-stream,
hackmd, isponsorblocktv, matrix, n8n, send, grampsweb, health,
onlyoffice, owntracks, paperless-ngx, privatebin, resume,
speedtest, stirling-pdf, tandoor, rybbit (clickhouse), tor-proxy
(torrserver), whisper+piper, frigate (config), ollama (ui),
servarr (prowlarr/listenarr/qbittorrent), aiostreams, freshrss
(extensions), meshcentral (data+files), openclaw (data+home+
openlobster), technitium, mailserver (data+roundcube html+enigma),
dbaas (pgadmin).
Strategy set to Recreate where needed for RWO volumes.
2026-04-04 19:25:12 +03:00
|
|
|
|
resource "kubernetes_persistent_volume_claim" "home_proxmox" {
|
|
|
|
|
|
wait_until_bound = false
|
|
|
|
|
|
metadata {
|
|
|
|
|
|
name = "openclaw-home-proxmox"
|
|
|
|
|
|
namespace = kubernetes_namespace.openclaw.metadata[0].name
|
|
|
|
|
|
annotations = {
|
2026-05-10 19:56:16 +00:00
|
|
|
|
"resize.topolvm.io/threshold" = "10%"
|
feat(storage): migrate 38 NFS PVCs to proxmox-lvm (Wave 2)
Add proxmox-lvm PVCs with pvc-autoresizer annotations for all
remaining single-pod app data services. Deployments updated to
use new block storage PVCs. Old NFS modules retained for rollback.
Services: affine, changedetection, diun, excalidraw, f1-stream,
hackmd, isponsorblocktv, matrix, n8n, send, grampsweb, health,
onlyoffice, owntracks, paperless-ngx, privatebin, resume,
speedtest, stirling-pdf, tandoor, rybbit (clickhouse), tor-proxy
(torrserver), whisper+piper, frigate (config), ollama (ui),
servarr (prowlarr/listenarr/qbittorrent), aiostreams, freshrss
(extensions), meshcentral (data+files), openclaw (data+home+
openlobster), technitium, mailserver (data+roundcube html+enigma),
dbaas (pgadmin).
Strategy set to Recreate where needed for RWO volumes.
2026-04-04 19:25:12 +03:00
|
|
|
|
"resize.topolvm.io/increase" = "100%"
|
|
|
|
|
|
"resize.topolvm.io/storage_limit" = "5Gi"
|
|
|
|
|
|
}
|
|
|
|
|
|
}
|
|
|
|
|
|
spec {
|
|
|
|
|
|
access_modes = ["ReadWriteOnce"]
|
|
|
|
|
|
storage_class_name = "proxmox-lvm"
|
|
|
|
|
|
resources {
|
|
|
|
|
|
requests = {
|
|
|
|
|
|
storage = "1Gi"
|
|
|
|
|
|
}
|
|
|
|
|
|
}
|
|
|
|
|
|
}
|
2026-05-10 21:57:01 +00:00
|
|
|
|
lifecycle {
|
|
|
|
|
|
# The autoresizer expands requests.storage up to storage_limit and
|
|
|
|
|
|
# PVCs can't shrink. Without this, every TF apply tries to revert
|
|
|
|
|
|
# to the spec value, K8s rejects the shrink, and the PVC ends up
|
|
|
|
|
|
# in Terminating-but-in-use limbo.
|
|
|
|
|
|
ignore_changes = [spec[0].resources[0].requests]
|
|
|
|
|
|
}
|
feat(storage): migrate 38 NFS PVCs to proxmox-lvm (Wave 2)
Add proxmox-lvm PVCs with pvc-autoresizer annotations for all
remaining single-pod app data services. Deployments updated to
use new block storage PVCs. Old NFS modules retained for rollback.
Services: affine, changedetection, diun, excalidraw, f1-stream,
hackmd, isponsorblocktv, matrix, n8n, send, grampsweb, health,
onlyoffice, owntracks, paperless-ngx, privatebin, resume,
speedtest, stirling-pdf, tandoor, rybbit (clickhouse), tor-proxy
(torrserver), whisper+piper, frigate (config), ollama (ui),
servarr (prowlarr/listenarr/qbittorrent), aiostreams, freshrss
(extensions), meshcentral (data+files), openclaw (data+home+
openlobster), technitium, mailserver (data+roundcube html+enigma),
dbaas (pgadmin).
Strategy set to Recreate where needed for RWO volumes.
2026-04-04 19:25:12 +03:00
|
|
|
|
}
|
|
|
|
|
|
|
truenas deprecation: migrate all non-immich storage to proxmox NFS
- Migrate 7 backup CronJobs to Proxmox host NFS (192.168.1.127)
(etcd, mysql, postgresql, nextcloud, redis, vaultwarden, plotting-book)
- Migrate headscale backup, ebook2audiobook, osm_routing to Proxmox NFS
- Migrate servarr (lidarr, readarr, soulseek) NFS refs to Proxmox
- Remove 79 orphaned TrueNAS NFS module declarations from 49 stacks
- Delete stacks/platform/modules/ (27 dead module copies, 65MB)
- Update nfs-truenas StorageClass to point to Proxmox (192.168.1.127)
- Remove iscsi DNS record from config.tfvars
- Fix woodpecker persistence config and alertmanager PV
Only Immich (8 PVCs, ~1.4TB) remains on TrueNAS.
2026-04-12 14:35:39 +01:00
|
|
|
|
module "nfs_workspace_host" {
|
[ci skip] complete NFS CSI migration: complex stacks + platform modules
Migrate remaining multi-volume stacks and all platform modules from
inline NFS volumes to CSI-backed PV/PVC with nfs-truenas StorageClass
(soft,timeo=30,retrans=3 mount options).
Complex stacks: openclaw (4 vols), immich (8 vols), frigate (2 vols),
nextcloud (2 vols + old PV replaced), rybbit (1 vol)
Remaining stacks: affine, ebook2audiobook, f1-stream, osm_routing,
real-estate-crawler
Platform modules: monitoring (prometheus, loki, alertmanager PVs
converted from native NFS to CSI), redis, dbaas, technitium,
headscale, vaultwarden, uptime-kuma, mailserver, infra-maintenance
2026-03-02 01:24:07 +00:00
|
|
|
|
source = "../../modules/kubernetes/nfs_volume"
|
truenas deprecation: migrate all non-immich storage to proxmox NFS
- Migrate 7 backup CronJobs to Proxmox host NFS (192.168.1.127)
(etcd, mysql, postgresql, nextcloud, redis, vaultwarden, plotting-book)
- Migrate headscale backup, ebook2audiobook, osm_routing to Proxmox NFS
- Migrate servarr (lidarr, readarr, soulseek) NFS refs to Proxmox
- Remove 79 orphaned TrueNAS NFS module declarations from 49 stacks
- Delete stacks/platform/modules/ (27 dead module copies, 65MB)
- Update nfs-truenas StorageClass to point to Proxmox (192.168.1.127)
- Remove iscsi DNS record from config.tfvars
- Fix woodpecker persistence config and alertmanager PV
Only Immich (8 PVCs, ~1.4TB) remains on TrueNAS.
2026-04-12 14:35:39 +01:00
|
|
|
|
name = "openclaw-workspace-host"
|
[ci skip] complete NFS CSI migration: complex stacks + platform modules
Migrate remaining multi-volume stacks and all platform modules from
inline NFS volumes to CSI-backed PV/PVC with nfs-truenas StorageClass
(soft,timeo=30,retrans=3 mount options).
Complex stacks: openclaw (4 vols), immich (8 vols), frigate (2 vols),
nextcloud (2 vols + old PV replaced), rybbit (1 vol)
Remaining stacks: affine, ebook2audiobook, f1-stream, osm_routing,
real-estate-crawler
Platform modules: monitoring (prometheus, loki, alertmanager PVs
converted from native NFS to CSI), redis, dbaas, technitium,
headscale, vaultwarden, uptime-kuma, mailserver, infra-maintenance
2026-03-02 01:24:07 +00:00
|
|
|
|
namespace = kubernetes_namespace.openclaw.metadata[0].name
|
truenas deprecation: migrate all non-immich storage to proxmox NFS
- Migrate 7 backup CronJobs to Proxmox host NFS (192.168.1.127)
(etcd, mysql, postgresql, nextcloud, redis, vaultwarden, plotting-book)
- Migrate headscale backup, ebook2audiobook, osm_routing to Proxmox NFS
- Migrate servarr (lidarr, readarr, soulseek) NFS refs to Proxmox
- Remove 79 orphaned TrueNAS NFS module declarations from 49 stacks
- Delete stacks/platform/modules/ (27 dead module copies, 65MB)
- Update nfs-truenas StorageClass to point to Proxmox (192.168.1.127)
- Remove iscsi DNS record from config.tfvars
- Fix woodpecker persistence config and alertmanager PV
Only Immich (8 PVCs, ~1.4TB) remains on TrueNAS.
2026-04-12 14:35:39 +01:00
|
|
|
|
nfs_server = "192.168.1.127"
|
|
|
|
|
|
nfs_path = "/srv/nfs/openclaw/workspace"
|
[ci skip] complete NFS CSI migration: complex stacks + platform modules
Migrate remaining multi-volume stacks and all platform modules from
inline NFS volumes to CSI-backed PV/PVC with nfs-truenas StorageClass
(soft,timeo=30,retrans=3 mount options).
Complex stacks: openclaw (4 vols), immich (8 vols), frigate (2 vols),
nextcloud (2 vols + old PV replaced), rybbit (1 vol)
Remaining stacks: affine, ebook2audiobook, f1-stream, osm_routing,
real-estate-crawler
Platform modules: monitoring (prometheus, loki, alertmanager PVs
converted from native NFS to CSI), redis, dbaas, technitium,
headscale, vaultwarden, uptime-kuma, mailserver, infra-maintenance
2026-03-02 01:24:07 +00:00
|
|
|
|
}
|
|
|
|
|
|
|
feat(storage): migrate 38 NFS PVCs to proxmox-lvm (Wave 2)
Add proxmox-lvm PVCs with pvc-autoresizer annotations for all
remaining single-pod app data services. Deployments updated to
use new block storage PVCs. Old NFS modules retained for rollback.
Services: affine, changedetection, diun, excalidraw, f1-stream,
hackmd, isponsorblocktv, matrix, n8n, send, grampsweb, health,
onlyoffice, owntracks, paperless-ngx, privatebin, resume,
speedtest, stirling-pdf, tandoor, rybbit (clickhouse), tor-proxy
(torrserver), whisper+piper, frigate (config), ollama (ui),
servarr (prowlarr/listenarr/qbittorrent), aiostreams, freshrss
(extensions), meshcentral (data+files), openclaw (data+home+
openlobster), technitium, mailserver (data+roundcube html+enigma),
dbaas (pgadmin).
Strategy set to Recreate where needed for RWO volumes.
2026-04-04 19:25:12 +03:00
|
|
|
|
resource "kubernetes_persistent_volume_claim" "data_proxmox" {
|
|
|
|
|
|
wait_until_bound = false
|
|
|
|
|
|
metadata {
|
|
|
|
|
|
name = "openclaw-data-proxmox"
|
|
|
|
|
|
namespace = kubernetes_namespace.openclaw.metadata[0].name
|
|
|
|
|
|
annotations = {
|
2026-05-10 19:56:16 +00:00
|
|
|
|
"resize.topolvm.io/threshold" = "10%"
|
feat(storage): migrate 38 NFS PVCs to proxmox-lvm (Wave 2)
Add proxmox-lvm PVCs with pvc-autoresizer annotations for all
remaining single-pod app data services. Deployments updated to
use new block storage PVCs. Old NFS modules retained for rollback.
Services: affine, changedetection, diun, excalidraw, f1-stream,
hackmd, isponsorblocktv, matrix, n8n, send, grampsweb, health,
onlyoffice, owntracks, paperless-ngx, privatebin, resume,
speedtest, stirling-pdf, tandoor, rybbit (clickhouse), tor-proxy
(torrserver), whisper+piper, frigate (config), ollama (ui),
servarr (prowlarr/listenarr/qbittorrent), aiostreams, freshrss
(extensions), meshcentral (data+files), openclaw (data+home+
openlobster), technitium, mailserver (data+roundcube html+enigma),
dbaas (pgadmin).
Strategy set to Recreate where needed for RWO volumes.
2026-04-04 19:25:12 +03:00
|
|
|
|
"resize.topolvm.io/increase" = "100%"
|
|
|
|
|
|
"resize.topolvm.io/storage_limit" = "5Gi"
|
|
|
|
|
|
}
|
|
|
|
|
|
}
|
|
|
|
|
|
spec {
|
|
|
|
|
|
access_modes = ["ReadWriteOnce"]
|
|
|
|
|
|
storage_class_name = "proxmox-lvm"
|
|
|
|
|
|
resources {
|
|
|
|
|
|
requests = {
|
|
|
|
|
|
storage = "1Gi"
|
|
|
|
|
|
}
|
|
|
|
|
|
}
|
|
|
|
|
|
}
|
2026-05-10 21:57:01 +00:00
|
|
|
|
lifecycle {
|
|
|
|
|
|
# The autoresizer expands requests.storage up to storage_limit and
|
|
|
|
|
|
# PVCs can't shrink. Without this, every TF apply tries to revert
|
|
|
|
|
|
# to the spec value, K8s rejects the shrink, and the PVC ends up
|
|
|
|
|
|
# in Terminating-but-in-use limbo.
|
|
|
|
|
|
ignore_changes = [spec[0].resources[0].requests]
|
|
|
|
|
|
}
|
feat(storage): migrate 38 NFS PVCs to proxmox-lvm (Wave 2)
Add proxmox-lvm PVCs with pvc-autoresizer annotations for all
remaining single-pod app data services. Deployments updated to
use new block storage PVCs. Old NFS modules retained for rollback.
Services: affine, changedetection, diun, excalidraw, f1-stream,
hackmd, isponsorblocktv, matrix, n8n, send, grampsweb, health,
onlyoffice, owntracks, paperless-ngx, privatebin, resume,
speedtest, stirling-pdf, tandoor, rybbit (clickhouse), tor-proxy
(torrserver), whisper+piper, frigate (config), ollama (ui),
servarr (prowlarr/listenarr/qbittorrent), aiostreams, freshrss
(extensions), meshcentral (data+files), openclaw (data+home+
openlobster), technitium, mailserver (data+roundcube html+enigma),
dbaas (pgadmin).
Strategy set to Recreate where needed for RWO volumes.
2026-04-04 19:25:12 +03:00
|
|
|
|
}
|
|
|
|
|
|
|
2026-03-15 16:04:02 +00:00
|
|
|
|
## cc-config NFS volume removed — replaced by dotfiles repo clone in init container
|
|
|
|
|
|
## See init_container "install-dotfiles" in the deployment
|
2026-03-14 08:51:45 +00:00
|
|
|
|
|
2026-02-22 15:13:55 +00:00
|
|
|
|
resource "kubernetes_deployment" "openclaw" {
|
|
|
|
|
|
metadata {
|
|
|
|
|
|
name = "openclaw"
|
|
|
|
|
|
namespace = kubernetes_namespace.openclaw.metadata[0].name
|
|
|
|
|
|
labels = {
|
|
|
|
|
|
app = "openclaw"
|
|
|
|
|
|
tier = local.tiers.aux
|
|
|
|
|
|
}
|
|
|
|
|
|
}
|
|
|
|
|
|
spec {
|
|
|
|
|
|
strategy {
|
|
|
|
|
|
type = "Recreate"
|
|
|
|
|
|
}
|
|
|
|
|
|
replicas = 1
|
|
|
|
|
|
selector {
|
|
|
|
|
|
match_labels = {
|
|
|
|
|
|
app = "openclaw"
|
|
|
|
|
|
}
|
|
|
|
|
|
}
|
|
|
|
|
|
template {
|
|
|
|
|
|
metadata {
|
|
|
|
|
|
labels = {
|
|
|
|
|
|
app = "openclaw"
|
|
|
|
|
|
}
|
migrate consuming stacks to ESO + remove k8s-dashboard static token
Phase 9: ExternalSecret migration across 26 stacks:
Fully migrated (vault data source removed, ESO delivers secrets):
- speedtest, shadowsocks, wealthfolio, plotting-book, f1-stream, tandoor
- n8n, dawarich, diun, netbox, onlyoffice, tuya-bridge
- hackmd (ESO template for DB URL), health (ESO template for DB URL)
- trading-bot (ESO template for DATABASE_URL + 7 secret env vars)
- forgejo (removed unused vault data source)
Partially migrated (vault kept for plan-time, ESO added for runtime):
- immich, linkwarden, nextcloud, paperless-ngx (jsondecode for homepage)
- claude-memory, rybbit, url, webhook_handler (plan-time in locals/jobs)
- woodpecker, openclaw, resume (plan-time in helm values/jobs/modules)
17 stacks unchanged (all plan-time: homepage annotations, configmaps,
module inputs) — vault data source works with OIDC auth.
Phase 17a: Remove k8s-dashboard static admin token secret.
Users now get tokens via: vault write kubernetes/creds/dashboard-admin
2026-03-15 19:05:04 +00:00
|
|
|
|
annotations = {
|
|
|
|
|
|
"reloader.stakater.com/search" = "true"
|
openclaw: realtime usage dashboard via Prometheus exporter sidecar
Stdlib-only Python exporter ($1) reads ~/.openclaw/agents/*/sessions/*.jsonl
(assistant messages with usage) plus auth-profiles.json (OAuth expiry,
Plus-tier label) and exposes Prometheus text format on :9099/metrics.
Container is python:3.12-slim; pod template gets prometheus.io/scrape
annotations so the existing kubernetes-pods job picks it up — no
ServiceMonitor needed.
Metrics exported:
openclaw_codex_messages_total{provider,model,session_kind} counter
openclaw_codex_input/output/cache_read/cache_write_tokens_total
openclaw_codex_message_errors_total{reason}
openclaw_codex_active_sessions{kind} gauge
openclaw_codex_oauth_expiry_seconds{provider,account,plan} gauge
openclaw_codex_last_run_timestamp gauge
Grafana dashboard "OpenClaw — Codex Usage" (Applications folder, 30s
refresh): messages/5h vs Plus rate-card, % of 1,200 floor, tokens/5h,
cache hit %, OAuth expiry days, active sessions, last-turn age, errors,
plus per-model timeseries + bar gauge + error table.
Plus rate-card thresholds in the gauge are conservative (1,200/5h floor;
real cap is dynamic 1,200–7,000). Re-baseline if throttling shows up
below 80%.
2026-05-07 09:04:25 +00:00
|
|
|
|
# Prometheus auto-discovers pods with these annotations.
|
|
|
|
|
|
# Scraped by the openclaw-exporter sidecar — exposes /metrics on :9099.
|
|
|
|
|
|
"prometheus.io/scrape" = "true"
|
|
|
|
|
|
"prometheus.io/port" = "9099"
|
|
|
|
|
|
"prometheus.io/path" = "/metrics"
|
migrate consuming stacks to ESO + remove k8s-dashboard static token
Phase 9: ExternalSecret migration across 26 stacks:
Fully migrated (vault data source removed, ESO delivers secrets):
- speedtest, shadowsocks, wealthfolio, plotting-book, f1-stream, tandoor
- n8n, dawarich, diun, netbox, onlyoffice, tuya-bridge
- hackmd (ESO template for DB URL), health (ESO template for DB URL)
- trading-bot (ESO template for DATABASE_URL + 7 secret env vars)
- forgejo (removed unused vault data source)
Partially migrated (vault kept for plan-time, ESO added for runtime):
- immich, linkwarden, nextcloud, paperless-ngx (jsondecode for homepage)
- claude-memory, rybbit, url, webhook_handler (plan-time in locals/jobs)
- woodpecker, openclaw, resume (plan-time in helm values/jobs/modules)
17 stacks unchanged (all plan-time: homepage annotations, configmaps,
module inputs) — vault data source works with OIDC auth.
Phase 17a: Remove k8s-dashboard static admin token secret.
Users now get tokens via: vault write kubernetes/creds/dashboard-admin
2026-03-15 19:05:04 +00:00
|
|
|
|
}
|
2026-02-22 15:13:55 +00:00
|
|
|
|
}
|
|
|
|
|
|
spec {
|
|
|
|
|
|
service_account_name = kubernetes_service_account.openclaw.metadata[0].name
|
|
|
|
|
|
|
2026-03-22 02:56:04 +02:00
|
|
|
|
# Init 0: fix /workspace ownership so node user can write
|
|
|
|
|
|
init_container {
|
|
|
|
|
|
name = "fix-workspace-perms"
|
|
|
|
|
|
image = "busybox:1.37"
|
|
|
|
|
|
command = ["sh", "-c", "chown 1000:1000 /workspace"]
|
|
|
|
|
|
volume_mount {
|
|
|
|
|
|
name = "workspace"
|
|
|
|
|
|
mount_path = "/workspace"
|
|
|
|
|
|
}
|
|
|
|
|
|
}
|
|
|
|
|
|
|
2026-03-15 16:04:02 +00:00
|
|
|
|
# Init 1: copy openclaw.json from ConfigMap into writable NFS home
|
2026-02-22 15:13:55 +00:00
|
|
|
|
init_container {
|
2026-03-14 23:42:08 +00:00
|
|
|
|
name = "copy-config"
|
|
|
|
|
|
image = "busybox:1.37"
|
|
|
|
|
|
command = ["sh", "-c", "cp /config/openclaw.json /home/node/.openclaw/openclaw.json && chown 1000:1000 /home/node/.openclaw/openclaw.json"]
|
2026-02-22 15:13:55 +00:00
|
|
|
|
volume_mount {
|
|
|
|
|
|
name = "openclaw-config"
|
2026-03-14 23:42:08 +00:00
|
|
|
|
mount_path = "/config"
|
2026-02-22 15:13:55 +00:00
|
|
|
|
}
|
2026-03-14 08:51:45 +00:00
|
|
|
|
volume_mount {
|
2026-03-14 23:42:08 +00:00
|
|
|
|
name = "openclaw-home"
|
|
|
|
|
|
mount_path = "/home/node/.openclaw"
|
2026-03-14 08:51:45 +00:00
|
|
|
|
}
|
2026-02-22 15:13:55 +00:00
|
|
|
|
}
|
|
|
|
|
|
|
2026-05-08 08:07:38 +00:00
|
|
|
|
# Init 1b: regenerate kubeconfig pointing at the projected SA tokenFile
|
|
|
|
|
|
# so kubectl always reads the fresh, kubelet-rotated token. Without
|
|
|
|
|
|
# this the previously-baked kubeconfig retains a SA token bound to a
|
|
|
|
|
|
# long-dead pod and kubectl returns "must be logged in to the server".
|
|
|
|
|
|
init_container {
|
ingress_factory: replace `protected` bool with `auth` enum + audit pass across 100 stacks
Phase 3+4 of default-deny ingress plan. Replaces the `protected = bool` (default
false → unprotected) variable in `modules/kubernetes/ingress_factory` with
`auth = string` enum (default "required" → fail-closed). Touches every
ingress_factory caller so the audit decision is recorded explicitly in code.
ingress_factory (Phase 3):
- `auth = "required"`: standard Authentik forward-auth (the legacy
`protected = true` semantic).
- `auth = "public"`: forward-auth via the new `authentik-forward-auth-public`
middleware → dedicated public outpost → guest auto-bind. Logged-in users
keep their real identity.
- `auth = "none"`: no Authentik middleware. For Anubis-fronted content, native
client APIs (Git, /v2/, WebDAV), webhook receivers, the Authentik outpost
itself.
- `effective_anti_ai` default flips ON only when `auth = "none"` (auth-gated
ingresses don't need anti-AI noise; the auth flow already discourages bots).
Audit pass (Phase 4) across 96 ingress_factory call sites:
- 49 explicit `protected = true` → `auth = "required"`
- 8 explicit `protected = false` → `auth = "none"` (5) or `auth = "public"` (3)
- 64 previously-default (no protected line) → `auth = "required"` ADDED, then
reviewed individually:
* 9 Anubis-fronted (blog, www, kms, travel, f1, cyberchef, jsoncrack,
homepage, wrongmove UI, privatebin) → `auth = "none"`
* 22 native-client / programmatic surfaces (Forgejo Git+/v2/, webhook
handler, claude-memory MCP, Nextcloud WebDAV, Matrix, Vault CLI/OIDC,
xray VPN, ntfy, woodpecker webhooks, n8n triggers, ntfy push, dawarich
location ingestion, immich frame kiosk, headscale CP, send anonymous
drops, rybbit beacon, vaultwarden API, Authentik UI itself + outposts) →
`auth = "none"`
* Remaining ~33 → `auth = "required"` confirmed (admin tools, internal
UIs, services without app-level auth)
- Smoke-test promotions to `auth = "public"`: fire-planner public UI,
k8s-portal API, insta2spotify callback.
Three call sites in wrapper modules (`stacks/freedify/factory/`,
`stacks/reverse-proxy/modules/reverse_proxy/`) keep their internal `protected`
bool — they translate to `auth` internally, out of scope for this rename.
Behavior change: previously-default ingresses now fail closed (require
Authentik login) unless explicitly flipped to `auth = "none"` or
`auth = "public"`. This is the audit goal — no more accidentally-unprotected
surfaces. Sites that were intentionally public (Anubis content, native APIs,
webhooks) are now explicitly recorded as `auth = "none"`.
Drive-by: `modules/create-vm/main.tf` picked up cosmetic alignment via
`terraform fmt -recursive` during the audit. Behavior-neutral.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-10 18:53:49 +00:00
|
|
|
|
name = "setup-kubeconfig"
|
|
|
|
|
|
image = "busybox:1.37"
|
2026-05-08 08:07:38 +00:00
|
|
|
|
command = ["sh", "-c", <<-EOT
|
|
|
|
|
|
cat > /home/node/.openclaw/kubeconfig <<'KUBECONFIG_EOF'
|
|
|
|
|
|
apiVersion: v1
|
|
|
|
|
|
kind: Config
|
|
|
|
|
|
clusters:
|
|
|
|
|
|
- cluster:
|
|
|
|
|
|
certificate-authority: /var/run/secrets/kubernetes.io/serviceaccount/ca.crt
|
|
|
|
|
|
server: https://kubernetes.default.svc
|
|
|
|
|
|
name: in-cluster
|
|
|
|
|
|
contexts:
|
|
|
|
|
|
- context:
|
|
|
|
|
|
cluster: in-cluster
|
|
|
|
|
|
user: openclaw
|
|
|
|
|
|
namespace: openclaw
|
|
|
|
|
|
name: in-cluster
|
|
|
|
|
|
current-context: in-cluster
|
|
|
|
|
|
users:
|
|
|
|
|
|
- name: openclaw
|
|
|
|
|
|
user:
|
|
|
|
|
|
tokenFile: /var/run/secrets/kubernetes.io/serviceaccount/token
|
|
|
|
|
|
KUBECONFIG_EOF
|
|
|
|
|
|
chown 1000:1000 /home/node/.openclaw/kubeconfig
|
|
|
|
|
|
chmod 0644 /home/node/.openclaw/kubeconfig
|
|
|
|
|
|
EOT
|
|
|
|
|
|
]
|
|
|
|
|
|
volume_mount {
|
|
|
|
|
|
name = "openclaw-home"
|
|
|
|
|
|
mount_path = "/home/node/.openclaw"
|
|
|
|
|
|
}
|
|
|
|
|
|
}
|
|
|
|
|
|
|
2026-03-29 01:11:33 +02:00
|
|
|
|
# Init 2 removed: install-dotfiles init container was cloning dotfiles
|
|
|
|
|
|
# repo via git on every pod start, causing 200+ small NFS writes.
|
|
|
|
|
|
# Dotfiles already exist on NFS at /home/node/.openclaw/dotfiles from
|
|
|
|
|
|
# a previous clone. To update, run git pull manually or via CronJob.
|
2026-03-15 16:04:02 +00:00
|
|
|
|
|
2026-02-22 15:13:55 +00:00
|
|
|
|
# Main container: OpenClaw
|
|
|
|
|
|
container {
|
ingress_factory: replace `protected` bool with `auth` enum + audit pass across 100 stacks
Phase 3+4 of default-deny ingress plan. Replaces the `protected = bool` (default
false → unprotected) variable in `modules/kubernetes/ingress_factory` with
`auth = string` enum (default "required" → fail-closed). Touches every
ingress_factory caller so the audit decision is recorded explicitly in code.
ingress_factory (Phase 3):
- `auth = "required"`: standard Authentik forward-auth (the legacy
`protected = true` semantic).
- `auth = "public"`: forward-auth via the new `authentik-forward-auth-public`
middleware → dedicated public outpost → guest auto-bind. Logged-in users
keep their real identity.
- `auth = "none"`: no Authentik middleware. For Anubis-fronted content, native
client APIs (Git, /v2/, WebDAV), webhook receivers, the Authentik outpost
itself.
- `effective_anti_ai` default flips ON only when `auth = "none"` (auth-gated
ingresses don't need anti-AI noise; the auth flow already discourages bots).
Audit pass (Phase 4) across 96 ingress_factory call sites:
- 49 explicit `protected = true` → `auth = "required"`
- 8 explicit `protected = false` → `auth = "none"` (5) or `auth = "public"` (3)
- 64 previously-default (no protected line) → `auth = "required"` ADDED, then
reviewed individually:
* 9 Anubis-fronted (blog, www, kms, travel, f1, cyberchef, jsoncrack,
homepage, wrongmove UI, privatebin) → `auth = "none"`
* 22 native-client / programmatic surfaces (Forgejo Git+/v2/, webhook
handler, claude-memory MCP, Nextcloud WebDAV, Matrix, Vault CLI/OIDC,
xray VPN, ntfy, woodpecker webhooks, n8n triggers, ntfy push, dawarich
location ingestion, immich frame kiosk, headscale CP, send anonymous
drops, rybbit beacon, vaultwarden API, Authentik UI itself + outposts) →
`auth = "none"`
* Remaining ~33 → `auth = "required"` confirmed (admin tools, internal
UIs, services without app-level auth)
- Smoke-test promotions to `auth = "public"`: fire-planner public UI,
k8s-portal API, insta2spotify callback.
Three call sites in wrapper modules (`stacks/freedify/factory/`,
`stacks/reverse-proxy/modules/reverse_proxy/`) keep their internal `protected`
bool — they translate to `auth` internally, out of scope for this rename.
Behavior change: previously-default ingresses now fail closed (require
Authentik login) unless explicitly flipped to `auth = "none"` or
`auth = "public"`. This is the audit goal — no more accidentally-unprotected
surfaces. Sites that were intentionally public (Anubis content, native APIs,
webhooks) are now explicitly recorded as `auth = "none"`.
Drive-by: `modules/create-vm/main.tf` picked up cosmetic alignment via
`terraform fmt -recursive` during the audit. Behavior-neutral.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-10 18:53:49 +00:00
|
|
|
|
name = "openclaw"
|
|
|
|
|
|
image = "ghcr.io/openclaw/openclaw:2026.5.4"
|
2026-05-06 22:06:32 +00:00
|
|
|
|
# Doctor --fix auto-promotes the highest-tier codex model (gpt-5-pro) after
|
|
|
|
|
|
# auth-profile-based model discovery; pin gpt-5.4-mini back to default after it.
|
|
|
|
|
|
command = ["sh", "-c", "node openclaw.mjs doctor --fix 2>/dev/null; node openclaw.mjs models set openai-codex/gpt-5.4-mini 2>/dev/null; exec node openclaw.mjs gateway --allow-unconfigured --bind lan"]
|
2026-02-22 15:13:55 +00:00
|
|
|
|
port {
|
|
|
|
|
|
container_port = 18789
|
|
|
|
|
|
}
|
2026-03-01 14:44:22 +00:00
|
|
|
|
readiness_probe {
|
|
|
|
|
|
tcp_socket {
|
|
|
|
|
|
port = 18789
|
|
|
|
|
|
}
|
|
|
|
|
|
initial_delay_seconds = 30
|
|
|
|
|
|
period_seconds = 10
|
|
|
|
|
|
}
|
2026-03-14 23:42:08 +00:00
|
|
|
|
env {
|
|
|
|
|
|
name = "NODE_OPTIONS"
|
|
|
|
|
|
value = "--max-old-space-size=1536"
|
|
|
|
|
|
}
|
2026-02-22 15:13:55 +00:00
|
|
|
|
env {
|
|
|
|
|
|
name = "OPENCLAW_GATEWAY_TOKEN"
|
|
|
|
|
|
value = random_password.gateway_token.result
|
|
|
|
|
|
}
|
|
|
|
|
|
env {
|
|
|
|
|
|
name = "PATH"
|
|
|
|
|
|
value = "/tools:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin"
|
|
|
|
|
|
}
|
|
|
|
|
|
env {
|
|
|
|
|
|
name = "TF_VAR_prod"
|
|
|
|
|
|
value = "true"
|
|
|
|
|
|
}
|
|
|
|
|
|
env {
|
|
|
|
|
|
name = "KUBECONFIG"
|
|
|
|
|
|
value = "/home/node/.openclaw/kubeconfig"
|
|
|
|
|
|
}
|
|
|
|
|
|
env {
|
|
|
|
|
|
name = "GIT_CONFIG_GLOBAL"
|
|
|
|
|
|
value = "/home/node/.openclaw/.gitconfig"
|
|
|
|
|
|
}
|
|
|
|
|
|
# Skill secrets - Home Assistant
|
|
|
|
|
|
env {
|
|
|
|
|
|
name = "HOME_ASSISTANT_URL"
|
|
|
|
|
|
value = "https://ha-london.viktorbarzin.me"
|
|
|
|
|
|
}
|
|
|
|
|
|
env {
|
|
|
|
|
|
name = "HOME_ASSISTANT_TOKEN"
|
2026-03-14 17:15:48 +00:00
|
|
|
|
value = local.skill_secrets["home_assistant_token"]
|
2026-02-22 15:13:55 +00:00
|
|
|
|
}
|
|
|
|
|
|
env {
|
|
|
|
|
|
name = "HOME_ASSISTANT_SOFIA_URL"
|
|
|
|
|
|
value = "https://ha-sofia.viktorbarzin.me"
|
|
|
|
|
|
}
|
|
|
|
|
|
env {
|
|
|
|
|
|
name = "HOME_ASSISTANT_SOFIA_TOKEN"
|
2026-03-14 17:15:48 +00:00
|
|
|
|
value = local.skill_secrets["home_assistant_sofia_token"]
|
2026-02-22 15:13:55 +00:00
|
|
|
|
}
|
|
|
|
|
|
# Skill secrets - Uptime Kuma
|
|
|
|
|
|
env {
|
|
|
|
|
|
name = "UPTIME_KUMA_PASSWORD"
|
2026-03-14 17:15:48 +00:00
|
|
|
|
value = local.skill_secrets["uptime_kuma_password"]
|
2026-02-22 15:13:55 +00:00
|
|
|
|
}
|
2026-03-14 16:01:41 +00:00
|
|
|
|
# Memory API
|
|
|
|
|
|
env {
|
|
|
|
|
|
name = "MEMORY_API_URL"
|
|
|
|
|
|
value = "http://claude-memory.claude-memory.svc.cluster.local"
|
|
|
|
|
|
}
|
|
|
|
|
|
env {
|
migrate consuming stacks to ESO + remove k8s-dashboard static token
Phase 9: ExternalSecret migration across 26 stacks:
Fully migrated (vault data source removed, ESO delivers secrets):
- speedtest, shadowsocks, wealthfolio, plotting-book, f1-stream, tandoor
- n8n, dawarich, diun, netbox, onlyoffice, tuya-bridge
- hackmd (ESO template for DB URL), health (ESO template for DB URL)
- trading-bot (ESO template for DATABASE_URL + 7 secret env vars)
- forgejo (removed unused vault data source)
Partially migrated (vault kept for plan-time, ESO added for runtime):
- immich, linkwarden, nextcloud, paperless-ngx (jsondecode for homepage)
- claude-memory, rybbit, url, webhook_handler (plan-time in locals/jobs)
- woodpecker, openclaw, resume (plan-time in helm values/jobs/modules)
17 stacks unchanged (all plan-time: homepage annotations, configmaps,
module inputs) — vault data source works with OIDC auth.
Phase 17a: Remove k8s-dashboard static admin token secret.
Users now get tokens via: vault write kubernetes/creds/dashboard-admin
2026-03-15 19:05:04 +00:00
|
|
|
|
name = "MEMORY_API_KEY"
|
|
|
|
|
|
value_from {
|
|
|
|
|
|
secret_key_ref {
|
|
|
|
|
|
name = "openclaw-secrets"
|
|
|
|
|
|
key = "claude_memory_api_key"
|
|
|
|
|
|
}
|
|
|
|
|
|
}
|
2026-03-14 16:01:41 +00:00
|
|
|
|
}
|
2026-02-22 15:13:55 +00:00
|
|
|
|
# Python packages path for skills
|
|
|
|
|
|
env {
|
|
|
|
|
|
name = "PYTHONPATH"
|
|
|
|
|
|
value = "/tools/python-libs"
|
|
|
|
|
|
}
|
|
|
|
|
|
volume_mount {
|
|
|
|
|
|
name = "tools"
|
|
|
|
|
|
mount_path = "/tools"
|
|
|
|
|
|
}
|
|
|
|
|
|
volume_mount {
|
|
|
|
|
|
name = "workspace"
|
|
|
|
|
|
mount_path = "/workspace"
|
|
|
|
|
|
}
|
|
|
|
|
|
volume_mount {
|
|
|
|
|
|
name = "data"
|
|
|
|
|
|
mount_path = "/data"
|
|
|
|
|
|
}
|
|
|
|
|
|
volume_mount {
|
|
|
|
|
|
name = "ssh-key"
|
|
|
|
|
|
mount_path = "/ssh"
|
|
|
|
|
|
}
|
|
|
|
|
|
volume_mount {
|
|
|
|
|
|
name = "openclaw-home"
|
|
|
|
|
|
mount_path = "/home/node/.openclaw"
|
|
|
|
|
|
}
|
|
|
|
|
|
resources {
|
|
|
|
|
|
limits = {
|
2026-03-14 23:42:08 +00:00
|
|
|
|
memory = "2Gi"
|
2026-02-22 15:13:55 +00:00
|
|
|
|
}
|
|
|
|
|
|
requests = {
|
2026-03-01 14:44:22 +00:00
|
|
|
|
cpu = "100m"
|
2026-03-15 15:30:18 +00:00
|
|
|
|
memory = "2Gi"
|
2026-02-22 15:13:55 +00:00
|
|
|
|
}
|
|
|
|
|
|
}
|
|
|
|
|
|
}
|
|
|
|
|
|
|
2026-03-22 02:56:04 +02:00
|
|
|
|
# Sidecar: playwright-mcp — headless browser for agents
|
|
|
|
|
|
container {
|
|
|
|
|
|
name = "playwright-mcp"
|
|
|
|
|
|
image = "docker.io/viktorbarzin/playwright-mcp:v1"
|
|
|
|
|
|
args = ["--headless", "--browser", "chromium", "--no-sandbox", "--port", "3000", "--host", "0.0.0.0"]
|
|
|
|
|
|
port {
|
|
|
|
|
|
container_port = 3000
|
|
|
|
|
|
}
|
|
|
|
|
|
resources {
|
|
|
|
|
|
requests = {
|
|
|
|
|
|
cpu = "50m"
|
|
|
|
|
|
memory = "256Mi"
|
|
|
|
|
|
}
|
|
|
|
|
|
limits = {
|
|
|
|
|
|
memory = "512Mi"
|
|
|
|
|
|
}
|
|
|
|
|
|
}
|
|
|
|
|
|
}
|
|
|
|
|
|
|
openclaw: realtime usage dashboard via Prometheus exporter sidecar
Stdlib-only Python exporter ($1) reads ~/.openclaw/agents/*/sessions/*.jsonl
(assistant messages with usage) plus auth-profiles.json (OAuth expiry,
Plus-tier label) and exposes Prometheus text format on :9099/metrics.
Container is python:3.12-slim; pod template gets prometheus.io/scrape
annotations so the existing kubernetes-pods job picks it up — no
ServiceMonitor needed.
Metrics exported:
openclaw_codex_messages_total{provider,model,session_kind} counter
openclaw_codex_input/output/cache_read/cache_write_tokens_total
openclaw_codex_message_errors_total{reason}
openclaw_codex_active_sessions{kind} gauge
openclaw_codex_oauth_expiry_seconds{provider,account,plan} gauge
openclaw_codex_last_run_timestamp gauge
Grafana dashboard "OpenClaw — Codex Usage" (Applications folder, 30s
refresh): messages/5h vs Plus rate-card, % of 1,200 floor, tokens/5h,
cache hit %, OAuth expiry days, active sessions, last-turn age, errors,
plus per-model timeseries + bar gauge + error table.
Plus rate-card thresholds in the gauge are conservative (1,200/5h floor;
real cap is dynamic 1,200–7,000). Re-baseline if throttling shows up
below 80%.
2026-05-07 09:04:25 +00:00
|
|
|
|
# Sidecar: openclaw-exporter — Prometheus exporter for Codex/OAuth usage.
|
|
|
|
|
|
# Reads sessions JSONL files + auth-profiles.json, exposes /metrics on :9099.
|
|
|
|
|
|
# Stdlib-only Python; no pip install at startup.
|
|
|
|
|
|
container {
|
|
|
|
|
|
name = "openclaw-exporter"
|
|
|
|
|
|
image = "docker.io/library/python:3.12-slim"
|
|
|
|
|
|
command = ["python3", "/scripts/exporter.py"]
|
|
|
|
|
|
port {
|
|
|
|
|
|
container_port = 9099
|
|
|
|
|
|
name = "metrics"
|
|
|
|
|
|
}
|
|
|
|
|
|
env {
|
|
|
|
|
|
name = "OPENCLAW_HOME"
|
|
|
|
|
|
value = "/home/node/.openclaw"
|
|
|
|
|
|
}
|
|
|
|
|
|
env {
|
|
|
|
|
|
name = "METRICS_PORT"
|
|
|
|
|
|
value = "9099"
|
|
|
|
|
|
}
|
|
|
|
|
|
volume_mount {
|
|
|
|
|
|
name = "openclaw-exporter-script"
|
|
|
|
|
|
mount_path = "/scripts"
|
|
|
|
|
|
read_only = true
|
|
|
|
|
|
}
|
|
|
|
|
|
volume_mount {
|
|
|
|
|
|
name = "openclaw-home"
|
|
|
|
|
|
mount_path = "/home/node/.openclaw"
|
|
|
|
|
|
read_only = true
|
|
|
|
|
|
}
|
|
|
|
|
|
readiness_probe {
|
|
|
|
|
|
http_get {
|
|
|
|
|
|
path = "/healthz"
|
|
|
|
|
|
port = 9099
|
|
|
|
|
|
}
|
|
|
|
|
|
initial_delay_seconds = 5
|
|
|
|
|
|
period_seconds = 30
|
|
|
|
|
|
}
|
|
|
|
|
|
resources {
|
|
|
|
|
|
requests = {
|
|
|
|
|
|
cpu = "10m"
|
|
|
|
|
|
memory = "64Mi"
|
|
|
|
|
|
}
|
|
|
|
|
|
limits = {
|
|
|
|
|
|
memory = "128Mi"
|
|
|
|
|
|
}
|
|
|
|
|
|
}
|
|
|
|
|
|
}
|
|
|
|
|
|
|
2026-03-01 15:57:31 +00:00
|
|
|
|
# Sidecar: modelrelay — auto-routes to fastest healthy free model
|
|
|
|
|
|
container {
|
|
|
|
|
|
name = "modelrelay"
|
2026-03-15 10:37:58 +00:00
|
|
|
|
image = "docker.io/library/node:22-alpine"
|
2026-03-01 15:57:31 +00:00
|
|
|
|
command = ["sh", "-c", <<-EOF
|
|
|
|
|
|
if [ ! -f /tools/modelrelay/node_modules/.package-lock.json ]; then
|
|
|
|
|
|
mkdir -p /tools/modelrelay
|
|
|
|
|
|
cd /tools/modelrelay
|
|
|
|
|
|
npm init -y > /dev/null 2>&1
|
|
|
|
|
|
npm install modelrelay > /dev/null 2>&1
|
|
|
|
|
|
fi
|
|
|
|
|
|
cd /tools/modelrelay
|
|
|
|
|
|
exec npx modelrelay --port 7352
|
|
|
|
|
|
EOF
|
|
|
|
|
|
]
|
|
|
|
|
|
port {
|
|
|
|
|
|
container_port = 7352
|
|
|
|
|
|
}
|
|
|
|
|
|
env {
|
migrate consuming stacks to ESO + remove k8s-dashboard static token
Phase 9: ExternalSecret migration across 26 stacks:
Fully migrated (vault data source removed, ESO delivers secrets):
- speedtest, shadowsocks, wealthfolio, plotting-book, f1-stream, tandoor
- n8n, dawarich, diun, netbox, onlyoffice, tuya-bridge
- hackmd (ESO template for DB URL), health (ESO template for DB URL)
- trading-bot (ESO template for DATABASE_URL + 7 secret env vars)
- forgejo (removed unused vault data source)
Partially migrated (vault kept for plan-time, ESO added for runtime):
- immich, linkwarden, nextcloud, paperless-ngx (jsondecode for homepage)
- claude-memory, rybbit, url, webhook_handler (plan-time in locals/jobs)
- woodpecker, openclaw, resume (plan-time in helm values/jobs/modules)
17 stacks unchanged (all plan-time: homepage annotations, configmaps,
module inputs) — vault data source works with OIDC auth.
Phase 17a: Remove k8s-dashboard static admin token secret.
Users now get tokens via: vault write kubernetes/creds/dashboard-admin
2026-03-15 19:05:04 +00:00
|
|
|
|
name = "NVIDIA_API_KEY"
|
|
|
|
|
|
value_from {
|
|
|
|
|
|
secret_key_ref {
|
|
|
|
|
|
name = "openclaw-secrets"
|
|
|
|
|
|
key = "nvidia_api_key"
|
|
|
|
|
|
}
|
|
|
|
|
|
}
|
2026-03-01 15:57:31 +00:00
|
|
|
|
}
|
|
|
|
|
|
env {
|
migrate consuming stacks to ESO + remove k8s-dashboard static token
Phase 9: ExternalSecret migration across 26 stacks:
Fully migrated (vault data source removed, ESO delivers secrets):
- speedtest, shadowsocks, wealthfolio, plotting-book, f1-stream, tandoor
- n8n, dawarich, diun, netbox, onlyoffice, tuya-bridge
- hackmd (ESO template for DB URL), health (ESO template for DB URL)
- trading-bot (ESO template for DATABASE_URL + 7 secret env vars)
- forgejo (removed unused vault data source)
Partially migrated (vault kept for plan-time, ESO added for runtime):
- immich, linkwarden, nextcloud, paperless-ngx (jsondecode for homepage)
- claude-memory, rybbit, url, webhook_handler (plan-time in locals/jobs)
- woodpecker, openclaw, resume (plan-time in helm values/jobs/modules)
17 stacks unchanged (all plan-time: homepage annotations, configmaps,
module inputs) — vault data source works with OIDC auth.
Phase 17a: Remove k8s-dashboard static admin token secret.
Users now get tokens via: vault write kubernetes/creds/dashboard-admin
2026-03-15 19:05:04 +00:00
|
|
|
|
name = "OPENROUTER_API_KEY"
|
|
|
|
|
|
value_from {
|
|
|
|
|
|
secret_key_ref {
|
|
|
|
|
|
name = "openclaw-secrets"
|
|
|
|
|
|
key = "openrouter_api_key"
|
|
|
|
|
|
}
|
|
|
|
|
|
}
|
2026-03-01 15:57:31 +00:00
|
|
|
|
}
|
|
|
|
|
|
volume_mount {
|
|
|
|
|
|
name = "tools"
|
|
|
|
|
|
mount_path = "/tools"
|
|
|
|
|
|
}
|
|
|
|
|
|
resources {
|
|
|
|
|
|
limits = {
|
right-size memory: set requests=limits based on actual usage
- Set memory requests = limits across 56 stacks to prevent overcommit
- Right-sized limits based on actual pod usage (2x actual, rounded up)
- Scaled down trading-bot (replicas=0) to free memory
- Fixed OOMKilled services: forgejo, dawarich, health, meshcentral,
paperless-ngx, vault auto-unseal, rybbit, whisper, openclaw, clickhouse
- Added startup+liveness probes to calibre-web
- Bumped inotify limits on nodes 2,3 (max_user_instances 128->8192)
Post node2 OOM incident (2026-03-14). Previous kubelet config had no
kubeReserved/systemReserved set, allowing pods to starve the kernel.
2026-03-14 21:01:24 +00:00
|
|
|
|
memory = "256Mi"
|
2026-03-01 15:57:31 +00:00
|
|
|
|
}
|
|
|
|
|
|
requests = {
|
|
|
|
|
|
cpu = "25m"
|
2026-03-15 10:47:34 +00:00
|
|
|
|
memory = "128Mi"
|
2026-03-01 15:57:31 +00:00
|
|
|
|
}
|
|
|
|
|
|
}
|
|
|
|
|
|
}
|
|
|
|
|
|
|
2026-02-22 15:13:55 +00:00
|
|
|
|
volume {
|
|
|
|
|
|
name = "tools"
|
[ci skip] complete NFS CSI migration: complex stacks + platform modules
Migrate remaining multi-volume stacks and all platform modules from
inline NFS volumes to CSI-backed PV/PVC with nfs-truenas StorageClass
(soft,timeo=30,retrans=3 mount options).
Complex stacks: openclaw (4 vols), immich (8 vols), frigate (2 vols),
nextcloud (2 vols + old PV replaced), rybbit (1 vol)
Remaining stacks: affine, ebook2audiobook, f1-stream, osm_routing,
real-estate-crawler
Platform modules: monitoring (prometheus, loki, alertmanager PVs
converted from native NFS to CSI), redis, dbaas, technitium,
headscale, vaultwarden, uptime-kuma, mailserver, infra-maintenance
2026-03-02 01:24:07 +00:00
|
|
|
|
persistent_volume_claim {
|
truenas deprecation: migrate all non-immich storage to proxmox NFS
- Migrate 7 backup CronJobs to Proxmox host NFS (192.168.1.127)
(etcd, mysql, postgresql, nextcloud, redis, vaultwarden, plotting-book)
- Migrate headscale backup, ebook2audiobook, osm_routing to Proxmox NFS
- Migrate servarr (lidarr, readarr, soulseek) NFS refs to Proxmox
- Remove 79 orphaned TrueNAS NFS module declarations from 49 stacks
- Delete stacks/platform/modules/ (27 dead module copies, 65MB)
- Update nfs-truenas StorageClass to point to Proxmox (192.168.1.127)
- Remove iscsi DNS record from config.tfvars
- Fix woodpecker persistence config and alertmanager PV
Only Immich (8 PVCs, ~1.4TB) remains on TrueNAS.
2026-04-12 14:35:39 +01:00
|
|
|
|
claim_name = module.nfs_tools_host.claim_name
|
2026-03-01 13:59:07 +00:00
|
|
|
|
}
|
2026-02-22 15:13:55 +00:00
|
|
|
|
}
|
|
|
|
|
|
volume {
|
|
|
|
|
|
name = "openclaw-home"
|
[ci skip] complete NFS CSI migration: complex stacks + platform modules
Migrate remaining multi-volume stacks and all platform modules from
inline NFS volumes to CSI-backed PV/PVC with nfs-truenas StorageClass
(soft,timeo=30,retrans=3 mount options).
Complex stacks: openclaw (4 vols), immich (8 vols), frigate (2 vols),
nextcloud (2 vols + old PV replaced), rybbit (1 vol)
Remaining stacks: affine, ebook2audiobook, f1-stream, osm_routing,
real-estate-crawler
Platform modules: monitoring (prometheus, loki, alertmanager PVs
converted from native NFS to CSI), redis, dbaas, technitium,
headscale, vaultwarden, uptime-kuma, mailserver, infra-maintenance
2026-03-02 01:24:07 +00:00
|
|
|
|
persistent_volume_claim {
|
feat(storage): migrate 38 NFS PVCs to proxmox-lvm (Wave 2)
Add proxmox-lvm PVCs with pvc-autoresizer annotations for all
remaining single-pod app data services. Deployments updated to
use new block storage PVCs. Old NFS modules retained for rollback.
Services: affine, changedetection, diun, excalidraw, f1-stream,
hackmd, isponsorblocktv, matrix, n8n, send, grampsweb, health,
onlyoffice, owntracks, paperless-ngx, privatebin, resume,
speedtest, stirling-pdf, tandoor, rybbit (clickhouse), tor-proxy
(torrserver), whisper+piper, frigate (config), ollama (ui),
servarr (prowlarr/listenarr/qbittorrent), aiostreams, freshrss
(extensions), meshcentral (data+files), openclaw (data+home+
openlobster), technitium, mailserver (data+roundcube html+enigma),
dbaas (pgadmin).
Strategy set to Recreate where needed for RWO volumes.
2026-04-04 19:25:12 +03:00
|
|
|
|
claim_name = kubernetes_persistent_volume_claim.home_proxmox.metadata[0].name
|
2026-03-01 16:12:07 +00:00
|
|
|
|
}
|
2026-02-22 15:13:55 +00:00
|
|
|
|
}
|
|
|
|
|
|
volume {
|
|
|
|
|
|
name = "workspace"
|
[ci skip] complete NFS CSI migration: complex stacks + platform modules
Migrate remaining multi-volume stacks and all platform modules from
inline NFS volumes to CSI-backed PV/PVC with nfs-truenas StorageClass
(soft,timeo=30,retrans=3 mount options).
Complex stacks: openclaw (4 vols), immich (8 vols), frigate (2 vols),
nextcloud (2 vols + old PV replaced), rybbit (1 vol)
Remaining stacks: affine, ebook2audiobook, f1-stream, osm_routing,
real-estate-crawler
Platform modules: monitoring (prometheus, loki, alertmanager PVs
converted from native NFS to CSI), redis, dbaas, technitium,
headscale, vaultwarden, uptime-kuma, mailserver, infra-maintenance
2026-03-02 01:24:07 +00:00
|
|
|
|
persistent_volume_claim {
|
truenas deprecation: migrate all non-immich storage to proxmox NFS
- Migrate 7 backup CronJobs to Proxmox host NFS (192.168.1.127)
(etcd, mysql, postgresql, nextcloud, redis, vaultwarden, plotting-book)
- Migrate headscale backup, ebook2audiobook, osm_routing to Proxmox NFS
- Migrate servarr (lidarr, readarr, soulseek) NFS refs to Proxmox
- Remove 79 orphaned TrueNAS NFS module declarations from 49 stacks
- Delete stacks/platform/modules/ (27 dead module copies, 65MB)
- Update nfs-truenas StorageClass to point to Proxmox (192.168.1.127)
- Remove iscsi DNS record from config.tfvars
- Fix woodpecker persistence config and alertmanager PV
Only Immich (8 PVCs, ~1.4TB) remains on TrueNAS.
2026-04-12 14:35:39 +01:00
|
|
|
|
claim_name = module.nfs_workspace_host.claim_name
|
2026-02-22 15:13:55 +00:00
|
|
|
|
}
|
|
|
|
|
|
}
|
|
|
|
|
|
volume {
|
|
|
|
|
|
name = "data"
|
[ci skip] complete NFS CSI migration: complex stacks + platform modules
Migrate remaining multi-volume stacks and all platform modules from
inline NFS volumes to CSI-backed PV/PVC with nfs-truenas StorageClass
(soft,timeo=30,retrans=3 mount options).
Complex stacks: openclaw (4 vols), immich (8 vols), frigate (2 vols),
nextcloud (2 vols + old PV replaced), rybbit (1 vol)
Remaining stacks: affine, ebook2audiobook, f1-stream, osm_routing,
real-estate-crawler
Platform modules: monitoring (prometheus, loki, alertmanager PVs
converted from native NFS to CSI), redis, dbaas, technitium,
headscale, vaultwarden, uptime-kuma, mailserver, infra-maintenance
2026-03-02 01:24:07 +00:00
|
|
|
|
persistent_volume_claim {
|
feat(storage): migrate 38 NFS PVCs to proxmox-lvm (Wave 2)
Add proxmox-lvm PVCs with pvc-autoresizer annotations for all
remaining single-pod app data services. Deployments updated to
use new block storage PVCs. Old NFS modules retained for rollback.
Services: affine, changedetection, diun, excalidraw, f1-stream,
hackmd, isponsorblocktv, matrix, n8n, send, grampsweb, health,
onlyoffice, owntracks, paperless-ngx, privatebin, resume,
speedtest, stirling-pdf, tandoor, rybbit (clickhouse), tor-proxy
(torrserver), whisper+piper, frigate (config), ollama (ui),
servarr (prowlarr/listenarr/qbittorrent), aiostreams, freshrss
(extensions), meshcentral (data+files), openclaw (data+home+
openlobster), technitium, mailserver (data+roundcube html+enigma),
dbaas (pgadmin).
Strategy set to Recreate where needed for RWO volumes.
2026-04-04 19:25:12 +03:00
|
|
|
|
claim_name = kubernetes_persistent_volume_claim.data_proxmox.metadata[0].name
|
2026-02-22 15:13:55 +00:00
|
|
|
|
}
|
|
|
|
|
|
}
|
|
|
|
|
|
volume {
|
|
|
|
|
|
name = "ssh-key"
|
|
|
|
|
|
secret {
|
|
|
|
|
|
secret_name = kubernetes_secret.ssh_key.metadata[0].name
|
|
|
|
|
|
default_mode = "0600"
|
|
|
|
|
|
}
|
|
|
|
|
|
}
|
|
|
|
|
|
volume {
|
|
|
|
|
|
name = "openclaw-config"
|
|
|
|
|
|
config_map {
|
|
|
|
|
|
name = kubernetes_config_map.openclaw_config.metadata[0].name
|
|
|
|
|
|
}
|
|
|
|
|
|
}
|
openclaw: realtime usage dashboard via Prometheus exporter sidecar
Stdlib-only Python exporter ($1) reads ~/.openclaw/agents/*/sessions/*.jsonl
(assistant messages with usage) plus auth-profiles.json (OAuth expiry,
Plus-tier label) and exposes Prometheus text format on :9099/metrics.
Container is python:3.12-slim; pod template gets prometheus.io/scrape
annotations so the existing kubernetes-pods job picks it up — no
ServiceMonitor needed.
Metrics exported:
openclaw_codex_messages_total{provider,model,session_kind} counter
openclaw_codex_input/output/cache_read/cache_write_tokens_total
openclaw_codex_message_errors_total{reason}
openclaw_codex_active_sessions{kind} gauge
openclaw_codex_oauth_expiry_seconds{provider,account,plan} gauge
openclaw_codex_last_run_timestamp gauge
Grafana dashboard "OpenClaw — Codex Usage" (Applications folder, 30s
refresh): messages/5h vs Plus rate-card, % of 1,200 floor, tokens/5h,
cache hit %, OAuth expiry days, active sessions, last-turn age, errors,
plus per-model timeseries + bar gauge + error table.
Plus rate-card thresholds in the gauge are conservative (1,200/5h floor;
real cap is dynamic 1,200–7,000). Re-baseline if throttling shows up
below 80%.
2026-05-07 09:04:25 +00:00
|
|
|
|
volume {
|
|
|
|
|
|
name = "openclaw-exporter-script"
|
|
|
|
|
|
config_map {
|
|
|
|
|
|
name = kubernetes_config_map.openclaw_exporter.metadata[0].name
|
|
|
|
|
|
default_mode = "0555"
|
|
|
|
|
|
}
|
|
|
|
|
|
}
|
2026-02-22 15:13:55 +00:00
|
|
|
|
}
|
|
|
|
|
|
}
|
|
|
|
|
|
}
|
[infra] Sweep dns_config ignore_changes across all pod-owning resources [ci skip]
## Context
Wave 3A (commit c9d221d5) added the `# KYVERNO_LIFECYCLE_V1` marker to the
27 pre-existing `ignore_changes = [...dns_config]` sites so they could be
grepped and audited. It did NOT address pod-owning resources that were
simply missing the suppression entirely. Post-Wave-3A sampling (2026-04-18)
found that navidrome, f1-stream, frigate, servarr, monitoring, crowdsec,
and many other stacks showed perpetual `dns_config` drift every plan
because their `kubernetes_deployment` / `kubernetes_stateful_set` /
`kubernetes_cron_job_v1` resources had no `lifecycle {}` block at all.
Root cause (same as Wave 3A): Kyverno's admission webhook stamps
`dns_config { option { name = "ndots"; value = "2" } }` on every pod's
`spec.template.spec.dns_config` to prevent NxDomain search-domain flooding
(see `k8s-ndots-search-domain-nxdomain-flood` skill). Without `ignore_changes`
on every Terraform-managed pod-owner, Terraform repeatedly tries to strip
the injected field.
## This change
Extends the Wave 3A convention by sweeping EVERY `kubernetes_deployment`,
`kubernetes_stateful_set`, `kubernetes_daemon_set`, `kubernetes_cron_job_v1`,
`kubernetes_job_v1` (+ their `_v1` variants) in the repo and ensuring each
carries the right `ignore_changes` path:
- **kubernetes_deployment / stateful_set / daemon_set / job_v1**:
`spec[0].template[0].spec[0].dns_config`
- **kubernetes_cron_job_v1**:
`spec[0].job_template[0].spec[0].template[0].spec[0].dns_config`
(extra `job_template[0]` nesting — the CronJob's PodTemplateSpec is
one level deeper)
Each injection / extension is tagged `# KYVERNO_LIFECYCLE_V1: Kyverno
admission webhook mutates dns_config with ndots=2` inline so the
suppression is discoverable via `rg 'KYVERNO_LIFECYCLE_V1' stacks/`.
Two insertion paths are handled by a Python pass (`/tmp/add_dns_config_ignore.py`):
1. **No existing `lifecycle {}`**: inject a brand-new block just before the
resource's closing `}`. 108 new blocks on 93 files.
2. **Existing `lifecycle {}` (usually for `DRIFT_WORKAROUND: CI owns image tag`
from Wave 4, commit a62b43d1)**: extend its `ignore_changes` list with the
dns_config path. Handles both inline (`= [x]`) and multiline
(`= [\n x,\n]`) forms; ensures the last pre-existing list item carries
a trailing comma so the extended list is valid HCL. 34 extensions.
The script skips anything already mentioning `dns_config` inside an
`ignore_changes`, so re-running is a no-op.
## Scale
- 142 total lifecycle injections/extensions
- 93 `.tf` files touched
- 108 brand-new `lifecycle {}` blocks + 34 extensions of existing ones
- Every Tier 0 and Tier 1 stack with a pod-owning resource is covered
- Together with Wave 3A's 27 pre-existing markers → **169 greppable
`KYVERNO_LIFECYCLE_V1` dns_config sites across the repo**
## What is NOT in this change
- `stacks/trading-bot/main.tf` — entirely commented-out block (`/* … */`).
Python script touched the file, reverted manually.
- `_template/main.tf.example` skeleton — kept minimal on purpose; any
future stack created from it should either inherit the Wave 3A one-line
form or add its own on first `kubernetes_deployment`.
- `terraform fmt` fixes to pre-existing alignment issues in meshcentral,
nvidia/modules/nvidia, vault — unrelated to this commit. Left for a
separate fmt-only pass.
- Non-pod resources (`kubernetes_service`, `kubernetes_secret`,
`kubernetes_manifest`, etc.) — they don't own pods so they don't get
Kyverno dns_config mutation.
## Verification
Random sample post-commit:
```
$ cd stacks/navidrome && ../../scripts/tg plan → No changes.
$ cd stacks/f1-stream && ../../scripts/tg plan → No changes.
$ cd stacks/frigate && ../../scripts/tg plan → No changes.
$ rg -c 'KYVERNO_LIFECYCLE_V1' stacks/ --include='*.tf' --include='*.tf.example' \
| awk -F: '{s+=$2} END {print s}'
169
```
## Reproduce locally
1. `git pull`
2. `rg 'KYVERNO_LIFECYCLE_V1' stacks/ | wc -l` → 169+
3. `cd stacks/navidrome && ../../scripts/tg plan` → expect 0 drift on
the deployment's dns_config field.
Refs: code-seq (Wave 3B dns_config class closed; kubernetes_manifest
annotation class handled separately in 8d94688d for tls_secret)
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-18 21:19:48 +00:00
|
|
|
|
lifecycle {
|
|
|
|
|
|
# KYVERNO_LIFECYCLE_V1: Kyverno admission webhook mutates dns_config with ndots=2
|
|
|
|
|
|
ignore_changes = [spec[0].template[0].spec[0].dns_config]
|
|
|
|
|
|
}
|
2026-02-22 15:13:55 +00:00
|
|
|
|
}
|
|
|
|
|
|
|
|
|
|
|
|
resource "kubernetes_service" "openclaw" {
|
|
|
|
|
|
metadata {
|
|
|
|
|
|
name = "openclaw"
|
|
|
|
|
|
namespace = kubernetes_namespace.openclaw.metadata[0].name
|
|
|
|
|
|
labels = {
|
|
|
|
|
|
app = "openclaw"
|
|
|
|
|
|
}
|
|
|
|
|
|
}
|
|
|
|
|
|
spec {
|
|
|
|
|
|
selector = {
|
|
|
|
|
|
app = "openclaw"
|
|
|
|
|
|
}
|
|
|
|
|
|
port {
|
|
|
|
|
|
port = 80
|
|
|
|
|
|
target_port = 18789
|
|
|
|
|
|
}
|
|
|
|
|
|
}
|
|
|
|
|
|
}
|
|
|
|
|
|
|
|
|
|
|
|
module "ingress" {
|
|
|
|
|
|
source = "../../modules/kubernetes/ingress_factory"
|
2026-04-16 13:45:04 +00:00
|
|
|
|
dns_type = "non-proxied"
|
2026-02-22 15:13:55 +00:00
|
|
|
|
namespace = kubernetes_namespace.openclaw.metadata[0].name
|
|
|
|
|
|
name = "openclaw"
|
|
|
|
|
|
tls_secret_name = var.tls_secret_name
|
|
|
|
|
|
port = 80
|
ingress_factory: replace `protected` bool with `auth` enum + audit pass across 100 stacks
Phase 3+4 of default-deny ingress plan. Replaces the `protected = bool` (default
false → unprotected) variable in `modules/kubernetes/ingress_factory` with
`auth = string` enum (default "required" → fail-closed). Touches every
ingress_factory caller so the audit decision is recorded explicitly in code.
ingress_factory (Phase 3):
- `auth = "required"`: standard Authentik forward-auth (the legacy
`protected = true` semantic).
- `auth = "public"`: forward-auth via the new `authentik-forward-auth-public`
middleware → dedicated public outpost → guest auto-bind. Logged-in users
keep their real identity.
- `auth = "none"`: no Authentik middleware. For Anubis-fronted content, native
client APIs (Git, /v2/, WebDAV), webhook receivers, the Authentik outpost
itself.
- `effective_anti_ai` default flips ON only when `auth = "none"` (auth-gated
ingresses don't need anti-AI noise; the auth flow already discourages bots).
Audit pass (Phase 4) across 96 ingress_factory call sites:
- 49 explicit `protected = true` → `auth = "required"`
- 8 explicit `protected = false` → `auth = "none"` (5) or `auth = "public"` (3)
- 64 previously-default (no protected line) → `auth = "required"` ADDED, then
reviewed individually:
* 9 Anubis-fronted (blog, www, kms, travel, f1, cyberchef, jsoncrack,
homepage, wrongmove UI, privatebin) → `auth = "none"`
* 22 native-client / programmatic surfaces (Forgejo Git+/v2/, webhook
handler, claude-memory MCP, Nextcloud WebDAV, Matrix, Vault CLI/OIDC,
xray VPN, ntfy, woodpecker webhooks, n8n triggers, ntfy push, dawarich
location ingestion, immich frame kiosk, headscale CP, send anonymous
drops, rybbit beacon, vaultwarden API, Authentik UI itself + outposts) →
`auth = "none"`
* Remaining ~33 → `auth = "required"` confirmed (admin tools, internal
UIs, services without app-level auth)
- Smoke-test promotions to `auth = "public"`: fire-planner public UI,
k8s-portal API, insta2spotify callback.
Three call sites in wrapper modules (`stacks/freedify/factory/`,
`stacks/reverse-proxy/modules/reverse_proxy/`) keep their internal `protected`
bool — they translate to `auth` internally, out of scope for this rename.
Behavior change: previously-default ingresses now fail closed (require
Authentik login) unless explicitly flipped to `auth = "none"` or
`auth = "public"`. This is the audit goal — no more accidentally-unprotected
surfaces. Sites that were intentionally public (Anubis content, native APIs,
webhooks) are now explicitly recorded as `auth = "none"`.
Drive-by: `modules/create-vm/main.tf` picked up cosmetic alignment via
`terraform fmt -recursive` during the audit. Behavior-neutral.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-10 18:53:49 +00:00
|
|
|
|
auth = "required"
|
2026-03-07 16:41:36 +00:00
|
|
|
|
extra_annotations = {
|
|
|
|
|
|
"gethomepage.dev/enabled" = "true"
|
|
|
|
|
|
"gethomepage.dev/name" = "OpenClaw"
|
|
|
|
|
|
"gethomepage.dev/description" = "AI assistant"
|
|
|
|
|
|
"gethomepage.dev/icon" = "openai.png"
|
|
|
|
|
|
"gethomepage.dev/group" = "AI & Data"
|
|
|
|
|
|
"gethomepage.dev/pod-selector" = ""
|
|
|
|
|
|
}
|
2026-02-22 15:13:55 +00:00
|
|
|
|
}
|
|
|
|
|
|
|
2026-03-07 21:09:31 +00:00
|
|
|
|
# --- Webhook receiver: triggers task-processor Job on Forgejo issue events ---
|
|
|
|
|
|
|
|
|
|
|
|
resource "kubernetes_config_map" "task_webhook" {
|
|
|
|
|
|
metadata {
|
|
|
|
|
|
name = "task-webhook"
|
|
|
|
|
|
namespace = kubernetes_namespace.openclaw.metadata[0].name
|
|
|
|
|
|
}
|
|
|
|
|
|
data = {
|
|
|
|
|
|
"server.py" = <<-PYEOF
|
|
|
|
|
|
from http.server import HTTPServer, BaseHTTPRequestHandler
|
|
|
|
|
|
import subprocess, time, json, os
|
|
|
|
|
|
|
|
|
|
|
|
BOT_USER = os.environ.get('FORGEJO_BOT_USER', 'viktor')
|
|
|
|
|
|
|
|
|
|
|
|
class Handler(BaseHTTPRequestHandler):
|
|
|
|
|
|
def do_POST(self):
|
|
|
|
|
|
try:
|
|
|
|
|
|
body = self.rfile.read(int(self.headers.get('Content-Length', 0)))
|
|
|
|
|
|
data = json.loads(body)
|
|
|
|
|
|
action = data.get('action', '')
|
|
|
|
|
|
|
|
|
|
|
|
# Trigger on: new issue, reopened issue, or new comment
|
|
|
|
|
|
trigger = False
|
|
|
|
|
|
if action in ('opened', 'reopened'):
|
|
|
|
|
|
issue = data.get('issue', {})
|
|
|
|
|
|
print(f"Issue #{issue.get('number','?')} {action}: {issue.get('title','?')}")
|
|
|
|
|
|
trigger = True
|
|
|
|
|
|
elif action == 'created' and 'comment' in data:
|
|
|
|
|
|
comment = data.get('comment', {})
|
|
|
|
|
|
commenter = comment.get('user', {}).get('login', '')
|
|
|
|
|
|
# Skip comments from the bot itself to avoid loops
|
|
|
|
|
|
if commenter != BOT_USER:
|
|
|
|
|
|
issue = data.get('issue', {})
|
|
|
|
|
|
print(f"Comment on #{issue.get('number','?')} by {commenter}")
|
|
|
|
|
|
trigger = True
|
|
|
|
|
|
else:
|
|
|
|
|
|
print(f"Skipping own comment on #{data.get('issue',{}).get('number','?')}")
|
|
|
|
|
|
|
|
|
|
|
|
if trigger:
|
|
|
|
|
|
job_name = f"task-processor-{int(time.time())}"
|
|
|
|
|
|
subprocess.run([
|
|
|
|
|
|
'kubectl', 'create', 'job', job_name,
|
|
|
|
|
|
'--from=cronjob/task-processor',
|
|
|
|
|
|
'-n', 'openclaw'
|
|
|
|
|
|
], check=True)
|
|
|
|
|
|
self.send_response(200)
|
|
|
|
|
|
self.end_headers()
|
|
|
|
|
|
self.wfile.write(b'{"ok":true}')
|
|
|
|
|
|
else:
|
|
|
|
|
|
self.send_response(200)
|
|
|
|
|
|
self.end_headers()
|
|
|
|
|
|
self.wfile.write(b'{"ok":true,"skipped":true}')
|
|
|
|
|
|
except Exception as e:
|
|
|
|
|
|
print(f"Error: {e}")
|
|
|
|
|
|
self.send_response(500)
|
|
|
|
|
|
self.end_headers()
|
|
|
|
|
|
self.wfile.write(f'{{"error":"{e}"}}'.encode())
|
|
|
|
|
|
|
|
|
|
|
|
def do_GET(self):
|
|
|
|
|
|
self.send_response(200)
|
|
|
|
|
|
self.end_headers()
|
|
|
|
|
|
self.wfile.write(b'{"status":"ok"}')
|
|
|
|
|
|
|
|
|
|
|
|
def log_message(self, fmt, *args):
|
|
|
|
|
|
print(f"[webhook] {args[0]} {args[1]} {args[2]}")
|
|
|
|
|
|
|
|
|
|
|
|
print("Task webhook receiver listening on :8080")
|
|
|
|
|
|
HTTPServer(('', 8080), Handler).serve_forever()
|
|
|
|
|
|
PYEOF
|
|
|
|
|
|
}
|
|
|
|
|
|
}
|
|
|
|
|
|
|
|
|
|
|
|
resource "kubernetes_service_account" "task_webhook" {
|
|
|
|
|
|
metadata {
|
|
|
|
|
|
name = "task-webhook"
|
|
|
|
|
|
namespace = kubernetes_namespace.openclaw.metadata[0].name
|
|
|
|
|
|
}
|
|
|
|
|
|
}
|
|
|
|
|
|
|
|
|
|
|
|
resource "kubernetes_role" "task_webhook" {
|
|
|
|
|
|
metadata {
|
|
|
|
|
|
name = "task-webhook-job-creator"
|
|
|
|
|
|
namespace = kubernetes_namespace.openclaw.metadata[0].name
|
|
|
|
|
|
}
|
|
|
|
|
|
rule {
|
|
|
|
|
|
api_groups = ["batch"]
|
|
|
|
|
|
resources = ["jobs", "cronjobs"]
|
|
|
|
|
|
verbs = ["get", "list", "create"]
|
|
|
|
|
|
}
|
|
|
|
|
|
}
|
|
|
|
|
|
|
|
|
|
|
|
resource "kubernetes_role_binding" "task_webhook" {
|
|
|
|
|
|
metadata {
|
|
|
|
|
|
name = "task-webhook-job-creator"
|
|
|
|
|
|
namespace = kubernetes_namespace.openclaw.metadata[0].name
|
|
|
|
|
|
}
|
|
|
|
|
|
subject {
|
|
|
|
|
|
kind = "ServiceAccount"
|
|
|
|
|
|
name = kubernetes_service_account.task_webhook.metadata[0].name
|
|
|
|
|
|
namespace = kubernetes_namespace.openclaw.metadata[0].name
|
|
|
|
|
|
}
|
|
|
|
|
|
role_ref {
|
|
|
|
|
|
api_group = "rbac.authorization.k8s.io"
|
|
|
|
|
|
kind = "Role"
|
|
|
|
|
|
name = kubernetes_role.task_webhook.metadata[0].name
|
|
|
|
|
|
}
|
|
|
|
|
|
}
|
|
|
|
|
|
|
|
|
|
|
|
resource "kubernetes_deployment" "task_webhook" {
|
|
|
|
|
|
metadata {
|
|
|
|
|
|
name = "task-webhook"
|
|
|
|
|
|
namespace = kubernetes_namespace.openclaw.metadata[0].name
|
|
|
|
|
|
labels = {
|
|
|
|
|
|
app = "task-webhook"
|
|
|
|
|
|
tier = local.tiers.aux
|
|
|
|
|
|
}
|
|
|
|
|
|
}
|
|
|
|
|
|
spec {
|
|
|
|
|
|
replicas = 1
|
|
|
|
|
|
selector {
|
|
|
|
|
|
match_labels = {
|
|
|
|
|
|
app = "task-webhook"
|
|
|
|
|
|
}
|
|
|
|
|
|
}
|
|
|
|
|
|
template {
|
|
|
|
|
|
metadata {
|
|
|
|
|
|
labels = {
|
|
|
|
|
|
app = "task-webhook"
|
|
|
|
|
|
}
|
|
|
|
|
|
}
|
|
|
|
|
|
spec {
|
|
|
|
|
|
service_account_name = kubernetes_service_account.task_webhook.metadata[0].name
|
|
|
|
|
|
container {
|
2026-03-14 08:51:45 +00:00
|
|
|
|
name = "webhook"
|
|
|
|
|
|
image = "python:3-alpine"
|
2026-03-07 21:09:31 +00:00
|
|
|
|
command = ["sh", "-c", "apk add --no-cache curl > /dev/null 2>&1 && curl -sfL https://dl.k8s.io/release/v1.34.2/bin/linux/amd64/kubectl -o /usr/local/bin/kubectl && chmod +x /usr/local/bin/kubectl && exec python3 -u /app/server.py"]
|
|
|
|
|
|
port {
|
|
|
|
|
|
container_port = 8080
|
|
|
|
|
|
}
|
|
|
|
|
|
volume_mount {
|
|
|
|
|
|
name = "app"
|
|
|
|
|
|
mount_path = "/app"
|
|
|
|
|
|
}
|
|
|
|
|
|
resources {
|
|
|
|
|
|
requests = {
|
|
|
|
|
|
cpu = "5m"
|
right-size memory: set requests=limits based on actual usage
- Set memory requests = limits across 56 stacks to prevent overcommit
- Right-sized limits based on actual pod usage (2x actual, rounded up)
- Scaled down trading-bot (replicas=0) to free memory
- Fixed OOMKilled services: forgejo, dawarich, health, meshcentral,
paperless-ngx, vault auto-unseal, rybbit, whisper, openclaw, clickhouse
- Added startup+liveness probes to calibre-web
- Bumped inotify limits on nodes 2,3 (max_user_instances 128->8192)
Post node2 OOM incident (2026-03-14). Previous kubelet config had no
kubeReserved/systemReserved set, allowing pods to starve the kernel.
2026-03-14 21:01:24 +00:00
|
|
|
|
memory = "64Mi"
|
2026-03-07 21:09:31 +00:00
|
|
|
|
}
|
|
|
|
|
|
limits = {
|
|
|
|
|
|
memory = "64Mi"
|
|
|
|
|
|
}
|
|
|
|
|
|
}
|
|
|
|
|
|
}
|
|
|
|
|
|
volume {
|
|
|
|
|
|
name = "app"
|
|
|
|
|
|
config_map {
|
|
|
|
|
|
name = kubernetes_config_map.task_webhook.metadata[0].name
|
|
|
|
|
|
}
|
|
|
|
|
|
}
|
|
|
|
|
|
}
|
|
|
|
|
|
}
|
|
|
|
|
|
}
|
[infra] Sweep dns_config ignore_changes across all pod-owning resources [ci skip]
## Context
Wave 3A (commit c9d221d5) added the `# KYVERNO_LIFECYCLE_V1` marker to the
27 pre-existing `ignore_changes = [...dns_config]` sites so they could be
grepped and audited. It did NOT address pod-owning resources that were
simply missing the suppression entirely. Post-Wave-3A sampling (2026-04-18)
found that navidrome, f1-stream, frigate, servarr, monitoring, crowdsec,
and many other stacks showed perpetual `dns_config` drift every plan
because their `kubernetes_deployment` / `kubernetes_stateful_set` /
`kubernetes_cron_job_v1` resources had no `lifecycle {}` block at all.
Root cause (same as Wave 3A): Kyverno's admission webhook stamps
`dns_config { option { name = "ndots"; value = "2" } }` on every pod's
`spec.template.spec.dns_config` to prevent NxDomain search-domain flooding
(see `k8s-ndots-search-domain-nxdomain-flood` skill). Without `ignore_changes`
on every Terraform-managed pod-owner, Terraform repeatedly tries to strip
the injected field.
## This change
Extends the Wave 3A convention by sweeping EVERY `kubernetes_deployment`,
`kubernetes_stateful_set`, `kubernetes_daemon_set`, `kubernetes_cron_job_v1`,
`kubernetes_job_v1` (+ their `_v1` variants) in the repo and ensuring each
carries the right `ignore_changes` path:
- **kubernetes_deployment / stateful_set / daemon_set / job_v1**:
`spec[0].template[0].spec[0].dns_config`
- **kubernetes_cron_job_v1**:
`spec[0].job_template[0].spec[0].template[0].spec[0].dns_config`
(extra `job_template[0]` nesting — the CronJob's PodTemplateSpec is
one level deeper)
Each injection / extension is tagged `# KYVERNO_LIFECYCLE_V1: Kyverno
admission webhook mutates dns_config with ndots=2` inline so the
suppression is discoverable via `rg 'KYVERNO_LIFECYCLE_V1' stacks/`.
Two insertion paths are handled by a Python pass (`/tmp/add_dns_config_ignore.py`):
1. **No existing `lifecycle {}`**: inject a brand-new block just before the
resource's closing `}`. 108 new blocks on 93 files.
2. **Existing `lifecycle {}` (usually for `DRIFT_WORKAROUND: CI owns image tag`
from Wave 4, commit a62b43d1)**: extend its `ignore_changes` list with the
dns_config path. Handles both inline (`= [x]`) and multiline
(`= [\n x,\n]`) forms; ensures the last pre-existing list item carries
a trailing comma so the extended list is valid HCL. 34 extensions.
The script skips anything already mentioning `dns_config` inside an
`ignore_changes`, so re-running is a no-op.
## Scale
- 142 total lifecycle injections/extensions
- 93 `.tf` files touched
- 108 brand-new `lifecycle {}` blocks + 34 extensions of existing ones
- Every Tier 0 and Tier 1 stack with a pod-owning resource is covered
- Together with Wave 3A's 27 pre-existing markers → **169 greppable
`KYVERNO_LIFECYCLE_V1` dns_config sites across the repo**
## What is NOT in this change
- `stacks/trading-bot/main.tf` — entirely commented-out block (`/* … */`).
Python script touched the file, reverted manually.
- `_template/main.tf.example` skeleton — kept minimal on purpose; any
future stack created from it should either inherit the Wave 3A one-line
form or add its own on first `kubernetes_deployment`.
- `terraform fmt` fixes to pre-existing alignment issues in meshcentral,
nvidia/modules/nvidia, vault — unrelated to this commit. Left for a
separate fmt-only pass.
- Non-pod resources (`kubernetes_service`, `kubernetes_secret`,
`kubernetes_manifest`, etc.) — they don't own pods so they don't get
Kyverno dns_config mutation.
## Verification
Random sample post-commit:
```
$ cd stacks/navidrome && ../../scripts/tg plan → No changes.
$ cd stacks/f1-stream && ../../scripts/tg plan → No changes.
$ cd stacks/frigate && ../../scripts/tg plan → No changes.
$ rg -c 'KYVERNO_LIFECYCLE_V1' stacks/ --include='*.tf' --include='*.tf.example' \
| awk -F: '{s+=$2} END {print s}'
169
```
## Reproduce locally
1. `git pull`
2. `rg 'KYVERNO_LIFECYCLE_V1' stacks/ | wc -l` → 169+
3. `cd stacks/navidrome && ../../scripts/tg plan` → expect 0 drift on
the deployment's dns_config field.
Refs: code-seq (Wave 3B dns_config class closed; kubernetes_manifest
annotation class handled separately in 8d94688d for tls_secret)
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-18 21:19:48 +00:00
|
|
|
|
lifecycle {
|
|
|
|
|
|
# KYVERNO_LIFECYCLE_V1: Kyverno admission webhook mutates dns_config with ndots=2
|
|
|
|
|
|
ignore_changes = [spec[0].template[0].spec[0].dns_config]
|
|
|
|
|
|
}
|
2026-03-07 21:09:31 +00:00
|
|
|
|
}
|
|
|
|
|
|
|
|
|
|
|
|
resource "kubernetes_service" "task_webhook" {
|
|
|
|
|
|
metadata {
|
|
|
|
|
|
name = "task-webhook"
|
|
|
|
|
|
namespace = kubernetes_namespace.openclaw.metadata[0].name
|
|
|
|
|
|
labels = {
|
|
|
|
|
|
app = "task-webhook"
|
|
|
|
|
|
}
|
|
|
|
|
|
}
|
|
|
|
|
|
spec {
|
|
|
|
|
|
selector = {
|
|
|
|
|
|
app = "task-webhook"
|
|
|
|
|
|
}
|
|
|
|
|
|
port {
|
|
|
|
|
|
port = 80
|
|
|
|
|
|
target_port = 8080
|
|
|
|
|
|
}
|
|
|
|
|
|
}
|
|
|
|
|
|
}
|
|
|
|
|
|
|
|
|
|
|
|
module "task_webhook_ingress" {
|
2026-04-19 13:01:36 +00:00
|
|
|
|
source = "../../modules/kubernetes/ingress_factory"
|
ingress_factory: replace `protected` bool with `auth` enum + audit pass across 100 stacks
Phase 3+4 of default-deny ingress plan. Replaces the `protected = bool` (default
false → unprotected) variable in `modules/kubernetes/ingress_factory` with
`auth = string` enum (default "required" → fail-closed). Touches every
ingress_factory caller so the audit decision is recorded explicitly in code.
ingress_factory (Phase 3):
- `auth = "required"`: standard Authentik forward-auth (the legacy
`protected = true` semantic).
- `auth = "public"`: forward-auth via the new `authentik-forward-auth-public`
middleware → dedicated public outpost → guest auto-bind. Logged-in users
keep their real identity.
- `auth = "none"`: no Authentik middleware. For Anubis-fronted content, native
client APIs (Git, /v2/, WebDAV), webhook receivers, the Authentik outpost
itself.
- `effective_anti_ai` default flips ON only when `auth = "none"` (auth-gated
ingresses don't need anti-AI noise; the auth flow already discourages bots).
Audit pass (Phase 4) across 96 ingress_factory call sites:
- 49 explicit `protected = true` → `auth = "required"`
- 8 explicit `protected = false` → `auth = "none"` (5) or `auth = "public"` (3)
- 64 previously-default (no protected line) → `auth = "required"` ADDED, then
reviewed individually:
* 9 Anubis-fronted (blog, www, kms, travel, f1, cyberchef, jsoncrack,
homepage, wrongmove UI, privatebin) → `auth = "none"`
* 22 native-client / programmatic surfaces (Forgejo Git+/v2/, webhook
handler, claude-memory MCP, Nextcloud WebDAV, Matrix, Vault CLI/OIDC,
xray VPN, ntfy, woodpecker webhooks, n8n triggers, ntfy push, dawarich
location ingestion, immich frame kiosk, headscale CP, send anonymous
drops, rybbit beacon, vaultwarden API, Authentik UI itself + outposts) →
`auth = "none"`
* Remaining ~33 → `auth = "required"` confirmed (admin tools, internal
UIs, services without app-level auth)
- Smoke-test promotions to `auth = "public"`: fire-planner public UI,
k8s-portal API, insta2spotify callback.
Three call sites in wrapper modules (`stacks/freedify/factory/`,
`stacks/reverse-proxy/modules/reverse_proxy/`) keep their internal `protected`
bool — they translate to `auth` internally, out of scope for this rename.
Behavior change: previously-default ingresses now fail closed (require
Authentik login) unless explicitly flipped to `auth = "none"` or
`auth = "public"`. This is the audit goal — no more accidentally-unprotected
surfaces. Sites that were intentionally public (Anubis content, native APIs,
webhooks) are now explicitly recorded as `auth = "none"`.
Drive-by: `modules/create-vm/main.tf` picked up cosmetic alignment via
`terraform fmt -recursive` during the audit. Behavior-neutral.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-10 18:53:49 +00:00
|
|
|
|
auth = "required"
|
2026-04-19 13:01:36 +00:00
|
|
|
|
namespace = kubernetes_namespace.openclaw.metadata[0].name
|
|
|
|
|
|
name = "task-webhook"
|
|
|
|
|
|
tls_secret_name = var.tls_secret_name
|
|
|
|
|
|
host = "task-webhook"
|
|
|
|
|
|
port = 80
|
|
|
|
|
|
external_monitor = false
|
2026-03-07 21:09:31 +00:00
|
|
|
|
}
|
|
|
|
|
|
|
2026-04-19 15:13:03 +00:00
|
|
|
|
# --- Shared ServiceAccount: grants pod-exec into the openclaw pod ---
|
|
|
|
|
|
# Used by the task_processor CronJob (below). Previously also used by the
|
|
|
|
|
|
# cluster_healthcheck CronJob, which has been decommissioned — the local
|
|
|
|
|
|
# `scripts/cluster_healthcheck.sh` is now the single authoritative runner.
|
2026-02-22 15:13:55 +00:00
|
|
|
|
|
|
|
|
|
|
resource "kubernetes_service_account" "healthcheck" {
|
|
|
|
|
|
metadata {
|
|
|
|
|
|
name = "cluster-healthcheck"
|
|
|
|
|
|
namespace = kubernetes_namespace.openclaw.metadata[0].name
|
|
|
|
|
|
}
|
|
|
|
|
|
}
|
|
|
|
|
|
|
|
|
|
|
|
resource "kubernetes_role" "healthcheck_exec" {
|
|
|
|
|
|
metadata {
|
|
|
|
|
|
name = "healthcheck-pod-exec"
|
|
|
|
|
|
namespace = kubernetes_namespace.openclaw.metadata[0].name
|
|
|
|
|
|
}
|
|
|
|
|
|
rule {
|
|
|
|
|
|
api_groups = [""]
|
|
|
|
|
|
resources = ["pods"]
|
|
|
|
|
|
verbs = ["get", "list"]
|
|
|
|
|
|
}
|
|
|
|
|
|
rule {
|
|
|
|
|
|
api_groups = [""]
|
|
|
|
|
|
resources = ["pods/exec"]
|
|
|
|
|
|
verbs = ["create"]
|
|
|
|
|
|
}
|
|
|
|
|
|
}
|
|
|
|
|
|
|
|
|
|
|
|
resource "kubernetes_role_binding" "healthcheck_exec" {
|
|
|
|
|
|
metadata {
|
|
|
|
|
|
name = "healthcheck-pod-exec"
|
|
|
|
|
|
namespace = kubernetes_namespace.openclaw.metadata[0].name
|
|
|
|
|
|
}
|
|
|
|
|
|
subject {
|
|
|
|
|
|
kind = "ServiceAccount"
|
|
|
|
|
|
name = kubernetes_service_account.healthcheck.metadata[0].name
|
|
|
|
|
|
namespace = kubernetes_namespace.openclaw.metadata[0].name
|
|
|
|
|
|
}
|
|
|
|
|
|
role_ref {
|
|
|
|
|
|
api_group = "rbac.authorization.k8s.io"
|
|
|
|
|
|
kind = "Role"
|
|
|
|
|
|
name = kubernetes_role.healthcheck_exec.metadata[0].name
|
|
|
|
|
|
}
|
|
|
|
|
|
}
|
|
|
|
|
|
|
2026-03-07 21:09:31 +00:00
|
|
|
|
# --- CronJob: Task processor — polls Forgejo issues and triggers OpenClaw ---
|
|
|
|
|
|
|
|
|
|
|
|
resource "kubernetes_cron_job_v1" "task_processor" {
|
|
|
|
|
|
metadata {
|
|
|
|
|
|
name = "task-processor"
|
|
|
|
|
|
namespace = kubernetes_namespace.openclaw.metadata[0].name
|
|
|
|
|
|
labels = {
|
|
|
|
|
|
app = "task-processor"
|
|
|
|
|
|
tier = local.tiers.aux
|
|
|
|
|
|
}
|
|
|
|
|
|
}
|
|
|
|
|
|
spec {
|
|
|
|
|
|
schedule = "*/5 * * * *"
|
|
|
|
|
|
concurrency_policy = "Forbid"
|
|
|
|
|
|
failed_jobs_history_limit = 3
|
|
|
|
|
|
successful_jobs_history_limit = 3
|
|
|
|
|
|
|
|
|
|
|
|
job_template {
|
|
|
|
|
|
metadata {
|
|
|
|
|
|
labels = {
|
|
|
|
|
|
app = "task-processor"
|
|
|
|
|
|
}
|
|
|
|
|
|
}
|
|
|
|
|
|
spec {
|
2026-04-19 15:13:03 +00:00
|
|
|
|
active_deadline_seconds = 600
|
|
|
|
|
|
backoff_limit = 0
|
|
|
|
|
|
ttl_seconds_after_finished = 86400
|
2026-03-07 21:09:31 +00:00
|
|
|
|
template {
|
|
|
|
|
|
metadata {
|
|
|
|
|
|
labels = {
|
|
|
|
|
|
app = "task-processor"
|
|
|
|
|
|
}
|
|
|
|
|
|
}
|
|
|
|
|
|
spec {
|
|
|
|
|
|
service_account_name = kubernetes_service_account.healthcheck.metadata[0].name
|
|
|
|
|
|
restart_policy = "Never"
|
|
|
|
|
|
|
|
|
|
|
|
container {
|
|
|
|
|
|
name = "task-processor"
|
|
|
|
|
|
image = "bitnami/kubectl:latest"
|
|
|
|
|
|
command = ["bash", "-c", <<-EOF
|
|
|
|
|
|
# Find the openclaw pod
|
|
|
|
|
|
POD=$(kubectl get pods -n openclaw -l app=openclaw -o jsonpath='{.items[0].metadata.name}' 2>/dev/null)
|
|
|
|
|
|
if [ -z "$POD" ]; then
|
|
|
|
|
|
echo "ERROR: OpenClaw pod not found"
|
|
|
|
|
|
exit 1
|
|
|
|
|
|
fi
|
|
|
|
|
|
echo "Executing task processor in pod $POD..."
|
|
|
|
|
|
kubectl exec -n openclaw "$POD" -c openclaw -- \
|
|
|
|
|
|
env FORGEJO_TOKEN="$FORGEJO_TOKEN" \
|
2026-03-24 19:40:15 +02:00
|
|
|
|
FORGEJO_URL="http://forgejo.forgejo.svc.cluster.local" \
|
2026-03-07 21:09:31 +00:00
|
|
|
|
OPENCLAW_TOKEN="$OPENCLAW_TOKEN" \
|
|
|
|
|
|
OPENCLAW_URL="https://integrate.api.nvidia.com" \
|
|
|
|
|
|
bash /workspace/infra/scripts/task-processor.sh
|
|
|
|
|
|
EOF
|
|
|
|
|
|
]
|
|
|
|
|
|
|
|
|
|
|
|
env {
|
migrate consuming stacks to ESO + remove k8s-dashboard static token
Phase 9: ExternalSecret migration across 26 stacks:
Fully migrated (vault data source removed, ESO delivers secrets):
- speedtest, shadowsocks, wealthfolio, plotting-book, f1-stream, tandoor
- n8n, dawarich, diun, netbox, onlyoffice, tuya-bridge
- hackmd (ESO template for DB URL), health (ESO template for DB URL)
- trading-bot (ESO template for DATABASE_URL + 7 secret env vars)
- forgejo (removed unused vault data source)
Partially migrated (vault kept for plan-time, ESO added for runtime):
- immich, linkwarden, nextcloud, paperless-ngx (jsondecode for homepage)
- claude-memory, rybbit, url, webhook_handler (plan-time in locals/jobs)
- woodpecker, openclaw, resume (plan-time in helm values/jobs/modules)
17 stacks unchanged (all plan-time: homepage annotations, configmaps,
module inputs) — vault data source works with OIDC auth.
Phase 17a: Remove k8s-dashboard static admin token secret.
Users now get tokens via: vault write kubernetes/creds/dashboard-admin
2026-03-15 19:05:04 +00:00
|
|
|
|
name = "FORGEJO_TOKEN"
|
|
|
|
|
|
value_from {
|
|
|
|
|
|
secret_key_ref {
|
|
|
|
|
|
name = "openclaw-secrets"
|
|
|
|
|
|
key = "forgejo_api_token"
|
|
|
|
|
|
}
|
|
|
|
|
|
}
|
2026-03-07 21:09:31 +00:00
|
|
|
|
}
|
|
|
|
|
|
env {
|
migrate consuming stacks to ESO + remove k8s-dashboard static token
Phase 9: ExternalSecret migration across 26 stacks:
Fully migrated (vault data source removed, ESO delivers secrets):
- speedtest, shadowsocks, wealthfolio, plotting-book, f1-stream, tandoor
- n8n, dawarich, diun, netbox, onlyoffice, tuya-bridge
- hackmd (ESO template for DB URL), health (ESO template for DB URL)
- trading-bot (ESO template for DATABASE_URL + 7 secret env vars)
- forgejo (removed unused vault data source)
Partially migrated (vault kept for plan-time, ESO added for runtime):
- immich, linkwarden, nextcloud, paperless-ngx (jsondecode for homepage)
- claude-memory, rybbit, url, webhook_handler (plan-time in locals/jobs)
- woodpecker, openclaw, resume (plan-time in helm values/jobs/modules)
17 stacks unchanged (all plan-time: homepage annotations, configmaps,
module inputs) — vault data source works with OIDC auth.
Phase 17a: Remove k8s-dashboard static admin token secret.
Users now get tokens via: vault write kubernetes/creds/dashboard-admin
2026-03-15 19:05:04 +00:00
|
|
|
|
name = "OPENCLAW_TOKEN"
|
|
|
|
|
|
value_from {
|
|
|
|
|
|
secret_key_ref {
|
|
|
|
|
|
name = "openclaw-secrets"
|
|
|
|
|
|
key = "nvidia_api_key"
|
|
|
|
|
|
}
|
|
|
|
|
|
}
|
2026-03-07 21:09:31 +00:00
|
|
|
|
}
|
|
|
|
|
|
|
|
|
|
|
|
resources {
|
|
|
|
|
|
requests = {
|
|
|
|
|
|
cpu = "50m"
|
|
|
|
|
|
memory = "64Mi"
|
|
|
|
|
|
}
|
|
|
|
|
|
limits = {
|
right-size memory: set requests=limits based on actual usage
- Set memory requests = limits across 56 stacks to prevent overcommit
- Right-sized limits based on actual pod usage (2x actual, rounded up)
- Scaled down trading-bot (replicas=0) to free memory
- Fixed OOMKilled services: forgejo, dawarich, health, meshcentral,
paperless-ngx, vault auto-unseal, rybbit, whisper, openclaw, clickhouse
- Added startup+liveness probes to calibre-web
- Bumped inotify limits on nodes 2,3 (max_user_instances 128->8192)
Post node2 OOM incident (2026-03-14). Previous kubelet config had no
kubeReserved/systemReserved set, allowing pods to starve the kernel.
2026-03-14 21:01:24 +00:00
|
|
|
|
memory = "64Mi"
|
2026-03-07 21:09:31 +00:00
|
|
|
|
}
|
|
|
|
|
|
}
|
|
|
|
|
|
}
|
|
|
|
|
|
}
|
|
|
|
|
|
}
|
|
|
|
|
|
}
|
|
|
|
|
|
}
|
|
|
|
|
|
}
|
[infra] Sweep dns_config ignore_changes across all pod-owning resources [ci skip]
## Context
Wave 3A (commit c9d221d5) added the `# KYVERNO_LIFECYCLE_V1` marker to the
27 pre-existing `ignore_changes = [...dns_config]` sites so they could be
grepped and audited. It did NOT address pod-owning resources that were
simply missing the suppression entirely. Post-Wave-3A sampling (2026-04-18)
found that navidrome, f1-stream, frigate, servarr, monitoring, crowdsec,
and many other stacks showed perpetual `dns_config` drift every plan
because their `kubernetes_deployment` / `kubernetes_stateful_set` /
`kubernetes_cron_job_v1` resources had no `lifecycle {}` block at all.
Root cause (same as Wave 3A): Kyverno's admission webhook stamps
`dns_config { option { name = "ndots"; value = "2" } }` on every pod's
`spec.template.spec.dns_config` to prevent NxDomain search-domain flooding
(see `k8s-ndots-search-domain-nxdomain-flood` skill). Without `ignore_changes`
on every Terraform-managed pod-owner, Terraform repeatedly tries to strip
the injected field.
## This change
Extends the Wave 3A convention by sweeping EVERY `kubernetes_deployment`,
`kubernetes_stateful_set`, `kubernetes_daemon_set`, `kubernetes_cron_job_v1`,
`kubernetes_job_v1` (+ their `_v1` variants) in the repo and ensuring each
carries the right `ignore_changes` path:
- **kubernetes_deployment / stateful_set / daemon_set / job_v1**:
`spec[0].template[0].spec[0].dns_config`
- **kubernetes_cron_job_v1**:
`spec[0].job_template[0].spec[0].template[0].spec[0].dns_config`
(extra `job_template[0]` nesting — the CronJob's PodTemplateSpec is
one level deeper)
Each injection / extension is tagged `# KYVERNO_LIFECYCLE_V1: Kyverno
admission webhook mutates dns_config with ndots=2` inline so the
suppression is discoverable via `rg 'KYVERNO_LIFECYCLE_V1' stacks/`.
Two insertion paths are handled by a Python pass (`/tmp/add_dns_config_ignore.py`):
1. **No existing `lifecycle {}`**: inject a brand-new block just before the
resource's closing `}`. 108 new blocks on 93 files.
2. **Existing `lifecycle {}` (usually for `DRIFT_WORKAROUND: CI owns image tag`
from Wave 4, commit a62b43d1)**: extend its `ignore_changes` list with the
dns_config path. Handles both inline (`= [x]`) and multiline
(`= [\n x,\n]`) forms; ensures the last pre-existing list item carries
a trailing comma so the extended list is valid HCL. 34 extensions.
The script skips anything already mentioning `dns_config` inside an
`ignore_changes`, so re-running is a no-op.
## Scale
- 142 total lifecycle injections/extensions
- 93 `.tf` files touched
- 108 brand-new `lifecycle {}` blocks + 34 extensions of existing ones
- Every Tier 0 and Tier 1 stack with a pod-owning resource is covered
- Together with Wave 3A's 27 pre-existing markers → **169 greppable
`KYVERNO_LIFECYCLE_V1` dns_config sites across the repo**
## What is NOT in this change
- `stacks/trading-bot/main.tf` — entirely commented-out block (`/* … */`).
Python script touched the file, reverted manually.
- `_template/main.tf.example` skeleton — kept minimal on purpose; any
future stack created from it should either inherit the Wave 3A one-line
form or add its own on first `kubernetes_deployment`.
- `terraform fmt` fixes to pre-existing alignment issues in meshcentral,
nvidia/modules/nvidia, vault — unrelated to this commit. Left for a
separate fmt-only pass.
- Non-pod resources (`kubernetes_service`, `kubernetes_secret`,
`kubernetes_manifest`, etc.) — they don't own pods so they don't get
Kyverno dns_config mutation.
## Verification
Random sample post-commit:
```
$ cd stacks/navidrome && ../../scripts/tg plan → No changes.
$ cd stacks/f1-stream && ../../scripts/tg plan → No changes.
$ cd stacks/frigate && ../../scripts/tg plan → No changes.
$ rg -c 'KYVERNO_LIFECYCLE_V1' stacks/ --include='*.tf' --include='*.tf.example' \
| awk -F: '{s+=$2} END {print s}'
169
```
## Reproduce locally
1. `git pull`
2. `rg 'KYVERNO_LIFECYCLE_V1' stacks/ | wc -l` → 169+
3. `cd stacks/navidrome && ../../scripts/tg plan` → expect 0 drift on
the deployment's dns_config field.
Refs: code-seq (Wave 3B dns_config class closed; kubernetes_manifest
annotation class handled separately in 8d94688d for tls_secret)
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-18 21:19:48 +00:00
|
|
|
|
lifecycle {
|
|
|
|
|
|
# KYVERNO_LIFECYCLE_V1: Kyverno admission webhook mutates dns_config with ndots=2
|
|
|
|
|
|
ignore_changes = [spec[0].job_template[0].spec[0].template[0].spec[0].dns_config]
|
|
|
|
|
|
}
|
2026-03-07 21:09:31 +00:00
|
|
|
|
}
|
2026-03-19 20:23:59 +00:00
|
|
|
|
|
|
|
|
|
|
# --- OpenLobster: Multi-user Telegram AI assistant (trial) ---
|
|
|
|
|
|
|
feat(storage): migrate 38 NFS PVCs to proxmox-lvm (Wave 2)
Add proxmox-lvm PVCs with pvc-autoresizer annotations for all
remaining single-pod app data services. Deployments updated to
use new block storage PVCs. Old NFS modules retained for rollback.
Services: affine, changedetection, diun, excalidraw, f1-stream,
hackmd, isponsorblocktv, matrix, n8n, send, grampsweb, health,
onlyoffice, owntracks, paperless-ngx, privatebin, resume,
speedtest, stirling-pdf, tandoor, rybbit (clickhouse), tor-proxy
(torrserver), whisper+piper, frigate (config), ollama (ui),
servarr (prowlarr/listenarr/qbittorrent), aiostreams, freshrss
(extensions), meshcentral (data+files), openclaw (data+home+
openlobster), technitium, mailserver (data+roundcube html+enigma),
dbaas (pgadmin).
Strategy set to Recreate where needed for RWO volumes.
2026-04-04 19:25:12 +03:00
|
|
|
|
resource "kubernetes_persistent_volume_claim" "openlobster_data_proxmox" {
|
|
|
|
|
|
wait_until_bound = false
|
|
|
|
|
|
metadata {
|
|
|
|
|
|
name = "openlobster-data-proxmox"
|
|
|
|
|
|
namespace = kubernetes_namespace.openclaw.metadata[0].name
|
|
|
|
|
|
annotations = {
|
2026-05-10 19:56:16 +00:00
|
|
|
|
"resize.topolvm.io/threshold" = "10%"
|
feat(storage): migrate 38 NFS PVCs to proxmox-lvm (Wave 2)
Add proxmox-lvm PVCs with pvc-autoresizer annotations for all
remaining single-pod app data services. Deployments updated to
use new block storage PVCs. Old NFS modules retained for rollback.
Services: affine, changedetection, diun, excalidraw, f1-stream,
hackmd, isponsorblocktv, matrix, n8n, send, grampsweb, health,
onlyoffice, owntracks, paperless-ngx, privatebin, resume,
speedtest, stirling-pdf, tandoor, rybbit (clickhouse), tor-proxy
(torrserver), whisper+piper, frigate (config), ollama (ui),
servarr (prowlarr/listenarr/qbittorrent), aiostreams, freshrss
(extensions), meshcentral (data+files), openclaw (data+home+
openlobster), technitium, mailserver (data+roundcube html+enigma),
dbaas (pgadmin).
Strategy set to Recreate where needed for RWO volumes.
2026-04-04 19:25:12 +03:00
|
|
|
|
"resize.topolvm.io/increase" = "100%"
|
|
|
|
|
|
"resize.topolvm.io/storage_limit" = "5Gi"
|
|
|
|
|
|
}
|
|
|
|
|
|
}
|
|
|
|
|
|
spec {
|
|
|
|
|
|
access_modes = ["ReadWriteOnce"]
|
|
|
|
|
|
storage_class_name = "proxmox-lvm"
|
|
|
|
|
|
resources {
|
|
|
|
|
|
requests = {
|
|
|
|
|
|
storage = "1Gi"
|
|
|
|
|
|
}
|
|
|
|
|
|
}
|
|
|
|
|
|
}
|
2026-05-10 21:57:01 +00:00
|
|
|
|
lifecycle {
|
|
|
|
|
|
# The autoresizer expands requests.storage up to storage_limit and
|
|
|
|
|
|
# PVCs can't shrink. Without this, every TF apply tries to revert
|
|
|
|
|
|
# to the spec value, K8s rejects the shrink, and the PVC ends up
|
|
|
|
|
|
# in Terminating-but-in-use limbo.
|
|
|
|
|
|
ignore_changes = [spec[0].resources[0].requests]
|
|
|
|
|
|
}
|
feat(storage): migrate 38 NFS PVCs to proxmox-lvm (Wave 2)
Add proxmox-lvm PVCs with pvc-autoresizer annotations for all
remaining single-pod app data services. Deployments updated to
use new block storage PVCs. Old NFS modules retained for rollback.
Services: affine, changedetection, diun, excalidraw, f1-stream,
hackmd, isponsorblocktv, matrix, n8n, send, grampsweb, health,
onlyoffice, owntracks, paperless-ngx, privatebin, resume,
speedtest, stirling-pdf, tandoor, rybbit (clickhouse), tor-proxy
(torrserver), whisper+piper, frigate (config), ollama (ui),
servarr (prowlarr/listenarr/qbittorrent), aiostreams, freshrss
(extensions), meshcentral (data+files), openclaw (data+home+
openlobster), technitium, mailserver (data+roundcube html+enigma),
dbaas (pgadmin).
Strategy set to Recreate where needed for RWO volumes.
2026-04-04 19:25:12 +03:00
|
|
|
|
}
|
|
|
|
|
|
|
2026-03-19 20:23:59 +00:00
|
|
|
|
resource "random_password" "openlobster_graphql_token" {
|
|
|
|
|
|
length = 32
|
|
|
|
|
|
special = false
|
|
|
|
|
|
}
|
|
|
|
|
|
|
|
|
|
|
|
resource "kubernetes_deployment" "openlobster" {
|
|
|
|
|
|
metadata {
|
|
|
|
|
|
name = "openlobster"
|
|
|
|
|
|
namespace = kubernetes_namespace.openclaw.metadata[0].name
|
|
|
|
|
|
labels = {
|
|
|
|
|
|
app = "openlobster"
|
|
|
|
|
|
tier = local.tiers.aux
|
|
|
|
|
|
}
|
|
|
|
|
|
}
|
|
|
|
|
|
spec {
|
|
|
|
|
|
strategy {
|
|
|
|
|
|
type = "Recreate"
|
|
|
|
|
|
}
|
|
|
|
|
|
replicas = 0
|
|
|
|
|
|
selector {
|
|
|
|
|
|
match_labels = {
|
|
|
|
|
|
app = "openlobster"
|
|
|
|
|
|
}
|
|
|
|
|
|
}
|
|
|
|
|
|
template {
|
|
|
|
|
|
metadata {
|
|
|
|
|
|
labels = {
|
|
|
|
|
|
app = "openlobster"
|
|
|
|
|
|
}
|
|
|
|
|
|
}
|
|
|
|
|
|
spec {
|
|
|
|
|
|
# node4 has corrupted containerd content store — avoid it
|
|
|
|
|
|
affinity {
|
|
|
|
|
|
node_affinity {
|
|
|
|
|
|
required_during_scheduling_ignored_during_execution {
|
|
|
|
|
|
node_selector_term {
|
|
|
|
|
|
match_expressions {
|
|
|
|
|
|
key = "kubernetes.io/hostname"
|
|
|
|
|
|
operator = "NotIn"
|
|
|
|
|
|
values = ["k8s-node4"]
|
|
|
|
|
|
}
|
|
|
|
|
|
}
|
|
|
|
|
|
}
|
|
|
|
|
|
}
|
|
|
|
|
|
}
|
|
|
|
|
|
container {
|
|
|
|
|
|
name = "openlobster"
|
|
|
|
|
|
image = "ghcr.io/neirth/openlobster/openlobster:latest"
|
|
|
|
|
|
port {
|
|
|
|
|
|
container_port = 8080
|
|
|
|
|
|
}
|
|
|
|
|
|
env {
|
|
|
|
|
|
name = "OPENLOBSTER_GRAPHQL_AUTH_TOKEN"
|
|
|
|
|
|
value = random_password.openlobster_graphql_token.result
|
|
|
|
|
|
}
|
|
|
|
|
|
env {
|
|
|
|
|
|
name = "OPENLOBSTER_PROVIDERS_ANTHROPIC_API_KEY"
|
|
|
|
|
|
value_from {
|
|
|
|
|
|
secret_key_ref {
|
|
|
|
|
|
name = "openclaw-secrets"
|
|
|
|
|
|
key = "anthropic_api_key"
|
|
|
|
|
|
}
|
|
|
|
|
|
}
|
|
|
|
|
|
}
|
|
|
|
|
|
env {
|
|
|
|
|
|
name = "OPENLOBSTER_PROVIDERS_ANTHROPIC_MODEL"
|
|
|
|
|
|
value = "claude-sonnet-4-20250514"
|
|
|
|
|
|
}
|
|
|
|
|
|
env {
|
|
|
|
|
|
name = "OPENLOBSTER_CHANNELS_TELEGRAM_TOKEN"
|
|
|
|
|
|
value_from {
|
|
|
|
|
|
secret_key_ref {
|
|
|
|
|
|
name = "openclaw-secrets"
|
|
|
|
|
|
key = "telegram_bot_token"
|
|
|
|
|
|
}
|
|
|
|
|
|
}
|
|
|
|
|
|
}
|
|
|
|
|
|
env {
|
|
|
|
|
|
name = "OPENLOBSTER_DATABASE_DRIVER"
|
|
|
|
|
|
value = "sqlite"
|
|
|
|
|
|
}
|
|
|
|
|
|
env {
|
|
|
|
|
|
name = "OPENLOBSTER_DATABASE_DSN"
|
|
|
|
|
|
value = "/app/data/openlobster.db"
|
|
|
|
|
|
}
|
|
|
|
|
|
env {
|
|
|
|
|
|
name = "OPENLOBSTER_AGENT_NAME"
|
|
|
|
|
|
value = "Lobster"
|
|
|
|
|
|
}
|
|
|
|
|
|
env {
|
|
|
|
|
|
name = "OPENLOBSTER_MEMORY_BACKEND"
|
|
|
|
|
|
value = "file"
|
|
|
|
|
|
}
|
|
|
|
|
|
volume_mount {
|
|
|
|
|
|
name = "openlobster-data"
|
|
|
|
|
|
mount_path = "/app/data"
|
|
|
|
|
|
}
|
|
|
|
|
|
resources {
|
|
|
|
|
|
requests = {
|
|
|
|
|
|
cpu = "10m"
|
|
|
|
|
|
memory = "64Mi"
|
|
|
|
|
|
}
|
|
|
|
|
|
limits = {
|
|
|
|
|
|
memory = "256Mi"
|
|
|
|
|
|
}
|
|
|
|
|
|
}
|
|
|
|
|
|
}
|
|
|
|
|
|
volume {
|
|
|
|
|
|
name = "openlobster-data"
|
|
|
|
|
|
persistent_volume_claim {
|
feat(storage): migrate 38 NFS PVCs to proxmox-lvm (Wave 2)
Add proxmox-lvm PVCs with pvc-autoresizer annotations for all
remaining single-pod app data services. Deployments updated to
use new block storage PVCs. Old NFS modules retained for rollback.
Services: affine, changedetection, diun, excalidraw, f1-stream,
hackmd, isponsorblocktv, matrix, n8n, send, grampsweb, health,
onlyoffice, owntracks, paperless-ngx, privatebin, resume,
speedtest, stirling-pdf, tandoor, rybbit (clickhouse), tor-proxy
(torrserver), whisper+piper, frigate (config), ollama (ui),
servarr (prowlarr/listenarr/qbittorrent), aiostreams, freshrss
(extensions), meshcentral (data+files), openclaw (data+home+
openlobster), technitium, mailserver (data+roundcube html+enigma),
dbaas (pgadmin).
Strategy set to Recreate where needed for RWO volumes.
2026-04-04 19:25:12 +03:00
|
|
|
|
claim_name = kubernetes_persistent_volume_claim.openlobster_data_proxmox.metadata[0].name
|
2026-03-19 20:23:59 +00:00
|
|
|
|
}
|
|
|
|
|
|
}
|
|
|
|
|
|
}
|
|
|
|
|
|
}
|
|
|
|
|
|
}
|
|
|
|
|
|
lifecycle {
|
[infra] Establish KYVERNO_LIFECYCLE_V1 drift-suppression convention [ci skip]
## Context
Phase 1 of the state-drift consolidation audit (plan Wave 3) identified that
the entire repo leans on a repeated `lifecycle { ignore_changes = [...dns_config] }`
snippet to suppress Kyverno's admission-webhook dns_config mutation (the ndots=2
override that prevents NxDomain search-domain flooding). 27 occurrences across
19 stacks. Without this suppression, every pod-owning resource shows perpetual
TF plan drift.
The original plan proposed a shared `modules/kubernetes/kyverno_lifecycle/`
module emitting the ignore-paths list as an output that stacks would consume in
their `ignore_changes` blocks. That approach is architecturally impossible:
Terraform's `ignore_changes` meta-argument accepts only static attribute paths
— it rejects module outputs, locals, variables, and any expression (the HCL
spec evaluates `lifecycle` before the regular expression graph). So a DRY
module cannot exist. The canonical pattern IS the repeated snippet.
What the snippet was missing was a *discoverability tag* so that (a) new
resources can be validated for compliance, (b) the existing 27 sites can be
grep'd in a single command, and (c) future maintainers understand the
convention rather than each reinventing it.
## This change
- Introduces `# KYVERNO_LIFECYCLE_V1` as the canonical marker comment.
Attached inline on every `spec[0].template[0].spec[0].dns_config` line
(or `spec[0].job_template[0].spec[0]...` for CronJobs) across all 27
existing suppression sites.
- Documents the convention with rationale and copy-paste snippets in
`AGENTS.md` → new "Kyverno Drift Suppression" section.
- Expands the existing `.claude/CLAUDE.md` Kyverno ndots note to reference
the marker and explain why the module approach is blocked.
- Updates `_template/main.tf.example` so every new stack starts compliant.
## What is NOT in this change
- The `kubernetes_manifest` Kyverno annotation drift (beads `code-seq`)
— that is Phase B with a sibling `# KYVERNO_MANIFEST_V1` marker.
- Behavioral changes — every `ignore_changes` list is byte-identical
save for the inline comment.
- The fallback module the original plan anticipated — skipped because
Terraform rejects expressions in `ignore_changes`.
- `terraform fmt` cleanup on adjacent unrelated blocks in three files
(claude-agent-service, freedify/factory, hermes-agent). Reverted to
keep this commit scoped to the convention rollout.
## Before / after
Before (cannot distinguish accidental-forgotten from intentional-convention):
```hcl
lifecycle {
ignore_changes = [spec[0].template[0].spec[0].dns_config]
}
```
After (greppable, self-documenting, discoverable by tooling):
```hcl
lifecycle {
ignore_changes = [spec[0].template[0].spec[0].dns_config] # KYVERNO_LIFECYCLE_V1
}
```
## Test Plan
### Automated
```
$ rg -c 'KYVERNO_LIFECYCLE_V1' stacks/ --include='*.tf' --include='*.tf.example' \
| awk -F: '{s+=$2} END {print s}'
27
$ git diff --stat | grep -E '\.(tf|tf\.example|md)$' | wc -l
21
# All code-file diffs are 1 insertion + 1 deletion per marker site,
# except beads-server (3), ebooks (4), immich (3), uptime-kuma (2).
$ git diff --stat stacks/ | tail -1
20 files changed, 45 insertions(+), 28 deletions(-)
```
### Manual Verification
No apply required — HCL comments only. Zero effect on any stack's plan output.
Future audits: `rg 'KYVERNO_LIFECYCLE_V1' stacks/ | wc -l` must grow as new
pod-owning resources are added.
## Reproduce locally
1. `cd infra && git pull`
2. `rg 'KYVERNO_LIFECYCLE_V1' stacks/` → expect 27 hits in 19 files
3. Grep any new `kubernetes_deployment` for the marker; absence = missing
suppression.
Closes: code-28m
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-18 14:15:51 +00:00
|
|
|
|
ignore_changes = [spec[0].template[0].spec[0].dns_config] # KYVERNO_LIFECYCLE_V1
|
2026-03-19 20:23:59 +00:00
|
|
|
|
}
|
|
|
|
|
|
}
|
|
|
|
|
|
|
|
|
|
|
|
resource "kubernetes_service" "openlobster" {
|
|
|
|
|
|
metadata {
|
|
|
|
|
|
name = "openlobster"
|
|
|
|
|
|
namespace = kubernetes_namespace.openclaw.metadata[0].name
|
|
|
|
|
|
labels = {
|
|
|
|
|
|
app = "openlobster"
|
|
|
|
|
|
}
|
|
|
|
|
|
}
|
|
|
|
|
|
spec {
|
|
|
|
|
|
selector = {
|
|
|
|
|
|
app = "openlobster"
|
|
|
|
|
|
}
|
|
|
|
|
|
port {
|
|
|
|
|
|
port = 80
|
|
|
|
|
|
target_port = 8080
|
|
|
|
|
|
}
|
|
|
|
|
|
}
|
|
|
|
|
|
}
|
|
|
|
|
|
|
|
|
|
|
|
module "openlobster_ingress" {
|
|
|
|
|
|
source = "../../modules/kubernetes/ingress_factory"
|
2026-04-16 13:45:04 +00:00
|
|
|
|
dns_type = "proxied"
|
2026-03-19 20:23:59 +00:00
|
|
|
|
namespace = kubernetes_namespace.openclaw.metadata[0].name
|
|
|
|
|
|
name = "openlobster"
|
|
|
|
|
|
tls_secret_name = var.tls_secret_name
|
|
|
|
|
|
host = "openlobster"
|
|
|
|
|
|
port = 80
|
ingress_factory: replace `protected` bool with `auth` enum + audit pass across 100 stacks
Phase 3+4 of default-deny ingress plan. Replaces the `protected = bool` (default
false → unprotected) variable in `modules/kubernetes/ingress_factory` with
`auth = string` enum (default "required" → fail-closed). Touches every
ingress_factory caller so the audit decision is recorded explicitly in code.
ingress_factory (Phase 3):
- `auth = "required"`: standard Authentik forward-auth (the legacy
`protected = true` semantic).
- `auth = "public"`: forward-auth via the new `authentik-forward-auth-public`
middleware → dedicated public outpost → guest auto-bind. Logged-in users
keep their real identity.
- `auth = "none"`: no Authentik middleware. For Anubis-fronted content, native
client APIs (Git, /v2/, WebDAV), webhook receivers, the Authentik outpost
itself.
- `effective_anti_ai` default flips ON only when `auth = "none"` (auth-gated
ingresses don't need anti-AI noise; the auth flow already discourages bots).
Audit pass (Phase 4) across 96 ingress_factory call sites:
- 49 explicit `protected = true` → `auth = "required"`
- 8 explicit `protected = false` → `auth = "none"` (5) or `auth = "public"` (3)
- 64 previously-default (no protected line) → `auth = "required"` ADDED, then
reviewed individually:
* 9 Anubis-fronted (blog, www, kms, travel, f1, cyberchef, jsoncrack,
homepage, wrongmove UI, privatebin) → `auth = "none"`
* 22 native-client / programmatic surfaces (Forgejo Git+/v2/, webhook
handler, claude-memory MCP, Nextcloud WebDAV, Matrix, Vault CLI/OIDC,
xray VPN, ntfy, woodpecker webhooks, n8n triggers, ntfy push, dawarich
location ingestion, immich frame kiosk, headscale CP, send anonymous
drops, rybbit beacon, vaultwarden API, Authentik UI itself + outposts) →
`auth = "none"`
* Remaining ~33 → `auth = "required"` confirmed (admin tools, internal
UIs, services without app-level auth)
- Smoke-test promotions to `auth = "public"`: fire-planner public UI,
k8s-portal API, insta2spotify callback.
Three call sites in wrapper modules (`stacks/freedify/factory/`,
`stacks/reverse-proxy/modules/reverse_proxy/`) keep their internal `protected`
bool — they translate to `auth` internally, out of scope for this rename.
Behavior change: previously-default ingresses now fail closed (require
Authentik login) unless explicitly flipped to `auth = "none"` or
`auth = "public"`. This is the audit goal — no more accidentally-unprotected
surfaces. Sites that were intentionally public (Anubis content, native APIs,
webhooks) are now explicitly recorded as `auth = "none"`.
Drive-by: `modules/create-vm/main.tf` picked up cosmetic alignment via
`terraform fmt -recursive` during the audit. Behavior-neutral.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-10 18:53:49 +00:00
|
|
|
|
auth = "required"
|
2026-03-19 20:23:59 +00:00
|
|
|
|
}
|