The previous commit promoted modelrelay/auto-fastest to primary as a
workaround for the expired openai-codex OAuth token. But modelrelay
routes to small tool-call-shy models (nvidia/stepfun-ai/step-3.5-flash)
that hallucinate answers instead of using ssh / curl / etc. — exactly
what the v4 learning loop is supposed to leverage.
Revert primary back to openai-codex/gpt-5.4-mini (gpt-5.4-mini is the
only mini variant the Codex backend accepts for ChatGPT Plus tier),
and inline the re-auth command in the model-block comment so future
sessions know exactly what to do when the OAuth token expires:
kubectl -n openclaw exec -it $(kubectl -n openclaw get pods \
-l app=openclaw -o jsonpath='{.items[0].metadata.name}') \
-c openclaw -- node /app/openclaw.mjs models auth login \
--provider openai-codex
modelrelay/auto-fastest stays in the fallback chain so the agent
remains partially usable while the token is expired.
🤖 Generated with [Claude Code](https://claude.com/claude-code)
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
User feedback: "this should work for any task, not just calendar.
this learning flow must be strongly engrained to ensure openclaw
gets better over time."
The v3 rules were buried at the bottom of TOOLS.md and only stated
in workflow language. Three changes to make the rule unavoidable:
1. **SOUL.md** — new marker-delimited section "Learning is your
identity" inserted before ## Boundaries. AGENTS.md tells the
agent to read SOUL.md first every session, so this is now the
FIRST thing the agent loads about itself. Frames learning as
character, not procedure.
2. **TOOLS.md v4** — section moved from the END of the file to
right after the `# TOOLS.md` title (first substantive content
on file load). Title strengthened: "THE FLOW — run this on
EVERY task. Not just hard ones." Concrete examples explicitly
call out diverse domains (calendar, frigate restart, disk
usage, inbox summary, deploys) so the universality is
unmistakable.
3. **learn-from-tasks skill** — opens with "This is universal.
EVERY task runs through this flow — not just hard ones, not
just unfamiliar ones. The save at the end is mandatory."
The actual flow (know → ask devvm → save) is unchanged. What
changed is salience: the rule is now the first thing the agent
encounters in three independent surfaces, with stronger framing
that makes "skipping the save" feel like a violation of identity
rather than a missed optimisation.
Marker bumped v3 → v4. Stripper handles v1-v9 idempotently.
🤖 Generated with [Claude Code](https://claude.com/claude-code)
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Refines the devvm-fallback into an explicit triage flow that the
agent runs on every task. The default path is to ASK devvm-claude
when uncertain — don't brute-force. Most tasks are solvable there.
## The flow
1. Do I KNOW how? Check `memory_recall` and INDEX.md.
2. If not, SSH devvm and ask claude — and crucially, ask it to
share the steps + credentials needed so I can do it on my own
next time. Save the answer in openclaw memory.
3. (RARE) If devvm-claude says no, try in-pod. Most likely fail —
that's OK.
## Storage moved to memory-indexed location
Learnings now live under
`/workspace/memory/projects/openclaw-learned/` (was
`/workspace/learned/`) so memory-core indexes them and
`memory_recall` surfaces them. Layout:
- `scripts/<task>.md` runnable recipes
- `knowledge/<topic>.md` decisions, paths, gotchas
- `credentials/<name>.md` **POINTERS to Vault, never values**
## Credentials = Vault pointers only
Previous v2 design saved cred values to plaintext NFS files. v3
flips to pointer-only: cred file documents the Vault path + fetch
command (`ssh devvm 'vault kv get -field=foo secret/bar'`), the
consumer, and rotation expectations. The secret stays in Vault.
## Init container also migrates
Strips v1/v2/v3 markers from TOOLS.md before re-inserting v3,
moves any files from the legacy `/workspace/learned/` tree into
the new location, removes the empty legacy dir. User edits
outside the markers always survive.
🤖 Generated with [Claude Code](https://claude.com/claude-code)
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Refine the init container's devvm-fallback seeding so the OpenClaw
agent treats devvm as its DEFAULT teacher and saves recipes locally
to become independent over time:
1. TOOLS.md v2 section now has two emphatic CRITICAL rules:
- "TRY DEVVM before giving up" — when stuck, ssh devvm before
telling the user "I can't do that".
- "After every task, introspect → save a faster way" — for any
non-trivial task (especially recurring ones), save the recipe
to /workspace/learned/ and update INDEX.md.
2. New cc-skill `learn-from-tasks` at
/home/node/.openclaw/cc-skills/learn-from-tasks/SKILL.md formalises
both triggers: (A) you're stuck → check INDEX → ask devvm → save;
(B) you just finished → introspect → save if recurring.
3. /workspace/learned/ scaffold: INDEX.md table-of-contents +
scripts/, knowledge/, credentials/ (0700) subdirs. Agent checks
INDEX.md BEFORE reaching for devvm, so saved recipes are
findable on the next run.
4. Marker migration: strips both v1 and v2 markers before re-inserting
so user edits outside the markers always survive future restarts.
Security caveat documented inline: credentials in
/workspace/learned/credentials/ are NFS plaintext — acceptable for
home-lab personal scope, NOT for anything more sensitive than what
`ssh devvm` already gives the pod (wizard's access).
🤖 Generated with [Claude Code](https://claude.com/claude-code)
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
The OpenClaw agent reads TOOLS.md on every session per AGENTS.md
("environment-specific notes"), but it does NOT auto-search the
memory-core index for "devvm" before answering. Result: the agent
said "I don't have access to the devvm" even though ssh + the
openclaw-task wrapper were fully wired up (verified e2e in
9ad52dfd).
Updated init 6 (seed-devvm-memory-note) to ALSO append a
marker-delimited section to /workspace/TOOLS.md describing the
devvm SSH capability + openclaw-task usage. Idempotent: strips
any prior v1 section before re-inserting, so user edits outside
the markers survive future pod restarts.
The /workspace/memory/projects/openclaw-runtime/devvm-fallback.md
memory note stays in place — it's still indexed by memory-core
and surfaces for memory_recall queries.
🤖 Generated with [Claude Code](https://claude.com/claude-code)
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Give the OpenClaw pod two new capabilities:
1. Host-tools bundle. New init container `install-host-tools` extracts
openssh-client + dnsutils + tmux + jq + ripgrep + fd + vault + yq +
friends into /tools/host-tools/, with the bookworm-slim libs the
binaries need. PATH + LD_LIBRARY_PATH on the main container point
ld.so at the bundle. Idempotent via /tools/host-tools/.installed-v1
marker; smoke test (ldd-based) fails the init at deploy time if any
binary has unresolved deps. Bundle is ~558 MB on the existing
/srv/nfs/openclaw/tools NFS.
2. devvm SSH + async task pattern. New init `setup-ssh-config` writes
id_rsa/config/known_hosts under /home/node/.openclaw/.ssh; main
container startup symlinks /home/node/.ssh → there. New
/usr/local/bin/openclaw-task wrapper on devvm manages long-running
work as tmux sessions on devvm (sessions and logs survive pod
restarts — they live on devvm, not in the pod). New init container
`seed-devvm-memory-note` drops a markdown note teaching the pattern;
main container startup now runs `openclaw memory index --force` so
the note is searchable on first boot.
Design + verified E2E flow in
docs/plans/2026-05-22-openclaw-devvm-access-design.md. Persistence test
green: spawned a 50s task from pod A, deleted pod A, new pod B saw the
task finish and read its full log.
Pre-existing keel.sh annotation drift on openclaw/{openlobster,
task_webhook} cleaned up in the same apply.
🤖 Generated with [Claude Code](https://claude.com/claude-code)
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Provider declarations were applied across freshrss, linkwarden,
navidrome, openclaw, tandoor, vault in prior sessions; lock files
regenerated for the 4 stacks where init had run. Commits the WIP so
downstream Terraform plans can proceed.
- kubectl (gavinbunney/kubectl ~> 1.14): kubernetes_manifest panic
workaround for Kyverno CRDs (beads code-e2dp)
- authentik (goauthentik/authentik ~> 2024.10): used where stacks
manage their own Authentik objects
Plugin needs three things to load under OpenClaw 2026.5.x:
1. plugins.allow includes 'recruiter-api' (doctor --fix overwrites the
ConfigMap-baked value, so re-patch via 'openclaw config patch --stdin'
in the startup command after doctor runs).
2. 'openclaw plugins enable recruiter-api' to flip its registry entry.
3. manifest declares contracts.tools (added in recruiter-responder commit
83ffd9fa).
Plus: VIKTOR_CHAT_ID env wired from secret/openclaw.viktor_chat_id so the
plugin's polling loop knows which Telegram chat to deliver into.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
recruiter-responder ExternalSecret gains GPT_MINI_ENDPOINT/_API_KEY/_MODEL
(NIM-served qwen3-coder-480b — gpt-5.4-mini in OpenClaw is OAuth-only and
not HTTP-accessible to external services). OpenClaw gains VIKTOR_CHAT_ID
env consumed by the recruiter-api plugin's announcement loop.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Wire ha-mcp, context7, and the in-pod playwright sidecar as native
MCP servers on OpenClaw via `mcp set` in the container startup
(ConfigMap-baked mcp.servers gets stripped by `doctor --fix`; CLI-set
entries persist). HA URL pulled from new Vault key
secret/openclaw.ha_sofia_mcp_url and passed via the
HA_SOFIA_MCP_URL env var.
Add a daily 03:00 UTC `memory-sync` CronJob in the openclaw
namespace: pulls all non-sensitive memories from
claude-memory.claude-memory.svc:80/api/memories, groups by category,
writes 18 Markdown files into /workspace/memory/projects/claude-
memory-sync/ (the path memory-core indexes), then triggers
`openclaw memory index --force` via kubectl exec. Reuses the
existing cluster-healthcheck SA (pods+pods/exec). Smoke test: 1488
memories synced, 25/25 files indexed, search returns hits.
Also drops the legacy /app/extensions entry from
plugins.load.paths (doctor warning), wires HA_SOFIA_MCP_URL env,
and one-shot deletes the stale 2026-02-28 metaclaw-export.json from
the openclaw home volume.
claude_memory MCP intentionally NOT wired — its /mcp/mcp transport
404s on the deployed claude-memory-mcp:17 image (tracked as
code-z1so). Shared knowledge is delivered via the CronJob's REST
sync instead. Adding claude_memory to mcp.servers is a one-line
follow-up once that's fixed.
OpenClaw can now answer 'what do we know about <company>?' from cache
via the new recruiter_company_research tool, and recruiter_get embeds
the cached research payload inline.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
- stacks/recruiter-responder/terragrunt.hcl: bump image_tag to 0500c3d3
(300s LLM timeouts + IMAP BODY.PEEK[] fix).
- stacks/openclaw/main.tf: install-recruiter-plugin init container now
runs as uid 0 — the openclaw NFS volume is owned by uid 1000 and the
recruiter-responder image otherwise drops to uid 10001 which can't
write or chown.
Smoke-tested end-to-end 2026-05-15 ~23:15:
Synthetic recruiter email -> IMAP IDLE EXISTS push -> qwen3-8b triage
(12.1s, JSON output complete with company/role/salary/location/tech)
-> 2 drafts persisted in Postgres -> Telegram sendMessage 200 OK.
Then deleted 3 stale n8n workflows W992Nr7..., 1AU4k7..., IisDNx... from
the n8n Postgres workflow_entity table.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Three coupled changes for the new recruiter-responder pipeline:
1. stacks/llama-cpp/: add qwen3-8b text-only model to llama-swap. Uses
unsloth/Qwen3-8B-GGUF Q4_K_M, 16k context, no mmproj. Refactored the
download Job script + cmd renderer to handle text_only=true (skip
mmproj download + --mmproj flag). The 3 existing vision models stay
on text_only=false; no behaviour change for them.
2. stacks/recruiter-responder/: new stack. Namespace, 2 ExternalSecrets
(app secrets from secret/recruiter-responder, DB creds from Vault DB
engine static-creds/pg-recruiter-responder), Deployment (replicas=1,
Recreate -- IMAP IDLE + APScheduler want single leader), Service
ClusterIP. Image: forgejo.viktorbarzin.me/viktor/recruiter-responder.
3. stacks/openclaw/: add init container `install-recruiter-plugin` that
uses the recruiter-responder image to copy the .mjs plugin into
/home/node/.openclaw/extensions/recruiter-api/ on NFS. Couples plugin
version to the recruiter-responder image tag. Also injects
RECRUITER_RESPONDER_URL + RECRUITER_RESPONDER_TOKEN env vars (token
from openclaw-secrets.recruiter_responder_bearer_token, optional).
Pre-apply checklist for recruiter-responder stack:
- Vault: seed secret/recruiter-responder with webhook_bearer_token,
imap_{me,spam}_{user,pass}, smtp_password, claude_agent_token,
task_webhook_token.
- Vault: add secret/openclaw.recruiter_responder_bearer_token (same as
above webhook_bearer_token).
- dbaas: create DB recruiter_responder + role recruiter_responder,
and Vault DB-engine role static-creds/pg-recruiter-responder.
- Build + push image via Woodpecker (recruiter-responder repo CI).
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
After fixing the threshold=80% misconfig and seeing two PVCs
(prometheus + technitium primary) get stuck Terminating, a 3rd round
showed four more PVCs (frigate, hackmd, immich-postgresql,
paperless-ngx) in the same state. Same root cause: TF spec'd a
smaller storage size than the autoresizer-grown live value, K8s
rejected the shrink, TF force-replaced the PVC, and the
pvc-protection finalizer held it in Terminating while the pod kept
using the underlying volume.
Bulk-inject lifecycle.ignore_changes = [spec[0].resources[0].requests]
on every kubernetes_persistent_volume_claim block that has
resize.topolvm.io/threshold annotations. The pattern was already
documented in .claude/CLAUDE.md but ~63 stacks were missing it.
Live PVCs are unaffected; this only prevents future TF applies from
attempting the destroy+recreate.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
topolvm/pvc-autoresizer's threshold annotation is the FREE-SPACE
percentage below which expansion fires (per upstream README). Setting
it to "80%" means "expand when free-space drops below 80%", i.e. as
soon as the PVC crosses 20% utilization — which caused
prometheus-data-proxmox to be repeatedly expanded from 200Gi to 433Gi
in 70 minutes (six 10% bumps, all when the volume was only ~14% used).
Once the SC opt-in fix landed (1e4eac53) and the inode metrics fix
landed (02a12f1a), the autoresizer started actively misfiring across
75+ PVCs cluster-wide.
Flip the value to "10%" everywhere — that's "expand when free-space
drops below 10%", i.e. at 90% utilization, which is the conventional
semantic and matches the alert thresholds in
prometheus_chart_values.tpl (PVAutoExpanding fires at 80%, PVFillingUp
at 95%).
The CLAUDE.md PVC template was the source of the misconfig, so update
it too. Live PVC annotations were patched in parallel via kubectl
annotate; TF apply on each affected stack will be a no-op against
those live values.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
The previously-baked kubeconfig at /home/node/.openclaw/kubeconfig retained
a service-account token bound to the original (long-dead) pod, so kubectl
calls from inside the openclaw container failed with "the server has asked
for the client to provide credentials" even though the openclaw SA has
cluster-admin and kubelet projects a fresh token at
/var/run/secrets/kubernetes.io/serviceaccount/token.
Add init-container "setup-kubeconfig" that writes a kubeconfig with
tokenFile + certificate-authority paths pointing at the projected
SA volume — kubelet auto-rotates the token, kubectl always reads
fresh creds, no Vault K8s-creds-engine refresh needed.
Verified end-to-end: agent ran `kubectl get nodes -o wide` inside the
pod and delivered a correct one-line summary to Telegram via
openai-codex/gpt-5.4-mini.
Stdlib-only Python exporter ($1) reads ~/.openclaw/agents/*/sessions/*.jsonl
(assistant messages with usage) plus auth-profiles.json (OAuth expiry,
Plus-tier label) and exposes Prometheus text format on :9099/metrics.
Container is python:3.12-slim; pod template gets prometheus.io/scrape
annotations so the existing kubernetes-pods job picks it up — no
ServiceMonitor needed.
Metrics exported:
openclaw_codex_messages_total{provider,model,session_kind} counter
openclaw_codex_input/output/cache_read/cache_write_tokens_total
openclaw_codex_message_errors_total{reason}
openclaw_codex_active_sessions{kind} gauge
openclaw_codex_oauth_expiry_seconds{provider,account,plan} gauge
openclaw_codex_last_run_timestamp gauge
Grafana dashboard "OpenClaw — Codex Usage" (Applications folder, 30s
refresh): messages/5h vs Plus rate-card, % of 1,200 floor, tokens/5h,
cache hit %, OAuth expiry days, active sessions, last-turn age, errors,
plus per-model timeseries + bar gauge + error table.
Plus rate-card thresholds in the gauge are conservative (1,200/5h floor;
real cap is dynamic 1,200–7,000). Re-baseline if throttling shows up
below 80%.
Bumps image 2026.2.26 → 2026.5.4 (openai-codex provider plugin landed in
2026.4.21+). Auth profile is OAuth via the device-pairing flow against the
Codex backend (account ancaelena98@gmail.com); token persists in
/home/node/.openclaw/agents/main/agent/auth-state.json on NFS so it survives
pod restarts. Plus tier accepts gpt-5.4-mini (1,200–7,000 local msgs/5h);
gpt-5-mini and gpt-5.1-codex-mini both return errors on Plus, so we pin
gpt-5.4-mini explicitly. doctor --fix auto-promotes the highest-tier model
(gpt-5-pro) after model discovery, so the container command pins the mini
back as default after doctor runs but before gateway start.
NVIDIA retired nim/qwen/qwen3.5-397b-a17b — modelrelay shows consistent
TIMEOUTs over 24h+ of pings, and nim/nvidia/llama-3.1-nemotron-ultra-253b-v1
returns 404. With both gone the openclaw failover never reached
mistral-large-3 in time, so every message hung until the 120s embedded-run
timeout. Promote qwen3-coder-480b-a35b-instruct (already in models list, UP
~1-2s, 256k ctx) to primary; drop the dead nemotron-ultra fallback.
- scripts/cluster_healthcheck.sh: add 12 new checks (cert-manager
readiness/expiry/requests, backup freshness per-DB/offsite/LVM,
monitoring prom+AM/vault-sealed/CSS, external reachability cloudflared
+authentik/ExternalAccessDivergence/traefik-5xx). Bump TOTAL_CHECKS
to 42, add --no-fix flag.
- Remove the duplicate pod-version .claude/cluster-health.sh (1728
lines) and the openclaw cluster_healthcheck CronJob (local CLI is
now the single authoritative runner). Keep the healthcheck SA +
Role + RoleBinding — still reused by task_processor CronJob.
- Remove SLACK_WEBHOOK_URL env from openclaw deployment and delete
the unused setup-monitoring.sh.
- Rewrite .claude/skills/cluster-health/SKILL.md: mandates running
the script first, refreshes the 42-check table, drops stale
CronJob/Slack/post-mortem sections, documents the monorepo-canonical
+ hardlink layout. File is hardlinked to
/home/wizard/code/.claude/skills/cluster-health/SKILL.md for
dual discovery.
- AGENTS.md + k8s-portal agent page: 25-check → 42-check.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Adds `external_monitor = false` to the ingress_factory calls for
task-webhook and torrserver so the `external-monitor-sync` CronJob
stops auto-creating `[External] <name>` monitors for them. Both
services remain deployed and reachable; only the Uptime Kuma monitors
are dropped.
## Context
Wave 3A (commit c9d221d5) added the `# KYVERNO_LIFECYCLE_V1` marker to the
27 pre-existing `ignore_changes = [...dns_config]` sites so they could be
grepped and audited. It did NOT address pod-owning resources that were
simply missing the suppression entirely. Post-Wave-3A sampling (2026-04-18)
found that navidrome, f1-stream, frigate, servarr, monitoring, crowdsec,
and many other stacks showed perpetual `dns_config` drift every plan
because their `kubernetes_deployment` / `kubernetes_stateful_set` /
`kubernetes_cron_job_v1` resources had no `lifecycle {}` block at all.
Root cause (same as Wave 3A): Kyverno's admission webhook stamps
`dns_config { option { name = "ndots"; value = "2" } }` on every pod's
`spec.template.spec.dns_config` to prevent NxDomain search-domain flooding
(see `k8s-ndots-search-domain-nxdomain-flood` skill). Without `ignore_changes`
on every Terraform-managed pod-owner, Terraform repeatedly tries to strip
the injected field.
## This change
Extends the Wave 3A convention by sweeping EVERY `kubernetes_deployment`,
`kubernetes_stateful_set`, `kubernetes_daemon_set`, `kubernetes_cron_job_v1`,
`kubernetes_job_v1` (+ their `_v1` variants) in the repo and ensuring each
carries the right `ignore_changes` path:
- **kubernetes_deployment / stateful_set / daemon_set / job_v1**:
`spec[0].template[0].spec[0].dns_config`
- **kubernetes_cron_job_v1**:
`spec[0].job_template[0].spec[0].template[0].spec[0].dns_config`
(extra `job_template[0]` nesting — the CronJob's PodTemplateSpec is
one level deeper)
Each injection / extension is tagged `# KYVERNO_LIFECYCLE_V1: Kyverno
admission webhook mutates dns_config with ndots=2` inline so the
suppression is discoverable via `rg 'KYVERNO_LIFECYCLE_V1' stacks/`.
Two insertion paths are handled by a Python pass (`/tmp/add_dns_config_ignore.py`):
1. **No existing `lifecycle {}`**: inject a brand-new block just before the
resource's closing `}`. 108 new blocks on 93 files.
2. **Existing `lifecycle {}` (usually for `DRIFT_WORKAROUND: CI owns image tag`
from Wave 4, commit a62b43d1)**: extend its `ignore_changes` list with the
dns_config path. Handles both inline (`= [x]`) and multiline
(`= [\n x,\n]`) forms; ensures the last pre-existing list item carries
a trailing comma so the extended list is valid HCL. 34 extensions.
The script skips anything already mentioning `dns_config` inside an
`ignore_changes`, so re-running is a no-op.
## Scale
- 142 total lifecycle injections/extensions
- 93 `.tf` files touched
- 108 brand-new `lifecycle {}` blocks + 34 extensions of existing ones
- Every Tier 0 and Tier 1 stack with a pod-owning resource is covered
- Together with Wave 3A's 27 pre-existing markers → **169 greppable
`KYVERNO_LIFECYCLE_V1` dns_config sites across the repo**
## What is NOT in this change
- `stacks/trading-bot/main.tf` — entirely commented-out block (`/* … */`).
Python script touched the file, reverted manually.
- `_template/main.tf.example` skeleton — kept minimal on purpose; any
future stack created from it should either inherit the Wave 3A one-line
form or add its own on first `kubernetes_deployment`.
- `terraform fmt` fixes to pre-existing alignment issues in meshcentral,
nvidia/modules/nvidia, vault — unrelated to this commit. Left for a
separate fmt-only pass.
- Non-pod resources (`kubernetes_service`, `kubernetes_secret`,
`kubernetes_manifest`, etc.) — they don't own pods so they don't get
Kyverno dns_config mutation.
## Verification
Random sample post-commit:
```
$ cd stacks/navidrome && ../../scripts/tg plan → No changes.
$ cd stacks/f1-stream && ../../scripts/tg plan → No changes.
$ cd stacks/frigate && ../../scripts/tg plan → No changes.
$ rg -c 'KYVERNO_LIFECYCLE_V1' stacks/ --include='*.tf' --include='*.tf.example' \
| awk -F: '{s+=$2} END {print s}'
169
```
## Reproduce locally
1. `git pull`
2. `rg 'KYVERNO_LIFECYCLE_V1' stacks/ | wc -l` → 169+
3. `cd stacks/navidrome && ../../scripts/tg plan` → expect 0 drift on
the deployment's dns_config field.
Refs: code-seq (Wave 3B dns_config class closed; kubernetes_manifest
annotation class handled separately in 8d94688d for tls_secret)
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
## Context
Wave 3B-continued: the Goldilocks VPA dashboard (stacks/vpa) runs a Kyverno
ClusterPolicy `goldilocks-vpa-auto-mode` that mutates every namespace with
`metadata.labels["goldilocks.fairwinds.com/vpa-update-mode"] = "off"`. This
is intentional — Terraform owns container resource limits, and Goldilocks
should only provide recommendations, never auto-update. The label is how
Goldilocks decides per-namespace whether to run its VPA in `off` mode.
Effect on Terraform: every `kubernetes_namespace` resource shows the label
as pending-removal (`-> null`) on every `scripts/tg plan`. Dawarich survey
2026-04-18 confirmed the drift. Cluster-side count: 88 namespaces carry the
label (`kubectl get ns -o json | jq ... | wc -l`). Every TF-managed namespace
is affected.
This commit brings the intentional admission drift under the same
`# KYVERNO_LIFECYCLE_V1` discoverability marker introduced in c9d221d5 for
the ndots dns_config pattern. The marker now stands generically for any
Kyverno admission-webhook drift suppression; the inline comment records
which specific policy stamps which specific field so future grep audits
show why each suppression exists.
## This change
107 `.tf` files touched — every stack's `resource "kubernetes_namespace"`
resource gets:
```hcl
lifecycle {
# KYVERNO_LIFECYCLE_V1: goldilocks-vpa-auto-mode ClusterPolicy stamps this label on every namespace
ignore_changes = [metadata[0].labels["goldilocks.fairwinds.com/vpa-update-mode"]]
}
```
Injection was done with a brace-depth-tracking Python pass (`/tmp/add_goldilocks_ignore.py`):
match `^resource "kubernetes_namespace" ` → track `{` / `}` until the
outermost closing brace → insert the lifecycle block before the closing
brace. The script is idempotent (skips any file that already mentions
`goldilocks.fairwinds.com/vpa-update-mode`) so re-running is safe.
Vault stack picked up 2 namespaces in the same file (k8s-users produces
one, plus a second explicit ns) — confirmed via file diff (+8 lines).
## What is NOT in this change
- `stacks/trading-bot/main.tf` — entire file is `/* … */` commented out
(paused 2026-04-06 per user decision). Reverted after the script ran.
- `stacks/_template/main.tf.example` — per-stack skeleton, intentionally
minimal. User keeps it that way. Not touched by the script (file
has no real `resource "kubernetes_namespace"` — only a placeholder
comment).
- `.terraform/` copies (e.g. `stacks/metallb/.terraform/modules/...`) —
gitignored, won't commit; the live path was edited.
- `terraform fmt` cleanup of adjacent pre-existing alignment issues in
authentik, freedify, hermes-agent, nvidia, vault, meshcentral. Reverted
to keep the commit scoped to the Goldilocks sweep. Those files will
need a separate fmt-only commit or will be cleaned up on next real
apply to that stack.
## Verification
Dawarich (one of the hundred-plus touched stacks) showed the pattern
before and after:
```
$ cd stacks/dawarich && ../../scripts/tg plan
Before:
Plan: 0 to add, 2 to change, 0 to destroy.
# kubernetes_namespace.dawarich will be updated in-place
(goldilocks.fairwinds.com/vpa-update-mode -> null)
# module.tls_secret.kubernetes_secret.tls_secret will be updated in-place
(Kyverno generate.* labels — fixed in 8d94688d)
After:
No changes. Your infrastructure matches the configuration.
```
Injection count check:
```
$ rg -c 'KYVERNO_LIFECYCLE_V1: goldilocks-vpa-auto-mode' stacks/ | awk -F: '{s+=$2} END {print s}'
108
```
## Reproduce locally
1. `git pull`
2. Pick any stack: `cd stacks/<name> && ../../scripts/tg plan`
3. Expect: no drift on the namespace's goldilocks.fairwinds.com/vpa-update-mode label.
Closes: code-dwx
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
## Context
Phase 1 of the state-drift consolidation audit (plan Wave 3) identified that
the entire repo leans on a repeated `lifecycle { ignore_changes = [...dns_config] }`
snippet to suppress Kyverno's admission-webhook dns_config mutation (the ndots=2
override that prevents NxDomain search-domain flooding). 27 occurrences across
19 stacks. Without this suppression, every pod-owning resource shows perpetual
TF plan drift.
The original plan proposed a shared `modules/kubernetes/kyverno_lifecycle/`
module emitting the ignore-paths list as an output that stacks would consume in
their `ignore_changes` blocks. That approach is architecturally impossible:
Terraform's `ignore_changes` meta-argument accepts only static attribute paths
— it rejects module outputs, locals, variables, and any expression (the HCL
spec evaluates `lifecycle` before the regular expression graph). So a DRY
module cannot exist. The canonical pattern IS the repeated snippet.
What the snippet was missing was a *discoverability tag* so that (a) new
resources can be validated for compliance, (b) the existing 27 sites can be
grep'd in a single command, and (c) future maintainers understand the
convention rather than each reinventing it.
## This change
- Introduces `# KYVERNO_LIFECYCLE_V1` as the canonical marker comment.
Attached inline on every `spec[0].template[0].spec[0].dns_config` line
(or `spec[0].job_template[0].spec[0]...` for CronJobs) across all 27
existing suppression sites.
- Documents the convention with rationale and copy-paste snippets in
`AGENTS.md` → new "Kyverno Drift Suppression" section.
- Expands the existing `.claude/CLAUDE.md` Kyverno ndots note to reference
the marker and explain why the module approach is blocked.
- Updates `_template/main.tf.example` so every new stack starts compliant.
## What is NOT in this change
- The `kubernetes_manifest` Kyverno annotation drift (beads `code-seq`)
— that is Phase B with a sibling `# KYVERNO_MANIFEST_V1` marker.
- Behavioral changes — every `ignore_changes` list is byte-identical
save for the inline comment.
- The fallback module the original plan anticipated — skipped because
Terraform rejects expressions in `ignore_changes`.
- `terraform fmt` cleanup on adjacent unrelated blocks in three files
(claude-agent-service, freedify/factory, hermes-agent). Reverted to
keep this commit scoped to the convention rollout.
## Before / after
Before (cannot distinguish accidental-forgotten from intentional-convention):
```hcl
lifecycle {
ignore_changes = [spec[0].template[0].spec[0].dns_config]
}
```
After (greppable, self-documenting, discoverable by tooling):
```hcl
lifecycle {
ignore_changes = [spec[0].template[0].spec[0].dns_config] # KYVERNO_LIFECYCLE_V1
}
```
## Test Plan
### Automated
```
$ rg -c 'KYVERNO_LIFECYCLE_V1' stacks/ --include='*.tf' --include='*.tf.example' \
| awk -F: '{s+=$2} END {print s}'
27
$ git diff --stat | grep -E '\.(tf|tf\.example|md)$' | wc -l
21
# All code-file diffs are 1 insertion + 1 deletion per marker site,
# except beads-server (3), ebooks (4), immich (3), uptime-kuma (2).
$ git diff --stat stacks/ | tail -1
20 files changed, 45 insertions(+), 28 deletions(-)
```
### Manual Verification
No apply required — HCL comments only. Zero effect on any stack's plan output.
Future audits: `rg 'KYVERNO_LIFECYCLE_V1' stacks/ | wc -l` must grow as new
pod-owning resources are added.
## Reproduce locally
1. `cd infra && git pull`
2. `rg 'KYVERNO_LIFECYCLE_V1' stacks/` → expect 27 hits in 19 files
3. Grep any new `kubernetes_deployment` for the marker; absence = missing
suppression.
Closes: code-28m
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Two-tier state architecture:
- Tier 0 (infra, platform, cnpg, vault, dbaas, external-secrets): local
state with SOPS encryption in git — unchanged, required for bootstrap.
- Tier 1 (105 app stacks): PostgreSQL backend on CNPG cluster at
10.0.20.200:5432/terraform_state with native pg_advisory_lock.
Motivation: multi-operator friction (every workstation needed SOPS + age +
git-crypt), bootstrap complexity for new operators, and headless agents/CI
needing the full encryption toolchain just to read state.
Changes:
- terragrunt.hcl: conditional backend (local vs pg) based on tier0 list
- scripts/tg: tier detection, auto-fetch PG creds from Vault for Tier 1,
skip SOPS and Vault KV locking for Tier 1 stacks
- scripts/state-sync: tier-aware encrypt/decrypt (skips Tier 1)
- scripts/migrate-state-to-pg: one-shot migration script (idempotent)
- stacks/vault/main.tf: pg-terraform-state static role + K8s auth role
for claude-agent namespace
- stacks/dbaas: terraform_state DB creation + MetalLB LoadBalancer
service on shared IP 10.0.20.200
- Deleted 107 .tfstate.enc files for migrated Tier 1 stacks
- Cleaned up per-stack tiers.tf (now generated by root terragrunt.hcl)
[ci skip]
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Terragrunt now generates cloudflare_provider.tf (Vault-sourced API key)
and includes cloudflare in required_providers. These are the generated
files from running `terragrunt init -upgrade` across all stacks.
[ci skip]
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
## Context
Deploying new services required manually adding hostnames to
cloudflare_proxied_names/cloudflare_non_proxied_names in config.tfvars —
a separate file from the service stack. This was frequently forgotten,
leaving services unreachable externally.
## This change:
- Add `dns_type` parameter to `ingress_factory` and `reverse_proxy/factory`
modules. Setting `dns_type = "proxied"` or `"non-proxied"` auto-creates
the Cloudflare DNS record (CNAME to tunnel or A/AAAA to public IP).
- Simplify cloudflared tunnel from 100 per-hostname rules to wildcard
`*.viktorbarzin.me → Traefik`. Traefik still handles host-based routing.
- Add global Cloudflare provider via terragrunt.hcl (separate
cloudflare_provider.tf with Vault-sourced API key).
- Migrate 118 hostnames from centralized config.tfvars to per-service
dns_type. 17 hostnames remain centrally managed (Helm ingresses,
special cases).
- Update docs, AGENTS.md, CLAUDE.md, dns.md runbook.
```
BEFORE AFTER
config.tfvars (manual list) stacks/<svc>/main.tf
| module "ingress" {
v dns_type = "proxied"
stacks/cloudflared/ }
for_each = list |
cloudflare_record auto-creates
tunnel per-hostname cloudflare_record + annotation
```
## What is NOT in this change:
- Uptime Kuma monitor migration (still reads from config.tfvars)
- 17 remaining centrally-managed hostnames (Helm, special cases)
- Removal of allow_overwrite (keep until migration confirmed stable)
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Applied all 20 NFS stacks to converge PV mount_options (nfsvers=4).
State files encrypted and committed.
[ci skip]
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
The init container was cloning the dotfiles repo via git on every pod
start, causing 200+ small NFS writes that amplified through ZFS.
Dotfiles already exist on NFS from a previous clone — no need to
re-clone on every restart. To update dotfiles, run git pull manually.
Also cleaned up stale Uptime Kuma files (1.6GB old SQLite DB + 289MB
error log left over from migration to MariaDB).
- Terragrunt-regenerated providers.tf across stacks (vault_root_token
variable removed from root generate block)
- Upstream monitoring/openclaw/CLAUDE.md changes from rebase
- Use $$ for shell variable escaping in Terraform ($ is Terraform interpolation)
- Fix image: docker.io/alpine/git (not library/alpine/git)
- Inline command instead of heredoc to avoid Terraform interpolation issues
Vault is now the sole source of truth for secrets. SOPS pipeline
removed entirely — auth via `vault login -method=oidc`.
Part A: SOPS removal
- vault/main.tf: delete 990 lines (93 vars + 43 KV write resources),
add self-read data source for OIDC creds from secret/vault
- terragrunt.hcl: remove SOPS var loading, vault_root_token, check_secrets hook
- scripts/tg: remove SOPS decryption, keep -auto-approve logic
- .woodpecker/default.yml: replace SOPS with Vault K8s auth via curl
- Delete secrets.sops.json, .sops.yaml
Part B: External Secrets Operator
- New stack stacks/external-secrets/ with Helm chart + 2 ClusterSecretStores
(vault-kv for KV v2, vault-database for DB engine)
Part C: Database secrets engine (in vault/main.tf)
- MySQL + PostgreSQL connections with static role rotation (24h)
- 6 MySQL roles (speedtest, wrongmove, codimd, nextcloud, shlink, grafana)
- 6 PostgreSQL roles (trading, health, linkwarden, affine, woodpecker, claude_memory)
Part D: Kubernetes secrets engine (in vault/main.tf)
- RBAC for Vault SA to manage K8s tokens
- Roles: dashboard-admin, ci-deployer, openclaw, local-admin
- New scripts/vault-kubeconfig helper for dynamic kubeconfig
K8s auth method with scoped policies for CI, ESO, OpenClaw, Woodpecker sync.
- Add init container "install-dotfiles" that clones the dotfiles repo
and installs skills/agents/hooks to OpenClaw's home directory
- Remove nfs_cc_config module and its volume mount
- Skills/agents now come from the same chezmoi-managed dotfiles repo
that manages the Mac config, eliminating the dual-sync problem
SQLite backup via Online Backup API + copy of RSA keys,
attachments, sends, and config. 30-day retention with rotation.
Pod affinity ensures co-scheduling with vaultwarden for RWO PVC access.
- Replace subPath ConfigMap mount with init container that copies openclaw.json
to writable NFS home (OpenClaw writes back to the file at runtime)
- Remove invalid memory-api plugin references causing "Config invalid"
- Increase memory to 2Gi (req+limit) with NODE_OPTIONS=--max-old-space-size=1536
- Fix tg wrapper to inject -auto-approve when apply --non-interactive is used
- Set memory requests = limits across 56 stacks to prevent overcommit
- Right-sized limits based on actual pod usage (2x actual, rounded up)
- Scaled down trading-bot (replicas=0) to free memory
- Fixed OOMKilled services: forgejo, dawarich, health, meshcentral,
paperless-ngx, vault auto-unseal, rybbit, whisper, openclaw, clickhouse
- Added startup+liveness probes to calibre-web
- Bumped inotify limits on nodes 2,3 (max_user_instances 128->8192)
Post node2 OOM incident (2026-03-14). Previous kubelet config had no
kubeReserved/systemReserved set, allowing pods to starve the kernel.