Commit graph

82 commits

Author SHA1 Message Date
Viktor Barzin
d4c76a07a2 openclaw: revert model swap + document codex re-auth path
The previous commit promoted modelrelay/auto-fastest to primary as a
workaround for the expired openai-codex OAuth token. But modelrelay
routes to small tool-call-shy models (nvidia/stepfun-ai/step-3.5-flash)
that hallucinate answers instead of using ssh / curl / etc. — exactly
what the v4 learning loop is supposed to leverage.

Revert primary back to openai-codex/gpt-5.4-mini (gpt-5.4-mini is the
only mini variant the Codex backend accepts for ChatGPT Plus tier),
and inline the re-auth command in the model-block comment so future
sessions know exactly what to do when the OAuth token expires:

  kubectl -n openclaw exec -it $(kubectl -n openclaw get pods \
    -l app=openclaw -o jsonpath='{.items[0].metadata.name}') \
    -c openclaw -- node /app/openclaw.mjs models auth login \
    --provider openai-codex

modelrelay/auto-fastest stays in the fallback chain so the agent
remains partially usable while the token is expired.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-22 14:17:01 +00:00
Viktor Barzin
dbb3dc04d3 openclaw: engrain the learning loop at the identity level
User feedback: "this should work for any task, not just calendar.
this learning flow must be strongly engrained to ensure openclaw
gets better over time."

The v3 rules were buried at the bottom of TOOLS.md and only stated
in workflow language. Three changes to make the rule unavoidable:

1. **SOUL.md** — new marker-delimited section "Learning is your
   identity" inserted before ## Boundaries. AGENTS.md tells the
   agent to read SOUL.md first every session, so this is now the
   FIRST thing the agent loads about itself. Frames learning as
   character, not procedure.

2. **TOOLS.md v4** — section moved from the END of the file to
   right after the `# TOOLS.md` title (first substantive content
   on file load). Title strengthened: "THE FLOW — run this on
   EVERY task. Not just hard ones." Concrete examples explicitly
   call out diverse domains (calendar, frigate restart, disk
   usage, inbox summary, deploys) so the universality is
   unmistakable.

3. **learn-from-tasks skill** — opens with "This is universal.
   EVERY task runs through this flow — not just hard ones, not
   just unfamiliar ones. The save at the end is mandatory."

The actual flow (know → ask devvm → save) is unchanged. What
changed is salience: the rule is now the first thing the agent
encounters in three independent surfaces, with stronger framing
that makes "skipping the save" feel like a violation of identity
rather than a missed optimisation.

Marker bumped v3 → v4. Stripper handles v1-v9 idempotently.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-22 14:17:01 +00:00
Viktor Barzin
d0a4876825 openclaw: v3 flow — know → ask devvm → (rarely) try yourself
Refines the devvm-fallback into an explicit triage flow that the
agent runs on every task. The default path is to ASK devvm-claude
when uncertain — don't brute-force. Most tasks are solvable there.

## The flow

1. Do I KNOW how? Check `memory_recall` and INDEX.md.
2. If not, SSH devvm and ask claude — and crucially, ask it to
   share the steps + credentials needed so I can do it on my own
   next time. Save the answer in openclaw memory.
3. (RARE) If devvm-claude says no, try in-pod. Most likely fail —
   that's OK.

## Storage moved to memory-indexed location

Learnings now live under
`/workspace/memory/projects/openclaw-learned/` (was
`/workspace/learned/`) so memory-core indexes them and
`memory_recall` surfaces them. Layout:

- `scripts/<task>.md`       runnable recipes
- `knowledge/<topic>.md`    decisions, paths, gotchas
- `credentials/<name>.md`   **POINTERS to Vault, never values**

## Credentials = Vault pointers only

Previous v2 design saved cred values to plaintext NFS files. v3
flips to pointer-only: cred file documents the Vault path + fetch
command (`ssh devvm 'vault kv get -field=foo secret/bar'`), the
consumer, and rotation expectations. The secret stays in Vault.

## Init container also migrates

Strips v1/v2/v3 markers from TOOLS.md before re-inserting v3,
moves any files from the legacy `/workspace/learned/` tree into
the new location, removes the empty legacy dir. User edits
outside the markers always survive.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-22 14:17:01 +00:00
Viktor Barzin
ef67a53676 openclaw: explicit "use devvm + learn" default behaviour
Refine the init container's devvm-fallback seeding so the OpenClaw
agent treats devvm as its DEFAULT teacher and saves recipes locally
to become independent over time:

1. TOOLS.md v2 section now has two emphatic CRITICAL rules:
   - "TRY DEVVM before giving up" — when stuck, ssh devvm before
     telling the user "I can't do that".
   - "After every task, introspect → save a faster way" — for any
     non-trivial task (especially recurring ones), save the recipe
     to /workspace/learned/ and update INDEX.md.

2. New cc-skill `learn-from-tasks` at
   /home/node/.openclaw/cc-skills/learn-from-tasks/SKILL.md formalises
   both triggers: (A) you're stuck → check INDEX → ask devvm → save;
   (B) you just finished → introspect → save if recurring.

3. /workspace/learned/ scaffold: INDEX.md table-of-contents +
   scripts/, knowledge/, credentials/ (0700) subdirs. Agent checks
   INDEX.md BEFORE reaching for devvm, so saved recipes are
   findable on the next run.

4. Marker migration: strips both v1 and v2 markers before re-inserting
   so user edits outside the markers always survive future restarts.

Security caveat documented inline: credentials in
/workspace/learned/credentials/ are NFS plaintext — acceptable for
home-lab personal scope, NOT for anything more sensitive than what
`ssh devvm` already gives the pod (wizard's access).

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-22 14:17:01 +00:00
Viktor Barzin
43802d2452 openclaw: also write devvm section to /workspace/TOOLS.md
The OpenClaw agent reads TOOLS.md on every session per AGENTS.md
("environment-specific notes"), but it does NOT auto-search the
memory-core index for "devvm" before answering. Result: the agent
said "I don't have access to the devvm" even though ssh + the
openclaw-task wrapper were fully wired up (verified e2e in
9ad52dfd).

Updated init 6 (seed-devvm-memory-note) to ALSO append a
marker-delimited section to /workspace/TOOLS.md describing the
devvm SSH capability + openclaw-task usage. Idempotent: strips
any prior v1 section before re-inserting, so user edits outside
the markers survive future pod restarts.

The /workspace/memory/projects/openclaw-runtime/devvm-fallback.md
memory note stays in place — it's still indexed by memory-core
and surfaces for memory_recall queries.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-22 14:17:01 +00:00
Viktor Barzin
7e558de8f0 openclaw: SSH + tmux task fallback to devvm
Give the OpenClaw pod two new capabilities:

1. Host-tools bundle. New init container `install-host-tools` extracts
   openssh-client + dnsutils + tmux + jq + ripgrep + fd + vault + yq +
   friends into /tools/host-tools/, with the bookworm-slim libs the
   binaries need. PATH + LD_LIBRARY_PATH on the main container point
   ld.so at the bundle. Idempotent via /tools/host-tools/.installed-v1
   marker; smoke test (ldd-based) fails the init at deploy time if any
   binary has unresolved deps. Bundle is ~558 MB on the existing
   /srv/nfs/openclaw/tools NFS.

2. devvm SSH + async task pattern. New init `setup-ssh-config` writes
   id_rsa/config/known_hosts under /home/node/.openclaw/.ssh; main
   container startup symlinks /home/node/.ssh → there. New
   /usr/local/bin/openclaw-task wrapper on devvm manages long-running
   work as tmux sessions on devvm (sessions and logs survive pod
   restarts — they live on devvm, not in the pod). New init container
   `seed-devvm-memory-note` drops a markdown note teaching the pattern;
   main container startup now runs `openclaw memory index --force` so
   the note is searchable on first boot.

Design + verified E2E flow in
docs/plans/2026-05-22-openclaw-devvm-access-design.md. Persistence test
green: spawned a 50s task from pod A, deleted pod A, new pod B saw the
task finish and read its full log.

Pre-existing keel.sh annotation drift on openclaw/{openlobster,
task_webhook} cleaned up in the same apply.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-22 14:17:01 +00:00
Viktor Barzin
a90ce27923 infra: add kubectl + authentik providers across 6 stacks
Provider declarations were applied across freshrss, linkwarden,
navidrome, openclaw, tandoor, vault in prior sessions; lock files
regenerated for the 4 stacks where init had run. Commits the WIP so
downstream Terraform plans can proceed.

- kubectl (gavinbunney/kubectl ~> 1.14): kubernetes_manifest panic
  workaround for Kyverno CRDs (beads code-e2dp)
- authentik (goauthentik/authentik ~> 2024.10): used where stacks
  manage their own Authentik objects
2026-05-22 14:17:00 +00:00
fa2b57f177 openclaw: enable recruiter-api plugin (allowlist + manifest contracts)
Plugin needs three things to load under OpenClaw 2026.5.x:
1. plugins.allow includes 'recruiter-api' (doctor --fix overwrites the
   ConfigMap-baked value, so re-patch via 'openclaw config patch --stdin'
   in the startup command after doctor runs).
2. 'openclaw plugins enable recruiter-api' to flip its registry entry.
3. manifest declares contracts.tools (added in recruiter-responder commit
   83ffd9fa).

Plus: VIKTOR_CHAT_ID env wired from secret/openclaw.viktor_chat_id so the
plugin's polling loop knows which Telegram chat to deliver into.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-22 14:17:00 +00:00
6417c770c1 recruiter-responder + openclaw: wire gpt-mini secret keys + VIKTOR_CHAT_ID
recruiter-responder ExternalSecret gains GPT_MINI_ENDPOINT/_API_KEY/_MODEL
(NIM-served qwen3-coder-480b — gpt-5.4-mini in OpenClaw is OAuth-only and
not HTTP-accessible to external services). OpenClaw gains VIKTOR_CHAT_ID
env consumed by the recruiter-api plugin's announcement loop.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-22 14:17:00 +00:00
Viktor Barzin
1dd8f4e2bf openclaw: native MCP servers + daily claude-memory sync
Wire ha-mcp, context7, and the in-pod playwright sidecar as native
MCP servers on OpenClaw via `mcp set` in the container startup
(ConfigMap-baked mcp.servers gets stripped by `doctor --fix`; CLI-set
entries persist). HA URL pulled from new Vault key
secret/openclaw.ha_sofia_mcp_url and passed via the
HA_SOFIA_MCP_URL env var.

Add a daily 03:00 UTC `memory-sync` CronJob in the openclaw
namespace: pulls all non-sensitive memories from
claude-memory.claude-memory.svc:80/api/memories, groups by category,
writes 18 Markdown files into /workspace/memory/projects/claude-
memory-sync/ (the path memory-core indexes), then triggers
`openclaw memory index --force` via kubectl exec. Reuses the
existing cluster-healthcheck SA (pods+pods/exec). Smoke test: 1488
memories synced, 25/25 files indexed, search returns hits.

Also drops the legacy /app/extensions entry from
plugins.load.paths (doctor warning), wires HA_SOFIA_MCP_URL env,
and one-shot deletes the stale 2026-02-28 metaclaw-export.json from
the openclaw home volume.

claude_memory MCP intentionally NOT wired — its /mcp/mcp transport
404s on the deployed claude-memory-mcp:17 image (tracked as
code-z1so). Shared knowledge is delivered via the CronJob's REST
sync instead. Adding claude_memory to mcp.servers is a one-line
follow-up once that's fixed.
2026-05-22 14:16:53 +00:00
3027ab85a8 recruiter-responder: bump image_tag to 189ef901
OpenClaw can now answer 'what do we know about <company>?' from cache
via the new recruiter_company_research tool, and recruiter_get embeds
the cached research payload inline.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-22 14:16:49 +00:00
root
1177a82452 Woodpecker CI deploy [CI SKIP] 2026-05-22 14:16:47 +00:00
a98b00324d recruiter-responder: pin image tag + run plugin installer init as root
- stacks/recruiter-responder/terragrunt.hcl: bump image_tag to 0500c3d3
  (300s LLM timeouts + IMAP BODY.PEEK[] fix).
- stacks/openclaw/main.tf: install-recruiter-plugin init container now
  runs as uid 0 — the openclaw NFS volume is owned by uid 1000 and the
  recruiter-responder image otherwise drops to uid 10001 which can't
  write or chown.

Smoke-tested end-to-end 2026-05-15 ~23:15:
  Synthetic recruiter email -> IMAP IDLE EXISTS push -> qwen3-8b triage
  (12.1s, JSON output complete with company/role/salary/location/tech)
  -> 2 drafts persisted in Postgres -> Telegram sendMessage 200 OK.
Then deleted 3 stale n8n workflows W992Nr7..., 1AU4k7..., IisDNx... from
the n8n Postgres workflow_entity table.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-22 14:16:47 +00:00
7e1580ba8c recruiter-responder: deploy stack + llama-cpp qwen3-8b + openclaw plugin mount
Three coupled changes for the new recruiter-responder pipeline:

1. stacks/llama-cpp/: add qwen3-8b text-only model to llama-swap. Uses
   unsloth/Qwen3-8B-GGUF Q4_K_M, 16k context, no mmproj. Refactored the
   download Job script + cmd renderer to handle text_only=true (skip
   mmproj download + --mmproj flag). The 3 existing vision models stay
   on text_only=false; no behaviour change for them.

2. stacks/recruiter-responder/: new stack. Namespace, 2 ExternalSecrets
   (app secrets from secret/recruiter-responder, DB creds from Vault DB
   engine static-creds/pg-recruiter-responder), Deployment (replicas=1,
   Recreate -- IMAP IDLE + APScheduler want single leader), Service
   ClusterIP. Image: forgejo.viktorbarzin.me/viktor/recruiter-responder.

3. stacks/openclaw/: add init container `install-recruiter-plugin` that
   uses the recruiter-responder image to copy the .mjs plugin into
   /home/node/.openclaw/extensions/recruiter-api/ on NFS. Couples plugin
   version to the recruiter-responder image tag. Also injects
   RECRUITER_RESPONDER_URL + RECRUITER_RESPONDER_TOKEN env vars (token
   from openclaw-secrets.recruiter_responder_bearer_token, optional).

Pre-apply checklist for recruiter-responder stack:
  - Vault: seed secret/recruiter-responder with webhook_bearer_token,
    imap_{me,spam}_{user,pass}, smtp_password, claude_agent_token,
    task_webhook_token.
  - Vault: add secret/openclaw.recruiter_responder_bearer_token (same as
    above webhook_bearer_token).
  - dbaas: create DB recruiter_responder + role recruiter_responder,
    and Vault DB-engine role static-creds/pg-recruiter-responder.
  - Build + push image via Woodpecker (recruiter-responder repo CI).

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-22 14:16:46 +00:00
Viktor Barzin
dc134011eb fix: pvc-autoresizer + TF drift safety — bulk add ignore_changes
After fixing the threshold=80% misconfig and seeing two PVCs
(prometheus + technitium primary) get stuck Terminating, a 3rd round
showed four more PVCs (frigate, hackmd, immich-postgresql,
paperless-ngx) in the same state. Same root cause: TF spec'd a
smaller storage size than the autoresizer-grown live value, K8s
rejected the shrink, TF force-replaced the PVC, and the
pvc-protection finalizer held it in Terminating while the pod kept
using the underlying volume.

Bulk-inject lifecycle.ignore_changes = [spec[0].resources[0].requests]
on every kubernetes_persistent_volume_claim block that has
resize.topolvm.io/threshold annotations. The pattern was already
documented in .claude/CLAUDE.md but ~63 stacks were missing it.

Live PVCs are unaffected; this only prevents future TF applies from
attempting the destroy+recreate.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-22 14:16:43 +00:00
Viktor Barzin
3f2b2f9d32 fix: pvc-autoresizer threshold should be 10%, not 80%
topolvm/pvc-autoresizer's threshold annotation is the FREE-SPACE
percentage below which expansion fires (per upstream README). Setting
it to "80%" means "expand when free-space drops below 80%", i.e. as
soon as the PVC crosses 20% utilization — which caused
prometheus-data-proxmox to be repeatedly expanded from 200Gi to 433Gi
in 70 minutes (six 10% bumps, all when the volume was only ~14% used).
Once the SC opt-in fix landed (1e4eac53) and the inode metrics fix
landed (02a12f1a), the autoresizer started actively misfiring across
75+ PVCs cluster-wide.

Flip the value to "10%" everywhere — that's "expand when free-space
drops below 10%", i.e. at 90% utilization, which is the conventional
semantic and matches the alert thresholds in
prometheus_chart_values.tpl (PVAutoExpanding fires at 80%, PVFillingUp
at 95%).

The CLAUDE.md PVC template was the source of the misconfig, so update
it too. Live PVC annotations were patched in parallel via kubectl
annotate; TF apply on each affected stack will be a no-op against
those live values.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-22 14:16:43 +00:00
Viktor Barzin
ff5538a667 ingress_factory: replace protected bool with auth enum + audit pass across 100 stacks
Phase 3+4 of default-deny ingress plan. Replaces the `protected = bool` (default
false → unprotected) variable in `modules/kubernetes/ingress_factory` with
`auth = string` enum (default "required" → fail-closed). Touches every
ingress_factory caller so the audit decision is recorded explicitly in code.

ingress_factory (Phase 3):
- `auth = "required"`: standard Authentik forward-auth (the legacy
  `protected = true` semantic).
- `auth = "public"`: forward-auth via the new `authentik-forward-auth-public`
  middleware → dedicated public outpost → guest auto-bind. Logged-in users
  keep their real identity.
- `auth = "none"`: no Authentik middleware. For Anubis-fronted content, native
  client APIs (Git, /v2/, WebDAV), webhook receivers, the Authentik outpost
  itself.
- `effective_anti_ai` default flips ON only when `auth = "none"` (auth-gated
  ingresses don't need anti-AI noise; the auth flow already discourages bots).

Audit pass (Phase 4) across 96 ingress_factory call sites:
- 49 explicit `protected = true`     → `auth = "required"`
- 8 explicit `protected = false`     → `auth = "none"` (5) or `auth = "public"` (3)
- 64 previously-default (no protected line) → `auth = "required"` ADDED, then
  reviewed individually:
  * 9 Anubis-fronted (blog, www, kms, travel, f1, cyberchef, jsoncrack,
    homepage, wrongmove UI, privatebin) → `auth = "none"`
  * 22 native-client / programmatic surfaces (Forgejo Git+/v2/, webhook
    handler, claude-memory MCP, Nextcloud WebDAV, Matrix, Vault CLI/OIDC,
    xray VPN, ntfy, woodpecker webhooks, n8n triggers, ntfy push, dawarich
    location ingestion, immich frame kiosk, headscale CP, send anonymous
    drops, rybbit beacon, vaultwarden API, Authentik UI itself + outposts) →
    `auth = "none"`
  * Remaining ~33 → `auth = "required"` confirmed (admin tools, internal
    UIs, services without app-level auth)
- Smoke-test promotions to `auth = "public"`: fire-planner public UI,
  k8s-portal API, insta2spotify callback.

Three call sites in wrapper modules (`stacks/freedify/factory/`,
`stacks/reverse-proxy/modules/reverse_proxy/`) keep their internal `protected`
bool — they translate to `auth` internally, out of scope for this rename.

Behavior change: previously-default ingresses now fail closed (require
Authentik login) unless explicitly flipped to `auth = "none"` or
`auth = "public"`. This is the audit goal — no more accidentally-unprotected
surfaces. Sites that were intentionally public (Anubis content, native APIs,
webhooks) are now explicitly recorded as `auth = "none"`.

Drive-by: `modules/create-vm/main.tf` picked up cosmetic alignment via
`terraform fmt -recursive` during the audit. Behavior-neutral.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-22 14:16:42 +00:00
Viktor Barzin
29bb434e1e ig-poster: 69e395f2 + sync IMMICH_PG_* via ESO for CLIP scoring; postiz publish-notify n8n workflow 2026-05-10 11:12:38 +00:00
Viktor Barzin
badc341669 openclaw: regenerate kubeconfig at pod start using projected SA tokenFile
The previously-baked kubeconfig at /home/node/.openclaw/kubeconfig retained
a service-account token bound to the original (long-dead) pod, so kubectl
calls from inside the openclaw container failed with "the server has asked
for the client to provide credentials" even though the openclaw SA has
cluster-admin and kubelet projects a fresh token at
/var/run/secrets/kubernetes.io/serviceaccount/token.

Add init-container "setup-kubeconfig" that writes a kubeconfig with
tokenFile + certificate-authority paths pointing at the projected
SA volume — kubelet auto-rotates the token, kubectl always reads
fresh creds, no Vault K8s-creds-engine refresh needed.

Verified end-to-end: agent ran `kubectl get nodes -o wide` inside the
pod and delivered a correct one-line summary to Telegram via
openai-codex/gpt-5.4-mini.
2026-05-10 11:12:37 +00:00
Viktor Barzin
41655096c7 openclaw: realtime usage dashboard via Prometheus exporter sidecar
Stdlib-only Python exporter ($1) reads ~/.openclaw/agents/*/sessions/*.jsonl
(assistant messages with usage) plus auth-profiles.json (OAuth expiry,
Plus-tier label) and exposes Prometheus text format on :9099/metrics.
Container is python:3.12-slim; pod template gets prometheus.io/scrape
annotations so the existing kubernetes-pods job picks it up — no
ServiceMonitor needed.

Metrics exported:
  openclaw_codex_messages_total{provider,model,session_kind}    counter
  openclaw_codex_input/output/cache_read/cache_write_tokens_total
  openclaw_codex_message_errors_total{reason}
  openclaw_codex_active_sessions{kind}                          gauge
  openclaw_codex_oauth_expiry_seconds{provider,account,plan}    gauge
  openclaw_codex_last_run_timestamp                             gauge

Grafana dashboard "OpenClaw — Codex Usage" (Applications folder, 30s
refresh): messages/5h vs Plus rate-card, % of 1,200 floor, tokens/5h,
cache hit %, OAuth expiry days, active sessions, last-turn age, errors,
plus per-model timeseries + bar gauge + error table.

Plus rate-card thresholds in the gauge are conservative (1,200/5h floor;
real cap is dynamic 1,200–7,000). Re-baseline if throttling shows up
below 80%.
2026-05-07 23:29:32 +00:00
Viktor Barzin
115ca184ff openclaw: switch primary to ChatGPT Plus OAuth (openai-codex/gpt-5.4-mini)
Bumps image 2026.2.26 → 2026.5.4 (openai-codex provider plugin landed in
2026.4.21+). Auth profile is OAuth via the device-pairing flow against the
Codex backend (account ancaelena98@gmail.com); token persists in
/home/node/.openclaw/agents/main/agent/auth-state.json on NFS so it survives
pod restarts. Plus tier accepts gpt-5.4-mini (1,200–7,000 local msgs/5h);
gpt-5-mini and gpt-5.1-codex-mini both return errors on Plus, so we pin
gpt-5.4-mini explicitly. doctor --fix auto-promotes the highest-tier model
(gpt-5-pro) after model discovery, so the container command pins the mini
back as default after doctor runs but before gateway start.
2026-05-07 23:29:32 +00:00
Viktor Barzin
8b180f7662 openclaw: switch primary model to qwen3-coder-480b (qwen3.5-397b dead on NIM)
NVIDIA retired nim/qwen/qwen3.5-397b-a17b — modelrelay shows consistent
TIMEOUTs over 24h+ of pings, and nim/nvidia/llama-3.1-nemotron-ultra-253b-v1
returns 404. With both gone the openclaw failover never reached
mistral-large-3 in time, so every message hung until the 120s embedded-run
timeout. Promote qwen3-coder-480b-a35b-instruct (already in models list, UP
~1-2s, 256k ctx) to primary; drop the dead nemotron-ultra fallback.
2026-05-07 23:29:31 +00:00
Viktor Barzin
a0d770d9a7 [cluster-health] Expand to 42 checks, remove pod CronJob path
- scripts/cluster_healthcheck.sh: add 12 new checks (cert-manager
  readiness/expiry/requests, backup freshness per-DB/offsite/LVM,
  monitoring prom+AM/vault-sealed/CSS, external reachability cloudflared
  +authentik/ExternalAccessDivergence/traefik-5xx). Bump TOTAL_CHECKS
  to 42, add --no-fix flag.
- Remove the duplicate pod-version .claude/cluster-health.sh (1728
  lines) and the openclaw cluster_healthcheck CronJob (local CLI is
  now the single authoritative runner). Keep the healthcheck SA +
  Role + RoleBinding — still reused by task_processor CronJob.
- Remove SLACK_WEBHOOK_URL env from openclaw deployment and delete
  the unused setup-monitoring.sh.
- Rewrite .claude/skills/cluster-health/SKILL.md: mandates running
  the script first, refreshes the 42-check table, drops stale
  CronJob/Slack/post-mortem sections, documents the monorepo-canonical
  + hardlink layout. File is hardlinked to
  /home/wizard/code/.claude/skills/cluster-health/SKILL.md for
  dual discovery.
- AGENTS.md + k8s-portal agent page: 25-check → 42-check.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-19 15:13:03 +00:00
Viktor Barzin
581aed5fcc [openclaw,tor-proxy] Opt task-webhook + torrserver out of external monitoring
Adds `external_monitor = false` to the ingress_factory calls for
task-webhook and torrserver so the `external-monitor-sync` CronJob
stops auto-creating `[External] <name>` monitors for them. Both
services remain deployed and reachable; only the Uptime Kuma monitors
are dropped.
2026-04-19 13:01:36 +00:00
Viktor Barzin
327ce215b9 [infra] Sweep dns_config ignore_changes across all pod-owning resources [ci skip]
## Context

Wave 3A (commit c9d221d5) added the `# KYVERNO_LIFECYCLE_V1` marker to the
27 pre-existing `ignore_changes = [...dns_config]` sites so they could be
grepped and audited. It did NOT address pod-owning resources that were
simply missing the suppression entirely. Post-Wave-3A sampling (2026-04-18)
found that navidrome, f1-stream, frigate, servarr, monitoring, crowdsec,
and many other stacks showed perpetual `dns_config` drift every plan
because their `kubernetes_deployment` / `kubernetes_stateful_set` /
`kubernetes_cron_job_v1` resources had no `lifecycle {}` block at all.

Root cause (same as Wave 3A): Kyverno's admission webhook stamps
`dns_config { option { name = "ndots"; value = "2" } }` on every pod's
`spec.template.spec.dns_config` to prevent NxDomain search-domain flooding
(see `k8s-ndots-search-domain-nxdomain-flood` skill). Without `ignore_changes`
on every Terraform-managed pod-owner, Terraform repeatedly tries to strip
the injected field.

## This change

Extends the Wave 3A convention by sweeping EVERY `kubernetes_deployment`,
`kubernetes_stateful_set`, `kubernetes_daemon_set`, `kubernetes_cron_job_v1`,
`kubernetes_job_v1` (+ their `_v1` variants) in the repo and ensuring each
carries the right `ignore_changes` path:

- **kubernetes_deployment / stateful_set / daemon_set / job_v1**:
  `spec[0].template[0].spec[0].dns_config`
- **kubernetes_cron_job_v1**:
  `spec[0].job_template[0].spec[0].template[0].spec[0].dns_config`
  (extra `job_template[0]` nesting — the CronJob's PodTemplateSpec is
  one level deeper)

Each injection / extension is tagged `# KYVERNO_LIFECYCLE_V1: Kyverno
admission webhook mutates dns_config with ndots=2` inline so the
suppression is discoverable via `rg 'KYVERNO_LIFECYCLE_V1' stacks/`.

Two insertion paths are handled by a Python pass (`/tmp/add_dns_config_ignore.py`):

1. **No existing `lifecycle {}`**: inject a brand-new block just before the
   resource's closing `}`. 108 new blocks on 93 files.
2. **Existing `lifecycle {}` (usually for `DRIFT_WORKAROUND: CI owns image tag`
   from Wave 4, commit a62b43d1)**: extend its `ignore_changes` list with the
   dns_config path. Handles both inline (`= [x]`) and multiline
   (`= [\n  x,\n]`) forms; ensures the last pre-existing list item carries
   a trailing comma so the extended list is valid HCL. 34 extensions.

The script skips anything already mentioning `dns_config` inside an
`ignore_changes`, so re-running is a no-op.

## Scale

- 142 total lifecycle injections/extensions
- 93 `.tf` files touched
- 108 brand-new `lifecycle {}` blocks + 34 extensions of existing ones
- Every Tier 0 and Tier 1 stack with a pod-owning resource is covered
- Together with Wave 3A's 27 pre-existing markers → **169 greppable
  `KYVERNO_LIFECYCLE_V1` dns_config sites across the repo**

## What is NOT in this change

- `stacks/trading-bot/main.tf` — entirely commented-out block (`/* … */`).
  Python script touched the file, reverted manually.
- `_template/main.tf.example` skeleton — kept minimal on purpose; any
  future stack created from it should either inherit the Wave 3A one-line
  form or add its own on first `kubernetes_deployment`.
- `terraform fmt` fixes to pre-existing alignment issues in meshcentral,
  nvidia/modules/nvidia, vault — unrelated to this commit. Left for a
  separate fmt-only pass.
- Non-pod resources (`kubernetes_service`, `kubernetes_secret`,
  `kubernetes_manifest`, etc.) — they don't own pods so they don't get
  Kyverno dns_config mutation.

## Verification

Random sample post-commit:
```
$ cd stacks/navidrome && ../../scripts/tg plan  → No changes.
$ cd stacks/f1-stream && ../../scripts/tg plan  → No changes.
$ cd stacks/frigate && ../../scripts/tg plan    → No changes.

$ rg -c 'KYVERNO_LIFECYCLE_V1' stacks/ --include='*.tf' --include='*.tf.example' \
    | awk -F: '{s+=$2} END {print s}'
169
```

## Reproduce locally
1. `git pull`
2. `rg 'KYVERNO_LIFECYCLE_V1' stacks/ | wc -l` → 169+
3. `cd stacks/navidrome && ../../scripts/tg plan` → expect 0 drift on
   the deployment's dns_config field.

Refs: code-seq (Wave 3B dns_config class closed; kubernetes_manifest
annotation class handled separately in 8d94688d for tls_secret)

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-18 21:19:48 +00:00
Viktor Barzin
8b43692af0 [infra] Suppress Goldilocks vpa-update-mode label drift on all namespaces [ci skip]
## Context

Wave 3B-continued: the Goldilocks VPA dashboard (stacks/vpa) runs a Kyverno
ClusterPolicy `goldilocks-vpa-auto-mode` that mutates every namespace with
`metadata.labels["goldilocks.fairwinds.com/vpa-update-mode"] = "off"`. This
is intentional — Terraform owns container resource limits, and Goldilocks
should only provide recommendations, never auto-update. The label is how
Goldilocks decides per-namespace whether to run its VPA in `off` mode.

Effect on Terraform: every `kubernetes_namespace` resource shows the label
as pending-removal (`-> null`) on every `scripts/tg plan`. Dawarich survey
2026-04-18 confirmed the drift. Cluster-side count: 88 namespaces carry the
label (`kubectl get ns -o json | jq ... | wc -l`). Every TF-managed namespace
is affected.

This commit brings the intentional admission drift under the same
`# KYVERNO_LIFECYCLE_V1` discoverability marker introduced in c9d221d5 for
the ndots dns_config pattern. The marker now stands generically for any
Kyverno admission-webhook drift suppression; the inline comment records
which specific policy stamps which specific field so future grep audits
show why each suppression exists.

## This change

107 `.tf` files touched — every stack's `resource "kubernetes_namespace"`
resource gets:

```hcl
lifecycle {
  # KYVERNO_LIFECYCLE_V1: goldilocks-vpa-auto-mode ClusterPolicy stamps this label on every namespace
  ignore_changes = [metadata[0].labels["goldilocks.fairwinds.com/vpa-update-mode"]]
}
```

Injection was done with a brace-depth-tracking Python pass (`/tmp/add_goldilocks_ignore.py`):
match `^resource "kubernetes_namespace" ` → track `{` / `}` until the
outermost closing brace → insert the lifecycle block before the closing
brace. The script is idempotent (skips any file that already mentions
`goldilocks.fairwinds.com/vpa-update-mode`) so re-running is safe.

Vault stack picked up 2 namespaces in the same file (k8s-users produces
one, plus a second explicit ns) — confirmed via file diff (+8 lines).

## What is NOT in this change

- `stacks/trading-bot/main.tf` — entire file is `/* … */` commented out
  (paused 2026-04-06 per user decision). Reverted after the script ran.
- `stacks/_template/main.tf.example` — per-stack skeleton, intentionally
  minimal. User keeps it that way. Not touched by the script (file
  has no real `resource "kubernetes_namespace"` — only a placeholder
  comment).
- `.terraform/` copies (e.g. `stacks/metallb/.terraform/modules/...`) —
  gitignored, won't commit; the live path was edited.
- `terraform fmt` cleanup of adjacent pre-existing alignment issues in
  authentik, freedify, hermes-agent, nvidia, vault, meshcentral. Reverted
  to keep the commit scoped to the Goldilocks sweep. Those files will
  need a separate fmt-only commit or will be cleaned up on next real
  apply to that stack.

## Verification

Dawarich (one of the hundred-plus touched stacks) showed the pattern
before and after:

```
$ cd stacks/dawarich && ../../scripts/tg plan

Before:
  Plan: 0 to add, 2 to change, 0 to destroy.
   # kubernetes_namespace.dawarich will be updated in-place
     (goldilocks.fairwinds.com/vpa-update-mode -> null)
   # module.tls_secret.kubernetes_secret.tls_secret will be updated in-place
     (Kyverno generate.* labels — fixed in 8d94688d)

After:
  No changes. Your infrastructure matches the configuration.
```

Injection count check:
```
$ rg -c 'KYVERNO_LIFECYCLE_V1: goldilocks-vpa-auto-mode' stacks/ | awk -F: '{s+=$2} END {print s}'
108
```

## Reproduce locally
1. `git pull`
2. Pick any stack: `cd stacks/<name> && ../../scripts/tg plan`
3. Expect: no drift on the namespace's goldilocks.fairwinds.com/vpa-update-mode label.

Closes: code-dwx

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-18 21:15:27 +00:00
Viktor Barzin
c9d221d578 [infra] Establish KYVERNO_LIFECYCLE_V1 drift-suppression convention [ci skip]
## Context

Phase 1 of the state-drift consolidation audit (plan Wave 3) identified that
the entire repo leans on a repeated `lifecycle { ignore_changes = [...dns_config] }`
snippet to suppress Kyverno's admission-webhook dns_config mutation (the ndots=2
override that prevents NxDomain search-domain flooding). 27 occurrences across
19 stacks. Without this suppression, every pod-owning resource shows perpetual
TF plan drift.

The original plan proposed a shared `modules/kubernetes/kyverno_lifecycle/`
module emitting the ignore-paths list as an output that stacks would consume in
their `ignore_changes` blocks. That approach is architecturally impossible:
Terraform's `ignore_changes` meta-argument accepts only static attribute paths
— it rejects module outputs, locals, variables, and any expression (the HCL
spec evaluates `lifecycle` before the regular expression graph). So a DRY
module cannot exist. The canonical pattern IS the repeated snippet.

What the snippet was missing was a *discoverability tag* so that (a) new
resources can be validated for compliance, (b) the existing 27 sites can be
grep'd in a single command, and (c) future maintainers understand the
convention rather than each reinventing it.

## This change

- Introduces `# KYVERNO_LIFECYCLE_V1` as the canonical marker comment.
  Attached inline on every `spec[0].template[0].spec[0].dns_config` line
  (or `spec[0].job_template[0].spec[0]...` for CronJobs) across all 27
  existing suppression sites.
- Documents the convention with rationale and copy-paste snippets in
  `AGENTS.md` → new "Kyverno Drift Suppression" section.
- Expands the existing `.claude/CLAUDE.md` Kyverno ndots note to reference
  the marker and explain why the module approach is blocked.
- Updates `_template/main.tf.example` so every new stack starts compliant.

## What is NOT in this change

- The `kubernetes_manifest` Kyverno annotation drift (beads `code-seq`)
  — that is Phase B with a sibling `# KYVERNO_MANIFEST_V1` marker.
- Behavioral changes — every `ignore_changes` list is byte-identical
  save for the inline comment.
- The fallback module the original plan anticipated — skipped because
  Terraform rejects expressions in `ignore_changes`.
- `terraform fmt` cleanup on adjacent unrelated blocks in three files
  (claude-agent-service, freedify/factory, hermes-agent). Reverted to
  keep this commit scoped to the convention rollout.

## Before / after

Before (cannot distinguish accidental-forgotten from intentional-convention):
```hcl
lifecycle {
  ignore_changes = [spec[0].template[0].spec[0].dns_config]
}
```

After (greppable, self-documenting, discoverable by tooling):
```hcl
lifecycle {
  ignore_changes = [spec[0].template[0].spec[0].dns_config] # KYVERNO_LIFECYCLE_V1
}
```

## Test Plan

### Automated
```
$ rg -c 'KYVERNO_LIFECYCLE_V1' stacks/ --include='*.tf' --include='*.tf.example' \
    | awk -F: '{s+=$2} END {print s}'
27

$ git diff --stat | grep -E '\.(tf|tf\.example|md)$' | wc -l
21

# All code-file diffs are 1 insertion + 1 deletion per marker site,
# except beads-server (3), ebooks (4), immich (3), uptime-kuma (2).
$ git diff --stat stacks/ | tail -1
20 files changed, 45 insertions(+), 28 deletions(-)
```

### Manual Verification

No apply required — HCL comments only. Zero effect on any stack's plan output.
Future audits: `rg 'KYVERNO_LIFECYCLE_V1' stacks/ | wc -l` must grow as new
pod-owning resources are added.

## Reproduce locally
1. `cd infra && git pull`
2. `rg 'KYVERNO_LIFECYCLE_V1' stacks/` → expect 27 hits in 19 files
3. Grep any new `kubernetes_deployment` for the marker; absence = missing
   suppression.

Closes: code-28m

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-18 14:15:51 +00:00
Viktor Barzin
e80b2f026f [infra] Migrate Terraform state from local SOPS to PostgreSQL backend
Two-tier state architecture:
- Tier 0 (infra, platform, cnpg, vault, dbaas, external-secrets): local
  state with SOPS encryption in git — unchanged, required for bootstrap.
- Tier 1 (105 app stacks): PostgreSQL backend on CNPG cluster at
  10.0.20.200:5432/terraform_state with native pg_advisory_lock.

Motivation: multi-operator friction (every workstation needed SOPS + age +
git-crypt), bootstrap complexity for new operators, and headless agents/CI
needing the full encryption toolchain just to read state.

Changes:
- terragrunt.hcl: conditional backend (local vs pg) based on tier0 list
- scripts/tg: tier detection, auto-fetch PG creds from Vault for Tier 1,
  skip SOPS and Vault KV locking for Tier 1 stacks
- scripts/state-sync: tier-aware encrypt/decrypt (skips Tier 1)
- scripts/migrate-state-to-pg: one-shot migration script (idempotent)
- stacks/vault/main.tf: pg-terraform-state static role + K8s auth role
  for claude-agent namespace
- stacks/dbaas: terraform_state DB creation + MetalLB LoadBalancer
  service on shared IP 10.0.20.200
- Deleted 107 .tfstate.enc files for migrated Tier 1 stacks
- Cleaned up per-stack tiers.tf (now generated by root terragrunt.hcl)

[ci skip]

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-16 19:33:12 +00:00
Viktor Barzin
216d4240c9 [infra] Add Cloudflare provider to all stack lock files and generated providers
Terragrunt now generates cloudflare_provider.tf (Vault-sourced API key)
and includes cloudflare in required_providers. These are the generated
files from running `terragrunt init -upgrade` across all stacks.

[ci skip]

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-16 16:31:36 +00:00
Viktor Barzin
b1d152be1f [infra] Auto-create Cloudflare DNS records from ingress_factory
## Context

Deploying new services required manually adding hostnames to
cloudflare_proxied_names/cloudflare_non_proxied_names in config.tfvars —
a separate file from the service stack. This was frequently forgotten,
leaving services unreachable externally.

## This change:

- Add `dns_type` parameter to `ingress_factory` and `reverse_proxy/factory`
  modules. Setting `dns_type = "proxied"` or `"non-proxied"` auto-creates
  the Cloudflare DNS record (CNAME to tunnel or A/AAAA to public IP).
- Simplify cloudflared tunnel from 100 per-hostname rules to wildcard
  `*.viktorbarzin.me → Traefik`. Traefik still handles host-based routing.
- Add global Cloudflare provider via terragrunt.hcl (separate
  cloudflare_provider.tf with Vault-sourced API key).
- Migrate 118 hostnames from centralized config.tfvars to per-service
  dns_type. 17 hostnames remain centrally managed (Helm ingresses,
  special cases).
- Update docs, AGENTS.md, CLAUDE.md, dns.md runbook.

```
BEFORE                          AFTER
config.tfvars (manual list)     stacks/<svc>/main.tf
        |                         module "ingress" {
        v                           dns_type = "proxied"
stacks/cloudflared/               }
  for_each = list                     |
  cloudflare_record               auto-creates
  tunnel per-hostname             cloudflare_record + annotation
```

## What is NOT in this change:

- Uptime Kuma monitor migration (still reads from config.tfvars)
- 17 remaining centrally-managed hostnames (Helm, special cases)
- Removal of allow_overwrite (keep until migration confirmed stable)

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-16 13:45:04 +00:00
Viktor Barzin
30cdeefb1c chore: sync terraform state after nfsvers=4 convergence
Applied all 20 NFS stacks to converge PV mount_options (nfsvers=4).
State files encrypted and committed.

[ci skip]

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-14 11:20:18 +00:00
Viktor Barzin
82b0f6c4cb truenas deprecation: migrate all non-immich storage to proxmox NFS
- Migrate 7 backup CronJobs to Proxmox host NFS (192.168.1.127)
  (etcd, mysql, postgresql, nextcloud, redis, vaultwarden, plotting-book)
- Migrate headscale backup, ebook2audiobook, osm_routing to Proxmox NFS
- Migrate servarr (lidarr, readarr, soulseek) NFS refs to Proxmox
- Remove 79 orphaned TrueNAS NFS module declarations from 49 stacks
- Delete stacks/platform/modules/ (27 dead module copies, 65MB)
- Update nfs-truenas StorageClass to point to Proxmox (192.168.1.127)
- Remove iscsi DNS record from config.tfvars
- Fix woodpecker persistence config and alertmanager PV

Only Immich (8 PVCs, ~1.4TB) remains on TrueNAS.
2026-04-12 14:35:39 +01:00
Viktor Barzin
cb8a808700 feat(storage): migrate 38 NFS PVCs to proxmox-lvm (Wave 2)
Add proxmox-lvm PVCs with pvc-autoresizer annotations for all
remaining single-pod app data services. Deployments updated to
use new block storage PVCs. Old NFS modules retained for rollback.

Services: affine, changedetection, diun, excalidraw, f1-stream,
hackmd, isponsorblocktv, matrix, n8n, send, grampsweb, health,
onlyoffice, owntracks, paperless-ngx, privatebin, resume,
speedtest, stirling-pdf, tandoor, rybbit (clickhouse), tor-proxy
(torrserver), whisper+piper, frigate (config), ollama (ui),
servarr (prowlarr/listenarr/qbittorrent), aiostreams, freshrss
(extensions), meshcentral (data+files), openclaw (data+home+
openlobster), technitium, mailserver (data+roundcube html+enigma),
dbaas (pgadmin).

Strategy set to Recreate where needed for RWO volumes.
2026-04-04 19:25:12 +03:00
Viktor Barzin
46444e0306 openclaw: remove install-dotfiles init container to reduce NFS writes
The init container was cloning the dotfiles repo via git on every pod
start, causing 200+ small NFS writes that amplified through ZFS.
Dotfiles already exist on NFS from a previous clone — no need to
re-clone on every restart. To update dotfiles, run git pull manually.

Also cleaned up stale Uptime Kuma files (1.6GB old SQLite DB + 289MB
error log left over from migration to MariaDB).
2026-03-29 01:11:33 +02:00
Viktor Barzin
f0eb4fae8b fix: openclaw task-processor use internal Forgejo URL
The task-processor CronJob was failing every 5min because
it used https://forgejo.viktorbarzin.me (external, via Cloudflare
tunnel) which is unreachable from within the cluster. Changed to
http://forgejo.forgejo.svc.cluster.local (internal ClusterIP).
2026-03-24 19:40:15 +02:00
Viktor Barzin
1c13af142d sync regenerated providers.tf + upstream changes
- Terragrunt-regenerated providers.tf across stacks (vault_root_token
  variable removed from root generate block)
- Upstream monitoring/openclaw/CLAUDE.md changes from rebase
2026-03-22 02:56:04 +02:00
Viktor Barzin
6b8ce04d44 fix(openclaw): change agent workspace from /workspace/infra to /workspace
Keeps infra repo as a subdirectory, allows OpenClaw to write to /workspace directly.
2026-03-19 23:32:28 +00:00
Viktor Barzin
21bb3036af state(dbaas): update encrypted state 2026-03-19 20:23:59 +00:00
Viktor Barzin
1acf8cc4e8 migrate consuming stacks to ESO + remove k8s-dashboard static token
Phase 9: ExternalSecret migration across 26 stacks:

Fully migrated (vault data source removed, ESO delivers secrets):
- speedtest, shadowsocks, wealthfolio, plotting-book, f1-stream, tandoor
- n8n, dawarich, diun, netbox, onlyoffice, tuya-bridge
- hackmd (ESO template for DB URL), health (ESO template for DB URL)
- trading-bot (ESO template for DATABASE_URL + 7 secret env vars)
- forgejo (removed unused vault data source)

Partially migrated (vault kept for plan-time, ESO added for runtime):
- immich, linkwarden, nextcloud, paperless-ngx (jsondecode for homepage)
- claude-memory, rybbit, url, webhook_handler (plan-time in locals/jobs)
- woodpecker, openclaw, resume (plan-time in helm values/jobs/modules)

17 stacks unchanged (all plan-time: homepage annotations, configmaps,
module inputs) — vault data source works with OIDC auth.

Phase 17a: Remove k8s-dashboard static admin token secret.
Users now get tokens via: vault write kubernetes/creds/dashboard-admin
2026-03-15 19:05:04 +00:00
Viktor Barzin
1fe7798609 fix openclaw init container: escape shell vars, fix image path [ci skip]
- Use $$ for shell variable escaping in Terraform ($ is Terraform interpolation)
- Fix image: docker.io/alpine/git (not library/alpine/git)
- Inline command instead of heredoc to avoid Terraform interpolation issues
2026-03-15 17:19:03 +00:00
Viktor Barzin
3aba29e7a3 remove SOPS pipeline, deploy ESO + Vault DB/K8s engines
Vault is now the sole source of truth for secrets. SOPS pipeline
removed entirely — auth via `vault login -method=oidc`.

Part A: SOPS removal
- vault/main.tf: delete 990 lines (93 vars + 43 KV write resources),
  add self-read data source for OIDC creds from secret/vault
- terragrunt.hcl: remove SOPS var loading, vault_root_token, check_secrets hook
- scripts/tg: remove SOPS decryption, keep -auto-approve logic
- .woodpecker/default.yml: replace SOPS with Vault K8s auth via curl
- Delete secrets.sops.json, .sops.yaml

Part B: External Secrets Operator
- New stack stacks/external-secrets/ with Helm chart + 2 ClusterSecretStores
  (vault-kv for KV v2, vault-database for DB engine)

Part C: Database secrets engine (in vault/main.tf)
- MySQL + PostgreSQL connections with static role rotation (24h)
- 6 MySQL roles (speedtest, wrongmove, codimd, nextcloud, shlink, grafana)
- 6 PostgreSQL roles (trading, health, linkwarden, affine, woodpecker, claude_memory)

Part D: Kubernetes secrets engine (in vault/main.tf)
- RBAC for Vault SA to manage K8s tokens
- Roles: dashboard-admin, ci-deployer, openclaw, local-admin
- New scripts/vault-kubeconfig helper for dynamic kubeconfig

K8s auth method with scoped policies for CI, ESO, OpenClaw, Woodpecker sync.
2026-03-15 16:37:38 +00:00
Viktor Barzin
deeea5edab openclaw: replace cc-config NFS with dotfiles repo clone [ci skip]
- Add init container "install-dotfiles" that clones the dotfiles repo
  and installs skills/agents/hooks to OpenClaw's home directory
- Remove nfs_cc_config module and its volume mount
- Skills/agents now come from the same chezmoi-managed dotfiles repo
  that manages the Mac config, eliminating the dual-sync problem
2026-03-15 16:04:02 +00:00
Viktor Barzin
194281e527 right-size cluster memory: reduce overprovisioned, fix under-provisioned services
Phase 1 - Quick wins (~4.5 Gi saved):
- democratic-csi: add explicit sidecar resources (64-80Mi vs 256Mi LimitRange default)
- caretta: 768Mi → 600Mi (VPA upper 485Mi)
- immich-ml: 4Gi → 3584Mi (VPA upper 2.95Gi, GPU margin)
- onlyoffice: 3Gi → 2304Mi (VPA upper 1.82Gi)

Phase 2 - Safety fixes (prevent OOMKills):
- frigate: 2Gi/8Gi → 5Gi/10Gi (VPA upper 7.7Gi, was 4% headroom)
- openclaw: 1280Mi req → 2Gi req=limit (documented 2Gi requirement)

Phase 3 - Additional right-sizing:
- authentik workers: 1Gi → 896Mi x3 (VPA upper 722Mi)
- shlink: 512Mi/768Mi → 960Mi req=limit (VPA upper 780Mi, safety increase)

Phase 4 - Burstable QoS for lower tiers:
- tier-3-edge: 128Mi/128Mi → 96Mi req / 192Mi limit
- tier-4-aux: 128Mi/128Mi → 64Mi req / 256Mi limit

Phase 5 - Monitoring:
- Add ClusterMemoryRequestsHigh alert (>85% allocatable, 15m)
- Add ContainerNearOOM alert (>85% limit, 30m)
- Add PodUnschedulable alert (5m, critical)

Cluster: 92.7% → 90.8% memory requests. Stirling-pdf now schedulable.
2026-03-15 15:30:18 +00:00
Viktor Barzin
18d012db11 fix: reduce openclaw memory requests for scheduling
- openclaw: request 1280Mi (limit 2Gi), modelrelay request 128Mi
  (limit 256Mi). Total request 1408Mi fits available capacity.
2026-03-15 10:47:34 +00:00
Viktor Barzin
56ddee457a fix: openclaw policy violation + reduce memory requests for capacity
- openclaw: fix Kyverno policy violation (node:22-alpine ->
  docker.io/library/node:22-alpine), reduce request to 1536Mi
  with 2Gi limit for overcommit
- rybbit/clickhouse: reduce 1Gi -> 768Mi (frees 256Mi)
- stirling-pdf: reduce 1536Mi -> 1200Mi (frees 336Mi)
2026-03-15 10:37:58 +00:00
Viktor Barzin
4a27345057 enable memory-core plugin for OpenClaw [ci skip]
- Add memory-core to plugins.allow and plugins.slots.memory
- Add /app/extensions to plugin load paths
- Update CLAUDE.md memory instructions to reference native tools
2026-03-15 03:22:07 +00:00
Viktor Barzin
6f562b5da6 add vaultwarden daily backup CronJob to NFS
SQLite backup via Online Backup API + copy of RSA keys,
attachments, sends, and config. 30-day retention with rotation.
Pod affinity ensures co-scheduling with vaultwarden for RWO PVC access.
2026-03-15 00:03:59 +00:00
Viktor Barzin
46afa85b01 fix openclaw config mount and OOM: use init container, increase memory to 2Gi
- Replace subPath ConfigMap mount with init container that copies openclaw.json
  to writable NFS home (OpenClaw writes back to the file at runtime)
- Remove invalid memory-api plugin references causing "Config invalid"
- Increase memory to 2Gi (req+limit) with NODE_OPTIONS=--max-old-space-size=1536
- Fix tg wrapper to inject -auto-approve when apply --non-interactive is used
2026-03-14 23:42:17 +00:00
Viktor Barzin
eb0301b02b lower memory limits closer to actual usage
openclaw: 1536Mi -> 768Mi, affine: 256Mi -> 128Mi, rybbit: 512Mi -> 384Mi.
Also patched via kubectl: aiostreams, cloudflared, crowdsec, uptime-kuma,
vaultwarden, pgadmin, phpmyadmin, goflow2, sealed-secrets, ebook2audiobook.
2026-03-14 21:15:26 +00:00
Viktor Barzin
f7c2c06009 right-size memory: set requests=limits based on actual usage
- Set memory requests = limits across 56 stacks to prevent overcommit
- Right-sized limits based on actual pod usage (2x actual, rounded up)
- Scaled down trading-bot (replicas=0) to free memory
- Fixed OOMKilled services: forgejo, dawarich, health, meshcentral,
  paperless-ngx, vault auto-unseal, rybbit, whisper, openclaw, clickhouse
- Added startup+liveness probes to calibre-web
- Bumped inotify limits on nodes 2,3 (max_user_instances 128->8192)

Post node2 OOM incident (2026-03-14). Previous kubelet config had no
kubeReserved/systemReserved set, allowing pods to starve the kernel.
2026-03-14 21:01:24 +00:00