Root cause of today's CAPI 403 crashloop: chart 0.21.0 pins appVersion
to v1.7.3, but Keel had auto-bumped the running pods to v1.7.8 on
2026-05-16 and they ran fine with CAPI for 8 days. Today's TF apply
(b59acbc1 agent memory bump) re-rendered the deployment from chart
defaults, reverting the image to v1.7.3 — and v1.7.3 has a CAPI
watcher-auth bug against the current api.crowdsec.net behaviour, so
every fresh replica started 403'ing on startup.
Fix: set `image.tag: "v1.7.8"` in values.yaml so the image survives
future TF applies independently of the chart's appVersion. Verified
CAPI auth succeeds on all 3 fresh pods with v1.7.8.
Also dropped the ENROLL_KEY env block — the existing key `cmey5e636…`
is single-shot and was already consumed by the first replica;
subsequent pods hit 403 on `cscli console enroll`. CAPI works WITHOUT
console enrollment (separate flows). Re-enable console reporting by
generating a fresh enroll key at app.crowdsec.net (procedure
documented in the values.yaml comment block).
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
CAPI auth at api.crowdsec.net is rejecting watcher logins from inside
the cluster within ~1h of registration, even after rotating creds via
`cscli capi register`. The same login successfully authenticates from
devvm but fails from cluster pods → IP-throttle or account-state issue
at the central API. Until that's resolved with CrowdSec support (or
the throttle window resets), running with CAPI on is just chronic
crashloops on every fresh replica.
`DISABLE_ONLINE_API=true` makes the chart entrypoint
`conf_set 'del(.api.server.online_client)'`, removing the online_client
block entirely. Pods skip CAPI auth, no 403, no crashloop. Trade-off:
no community blocklists. Local scenarios + bouncers continue
unchanged.
Side-effect of disabling CAPI in this chart (v0.21.0) — `role.yaml`
is gated on `IsOnlineAPIDisabled=false` while `cscli-lapi-register-job`
is gated on `StoreLAPICscliCredentialsInSecret=true` (orthogonal). So
the hook runs without the Role it needs, and atomic apply rolls back.
Mitigation: pre-created the `crowdsec-lapi-cscli-credentials` Secret
manually (the hook short-circuits when the secret already exists) and
re-applied the missing Role for future re-enablement.
Re-enable path documented in the comment block.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Every `tg apply` was reverting the annotations that keel patches when it
detects an upstream digest change — `keel.sh/match-tag` (Kyverno-stamped),
`keel.sh/update-time` (on the pod template; what actually triggers the
rollout), plus the K8s-managed `kubernetes.io/change-cause` and
`deployment.kubernetes.io/revision`. The revert forced a rollout, then
the next keel poll re-stamped the annotations, forcing another. With
llama-swap's ~10s cold-load on each pod recreate the user noticed.
Upstream `ghcr.io/mostlygeek/llama-swap:cuda` is a moving nightly tag —
keel still drives one legitimate rollout per day at ~07:25 UTC; this
patch stops the apply-driven extra rollouts on top of that.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
5th worker container running in audit-only mode. Writes
kevin_signal_bridge_state rows showing what it WOULD trade but never
publishes to signals:generated. Kill-switch flipped in Phase 2.
The xray-vless ingress, Service port 6443, and container port 6443 had
no backing listener — xray.config.json only binds 7443 (REALITY), 8443
(WS) and 9443 (XHTTP). The "xray-vless" hostname was returning 502
since the module was created.
Side effect: removing the first Service port slot ("vless"/6443) caused
the kubernetes provider to shift targetPort values on the remaining
two ports (defaulting only worked at create time, not on port removal).
Pinning target_port explicitly makes Service routing deterministic.
End-to-end verified: REALITY via public IP:8080 (pfSense forward 8080
-> 10.0.20.200:7443), WS via Cloudflare, XHTTP via Cloudflare — all
three transports proxied successfully through a test pod, egress IP
correctly resolves to the home WAN.
krr 2026-05-22 flagged postiz-app as critically under-requested when it
was running (gap 2.2 GiB above the 512Mi request). Postiz is currently
uninstalled in the cluster — this change is only for when the stack is
re-deployed later. No apply triggered now.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
krr 2026-05-22 flagged crowdsec-agent DaemonSet (4 pods) as under-
requested by ~588 MiB across the cluster. Live usage around the
80-128 MiB mark for active log parsing — 64 MiB request risked eviction
ahead of more-needed pods. Limit stays at 512 MiB.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
krr 2026-05-22 flagged nvidia-driver-daemonset as critically
under-requested (~566 MiB gap). Live driver process holds ~600-800Mi
once the kernel module is loaded. Limit stays at 2Gi so the DKMS build
during a kernel upgrade still has headroom (documented in values.yaml
to need ~1.4 GiB peak).
May help unblock code-8vr0 (GPU driver crashloop on node1) if the
crashloop was OOM-driven.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
krr 2026-05-22 flagged loki as under-requested by 1.9 GiB. Live working
set is sitting at ~3 GiB during normal ingestion; the existing 2 GiB
request meant scheduler didn't reserve enough room and the pod risked
eviction. Limit stays at 4 GiB (documented ceiling in loki.yaml).
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
The CSI node plugin's LUKS2 Argon2id key derivation peaks at ~1 GiB
during unlock (memory id=712 + already-documented in the limits=1280Mi).
Request was 64 MiB — meaning the unlock burst ran "best-effort", first in
line for OOM under node pressure. krr 2026-05-22 flagged this as a top
under-request. Bumping request matches the documented requirement.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
The Synology DSM (port 5001) ingress chronically trips IngressTTFBCritical
because of NAS-side latency that is unrelated to k8s upgrades. The chain
was halting indefinitely waiting for it to clear. Add it alongside
RecentNodeReboot to the per-call ignore regex so the chain can proceed
autonomously without manual silences.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
The www subdomain was internal-only (no Cloudflare DNS record) but the
external uptime-kuma monitor still flagged it as down because public DNS
resolution failed. Removing the ingress along with the Technitium CNAME
makes the failure mode disappear and lets the cluster reach an
autonomous-clean state.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Right-sizing per krr report (2026-05-22). Zone data is ~43 MiB; the rest
was cache headroom. Primary keeps more (1 GiB) since it owns authoritative
zones; replicas get 512 MiB. DNS sanity-checked across CoreDNS and the
MetalLB external IP (10.0.20.201) post-rollout.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Right-sizing per krr report (2026-05-22). Kept Burstable QoS (limit > request)
so an active agent run still has 2 GiB headroom — krr's 100 MiB recommendation
was measured idle and is not safe for an active job.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Right-sizing per krr report (2026-05-22). Image bump syncs main.tf with
the live Keel-managed version to avoid an inadvertent downgrade on apply.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Right-sizing per krr report (2026-05-22, memory id=2431-2438). Live pod
working set is ~80 MiB; 512Mi leaves comfortable headroom for the
Symfony+RoadRunner footprint.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
The chain wasn't idempotent — re-running on a partially-upgraded cluster
would re-drain + re-kubeadm + re-apt an already-upgraded node, causing
unnecessary disruption (5-10 min per no-op node) and risking alert
re-fires during the unnecessary drain.
Today's chain hit this twice: after fixing the version-detection bug
(commit a0f3e155), the chain correctly resumed but re-did master AND
node4 even though both were already on v1.34.8. node4 got cordoned,
drained, and is now soaking for 10 min for no reason.
Fix: at the top of phase_master and phase_worker, read the node's
current kubelet version. If it equals TARGET_VERSION, skip the whole
phase (return 0 — spawn_next will fire downstream). Chain advances
without disturbing the already-upgraded node.
In-flight effect: the current node4 worker pod has the old script
mounted from configmap snapshot, so it'll continue. If it fails and
retries, the new pod will see node4 on v1.34.8 and short-circuit.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Previous version-check read RUNNING from .items[0].nodeInfo.kubeletVersion
— which is just k8s-master. If master is upgraded but workers aren't
(e.g. a chain that completed master phase but failed mid-worker), the
version-check sees v1.34.8 and decides "no upgrade needed", never
spawning the resume phase. Workers stay behind forever.
Today's chain hit exactly this: master + node4 upgraded to v1.34.8,
worker-node4 Failed mid-soak (alert sensitivity, since loosened),
chain dead. Re-triggering the version-check looked at master only,
decided cluster was "done", and refused to resume worker chain.
Fix: read all node kubelet versions, sort -V, take head -1 (oldest).
A partial chain now correctly reports the un-upgraded version and the
chain resumes.
Trivial change; tested live — chain now correctly reports v1.34.7
(workers' version) and spawns preflight → master → worker chain.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Earlier in this session, commit 503ac4c1 brought the for: from 5m → 2m
based on a brief I wrote inaccurately. The brief said the alert "fires
immediately" but it was actually already at 5m. The subagent followed
the explicit "2m" target and tightened it — opposite of what we wanted.
10m is the right value for our chain: a full drain + kubeadm + apt +
kubelet restart + uncordon cycle can take a worker out of MetalLB
rotation for 5-7 min in the worst case (PDB stickiness on some pods).
10m suppresses upgrade-induced blips while still catching real
speaker-down conditions.
node4 worker phase tripped this alert mid-soak today, aborted the
chain (Job retry), succeeded on the 2nd attempt only because alerts
didn't re-fire fast enough. With 10m the next workers shouldn't need
the retry.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Today's worker-phase rolling upgrade tripped MysqlStandaloneDown,
MetalLBSpeakerDown, KubeletRunningContainersDrop, and
IngressErrorRate5xxHigh even though every affected workload
recovered within 30-60s. Loosen `for:` (and one threshold) on each so
they only fire on persistent faults, not on routine drain+kubelet-
restart cycles.
- MysqlStandaloneDown: for 2m -> 3m (single-replica StatefulSet,
drain re-scheduling routinely takes 1-3m).
- MetalLBSpeakerDown: for 5m -> 2m (kubelet restart drops the
speaker pod for 30-45s; 2m suppresses that blip).
- KubeletRunningContainersDrop: absolute `< -10` threshold replaced
with relative `< -0.5` (>50% drop vs. 10m ago); routine drains
routinely shed 10-30 containers and tripped the old rule.
- IngressErrorRate5xxHigh: for 5m -> 10m (rolling pod migrations
cause brief 5xx spikes that clear in 1-2m).
Severity, labels, and annotation structure preserved; only `for:`
durations and the one expression changed. Tactical loosening of
four specific alerts -- broader observability audit tracked
separately in beads.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Refactored halt_on_alert_query from denylist ("ignore these noisy alerts")
to an allowlist ("only halt on severity=critical"). Today's blocking
alerts were all warning/info-level and not actual upgrade blockers:
- PodCrashLooping (gpu-operator on the GPU node, code-8vr0, long-standing)
- IngressTTFBHigh (Traefik latency, transient)
- NodeHighIOWait (chicken-and-egg with our own upgrade I/O)
- RecentNodeReboot (chain causes this itself)
severity=critical filtering is more robust than maintaining a denylist
of every noisy alert that crops up. extra_ignore parameter kept for
backwards compatibility but is rarely needed now (critical alerts are
the only ones that should actually halt the chain).
Tested end-to-end this session — master successfully upgraded to v1.34.8
via the autonomous chain after the apiserver state-repair (apiserver
manifest had been pinned at v1.34.2 from a previous month's rollback;
required a one-time manual edit + kubelet reload to bring back to v1.34.7,
after which the chain ran cleanly).
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Wave-1 trusted-registries allowlist was missing woodpeckerci/* which is
used by every .woodpecker.yml's clone step (woodpeckerci/plugin-git) and
build steps (woodpeckerci/plugin-docker-buildx). Result: ALL Woodpecker
pipelines have been failing at the git step since the Audit→Enforce flip
on 2026-05-19. First surfaced via code-da4h (recruiter-responder pushes
not building).
Added between viren070/* and zelest/* in the same DockerHub-user-repos
block as the 2026-05-22 batch (commit 2d35d72a).
Closes: code-da4h
Every internal *.viktorbarzin.me hostname (~80 services) chains through the
split-horizon `viktorbarzin.me` apex A record. If the apex drifts (ISP
rollover, accidental edit), every internal service breaks at once — the
2026-05-22 ha-sofia incident was exactly this.
This adds a backstop probe so the next drift surfaces in <10 min instead
of via user-reported outage:
- CronJob `viktorbarzin-apex-probe` in `technitium` namespace, every 5 min,
resolves `viktorbarzin.me A` against the Technitium LB IP (10.0.20.201)
and pushes `viktorbarzin_apex_correct` + `_last_correct_timestamp` to
Pushgateway. Python+dnspython, ~30 LOC.
- 3 Prometheus alerts:
- `ViktorBarzinApexDrift` (critical, 10m) — apex resolved to anything
other than 10.0.20.200.
- `ViktorBarzinApexProbeStale` (warning, 5m on 15m gap) — probe stopped
succeeding.
- `ViktorBarzinApexProbeNeverRun` (warning, 30m absent) — probe never
reported.
- Added the new alert names to the Slack receiver matcher in both routes
alongside EmailRoundtrip*.
Verified: rules loaded as inactive (apex is correct), metric flowing, manual
probe job pass observed.
Three changes unblocking the autonomous chain for k8s patch upgrades:
1. **phase_master quiesces tigera-operator before drain, restores after.**
Tigera crashes immediately if apiserver is unreachable (no retry logic)
and crashlooping it during master static-pod swaps generates ~500MB/s
disk I/O that pushes kubeadm's 5-min static-pod-hash watch past its
limit. Quiesce removes the storm contributor; calico data plane keeps
running unchanged (data plane is the DaemonSet+Typha, operator is just
the reconciler).
2. **update_k8s.sh retries with --etcd-upgrade=false on the 2nd attempt.**
For patch upgrades (1.34.7→1.34.8), etcd's image doesn't change — kubeadm
writes an identical manifest, hash doesn't update, watch times out and
rolls back forever. The skip-etcd retry sidesteps it for the legitimate
no-change case while still doing a full etcd upgrade on the first
attempt (correct for minor-version bumps).
3. **halt_on_alert_query also ignores IngressTTFBHigh + NodeHighIOWait.**
Both are symptoms-not-causes: ingress latency spikes briefly during any
pod-restart wave; high IOwait is exactly what upgrade activity causes
(chicken-and-egg). The inline quiet-baseline check (Ready transition
<10min) is the real cluster-churn gate.
RBAC: k8s-upgrade-job ClusterRole gains `patch` on deployments + scale
subresource so the chain can do the scale-to-0/back-to-1 on tigera.
These three together get the chain past the cascade that's been blocking
1.34.7→1.34.8 for a week. Long-term fix is still HA control plane
(beads code-n0ow); these are the bridge.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Browsers accumulate one authentik_proxy_<random> cookie per Authentik
Proxy Provider under viktorbarzin.me (Path=/). With 30+ services the
combined Cookie header exceeds nginx's default 4 x 8k
large_client_header_buffers and trips '431 Request Header Fields Too
Large' at the forward-auth nginx (traefik/auth-proxy).
Bumped to:
client_header_buffer_size 8k
large_client_header_buffers 8 64k
Matches the pattern used on the London Flint 2 router nginx
(memory id=647).
Sonnet-4-5 trips Anthropic per-account rate_limit_error on the OAuth
bearer (sk-ant-oat01) tokens after 5-10 burst calls — sticky multi-hour
quota. Haiku-4-5 has much higher RPM and processes the 16-video
backfill cleanly (~30s/video with inter-call throttle).
Comment above the env line documents the rationale for future re-evaluation.
Remove leftover SendGrid references after the Brevo migration was completed:
- Delete TF `cloudflare_record.mail_domainkey` (TXT at `s1._domainkey`,
SendGrid-era DKIM, hidden behind the SendGrid CNAME but would re-emerge
once the CNAME is removed).
- Clean up commented-out `smtp.sendgrid.net` relayhost references and the
`# For sendgrid` comment on `sasl_passwd` in the mailserver module.
DNS records deleted out-of-band (not TF-managed):
- CF: `s1._domainkey CNAME` + `s2._domainkey CNAME` → sendgrid.net (manual entries)
- Technitium internal `viktorbarzin.me`: `em7107`, `s1._domainkey`,
`s2._domainkey` CNAMEs → sendgrid.net
Verified end-to-end mail flow unaffected (Brevo outbound + IMAP receive,
roundtrip 20.4s — identical to baseline). Active DKIM (`mail._domainkey`
local + `brevo1/brevo2._domainkey` Brevo) untouched.
Auth audit on 2026-05-22 — all the broken paths and the one that works:
- openai-codex OAuth: EXPIRED (ChatGPT Plus, ancaelena98@gmail.com)
- secret/openclaw → openai_api_key (sk-svcacct): insufficient_quota
- openrouter_api_key: "Key limit exceeded (total limit)"
- llama_api_key: region-blocked
- anthropic_api_key: sk-ant-oat-… (OAuth refresh token, not a real
x-api-key — won't auth via x-api-key header)
- nvidia_api_key (NIM): WORKS. The key was already baked into the
openclaw.json providers.nim.apiKey from secret/openclaw → nvidia_api_key.
Two NIM models verified end-to-end (call from inside openclaw pod
with tool-call schema, both returned proper {tool_calls:[…]} JSON):
- meta/llama-3.1-70b-instruct — 0.58s, primary
- meta/llama-4-maverick-17b-128e — 16s, smarter, fallback
Fallback chain: maverick → openai-codex (auto-promotes once re-authed)
→ modelrelay/auto-fastest (last resort, hallucinates instead of
tool-calling, but at least responds).
Models registered in both `agents.defaults.models` (allowlist) and
`models.providers.nim.models` (capability declarations) so the agent
sees them as available tools. Startup `models set` updated to pin
the new primary across `doctor --fix` runs.
🤖 Generated with [Claude Code](https://claude.com/claude-code)
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Discovered via W1.5 enforcement when querying live cluster state:
PolicyViolation events on 5 deployments (council-complaints, ebook2audiobook,
hermes-agent, netbox, whisper/piper) trying to admit images from registries
not in the original enumeration.
Added entries:
- amruthpillai/* (resume — reactive-resume)
- athomasson2/* (ebook2audiobook)
- netboxcommunity/* (netbox)
- nousresearch/* (hermes-agent)
- opentripplanner/* (osm-routing)
- rhasspy/* (whisper, piper)
- registry.viktorbarzin.me/* (legacy private registry — council-complaints
still references; should migrate to forgejo)
The legacy registry.viktorbarzin.me was supposedly decommissioned 2026-05-07
per CLAUDE.md but council-complaints still uses it — separate cleanup task.
## Verification
- kubectl delete + reapply (kubectl_manifest resourceVersion=0 patch gotcha,
same as 2026-05-18 inject-keel-annotations)
- Dry-run admission of previously-blocked images now PASS:
- netboxcommunity/netbox:v4.5.0-beta1 ✓
- rhasspy/wyoming-whisper:3.1.0 ✓
- registry.viktorbarzin.me/council-complaints:1c56f8f ✓
- Policy still in Enforce mode
## Observation status (W1.6)
- Calico GNP wave1-egress-observe-tier34 still applied, 82 ns selected
- Loki `{job="node-journal"} |~ "calico-packet"` returns ~5000 lines/hour
- No errors from observation infrastructure
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Root cause of the recurring 'cnpg-webhook-cert' TLS expiry warn:
CNPG default 'expiringCheckThreshold = 7' means the operator only
regenerates the self-signed webhook cert when remaining lifetime drops
BELOW 7 days. Our cluster-health check #22 alerts at <30d. Result:
~23 days of WARN before CNPG would even attempt rotation.
Set EXPIRING_CHECK_THRESHOLD=30 via the chart's config.data map so the
operator now regenerates with 30d buffer, aligning with our monitoring
threshold. Cert lifetime stays at chart default 90d.
Verified after apply: operator runtime config shows
'expiringCheckThreshold:30'. Companion in-session action: deleted the
existing soon-to-expire secret and bounced the operator to force an
immediate fresh 90-day cert (notBefore=May 22, notAfter=Aug 20).
Two latent issues found while diagnosing why the May 2026 META vest
didn't land:
1. broker-sync-imap CronJob's command was 'broker-sync imap', but the
actual CLI subcommand is 'imap-ingest'. Every scheduled run had
been failing with 'No such command imap' since day-one.
2. Pod runs as uid=10001 gid=999; PVC /data dir is mode 2775
group=10001. Without fsGroup in the pod's securityContext the
pod gets only 'other' (r-x) perms on the dir, so sqlite3 can't
create journal/WAL files next to sync.db -- hits
'attempt to write a readonly database'. fsGroup=10001 adds the
matching gid to the pod's supplemental groups so writes work.
Schwab email-sender regex fix is in broker-sync@d860aef.
Operational layer for the new col_snapshot cache shipped in
fire-planner@e72fd22:
stacks/fire-planner:
- fire-planner-col-refresh CronJob — Sun 04:00 UTC, no-op until rows
age toward the 1-year TTL boundary (within 7 days). Calls
python -m fire_planner col-refresh-stale, upserts via cache.upsert.
monitoring/dashboards/cost-of-living.json (Finance folder):
- Two template variables: $city (single-select from col_snapshot),
$baseline_city (for COL ratio computation, defaults London).
- Stat row: total w/rent, w/o rent, 1-bed rent, ratio (color-coded).
- All-cities ranked table with gradient-gauged total + colored ratio.
- Cache-freshness table flags rows approaching TTL expiry.
Initial population needs a one-shot: post-Keel-rollout,
kubectl -n fire-planner exec deploy/fire-planner -- \\
python -m fire_planner col-seed
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
The previous commit promoted modelrelay/auto-fastest to primary as a
workaround for the expired openai-codex OAuth token. But modelrelay
routes to small tool-call-shy models (nvidia/stepfun-ai/step-3.5-flash)
that hallucinate answers instead of using ssh / curl / etc. — exactly
what the v4 learning loop is supposed to leverage.
Revert primary back to openai-codex/gpt-5.4-mini (gpt-5.4-mini is the
only mini variant the Codex backend accepts for ChatGPT Plus tier),
and inline the re-auth command in the model-block comment so future
sessions know exactly what to do when the OAuth token expires:
kubectl -n openclaw exec -it $(kubectl -n openclaw get pods \
-l app=openclaw -o jsonpath='{.items[0].metadata.name}') \
-c openclaw -- node /app/openclaw.mjs models auth login \
--provider openai-codex
modelrelay/auto-fastest stays in the fallback chain so the agent
remains partially usable while the token is expired.
🤖 Generated with [Claude Code](https://claude.com/claude-code)
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
User feedback: "this should work for any task, not just calendar.
this learning flow must be strongly engrained to ensure openclaw
gets better over time."
The v3 rules were buried at the bottom of TOOLS.md and only stated
in workflow language. Three changes to make the rule unavoidable:
1. **SOUL.md** — new marker-delimited section "Learning is your
identity" inserted before ## Boundaries. AGENTS.md tells the
agent to read SOUL.md first every session, so this is now the
FIRST thing the agent loads about itself. Frames learning as
character, not procedure.
2. **TOOLS.md v4** — section moved from the END of the file to
right after the `# TOOLS.md` title (first substantive content
on file load). Title strengthened: "THE FLOW — run this on
EVERY task. Not just hard ones." Concrete examples explicitly
call out diverse domains (calendar, frigate restart, disk
usage, inbox summary, deploys) so the universality is
unmistakable.
3. **learn-from-tasks skill** — opens with "This is universal.
EVERY task runs through this flow — not just hard ones, not
just unfamiliar ones. The save at the end is mandatory."
The actual flow (know → ask devvm → save) is unchanged. What
changed is salience: the rule is now the first thing the agent
encounters in three independent surfaces, with stronger framing
that makes "skipping the save" feel like a violation of identity
rather than a missed optimisation.
Marker bumped v3 → v4. Stripper handles v1-v9 idempotently.
🤖 Generated with [Claude Code](https://claude.com/claude-code)
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Uncomment the trading-bot stack (disabled 2026-04-06 due to resource
consumption) and add the new meet_kevin_watcher service container.
Changes:
- Uncomment the /* ... */ block enclosing the entire stack
- Fix db_init job: add -d postgres to psql commands (root user has no
root-named database — matches pattern used in claude-memory + others)
- Remove 3 disabled containers from trading-bot-workers Pod spec:
news-fetcher, sentiment-analyzer, trade-executor
- Add new meet-kevin-watcher container (image
viktorbarzin/trading-bot-service:latest, command
python -m services.meet_kevin_watcher.main, mem 128Mi/256Mi)
- Extend ExternalSecret with TRADING_OPENROUTER_API_KEY and
TRADING_MEET_KEVIN_CHANNEL_ID keys (sourced from Vault
secret/trading-bot)
- Add 4 common_env entries for the Meet Kevin pipeline
(poll interval, daily cost cap, model slug, prompt version)
- Update lifecycle.ignore_changes to 4 image indices
vault: re-enable pg-trading static role
- Add pg-trading to vault_database_secret_backend_connection allowed_roles
- Uncomment vault_database_secret_backend_static_role.pg_trading
(was disabled 2026-04-06 with the rest of trading-bot stack)
kyverno: add postgres* to trusted-registries allowlist
- trading-bot db_init uses postgres:16-alpine (Docker Hub library image)
- postgres* was not in the DockerHub bare-name allowlist (unlike mysql*,
alpine*, nginx*, python* which were already there)
Final workers Pod containers (in order):
[0] signal-generator
[1] learning-engine
[2] market-data
[3] meet-kevin-watcher (NEW)
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Refines the devvm-fallback into an explicit triage flow that the
agent runs on every task. The default path is to ASK devvm-claude
when uncertain — don't brute-force. Most tasks are solvable there.
## The flow
1. Do I KNOW how? Check `memory_recall` and INDEX.md.
2. If not, SSH devvm and ask claude — and crucially, ask it to
share the steps + credentials needed so I can do it on my own
next time. Save the answer in openclaw memory.
3. (RARE) If devvm-claude says no, try in-pod. Most likely fail —
that's OK.
## Storage moved to memory-indexed location
Learnings now live under
`/workspace/memory/projects/openclaw-learned/` (was
`/workspace/learned/`) so memory-core indexes them and
`memory_recall` surfaces them. Layout:
- `scripts/<task>.md` runnable recipes
- `knowledge/<topic>.md` decisions, paths, gotchas
- `credentials/<name>.md` **POINTERS to Vault, never values**
## Credentials = Vault pointers only
Previous v2 design saved cred values to plaintext NFS files. v3
flips to pointer-only: cred file documents the Vault path + fetch
command (`ssh devvm 'vault kv get -field=foo secret/bar'`), the
consumer, and rotation expectations. The secret stays in Vault.
## Init container also migrates
Strips v1/v2/v3 markers from TOOLS.md before re-inserting v3,
moves any files from the legacy `/workspace/learned/` tree into
the new location, removes the empty legacy dir. User edits
outside the markers always survive.
🤖 Generated with [Claude Code](https://claude.com/claude-code)
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Refine the init container's devvm-fallback seeding so the OpenClaw
agent treats devvm as its DEFAULT teacher and saves recipes locally
to become independent over time:
1. TOOLS.md v2 section now has two emphatic CRITICAL rules:
- "TRY DEVVM before giving up" — when stuck, ssh devvm before
telling the user "I can't do that".
- "After every task, introspect → save a faster way" — for any
non-trivial task (especially recurring ones), save the recipe
to /workspace/learned/ and update INDEX.md.
2. New cc-skill `learn-from-tasks` at
/home/node/.openclaw/cc-skills/learn-from-tasks/SKILL.md formalises
both triggers: (A) you're stuck → check INDEX → ask devvm → save;
(B) you just finished → introspect → save if recurring.
3. /workspace/learned/ scaffold: INDEX.md table-of-contents +
scripts/, knowledge/, credentials/ (0700) subdirs. Agent checks
INDEX.md BEFORE reaching for devvm, so saved recipes are
findable on the next run.
4. Marker migration: strips both v1 and v2 markers before re-inserting
so user edits outside the markers always survive future restarts.
Security caveat documented inline: credentials in
/workspace/learned/credentials/ are NFS plaintext — acceptable for
home-lab personal scope, NOT for anything more sensitive than what
`ssh devvm` already gives the pod (wizard's access).
🤖 Generated with [Claude Code](https://claude.com/claude-code)
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
The OpenClaw agent reads TOOLS.md on every session per AGENTS.md
("environment-specific notes"), but it does NOT auto-search the
memory-core index for "devvm" before answering. Result: the agent
said "I don't have access to the devvm" even though ssh + the
openclaw-task wrapper were fully wired up (verified e2e in
9ad52dfd).
Updated init 6 (seed-devvm-memory-note) to ALSO append a
marker-delimited section to /workspace/TOOLS.md describing the
devvm SSH capability + openclaw-task usage. Idempotent: strips
any prior v1 section before re-inserting, so user edits outside
the markers survive future pod restarts.
The /workspace/memory/projects/openclaw-runtime/devvm-fallback.md
memory note stays in place — it's still indexed by memory-core
and surfaces for memory_recall queries.
🤖 Generated with [Claude Code](https://claude.com/claude-code)
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Give the OpenClaw pod two new capabilities:
1. Host-tools bundle. New init container `install-host-tools` extracts
openssh-client + dnsutils + tmux + jq + ripgrep + fd + vault + yq +
friends into /tools/host-tools/, with the bookworm-slim libs the
binaries need. PATH + LD_LIBRARY_PATH on the main container point
ld.so at the bundle. Idempotent via /tools/host-tools/.installed-v1
marker; smoke test (ldd-based) fails the init at deploy time if any
binary has unresolved deps. Bundle is ~558 MB on the existing
/srv/nfs/openclaw/tools NFS.
2. devvm SSH + async task pattern. New init `setup-ssh-config` writes
id_rsa/config/known_hosts under /home/node/.openclaw/.ssh; main
container startup symlinks /home/node/.ssh → there. New
/usr/local/bin/openclaw-task wrapper on devvm manages long-running
work as tmux sessions on devvm (sessions and logs survive pod
restarts — they live on devvm, not in the pod). New init container
`seed-devvm-memory-note` drops a markdown note teaching the pattern;
main container startup now runs `openclaw memory index --force` so
the note is searchable on first boot.
Design + verified E2E flow in
docs/plans/2026-05-22-openclaw-devvm-access-design.md. Persistence test
green: spawned a 50s task from pod A, deleted pod A, new pod B saw the
task finish and read its full log.
Pre-existing keel.sh annotation drift on openclaw/{openlobster,
task_webhook} cleaned up in the same apply.
🤖 Generated with [Claude Code](https://claude.com/claude-code)
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Two changes in one commit because they are coupled — the DISABLED_PROVIDERS
addition cannot land safely without the Keel exclusion on temporal:
1. Add DISABLED_PROVIDERS env on postiz Helm chart. Live DB audit showed
only 'instagram-standalone' connected; all other Postiz providers
were idle-polling Temporal task queues. List excludes x, linkedin,
reddit, threads, youtube, tiktok, pinterest, dribbble, slack,
discord, mastodon, bluesky, lemmy, warpcast, vk, beehiiv, telegram,
wordpress, nostr, farcaster. Keeps facebook + instagram + the
standalone variant active.
2. temporal deployment needs keel.sh/policy=never (set live via kubectl
annotate). Keel was rolling temporalio/auto-setup 1.28.1 -> 0.20.0
on every helm reconcile because :0.20.0 is published in the same
registry path but is a DIFFERENT (legacy Cassandra-based) image
stream. Memory id 1933 trap; new variant captured in id 2315-2319.
The annotation is set live (not in TF) because the existing TF block
has lifecycle.ignore_changes = [keel.sh/policy] so the chart
reconcile won't reset it. Long-term fix: add temporal to the
Kyverno keel-mutate-existing exclude list so it survives a
namespace re-label.
Three changes from today's autonomous-pipeline validation session:
1. **Kill-switch ConfigMap** — chain checks for `k8s-upgrade-killswitch`
ConfigMap in `k8s-upgrade` namespace at the top of every phase + at the
start of version-check. Existence halts the chain (exit 0) with a Slack
message. Single-command emergency stop:
kubectl -n k8s-upgrade create configmap k8s-upgrade-killswitch \
--from-literal=reason="storm response"
Resume: kubectl -n k8s-upgrade delete cm k8s-upgrade-killswitch
Role rule for `configmaps` get/list/watch added (resourceName-scoped).
2. **Ignore RecentNodeReboot in halt_on_alert_query everywhere** — the
chain itself causes reboots. The pre-drain master check, post-upgrade
worker check, postflight check, and preflight halt-on-alert all now
pass `RecentNodeReboot` as the extra-ignore. Previously only worker
phase's post-upgrade gate did this. Master Failed silently this morning
on the pre-drain check after my own master reboot.
3. **Preflight quiet-baseline 3600s → 600s** — the 1h cooldown after any
Ready transition meant the chain refused to run for an hour after
every kured reboot. 10 min is enough for kubelet/control-plane to
settle; the 24h-between-cluster-reboots invariant lives in
kured-sentinel-gate, not here.
Validated by running the chain end-to-end: preflight passed in 5s,
master phase now in drain. Today's storm post-mortem (snapshot CoW
amplification + tigera-operator crashloop feedback loop) drove the
kill-switch design.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Workers handle background tasks only (LDAP sync, email, certificate
renewal) — no user-facing traffic, so 2-of-3 redundancy isn't load-
bearing. Reduces sustained CPU by ~100m.
Server replicas unchanged at 3 (PDB minAvailable=2 — user-facing).
PgBouncer pool unchanged at 3 (DB connection pooling).
Bot crawlers were hitting /<owner>/<repo>/archive/<sha>.zip on the
dot_files repo (vim-plugin source trees) — each request synthesised a
fresh ZIP from git history, taking 9.9s and returning 500 under
sustained load. Cost: ~440m sustained forgejo CPU.
Toggle: FORGEJO__repository__DISABLE_DOWNLOAD_SOURCE_ARCHIVES=true.
/archive/* URLs now 404; git clone / OCI registry / API unaffected.
Measured: forgejo pod 440-573m -> 60m steady-state (~85% drop).
(Pod rollout took ~7min on the new RS due to kubelet's recursive
chown of the 2700+ files in the data PVC — fsGroupChangePolicy is
unset and defaults to Always; could be set to OnRootMismatch later.)
Halves sample volume on all default-scrape jobs (cAdvisor, node-exporter,
service-endpoints, etc.). Memory id 559's earlier scrape-2m tuning was
applied live but not codified — this restores the Helm template.
Companion changes to keep alerting fidelity:
- evaluation_interval kept at 1m (alerts evaluate every minute)
- snmp-ups job pinned to scrape_interval=30s so PowerOutage /
LowUPSBattery detect within ~30s instead of 2m
- 3 alerts bumped from for:1m to for:3m (HighGPUTemp, LowUPSBattery,
PowerOutage) for stability above the new 2m global cadence
Other jobs that already had per-job overrides (snmp-idrac 1m,
redfish-idrac 3m, kubernetes-pods 5m, kubernetes-services 5m) unaffected.
Expected: 50-150m sustained CPU saving on Prometheus + apiserver.
Verification ongoing — apiserver settles ~minutes after Prometheus
config reload due to initial-target-scrape burst.
Replaced 'loki.source.kubernetes' with 'loki.source.file' in alloy DS
config. discovery.relabel.pod_logs already sets __path__ to the kubelet
log path (/var/log/pods/*<uid>/<container>/*.log) and varlog host-mount
was already present, so this is a one-line swap.
Why: apiserver was burning ~700m sustained on 'CONNECT pods/log' streams
(13 req/s, ~2200 sec/s of long-lived TCP connections). Streaming pod
logs through the apiserver instead of tailing kubelet's log files was
the dominant residual cost after the recent Loki/Alloy onboarding.
Measured before/after:
- Alloy DS: ~620m total (5 x ~125m) -> ~92m total (5 x ~18m)
- kube-apiserver: peak 1959m midnight burst, settled 632m
(Stuck-pod recovery: alloy-7zg7t on k8s-master needed --force delete
during rollout — FailedKillPod 'unable to signal init: permission denied'
on runc, transient runtime issue, unrelated to this change.)
Both static-roles existed in Vault state (created out-of-band) but
were missing from the postgresql connection's allowed_roles list. Vault
was logging 'is not an allowed role' rotation errors every 10s for both,
sustained CPU waste ~40-70m.
Adopted both via 'import {}' (import blocks removed after first apply
per the canonical adoption pattern).
- pg-matrix: username=matrix, rotation_period=86400 (1d)
- pg-technitium: username=technitium, rotation_period=604800 (7d)
Verified: 'is not an allowed role' errors stopped in vault-0 logs
immediately after apply.
kubectl drain --ignore-daemonsets needs to GET each pod's owner
reference (DaemonSet/StatefulSet/ReplicaSet/Deployment) to classify
which pods can be drained vs ignored. Without these RBAC verbs, drain
bails with 'cannot delete daemonsets ... is forbidden' for every
daemonset-managed pod on the node.
Provider declarations were applied across freshrss, linkwarden,
navidrome, openclaw, tandoor, vault in prior sessions; lock files
regenerated for the 4 stacks where init had run. Commits the WIP so
downstream Terraform plans can proceed.
- kubectl (gavinbunney/kubectl ~> 1.14): kubernetes_manifest panic
workaround for Kyverno CRDs (beads code-e2dp)
- authentik (goauthentik/authentik ~> 2024.10): used where stacks
manage their own Authentik objects
Plugin needs three things to load under OpenClaw 2026.5.x:
1. plugins.allow includes 'recruiter-api' (doctor --fix overwrites the
ConfigMap-baked value, so re-patch via 'openclaw config patch --stdin'
in the startup command after doctor runs).
2. 'openclaw plugins enable recruiter-api' to flip its registry entry.
3. manifest declares contracts.tools (added in recruiter-responder commit
83ffd9fa).
Plus: VIKTOR_CHAT_ID env wired from secret/openclaw.viktor_chat_id so the
plugin's polling loop knows which Telegram chat to deliver into.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>