Give the OpenClaw pod two new capabilities:
1. Host-tools bundle. New init container `install-host-tools` extracts
openssh-client + dnsutils + tmux + jq + ripgrep + fd + vault + yq +
friends into /tools/host-tools/, with the bookworm-slim libs the
binaries need. PATH + LD_LIBRARY_PATH on the main container point
ld.so at the bundle. Idempotent via /tools/host-tools/.installed-v1
marker; smoke test (ldd-based) fails the init at deploy time if any
binary has unresolved deps. Bundle is ~558 MB on the existing
/srv/nfs/openclaw/tools NFS.
2. devvm SSH + async task pattern. New init `setup-ssh-config` writes
id_rsa/config/known_hosts under /home/node/.openclaw/.ssh; main
container startup symlinks /home/node/.ssh → there. New
/usr/local/bin/openclaw-task wrapper on devvm manages long-running
work as tmux sessions on devvm (sessions and logs survive pod
restarts — they live on devvm, not in the pod). New init container
`seed-devvm-memory-note` drops a markdown note teaching the pattern;
main container startup now runs `openclaw memory index --force` so
the note is searchable on first boot.
Design + verified E2E flow in
docs/plans/2026-05-22-openclaw-devvm-access-design.md. Persistence test
green: spawned a 50s task from pod A, deleted pod A, new pod B saw the
task finish and read its full log.
Pre-existing keel.sh annotation drift on openclaw/{openlobster,
task_webhook} cleaned up in the same apply.
🤖 Generated with [Claude Code](https://claude.com/claude-code)
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Captures today's k8s-upgrade-pipeline session findings — root cause
of repeated upgrade failures is the single-master apiserver outage
window cascading into operator crashloops + storm I/O. HA control
plane with 3 masters + apiserver LB removes the cascade entirely.
Tracked in beads code-n0ow. Plan doc to follow.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Captures the wipe+reinit strategy (sidestep the broken DD upgrade
path), the IO config bump (innodb_io_capacity 100→2000), root-cause
analysis with explicit uncertainty, verification gates, and rollback.
Not scheduled yet. Tracked in beads code-963q.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
15-task plan for a shared presence board so Claude Code sessions can
see which shared infra resources are being actively mutated by other
sessions. Resource-scoped claims on the existing Dolt server,
heartbeat-driven TTL, agent-driven via CLAUDE.md rule + Python CLI.
Foundation for opt-out-pure auto-update model per
docs/plans/2026-05-16-auto-upgrade-apps-{design,plan}.md.
- New stack `stacks/keel/` deploys Keel via Helm (charts.keel.sh, v1.0.6).
Polls registries hourly per design decision #8. Default schedule
overridable per-workload via keel.sh/pollSchedule annotation.
- New Kyverno ClusterPolicy `inject-keel-annotations` mutates Deployments,
StatefulSets, and DaemonSets in namespaces labeled `keel.sh/enrolled=true`
with keel.sh/policy=force + trigger=poll + pollSchedule=@every 1h.
- Phase 0 enrolls no namespaces. Phase 1 (next session) labels the
self-hosted set.
- Per-workload opt-out: label `keel.sh/policy: never` (used by rollback
runbook and chrome-service-style deliberate pins).
- Keel namespace excluded from the mutate — supervisor self-update has
too-bad a failure mode (decision #11).
- AGENTS.md: KYVERNO_LIFECYCLE_V2 marker convention added for the
ignore_changes block enrolled workloads need.
- .claude/CLAUDE.md: docker-images rule flagged as transitional.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Stage 1 of moving private images off the registry:2 container at
registry.viktorbarzin.me:5050 (which has hit distribution#3324 corruption
3x in 3 weeks) onto Forgejo's built-in OCI registry. No cutover risk —
pods still pull from the existing registry until Phase 3.
What changes:
* Forgejo deployment: memory 384Mi→1Gi, PVC 5Gi→15Gi (cap 50Gi).
Explicit FORGEJO__packages__ENABLED + CHUNKED_UPLOAD_PATH (defensive,
v11 default-on).
* ingress_factory: max_body_size variable was declared but never wired
in after the nginx→Traefik migration. Now creates a per-ingress
Buffering middleware when set; default null = no limit (preserves
existing behavior). Forgejo ingress sets max_body_size=5g to allow
multi-GB layer pushes.
* Cluster-wide registry-credentials Secret: 4th auths entry for
forgejo.viktorbarzin.me, populated from Vault secret/viktor/
forgejo_pull_token (cluster-puller PAT, read:package). Existing
Kyverno ClusterPolicy syncs cluster-wide — no policy edits.
* Containerd hosts.toml redirect: forgejo.viktorbarzin.me → in-cluster
Traefik LB 10.0.20.200 (avoids hairpin NAT for in-cluster pulls).
Cloud-init for new VMs + scripts/setup-forgejo-containerd-mirror.sh
for existing nodes.
* Forgejo retention CronJob (0 4 * * *): keeps newest 10 versions per
package + always :latest. First 7 days dry-run (DRY_RUN=true);
flip the local in cleanup.tf after log review.
* Forgejo integrity probe CronJob (*/15): same algorithm as the
existing registry-integrity-probe. Existing Prometheus alerts
(RegistryManifestIntegrityFailure et al) made instance-aware so
they cover both registries during the bake.
* Docs: design+plan in docs/plans/, setup runbook in docs/runbooks/.
Operational note — the apply order is non-trivial because the new
Vault keys (forgejo_pull_token, forgejo_cleanup_token,
secret/ci/global/forgejo_*) must exist BEFORE terragrunt apply in the
kyverno + monitoring + forgejo stacks. The setup runbook documents
the bootstrap sequence.
Phase 1 (per-project dual-push pipelines) follows in subsequent
commits. Bake clock starts when the last project goes dual-push.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Deleted the 6 NFS PVs orphaned by the Phase 2 rolling and removed
their /srv/nfs/<dir> subtrees on the PVE host (~1.5 GB; vault-2 audit
log was 1.4 GB on its own). Cluster-wide Released-PV sweep on the
proxmox-lvm/encrypted side stays out of scope.
All 3 vault voters now on proxmox-lvm-encrypted (vault-0 16:18, vault-1
+ vault-2 today). The NFS fsync incompatibility identified in the
2026-04-22 raft-leader-deadlock post-mortem is no longer reachable —
raft consensus log + audit log live on LUKS2 block storage with real
fsync semantics.
Cluster-wide consumers of the inline kubernetes_storage_class.nfs_proxmox
dropped to zero after the rolling, so the resource is removed from
infra/stacks/vault/main.tf. Released NFS PVs (6) remain in the cluster
and will be reclaimed in Phase 3 cleanup.
Lesson learned (recorded in plan): pvc-protection finalizer races the
StatefulSet controller — pod recreates on the OLD PVCs unless the
finalizer is patched out before pod delete. Force-finalize technique
applied to vault-1 + vault-2 successfully.
Closes: code-gy7h
Phase 2 of the NFS-hostile migration: data + audit storageClass on
the vault helm release switches from nfs-proxmox to
proxmox-lvm-encrypted, then per-pod rolling swap (24h soak between).
vault-0 swap done. vault-1 + vault-2 still on NFS — the rolling part
is what makes this safe (raft quorum maintained by 2 healthy pods
while one is replaced).
Also restores chart-default pod securityContext fields. The previous
`statefulSet.securityContext.pod = {fsGroupChangePolicy = "..."}`
block REPLACED (not merged) the chart's defaults — fsGroup,
runAsGroup, runAsUser, runAsNonRoot were all silently dropped. NFS
exports were permissive enough to mask the missing fsGroup; ext4 LV
volume root is root:root and the vault user (UID 100) couldn't open
vault.db, CrashLoopBackOff. Fix: provide all five fields explicitly,
survives future chart bumps. vault-1 and vault-2 retained their
correct securityContext from when their pod specs were written to
etcd, before the partial customization landed — the bug only surfaces
when a pod is recreated.
Pre-flight raft snapshot saved at /tmp/vault-pre-migration-*.snap
(recovery anchor).
Refs: code-gy7h
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Live PG data moves to a 10Gi LUKS-encrypted RWO PVC. WAL fsync per
commit on NFS contributed to the 2026-04-22 NFS writeback storm
(2h43m recovery, 3 of 4 nodes hard-reset). Backups remain on NFS
(append-only, NFS-tolerant).
The init container that writes postgresql.override.conf is now gated
on PG_VERSION presence — on a fresh PVC the file would otherwise make
initdb refuse the non-empty PGDATA. First boot skips the override and
initdb's cleanly; second boot (after a forced restart) writes the
override so vchord/vectors/pg_prewarm load before the dump restore.
Idempotent on initialised PVCs.
Migration executed: pg_dumpall (1.9GB) → restore on encrypted PVC →
REINDEX clip_index/face_index → 111,843 assets verified, external
HTTP 200, all 10 extensions present (vector minor 0.8.0→0.8.1 only).
LV created on PVE host, picked up by lvm-pvc-snapshot.
See docs/plans/2026-04-25-nfs-hostile-migration-{design,plan}.md.
Phase 2 (Vault Raft) follows under code-gy7h.
Closes: code-ahr7
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Inbound:
- Direct MX to mail.viktorbarzin.me (ForwardEmail relay attempted and abandoned)
- Dedicated MetalLB IP 10.0.20.202 with ETP: Local for CrowdSec real-IP detection
- Removed Cloudflare Email Routing (can't store-and-forward)
- Fixed dual SPF violation, hardened to -all
- Added MTA-STS, TLSRPT, imported Rspamd DKIM into Terraform
- Removed dead BIND zones from config.tfvars (199 lines)
Outbound:
- Migrated from Mailgun (100/day) to Brevo (300/day free)
- Added Brevo DKIM CNAMEs and verification TXT
Monitoring:
- Probe frequency: 30m → 20m, alert thresholds adjusted to 60m
- Enabled Dovecot exporter scraping (port 9166)
- Added external SMTP monitor on public IP
Documentation:
- New docs/architecture/mailserver.md with full architecture
- New docs/architecture/mailserver-visual.html visualization
- Updated monitoring.md, CLAUDE.md, historical plan docs
- Migrate ebooks-calibre-config-iscsi (2Gi, 2380 files) to proxmox-lvm
- Update docs/architecture/storage.md: document Proxmox CSI as primary
block storage, mark democratic-csi iSCSI as deprecated
- Add full migration plan to docs/plans/
Major milestone - shared PostgreSQL moved from NFS to CloudNativePG:
- CNPG cluster (pg-cluster) running in dbaas namespace on local-path storage
- PostGIS image (ghcr.io/cloudnative-pg/postgis:16) for dawarich compatibility
- All 20 databases and 19 roles restored from pg_dumpall backup
- postgresql.dbaas Service patched to point at CNPG primary
- Old PG deployment scaled to 0 (NFS data intact for rollback)
- All 12+ dependent services verified running:
authentik, n8n, dawarich, tandoor, linkwarden, netbox, woodpecker,
rybbit, affine, health, resume, trading-bot, atuin
- Authentik PgBouncer working through the switched endpoint
TODO: codify CNPG cluster in Terraform, add 2nd replica, update backup CronJob
Key changes from v1:
- Drop 3-instance replication → 2-instance CNPG, single Redis/MySQL
- Remove Headscale from PG migration (project discourages it)
- Remove MeshCentral from PG migration (NeDB, not SQLite)
- Replace Redis Sentinel with single redis:7 on local disk (modules unused)
- Add RAM overcommit warning and mitigation
- Add explicit single-host limitation acknowledgment
- Add per-component rollback plans
- Fix backup strategy (CNPG can't archive WAL to NFS natively)
- Reorder migration: low-risk services first, authentik last
- Add research gate before each service migration
- Add gpu=true label to Terraform (nvidia null_resource alongside taint)
- Improve API server OIDC config to detect value changes, not just flag presence
- Add policy_hash trigger to audit-policy so rule changes auto-reapply
- Enable prometheus-node-exporter sub-chart, delete unused Ansible playbook
- Document full node rebuild procedure in CLAUDE.md
- Save Talos Linux migration evaluation for future reference
Delete 20 orphaned module directories and 3 stray files from
modules/kubernetes/ that are no longer referenced by any stack.
Remove 7 root-level legacy files including the empty tfstate,
27MB terraform zip, commented-out main.tf, and migration notes.
Clean up commented-out dockerhub_secret and oauth-proxy references
in blog, travel_blog, and city-guesser stacks. Remove stale
frigate config.yaml entry from .gitignore. Remove ephemeral
docs/plans/ directory.