`bw unlock` only decrypts the LOCAL cache, so a persisted (already
logged-in) session served stale data — a password changed in the web
vault wouldn't appear until the next fresh login. Add a best-effort
`bw sync` in openSession (the chokepoint every read shares: get, get
--all, list, code, status), so reads reflect current server-side values.
Best-effort by design: a transient sync failure warns on stderr and
falls back to the cached vault rather than failing the read (an AFK
agent shouldn't break on a network blip). status keeps its own explicit
sync so a reachability failure still surfaces in its report.
CLI v0.10.1. Tests assert the sync runs after unlock and before the read,
and that a read still succeeds when sync fails.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
`homelab vault get` could only fetch one of five allow-listed fields and
had no way to see what fields an item even has — in particular it could
not reach arbitrary user-defined custom fields. Add a `--all` flag that
dumps the whole item as a normalized JSON object
(`{name, username?, password?, uris?, totp?, notes?, fields?}`), so a
Claude session can discover and read every field, custom ones included,
in a single call.
Security model preserved:
- Like `get --json`, the dump is all secret values, so it refuses a bare
TTY (pipe it, e.g. `| jq`); the machine/agent path is stdout.
- The TOTP *seed* is reduced to a presence flag (`"totp": true`) and
never emitted — the seed is more powerful than a one-time code, so the
only seed-derived path stays the specially-audited `vault code`. Tests
assert the seed and password-history never appear in the dump.
- Op-log uses a distinct `get-all` verb (item name still never logged) so
a bulk dump is distinguishable from a single-field read.
`normalizeItem` is a pure, unit-tested core; `getItem` is the
session+fetch seam. CLI bumped to v0.10.0. Docs: README changelog,
onboarding runbook, design spec §16.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Makes the goldmane_edges east-west trail (ADR-0014) reachable during incident
investigations without remembering the DB/creds/SQL. New top-level verb:
homelab edges --ns <ns> edges touching <ns> (either direction)
homelab edges --src/--dst <ns> directional egress / ingress peers
homelab edges --peers-of <ns> distinct peer namespaces of <ns>
homelab edges --new-since 24h first seen since a duration or date (YYYY-MM-DD)
homelab edges --denied only action='deny' (blocked / lateral movement)
homelab edges --json --limit N machine-readable / row cap (default 200)
Filters render to a single read-only SELECT against the `edge` table, run via
the dbaas CNPG primary pod (same exec path as `k8s db`). Namespace values are
validated to the k8s name charset (injection guard) before they reach SQL.
TDD: edges_test.go covers flag parsing, query building (each filter, AND
combination, peers-of shape, JSON wrapper), the new-since duration/date parser,
and namespace-validation / injection rejection. Smoke-tested live: --peers-of,
--new-since 24h, --denied, and --json all return correct rows.
Docs: runbook query section now leads with the CLI; cli/README gains a v0.9
section. VERSION v0.8.2 -> v0.9.0.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
emo's Claude Code sessions hit "UserPromptSubmit hook error" on almost every
prompt. Root cause: the homelab-memory-recall.py UserPromptSubmit hook runs
`homelab memory recall <prompt>` and strict-decodes its stdout. printMemories
truncated each memory's preview with a BYTE slice (c[:240]), which cuts through
the middle of a 2-byte Cyrillic character and emits invalid UTF-8 (a dangling
0xd0 lead byte). The hook's subprocess.run(text=True) then raised
UnicodeDecodeError — not caught by its `except (TimeoutExpired, OSError)` — so
the hook exited non-zero and Claude surfaced the error. It is Cyrillic-specific
(ASCII has no multibyte chars to split), so it bit emo (Bulgarian prompts) every
turn while English users almost never saw it.
Two-layer fix:
- cli: truncatePreview() now counts RUNES, not bytes, so the preview never
splits a character. Regression test asserts valid UTF-8 on a long Cyrillic
string. Fixes the root for every consumer of `memory recall` / `memory list`.
- hook: subprocess.run gains errors="replace" and the except is broadened to
honor the script's own "best-effort, exit 0" contract — so a truncated or
otherwise odd payload can never again surface as a hook error.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Setting up emo's Bitwarden access via `homelab vault`, his one-time
`homelab vault setup` failed with an opaque "exit status 2". Two latent
CLI bugs, both of which any non-admin AFK invocation can hit:
1. The CLI set VAULT_TOKEN but never VAULT_ADDR, relying on the ambient
value. It IS in /etc/environment (login shells), but emo runs his
agents from long-lived tmux / non-login shells that never sourced it,
so every `vault` child hit the 127.0.0.1:8200 default -> connection
refused. claude-auth-sync already self-defaults VAULT_ADDR; the CLI
now does the same.
2. Token precedence was env > ~/.vault-token > scoped. A power-user who
ran `vault login -method=oidc` carries a read-only ~/.vault-token
(policy `default`, capability `deny` on their workstation path), which
shadowed the purpose-built scoped token -> 403 permission denied on
the user's OWN path. This tool only ever touches
secret/workstation/claude-users/<user>, which the scoped token covers
exactly, so precedence is now env > scoped > ~/.vault-token. Verified
the scoped tokens for both emo and wizard hold create/read/update on
their own paths, so admins are unaffected.
Also stop swallowing the shelled `vault`/`bw` stderr: errors now carry
the real message (connection refused / permission denied) instead of a
bare "exit status N" — without that, (1) and (2) were indistinguishable.
Verified end-to-end as emo (VAULT_ADDR unset + his read-only
~/.vault-token present): writeCreds now succeeds.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Two remaining gaps to let non-admins (emo) use `homelab vault`:
- setup-devvm.sh installed `@bitwarden/cli` only when `command -v bw`
failed, which an admin's own ~/.local/bin/bw satisfied — so the
system-wide copy was never installed and non-admins had no `bw`
backend. Install to the npm /usr prefix and guard on the system path
(/usr/bin/bw) instead.
- Add docs/runbooks/homelab-vault-onboarding.md (per-user setup, the
shared Organization/Collection flow for sharing passwords, admin
deploy + verification, security model) and repoint the two code
comments that cited a design-spec path which never existed.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
`homelab vault` was effectively admin-only: two bugs blocked every
non-admin (e.g. emo) from using it for their own Vaultwarden vault.
1. Token: the CLI relied purely on ambient `vault` auth (~/.vault-token
/ $VAULT_TOKEN), which only admins have. Non-admins carry a scoped
token at ~/.config/claude-auth-sync/vault-token (policy
workstation-claude-<user>). Add ensureVaultToken(): explicit env >
~/.vault-token > scoped fallback, wired into every vault verb. Admins
are unaffected (their ambient token wins).
2. Write capability: `homelab vault setup` used plain `vault kv patch`,
which needs the `patch` capability the scoped policy does not grant
(only create/read/update) — so setup 403'd for non-admins. Switch to
`kv patch -method=rw` (read-modify-write; same approach
claude-auth-sync already uses), with `kv put` only when the path
doesn't exist yet. Preserves co-located keys (claude_ai_oauth_json).
Enables onboarding emo onto the per-user Vaultwarden access tool.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
C1 (critical): setup wrote the master password + API client_secret as
`vault kv patch key=value` argv, leaking them via /proc/<pid>/cmdline to
same-UID processes. Now written via stdin (key=- form); only email +
client_id (non-credentials) remain in argv.
I1: `get --json` refused on a TTY (was dumping the secret to scrollback).
M1: vaultLock now holds the per-user flock (it mutates bw state).
M4: bw login-detection parses status JSON instead of substring matching.
M5: clipboard path refuses when stderr is not a TTY (was silently failing).
M6: realRunner trims only trailing newline, preserving secret whitespace;
secret prompts likewise.
Adds security-property tests: no secret in argv across the get flow,
clipboard decision matrix, --json TTY gate, bw status parsing.
Make `homelab browser --help` and chrome-service.md state the same tiered rule
now in ~/code/CLAUDE.md: default to the Playwright MCP/headless browser for all
routine automation; reach for `homelab browser` ONLY when headless is blocked
(loads-but-submit-fails / one request errors while siblings 200 / explicit bot
wall). Removes the "co-equal choice" framing so agents have one non-conflicting
instruction. Adds a test asserting the tiered wording so it can't regress.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Add `homelab browser run|open` so agents can drive the cluster's headful
Chrome (chrome-service) over CDP from the devvm. The headless playwright/mcp
browser can load anti-bot sites and fill their forms, but the gated submit
silently fails — e.g. the Stirling Ackroyd Fixflo tenant portal returned
net::ERR_FILE_NOT_FOUND on its pre-submit check and hung, creating nothing.
Driving the real headful Chrome submits first try. That capability already
existed but was undiscoverable, so it cost ~40 min + redundant form re-runs to
find; now it is one command, versioned, test-covered, and `browser --help`
carries the when-to-use signature + an error-code cheat-sheet so the right tool
is reached at the right moment (the failure was judgment, not setup).
- port-forward svc/chrome-service:9222 (tunnels API-server->pod, so it bypasses
the :9222 NetworkPolicy), assert non-headless via /json/version,
connect_over_cdp, inject the same vendored stealth.js the in-cluster callers
use; the port-forward is always torn down, on success and on error.
- node CDP client pinned to playwright-core@1.48.2 to match the v1.48.0-noble
image (Chromium 130); self-provisioned lazily into ~/.cache/homelab, no
per-user setup.
- default is a fresh incognito context (safe for the shared browser + concurrent
callers); --shared-context reuses the warmed persistent profile.
- TDD: cmd_browser_test.go covers arg parsing, headless detection, the version
pin, the help cheat-sheet, and a stealth.js drift guard. Verified end-to-end
against bot.sannysoft.com (real Chrome UA, webdriver hidden, plugins/WebGL
spoofed) and `browser open`.
- docs: README v0.8 section, ADR-0013, and a chrome-service.md "driving from
outside the cluster" section.
Closes: code-nepg
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
`ha token` originally read openclaw/openclaw-secrets -> skill_secrets, which only
cluster admins can read — so it hung/failed for the non-admin operator it was
built for (emo = emil.barzin@gmail.com, OIDC group "Home Server Admins", whose
identity is deliberately barred from secrets in the openclaw namespace).
Split the HA tokens into a dedicated secret openclaw/ha-tokens (keys sofia/london)
with a Role + RoleBinding granting `get` on JUST that secret to the Home Server
Admins group (k8s RBAC can't scope to a JSON sub-key, hence a separate object).
emo now resolves the HA token with their own identity, WITHOUT gaining the rest
of skill_secrets (slack_webhook, uptime_kuma_password). openclaw's own deployment
keeps reading openclaw-secrets — purely additive.
- stacks/openclaw/ha_tokens.tf: new secret + least-privilege Role/RoleBinding
- cli/cmd_ha.go: read openclaw/ha-tokens (raw base64 per-instance key); drop JSON parse
- README + ADR-0012 updated; VERSION -> v0.7.1
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Mined another devvm user's Claude sessions for repeated, hand-rolled command
patterns worth absorbing into the shared CLI. The dominant signal was Home
Assistant "Sofia" work: a `kubectl | base64 | jq` token-extraction pipeline
re-derived ~420x, and a bespoke non-interactive `ssh -o …` invocation reinvented
~30x — every session. The existing `home-assistant-sofia.py` already covers the
API but goes unused from an arbitrary cwd (needs an env var set + a cwd-relative
path), so agents bypassed it and hand-rolled everything.
Add two verbs covering exactly the gaps the `ha` MCP can't (entity state/control
stays with the MCP):
- `ha token [--instance sofia|london]` (read): resolves the long-lived API token
live from k8s secret openclaw/openclaw-secrets via the ambient kubeconfig — no
pre-set env var. Composes as `curl -H "Authorization: Bearer $(homelab ha token)"`.
- `ha ssh [--instance sofia|london] -- <cmd>` (write): deterministic
non-interactive ssh to the HA host using the invoking user's key.
Also fix the root cause: `home-assistant-sofia.py` now falls back to
`homelab ha token` when its env var is unset (works from any directory), and the
home-assistant skill points agents at these verbs + `homelab metrics query`
instead of hand-rolled curls. README + ADR-0012 + AGENTS.md updated per the
per-verb-group convention.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Answers the question that drove the whole CLI — which verbs to add next — with
data instead of one maintainer's habits, and resolves the cross-user-usage ask
in-bounds (no reading anyone's home).
- emit on dispatch: every verb fire-and-forgets one Loki line {job,user,verb} +
"exit=N ver=X". ONLY the verb path + exit code — never args, paths, flags, or
secrets (the emit never sees arguments). Best-effort: 800ms timeout, errors
swallowed, never affects the command; opt-out HOMELAB_TELEMETRY=0. Discovery
verbs (manifest/version/help) and usage itself don't self-record.
- usage top [--since 30d] [--user U] [--json]: ranks verbs via
sum by (verb)(count_over_time({job="homelab-usage"}[…])) against the shared
Loki. Cross-user analytics WITHOUT touching ~/.claude — the privacy-preserving
answer to "what does the team use".
- Loki sink (zero new infra, dogfoods v0.5 logs path); push verified HTTP 204 no
auth. ADR docs/adr/0011.
Live-verified: ran 4 verbs, usage top ranked them correctly (metrics query=2).
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
The remaining verbs that pass the "saves reasoning, not just typing" test the
user posed mid-session: each encodes the non-obvious which-endpoint-reached-how
resolution otherwise re-derived every time. (Same test deprioritized node-ssh
and secret-get aliasing — thin wrappers over commands already known.)
- net check <host> [path]: two-legged reachability — external (public DNS→CF)
vs internal (Traefik LB) — so you see WHERE a break is, not just that one path
works. (live: surfaced the LB at 6ms vs CF 77ms.)
- dns lookup <name> [type]: Technitium (10.0.20.201) vs public (1.1.1.1) diff.
- metrics query "<promql>" / metrics alerts: Prometheus via the LB
(prometheus-query.viktorbarzin.lan); alerts uses the synthetic ALERTS series
since the query frontend has no /api/v1/alerts and Alertmanager has no ingress.
- logs query "<logql>" [--since 1h] [--limit N]: Loki range query via the LB.
All reach auth-free internal ingresses through the LB (Go form of
curl --resolve host:443:10.0.20.203) — no port-forward, no kubectl. In-cluster-
only endpoints (Alertmanager v2) deliberately out of scope. Verified live before
building; all five smoke-tested green.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Adds the verb-group that kills the single biggest reasoning sink in agent
sessions — watching a build/deploy to completion (proven the session that built
it: hours hand-rolling Woodpecker polling + DB-schema spelunking for one CI
incident).
- ci status/watch: Woodpecker REST API (version-stable, not its DB schema),
reached via the internal Traefik LB (dial 10.0.20.203, SNI=ci.viktorbarzin.me
so the cert verifies — the Go form of the house `curl --resolve` pattern),
token from WOODPECKER_TOKEN/Vault, repo id resolved from the cwd remote, with
retries that ride Woodpecker's intermittent empty responses. watch matches the
HEAD/given commit (avoids the post-push race) and exits non-zero on failure.
- deploy wait: image-sha match THEN rollout status (rollout status alone returns
success on the old ReplicaSet); kubectl-based.
- work land now auto-watches CI to green on the landed commit (--no-ci-watch to
skip), closing the v0.1 gap.
- ci logs deferred to v0.4.1 (Woodpecker detail/log endpoints were the least
reliable; status/watch use the working list endpoint).
Live-verified ci status/watch against the live API.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
`k8s db <app>` (Postgres path) execed `pg-cluster-rw`, which is the CNPG
read-write SERVICE, not a pod — so kubectl exec failed with
`pods "pg-cluster-rw" not found`. The unit test only checked the plan; the verb
was never fired at live state (the gap flagged in v0.2), so it shipped broken.
Fix: the PG plan now carries a label selector (cnpg.io/instanceRole=primary)
instead of a pod name, and k8s db resolves the actual primary POD via
`kubectl get pod -l <selector>` before exec. MySQL path (real pod
mysql-standalone-0) unchanged. Live-verified both paths (psql + mysql).
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Lets agents search/navigate memory via the CLI, as the first step toward
deprecating the memory MCP. claude-memory is a FastAPI service (the MCP is just
one frontend); homelab memory is a thin Bearer-auth HTTP client over the same
API, using the env the hooks already set (CLAUDE_MEMORY_API_URL/KEY). It works
even when the MCP frontend is down — the recurring disconnect that took the MCP
offline for this whole session.
Verbs: recall (server-side semantic search), list, categories, tags, stats,
secret (read); store, update, delete (write). Validated against the live API
including a store→recall→delete round-trip — full data-plane parity with the MCP.
The deprecation itself (rewiring the per-prompt auto-recall + auto-learn hooks to
the CLI, then uninstalling the MCP) is a deliberate follow-up, sequenced after
the CLI is proven in the hooks — see docs/adr/0008.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Bump cli/VERSION to v0.2.0; document the k8s verbs (README table + resolver
note), add docs/adr/0007 (resolver, read/write split, config-mutation stays
raw, db dbaas pattern), and extend the AGENTS.md discovery pointer with the
Kubernetes surface.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Mining the post-v0.1 corpus showed kubectl is the dominant remaining domain by
far: 11,291 commands across 243 sessions (more than everything else combined).
This adds the full k8s verb-group built on an app→namespace→pod resolver (most
namespaces hold one app, so <app> defaults to the namespace and the target
defaults to deploy/<app>, letting kubectl resolve the pod; -n/--pod/-c/-l/--tty
override).
Read: status (pods + non-Normal events), get, logs, describe, debug (one-shot
triage), pf, rollout-status. Write/operational: db (the dbaas psql/mysql exec
pattern — PG via pg-cluster-rw -c postgres, MySQL via mysql-standalone-0 with the
env-password bash wrapper, never inline), exec, rm-pod (pods/jobs ONLY), restart.
Config-mutation verbs (apply/edit/patch/scale/create) are deliberately NOT
exposed — they stay raw per the Terraform-only policy.
Smoke-verified read verbs against the live cluster (get/logs/rollout-status);
write verbs are unit-tested (resolver, db-plan, shell-quoting) but not fired at
live state.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Completes v0.1: documentation, build/install path, and version stamping.
- cli/VERSION (v0.1.0) stamped into the binary via ldflags.
- cli/README.md rewritten as the homelab overview (verbs + tiers, manifest,
build, the preserved legacy webhook use-cases).
- docs/adr/0004-0006: why homelab exists (grown in place from infra/cli, not a
separate repo), v0.1 scope + everything-allowed/tiers-recorded, and the
work/tf behaviour (native worktree entry, verification-gated auto-land,
presence-coupled apply).
- setup-devvm.sh builds cli/ -> /usr/local/bin/homelab each provisioning run
(t3-dispatch pattern), so every devvm user gets the current binary.
- AGENTS.md: discovery pointer under Common Operations.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Completes the infra-loop verb surface. work start creates .worktrees/<topic>
on <user>/<topic> off <remote>/master (git-crypt-aware, ensures .worktrees is
ignored) and prints the path for native EnterWorktree entry. work land fetches,
merges master in, verifies, pushes HEAD:master with non-fast-forward retry, and
falls back to pushing the feature branch for a PR when the direct push is
rejected (branch protection). work clean removes the worktree + branch.
Safety: work land REFUSES to push when it cannot verify (no --verify-cmd and no
auto-detected suite) unless --no-verify is passed. This was added after an
accidental smoke-test invocation pushed unverified WIP to master (benign — the
infra CI applied 0 stacks since the diff was cli/-only — but the gate makes an
unverified land a deliberate choice, not the default).
Known v0.1 limitation: land does not yet block on CI to green; that arrives with
the ci/deploy watch verbs. It prints a reminder to follow the pipeline manually.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Adds the tf verb-group and the resolver substrate beneath it, continuing the
v0.1 infra-loop build.
- substrate: findInfraRoot (walk up to terragrunt.hcl + stacks/), stack→dir
resolver, and repo/remote/git-crypt detection (preferRemote forgejo>origin,
hasGitCryptAttr, gitCryptFlags) — the last is for `work` next.
- tf plan/validate/fmt/force-unlock/apply, resolving the stack from cwd and
delegating to scripts/tg (which owns state decrypt/encrypt, the Vault lock,
and the ingress auth-comment check) rather than calling terragrunt directly.
- tf apply is presence-coupled: claims stack:<name>, ALWAYS releases on exit
(normal, error, or SIGINT/SIGTERM via sync.Once + signal handler) — fixing
the documented ~200-claim leak — and prints an out-of-band reminder since CI
applies canonically on push.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Begin evolving the existing infra/cli into the agent-facing "homelab" CLI
decided in the design/grilling session: one composable, JSON-capable surface
for the operations agents run over and over (mined from 51k commands across
2,225 past sessions; the infra inner-loop is ~29% of them). v0.1 targets that
loop — work/tf/claim — and ships here, in place, in infra/cli.
This first slice:
- command registry + dispatcher (longest-prefix verb matching) and a
`manifest`/`manifest --json` progressive-discovery entrypoint; every verb
declares a read|write tier so write-gating can be added later (everything is
allowed for now).
- claim/release verbs wrapping the existing presence script (not reimplemented),
with label-taxonomy validation.
- main() front-dispatches the homelab verb surface but falls through to the
legacy webhook -use-case path verbatim, so the in-cluster infra-cli image is
unaffected.
- fix a pre-existing vet error (glog.Infof missing format directive) that
blocked `go test`.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
6d224861 came from a --no-checkout worktree whose empty index made the
commit drop every file except two. This restores 05b50d2b's full tree and
correctly adds stacks/stem95su/gdrive-sync.tf + the service-catalog stem95su
entry. Forward-only (parent=6d224861, no force-push); [ci skip] since the
live infra was never applied from the broken commit.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
CronJob stem95su-gdrive-sync (*/10) mounts the content PVC RW and
rclone-syncs the read-only Drive folder "claude" (stem claude/files) onto
it (rclone/rclone:1.74.3, scope=drive.readonly, empty-source guard +
--max-delete 25). ESO ExternalSecret stem95su-rclone <- Vault
secret/stem95su. Requires the GCP OAuth app published to Production or the
refresh token expires ~weekly.
Lands the gdrive-sync stack on master (it had landed on a feature branch
by accident on the shared devvm checkout).
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
- Add public_ipv6 variable and AAAA records for all 34 non-proxied services
- Fix stale DNS records (85.130.108.6 → 176.12.22.76, old IPv6 → HE tunnel)
- Update SPF record with current IPv4/IPv6 addresses
- Add AAAA update support to Technitium DNS updater CLI
- Pin mailserver MetalLB IP to 10.0.20.201 for stable pfSense NAT
- pfSense: HE_IPv6 interface, strict firewall (80,443,25,465,587,993 + ICMPv6),
socat IPv6→IPv4 proxy, removed dangerous "Allow all DEBUG" rules