Compare commits

...
Sign in to create a new pull request.

209 commits

Author SHA1 Message Date
Viktor Barzin
cf42042cba monitoring: re-trigger apply to persist state after CI cancel-race
All checks were successful
ci/woodpecker/push/default Pipeline was successful
No-op comment touch in loki.tf to force a clean `terragrunt apply monitoring`.
The pfSense egress-monitoring apply (commit 7fe2d978, CI pipeline #414) was
cancelled by a newer push and SIGKILLed mid-helm-upgrade: the live resources
applied (probes green, rules loaded) but the Terraform state write and the helm
release finalize were lost, leaving the prometheus release stuck in
pending-upgrade (manually unstuck). This commit re-applies the unchanged
monitoring stack so state matches live, with zero resource changes.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-28 16:58:49 +00:00
Viktor Barzin
f92075b7c5 fire-planner: solve FIRE targets to age 100 (horizon 60→72)
All checks were successful
ci/woodpecker/push/default Pipeline was successful
Viktor plans to live to 100, so the portfolio must last that long. The
fire-targets CronJob was solving a 60-year horizon (≈ to age 88); set it to 72
(retire ~age 28 → age 100). Raises every case's FIRE number modestly (more years
to fund). A one-off in-cluster job re-solves the existing rows at the new horizon.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-28 16:49:20 +00:00
Viktor Barzin
7fe2d9780e monitoring: add pfSense WAN/egress alerting + probes
Some checks failed
ci/woodpecker/push/default Pipeline was canceled
On 2026-06-27 pfSense (Proxmox VMID 101) stopped passing internet egress for
~20 min while internal routing + Unbound stayed up; recovery needed a manual
reboot and NOTHING alerted — there was no egress probe and the cloudflared
replica metric stayed green. Add first-class egress monitoring so the next
occurrence pages in ~2 min instead of being noticed by a human.

- blackbox-exporter: new icmp_egress + dns_external probe modules (+ NET_RAW
  so ICMP can use raw sockets).
- Three in-cluster probe jobs exercising the pod->node->pfSense-NAT path that
  failed: wan-gateway-icmp (192.168.1.1), internet-egress-icmp (9.9.9.9 +
  1.1.1.1), internet-egress-dns (cloudflare.com via both resolvers).
- Prometheus alerts (group "Egress / pfSense"): WANGatewayUnreachable,
  InternetEgressDown (both providers dead), ExternalDNSResolutionDown,
  EgressOnlyDivergence (reuses the existing t3-probe legs — the incident's
  exact "external down while internal up" signature), PfSenseVMDown.
- Loki ruler: CloudflaredTunnelConnLoss — the canary that fired first; the
  cloudflared replica metric is blind to tunnel-connection loss. Threshold
  calibrated against live Loki (steady-state ~2/6h vs 37-85/5m in-incident).
- Alertmanager inhibit: WAN/egress-down suppresses the downstream egress
  symptom alerts so one root alert pages, not a storm.
- Runbook docs/runbooks/pfsense-egress.md + .claude/CLAUDE.md.

All metric names + the cloudflared threshold verified against live
Prometheus/Loki. Pure GitOps, no pfSense change. Firewall-side hardening
(dpinger retargeting, failover gateway, pfSense syslog -> Loki) is deferred
and documented in the runbook.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-28 16:46:30 +00:00
Viktor Barzin
279b88d2bc docs: add MetalLB L2Status-immutable PG-VIP-flap post-mortem (code-aoxk)
All checks were successful
ci/woodpecker/push/postmortem-todos Pipeline was successful
ci/woodpecker/push/default Pipeline was successful
Post-mortem for the 2026-04/05 SEV3 where a stuck MetalLB ServiceL2Status
CR (immutable status.node) flapped the PG load-balancer VIP and silently
broke Tier-1 Woodpecker terragrunt applies for ~5 days (the wrapper error
"Cannot read PG creds" masked the real cause for ~25 days). Written when
the incident closed (beads code-aoxk, 2026-05-26) but never committed;
landing it so the RCA + stuck-CR cleanup procedure live in the repo.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-28 16:25:10 +00:00
Viktor Barzin
6f042ee239 fix(fire-planner): grafana fire-planner-pg datasource survives pw rotation
Some checks failed
ci/woodpecker/push/default Pipeline failed
The fire-planner-pg Grafana datasource baked the rotating fire_planner DB
password into its provisioning ConfigMap at terraform plan-time, so on every
7-day static-role rotation the password went stale and ALL fire-planner-pg
dashboards (fire-planner, cost-of-living, and the new wealth FIRE Countdown)
silently failed with "password authentication failed for user fire_planner"
until the next stack apply.

Switch to the same live-env pattern wealth-pg / payslips-pg already use:
- new ExternalSecret grafana-fire-planner-pg-creds (monitoring ns, Reloader
  match) mirrors the rotating Vault static-creds/pg-fire-planner password
- datasource ConfigMap now references $__env{FIRE_PLANNER_PG_PASSWORD}
- Grafana mounts it via envFromSecrets; reloader (auto) restarts Grafana on
  rotation so the provisioned datasource never goes stale

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-28 16:14:42 +00:00
Viktor Barzin
35c0057d83 chrome-service: raise noVNC sidecar memory limit 96Mi->256Mi (fix OOMKill)
All checks were successful
ci/woodpecker/push/default Pipeline was successful
The noVNC sidecar (x11vnc + websockify) was OOMKilled (exit 137) repeatedly
whenever someone actively opened chrome.viktorbarzin.me — the view connected
then froze/hung. Idle usage is ~37Mi, but x11vnc + websockify
framebuffer/encode buffers spike past the 96Mi cap when streaming the
1280x720 screen to a client. Raised request 32Mi->64Mi, limit 96Mi->256Mi
(Burstable, aux tier). Already applied live via a transient kubectl patch
(Recreate rollout, verified 0 restarts since); this lands the durable state
so the next apply / daily drift-detection doesn't revert it to 96Mi.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-28 15:39:17 +00:00
Viktor Barzin
2e50c1235c chrome-service: grant emo shared browser access (noVNC + homelab browser CLI)
All checks were successful
ci/woodpecker/push/default Pipeline was successful
Viktor asked to give emo access to the cluster's headed Chrome so he can fill
in forms and get past anti-bot / captcha pages. emo was deliberately locked
out of chrome-service (noVNC Authentik allowlist was Viktor-only + his
power-user RBAC has no pods/portforward). Viktor's explicit decision: SHARE
his existing browser rather than stand up an isolated per-user instance,
accepting that emo can therefore reach Viktor's warmed logged-in sessions
(CDP has no per-context auth, so the single shared persistent profile is
reachable by anyone who can drive the browser). emo's CLI use is hands-off
(his agent can run it unattended).

- authentik: add emo (emil.barzin / emil.barzin@gmail.com) to CHROME_ALLOWED
  so the admin-services-restriction policy admits him to chrome.viktorbarzin.me
  (noVNC). Reverses the prior Viktor-only lock; comment updated to record why.
- chrome-service/rbac.tf (new): emo-browser ServiceAccount + long-lived token
  (dashboard-sa.tf pattern), a chrome-service-portforward Role granting
  pods/portforward, and a cluster read-only binding (oidc-power-user-readonly)
  so the SA can resolve the Service and emo's normal read access doesn't regress.
- t3-provision-users.sh: install_browser_kubeconfig installs a dual-context
  kubeconfig for any user with a <user>-browser SA — SA token as the default
  context (non-interactive, works headless), personal OIDC retained as the
  oidc@homelab named context. emo's OIDC-only kubeconfig can't authenticate the
  headless agent session that homelab browser needs.
- docs/architecture/chrome-service.md: document the shared-browser multi-user
  access model, the session-exposure trade-off, and how to grant/revoke a user.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-28 15:20:07 +00:00
Viktor Barzin
50077b43d4 paperless-ngx: drop TASK_WORKERS 6->4 (6 OOMKilled the pod mid-import)
All checks were successful
ci/woodpecker/push/default Pipeline was successful
6 OCR workers crept past the 8Gi per-container memory cap over ~6h and
OOMKilled paperless at 15:00 during the Emo bulk import. The import
auto-recovered (the consume dir lives on the PVC, so a restart re-scans
and reprocesses — nothing lost), but it left the queue inflated with
re-queued duplicates and spiked etcd on each restart.

The 8Gi cap is the shared edge-tier `tier-defaults` LimitRange, not worth
raising for one namespace. 4 workers fit with headroom (4 measured
~1.3Gi). Matches the value applied live via `kubectl set env` during
incident response; this removes the drift so the next apply keeps it.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-28 15:06:46 +00:00
Viktor Barzin
8236ae309d postiz: reconcile HCL to live (adopt unmerged stack config), keep parked
All checks were successful
ci/woodpecker/push/default Pipeline was successful
postiz's live deployment (Helm + Temporal + Elasticsearch + Authentik
OIDC + static-DB password) came from the never-merged branch
`wizard/postiz-cnpg-oidc`, so master's HCL was stale and a `terragrunt
apply` would have DESTROYED the stack. This lands that postiz config to
master so HCL == state == live (CI green; destroy-landmine gone).

Kept PARKED (postiz + temporal replicas = 0): IG-via-postiz is Meta-
blocked (it hardcodes retired Instagram scopes → OAuth "Invalid Scopes"),
which is why it was parked; IG runs via the instagram-poster service. To
revive later: flip postiz `replicaCount` + temporal `replicas` back to 1
and re-check image pins.

Notes captured in this reconcile:
- ES image pinned to 7.17.28 (the branch's 7.17.24 was a DOWNGRADE vs the
  live data → ES refused to start "cannot downgrade node 7.17.28→7.17.24";
  caught + rolled back during this work).
- The 4 Authentik resources (app/provider/group/binding) were re-imported
  into state (adopted, not recreated — no duplicate AK objects); the
  obsolete `external_secret_jwt` ExternalSecret was removed (Retain → its
  synced secret was kept).
- Vault-side cleanup (removing the unused pg-postiz rotated role) is
  deliberately NOT included here — deferred, postiz uses a static
  secret/postiz database_url.

State was already reconciled by a local `scripts/tg apply`; this commit is
the HCL catch-up (CI re-apply is a no-op).

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-28 12:54:59 +00:00
Viktor Barzin
250d0fc334 docs(authentik): document SFE forced-WebAuthn escape hatches (TOTP + social)
All checks were successful
ci/woodpecker/push/default Pipeline was successful
Old-browser users on the SFE who have a password but no MFA device hit the
default-authentication-flow's forced WebAuthn passkey enrolment, which the SFE
cannot render (the 'unsupported state: ak-stage-authenticator-webauthn' error).
emo (Google-only, iPadOS 15) hit this on the password path.

Document the two no-MFA-downgrade fixes: (1) social login, whose source flow
(default-source-authentication) has no MFA stage, so the SFE's social button
always completes; (2) enrolling TOTP, which the SFE can validate (unlike
WebAuthn) and which flips the MFA stage from force-enrol to validate. TOTP was
enrolled for emo and stored in his Vaultwarden authentik item; verified
end-to-end (a Bitwarden-generated code is accepted by authentik).

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-28 12:24:40 +00:00
Viktor Barzin
e518ada3d4 authentik: repoint to overlay patch3 (all-iOS SFE + SFE social links) + docs
All checks were successful
ci/woodpecker/push/default Pipeline was successful
global.image -> 2026.2.4-patch3. Old iPad Chrome (and any iOS browser) now gets
the SFE too, and the SFE login shows social-login buttons (emo is Google-only with
no password, so the password form alone was a dead end). Docs: .claude/CLAUDE.md +
authentication.md.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-28 11:53:26 +00:00
Viktor Barzin
4fc09b7a61 Merge remote-tracking branch 'origin/master' into wizard/authentik-sfe-social
Some checks failed
Build Custom Authentik Image / build (push) Waiting to run
ci/woodpecker/push/default Pipeline was canceled
2026-06-28 11:53:04 +00:00
Viktor Barzin
916516eeab authentik overlay patch3: SFE for ALL old iOS browsers + social-login links
Two follow-ups to patch2 (both in patch-compat-sfe.py, guarded):

1. compat_needs_sfe() now also serves the SFE to ANY iOS browser on iOS<=16.3,
   not just Safari. iOS Chrome/Firefox are WebKit skins (Apple mandate) reporting
   a non-Safari UA family, so the Safari-only check missed them and they still got
   the blank modern SPA. Added an os.family=="iOS" + version<=16.3 branch.

2. Inject static social-login <a> links (Continue with Google/GitHub/Facebook ->
   /source/oauth/login/<slug>/) into the SFE shell (flow-sfe.html). The SFE
   architecturally can't render Identification-stage sources (authentik docs), and
   emo's account (emil.barzin@gmail.com) is Google-only with NO password — so the
   SFE's username/password form was a dead end. The links are plain redirects that
   work on any browser. Slugs are static; re-verify on source changes.

Tag -> 2026.2.4-patch3; values repoint + docs land once GHA builds it.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-28 11:53:03 +00:00
Viktor Barzin
08bdf32aa0 feat(fire-planner): FIRE Countdown dashboard section + monthly target solve
Some checks failed
ci/woodpecker/push/default Pipeline was canceled
Add a "FIRE Countdown" section to the wealth Grafana dashboard plus a monthly
CronJob that computes the targets it reads.

Viktor wanted a £ countdown to retirement in today's money, per life-case
(Solo / Household / Family) and per country, with progress, a projected date,
runway, and his safety guardrails — so he can see how close he is to FIRE
(ideally lean) without ever coming back to work.

- wealth.json: new country / with_home / savings_per_year template vars + a
  per-Case row (target NW at the 99% GK bar, progress gauge, still-needed,
  projected FIRE date, runway) and safety-valve panels (re-entry trigger vs
  £1.0M, 2.5yr cash buffer, pension tranche @57, Anca-bridge note). Reads
  fire_planner.fire_target via the fire-planner-pg datasource (Mixed).
- fire-planner stack: fire-planner-fire-targets CronJob (monthly, 2nd 10:00
  UTC) runs `recompute-fire-targets --countries all`.

Targets come from the solver shipped in fire-planner edb4d11.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-28 11:52:17 +00:00
Viktor Barzin
6ba60cbb2d authentik: repoint to overlay patch2 (SFE for old Safari) + docs
All checks were successful
ci/woodpecker/push/default Pipeline was successful
global.image -> 2026.2.4-patch2 (adds the compat_needs_sfe SFE patch on top of the
SLOW-1a query patch). Old Safari/WebKit (<=16.3) now gets authentik's no-JS SFE
login instead of a blank page — fixes emo's iPadOS-15.8 iPad with no auth
downgrade. Docs: .claude/CLAUDE.md Authentik row + docs/architecture/authentication.md.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-28 11:39:29 +00:00
Viktor Barzin
5fb2004de5 Merge remote-tracking branch 'origin/master' into wizard/authentik-perf-fix
Some checks are pending
Build Custom Authentik Image / build (push) Waiting to run
ci/woodpecker/push/default Pipeline was successful
2026-06-28 11:38:07 +00:00
Viktor Barzin
f10bb71562 authentik overlay: serve the no-JS SFE login to old Safari (patch #2)
Old Safari/WebKit (<=16.3, e.g. iPadOS<=16.3) can't parse authentik's modern
ES2022 flow SPA and gets a COMPLETELY BLANK login — exactly what emo's iPadOS-15.8
iPad hit. authentik already ships a no-JS Simplified Flow Executor (SFE, ES5) and
serves it via compat_needs_sfe(), but only for IE/old-Edge/PKeyAuth. Extend that
to old Safari so those clients get the REAL authentik login (password + MFA +
reputation, identity preserved — NO auth downgrade, no new credential store).

Chosen over a Traefik basic-auth fallback after an adversarial review: that route
would put a single, spoofable-UA password in front of vbarzin->wizard (passwordless
root on the cluster-controlling devvm) — an MFA->single-factor path to cluster root.
SFE keeps full authentik auth and is generic for any old browser.

Shipped as patch #2 in the existing overlay image (patch-compat-sfe.py — guarded:
asserts the upstream anchor + ast-parses; verified against the live interface.py).
Tag -> 2026.2.4-patch2; the values repoint lands once GHA builds the image.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-28 11:38:05 +00:00
Viktor Barzin
ec681ba6e1 ci(infra): stop double-apply + stop counting PG lock-waits as failures
All checks were successful
ci/woodpecker/push/default Pipeline was successful
The infra terragrunt-apply pipeline (.woodpecker/default.yml) was going
red ~20% of the time. Root causes (verified from the failure logs, not
guessed):

1. infra is registered in Woodpecker TWICE — canonical Forgejo (repo 82)
   AND legacy GitHub mirror (repo 1) — and BOTH run `default.yml` on every
   push. The two applies race each other for the per-stack PG state lock →
   "Error acquiring the state lock" failures + push-supersede "killed" runs.
2. The skip-not-fail lock guard only matched the Tier-0 Vault lock string
   ("is locked by"); the Tier-1 PG-backend lock ("Error acquiring the state
   lock") fell through and was counted as a hard FAILURE.
3. Transient provider-registry download timeouts (and Vault 5xx) failed the
   whole pipeline with no retry.

Fixes (all in default.yml):
- Forge guard: the push-apply runs ONLY on the canonical Forgejo forge; on
  the GitHub mirror it no-ops (exit 0). The mirror keeps running the crons
  (they live on repo 1), so we de-dup the apply without deactivating the
  registration. Fail-open on unknown forge.
- Lock-skip now matches BOTH tiers (Vault + PG) → lock-waits are SKIPPED.
- Bounded retry (3x) ONLY on transient signatures (provider download
  timeout, Vault 5xx). Config errors + helm atomic-timeouts fail fast.

Rejected (documented in docs/architecture/ci-cd.md): an off-infra GHA
validate gate (catches ~0 of the real, runtime/Vault-data/SSA/lock
failures; reproduced `terraform validate` passing the exact stacks that
fail at apply) and lock-reaping/force-unlock (PG advisory locks are
session-scoped + auto-release; force-unlock can't free them and would
corrupt a live concurrent apply).

Shell logic + the classification regexes were unit-tested locally against
the real decoded error strings (#359 PG lock, #353 provider timeout, #360
missing-arg, helm atomic timeout); `bash -n` clean; YAML parses.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-28 11:37:18 +00:00
Viktor Barzin
69e35efd95 Merge remote-tracking branch 'origin/master' into wizard/vault-kv
Some checks are pending
Build infra CLI / build (push) Waiting to run
ci/woodpecker/push/default Pipeline was successful
2026-06-28 11:09:38 +00:00
Viktor Barzin
e03e4719ad vault: distinguish Vaultwarden vs HashiCorp Vault, add vault kv
`homelab vault` only spoke to Vaultwarden (the password manager), but the
name reads as HashiCorp Vault (the infra secrets store — actually OpenBao
here). Make the two unmistakable and support both.

Distinction (no breakage — the existing Vaultwarden verbs are unchanged):
- bare `homelab vault` help now LEADS with the two-stores split;
- every verb summary is tagged `[vaultwarden]` or `[hashicorp-vault]`;
- HashiCorp Vault/OpenBao lives under a clearly-named `vault kv` group.

New `vault kv` (HashiCorp Vault / OpenBao, the secret/… KV store):
- `kv get <path> [--field K]` — read; --field → one value (TTY-aware
  clipboard/stdout), no field → full secret JSON (refuses a bare TTY).
- `kv list <path>` — list sub-paths (no values).
- `kv put <path> <key>` — write one key; value via stdin (piped or
  no-echo prompt, never argv); creates the path or merges (never
  clobbers siblings; uses kv patch -method=rw so no `patch` cap needed).

Critical: `kv` uses the caller's OWN Vault token (OIDC ~/.vault-token /
$VAULT_TOKEN), NOT the per-user scoped Vaultwarden token (bound only to
claude-users/<user>, which would 403 elsewhere) — handlers set VAULT_ADDR
but never inject the scoped token. Access is whatever the policy grants.

Logic in cmd_vault_kv.go (pure cores extractKVData/parseKVList/arg
builders/kvGet/List/Put; file header documents the credential split).
CLI v0.11.0. Tests: no value in put argv, create-then-merge, KV-v2
envelope strip, help names both systems. Verified e2e against live Vault
(read key-names-only + a scratch put/merge/cleanup).

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-28 11:09:33 +00:00
Viktor Barzin
460f2ad42f state(vault): update encrypted state
All checks were successful
ci/woodpecker/push/default Pipeline was successful
2026-06-28 11:07:22 +00:00
Viktor Barzin
87a450e9a3 vault: grant emo full read/write on his own secret/emo tree
Viktor asked that emo be able to edit his own secrets with full access.
emo's personal-emo policy was read-only (read on data, read/list on
metadata), so he could view but not change his personal secrets.

Widen it to the same self-service capability set every namespace-owner
already has over their own tree: create/read/update/delete/list on
secret/data/emo(+/*) and list/read/delete on secret/metadata/emo(+/*).
Scope is unchanged — still only emo's own secret/emo subtree, still a
named exception that does not widen the power-user tier in general.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-28 11:07:22 +00:00
Viktor Barzin
a1cf7ccaf6 authentik: repoint to the SLOW-1a overlay image + un-enroll Keel
All checks were successful
ci/woodpecker/push/default Pipeline was successful
GHA built ghcr.io/viktorbarzin/authentik-server:2026.2.4-patch1 (public, verified
anonymously pullable). Point global.image at it (repository + tag pinned
explicitly so neither helm's appVersion default nor Keel can downgrade it — the
2026-06-10 boot-storm class) and remove keel.sh/enrolled from the namespace so
Keel won't auto-bump the custom tag. authentik is now manual-upgrade: bump the
Dockerfile FROM + this tag together on each authentik version bump.

Net effect once rolled: the identification-stage query drops ~1.4s -> ~14ms, so
the cold login-flow first-load stops being slow. (Does NOT affect old-browser
clients — iPadOS<=15/Safari<=15.6 still can't run the SPA; that's unfixable
server-side.) Docs: .claude/CLAUDE.md Authentik row.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-28 10:46:21 +00:00
Viktor Barzin
7ec64ed5ff authentik: custom-image overlay to fix the 1.4s login-flow query (SLOW-1a)
Some checks are pending
Build Custom Authentik Image / build (push) Waiting to run
ci/woodpecker/push/default Pipeline was successful
The login flow's identification stage runs a bare select_subclasses() that
LEFT-JOINs every Source subtype table — ~1.4s server-side on every cold login
(verified live: 1527ms vs 14ms). Narrow it to only the subtypes that render a UI
login button (oauth/saml/plex/telegram/kerberos — not the sync-only ldap/scim),
via django-model-utils string accessors so no import is needed. Byte-identical
output, ~100x faster, robust to adding new login source types.

Shipped as a thin overlay over the official image (mirrors the diun/excalidraw
precedent): stacks/authentik/Dockerfile (FROM ghcr.io/goauthentik/server:2026.2.4
+ a guarded sed) built by .github/workflows/build-authentik.yml -> ghcr.io/
viktorbarzin/authentik-server:2026.2.4-patch1. The values repoint + Keel freeze
land in a follow-up commit once the image is built. Upstream bug still present in
main (no fix/PR) — drop this overlay once upstream narrows the query.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-28 10:42:58 +00:00
Viktor Barzin
12a45fa94e vault: bw sync on every read so reads show the latest values
Some checks are pending
Build infra CLI / build (push) Waiting to run
ci/woodpecker/push/default Pipeline was successful
`bw unlock` only decrypts the LOCAL cache, so a persisted (already
logged-in) session served stale data — a password changed in the web
vault wouldn't appear until the next fresh login. Add a best-effort
`bw sync` in openSession (the chokepoint every read shares: get, get
--all, list, code, status), so reads reflect current server-side values.

Best-effort by design: a transient sync failure warns on stderr and
falls back to the cached vault rather than failing the read (an AFK
agent shouldn't break on a network blip). status keeps its own explicit
sync so a reachability failure still surfaces in its report.

CLI v0.10.1. Tests assert the sync runs after unlock and before the read,
and that a read still succeeds when sync fails.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-28 10:19:54 +00:00
Viktor Barzin
3d948c7033 Merge remote-tracking branch 'origin/master' into wizard/upgrade-gate-held
All checks were successful
ci/woodpecker/push/default Pipeline was successful
2026-06-28 10:09:42 +00:00
Viktor Barzin
2880fe1c29 docs: update k8s-version-upgrade runbook for actionable-vs-held gate
Reflect the classification change in the operational runbook: the gate's three
refusal classes (actionable/waiting/pinned), held wins on a mix, refusals now
Complete cleanly (no Failed Job), k8s_upgrade_held gauge + the deliberate
no-alert-for-held, the dropped K8sUpgradeChainJobFailed suppression clause, the
nightly report ⏸️ HELD outcome, and the detector's silent nightly re-evaluation.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-28 10:09:34 +00:00
Viktor Barzin
eebb6c8594 k8s-upgrade: classify compat-gate blocks as actionable vs held; quiet the held case
The nightly upgrade chain detected 1.36, the preflight compat-gate refused it,
and that produced a Failed preflight Job + a K8sUpgradeBlocked alert EVERY
night — even though the block is unactionable (no kyverno/ESO release supports
1.36 yet, and gpu-operator is pinned to its current version because bumping it
needs a newer NVIDIA driver image + Ubuntu/kernel we're not ready for). Viktor
asked to teach the checker to tell 'we can fix this' apart from 'nothing to do
but wait', and stop the nightly Failed-Job + alert noise for the latter.

compat-gate.py now classifies each blocker:
  - ACTIONABLE: a newer addon version in addon-compat.json supports the target
    -> exit 2, k8s_upgrade_blocked=1 -> K8sUpgradeBlocked alert (reasons in the
    nightly report).
  - WAITING-on-upstream: no released version supports the target yet -> held.
  - PINNED: addon marked pinned in the matrix (gpu-operator) -> held.
Held wins on a mix -> exit 4, k8s_upgrade_held=1 (NEW gauge), NO alert.

Tidy the block path (Viktor's scope choice): deliberate gate decisions now make
the preflight Job Complete cleanly (HALT_CHAIN stops chain progression without a
non-zero exit), so they no longer create Failed Jobs. Dropped the now-obsolete
'unless k8s_upgrade_blocked==1' suppression from K8sUpgradeChainJobFailed. Gauge
is pushed definitively once per run (no 1->0->1 flap that re-notifies). The
detector re-spawns a refused-but-Complete preflight nightly (silently) so a
standing hold still re-evaluates, and only announces genuine new/Failed spawns.

nightly-report gains a quiet '⏸️ HELD' headline with reasons grouped by class.
gpu-operator pinned in addon-compat.json (unpin = delete pinned + pin_reason).

Net effect on 1.36: HELD + quiet (waiting on kyverno/ESO, gpu-operator pinned;
Calico the lone actionable piece) — no nightly Failed Job, no alert, just the
morning report's HELD line. Design: docs/plans/2026-06-28-k8s-upgrade-gate-held-classification.md

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-28 10:08:20 +00:00
Viktor Barzin
ccee443790 vault: add get --all to browse every field of an item
Some checks are pending
Build infra CLI / build (push) Waiting to run
ci/woodpecker/push/default Pipeline was successful
`homelab vault get` could only fetch one of five allow-listed fields and
had no way to see what fields an item even has — in particular it could
not reach arbitrary user-defined custom fields. Add a `--all` flag that
dumps the whole item as a normalized JSON object
(`{name, username?, password?, uris?, totp?, notes?, fields?}`), so a
Claude session can discover and read every field, custom ones included,
in a single call.

Security model preserved:
- Like `get --json`, the dump is all secret values, so it refuses a bare
  TTY (pipe it, e.g. `| jq`); the machine/agent path is stdout.
- The TOTP *seed* is reduced to a presence flag (`"totp": true`) and
  never emitted — the seed is more powerful than a one-time code, so the
  only seed-derived path stays the specially-audited `vault code`. Tests
  assert the seed and password-history never appear in the dump.
- Op-log uses a distinct `get-all` verb (item name still never logged) so
  a bulk dump is distinguishable from a single-field read.

`normalizeItem` is a pure, unit-tested core; `getItem` is the
session+fetch seam. CLI bumped to v0.10.0. Docs: README changelog,
onboarding runbook, design spec §16.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-28 10:01:49 +00:00
Viktor Barzin
afcd463f39 k8s-upgrade: design doc for actionable-vs-held compat-gate classification
The nightly upgrade chain fails a preflight Job and raises K8sUpgradeBlocked
every night for the 1.36 target, even though the block is unactionable: no
kyverno/ESO release supports 1.36 yet and gpu-operator is deliberately pinned
(NVIDIA driver/Ubuntu coupling). Viktor asked to teach the checker to tell
'we can fix this' apart from 'nothing to do but wait', and stop the nightly
Failed-Job + alert noise for the latter.

This documents the design: classify each blocker as actionable / waiting-
upstream / pinned, keep the alert only for actionable, quiet the held case to
the nightly report, and make deliberate gate decisions Complete cleanly.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-28 10:01:36 +00:00
Viktor Barzin
b3c419e108 Merge remote-tracking branch 'origin/master' into wizard/authentik-perf-fix
All checks were successful
ci/woodpecker/push/default Pipeline was successful
2026-06-28 09:55:25 +00:00
Viktor Barzin
9a1ab6247b cli: add homelab edges — who-talks-to-whom investigation helper (v0.9.0)
Some checks are pending
Build infra CLI / build (push) Waiting to run
ci/woodpecker/push/default Pipeline was successful
Makes the goldmane_edges east-west trail (ADR-0014) reachable during incident
investigations without remembering the DB/creds/SQL. New top-level verb:

  homelab edges --ns <ns>         edges touching <ns> (either direction)
  homelab edges --src/--dst <ns>  directional egress / ingress peers
  homelab edges --peers-of <ns>   distinct peer namespaces of <ns>
  homelab edges --new-since 24h   first seen since a duration or date (YYYY-MM-DD)
  homelab edges --denied          only action='deny' (blocked / lateral movement)
  homelab edges --json --limit N  machine-readable / row cap (default 200)

Filters render to a single read-only SELECT against the `edge` table, run via
the dbaas CNPG primary pod (same exec path as `k8s db`). Namespace values are
validated to the k8s name charset (injection guard) before they reach SQL.

TDD: edges_test.go covers flag parsing, query building (each filter, AND
combination, peers-of shape, JSON wrapper), the new-since duration/date parser,
and namespace-validation / injection rejection. Smoke-tested live: --peers-of,
--new-since 24h, --denied, and --json all return correct rows.

Docs: runbook query section now leads with the CLI; cli/README gains a v0.9
section. VERSION v0.8.2 -> v0.9.0.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-28 09:51:41 +00:00
Viktor Barzin
0fa5852ec6 homelab v0.8.2: fix memory recall truncating multibyte UTF-8 mid-character
Some checks are pending
Build infra CLI / build (push) Waiting to run
ci/woodpecker/push/default Pipeline was successful
emo's Claude Code sessions hit "UserPromptSubmit hook error" on almost every
prompt. Root cause: the homelab-memory-recall.py UserPromptSubmit hook runs
`homelab memory recall <prompt>` and strict-decodes its stdout. printMemories
truncated each memory's preview with a BYTE slice (c[:240]), which cuts through
the middle of a 2-byte Cyrillic character and emits invalid UTF-8 (a dangling
0xd0 lead byte). The hook's subprocess.run(text=True) then raised
UnicodeDecodeError — not caught by its `except (TimeoutExpired, OSError)` — so
the hook exited non-zero and Claude surfaced the error. It is Cyrillic-specific
(ASCII has no multibyte chars to split), so it bit emo (Bulgarian prompts) every
turn while English users almost never saw it.

Two-layer fix:
- cli: truncatePreview() now counts RUNES, not bytes, so the preview never
  splits a character. Regression test asserts valid UTF-8 on a long Cyrillic
  string. Fixes the root for every consumer of `memory recall` / `memory list`.
- hook: subprocess.run gains errors="replace" and the except is broadened to
  honor the script's own "best-effort, exit 0" contract — so a truncated or
  otherwise odd payload can never again surface as a hook error.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-28 09:40:51 +00:00
Viktor Barzin
a3eb309e26 calico: fix empty Whisker UI — allow whisker egress to the kube-dns ClusterIP
All checks were successful
ci/woodpecker/push/default Pipeline was successful
Real root cause of the 2026-06-28 "Whisker UI empty" incident (the watchdog
added in 8d1d2fb9 was treating a symptom). The tigera operator's own `whisker`
NetworkPolicy is policyTypes:[Ingress,Egress]; its egress allows DNS only to the
kube-dns *pods* (podSelector k8s-app=kube-dns). But whisker-backend resolves
goldmane.calico-system.svc via the kube-dns *ClusterIP* (10.96.0.10), and Calico
drops UDP DNS to a ClusterIP under a podSelector-only egress rule.

Verified in an isolated repro: from the whisker pod's netns, ClusterIP DNS = 100%
timeout while direct kube-dns pod-IP DNS = OK; a pod with no egress policy
resolves fine; a test pod with the operator's podSelector-only egress rule
reproduces the failure, and adding an ipBlock(ClusterIP) egress rule flips it to
100% ok. whisker-backend resolves goldmane once in the brief startup window
before the policy programs, holds its long-lived gRPC stream, and only
re-resolves when that stream breaks (e.g. a node-reboot blip) — then the blocked
ClusterIP DNS wedges its Go resolver and the UI goes empty. The durable
aggregator (separate pod, unrestricted namespace) was never affected.

Fix: additive egress NetworkPolicy whisker-allow-dns-clusterip
(whisker -> 10.96.0.10/32 on 53 UDP+TCP); k8s egress policies are additive so
the operator NP is untouched. The whisker-watchdog CronJob is kept as a backstop
(repurposed comment). Applied + verified: ClusterIP DNS from the whisker netns
now 8/8 ok, whisker-backend 0 errors, flow API returns 828 flows / the namespace
list. Docs (runbook + CLAUDE.md) updated to the real root cause.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-28 09:32:28 +00:00
Viktor Barzin
385dfff0e7 authentik: fix episodic blank-screen + 30s-hang login (reliability R2)
The login screen would sometimes hang/blank for everyone for ~30s at a time.
Root-caused: the readiness probe (/-/health/ready/) queries the DB, and on a
transient PG/pgbouncer blip it 503s; with the chart-default ~30s tolerance all 3
goauthentik-server pods dropped out of the Service at once, so Traefik had no
healthy backend -> 502/503/504. Compounded by a silent drift: the repo set the
rollout strategy under `strategy:`, but the chart reads `deploymentStrategy:` —
so live ran the chart-default 25%/25% and dropped a pod out of rotation on every
roll. (Redis was removed upstream in authentik 2026.2, so sessions+cache are on
PostgreSQL and request-serving is coupled to PG — verified there is no
external-cache option to put back, so a SHORT transient is now survived but a
total CNPG outage still takes authentik down.)

Reliability package (R2, approved):
- readinessProbe.failureThreshold 3->8 (~80s) — absorbs a full CNPG failover
  reconnect without dropping the whole fleet from the Service.
- rename server+worker `strategy:` -> `deploymentStrategy:` (the real chart key)
  and set maxSurge:1/maxUnavailable:0 so a roll never dips below 3 ready.
- gunicorn AUTHENTIK_WEB__MAX_REQUESTS 1000->10000 / JITTER 50->1000 so the 9
  workers' recycles don't cluster on a DB blip.
- / and /static ingresses switch to the dedicated authentik-rate-limit (100/1000)
  from the previous commit (skip_default_rate_limit) — fixes the cold-load 429
  blank screen.

Liveness intentionally left DB-coupled-but-shallow (LiveView always returns 200,
so it can't kill a DB-blocked pod). CONN_MAX_AGE intentionally NOT set (pins the
pgbouncer pool, reverted 2026-06-10). Docs: .claude/CLAUDE.md + authentication.md
(also corrected a stale "60s persistent DB connections" note).

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-28 09:17:05 +00:00
Viktor Barzin
b84b0021c2 authentik: dedicated rate-limit carve-out + per-router 5xx observability
All checks were successful
ci/woodpecker/push/default Pipeline was successful
Unauthenticated users were getting a blank login screen (and the screen would
sometimes just hang). Root-caused via a read-only fan-out + adversarial verify:
the login SPA cold-loads ~70 flow-executor JS/CSS chunks from /static through
the SHARED 10/50 Traefik limiter, so a fresh/empty-cache load 429s the tail and
a failed ES-module import aborts SPA bootstrap -> permanent blank. authentik was
the only first-party SPA still on the default limiter (8 siblings already have a
carve-out). NAT-shared clients trip it especially easily (shared per-IP bucket).

- traefik: new `authentik-rate-limit` Middleware (average 100 / burst 1000,
  mirroring the existing health/tripit carve-outs). The authentik / and /static
  ingresses switch to it in the authentik-stack commit.
- monitoring: the `traefik` scrape job's drop-regex was a blanket
  `traefik_router_.*`, which also dropped `traefik_router_requests_total` — so
  per-router 4xx/5xx (incl. 429/503) was neither queryable nor alertable.
  Narrowed it to keep the counter while still dropping the high-cardinality
  `*_duration_seconds_bucket` histogram, and added `AuthentikRootRouter5xxHigh`
  for the episodic all-3-server-pods-NotReady 502/503/504 cascade.

Docs updated (networking.md rate-limit list, .claude/CLAUDE.md). GitOps CI applies.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-28 09:10:34 +00:00
Viktor Barzin
65a09dcbc4 docs(homelab-vault): rebuild snippet uses cli/VERSION, not git describe
All checks were successful
ci/woodpecker/push/default Pipeline was successful
The onboarding runbook's "rebuild the binary" command stamped the version
from `git describe --tags --always`, but setup-devvm.sh stamps it from
`cli/VERSION`. The v0.8.1 tag is no longer reachable from master, so the
describe form silently produced a bare commit sha — diverging from what a
provisioner reconcile stamps. Match the canonical source.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-28 09:05:49 +00:00
Viktor Barzin
c53e7839e1 Merge remote-tracking branch 'origin/master' into wizard/vault-addr-default
Some checks failed
Build infra CLI / build (push) Waiting to run
ci/woodpecker/push/default Pipeline was canceled
2026-06-28 09:04:43 +00:00
Viktor Barzin
0525f0b12d homelab vault: self-default VAULT_ADDR + prefer scoped token over ~/.vault-token
Setting up emo's Bitwarden access via `homelab vault`, his one-time
`homelab vault setup` failed with an opaque "exit status 2". Two latent
CLI bugs, both of which any non-admin AFK invocation can hit:

1. The CLI set VAULT_TOKEN but never VAULT_ADDR, relying on the ambient
   value. It IS in /etc/environment (login shells), but emo runs his
   agents from long-lived tmux / non-login shells that never sourced it,
   so every `vault` child hit the 127.0.0.1:8200 default -> connection
   refused. claude-auth-sync already self-defaults VAULT_ADDR; the CLI
   now does the same.

2. Token precedence was env > ~/.vault-token > scoped. A power-user who
   ran `vault login -method=oidc` carries a read-only ~/.vault-token
   (policy `default`, capability `deny` on their workstation path), which
   shadowed the purpose-built scoped token -> 403 permission denied on
   the user's OWN path. This tool only ever touches
   secret/workstation/claude-users/<user>, which the scoped token covers
   exactly, so precedence is now env > scoped > ~/.vault-token. Verified
   the scoped tokens for both emo and wizard hold create/read/update on
   their own paths, so admins are unaffected.

Also stop swallowing the shelled `vault`/`bw` stderr: errors now carry
the real message (connection refused / permission denied) instead of a
bare "exit status N" — without that, (1) and (2) were indistinguishable.

Verified end-to-end as emo (VAULT_ADDR unset + his read-only
~/.vault-token present): writeCreds now succeeds.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-28 09:04:28 +00:00
Viktor Barzin
8d1d2fb999 calico: add whisker-watchdog CronJob to self-heal a wedged whisker-backend
All checks were successful
ci/woodpecker/push/default Pipeline was successful
Whisker showed an empty UI on 2026-06-28. Root cause: whisker-backend dials
goldmane:7443 over a long-lived gRPC stream; when that stream dropped during a
transient CNI/DNS blip (right after k8s-node5 finished its v1.35.6 upgrade, its
pod resolver briefly timed out on the kube-dns ClusterIP) the Go gRPC resolver
got WEDGED — spamming "failed to stream flows" / "code = Unavailable: dns ...
i/o timeout" forever, never reconnecting. The operator ships whisker-backend
with NO liveness probe, so nothing restarted it; the live UI stayed blank until
a manual `kubectl delete pod`. (The durable aggregator is a separate pod and
was unaffected — only Whisker's ~60-min live view went dark.)

Whisker is operator-managed (Whisker CR), so we can't inject a liveness probe.
Instead add a watchdog so this never needs a manual restart again:
- whisker-watchdog CronJob (every 10 min) + least-privilege SA/Role/RoleBinding
  (calico-system only: pods get/list/delete, pods/log get).
- It restarts the whisker pod only when whisker-backend logs >=10 goldmane-
  connection errors in 11m AND Goldmane is Ready (the Goldmane-Ready guard
  avoids restart-thrash during a real Goldmane outage).
- Self-tested: a manual run reports "whisker-backend healthy: 0 ... errors"
  and does not restart.

Docs: runbook gains a "Whisker UI empty" troubleshooting entry + a self-heal
note; the stale 2026-06-25 "digest never posted" known-state block is updated
to Resolved (digest posts to #alerts, lastSuccessfulTime current); CLAUDE.md
flow-trail bullet gains the whisker-wedge gotcha.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-28 08:59:31 +00:00
Viktor Barzin
c70810a51b workstation: per-user long-lived Claude token to end concurrent-refresh logout
All checks were successful
ci/woodpecker/push/default Pipeline was successful
A heavy user (emo) runs 8+ always-on `claude` agents + their t3-serve instance,
all sharing one ~/.claude/.credentials.json. When the shared access token expires
the processes refresh simultaneously; OAuth refresh-token rotation makes the
losing writer persist an EMPTY refresh token, logging the user out roughly every
access-token lifetime (~8h). Re-issuing the credential never sticks — the race
recurs (this is why emo's "standalone token" fix kept regressing).

Fix: an opt-in, per-user, non-rotating setup-token (sk-ant-oat01, ~1y, scope
user:inference) kept in the user's OWN Vault path (field `setup_token`).
claude-auth-sync materializes it to a user-owned
~/.config/claude-auth-sync/claude-oauth.env and, while it is present, SKIPS the
rotating-credential validate/backup/restore (so no false
WorkstationClaudeAuthInvalid). start-claude.sh and t3-serve@.service load it as
CLAUDE_CODE_OAUTH_TOKEN, so every session of that user uses the non-rotating
token and there is nothing to race on.

Fail-safe + opt-in: with no `setup_token` in Vault, every path is a no-op, so
users on the normal per-user Enterprise-SSO flow are unaffected. This is each
user's OWN identity, never the forbidden shared CLAUDE_CODE_OAUTH_TOKEN. Runbook
documents enable/disable/rotate.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-28 08:07:43 +00:00
Viktor Barzin
3cc8f9f661 paperless-ngx: keep mem limit at 8Gi (tier LimitRange caps containers)
All checks were successful
ci/woodpecker/push/default Pipeline was successful
The prior commit set the limit to 10Gi, but the shared tier-defaults
LimitRange caps per-container memory at 8Gi, so the rollout's new pod was
forbidden (FailedCreate) and paperless was briefly down. 8Gi is ample for
6 workers anyway (4 workers measured ~1.3Gi under full OCR load). Restored
service live via kubectl patch; this commit matches TF to the live 8Gi so
drift detection won't re-revert it.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-27 19:37:59 +00:00
Viktor Barzin
21d20dccf8 paperless-ngx: bulk-import via PVC consume dir (restart-safe) + 6 workers
Some checks failed
ci/woodpecker/push/default Pipeline was canceled
Emo's ~13.7k-document import was going through the API upload path, which
stages each file on the pod's EPHEMERAL scratch before queuing it. Any
paperless pod or redis restart therefore destroyed all in-flight work
(the "File not found" failures we hit) and required manual re-uploads.

Move bulk ingest to paperless's consume directory placed on the encrypted
PVC, with PAPERLESS_CONSUMER_POLLING so the whole folder is re-scanned
periodically (and on startup) with a file-stability check. Files now live
on durable storage and survive any restart — the folder is the queue and
self-heals, so we can copy everything in fast and let it process over
time with zero retry/integrity risk. RECURSIVE preserves the source tree
(avoids basename collisions); owner+tag come from a consumption workflow.

Bump TASK_WORKERS 4->6 to speed the OCR/convert-bound processing (node6
has the core headroom for one pod) and mem limit 8->10Gi for the extra
workers. Revert workers/mem/consume envs to defaults once the import ends.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-27 19:35:10 +00:00
Viktor Barzin
2cb37d51d4 paperless-ngx: scale Gotenberg x3 + Tika x2, 4 workers, skip-archive — speed the Emo import
All checks were successful
ci/woodpecker/push/default Pipeline was successful
Bottleneck found: single Gotenberg 503s under concurrent workers (office docs
failing + slow). Cluster is otherwise idle (sdc 0.5% util, etcd ~1/min), so:
- Gotenberg 1->3 + Tika 1->2 (Service load-balances; fixes the 503s, parallel
  office conversion).
- paperless TASK_WORKERS 2->4, THREADS_PER_WORKER 2->1, mem limit 4->8Gi (avoid
  OOM with 4 concurrent OCR). Requests kept low to stay within tier-quota
  (requests.memory 3840/4096Mi).
- PAPERLESS_OCR_SKIP_ARCHIVE_FILE=with_text: skip redundant archive for born-
  digital/office docs (big IO saver for the work-doc set).
Guard + etcd watch stay in place; revert to defaults after the import.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-27 18:45:25 +00:00
Viktor Barzin
d6bd9486e3 Merge remote-tracking branch 'origin/master' into wizard/portal-onboarding-paths
Some checks failed
ci/woodpecker/push/default Pipeline was successful
Build k8s-portal / build (push) Has been cancelled
2026-06-27 16:34:44 +00:00
Viktor Barzin
fca948a23d k8s-portal: document all three cluster-access paths in onboarding
The Getting Started portal only walked through the heaviest path (local VPN + kubectl + Vault + sops install) and never mentioned the two zero-setup routes that users actually reach first. Restructure onboarding to lead with all three, recommendation first: (A) the t3 web terminal, which drops you into a ready shell with kubectl/Vault/repos preinstalled; (B) the k8s web dashboard, auto-authenticated per user; and (C) the existing own-machine setup. Flag the dashboard/terminal as the fallback when CLI OIDC login is unavailable, reframe the misleading home-page 'VPN required' banner (only path C needs it), add the access endpoints to the service catalog, and fix a stale Vaultwarden URL (was vault.viktorbarzin.me, which is actually HashiCorp Vault; Vaultwarden is vaultwarden.viktorbarzin.me).

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-27 16:34:36 +00:00
Viktor Barzin
9599beadc9 paperless-ngx: 2 task workers + 2 threads/worker + 4Gi limit for the Emo bulk import
Some checks failed
ci/woodpecker/push/default Pipeline was canceled
Emo's ~13.7k-doc import is OCR-bound on a single celery worker (~10s/doc =
multi-day). Bump PAPERLESS_TASK_WORKERS=2 + THREADS_PER_WORKER=2 for ~2x
throughput, and the memory limit 2Gi->4Gi to fit two concurrent OCR jobs.
Kept deliberately modest: archive writes hit the shared sdc HDD that etcd
also lives on (IO-storm risk, code-oflt) — watch etcd apply latency and
revert workers to 1 if it degrades. Revert to defaults once the import done.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-27 16:33:43 +00:00
d4f564e8d5 Merge pull request 'docs(ci-cd): plotting-book build→ghcr→deploy flow diagram' (#16) from wizard/plotting-doc into master
All checks were successful
ci/woodpecker/push/default Pipeline was successful
2026-06-27 15:50:02 +00:00
Viktor Barzin
0097bddf9f docs(ci-cd): add plotting-book build→ghcr→deploy flow diagram
ASCII flow of the migrated plotting-book pipeline (GHA build in Anca's
repo → private ghcr.io/passionprojectsanca/book-plotter → Woodpecker
redeploy hook → in-cluster pull via ghcr-credentials), plus the Kyverno
admission / Keel backstop / Vault pull-cred supporting cast and the
serving path. Appended to the existing plotting-book section.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-27 15:49:58 +00:00
Viktor Barzin
bbc797b30e ci(woodpecker): stop applying/planning the Tier-0 vault stack in CI
All checks were successful
ci/woodpecker/push/default Pipeline was successful
The nightly drift-detection cron and every vault-touching push apply have
been failing because CI runs terragrunt plan/apply on the Tier-0 `vault`
stack, which manages Vault's own transit mount + ACL policies. The CI
`ci` Vault role intentionally lacks those admin perms (sys/mounts,
sys/policies/acl), so the run always errors:
  - apply: 403 on vault_mount.transit + vault_policy.personal_emo, plus an
    Invalid for_each (local.k8s_users from secret/platform is deferred)
  - drift: terragrunt plan exits 1 → fails the whole nightly run

vault is Tier-0 = human-applied via OIDC. Skip it in both pipelines:
- default.yml: skip `vault` in the platform-apply loop (kept in
  PLATFORM_STACKS so the app-stack detector still excludes it)
- drift-detection.yml: skip `vault` in the per-stack plan loop
- ci-cd.md: document the exclusion on both pipeline rows

Found during a CI-health sweep (user reported many failures): GitHub
Actions all green; all Woodpecker repos green except this recurring
infra-repo failure, doubled by the legacy repo-1 + repo-82 dual
registration.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-27 15:48:41 +00:00
81c2b14e29 Merge pull request 'plotting-book: pull image from private ghcr instead of public DockerHub' (#15) from wizard/plotting-ghcr into master
All checks were successful
ci/woodpecker/push/default Pipeline was successful
2026-06-27 15:32:35 +00:00
Viktor Barzin
c13a3f1694 plotting-book: pull image from private ghcr instead of public DockerHub
Anca's plotting-book app now builds its image in her own GitHub repo to
the private package ghcr.io/passionprojectsanca/book-plotter (off public
DockerHub viktorbarzin/book-plotter). Wire the cluster to pull it:

- stacks/plotting-book: point the deployment baseline image at the ghcr
  package and add imagePullSecrets {ghcr-credentials} so the pod can pull
  the private image (the live tag is still CI-owned via ignore_changes).
- stacks/kyverno: add the plotting-book namespace to the ghcr-credentials
  allowlist so the Kyverno generate policy clones the pull secret into it.
  Verified the shared ghcr_pull_token (Viktor, repo-admin on Anca's repo)
  can read the private package before wiring this.

Docs: correct ci-cd.md (it wrongly listed plotting-book as already on
ghcr — it was on DockerHub) and note the special arrangement; amend
ADR-0003 to record that this GitHub-first repo builds to its own org's
ghcr namespace.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-27 15:32:19 +00:00
Viktor Barzin
bf40409141 docs(security): note crowdsec-cf-sync rate-limit resilience
All checks were successful
ci/woodpecker/push/default Pipeline was successful
Document the backoff_limit=0 + CF-429 soft-skip hardening alongside the
cf-sync architecture description, with the why (the backoff_limit=2
retry-storm that escalated Cloudflare's Lists-API throttle into a stuck
state). Follow-up to 5b49634f.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-27 15:27:44 +00:00
Viktor Barzin
5b49634fe0 rybbit/crowdsec-cf-sync: stop Cloudflare Lists-API retry-storm (429 self-DoS)
All checks were successful
ci/woodpecker/push/default Pipeline was successful
The edge-ban sync was failing every 2 min on Cloudflare HTTP 429
(rate-limited) and never recovering, leaving the crowdsec_ban list empty.

Root cause: backoff_limit=2 made k8s re-run a failing pod up to 3x within
seconds, so each */2 cycle fired a burst of POSTs into Cloudflare's
per-60s Lists-API write limit. That kept the throttle perpetually tripped
(it stopped clearing even after minutes of quiet) — a self-inflicted DoS.

Two changes make the sync gentle and self-healing:
- backoff_limit 2 -> 0: one attempt per */2 cycle (the schedule IS the
  retry cadence), no rapid-fire burst.
- lapi_kv_sync.py: treat a CF 429 as a soft-skip (exit 0, retry next
  cycle) like the existing LAPI fail-safe, instead of fail-loud + k8s
  retry. Any other CF error still fails loud.

Found during a cluster health check (AIOStreams CSI + pfSense SSH issues
handled separately).

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-27 15:23:42 +00:00
Viktor Barzin
7c72368243 state(vault): update encrypted state
Some checks failed
ci/woodpecker/push/default Pipeline failed
2026-06-27 13:54:23 +00:00
Viktor Barzin
f92ab04dae vault: grant emo read-only access to his own secret/emo
emo (power-user tier) had no Vault policy granting his personal secret
path, so `vault kv get secret/emo` failed. Viktor asked to give him that
access. Adds a read-only `personal-emo` policy (read on secret/data/emo +
metadata) and attaches it to emo's OIDC identity by adopting the
entity/alias Vault auto-created on his first login. Scoped explicitly to
emo; does not widen the power-user tier (which stays secret-less).

Verified live: a personal-emo token reads secret/emo, is denied writes,
and is denied other paths (secret/viktor -> 403).

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-27 13:35:57 +00:00
Viktor Barzin
90f5425cdc state(vault): update encrypted state 2026-06-27 13:33:34 +00:00
a7117e0bfe immich(frame-emo): bump photo-frame Interval 30->45s
All checks were successful
ci/woodpecker/push/default Pipeline was successful
Permissions-test change requested by Viktor: slow Emo's Sofia photo-frame
slideshow from 30s to 45s per image.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-27 13:07:00 +00:00
Viktor Barzin
d50962b00e immich: add Immich photo-frame for Emo's Portal (highlights-immich-emo)
All checks were successful
ci/woodpecker/push/default Pipeline was successful
Second ImmichFrame instance cloned from the London frame (frame.tf), scoped to Emo's Immich account (emil.barzin) with Sofia weather coords and last-2-years photos. Drives Emo's Meta Portal Mini in Sofia via the portal-immich-frame app. Dedicated API key minted on Emo's account and stored in Vault (secret/immich -> frame_api_key_emo).

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-27 12:40:29 +00:00
Viktor Barzin
e8b72019b5 paperless-ngx: deploy Tika + Gotenberg for Office ingest + raise PVC ceiling to 80Gi
All checks were successful
ci/woodpecker/push/default Pipeline was successful
Emo's import scope now includes his work-PC document set (C/Documents,
Project Management, Service & MRO, etc. on the NAS), which is ~4.9k Office
files (.doc/.docx/.xls/.xlsx/.ppt/.pptx) on top of Emo shared. Paperless
can only archive/OCR/index those if it can convert them, so add the standard
Apache Tika (text+metadata) + Gotenberg (-> PDF) sidecar deployments + their
services in the paperless-ngx namespace and point PAPERLESS_TIKA_* at them.
Pinned images (gotenberg 8.25, tika 3.3.1.0), single replica, no PVC.

Total in-scope document set across all NAS locations is now ~13,700 PDF+Office
files / ~13.7GB source (~30GB once OCR'd + archived), so raise the data PVC
autoresize ceiling 30Gi -> 80Gi for comfortable headroom. The topolvm
autoresizer grows on demand up to the ceiling.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-27 12:02:04 +00:00
Viktor Barzin
041aedc486 Merge remote-tracking branch 'origin/master' into wizard/paperless-emo
All checks were successful
ci/woodpecker/push/default Pipeline was successful
2026-06-27 08:17:28 +00:00
Viktor Barzin
7988a690ed paperless-ngx: add Bulgarian OCR (bul+eng) + raise data PVC ceiling to 30Gi
Preparing Paperless for Emo's document import from the NAS. His archive is
Bulgarian (Cyrillic) + English, but OCR was English-only (tesseract had no
'bul' pack and PAPERLESS_OCR_LANGUAGE was unset/defaulted to eng), so scanned
BG documents would OCR to garbage and be unsearchable. Add bul to the install
list and set OCR_LANGUAGE=bul+eng.

Also raise the data PVC autoresize ceiling from 5Gi to 30Gi: everything
(originals + archive via PAPERLESS_MEDIA_ROOT=../data) lives on the single
encrypted PVC, and the ~2.7GB in-scope import would blow past the 5Gi cap
mid-ingest. The topolvm autoresizer grows the volume on demand up to the
ceiling; 30Gi gives ample headroom.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-27 08:17:13 +00:00
Viktor Barzin
6415f77fed Merge remote-tracking branch 'origin/master' into wizard/emo-vault-onboard
Some checks failed
Build infra CLI / build (push) Waiting to run
ci/woodpecker/push/default Pipeline was canceled
2026-06-27 08:17:06 +00:00
Viktor Barzin
b371ae6eee homelab vault: install bw system-wide + onboarding runbook
Two remaining gaps to let non-admins (emo) use `homelab vault`:

- setup-devvm.sh installed `@bitwarden/cli` only when `command -v bw`
  failed, which an admin's own ~/.local/bin/bw satisfied — so the
  system-wide copy was never installed and non-admins had no `bw`
  backend. Install to the npm /usr prefix and guard on the system path
  (/usr/bin/bw) instead.

- Add docs/runbooks/homelab-vault-onboarding.md (per-user setup, the
  shared Organization/Collection flow for sharing passwords, admin
  deploy + verification, security model) and repoint the two code
  comments that cited a design-spec path which never existed.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-27 08:16:52 +00:00
Viktor Barzin
51dc5d031c homelab vault: make it work for non-admin workstation users
`homelab vault` was effectively admin-only: two bugs blocked every
non-admin (e.g. emo) from using it for their own Vaultwarden vault.

1. Token: the CLI relied purely on ambient `vault` auth (~/.vault-token
   / $VAULT_TOKEN), which only admins have. Non-admins carry a scoped
   token at ~/.config/claude-auth-sync/vault-token (policy
   workstation-claude-<user>). Add ensureVaultToken(): explicit env >
   ~/.vault-token > scoped fallback, wired into every vault verb. Admins
   are unaffected (their ambient token wins).

2. Write capability: `homelab vault setup` used plain `vault kv patch`,
   which needs the `patch` capability the scoped policy does not grant
   (only create/read/update) — so setup 403'd for non-admins. Switch to
   `kv patch -method=rw` (read-modify-write; same approach
   claude-auth-sync already uses), with `kv put` only when the path
   doesn't exist yet. Preserves co-located keys (claude_ai_oauth_json).

Enables onboarding emo onto the per-user Vaultwarden access tool.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-27 08:15:42 +00:00
Viktor Barzin
82a7b2585b chrome-service: reconcile state after pipeline #366 was killed mid-apply + document cancel-previous hazard
All checks were successful
ci/woodpecker/push/default Pipeline was successful
Pipeline #366 (the SHA-pin apply, commit 7b4a8ba8) was SIGKILLed mid-apply by
Woodpecker cancel-previous when I pushed the next commit (#367, docs) while it
was still running — the apply log ends at '[chrome-service] Starting apply...'
with no 'Apply complete!', so the terraform state write did not finish. The live
deployment is correct (image = the supervised SHA, verified, self-healing), but
the stored state may be stale; this commit re-triggers a clean changed-stack
apply to reconcile it (comment-only change → 0 resource changes, no rollout).

Also adds a caution to the novnc image comment: after bumping the SHA, WAIT for
the apply pipeline to finish before pushing again (memory id=1957).

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-27 08:15:41 +00:00
Viktor Barzin
006f97ef58 docs: bless local terragrunt apply, but require committing every applied change
All checks were successful
ci/woodpecker/push/default Pipeline was successful
Viktor asked to change the infra apply guidance: instead of 'never apply
locally, always rely on CI', the policy is now 'you MAY apply locally, but
always commit the change to the infra repo'.

- .claude/CLAUDE.md (Critical Rule: Terraform Only): new bullet making local
  apply explicit (scripts/tg apply / homelab tf apply) from the MAIN checkout
  (not a worktree — git-crypt'd tfvars read as ciphertext there), with a hard
  requirement that every applied change is committed + pushed to master the same
  session so the repo stays the source of truth and CI drift-detection doesn't
  revert it. Spells out the apply<->commit ordering both ways.
- AGENTS.md (non-admin workstation land steps): step 5 now notes local apply as
  an option alongside CI auto-apply, with the same 'always committed, never
  applied uncommitted' rule.

Note: the org-managed settings block also frames CI auto-apply but is not
editable from a workstation clone.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-27 08:10:20 +00:00
Viktor Barzin
7b4a8ba867 chrome-service: pin noVNC image to the x11vnc-supervision build
Some checks failed
ci/woodpecker/push/default Pipeline was canceled
Deploys the self-heal fix from the previous commit. Keel is off for this
deployment (keel.sh/policy=never, because the browser container's playwright
image is version-pinned to f1-stream) and the novnc image was :latest with
imagePullPolicy=IfNotPresent, so a rebuilt :latest would NOT be re-pulled on a
rollout — the supervised entrypoint would never reach the running pod.

Pin novnc to :19d0f0933a (the build of the prior
commit; ghcr digest sha256:5b783ac6, == :latest) so the stack apply rolls the
sidecar onto the new image. Future novnc entrypoint changes deploy by bumping
this digest after build-chrome-service-novnc.yml publishes a new SHA tag.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-27 08:04:55 +00:00
Viktor Barzin
19d0f0933a chrome-service: supervise x11vnc in noVNC sidecar so the VNC view self-heals
Some checks failed
ci/woodpecker/push/default Pipeline was successful
Build chrome-service-novnc / build (push) Has been cancelled
The noVNC view at chrome.viktorbarzin.me went black: x11vnc (in the novnc
sidecar) attaches to the browser container's Xvfb over localhost:6099, and when
that container restarted (~8h ago, Chrome exited cleanly) x11vnc lost its X
connection and exited. Because the entrypoint ran x11vnc as an unsupervised
background child and then exec'd websockify as PID 1, the dead x11vnc was never
relaunched — :5900 stayed dead (a defunct zombie), websockify kept returning
'Connection refused', and the view was black until a manual pod restart.

Fix: the entrypoint now runs both x11vnc and websockify as supervised background
children and exits non-zero via 'wait -n' if either dies, so the kubelet restarts
the novnc container, which re-waits for Xvfb and relaunches x11vnc. The bridge
now self-heals across browser-container restarts. Mirrors the android-emulator
stack's supervision pattern. Architecture doc updated with the new failure mode,
diagnosis, immediate-recovery, and SHA-pin deploy note.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-27 08:03:29 +00:00
Viktor Barzin
abb15cd49d devvm: personalize emo's cluster-health skill for ha-sofia
All checks were successful
ci/woodpecker/push/default Pipeline was successful
emo cares about ha-sofia + his Sofia smart-home devices (Tuya, the MPPT
ATS, the Барзини → Статус dashboard), and only about the cluster when it's
breaking those. Rewrite his vendored cluster-health into an ha-sofia-focused,
read-only variant:
- leads with ha-sofia's in-cluster dependency chain (tuya-bridge + the
  cloudflared/Traefik/DNS/TLS reachability path), all checkable read-only;
- fixes the script path to emo's own clone (/home/emo/code) — he can't read
  wizard's tree — and runs it --no-fix (he's cluster read-only);
- loads emo's own HA token (see below) so the ha-sofia checks (26-29, 45)
  actually run for him; documents the host-SSH/Vault checks that skip;
- triages: cluster FAIL/WARN matters only if on his chain; everything else is
  a one-line "admin's area"; escalate via /file-issue since he can't fix.

This snapshot copy is now an emo-specific variant, intentionally diverged
from the canonical 47-check admin skill — README updated to say "do not
re-sync from canonical".

Token: a dedicated long-lived HA token (client_name emo-cluster-health) was
minted on ha-sofia via the admin account and stored emo-readable at
/home/emo/.config/cluster-health/haos_token (600). It carries admin HA scope
(HA only mints tokens for the authenticating account); independently revocable.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-26 16:03:14 +00:00
Viktor Barzin
fc83595f5e devvm: vendor cluster-health into per-user agent-skill snapshot
All checks were successful
ci/woodpecker/push/default Pipeline was successful
Make cluster-health a user-global skill for emo (the lone entry in the
provisioner's SKILL_USERS allowlist), so it's available from any directory
— not only when working inside the infra clone where it already exists as a
project skill (.claude/skills/cluster-health). install_skills() in
t3-provision-users.sh copies the vendored snapshot into ~/.agents/skills/ and
symlinks ~/.claude/skills/, so this is the durable, rebuild-surviving path.

cluster-health is homelab-local (vendored from this repo's own
.claude/skills/), unlike the other snapshot entries which mirror upstream
mattpocock/skills + vercel-labs/skills; README documents its provenance and
the explicit re-sync step so the vendored copy doesn't silently drift.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-26 15:20:19 +00:00
Viktor Barzin
fd33d1a447 monitoring: consolidate all Slack alerting to #alerts, abandon #security
Some checks are pending
ci/woodpecker/push/default Pipeline is running
The dedicated #security Slack channel was unreachable: the shared incoming
webhook (Vault secret/viktor -> alertmanager_slack_api_url) belongs to a
Slack app that isn't a member of #security, so any channel override on it
returns HTTP 404 channel_not_found. The goldmane-edges-digest was silently
failing for that reason.

Per request ("dump the security channel, post in an existing one"), route
everything to #alerts instead:
- alertmanager slack-security receiver -> #alerts (keeps its [SECURITY/<sev>]
  title styling so security-lane alerts still stand out in the shared channel)
- goldmane-edges-digest CronJob SLACK_CHANNEL -> #alerts (comment only; value
  was already switched and applied last change)
- AggregatorDown / DigestFailing alert summaries reworded to say #alerts
- docs swept (security.md, monitoring.md, ADR-0014, goldmane runbook,
  .claude/CLAUDE.md, service-catalog, CONTEXT.md) to drop the
  "invite the app / flip back to #security" caveats and state the
  #security abandonment + #alerts consolidation as the current routing.

Monitoring stack applied (alertmanager rolled, live config verified:
slack-security channel is now #alerts).

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-26 13:29:44 +00:00
Viktor Barzin
196d0db4bd rbac/apiserver-oidc: back up the apiserver manifest OUTSIDE /etc/kubernetes/manifests
All checks were successful
ci/woodpecker/push/default Pipeline was successful
The SSO restore script backed up the live manifest with
`cp "$MANIFEST" "$MANIFEST.bak.$TS"` — i.e. INSIDE /etc/kubernetes/manifests/.
The kubelet treats every file in that dir as a static pod, so the .bak became a
SECOND kube-apiserver static pod. While both copies were identical it was
harmless, but the instant `kubeadm upgrade` changed the real manifest's image to
v1.35.6, the kubelet saw two same-named pods with different specs and flip-flopped
(pod attempt count hit 13) — the new apiserver never stabilised, so kubeadm timed
out on "static Pod hash did not change after 5m" and rolled back. THIS was the
real cause of the 1.34->1.35 upgrade stalling for days (not etcd IO, which was a
downstream symptom of the flip-flopping apiserver hammering etcd).

Fix: write backups to a dedicated dir OUTSIDE the static-pod dir
(/etc/kubernetes/apiserver-oidc-bak/) and read the rollback copy from there. The
stray .bak that planted the landmine on 2026-06-18 was moved out manually
2026-06-26; this prevents the SSO script (and the upgrade chain's restore.sh,
which is the same script) from ever re-creating it.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-26 10:29:19 +00:00
Viktor Barzin
5d33327c30 postiz: repoint postgres-backup CronJob at CNPG (was failing on removed host)
Some checks failed
ci/woodpecker/push/default Pipeline failed
The postiz-postgres-backup CronJob still dumped from the chart's bundled
`postiz-postgresql` host with a hardcoded `postiz-password`. That bundled
PostgreSQL was removed when postiz migrated to the shared CNPG cluster, so
the host no longer resolves (NXDOMAIN) and every nightly run failed —
firing BackupCronJobFailed, and leaving the postiz DB with no logical dump
in the offsite pipeline.

Connect via the app's own DATABASE_URL (from the postiz-secrets Secret,
postgresql://postiz:…@pg-cluster-rw.dbaas.svc.cluster.local/postiz) instead
of a hardcoded host/user/password, so the backup tracks the live DB and
credentials. Verified with a one-off test job: psql + pg_dump 16.4 connect
to CNPG 16.9 and produce a 180K custom-format dump.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-26 09:34:42 +00:00
Viktor Barzin
1bca799bb4 monitoring: give kube-state-metrics a 512Mi memory limit (Burstable)
Some checks failed
ci/woodpecker/push/default Pipeline failed
kube-state-metrics had no explicit resources, so the monitoring-namespace
LimitRange pinned it to requests=limits=256Mi (Guaranteed QoS). KSM idles
around 45Mi but momentarily spikes past 256Mi during a full object relist
(450+ pods, 150+ jobs, all secrets/endpoints) and gets OOMKilled. Each OOM
blacks out the KSM-exported series that ~10 alert rules read, so they all
fire false "<svc>Down" criticals at once and self-resolve when KSM recovers
~5 min later — exactly the alert storm seen at 2026-06-26 08:42 UTC.

Set explicit Burstable resources: keep the request low (64Mi, just above
idle) so we don't reserve memory we don't use, and raise only the limit to
512Mi to absorb the relist peak. No CPU limit, per the cluster-wide policy.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-26 09:06:31 +00:00
Viktor Barzin
d105713ae7 fix(workstation): claude-auth-sync must merge, not overwrite, the shared Vault path
All checks were successful
ci/woodpecker/push/default Pipeline was successful
cas_backup did `vault kv put secret/workstation/claude-users/<user>`, a full
KV-v2 replace that rewrote the document with only its 3 OAuth keys. Because
`homelab vault setup` co-locates the user's vaultwarden_* credentials on that
same path, every six-hourly sync silently deleted them — so `homelab vault`
reported "not configured" within hours of each setup. (Reported as: homelab
vault "keeps getting reset / logged out", set up 3 times.)

Switch the backup to a merge: `kv patch -method=rw` (read+update, needs no
`patch` capability) when the path exists, and `kv put` only to create it on the
first backup. Add a regression test with a fake vault asserting a pre-existing
sibling key survives a backup, and document the merge requirement in the
renewal runbook.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-26 08:33:41 +00:00
Viktor Barzin
6f1951af93 fix(workstation): carry OS/sudo authz policy into managed-settings source + multi-tenancy doc
All checks were successful
ci/woodpecker/push/default Pipeline was successful
ADR-0015's policy change was applied live to /etc/claude-code/managed-settings.json, but that file self-deploys from the repo source scripts/workstation/managed-settings.json via the hourly reconcile (sync_managed_config). Without updating the source the next reconcile would REVERT /etc to the old 'never read other homes' rule. This updates the source-of-truth claudeMd (now byte-identical to /etc) so the change is durable + canonical, and refresh_codex_mirror propagates it to every user's ~/.codex/AGENTS.md. Also notes the access-model change in the multi-tenancy architecture doc (pointer to ADR-0015).

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-26 08:25:33 +00:00
Viktor Barzin
8121d8a4ac docs(adr): add ADR-0015 (OS/sudo is the authorization boundary), supersede ADR-0011 privacy norm
All checks were successful
ci/woodpecker/push/default Pipeline was successful
Viktor (owner) wants agents to stop refusing file reads the OS already permits. wizard holds passwordless root ((ALL) NOPASSWD: ALL), so the managed-settings rule 'never read another user's ~/.claude' was stricter than the OS itself. The managed-settings policy (/etc/claude-code/managed-settings.json) was updated out-of-band to defer to OS/sudo authorization with no extra prompt; backup kept at .bak-2026-06-26. This ADR records the decision, its symmetry across sudo-holders, and the larger blast radius.

ADR-0011's usage-telemetry design is unchanged; only the cross-user privacy norm it referenced is superseded. The original ask was to delete ADR-0011 — superseded instead to preserve the audit trail and the ADR-0012/0013 references.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-26 08:22:29 +00:00
Viktor Barzin
ebc8b6588f ESO: add force_conflicts to all ExternalSecret manifests (fleet sweep)
Some checks failed
ci/woodpecker/push/default Pipeline failed
The 2026-06-22 external-secrets v1 migration made the ESO controller the
server-side-apply owner of .spec.refreshInterval on every ExternalSecret, so any
stack defining one via kubernetes_manifest fails `terraform apply` with a
field-manager conflict the next time it's applied (instagram-poster + grafana hit
this on 2026-06-24; it was latent across the whole fleet). Add
field_manager { force_conflicts = true } to all 101 remaining ExternalSecret
manifests across 70 stacks, matching the fix already on grafana / woodpecker /
traefik / k8s-version-upgrade / instagram-poster. TF and ESO set the same value,
so it's stable (no perpetual drift). Defuses the landmine before each stack's
next apply trips it.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-25 21:28:11 +00:00
Viktor Barzin
6c5288998f goldmane-trail: polish follow-ups #57/#59/#61/#62/#63 + digest→#alerts
All checks were successful
ci/woodpecker/push/default Pipeline was successful
Completes the Goldmane who-talks-to-whom trail (ADR-0014), implemented by a
subagent workflow (distinct stacks in parallel, docs last):

- #57 Whisker gated ingress: ingress_factory (whisker.viktorbarzin.me,
  auth=required, Authentik-gated) + a NetworkPolicy allowing traefik->whisker:8081
  (the operator's whisker NP default-denies ingress). calico stack.
- #61 pipeline health: AggregatorDown + DigestFailing Prometheus alerts
  (prometheus_chart_values.tpl) + cluster-health check #48.
- #59 service-identity labels on the multi-Service namespaces (monitoring's 5
  TF-managed deployments + dbaas), with the KYVERNO_LIFECYCLE_V1 marker so they
  update in-place.
- #62/#63 docs: docs/runbooks/goldmane-flow-trail.md (new), service-catalog,
  security.md + monitoring.md east-west sections, ADR-0014 as-built, CONTEXT.md.
  #62 = the SQL to derive the Wave-1 per-namespace egress allowlist from the
  edge table (feeds code-8ywc; enforce-flips out of scope).

Also fixes the digest's Slack target: #security override 404s channel_not_found
because the shared alertmanager_slack_api_url webhook's app isn't a member of
#security (this likely also breaks alertmanager's slack-security receiver — flagged
in the runbook). Routed to #alerts (the webhook's working channel) until the app
is invited; verified a real digest run posts cleanly (360 edges).

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-25 17:49:25 +00:00
Viktor Barzin
306cdd4cb3 state(dbaas): update encrypted state 2026-06-25 17:31:03 +00:00
Viktor Barzin
9c68d147e0 k8s-upgrade: reclaim+auto-prune kubeadm /etc/kubernetes/tmp leak; correct crash root cause to etcd IO (not OIDC)
Some checks failed
ci/woodpecker/push/postmortem-todos Pipeline was successful
ci/woodpecker/push/default Pipeline failed
Digging into "why did the apiserver crash" disproved the earlier OIDC
explanation. An isolated v1.35.6 apiserver repro with authentik reachable
initialises OIDC cleanly (oidc.go:313, no error) and runs fine — so the
--authentication-config -> --oidc-* revert is NOT what crashed it. etcd's
surviving crash-window log is the real cause: 1180 "apply request took too long"
warnings in 16 min, individual applies up to 4.3s (healthy <100ms) right as
kubeadm tried to bring up the new apiserver. That's etcd IO starvation on the
shared sdc HDD (beads code-oflt).

A big contributor + the reason master root fs sat at 73%: kubeadm dumps a full
~400MB etcd DB backup into /etc/kubernetes/tmp/kubeadm-backup-etcd-<ts>/ before
every etcd upgrade and never cleans it up — 145 dirs / 28GB had accumulated,
driving image-GC churn and extra write-IO onto etcd's spindle. Reclaimed live
(73% -> 23%) and added a preflight prune (>3 days) so it can't re-accumulate.

Also corrected the OIDC handling: the kubeadm-config drift is real but only
breaks dashboard/kubectl SSO AFTER a successful upgrade (recoverable via the
chain's restore.sh + the kubeadm-config reconciliation) — it does not crash the
apiserver. So the preflight check is now an ALERT, not a block (was added on the
wrong hypothesis). Post-mortem, runbook, and apiserver-oidc.tf header corrected.

Per Viktor: reclaim the disk and automate so the manual cleanup never recurs;
the durable IO fix remains code-oflt (etcd off the shared HDD).

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-25 15:23:15 +00:00
Viktor Barzin
60a1cb9a25 k8s-upgrade: reconcile kubeadm-config OIDC drift that crash-looped the v1.35 apiserver upgrade
All checks were successful
ci/woodpecker/push/postmortem-todos Pipeline was successful
ci/woodpecker/push/default Pipeline was successful
Last night's autonomous 1.34->1.35 run reached the master control-plane phase
for the first time (preflight passed, etcd snapshot taken, etcd upgraded), then
the kube-apiserver upgrade to v1.35.6 crash-looped and kubeadm auto-rolled-back
to 1.34.9. The cluster stayed healthy but the master was left cordoned and the
chain wedged on in_flight.

Root cause: kubeadm upgrade regenerates the apiserver static-pod manifest from
the kubeadm-config ConfigMap. apiserver auth was switched on 2026-06-19 to a
structured multi-issuer --authentication-config (kubectl + dashboard SSO), but
kubeadm-config still carried the legacy single-issuer --oidc-* extraArgs, so the
regenerated manifest reverted structured auth and the new apiserver crash-looped.
Proven via `kubeadm upgrade diff`. The existing post-upgrade OIDC restore step
never ran because the upgrade itself never succeeded.

Fix:
- rbac/apiserver-oidc.tf: the remote script now also reconciles kubeadm-config
  (kubeadm init phase upload-config: drop --oidc-*, add --authentication-config)
  so a future kubeadm upgrade regenerates a correct manifest. Delivered to the
  cluster via the apiserver-oidc-restore ConfigMap the chain re-runs (CI needs no
  ssh key); trigger deliberately not script-hashed since CI cannot ssh.
- k8s-version-upgrade/upgrade-step.sh: new preflight gate runs `kubeadm upgrade
  diff` and BLOCKS+alerts (never drains the master) if --authentication-config
  would still be dropped.
- Post-mortem + runbook updated.

The live kubeadm-config was reconciled directly on the master and verified
(`kubeadm upgrade diff` now shows only the control-plane image bump), so tonight's
run can complete the 1.34->1.35 upgrade.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-25 14:16:04 +00:00
Viktor Barzin
c6bba1da6e home-assistant skill: refresh ha-london map (HAOS 2026.5.2, Cowboy revived, Overview redesign)
All checks were successful
ci/woodpecker/push/default Pipeline was successful
Viktor asked to redesign the ha-london dashboards and fix the broken integrations (the Cowboy one). The skill's ha-london knowledge map had drifted badly from reality, so this brings it current: it claimed HA 2025.9.1 on a docker-run container (it's HAOS 2026.5.2, managed); listed the now-dead jdejaegh Cowboy integration with sensor.bike_* entities (revived via elsbrock/cowboy-ha v1.2.0 -> entities are sensor.classic_performance_*); and didn't flag that met/metoffice/roomba/hildebrandglow are user-disabled (not broken) or that Tapo P100 is failing. Also documents the redesigned Overview (Home+More sections, Mushroom+mini-graph-card), the dashboard/view/card glossary, the parked-bike 'unknown battery' gotcha, and that london is API-only from the Sofia devvm (no SSH).

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-24 22:03:15 +00:00
Viktor Barzin
b858561bd0 Merge remote-tracking branch 'origin/master'
Some checks failed
ci/woodpecker/push/default Pipeline failed
2026-06-24 20:59:39 +00:00
Viktor Barzin
a7704f46a6 deploy goldmane-edge-aggregator: durable who-talks-to-whom edge trail (#58, ADR-0014)
Infra side of ADR-0014: an mTLS gRPC consumer of Calico Goldmane's Flows API
that records the namespace-pair edge-set in CNPG and posts a daily new-edge
digest to #security. Adds the goldmane-edge-aggregator stack, the
pg-goldmane-edges Vault rotation role (Tier-0 vault state updated here), and the
namespace in the ghcr-credentials allowlist.

Cert: REUSES the operator-minted, Tigera-CA-signed whisker-backend client cert
(Goldmane verifies only the CA chain, not identity) instead of minting from the
Tigera CA private key. This avoids putting the CA key in TF state AND the
hashicorp/tls provider, which is incompatible with this repo's global
generate-providers/lockfile pattern (it broke every stack's lockfile).

Verified live: aggregator streaming flows, 174 edges in Postgres across 50x54
namespaces, db+slack ExternalSecrets synced, digest dry-run formats correctly,
private image pulls via the Kyverno-synced ghcr-credentials.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-24 20:59:39 +00:00
Viktor Barzin
aa510e3600 instagram-poster: force_conflicts on ESO manifests (fix apply)
All checks were successful
ci/woodpecker/push/default Pipeline was successful
The ESO v1 migration (2026-06-22) made the external-secrets controller own
.spec.refreshInterval via server-side apply, so terraform apply of the two
ExternalSecret manifests fails with a field-manager conflict (Woodpecker #348),
which blocked the replicas=0 scale-down from landing. Add force_conflicts=true
to both, matching the grafana/woodpecker/traefik fix applied to other stacks
the same day.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-24 20:49:53 +00:00
Viktor Barzin
53834deb24 instagram-poster: scale to 0 (unused, dead ExternalSecret)
Some checks failed
ci/woodpecker/push/default Pipeline failed
Viktor confirmed the Instagram Graph poster isn't used. Its ExternalSecret
has been dead on missing Vault keys (ig_graph_long_lived_token,
ig_business_account_id), so the deployment sat at 0/1 firing
DeploymentReplicasMismatch. Setting replicas=0 stops the alert and makes the
scale-down durable (a bare kubectl scale reverts on the next stack apply).
Re-set to 1 after minting a Meta long-lived token + populating the Vault keys.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-24 20:45:30 +00:00
Viktor Barzin
8dd9a3978d Merge remote-tracking branch 'forgejo/master' into wizard/homelab-vault
All checks were successful
ci/woodpecker/push/default Pipeline was successful
2026-06-24 12:25:52 +00:00
Viktor Barzin
65b2df1222 fix(monitoring): force_conflicts on grafana_db_creds ExternalSecret
The external-secrets controller owns .spec.refreshInterval via SSA, so a plain
terraform apply of the monitoring stack conflicts. Latent until 2026-06-24 (the
homelab-vault loki-rules change was the first monitoring apply in a while and
surfaced it). force_conflicts lets TF win — same pattern as woodpecker/traefik/
k8s-version-upgrade stacks.
2026-06-24 12:25:36 +00:00
Viktor Barzin
1d0388da12 Merge remote-tracking branch 'origin/master'
All checks were successful
ci/woodpecker/push/default Pipeline was successful
2026-06-24 12:22:58 +00:00
Viktor Barzin
92361f36db calico: enable Goldmane + Whisker (Calico 3.30 OSS flow observability)
Turns on Calico 3.30's native east-west flow observability so we can see which
Service talks to which (ADR-0014, issue #57). Enabled via the operator CRs
directly (kubectl_manifest Goldmane + Whisker, name=default) rather than the
Helm goldmane/whisker flags, because the goldmanes/whiskers CRDs already exist
and this sidesteps the helm-upgrade CR-before-CRD ordering issue. Whisker
notifications=Disabled so the UI doesn't call the external Tigera endpoint.

Applied supervised: creating the Goldmane CR re-rendered calico-node with the
FELIX_FLOWLOGSGOLDMANESERVER env (operator auto-wires Felix — no manual
FelixConfiguration); calico-node rolled cleanly 7/7, tigerastatus healthy,
goldmane is receiving flows from all nodes, Whisker UI serves.

Durable Loki persistence is NOT included here: the Goldmane emitter is Calico
Cloud/Enterprise-gated with no OSS knob to aim it at Loki (the CR can override
only name+resources, not env), so a durable trail needs a small custom gRPC
consumer of goldmane:7443 — tracked in issue #58.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-24 12:22:48 +00:00
Viktor Barzin
e711b2f971 feat(monitoring): homelab vault traceability alerts (TOTP-fetch + volume)
Some checks failed
ci/woodpecker/push/default Pipeline failed
Build infra CLI / build (push) Has been cancelled
Adds a Loki ruler group (lane=security -> #security) for the homelab vault
op-log: VaultwardenTOTPFetched (every 2nd-factor fetch is visible) and
VaultwardenFetchVolumeHigh (>100 fetches/10m backstop). The audit spine
(Vault audit device, reads of secret/data/workstation/claude-users/*) is
already captured. True CLI-bypass detection needs cross-stream correlation
(follow-up).
2026-06-24 10:31:32 +00:00
Viktor Barzin
64104e56e9 feat(devvm): install Bitwarden CLI for homelab vault 2026-06-24 10:29:57 +00:00
Viktor Barzin
15643d1f44 feat(cli): bare homelab vault help command 2026-06-24 10:29:32 +00:00
Viktor Barzin
772aed5370 fix(cli): vault security review fixes
C1 (critical): setup wrote the master password + API client_secret as
`vault kv patch key=value` argv, leaking them via /proc/<pid>/cmdline to
same-UID processes. Now written via stdin (key=- form); only email +
client_id (non-credentials) remain in argv.
I1: `get --json` refused on a TTY (was dumping the secret to scrollback).
M1: vaultLock now holds the per-user flock (it mutates bw state).
M4: bw login-detection parses status JSON instead of substring matching.
M5: clipboard path refuses when stderr is not a TTY (was silently failing).
M6: realRunner trims only trailing newline, preserving secret whitespace;
    secret prompts likewise.
Adds security-property tests: no secret in argv across the get flow,
clipboard decision matrix, --json TTY gate, bw status parsing.
2026-06-24 10:28:31 +00:00
Viktor Barzin
5a864cf19c feat(cli): homelab vault setup onboarding (one-time, self-service) 2026-06-24 10:21:57 +00:00
Viktor Barzin
e20033855d feat(cli): vault list/search/code/status/lock 2026-06-24 10:21:07 +00:00
Viktor Barzin
365340b37d feat(cli): homelab vault get with TTY-aware return 2026-06-24 10:20:05 +00:00
Viktor Barzin
2dd12fc6be feat(cli): vault session bootstrap with per-user flock + no-coredump 2026-06-24 10:18:36 +00:00
Viktor Barzin
5bae2a3907 feat(cli): privacy-aware vault op-log (process, never the secret) 2026-06-24 10:17:50 +00:00
Viktor Barzin
81122f8607 feat(cli): TTY-aware return + OSC52 clipboard with terminal gating 2026-06-24 10:17:13 +00:00
Viktor Barzin
06f4b87af1 feat(cli): vault bw engine env/arg builders + unlock 2026-06-24 10:16:19 +00:00
Viktor Barzin
cd44ca5921 feat(cli): vault creds loading from per-user Vault path 2026-06-24 10:15:32 +00:00
Viktor Barzin
6c53ee10b1 feat(cli): register homelab vault command group skeleton 2026-06-24 10:14:24 +00:00
Viktor Barzin
ae0d7984c4 docs: ADR-0014 + glossary — service identity (namespace+label) & Calico Goldmane observability
All checks were successful
ci/woodpecker/push/default Pipeline was successful
Records the design reached in a /grill-with-docs session: how to track which
Service talks to which as more Services are added, using k8s-native options.

Decision: service identity = the workload's namespace (primary) plus a
`service-identity` label only in the few multi-Service namespaces; east-west
observability = Calico 3.30 Goldmane/Whisker (already in our Calico v3.30.7,
currently disabled) emitting to Loki for a durable trail; enforcement reuses the
existing Wave 1 egress track. Dedicated per-Service ServiceAccounts deferred and
a service mesh / mTLS / SPIFFE rejected — the trust model needs attribution-grade
forensics on a trusted, etcd-constrained cluster, not cryptographic
non-repudiation. This is the service-mesh evaluation the 2026-04-20 infra audit
flagged as missing; rejected alternatives (Retina, Hubble, Kiali, a custom Alloy
enricher) are recorded with rationale.

Adds glossary terms (Service identity, Goldmane / Whisker) to CONTEXT.md.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-24 10:00:36 +00:00
Viktor Barzin
0293b5c634 android-emulator: fix idle-sleeper dying with SIGPIPE before it could sleep
All checks were successful
ci/woodpecker/push/default Pipeline was successful
Caught live-testing the previous commit: every sleeper run exited 141
(SIGPIPE) in ~1s with no output, never reaching the scale-down. Cause:
`set -o pipefail` + `dumpsys power | awk '...; exit'` — awk closes the pipe
after the first match while `kubectl exec` is still streaming dumpsys, so
the exec gets SIGPIPE, pipefail makes the pipeline 141, and set -e kills the
script before any echo. (My earlier dry-run missed it because it didn't run
under `set -euo pipefail`.)

Fix: drop pipefail; capture each exec to a var (`|| true`) then parse with
awk reading to END (no early `exit`), so nothing can SIGPIPE mid-stream and
a failed/booting exec falls through to the fail-safe "do not sleep" branch.
Also fetch the pod name via jsonpath instead of `-o name | head -1` (no pipe
to SIGPIPE, no `pod/` prefix to strip), and exec `adb` directly without the
`sh -c` wrapper.

Verified live: ran the corrected script as the gate ServiceAccount against
the stuck emulator (idle ~120h) — it logged "idle >= 6h ... scaling to zero"
and patched the deployment to replicas=0. The 6+ day pod is now asleep.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-24 08:57:36 +00:00
Viktor Barzin
839fdb33c2 android-emulator: sleep after 6h idle (activity-based), fix never-sleeping
All checks were successful
ci/woodpecker/push/default Pipeline was successful
The emulator was meant to scale to zero when idle but had been up 6+ days
straight despite ~5 days with no real use. Two bugs:

1. The idle check counted ESTABLISHED TCP connections to the adb/noVNC
   ports. A forgotten `adb connect` (no disconnect) holds that transport
   open forever, so every 15-min run saw "active" and reset the counter --
   it never reached the sleep branch. (Right now: 4 such stale transports
   from pods on k8s-node3/node4.)
2. Even when it did reach the sleep branch, `kubectl scale --replicas=0`
   failed Forbidden -- the gate ServiceAccount can patch `deployments` but
   not `deployments/scale`.

Switch the sleeper to measure actual use: time since last user activity
(taps/keys/app-launches, incl. noVNC clicks) from `dumpsys power` vs guest
uptime. No interaction for 6h -> sleep. This ignores idle/forgotten
connections entirely. Scale down with a direct replicas patch on the named
deployment (same path the wake gate scales up), so it needs only the
existing `deployments` patch grant -- no `deployments/scale`. Now stateless
(drops the idle-counter annotation; gate.py no longer sets it) and lighter
on etcd. Fail-safe: any read error (e.g. mid-boot) does not sleep.

Requested by Viktor: turn the dev-only emulator off when it hasn't been
used for 6h.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-24 08:49:23 +00:00
Viktor Barzin
566447a698 k8s-upgrade: preflight kubeadm-plan gate must pass explicit target (minor-upgrade fix)
All checks were successful
ci/woodpecker/push/default Pipeline was successful
Last night's 1.34.9->1.35.6 run passed the ESO/kyverno compat gate (the migration
worked!) but ABORTED at the kubeadm-plan-target gate: it ran `kubeadm upgrade plan`
with NO version, so master's old 1.34.9 kubeadm auto-proposed only the current
minor (Loki: "falling back to stable-1.34") and plan_target != 1.35.6 -> abort.
That gate worked for patch upgrades but never for minors. Fix: pass the explicit
`v$TARGET_VERSION` (verified on master: `kubeadm upgrade plan v1.35.6` emits
"kubeadm upgrade apply v1.35.6"). Works for patches too. Applied live to the
ConfigMap before tonight's run; deleted the failed preflight-1-35-6 job.

Also: ESO 2.x took SSA ownership of .spec.refreshInterval, so terraform's apply of
the k8s-upgrade-creds ExternalSecret hit a field-manager conflict. Added
field_manager.force_conflicts=true (benign — interval is semantically identical).
This pattern affects all 104 migrated ESs fleet-wide (follow-up).

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-24 06:06:14 +00:00
Viktor Barzin
98d2b89614 calico: bump tigera-operator mem limit 256Mi -> 512Mi (OOM crashloop fix)
All checks were successful
ci/woodpecker/push/default Pipeline was successful
The operator OOM-crashlooped on 2026-06-23: it idles at ~246Mi with a ~266Mi
startup spike (re-listing resources to build informer caches), both at/over the
256Mi limit, so the first time the pod restarted it could never finish startup
(exit 137 OOMKilled, leader-elect, OOM, repeat). A latent landmine — the limit
was always too tight; it only bit once the pod restarted. Data plane was never
affected (calico-node 7/7, tigerastatus green throughout). 512Mi gives headroom
(now ~246Mi steady, verified stable 0 restarts). NOT caused by the ESO migration
(which never touched calico); cluster churn was at most the trigger that exposed
the tight limit.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-23 12:46:28 +00:00
Viktor Barzin
68c240b8de Merge remote-tracking branch 'origin/master'
Some checks failed
ci/woodpecker/push/default Pipeline failed
2026-06-23 09:56:25 +00:00
Viktor Barzin
7d297dc6b1 eso: complete migration — chart 2.6.0, all CRs on v1, 1.35 gate cleared
Phase 3 of the ESO 0.12->2.6 migration (the last k8s-1.35 compat-gate blocker).
Climbed external-secrets 0.16.2 -> 0.17.0 -> ... -> 2.6.0 one minor at a time,
each hop applied + verified (ES sync held at 109 Ready every hop; atomic=true
rollback safety net). Crossed the 0.17 cutoff (v1beta1 serving removed) only
after Phase 2 put all 104 ExternalSecrets + 2 ClusterSecretStores on
external-secrets.io/v1. Result: compat-gate now returns "OK: cluster is safe to
upgrade to 1.35.6" (EXIT 0) — the autonomous version-check chain will take k8s
1.34 -> 1.35 on its next nightly run.

Also fixes the repo-wide stale-lock issue that broke CI pipeline 332: the
terragrunt-generated providers.tf declares gavinbunney/kubectl + telmate/proxmox,
but ~28-39 stacks' committed .terraform.lock.hcl predated that ("Inconsistent
dependency lock file: no version selected"). Reconciled via `tg init -upgrade`
and committed so `terragrunt apply`/CI work cleanly again.

Docs: .claude/CLAUDE.md ESO line corrected (104 ESs, v1, chart 2.6.0); plan doc
marked COMPLETE.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-23 09:55:51 +00:00
Viktor Barzin
ff4b01a674 state(external-secrets): update encrypted state 2026-06-23 09:53:36 +00:00
Viktor Barzin
e1a85dd727 state(external-secrets): update encrypted state 2026-06-23 09:52:30 +00:00
Viktor Barzin
af22416d6f state(external-secrets): update encrypted state 2026-06-23 09:51:21 +00:00
Viktor Barzin
c75982f408 state(external-secrets): update encrypted state 2026-06-23 09:50:11 +00:00
Viktor Barzin
0407e3c578 state(external-secrets): update encrypted state 2026-06-23 09:48:33 +00:00
Viktor Barzin
dab8f9446f state(external-secrets): update encrypted state 2026-06-23 09:47:24 +00:00
Viktor Barzin
e815bb0295 state(external-secrets): update encrypted state 2026-06-23 09:46:17 +00:00
Viktor Barzin
8412cd7d54 state(external-secrets): update encrypted state 2026-06-23 09:45:04 +00:00
Viktor Barzin
f2956e1e62 state(external-secrets): update encrypted state 2026-06-23 09:43:57 +00:00
Viktor Barzin
bf2f865eee state(external-secrets): update encrypted state 2026-06-23 09:42:52 +00:00
Viktor Barzin
6f3cfb18c7 state(external-secrets): update encrypted state 2026-06-23 09:41:46 +00:00
Viktor Barzin
6e8e066215 state(external-secrets): update encrypted state 2026-06-23 09:40:14 +00:00
Viktor Barzin
de1fb04d9f state(external-secrets): update encrypted state 2026-06-23 09:39:12 +00:00
Viktor Barzin
606cfdb544 state(external-secrets): update encrypted state 2026-06-23 09:38:12 +00:00
Viktor Barzin
72464e7880 state(external-secrets): update encrypted state 2026-06-23 09:37:11 +00:00
Viktor Barzin
e88ea50304 docs(multi-tenancy): document install_skills (vendored per-user agent skills)
All checks were successful
ci/woodpecker/push/default Pipeline was successful
Record the new reconcile step alongside install_memory/install_playwright:
vendored own-copies of the 16-skill set for the SKILL_USERS allowlist (emo),
why it's vendored not npx (upstream drift), and that if-absent keys on the
user's own copy so it heals a stale/cross-user ~/.claude/skills symlink
(emo's grill-me pointed into the admin's home).

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-23 09:30:27 +00:00
Viktor Barzin
1c8dc6bd6c t3-provision-users: install_skills heals stale symlinks + owns ~/.agents
All checks were successful
ci/woodpecker/push/default Pipeline was successful
Follow-up to the vendored-skills change, from verifying the emo rollout:

- The if-absent guard treated ANY pre-existing ~/.claude/skills/<name> entry
  as "installed", so a manual cross-user symlink emo already had (grill-me ->
  /home/wizard/.claude/skills/grill-me) was skipped — leaving the requested
  skill depending on the admin's home instead of emo's own copy. The guard now
  keys on the user's OWN copy (a real dir under ~/.agents/skills) and (re)points
  the ~/.claude/skills symlink at it, healing a stale/cross-user link while
  still never clobbering a real dir.
- install -d left the intermediate ~/.agents owned by root; now owned by the user.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-23 09:27:31 +00:00
Viktor Barzin
987fdd16db t3-provision-users: vendor agent skills + per-user install_skills (emo)
All checks were successful
ci/woodpecker/push/default Pipeline was successful
Make the admin's Claude Code agent skills available to the `emo` devvm user.
Viktor asked to install Matt Pocock's skills for emo, starting with grill-me
but covering the full set the admin already uses.

The `npx skills` upstream has drifted off that set (diagnose -> diagnosing-bugs
and write-a-skill -> writing-great-skills were renamed; caveman + zoom-out are
no longer published), so reproducing it via npx is impossible and would also
spray ~70 agent dirs into the user's home + add a GitHub-clone + unpinned-CLI
dependency to the hourly root reconcile. Instead vendor a point-in-time
snapshot of the 16 skills (scripts/workstation/claude-skills/) and copy them
per-user, mirroring install_memory: install_skills() copies each skill into
~/.agents/skills/<name> (owned by the user) and symlinks
~/.claude/skills/<name> -> ../../.agents/skills/<name>. if-absent, additive,
best-effort, scoped to the SKILL_USERS allowlist (emo).

find-skills is from vercel-labs/skills (not Matt Pocock) but included since it
is part of the admin's current set.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-23 09:23:37 +00:00
Viktor Barzin
59f2beda21 chrome-service: run real Google Chrome (H.264/AAC codecs) for the browser
All checks were successful
ci/woodpecker/push/default Pipeline was successful
Point the chrome-service container at the new chrome-service-browser image and
launch /opt/google/chrome/chrome instead of the bundled Chromium. Fixes
MEDIA_ERR_SRC_NOT_SUPPORTED on H.264/AAC video (Instagram Reels etc.) in the
noVNC view — bundled Chromium has those codecs compiled out; only real Chrome
carries them. connect_over_cdp callers (tripit fare scrape, homelab browser,
snapshot-harvester) attach over raw CDP (version-tolerant) — validated after
rollout. Image is built off-infra on GHA (prior commit) → public ghcr.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-22 21:15:36 +00:00
Viktor Barzin
df1ec1879d chrome-service: build a real-Chrome browser image (H.264/AAC codecs)
Some checks failed
ci/woodpecker/push/default Pipeline was successful
Build chrome-service-browser / build (push) Has been cancelled
Add an infra-owned image (Playwright base + google-chrome-stable) + its GHA
build workflow. The bundled Chromium ships proprietary codecs compiled out, so
H.264/AAC video (Instagram Reels, X, most .mp4) fails in the noVNC view with
MEDIA_ERR_SRC_NOT_SUPPORTED; only real Google Chrome carries those codecs
(libffmpeg swap + Chrome-for-Testing both ruled out). This commit only builds
the image (→ ghcr.io/viktorbarzin/chrome-service-browser); a follow-up flips
main.tf's launch to it once the image exists + is public.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-22 21:01:17 +00:00
Viktor Barzin
7061b1dfc6 state(external-secrets): update encrypted state 2026-06-22 20:55:27 +00:00
Viktor Barzin
e2f328ff4a state(external-secrets): update encrypted state 2026-06-22 20:45:24 +00:00
Viktor Barzin
a735be9ba4 state(external-secrets): update encrypted state 2026-06-22 20:45:08 +00:00
Viktor Barzin
c670cb7118 eso: Phase 2 — migrate all 104 ExternalSecrets + 2 ClusterSecretStores to v1
Some checks failed
ci/woodpecker/push/default Pipeline failed
The API rewrite half of the ESO 0.12->2.6 migration (last k8s-1.35 compat-gate
blocker). Done on chart 0.16.2, which serves BOTH external-secrets.io/v1beta1
and v1, so this is the safe window — MUST land before 0.17 removes v1beta1
(there is no conversion webhook). Pure apiVersion bump, schema is byte-identical:
106 occurrences (104 ExternalSecrets + 2 ClusterSecretStores vault-kv/vault-database)
across 73 .tf files, v1beta1 -> v1, no other field changes.

Validated live first on tandoor (single, non-coupled, synced ES): the
kubernetes_manifest apiVersion bump forces a REPLACE; the target Secret is
cascade-GC'd for ONE ~0.3s poll then ESO recreates it (identical value re-synced
from Vault, new UID) and the ES returns SecretSynced=True on v1. Running pods
keep their mounted copy through the sub-second blip. All 110 target Secrets were
snapshotted to /tmp first as a backstop.

CI applies the changed stacks serially (staged rollout); watching aggregate ES
sync back to 108 synced (2 pre-existing dead: instagram-poster, payslip-ingest).
Next: Phase 3 climb 0.16.2 -> 2.6.0.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-22 19:13:04 +00:00
Viktor Barzin
98cd535b97 authentik: lock chrome.viktorbarzin.me noVNC to Viktor only
All checks were successful
ci/woodpecker/push/default Pipeline was successful
The chrome-service noVNC exposes Viktor's live logged-in browser sessions
(Instagram etc. — he'll sign in there for homelab browser to reuse). It was
auth="required" = any authenticated user, and "Home Server Admins" includes emo
(emil.barzin@gmail.com), so the admin group is not a sufficient gate. Add a
host-specific case to the domain-wide forward-auth restriction allowing only
Viktor's accounts (vbarzin@gmail.com + akadmin break-glass); everyone else,
incl. emo, is denied at the noVNC. emo's AGENT already can't reach the browser
(read-only RBAC blocks port-forward); this closes the human noVNC path.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-22 18:09:27 +00:00
Viktor Barzin
a3cdc0d6d0 chrome-service: size headed Chrome window to fill Xvfb (noVNC cut-off)
All checks were successful
ci/woodpecker/push/default Pipeline was successful
The noVNC view showed the browser in the top-left with the rest of the
framebuffer black. Cause: Chrome launched with no --window-size, and there's no
window manager, so it opened at its profile-persisted (smaller) size inside the
1280x720 Xvfb. Add --window-size=1280,720 --window-position=0,0 so the window
fills the screen on every launch (fresh pods/profiles too). Live windows were
already resized via CDP as a stopgap.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-22 18:00:20 +00:00
Viktor Barzin
c7ead032ec chrome-service: fix noVNC stuck-"Connecting" (x11vnc fd-sweep under nofile=2^31)
Some checks failed
ci/woodpecker/push/default Pipeline was successful
Build chrome-service-novnc / build (push) Has been cancelled
The noVNC view hung on "Connecting" forever then timed out. Root cause: x11vnc
sweeps the entire fd table (fcntl per fd) on every client connection, and
containerd grants pods RLIMIT_NOFILE=2^31, so the RFB handshake never completes
(websockify accepts the WS and dials localhost:5900, but x11vnc never sends its
banner — verified: handshake timed out at 8s, x11vnc had burned 1h41m CPU
spinning). Same bug + fix the android-emulator stack already carries.

Cap nofile before x11vnc starts, in two places:
- files/novnc/entrypoint.sh: `ulimit -n 65536` (root fix, makes the image correct)
- main.tf novnc container: `command = ["bash","-c","ulimit -n 65536; exec /entrypoint.sh"]`
  so the cap applies deterministically on rollout even though the image is
  :latest/IfNotPresent (a rebuilt entrypoint isn't guaranteed to be re-pulled).

Also documents the gotcha + diagnosis in docs/architecture/chrome-service.md and
notes the black-when-idle behaviour + the autoconnect URL.

(A live x11vnc relaunch with the cap already unblocked the running pod; this
makes it survive restarts.)

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-22 17:34:03 +00:00
Viktor Barzin
20ca5ee624 tripit: REEL_PROVIDER=anonymous — actually fetch reels (was fake canned caption)
All checks were successful
ci/woodpecker/push/default Pipeline was successful
REEL_PROVIDER was unset, so the reel pipeline used FakeReelExtractor, which returns
a CANNED caption — every pasted (tripit #120) or forwarded reel produced a DUMMY
Saved Place instead of reading the real reel. Set REEL_PROVIDER=anonymous in app_env
(covers the web Deployment + the ingest CronJob) so AnonymousReelExtractor does the
real anonymous read. Verified live from the cluster: yt-dlp fetched a real IG /p/
caption (no IG_GRAPHQL_DOC_ID needed — the internal-API path is an optional
optimisation; yt-dlp fallback works). LLM extraction + Nominatim POI geocoding were
already real (prior commits); this was the last fake link in the chain.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-22 17:30:47 +00:00
Viktor Barzin
f46b69f372 tripit: enable real LLM + Nominatim on the web Deployment (in-app reel paste #120)
All checks were successful
ci/woodpecker/push/default Pipeline was successful
The web Deployment ran LLM_MODE=fake with no reel geocoder — only the ingest-plans
CronJob had real providers. The in-app reel-URL paste feature (tripit #120) runs
ingest_reel IN the web pod (BackgroundTask), so the Deployment now needs real
extraction: LLM_MODE=llamacpp (qwen3vl-8b; qwen3-8b segfaults on the current
llama-swap image) with the ADR-0033 claude-agent-service fallback, plus
REEL_GEOCODER_PROVIDER=nominatim for venue->city/country POI geocoding. Set in
app_env (feeds the Deployment; the CronJobs already had these via extra_env). Bonus:
this also un-fakes the in-app booking *share* import, which used the same fake LLM.
MAIL_INGEST_ENABLED stays false on the Deployment (only the CronJob polls mail).

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-22 16:50:04 +00:00
Viktor Barzin
59f2070e56 tripit: switch mail-ingest LLM_MODEL qwen3-8b -> qwen3vl-8b (qwen3-8b segfaults)
All checks were successful
ci/woodpecker/push/default Pipeline was successful
The qwen3-8b GGUF segfaults on load on the current llama-swap :cuda image
("common_init_from_params: failed to create context"; llama-swap returns 502),
which broke ALL tripit mail ingest text extraction — booking emails AND forwarded
reels (status=failed, "no place could be read"). The GGUF isn't corrupt (valid
header, full size, worked for weeks) — it's a llama.cpp/image regression. Rather
than pin the SHARED llama-swap image (cross-user blast radius), repoint the
ingest-plans CronJob at qwen3vl-8b, an already-provisioned 8B model that loads
fine and extracts flight numbers + places reliably. Restores the auto-path
(reels resolve via the Nominatim geocoder; bookings parse again). The broken
qwen3-8b GGUF is a separate, non-urgent llama-cpp cleanup.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-22 15:52:09 +00:00
Viktor Barzin
7dbbb74163 homelab v0.8.1: frame browser as escalation (default headless), match CLAUDE.md
Some checks failed
ci/woodpecker/push/default Pipeline was successful
Build infra CLI / build (push) Has been cancelled
Make `homelab browser --help` and chrome-service.md state the same tiered rule
now in ~/code/CLAUDE.md: default to the Playwright MCP/headless browser for all
routine automation; reach for `homelab browser` ONLY when headless is blocked
(loads-but-submit-fails / one request errors while siblings 200 / explicit bot
wall). Removes the "co-equal choice" framing so agents have one non-conflicting
instruction. Adds a test asserting the tiered wording so it can't regress.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-22 15:44:43 +00:00
Viktor Barzin
f96cde35bd tripit: enable Nominatim POI geocoding for reel→Wishlist ingest
All checks were successful
ci/woodpecker/push/default Pipeline was successful
Forwarded reels (tripit ADR-0031) geocode their venue to map a Saved Place to a
country + city, but the reel route was wired to the global geocoder, which here is
GEOCODER_PROVIDER=openmeteo (city-level, name-based). OpenMeteo returns nothing for
a venue query like "Time Out Market, Lisbon" so reels never resolved and no Saved
Place was created. The app fix (tripit 3c62d596) gave the reel route its own
geocoder behind REEL_GEOCODER_PROVIDER; set it to nominatim on the ingest-plans
CronJob (the only one running the reel route) so forwarded reels resolve to real
venue coords + city + country. Isolated from the global geocoder, which stays
openmeteo for weather/tours. Verified Nominatim resolves the venue from the cluster.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-22 14:59:37 +00:00
Viktor Barzin
a6b52a5839 homelab v0.8.0: browser verbs for headful anti-bot web automation
Some checks are pending
Build infra CLI / build (push) Waiting to run
ci/woodpecker/push/default Pipeline was successful
Add `homelab browser run|open` so agents can drive the cluster's headful
Chrome (chrome-service) over CDP from the devvm. The headless playwright/mcp
browser can load anti-bot sites and fill their forms, but the gated submit
silently fails — e.g. the Stirling Ackroyd Fixflo tenant portal returned
net::ERR_FILE_NOT_FOUND on its pre-submit check and hung, creating nothing.
Driving the real headful Chrome submits first try. That capability already
existed but was undiscoverable, so it cost ~40 min + redundant form re-runs to
find; now it is one command, versioned, test-covered, and `browser --help`
carries the when-to-use signature + an error-code cheat-sheet so the right tool
is reached at the right moment (the failure was judgment, not setup).

- port-forward svc/chrome-service:9222 (tunnels API-server->pod, so it bypasses
  the :9222 NetworkPolicy), assert non-headless via /json/version,
  connect_over_cdp, inject the same vendored stealth.js the in-cluster callers
  use; the port-forward is always torn down, on success and on error.
- node CDP client pinned to playwright-core@1.48.2 to match the v1.48.0-noble
  image (Chromium 130); self-provisioned lazily into ~/.cache/homelab, no
  per-user setup.
- default is a fresh incognito context (safe for the shared browser + concurrent
  callers); --shared-context reuses the warmed persistent profile.
- TDD: cmd_browser_test.go covers arg parsing, headless detection, the version
  pin, the help cheat-sheet, and a stealth.js drift guard. Verified end-to-end
  against bot.sannysoft.com (real Chrome UA, webdriver hidden, plugins/WebGL
  spoofed) and `browser open`.
- docs: README v0.8 section, ADR-0013, and a chrome-service.md "driving from
  outside the cluster" section.

Closes: code-nepg

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-22 12:22:22 +00:00
Viktor Barzin
de163aa6af workstation: switch devvm OOM backstop from systemd-oomd to earlyoom
All checks were successful
ci/woodpecker/push/postmortem-todos Pipeline was successful
ci/woodpecker/push/default Pipeline was successful
The systemd-oomd backstop added in the previous commit is INERT on this box.
oomd's memory-pressure kill only acts on cgroups doing active reclaim (pgscan
rising); with MemorySwapMax=0 + anonymous agent memory there is nothing to
reclaim, so pgscan stays 0 and oomd never fires. Proven live: a cgroup held at
96-99% memory.pressure for >70s with pgscan=0 was never killed (oomctl + balloon).
The very swap=0 that kills the IO storm also neuters oomd.

Replace it with earlyoom, which watches free RAM (MemAvailable%) and is
swap-independent: SIGTERM the biggest task at 5%, SIGKILL at 3%, swap ignored
(-s 100). It --avoids sshd/systemd/dockerd/containerd/t3-dispatch/tmux (the
admin's way in always survives) and --prefers the agent/browser hogs. Verified
via --dryrun: fires on the RAM threshold and selects a chrome process, not a
protected daemon.

The per-cgroup caps (MemoryHigh=12G/MemoryMax=16G/MemorySwapMax=0 per user,
docker.slice 8G) are unchanged and remain the PRIMARY guard — earlyoom is the
aggregate net for the rare all-users-maxed case. systemd-oomd purged; its config
+ ManagedOOM drop-ins removed. Post-mortem updated with the finding.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-22 10:39:16 +00:00
Viktor Barzin
3a59f4a8bf workstation: per-user memory caps + systemd-oomd backstop on devvm
All checks were successful
ci/woodpecker/push/postmortem-todos Pipeline was successful
ci/woodpecker/push/default Pipeline was successful
The shared devvm keeps overloading and had to be hard-killed again today
(2026-06-22): a runaway in one user's ssh/tmux session (a 10G ugrep, plus
stacked max-effort agents) grew unbounded, spilled into the disk swap, and
swap-thrashed the throttled virtual disk into an IO storm until the box wedged.

Root cause: ssh/tmux work runs under user-<uid>.slice, left memory-uncontained
by the explicit 2026-06-10 "swap-only" decision, while only the t3-serve tree
was capped. So one user could starve everyone.

This bounds every user on BOTH trees (MemoryHigh=12G, MemoryMax=16G,
MemorySwapMax=0 so work OOMs locally at its ceiling instead of thrashing swap),
adds a systemd-oomd PSI backstop that sheds the single worst work cgroup under
box-wide pressure while leaving system.slice (sshd/services/your way in)
protected, gives system.slice a fair-share CPU/IO priority edge, and routes
docker containers into a capped, oomd-policed docker.slice so they can't dodge
the caps or mis-target oomd. All durable in setup-devvm.sh so a VM rebuild
reproduces them; systemd-oomd added to packages.txt.

Applied live and verified: oomctl shows the backstop armed (not dry-run) on the
work slices with system.slice protected; a capped-balloon stress test OOM-killed
locally at the ceiling with swap flat (no thrash).

Post-mortem: docs/post-mortems/2026-06-22-devvm-mem-io-overload-containment.md

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-22 10:25:09 +00:00
Viktor Barzin
2169e0de5f workstation: harden memory hooks — prune dead plugin refs + homelab-CLI-only store
All checks were successful
ci/woodpecker/push/default Pipeline was successful
wire-memory-hooks.py now PRUNES any settings.json hook still pointing at the
retired claude-memory plugin (plugins/claude-memory/hooks/) before the additive
pass. install_memory() rm -rf's that dir, so those entries are dangling — and a
missing UserPromptSubmit hook exits 2, a BLOCKING error that erases the prompt
and froze emo's sessions (2026-06-22). The plugin shares basenames with the
homelab hooks, so the old additive-only logic saw the dead plugin path as
"already present" and skipped installing the real ~/.claude/hooks/ copy; pruning
first fixes that. Verified against emo's exact original config: yields the
correct 4-hook set, drops the dead PermissionRequest entry, idempotent on rerun.

auto-learn.py now stores via the `homelab memory` CLI only — dropped the direct
HTTP path and the local-SQLite fallback (memory is homelab-CLI-only per Viktor;
never local files). No-ops silently when no API key is in env (e.g. ancamilea).

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-22 09:24:42 +00:00
Viktor Barzin
aeed461591 Revert "feat(monitoring): Tempo + OTel Collector for tripit tracing, hardened (ADR-0032 Phase 2)"
All checks were successful
ci/woodpecker/push/default Pipeline was successful
This reverts commit 1595bddfc2.
2026-06-22 08:31:17 +00:00
Viktor Barzin
1595bddfc2 feat(monitoring): Tempo + OTel Collector for tripit tracing, hardened (ADR-0032 Phase 2)
Some checks failed
ci/woodpecker/push/default Pipeline failed
Re-land Phase 2 after the first attempt's two failure modes, both fixed:
- tempo.resources set under the correct single-binary chart key (was OOMKilled on
  the namespace LimitRange default when mis-placed at top level).
- atomic=true + cleanup_on_fail=true on BOTH helm releases — a failed install
  auto-rolls-back instead of leaving a stuck/orphaned release (memory #6479).

Tempo (single-binary, proxmox-lvm 20Gi, 30d) + OTel Collector (contrib; otlp ->
redaction -> batch -> tempo) + Tempo datasource + additive trace_id->Tempo
derivedField on Loki + tripit LOG_FORMAT=json/OTEL_EXPORTER_OTLP_ENDPOINT.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-22 08:17:59 +00:00
Viktor Barzin
a0897de7c3 workstation: document homelab-memory hooks + provisioner self-deploy [ci skip]
multi-tenancy.md never mentioned the homelab-memory hooks rollout and still
listed claude_memory credential injection as purely "future". Document what is
actually true now: install_memory provisions the recall/auto-learn/compaction
hooks per user, the provisioner binary self-deploys from the repo (step 0), the
set -e abort fix, and that the hooks no-op without a MEMORY_API_KEY in env (CLI
defaults the URL) — emo has a key, ancamilea is keyless until one is minted.
Also clarify setup-devvm.sh's binary install is now bootstrap-only (ongoing
edits self-deploy).

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-22 08:04:38 +00:00
Viktor Barzin
92f35550f2 workstation: self-deploy t3-provision-users from the repo each reconcile [ci skip]
Root cause of emo's lost memory: nothing redeployed /usr/local/bin/t3-provision-users
except the manual setup-devvm.sh, so the homelab-memory rollout (44562535/9aa2438e,
Jun 21) sat committed-but-undeployed for a day — the hourly reconcile kept running the
pre-memory binary and never wired the new memory hooks for emo/anca.

Close the gap the same way the script already treats managed-settings.json and
start-claude.sh (sync_managed_config / deploy_user_launcher): the repo is the
authoring surface. At the top of the run, if the repo copy differs from the deployed
binary, install it and re-exec the fresh one. Guards: a re-exec env flag (no loop),
bash -n (never deploy a broken script), DRY_RUN (no mutation), cmp (no churn when
unchanged). Verified across all four paths in isolation.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-22 08:02:31 +00:00
Viktor Barzin
0b11a28d66 workstation: stop install_memory aborting the reconcile under set -e
install_memory (added in 44562535) ended with `[[ -d <plugin-dir> ]] && rm && log`
and guarded a chmod with a bare `[[ -f settings ]] && chmod`. When the plugin dir
or settings file is absent — the normal case for users who never had the
claude-memory plugin — those return non-zero, and under `set -euo pipefail` the
function returns non-zero and kills the whole hourly reconcile after the FIRST
user, before the rest are processed.

It never fired before because the rollout was committed but the deployed
/usr/local/bin/t3-provision-users was never updated, so install_memory had never
run. On first real run it aborted right after ancamilea, so emo (and wizard)
never got their memory hooks wired — the reason emo's sessions lost memory. Wrap
the cleanup in an if-block, guard the chmod, and end the function with return 0.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-22 07:59:47 +00:00
Viktor Barzin
464e0bfb97 Revert "feat(monitoring): Tempo + OTel Collector for tripit tracing (ADR-0032 Phase 2)"
All checks were successful
ci/woodpecker/push/default Pipeline was successful
This reverts commit 7513468a2d.
2026-06-22 06:46:56 +00:00
Viktor Barzin
72dcb125d5 Revert "fix(monitoring): tempo OOMKilled — move resources under tempo.resources"
This reverts commit a02782d11f.
2026-06-22 06:46:56 +00:00
Viktor Barzin
a02782d11f fix(monitoring): tempo OOMKilled — move resources under tempo.resources
Some checks failed
ci/woodpecker/push/default Pipeline failed
Pipeline #315 failed: tempo-0 CrashLoopBackOff / OOMKilled (exit 137). The
single-binary grafana/tempo chart (v1.24.4) takes container resources at
tempo.resources, not a top-level resources: — so my block was ignored and the pod
fell to the namespace LimitRange default and OOMed. Set tempo.resources explicitly
(req 256Mi / limit 2Gi). tripit + existing monitoring were unaffected throughout.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-22 06:44:31 +00:00
Viktor Barzin
7513468a2d feat(monitoring): Tempo + OTel Collector for tripit tracing (ADR-0032 Phase 2)
Some checks failed
ci/woodpecker/push/default Pipeline failed
Stand up the cluster's first trace store + OTLP ingress so tripit's OpenTelemetry
spans (Phase 1, already live in prod) export and correlate with logs:
- Grafana Tempo (single-binary, filesystem on proxmox-lvm 20Gi, 30d)
- OTel Collector (contrib; otlp -> redaction deny-list backstop -> batch -> tempo)
- Grafana: a Tempo datasource + an ADDITIVE trace_id->Tempo derivedField on the
  Loki datasource (no uid change, so existing dashboards are unaffected)
- tripit deployment: LOG_FORMAT=json + OTEL_EXPORTER_OTLP_ENDPOINT -> the Collector

Additive (new helm releases; Loki/Prometheus/Grafana untouched). Offline
'terraform validate' clean; full plan+apply runs in CI (locked git-crypt blocks a
local plan as non-admin).

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-22 06:31:11 +00:00
Viktor Barzin
1a32c07ffe docs(eso): Phase 1 done (0.16.2) + confirmed Phase 2 GC findings
All checks were successful
ci/woodpecker/push/default Pipeline was successful
Execution log added to the ESO migration plan. Phase 1 complete: ESO at 0.16.2
(both v1beta1+v1 served). Phase 2 findings confirmed live: apiVersion bump forces
a kubernetes_manifest REPLACE, and ESO ESs use creationPolicy=Owner (target Secret
ownerRef → cascade-GC risk on the replace's delete). Phase 2 must snapshot Secrets
+ empirically validate GC-survival on the first live ES + per-stack two-phase
-target apply (fallback: state rm + import). Corrected the doc's k8s assumption
(cluster is on 1.34; whole climb stays on 1.34, no interleave).

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-21 20:44:50 +00:00
Viktor Barzin
ac27e41fde Merge remote-tracking branch 'origin/master'
All checks were successful
ci/woodpecker/push/default Pipeline was successful
2026-06-21 20:41:35 +00:00
Viktor Barzin
296deda3b4 eso: Phase 1 — climb chart 0.12.1 -> 0.16.2 (transition version) + atomic
First half of the ESO 0.12->2.6 migration (docs/plans/2026-06-21-eso-0.12-to-2.x-migration-design.md),
clearing the LAST k8s-1.35 compat-gate blocker. Stepped one minor at a time on
k8s 1.34 (no k8s interleave — cluster already on 1.34, ESO bands are conservative
tested ranges not hard limits): 0.12.1 -> 0.13.0 -> 0.14.4 -> 0.15.1 -> 0.16.2.
Each hop applied + verified: controller healthy, all 108 live ExternalSecrets
stayed SecretSynced (2 pre-existing dead — instagram-poster, payslip-ingest —
missing Vault data, untouched). Added atomic=true + timeout=600 (ESO had no
rollback safety net). 0.16.2 serves BOTH v1beta1 AND v1 (storedVersions now
["v1beta1","v1"]) — the safe window to rewrite all 104 CRs to v1 (Phase 2) before
0.17 removes v1beta1. State auto-committed per hop by scripts/tg (Tier-0 SOPS).

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-21 20:41:30 +00:00
Viktor Barzin
0cd59d2c55 state(external-secrets): update encrypted state 2026-06-21 20:41:10 +00:00
Viktor Barzin
b8612e788d state(external-secrets): update encrypted state 2026-06-21 20:39:45 +00:00
Viktor Barzin
877e5c73b2 state(external-secrets): update encrypted state 2026-06-21 20:38:34 +00:00
Viktor Barzin
de2250f667 immich-frame: set photo date format to dd/MM/yyyy
All checks were successful
ci/woodpecker/push/default Pipeline was successful
The photo date overlay was showing US-style MM/dd/yyyy — ImmichFrame's built-in default when PhotoDateFormat is unset. Viktor wants UK day/month/year ordering instead. Pin PhotoDateFormat to the date-fns pattern "dd/MM/yyyy" (uppercase MM = month; lowercase mm would render minutes). The config map carries reloader.stakater.com/match, so Reloader restarts the immich-frame pod automatically on apply.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-21 20:36:43 +00:00
Viktor Barzin
8e6eff03dd state(external-secrets): update encrypted state 2026-06-21 20:36:37 +00:00
Viktor Barzin
0bae025b9b wealth dashboard: spend-down figures in today's money (inflation-adjusted)
All checks were successful
ci/woodpecker/push/default Pipeline was successful
Viktor asked whether the spend-down numbers were inflation-adjusted —
they were not (all nominal). He chose to switch the card to today's
money, so every row now shows constant purchasing power for life.

Each row is a die-with-zero annuity at the REAL rate (1+g)/1.03−1
(3% inflation), spending a constant inflation-adjusted amount (the
actual pounds withdrawn rise with inflation) until net worth hits £0
at age 100:
  • No growth (0%)  → £12/day, £370/mo,   £4,446/yr   (negative real: loses to inflation)
  • Inflation (3%)  → £43/day, £1,315/mo, £15,776/yr  (0% real: holds value)
  • Market (7%)     → £130/day, £3,942/mo, £47,300/yr (~3.9% real)

Title now flags "(today's £)". Same panel/layout; only the SQL, title,
and tooltip changed.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-21 20:13:59 +00:00
Viktor Barzin
3fb6284e2b immich-frame: use 24-hour clock (ClockFormat HH:mm)
All checks were successful
ci/woodpecker/push/default Pipeline was successful
Viktor asked to switch the Immich photo-frame shown on the Portal
kitchen appliance to a 24-hour clock. immichFrame defaults ClockFormat
to 'hh:mm' (12-hour) and we never overrode it, so the frame was showing
12-hour time. Set ClockFormat: "HH:mm" (date-fns 24h token) in the
frame Settings.yml ConfigMap; Reloader restarts the pod on apply.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-21 20:10:51 +00:00
Viktor Barzin
e89de86af0 wealth dashboard: spend-down table → three growth scenarios
All checks were successful
ci/woodpecker/push/default Pipeline was successful
Viktor wanted the spend-down card to compare three portfolio-growth
scenarios rather than the previous floor-vs-4%-real pair.

The table now has three rows, each a die-with-zero annuity (drain net
worth to £0 by age 100) spending a constant number of ACTUAL (nominal)
pounds, differing only by the assumed nominal growth rate:
  • No growth (0%)      → £43/day,  £1,315/mo, £15,776/yr  (= NW ÷ years)
  • Inflation (3%)      → £106/day, £3,233/mo, £38,792/yr  (NEW)
  • Avg market (7%)     → £220/day, £6,703/mo, £80,435/yr

This keeps the £43 no-growth floor he anchored on. The old third row
was "4% real" (£133) expressed in today's money; it's replaced by the
7%-nominal market row (£220, actual pounds) so all three rows share one
basis (nominal pounds) and are directly comparable. 3%/7% are hardcoded
(one-line SQL edit). Table height 4→5 for the extra row; panels below
shifted down 1.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-21 20:06:29 +00:00
Viktor Barzin
85d42f2c13 wealth dashboard: merge spend-down tiles into one compact table
All checks were successful
ci/woodpecker/push/default Pipeline was successful
Viktor wanted the six separate spend-down stat tiles consolidated into a
single, more compact card with the figures laid out as rows.

Replaces stat panels 9220-9225 with one table panel (id 9220) in the
Overview row: 2 rows (Floor / 4% real) × 3 columns (per day / month /
year). Same underlying math and live values (£43/£1,315/£15,776 floor;
£133/£4,039/£48,463 at 4% real). w=9 instead of the full-width tile row,
so it takes ~a third of the width.

Note: this intentionally overrides the "table panels live at the bottom"
layout convention — Viktor chose to keep this headline KPI glanceable at
the top of the dashboard rather than scroll for it.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-21 19:55:57 +00:00
Viktor Barzin
63add2a126 feat(tripit): finalize ADR-0028 auth env — AUTH_MODE=normal, trips@ sender, trust XFF
All checks were successful
ci/woodpecker/push/default Pipeline was successful
Now that the native-auth rollout is complete: (1) AUTH_MODE hybrid->normal — the legacy Authentik OIDC-bearer + forward-auth arms were removed in #96, and 'hybrid' already resolved to 'normal' via backward-compat parsing; this makes it explicit and corrects the now-false comment. (2) SMTP_FROM plans@->trips@ — the dedicated native-auth sender; the trips@->spam@ send-as alias is live + verified (RCPT 250). (3) TRUST_FORWARDED_FOR=true — so #95's per-IP signup rate-limit keys on the real client behind Traefik, not the shared ingress pod IP. Env-only; the Deployment image is KEEL_IGNORE_IMAGE (lifecycle-ignored), so this does NOT touch the running image. Reloader restarts the pod to pick up the new env.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-21 19:50:20 +00:00
Viktor Barzin
166a2bcab4 wealth dashboard: add "spend-down to £0 at 100" stat tiles
All checks were successful
ci/woodpecker/push/default Pipeline was successful
Viktor wanted a glanceable number on the Wealth dashboard for how much
he can spend for the rest of his life — spending the whole net worth
down to zero by age 100.

Adds a third line of six stat tiles to the Overview section, two
equations × three cadences (per day / month / year):

  • FLOOR  — net worth ÷ time remaining to age 100. Treats the money as
    cash (no growth, no inflation): a conservative lower bound.
    ≈ £43/day, £1.3k/mo, £15.8k/yr.
  • 4% REAL — die-with-zero annuity: the constant, inflation-adjusted
    spend that drains the balance to £0 at 100 while it keeps earning
    4% real. PMT = NW·r/(1−(1+r)^−n). ≈ £133/day, £4.0k/mo, £48.5k/yr.

Horizon is today → his 100th birthday (DOB 1998-10-04 → 2098-10-04),
computed live so the figures tick as net worth and the horizon move.
Net worth reuses the existing latest-per-account dav_corrected math, so
the tiles always agree with the "Net worth (current)" stat (pension
included; target £0). The 4% real rate is hard-coded per his "keep it
simple, just a number" steer — a one-line SQL edit to change later.

Layout: tiles inserted at y=9; all sections below shifted down 4 rows.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-21 19:48:30 +00:00
c830f9f462 Merge pull request 'workstation: wire-memory-hooks as root (fix non-admin wiring)' (#14) from wizard/mem-fix into master
All checks were successful
ci/woodpecker/push/default Pipeline was successful
2026-06-21 17:45:39 +00:00
Viktor Barzin
9aa2438e75 workstation: run wire-memory-hooks as root, not runuser (fix non-admin wiring)
install_memory ran the JSON-merge helper via 'runuser -u $user', but the helper
lives under the admin's mode-700 home ($WORKSTATION_DIR) which non-admin users
can't traverse -> wiring silently failed for emo/anca (hooks copied but never
wired into settings.json). Run the helper as root (it reads both the repo helper
and the user's home) and chown the result back to the user. Verified by the live
all-users rollout: emo + anca now wired correctly.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-21 17:45:36 +00:00
f318773cb0 Merge pull request 'workstation: homelab-memory for all users (retire claude-memory MCP)' (#13) from wizard/memory-allusers into master
All checks were successful
ci/woodpecker/push/default Pipeline was successful
2026-06-21 17:42:51 +00:00
Viktor Barzin
44562535a2 workstation: provision homelab-memory hooks for all users (retire claude-memory MCP)
Roll the wizard MCP->homelab-CLI memory migration out to every devvm user. Adds
install_memory() to t3-provision-users.sh (mirrors install_playwright: per-user,
idempotent, if-absent, as-the-user): installs the 4 memory hook scripts into
~/.claude/hooks, wires them into settings.json additively (wire-memory-hooks.py
never touches env / the per-user MEMORY_API_KEY), and removes ONLY the
claude_memory MCP + plugin if present. Reuses each user's existing key (no
minting; per-user isolation stays deferred per the 2026-06-07 design). The
homelab CLI hits the same remote HTTP API the MCP used; recall runs via the
homelab-memory-recall.py UserPromptSubmit hook. Shared instructions (rules/skills
symlinked from base; root+infra CLAUDE.md) already cover all users.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-21 17:42:42 +00:00
Viktor Barzin
79749d7324 Merge remote-tracking branch 'origin/master'
All checks were successful
ci/woodpecker/push/default Pipeline was successful
2026-06-21 17:27:42 +00:00
Viktor Barzin
5e3fe2e8e2 docs(plans): ESO 0.12->2.6 (v1beta1->v1) migration design — the last k8s-1.35 blocker
Design doc for migrating External Secrets Operator off v0.12 (k8s <=1.31), now
the ONLY remaining compat-gate blocker for autonomous k8s 1.35 (kyverno cleared
to 1.18.1 today). Decisive findings: NO v1beta1->v1 conversion webhook, so all
104 ExternalSecrets (across 73 stacks) + 2 ClusterSecretStores must be rewritten
to external-secrets.io/v1 (byte-identical apiVersion bump) while on 0.16.2, BEFORE
crossing 0.17 (which removes v1beta1 — the point of no return). Step one minor at
a time (no skipping); chart==app version; downstream Secrets survive. 5-phase
ordered plan + per-phase rollback + the plan-time data.kubernetes_secret -target
gotcha (15 stacks) + Tier-0/SOPS handling. Plan only — nothing applied.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-21 17:27:37 +00:00
3f81b20fa6 Merge pull request 'docs: memory via homelab CLI (retire memory-tool/MCP refs)' (#12) from wizard/memory-cli-docs into master
All checks were successful
ci/woodpecker/push/default Pipeline was successful
2026-06-21 17:24:10 +00:00
Viktor Barzin
e2018f9b6c docs: memory via homelab CLI, not the retired memory-tool/MCP
The claude-memory MCP/plugin was uninstalled 2026-06-21 (recall now via the
homelab-memory-recall.py UserPromptSubmit hook; store/recall/update via the
`homelab memory` CLI, which hits the same remote HTTP API). Updates the
.claude/CLAUDE.md 'remember X' instruction off the obsolete local memory-tool
CLI + memory_search/memory_get onto the homelab CLI. Matches the root monorepo
CLAUDE.md + ~/.claude/rules/execution.md.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-21 17:24:00 +00:00
Viktor Barzin
51838a4ec7 kyverno: 3.6.1 -> 3.8.1 (app 1.16 -> 1.18.1) — clears the k8s-1.35 compat-gate block
All checks were successful
ci/woodpecker/push/default Pipeline was successful
kyverno v1.16 supports k8s <=1.34, so it was one of the two addons blocking the
autonomous 1.35 upgrade (compat gate, nightly). v1.18 supports 1.35.

Stepped one minor at a time per the kyverno upgrade guide (per-minor CRD notes):
3.6.1 (1.16) -> 3.7.2 (1.17.2) -> 3.8.1 (1.18.1), each hop applied + verified
supervised. atomic=true (auto-rollback on a failed rollout) + forceFailurePolicyIgnore
(admissions stay open mid-roll) kept it safe. Values schema confirmed compatible
across 3.6->3.8 (forceFailurePolicyIgnore still under features:).

Verified after each hop: all 17 ClusterPolicies stayed Ready, admission controller
2/2, no destroys/replaces in plan. Final 1.18.1: images v1.18.1, mutating webhook
live (server-side dry-run injects ndots:2 in a non-excluded ns). compat-gate vs
1.35.6 now lists ONLY external-secrets (kyverno cleared). ESO 0.12->2.x
(v1beta1->v1, 73 files) is the last remaining 1.35 blocker — to be planned.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-21 17:21:38 +00:00
Viktor Barzin
ead876ec65 k8s-upgrade: nightly Slack report monitor + scope chain-failed alert to phases
All checks were successful
ci/woodpecker/push/default Pipeline was successful
Adds a daily visibility layer so every night's autonomous-upgrade outcome is
reviewable at a glance during the upgrade-cleanup window (Viktor: "track every
night's upgrade for the next 7 days; clean up all bugs and blockers").

Last night (2026-06-20) confirmed BOTH prior fixes work in production: the
detector resolved target 1.35.6 (k8s_upgrade_available) and the compat gate
correctly REFUSED it (k8s_upgrade_blocked=1 -> K8sUpgradeBlocked) because ESO
v0.12 (<=1.31) and kyverno v1.16 (<=1.34) don't support 1.35.

What's here:
- CronJob k8s-upgrade-nightly-report (06:07 UTC) -> one Slack summary/morning:
  running version, detector freshness, detected target, outcome (no-op /
  blocked+live reasons / upgraded / in-progress / detector-stale), recent jobs.
  Read-only: reads Pushgateway gauges + live nodes/jobs, re-runs compat-gate.py
  for fresh blockers; reuses the chain SA + slack_webhook + scripts ConfigMap.
  Pure helpers unit-tested (test_nightly_report.py, 8 cases incl. a real
  v-prefix bug TDD caught). Verified end-to-end in-cluster (posted to Slack).
- K8sUpgradeChainJobFailed regex scoped from `k8s-upgrade-.*` to
  `k8s-upgrade-(preflight|master|worker|postflight)-.*` so the new report job
  (or any future helper) can't false-trip the chain-wedged alarm.

Manual state repair (no git artifact): imported the orphaned `alert-digest`
CronJob into the monitoring stack state
(`tg import module.monitoring.kubernetes_cron_job_v1.alert_digest monitoring/alert-digest`).
Root cause: when alert_digest was added (2026-06-12) the apply recorded its
ConfigMap + Secret but not the CronJob, so every full monitoring apply since has
failed with `cronjobs.batch "alert-digest" already exists` (Woodpecker pipeline
298 today) — surviving only via targeted prometheus applies. Now in state, so
monitoring CI applies cleanly again.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-21 16:57:44 +00:00
Viktor Barzin
7270e2be3b monitoring: K8sUpgradeChainJobFailed must not double-fire on a compat-gate block
Some checks failed
ci/woodpecker/push/default Pipeline failed
Last night (2026-06-20) the detector + compat-gate fixes worked: the chain
resolved target 1.35.6 and the gate correctly REFUSED it (ESO 0.12 + kyverno
1.16 don't support 1.35), pushing k8s_upgrade_blocked=1 -> K8sUpgradeBlocked
fired as designed. But the refusal also made the preflight Job exit 1
(block() exits 1 on purpose so the Failed Job re-spawns nightly), which tripped
K8sUpgradeChainJobFailed too — a duplicate, misleading "pipeline wedged" alarm
for what is the intended halt-and-alert outcome.

Fix: gate the alert with `unless on() k8s_upgrade_blocked == 1`. A deliberate
block sets that gauge (and it stays 1 until the next preflight resets it), so
the chain-job-failed alert is suppressed for the blocked period; a genuine
wedge / crash / halt-on-alert exits 1 WITHOUT setting it, so it still fires
(preserving the alert's original purpose — catching the pre-in_flight preflight
failure that hid the 5-day 1.34.9 wedge). Runbook + automated-upgrades docs
updated to match.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-21 16:35:35 +00:00
Viktor Barzin
b0ccaf1c65 state(vault): update encrypted state 2026-06-21 15:07:01 +00:00
Viktor Barzin
f84e6818b2 state(vault): update encrypted state 2026-06-21 15:07:01 +00:00
Viktor Barzin
cc4bb8ffe8 wealth dashboard: show price freshness for all 3 holdings, not just worst
Some checks failed
ci/woodpecker/push/default Pipeline failed
Viktor wanted the freshness tile to cover all three main holdings
(META, VUAG, VUSA), not only the single stalest one. Dropped LIMIT 1 so
the stat renders one value per held position (worst-first), switched the
tile to horizontal orientation so the three values sit side-by-side, and
updated the description. Each value is coloured by its own age threshold
(META red ~2mo, the Vanguard ETFs green ~2d). No threshold or datasource
change.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-21 14:49:33 +00:00
6c2c56ab3b Merge pull request 'docs: CrowdSec enforcement = firewall-bouncer + CF WAF (plugin removed)' (#11) from wizard/crowdsec-docs into master
All checks were successful
ci/woodpecker/push/default Pipeline was successful
2026-06-21 13:40:41 +00:00
Viktor Barzin
ceae4d5f06 docs: rewrite CrowdSec enforcement architecture (firewall-bouncer + CF WAF; Yaegi plugin removed)
The Traefik Yaegi CrowdSec bouncer plugin was dead on Traefik 3.7.5 (handler
never invoked) and has been removed. Document the replacement: in-kernel
nftables drop via cs-firewall-bouncer on direct hosts, and a Cloudflare IP-List
+ zone WAF block rule (fed by a LAPI->CF-list sync CronJob) on proxied hosts.
Both add zero per-request latency and fail open.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-21 13:39:26 +00:00
4df741f6de Merge pull request 'traefik/crowdsec: delete dead Yaegi plugin + middleware CRD + captcha (PR2/2)' (#10) from wizard/cs-deplugin-crd into master
All checks were successful
ci/woodpecker/push/default Pipeline was successful
2026-06-21 13:36:03 +00:00
Viktor Barzin
c23b03864e traefik/crowdsec: delete dead Yaegi plugin + middleware CRD + captcha (PR2/2)
Zero live ingresses reference traefik-crowdsec@kubernetescrd (PR1 + a
cluster-wide targeted ingress re-apply confirmed 0), so the crowdsec Middleware
CRD and the broken Yaegi bouncer plugin can be removed without orphaning any
router. Removes: the `crowdsec` Middleware, the crowdsec-bouncer plugin (static
config + initContainer download + state.json entry), the captcha template
ConfigMap + volume + captcha.html, the Turnstile widget + data.cloudflare_accounts,
and the 3 now-unused module vars. Also drops the `crowdsec` middleware from the
catch-all error-pages IngressRoute chain (the one remaining CRD-level reference,
which an Ingress-annotation grep does not surface) so that router is not orphaned
when the Middleware is deleted; it keeps rate-limit. Enforcement is fully handled
out-of-band now: cs-firewall-bouncer (in-kernel nftables, direct hosts) +
Cloudflare IP-List/WAF (proxied hosts). The api-token-middleware plugin is
deliberately preserved (still used by paperless-mcp).

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-21 13:35:13 +00:00
df86075c3d Merge pull request 'cleanup: fully remove orphaned council-complaints app' (#9) from wizard/council-cleanup into master
All checks were successful
ci/woodpecker/push/default Pipeline was successful
2026-06-21 13:33:23 +00:00
Viktor Barzin
68d9058f85 cleanup: fully remove orphaned council-complaints app
The council-complaints app (Islington civic-reporting pilot) has been
abandoned. It was already dead in the cluster (deployments scaled 0/0,
image only on the decommissioned registry.viktorbarzin.me which 404s),
and it was never in Terraform — only docs + a kyverno comment referenced
it. Its live cluster resources (namespace, both NFS-backed PVs, ingresses)
were torn down out-of-band via kubectl (nothing in TF to drift from); the
DB-dump PVC was backed up to NFS first.

This removes the remaining repo references to the live app:
- service-catalog.md: drop the council-complaints row
- ci-cd.md + .claude/CLAUDE.md: drop it from the GHA->ghcr app list
- kyverno require-trusted-registries: the registry.viktorbarzin.me/*
  allowlist comment claimed council-complaints as the last referencer;
  rewrite it (no live workload pulls from that registry now; only stale
  completed Job records still carry the ref). The allowlist line itself
  is kept (registry-scoped, not app-specific).

Historical point-in-time plan docs (docs/plans/2026-05-16-auto-upgrade-
apps-{design,plan}.md) still mention it inside a frozen "10 GHA-migrated
repos (memory id=388)" snapshot; left as-is so the dated record stays
accurate.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-21 13:32:10 +00:00
Viktor Barzin
6dc3ce139f wealth dashboard: expand all rows by default + inline the freshness stat
Some checks failed
ci/woodpecker/push/default Pipeline failed
Two follow-ups Viktor asked for on the Price freshness panel:

- Expand every section by default. Grafana's collapsed rows hide their
  child panels; just flipping collapsed=false leaves a non-canonical shape
  (confirmed via the Grafana API that it keeps the panels nested rather
  than hoisting them), so each row is now collapsed=false + panels=[] with
  its children hoisted to top-level -- the exact form Grafana writes when
  you expand-and-save. Row headers revert to their original y (the child
  y-coords were already expanded-layout coordinates).

- Stop the freshness stat from taking its own line. It's now the 6th tile
  in the existing returns row (1d/7d/30d/90d/12mo + freshness), all width 4
  at y=5; the collapsed-row y-shift from the previous commit is undone.

No query or threshold changes. The large diff is mechanical: 12 child
panels re-indent from nested to top-level.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-21 13:29:25 +00:00
Viktor Barzin
92ff0b92f1 Merge remote-tracking branch 'forgejo/master' into wizard/t3-idle-migrate
All checks were successful
ci/woodpecker/push/default Pipeline was successful
2026-06-21 12:41:33 +00:00
Viktor Barzin
5a136c7d53 docs: t3-migrate-idle runbook section + service-catalog + design status
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-21 12:40:46 +00:00
Viktor Barzin
334d8fee5d setup-devvm: install + enable t3-migrate-idle (lib, script, units, timer)
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-21 12:36:13 +00:00
Viktor Barzin
3cf09a0fe3 t3-migrate-idle: systemd oneshot + overnight timer (01:00-05:40, /20)
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-21 12:35:19 +00:00
Viktor Barzin
af9f7be297 t3-migrate-idle: drain deferral markers when safe
For each /var/lib/t3-autoupdate/deferred/<user> marker: skip+clear if the unit is
gone or was already restarted after the deferral; otherwise, when the idle gate is
satisfied, take a pre-restart backup and restart via the shared safe_restart_unit,
clearing the marker on verified success. DRY_RUN logs decisions without acting.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-21 12:34:44 +00:00
Viktor Barzin
06e400522f t3-migrate-idle: idle gate (no in-flight turn + quiet buffer), TDD
The gate reads t3's state.sqlite: safe to restart only when zero threads have an
active_turn_id AND the most-recent thread activity is older than the quiet buffer
(default 15m). Fail-closed on any parse/query error. Pure-bash unit tests cover
the boundaries against fixture DBs (no root/bats/Docker).

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-21 12:34:11 +00:00
Viktor Barzin
de97696ff0 t3-autoupdate: source the shared safe-restart lib + record deferrals
Behavior-preserving refactor: the per-unit restart/recover body and small helpers
now come from t3-safe-restart.sh (one audited copy). Additionally, when a unit is
deferred for an active agent, write a marker under /var/lib/t3-autoupdate/deferred/
so the new idle migrator can drain it later; clear the marker on a successful
restart. Install/health-gate/canary logic is unchanged.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-21 12:32:57 +00:00
Viktor Barzin
2ab5b94748 t3-safe-restart: extract shared safe-restart library from t3-autoupdate
Pull the per-unit backup->restart->verify->recover routine (and the small
helpers it needs) out of t3-autoupdate.sh into a sourced library, so a second
job (the upcoming idle migrator) can reuse the exact same audited recovery path
instead of forking safety-critical code. safe_restart_unit returns non-zero on
failure (after recovery+freeze) rather than exiting, so callers control flow.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-21 12:28:53 +00:00
Viktor Barzin
0cebeeb0ee t3-idle-migrate: implementation plan
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-21 12:26:05 +00:00
Viktor Barzin
ddbdbca7e9 wealth dashboard: add "Price freshness" stat for stalest held quote
Some checks failed
ci/woodpecker/push/default Pipeline failed
Viktor was worried about stale prices silently distorting net worth.
Confirmed it's real: META's quote has been frozen at 2026-04-17 (65 days
old) while the dashboard keeps valuing the ~55-share position at that
stale close; the Vanguard ETFs are current. Nothing flagged it.

Adds one compact stat to the Overview row showing the most out-of-date
HELD position's quote age (symbol + humanised age), colour-coded: green
<=4d (weekend/bank-holiday tolerant), amber 5-9d, red >=10d. Pure read of
the quote_latest mirror via the wealth-pg datasource, held positions
only, LEFT JOIN so a held symbol with no quote at all sorts as max-stale.
The six collapsed rows below shift down 4 grid units to make room; no
other panel touched.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-21 12:23:45 +00:00
Viktor Barzin
9503bed589 t3-idle-migrate: design for graceful overnight restart of deferred t3-serve instances
Viktor hit the t3 'Client and server versions differ' warning. Root cause: the daily gated autoupdate defers a user's t3-serve restart whenever that user has an active agent at the 04:00 window, so anyone busy every night (long-lived/AFK sessions) never migrates and the client/server version skew persists for days.

This design adds a small idle-gated overnight job that drains those deferrals -- restarting a deferred instance onto the current binary only when no turn is in flight (state.sqlite active_turn_id) and it's been quiet for a buffer, so the migration lands in a real quiet gap instead of killing in-flight agent turns. Reuses the autoupdate's proven backup->restart->verify->recover path via a shared helper (approach C from the brainstorm). Design doc only; no behavior change yet.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-21 12:04:22 +00:00
Viktor Barzin
b1bbe42821 homelab ha token: dedicated openclaw/ha-tokens secret + least-priv RBAC for emo
Some checks are pending
Build infra CLI / build (push) Waiting to run
ci/woodpecker/push/default Pipeline was successful
`ha token` originally read openclaw/openclaw-secrets -> skill_secrets, which only
cluster admins can read — so it hung/failed for the non-admin operator it was
built for (emo = emil.barzin@gmail.com, OIDC group "Home Server Admins", whose
identity is deliberately barred from secrets in the openclaw namespace).

Split the HA tokens into a dedicated secret openclaw/ha-tokens (keys sofia/london)
with a Role + RoleBinding granting `get` on JUST that secret to the Home Server
Admins group (k8s RBAC can't scope to a JSON sub-key, hence a separate object).
emo now resolves the HA token with their own identity, WITHOUT gaining the rest
of skill_secrets (slack_webhook, uptime_kuma_password). openclaw's own deployment
keeps reading openclaw-secrets — purely additive.

- stacks/openclaw/ha_tokens.tf: new secret + least-privilege Role/RoleBinding
- cli/cmd_ha.go: read openclaw/ha-tokens (raw base64 per-instance key); drop JSON parse
- README + ADR-0012 updated; VERSION -> v0.7.1

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-21 10:45:32 +00:00
a091689603 Merge pull request 'traefik/crowdsec: remove dead plugin middleware reference (PR1/2)' (#8) from wizard/cs-deplugin-refs into master
Some checks failed
ci/woodpecker/push/default Pipeline failed
2026-06-21 00:17:51 +00:00
Viktor Barzin
71d0af084e traefik/crowdsec: remove 6 hard-coded middleware refs the variable sweep missed (PR1/2)
The first PR1 commit only dropped the ingress_factory reference + the 8
exclude_crowdsec call sites. But the crowdsec middleware is ALSO hard-coded
(not via the variable) in 6 more ingresses that build their middleware chain by
hand: owntracks, the monitoring Helm values (grafana + prometheus +
alertmanager), and the reverse-proxy module + its own separate ingress factory.
Remove all 6 so that after the full-cluster apply NO live ingress references
traefik-crowdsec@kubernetescrd — the precondition for PR2 deleting the CRD.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-21 00:17:40 +00:00
Viktor Barzin
7bd4612edf ci: scripts/tg waits out a contended state lock (-lock-timeout)
All checks were successful
ci/woodpecker/push/default Pipeline was successful
The infra CI pipeline was failing often — ~38% of the last 50 runs didn't
succeed. The single biggest cause (8 of 19 non-successes) was Tier-1 stack
applies dying instantly with "Error acquiring the state lock".

Tier-0 stacks already degrade gracefully (Vault advisory lock → the pipeline
skips a locked stack). Tier-1 stacks have no such fallback: they rely on
terraform's pg-backend pg_advisory_lock, and scripts/tg ran terragrunt with
no -lock-timeout, so any concurrent lock holder was fatal — a Woodpecker-killed
run whose PG lock wasn't reaped yet (PL266 killed → PL267 failed the same
second), a human/agent applying locally, or the daily drift `plan`.

Fix: scripts/tg now passes -lock-timeout (default 5m, override TG_LOCK_TIMEOUT)
on every state-locking verb (plan/apply/destroy/refresh), so a contended lock
WAITS for the holder to finish instead of failing. -auto-approve behaviour for
non-interactive applies is unchanged. Central wrapper change → covers CI, plus
local human/agent applies; no CI image rebuild (tg is read from the repo).

Adds a hermetic pytest (stub terragrunt + preset PG_CONN_STR) pinning the
arg-injection. Docs updated in AGENTS.md + .claude/CLAUDE.md.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-21 00:15:39 +00:00
Viktor Barzin
84a18a5529 traefik/crowdsec: remove dead Yaegi-plugin middleware reference (PR1/2)
The Traefik CrowdSec (Yaegi) bouncer plugin enforces nothing on Traefik 3.7.5
(handler never invoked) and is fully superseded by the cs-firewall-bouncer
(in-kernel nftables drop on direct hosts) + the Cloudflare IP-List/WAF rule
(proxied hosts). Drop the `traefik-crowdsec@kubernetescrd` middleware from the
ingress_factory chain and the 8 explicit `exclude_crowdsec = true` call sites,
and delete the now-unused `exclude_crowdsec` variable.

This is PR1 of a 2-phase removal: the reference is removed FIRST (a shared-module
change → full-cluster apply re-renders every ingress without the middleware) so
that PR2 can delete the `crowdsec` Middleware CRD + the plugin itself WITHOUT
leaving any ingress pointing at a missing middleware (which would error those
routers). PR2 MUST NOT land until this has fully applied and zero live ingresses
reference traefik-crowdsec@kubernetescrd.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-21 00:15:12 +00:00
9774ae3d19 Merge pull request 'crowdsec: firewall-bouncer cluster-wide (remove node2 pin)' (#7) from wizard/cs-fw-allnodes into master
All checks were successful
ci/woodpecker/push/default Pipeline was successful
2026-06-21 00:08:15 +00:00
292 changed files with 27307 additions and 9483 deletions

View file

@ -16,6 +16,7 @@
**ALL infrastructure changes MUST go through Terraform/Terragrunt.** Never use `kubectl apply/edit/patch/set`, `helm install/upgrade`, or any manual cluster mutation as the final state.
- **No exceptions for "quick fixes"** — even one-line changes must be in `.tf` files and applied via `scripts/tg apply`
- **Apply locally OR let CI do it — but ALWAYS commit.** You don't have to wait for CI: with apply access you MAY run the apply yourself (`scripts/tg apply <stack>` / `homelab tf apply <stack>`), but **from the main checkout, never a worktree** (git-crypt'd `*.tfvars` come through as ciphertext under the worktree filter-bypass, so a worktree apply reads garbage). **Every applied change MUST be committed and pushed to `master` the same session** — the repo is the source of truth, so applied-but-uncommitted HCL is drift that the next CI apply / daily drift-detection will try to revert. Order either way: apply locally then commit + push (CI's changed-stack apply then no-ops), or commit + push and let CI apply. Never apply an uncommitted edit; never leave a committed change unapplied.
- **kubectl is for read-only operations and temporary debugging only** (get, describe, logs, exec, port-forward)
- **If a resource isn't in Terraform yet**, evaluate whether it can be added before making manual changes. If manual change is unavoidable (e.g., emergency), document it immediately and create the Terraform resource in the same session
- **kubectl scale/patch during migrations is acceptable** as a transient step, but the final state must be in Terraform and applied via `scripts/tg apply`
@ -24,8 +25,8 @@
Violations cause state drift, which causes future applies to break or silently revert changes.
## Instructions
- **"remember X"**: Use `memory-tool store "content" --category facts --tags "tag1,tag2"` (via exec) for persistent cross-session memory. Also update this file + `AGENTS.md` (if shared knowledge), commit with `[ci skip]`. To recall: `memory-tool recall "query"`. To list: `memory-tool list`. To delete: `memory-tool delete <id>`. The native `memory_search` and `memory_get` tools are also available for searching indexed memory files. For **storing** new memories, always use the `memory-tool` CLI via exec.
- **Apply**: Authenticate via `vault login -method=oidc`, then use `scripts/tg` (preferred — handles state decrypt/encrypt) or `terragrunt` directly. `scripts/tg` adds `-auto-approve` for `--non-interactive` applies.
- **"remember X"**: store to the remote claude-memory store via the **`homelab memory` CLI**: `homelab memory store "content" --category facts --tags "tag1,tag2"` (also `recall "query"` / `update <id>` / `list` / `delete <id>`). For shared knowledge, also update the relevant CLAUDE.md / `AGENTS.md`. (Supersedes the old `memory-tool` CLI **and** the claude-memory MCP — both retired 2026-06-21; the homelab CLI hits the same remote HTTP API. Recall also runs automatically each turn via a UserPromptSubmit hook.)
- **Apply**: Authenticate via `vault login -method=oidc`, then use `scripts/tg` (preferred — handles state decrypt/encrypt) or `terragrunt` directly. `scripts/tg` adds `-auto-approve` for `--non-interactive` applies, and `-lock-timeout` (default `5m`, override via `TG_LOCK_TIMEOUT`) on every state-locking verb (`plan`/`apply`/`destroy`/`refresh`) so a contended state lock **waits** instead of failing instantly with `Error acquiring the state lock`.
- **New services need CI/CD** and **monitoring** (Prometheus/Uptime Kuma). CI = a GHA workflow on the repo's GitHub mirror (build + tests off-infra, ADR-0002); Woodpecker gets a deploy-only pipeline — never an in-cluster build.
- **New service**: Use `setup-project` skill for full workflow
- **Ingress**: `ingress_factory` module. **Auth** (`auth` string enum, default `"required"` — fail-closed). Pick by asking "what gates the app?":
@ -47,7 +48,7 @@ Violations cause state drift, which causes future applies to break or silently r
## Terraform State — Two-Tier Backend
- **Tier 0 (bootstrap)**: Local state, SOPS-encrypted in git. Stacks: `infra`, `platform`, `cnpg`, `vault`, `dbaas`, `external-secrets`. These must exist before PG is reachable.
- **Tier 1 (everything else)**: PostgreSQL backend (`pg`) on CNPG cluster at `pg-cluster-rw.dbaas.svc.cluster.local:5432/terraform_state`. Native `pg_advisory_lock` for concurrent safety. Each stack gets its own PG schema.
- **Tier 1 (everything else)**: PostgreSQL backend (`pg`) on CNPG cluster at `pg-cluster-rw.dbaas.svc.cluster.local:5432/terraform_state`. Native `pg_advisory_lock` for concurrent safety. Each stack gets its own PG schema. **Lock contention is non-fatal**: `scripts/tg` passes `-lock-timeout` (default `5m`) so a contended lock waits rather than hard-failing — this was the #1 cause of infra CI failures (a Woodpecker-killed run's unreaped PG lock, a concurrent local apply, or the daily drift `plan`; Tier-1 stacks have no Vault advisory-lock skip to fall back on, unlike Tier-0).
- **Auth**: `scripts/tg` auto-fetches PG credentials from Vault (`database/static-creds/pg-terraform-state`). Humans use `vault login -method=oidc`, agents use K8s auth (role: `terraform-state`, namespace: `claude-agent`).
- **Tier 0 workflow** (unchanged): `git pull``scripts/tg plan``scripts/tg apply``git push`. State sync via SOPS is transparent.
- **Tier 1 workflow**: `vault login -method=oidc``scripts/tg plan``scripts/tg apply`. No git commit needed — PG is authoritative.
@ -63,7 +64,7 @@ Violations cause state drift, which causes future applies to break or silently r
- **`secret/viktor`** — go-to path for ALL personal secrets (135 keys). Contains every API key, token, password, SSH key, and config from the old terraform.tfvars. Check here first: `vault kv get -field=KEY secret/viktor`.
- **Auth**: `vault login -method=oidc` (Authentik SSO) → `~/.vault-token` → read by Vault TF provider.
- **Vault stack self-reads**: `data "vault_kv_secret_v2" "vault"` reads its own OIDC creds from `secret/vault`.
- **ESO (External Secrets Operator)**: `stacks/external-secrets/`43 ExternalSecrets + 9 DB-creds ExternalSecrets. API version `v1beta1`. Two ClusterSecretStores: `vault-kv` and `vault-database`.
- **ESO (External Secrets Operator)**: `stacks/external-secrets/`chart **2.6.0 / app v2.6.0** (migrated 0.12.1→2.6.0 on 2026-06-22, one minor at a time; helm_release has `atomic=true`). **~104 ExternalSecrets across 73 files**, all on **API version `v1`** (migrated v1beta1→v1 on 2026-06-22 — there is NO v1beta1→v1 conversion webhook, so all CRs were rewritten to v1 on chart 0.16.2 before 0.17 removed v1beta1; see `docs/plans/2026-06-21-eso-0.12-to-2.x-migration-design.md`). Two ClusterSecretStores: `vault-kv` and `vault-database`. (2 pre-existing dead ESs — instagram-poster, payslip-ingest — fail "cannot find secret data" on missing Vault keys, unrelated.)
- **Plan-time pattern**: Former plan-time stacks use `data "kubernetes_secret"` to read ESO-created K8s Secrets at plan time (no Vault dependency). First-apply gotcha: must `terragrunt apply -target=kubernetes_manifest.external_secret` first, then full apply. `count` on resources using secret values fails — remove conditional counts.
- **14 hybrid stacks** still keep `data "vault_kv_secret_v2"` for plan-time needs (job commands, Helm templatefile, module inputs). Platform has 48 plan-time refs — no migration possible without restructuring modules.
- **Database rotation**: Vault DB engine rotates passwords every 7 days (604800s). MySQL: speedtest, wrongmove, codimd, nextcloud, shlink, grafana, phpipam. PostgreSQL: health, linkwarden, affine, woodpecker, claude_memory, crowdsec, technitium. Excluded: authentik (PgBouncer), root users. **Apps that read a rotated secret only at startup** (env var / initContainer, not a hot-reloaded mount) MUST carry a Reloader annotation (`secret.reloader.stakater.com/reload: <secret>`) or they keep the stale password and silently fail DB auth on each rotation until manually restarted — matrix's Synapse `inject-db-password` initContainer hit exactly this (found via Loki 2026-06-05, ~12.9k auth-fail lines/hr); matrix has since migrated to tuwunel (RocksDB, no Postgres) on 2026-06-08 and is no longer in the rotation list above. Technitium uses a password-sync CronJob (every 6h) to push rotated password to the Technitium app config via API, disable SQLite + MySQL logging, check PG plugin is loaded, configure PG query logging (90-day retention), and disable SQLite on secondary/tertiary instances.
@ -130,7 +131,7 @@ ghcr, NOT DockerHub), kms-website, Freedify, instagram-poster, payslip-ingest,
broker-sync (image `wealthfolio-sync`), fire-planner, recruiter-responder,
x402-gateway — plus tripit. Earlier public-repo apps already on GHA (Website,
apple-health-data, audiblez-web, plotting-book, insta2spotify,
audiobook-search, council-complaints) now also land on ghcr.
audiobook-search) now also land on ghcr.
- **PUBLIC ghcr packages:** beadboard, nextcloud-todos, claude-agent-service,
claude-memory-mcp, kms-website, freedify, tuya_bridge, x402-gateway,
chrome-service-novnc, android-emulator.
@ -202,8 +203,8 @@ the workflow's built-in `GITHUB_TOKEN` (`packages: write`).
- **Critical path services scaled to 3**: Traefik, Authentik, CrowdSec LAPI, PgBouncer, Cloudflared.
- **PDBs**: minAvailable=2 on Traefik and Authentik.
- **Fallback proxies**: basicAuth when Authentik is down, fail-open when poison-fountain is down.
- **CrowdSec bouncer**: graceful degradation mode (fail-open on error).
- **Rate limiting**: Return 429 (not 503). Per-service tuning via dedicated middleware + `skip_default_rate_limit` (default 10/s burst 50): Immich 1000/20000, ActualBudget 50/300 (app boot = ~70 parallel revalidations).
- **CrowdSec enforcement is out-of-band** (no Traefik plugin/middleware — the dead Yaegi `crowdsec-bouncer-traefik-plugin` was removed on Traefik 3.7.5): banned IPs are dropped **in-kernel via nftables** by the `cs-firewall-bouncer` DaemonSet on **direct** hosts (drops in BOTH the `input` and `forward` hooks — Traefik is ETP=Local so client traffic is DNAT'd to the pod via `forward`; pulls ALL decisions incl. the ~31k CAPI blocklist), and **blocked at the Cloudflare edge** for **proxied** hosts (one `crowdsec_ban` Rules List + a zone WAF block rule, fed by the `crowdsec-cf-sync` CronJob in `rybbit` ns every 2 min — excludes CAPI). Zero per-request latency; **fails open** (LAPI down → no new bans, existing drops persist, legit traffic never blocked). Whitelist covers RFC1918 + tailnet + internal CIDRs. Full as-built: `docs/architecture/security.md`.
- **Rate limiting**: Return 429 (not 503). Per-service tuning via dedicated middleware + `skip_default_rate_limit` (default 10/s burst 50): Immich 1000/20000, ActualBudget 50/300 (app boot = ~70 parallel revalidations), authentik 100/1000 on `/`+`/static` (login SPA cold-loads ~70 flow chunks from `/static`; default burst 429'd them → blank login screen).
- **Retry middleware**: 2 attempts, 100ms — in default ingress chain.
- **Entrypoint transport timeouts** (`websecure` `respondingTimeouts`): `writeTimeout=0` (unlimited download duration), `readTimeout=3600s` (uploads ≤1h), `idleTimeout=600s`. These are **HARD total-duration caps**, not nginx-style per-read idle timeouts — a finite `writeTimeout` truncates *any* large download at that wall-clock mark (a prior `writeTimeout=60s` silently cut Immich videos at 60s). **Do NOT re-tighten `writeTimeout`**; keep `readTimeout` finite (slow-loris backstop) but ≥ longest expected upload. Full rationale: `docs/architecture/networking.md` → "Entrypoint Transport Timeouts".
- **HTTP/3 (QUIC)**: Enabled on Traefik. Works for **direct (non-proxied) apps** via the dedicated LB IP below (ETP=Local). Proxied apps get QUIC at the Cloudflare edge.
@ -216,9 +217,9 @@ the workflow's built-in `GITHUB_TOKEN` (`packages: write`).
|---------|--------------------------|
| Nextcloud | MaxRequestWorkers=150, needs 8Gi limit (Apache transient memory spikes, see commit eb94144), very generous startup probe |
| Immich | ML on SSD (CUDA), disable ModSecurity (breaks streaming), frequent upgrades. **`immich-machine-learning` MUST run with `MACHINE_LEARNING_MODEL_TTL > 0`** (set to `600` in `stacks/immich/main.tf`, env on the `immich-machine-learning` deployment). At `0`, no model ever unloads and onnxruntime's CUDA arena (OCR's dynamic input shapes inflate it to ~10 GB) is held forever on the **time-sliced T4 it shares with llama-swap/frigate/immich-server** — which has no VRAM isolation, so immich-ml starved llama-swap (qwen3-8b) and silently broke recruiter-responder triage for ~5 h on 2026-06-02 (post-mortem `docs/post-mortems/2026-06-02-immich-ml-ttl-gpu-oom-recruiter.md`). TTL>0 lets idle models (OCR, face — AND CLIP) free VRAM. The TTL is a single GLOBAL knob (no per-model pin), so CLIP would also unload after 600s idle; the `clip-keepalive` CronJob (`*/5 * * * *`, same stack) pings the CLIP textual encoder so smart-search stays warm without pinning the ad-hoc models. **Smart search has a SECOND warmth layer in Postgres** (don't conflate it with the ML model): the ~665MB vchord `clip_index` must stay resident in PG `shared_buffers`, else an ANN probe that lands on an evicted list pays a ~1.8s cold storage read vs ~4ms warm. The `postStart` hook prewarms it ONCE at pod start and `pg_prewarm.autoprewarm` only re-warms at *startup*, so the index decays out of cache over days under job buffer-pressure (observed ~33% resident after 9d uptime → slow context search, easily misattributed to the ML model). The `clip-index-prewarm` CronJob (`*/5`, same stack) re-runs `pg_prewarm('clip_index')` to pin it hot; `immich-search-probe` (`*/5`) measures live latency + residency → Pushgateway gauges (`immich_smart_search_db_seconds`, `immich_clip_index_cached_pct`) → alerts `ImmichSmartSearchSlow`/`ImmichClipIndexColdCache`/`ImmichSearchProbeStale` + cluster-health check #46 (`check_immich_search`). immich PG role is a superuser so the CronJobs can run `pg_prewarm`/`pg_buffercache`. **Video transcoding is GPU-accelerated**: `immich-server` is pinned to GPU node1 (nodeSelector `nvidia.com/gpu.present` + NoSchedule toleration + `gpu-workload` priority) with a time-sliced `nvidia.com/gpu=1` slice — the stock immich-server image's ffmpeg already ships h264/hevc_nvenc + NVDEC. Activated via `ffmpeg.accel=nvenc` + `accelDecode=true` in the **DB** system-config (`system_metadata` table, key `system-config`, JSONB — NOT Terraform; app config is DB-managed here like oauth/smtp). Direct DB edits need a pod **recreate** to reload (config is cached at boot; only API-driven changes broadcast a reload). **Streaming bitrate is capped** to keep 4K playback smooth on the contended HDD and over remote uplinks: `ffmpeg.maxBitrate=20000k` + `preset=medium` + `transcode=bitrate` (set 2026-06-01 — was uncapped `maxBitrate=0` + `ultrafast` + `targetResolution=original`, which produced 77264 Mbps 4K transcodes that stuttered for every client, local and remote, since even a single stream needs ~1013.5 MB/s off the shared `sdc` spindle). 4K resolution is preserved (`targetResolution=original`); originals are NEVER modified — only the `encoded-video/` streaming copy. To re-apply transcode settings to EXISTING videos (config changes only affect new/missing ones): delete the offenders' `asset_file` rows `WHERE type='encoded_video'` (derived/regenerable — never touches originals) then run videoConversion `force=false` (admin Jobs API → "Missing"); it regenerates them to the deterministic `<assetId>.mp4` path at concurrency 1 (gentle on sdc). See `docs/runbooks/immich-transcode-bitrate.md`. If Immich is ever reinstalled fresh (not restored), re-set these keys (accel, accelDecode, **maxBitrate=20000k, preset=medium, transcode=bitrate**). Thumbnails/previews live on SSD NFS (sdb) — do NOT move to block storage (HDD sdc = slower + the contended IO domain). **Background-job concurrency is capped to protect sdc** (DB-managed system-config, `system_metadata` key `system-config`, JSONB `job.*.concurrency`; re-set on fresh install): `thumbnailGeneration=2`, `metadataExtraction=2`, `library=2` — these jobs read ORIGINALS off the HDD library. Left uncapped (were 8/4/4) a library-wide job (e.g. Duplicate Detection on 2026-06-01) fans the ML/thumbnail backfill out into a read storm that saturates sdc and starves etcd → apiserver down. `sidecar`/`smartSearch`/`faceDetection` stay at Immich defaults (small `.xmp` / SSD previews). Apply via Job Settings UI or the `system-config` API; **direct DB edits need an `immich-server` pod recreate to reload** (config cached at boot). See `docs/post-mortems/2026-05-25-immich-anca-elements-io-storm.md`. |
| CrowdSec | Pin version, disable Metabase when not needed (CPU hog), LAPI scaled to 3, **DB on PostgreSQL** (migrated from MySQL), flush config: max_items=10000/max_age=7d/agents_autodelete=30d, DECISION_DURATION=168h in blocklist CronJob |
| CrowdSec | Pin version, disable Metabase when not needed (CPU hog), LAPI scaled to 3, **DB on PostgreSQL** (migrated from MySQL), flush config: max_items=10000/max_age=7d/agents_autodelete=30d, DECISION_DURATION=168h in blocklist CronJob. **Enforcement is out-of-band, NOT a Traefik plugin** (the Yaegi `crowdsec-bouncer-traefik-plugin` was dead on Traefik 3.7.5 and removed): `cs-firewall-bouncer` DaemonSet drops in-kernel via nftables on direct hosts (bouncer key `firewall`, v0.0.34 binary fetched at runtime, hostNetwork+NET_ADMIN, `stacks/crowdsec/modules/crowdsec/firewall_bouncer.tf`); `crowdsec-cf-sync` CronJob blocks at the CF edge for proxied hosts (bouncer key `kvsync`, `stacks/rybbit/crowdsec_edge.tf`). Both fail open. See `docs/architecture/security.md` |
| Frigate | GPU stall detection in liveness probe (inference speed check), high CPU |
| Authentik | 3 server replicas + 2-replica embedded outpost (PG-backed sessions), PgBouncer in front of PostgreSQL, strip auth headers before forwarding. **`authentik.*` Helm values are INERT** (existingSecret skips chart env rendering) — tune via `server.env`/`worker.env` in `modules/authentik/values.yaml`. Single-screen login (password embedded in identification stage); all first-party OIDC apps use implicit consent (2026-06-10). `/static` ingress carve-out serves assets with immutable Cache-Control. |
| Authentik | 3 server replicas + 2-replica embedded outpost (PG-backed sessions), PgBouncer in front of PostgreSQL, strip auth headers before forwarding. **`authentik.*` Helm values are INERT** (existingSecret skips chart env rendering) — tune via `server.env`/`worker.env` in `modules/authentik/values.yaml`. Single-screen login (password embedded in identification stage); all first-party OIDC apps use implicit consent (2026-06-10). `/static` ingress carve-out serves assets with immutable Cache-Control; `/`+`/static` use a dedicated `authentik-rate-limit` (100/1000) so the cold-load chunk burst isn't 429'd into a blank screen. **Reliability (2026-06-28): the chart key is `deploymentStrategy`, NOT `strategy`** — the old `strategy:` key was inert, so live ran the chart default 25%/25% and dropped a server pod out of rotation on every roll; now `maxSurge:1/maxUnavailable:0`. Readiness `failureThreshold:8` (~80s, was 30s): the DB-coupled `/-/health/ready/` returns 503 on a PG/pgbouncer blip, and with too-tight tolerance all 3 server pods left the Service at once → Traefik 502/504 (the episodic blank-screen + 30s-hang). gunicorn `max_requests=10000`/jitter=1000 decorrelates worker recycles from DB blips. Redis is GONE since 2026.2 (sessions+cache+channels on PostgreSQL, no external-cache option) — a short PG transient is now survived, but a TOTAL CNPG outage still takes authentik down. **Custom overlay image (2026-06-28):** server+worker run `ghcr.io/viktorbarzin/authentik-server:2026.2.4-patch3` (built by `.github/workflows/build-authentik.yml` from `stacks/authentik/Dockerfile` + `patch-compat-sfe.py`) with TWO guarded patches: **#1 SLOW-1a** — narrows the identification-stage `select_subclasses()` query (~1.4s→~14ms; bare upstream call LEFT-JOINs every source subtype); **#2 old-browser blank login** — `patch-compat-sfe.py` (a) extends `compat_needs_sfe()` to serve authentik's built-in no-JS **SFE** login to old Safari/WebKit AND **any iOS browser** (Chrome/CriOS, Firefox/FxiOS — all share the system WebKit) on iOS≤16.3, and (b) **injects static social-login `<a>` links into the SFE shell** (`flow-sfe.html`) since the SFE can't render Identification-stage sources — required for password-less accounts (e.g. emo = Google-only). The modern flow SPA is ES2022 (needs Safari 16.4+) and renders BLANK on older WebKit; every iOS browser shares that WebKit, so it's not browser-choice (emo's iPadOS-15.8 iPad hit this). SFE = the *real* authentik login (password + MFA + reputation, no auth downgrade) — chosen over a Traefik basic-auth fallback which would have put a spoofable-UA single password in front of `vbarzin→wizard` passwordless-root. Social link = plain redirect to `/source/oauth/login/<slug>/` (works on any browser); slugs (google/github/facebook) are static — re-verify on source changes. **Keel un-enrolled** for the ns → image pinned in `global.image` (repo+tag), **upgraded manually**: bump the Dockerfile `FROM` + the values tag (+ re-verify both patches) together, GHA rebuilds, then apply. |
| Kyverno | failurePolicy=Ignore to prevent blocking cluster, pin chart version |
| MySQL Standalone | Raw `kubernetes_stateful_set_v1` pinned to `mysql:8.4.8` exactly (migrated from InnoDB Cluster 2026-04-16; **pinned to 8.4.8 on 2026-05-18** after Keel-driven `mysql:8.4` → 8.4.9 bump stalled the DD upgrade and required a full PVC-wipe + dump-restore — see `docs/runbooks/restore-mysql.md` and beads code-eme8/code-k40p). `skip-log-bin`, `innodb_flush_log_at_trx_commit=2`, `innodb_doublewrite=ON`. ConfigMap `mysql-standalone-cnf`. PVC `data-mysql-standalone-0` (5Gi initial → 30Gi via autoresizer, `proxmox-lvm-encrypted`). Service `mysql.dbaas` unchanged. Anti-affinity excludes k8s-node1. Bitnami charts deprecated (Broadcom Aug 2025) — use official images. |
| phpIPAM | IPAM — no active scanning. `pfsense-import` CronJob (hourly) pulls Kea leases + ARP via SSH. `dns-sync` CronJob (15min) bidirectional sync with Technitium. Kea DDNS on pfSense handles all 3 subnets. API app `claude` (ssl_token). |
@ -231,9 +232,10 @@ the workflow's built-in `GITHUB_TOKEN` (`packages: write`).
- Alertmanager is now scraped (`extraScrapeConfigs` job `alertmanager`) → `alertmanager_notifications_total`/`_alerts`/`_notifications_failed_total` available; it had no `prometheus.io/scrape` annotation so notification volume was previously unmeasurable.
- Every new service gets Prometheus scrape config + Uptime Kuma monitor. External monitors auto-created for Cloudflare-proxied services by `external-monitor-sync` CronJob (10min, uptime-kuma ns). Mechanism: `ingress_factory` auto-adds `uptime.viktorbarzin.me/external-monitor=true` whenever `dns_type != "none"` (see `modules/kubernetes/ingress_factory/main.tf`) — no manual action needed on new services. The `cloudflare_proxied_names` list in `config.tfvars` is a legacy fallback for the 17 hostnames not yet migrated to `ingress_factory` `dns_type`; don't check that list when debugging "is this monitored?" questions.
- **External monitoring**: `[External] <service>` monitors in Uptime Kuma test full external path (DNS → Cloudflare → Tunnel → Traefik). Divergence metric `external_internal_divergence_count` → alert `ExternalAccessDivergence` (15min). Config: `stacks/uptime-kuma/`, targets from `cloudflare_proxied_names` in `config.tfvars` (17 remaining centrally-managed hostnames; most DNS records now auto-created by `ingress_factory` `dns_type` param).
- Key alerts: OOMKill, pod replica mismatch, 4xx/5xx error rates, UPS battery, CPU temp, SSD writes, NFS responsiveness, ClusterMemoryRequestsHigh (>85%), ContainerNearOOM (>85% limit), PodUnschedulable, ExternalAccessDivergence, ImmichSmartSearchSlow (context-search latency / clip_index cache eviction).
- Key alerts: OOMKill, pod replica mismatch, 4xx/5xx error rates, UPS battery, CPU temp, SSD writes, NFS responsiveness, ClusterMemoryRequestsHigh (>85%), ContainerNearOOM (>85% limit), PodUnschedulable, ExternalAccessDivergence, ImmichSmartSearchSlow (context-search latency / clip_index cache eviction), AuthentikRootRouter5xxHigh (all-3-server-pods-NotReady cascade → 502/503/504 on the authentik `/` router). **The Traefik scrape keeps `traefik_router_requests_total`** (per-router `code` label) — the drop-regex in the `traefik` scrape job drops only the high-cardinality `*_duration_seconds_bucket` histogram, NOT the request counter, so per-router 429/5xx is queryable + alertable.
- **E2E email monitoring**: CronJob `email-roundtrip-monitor` (every 20 min) sends test email via Brevo HTTP API to `smoke-test@viktorbarzin.me` (catch-all → `spam@`), verifies IMAP delivery, deletes test email, pushes metrics to Pushgateway + Uptime Kuma. Alerts: `EmailRoundtripFailing` (60m), `EmailRoundtripStale` (60m), `EmailRoundtripNeverRun` (60m). Outbound relay: Brevo EU (`smtp-relay.brevo.com:587`, 300/day free — migrated from Mailgun). Inbound external traffic enters via pfSense HAProxy on `10.0.20.1:{25,465,587,993}`, which forwards to k8s `mailserver-proxy` NodePort (30125-30128) with `send-proxy-v2`. Mailserver pod runs alt PROXY-speaking listeners (2525/4465/5587/10993) alongside stock PROXY-free ones (25/465/587/993) for intra-cluster clients. Real client IPs recovered from PROXY v2 header despite kube-proxy SNAT (replaces pre-2026-04-19 MetalLB `10.0.20.202` ETP:Local scheme; see bd code-yiu + `docs/runbooks/mailserver-pfsense-haproxy.md`). Vault: `brevo_api_key` in `secret/viktor` (probe + relay).
- **Authentik walling-off guard**: `blackbox-exporter` (monitoring ns, `stacks/monitoring/modules/monitoring/authentik_walloff_probe.tf`) probes each must-stay-public `auth = "none"` carve-out URL with `no_follow_redirects` and FAILS (`fail_if_header_matches` on `Location`) iff it 302s to Authentik. Catches a carve-out regressing (TF revert / deploy / `ingress_factory` `auth` default flipping back to `"required"`). Scrape job `blackbox-authentik-walloff` (1m) → alert `AuthentikWallingOffPublicPath` (`probe_failed_due_to_regex == 1`, for 10m, `lane=security``#security` Slack). **To guard a new carve-out: add one line to `local.authentik_walloff_targets`** (a `service → URL` map; `valid_status_codes` includes 301/302 so legit redirects/404s stay green — only the Authentik `Location` fails the probe). `curl -sI '<url>'` must NOT show a Location to `authentik.viktorbarzin.me` before adding.
- **Authentik walling-off guard**: `blackbox-exporter` (monitoring ns, `stacks/monitoring/modules/monitoring/authentik_walloff_probe.tf`) probes each must-stay-public `auth = "none"` carve-out URL with `no_follow_redirects` and FAILS (`fail_if_header_matches` on `Location`) iff it 302s to Authentik. Catches a carve-out regressing (TF revert / deploy / `ingress_factory` `auth` default flipping back to `"required"`). Scrape job `blackbox-authentik-walloff` (1m) → alert `AuthentikWallingOffPublicPath` (`probe_failed_due_to_regex == 1`, for 10m, `lane=security` → posts to `#alerts` via the `slack-security` receiver, which keeps its `[SECURITY]` styling; the dedicated `#security` channel was abandoned 2026-06-25 — the shared webhook's app isn't a member of it). **To guard a new carve-out: add one line to `local.authentik_walloff_targets`** (a `service → URL` map; `valid_status_codes` includes 301/302 so legit redirects/404s stay green — only the Authentik `Location` fails the probe). `curl -sI '<url>'` must NOT show a Location to `authentik.viktorbarzin.me` before adding.
- **pfSense egress / WAN monitoring** (added 2026-06-28 after the 2026-06-27 egress-only incident — pfSense VMID 101 stopped passing internet egress for ~20 min while internal routing + Unbound stayed up, and NOTHING alerted: no egress probe existed and the cloudflared replica metric stayed green): `blackbox-exporter` gained `icmp_egress` + `dns_external` modules (+ `NET_RAW` on the pod) in `authentik_walloff_probe.tf`. Three in-cluster probe jobs (`wan-gateway-icmp` → 192.168.1.1, `internet-egress-icmp` → 9.9.9.9/1.1.1.1, `internet-egress-dns` → cloudflare.com via both) traverse the pod→node→pfSense-NAT path that fails. Alerts (group `Egress / pfSense` in `alerting_rules.yml`): `WANGatewayUnreachable`, `InternetEgressDown` (`max()==0` = both providers dead, not a single-provider blip), `ExternalDNSResolutionDown`, `EgressOnlyDivergence` (t3-probe `cloudflare` leg down WHILE `internal` leg up — the incident signature, reuses the existing t3-probe), `PfSenseVMDown` (`pve_up{id="qemu/101"}==0` while host up — does NOT catch a guest-internal reboot, `pve_up` tracks the qemu process). Plus Loki ruler `CloudflaredTunnelConnLoss` (>20 edge-conn failures/5m; calibrated live: steady-state ~2/6h vs 37-85/5m in-incident; the cloudflared replica metric is blind to tunnel-connection loss). `WANGatewayUnreachable`/`InternetEgressDown` **inhibit** the downstream egress symptoms (ExternalDNSResolutionDown/EgressOnlyDivergence/CloudflaredTunnelConnLoss/Email*/ExternalAccessDivergence). Runbook: `docs/runbooks/pfsense-egress.md`. **Deferred (needs a live-pfSense change, not in this monitoring-only change):** point dpinger's monitor at the local gateway + widen thresholds, disable `gw_down_kill_states`, add a failover gateway group + auto-recovery watchdog, and ship pfSense system/gateway/routing syslog to Loki (today only filterlog → CrowdSec; those logs are NOT centrally queryable — id #6717). No Uptime-Kuma egress monitor was added (the `external-monitor-sync` is purpose-built for `*.viktorbarzin.me` Cloudflare-path discovery; the blackbox probes cover egress directly).
## Security Posture (Wave 1 — locked 2026-05-18)
@ -241,9 +243,10 @@ Plan in `docs/architecture/security.md` + response playbook in `docs/runbooks/se
- **Identity allowlist for security rules**: ONLY `me@viktorbarzin.me`. NOT `viktor@viktorbarzin.me`, NOT `emo@viktorbarzin.me` (those don't exist). emo's identity scheme is unknown — ask before assuming.
- **Source-IP allowlist (K2, K9, V7, S1)**: `10.0.20.0/22`, `192.168.1.0/24` (Proxmox + Sofia LAN), K8s pod CIDR, K8s service CIDR, Headscale tailnet. **Policy: no public-IP access** — Vault, kube-apiserver, PVE sshd must transit LAN or Headscale. **One documented exception (2026-06-11): break-glass SSH** — PVE sshd on a WAN-exposed `:52222`, key-only, dedicated break-glass key only (`Match LocalPort`), rate-limited + fail2ban; intentionally cluster-independent so it survives an outage. As-built `docs/runbooks/breakglass-ssh.md`. (Replaced the 2026-05-30 port-knock design — circular Vault dep caused a lockout.)
- **Response model**: (I) Slack-only daily skim. All security alerts via Loki ruler → Alertmanager → `#security` Slack receiver. Single channel with severity labels inside (critical/warning/info). No paging.
- **Response model**: (I) Slack-only daily skim. All security alerts via Loki ruler → Alertmanager → the `slack-security` receiver, which posts to `#alerts` (it keeps its `[SECURITY/<sev>]` title styling so security-lane alerts stand out). Severity labels carried in the alert (critical/warning/info). No paging. The dedicated `#security` channel was abandoned 2026-06-25 — the shared `alertmanager_slack_api_url` webhook's Slack app isn't a member of it (a `#security` override 404s), so everything consolidated to `#alerts`.
- **Kyverno policies (wave 1)**: `deny-privileged-containers`, `deny-host-namespaces`, `restrict-sys-admin`, `require-trusted-registries` flip Audit→Enforce with the 31-namespace exclude list (memory id=1970). `failurePolicy: Ignore` preserved. Cosign `verify-images` deferred.
- **NetworkPolicy default-deny egress (wave 1)**: observe-then-enforce (γ approach) — Calico flow logs cluster-wide + GlobalNetworkPolicy log-only on tier 3+4, build empirical allowlist after 1 week, phased per-namespace enforce starting `recruiter-responder`. Tier 0/1/2 deferred.
- **NetworkPolicy default-deny egress (wave 1)**: observe-then-enforce (γ approach) — Calico flow logs cluster-wide + GlobalNetworkPolicy log-only on tier 3+4, build empirical allowlist after 1 week, phased per-namespace enforce starting `recruiter-responder`. Tier 0/1/2 deferred. **The internal (ns-to-ns) half of each allowlist now derives faster from the east-west flow trail** (below): `SELECT DISTINCT dst_ns FROM edge WHERE src_ns='<ns>' AND action='allow'`. External egress is NOT in that table (empty-ns flows dropped) — those still come from the Calico flow-log W1.6 snapshot. Enforce-flips remain out of scope of the trail (observe-and-derive only; beads `code-8ywc`).
- **East-west flow trail (who-talks-to-whom, ADR-0014)**: Calico **Goldmane** (`goldmane.calico-system:7443`, gRPC/mTLS, ~60-min in-memory ring buffer — no etcd writes) + **Whisker** live UI (`whisker.viktorbarzin.me`, Authentik-gated) → **`goldmane-edge-aggregator`** streams Goldmane's `Flows.Stream` over mTLS and upserts the namespace-pair **edge set** (`edge(src_ns,dst_ns,action,first_seen,last_seen,flow_count)`, self-edges + public-internet flows dropped) into **CNPG DB `goldmane_edges`** → daily **`goldmane-edges-digest`** CronJob posts first-seen edges to `#alerts` (consolidated to `#alerts`; the `#security` channel was abandoned 2026-06-25 — the shared webhook's Slack app isn't a member of it, so a `#security` override 404s; see runbook). **CERT-REUSE GOTCHA**: the aggregator's mTLS client cert reuses the operator's Tigera-CA-signed `whisker-backend-key-pair` Secret (Goldmane verifies CA-chain only) — **re-apply `stacks/goldmane-edge-aggregator` if the operator rotates it** (symptom: no `last_seen` updates, `AggregatorDown`). Service identity = namespace, + `service-identity` label only in `monitoring`/`kube-system`/`dbaas`. Health: `AggregatorDown` + `DigestFailing` alerts + cluster-health #48. **WHISKER-WEDGE GOTCHA** (2026-06-28): the operator's `whisker` NetworkPolicy allows DNS egress only to kube-dns *pods*, but whisker-backend resolves goldmane via the kube-dns *ClusterIP* — Calico drops UDP DNS to a ClusterIP under a podSelector-only egress rule, so when whisker-backend's gRPC stream breaks and it re-resolves, it wedges and the UI goes **empty** (the aggregator, a separate pod, is unaffected). FIX = additive egress NP `whisker-allow-dns-clusterip` (`stacks/calico`, allows whisker→10.96.0.10/32:53); the `whisker-watchdog` CronJob is a backstop. Manual heal `kubectl -n calico-system delete pod -l k8s-app=whisker`. Runbook: `docs/runbooks/goldmane-flow-trail.md`. (Goldmane is OSS tech-preview — reversible operator-CR toggle in `stacks/calico/main.tf`.)
- **What's NOT in scope**: canary tokens (rejected — self-trigger risk with Viktor's normal `vault kv list secret/viktor` and `kubectl get secret -A` workflows), Falco/Tetragon (too noisy for Slack-only daily check), Cloudflare/GitHub audit polling (deferred to wave 2).
## Storage & Backup Architecture

File diff suppressed because one or more lines are too long

View file

@ -11,8 +11,8 @@ description: |
There are TWO Home Assistant deployments: ha-london (default) and ha-sofia.
Always use Home Assistant for smart home control.
author: Claude Code
version: 2.0.0
date: 2026-02-07
version: 2.1.0
date: 2026-06-24
---
# Home Assistant Control
@ -395,14 +395,27 @@ Advanced SSH, File Editor, Studio Code Server, InfluxDB, Mosquitto, Node-RED, Fr
## ha-london Knowledge Map
### Overview
- **HA Version**: 2025.9.1 (Docker container on Raspberry Pi)
- **HA Version**: 2026.5.2 on **Home Assistant OS** (HAOS — managed appliance, NOT a `docker run` container). Latest is 2026.6.4 (update available, deliberately not applied).
- **Location**: London, UK
- **Platform**: Raspberry Pi 4, HA OS (not Docker standalone)
- **SSH**: `ssh hassio@192.168.8.103` (requires `sudo` for file access)
- **Config path**: `/config/` (requires `sudo` for file access)
- **Platform**: Raspberry Pi 4, HA OS
- **Access from the Sofia devvm**: london is **remote**`homelab ha ssh --instance london` generally WON'T connect (ADR-0012). Drive it via the API: `homelab ha token --instance london` + `https://ha-london.viktorbarzin.me/api/...`, and the WebSocket API `wss://ha-london.viktorbarzin.me/api/websocket` for dashboards / config-entries / HACS installs.
- **SSH (only from the London LAN)**: `ssh hassio@192.168.8.103` (requires `sudo` for file access)
- **Config path**: `/config/`
- **3 tracked people**: Viktor Barzin, Anca Milea, Gheorghe Milea
- **Zone**: London (home)
### Dashboards (redesigned 2026-06-24)
**Glossary** (HA terms — keep distinct):
- **Dashboard** = a sidebar entry (Overview, Air Quality, Map). Sidebar *order* is a per-USER frontend preference, not in any dashboard config.
- **View** = a tab inside a dashboard. View order is global (stored in the dashboard config).
- **Card** = a widget inside a view.
- **Overview** (`lovelace`, the default): responsive **sections** views, styled with Mushroom + mini-graph-card.
- **Home** tab: *Who's home* · *Comfort & Air* (CO₂/temp/humidity/PM2.5/VOC chips + CO₂ and temp/humidity trend graphs + link to Air Quality) · *Cowboy* (battery/range/last-ride) · *Energy* (5 Kasa plugs + power trend) · *Quick actions* (Netflix/Stremio/Night).
- **More** tab: *Network* (GL-MT6000 router) · *System* (HA version/update, last backup, RPi power) · *Phones*.
- **Air Quality** (`air-quality`): deep-dive (views: Home, Detailed). (`detialed``detailed` path typo fixed 2026-06-24.)
- Built via the WS `lovelace/config/save` API (london is remote — no SSH path).
### Key Systems
#### 1. Smart Plugs (TP-Link Kasa) — Energy Monitoring
@ -424,10 +437,15 @@ Named plugs with power/energy tracking:
- PM1.0/2.5/4.0/10 particulate sensors
- VOC, NOx, ammonia, CO, ethanol, hydrogen, methane, NO2 gas sensors
#### 3. Cowboy E-Bike
- `sensor.bike_state_of_charge`: Battery %
- `sensor.bike_total_distance`: Total km
- `sensor.bike_total_co2_saved`: CO2 saved (grams)
#### 3. Cowboy E-Bike (`elsbrock/cowboy-ha`)
Bike named **"Classic Performance"** → entities are `sensor.classic_performance_*` (26 total). The old `sensor.bike_*` names are GONE (they were the dead `jdejaegh` integration).
- `sensor.classic_performance_remaining_battery`: Battery % (was `sensor.bike_state_of_charge`)
- `sensor.classic_performance_remaining_range`: Range km
- `sensor.classic_performance_mileage`: Total km (was `sensor.bike_total_distance`)
- `sensor.classic_performance_saved_co2`: Lifetime CO2 saved (was `sensor.bike_total_co2_saved`)
- Plus `_distance_today`, `_last_trip_*`, `_battery_health`, `device_tracker.classic_performance`, etc.
- **GOTCHA**: live battery/range/mileage read `unknown` while the bike is parked/asleep — Cowboy only reports live SoC when awake (ridden/charging); trip-history + `distance_today` stay live regardless.
- Auth: account **email+password** (no AWS Cognito — that was the dead `jdejaegh`/`cowboybike` lineage). Setup via UI config flow / REST `config_entries/flow`. Creds in Vaultwarden item **"cowboy bike"** (`homelab vault get "cowboy bike"`).
#### 4. Uptime Monitoring (UptimeRobot)
- `sensor.blog`: blog uptime
@ -446,12 +464,17 @@ Named plugs with power/energy tracking:
- Scripts: `script.start_netflix`, `script.start_stremio`
- Scene: `scene.night` (turns off Livia + Michelle plugs)
### Custom Components
- **cowboy**: Cowboy e-bike integration (HACS)
- **hildebrandglow_dcc**: UK smart meter DCC energy data (HACS)
### Custom Components (HACS integrations)
- **cowboy** (`elsbrock/cowboy-ha` v1.2.0): Cowboy e-bike — revived 2026-06-24. The old `jdejaegh/home-assistant-cowboy` repo is **dead (404)**; don't chase it.
- **hildebrandglow_dcc**: UK smart meter DCC energy — **DISABLED by user** (config entry `disabled_by: user`), not broken.
### HACS frontend cards (plugins)
- **Mushroom** (`piitaya/lovelace-mushroom`), **mini-graph-card** (`kalkih/mini-graph-card`), **plotly-graph-card** (`dbuezas/lovelace-plotly-graph-card`) — used by the redesigned Overview. Install over WS `hacs/repository/download`; resources auto-register in storage mode.
### Integrations
ESPHome, TP-Link Kasa, Tapo, UptimeRobot, Cowboy, Hildebrand Glow DCC, Oral-B BLE, Ookla Speedtest, HACS, OpenRouter (multiple free LLMs), Piper (local TTS), Whisper (local STT), Android TV/ADB
ESPHome, TP-Link Kasa, Tapo, UptimeRobot, **Cowboy** (elsbrock), Oral-B BLE, Ookla Speedtest (exposes only an `update` entity, no live speed sensors), HACS, OpenRouter (free LLMs), Piper (TTS), Whisper (STT), Android TV/ADB.
- **Disabled by user (NOT broken)**: `met` + `metoffice` (weather — so `weather.*` entities are ABSENT), `roomba` (Rumi vacuum), `hildebrandglow_dcc` (energy).
- **Failing**: `tplink` **Tapo P100** projector plug — `setup_retry`, 403 KLAP handshake from 192.168.8.108 (plug off / firmware). Left as-is.
### AI / Voice Assistants
- 5 free LLM conversation agents: Google Gemma 3 27B, Meta Llama 3.2 3B, Mistral Devstral 2, OpenAI GPT-OSS-20B, Z.AI GLM 4.5 Air
@ -466,15 +489,8 @@ ESPHome, TP-Link Kasa, Tapo, UptimeRobot, Cowboy, Hildebrand Glow DCC, Oral-B BL
- Anca arrival/departure notifications
- Night scene: turns off Livia + Michelle
### Docker Setup
```bash
docker run -d --name homeassistant --privileged \
-e TZ=Europe/London \
-v /home/pi/docker/homeAssistant:/config \
-v /run/dbus:/run/dbus:ro \
--network=host --restart=unless-stopped \
homeassistant/home-assistant:2025.9
```
### Platform (HAOS — ignore any legacy `docker run` snippet)
ha-london runs **Home Assistant OS** (managed appliance), NOT a hand-run Docker container. There is no `docker run homeassistant/home-assistant` to manage. Install HACS components over the WebSocket API (`hacs/repository/download` with the repo's HACS id), then restart via `POST /api/services/homeassistant/restart` — a HAOS restart drops automations for ~12 min and resets `sensor.uptime` (use that as the "back up" marker).
### SSH Access
```bash

39
.github/workflows/build-authentik.yml vendored Normal file
View file

@ -0,0 +1,39 @@
name: Build Custom Authentik Image
# ADR-0002: infra-owned image built off-infra on GHA → ghcr.
# Thin SLOW-1a overlay over the official authentik server (narrows the login
# identification stage's select_subclasses() to the login-capable source subtypes;
# see stacks/authentik/Dockerfile). Rebuild only when the Dockerfile changes — on
# every authentik bump, edit the FROM tag + the patchN suffix here + the image tag
# in modules/authentik/values.yaml together.
on:
push:
branches: [master]
paths:
- 'stacks/authentik/Dockerfile'
workflow_dispatch: {}
permissions:
contents: read
packages: write
jobs:
build:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
- uses: docker/setup-buildx-action@v3
- uses: docker/login-action@v3
with:
registry: ghcr.io
username: ${{ github.actor }}
password: ${{ secrets.GITHUB_TOKEN }}
- uses: docker/build-push-action@v6
with:
context: stacks/authentik
platforms: linux/amd64
provenance: false
push: true
tags: |
ghcr.io/viktorbarzin/authentik-server:2026.2.4-patch3
ghcr.io/viktorbarzin/authentik-server:latest

View file

@ -0,0 +1,39 @@
name: Build chrome-service-browser
# ADR-0002: infra-owned image built off-infra on GHA → ghcr. Playwright base +
# real Google Chrome (proprietary H.264/AAC codecs) for the chrome-service
# browser container, so the noVNC view can play H.264 video (Reels). Rebuilds
# are rare → dispatch + path trigger. NOTE: after the first push, set the ghcr
# package `chrome-service-browser` to PUBLIC (same as chrome-service-novnc) so
# the pod pulls it without credentials.
on:
push:
branches: [master]
paths:
- 'stacks/chrome-service/files/chrome/**'
workflow_dispatch: {}
permissions:
contents: read
packages: write
jobs:
build:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
- uses: docker/setup-buildx-action@v3
- uses: docker/login-action@v3
with:
registry: ghcr.io
username: ${{ github.actor }}
password: ${{ secrets.GITHUB_TOKEN }}
- uses: docker/build-push-action@v6
with:
context: stacks/chrome-service/files/chrome
platforms: linux/amd64
provenance: false
push: true
tags: |
ghcr.io/viktorbarzin/chrome-service-browser:latest
ghcr.io/viktorbarzin/chrome-service-browser:${{ github.sha }}

View file

@ -65,6 +65,21 @@ steps:
# don't need explicit token propagation.
VAULT_ADDR: http://vault-active.vault.svc.cluster.local:8200
commands:
# ── Forge guard: apply ONLY on the canonical Forgejo forge ──
# infra is registered in Woodpecker on BOTH the Forgejo canonical repo and
# the legacy GitHub mirror, and BOTH fire this push pipeline. Without this
# guard both run `terragrunt apply` on every push and race each other for
# the per-stack PG state lock — the dominant cause of the "Error acquiring
# the state lock" failures + push-supersede "killed" runs. The GitHub-mirror
# registration keeps running the CRONS (drift-detection, renew-tls, …) — only
# its duplicate push-apply no-ops here. Fail-open: an unknown forge (neither
# env var set) still applies, preserving prior behaviour.
- |
if echo "${CI_REPO_URL:-}${CI_FORGE_URL:-}" | grep -qi 'github\.com'; then
echo "[forge-guard] GitHub-mirror push — apply runs only on the Forgejo canonical repo (avoids double-apply + state-lock races). Skipping."
exit 0
fi
# ── Skip CI commits ──
- |
if echo "$CI_COMMIT_MESSAGE" | grep -q '\[CI SKIP\]\|\[ci skip\]'; then
@ -213,23 +228,40 @@ steps:
if [ -s .platform_apply ]; then
echo "=== Applying platform stacks (serial, locked) ==="
while read -r stack; do
# Tier-0 `vault` is human-applied via OIDC; the CI `ci` Vault role
# lacks Vault-admin perms (sys/mounts + sys/policies/acl), so a CI
# apply always 403s and fails the pipeline. Kept in PLATFORM_STACKS
# (so the app-stack detector still excludes it) but skipped here.
# (2026-06-27 — see docs/architecture/ci-cd.md)
if [ "$stack" = "vault" ]; then echo "[vault] SKIPPED (Tier-0, human-applied via OIDC)"; continue; fi
echo "[$stack] Starting apply..."
ATTEMPT=0
while :; do
ATTEMPT=$((ATTEMPT + 1))
set +e
OUTPUT=$(cd "stacks/$stack" && ../../scripts/tg apply --non-interactive 2>&1)
EXIT=$?
set -e
if [ $EXIT -ne 0 ]; then
if echo "$OUTPUT" | grep -q "is locked by"; then
echo "[$stack] SKIPPED (locked by another session)"
else
echo "$OUTPUT" | tail -50
echo "[$stack] FAILED (exit $EXIT)"
FAILED_PLATFORM_STACKS="$FAILED_PLATFORM_STACKS $stack"
if [ $EXIT -eq 0 ]; then
echo "$OUTPUT" | tail -3; echo "[$stack] OK"; break
fi
else
echo "$OUTPUT" | tail -3
echo "[$stack] OK"
# Lock contention → SKIP, not fail. Match BOTH the Tier-0 Vault lock
# ("is locked by", from scripts/tg) AND the Tier-1 PG-backend lock
# ("Error acquiring the state lock" / "already locked"). The PG case
# was previously counted as a failure — the #1 source of false reds.
if echo "$OUTPUT" | grep -qE 'is locked by|Error acquiring the state lock|already locked'; then
echo "[$stack] SKIPPED (locked by another session/run)"; break
fi
# Transient: provider-registry download timeout / Vault 5xx → bounded
# retry. Deliberately NOT helm atomic-timeouts or config errors
# (missing arg, invalid index) — those must fail fast, retry can't fix
# them and can worsen a stuck helm release.
if [ $ATTEMPT -lt 3 ] && echo "$OUTPUT" | grep -qE 'Failed to install provider|Client\.Timeout exceeded while awaiting headers|error reading from Vault.*Code: 5[0-9][0-9]'; then
echo "[$stack] transient error (attempt $ATTEMPT/3) — retrying in 15s..."; sleep 15; continue
fi
echo "$OUTPUT" | tail -50; echo "[$stack] FAILED (exit $EXIT)"
FAILED_PLATFORM_STACKS="$FAILED_PLATFORM_STACKS $stack"; break
done
done < .platform_apply
fi
# Deferred until after app stacks so both lists get a chance to run.
@ -242,22 +274,27 @@ steps:
echo "=== Applying app stacks (serial, locked) ==="
while read -r stack; do
echo "[$stack] Starting apply..."
ATTEMPT=0
while :; do
ATTEMPT=$((ATTEMPT + 1))
set +e
OUTPUT=$(cd "stacks/$stack" && ../../scripts/tg apply --non-interactive 2>&1)
EXIT=$?
set -e
if [ $EXIT -ne 0 ]; then
if echo "$OUTPUT" | grep -q "is locked by"; then
echo "[$stack] SKIPPED (locked by another session)"
else
echo "$OUTPUT" | tail -50
echo "[$stack] FAILED (exit $EXIT)"
FAILED_APP_STACKS="$FAILED_APP_STACKS $stack"
if [ $EXIT -eq 0 ]; then
echo "$OUTPUT" | tail -3; echo "[$stack] OK"; break
fi
else
echo "$OUTPUT" | tail -3
echo "[$stack] OK"
# Lock contention → SKIP, not fail (Tier-0 Vault + Tier-1 PG; see platform loop).
if echo "$OUTPUT" | grep -qE 'is locked by|Error acquiring the state lock|already locked'; then
echo "[$stack] SKIPPED (locked by another session/run)"; break
fi
# Transient provider-download / Vault 5xx → bounded retry (see platform loop).
if [ $ATTEMPT -lt 3 ] && echo "$OUTPUT" | grep -qE 'Failed to install provider|Client\.Timeout exceeded while awaiting headers|error reading from Vault.*Code: 5[0-9][0-9]'; then
echo "[$stack] transient error (attempt $ATTEMPT/3) — retrying in 15s..."; sleep 15; continue
fi
echo "$OUTPUT" | tail -50; echo "[$stack] FAILED (exit $EXIT)"
FAILED_APP_STACKS="$FAILED_APP_STACKS $stack"; break
done
done < .app_apply
fi
# Fail the step loudly so the pipeline `default` workflow state

View file

@ -85,6 +85,13 @@ steps:
stack=$(basename "$stack_dir")
[ -f "$stack_dir/terragrunt.hcl" ] || continue
# Tier-0 `vault` is human-applied via OIDC; the CI `ci` Vault role lacks
# Vault-admin perms (sys/mounts + sys/policies/acl), so `terragrunt plan`
# on it ERRORs (detailed-exitcode 1) and fails the whole nightly drift
# run. Skip it — drift on Tier-0 vault is caught at human apply time.
# (2026-06-27)
[ "$stack" = "vault" ] && continue
echo -n "[$stack] planning... "
OUTPUT=$(cd "$stack_dir" && terragrunt plan -detailed-exitcode -input=false 2>&1)
EXIT=$?

View file

@ -9,7 +9,7 @@
- **Ask before `git push`** — always confirm with the user first
## Execution
- **Apply a service**: `scripts/tg apply --non-interactive` (auto-decrypts SOPS secrets)
- **Apply a service**: `scripts/tg apply --non-interactive` (auto-decrypts SOPS secrets; passes `-lock-timeout`, default `5m` / `TG_LOCK_TIMEOUT`, so a contended state lock waits instead of failing with `Error acquiring the state lock`)
- **Legacy apply**: `cd stacks/<service> && terragrunt apply --non-interactive` (uses terraform.tfvars)
- **kubectl**: `kubectl --kubeconfig $(pwd)/config`
- **Health check**: `bash scripts/cluster_healthcheck.sh --quiet`
@ -273,8 +273,11 @@ To land a finished change from such a clone:
Slack audit feed; a no-op CI apply on a docs-only commit is harmless.
4. Leave the clone on clean `master` so auto-refresh keeps working.
5. Tell the user in plain language what happened. Stack changes are
auto-applied by CI — verify the live result with the user's read-only
kubectl before saying "it's live".
auto-applied by CI on push — or, with apply access, applied locally yourself
(`scripts/tg apply`, from the main checkout, not a worktree); either path is
fine, but the change must always be committed here, never applied
uncommitted. Verify the live result with the user's read-only kubectl before
saying "it's live".
If a push to `master` is rejected by branch protection (user not on the
whitelist — e.g. new users before Viktor grants it), fall back to a

View file

@ -117,9 +117,17 @@ The bare-metal load-balancer that assigns external IPs to `type=LoadBalancer` Se
_Avoid_: calling `.200` "the cluster IP" or assuming all ingress shares one LB IP.
**Calico**:
The cluster CNI and **NetworkPolicy** engine (also GlobalNetworkPolicy + flow logs). Egress lockdown follows an **observe-then-enforce** rollout — flow logs build an empirical allowlist, then default-deny egress is enforced per-namespace, tier by tier (wave 1 began at `recruiter-responder`; Tier 0/1/2 deferred).
The cluster CNI and **NetworkPolicy** engine (also GlobalNetworkPolicy + flow logs; live flow observability via **Goldmane / Whisker**). Egress lockdown follows an **observe-then-enforce** rollout — flow logs build an empirical allowlist, then default-deny egress is enforced per-namespace, tier by tier (wave 1 began at `recruiter-responder`; Tier 0/1/2 deferred).
_Avoid_: "firewall" (it's pod-level policy, not a perimeter); conflating a Calico **NetworkPolicy** (enforced in the data path) with a **Kyverno policy** (enforced at admission) — different layers.
**Service identity**:
How a **Service** is named in flow/audit data — its **namespace** is the primary identity (Goldmane stamps it natively, and "one Service ≈ one namespace" holds for ~87 namespaces), refined by an explicit identity label (e.g. `service-identity`) only in the handful of genuinely multi-Service namespaces (`monitoring`, `kube-system`, `dbaas`). Deliberately NOT a per-Service **ServiceAccount** (deferred — 56% of pods share `default`; revisit only if principal-based enforcement or mTLS is adopted) and NOT a SPIFFE/mesh identity (rejected — attribution-grade audit on a trusted single-tenant cluster doesn't justify a mesh).
_Avoid_: equating "service identity" with a workload's **ServiceAccount** (that's the deferred enforcement principal, not the attribution key) or with cryptographic/SPIFFE identity; "Service" here is the domain **Service**, not the K8s `Service` object.
**Goldmane / Whisker**:
Calico 3.30's OSS flow-observability pair — **Goldmane** aggregates identity-stamped flows (namespace/pod/workload/labels + allow-deny + policy trace) streamed from Felix over gRPC into an in-memory ~60-min ring buffer (no etcd/API writes); **Whisker** is its live web UI. The east-west "who-talks-to-whom" data plane, succeeding raw iptables-`LOG`→journald lines (which carry no identity). The in-memory buffer alone is not an audit trail — durable history is the **`goldmane-edge-aggregator`** (the implemented trail; ADR-0014 originally framed this as a Loki emitter), which streams Goldmane's gRPC `Flows.Stream` over mTLS and upserts the namespace-pair **edge set** into CNPG DB `goldmane_edges` + a daily `#alerts` digest (the `#security` channel was abandoned 2026-06-25). As-built: `docs/runbooks/goldmane-flow-trail.md`.
_Avoid_: assuming Goldmane persists (it's a ring buffer — lost on restart); expecting a ServiceAccount field in its schema (it carries labels, not SA); confusing it with Cilium **Hubble** (needs the Cilium datapath, unusable on Calico) or **Kiali** (needs an Istio mesh).
### Storage
**proxmox-lvm-encrypted**:

View file

@ -162,7 +162,7 @@ and a cwd-relative path, neither of which holds in an arbitrary session.
| Command | Tier | What it does |
|---|---|---|
| `ha token [--instance sofia\|london]` | read | print the long-lived HA API token, resolved live from k8s Secret `openclaw/openclaw-secrets` (`skill_secrets` JSON) via the ambient kubeconfig — no pre-set env var. Use as `curl -H "Authorization: Bearer $(homelab ha token)" …` |
| `ha token [--instance sofia\|london]` | read | print the long-lived HA API token, resolved live from the dedicated k8s Secret `openclaw/ha-tokens` (key per instance) via the ambient kubeconfig — no pre-set env var. Use as `curl -H "Authorization: Bearer $(homelab ha token)" …`. The secret is a least-privilege carve-out (`stacks/openclaw/ha_tokens.tf`): the `Home Server Admins` group can read *just* it, so non-admin operators get the HA token without the rest of `skill_secrets` (slack webhook, uptime-kuma password) |
| `ha ssh [--instance sofia\|london] [-i KEY] -- <cmd>` | write | run `<cmd>` on the HA host over ssh with deterministic non-interactive flags (explicit key = the invoking user's `~/.ssh/id_ed25519`, no user ssh-config, no known_hosts prompt). sofia (`vbarzin@192.168.1.8`) is reachable from the devvm LAN; london is documented but generally remote |
`--instance` defaults to **sofia** (the devvm shares the Sofia LAN). `ha token`
@ -171,6 +171,100 @@ prints the bare token to stdout so it composes in `$(…)`; it's read-tier like
not tied to whoever first wrote the workflow (the user's key must be enrolled on
the HA host).
### v0.8 verbs — browser (headful anti-bot automation)
Drive the cluster's **headful** Chrome (`chrome-service`, real Chrome under Xvfb)
from the devvm over CDP, for sites that detect and block headless automation. The
headless `@playwright/mcp` browser can *load* such a site and fill its forms, but
the gated action (submit/login) silently fails — the motivating case was the
Stirling Ackroyd Fixflo tenant portal, whose pre-submit check returned
`net::ERR_FILE_NOT_FOUND` and hung. This path connects via `connect_over_cdp`,
injects the same `stealth.js` the in-cluster callers use, and submits first try.
The command owns only the *mechanics* (port-forward, stealth, lifecycle); the
agent supplies the Playwright script — judgment stays out of the CLI.
| Command | Tier | What it does |
|---|---|---|
| `browser run <script.js> [--url U] [--shared-context] [--keep-open] [--port N] [--timeout S]` | write | port-forward `svc/chrome-service:9222`, assert it's a real (non-headless) Chrome via `/json/version`, `connect_over_cdp`, `addInitScript(stealth.js)`, then run the script with `page`/`context`/`browser`/`log` in scope (top-level await ok; return a value to print it). Always tears the forward down. |
| `browser open <url> [--shared-context] [--timeout S]` | write | open `<url>` headful and print title + visible text + a screenshot path — a quick check. |
| `browser --help` | read | when-to-use signature + the error-code cheat-sheet (`ERR_FILE_NOT_FOUND` = automation-layer intercept, not egress; `ERR_CONNECTION_REFUSED`/`_TIMED_OUT`/`_NAME_NOT_RESOLVED` = real egress; one endpoint 500 while siblings 200 = bot rejection). |
Default context is a **fresh incognito** one (closed on exit) — safe for the
shared browser and concurrent callers (e.g. tripit's fare scrape); `--shared-context`
reuses the warmed persistent profile when a pre-logged-in session is needed.
`port-forward` tunnels API-server→pod, so it bypasses the `:9222` NetworkPolicy
that gates in-cluster callers — no namespace label needed. The node CDP client is
pinned to **`playwright-core@1.48.2`** to match the chrome-service image minor
(Chromium 130; protocol changes between minors) and is installed once, lazily,
into `~/.cache/homelab/browser-client/` (no per-user setup). Because the client
runs on the devvm, `setInputFiles` streams local files to the remote browser over
CDP — no `chmod`/staging-dir workaround. See `docs/architecture/chrome-service.md`
and `docs/adr/0013`.
### v0.9 verbs — edges (east-west "who-talks-to-whom" trail)
Read-only investigation helper over the `goldmane_edges` CNPG trail (ADR-0014):
filters render to a single safe `SELECT` (namespace values validated to the k8s
name charset) run via the dbaas primary pod — the same exec path as `k8s db`.
| Command | Tier | What it does |
| --- | --- | --- |
| `edges --ns <ns>` | read | edges touching `<ns>` (either direction) |
| `edges --src <ns>` / `--dst <ns>` | read | directional: `<ns>`'s egress / ingress peers |
| `edges --peers-of <ns>` | read | distinct peer namespaces of `<ns>` (both directions) |
| `edges --new-since <24h\|7d\|YYYY-MM-DD>` | read | edges first seen since a duration or date |
| `edges --denied` | read | only `action='deny'` edges (blocked / lateral-movement) |
| `edges --json` / `--limit N` | read | JSON array output / row cap (default 200) |
### v0.10 — `vault get --all` (browse every field)
`vault get <name> --all` returns the **whole item** as a normalized JSON object,
so an agent can discover and read fields the single-field `--field` allowlist
can't reach — notably arbitrary **custom fields**.
| Command | Tier | What it does |
| --- | --- | --- |
| `vault get <name> --all` | read | all fields as JSON: `{name, username?, password?, uris?, totp?, notes?, fields?}` |
Shape notes: present standard fields only (empty ones omitted); `fields` is a
custom `name→value` map (duplicate names → last-wins; `linked` fields skipped).
The TOTP **seed is never emitted**`totp` is a presence flag (`true`), so the
only seed-derived path stays the specially-audited `vault code`. Like
`get --json`, the dump is all secret values, so it **refuses a terminal** — pipe
it (`homelab vault get <name> --all | jq`).
### v0.10.1 — reads `bw sync` first (always fresh)
Every vault read (`get`, `get --all`, `list`, `code`, `status`) now runs `bw
sync` when opening its session, so it reflects the latest server-side values.
`bw unlock` only decrypts the *local* cache, so without this a persisted
(already-logged-in) session served stale data — a password changed in the web
vault wouldn't show up until the next login. The sync is **best-effort**: a
transient failure warns on stderr and falls back to the cached vault rather than
failing the read.
### v0.11 — `vault kv` (HashiCorp Vault / OpenBao infra secrets)
`homelab vault` now fronts **two unrelated stores**, made explicit in the bare
`homelab vault` help and via `[vaultwarden]` / `[hashicorp-vault]` summary tags:
- **Vaultwarden** — your personal password manager (`vault get/list/code/…`, unchanged).
- **HashiCorp Vault / OpenBao** — homelab infra secrets, the `secret/…` KV store, under `vault kv`.
| Command | Tier | What it does |
| --- | --- | --- |
| `vault kv get <path> [--field K]` | read | read a secret: `--field K` → one value (TTY-aware clipboard/stdout); no field → all fields as JSON (refuses a bare TTY) |
| `vault kv list <path>` | read | list sub-paths under `<path>` (no values) |
| `vault kv put <path> <key>` | write | write one key; **value via stdin** (piped or no-echo prompt, never argv); creates the path or **merges** (never clobbers siblings) |
**Different credentials:** the Vaultwarden verbs use the per-user *scoped* token
(bound to `claude-users/<user>`); `vault kv` uses your **own** Vault token
(`vault login -method=oidc``~/.vault-token`, or `$VAULT_TOKEN`) — the kv
handlers set `VAULT_ADDR` but never inject the scoped token (which would 403 off
its own path). Access is whatever your policy grants. Writes are merge-only;
`put` (replace) / `delete` are out of scope — use the raw `vault` CLI.
## Build / install
Built from source to `/usr/local/bin/homelab` during devvm provisioning
@ -190,4 +284,4 @@ original flag-based path unchanged, so the webhook handler is unaffected.
## Design
See `infra/docs/adr/0004``0012` for the architecture decisions.
See `infra/docs/adr/0004``0013` for the architecture decisions.

View file

@ -1 +1 @@
v0.7.0
v0.11.0

388
cli/browser.go Normal file
View file

@ -0,0 +1,388 @@
package main
import (
_ "embed"
"encoding/json"
"fmt"
"io"
"net"
"net/http"
"os"
"os/exec"
"os/signal"
"path/filepath"
"strconv"
"strings"
"sync"
"syscall"
"time"
)
// playwrightVersion pins the node CDP client to the chrome-service image minor
// (mcr.microsoft.com/playwright:v1.48.0-noble → Chromium 130). connect_over_cdp
// speaks the browser's CDP, so the client minor must track the server minor;
// see docs/architecture/chrome-service.md "Image pin".
const playwrightVersion = "1.48.2"
// defaultBrowserTimeout is how long (seconds) to wait for the port-forwarded CDP
// endpoint to become ready before giving up.
const defaultBrowserTimeout = 60
const (
chromeServiceNamespace = "chrome-service"
chromeServiceName = "chrome-service"
chromeServiceCDPPort = 9222
)
// stealthJS is vendored verbatim from stacks/chrome-service/files/stealth.js (the
// source of truth the in-cluster callers use). TestStealthJSEmbeddedMatchesCanonical
// guards against drift.
//
//go:embed browser_stealth.js
var stealthJS string
// runnerJS is the node wrapper that connects to the port-forwarded CDP endpoint,
// installs the stealth init script, and runs the user's Playwright script.
//
//go:embed browser_runner.js
var runnerJS string
// browserOpts is the parsed form of `homelab browser run|open` arguments.
type browserOpts struct {
mode string // "run" | "open"
script string // path to the user Playwright script (run mode)
url string // initial URL (run: optional; open: required positional)
sharedCtx bool // use the warmed persistent profile instead of a fresh context
keepOpen bool // leave the created context/pages open on exit
port int // explicit local port for the forward (0 = auto)
timeout int // CDP readiness timeout, seconds
help bool
}
// parseBrowserArgs parses the args after `browser run` / `browser open`.
func parseBrowserArgs(mode string, args []string) (browserOpts, error) {
o := browserOpts{mode: mode, timeout: defaultBrowserTimeout}
var positionals []string
atoi := func(s, flag string) (int, error) {
n, err := strconv.Atoi(s)
if err != nil {
return 0, fmt.Errorf("%s expects an integer, got %q", flag, s)
}
return n, nil
}
for i := 0; i < len(args); i++ {
a := args[i]
switch {
case a == "-h" || a == "--help":
o.help = true
case a == "--shared-context":
o.sharedCtx = true
case a == "--keep-open":
o.keepOpen = true
case a == "--url":
if i+1 < len(args) {
o.url = args[i+1]
i++
}
case strings.HasPrefix(a, "--url="):
o.url = strings.TrimPrefix(a, "--url=")
case a == "--port":
if i+1 < len(args) {
n, err := atoi(args[i+1], "--port")
if err != nil {
return o, err
}
o.port = n
i++
}
case strings.HasPrefix(a, "--port="):
n, err := atoi(strings.TrimPrefix(a, "--port="), "--port")
if err != nil {
return o, err
}
o.port = n
case a == "--timeout":
if i+1 < len(args) {
n, err := atoi(args[i+1], "--timeout")
if err != nil {
return o, err
}
o.timeout = n
i++
}
case strings.HasPrefix(a, "--timeout="):
n, err := atoi(strings.TrimPrefix(a, "--timeout="), "--timeout")
if err != nil {
return o, err
}
o.timeout = n
case strings.HasPrefix(a, "-"):
return o, fmt.Errorf("unknown flag %q (try: homelab browser --help)", a)
default:
positionals = append(positionals, a)
}
}
if o.help {
return o, nil
}
switch mode {
case "run":
if len(positionals) == 0 {
return o, fmt.Errorf("usage: homelab browser run <script.js> [--url URL] [--shared-context] [--keep-open] [--port N] [--timeout S]")
}
o.script = positionals[0]
case "open":
if len(positionals) == 0 {
return o, fmt.Errorf("usage: homelab browser open <url> [--shared-context] [--timeout S]")
}
o.url = positionals[0]
}
return o, nil
}
// cdpHealthy parses a CDP /json/version body and reports whether the endpoint is
// a real (non-headless) Chrome — the entire reason chrome-service exists.
func cdpHealthy(jsonBody []byte) (browser string, healthy bool, err error) {
var v struct {
Browser string `json:"Browser"`
UserAgent string `json:"User-Agent"`
}
if e := json.Unmarshal(jsonBody, &v); e != nil {
return "", false, fmt.Errorf("parse /json/version: %w", e)
}
if v.Browser == "" {
return "", false, fmt.Errorf("/json/version had no Browser field")
}
healthy = strings.HasPrefix(v.Browser, "Chrome/") &&
!strings.Contains(v.Browser, "Headless") &&
!strings.Contains(v.UserAgent, "Headless")
return v.Browser, healthy, nil
}
// buildPortForwardArgs is the kubectl invocation that exposes chrome-service's
// CDP locally. port-forward tunnels API-server→pod, so it bypasses the :9222
// NetworkPolicy that gates in-cluster callers.
func buildPortForwardArgs(localPort int) []string {
return []string{"-n", chromeServiceNamespace, "port-forward",
"svc/" + chromeServiceName, fmt.Sprintf("%d:%d", localPort, chromeServiceCDPPort)}
}
// browserClientPackageJSON is the auto-managed manifest for the pinned node CDP
// client kept under the user cache dir.
func browserClientPackageJSON() string {
return fmt.Sprintf(`{
"name": "homelab-browser-client",
"private": true,
"description": "Pinned CDP client for 'homelab browser' — auto-managed, do not edit.",
"dependencies": {
"playwright-core": "%s"
}
}
`, playwrightVersion)
}
// freePort asks the kernel for an unused ephemeral TCP port.
func freePort() (int, error) {
l, err := net.Listen("tcp", "127.0.0.1:0")
if err != nil {
return 0, err
}
defer l.Close()
return l.Addr().(*net.TCPAddr).Port, nil
}
// browserClientDir is where the pinned node client + managed runner files live.
func browserClientDir() (string, error) {
cache, err := os.UserCacheDir()
if err != nil || cache == "" {
home, herr := os.UserHomeDir()
if herr != nil {
return "", fmt.Errorf("locate cache dir: %v / %v", err, herr)
}
cache = filepath.Join(home, ".cache")
}
return filepath.Join(cache, "homelab", "browser-client"), nil
}
// installedPlaywrightVersion reads the version of the playwright-core already
// installed in dir, or "" if absent/unreadable.
func installedPlaywrightVersion(dir string) string {
b, err := os.ReadFile(filepath.Join(dir, "node_modules", "playwright-core", "package.json"))
if err != nil {
return ""
}
var v struct {
Version string `json:"version"`
}
if json.Unmarshal(b, &v) != nil {
return ""
}
return v.Version
}
// ensureBrowserClient writes the managed runner/stealth/package files into dir
// and lazily installs the pinned playwright-core (only when missing/mismatched),
// so no per-user setup is needed and the client tracks the binary version.
func ensureBrowserClient(dir string) error {
if err := os.MkdirAll(dir, 0o755); err != nil {
return err
}
files := map[string]string{
"package.json": browserClientPackageJSON(),
"browser_runner.js": runnerJS,
"stealth.js": stealthJS,
}
for name, content := range files {
if err := os.WriteFile(filepath.Join(dir, name), []byte(content), 0o644); err != nil {
return err
}
}
if installedPlaywrightVersion(dir) == playwrightVersion {
return nil
}
fmt.Fprintf(os.Stderr, "homelab browser: installing pinned playwright-core@%s (one-time, ~a few seconds)…\n", playwrightVersion)
cmd := exec.Command("npm", "install", "--no-audit", "--no-fund", "--silent")
cmd.Dir = dir
cmd.Stdout = os.Stderr
cmd.Stderr = os.Stderr
if err := cmd.Run(); err != nil {
return fmt.Errorf("npm install playwright-core@%s in %s: %w (is node/npm installed?)", playwrightVersion, dir, err)
}
if got := installedPlaywrightVersion(dir); got != playwrightVersion {
return fmt.Errorf("playwright-core install mismatch in %s: want %s, got %q", dir, playwrightVersion, got)
}
return nil
}
// waitForCDP polls the local CDP endpoint until it answers as a healthy
// (non-headless) Chrome, or the timeout elapses.
func waitForCDP(cdpURL string, timeout time.Duration) (string, error) {
deadline := time.Now().Add(timeout)
client := &http.Client{Timeout: 3 * time.Second}
var lastErr error
for time.Now().Before(deadline) {
resp, err := client.Get(cdpURL + "/json/version")
if err != nil {
lastErr = err
time.Sleep(300 * time.Millisecond)
continue
}
body, _ := io.ReadAll(resp.Body)
resp.Body.Close()
browser, healthy, herr := cdpHealthy(body)
if herr != nil {
lastErr = herr
time.Sleep(300 * time.Millisecond)
continue
}
if !healthy {
return browser, fmt.Errorf("CDP reports %q — expected a non-headless Chrome (wrong target?)", browser)
}
return browser, nil
}
if lastErr == nil {
lastErr = fmt.Errorf("timed out after %s", timeout)
}
return "", lastErr
}
// runBrowser is the orchestration: pick a port, ensure the pinned client, start
// (and ALWAYS tear down) a CDP port-forward, wait for readiness, then run node.
func runBrowser(o browserOpts) error {
port := o.port
if port == 0 {
p, err := freePort()
if err != nil {
return fmt.Errorf("pick local port: %w", err)
}
port = p
}
dir, err := browserClientDir()
if err != nil {
return err
}
if err := ensureBrowserClient(dir); err != nil {
return err
}
// Start the forward in its own process group so the whole tree dies on cleanup.
pf := exec.Command("kubectl", buildPortForwardArgs(port)...)
pf.SysProcAttr = &syscall.SysProcAttr{Setpgid: true}
var pfLog strings.Builder
pf.Stdout = &pfLog
pf.Stderr = &pfLog
if err := pf.Start(); err != nil {
return fmt.Errorf("start kubectl port-forward (kubeconfig set?): %w", err)
}
var once sync.Once
teardown := func() {
once.Do(func() {
if pf.Process != nil {
_ = syscall.Kill(-pf.Process.Pid, syscall.SIGKILL)
}
_ = pf.Wait()
})
}
defer teardown()
// Tear down on Ctrl-C / SIGTERM too, then exit non-zero.
sigCh := make(chan os.Signal, 1)
signal.Notify(sigCh, os.Interrupt, syscall.SIGTERM)
defer signal.Stop(sigCh)
go func() {
if _, ok := <-sigCh; ok {
teardown()
os.Exit(130)
}
}()
cdpURL := fmt.Sprintf("http://127.0.0.1:%d", port)
browser, err := waitForCDP(cdpURL, time.Duration(o.timeout)*time.Second)
if err != nil {
return fmt.Errorf("chrome-service CDP not ready on %s: %w\n--- port-forward log ---\n%s", cdpURL, err, pfLog.String())
}
fmt.Fprintf(os.Stderr, "homelab browser: connected to %s via %s\n", browser, cdpURL)
return runBrowserNode(dir, cdpURL, o)
}
// runBrowserNode invokes the managed node runner with inputs passed via env.
func runBrowserNode(dir, cdpURL string, o browserOpts) error {
env := append(os.Environ(),
"HOMELAB_CDP_URL="+cdpURL,
"HOMELAB_BROWSER_MODE="+o.mode,
"HOMELAB_STEALTH_PATH="+filepath.Join(dir, "stealth.js"),
"NODE_PATH="+filepath.Join(dir, "node_modules"),
)
if o.url != "" {
env = append(env, "HOMELAB_BROWSER_URL="+o.url)
}
if o.script != "" {
abs, err := filepath.Abs(o.script)
if err != nil {
return err
}
if _, err := os.Stat(abs); err != nil {
return fmt.Errorf("script %s: %w", o.script, err)
}
env = append(env, "HOMELAB_BROWSER_SCRIPT="+abs)
}
if o.sharedCtx {
env = append(env, "HOMELAB_BROWSER_SHARED=1")
}
if o.keepOpen {
env = append(env, "HOMELAB_BROWSER_KEEP_OPEN=1")
}
if o.mode == "open" {
shot := filepath.Join(os.TempDir(), fmt.Sprintf("homelab-browser-%d.png", os.Getpid()))
env = append(env, "HOMELAB_BROWSER_SCREENSHOT="+shot)
}
cmd := exec.Command("node", filepath.Join(dir, "browser_runner.js"))
cmd.Env = env
cmd.Stdout = os.Stdout
cmd.Stderr = os.Stderr
cmd.Stdin = os.Stdin
return cmd.Run()
}

106
cli/browser_runner.js Normal file
View file

@ -0,0 +1,106 @@
// homelab browser — node CDP runner (auto-managed; regenerated each run from the
// homelab binary — DO NOT EDIT here). Connects to the port-forwarded
// chrome-service CDP endpoint, installs the stealth init script, then runs the
// user's Playwright script (run mode) or opens a URL (open mode). All inputs
// arrive via HOMELAB_* env vars set by the Go CLI.
'use strict';
const fs = require('fs');
const { chromium } = require('playwright-core');
async function main() {
const cdpURL = process.env.HOMELAB_CDP_URL;
if (!cdpURL) throw new Error('HOMELAB_CDP_URL not set');
const mode = process.env.HOMELAB_BROWSER_MODE || 'run';
const stealthPath = process.env.HOMELAB_STEALTH_PATH || '';
const initURL = process.env.HOMELAB_BROWSER_URL || '';
const scriptPath = process.env.HOMELAB_BROWSER_SCRIPT || '';
const shared = process.env.HOMELAB_BROWSER_SHARED === '1';
const keepOpen = process.env.HOMELAB_BROWSER_KEEP_OPEN === '1';
const screenshotPath = process.env.HOMELAB_BROWSER_SCREENSHOT || '';
const browser = await chromium.connectOverCDP(cdpURL);
// Fresh isolated context by default (safe for the shared browser + concurrent
// callers); --shared-context reuses the warmed persistent profile.
let context;
let createdContext = false;
if (shared) {
const existing = browser.contexts();
if (existing.length) {
context = existing[0];
} else {
context = await browser.newContext();
createdContext = true;
}
} else {
context = await browser.newContext();
createdContext = true;
}
if (stealthPath) {
const stealth = fs.readFileSync(stealthPath, 'utf8');
if (stealth.trim()) await context.addInitScript(stealth);
}
const page = await context.newPage();
const log = (...a) => console.error('[browser]', ...a);
let exitCode = 0;
try {
if (initURL) {
await page.goto(initURL, { waitUntil: 'domcontentloaded' });
}
if (mode === 'open') {
console.log('url: ' + page.url());
console.log('title: ' + (await page.title()));
const text = (await page.evaluate(() => (document.body ? document.body.innerText : ''))).trim();
console.log('--- visible text (truncated to 4000 chars) ---');
console.log(text.slice(0, 4000));
if (screenshotPath) {
await page.screenshot({ path: screenshotPath, fullPage: true });
console.log('screenshot: ' + screenshotPath);
}
} else {
if (!scriptPath) throw new Error('run mode requires HOMELAB_BROWSER_SCRIPT');
const src = fs.readFileSync(scriptPath, 'utf8');
// Run the user's source with page/context/browser/log in lexical scope.
// AsyncFunction body permits top-level await.
const AsyncFunction = Object.getPrototypeOf(async () => {}).constructor;
const fn = new AsyncFunction('page', 'context', 'browser', 'log', src);
const result = await fn(page, context, browser, log);
if (result !== undefined) {
let out;
try {
out = typeof result === 'string' ? result : JSON.stringify(result, null, 2);
} catch (_) {
out = String(result);
}
console.log(out);
}
}
} catch (e) {
console.error('homelab browser: script error:', e && e.stack ? e.stack : e);
exitCode = 1;
} finally {
if (!keepOpen) {
try {
// Close only what we created; never tear down the shared persistent context.
if (createdContext) {
await context.close();
} else {
await page.close();
}
} catch (_) { /* ignore */ }
}
// Disconnect from the CDP endpoint; this does NOT kill the remote browser.
try {
await browser.close();
} catch (_) { /* ignore */ }
}
process.exit(exitCode);
}
main().catch((e) => {
console.error('homelab browser: fatal:', e && e.stack ? e.stack : e);
process.exit(1);
});

54
cli/browser_stealth.js Normal file
View file

@ -0,0 +1,54 @@
// Minimal stealth init script for Playwright-driven Chromium.
// Vendored from puppeteer-extra-plugin-stealth/evasions/* (MIT) — covers:
// webdriver, chrome.runtime, navigator.plugins, navigator.languages,
// Permissions.query, WebGL getParameter (vendor + renderer spoof).
// Run via context.add_init_script() so it executes before any page script.
(() => {
// navigator.webdriver — most common detection, removed entirely.
Object.defineProperty(Navigator.prototype, 'webdriver', { get: () => undefined });
// window.chrome.runtime — many sites check that real Chrome exposes this.
if (!window.chrome) window.chrome = {};
window.chrome.runtime = window.chrome.runtime || {};
// navigator.plugins — headless reports zero; spoof a plausible PDF viewer.
Object.defineProperty(navigator, 'plugins', {
get: () => [{ name: 'Chrome PDF Plugin' }, { name: 'Chrome PDF Viewer' }, { name: 'Native Client' }],
});
// navigator.languages — headless returns empty array.
Object.defineProperty(navigator, 'languages', { get: () => ['en-US', 'en'] });
// Permissions.query — headless returns 'denied' for notifications instead of 'default'.
const origQuery = window.navigator.permissions && window.navigator.permissions.query;
if (origQuery) {
window.navigator.permissions.query = (parameters) =>
parameters && parameters.name === 'notifications'
? Promise.resolve({ state: Notification.permission })
: origQuery(parameters);
}
// WebGL getParameter — spoof vendor + renderer strings to a real GPU.
const spoofGl = (proto) => {
if (!proto) return;
const orig = proto.getParameter;
proto.getParameter = function (parameter) {
if (parameter === 37445) return 'Intel Inc.'; // UNMASKED_VENDOR_WEBGL
if (parameter === 37446) return 'Intel Iris OpenGL Engine'; // UNMASKED_RENDERER_WEBGL
return orig.apply(this, arguments);
};
};
spoofGl(window.WebGLRenderingContext && window.WebGLRenderingContext.prototype);
spoofGl(window.WebGL2RenderingContext && window.WebGL2RenderingContext.prototype);
// disable-devtool.js (theajack/disable-devtool) auto-inits via a script
// tag with `disable-devtool-auto`. Its Performance detector trips under
// Playwright (CDP adds console.log latency vs console.table) and the
// redirect URL is hard-coded — for hmembeds that's google.com.
// Hide the auto-init marker so the library's IIFE exits early.
const origQS = Document.prototype.querySelector;
Document.prototype.querySelector = function (sel) {
if (typeof sel === 'string' && sel.indexOf('disable-devtool-auto') !== -1) return null;
return origQS.apply(this, arguments);
};
})();

117
cli/cmd_browser.go Normal file
View file

@ -0,0 +1,117 @@
package main
import "fmt"
// browser verbs drive the cluster's HEADFUL Chrome (ns chrome-service) over CDP
// from outside the cluster, for sites that detect/block headless automation.
// The headless @playwright/mcp browser can load such sites but their gated
// actions (submit/login) silently fail; this path submits first try. Mechanics
// only — the agent supplies the Playwright script. See docs/adr/0013.
func browserCommands() []Command {
return []Command{
{Path: []string{"browser"}, Tier: TierRead,
Summary: "headful cluster-Chrome automation for anti-bot sites (run `browser --help`)", Run: browserTopHelp},
{Path: []string{"browser", "run"}, Tier: TierWrite,
Summary: "run a Playwright script against headful cluster Chrome: browser run <script.js> [--url U] [--shared-context]", Run: browserRun},
{Path: []string{"browser", "open"}, Tier: TierWrite,
Summary: "open a URL in headful cluster Chrome; print title + text + screenshot: browser open <url>", Run: browserOpen},
}
}
func browserTopHelp([]string) error {
fmt.Print(browserHelp())
return nil
}
func browserRun(args []string) error {
o, err := parseBrowserArgs("run", args)
if err != nil {
return err
}
if o.help {
fmt.Print(browserHelp())
return nil
}
return runBrowser(o)
}
func browserOpen(args []string) error {
o, err := parseBrowserArgs("open", args)
if err != nil {
return err
}
if o.help {
fmt.Print(browserHelp())
return nil
}
return runBrowser(o)
}
// browserHelp carries the discoverability payload: WHEN to reach for this, and
// the diagnostic cheat-sheet that lets the agent self-correct instead of
// retrying a deterministic form blind (the failure mode that motivated this).
func browserHelp() string {
return `homelab browser drive the cluster's HEADFUL Chrome (anti-bot) over CDP
The shared chrome-service (ns chrome-service) runs a REAL, headed Chrome under
Xvfb. This connects to it via a port-forward + Playwright connect_over_cdp,
injects the same stealth.js the in-cluster callers use, and runs your script.
USAGE
homelab browser run <script.js> [--url URL] [--shared-context] [--keep-open] [--port N] [--timeout S]
homelab browser open <url> [--shared-context] [--timeout S]
WHEN TO USE THIS escalation only; DEFAULT to the headless/MCP browser
Default to the Playwright MCP / headless browser for ALL routine browsing and
automation it's interactive (snapshot per step), fast to start, isolated.
Reach for THIS command ONLY when headless is demonstrably blocked: a site
LOADS fine but a gated action FAILS or HANGS a submit/login/checkout spins
forever, or ONE request errors while its siblings 200. That is the signature
of headless / anti-bot detection (navigator.webdriver, UA "HeadlessChrome",
disable-devtool traps). It presents as a real Chrome and usually succeeds
first try but it's the shared cluster browser (slower startup, one batch
run, no per-step feedback), so it's the escalation path, never the default.
ERROR-CODE CHEAT-SHEET (diagnose BEFORE retrying)
ERR_FILE_NOT_FOUND (-6) request intercepted/resolved locally by the
automation layer NOT a network/egress problem.
(This is what silently broke the headless submit.)
ERR_CONNECTION_REFUSED / real egress failure (DNS/route/firewall). These also
ERR_TIMED_OUT / break the initial page load if the page loaded,
ERR_NAME_NOT_RESOLVED egress is fine and the cause is elsewhere.
one endpoint 500s while server-side bot rejection of the automation, not
its siblings 200 your payload.
HABITS
- Inspect the network panel BEFORE retrying a deterministic form; a blind
retry just repeats the same silent failure.
- Don't park a half-filled multi-step form across a user pause the session
can expire; re-run the whole flow from this command in one shot.
- Uploads stream over CDP via setInputFiles from THIS host no chmod/staging
of $HOME needed; just point setInputFiles at a local path.
CONTEXT
Default: a FRESH incognito context, closed on exit safe for the shared
browser and concurrent callers (e.g. tripit). Your script does its own login.
--shared-context: reuse the warmed PERSISTENT profile (cookies from a manual
noVNC login at chrome.viktorbarzin.me) when you need a pre-logged-in session.
SCRIPT CONTRACT (run mode)
Your file's body runs with page, context, browser and log() already in scope
(top-level await allowed). Return a value to print it. Example flow.js:
await page.goto('https://portal.example.com/login');
await page.fill('#user', 'me'); await page.fill('#pass', process.env.PW);
await page.click('button[type=submit]');
await page.waitForURL('**/dashboard');
return 'logged in: ' + page.url();
Run it: homelab browser run flow.js
NOTES
- The Playwright client is pinned to playwright-core@` + playwrightVersion + ` to match the
chrome-service image (Chrome 130); installed once into ~/.cache/homelab/.
- The port-forward is always torn down, on success and on error.
`
}

172
cli/cmd_browser_test.go Normal file
View file

@ -0,0 +1,172 @@
package main
import (
"os"
"reflect"
"strings"
"testing"
)
func TestParseBrowserArgsRun(t *testing.T) {
got, err := parseBrowserArgs("run", []string{
"flow.js", "--url", "https://example.com", "--shared-context",
"--port", "19999", "--timeout", "45", "--keep-open",
})
if err != nil {
t.Fatalf("parseBrowserArgs run: unexpected err: %v", err)
}
want := browserOpts{
mode: "run", script: "flow.js", url: "https://example.com",
sharedCtx: true, keepOpen: true, port: 19999, timeout: 45,
}
if !reflect.DeepEqual(got, want) {
t.Fatalf("parseBrowserArgs run =\n %+v\nwant\n %+v", got, want)
}
}
func TestParseBrowserArgsRunDefaults(t *testing.T) {
got, err := parseBrowserArgs("run", []string{"flow.js"})
if err != nil {
t.Fatalf("unexpected err: %v", err)
}
if got.script != "flow.js" || got.sharedCtx || got.keepOpen || got.port != 0 {
t.Fatalf("defaults wrong: %+v", got)
}
if got.timeout != defaultBrowserTimeout {
t.Fatalf("timeout default = %d, want %d", got.timeout, defaultBrowserTimeout)
}
}
func TestParseBrowserArgsRunRequiresScript(t *testing.T) {
if _, err := parseBrowserArgs("run", []string{"--url", "https://x"}); err == nil {
t.Fatalf("run without a script path should error")
}
}
func TestParseBrowserArgsOpenRequiresURL(t *testing.T) {
got, err := parseBrowserArgs("open", []string{"https://example.com"})
if err != nil {
t.Fatalf("unexpected err: %v", err)
}
if got.url != "https://example.com" || got.mode != "open" {
t.Fatalf("open parse wrong: %+v", got)
}
if _, err := parseBrowserArgs("open", []string{}); err == nil {
t.Fatalf("open without a URL should error")
}
}
func TestParseBrowserArgsHelp(t *testing.T) {
for _, a := range [][]string{{"--help"}, {"-h"}, {"flow.js", "--help"}} {
got, err := parseBrowserArgs("run", a)
if err != nil {
t.Fatalf("help parse %v: %v", a, err)
}
if !got.help {
t.Fatalf("args %v should set help", a)
}
}
}
func TestParseBrowserArgsEqualsForm(t *testing.T) {
got, err := parseBrowserArgs("run", []string{"flow.js", "--url=https://x", "--port=8123", "--timeout=10"})
if err != nil {
t.Fatalf("unexpected err: %v", err)
}
if got.url != "https://x" || got.port != 8123 || got.timeout != 10 {
t.Fatalf("--flag=value form not parsed: %+v", got)
}
}
func TestCDPHealthy(t *testing.T) {
real := []byte(`{"Browser":"Chrome/130.0.6723.31","User-Agent":"Mozilla/5.0 (X11; Linux x86_64) Chrome/130.0.0.0 Safari/537.36","webSocketDebuggerUrl":"ws://127.0.0.1/devtools/browser/x"}`)
browser, ok, err := cdpHealthy(real)
if err != nil || !ok {
t.Fatalf("real Chrome should be healthy: ok=%v err=%v", ok, err)
}
if !strings.HasPrefix(browser, "Chrome/") {
t.Fatalf("browser = %q, want Chrome/ prefix", browser)
}
headless := []byte(`{"Browser":"HeadlessChrome/130.0.6723.31","User-Agent":"Mozilla/5.0 HeadlessChrome/130.0.0.0"}`)
if _, ok, _ := cdpHealthy(headless); ok {
t.Fatalf("HeadlessChrome must be reported unhealthy (the whole point of chrome-service)")
}
if _, _, err := cdpHealthy([]byte("not json")); err == nil {
t.Fatalf("malformed /json/version body should error")
}
}
func TestBuildPortForwardArgs(t *testing.T) {
got := buildPortForwardArgs(18080)
want := []string{"-n", "chrome-service", "port-forward", "svc/chrome-service", "18080:9222"}
if !reflect.DeepEqual(got, want) {
t.Fatalf("buildPortForwardArgs =\n %v\nwant\n %v", got, want)
}
}
func TestBrowserClientPackageJSONPinsVersion(t *testing.T) {
pj := browserClientPackageJSON()
if !strings.Contains(pj, `"playwright-core": "`+playwrightVersion+`"`) {
t.Fatalf("package.json must pin playwright-core to %s; got:\n%s", playwrightVersion, pj)
}
}
func TestPlaywrightVersionPinnedToServerMinor(t *testing.T) {
// chrome-service runs mcr.microsoft.com/playwright:v1.48.0-noble; the CDP
// client minor MUST match (protocol changes between minors).
if !strings.HasPrefix(playwrightVersion, "1.48.") {
t.Fatalf("playwrightVersion = %q, must be 1.48.x to match the chrome-service image", playwrightVersion)
}
}
func TestBrowserHelpHasDiagnosticCheatSheet(t *testing.T) {
h := browserHelp()
for _, want := range []string{
"homelab browser run",
"ERR_FILE_NOT_FOUND",
"ERR_CONNECTION_REFUSED",
"network panel",
"headless",
"--shared-context",
} {
if !strings.Contains(h, want) {
t.Errorf("browser --help is missing %q (the discoverability/self-correction payload)", want)
}
}
}
func TestBrowserHelpIsTiered(t *testing.T) {
// --help must frame this as the ESCALATION path (default to headless first),
// matching ~/code/CLAUDE.md and chrome-service.md — non-conflicting agent
// instructions. Guard against a regression to "co-equal choice" wording.
h := browserHelp()
for _, want := range []string{"Default to the", "escalation"} {
if !strings.Contains(h, want) {
t.Errorf("browser --help must carry the tiered/default-headless framing; missing %q", want)
}
}
}
func TestStealthJSEmbeddedMatchesCanonical(t *testing.T) {
// The embedded copy must never drift from the source of truth that the
// in-cluster callers use, else the CLI's stealth and the cluster's diverge.
canonical, err := os.ReadFile("../stacks/chrome-service/files/stealth.js")
if err != nil {
t.Fatalf("read canonical stealth.js: %v", err)
}
if stealthJS != string(canonical) {
t.Fatalf("cli/browser_stealth.js has drifted from stacks/chrome-service/files/stealth.js — re-copy it")
}
}
func TestFreePortReturnsUsablePort(t *testing.T) {
p, err := freePort()
if err != nil {
t.Fatalf("freePort: %v", err)
}
if p <= 1024 || p > 65535 {
t.Fatalf("freePort returned %d, want an ephemeral port", p)
}
}

69
cli/cmd_edges.go Normal file
View file

@ -0,0 +1,69 @@
package main
import "fmt"
func edgesCommands() []Command {
return []Command{
{Path: []string{"edges"}, Tier: TierRead,
Summary: "who-talks-to-whom trail: edges [--ns|--src|--dst|--peers-of N] [--new-since 24h] [--denied] [--json] [--limit N]",
Run: edgesRun},
}
}
// edgesRun renders the filter flags to SQL and runs it read-only against the
// goldmane_edges CNPG DB via the dbaas primary pod (same exec path as `k8s db`).
func edgesRun(args []string) error {
for _, a := range args {
if a == "-h" || a == "--help" {
fmt.Print(edgesUsage())
return nil
}
}
o, err := parseEdgesArgs(args)
if err != nil {
return fmt.Errorf("%w\n\n%s", err, edgesUsage())
}
sql, err := buildEdgesQuery(o)
if err != nil {
return err
}
// pg-cluster-rw is a Service (not exec-able); resolve the primary POD.
pod, err := kubectlCapture("dbaas", "get", "pod", "-l", "cnpg.io/instanceRole=primary",
"-o", "jsonpath={.items[0].metadata.name}")
if err != nil || pod == "" {
return fmt.Errorf("could not resolve CNPG primary pod in dbaas: %v", err)
}
exec := []string{"exec", pod, "-c", "postgres", "--", "psql", "-U", "postgres", "-d", "goldmane_edges"}
if o.asJSON {
exec = append(exec, "-tAc", sql) // raw tuple → the JSON array
} else {
exec = append(exec, "-P", "pager=off", "-c", sql) // aligned table for humans
}
return kubectlStream("dbaas", exec...)
}
func edgesUsage() string {
return `homelab edges query the who-talks-to-whom trail (goldmane_edges, ADR-0014)
Usage: homelab edges [filters]
Filters (AND-combined; namespace values are validated to the k8s name charset):
--ns NAME edges touching NAME (either direction)
--src NAME edges where source namespace = NAME
--dst NAME edges where destination namespace = NAME
--peers-of NAME distinct peer namespaces of NAME (both directions)
--new-since SPEC first seen since SPEC: a duration (24h, 7d, 30m, 90s) or a date (YYYY-MM-DD)
--denied only denied (action='deny') edges blocked / lateral-movement attempts
--json output a JSON array (for agents/pipelines)
--limit N cap rows (default 200)
Examples:
homelab edges --ns immich # everything immich talks to / is talked to by
homelab edges --peers-of authentik # authentik's peer namespaces
homelab edges --src recruiter-responder # that namespace's egress peers
homelab edges --new-since 24h # edges first seen in the last day
homelab edges --denied --json # blocked flows, machine-readable
Read-only SELECT against CNPG DB goldmane_edges via the dbaas primary pod.
`
}

View file

@ -2,7 +2,6 @@ package main
import (
"encoding/base64"
"encoding/json"
"fmt"
"os"
"path/filepath"
@ -14,23 +13,24 @@ import (
// host-level work (config files, docker, add-ons). Entity state/control stays
// with the MCP — see docs/adr/0012.
//
// The token lives in a k8s Secret (a JSON blob of several skill tokens), the
// same place the openclaw agent reads it from. `ha token` resolves it on demand
// via the ambient kubeconfig, so it never depends on a pre-set env var (the gap
// that made agents re-derive the kubectl|base64|jq pipeline every session).
// The token lives in the dedicated k8s Secret openclaw/ha-tokens (one key per
// instance), split out of openclaw-secrets so non-admin operators (emo / "Home
// Server Admins") can read JUST the HA token, not the full skill_secrets blob.
// `ha token` resolves it on demand via the ambient kubeconfig, so it never
// depends on a pre-set env var (the gap that made agents re-derive the
// kubectl|base64|jq pipeline every session).
type haInstance struct {
name string // sofia | london
sshUser string // SSH login on the HA host
sshHost string // host reachable from the devvm (Sofia LAN)
secretKey string // key inside skill_secrets holding this instance's token
secretKey string // key inside the openclaw/ha-tokens Secret holding this token
}
const (
haDefaultInstance = "sofia"
haSecretNamespace = "openclaw"
haSecretName = "openclaw-secrets"
haSecretField = "skill_secrets" // a base64 JSON blob: {token-name: token}
haSecretName = "ha-tokens" // dedicated, least-privilege; see stacks/openclaw/ha_tokens.tf
)
// haInstances maps instance name → connection/secret facts. sofia is the default
@ -38,8 +38,8 @@ const (
// (192.168.8.x) is only reachable remotely, so `ha ssh --instance london`
// generally won't connect from here (token resolution still works).
var haInstances = map[string]haInstance{
"sofia": {name: "sofia", sshUser: "vbarzin", sshHost: "192.168.1.8", secretKey: "home_assistant_sofia_token"},
"london": {name: "london", sshUser: "hassio", sshHost: "192.168.8.103", secretKey: "home_assistant_token"},
"sofia": {name: "sofia", sshUser: "vbarzin", sshHost: "192.168.1.8", secretKey: "sofia"},
"london": {name: "london", sshUser: "hassio", sshHost: "192.168.8.103", secretKey: "london"},
}
func haCommands() []Command {
@ -63,22 +63,14 @@ func resolveHAInstance(name string) (haInstance, error) {
return inst, nil
}
// parseSkillSecret decodes the base64 skill_secrets blob (as returned by kubectl
// jsonpath, trailing whitespace tolerated) and returns the value for key.
func parseSkillSecret(b64, key string) (string, error) {
// decodeSecretValue base64-decodes a k8s Secret `.data.<key>` value as returned
// by kubectl jsonpath (trailing whitespace tolerated).
func decodeSecretValue(b64 string) (string, error) {
raw, err := base64.StdEncoding.DecodeString(strings.TrimSpace(b64))
if err != nil {
return "", fmt.Errorf("decode %s: %w", haSecretField, err)
return "", fmt.Errorf("base64-decode secret value: %w", err)
}
var m map[string]string
if err := json.Unmarshal(raw, &m); err != nil {
return "", fmt.Errorf("parse %s json: %w", haSecretField, err)
}
v, ok := m[key]
if !ok {
return "", fmt.Errorf("key %q not present in %s", key, haSecretField)
}
return v, nil
return string(raw), nil
}
func haToken(args []string) error {
@ -95,14 +87,14 @@ func haToken(args []string) error {
return err
}
b64, err := kubectlCapture(haSecretNamespace, "get", "secret", haSecretName,
"-o", "jsonpath={.data."+haSecretField+"}")
"-o", "jsonpath={.data."+inst.secretKey+"}")
if err != nil {
return fmt.Errorf("read secret %s/%s (kubeconfig set?): %w", haSecretNamespace, haSecretName, err)
return fmt.Errorf("read secret %s/%s (kubeconfig set? RBAC?): %w", haSecretNamespace, haSecretName, err)
}
if b64 == "" {
return fmt.Errorf("secret %s/%s has no %q field", haSecretNamespace, haSecretName, haSecretField)
return fmt.Errorf("secret %s/%s has no %q key", haSecretNamespace, haSecretName, inst.secretKey)
}
tok, err := parseSkillSecret(b64, inst.secretKey)
tok, err := decodeSecretValue(b64)
if err != nil {
return err
}

View file

@ -12,10 +12,10 @@ func TestResolveHAInstance(t *testing.T) {
if got, err := resolveHAInstance(""); err != nil || got.name != "sofia" {
t.Fatalf(`resolveHAInstance("") = %+v, %v; want sofia`, got, err)
}
if got, err := resolveHAInstance("sofia"); err != nil || got.secretKey != "home_assistant_sofia_token" {
if got, err := resolveHAInstance("sofia"); err != nil || got.secretKey != "sofia" {
t.Fatalf("sofia secretKey = %q, %v", got.secretKey, err)
}
if got, err := resolveHAInstance("london"); err != nil || got.secretKey != "home_assistant_token" || got.sshUser != "hassio" {
if got, err := resolveHAInstance("london"); err != nil || got.secretKey != "london" || got.sshUser != "hassio" {
t.Fatalf("london = %+v, %v", got, err)
}
if _, err := resolveHAInstance("paris"); err == nil {
@ -23,22 +23,19 @@ func TestResolveHAInstance(t *testing.T) {
}
}
func TestParseSkillSecret(t *testing.T) {
blob := base64.StdEncoding.EncodeToString([]byte(
`{"home_assistant_sofia_token":"tok-sofia","home_assistant_token":"tok-london","slack_webhook":"https://x"}`))
if got, err := parseSkillSecret(blob, "home_assistant_sofia_token"); err != nil || got != "tok-sofia" {
t.Fatalf("parseSkillSecret sofia = %q, %v; want tok-sofia", got, err)
func TestDecodeSecretValue(t *testing.T) {
// k8s stores Secret values base64-encoded; `kubectl -o jsonpath={.data.<k>}`
// returns that base64, which decodeSecretValue turns back into the raw token.
enc := base64.StdEncoding.EncodeToString([]byte("tok-sofia"))
if got, err := decodeSecretValue(enc); err != nil || got != "tok-sofia" {
t.Fatalf("decodeSecretValue = %q, %v; want tok-sofia", got, err)
}
// kubectl jsonpath output can carry trailing whitespace/newline — must tolerate it
if got, err := parseSkillSecret(blob+"\n", "home_assistant_token"); err != nil || got != "tok-london" {
t.Fatalf("parseSkillSecret london (trailing ws) = %q, %v; want tok-london", got, err)
// trailing whitespace/newline from jsonpath output must be tolerated
if got, err := decodeSecretValue(enc + "\n"); err != nil || got != "tok-sofia" {
t.Fatalf("decodeSecretValue (trailing ws) = %q, %v; want tok-sofia", got, err)
}
if _, err := parseSkillSecret(blob, "missing_key"); err == nil {
t.Fatalf("parseSkillSecret should error on a key absent from the blob")
}
if _, err := parseSkillSecret("not-base64!!", "home_assistant_sofia_token"); err == nil {
t.Fatalf("parseSkillSecret should error on undecodable base64")
if _, err := decodeSecretValue("not-base64!!"); err == nil {
t.Fatalf("decodeSecretValue should error on undecodable base64")
}
}

View file

@ -54,10 +54,7 @@ func printMemories(raw []byte, jsonOut bool) error {
return nil
}
for _, m := range r.Memories {
c := strings.ReplaceAll(m.Content, "\n", " ")
if len(c) > 240 {
c = c[:240] + "…"
}
c := truncatePreview(strings.ReplaceAll(m.Content, "\n", " "), 240)
fmt.Printf("#%d [%s] (%.2f) %s\n", m.ID, m.Category, m.Importance, c)
if m.Tags != "" {
fmt.Printf(" tags: %s\n", m.Tags)
@ -66,6 +63,21 @@ func printMemories(raw []byte, jsonOut bool) error {
return nil
}
// truncatePreview shortens s to at most maxRunes RUNES, appending "…" when it
// trims. Counting runes (not bytes) is load-bearing: a byte slice like s[:240]
// can cut through the middle of a multibyte UTF-8 character (e.g. 2-byte
// Cyrillic), leaving a dangling lead byte = invalid UTF-8. That crashed strict
// decoders downstream — notably the homelab-memory-recall.py UserPromptSubmit
// hook (subprocess text=True), which surfaced as a recurring "UserPromptSubmit
// hook error" for Cyrillic-language users.
func truncatePreview(s string, maxRunes int) string {
r := []rune(s)
if len(r) <= maxRunes {
return s
}
return string(r[:maxRunes]) + "…"
}
func memoryRecall(args []string) error {
req := memRecallReq{}
jsonOut := false

944
cli/cmd_vault.go Normal file
View file

@ -0,0 +1,944 @@
package main
import (
"bufio"
"encoding/base64"
"encoding/json"
"errors"
"fmt"
"os"
"os/exec"
"strings"
"syscall"
)
// vault verbs give each unix user no-HITL access to THEIR OWN Vaultwarden vault.
// Identity is the kernel UID; per-user creds live in that user's isolated Vault
// path (secret/workstation/claude-users/<user>) read via their scoped token, and
// decryption is done by the official `bw` CLI. See
// docs/runbooks/homelab-vault-onboarding.md.
func vaultCommands() []Command {
cmds := []Command{
// Vaultwarden — your personal password manager (logins/passwords/TOTP).
{Path: []string{"vault", "setup"}, Tier: TierWrite,
Summary: "[vaultwarden] one-time: store your master password + API key in your Vault path", Run: vaultSetup},
{Path: []string{"vault", "status"}, Tier: TierRead,
Summary: "[vaultwarden] show whether your vault is configured/reachable (no secrets)", Run: vaultStatus},
{Path: []string{"vault", "list"}, Tier: TierRead,
Summary: "[vaultwarden] list your item names: vault list [--search Q]", Run: vaultList},
{Path: []string{"vault", "get"}, Tier: TierRead,
Summary: "[vaultwarden] fetch one login: vault get <name> [--field password|username|uri|notes|totp] [--json] [--all]", Run: vaultGet},
{Path: []string{"vault", "search"}, Tier: TierRead,
Summary: "[vaultwarden] search your item names: vault search <query>", Run: vaultSearch},
{Path: []string{"vault", "code"}, Tier: TierRead,
Summary: "[vaultwarden] current TOTP code for an item: vault code <name>", Run: vaultCode},
{Path: []string{"vault", "lock"}, Tier: TierWrite,
Summary: "[vaultwarden] lock/log out the local bw session", Run: vaultLock},
{Path: []string{"vault"}, Tier: TierRead,
Summary: "two stores: Vaultwarden (logins) + HashiCorp Vault/OpenBao kv (infra secrets) — run `homelab vault` for help",
Run: func([]string) error { fmt.Print(vaultHelp()); return nil }},
}
// HashiCorp Vault / OpenBao — homelab INFRA secrets (the secret/… KV store).
return append(cmds, vaultKVCommands()...)
}
// vaultHelp is shown for bare `homelab vault`. It LEADS with the distinction
// between the two unrelated "vaults" this command fronts, because the name
// collides: Vaultwarden (a password manager) vs HashiCorp Vault / OpenBao (the
// infra secrets store).
func vaultHelp() string {
return `homelab vault two different secret stores under one command:
Vaultwarden your personal PASSWORD MANAGER (logins / passwords / TOTP)
HashiCorp Vault / OpenBao homelab INFRA secrets (the secret/ KV store) 'vault kv '
Vaultwarden (reads YOUR OWN vault; no-HITL after one-time setup)
homelab vault setup one-time: store your master password + API key in your Vault path
homelab vault status configured / unlocked / reachable (no secrets)
homelab vault list [--search Q] list your item names (no secrets)
homelab vault get <name> [--field password|username|uri|notes|totp] [--json]
TTY clipboard (auto-clears); piped stdout
homelab vault get <name> --all all fields (incl. custom) as JSON; piped only.
TOTP shown as presence flag use 'vault code' for a code.
homelab vault code <name> current TOTP code
homelab vault lock lock / log out the local bw session
HashiCorp Vault / OpenBao (infra secrets; uses your own OIDC vault token)
homelab vault kv get <path> [--field K] read an infra KV secret
homelab vault kv list <path> list sub-paths
homelab vault kv put <path> <key> write one key (value via stdin)
Vaultwarden creds live only in your own Vault path; the admin never sees them.
Security model: docs/runbooks/homelab-vault-onboarding.md
(note: anything running as your user can decrypt your vault the accepted no-HITL trade).
`
}
const vwUserPathPrefix = "secret/workstation/claude-users/"
// vwCreds is one user's Vaultwarden auth material, read from their Vault path.
type vwCreds struct {
Email string
MasterPassword string
ClientID string
ClientSecret string
}
// cmdRunner shells out to an external command with an explicit environment and
// returns trimmed stdout. Secrets are passed via envv, NEVER argv. Tests inject
// a fake; realRunner is the production implementation.
type cmdRunner func(name string, argv, envv []string) (string, error)
func realRunner(name string, argv, envv []string) (string, error) {
cmd := exec.Command(name, argv...)
if envv != nil {
cmd.Env = envv
}
out, err := cmd.Output()
// Trim only the trailing newline the tool appends — NOT all whitespace, so a
// fetched secret with significant leading/trailing spaces is preserved.
return strings.TrimRight(string(out), "\r\n"), augmentErr(err, exitStderr(err))
}
// exitStderr returns the stderr captured by cmd.Output() on a failed exec (it
// stows it on *exec.ExitError), or nil. The tools we shell out to (vault, bw)
// write the actionable message there — "connection refused", "permission
// denied" — which the caller would otherwise never see behind a bare
// "exit status N".
func exitStderr(err error) []byte {
var ee *exec.ExitError
if errors.As(err, &ee) {
return ee.Stderr
}
return nil
}
// augmentErr appends captured stderr to an error so failures are diagnosable
// (not just "exit status 2"). Returns nil when err is nil, and err unchanged
// when there's no stderr; preserves the wrapped error for errors.Is/As.
func augmentErr(err error, stderr []byte) error {
if err == nil {
return nil
}
if s := strings.TrimSpace(string(stderr)); s != "" {
return fmt.Errorf("%w: %s", err, s)
}
return err
}
// realRunnerStdin runs a command feeding `stdin` to it, for secret values that
// must NOT appear in argv (visible via ps / /proc/<pid>/cmdline to same-UID
// processes). Used by setup to write the master password / client_secret.
func realRunnerStdin(name string, argv, envv []string, stdin string) (string, error) {
cmd := exec.Command(name, argv...)
if envv != nil {
cmd.Env = envv
}
cmd.Stdin = strings.NewReader(stdin)
out, err := cmd.Output()
return strings.TrimRight(string(out), "\r\n"), augmentErr(err, exitStderr(err))
}
func vwCredsPath(user string) string { return vwUserPathPrefix + user }
func bwAppDataDir(uid string) string { return "/run/user/" + uid + "/homelab-bw" }
// readVaultField returns one field from a KV-v2 path, "" if absent/error.
func readVaultField(run cmdRunner, field, path string) string {
out, err := run("vault", []string{"kv", "get", "-field=" + field, path}, nil)
if err != nil {
return ""
}
return out
}
// loadCreds reads the four vaultwarden_* keys from the user's isolated path.
// A missing master password means the user hasn't onboarded.
func loadCreds(run cmdRunner, user string) (vwCreds, error) {
p := vwCredsPath(user)
c := vwCreds{
Email: readVaultField(run, "vaultwarden_email", p),
MasterPassword: readVaultField(run, "vaultwarden_master_password", p),
ClientID: readVaultField(run, "vaultwarden_client_id", p),
ClientSecret: readVaultField(run, "vaultwarden_client_secret", p),
}
if c.MasterPassword == "" {
return vwCreds{}, fmt.Errorf("vault not configured for this user — run `homelab vault setup`")
}
return c, nil
}
// vaultCurrentUser/vaultCurrentUID are seams for tests (avoid conflict with repo.go's currentUser func).
var vaultCurrentUser = func() string { return os.Getenv("USER") }
var vaultCurrentUID = func() string { return fmt.Sprintf("%d", os.Getuid()) }
// scopedTokenPath is where claude-auth-sync keeps the user's scoped Vault token.
// MUST match CAS_VAULT_TOKEN_FILE in scripts/workstation/claude-auth-sync.sh.
func scopedTokenPath(home string) string {
return home + "/.config/claude-auth-sync/vault-token"
}
// vaultTokenSource decides which Vault token the `vault` child processes should
// use. Precedence: an explicit $VAULT_TOKEN (deliberate override), then the
// per-user scoped token claude-auth-sync maintains at scopedTokenPath(HOME)
// (policy workstation-claude-<user>, which grants exactly the create/read/update
// this tool needs on the user's own path), then a native ~/.vault-token.
//
// The scoped token MUST beat ~/.vault-token: this tool only ever touches the
// caller's own secret/workstation/claude-users/<user> path, and a power-user who
// ran `vault login -method=oidc` carries a read-only ~/.vault-token whose
// capability on that path is `deny` — letting it win shadows the scoped token
// and every op fails 403/deny (emo, 2026-06-28). ~/.vault-token is only the
// right credential when there is no scoped token (admins). Returns the token to
// export — "" when the vault CLI should read the ambient/native credential —
// plus a source tag for tests/logging.
func vaultTokenSource(envToken string, haveVaultTokenFile bool, scopedToken string) (token, source string) {
switch {
case envToken != "":
return "", "env"
case strings.TrimSpace(scopedToken) != "":
return strings.TrimSpace(scopedToken), "scoped"
case haveVaultTokenFile:
return "", "file"
default:
return "", "none"
}
}
// vaultAddrDefault is the cluster Vault the workstation talks to. The bw server
// is likewise hardcoded (openSession), so a sane default here is consistent.
const vaultAddrDefault = "https://vault.viktorbarzin.me"
// vaultAddrToSet returns the VAULT_ADDR to export when the caller's environment
// doesn't already set one, else "". homelab vault is invoked by AFK agent
// sessions — frequently non-login shells (tmux panes, agent subprocesses) that
// never sourced /etc/environment — so, like claude-auth-sync, the CLI must NOT
// depend on an ambient VAULT_ADDR; otherwise every `vault` child falls back to
// the 127.0.0.1:8200 default and fails "connection refused" (exit 2).
func vaultAddrToSet(envAddr string) string {
if strings.TrimSpace(envAddr) == "" {
return vaultAddrDefault
}
return ""
}
// ensureVaultAddr exports the default VAULT_ADDR when none is set, so the vault
// child processes reach the cluster Vault regardless of the caller's shell. An
// explicit VAULT_ADDR (admins, CI) is left untouched.
func ensureVaultAddr() {
if a := vaultAddrToSet(os.Getenv("VAULT_ADDR")); a != "" {
os.Setenv("VAULT_ADDR", a)
}
}
// fileNonEmpty reports whether path exists and has content.
func fileNonEmpty(path string) bool {
fi, err := os.Stat(path)
return err == nil && fi.Size() > 0
}
// ensureVaultToken wires vaultTokenSource to the real environment: when the user
// has no ambient Vault credential, it exports the claude-auth-sync scoped token
// so the `vault` child processes authenticate as workstation-claude-<user>. It
// is idempotent and safe for admins, whose explicit $VAULT_TOKEN / ~/.vault-token
// take precedence and are left untouched.
func ensureVaultToken() {
// Every vault verb funnels through here, so this is the one place that also
// guarantees VAULT_ADDR is set (see vaultAddrToSet for why it can't be
// assumed from the caller's shell).
ensureVaultAddr()
home := os.Getenv("HOME")
scoped, _ := os.ReadFile(scopedTokenPath(home))
tok, src := vaultTokenSource(os.Getenv("VAULT_TOKEN"), home != "" && fileNonEmpty(home+"/.vault-token"), string(scoped))
if src == "scoped" {
os.Setenv("VAULT_TOKEN", tok)
}
}
// bwBaseEnv is the minimal non-secret environment bw/node need. We deliberately
// do NOT inherit the full parent env (keeps stray secrets out of the child).
func bwBaseEnv(appdata string) []string {
path := os.Getenv("PATH")
if path == "" {
path = "/usr/local/bin:/usr/bin:/bin"
}
return []string{
"PATH=" + path,
"HOME=" + os.Getenv("HOME"),
"BITWARDENCLI_APPDATA_DIR=" + appdata,
"BW_NOINTERACTION=true",
}
}
// bwSecretEnv adds the secret-bearing vars. session may be "" (pre-unlock).
func bwSecretEnv(appdata string, c vwCreds, session string) []string {
env := bwBaseEnv(appdata)
env = append(env,
"BW_CLIENTID="+c.ClientID,
"BW_CLIENTSECRET="+c.ClientSecret,
"BW_PASSWORD="+c.MasterPassword,
)
if session != "" {
env = append(env, "BW_SESSION="+session)
}
return env
}
func bwLoginArgs() []string { return []string{"login", "--apikey"} }
func bwUnlockArgs() []string { return []string{"unlock", "--passwordenv", "BW_PASSWORD", "--raw"} }
func bwGetArgs(field, name string) []string { return []string{"get", field, name} }
func bwItemArgs(name string) []string { return []string{"get", "item", name} }
func bwStatusArgs() []string { return []string{"status"} }
func bwSyncArgs() []string { return []string{"sync"} }
// bwNeedsLogin parses `bw status` JSON and reports whether a `bw login` is
// required. Unparseable/empty output → true (safer to attempt login).
func bwNeedsLogin(statusJSON string) bool {
var s struct {
Status string `json:"status"`
}
if err := json.Unmarshal([]byte(statusJSON), &s); err != nil {
return true
}
return s.Status == "unauthenticated" || s.Status == ""
}
func bwListArgs(search string) []string {
a := []string{"list", "items"}
if search != "" {
a = append(a, "--search", search)
}
return a
}
// bwUnlock runs `bw unlock` and returns the raw session key.
func bwUnlock(run cmdRunner, env []string) (string, error) {
out, err := run("bw", bwUnlockArgs(), env)
if err != nil {
return "", fmt.Errorf("bw unlock failed (wrong master password? run `homelab vault setup`): %w", err)
}
return out, nil
}
// bwGet fetches one field of one item; session must be present in env.
func bwGet(run cmdRunner, env []string, field, name string) (string, error) {
return run("bw", bwGetArgs(field, name), env)
}
func returnMode(isTTY bool) string {
if isTTY {
return "clipboard"
}
return "stdout"
}
// stdoutIsTTY reports whether stdout is a character device (a terminal).
func stdoutIsTTY() bool {
fi, err := os.Stdout.Stat()
if err != nil {
return false
}
return fi.Mode()&os.ModeCharDevice != 0
}
// stderrIsTTY reports whether stderr is a terminal (the OSC52 escape is written
// to stderr, so the clipboard path is only viable when stderr is a terminal).
func stderrIsTTY() bool {
fi, err := os.Stderr.Stat()
if err != nil {
return false
}
return fi.Mode()&os.ModeCharDevice != 0
}
// osc52 returns the OSC 52 escape that makes the local terminal copy payload to
// the system clipboard (works over SSH; no X11). osc52clear copies empty.
func osc52(payload string) string {
return "\x1b]52;c;" + base64.StdEncoding.EncodeToString([]byte(payload)) + "\a"
}
func osc52clear() string { return "\x1b]52;c;\a" }
// terminalAllowed gates OSC 52: only terminals known to honor clipboard writes,
// else we'd dump the secret's base64 into scrollback on unsupported terminals.
func terminalAllowed(term, termProgram string) bool {
t := strings.ToLower(term)
p := strings.ToLower(termProgram)
for _, ok := range []string{"kitty", "alacritty", "foot", "wezterm", "ghostty", "tmux", "screen"} {
if strings.Contains(t, ok) || strings.Contains(p, ok) {
return true
}
}
// xterm proper supports it only when the program is a known-good emulator.
return false
}
// opRecord is one CLI operation. ItemName is accepted for the caller's
// convenience but is INTENTIONALLY never rendered into the log line — auditing
// which of your own logins you opened is itself sensitive, and per-item reads
// are invisible server-side anyway (spec §9a).
type opRecord struct {
User string
Verb string
PID int
PPID int
ParentComm string
ItemName string // never logged
}
func opLogLine(r opRecord) string {
return fmt.Sprintf("user=%s verb=%s pid=%d ppid=%d parent=%s",
r.User, r.Verb, r.PID, r.PPID, r.ParentComm)
}
// parentComm reads /proc/<ppid>/comm (best-effort; "" on failure).
func parentComm(ppid int) string {
b, err := os.ReadFile(fmt.Sprintf("/proc/%d/comm", ppid))
if err != nil {
return ""
}
return strings.TrimSpace(string(b))
}
// writeOpLog appends one privacy-aware line to the user's op-log (best-effort;
// never blocks or fails the command). Goes to syslog so it ships to Loki.
func writeOpLog(r opRecord) {
exec.Command("logger", "-t", "homelab-vault", opLogLine(r)).Run() // best-effort
}
func vaultLockPath(uid string) string { return "/run/user/" + uid + "/homelab-vault.lock" }
// hardenProcess disables core dumps so a bw/homelab crash can't spill the master
// password to a core file. Best-effort.
func hardenProcess() {
_ = syscall.Setrlimit(syscall.RLIMIT_CORE, &syscall.Rlimit{Cur: 0, Max: 0})
}
// withUserLock serializes bw mutations for this user (concurrent Claude sessions
// as the same user otherwise race bw's appdata). Returns an unlock func.
func withUserLock(uid string) (func(), error) {
f, err := os.OpenFile(vaultLockPath(uid), os.O_CREATE|os.O_RDWR, 0600)
if err != nil {
return nil, err
}
if err := syscall.Flock(int(f.Fd()), syscall.LOCK_EX); err != nil {
f.Close()
return nil, err
}
return func() { syscall.Flock(int(f.Fd()), syscall.LOCK_UN); f.Close() }, nil
}
// session is one usable bw context: the env (with BW_SESSION) ready for `bw get`.
type session struct {
env []string
}
// openSession resolves creds, ensures login, unlocks, and returns a ready env.
// Caller must hold the user lock. appdata is created on tmpfs (0700).
func openSession(run cmdRunner, user, uid string) (session, error) {
creds, err := loadCreds(run, user)
if err != nil {
return session{}, err
}
appdata := bwAppDataDir(uid)
if err := os.MkdirAll(appdata, 0700); err != nil {
return session{}, fmt.Errorf("create bw appdata %s: %w", appdata, err)
}
loginEnv := bwSecretEnv(appdata, creds, "")
// Ensure server is set and we're logged in (idempotent; ignore "already").
_, _ = run("bw", []string{"config", "server", "https://vaultwarden.viktorbarzin.me"}, loginEnv)
st, _ := run("bw", bwStatusArgs(), loginEnv)
if bwNeedsLogin(st) {
if _, err := run("bw", bwLoginArgs(), loginEnv); err != nil {
return session{}, fmt.Errorf("bw login --apikey failed (API key valid? run `homelab vault setup`): %w", err)
}
}
sess, err := bwUnlock(run, loginEnv)
if err != nil {
return session{}, err
}
sessEnv := bwSecretEnv(appdata, creds, sess)
// Pull the latest server-side state so reads reflect current values. `bw
// unlock` only decrypts the LOCAL cache, so a persisted (already-logged-in)
// session would otherwise serve stale data until the next login. Best-effort:
// a transient sync failure must not break a read — fall back to the cached
// vault and warn (status reports reachability separately).
if _, err := run("bw", bwSyncArgs(), sessEnv); err != nil {
fmt.Fprintln(os.Stderr, "homelab vault: warning: bw sync failed; using cached vault (values may be stale): "+err.Error())
}
return session{env: sessEnv}, nil
}
type getOpts struct {
name string
field string
json bool
all bool // dump every field (incl. custom) as normalized JSON
}
var validGetFields = map[string]bool{"password": true, "username": true, "uri": true, "notes": true, "totp": true}
func parseGetArgs(args []string) (getOpts, error) {
o := getOpts{field: "password"}
for i := 0; i < len(args); i++ {
a := args[i]
switch {
case a == "--json":
o.json = true
case a == "--all":
o.all = true
case a == "--field" && i+1 < len(args):
o.field = args[i+1]
i++
case strings.HasPrefix(a, "--field="):
o.field = strings.TrimPrefix(a, "--field=")
case !strings.HasPrefix(a, "-") && o.name == "":
o.name = a
}
}
if o.name == "" {
return o, fmt.Errorf("usage: homelab vault get <name> [--field password|username|uri|notes|totp] [--json] [--all]")
}
// --all dumps the whole item, so --field is irrelevant — skip its allowlist.
if !o.all && !validGetFields[o.field] {
return o, fmt.Errorf("invalid --field %q (want password|username|uri|notes|totp)", o.field)
}
return o, nil
}
// getValue opens a session and fetches one field. Pure of I/O side effects
// besides the runner, so it is unit-tested with a fake runner.
func getValue(run cmdRunner, user, uid string, o getOpts) (string, error) {
s, err := openSession(run, user, uid)
if err != nil {
return "", err
}
return bwGet(run, s.env, o.field, o.name)
}
// getItem opens a session and returns the whole item as raw `bw get item` JSON.
// Used by `get --all`; normalization is a separate, pure step (normalizeItem).
func getItem(run cmdRunner, user, uid, name string) (string, error) {
s, err := openSession(run, user, uid)
if err != nil {
return "", err
}
return run("bw", bwItemArgs(name), s.env)
}
// normalizedItem is the browse-all-fields projection of a Vaultwarden item: the
// standard login fields that are present, notes, and a flat map of custom field
// name→value. bw internals (id, object, reprompt, passwordHistory) are dropped,
// and the TOTP *seed* is reduced to a presence flag — the only seed-derived path
// stays the specially-audited `vault code` (see the design §10/§16).
type normalizedItem struct {
Name string `json:"name"`
Username string `json:"username,omitempty"`
Password string `json:"password,omitempty"`
URIs []string `json:"uris,omitempty"`
TOTP bool `json:"totp,omitempty"` // presence only, never the seed
Notes string `json:"notes,omitempty"`
Fields map[string]string `json:"fields,omitempty"` // custom field name→value
}
// bwFieldLinked is the Bitwarden custom-field type for a "linked" field: it
// references another field and carries a null value, so it is not real data.
const bwFieldLinked = 3
// normalizeItem parses a `bw get item` payload into the browse projection. It is
// pure (no I/O), so it is the unit-tested heart of `get --all`.
func normalizeItem(raw string) (normalizedItem, error) {
var it struct {
Name string `json:"name"`
Notes string `json:"notes"`
Login *struct {
Username string `json:"username"`
Password string `json:"password"`
Totp string `json:"totp"`
URIs []struct {
URI string `json:"uri"`
} `json:"uris"`
} `json:"login"`
Fields []struct {
Name string `json:"name"`
Value string `json:"value"`
Type int `json:"type"`
} `json:"fields"`
}
if err := json.Unmarshal([]byte(raw), &it); err != nil {
return normalizedItem{}, fmt.Errorf("parse bw item: %w", err)
}
n := normalizedItem{Name: it.Name, Notes: it.Notes}
if it.Login != nil {
n.Username = it.Login.Username
n.Password = it.Login.Password
n.TOTP = it.Login.Totp != ""
for _, u := range it.Login.URIs {
if u.URI != "" {
n.URIs = append(n.URIs, u.URI)
}
}
}
for _, f := range it.Fields {
if f.Type == bwFieldLinked {
continue // references another field, no value of its own
}
if n.Fields == nil {
n.Fields = map[string]string{}
}
n.Fields[f.Name] = f.Value // duplicate names: last-wins (rare; documented)
}
return n, nil
}
// clipboardDecision picks how to return a secret value. "stdout" prints it (a
// pipe/agent — the intended machine path); "clipboard" copies via OSC52;
// "refuse" emits nothing sensitive (would otherwise risk dumping the secret's
// base64 into scrollback, or silently fail because the OSC52 escape goes to a
// non-terminal stderr).
func clipboardDecision(stdoutTTY, stderrTTY bool, term, termProgram string) string {
if !stdoutTTY {
return "stdout"
}
if terminalAllowed(term, termProgram) && stderrTTY {
return "clipboard"
}
return "refuse"
}
// jsonToStdoutOK reports whether `--json` may print the secret to stdout — only
// when stdout is NOT a terminal (i.e. piped to a machine consumer).
func jsonToStdoutOK(stdoutTTY bool) bool { return !stdoutTTY }
// emitSecret returns a value TTY-aware (see clipboardDecision). Never prints the
// secret to a terminal's stdout/scrollback.
func emitSecret(value string) {
switch clipboardDecision(stdoutIsTTY(), stderrIsTTY(), os.Getenv("TERM"), os.Getenv("TERM_PROGRAM")) {
case "stdout":
fmt.Println(value)
case "clipboard":
fmt.Fprint(os.Stderr, osc52(value))
fmt.Fprintln(os.Stderr, "copied to clipboard; clearing in 30s")
clearClipboardAfter(30)
default: // refuse
fmt.Fprintln(os.Stderr, "refusing to print secret: this terminal can't do OSC52 clipboard safely; pipe the command (e.g. | cat) or use a supported terminal")
}
}
// clearClipboardAfter spawns a detached background clear so the secret doesn't
// linger in the clipboard. Best-effort.
func clearClipboardAfter(seconds int) {
exec.Command("sh", "-c", fmt.Sprintf("sleep %d; printf '%s'", seconds, osc52clear())).Start()
}
// listNames extracts "name (id)" from `bw list items` JSON; never values.
func listNames(jsonOut string) []string {
var items []struct {
ID string `json:"id"`
Name string `json:"name"`
}
if err := json.Unmarshal([]byte(jsonOut), &items); err != nil {
return nil
}
out := make([]string, 0, len(items))
for _, it := range items {
out = append(out, fmt.Sprintf("%s (%s)", it.Name, it.ID))
}
return out
}
func runList(run cmdRunner, user, uid, search string) ([]string, error) {
s, err := openSession(run, user, uid)
if err != nil {
return nil, err
}
out, err := run("bw", bwListArgs(search), s.env)
if err != nil {
return nil, err
}
return listNames(out), nil
}
func vaultList(args []string) error {
hardenProcess()
ensureVaultToken()
search := ""
for i := 0; i < len(args); i++ {
if args[i] == "--search" && i+1 < len(args) {
search = args[i+1]
i++
} else if strings.HasPrefix(args[i], "--search=") {
search = strings.TrimPrefix(args[i], "--search=")
}
}
uid := vaultCurrentUID()
unlock, err := withUserLock(uid)
if err != nil {
return err
}
defer unlock()
names, err := runList(realRunner, vaultCurrentUser(), uid, search)
if err != nil {
return err
}
for _, n := range names {
fmt.Println(n)
}
return nil
}
func vaultSearch(args []string) error {
if len(args) == 0 {
return fmt.Errorf("usage: homelab vault search <query>")
}
return vaultList([]string{"--search", strings.Join(args, " ")})
}
func vaultCode(args []string) error {
hardenProcess()
ensureVaultToken()
if len(args) == 0 {
return fmt.Errorf("usage: homelab vault code <name>")
}
name := args[0]
uid := vaultCurrentUID()
unlock, err := withUserLock(uid)
if err != nil {
return err
}
defer unlock()
user := vaultCurrentUser()
val, err := getValue(realRunner, user, uid, getOpts{name: name, field: "totp"})
if err != nil {
return err
}
// TOTP is the most sensitive op: log AND emit an ntfy-bound marker (spec §9a-d).
writeOpLog(opRecord{User: user, Verb: "code", PID: os.Getpid(), PPID: os.Getppid(), ParentComm: parentComm(os.Getppid()), ItemName: name})
exec.Command("logger", "-t", "homelab-vault-totp", "user="+user+" totp-fetch parent="+parentComm(os.Getppid())).Run()
emitSecret(val)
return nil
}
// statusSummary reports config/reachability without revealing secrets.
func statusSummary(run cmdRunner, user, uid string) string {
if _, err := loadCreds(run, user); err != nil {
return "vault: not configured — run `homelab vault setup`"
}
s, err := openSession(run, user, uid)
if err != nil {
return "vault: configured, but unlock/login FAILED (creds stale? run `homelab vault setup`): " + err.Error()
}
// openSession already did a best-effort sync; status re-runs it explicitly so
// a reachability failure surfaces in this report rather than only on stderr.
if _, err := run("bw", bwSyncArgs(), s.env); err != nil {
return "vault: configured + unlocked, but sync/reachability failed: " + err.Error()
}
return "vault: configured, unlocked, reachable ✓"
}
func vaultStatus(args []string) error {
hardenProcess()
ensureVaultToken()
uid := vaultCurrentUID()
unlock, err := withUserLock(uid)
if err != nil {
return err
}
defer unlock()
fmt.Println(statusSummary(realRunner, vaultCurrentUser(), uid))
return nil
}
func vaultLock(args []string) error {
uid := vaultCurrentUID()
unlock, err := withUserLock(uid) // logout mutates bw state — serialize with get/list
if err != nil {
return err
}
defer unlock()
appdata := bwAppDataDir(uid)
_, _ = realRunner("bw", []string{"lock"}, bwBaseEnv(appdata))
_, logoutErr := realRunner("bw", []string{"logout"}, bwBaseEnv(appdata))
if logoutErr == nil {
fmt.Println("locked")
}
return nil // lock/logout best-effort; never error the caller
}
// kvWriteVerb selects the KV write semantics. merge=true → `kv patch -method=rw`
// (read-modify-write: needs only read+update, NOT the `patch` capability the
// scoped workstation-claude-<user> policy lacks, and preserves co-located keys
// such as claude-auth-sync's claude_ai_oauth_json). merge=false → `kv put`
// (creates the path on first use, before any sibling keys exist).
func kvWriteVerb(merge bool) []string {
if merge {
return []string{"kv", "patch", "-method=rw"}
}
return []string{"kv", "put"}
}
// vaultWritePublicArgs writes the non-secret identifiers via argv. Neither the
// email nor the API client_id is a usable credential on its own.
func vaultWritePublicArgs(merge bool, user, email, clientID string) []string {
return append(kvWriteVerb(merge), vwCredsPath(user),
"vaultwarden_email="+email,
"vaultwarden_client_id="+clientID,
)
}
// vaultWriteSecretArgs writes ONE secret value via the `key=-` stdin form, so the
// value never appears in argv (ps / /proc/<pid>/cmdline). Fed on stdin by
// realRunnerStdin.
func vaultWriteSecretArgs(merge bool, user, key string) []string {
return append(kvWriteVerb(merge), vwCredsPath(user), key+"=-")
}
// credsPathExists reports whether the user's KV path already holds data. Used to
// pick create (`kv put`) vs merge (`kv patch -method=rw`) for the first write:
// claude-auth-sync usually creates the path first (Claude OAuth backup), but a
// user could run `homelab vault setup` before that ever happens.
func credsPathExists(run cmdRunner, user string) bool {
_, err := run("vault", []string{"kv", "get", "-format=json", vwCredsPath(user)}, nil)
return err == nil
}
// cmdRunnerStdin is realRunnerStdin's shape, injected so writeCreds is testable.
type cmdRunnerStdin func(name string, argv, envv []string, stdin string) (string, error)
// writeCreds stores all four fields in the user's Vault path using only the
// capabilities the scoped policy grants (create/read/update — NOT `patch`). The
// first (public) write creates the path when absent; the two real secrets then
// merge in via read-modify-write so the public keys — and any claude-auth-sync
// keys already present — survive. Secret values travel on stdin, never argv.
func writeCreds(run cmdRunner, runStdin cmdRunnerStdin, user string, c vwCreds) error {
merge := credsPathExists(run, user)
if _, err := run("vault", vaultWritePublicArgs(merge, user, c.Email, c.ClientID), nil); err != nil {
return err
}
// The path now exists regardless of the branch above → merge the secrets in.
if _, err := runStdin("vault", vaultWriteSecretArgs(true, user, "vaultwarden_master_password"), nil, c.MasterPassword); err != nil {
return err
}
if _, err := runStdin("vault", vaultWriteSecretArgs(true, user, "vaultwarden_client_secret"), nil, c.ClientSecret); err != nil {
return err
}
return nil
}
// promptNoEcho reads one line without terminal echo (for the master password).
func promptNoEcho(prompt string) (string, error) {
fmt.Fprint(os.Stderr, prompt)
exec.Command("stty", "-echo").Run()
defer func() { exec.Command("stty", "echo").Run(); fmt.Fprintln(os.Stderr) }()
r := bufio.NewReader(os.Stdin)
line, err := r.ReadString('\n')
// Trim only the line terminator — a master password / API secret may
// legitimately contain leading/trailing spaces.
return strings.TrimRight(line, "\r\n"), err
}
func promptLine(prompt string) (string, error) {
fmt.Fprint(os.Stderr, prompt)
line, err := bufio.NewReader(os.Stdin).ReadString('\n')
return strings.TrimSpace(line), err
}
func vaultSetup(args []string) error {
hardenProcess()
ensureVaultToken()
fmt.Fprintln(os.Stderr, "One-time setup. Stored ONLY in your own Vault path; the admin never sees it.")
fmt.Fprintln(os.Stderr, "Get your API key at https://vaultwarden.viktorbarzin.me → Settings → Security → Keys → View API key.")
email, err := promptLine("Vaultwarden email: ")
if err != nil {
return err
}
clientID, err := promptLine("API key client_id (user.xxxx): ")
if err != nil {
return err
}
clientSecret, err := promptNoEcho("API key client_secret: ")
if err != nil {
return err
}
master, err := promptNoEcho("Master password: ")
if err != nil {
return err
}
if master == "" || clientID == "" || clientSecret == "" {
return fmt.Errorf("all fields are required")
}
c := vwCreds{Email: email, MasterPassword: master, ClientID: clientID, ClientSecret: clientSecret}
if err := writeCreds(realRunner, realRunnerStdin, vaultCurrentUser(), c); err != nil {
return fmt.Errorf("writing creds to your Vault path failed (scoped token present?): %w", err)
}
fmt.Fprintln(os.Stderr, "Stored. Verifying unlock…")
uid := vaultCurrentUID()
unlock, err := withUserLock(uid)
if err != nil {
return err
}
defer unlock()
if _, err := openSession(realRunner, vaultCurrentUser(), uid); err != nil {
return fmt.Errorf("stored, but verification failed — double-check master password / API key: %w", err)
}
fmt.Fprintln(os.Stderr, "✓ Verified. Fetches are now AFK.")
return nil
}
func vaultGet(args []string) error {
hardenProcess()
ensureVaultToken()
o, err := parseGetArgs(args)
if err != nil {
return err
}
uid := vaultCurrentUID()
unlock, err := withUserLock(uid)
if err != nil {
return err
}
defer unlock()
user := vaultCurrentUser()
if o.all {
return getAllFields(user, uid, o.name)
}
val, err := getValue(realRunner, user, uid, o)
if err != nil {
return err
}
writeOpLog(opRecord{User: user, Verb: "get", PID: os.Getpid(), PPID: os.Getppid(), ParentComm: parentComm(os.Getppid()), ItemName: o.name})
if o.json {
if !jsonToStdoutOK(stdoutIsTTY()) {
return fmt.Errorf("refusing to print a secret as JSON to a terminal; pipe it (e.g. | cat) or drop --json")
}
fmt.Printf("{%q:%q}\n", o.field, val)
return nil
}
emitSecret(val)
return nil
}
// getAllFields prints every field of one item as normalized JSON. Like
// `get --json`, the payload is all secret values, so it refuses a terminal
// (pipe it). The TOTP seed is never emitted — only a presence flag — so no extra
// TOTP audit is needed; the op-log uses a distinct verb so a bulk dump is
// distinguishable from a single-field get (the item name is still never logged).
func getAllFields(user, uid, name string) error {
if !jsonToStdoutOK(stdoutIsTTY()) {
return fmt.Errorf("refusing to print all fields as JSON to a terminal; pipe it (e.g. | jq)")
}
raw, err := getItem(realRunner, user, uid, name)
if err != nil {
return err
}
item, err := normalizeItem(raw)
if err != nil {
return err
}
out, err := json.Marshal(item)
if err != nil {
return err
}
writeOpLog(opRecord{User: user, Verb: "get-all", PID: os.Getpid(), PPID: os.Getppid(), ParentComm: parentComm(os.Getppid()), ItemName: name})
fmt.Println(string(out))
return nil
}

248
cli/cmd_vault_kv.go Normal file
View file

@ -0,0 +1,248 @@
package main
import (
"encoding/json"
"fmt"
"io"
"os"
"strings"
)
// The `vault kv` verbs talk to HashiCorp Vault / OpenBao — the homelab INFRA
// secrets store (the `secret/…` KV-v2 mount at vault.viktorbarzin.me) — NOT
// Vaultwarden. They are a thin, TTY-aware wrapper over the `vault` CLI that adds
// the same conveniences as the Vaultwarden verbs: a self-defaulted VAULT_ADDR
// (so non-login agent shells work) and clipboard/refuse-on-TTY secret handling.
//
// CREDENTIALS DIFFER FROM THE VAULTWARDEN VERBS. Those use the per-user *scoped*
// token (bound only to secret/workstation/claude-users/<user>). A general kv read
// of e.g. secret/viktor must use the caller's OWN Vault token (the OIDC
// ~/.vault-token or an explicit $VAULT_TOKEN) — the scoped token has `deny`
// everywhere else and would 403. So the kv handlers call ensureVaultAddr() to
// guarantee VAULT_ADDR but deliberately do NOT call ensureVaultToken() (which
// injects the scoped token). Access is then whatever the caller's policy grants.
func vaultKVCommands() []Command {
return []Command{
{Path: []string{"vault", "kv", "get"}, Tier: TierRead,
Summary: "[hashicorp-vault] read an infra KV secret: vault kv get <path> [--field K]", Run: vaultKVGet},
{Path: []string{"vault", "kv", "list"}, Tier: TierRead,
Summary: "[hashicorp-vault] list infra KV sub-paths: vault kv list <path>", Run: vaultKVList},
{Path: []string{"vault", "kv", "put"}, Tier: TierWrite,
Summary: "[hashicorp-vault] write one KV key (value via stdin): vault kv put <path> <key>", Run: vaultKVPut},
{Path: []string{"vault", "kv"}, Tier: TierRead,
Summary: "[hashicorp-vault] infra secrets (run `homelab vault kv` for help)",
Run: func([]string) error { fmt.Print(vaultKVHelp()); return nil }},
}
}
func vaultKVHelp() string {
return `homelab vault kv HashiCorp Vault / OpenBao (homelab INFRA secrets, the secret/ KV store)
homelab vault kv get <path> [--field K] read a secret
--field K one value (TTY clipboard; piped stdout)
no --field all fields as JSON (piped only)
homelab vault kv list <path> list sub-paths under <path> (no values)
homelab vault kv put <path> <key> write one key; value read from stdin
(piped, or no-echo prompt); merges never clobbers siblings
Uses YOUR Vault token (vault login -method=oidc ~/.vault-token); access is
whatever your policy grants. This is NOT Vaultwarden for your personal logins
use 'homelab vault get' (see 'homelab vault').
`
}
// --- arg builders (pure; values never travel via argv) --------------------
func vaultKVGetFieldArgs(path, field string) []string {
return []string{"kv", "get", "-field=" + field, path}
}
func vaultKVGetJSONArgs(path string) []string { return []string{"kv", "get", "-format=json", path} }
func vaultKVListArgs(path string) []string { return []string{"kv", "list", "-format=json", path} }
// vaultKVPutArgs builds the write argv. merge=true → `kv patch -method=rw`
// (read-modify-write: merges, needs only read+update — not the `patch` capability
// — and preserves sibling keys); merge=false → `kv put` (creates the path on
// first write). The value is ALWAYS read from stdin via the `<key>=-` form, so it
// never appears in argv (visible via ps / /proc/<pid>/cmdline to same-UID procs).
func vaultKVPutArgs(merge bool, path, key string) []string {
return append(kvWriteVerb(merge), path, key+"=-")
}
// --- pure parsers ----------------------------------------------------------
// extractKVData returns the inner secret object from a `vault kv get -format=json`
// envelope (`{"data":{"data":{…},"metadata":{…}}}`), dropping the metadata/request
// wrapper so only the secret's own key→value data is emitted.
func extractKVData(jsonOut string) (string, error) {
var env struct {
Data struct {
Data json.RawMessage `json:"data"`
} `json:"data"`
}
if err := json.Unmarshal([]byte(jsonOut), &env); err != nil {
return "", fmt.Errorf("parse vault kv json: %w", err)
}
if len(env.Data.Data) == 0 {
return "", fmt.Errorf("no secret data at that path")
}
return string(env.Data.Data), nil
}
// parseKVList parses the JSON array `vault kv list -format=json` prints.
func parseKVList(jsonOut string) ([]string, error) {
var keys []string
if err := json.Unmarshal([]byte(jsonOut), &keys); err != nil {
return nil, fmt.Errorf("parse vault kv list json: %w", err)
}
return keys, nil
}
// --- testable cores (injected cmdRunner) -----------------------------------
func kvGetField(run cmdRunner, path, field string) (string, error) {
return run("vault", vaultKVGetFieldArgs(path, field), nil)
}
func kvGetJSON(run cmdRunner, path string) (string, error) {
out, err := run("vault", vaultKVGetJSONArgs(path), nil)
if err != nil {
return "", err
}
return extractKVData(out)
}
func kvList(run cmdRunner, path string) ([]string, error) {
out, err := run("vault", vaultKVListArgs(path), nil)
if err != nil {
return nil, err
}
return parseKVList(out)
}
// kvPathExists reports whether the KV path already holds data, to pick create
// (`kv put`) vs merge (`kv patch -method=rw`) — so a write never clobbers
// sibling keys on an existing path.
func kvPathExists(run cmdRunner, path string) bool {
_, err := run("vault", vaultKVGetJSONArgs(path), nil)
return err == nil
}
// kvPut writes one key, creating the path when absent and merging when present.
// The value travels on stdin only (never argv).
func kvPut(run cmdRunner, runStdin cmdRunnerStdin, path, key, value string) error {
merge := kvPathExists(run, path)
_, err := runStdin("vault", vaultKVPutArgs(merge, path, key), nil, value)
return err
}
// --- handlers --------------------------------------------------------------
func vaultKVGet(args []string) error {
hardenProcess()
ensureVaultAddr() // own token, NOT the scoped one (see file header)
var path, field string
for i := 0; i < len(args); i++ {
a := args[i]
switch {
case a == "--field" && i+1 < len(args):
field = args[i+1]
i++
case strings.HasPrefix(a, "--field="):
field = strings.TrimPrefix(a, "--field=")
case !strings.HasPrefix(a, "-") && path == "":
path = a
}
}
if path == "" {
return fmt.Errorf("usage: homelab vault kv get <path> [--field <key>]")
}
if field != "" {
val, err := kvGetField(realRunner, path, field)
if err != nil {
return err
}
emitSecret(val) // TTY-aware: clipboard on a terminal, stdout when piped
return nil
}
// No --field → the whole secret. All values, so refuse a bare TTY (like
// `vault get --json`): pick a --field for the clipboard path, or pipe it.
if !jsonToStdoutOK(stdoutIsTTY()) {
return fmt.Errorf("refusing to print all KV fields as JSON to a terminal; use --field <key>, or pipe it (e.g. | jq)")
}
out, err := kvGetJSON(realRunner, path)
if err != nil {
return err
}
fmt.Println(out)
return nil
}
func vaultKVList(args []string) error {
ensureVaultAddr()
var path string
for _, a := range args {
if !strings.HasPrefix(a, "-") {
path = a
break
}
}
if path == "" {
return fmt.Errorf("usage: homelab vault kv list <path>")
}
keys, err := kvList(realRunner, path)
if err != nil {
return err
}
for _, k := range keys {
fmt.Println(k)
}
return nil
}
func vaultKVPut(args []string) error {
hardenProcess()
ensureVaultAddr()
var path, key string
for _, a := range args {
if strings.HasPrefix(a, "-") {
continue
}
switch {
case path == "":
path = a
case key == "":
key = a
}
}
if path == "" || key == "" {
return fmt.Errorf("usage: homelab vault kv put <path> <key> (value read from stdin)")
}
value, err := readSecretValue("Value for " + key + ": ")
if err != nil {
return err
}
if value == "" {
return fmt.Errorf("empty value; aborting (nothing written)")
}
if err := kvPut(realRunner, realRunnerStdin, path, key, value); err != nil {
return fmt.Errorf("writing %q to %s failed (does your token have write access? path correct?): %w", key, path, err)
}
fmt.Fprintln(os.Stderr, "wrote "+key+" to "+path)
return nil
}
// readSecretValue obtains a secret value WITHOUT putting it in argv: piped stdin
// is read verbatim (trailing newline trimmed, internal newlines preserved so
// multi-line values like PEM keys survive); an interactive TTY is prompted
// without echo.
func readSecretValue(prompt string) (string, error) {
fi, err := os.Stdin.Stat()
if err == nil && fi.Mode()&os.ModeCharDevice == 0 {
b, rerr := io.ReadAll(os.Stdin)
if rerr != nil {
return "", rerr
}
return strings.TrimRight(string(b), "\r\n"), nil
}
return promptNoEcho(prompt)
}

1057
cli/cmd_vault_test.go Normal file

File diff suppressed because it is too large Load diff

164
cli/edges.go Normal file
View file

@ -0,0 +1,164 @@
package main
import (
"fmt"
"regexp"
"strconv"
"strings"
)
// edgesOpts is the parsed filter set for `homelab edges` (the who-talks-to-whom
// investigation helper over the goldmane_edges trail; see ADR-0014).
type edgesOpts struct {
ns string // edges touching this namespace (either direction)
src string // edges where src_ns = this
dst string // edges where dst_ns = this
peersOf string // distinct peers of this namespace (both directions)
newSince string // first_seen >= duration (24h/7d/30m) or date (YYYY-MM-DD)
denied bool // action = 'deny' only
asJSON bool // wrap result as a JSON array
limit int // row cap (default 200)
}
// parseEdgesArgs parses the edges flag surface. Unknown flags error out so a
// typo surfaces instead of silently dumping the whole table.
func parseEdgesArgs(args []string) (edgesOpts, error) {
o := edgesOpts{limit: 200}
i := 0
for i < len(args) {
a := args[i]
key, inline, hasInline := a, "", false
if eq := strings.IndexByte(a, '='); eq >= 0 {
key, inline, hasInline = a[:eq], a[eq+1:], true
}
needVal := func() (string, error) {
if hasInline {
return inline, nil
}
if i+1 < len(args) {
i++
return args[i], nil
}
return "", fmt.Errorf("flag %s needs a value", key)
}
var err error
switch key {
case "--ns":
o.ns, err = needVal()
case "--src":
o.src, err = needVal()
case "--dst":
o.dst, err = needVal()
case "--peers-of":
o.peersOf, err = needVal()
case "--new-since":
o.newSince, err = needVal()
case "--denied":
o.denied = true
case "--json":
o.asJSON = true
case "--limit":
var v string
if v, err = needVal(); err == nil {
if o.limit, err = strconv.Atoi(v); err != nil {
err = fmt.Errorf("--limit must be an integer: %q", v)
}
}
default:
return o, fmt.Errorf("unknown flag: %s", a)
}
if err != nil {
return o, err
}
i++
}
return o, nil
}
// nsRE is the safe namespace-token charset (k8s names + "Global"). Used as the
// injection guard — anything else is rejected rather than quoted-and-hoped.
var nsRE = regexp.MustCompile(`^[A-Za-z0-9][A-Za-z0-9_.-]*$`)
func validateNS(s string) error {
if s == "" || len(s) > 63 || !nsRE.MatchString(s) {
return fmt.Errorf("invalid namespace name: %q", s)
}
return nil
}
// sqlStr renders a SQL string literal (belt-and-suspenders on top of validateNS).
func sqlStr(s string) string { return "'" + strings.ReplaceAll(s, "'", "''") + "'" }
var (
durRE = regexp.MustCompile(`^(\d+)([smhd])$`)
dateRE = regexp.MustCompile(`^\d{4}-\d{2}-\d{2}([ T]\d{2}:\d{2}(:\d{2})?)?$`)
)
// newSinceCond turns a duration (24h/7d/30m/90s) or a date (YYYY-MM-DD[ HH:MM])
// into a first_seen predicate.
func newSinceCond(v string) (string, error) {
if m := durRE.FindStringSubmatch(v); m != nil {
unit := map[string]string{"s": "seconds", "m": "minutes", "h": "hours", "d": "days"}[m[2]]
return fmt.Sprintf("first_seen >= now() - interval '%s %s'", m[1], unit), nil
}
if dateRE.MatchString(v) {
return "first_seen >= " + sqlStr(v), nil
}
return "", fmt.Errorf("--new-since must be a duration (e.g. 24h, 7d, 30m) or a date (YYYY-MM-DD): %q", v)
}
// buildEdgesQuery renders the SQL for the given filters against the `edge` table.
func buildEdgesQuery(o edgesOpts) (string, error) {
limit := o.limit
if limit <= 0 {
limit = 200
}
// peers-of is a distinct-peer summary, a different shape from the row list.
if o.peersOf != "" {
if err := validateNS(o.peersOf); err != nil {
return "", err
}
p := sqlStr(o.peersOf)
return fmt.Sprintf("SELECT DISTINCT peer, action FROM ("+
"SELECT dst_ns AS peer, action FROM edge WHERE src_ns = %s "+
"UNION SELECT src_ns AS peer, action FROM edge WHERE dst_ns = %s"+
") t ORDER BY peer LIMIT %d", p, p, limit), nil
}
var conds []string
for _, f := range []struct{ val, tmpl string }{
{o.ns, "(src_ns = %[1]s OR dst_ns = %[1]s)"},
{o.src, "src_ns = %s"},
{o.dst, "dst_ns = %s"},
} {
if f.val == "" {
continue
}
if err := validateNS(f.val); err != nil {
return "", err
}
conds = append(conds, fmt.Sprintf(f.tmpl, sqlStr(f.val)))
}
if o.denied {
conds = append(conds, "action = 'deny'")
}
if o.newSince != "" {
c, err := newSinceCond(o.newSince)
if err != nil {
return "", err
}
conds = append(conds, c)
}
q := "SELECT src_ns, dst_ns, action, flow_count, first_seen, last_seen FROM edge"
if len(conds) > 0 {
q += " WHERE " + strings.Join(conds, " AND ")
}
q += fmt.Sprintf(" ORDER BY first_seen DESC LIMIT %d", limit)
if o.asJSON {
q = "SELECT coalesce(json_agg(row_to_json(t)), '[]') FROM (" + q + ") t"
}
return q, nil
}

163
cli/edges_test.go Normal file
View file

@ -0,0 +1,163 @@
package main
import (
"strings"
"testing"
)
func TestParseEdgesArgs(t *testing.T) {
cases := []struct {
name string
args []string
want edgesOpts
}{
{"defaults", nil, edgesOpts{limit: 200}},
{"ns", []string{"--ns", "immich"}, edgesOpts{ns: "immich", limit: 200}},
{"ns equals", []string{"--ns=immich"}, edgesOpts{ns: "immich", limit: 200}},
{"src dst", []string{"--src", "a", "--dst", "b"}, edgesOpts{src: "a", dst: "b", limit: 200}},
{"peers-of", []string{"--peers-of", "authentik"}, edgesOpts{peersOf: "authentik", limit: 200}},
{"denied json", []string{"--denied", "--json"}, edgesOpts{denied: true, asJSON: true, limit: 200}},
{"new-since", []string{"--new-since", "24h"}, edgesOpts{newSince: "24h", limit: 200}},
{"limit", []string{"--limit", "50"}, edgesOpts{limit: 50}},
}
for _, c := range cases {
t.Run(c.name, func(t *testing.T) {
got, err := parseEdgesArgs(c.args)
if err != nil {
t.Fatalf("parseEdgesArgs(%v) error: %v", c.args, err)
}
if got != c.want {
t.Fatalf("parseEdgesArgs(%v) = %+v, want %+v", c.args, got, c.want)
}
})
}
}
func TestParseEdgesArgsErrors(t *testing.T) {
for _, args := range [][]string{
{"--limit", "abc"},
{"--bogus"},
} {
if _, err := parseEdgesArgs(args); err == nil {
t.Errorf("parseEdgesArgs(%v) expected error, got nil", args)
}
}
}
func TestBuildEdgesQueryDefaults(t *testing.T) {
q, err := buildEdgesQuery(edgesOpts{limit: 200})
if err != nil {
t.Fatal(err)
}
for _, want := range []string{"FROM edge", "ORDER BY first_seen DESC", "LIMIT 200"} {
if !strings.Contains(q, want) {
t.Errorf("query %q missing %q", q, want)
}
}
if strings.Contains(q, "WHERE") {
t.Errorf("no-filter query should have no WHERE: %q", q)
}
}
func TestBuildEdgesQueryFilters(t *testing.T) {
cases := []struct {
name string
o edgesOpts
want string
}{
{"ns both directions", edgesOpts{ns: "immich", limit: 10}, "(src_ns = 'immich' OR dst_ns = 'immich')"},
{"src only", edgesOpts{src: "authentik", limit: 10}, "src_ns = 'authentik'"},
{"dst only", edgesOpts{dst: "dbaas", limit: 10}, "dst_ns = 'dbaas'"},
{"denied", edgesOpts{denied: true, limit: 10}, "action = 'deny'"},
}
for _, c := range cases {
t.Run(c.name, func(t *testing.T) {
q, err := buildEdgesQuery(c.o)
if err != nil {
t.Fatal(err)
}
if !strings.Contains(q, "WHERE") || !strings.Contains(q, c.want) {
t.Errorf("query %q missing WHERE/%q", q, c.want)
}
})
}
}
func TestBuildEdgesQueryCombinedFiltersAnded(t *testing.T) {
q, err := buildEdgesQuery(edgesOpts{src: "a", denied: true, limit: 5})
if err != nil {
t.Fatal(err)
}
if !strings.Contains(q, "src_ns = 'a' AND action = 'deny'") {
t.Errorf("combined filters not AND'd: %q", q)
}
}
func TestBuildEdgesQueryPeersOf(t *testing.T) {
q, err := buildEdgesQuery(edgesOpts{peersOf: "authentik", limit: 100})
if err != nil {
t.Fatal(err)
}
for _, want := range []string{"DISTINCT", "src_ns = 'authentik'", "dst_ns = 'authentik'", "UNION"} {
if !strings.Contains(q, want) {
t.Errorf("peers-of query %q missing %q", q, want)
}
}
}
func TestBuildEdgesQueryJSON(t *testing.T) {
q, err := buildEdgesQuery(edgesOpts{asJSON: true, limit: 200})
if err != nil {
t.Fatal(err)
}
if !strings.Contains(q, "json_agg") || !strings.Contains(q, "row_to_json") {
t.Errorf("json query missing json_agg wrapper: %q", q)
}
}
func TestBuildEdgesQueryRejectsInjection(t *testing.T) {
for _, bad := range []string{"a'; DROP TABLE edge;--", "a b", "a;b", "a\"b"} {
if _, err := buildEdgesQuery(edgesOpts{ns: bad, limit: 10}); err == nil {
t.Errorf("buildEdgesQuery(ns=%q) expected validation error, got nil", bad)
}
}
}
func TestNewSinceCond(t *testing.T) {
cases := []struct {
in string
want string
}{
{"24h", "first_seen >= now() - interval '24 hours'"},
{"7d", "first_seen >= now() - interval '7 days'"},
{"30m", "first_seen >= now() - interval '30 minutes'"},
{"2026-06-28", "first_seen >= '2026-06-28'"},
}
for _, c := range cases {
got, err := newSinceCond(c.in)
if err != nil {
t.Fatalf("newSinceCond(%q) error: %v", c.in, err)
}
if got != c.want {
t.Errorf("newSinceCond(%q) = %q, want %q", c.in, got, c.want)
}
}
for _, bad := range []string{"yesterday", "1y", "'; DROP", ""} {
if _, err := newSinceCond(bad); err == nil {
t.Errorf("newSinceCond(%q) expected error, got nil", bad)
}
}
}
func TestValidateNS(t *testing.T) {
for _, ok := range []string{"immich", "calico-system", "kube-system", "Global", "pg-cluster-rw"} {
if err := validateNS(ok); err != nil {
t.Errorf("validateNS(%q) unexpected error: %v", ok, err)
}
}
for _, bad := range []string{"", "a b", "a'b", "a;b", "../x", "a$b"} {
if err := validateNS(bad); err == nil {
t.Errorf("validateNS(%q) expected error, got nil", bad)
}
}
}

View file

@ -20,8 +20,11 @@ func buildRegistry() []Command {
reg = append(reg, deployCommands()...)
reg = append(reg, netCommands()...)
reg = append(reg, obsCommands()...)
reg = append(reg, edgesCommands()...)
reg = append(reg, usageCommands()...)
reg = append(reg, haCommands()...)
reg = append(reg, browserCommands()...)
reg = append(reg, vaultCommands()...)
return reg
}

View file

@ -5,8 +5,31 @@ import (
"os"
"strings"
"testing"
"unicode/utf8"
)
func TestTruncatePreviewKeepsValidUTF8(t *testing.T) {
// Byte-slicing a long Cyrillic string at 240 splits a 2-byte rune and emits
// invalid UTF-8 — the bug that crashed the recall hook. truncatePreview must
// cut on a rune boundary and always stay valid UTF-8.
long := strings.Repeat("я", 300) // 300 runes / 600 bytes
got := truncatePreview(long, 240)
if !utf8.ValidString(got) {
t.Fatalf("truncatePreview produced invalid UTF-8: %q", got)
}
if r := []rune(got); len(r) != 241 || string(r[:240]) != strings.Repeat("я", 240) || r[240] != '…' {
t.Fatalf("truncatePreview = %d runes, want 240 Cyrillic + ellipsis", len(r))
}
// Short multibyte strings pass through untouched (no ellipsis).
if got := truncatePreview("кратко", 240); got != "кратко" {
t.Fatalf("short string altered: %q", got)
}
// ASCII boundary still works.
if got := truncatePreview(strings.Repeat("a", 500), 240); got != strings.Repeat("a", 240)+"…" {
t.Fatalf("ascii truncation wrong: %q", got)
}
}
func TestResolveMemoryBase(t *testing.T) {
old1, old2 := os.Getenv("CLAUDE_MEMORY_API_URL"), os.Getenv("MEMORY_API_URL")
defer func() { os.Setenv("CLAUDE_MEMORY_API_URL", old1); os.Setenv("MEMORY_API_URL", old2) }()

View file

@ -13,7 +13,7 @@ The trigger was a proposal to swap Forgejo out for GitHub entirely. The grilling
Do **not** swap to GitHub. Reaffirm and *complete* the model already in `CONTEXT.md`:
- Every first-party repo has exactly **one** push target — its **Canonical repo** on Forgejo. GitHub is a one-way push-mirror (off-site backup + the source GitHub Actions builds from). **No repo is ever dual-pushed.**
- A small, explicit set of **GitHub-first repos** are the exception (canonical lives on GitHub, outside the mirror policy): third-party clones/forks where GitHub is genuinely upstream (`jsoncrack.com`, `snmp_exporter`, `SparkyFitness`, `agent-rules-books`, `Plotting-Your-Dream-Book`) and the deliberately-public first-party `health`.
- A small, explicit set of **GitHub-first repos** are the exception (canonical lives on GitHub, outside the mirror policy): third-party clones/forks where GitHub is genuinely upstream (`jsoncrack.com`, `snmp_exporter`, `SparkyFitness`, `agent-rules-books`, `Plotting-Your-Dream-Book`) and the deliberately-public first-party `health`. `Plotting-Your-Dream-Book` (owned by Anca, dev in her org) keeps its GHA build in-place and pushes the image to **its own org's ghcr** (`ghcr.io/passionprojectsanca/book-plotter`, private) via the workflow's built-in `GITHUB_TOKEN` — no Forgejo mirror, no `viktorbarzin`-namespace push, no shared PAT in her repo (2026-06-27, migrated off DockerHub).
- `infra` is reconciled into the standard model: its GitHub-only `.github/workflows/build-*.yml` are brought onto Forgejo-canonical (inert on Forgejo, active on the mirror), then the mirror is enabled — ending the deliberate divergence while keeping Woodpecker on the Forgejo forge.
- Enforcement is **structural**: reconciled clones keep only the Forgejo remote, so there is no GitHub remote to habitually push to; the execution rule is "push to the canonical forge only, never the mirror."

View file

@ -5,6 +5,14 @@ exists to answer the question that drove the whole CLI — *which verbs are wort
adding next* — with data instead of one maintainer's habits (the earlier mining
covered a single user's ~51k commands, so the surface is shaped to that user).
> **Update (2026-06-26) — the cross-user privacy *norm* below is superseded by
> [ADR-0015](0015-os-is-the-authorization-boundary.md).** The prohibition this
> ADR leaned on ("reading another user's `~/.claude` is off-limits even for an
> owner in-session") no longer holds: the managed-settings policy now **defers
> to OS/sudo authorization**. The `usage top` telemetry design itself is
> unchanged and still current — only the "never read homes" framing in the
> third decision below is overtaken.
## Decisions
- **Emit on dispatch, in `dispatch()`.** The longest-prefix match already knows

View file

@ -19,12 +19,20 @@ gap for every user in every directory.
*resolution* and host *SSH*, neither of which an API-only MCP can provide. The
value is endpoint/secret/host resolution, exactly like `net`/`dns` (ADR-0010).
- **`ha token` resolves live from the cluster, not from an env var.** It reads
k8s Secret `openclaw/openclaw-secrets`, field `skill_secrets` (a base64 JSON
blob of several tokens), and prints the per-instance key
(`home_assistant_sofia_token` / `home_assistant_token`) via the ambient
kubeconfig. This is robust to env drift — the precise failure that made agents
re-derive the pipeline. Read-tier, prints the bare token to stdout so it
composes in `$(…)`, mirroring `memory secret`.
the dedicated k8s Secret `openclaw/ha-tokens` (one key per instance: `sofia` /
`london`) via the ambient kubeconfig. This is robust to env drift — the precise
failure that made agents re-derive the pipeline. Read-tier, prints the bare
token to stdout so it composes in `$(…)`, mirroring `memory secret`.
- **The token is split into its own least-privilege secret** (`stacks/openclaw/ha_tokens.tf`).
It was originally read from `openclaw-secrets``skill_secrets` (a JSON blob
also holding `slack_webhook` + `uptime_kuma_password`), which only cluster
admins can read — so the verb hung/failed for the non-admin operator it was
built for (emo = `emil.barzin@gmail.com`, group `Home Server Admins`, whose
OIDC identity is barred from secrets in `openclaw`). `ha-tokens` carries only
the HA tokens, with a Role+RoleBinding granting `get` on *just that secret* to
the `Home Server Admins` group (k8s RBAC can't scope to a JSON sub-key, hence
the separate object). openclaw's own deployment keeps reading `openclaw-secrets`
— this is purely additive.
- **`ha ssh` is deterministic and per-user.** Flags are fixed for unattended
use: `-F /dev/null` (ignore user ssh-config), `StrictHostKeyChecking=no` +
`UserKnownHostsFile=/dev/null` (no host-key prompt/record — agents have no

View file

@ -0,0 +1,75 @@
# homelab browser verbs: headful (anti-bot) web automation via cluster Chrome
v0.8 adds `browser run`, `browser open`, and `browser --help`. They package a
capability that already existed but was undiscoverable: driving the cluster's
**headful** Chrome (`chrome-service` — real Chrome under Xvfb, CDP on
`svc/chrome-service:9222`) from the devvm, for sites that detect and block
headless automation.
## Motivating incident (2026-06-22)
Logging a washing-machine repair on the Stirling Ackroyd **Fixflo** tenant
portal: the headless `@playwright/mcp` browser loaded the site and filled the
entire multi-step form, but the **final submit silently failed** — Fixflo's
pre-submit `POST /IssuePreCreationCheck` returned `net::ERR_FILE_NOT_FOUND`, the
spinner hung, no issue was created. Root cause = headless-Chrome detection. The
fix was to drive the headful `chrome-service` over `connect_over_cdp` — it
submitted first try (Fixflo ref IS22657587). That capability was documented
(`docs/architecture/chrome-service.md`) but **not packaged or discoverable**, so
it took ~40 min, three redundant full form re-runs, and a user hint. The agent
also misread `ERR_FILE_NOT_FOUND` as "network egress" and retried blind instead
of inspecting the network panel.
## Decisions
- **Mechanics in `homelab`, not a `~/.claude` skill.** A standalone skill was
rejected: the CLI is run every session (so the verb is *discoverable*), is
versioned, multi-user, and test-covered. A private, untested skill is none of
those. The command owns only the deterministic *mechanics* (port-forward,
stealth injection, lifecycle) — the agent supplies the Playwright script, so
*judgment* stays out of the CLI (the founding rule, ADR-0004/0005).
- **The failure was judgment, not setup friction**, so the CLI is paired with a
one-line pointer in always-in-context `~/code/CLAUDE.md` and a diagnostic
payload in `browser --help`: the *when-to-use* signature (a site loads but a
gated action fails/hangs, or one request 500s/aborts while siblings 200 →
suspect headless detection) and an error-code cheat-sheet (`ERR_FILE_NOT_FOUND`
= request resolved/intercepted by the automation layer, **not** egress;
egress failures are `ERR_CONNECTION_REFUSED`/`_TIMED_OUT`/`_NAME_NOT_RESOLVED`
and would break the page load too). A command the agent doesn't think to run is
useless; the cheat-sheet is the actual fix for the misdiagnosis.
- **Reach the pod via `kubectl port-forward`, then `connect_over_cdp` to
localhost.** port-forward tunnels API-server→pod, so it **bypasses the `:9222`
NetworkPolicy** that gates in-cluster callers — the devvm needs no namespace
label. Readiness is asserted against `/json/version`: the endpoint must report
a real `Chrome/…`, never `HeadlessChrome` (the whole point). The forward is
**always** torn down (process-group kill + signal handler), on success and on
error — an acceptance requirement.
- **Default to a fresh incognito context; `--shared-context` opts into the warmed
profile.** chrome-service is a single shared browser with a persistent profile.
A fresh, always-closed context is safe for concurrent callers (tripit's fare
scrape connects per-quote) and is what production already does. The warmed
persistent profile (cookies from a manual noVNC login) is opt-in for flows that
need a pre-logged-in session.
- **Pin the node CDP client to `playwright-core@1.48.2`** to match the
chrome-service image minor (`mcr.microsoft.com/playwright:v1.48.0-noble`,
Chromium 130). `connect_over_cdp` speaks the browser's CDP, and protocol
changes between Playwright minors — the devvm's ambient Python Playwright was
1.58, a 10-minor skew. The pin makes behaviour deterministic across the fleet
regardless of local drift. `playwright-core` (not `playwright`) because no
browser binary is needed — we connect to the remote one.
- **Self-provision the client lazily, no per-user setup.** The pinned client is
installed once into `~/.cache/homelab/browser-client/` (idempotent, version-
guarded) on first use, alongside the embedded runner + stealth files. node is
already fleet-wide; this avoids coupling the feature to a provisioner change
and keeps it self-contained and self-healing. The client runs on the devvm, so
`setInputFiles` streams local files to the remote browser over CDP — no
`chmod`/staging-dir workaround on the CDP path.
- **Vendor `stealth.js`, guard against drift.** The CLI embeds a byte-for-byte
copy of `stacks/chrome-service/files/stealth.js` (the source of truth the
in-cluster callers use) via `go:embed`; a unit test fails if the copy drifts.
`go:embed` can't reach outside the package dir, hence the vendored copy rather
than a path reference.
- **Scope held at two action verbs + help.** `run` (arbitrary script — the
workhorse) and `open` (navigate + title/text/screenshot — a quick check) cover
the surface. Both are write-tier; the bare `browser`/`--help` is read. Re-measure
via `usage top` (ADR-0011) before adding more.

View file

@ -0,0 +1,35 @@
---
status: accepted
date: 2026-06-24
---
# Service identity is namespace + label; east-west observability via Calico Goldmane; no service mesh
As the Service count grows we want an audit-grade record of which Service talks to which — the "service mesh evaluation" `docs/plans/2026-04-20-infra-audit-design.md` flagged as never done ("worth a design doc even if the answer is no, too much complexity for the gain"). We evaluated the full design space against two constraints: the trust model is a single-tenant cluster needing **attribution-grade** forensics (reconstruct events in a cluster we trust), not cryptographic non-repudiation against a hostile pod; and we are acutely **etcd-constrained** (we removed VPA/Goldilocks for exactly this, and carry open beads `code-oflt`/`code-at4f` on etcd starvation). Decision: **service identity = the workload's namespace** (primary; Goldmane stamps it natively and "one Service ≈ one namespace" holds for ~87 of our namespaces), refined by an explicit `service-identity` label only in the few genuinely multi-Service namespaces (`monitoring`, `kube-system`, `dbaas`). **East-west observability = Calico 3.30 Goldmane + Whisker** (already in our Calico v3.30.7, currently `enabled = false` in `stacks/calico/main.tf`), with Goldmane's emitter shipping flows to **Loki** for a durable trail. **Enforcement reuses the existing Wave 1 observe-then-enforce egress track**, now selecting on namespace/label and fed by Goldmane's allow/deny + policy-trace flows. We explicitly **reject** a service mesh, mTLS, SPIFFE/SPIRE, and dedicated per-Service ServiceAccounts for now.
## Considered options
- **Dedicated per-Service ServiceAccount as the identity primitive** — initially chosen, then reversed. 56% of pods (257/458) run as `default`, so it is a ~116-stack rollout; and Goldmane (the chosen flow source) carries pod/namespace/workload **labels but no ServiceAccount field**, so SAs would not even reach the audit trail without a custom pod→SA mapping. The cheaper, etcd-inert path (namespace is free; a handful of static labels) delivers the same attribution. Deferred until identity-aware NetworkPolicy needs a principal finer than namespace/label, or mTLS is adopted.
- **Service mesh (Istio / Linkerd / Cilium-mesh) + mTLS + SPIFFE/SPIRE** — the only thing that makes the trail cryptographically non-repudiable against a hostile pod. The trust model does not justify it, east-west stays single-tenant plaintext, and it is precisely the "too much complexity for the gain" the audit doc predicted. Rejected.
- **Microsoft Retina (CNI-agnostic eBPF)** — more capable (DNS, drops, Hubble UI) and GA, runs on Calico without a CNI change. But identity-rich mode writes **one `RetinaEndpoint` CRD per pod to etcd** (continuous, pod-proportional churn — the exact axis we guard), and it is metrics-first, not log-first (no per-flow Loki records without custom glue). Rejected for this use case; noted as the fallback if DNS/drop-level detail is ever needed.
- **Cilium Hubble** — reads Cilium's eBPF datapath maps; unusable on Calico without migrating the CNI. A CNI migration is not justified. Rejected.
- **Kiali** — builds its graph entirely from an Istio mesh's Prometheus telemetry; no mesh, no graph. Rejected.
- **Custom Grafana Alloy enrichment exporter over raw iptables-`LOG` flow lines** — Alloy has no IP→identity dictionary-lookup primitive (`loki.process` lacks a lookup stage; `k8sattributes` can't do per-line/dual-IP association), so this is a multi-day custom build that also has to beat pod-IP churn. Goldmane delivers identity-stamped flows natively and obviates it. Rejected.
- **Kyverno generate+mutate to provision/assign identity** — rejected on etcd grounds: background scans + PolicyReports + UpdateRequests are continuous writes, the VPA-class cost we shed. Identity stays static.
## Consequences
- **No etcd cost from the flow plane.** Goldmane streams flows from Felix (the existing `calico-node` DaemonSet) over gRPC into a ~60-minute in-memory ring buffer — nothing written to etcd or the K8s API. Steady-state cost is two Deployments (`goldmane`, `whisker`) + RAM/CPU on the goldmane pod.
- **The ring buffer is not a trail.** Durable, queryable history depends on the emitter→Loki path (reuse the 90-day security-stream retention); on a Goldmane restart the in-memory window is lost.
- **Goldmane is tech-preview** in OSS Calico 3.30 — the main risk. Enabling it is a reversible toggle in `stacks/calico/main.tf`, but the toggle interacts with the operator-managed Installation CR (only namespaces are TF-adopted today; verify how Goldmane/Whisker are enabled before applying).
- **Attribution is namespace-grained for free** across ~87 single-Service namespaces. Multi-Service namespaces (`monitoring`, `kube-system`, `dbaas`) need a `service-identity` label to disambiguate; most are platform/infra and already on the Wave 1 enforcement exclude list.
- **The trail is attribution-grade, not cryptographic.** It reliably reconstructs events in a trusted cluster but cannot prove identity against a pod that spoofs its source — an accepted limit of the trust model. This ADR does not change the east-west encryption posture (still plaintext, no mTLS).
- **Enforcement gains a better data source.** Goldmane's allow/deny + policy-trace flows build the Wave 1 empirical egress allowlist faster than the current iptables-`LOG`→journald→Loki path, and policies select on namespace/label with no SA dependency.
- **New ubiquitous language** recorded in `CONTEXT.md`: **Service identity** and **Goldmane / Whisker**.
- **Revisit triggers:** adopt dedicated per-Service SAs if identity-aware NetworkPolicy needs a principal finer than namespace/label, or if mTLS is ever required; reconsider Retina if DNS/drop-level flow detail becomes necessary.
## As-built (2026-06-25)
Implemented across infra issues #57#63. **One material deviation from the decision above:** the durable trail is NOT a Goldmane→Loki emitter (no such emitter exists in OSS Calico 3.30) — it is the **`goldmane-edge-aggregator`** service, which streams Goldmane's gRPC `Flows.Stream` API over mTLS and upserts the unique namespace-pair **edge set** (`edge(src_ns,dst_ns,action,first_seen,last_seen,flow_count)`, self-edges + empty-namespace flows dropped) into **CNPG DB `goldmane_edges`**, plus a daily `goldmane-edges-digest` CronJob → `#alerts` (all Slack consolidated to `#alerts`; the `#security` channel was abandoned 2026-06-25 — the shared webhook's Slack app isn't a member of it — see runbook). The mTLS client cert **reuses the operator's Tigera-CA-signed `whisker-backend-key-pair`** rather than copying the CA private key into TF state (Goldmane verifies CA-chain only, not identity) — re-apply `stacks/goldmane-edge-aggregator` if the operator rotates it. `service-identity` labels are live on the multi-Service namespaces (`monitoring`, `dbaas`). Whisker UI is Authentik-gated at `whisker.viktorbarzin.me`. Health: Prometheus alerts `AggregatorDown` + `DigestFailing` and cluster-health check #48.
Full as-built, query recipes (incl. the Wave-1 egress-allowlist derivation), and troubleshooting: [`docs/runbooks/goldmane-flow-trail.md`](../runbooks/goldmane-flow-trail.md). Stacks: `stacks/calico` (Goldmane/Whisker + Whisker ingress), `stacks/goldmane-edge-aggregator` (the trail). Code: `~/code/goldmane-edge-aggregator`.

View file

@ -0,0 +1,57 @@
# OS is the authorization boundary: agents defer to Unix/sudo, not a stricter in-policy rule
Supersedes the cross-user privacy *norm* that the devvm managed-settings policy
carried and that ADR-0011 leaned on ("never read another user's home /
`~/.claude`, off-limits even for an owner in-session"). ADR-0011's actual
subject — `usage top` telemetry and its emit design — is unchanged and still
current; only the privacy prohibition it referenced is superseded here.
## Context
The devvm managed-settings policy (`/etc/claude-code/managed-settings.json`,
`claudeMd`) carried two rules that were, in practice, *stricter than the OS*:
"you are not the admin, do not escalate privileges" and "never read another
user's home directory, credentials, tokens, or `~/.claude`." The OS told a
different story: `wizard` holds `(ALL) NOPASSWD: ALL` — full passwordless root.
The kernel had already granted total read access; the policy was layering an
artificial refusal on top of an authorization the OS already permits, and the
"not the admin" framing was factually wrong for a NOPASSWD-root user.
Two honest ways to resolve the inconsistency: tighten sudo to match the policy,
or loosen the policy to match the OS. The owner chose the latter on 2026-06-26,
for analytics/debugging across the shared box.
## Decision
- **Authorization follows the OS, not this policy.** Agents may access whatever
their OS user can access — directly or via `sudo` where they hold sudo rights
— and must not impose restrictions stricter than the OS. On this box that
includes other users' home directories and `~/.claude` for users who hold
broad sudo.
- **No separate prompt or carve-out** for OS-authorized access. The Unix
permission model + sudoers is the single source of truth for who may read
what. Other homes are `0750`-owned, so a cross-home read necessarily transits
`sudo` and is therefore captured in the sudo/auth audit log.
- **Cluster/infra RBAC tiering is unchanged.** kubectl / Vault / infra access
stays scoped to each user's RBAC tier; "defer to the OS" is about OS-level
file access, not a licence to exceed cluster RBAC.
- **Scope is symmetric and multi-user.** The rule lives in the *shared*
managed-settings, so every user's agents defer to that user's own sudo grant.
Any user with broad sudo gets the same cross-home read capability over other
users' files. Accepted by the owner with that understanding; emo's and
ancamilea's `~/.claude` is now agent-readable by sudo-holders.
- **Takes effect in a fresh session.** managed-settings loads at session start;
the session that made the change keeps running under the old policy.
## Consequences
- The privacy-preserving telemetry rationale in ADR-0011 (`usage top` as the
"cross-user analytics without reading homes" answer) remains useful but is no
longer the *only* sanctioned path; direct reads via `sudo` are now permitted.
- Larger blast radius: if an agent session running as a sudo-holder is
prompt-injected or otherwise compromised, it can now read every user's secrets
with no in-agent friction (sudo here is passwordless). The sudo/auth audit log
is the remaining accountability control.
- Reversible: restore the prior `claudeMd` bullets (backup kept at
`/etc/claude-code/managed-settings.json.bak-2026-06-26`) and start a fresh
session.

View file

@ -86,10 +86,56 @@ Signin latency is dominated by screen count and round trips, not server time
use the explicit-consent flow (it re-prompted every 4 weeks per app).
- **Live tuning via `server.env`/`worker.env`** (the `authentik.*` Helm values
are inert due to `existingSecret`): 3 gunicorn workers, 30m flow-plan cache,
15m policy cache, 60s persistent DB connections.
15m policy cache, gunicorn `max_requests=10000`/jitter=1000 (recycle
hardening — decorrelates the 9 workers' recycles from PG blips). **No
`CONN_MAX_AGE`** — persistent Django connections pin a PgBouncer server conn
1:1 and saturate the session-mode pool (reverted 2026-06-10).
- **Static assets cached immutable**: `/static` ingress carve-out adds
`Cache-Control: public, max-age=31536000, immutable` (assets are
version-fingerprinted; authentik itself sends no max-age).
- **Rate-limit carve-out** (2026-06-28): `/` and `/static` use a dedicated
`authentik-rate-limit` (100/1000) instead of the shared 10/50 default — the
login SPA cold-loads ~70 flow-executor chunks from `/static`; the default
burst 429'd the tail and a failed ES-module import left a blank login screen.
- **Readiness tolerance** (2026-06-28): server `readinessProbe.failureThreshold:8`
(~80s, was the chart-default ~30s). The probe (`/-/health/ready/`) queries the
DB; too-tight tolerance let a sub-60s PG/pgbouncer transient return 503 on all
3 server pods at once → Traefik had no healthy backend → 502/503/504 (episodic
blank login + 30s hangs). 80s absorbs a full CNPG failover reconnect. Sessions
+ cache are PostgreSQL-only since Redis was removed in 2026.2 (no external-cache
option), so request-serving is coupled to PG — this survives a short transient,
not a total CNPG outage.
- **Rolling-update strategy** (2026-06-28): the chart key is `deploymentStrategy`
(the repo's old `strategy:` key was silently inert → live ran the chart-default
25%/25% and dropped a server pod out of rotation on every roll). Now
`maxSurge:1/maxUnavailable:0` keeps all 3 ready throughout a roll.
- **Old-browser login (SFE)** (2026-06-28): authentik's modern flow SPA is ES2022
and renders a **blank login** on Safari/WebKit ≤16.3 (every iOS browser shares
the system WebKit, so it's not browser-choice — e.g. iPadOS ≤15). The overlay
image patches `flows/views/interface.py::compat_needs_sfe()` to also serve
authentik's built-in no-JS **Simplified Flow Executor** (SFE, ES5) to old Safari
**and any iOS browser** (Chrome/Firefox on iOS are WebKit skins) on iOS ≤16.3,
so those clients get the *real* authentik login (password + MFA + reputation —
no auth downgrade). The SFE can't render Identification-stage **sources**
(authentik limitation), so the patch also injects static social-login `<a>`
links into `flow-sfe.html` (→ `/source/oauth/login/<slug>/`, plain redirects) —
required for password-less accounts (e.g. Google-only users). A Traefik
basic-auth fallback was rejected: it would have put a single spoofable-UA
password in front of `vbarzin→wizard` (passwordless root on the devvm). See
`stacks/authentik/patch-compat-sfe.py`.
- **SFE + forced-WebAuthn MFA gotcha** (2026-06-28): the `default-authentication-flow`
MFA stage (`not_configured_action=configure`, `conf_stages=[webauthn]`) force-enrols
a WebAuthn passkey for any **password**-path user with no MFA device — but the SFE
**cannot render WebAuthn** (enrol *or* validate), so that user gets
`unsupported state: ak-stage-authenticator-webauthn`. Two escape hatches, **no MFA
downgrade**: (1) **social login** — sources run `default-source-authentication`
(UserLoginStage only, **no MFA stage**), so the SFE's "Continue with <provider>"
button always completes; (2) **enrol TOTP** — the SFE *can* validate TOTP codes, and
≥1 confirmed device flips the stage from force-enrol to validate. User MFA devices are
runtime data (not Terraform): enrol via `ak shell`
(`TOTPDevice.objects.create(user=…, confirmed=True)`) and store the secret in the
user's own Vaultwarden item. (Done for emo — the Google-only iPadOS-15 case: TOTP in
his `authentik.viktorbarzin.me` Bitwarden item; e2e-verified the BW code is accepted.)
- **Outpost**: 2 replicas, `log_level=info` (was 1 replica at `trace`).
- **auth-proxy nginx**: upstream `keepalive 32` + HTTP/1.1 — no per-request
TCP setup on the forward-auth subrequest path.

View file

@ -319,7 +319,7 @@ each Job's pod and its drain target are always different nodes.
- `K8sVersionSkew` — kubelet/apiserver `gitVersion` count >1 for 30m. Catches a half-done rollout.
- `EtcdPreUpgradeSnapshotMissing``k8s_upgrade_in_flight==1 && k8s_upgrade_snapshot_taken==0` for 10m. Catches preflight failing silently.
- `K8sUpgradeStalled``k8s_upgrade_in_flight==1 && time()-k8s_upgrade_started_timestamp > 5400` for 5m. Catches a chain Job dying without spawning its successor.
- `K8sUpgradeChainJobFailed``kube_job_status_failed{namespace="k8s-upgrade",job_name=~"k8s-upgrade-.*",reason=~"BackoffLimitExceeded|DeadlineExceeded"} > 0` for 15m (warning). Catches a phase Job that terminally failed **before `in_flight` was set** (the preflight gates exit pre-metric) — invisible to the two `in_flight`-based alerts above; this was the blind spot behind the 5-day 1.34.9 preflight wedge. Reason-scoped so a retry-success doesn't false-positive (and so it doesn't needlessly block kured).
- `K8sUpgradeChainJobFailed``(kube_job_status_failed{namespace="k8s-upgrade",job_name=~"k8s-upgrade-(preflight|master|worker|postflight)-.*",reason=~"BackoffLimitExceeded|DeadlineExceeded"} > 0) unless on() (k8s_upgrade_blocked == 1)` for 15m (warning). Catches a phase Job that terminally failed **before `in_flight` was set** (the preflight gates exit pre-metric) — invisible to the two `in_flight`-based alerts above; this was the blind spot behind the 5-day 1.34.9 preflight wedge. Reason-scoped so a retry-success doesn't false-positive (and so it doesn't needlessly block kured). The `unless k8s_upgrade_blocked == 1` clause (2026-06-21) excludes a deliberate compat-gate refusal (owned by `K8sUpgradeBlocked`) so a block doesn't double-fire as a wedge.
- **Pushgateway metrics**:
- `k8s_upgrade_in_flight` (set in preflight, cleared in postflight)
- `k8s_upgrade_snapshot_taken` (set after etcd snapshot Job completes with ≥1 KiB)

View file

@ -112,17 +112,32 @@ External caller (dev box):
@playwright/mcp --isolated --storage-state ~/.cache/...storage-state.json
```
## Browser binary — real Google Chrome (for proprietary codecs)
The chrome-service container runs **real Google Chrome**, not the bundled
Chromium, via the infra-owned image `ghcr.io/viktorbarzin/chrome-service-browser`
(`files/chrome/Dockerfile` = `mcr.microsoft.com/playwright:v1.48.0-noble` +
`google-chrome-stable`, built by `.github/workflows/build-chrome-service-browser.yml`).
The launch resolves `CHROMIUM=/opt/google/chrome/chrome`.
**Why:** the Playwright-bundled Chromium has proprietary codecs **compiled out**,
so H.264/AAC video (Instagram Reels, X, most `.mp4`) fails in the noVNC view with
`MEDIA_ERR_SRC_NOT_SUPPORTED` (the bytes download `200 video/mp4` but there's no
decoder — NOT a GPU issue). Royalty-free codecs (VP9/VP8/AV1 → YouTube) always
worked. Swapping `libffmpeg.so` does NOT help (codecs are compiled out, not just
the lib stripped) and Chrome-for-Testing is also codec-less — only
`google-chrome-stable` carries them.
## Image pin
Both the server image (`mcr.microsoft.com/playwright:v1.48.0-noble` in
`stacks/chrome-service/main.tf`) and the Python client
(`playwright==1.48.0` in callers' `requirements.txt`) **must match
minor-versions**. Bump in lockstep — Playwright protocol changes between
minors and the client cannot connect to a mismatched server.
The harvester + snapshot-server sidecar use
`mcr.microsoft.com/playwright/python:v1.48.0-noble` — same playwright
minor, with Python-side bindings pre-installed.
The Playwright base + the Python client (`playwright==1.48.0` in callers'
`requirements.txt`) and the snapshot sidecars
(`mcr.microsoft.com/playwright/python:v1.48.0-noble`) historically had to match
minor-versions. The chrome-service browser is now real Google Chrome (a newer
milestone than the 1.48 Chromium), but the `connect_over_cdp` callers (tripit
fare scrape, `homelab browser`, snapshot-harvester) attach over raw CDP, which is
version-tolerant — verified working against this Chrome. If a future Chrome
milestone breaks a caller, pin Chrome in the Dockerfile or bump the clients.
## Storage
@ -167,7 +182,66 @@ minor, with Python-side bindings pre-installed.
`x11vnc` (connected to Xvfb on `localhost:6099`) bridged to
`websockify` on port 6080. Service `chrome` maps :80 → :6080 and is
exposed via `ingress_factory` at `chrome.viktorbarzin.me`,
Authentik-gated.
Authentik-gated. The bare host serves `vnc.html` (image symlinks
`index.html → vnc.html`); add `?autoconnect=true&resize=scale&path=websockify`
to skip the Connect button. The view is **black when no browser window is
open** (idle) — that is normal, not a failed connection. Chrome is launched
with `--window-size=1280,720 --window-position=0,0` to fill the Xvfb screen
(no window manager runs, so without it Chrome opens at its profile-persisted
size and the rest of the framebuffer shows as a black cut-off).
### noVNC fd-sweep gotcha (stuck "Connecting")
If the noVNC client hangs on **"Connecting" forever then times out**, the cause
is almost always x11vnc's fd-table sweep: containerd grants pods
`RLIMIT_NOFILE = 2^31`, and x11vnc `fcntl`-sweeps the **entire** fd table on
every client connection, so the RFB handshake never completes (websockify
accepts the WS and logs `connecting to: localhost:5900`, but x11vnc never sends
the `RFB 003.008` banner). Diagnose: `grep "open files" /proc/$(pgrep -n
x11vnc)/limits` (huge = bad) and time the handshake from a sibling container
(`python3 -c "import socket;s=socket.socket();s.connect(('127.0.0.1',5900));print(s.recv(12))"`
healthy <0.3s, broken hangs). **Fix: cap `ulimit -n 65536` before x11vnc starts**
— done both in `files/novnc/entrypoint.sh` (root) and via the container `command`
wrapper in `main.tf` (so it applies deterministically even though the image is
`:latest`/`IfNotPresent` and won't re-pull a rebuilt entrypoint). Same bug + fix
as the android-emulator stack.
### noVNC black after a browser-container restart (x11vnc supervision)
A **distinct** failure from the fd-sweep gotcha above: the noVNC client *connects*
but the view is **black**, and the novnc container logs spew
`connecting to: localhost:5900` → `Failed to connect ... [Errno 111] Connection
refused` (x11vnc is **down**, not slow). Cause: `x11vnc` and `websockify` both run
in the **novnc** container, but x11vnc attaches to the **chrome-service** (browser)
container's Xvfb over `localhost:6099` (shared pod network). When the browser
container restarts — Chrome exits cleanly (exit 0, "Completed") or crashes — its
Xvfb vanishes and x11vnc loses its X connection and exits.
`entrypoint.sh` **supervises** x11vnc: it launches x11vnc and websockify as
background children and `wait -n`s on them, exiting non-zero if **either** dies, so
the kubelet restarts the novnc container, which re-waits for Xvfb on `:6099` and
relaunches x11vnc — the bridge **self-heals** across browser-container restarts.
(Before 2026-06-27, x11vnc was an unsupervised background child of an `exec`ed
websockify; a dead x11vnc was never relaunched, leaving `:5900` dead — a
`<defunct>` zombie — and the view black until a manual pod restart. Same
supervision pattern as the android-emulator stack's entrypoint.)
**Diagnose:** `kubectl exec -c novnc -- ps aux | grep x11vnc` (a `<defunct>`/Z
entry = the bug); or the RFB-banner probe from a sibling container (`python3 -c
"import socket;s=socket.socket();s.settimeout(2);s.connect(('127.0.0.1',5900));print(s.recv(12))"`
— healthy returns `b'RFB 003.008\n'`, broken = `ConnectionRefused`). **Immediate
recovery** (no image change): restart just the novnc container with `kubectl exec
-n chrome-service deploy/chrome-service -c novnc -- kill 1` — re-runs its entrypoint
and relaunches x11vnc **without** touching the browser session/in-flight CDP jobs.
> **Deploying a rebuilt novnc entrypoint:** Keel is **off** for this deployment
> (`keel.sh/policy=never`, because the browser container's playwright image is
> version-pinned to f1-stream) and the image is `:latest`/`IfNotPresent`, so a
> rebuilt `:latest` will **not** redeploy on its own. After the
> `build-chrome-service-novnc.yml` GHA build pushes `:latest` + `:<sha>`,
> **SHA-pin** the novnc `image` in `main.tf` to the new `:<sha>` to force the pull
> and rollout (the novnc image is TF-managed — not in the deployment's
> `lifecycle.ignore_changes`).
- **snapshot-server sidecar** (`mcr.microsoft.com/playwright/python:v1.48.0-noble`)
serves `GET /api/snapshot` from `/profile/snapshots/storage-state.json`,
bearer-gated by `PW_TOKEN`. Service `chrome-snapshot` maps :8088 → :8088
@ -180,6 +254,81 @@ minor, with Python-side bindings pre-installed.
See `stacks/chrome-service/README.md` for the recipe (label namespace,
inject `CHROME_CDP_URL`, vendor `stealth.js`).
## Driving from OUTSIDE the cluster (`homelab browser`)
Agents on the devvm reach this browser through the **`homelab browser`** CLI
(`cli/`, ADR-0013) — the packaged, discoverable form of the ad-hoc
`connect_over_cdp` recipe. It is the **escalation path, not the default**:
agents default to the Playwright MCP / headless browser for all routine
automation, and reach for `homelab browser` ONLY when headless is blocked — a
site loads but a gated action (submit/login) silently fails or hangs, the
signature of headless / anti-bot detection. (Same tiered rule lives in
`~/code/CLAUDE.md` and `homelab browser --help`.)
```text
devvm: homelab browser run flow.js
│ kubectl port-forward svc/chrome-service :9222 (random local port)
http://127.0.0.1:<port> ──► chrome-service pod :9222 (CDP)
│ assert /json/version Browser is "Chrome/…", not "HeadlessChrome"
│ node + playwright-core@1.48.2 → connectOverCDP
│ context.addInitScript(stealth.js) ← same vendored file as in-cluster
│ run the user's Playwright script with page/context/browser in scope
└─ port-forward always torn down (success or error)
```
Key facts:
- **port-forward bypasses the `:9222` NetworkPolicy.** It tunnels
API-server→pod, so the devvm needs no `chrome-service.viktorbarzin.me/client`
label — unlike in-cluster callers.
- **Client pinned to the image minor.** The node client is
`playwright-core@1.48.2` (matches `v1.48.0-noble` / Chromium 130), installed
lazily into `~/.cache/homelab/browser-client/`. Bump it in lockstep when the
server image bumps (same rule as the in-cluster Python clients — see "Image
pin" above).
- **Default context is a fresh incognito one** (closed on exit), safe for the
shared browser; `--shared-context` reuses the warmed persistent profile.
- **`stealth.js` is vendored** into the CLI (`cli/browser_stealth.js`) as a
byte-identical copy of `files/stealth.js`, guarded by a drift test — so the
CLI's stealth never diverges from the in-cluster callers'.
## Multi-user access (sharing the browser)
There is ONE chrome-service browser with ONE persistent profile, warmed with
**Viktor's** logged-in sessions. CDP has no per-context auth, so anyone who can
drive the browser — over the noVNC view OR the CDP/`homelab browser` path — can
reach the persistent profile (`browser.contexts[0]`) and therefore Viktor's
sessions. Access is gated accordingly, per user.
**Decision (2026-06-28):** emo (`emil.barzin` / `emil.barzin@gmail.com`) SHARES
Viktor's browser for form-filling + captcha solving, rather than getting an
isolated instance. The session-exposure trade-off above was explicitly accepted.
Two independent grants make up "browser access" for a user:
1. **noVNC (interactive view, `chrome.viktorbarzin.me`)** — gated by the Authentik
`admin-services-restriction` policy: the `CHROME_ALLOWED` set
(`stacks/authentik/admin-services-restriction.tf`) matches the user's Authentik
username OR email. Add the user there. No kubeconfig/RBAC needed.
2. **CLI (`homelab browser`, CDP over port-forward)** — needs `pods/portforward`
in `chrome-service` PLUS a non-interactive credential (a normal devvm user's
kubeconfig is interactive-OIDC-only and can't authenticate a headless agent
session). Provided by a per-user **ServiceAccount** with a long-lived token
(`stacks/chrome-service/rbac.tf`, e.g. `emo-browser`): `pods/portforward` in
this namespace + cluster read-only (`oidc-power-user-readonly`, so it can also
resolve the Service and doesn't regress the user's normal read). The devvm
provisioner (`scripts/t3-provision-users.sh``install_browser_kubeconfig`)
reads that token and installs it as the user's DEFAULT kubeconfig context
(`<user>-browser@homelab`), keeping their personal OIDC login as the
`oidc@homelab` named context. The SA's existence is the source of truth for who
gets the CLI — the provisioner no-ops for users without a `<user>-browser` SA.
**To grant another user:** add them to `CHROME_ALLOWED` (noVNC) and/or add a
`<user>-browser` SA + bindings mirroring `emo-browser` in `rbac.tf` (CLI), then run
the provisioner. To revoke: remove from `CHROME_ALLOWED` and delete the SA (rotate
a token by deleting its `<user>-browser-token` Secret).
## Limits + risks
- **Anti-bot vs stealth arms race** — when an upstream beats us (DRM

View file

@ -115,8 +115,66 @@ claude-agent-service, claude-memory-mcp, kms-website, Freedify,
instagram-poster, payslip-ingest, broker-sync (image name `wealthfolio-sync`),
fire-planner, recruiter-responder, x402-gateway — plus **tripit** (the original
pilot, 2026-06-09). Earlier public-repo apps already on GHA (Website,
k8s-portal, apple-health-data, audiblez-web, plotting-book, insta2spotify,
audiobook-search, council-complaints) now also land on ghcr.
k8s-portal, apple-health-data, audiblez-web, insta2spotify,
audiobook-search) now also land on ghcr.
**plotting-book** is a special case (a GitHub-first repo owned by Anca,
ADR-0003): the build runs in *her* GitHub repo
(`PassionProjectsAnca/Plotting-Your-Dream-Book`) and pushes to **private
`ghcr.io/passionprojectsanca/book-plotter`** — under her org's ghcr namespace,
not `viktorbarzin`, using the workflow's built-in `GITHUB_TOKEN` (no shared
PAT). The cluster pulls it via the Kyverno-synced `ghcr-credentials` secret (the
`plotting-book` namespace is on the allowlist; the shared `ghcr_pull_token` has
read access). Migrated off public DockerHub (`viktorbarzin/book-plotter`) on
2026-06-27. The Woodpecker deploy hook (repo 43, registered to Anca's repo) is
unchanged. Flow:
```text
DEVELOP ───────────────────────────────────────────────────────────────────────
Anca (Codex / t3 web agent)
│ git push → main
┌──────────────────────────────────────────────────────────────┐
│ GitHub: PassionProjectsAnca/Plotting-Your-Dream-Book (private)│ ← canonical
│ .github/workflows/build-and-deploy.yml on: push → main │
└───────────────────────────┬──────────────────────────────────┘
│ GitHub Actions runner (off-infra build · ADR-0002)
┌────────────────────┴─────────────────────────────────┐
▼ ▼
┌─────────────────────────────────────────────┐ ╔═══════════════════════════════════════╗
│ build job │ push ║ GHCR · PRIVATE package ║
│ • svu next --always → tag vX.Y.Z (→ repo) │═════▶║ ghcr.io/passionprojectsanca/ ║
│ • buildx linux/amd64, provenance:false │ tags ║ book-plotter :vX.Y.Z :latest ║
│ • login ghcr (GITHUB_TOKEN, packages:write)│ ╚═══════════════════╤═══════════════════╝
│ • delete-package-versions (keep newest 10) │ │
└───────────────────────┬─────────────────────┘ │ pull (private,
▼ deploy job [gate: repo var DEPLOY_ENABLED ≠ "false"] via secret)
POST ci.viktorbarzin.me/api/repos/43/pipelines {IMAGE_TAG, IMAGE_NAME} │
▼ │
┌─────────────────────────────────────────────────────────────┐ │
│ Woodpecker repo 43 · .woodpecker/deploy.yml (event: manual) │ │
│ kubectl set image deployment/plotting-book = <ghcr>:vX.Y.Z │ │
│ kubectl rollout status │ │
└───────────────────────────┬─────────────────────────────────┘ │
▼ │
═══════════════ Kubernetes · ns: plotting-book ════════════════════════════ │
┌─────────────────────────────────────────────────────────────┐ │
│ Deployment plotting-book (Recreate · image = ignore_changes)│ │
│ imagePullSecrets: ghcr-credentials ────────pull───────────┼─────────────────┘
│ Pod → Express :3001 + SQLite on PVC (proxmox-lvm) │
└─────────────────────────────────────────────────────────────┘
guards / supporting:
• Kyverno require-trusted-registries [Enforce] → ghcr.io/* ALLOWED (admission)
• Keel policy=patch @1h → watches GHCR via ghcr-credentials (backstop)
• ghcr-credentials ⇐ Kyverno generate-clone ⇐ Vault secret/viktor/ghcr_pull_token
═══════════════ Serving path (unchanged) ══════════════════════════════════
Browser ─▶ plotting-book.viktorbarzin.me (non-proxied DNS → Traefik .203)
─▶ Authentik forward-auth (gate) ─▶ Service :80 ─▶ Pod :3001
```
Governance: the Deployment + Kyverno allowlist are Terraform (`stacks/plotting-book`,
`stacks/kyverno`); the live image *tag* is CI-owned (`ignore_changes`).
### Infra-owned images (issues #29 / #30)
@ -163,9 +221,9 @@ Woodpecker is **deploy + cluster-touching steps only**:
| Pipeline | File | Purpose |
|----------|------|---------|
| per-app deploy | `.woodpecker/deploy.yml` (each repo) | `kubectl set image` + Slack notify (event: **manual**) |
| terragrunt apply | `.woodpecker/default.yml` | Changed-stacks apply on push to master (runs in `infra-ci`) |
| terragrunt apply | `.woodpecker/default.yml` | Changed-stacks apply on push to master (runs in `infra-ci`). **Skips Tier-0 `vault`** — it's human-applied via OIDC; the CI `ci` role lacks Vault-admin perms (`sys/mounts`, `sys/policies/acl`) so a CI apply 403s |
| certbot | `.woodpecker/renew-tls.yml` | TLS renewal cron |
| drift-detection | `.woodpecker/drift-detection.yml` | Nightly Terraform drift (runs in `infra-ci`) |
| drift-detection | `.woodpecker/drift-detection.yml` | Nightly Terraform drift (runs in `infra-ci`). **Skips Tier-0 `vault`** (its `plan` 403s under the `ci` role and would fail the whole run) |
| provision-user | `.woodpecker/provision-user.yml` | Add namespace-owner user from Vault spec |
| registry-config-sync | `.woodpecker/registry-config-sync.yml` | SCP `modules/docker-registry/*``10.0.20.10` on change |
| pve-nfs-exports-sync | `.woodpecker/pve-nfs-exports-sync.yml` | Sync `scripts/pve-nfs-exports``/etc/exports` on PVE |
@ -176,6 +234,38 @@ Woodpecker is **deploy + cluster-touching steps only**:
**No build/test pipeline exists on any repo.** Do not (re)introduce one.
### `default.yml` apply: dual-registration de-dup + reliability (2026-06-28)
infra is registered in Woodpecker on **both** the canonical Forgejo repo (id 82)
and the legacy GitHub mirror (id 1), and **both fire `default.yml` on every
push**. Left unguarded, two `terragrunt apply` runs race each other for the
per-stack PG state lock — historically the #1 source of `Error acquiring the
state lock` failures and push-supersede "killed" runs.
- **Forge guard** (first command in the `apply` step): the push-apply runs **only
on the canonical Forgejo forge**; on the GitHub mirror it logs `[forge-guard]`
and `exit 0`s. Detection: `CI_REPO_URL`/`CI_FORGE_URL` contains `github.com`
skip. Fail-open (unknown forge still applies). The mirror keeps running the
**crons** (drift-detection, renew-tls, …), which live on repo 1 — only its
duplicate push-apply no-ops. (Crons were NOT moved; deactivating repo 1 would
have killed them.)
- **Lock-skip matches both tiers**: a stack whose apply hits a lock is SKIPPED,
not failed. The grep now matches the Tier-0 Vault message (`is locked by`) **and**
the Tier-1 PG-backend message (`Error acquiring the state lock` / `already
locked`) — the PG case was previously miscounted as a hard failure.
- **Transient retry** (bounded, 3 attempts): only provider-registry download
timeouts (`Failed to install provider` / `Client.Timeout`) and Vault 5xx are
retried. Config errors (missing arg, invalid index) and helm `atomic` timeouts
are NOT retried — they fail fast.
A pre-apply off-infra validate gate was evaluated and rejected: `terraform
validate` runs without state but catches ~0 of the observed failures (they are
provider-config-from-Vault-data, server-side-apply conflicts, helm installs, and
lock contention — all invisible to static validate), and `plan` cannot run
off-infra (no Vault/PG access). `terragrunt apply` already fails at its plan
phase without mutating on config errors, so a separate in-pipeline plan-gate was
also dropped as redundant.
### Woodpecker API
Uses **numeric repo IDs** (`/api/repos/<id>/pipelines`), NOT owner/name paths

View file

@ -286,7 +286,7 @@ Uptime Kuma monitors: TCP SMTP (port 25) on `176.12.22.76` (external), IMAP (por
#### Security Alerts (Wave 1 — planned, beads `code-8ywc`)
Routed via **Loki ruler → Alertmanager → `#security` Slack receiver**. Same handling path as infra alerts. Single channel with severity labels inside (critical/warning/info), not three separate channels. Detection sources: K8s API audit log (`job=kube-audit`), Vault audit log (`job=vault-audit`), PVE sshd journald (`job=sshd-pve`), Calico flow logs (`job=calico-flow`, W1.6 only).
Routed via **Loki ruler → Alertmanager → the `slack-security` receiver, which posts to `#alerts`** (it keeps its `[SECURITY/<sev>]` title styling so security-lane alerts stand out there). Same handling path as infra alerts; severity labels carried in the alert (critical/warning/info). The dedicated `#security` channel was abandoned 2026-06-25 — the shared `alertmanager_slack_api_url` webhook's Slack app isn't a member of it (a `#security` override 404s), so everything consolidated to `#alerts`. Detection sources: K8s API audit log (`job=kube-audit`), Vault audit log (`job=vault-audit`), PVE sshd journald (`job=sshd-pve`), Calico flow logs (`job=calico-flow`, W1.6 only).
| # | Source | Event | Severity |
|---|---|---|---|
@ -318,9 +318,20 @@ IOPS impact estimated ~1-2 GB/day additional disk writes after custom audit-poli
Detects the inverse of the K-series alerts: a service that **must work WITHOUT Authentik SSO** getting accidentally walled off. Services on `ingress_factory auth = "required"` put Authentik forward-auth on `/`, which 302-bounces native-client / public / webhook / WebSocket / SPA-XHR paths. We carve those out with path-scoped `auth = "none"` ingresses; a TF revert, a bad deploy, or `ingress_factory`'s fail-closed `auth` default flipping back to `"required"` can silently clobber a carve-out.
- **Mechanism**: `blackbox-exporter` (monitoring ns) probes a representative GET-able URL per carve-out with `no_follow_redirects: true`. The `http_no_authentik_redirect` module FAILS the probe (`fail_if_header_matches` on the `Location` header, regex `authentik\.viktorbarzin\.me|/outpost\.goauthentik\.io|/application/o/authorize`) iff the response redirects to Authentik. `valid_status_codes` enumerates all expected non-Authentik responses **including 301/302** (so a legitimate redirect, e.g. a short-link 302, or a 404 carve-out like meshcentral `/agent.ashx`, stays green). Scrape job: `blackbox-authentik-walloff` (1m).
- **Alert**: `probe_failed_due_to_regex{job="blackbox-authentik-walloff"} == 1` for 10m → `severity=warning`, `lane=security`**`#security` Slack** (Slack-only, no paging). `probe_failed_due_to_regex` (not bare `probe_success==0`) is the signal: it isolates the Authentik-redirect from unrelated 5xx/DNS/TLS failures already covered by reachability alerts. Inhibited by `TraefikDown` and `AuthentikDown` (symptom, not regression, during those outages).
- **Alert**: `probe_failed_due_to_regex{job="blackbox-authentik-walloff"} == 1` for 10m → `severity=warning`, `lane=security`posts to **`#alerts`** via the `slack-security` receiver, which keeps its `[SECURITY]` styling (Slack-only, no paging; the dedicated `#security` channel was abandoned 2026-06-25 — the shared webhook's app isn't a member of it). `probe_failed_due_to_regex` (not bare `probe_success==0`) is the signal: it isolates the Authentik-redirect from unrelated 5xx/DNS/TLS failures already covered by reachability alerts. Inhibited by `TraefikDown` and `AuthentikDown` (symptom, not regression, during those outages).
- **Target list + how to add one**: `local.authentik_walloff_targets` in `stacks/monitoring/modules/monitoring/authentik_walloff_probe.tf` — a map of `service → URL`. To guard a NEW carve-out, add ONE line. Verify it does NOT already 302 to Authentik first: `curl -s -o /dev/null -w '%{http_code} %{redirect_url}\n' '<url>'`. The map key becomes the `service` label on the metric + alert. (Note: openclaw `task-webhook` is intentionally NOT probed — no public DNS record.)
#### East-west flow observability (Goldmane edge-aggregator) — `AggregatorDown` / `DigestFailing` (ADR-0014)
Health for the durable "who-talks-to-whom" trail (Calico Goldmane → `goldmane-edge-aggregator` → CNPG `goldmane_edges` → daily `#alerts` digest; full trail in security.md + [runbooks/goldmane-flow-trail.md](../runbooks/goldmane-flow-trail.md)). The aggregator pod exposes **no `/metrics`**, so health is inferred from kube-state-metrics. Alert group `Network Observability (Goldmane)` in `prometheus_chart_values.tpl`; both route the default `slack-warning` receiver → **`#alerts`**.
| Alert | Expr (abridged) | For | Severity |
|---|---|---|---|
| `AggregatorDown` | `kube_deployment_status_replicas_available{namespace="goldmane-edge-aggregator",deployment="goldmane-edge-aggregator"} < 1` (+ Prometheus-restart guard) | 15m | warning |
| `DigestFailing` | `kube_job_status_failed{namespace="goldmane-edge-aggregator",job_name=~"goldmane-edges-digest.*"} > 0` within 24h | 30m | warning |
The two layers are **complementary**: `AggregatorDown` ⇒ no new edges land in the DB; `DigestFailing` ⇒ edges still land but nobody is told. (`< 1` requires the metric series to exist — a fully-deleted Deployment is instead caught by cluster-health check #48 below as "deployment missing".) A freshness probe (#61b) was deliberately skipped — `AggregatorDown` is the agreed floor. **Cluster-health check #48** (`check_goldmane_aggregator` in `scripts/cluster_healthcheck.sh`) reads the Deployment's `Available` condition independently (human / `--quiet` / `--json`; JSON key `goldmane_aggregator`).
#### Backup Alerts
- **PostgreSQLBackupStale**: >36h since last backup
- **MySQLBackupStale**: >36h since last backup

View file

@ -541,7 +541,11 @@ Separate from the in-cluster namespace-owner model above, the **devvm** (`10.0.1
**RBAC tiers:** `admin` (Viktor — cluster-admin, unlocked tree, secrets) · `power-user` (cluster-wide read-only, NO Secrets, via a dedicated `oidc-power-user-readonly` ClusterRole) · `namespace-owner` (admin in own namespace only). Each session acts as the user's **own** OIDC identity (kubelogin), never the admin's.
**Config inheritance (live):** wizard authors the base (his chezmoi-versioned `~/.claude`). Two native layers carry it to every user — the enforced org `claudeMd` in `/etc/claude-code/managed-settings.json` (top precedence, all sessions) and per-user `~/.claude/{skills,rules,…}` **symlinks** to the base (seeded via `/etc/skel`; edits propagate live). Secrets stay per-user at mode 600, never symlinked. **The managed config self-deploys from the repo** (2026-06-10): the hourly reconcile's `sync_managed_config` installs `scripts/workstation/managed-settings.json` to `/etc/claude-code/` whenever the repo copy changes — so editing the claudeMd = edit + commit, no manual install — and `refresh_codex_mirror` regenerates each user's `~/.codex/AGENTS.md` (a static mirror of the claudeMd; only files carrying the mirror header are touched, user-customized ones are left alone). Repo-level guidance (`.claude/CLAUDE.md`, `AGENTS.md`, `CONTEXT.md` in the infra repo) reaches non-admins through their auto-freshened clones — commit + push and every user has it within the hour.
**Config inheritance (live):** wizard authors the base (his chezmoi-versioned `~/.claude`). Two native layers carry it to every user — the enforced org `claudeMd` in `/etc/claude-code/managed-settings.json` (top precedence, all sessions) and per-user `~/.claude/{skills,rules,…}` **symlinks** to the base (seeded via `/etc/skel`; edits propagate live). Secrets stay per-user at mode 600, never symlinked. **(2026-06-26: the managed `claudeMd` now defers OS-level file access to the OS/sudo — a user holding broad `sudo` may read other users' files incl. `~/.claude`; the mode-600 / no-symlink posture is unchanged but is no longer reinforced by an agent "never read other homes" rule. See [ADR-0015](../adr/0015-os-is-the-authorization-boundary.md).)** **The managed config self-deploys from the repo** (2026-06-10): the hourly reconcile's `sync_managed_config` installs `scripts/workstation/managed-settings.json` to `/etc/claude-code/` whenever the repo copy changes — so editing the claudeMd = edit + commit, no manual install — and `refresh_codex_mirror` regenerates each user's `~/.codex/AGENTS.md` (a static mirror of the claudeMd; only files carrying the mirror header are touched, user-customized ones are left alone). Repo-level guidance (`.claude/CLAUDE.md`, `AGENTS.md`, `CONTEXT.md` in the infra repo) reaches non-admins through their auto-freshened clones — commit + push and every user has it within the hour.
**Memory — homelab CLI hooks (rolled out 2026-06-21, deploy-fixed 2026-06-22):** the per-user `claude_memory` MCP was retired for the **homelab-memory hooks** — the reconcile's `install_memory` (re)installs four scripts into `~/.claude/hooks/` each run (`homelab-memory-recall.py` UserPromptSubmit recall, `auto-learn.py` Stop-hook extraction, `pre-compact-backup.sh`/`post-compact-recovery.sh`), wires them into `settings.json` if-absent + additive, and removes the old `claude_memory` MCP. **The provisioner binary itself now self-deploys from the repo** (step 0: `bash -n`-gated `install` + re-exec when `scripts/t3-provision-users.sh` differs from `/usr/local/bin/t3-provision-users`, guarded against re-exec loops / DRY_RUN mutation) — added after this very rollout sat committed-but-undeployed for a day (only the manual `setup-devvm.sh` had ever deployed the binary), so the hourly reconcile kept running the pre-memory version and emo/anca silently lost memory (recall + auto-learn never wired). A latent `set -e` abort in `install_memory` (a bare `[[ -d plugin-dir ]] && …` returning non-zero) was also fixed; it had killed the reconcile after the first user the first time it actually ran. The hooks need a `MEMORY_API_KEY` (or `CLAUDE_MEMORY_API_KEY`) in the user's `settings.json` env — the `homelab` CLI defaults the API URL, so **the key is the only hard requirement**; `install_memory` reuses an existing key and only WARNs if absent (it does NOT mint one — that's an admin Vault step, see Remaining). wizard + emo carry a key from their original MCP setup; **ancamilea is keyless → her memory no-ops until a key is minted.** (`auto-learn.py`'s passive store calls the API directly, so it additionally needs `*_API_URL` in env to avoid its local-SQLite fallback; recall + manual `homelab memory store` go through the URL-defaulting CLI and need only the key.)
**Agent skills — vendored own-copies for an allowlist (2026-06-23):** beyond the config-inheritance base (above, which symlinks the admin's `~/.claude/skills` into every user), the reconcile's `install_skills` gives users on the `SKILL_USERS` allowlist (currently `emo`) their OWN copies of a curated skill set vendored in-repo at `scripts/workstation/claude-skills/` (16: the admin's 15 `mattpocock/skills` + `find-skills` from `vercel-labs/skills`). It copies each into `~/.agents/skills/<name>` (owned by the user, parent `~/.agents` chowned too — `install -d` leaves intermediates root-owned) and points `~/.claude/skills/<name>` at it with a **relative** symlink (`../../.agents/skills/<name>` — the layout `skills add -g` produces; Claude Code reads `~/.claude/skills/`). **Vendored, NOT `npx skills add`:** upstream drifted off this exact set (`diagnose``diagnosing-bugs`, `write-a-skill``writing-great-skills` renamed; `caveman` + `zoom-out` unpublished), so npx can't reproduce it — and a per-reconcile GitHub clone + unpinned-CLI dependency has no place in the hourly root job; refresh by re-snapshotting (`claude-skills/README.md`). **if-absent keys on the user's OWN copy** (a real dir under `~/.agents/skills`), so a steady-state reconcile is a no-op AND a stale or cross-user `~/.claude/skills` symlink is healed to the own copy — emo had `grill-me`/`file-issue` symlinked into the admin's home; `grill-me` is now emo's own (`file-issue` is outside the set, left as-is). A real dir squatting a name is never clobbered. Best-effort tail (`return 0`, like `install_memory`). Extend coverage = edit `SKILL_USERS`.
**Onboarding state self-heals (2026-06-15):** `~/.claude.json` is a single file that ALL of a user's concurrent `claude` processes (the ttyd terminal + their `t3-serve` instance + agent/SDK sessions) read-modify-write, so a stale writer periodically drops top-level keys — including `hasCompletedOnboarding` — which bounces the next *interactive* session back to the first-run "Choose the text style" wizard even though the user is fully logged in (credentials live in the SEPARATE `~/.claude/.credentials.json`, untouched by the race; first observed for emo 2026-06-15). The launcher (`skel/start-claude.sh`) now idempotently re-asserts `hasCompletedOnboarding` (+ `lastOnboardingVersion`) in `~/.claude.json` right before it runs `claude` — merge-only, never clobbers other keys, no-op if jq is missing or the file is empty/corrupt. And since the launcher is a per-user copy that `/etc/skel` only seeds at account creation, the reconcile's new `deploy_user_launcher` step re-copies `skel/start-claude.sh` into every non-admin home (copy-if-changed) so launcher edits now reach EXISTING users within the hour — `.tmux.conf` is deliberately NOT re-copied (terminal-lobby appends its own managed section to it).

View file

@ -4,7 +4,7 @@ Last updated: 2026-04-19 (WS E — Kea DHCP pushes dual DNS per subnet; Kea DDNS
## Overview
The homelab network is built on a dual-VLAN architecture with pfSense providing gateway services, Technitium for internal DNS, and Cloudflare for external DNS. Traefik serves as the Kubernetes ingress controller with a comprehensive middleware chain including CrowdSec bot protection, Authentik forward-auth, and rate limiting. All HTTP traffic flows through Cloudflared tunnels, avoiding the need for port forwarding or exposing public IPs.
The homelab network is built on a dual-VLAN architecture with pfSense providing gateway services, Technitium for internal DNS, and Cloudflare for external DNS. Traefik serves as the Kubernetes ingress controller with a middleware chain of anti-AI bot-blocking, Authentik forward-auth, rate limiting, and retry. CrowdSec IP-reputation enforcement is **out-of-band** (not a Traefik hop): banned IPs are dropped in-kernel via nftables on direct hosts and blocked at the Cloudflare edge on proxied hosts (see `docs/architecture/security.md`). All HTTP traffic flows through Cloudflared tunnels, avoiding the need for port forwarding or exposing public IPs.
## Architecture Diagram
@ -16,12 +16,14 @@ graph TB
Traefik[Traefik Ingress<br/>3 replicas + PDB]
subgraph "Middleware Chain"
CS[CrowdSec Bouncer<br/>fail-open]
AntiAI[Anti-AI bot-block<br/>fail-open]
Auth[Authentik Forward-Auth<br/>3 replicas + PDB]
RL[Rate Limiter<br/>429 response]
Retry[Retry<br/>2 attempts, 100ms]
end
CSdrop[CrowdSec drop<br/>nftables / CF edge<br/>out-of-band, pre-Traefik]
subgraph "Proxmox Host (eno1)"
vmbr0[vmbr0 Bridge<br/>192.168.1.127/24]
vmbr1[vmbr1 Internal<br/>VLAN-aware]
@ -53,8 +55,9 @@ graph TB
Internet -->|DNS query| CF
CF -->|CNAME to tunnel| CFD
CFD --> Traefik
Traefik --> CS
CS --> Auth
CSdrop -.->|banned IPs dropped before Traefik| Traefik
Traefik --> AntiAI
AntiAI --> Auth
Auth --> RL
RL --> Retry
Retry --> Service
@ -82,7 +85,7 @@ graph TB
| Cloudflare DNS | SaaS | External | ~50 public domains under viktorbarzin.me |
| Cloudflared | Container | K8s (3 replicas) | Tunnel ingress, replaces port forwarding |
| Traefik | Helm chart | K8s (3 replicas + PDB) | Ingress controller, HTTP/3 enabled |
| CrowdSec | Helm chart | K8s (LAPI: 3 replicas) | Bot protection, fail-open bouncer |
| CrowdSec | Helm chart | K8s (LAPI: 3 replicas) | IP reputation. Out-of-band enforcement: `cs-firewall-bouncer` DaemonSet (in-kernel nftables drop, direct hosts) + Cloudflare edge WAF rule (proxied hosts). Fail-open |
| Authentik | Helm chart | K8s (3 replicas + PDB) | SSO, forward-auth middleware |
| MetalLB | v0.15.3 Helm chart | K8s | LoadBalancer IPs (10.0.20.200-10.0.20.220), all services on 10.0.20.200 |
| Registry Cache | Container | 10.0.20.10 | Pull-through for docker.io:5000, ghcr.io:5010 |
@ -208,24 +211,31 @@ VMs tag traffic on vmbr1 to isolate workloads. pfSense bridges VLAN 20 to the up
### Ingress Flow
CrowdSec is **not** a step in this chain — banned IPs are dropped before the
request ever reaches Traefik (Cloudflare edge WAF rule on proxied hosts; host
nftables on direct hosts). The flow below is for a request that survives that
out-of-band gate.
```mermaid
sequenceDiagram
participant Client
participant Cloudflare
participant CFedge as Cloudflare (edge WAF: crowdsec_ban block)
participant Cloudflared
participant Traefik
participant CrowdSec
participant AntiAI
participant Authentik
participant RateLimit
participant Retry
participant Service
participant Pod
Client->>Cloudflare: HTTPS request to blog.viktorbarzin.me
Cloudflare->>Cloudflared: Forward via tunnel (QUIC)
Client->>CFedge: HTTPS request to blog.viktorbarzin.me
Note over CFedge: banned IP → blocked here (proxied hosts)
CFedge->>Cloudflared: Forward via tunnel (QUIC)
Cloudflared->>Traefik: HTTP to LoadBalancer IP
Traefik->>CrowdSec: Apply bouncer middleware
CrowdSec->>Authentik: If allowed, check auth (protected=true)
Note over Traefik: on direct hosts, banned IPs already dropped in-kernel (nftables forward hook)
Traefik->>AntiAI: anti-AI bot-block (fail-open)
AntiAI->>Authentik: If allowed, check auth (protected=true)
Authentik->>RateLimit: If authenticated, check rate limit
RateLimit->>Retry: If within limit, continue
Retry->>Service: Forward to Service
@ -234,24 +244,27 @@ sequenceDiagram
Service-->>Retry: Response
Retry-->>RateLimit: Response
RateLimit-->>Authentik: Response (strip auth headers)
Authentik-->>CrowdSec: Response
CrowdSec-->>Traefik: Response
Authentik-->>AntiAI: Response
AntiAI-->>Traefik: Response
Traefik-->>Cloudflared: Response
Cloudflared-->>Cloudflare: Response via tunnel
Cloudflare-->>Client: HTTPS response
Cloudflared-->>CFedge: Response via tunnel
CFedge-->>Client: HTTPS response
```
### Middleware Chain
Every ingress created by the `ingress_factory` module follows this chain:
CrowdSec IP-reputation enforcement is **not** in this chain — it is out-of-band
(host nftables on direct hosts; the Cloudflare edge WAF `crowdsec_ban` rule on
proxied hosts), so banned IPs never reach the chain and there is no per-request
CrowdSec hop. Every ingress created by the `ingress_factory` module follows this
Traefik chain:
1. **CrowdSec Bouncer**: Checks IP against threat database. **Fail-open** mode — if LAPI is unreachable, traffic passes through to prevent outages.
1. **Anti-AI bot-block** (`ai-bot-block` ForwardAuth, on by default via `ingress_factory`): blocks/tarpits known AI crawlers. **Fail-open** (currently a no-op `return 200` — poison-fountain scaled to 0; see `docs/architecture/security.md`).
2. **Authentik Forward-Auth** (if `protected = true`): SSO authentication via OIDC. Non-authenticated users are redirected to login. Auth headers are stripped before forwarding to backend.
3. **Rate Limiting**: Per-IP throttling. Returns **429 Too Many Requests** (not 503) when limit exceeded. Default is `rate-limit` (average 10 req/s, burst 50). Services whose clients legitimately burst harder get a dedicated middleware via `skip_default_rate_limit = true` + `extra_middlewares`: Immich (`immich-rate-limit`, 1000/20000, photo uploads) and ActualBudget (`actualbudget-rate-limit`, 50/300 — the Actual web app boots with ~70 parallel asset/migration revalidations; the default burst 429'd the tail and stalled every page load).
3. **Rate Limiting**: Per-IP throttling. Returns **429 Too Many Requests** (not 503) when limit exceeded. Default is `rate-limit` (average 10 req/s, burst 50). Services whose clients legitimately burst harder get a dedicated middleware via `skip_default_rate_limit = true` + `extra_middlewares`: Immich (`immich-rate-limit`, 1000/20000, photo uploads), ActualBudget (`actualbudget-rate-limit`, 50/300 — the Actual web app boots with ~70 parallel asset/migration revalidations; the default burst 429'd the tail and stalled every page load), and authentik (`authentik-rate-limit`, 100/1000, on `/` and `/static` — the login SPA cold-loads ~70 flow-executor JS/CSS chunks from `/static`; the default burst 429'd the tail and a failed ES-module import left a blank login screen for cold/incognito/NAT-shared clients).
4. **Retry**: 2 attempts with 100ms delay on transient failures (5xx errors, connection errors).
Additional middleware:
- **Anti-AI**: On by default via `ingress_factory`. Blocks common AI crawler user-agents.
- **HTTP/3 (QUIC)**: Enabled globally on Traefik.
### Entrypoint Transport Timeouts
@ -348,7 +361,7 @@ Containerd on all K8s nodes uses `hosts.toml` to redirect pulls to the local cac
| pfSense | `stacks/pfsense/` | VM + cloud-init config |
| Technitium | `stacks/technitium/` | Deployment, Service, PVC |
| Traefik | `stacks/platform/` (sub-module) | Helm release, IngressRoute CRDs |
| CrowdSec | `stacks/platform/` (sub-module) | Helm release, LAPI + bouncer |
| CrowdSec | `stacks/crowdsec/` (+ edge in `stacks/rybbit/`) | Helm release, LAPI + agent; `cs-firewall-bouncer` DaemonSet (nftables, direct hosts) + Cloudflare edge sync (proxied hosts) |
| Authentik | `stacks/authentik/` | Helm release, ingress, OIDC configs |
| MetalLB | `stacks/platform/` (sub-module) | Helm release, IPAddressPool |
| Cloudflared | `stacks/cloudflared/` | Deployment (3 replicas), tunnel config; runs `--no-autoupdate` (in-place self-updates exited the pods and severed all tunnel WebSockets, 2026-06-09/10) |
@ -436,13 +449,30 @@ Containerd on all K8s nodes uses `hosts.toml` to redirect pulls to the local cac
**Decision**: Technitium handles internal `.lan` domains with near-zero latency. Cloudflare handles public domains with global DNS. K8s nodes use Technitium as primary, which forwards non-.lan queries to Cloudflare.
### Why Fail-Open on CrowdSec Bouncer?
### Why CrowdSec Enforcement Is Out-of-Band (and Fails Open)
**Alternatives considered**:
1. **Fail-closed**: Maximum security, but LAPI downtime blocks all traffic.
2. **Redundant LAPI**: Already scaled to 3 replicas, but resource pressure can still cause outages.
CrowdSec used to enforce inline as a Traefik middleware (the
`crowdsec-bouncer-traefik-plugin`). On Traefik 3.7.5 the Yaegi plugin handler was
never invoked, so it enforced nothing; the plugin was removed and enforcement
moved off the request path entirely (full history in
`docs/architecture/security.md`). It now runs on two surfaces:
**Decision**: Availability > strict bot blocking. CrowdSec LAPI is scaled to 3 replicas for resilience, but during cluster-wide resource exhaustion (e.g., memory pressure), bouncer falls back to allowing traffic. This prevents a complete service outage due to a security add-on.
- **Direct hosts**`cs-firewall-bouncer` DaemonSet drops banned IPs in the host
nftables, in **both the `input` and `forward` hooks**. The `forward` hook is
the load-bearing one: with Traefik on a dedicated LB IP at
`externalTrafficPolicy=Local`, client packets are DNAT'd to the Traefik **pod**
and transit the node's `forward` chain (not `input`) — which is exactly why the
ingress must preserve the **real client IP** end-to-end (ETP=Local + PROXY-v2
for IPv6; see the Traefik LB IP and IPv6 ingress notes above). Without the real
client IP the firewall-bouncer (and the CF edge rule) would have nothing to
match on.
- **Proxied hosts** → a Cloudflare edge WAF rule (`ip.src in $crowdsec_ban`) fed
by the `crowdsec-cf-sync` CronJob.
Both **fail open**: if LAPI is unreachable, the firewall-bouncer simply stops
receiving new decisions (existing drops persist) and the CF sync skips a run —
neither ever blocks legitimate traffic. Availability > strict bot blocking, and
out-of-band enforcement adds **zero per-request latency** (no Traefik hop).
### Why HTTP/3 (QUIC)?
@ -473,9 +503,10 @@ Containerd on all K8s nodes uses `hosts.toml` to redirect pulls to the local cac
**Symptoms**: All ingress routes return 503, Traefik dashboard shows no backends available.
**Diagnosis**: Middleware chain is blocking traffic. Check:
1. Authentik status: `kubectl get pod -n authentik`
2. CrowdSec LAPI status: `kubectl get pod -n crowdsec`
**Diagnosis**: Middleware chain is blocking traffic. (CrowdSec is **not** in the
chain — a CrowdSec/LAPI outage cannot cause 503s; it only stops new bans.) Check:
1. Authentik status: `kubectl get pod -n authentik` (ForwardAuth fails closed if the auth server is unreachable)
2. `bot-block-proxy` status: `kubectl get pod -n traefik -l app=bot-block-proxy` (anti-AI ForwardAuth target — also fails closed if down)
3. Traefik logs: `kubectl logs -n kube-system deploy/traefik`
**Fix**: If Authentik is down and ingress uses forward-auth, pods won't pass health checks. Scale Authentik to 3 replicas or temporarily disable forward-auth middleware.
@ -519,7 +550,7 @@ Containerd on all K8s nodes uses `hosts.toml` to redirect pulls to the local cac
**Diagnosis**: Check Traefik middleware config for the affected IngressRoute.
**Fix**: Give the service a dedicated higher-limit middleware (don't loosen the shared default): define `<service>-rate-limit` in `stacks/traefik/modules/traefik/middleware.tf`, then set `skip_default_rate_limit = true` + `extra_middlewares = ["traefik-<service>-rate-limit@kubernetescrd"]` on its `ingress_factory` call. Shared default is average 10 req/s / burst 50; Immich uses 1000/20000, ActualBudget 50/300.
**Fix**: Give the service a dedicated higher-limit middleware (don't loosen the shared default): define `<service>-rate-limit` in `stacks/traefik/modules/traefik/middleware.tf`, then set `skip_default_rate_limit = true` + `extra_middlewares = ["traefik-<service>-rate-limit@kubernetescrd"]` on its `ingress_factory` call. Shared default is average 10 req/s / burst 50; Immich uses 1000/20000, ActualBudget 50/300, authentik 100/1000 (login SPA `/static` chunk burst → blank screen).
### Large Downloads or Uploads Truncate / Fail Partway

View file

@ -2,40 +2,50 @@
## Overview
The homelab implements defense-in-depth security at the application layer (L7) using CrowdSec for threat intelligence and IP reputation, Kyverno for policy enforcement and resource governance, and a 3-layer anti-AI scraping defense (reduced from 5 in April 2026 after removing the rewrite-body plugin). All security components operate in graceful degradation mode (fail-open) to prevent cascading failures. Security policies are deployed in audit mode first, then selectively enforced after validation.
The homelab implements defense-in-depth security using CrowdSec for threat intelligence and IP reputation, Kyverno for policy enforcement and resource governance, and a 3-layer anti-AI scraping defense (reduced from 5 in April 2026 after removing the rewrite-body plugin). CrowdSec enforcement is **out-of-band** (not a per-request Traefik hop — see the CrowdSec section): banned IPs are dropped in-kernel via nftables on direct hosts, and blocked at the Cloudflare edge on proxied hosts, so enforcement adds **zero per-request latency**. All security components fail open (a CrowdSec outage stops new bans but never blocks legitimate traffic). Security policies are deployed in audit mode first, then selectively enforced after validation.
## Architecture Diagram
CrowdSec enforcement is out-of-band (NOT an inline Traefik middleware hop). The
Traefik request chain is anti-AI → Authentik ForwardAuth → rate-limit → retry;
CrowdSec drops banned IPs *before* (direct hosts) or *off* (proxied hosts) that
chain entirely.
```mermaid
graph LR
graph TB
Internet[Internet]
CF[Cloudflare WAF]
subgraph "Proxied hosts (orange-cloud)"
CFedge[Cloudflare edge<br/>WAF rule: ip.src in $crowdsec_ban → block]
end
subgraph "Direct hosts (grey-cloud / internal)"
NFT[Host nftables<br/>table crowdsec/crowdsec6<br/>drop in input + forward]
end
Tunnel[Cloudflared Tunnel]
CrowdSec[CrowdSec Bouncer<br/>Traefik Plugin]
AntiAI[Anti-AI Check<br/>poison-fountain]
ForwardAuth[Authentik ForwardAuth]
RateLimit[Rate Limit Middleware]
Retry[Retry Middleware<br/>2 attempts, 100ms]
Traefik[Traefik<br/>anti-AI → Authentik → rate-limit → retry]
Backend[Backend Service]
LAPI[CrowdSec LAPI<br/>3 replicas]
Agent[CrowdSec Agent]
Agent[CrowdSec Agent<br/>parses Traefik logs]
FWB[cs-firewall-bouncer<br/>DaemonSet, every node]
CFsync[crowdsec-cf-sync<br/>CronJob, every 2 min]
Internet -->|1| CF
CF -->|2| Tunnel
Tunnel -->|3| CrowdSec
CrowdSec -.->|Query| LAPI
Agent -.->|Report| LAPI
CrowdSec -->|4. Pass/Block| AntiAI
AntiAI -->|5. Human/Bot| ForwardAuth
ForwardAuth -->|6. Authenticated| RateLimit
RateLimit -->|7. Under Limit| Retry
Retry -->|8. Success/Retry| Backend
Internet -->|proxied| CFedge
Internet -->|direct| NFT
CFedge -->|allowed| Tunnel
Tunnel --> Traefik
NFT -->|allowed| Traefik
Traefik --> Backend
style CrowdSec fill:#f9f,stroke:#333
style AntiAI fill:#ff9,stroke:#333
style ForwardAuth fill:#9f9,stroke:#333
style RateLimit fill:#99f,stroke:#333
Agent -.->|report| LAPI
LAPI -.->|all decisions incl. CAPI| FWB
FWB -.->|program drop rules| NFT
LAPI -.->|ban/captcha decisions, CAPI excluded| CFsync
CFsync -.->|push IP list| CFedge
style CFedge fill:#f9f,stroke:#333
style NFT fill:#f9f,stroke:#333
```
## Components
@ -44,7 +54,8 @@ graph LR
|-----------|---------|----------|---------|
| CrowdSec LAPI | Pinned | `stacks/crowdsec/` | Local API, threat intelligence aggregation (3 replicas) |
| CrowdSec Agent | Pinned | `stacks/crowdsec/` | Log parser, scenario detection |
| CrowdSec Traefik Bouncer | Plugin | Traefik config | Plugin-based IP reputation check |
| cs-firewall-bouncer | v0.0.34 | `stacks/crowdsec/modules/crowdsec/firewall_bouncer.tf` | In-kernel nftables drop on every node (DIRECT hosts). Bouncer key `firewall` |
| crowdsec-cf-sync | — | `stacks/rybbit/crowdsec_edge.tf` | LAPI→Cloudflare-IP-List sync CronJob (PROXIED hosts). Bouncer key `kvsync` |
| Kyverno | Pinned chart | `stacks/kyverno/` | Policy engine for K8s admission control |
| poison-fountain | Latest | `stacks/poison-fountain/` | Anti-AI bot detection and tarpit service |
| cert-manager/certbot | - | `stacks/cert-manager/` | TLS certificate management |
@ -54,11 +65,15 @@ graph LR
### Request Security Layers
Every incoming request passes through 6 security layers:
CrowdSec IP-reputation enforcement happens **before** a request reaches the
Traefik chain (banned IPs are dropped in-kernel on direct hosts, or blocked at
the Cloudflare edge on proxied hosts — see CrowdSec Threat Intelligence below).
A request that survives that out-of-band gate then passes through the Traefik
middleware chain:
1. **Cloudflare WAF** - DDoS protection, bot detection, firewall rules (external)
2. **Cloudflared Tunnel** - Zero Trust tunnel, hides origin IP
3. **CrowdSec Bouncer** - IP reputation check against LAPI (fail-open on error)
1. **Cloudflare WAF / edge** - DDoS protection, bot detection, firewall rules incl. the CrowdSec `crowdsec_ban` block rule (proxied hosts only)
2. **Cloudflared Tunnel** - Zero Trust tunnel, hides origin IP (proxied hosts)
3. **CrowdSec out-of-band drop** - nftables on direct hosts; *not* a Traefik hop (zero per-request latency)
4. **Anti-AI Scraping** - 3-layer bot defense (optional per service, updated 2026-04-17)
5. **Authentik ForwardAuth** - Authentication check (if `protected = true`)
6. **Rate Limiting** - Per-source IP rate limits (returns 429 on breach)
@ -80,58 +95,78 @@ CrowdSec operates in a hub-and-agent model:
- Reports malicious IPs to LAPI
- Shares threat intel with CrowdSec community (anonymized)
**Traefik Bouncer Plugin** (`crowdsec-bouncer-traefik-plugin`, `stacks/traefik/modules/traefik/middleware.tf`):
- Integrated as Traefik middleware (in the default ingress chain)
- Queries LAPI for IP reputation on each request
- **Registered with LAPI** via `BOUNCER_KEY_traefik` env on the LAPI container
(`stacks/crowdsec/modules/crowdsec/values.yaml`), seeded from the same Vault key
the middleware presents (`ingress_crowdsec_api_key`). **Before 2026-06-19 the
bouncer was never registered → LAPI returned 403 → the plugin failed open and
enforced nothing (no bans, no captcha).** The seed re-registers automatically on
every LAPI start, so a DB wipe (e.g. the MySQL→PostgreSQL migration that lost the
original registration) can't silently disable enforcement again.
- **Fail-open mode**: If LAPI unreachable, allows traffic (graceful degradation)
- **Only sees non-proxied (direct) apps' real client IPs** (ETP=Local). Proxied
apps arrive from cloudflared's pod IP (in `clientTrustedIPs`) and are bypassed —
extending enforcement to proxied apps needs `forwardedHeadersTrustedIPs` (future).
- Honours two LAPI remediation types (profiles in `stacks/crowdsec/modules/crowdsec/values.yaml`):
- **`ban`** → HTTP 403 (serious attacks: CVE exploits, scanners, brute force)
- **`captcha`** → **Cloudflare Turnstile challenge** so the flagged user can
self-unblock (lower-severity abuse: `http-429-abuse`, `http-403-abuse`,
`http-crawl-non_statics`, `http-sensitive-files`). The plugin is configured
with `captchaProvider=turnstile` + the widget keys; the `captcha.html`
template is mounted into the Traefik pod at `/captcha`. The widget is
Terraform-managed in `stacks/traefik/main.tf`
(`cloudflare_turnstile_widget.crowdsec_captcha`, scoped to `viktorbarzin.me`
so it covers every subdomain). **Before 2026-06-19 no captcha provider was
configured, so `captcha` decisions silently degraded to a 403 ban** — users
had no way to self-unblock; wiring Turnstile fixed that.
Enforcement is split across **two out-of-band surfaces**, neither of which adds
any per-request latency. (See "Why the Traefik bouncer plugin was removed" below
for the supersession history — there is no longer an inline Traefik bouncer.)
**Cloudflare Edge Enforcement for proxied hosts** (`stacks/rybbit/crowdsec_edge.tf` + `lapi_kv_sync.py`):
- Proxied (orange-cloud) hosts terminate at the Cloudflare edge, so the in-cluster
bouncer above never decides on them. Edge enforcement instead syncs LAPI
decisions into **one Cloudflare account IP List (`crowdsec_ban`)** + a single
**zone-scoped WAF custom rule** blocking `(ip.src in $crowdsec_ban)` across every
proxied host. CronJob `crowdsec-cf-sync` (rybbit ns, every 2 min) reconciles it.
- **BAN-ONLY (2026-06-20):** only `type=ban` decisions sync to the edge. `captcha`
decisions are deliberately NOT pushed — the CF account allows only ONE Rules List
with a single block action, so folding captcha in would hard-block a soft
challenge on every proxied host. (Before 2026-06-20 captcha was downgraded to a
hard block at the edge.)
- **Auth carve-out (2026-06-20):** the WAF rule excludes `authentik.viktorbarzin.me`
+ `public-auth.viktorbarzin.me` (`… and not (http.host in {…})`), and the
Authentik UI ingress sets `exclude_crowdsec = true` for the in-cluster bouncer. A
CrowdSec hit must never wall a user out of the login / WebAuthn flow they
authenticate through; auth keeps `traefik-rate-limit` for brute-force protection.
- **⚠️ Currently NON-FUNCTIONAL (known issue, pre-existing since the 2026-06-20
rollout):** `crowdsec-cf-sync` fails every run — `cf_list_items()` pagination
gets CF `HTTP 400 code 10027 "invalid or expired cursor"`, so the list never
populates (`num_items=0`) and the edge rule blocks nothing. LAPI also returns
~31k ban IPs, likely exceeding CF IP-List capacity even once pagination is fixed.
**Edge enforcement for proxied hosts is therefore inert pending a fix** (the
in-cluster bouncer still protects direct apps; the auth carve-out is correct
regardless). Fix needs: (1) correct CF cursor pagination, (2) a capacity strategy
for the ban set.
**Surface 1 — DIRECT (non-Cloudflare-proxied) hosts → in-kernel nftables drop**
(`cs-firewall-bouncer` DaemonSet, `stacks/crowdsec/modules/crowdsec/firewall_bouncer.tf`):
- Runs on **every node** (no nodeSelector). Programs the HOST nftables — `table ip
crowdsec` / `table ip6 crowdsec6` — with drop rules in **both the `input` AND
the `forward` hooks**. The `forward` hook is required because Traefik is a
LoadBalancer with `externalTrafficPolicy=Local`: client traffic is DNAT'd to the
Traefik **pod** and transits the node's `forward` hook (not `input`) with the
real client IP preserved. Chains use `policy accept` (only set members drop —
it can never blackhole normal traffic).
- Pulls **all** decisions from LAPI, **including the CAPI community blocklist
(~31k IPs)**. Packets from banned IPs are dropped **in-kernel before reaching
Traefik** → zero per-request hops, no Traefik involvement at all.
- **Packaging**: cs-firewall-bouncer publishes no container image, so the
**v0.0.34** static binary is fetched at runtime by an initContainer onto a
`debian:bookworm-slim` runtime container. Needs `hostNetwork` +
`NET_ADMIN`/`NET_RAW` to talk netlink directly. Registered bouncer key:
**`firewall`**.
- **Fail-open**: if LAPI is unreachable it just stops receiving new decisions
(existing drop rules persist); it never blocks legitimate traffic.
**Surface 2 — PROXIED (Cloudflare orange-cloud) hosts → Cloudflare edge block**
(`stacks/rybbit/crowdsec_edge.tf` + `lapi_kv_sync.py`):
- Proxied hosts terminate at the Cloudflare edge, so a host-level nftables drop
would never see them. Enforcement is instead a single Cloudflare Rules List
**`crowdsec_ban`** + a zone-scoped WAF custom rule `(ip.src in $crowdsec_ban)`
**block** action, which covers every proxied host in the zone.
- Fed by the **`crowdsec-cf-sync` CronJob** (namespace `rybbit`, every 2 min,
pure-stdlib Python in a ConfigMap). It pulls local **ban/captcha ip-scoped**
decisions and pushes them into the CF list, but **EXCLUDES the ~31k CAPI
community blocklist** — that set is far too large for a CF Rules List (the CF
account hard-limits to **one** list), and CAPI is already covered in-kernel on
direct hosts and by Cloudflare's own managed protections on proxied hosts.
Registered bouncer key: **`kvsync`**.
- **Rate-limit resilient (2026-06-27):** Cloudflare's Lists-API *write* endpoint
is throttled (~per-60s; `429 retry-after`). The CronJob runs `backoff_limit=0`
(one POST per cycle — the `*/2` schedule IS the retry cadence) and treats a CF
`429` as a soft-skip (exit 0, retry next cycle), the same fail-safe pattern it
uses for LAPI. An earlier `backoff_limit=2` fired 3 rapid POSTs/cycle and
escalated the throttle into a stuck state that left the list empty — a
self-inflicted DoS that this change prevents.
- **Block-only**: the single-list limit precludes a separate
captcha/managed-challenge list, so both ban and captcha decisions are enforced
as a plain block at the edge.
- **Auth carve-out:** the WAF rule excludes `authentik.viktorbarzin.me` +
`public-auth.viktorbarzin.me` (`… and not (http.host in {…})`). A CrowdSec hit
must never wall a user out of the login / WebAuthn flow they authenticate
through; auth keeps `traefik-rate-limit` for brute-force protection.
**Whitelist** (`stacks/crowdsec/whitelist.yaml`): a CrowdSec whitelist covers
RFC1918 + the tailnet + internal CIDRs (plus one specific external IP), so
internal users are never enforced. Internal access uses split-horizon DNS
straight to Traefik, and direct internal clients are RFC1918 — both whitelisted.
#### Why the Traefik bouncer plugin was removed
Enforcement used to run as an inline Traefik middleware — the
`crowdsec-bouncer-traefik-plugin` (Yaegi/Lua), which queried LAPI on every
request and could serve a Cloudflare Turnstile captcha for soft remediations.
On **Traefik 3.7.5 the Yaegi handler was never invoked**, so the bouncer was
registered but enforced **nothing** despite appearing healthy. Rather than chase
the Yaegi runtime, the whole plugin path was **removed** (2026-06): the plugin
static config + initContainer download, the `crowdsec` Middleware CRD, the
`captcha.html` template + its ConfigMap and volume mount, and the Cloudflare
Turnstile widget (`cloudflare_turnstile_widget.crowdsec_captcha`). It was
replaced by the two out-of-band surfaces above, which add zero per-request
latency and fail open. (The earlier `crowdsec-cf-sync` cursor-pagination /
IP-List-capacity issues are also moot now that CAPI is excluded from the edge
list and dropped in-kernel instead.)
**Metabase** (disabled by default):
- Dashboard for CrowdSec analytics
@ -244,7 +279,7 @@ Beads epic: `code-8ywc`. **Status: partially live as of 2026-05-18.**
The block below documents the locked design.
Response model: **(I) Slack-only, daily skim.** All security alerts land in a new `#security` Slack channel via Alertmanager. No paging. Mean detection time accepted as ~12-24h; the design weight sits on prevention (Kyverno enforce, NetworkPolicy default-deny egress) rather than runtime detection.
Response model: **(I) Slack-only, daily skim.** All security alerts post to **`#alerts`** via Alertmanager (the `slack-security` receiver keeps its distinct `[SECURITY/<sev>]` title styling so security-lane alerts still stand out). The dedicated `#security` channel was abandoned (2026-06-25) — the shared `alertmanager_slack_api_url` incoming webhook's Slack app isn't a member of it, so a channel override there returns HTTP `404 channel_not_found`; everything consolidated to `#alerts`. No paging. Mean detection time accepted as ~12-24h; the design weight sits on prevention (Kyverno enforce, NetworkPolicy default-deny egress) rather than runtime detection.
#### Detection sources
@ -257,7 +292,7 @@ Response model: **(I) Slack-only, daily skim.** All security alerts land in a ne
#### Alert rules (16 total)
Routed via **Loki ruler → Alertmanager → `#security` Slack receiver**. Same handling path as existing infra alerts — silenceable in Alertmanager UI, history queryable, severity labels (critical/warning/info) inside the single `#security` channel.
Routed via **Loki ruler → Alertmanager → the `slack-security` receiver, which posts to `#alerts`** (it keeps its `[SECURITY/<sev>]` title styling so security-lane alerts stand out there; the dedicated `#security` channel was abandoned 2026-06-25 — the shared webhook's Slack app isn't a member of it). Same handling path as existing infra alerts — silenceable in Alertmanager UI, history queryable, severity labels (critical/warning/info) carried in the alert.
**K8s API audit (K2-K9, 8 rules — K1 cluster-admin-grant intentionally skipped):**
@ -336,6 +371,69 @@ Beads: `code-8ywc` W1.6 + W1.7. **Status: planned.**
- Rare-event misses: a Sunday-only CronJob's egress won't appear in 7 days of flow logs. Mitigation: extend observation to 2 weeks for namespaces with weekly CronJobs.
- Mass-rollout cascade: the 26h March 2026 outage (memory id=390) was a mass-change cascade. Mitigation: phased per-namespace with health-check pauses, similar to the 2026-05-17 Keel phased rollout (memory id=1972).
#### Deriving the per-namespace egress allowlist from the edge trail (Wave 1 W1.7)
The durable **east-west flow trail** (below) is now the preferred data source for
the *internal* (namespace-to-namespace) half of each Wave-1 egress allowlist —
faster and identity-stamped vs the original iptables-`LOG`→journald→Loki path
(ADR-0014: "Enforcement gains a better data source"). The unique observed
namespace pairs live in CNPG DB `goldmane_edges`, table `edge`. To derive the
namespaces a source is observed talking to (the `allow` set that seeds its
NetworkPolicy):
```sql
SELECT DISTINCT dst_ns FROM edge WHERE src_ns='<ns>' AND action='allow' ORDER BY dst_ns;
```
The full SQL recipe (whole-cluster matrix, deny sanity-checks, the ≥7-day
observation caveat) is in
[runbooks/goldmane-flow-trail.md → Deriving the Wave-1 egress allowlist](../runbooks/goldmane-flow-trail.md#deriving-the-wave-1-egress-allowlist-from-the-edge-table-infra-62).
**External / public-internet egress is NOT in this table** (empty-namespace flows
are dropped) — for those destinations keep using the Calico flow-log observation
(the W1.6 snapshot, `wave1-egress-observation-2026-05-22.md`). This feeds the
existing observe-then-enforce effort (beads `code-8ywc`); **enforce-flips remain
out of scope** of the trail — it is observe-and-derive only.
### East-west flow observability (Goldmane / Whisker + edge trail) (ADR-0014)
The "who-talks-to-whom" data plane that succeeds raw iptables-`LOG` lines (which
carried no identity). **Service identity = the workload's namespace** (primary),
refined by a `service-identity` label in the few multi-Service namespaces
(`monitoring`, `kube-system`, `dbaas`). End-to-end trail, three layers:
1. **Calico Goldmane + Whisker** (`calico-system`) — Goldmane aggregates
identity-stamped flows (ns/pod/workload/labels + allow-deny + policy-trace)
streamed from Felix over gRPC into a **~60-min in-memory ring buffer** (no
etcd/API writes — the etcd-cost constraint that drove the design). **Whisker**
is its live web UI at `whisker.viktorbarzin.me` (Authentik-gated,
`auth = "required"` — Whisker has no own login; an additive NetworkPolicy ORs
Traefik past the operator's default-deny `whisker` NP). The ring buffer is
**not** a trail (lost on Goldmane restart). Enabled via operator CRs in
`stacks/calico/main.tf`; reversible toggle (Goldmane is OSS tech-preview).
2. **`goldmane-edge-aggregator`** (`stacks/goldmane-edge-aggregator`) — streams
Goldmane's gRPC `Flows.Stream` over **mTLS** and upserts the low-cardinality
namespace-pair edge set (`edge(src_ns,dst_ns,action,first_seen,last_seen,
flow_count)`) into CNPG DB `goldmane_edges`. Self-edges and empty-namespace
(public-internet) flows are dropped — in-cluster relationships only. The mTLS
client cert **reuses the operator's Tigera-CA-signed `whisker-backend-key-pair`**
(Goldmane verifies CA-chain only, not identity) rather than copying the CA
private key into TF state — **re-apply the stack if the operator rotates that
Secret**.
3. **`goldmane-edges-digest`** CronJob — posts first-seen edges daily to
**`#alerts`** (reuses the alert-digest webhook). All Slack now consolidates to
`#alerts`; the `#security` channel was abandoned 2026-06-25 because that
webhook's Slack app isn't a member of it (a `#security` override 404s). See
runbook.
The trail is **attribution-grade, not cryptographic** (reconstructs events in a
trusted cluster; cannot prove identity against a spoofing pod — accepted trust-model
limit; east-west stays plaintext, no mTLS between app pods). Health is covered by
the **`AggregatorDown`** + **`DigestFailing`** alerts and cluster-health check #48
(see monitoring.md). Full as-built, query recipes, and troubleshooting:
[runbooks/goldmane-flow-trail.md](../runbooks/goldmane-flow-trail.md). Decision:
[ADR-0014](../adr/0014-service-identity-and-east-west-observability.md); glossary
`CONTEXT.md`**Service identity**, **Goldmane / Whisker**.
### TLS & HTTP/3
**Traefik** handles TLS termination:
@ -377,10 +475,12 @@ Beads: `code-8ywc` W1.6 + W1.7. **Status: planned.**
| Path | Purpose |
|------|---------|
| `stacks/crowdsec/` | CrowdSec LAPI, agent, bouncer config |
| `stacks/crowdsec/` | CrowdSec LAPI, agent config + `whitelist.yaml` |
| `stacks/crowdsec/modules/crowdsec/firewall_bouncer.tf` | cs-firewall-bouncer DaemonSet (in-kernel nftables drop, direct hosts) |
| `stacks/rybbit/crowdsec_edge.tf` + `lapi_kv_sync.py` | Cloudflare IP-List + WAF block rule + LAPI→CF sync CronJob (proxied hosts) |
| `stacks/kyverno/` | Kyverno deployment + policies |
| `stacks/poison-fountain/` | Anti-AI service + CronJob |
| `stacks/platform/modules/traefik/middleware.tf` | Security middleware definitions |
| `stacks/traefik/modules/traefik/middleware.tf` | Security middleware definitions (no longer includes a CrowdSec bouncer) |
| `stacks/platform/modules/ingress_factory/` | Per-service security toggles |
### Vault Paths
@ -490,7 +590,11 @@ spec:
**Fix**:
1. Check LAPI decisions: `kubectl exec -it crowdsec-lapi-0 -- cscli decisions list`
2. Remove ban: `kubectl exec -it crowdsec-lapi-0 -- cscli decisions delete --ip <IP>`
3. Whitelist if needed: Add to `stacks/crowdsec/whitelist.yaml`
— the in-kernel drop clears as soon as `cs-firewall-bouncer` reconciles (direct
hosts); for proxied hosts the `crowdsec-cf-sync` CronJob removes it from the
`crowdsec_ban` CF list within ~2 min.
3. Whitelist if needed: Add to `stacks/crowdsec/whitelist.yaml` (RFC1918 + tailnet
+ internal CIDRs are already whitelisted, so internal clients are never banned).
### Kyverno Policy Blocking Deployment

View file

@ -0,0 +1,243 @@
# External Secrets Operator: 0.12.1 → 2.6.0 Migration (v1beta1 → v1) — Design Doc
> **Status:****COMPLETE (2026-06-22).** ESO at chart/app **2.6.0**; all 104 ExternalSecrets + 2 ClusterSecretStores on `external-secrets.io/v1`; 109 ESs SecretSynced (2 pre-existing dead); compat-gate now returns `OK: cluster is safe to upgrade to 1.35.6` (EXIT 0) — the last k8s-1.35 blocker is cleared. Executed Phase 1 (climb to 0.16.2) → Phase 2 (v1 rewrite, validated GC-survival on tandoor) → Phase 3 (climb 0.16.2→2.6.0 across the 0.17 cutoff, ES sync held at 109 every hop). Side-finding fixed: repo-wide stale `.terraform.lock.hcl` files (missing gavinbunney/kubectl + telmate/proxmox from the generated providers.tf) had broken `terragrunt apply` for ~28 stacks (this is what failed CI pipeline 332) — reconciled via `init -upgrade` + committed.
> **Scope:** Upgrade the ESO Helm chart `0.12.1` (app `v0.12.1`) to `2.6.0` (app `v2.6.0`) and migrate every `external-secrets.io/v1beta1` custom resource to `external-secrets.io/v1`.
> **Owner:** Viktor Barzin. **Author:** Claude (research + design only — no changes applied).
>
> **EXECUTION CORRECTION + STATUS (2026-06-21 — "let's do the ESO migration"):** The cluster is already on **k8s 1.34.9** (all 7 nodes), NOT ≤1.31 as §4.3 assumed. ESO 0.12 runs fine on 1.34 (the support-matrix bands are conservative *tested* ranges, not hard limits). **The entire ESO climb 0.12→2.6 therefore happens on k8s 1.34 — there is NO k8s interleave; IGNORE the "advance k8s to 1.32/1.33" steps in §4.3 / Phase 1 / Phase 3.** Only AFTER ESO reaches 2.x does the nightly version-check chain take k8s 1.34→1.35 (gate clears). Exact hop sequence (latest patch per minor): **0.13.0 → 0.14.4 → 0.15.1 → 0.16.2** [rewrite all 104 CRs to `v1` here] → **0.17.0 → 0.18.2 → 0.19.2 → 0.20.4 → 1.0.0 → 1.1.1 → 1.2.1 → 1.3.2 → 2.0.1 → 2.1.0 → 2.2.0 → 2.3.0 → 2.4.1 → 2.5.0 → 2.6.0**. Pre-flight done: CRD `storedVersions` are `["v1beta1"]` only (no v1alpha1 patch needed).
>
> **EXECUTION LOG:**
> - **✅ Phase 1 DONE (2026-06-21):** ESO climbed 0.12.1 → 0.13.0 → 0.14.4 → 0.15.1 → **0.16.2**, one hop at a time, each applied + verified (controller healthy; 108 live ExternalSecrets stayed SecretSynced; 2 pre-existing dead — `instagram-poster/instagram-poster-secrets` False since 2026-05-10, `payslip-ingest/payslip-ingest-secrets` False since 2026-04-25, both missing Vault data, untouched). Added `atomic=true` + `timeout=600` to the helm_release. At 0.16.2 **both `v1beta1` and `v1` are served** (110 each) and `storedVersions = ["v1beta1","v1"]`. Committed (`eso: Phase 1 …`); state auto-committed per hop by `scripts/tg`.
> - **⏳ Phase 2 PENDING — findings confirmed (decisive for execution):** (a) bumping a `kubernetes_manifest` ExternalSecret's apiVersion v1beta1→v1 **forces a REPLACE** (verified live on instagram-poster: `-/+ must be replaced`), NOT in-place. (b) Our ExternalSecrets use **`creationPolicy=Owner`** (default; confirmed on nextcloud) → target Secrets carry an ownerReference, so the replace's delete step can **cascade-GC the Secret** before ESO recreates it. → **Phase 2 must be done carefully, NOT a blind bulk apply:** (1) snapshot ALL target Secrets first (backstop); (2) **empirically validate on the FIRST live stack** — migrate one ES while watching its target Secret; ESO re-syncs the identical spec fast and should re-adopt before GC, but confirm before proceeding; (3) then the per-stack two-phase `-target`-then-full apply (the 15 plan-time-coupled stacks need `-target` first). If validation shows GC wins, pivot to `state rm` + `import {}` (adopts the already-v1-served object with zero delete → zero GC). Repo is clean at v1beta1 (the lone test edit was reverted, never applied).
> - **Phase 3 PENDING:** hops 0.17.0 → 0.18.2 → 0.19.2 → 0.20.4 → 1.0.0 → 1.1.1 → 1.2.1 → 1.3.2 → 2.0.1 → 2.1.0 → 2.2.0 → 2.3.0 → 2.4.1 → 2.5.0 → 2.6.0 (all on k8s 1.34, CRs already v1). Crossing **0.17 is the point of no return**.
---
## 1. Goal & why
ESO is the **last remaining compatibility gate blocking the autonomous k8s 1.35 upgrade** (Kyverno was cleared to 1.18.1 earlier today). The installed ESO `0.12.x` supports only Kubernetes **1.19 → 1.31** ([support matrix](https://external-secrets.io/latest/introduction/stability-support/)); the k8s-version-check chain will refuse to advance the cluster past 1.31 while ESO sits at 0.12. The `2.x` series supports **k8s 1.341.35**, which clears the gate.
The hard part is not the chart bump itself — it is that **ESO removed the `external-secrets.io/v1beta1` API**, and every one of our ExternalSecret / ClusterSecretStore resources is currently declared `v1beta1`. If we upgrade past the removal version without first rewriting the manifests to `v1`, ESO stops reconciling and synced Secrets go stale (apps keep their last-good Secret, but rotations and new secrets break).
**Downtime tolerance:** brief, recoverable downtime of the ESO *controller* is acceptable. What must NOT happen is loss/corruption of the downstream Kubernetes `Secret` objects that apps mount (DB creds, API keys). Those must survive continuously.
---
## 2. Current state
### 2.1 Versions
| Component | Current | Target |
|---|---|---|
| Helm chart `external-secrets` | **0.12.1** | **2.6.0** |
| App / controller image | **v0.12.1** | **v2.6.0** |
| API version of all CRs | **`external-secrets.io/v1beta1`** | **`external-secrets.io/v1`** |
| Repo: `https://charts.external-secrets.io` | (unchanged) | (unchanged) |
ESO stack: `stacks/external-secrets/main.tf`. `helm_release.external_secrets` pins `version = "0.12.1"`, namespace `external-secrets` (separate `kubernetes_namespace` resource, not `create_namespace`), and the **only** chart value set is `installCRDs = true` (via `yamlencode({ installCRDs = true })`). No webhook/replica/resource overrides.
### 2.2 Inventory (live, from `stacks/`)
| Kind | Count | apiVersion | Where |
|---|---|---|---|
| **ExternalSecret** (`kubernetes_manifest`) | **104** | all `v1beta1` (0 mismatches) | 73 `.tf` files |
| **ClusterSecretStore** (definitions) | **2** | both `v1beta1` | `stacks/external-secrets/main.tf` |
| SecretStore | 0 | — | — |
| PushSecret | 0 | — | — |
| ClusterExternalSecret | 0 | — | — |
- **Only ONE apiVersion string exists in the whole tree:** `external-secrets.io/v1beta1` (106 occurrences = 104 ExternalSecret + 2 ClusterSecretStore). Zero `v1`, zero `v1alpha1`. → a clean single-target rewrite.
- **`secretStoreRef` split:** 78 ExternalSecrets → `vault-kv`, 26 → `vault-database` (78 + 26 = 104). The `kind = "ClusterSecretStore"` string also appears inside every `secretStoreRef`, so a naive `grep 'kind = "ClusterSecretStore"'` returns 106 — only **2** are real store definitions.
- **22 files carry >1 ExternalSecret** (max: `stacks/fire-planner/main.tf` = 5; then wealthfolio / real-estate-crawler / phpipam / payslip-ingest / n8n / job-hunter / ebooks = 3 each; 13 files = 2). The 104-vs-73 gap is these multi-secret files.
- **Nested-module ExternalSecrets** (easy to miss when scripting the bump): `stacks/instagram-poster/modules/instagram-poster/main.tf`, `stacks/postiz/modules/postiz/main.tf`, `stacks/technitium/modules/technitium/main.tf`, `stacks/mailserver/modules/mailserver/main.tf`, `stacks/monitoring/modules/monitoring/grafana.tf`, `stacks/proxmox-csi/modules/proxmox-csi/main.tf`.
- **Docs are STALE:** `.claude/CLAUDE.md` says "43 ExternalSecrets + 9 DB-creds". Live count is **104 ExternalSecrets / 73 files / 26 db-refs**. Fix in the migration PR.
### 2.3 The two ClusterSecretStores (`stacks/external-secrets/main.tf`)
Both `kubernetes_manifest`, both `external-secrets.io/v1beta1`, both `depends_on = [helm_release.external_secrets]`:
- **`vault-kv`** → Vault KV **v2** at `path = "secret"`, server `http://vault-active.vault.svc.cluster.local:8200`, auth `kubernetes` mount `kubernetes`, role `eso`, SA `external-secrets/external-secrets`.
- **`vault-database`** → identical except `path = "database"`, **`version = "v1"`** (Vault DB engine, KV-v1-style).
ESO's Vault auth role `eso` (`stacks/vault/main.tf:486-511`): policy `eso-reader` (`secret/data/*` read+list, deny `secret/data/vault`, `database/static-creds/*` read), `token_ttl = token_period = 864000` (10d, periodic/auto-renew).
### 2.4 Tier-0 / state
ESO is **Tier-0 (bootstrap)** (`.claude/CLAUDE.md` "Terraform State — Two-Tier Backend"; root `terragrunt.hcl` `tier0_stacks = ["infra","platform","cnpg","vault","dbaas","external-secrets"]`). Tier-0 ⇒ **local SOPS-encrypted state in git** (`state/stacks/external-secrets/terraform.tfstate`), NOT the PG backend. Workflow: `git pull``scripts/tg plan``scripts/tg apply``git push`; SOPS decrypt via Vault Transit (primary) → age fallback. **Tier-0 must apply before PG is reachable**, so the ESO upgrade cannot depend on PG.
### 2.5 Provider versions (`stacks/external-secrets/providers.tf`)
- `required_providers` declares **only** `vault = hashicorp/vault, ~> 4.0`.
- `provider "kubernetes"` and `provider "helm"` are declared **without version constraints** (resolve from root / `.terraform.lock.hcl`). The `helm` block already uses the **v3-style nested `kubernetes = {…}` argument** (not the legacy `kubernetes {}` block) ⇒ helm provider is **v3.x or v4.x** in the lockfile. **No `kubectl` provider** in this stack. No `required_version` pinned here.
- ⚠️ **Verify the resolved helm provider version** in `.terraform.lock.hcl` before starting — the prompt referenced `~> 4.0` for helm; the *stack* only pins that for `vault`. Either way the v3-syntax helm block + an SDK-v3 provider is compatible with the chart (see §4.5).
### 2.6 Plan-time coupling (the cross-cutting risk)
**15 stacks read ESO-created Secrets at plan time** via `data "kubernetes_secret"` (avoids a Vault dependency at plan): `actualbudget, affine, changedetection, coturn, ebooks, fire-planner, freedify, freshrss, grampsweb, k8s-dashboard (dashboard_injector.tf), navidrome, owntracks, real-estate-crawler, servarr, technitium (modules/technitium)`.
The documented **first-apply gotcha** (`.claude/CLAUDE.md`, `docs/architecture/secrets.md:360`, `stacks/fire-planner/main.tf:574`): the Secret must exist before the `data "kubernetes_secret"` plans, so on first creation you must `terragrunt apply -target=kubernetes_manifest.<external_secret>` first, then full apply. **Why this matters for the migration:** the `kubernetes_manifest` provider treats `apiVersion` as part of resource identity, so bumping `v1beta1``v1` **forces a replace** of all 104 ExternalSecrets. During replace there is a window where the new CR (and thus the synced Secret) may not yet be materialized when the same stack's `data "kubernetes_secret"` plans → the two-phase `-target` apply is needed **fleet-wide for the v1 rewrite step, not just fire-planner.**
### 2.7 Vault DB rotation (rotation interplay)
`stacks/vault/main.tf`: **25 `vault_database_secret_backend_static_role`, every one `rotation_period = 604800` (7 days)** (8 MySQL + 17 PostgreSQL static roles). ESO syncs these via `vault-database``remoteRef.key = "static-creds/<role>"`. Apps reading a rotated secret only at startup carry a Stakater Reloader annotation. **Implication:** any ESO controller downtime longer than the gap to the next rotation could leave a Secret stale across a rotation; keep controller downtime short and re-sync promptly.
### 2.8 git-crypt landmine (adjacent, not in ESO stack)
`.claude/CLAUDE.md:146` + `docs/architecture/ci-cd.md:108` + `stacks/kyverno/modules/kyverno/tls-secret-sync.tf`: on a **git-crypt-locked clone**, `kubernetes_secret.tls_secret` reads `secrets/fullchain.pem`/`privkey.pem` via `file()` which returns **ciphertext**, corrupting the wildcard TLS secret Kyverno clones cluster-wide. **The ESO stack itself has NO `file()` reads of git-crypt secrets** — so this landmine does not bite the ESO upgrade directly. It is listed here only as a guardrail: do not piggyback unrelated kyverno applies during this work, and run all applies from an **unlocked** checkout.
---
## 3. Target
- Helm chart **`external-secrets` 2.6.0** (app **v2.6.0**), repo `https://charts.external-secrets.io`.
- All ExternalSecret + ClusterSecretStore CRs on **`external-secrets.io/v1`**.
- Cluster ESO compatible with **k8s 1.341.35** ⇒ unblocks the autonomous 1.35 upgrade.
---
## 4. Key findings (the decisive facts)
> Sourced from ESO official docs + GitHub release notes; verbatim quotes below.
### 4.1 Chart version == app version (premise check)
The chart version and app version are released **in lockstep and are the same number**. `Chart.yaml`: `version: 0.12.1 / appVersion: v0.12.1`; `version: 2.6.0 / appVersion: v2.6.0`. The app series ran `…0.20.4 → 1.0.0 → … → 2.0.0 → … → 2.6.0`. **Crucially, the `v1.0.0` and `v2.0.0` APP releases are NOT the `external-secrets.io/v1` API**`v1.0.0` is just "continuation after 0.20.4" (release diff `v0.20.4...v1.0.0`, no API change), and `v2.0.0`'s only breaking change is removing the unmaintained **Alibaba + Device42** providers (we use neither — only Vault). The API migration happened back at **0.16/0.17**. Source: [v1.0.0 notes](https://github.com/external-secrets/external-secrets/releases/tag/v1.0.0) · [v2.0.0 notes](https://github.com/external-secrets/external-secrets/releases/tag/v2.0.0).
### 4.2 Version path: **NO skipping minors — step one minor at a time**
Official policy, verbatim ([stability-support](https://external-secrets.io/latest/introduction/stability-support/)):
> "**Upgrade version by version** — We strongly recommend upgrading one minor version at a time (e.g., 0.18.x → 0.19.x → 0.20.x) rather than skipping versions."
Maintainer (issue [#4785](https://github.com/external-secrets/external-secrets/issues/4785), @gusfcarvalho): *"We are pre release… Every minor bump should be treated as a major bump until we go 1.0."***You CANNOT helm-upgrade 0.12.1 → 2.6.0 directly.** You must step each minor: `0.12 → 0.13 → 0.14 → 0.15 → 0.16 → 0.17 → 0.18 → 0.19 → 0.20 → 1.x → 2.x`.
### 4.3 k8s ↔ ESO must advance roughly in lockstep
Each ESO release targets a **narrow** k8s band ([support matrix](https://external-secrets.io/latest/introduction/stability-support/)):
| ESO | k8s band |
|---|---|
| 0.12.x | 1.19 → 1.31 |
| 0.16.x | 1.32 |
| 0.17.x | 1.33 |
| 2.0 2.5 | 1.34 1.35 |
| 2.6 (latest) | (matrix row not yet appended; 2.x band is consistently 1.341.35 — see Open Questions) |
**This is the single most important sequencing constraint.** ESO doesn't "support only ≤ its max k8s" in a wide range — older ESO may not run cleanly on a *much newer* k8s either. The bands imply the ESO upgrade and the k8s upgrade need to be **interleaved**, not "finish ESO, then bump k8s in one jump." Practical reading: the cluster is currently on k8s ≤1.31 (ESO 0.12 blocks past it). The 0.16/0.17 steps want k8s 1.32/1.33; the 2.x steps want 1.34/1.35. So this is a **coordinated ESO+k8s climb**, e.g. ESO→0.16 alongside k8s→1.32, ESO→0.17 alongside k8s→1.33, then ESO→2.x alongside k8s→1.34→1.35. (The k8s climb is itself sequential via the version-check chain; this doc focuses on the ESO half but flags the coupling — see Open Questions for who drives the interleave.)
### 4.4 API migration: **must rewrite manifests to `v1` FIRST — there is NO v1beta1→v1 conversion webhook**
- **`external-secrets.io/v1` promoted to STORAGE version: v0.16.0.** v0.16.0 release notes "BREAKING CHANGES": *"Promotion of ExternalSecret/v1 and SecretStore/v1 and their cluster counterparts"* and *"Removal of Conversion Webhooks and …/v1alpha1…"*. From 0.16, **etcd stores `v1`**. Source: [v0.16.0 notes](https://github.com/external-secrets/external-secrets/releases/tag/v0.16.0).
- **`external-secrets.io/v1beta1` STOPS BEING SERVED (hard cutoff): v0.17.0.** Verbatim ([v0.17.0 notes](https://github.com/external-secrets/external-secrets/releases/tag/v0.17.0)):
> "v0.17.0 Stops serving `v1beta1` apis. You need to update your manifests from `v1beta1` to `v1` prior to updating from `v0.16` to `v0.17`. The only change needed is upgrading your manifests to `v1` (i.e. removing the `beta1` from `v1beta1`). … Be sure to do that to all your manifests prior to bumping to `v0.17.0`! `v0.16.2` already supports `v1` so this process should be smooth."
- **No v1beta1→v1 conversion webhook.** The only conversion webhook that ever existed was v1alpha1→v1beta1, **removed in 0.16**. Maintainer (issue [#5478](https://github.com/external-secrets/external-secrets/issues/5478), @gusfcarvalho): the post-0.16 "drift" is simply that etcd now stores v1 — *"This isn't really a conversion issue."* ⇒ **old v1beta1 manifests do NOT keep working past 0.17 via any auto-conversion.**
- **Verdict: MUST-REWRITE-FIRST.** Rewrite all CRs to `v1` while on **0.16.x** (which serves *both* v1beta1 and v1), then upgrade to 0.17. Real-world confirmation (issue [#4785](https://github.com/external-secrets/external-secrets/issues/4785), @Dutchy-): *"I was able to change v1beta1 to v1 on 0.16 without issues. After that I was able to upgrade to 0.17."*
- There is a deprecated escape hatch in chart 2.6.0 — `unsafeServeV1Beta1: true` re-enables v1beta1 serving for stragglers — but its own values comment says *"This flag will be removed on 2026.05.01"* (i.e. **already past**, do not rely on it).
- **Schema change is a PURE apiVersion string bump — ZERO field changes.** CRD `openAPIV3Schema` diff (v0.16.2 bundle, which serves both): ExternalSecret / SecretStore / ClusterSecretStore / ClusterExternalSecret have **byte-identical** spec field sets between v1beta1 and v1 (`{data, dataFrom, refreshInterval, refreshPolicy, secretStoreRef, target}` for ExternalSecret). Maintainer (issue #4785, @Skarlso): *"Just change your manifests to be v1 and upgrade… We don't have anything fancy that you need to do."* PushSecret only ever had `v1alpha1` (no v1beta1) — **unaffected** (we have 0 anyway).
### 4.5 Helm chart values + CRD handling (0.12 → 2.6)
- **No top-level values removed or renamed.** `values.yaml` diff 0.12.1↔2.6.0 is **additive only** (new keys: `enableHTTP2, extraInitContainers, genericTargets, grafanaDashboard, hostAliases, hostUsers, leaderElectionID, livenessProbe, openshiftFinalizers, processClusterGenerator, processClusterPushSecret, processSecretStore, readinessProbe, strategy, systemAuthDelegator, vault`). Our single value `installCRDs = true` survives.
- **`installCRDs` still works** in 2.6.0 (defaults `true`, "install and upgrade CRDs through helm chart"). CRDs are **templated into the single `external-secrets` chart** and **upgraded by `helm upgrade`** automatically — there is **no separate CRDs subchart**, and no manual `kubectl apply` of CRDs is required by default. (Out-of-band bundle, if ever needed, lives at `deploy/crds/bundle.yaml` per release tag.) The only CRD-value change: `crds.conversion.enabled` defaults `true` in 0.12.1 (for the old v1alpha1 webhook) → `false` in 2.6.0 ("we stopped supporting v1alpha1"). We don't set it, so the new default is fine.
- **CRD storedVersions bookkeeping (the one real pre-flight check):** v0.16.0 notes warn to ensure no CRD still lists `v1alpha1` in `.status.storedVersions` before/at 0.16, with a `kubectl patch` to set it to `["v1","v1beta1"]` if needed. This is CRD metadata hygiene, NOT secret deletion.
- **Helm provider:** `Chart.yaml apiVersion: v2` (Helm 3 chart) in both 0.12.1 and 2.6.0; **no minimum Helm version declared** (only `kubeVersion: ">= 1.19.0-0"`). The Terraform helm provider on Helm SDK v3 (v3.x/v4.x) is compatible. **The 2.x chart does NOT require a newer helm provider than 0.12 did** — the v3-style helm block in `providers.tf` already satisfies it. (Still: pin/verify the resolved version in the lockfile; see Open Questions.)
### 4.6 Data migration: **downstream Secrets survive**
The synced Kubernetes `Secret` objects are **not deleted or force-resynced** by these upgrades. The change is an apiVersion bump on the *custom resources*, whose `spec` is schema-identical, so the controller keeps reconciling the same target Secrets. A controller restart triggers a normal **reconcile (re-assert, not delete)**. Caveat: no release note says verbatim "synced Secrets are preserved"; the conclusion is from (a) schema identity, (b) maintainers calling it "100% compatible" (issue #5478), (c) absence of any "secrets recreated/deleted" note. **Standard caution: snapshot/back up all ESO-created Secrets before the 0.16→0.17 step** (see §8 verification). Unrelated watch-item: v0.14.0 flagged a stateful-**generators** change — we use no generators, so N/A.
---
## 5. Migration strategy (ordered, do-this-then-that)
> **Pre-reqs every step:** run from an **unlocked** infra checkout (git-crypt unlocked); `vault login -method=oidc`; ESO is **Tier-0** so use `scripts/tg plan` / `scripts/tg apply` against `stacks/external-secrets` and **`git push`** after each apply (SOPS state). Claim presence before each apply: `~/code/scripts/presence claim stack:external-secrets --purpose "ESO 0.12→2.x migration step N"`. Wait for the controller `Deployment` to roll out healthy before the next hop.
### Phase 0 — Pre-flight (no changes)
1. Confirm cluster k8s version and the version-check chain's current target; **coordinate with the k8s climb** (see §4.3 / Open Questions). Decide who drives the interleave.
2. `kubectl get crd | grep external-secrets.io` and for each: `kubectl get crd <name> -o jsonpath='{.status.storedVersions}'` — confirm none still list `v1alpha1`. If any do, plan the `kubectl patch …/status storedVersions=["v1beta1"]` per the v0.16.0 note (do this *before* reaching 0.16).
3. **Snapshot all ESO-managed Secrets** (rollback safety net):
`kubectl get externalsecrets -A` (record the 104) and `for ns/secret in <targets>: kubectl get secret -n <ns> <name> -o yaml > backup/<ns>-<name>.yaml`. Keep outside git-crypt or encrypt.
4. Inspect `.terraform.lock.hcl` in `stacks/external-secrets` — record resolved `helm` + `kubernetes` provider versions. If helm provider < what 2.6.0 needs (it doesn't appear to need anything beyond SDK v3), bump the constraint as its own committed change first.
5. Read `docs/architecture/secrets.md` + the fire-planner first-apply comment to re-confirm the `-target` pattern for the v1 rewrite step.
### Phase 1 — Climb to 0.16.x (chart bump only, NO manifest change yet)
ESO `0.16.x` is the **transition version** that serves *both* v1beta1 and v1. Climb to it one minor at a time, leaving all CRs as `v1beta1`:
6. For `v` in `0.13.0, 0.14.0, 0.15.x, 0.16.2` (use latest patch of each minor): set `helm_release.external_secrets.version = "<v>"`, `scripts/tg plan` (expect: chart upgrade + CRD upgrade in place; **no `kubernetes_manifest` replacements** — apiVersion unchanged), `scripts/tg apply`, `git push`, wait for rollout, verify `kubectl get externalsecrets -A` all `SecretSynced=True`.
- **Interleave k8s as required:** before/at 0.16 the cluster should be on **k8s 1.32** (0.16 band). Advance k8s via the normal version-check chain to 1.32 around this point.
- Watch the **0.14.0** notes (generators) — N/A for us, but eyeball the plan diff anyway.
7. **Land on 0.16.2 and STOP.** Verify both APIs are served: `kubectl get externalsecrets.v1.external-secrets.io -A` and `kubectl get externalsecrets.v1beta1.external-secrets.io -A` both work.
### Phase 2 — Rewrite all 104 CRs + 2 stores to `v1` (while on 0.16.2)
This is the MUST-DO-FIRST API migration, done in the safe window where both versions are served.
8. **Mechanical rewrite** across `stacks/`: replace the apiVersion string `external-secrets.io/v1beta1``external-secrets.io/v1` in every ExternalSecret and ClusterSecretStore `kubernetes_manifest` (104 + 2 = 106 occurrences across 73 files, **including the 6 nested-module files** in §2.2). **No other field changes** (schema identical). Do this in a worktree, committed file-by-file.
- Leave `secretStoreRef.kind = "ClusterSecretStore"` (that's a kind reference, not an apiVersion — unaffected).
9. **Two-phase apply because `kubernetes_manifest` replace + plan-time `data "kubernetes_secret"`:**
a. **Stores first:** `scripts/tg apply -target='kubernetes_manifest.css_vault_kv' -target='kubernetes_manifest.css_vault_db'` in `stacks/external-secrets` (they get replaced to v1; ESO still serves v1beta1 too, so in-flight ExternalSecrets keep syncing). `git push`.
b. **ExternalSecrets, per stack:** for each of the 73 stacks, `scripts/tg apply -target=kubernetes_manifest.<external_secret_name>` FIRST (materializes the replaced v1 CR + its Secret), THEN a full `scripts/tg apply` for that stack (lets the 15 plan-time `data "kubernetes_secret"` reads resolve against the now-existing Secret). The **15 plan-time-coupled stacks** (§2.6) absolutely need the `-target` first; the rest are lower-risk but follow the same pattern for safety. `git push` per stack (Tier-1 stacks use PG state; ESO stack is Tier-0).
- Because the spec is identical, the *replace* re-creates an identical CR; ESO reconciles and re-asserts the same target Secret (no value change) → apps keep their Secret throughout.
10. **Verify the rewrite fully landed:** `grep -rc 'external-secrets.io/v1beta1' stacks/` returns **0**; `kubectl get externalsecrets -A -o jsonpath used to confirm all served as v1`; all `SecretSynced=True`; spot-check a rotated DB cred (e.g. `nextcloud-db-creds`) still valid.
### Phase 3 — Cross the 0.17 cutoff, then climb to 2.6.0
Only after Phase 2 is 100% applied (zero v1beta1 in repo AND in etcd):
11. Bump chart `0.16.2 → 0.17.x`. `scripts/tg plan` (expect chart/CRD upgrade; **no manifest replacements** — already v1), apply, push, rollout, verify all synced. **k8s should be 1.33** (0.17 band) around here.
12. Continue one minor at a time: `0.18.x → 0.19.x → 0.20.x → 1.0.0 → 1.x (latest) → 2.0.0 → … → 2.6.0`. At each: bump `version`, plan, apply, push, rollout, verify synced. **k8s reaches 1.34 then 1.35** across the 2.x steps.
- **At 2.0.0:** confirm the plan shows nothing odd from the Alibaba/Device42 provider removal (we use only Vault — should be a no-op).
13. **Land on 2.6.0.** Verify: controller image `v2.6.0`, all 104 ExternalSecrets `SecretSynced=True`, both ClusterSecretStores `Valid=True`.
### Phase 4 — Close the gate + docs
14. Advance k8s to **1.35** via the version-check chain if not already; confirm the **compat-gate now lists ESO as compatible** and 1.35 is unblocked.
15. Update `.claude/CLAUDE.md` Secrets Management section: correct counts (**104 ExternalSecrets / 73 files / 26 db-refs**), apiVersion now `v1`. Update `docs/architecture/secrets.md`. Commit as part of the work (audit trail).
---
## 6. Risks & mitigations
| Risk | Likelihood | Mitigation |
|---|---|---|
| **Secret-sync outage → app DB/API auth failures** during controller restarts or the replace window | Med | Spec is identical so re-sync re-asserts the same value; keep each controller restart short; do Phase-2 replaces **per stack** (small blast radius); the 15 plan-time stacks use `-target` first so the Secret exists before dependents plan. Pre-step Secret snapshot (Phase 0.3) for instant restore. |
| **Crossing 0.17 with any CR still v1beta1** → ESO stops reconciling those, secrets go stale | High if rushed | Phase 2 gate: `grep -rc v1beta1 stacks/` **must be 0** AND `kubectl get …v1beta1…` returns nothing live before Phase 3. Do not skip 0.16. |
| **CRD removal/replace by helm dropping data** | Low | Chart manages CRDs in-place via `installCRDs=true` (upgrade, not delete-recreate); CRs are the data and they're untouched by a CRD *upgrade*. Snapshot anyway. Never `helm uninstall` (that can GC CRDs). |
| **No conversion webhook safety net** (must-rewrite-first) | Certain (by design) | Whole strategy is built on rewriting at 0.16. The deprecated `unsafeServeV1Beta1` is already past its 2026-05-01 removal — do NOT rely on it. |
| **`kubernetes_manifest` forces replace on apiVersion bump** → transient gap + plan-time read failures | High | Two-phase `-target` apply fleet-wide (Phase 2.9); identical spec ⇒ replacement CR is equivalent. |
| **Vault 7-day DB rotation lands mid-migration** → a Secret stale across rotation if controller down | Med | Keep controller downtime < rotation gap; re-sync immediately after each hop; Reloader annotations already re-roll pods on Secret change; if a rotation is imminent, sequence the affected db stacks last and verify those creds explicitly. |
| **git-crypt tls-secret-sync landmine** | Low (not in ESO stack) | ESO stack has no `file()` git-crypt reads; run from an **unlocked** checkout; do **not** piggyback kyverno applies during this work. |
| **helm/k8s provider in lockfile too old for 2.x chart** | Low | Phase 0.4 verify; bump constraint as a separate committed change if needed (chart needs only Helm SDK v3, already satisfied). |
| **k8s/ESO band mismatch** (e.g. ESO 0.12 on k8s 1.33) | Med | Interleave the climbs per §4.3; don't jump k8s far ahead of ESO or vice-versa. |
| **Many small applies = long, error-prone session** | Med | Script the per-stack `-target`-then-full loop; checkpoint with `kubectl get externalsecrets -A` after each; the rewrite itself is a single `sed`-class change so low semantic risk. |
---
## 7. Rollback plan (per hop)
- **During Phase 1 (chart climb, still v1beta1):** revert `version` to the previous minor in `stacks/external-secrets/main.tf`, `scripts/tg apply`, `git push`. Helm rolls the controller back; CRs unchanged. Clean.
- **During Phase 2 (v1 rewrite, on 0.16.2):** 0.16.2 serves both APIs, so you can `git revert` the apiVersion-bump commits and re-apply — the CRs flip back to v1beta1 cleanly (both served). Secrets unaffected (identical spec). This is the **last point of easy rollback**.
- **After Phase 3 (≥0.17, v1beta1 no longer served):** **rollback is HARD** — once etcd stores v1-only and the controller is ≥0.17, downgrading cannot re-serve v1beta1 and v1 objects can't be auto-converted back ([general guidance + maintainer position](https://github.com/external-secrets/external-secrets/issues/5478)). Treat **crossing 0.17 as the point of no return.** If you must recover: re-install 0.16.2 (serves both), restore CRs from the Phase-0 manifest snapshot, and restore Secrets from the Secret snapshot. This is a disaster-recovery path, not a routine rollback — hence the Phase-2 gate must be airtight.
- **Always available:** the Phase-0.3 Secret backups let you `kubectl apply` the last-good Secret to keep an app authenticating while you fix ESO.
---
## 8. Verification
**Per hop:**
- `kubectl -n external-secrets get deploy,po` healthy; controller image tag == target.
- `kubectl get externalsecrets -A` → all 104 `STATUS=SecretSynced` / `READY=True`.
- `kubectl get clustersecretstores``vault-kv` + `vault-database` `Valid=True`.
**After Phase 2 (v1 rewrite):**
- `grep -rc 'external-secrets.io/v1beta1' stacks/`**0**.
- `kubectl get externalsecrets.v1beta1.external-secrets.io -A` → still served on 0.16 (sanity), but `kubectl get externalsecrets.v1.external-secrets.io -A` is the real check.
- Spot-check a rotated DB cred end-to-end: e.g. `nextcloud-db-creds` value matches `vault read database/static-creds/mysql-nextcloud` and the app authenticates.
**Final (2.6.0):**
- Controller image `v2.6.0`; all ExternalSecrets synced; both stores valid.
- Diff a sample of the 104 target Secrets against the Phase-0 backups → values unchanged (continuity proof).
- App health: spot-check 34 high-value consumers (nextcloud, immich, grafana, a `vault-database` consumer) — pods running, no auth errors in logs.
- **Compat-gate:** run the upgrade-state / k8s-version-check audit — ESO no longer flagged as a 1.35 blocker; k8s 1.35 upgrade proceeds.
---
## 9. Open questions
1. **k8s/ESO interleave ownership.** §4.3 shows narrow per-version k8s bands (0.16→1.32, 0.17→1.33, 2.x→1.34-1.35). The cluster is currently ≤1.31. **Who drives the interleave** — does this migration also advance k8s step-by-step, or does the autonomous version-check chain advance k8s and we time ESO hops to it? Need the exact current k8s version and the chain's behavior when ESO is the only gate. (Decisive for sequencing Phases 1/3.)
2. **2.6.0 ↔ k8s 1.35 explicit support.** The support matrix table currently ends at **2.5** (k8s 1.34-1.35). 2.6.0 exists on GitHub but the matrix row isn't appended yet; the whole 2.x band is consistently 1.34-1.35, so 2.6 on 1.35 is a *strong inference* not a quoted row. Confirm via `Chart.yaml` `kubeVersion` of 2.6.0 or a 2.6 release note before relying on it. ([matrix](https://external-secrets.io/latest/introduction/stability-support/))
3. **Resolved helm provider version.** The stack only pins `vault ~> 4.0`; helm/k8s are unpinned (lockfile-resolved). Confirm the lockfile version and whether to pin it explicitly as part of this work. (Chart needs only Helm SDK v3 — likely a no-op, but verify.)
4. **Intermediate-minor patch selection.** Use latest patch of each minor (0.13.x, 0.14.x, 0.15.x). Confirm 0.16.**2** specifically (the note says 0.16.2 already supports v1) vs a later 0.16 patch.
5. **Per-stack apply automation.** 73 stacks × (target + full) apply is large. Acceptable to script a loop, or prefer manual per-stack with checkpoints? Some stacks have other in-flight drift that a full apply would also push — needs a clean-plan check per stack first.
6. **Stateful generators / advanced features.** Confirmed we use none (0 SecretStore/PushSecret/ClusterExternalSecret/generators), so the v0.14 generator and v2.0 provider-removal breaking changes are N/A — but re-confirm no generator usage crept in before Phase 3.
---
## 10. Sources (decisive facts)
- Skip-version policy + k8s support matrix: <https://external-secrets.io/latest/introduction/stability-support/>
- `v1` promoted to storage version (0.16.0): <https://github.com/external-secrets/external-secrets/releases/tag/v0.16.0>
- `v1beta1` removed / "rewrite manifests to v1 first" (0.17.0): <https://github.com/external-secrets/external-secrets/releases/tag/v0.17.0>
- No conversion webhook / "not a conversion issue" (#5478): <https://github.com/external-secrets/external-secrets/issues/5478>
- v1beta1↔v1 schema identical / "nothing fancy" (#4785): <https://github.com/external-secrets/external-secrets/issues/4785>
- App v1.0.0 ≠ API v1: <https://github.com/external-secrets/external-secrets/releases/tag/v1.0.0>
- v2.0.0 only removes Alibaba/Device42: <https://github.com/external-secrets/external-secrets/releases/tag/v2.0.0>
- Chart 2.6.0 on ArtifactHub: <https://artifacthub.io/packages/helm/external-secrets-operator/external-secrets>

View file

@ -0,0 +1,140 @@
# t3 idle-migrate — graceful overnight restart of deferred t3-serve instances — design
- **Date:** 2026-06-21
- **Status:** implemented 2026-06-21 (branch `wizard/t3-idle-migrate`; deployed + timer enabled on devvm, first overnight drain pending)
- **Owner:** Viktor (wizard)
- **Builds on:** the gated nightly tracker `t3-autoupdate` (re-enabled 2026-06-16, `scripts/t3-autoupdate.{sh,service,timer}`; design history in `docs/runbooks/t3-version-bump.md` + post-mortem `2026-06-09-t3-nightly-autoupdate-auth-outage.md`) and the per-user `t3-serve@<user>` systemd instances (`scripts/t3-serve@.service`).
## Goal
When `t3-autoupdate` **defers** a user's `t3-serve` restart because that user has an active agent at the daily 04:0005:00 window, the user's running server keeps executing its start-time t3 version indefinitely — their client (which tracks the freshly-installed global binary) then shows *"Client and server versions differ."* For a user who is busy at every daily window (wizard: long-lived/AFK sessions overnight), the deferral never resolves and the skew persists for days.
Add a **small, idle-gated overnight job that drains those deferrals**: restart a deferred `t3-serve@<user>` onto the current binary **only when nothing is actively working** in that instance, so the migration happens during a genuine quiet gap rather than killing in-flight agent turns.
## Background — why the skew persists (root cause, verified 2026-06-21)
- All `t3-serve@<user>` instances share ONE global `/usr/bin/t3` (→ `/usr/lib/node_modules/t3`). `t3-autoupdate` installs a new nightly to that single binary, health-gates it against a **copy** of wizard's populated `state.sqlite`, then **canary-restarts idle instances one at a time**, verifying pairing after each (`scripts/t3-autoupdate.sh` step 6).
- Its idle check is coarse — `unit_busy()`:
```sh
pid=$(systemctl show -p MainPID --value "$unit")
pgrep -aP "$pid" | grep -qiE 'claude|codex|opencode'
```
i.e. "does the server have any `claude`/`codex`/`opencode` **child**?" But `t3 serve` keeps one such child alive per **open** session, even one idle awaiting input. Live snapshot 2026-06-21: wizard had **5 `running` provider sessions** (= 5 `claude` children) but only **3 mid-turn**, plus **89 `ready` (open-idle)** threads. So `unit_busy` is true whenever any tab is open → wizard is deferred at every window.
- The job runs **once daily** (`OnCalendar=*-*-* 04:00:00`, `RandomizedDelaySec=1h`, `Persistent` deliberately omitted) and **only acts on a version bump** (exits early if `installed == target`). So once the binary is already current, nothing re-triggers a restart of a still-stale running server until the *next* new nightly — and only if the user happens to be idle then.
- Confirmed in the logs: `t3-autoupdate: deferring t3-serve@wizard.service (active agent) — migrates on its next idle restart` on **both** Jun 20 and Jun 21 windows; wizard's server has been up since Jun 20 06:17 on `…20260620.605` while the binary + client are on `…20260621.613`.
## Decisions (from brainstorm 2026-06-21)
1. **"Safe to restart" = no turn in flight AND a quiet buffer.** Not "zero open sessions" (that would essentially never fire for wizard). Open-but-idle tabs are acceptable to drop — t3 persists thread history in `state.sqlite` and the client reconnects/resumes (the daily job already restarts idle instances routinely; restart→resume is the exercised path). To verify during implementation: the user-facing resume after a server restart.
2. **Cadence: overnight window only.** Frequent checks within a fixed overnight window; never disconnects tabs during the working day. Migrates within ~1 night of a build landing.
3. **Scope: all `t3-serve@<user>`, self-limiting.** The job restarts only an instance that actually *owes* a migration (a deferral marker exists). Users already migrated at the daily window have no marker → no-op. No hardcoded per-user logic.
4. **Approach C: extract a shared safe-restart helper, reuse from both jobs.** One audited copy of the dangerous backup→restart→verify→recover logic; the new job adds only *scheduling + gating*.
## Constraints (load-bearing)
1. **The binary is global; migrations are forward-only and per-user-DB.** You cannot keep one user on the old version while others run the new one. A real-user forward-migration failure therefore means the build is unsafe for a real user → the only consistent recovery is the daily job's existing one (restore that user's DB + roll the **global** binary back to last-good + freeze + alert). This is a rare tail (the build was already migration-gated against a copy of wizard's real DB at install time), but the idle path must not invent a weaker recovery.
2. **Per-user secret boundary.** A user's `~/.t3/userdata/state.sqlite` is mode 600 and may not be read as another user. The job runs as root (system service) but reads each user's DB **as that user** via `runuser -u <user> -- sqlite3 …` (the pattern `backup_all` already uses), read-only (`mode=ro`) so it never locks the live WAL.
3. **Fail closed.** Any uncertainty about whether an instance is safe to restart (DB locked/busy/unreadable, query error, unparseable timestamp) → treat as *not safe*, skip this tick, retry in 20 min. Never restart on doubt.
4. **Do not change the daily job's gated-install behavior.** The step-6 extraction must be behavior-preserving; health-gate, canary, downgrade-guard, freeze, and rollback stay exactly as today.
5. **Infra-as-code via the devvm installer.** Sources live in `scripts/`; deployment is `scripts/workstation/setup-devvm.sh` (the devvm is hand-managed VM 102 — no Terraform apply). Shared-devvm deploy takes a presence claim.
## Design
### Components
Four new files in `scripts/` + a one-line addition to the existing job:
1. **`scripts/t3-safe-restart.sh`** — shared library, sourced (not executed). Holds the per-unit "dangerous" routine extracted from `t3-autoupdate.sh` step 6 as `safe_restart_unit <unit> <target>`:
pre-restart `VACUUM INTO` backup (as the owner) → `systemctl restart` → poll `verify_pairing` (15×2s ≈ 30s) → on failure: restore that user's DB from the pre-restart backup, `rollback_binary` to last-good, `touch $FREEZE_FILE`, log+alert. The shared helpers it needs (`LOG`, `ver`, `osusers`, `ak_for`, `verify_pairing`, `prebump_of`, `rollback_binary`, `DISPATCH`/`BACKUP_DIR`/… config) move into the lib too. Installed to `/usr/local/lib/t3-safe-restart.sh`.
**Contract:** returns `0` on verified success, **non-zero** after performing recovery+freeze on failure. This is the one non-verbatim change to step-6 logic — today it `exit 1`s inline; the extracted function `return`s instead so the *caller* decides (the daily job `exit 1`s on non-zero exactly as today; the idle job `break`s). Behavior is otherwise identical.
2. **`scripts/t3-migrate-idle.sh`** — the new job (scheduling + gating only). Installed to `/usr/local/bin/t3-migrate-idle`. Sources the lib; per tick, drains the deferral directory (control flow below).
3. **`scripts/t3-migrate-idle.service`** — `Type=oneshot`, `ExecStart=/usr/local/bin/t3-migrate-idle`. (No `EnvironmentFile` needed; env-overridable knobs have defaults.)
4. **`scripts/t3-migrate-idle.timer`** — overnight window, frequent checks:
```ini
[Timer]
OnCalendar=*-*-* 01..05:00/20 # fires 01:00,01:20,…,05:40; none at/after 06:00. System TZ (UTC) — tune the window.
Persistent=false # never replay a missed migrate-restart at an unpredictable time
RandomizedDelaySec=120
```
5. **One-line edit to `t3-autoupdate.sh`** — in the existing defer branch, *also record* the deferral:
```sh
LOG "deferring $unit (active agent) — migrates on its next idle restart"
mkdir -p "$DEFER_DIR" 2>/dev/null; printf '%s\n' "$target" > "$DEFER_DIR/$u" # NEW
deferred=$((deferred+1)); continue
```
where `DEFER_DIR=/var/lib/t3-autoupdate/deferred`. This is the *only* behavioral change to the scarred script beyond the verbatim step-6 extraction.
### Why a deferral marker (not version-introspection)
The marker makes "which instances owe a restart" **exact** and decouples it from the binary-is-current problem — the daily job already *knows* it deferred wizard, so it records that fact. The idle job drains the directory; the version string in the marker is informational (a restart always picks up whatever binary is current). The marker is removed only after the restart's pairing is verified.
### Control flow of `t3-migrate-idle` (per tick)
```
for marker in $DEFER_DIR/*: # nothing deferred → no-op
user = basename(marker); unit = t3-serve@<user>.service
[ unit is an active running service ] or { rm marker; continue } # gone
if unit ActiveEnterTimestamp > mtime(marker): rm marker; continue # already restarted (manual/other) → just clear
if not safe_to_restart(user): continue # mid-turn or not quiet → try next tick
target = contents(marker)
if safe_restart_unit(unit, target): rm marker # success: verified on new binary
else: # helper already restored DB + rolled back binary + froze + alerted
break # frozen: stop draining; a human investigates
```
### `safe_to_restart(user)` — the gate
Single read-only query, run as the user:
```sh
runuser -u "$user" -- sqlite3 "file:/home/$user/.t3/userdata/state.sqlite?mode=ro" "
SELECT
(SELECT count(*) FROM projection_thread_sessions WHERE active_turn_id IS NOT NULL),
CAST((julianday('now')
- julianday(replace(replace(max(updated_at),'T',' '),'Z',''))) * 86400 AS INT)
FROM projection_thread_sessions;"
```
- Column 1 = **active turns**; must be `0`. (`active_turn_id` is set exactly while a turn runs — verified 2026-06-21.)
- Column 2 = **idle seconds** = now most-recent thread activity. Must be `≥ QUIET_SECONDS` (default **900** = 15 min, env-overridable). `updated_at` is ISO-8601 `…Z`; `datetime('now')`/`julianday('now')` are UTC, so normalizing `T`/`Z` away before `julianday()` keeps the arithmetic correct without depending on a newer SQLite's `Z` parsing.
- **NULL idle** (no threads at all) ⇒ safe. **Any error / non-numeric / nonzero exit** ⇒ not safe (constraint 3).
### Failure recovery
Delegated entirely to `safe_restart_unit` (the extracted, already-proven path): restore the user's DB from the pre-restart backup, roll the global binary back to last-good, `touch /etc/t3-autoupdate.freeze`, log+alert. The idle job then stops draining (the freeze halts both jobs until a human clears it) — see constraint 1 for why per-user divergence isn't an option.
### Observability
- Structured `logger -t t3-migrate-idle` lines; extend the existing `T3AutoUpdate*` Loki ruler/alerts to also match this tag. Success → one line: `migrated t3-serve@wizard → <target> (idle restart; idle 47m)`. Failure → reuses the daily job's freeze+alert.
- **Recommended (optional):** a Pushgateway gauge for **deferral-marker age** + an alert if a marker survives **> 3 days** — passive visibility into "busy every night for 3 days," *not* the auto-escalation/daytime-widening that was explicitly de-scoped.
### Delivery
- Wire into `scripts/workstation/setup-devvm.sh` alongside the existing units:
- `install -m 0644 "$SCRIPTS/t3-safe-restart.sh" /usr/local/lib/t3-safe-restart.sh`
- `install -m 0755 "$SCRIPTS/t3-migrate-idle.sh" /usr/local/bin/t3-migrate-idle`
- add `t3-migrate-idle.service t3-migrate-idle.timer` to the unit-install loop (→ `/etc/systemd/system/`)
- add `t3-migrate-idle.timer` to the `systemctl enable --now` list
- `homelab claim host:devvm --purpose "deploy t3-migrate-idle units"` before the install + enable on the shared devvm.
- No Terraform (hand-managed VM 102).
## Testing
- **TDD on the gating core (`bats`)** against fixture `state.sqlite` files: active turn → unsafe; idle-but-recent (< QUIET) unsafe; idle + quiet safe; empty DB safe; locked/garbage DB / sqlite error unsafe (fail-closed); marker drain: unit started after marker clear+skip, before eligible.
- **`T3_DRY_RUN=1`** mode logs `would migrate <unit> → <target>` without acting. Roll out in dry-run first; confirm it flags wizard's server at a real overnight idle moment; then enable live.
- **Step-6 extraction is behavior-preserving** — validate the daily job's decisions are unchanged via a dry-run diff before/after the refactor.
## Out of scope (YAGNI)
- Daytime restarts / "around the clock" cadence (de-scoped: overnight only).
- Auto-escalation that widens to a daytime attempt after N stale nights (de-scoped; the optional marker-age alert covers visibility).
- Per-user opt-out file (not needed — the job is self-limiting via markers).
- Any change to how `t3-autoupdate` *installs/gates* a build.
## Open questions
None outstanding from the brainstorm. Two items to **verify during implementation** (not blockers): (a) user-facing session resume after a `t3-serve` restart; (b) the devvm's `sqlite3` parses the normalized timestamp as expected (the `replace()` normalization is the safeguard).

View file

@ -0,0 +1,729 @@
# t3 idle-migrate Implementation Plan
> **For agentic workers:** REQUIRED SUB-SKILL: Use superpowers:subagent-driven-development (recommended) or superpowers:executing-plans to implement this plan task-by-task. Steps use checkbox (`- [ ]`) syntax for tracking.
**Goal:** Add a small idle-gated overnight job that restarts a `t3-serve@<user>` deferred by the daily autoupdate, so a chronically-busy user's server migrates onto the current t3 binary during a real quiet gap instead of staying version-skewed for days.
**Architecture:** Extract the daily job's per-unit "dangerous" restart routine (backup→restart→verify→recover) into a sourced shared library `t3-safe-restart.sh`; the daily `t3-autoupdate` and a new `t3-migrate-idle` job both call it. The daily job records each deferral as a marker file; the new job drains markers overnight, restarting only when `state.sqlite` shows no in-flight turn and a quiet buffer has elapsed. Self-limiting (only acts on a recorded deferral), fail-closed.
**Tech Stack:** bash, systemd timers, sqlite3 (reading t3's `state.sqlite`), the existing `t3-autoupdate` machinery. Deployed via `scripts/workstation/setup-devvm.sh` on the hand-managed devvm (no Terraform).
**Design:** `docs/plans/2026-06-21-t3-idle-migrate-design.md`.
---
## File structure
- **Create `scripts/t3-safe-restart.sh`** — sourced library: shared config defaults, `LOG`/`ver`/`osusers`/`ak_for`/`verify_pairing`/`backup_user`/`prebump_of`/`rollback_binary`, and `safe_restart_unit`. One responsibility: the audited per-unit safe restart + its recovery.
- **Modify `scripts/t3-autoupdate.sh`** — source the lib; replace the inline helpers + step-6 body with calls into it; write/clear the deferral marker. Behavior unchanged.
- **Create `scripts/t3-migrate-idle.sh`** — the new job: the idle gate (`gate_query`/`gate_is_safe`/`safe_to_restart`) + the marker-drain loop. Main logic behind a `main`-guard so it's source-safe for tests.
- **Create `scripts/t3-migrate-idle.service`** + **`scripts/t3-migrate-idle.timer`** — oneshot + overnight timer.
- **Create `tests/t3-migrate-idle-gate.test.sh`** — pure-bash TDD for the gate predicates against fixture SQLite DBs (no root, no bats).
- **Modify `scripts/workstation/setup-devvm.sh`** — install + enable the new files.
- **Modify `docs/runbooks/t3-version-bump.md`** + **`.claude/reference/service-catalog.md`** — document the new job.
**Recovery semantics note (load-bearing):** `safe_restart_unit` is reused verbatim. In the *daily* path a canary failure happens when `last_good < target`, so its `rollback_binary` genuinely reverts the global binary (correct — a bad build is bad for everyone). In the *idle* path `last_good == installed == target` (the build was already accepted), so `rollback_binary` is a **harmless no-op reinstall** — recovery reduces to "restore the failing user's DB + freeze + alert" and does NOT downgrade other users. Known rare-tail limitation: if that user's forward migration genuinely fails at idle time (already gated against a copy of their real DB at install), their server may crashloop on the restored DB until a human acts on the freeze+alert. Documented, not hidden.
---
## Task 1: Shared library `t3-safe-restart.sh`
**Files:**
- Create: `scripts/t3-safe-restart.sh`
- [ ] **Step 1: Create the library**
```bash
#!/usr/bin/env bash
# t3-safe-restart.sh — SOURCED library (not executed). Shared by t3-autoupdate.sh
# (daily gated tracker) and t3-migrate-idle.sh (overnight deferral drainer).
#
# Holds the per-unit "dangerous" routine — backup -> restart -> verify pairing ->
# recover (restore DB + roll global binary back to last-good + freeze) — extracted
# verbatim from t3-autoupdate.sh step 6, plus the small helpers it depends on.
# The only change from the inline original: safe_restart_unit RETURNS non-zero on
# failure (after performing recovery+freeze) instead of `exit 1`, so the CALLER
# decides what to do (the daily job exits; the idle job stops draining).
#
# Callers must set, before calling safe_restart_unit: $target (version being moved
# TO, for log lines + the prebump filename) and $last_good (rollback target).
# Set $LOG_TAG before sourcing to tag syslog ("t3-autoupdate" / "t3-migrate-idle").
# ---- shared config defaults (override via env before sourcing) ------------------
: "${LOG_TAG:=t3-safe-restart}"
: "${FREEZE_FILE:=/etc/t3-autoupdate.freeze}"
: "${STATE_DIR:=/var/lib/t3-autoupdate}"
: "${LAST_GOOD_FILE:=$STATE_DIR/last-good}"
: "${DEFER_DIR:=$STATE_DIR/deferred}"
: "${BACKUP_DIR:=/var/backups/t3-state}"
: "${DISPATCH:=127.0.0.1:3780}"
: "${USER_MAP:=/etc/ttyd-user-map}"
: "${T3_BACKUP_TIMEOUT:=900}"
LOG() { logger -t "$LOG_TAG" "$*"; echo "$LOG_TAG: $*"; }
ver() { t3 --version 2>/dev/null | awk '{print $NF}' | sed 's/^v//'; }
# OS users owning a ~/.t3 (RHS of each non-comment "authentik=os_user" map line).
osusers() { awk -F= '!/^[[:space:]]*#/&&NF==2{gsub(/[[:space:]]/,"",$2);print $2}' "$USER_MAP" 2>/dev/null | sort -u; }
# authentik username for an OS user (reverse map; first match) — for dispatch verify.
ak_for() { awk -F= -v u="$1" '!/^[[:space:]]*#/&&NF==2{gsub(/[[:space:]]/,"",$1);gsub(/[[:space:]]/,"",$2);if($2==u){print $1;exit}}' "$USER_MAP" 2>/dev/null; }
# Online consistent snapshot of ONE user's state.sqlite (run AS the owner so the
# WAL stays owned; never stops the serve). Uses global $target for the filename.
# Echoes the backup path on success; non-zero on failure.
backup_user() {
local u="$1" src out dst ts
src="/home/$u/.t3/userdata/state.sqlite"; [ -f "$src" ] || return 1
ts="$(date +%Y%m%d-%H%M%S)"
out="$BACKUP_DIR/$u"; dst="$out/state-prebump-$target-$ts.sqlite"
install -d -o "$u" -g "$u" -m700 "$out" 2>/dev/null || mkdir -p "$out"
if runuser -u "$u" -- timeout "$T3_BACKUP_TIMEOUT" sqlite3 "$src" "VACUUM INTO '$dst'" 2>/dev/null && [ -s "$dst" ]; then
printf '%s\n' "$dst"; return 0
fi
rm -f "$dst"; return 1
}
# newest pre-bump backup for a user taken for the current $target (restore source).
prebump_of() { ls -1t "$BACKUP_DIR/$1/state-prebump-$target-"*.sqlite 2>/dev/null | head -1; }
# roll the GLOBAL binary back to last-good. In the idle path last_good==installed,
# so this is a harmless no-op reinstall (does NOT downgrade other users).
rollback_binary() {
LOG "rolling back binary $target -> $last_good"
if npm i -g "t3@$last_good" >/dev/null 2>&1; then LOG "rolled back to $last_good"; return 0; fi
LOG "ROLLBACK FAILED — could not reinstall t3@$last_good (t3 may be broken; manual fix per runbook)"; return 1
}
# verify a user's pairing through the REAL dispatch (mint -> exchange -> cookie).
verify_pairing() {
local u="$1" ak out; ak="$(ak_for "$u")"; [ -n "$ak" ] || { LOG "no authentik mapping for $u — skipping dispatch verify"; return 0; }
out="$(curl -s -i --max-time 10 -H "X-authentik-username: $ak" -H 'Sec-Fetch-Dest: document' "http://$DISPATCH/" 2>/dev/null)"
printf '%s' "$out" | grep -qi '^set-cookie:[[:space:]]*t3_session='
}
# safe_restart_unit <unit> <user>: restart the unit, verify pairing; on failure
# restore the user's DB from its pre-restart backup, roll the binary back, freeze.
# Assumes a pre-restart backup already exists for <user> at the current $target
# (the daily job's backup_all, or the idle job's backup_user, takes it first).
# Returns 0 on verified success, non-zero after recovery+freeze on failure.
safe_restart_unit() {
local unit="$1" u="$2" ok=0 _ bak
systemctl restart "$unit" || LOG "WARN: systemctl restart $unit returned non-zero"
for _ in $(seq 1 15); do
if verify_pairing "$u"; then ok=1; break; fi
sleep 2
done
if [ "$ok" = "1" ]; then
LOG "restarted $unit -> $target (pairing verified via dispatch)"; return 0
fi
LOG "HEALTH-CHECK FAILED: $u pairing broken AFTER restart onto $target — rolling back + restoring its DB"
rollback_binary
bak="$(prebump_of "$u")"
if [ -n "$bak" ]; then
systemctl stop "$unit" 2>/dev/null
if install -o "$u" -g "$u" -m600 "$bak" "/home/$u/.t3/userdata/state.sqlite" 2>/dev/null; then
rm -f "/home/$u/.t3/userdata/state.sqlite-wal" "/home/$u/.t3/userdata/state.sqlite-shm"
LOG "restored $u state.sqlite from $bak"
fi
systemctl start "$unit" 2>/dev/null
fi
touch "$FREEZE_FILE" 2>/dev/null
LOG "FROZEN ($FREEZE_FILE) after $u failed on $target; last_good stays $last_good — investigate, then remove the freeze file to resume"
return 1
}
```
- [ ] **Step 2: Syntax + lint check**
Run: `bash -n scripts/t3-safe-restart.sh && (command -v shellcheck >/dev/null && shellcheck -x scripts/t3-safe-restart.sh || echo "shellcheck absent — skipped")`
Expected: no syntax errors. (shellcheck may warn on the intentional global `$target`/`$last_good` references — acceptable; they are documented caller-set globals.)
- [ ] **Step 3: Source-and-define smoke test**
Run:
```bash
bash -c 'LOG_TAG=test; . scripts/t3-safe-restart.sh; for f in LOG ver osusers ak_for backup_user prebump_of rollback_binary verify_pairing safe_restart_unit; do declare -F "$f" >/dev/null || { echo "MISSING $f"; exit 1; }; done; echo "all functions defined"'
```
Expected: `all functions defined` (sourcing has no side effects — no exit, no output beyond the echo).
- [ ] **Step 4: Commit**
```bash
GC=(-c filter.git-crypt.smudge=cat -c filter.git-crypt.clean=cat -c filter.git-crypt.required=false)
git "${GC[@]}" add scripts/t3-safe-restart.sh
git "${GC[@]}" commit -m "t3-safe-restart: extract shared safe-restart library from t3-autoupdate
Pull the per-unit backup->restart->verify->recover routine (and the small
helpers it needs) out of t3-autoupdate.sh into a sourced library, so a second
job (the upcoming idle migrator) can reuse the exact same audited recovery path
instead of forking safety-critical code. safe_restart_unit returns non-zero on
failure (after recovery+freeze) rather than exiting, so callers control flow.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>"
```
---
## Task 2: Refactor `t3-autoupdate.sh` to use the library + record deferrals
**Files:**
- Modify: `scripts/t3-autoupdate.sh` (config block 3242, helpers 44165, step 6 loop 194225)
- [ ] **Step 1: Source the library; drop the now-shared helpers**
Replace lines 3252 (the `T3_*` config block through the `newer()` helper) with — keep the autoupdate-only config, source the lib for the shared bits:
```bash
# ---- autoupdate-specific config (shared config + helpers come from the lib) -----
T3_TRACK="${T3_TRACK:-nightly}" # npm dist-tag to follow (nightly | latest)
T3_PIN="${T3_PIN:-}" # optional HARD pin to an exact version (disables tracking)
SMOKE_PORT="${T3_SMOKE_PORT:-3799}"
DRY_RUN="${T3_DRY_RUN:-0}"
TMPROOT="${T3_TMPDIR:-/var/tmp}" # health-check scratch on DISK — /tmp is a 2G tmpfs and a populated state.sqlite (~hundreds of MB) overflows it
LOG_TAG=t3-autoupdate
# shellcheck source=scripts/t3-safe-restart.sh
. "${T3_SAFE_RESTART_LIB:-/usr/local/lib/t3-safe-restart.sh}"
# is $1 a strictly-newer version than $2 (version-sort)?
newer() { [ "$1" != "$2" ] && [ "$(printf '%s\n%s\n' "$1" "$2" | sort -V | tail -1)" = "$1" ]; }
mkdir -p "$STATE_DIR" 2>/dev/null || true
```
(The lib now provides `FREEZE_FILE`, `STATE_DIR`, `LAST_GOOD_FILE`, `DEFER_DIR`, `BACKUP_DIR`, `DISPATCH`, `USER_MAP`, `LOG`, `ver`, `osusers`, `ak_for`, `verify_pairing`, `prebump_of`, `rollback_binary`, `backup_user`, `safe_restart_unit`.)
- [ ] **Step 2: Simplify `backup_all` to call the shared `backup_user`**
Replace the `backup_all()` definition (lines 90105) with:
```bash
ADMIN_SEED=""
backup_all() {
local u dst
for u in $(osusers); do
if dst="$(backup_user "$u")"; then
LOG "pre-bump backup: $u -> $dst ($(stat -c%s "$dst" 2>/dev/null) bytes)"
[ "$u" = "wizard" ] && ADMIN_SEED="$dst"
else
LOG "WARN: pre-bump backup FAILED for $u (/home/$u/.t3/userdata/state.sqlite)"
fi
done
[ -n "$ADMIN_SEED" ] || ADMIN_SEED="$(ls -1t "$BACKUP_DIR"/*/"state-prebump-$target-"*.sqlite 2>/dev/null | head -1)"
}
```
Delete the now-duplicated standalone `prebump_of`, `rollback_binary`, and `verify_pairing` definitions (lines 107108, 146152, 160165) — they come from the lib. Keep `health_check` and `unit_busy` (autoupdate-only).
- [ ] **Step 3: Use `safe_restart_unit` + write/clear the deferral marker in step 6**
Replace the step-6 loop body (lines 196225) with:
```bash
for unit in $(systemctl list-units --type=service --state=running --no-legend 't3-serve@*' 2>/dev/null | awk '{print $1}'); do
u="$(printf '%s' "$unit" | sed -n 's/^t3-serve@\(.*\)\.service$/\1/p')"; [ -n "$u" ] || continue
if unit_busy "$unit"; then
LOG "deferring $unit (active agent) — migrates on its next idle restart"
mkdir -p "$DEFER_DIR" 2>/dev/null && printf '%s\n' "$target" >"$DEFER_DIR/$u" # record for t3-migrate-idle
deferred=$((deferred+1)); continue
fi
if safe_restart_unit "$unit" "$u"; then
restarted=$((restarted+1))
rm -f "$DEFER_DIR/$u" 2>/dev/null # now current — clear any stale marker
else
exit 1 # frozen by safe_restart_unit — preserve today's behavior
fi
done
```
- [ ] **Step 4: Syntax check + behavior-preserving dry-run diff**
Run:
```bash
bash -n scripts/t3-autoupdate.sh
# Confirm the only remaining defer/restart decisions are unchanged vs HEAD~1 logic:
git -c filter.git-crypt.smudge=cat -c filter.git-crypt.clean=cat -c filter.git-crypt.required=false diff HEAD scripts/t3-autoupdate.sh | grep -E '^\+|^-' | grep -vE 'safe_restart_unit|backup_user|DEFER_DIR|source|\. "|LOG_TAG|^\+\+\+|^---' | head -40
```
Expected: no syntax errors; the diff shows only the extraction (calls replacing inline bodies) + the two marker lines — no change to install/health-gate/canary decision logic.
- [ ] **Step 5: Commit**
```bash
GC=(-c filter.git-crypt.smudge=cat -c filter.git-crypt.clean=cat -c filter.git-crypt.required=false)
git "${GC[@]}" add scripts/t3-autoupdate.sh
git "${GC[@]}" commit -m "t3-autoupdate: source the shared safe-restart lib + record deferrals
Behavior-preserving refactor: the per-unit restart/recover body and small helpers
now come from t3-safe-restart.sh (one audited copy). Additionally, when a unit is
deferred for an active agent, write a marker under /var/lib/t3-autoupdate/deferred/
so the new idle migrator can drain it later; clear the marker on a successful
restart. Install/health-gate/canary logic is unchanged.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>"
```
---
## Task 3: The idle gate (TDD) — `gate_query` + `gate_is_safe`
**Files:**
- Create: `tests/t3-migrate-idle-gate.test.sh`
- Create (incremental): `scripts/t3-migrate-idle.sh` (gate functions only this task)
- [ ] **Step 1: Write the failing test**
Create `tests/t3-migrate-idle-gate.test.sh`:
```bash
#!/usr/bin/env bash
# Pure-bash unit tests for the t3-migrate-idle gate. No root, no bats, no Docker.
# Sources t3-migrate-idle.sh (main-guarded) with the lib path pointed at the worktree.
set -uo pipefail
HERE="$(cd "$(dirname "$0")/.." && pwd)" # repo root (tests/ is one level down)
export T3_SAFE_RESTART_LIB="$HERE/scripts/t3-safe-restart.sh"
# shellcheck source=/dev/null
. "$HERE/scripts/t3-migrate-idle.sh" # defines functions; main-guard prevents the drain from running
pass=0; fail=0
ok() { if "$@"; then pass=$((pass+1)); else fail=$((fail+1)); echo "FAIL: $*"; fi; }
notok(){ if "$@"; then fail=$((fail+1)); echo "FAIL (expected non-zero): $*"; else pass=$((pass+1)); fi; }
# --- gate_is_safe <active> <idle_seconds> with QUIET_SECONDS=900 ---
QUIET_SECONDS=900
ok gate_is_safe 0 1000 # idle, quiet long enough -> safe
notok gate_is_safe 1 1000 # a turn in flight -> unsafe
notok gate_is_safe 0 100 # idle but not quiet enough -> unsafe
ok gate_is_safe 0 "" # no threads at all (NULL idle) -> safe
notok gate_is_safe x 1000 # unparseable active -> unsafe
notok gate_is_safe 0 -30 # negative idle (clock skew) -> unsafe
# --- gate_query <db> against fixture SQLite DBs ---
TMP="$(mktemp -d)"; trap 'rm -rf "$TMP"' EXIT
mkfix() { # mkfix <file> ; reads rows "active_turn_id|updated_at" on stdin
local f="$1"; sqlite3 "$f" "CREATE TABLE projection_thread_sessions(active_turn_id TEXT, updated_at TEXT NOT NULL);"
while IFS='|' read -r a u; do sqlite3 "$f" "INSERT INTO projection_thread_sessions VALUES ($([ "$a" = NULL ] && echo NULL || echo "'$a'"), '$u');"; done
}
NOW="$(date -u +%Y-%m-%dT%H:%M:%S.000Z)"
OLD="$(date -u -d '1 hour ago' +%Y-%m-%dT%H:%M:%S.000Z)"
# active turn present -> "1|<small idle>"
printf '%s\n' "abc|$NOW" "NULL|$OLD" | mkfix "$TMP/active.db"
res="$(gate_query "$TMP/active.db")"; ok test "${res%%|*}" = "1"
# all idle, last activity 1h ago -> "0|>=3500"
printf '%s\n' "NULL|$OLD" "NULL|$OLD" | mkfix "$TMP/idle.db"
res="$(gate_query "$TMP/idle.db")"; ok test "${res%%|*}" = "0"; ok test "${res##*|}" -ge 3500
# empty table -> "0|" (NULL idle)
sqlite3 "$TMP/empty.db" "CREATE TABLE projection_thread_sessions(active_turn_id TEXT, updated_at TEXT NOT NULL);"
res="$(gate_query "$TMP/empty.db")"; ok test "${res%%|*}" = "0"
echo "PASS=$pass FAIL=$fail"; [ "$fail" -eq 0 ]
```
- [ ] **Step 2: Run it to verify it fails**
Run: `bash tests/t3-migrate-idle-gate.test.sh`
Expected: FAIL — `scripts/t3-migrate-idle.sh` does not exist yet (source error).
- [ ] **Step 3: Create `scripts/t3-migrate-idle.sh` with the gate functions + main-guard skeleton**
```bash
#!/usr/bin/env bash
# t3-migrate-idle.sh — drains t3-autoupdate's deferral markers (via the overnight
# t3-migrate-idle.timer). For each deferred t3-serve@<user>, if nothing is actively
# working in that instance (no in-flight turn + a quiet buffer), restart it onto the
# current binary using the shared safe_restart_unit, then clear the marker.
# Why this exists: t3-autoupdate defers a user with an active agent at its single
# daily window; a user busy every night never migrates and their client shows
# "Client and server versions differ". See docs/plans/2026-06-21-t3-idle-migrate-*.
set -uo pipefail
LOG_TAG=t3-migrate-idle
# shellcheck source=scripts/t3-safe-restart.sh
. "${T3_SAFE_RESTART_LIB:-/usr/local/lib/t3-safe-restart.sh}"
QUIET_SECONDS="${T3_MIGRATE_QUIET_SECONDS:-900}" # required idle before a restart (15 min)
DRY_RUN="${T3_DRY_RUN:-0}"
# pure logic: is it safe given <active_turns> and <idle_seconds>? fail closed.
gate_is_safe() {
local active="$1" idle="$2"
case "$active" in ''|*[!0-9]*) return 1;; esac # unparseable/empty active -> unsafe
[ "$active" -eq 0 ] || return 1 # a turn is running -> unsafe
[ -z "$idle" ] && return 0 # no threads at all -> safe
case "$idle" in ''|*[!0-9-]*) return 1;; esac # non-numeric -> unsafe
[ "$idle" -ge "$QUIET_SECONDS" ] # negative or < quiet -> unsafe
}
# query a state.sqlite (path or file: URI). Echoes "<active_turns>|<idle_seconds>".
# idle_seconds is empty when there are no rows. Normalizes ISO 'T'/'Z' for julianday.
gate_query() {
local db="$1"
sqlite3 -batch -noheader -separator '|' "$db" \
"SELECT
(SELECT count(*) FROM projection_thread_sessions WHERE active_turn_id IS NOT NULL),
CAST((julianday('now') - julianday(replace(replace(max(updated_at),'T',' '),'Z',''))) * 86400 AS INT)
FROM projection_thread_sessions;"
}
# safe_to_restart <user>: wire runuser + the user's DB into gate_query/gate_is_safe.
safe_to_restart() {
local u="$1" db row
db="/home/$u/.t3/userdata/state.sqlite"; [ -f "$db" ] || return 1
row="$(runuser -u "$u" -- sqlite3 -batch -noheader -separator '|' "file:$db?mode=ro" \
"SELECT
(SELECT count(*) FROM projection_thread_sessions WHERE active_turn_id IS NOT NULL),
CAST((julianday('now') - julianday(replace(replace(max(updated_at),'T',' '),'Z',''))) * 86400 AS INT)
FROM projection_thread_sessions;" 2>/dev/null)" || return 1
gate_is_safe "${row%%|*}" "${row##*|}"
}
main() {
: # drain loop added in Task 4
}
# main-guard: run only when executed, not when sourced (tests source this file).
if [ "${BASH_SOURCE[0]}" = "${0}" ]; then main "$@"; fi
```
- [ ] **Step 4: Run the test to verify it passes**
Run: `bash tests/t3-migrate-idle-gate.test.sh`
Expected: `PASS=10 FAIL=0` (exit 0).
- [ ] **Step 5: Commit**
```bash
GC=(-c filter.git-crypt.smudge=cat -c filter.git-crypt.clean=cat -c filter.git-crypt.required=false)
git "${GC[@]}" add scripts/t3-migrate-idle.sh tests/t3-migrate-idle-gate.test.sh
git "${GC[@]}" commit -m "t3-migrate-idle: idle gate (no in-flight turn + quiet buffer), TDD
The gate reads t3's state.sqlite: safe to restart only when zero threads have an
active_turn_id AND the most-recent thread activity is older than the quiet buffer
(default 15m). Fail-closed on any parse/query error. Pure-bash unit tests cover
the boundaries against fixture DBs (no root/bats/Docker).
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>"
```
---
## Task 4: The marker-drain loop in `t3-migrate-idle.sh`
**Files:**
- Modify: `scripts/t3-migrate-idle.sh` (replace the `main()` skeleton)
- [ ] **Step 1: Implement `main()` (the drain loop)**
Replace the `main() { : ; }` skeleton with:
```bash
main() {
# a frozen build must not be auto-migrated (shared switch with t3-autoupdate)
if [ -e "$FREEZE_FILE" ]; then LOG "FROZEN: $FREEZE_FILE present — not draining deferrals"; exit 0; fi
[ -d "$DEFER_DIR" ] || exit 0 # nothing deferred
last_good="$(tr -d '[:space:]' <"$LAST_GOOD_FILE" 2>/dev/null)" # rollback target for the helper
local marker u unit started mwritten migrated=0 skipped=0
for marker in "$DEFER_DIR"/*; do
[ -e "$marker" ] || continue # empty-dir glob
u="$(basename "$marker")"; unit="t3-serve@$u.service"
if ! systemctl is-active --quiet "$unit"; then
LOG "clearing marker for $u: $unit not active"; rm -f "$marker"; continue
fi
started="$(date -d "$(systemctl show -p ActiveEnterTimestamp --value "$unit" 2>/dev/null)" +%s 2>/dev/null || echo 0)"
mwritten="$(stat -c %Y "$marker" 2>/dev/null || echo 0)"
if [ "$started" -gt "$mwritten" ]; then
LOG "clearing marker for $u: $unit already restarted $((started-mwritten))s after the deferral"; rm -f "$marker"; continue
fi
if ! safe_to_restart "$u"; then skipped=$((skipped+1)); continue; fi
target="$(tr -d '[:space:]' <"$marker" 2>/dev/null)"; [ -n "$target" ] || target="$(ver)"
if [ "$DRY_RUN" = "1" ]; then LOG "DRY_RUN: would migrate $unit -> $target (idle gate satisfied)"; continue; fi
if ! backup_user "$u" >/dev/null; then
LOG "WARN: pre-restart backup failed for $u — skipping (fail closed)"; skipped=$((skipped+1)); continue
fi
if safe_restart_unit "$unit" "$u"; then
LOG "migrated $unit -> $target (idle restart)"; rm -f "$marker"; migrated=$((migrated+1))
else
LOG "migrate FAILED for $unit — recovery+freeze handled by safe_restart_unit; stopping drain"; exit 1
fi
done
LOG "idle-migrate pass complete (migrated=$migrated skipped=$skipped)"
}
```
- [ ] **Step 2: Re-run the gate tests (regression — main-guard still source-safe)**
Run: `bash tests/t3-migrate-idle-gate.test.sh`
Expected: `PASS=10 FAIL=0` (sourcing still defines functions without running the loop).
- [ ] **Step 3: Syntax + lint**
Run: `bash -n scripts/t3-migrate-idle.sh && (command -v shellcheck >/dev/null && shellcheck -x scripts/t3-migrate-idle.sh || echo "shellcheck absent — skipped")`
Expected: no syntax errors.
- [ ] **Step 4: Commit**
```bash
GC=(-c filter.git-crypt.smudge=cat -c filter.git-crypt.clean=cat -c filter.git-crypt.required=false)
git "${GC[@]}" add scripts/t3-migrate-idle.sh
git "${GC[@]}" commit -m "t3-migrate-idle: drain deferral markers when safe
For each /var/lib/t3-autoupdate/deferred/<user> marker: skip+clear if the unit is
gone or was already restarted after the deferral; otherwise, when the idle gate is
satisfied, take a pre-restart backup and restart via the shared safe_restart_unit,
clearing the marker on verified success. DRY_RUN logs decisions without acting.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>"
```
---
## Task 5: systemd units
**Files:**
- Create: `scripts/t3-migrate-idle.service`, `scripts/t3-migrate-idle.timer`
- [ ] **Step 1: Create the service unit**
`scripts/t3-migrate-idle.service`:
```ini
[Unit]
Description=t3 idle migrator — restart deferred t3-serve instances onto the current binary when idle
Documentation=https://forgejo.viktorbarzin.me/viktor/infra/src/branch/master/docs/plans/2026-06-21-t3-idle-migrate-design.md
After=network.target t3-dispatch.service
[Service]
Type=oneshot
ExecStart=/usr/local/bin/t3-migrate-idle
```
- [ ] **Step 2: Create the timer unit**
`scripts/t3-migrate-idle.timer`:
```ini
[Unit]
Description=Overnight drain of t3-autoupdate deferrals (idle-gated t3-serve migration)
[Timer]
OnCalendar=*-*-* 01..05:00/20
RandomizedDelaySec=120
Persistent=false
[Install]
WantedBy=timers.target
```
- [ ] **Step 3: Validate unit syntax**
Run: `systemd-analyze verify scripts/t3-migrate-idle.service scripts/t3-migrate-idle.timer 2>&1 | grep -v 'Unknown\|Cannot find' || echo "units parse OK"`
Expected: no fatal parse errors (warnings about the `[Install]` of a non-installed unit / missing exec on a non-deployed path are acceptable in the worktree).
- [ ] **Step 4: Confirm the OnCalendar expands to the intended overnight slots**
Run: `systemd-analyze calendar '*-*-* 01..05:00/20' --iterations=5`
Expected: next elapses at 01:00/01:20/01:40/02:00/… (every 20 min, hours 0105).
- [ ] **Step 5: Commit**
```bash
GC=(-c filter.git-crypt.smudge=cat -c filter.git-crypt.clean=cat -c filter.git-crypt.required=false)
git "${GC[@]}" add scripts/t3-migrate-idle.service scripts/t3-migrate-idle.timer
git "${GC[@]}" commit -m "t3-migrate-idle: systemd oneshot + overnight timer (01:00-05:40, /20)
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>"
```
---
## Task 6: Wire into `setup-devvm.sh`
**Files:**
- Modify: `scripts/workstation/setup-devvm.sh` (9a install ~line 164; 9d unit loop ~line 200; enable ~line 218)
- [ ] **Step 1: Install the lib + the new script (section 9a)**
After the `install -m 0755 "$SCRIPTS/t3-autoupdate.sh" /usr/local/bin/t3-autoupdate` line, add:
```bash
install -m 0644 "$SCRIPTS/t3-safe-restart.sh" /usr/local/lib/t3-safe-restart.sh
install -m 0755 "$SCRIPTS/t3-migrate-idle.sh" /usr/local/bin/t3-migrate-idle
```
- [ ] **Step 2: Install the unit files (section 9d loop)**
Add to the `for u in …` unit list (after the `t3-autoupdate.service t3-autoupdate.timer \` line):
```bash
t3-migrate-idle.service t3-migrate-idle.timer \
```
- [ ] **Step 3: Enable the timer (section 9 enable line)**
Append `t3-migrate-idle.timer` to the `systemctl enable --now` list:
```bash
systemctl enable --now t3-dispatch.service \
t3-autoupdate.timer t3-backup-state.timer t3-provision-users.timer t3-migrate-idle.timer >/dev/null 2>&1 || \
log "WARN: some units failed to enable (check: systemctl status t3-dispatch t3-*.timer)"
```
- [ ] **Step 4: Syntax check**
Run: `bash -n scripts/workstation/setup-devvm.sh`
Expected: no syntax errors.
- [ ] **Step 5: Commit**
```bash
GC=(-c filter.git-crypt.smudge=cat -c filter.git-crypt.clean=cat -c filter.git-crypt.required=false)
git "${GC[@]}" add scripts/workstation/setup-devvm.sh
git "${GC[@]}" commit -m "setup-devvm: install + enable t3-migrate-idle (lib, script, units, timer)
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>"
```
---
## Task 7: Deploy to the devvm + validate (dry-run first)
**Files:** none (operational). Presence-claimed, shared-host mutation.
- [ ] **Step 1: Claim the host**
Run: `homelab claim host:devvm --purpose "deploy t3-migrate-idle units (idle-gated t3-serve migration)"`
Expected: claim acquired (if already held by another session, defer per CLAUDE.md).
- [ ] **Step 2: Install the artifacts (mirror setup-devvm.sh 9a/9d)**
Run:
```bash
W=/home/wizard/code/infra/.worktrees/t3-idle-migrate/scripts
sudo install -m 0644 "$W/t3-safe-restart.sh" /usr/local/lib/t3-safe-restart.sh
sudo install -m 0755 "$W/t3-migrate-idle.sh" /usr/local/bin/t3-migrate-idle
sudo install -m 0644 "$W/t3-migrate-idle.service" /etc/systemd/system/t3-migrate-idle.service
sudo install -m 0644 "$W/t3-migrate-idle.timer" /etc/systemd/system/t3-migrate-idle.timer
sudo systemctl daemon-reload
```
Expected: no errors.
- [ ] **Step 2b: Re-point the live daily job at the installed lib (it now sources it)**
The deployed `/usr/local/bin/t3-autoupdate` is the OLD inline version until setup-devvm re-runs; install the refactored one so both jobs share the lib:
```bash
sudo install -m 0755 "$W/t3-autoupdate.sh" /usr/local/bin/t3-autoupdate
sudo /usr/local/bin/t3-autoupdate # safe: same-version run exits at "already on nightly; nothing to do"
```
Expected: log line `already on <track>=<ver>; nothing to do` (proves the refactored daily job sources the lib and runs clean).
- [ ] **Step 3: DRY-RUN the idle migrator against live state**
Run: `sudo T3_DRY_RUN=1 /usr/local/bin/t3-migrate-idle; echo "exit=$?"`
Expected: with wizard currently busy (mid-turn during the day), a `skipped` count — `idle-migrate pass complete (migrated=0 skipped=N)` — and NO restart. (If wizard happens to be idle+quiet, it logs `DRY_RUN: would migrate t3-serve@wizard …` and still does not act.)
- [ ] **Step 4: Seed a deferral marker for the current skew + dry-run again**
The live daily job already deferred wizard but the marker mechanism is new, so create it once to represent the existing `.605→.613` debt:
```bash
sudo install -d -m755 /var/lib/t3-autoupdate/deferred
printf '%s\n' "$(t3 --version | awk '{print $NF}' | sed 's/^v//')" | sudo tee /var/lib/t3-autoupdate/deferred/wizard >/dev/null
sudo T3_DRY_RUN=1 /usr/local/bin/t3-migrate-idle; echo "exit=$?"
```
Expected: the pass now considers `wizard` — either `DRY_RUN: would migrate t3-serve@wizard.service -> …613` (if idle) or counted in `skipped` (if mid-turn). Confirms marker drain + gate wiring end-to-end without acting.
- [ ] **Step 5: Enable the timer (live)**
Run: `sudo systemctl enable --now t3-migrate-idle.timer && systemctl list-timers t3-migrate-idle.timer --no-pager`
Expected: timer active, next elapse in the 01:0005:40 window.
- [ ] **Step 6: Release the claim**
Run: `homelab release host:devvm`
> **First live migration** happens overnight at the first idle+quiet tick. Verify next session: `journalctl -u t3-migrate-idle.service --since yesterday | grep -E 'migrated|skipped|DRY|FROZEN'` and `t3 --version` vs the running server's version. (The user-facing resume-after-restart is observed here — design open-question (a).)
---
## Task 8: Docs
**Files:**
- Modify: `docs/runbooks/t3-version-bump.md` (add an idle-migrate section)
- Modify: `.claude/reference/service-catalog.md` (add the unit)
- Modify: `docs/plans/2026-06-21-t3-idle-migrate-design.md` (Status → implemented)
- [ ] **Step 1: Runbook** — add a section after the autoupdate description:
```markdown
## Idle migrator (`t3-migrate-idle.timer`)
`t3-autoupdate` defers a user's `t3-serve` restart when they have an active agent
at the daily window, recording `/var/lib/t3-autoupdate/deferred/<user>`.
`t3-migrate-idle` (overnight, every 20 min 01:0005:40) drains those markers:
it restarts a deferred instance onto the current binary only when that user's
`state.sqlite` shows no in-flight turn (`active_turn_id`) and ≥15 min quiet, via
the shared `safe_restart_unit` (same backup→verify→recover as the daily canary).
- **Force a migration now:** `sudo systemctl start t3-migrate-idle.service` (still idle-gated).
- **Preview without acting:** `sudo T3_DRY_RUN=1 /usr/local/bin/t3-migrate-idle`.
- **Stop it:** the shared `/etc/t3-autoupdate.freeze` halts both jobs.
- **Rare-tail failure:** a forward-migration failure at idle restart restores the
user's DB + freezes + alerts (the binary rollback is a no-op since the build was
already accepted); the user's server may crashloop on the restored DB until the
freeze is cleared. Investigate per the rollback section above.
```
- [ ] **Step 2: service-catalog** — add a row/line for `t3-migrate-idle.timer` (overnight idle-gated t3-serve migration; sources `t3-safe-restart.sh`).
- [ ] **Step 3: design doc status** — change the header `Status:` to `implemented 2026-06-21 (commits on wizard/t3-idle-migrate)`.
- [ ] **Step 4: Commit**
```bash
GC=(-c filter.git-crypt.smudge=cat -c filter.git-crypt.clean=cat -c filter.git-crypt.required=false)
git "${GC[@]}" add docs/runbooks/t3-version-bump.md .claude/reference/service-catalog.md docs/plans/2026-06-21-t3-idle-migrate-design.md
git "${GC[@]}" commit -m "docs: t3-migrate-idle runbook + service-catalog + design status
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>"
```
---
## Task 9: Land
- [ ] **Step 1: Merge latest master into the branch**
Run:
```bash
GC=(-c filter.git-crypt.smudge=cat -c filter.git-crypt.clean=cat -c filter.git-crypt.required=false)
git "${GC[@]}" fetch forgejo
git "${GC[@]}" merge --no-edit forgejo/master
```
Expected: clean merge (no conflicts; the files are new or autoupdate-only). Resolve if any.
- [ ] **Step 2: Re-run the gate tests post-merge**
Run: `bash tests/t3-migrate-idle-gate.test.sh`
Expected: `PASS=10 FAIL=0`.
- [ ] **Step 3: Push to master**
Run: `git -c filter.git-crypt.smudge=cat -c filter.git-crypt.clean=cat -c filter.git-crypt.required=false push forgejo HEAD:master`
Expected: accepted. Non-fast-forward → fetch/merge/retry.
- [ ] **Step 4: Watch CI to completion**
Run: `homelab ci watch`
Expected: green (infra apply pipeline — this change is scripts/docs only, no Terraform, so apply is a no-op for it).
- [ ] **Step 5: Clean up the worktree**
Run (from the main checkout):
```bash
git -C /home/wizard/code/infra worktree remove .worktrees/t3-idle-migrate
git -C /home/wizard/code/infra branch -d wizard/t3-idle-migrate
```
---
## Self-review
- **Spec coverage:** marker mechanism (T2,T4) · shared safe-restart lib / approach C (T1) · idle gate active_turn_id+quiet (T3) · overnight timer (T5) · all-users self-limiting via markers (T4 loop) · failure recovery reuse (T1, note) · observability logs (LOG_TAG throughout) · delivery via setup-devvm (T6) · presence-claimed deploy (T7) · TDD on the gate (T3) · dry-run rollout (T7) · docs (T8). Optional Pushgateway marker-age gauge from the design is **intentionally deferred** (logged here as a follow-up, not built — keeps scope to the shipping mechanism).
- **Placeholders:** none — every file has complete content; every command has expected output.
- **Type/name consistency:** `safe_restart_unit`, `backup_user`, `prebump_of`, `gate_query`, `gate_is_safe`, `safe_to_restart`, `DEFER_DIR`, `QUIET_SECONDS`, `T3_SAFE_RESTART_LIB`, `LOG_TAG` used identically across tasks. `target`/`last_good` are documented caller-set globals consumed by lib functions.

View file

@ -0,0 +1,117 @@
# k8s-upgrade compat-gate: classify "actionable" vs "held" blocks
**Date:** 2026-06-28
**Status:** design → implementation
**Stack:** `stacks/k8s-version-upgrade` (+ `stacks/monitoring` alert rules)
## Problem
The cluster is on k8s 1.35.6. The nightly `k8s-version-check` chain detects the
next minor (1.36.2), runs the preflight compat-gate, and the gate **refuses**
it — because no released kyverno/ESO supports k8s 1.36 yet, and gpu-operator is
deliberately pinned (its 26.3 bump needs a newer NVIDIA driver image + Ubuntu
release we're not ready for). The result, **every single night**:
- a **Failed** preflight Job (`block()` exits 1), and
- `k8s_upgrade_blocked=1` → the **K8sUpgradeBlocked** alert.
But this block is **not actionable** — there's nothing we can upgrade to clear
it; we can only wait for upstream (kyverno/ESO) and, separately, do the
gpu-operator/Ubuntu work. The gate is crying wolf: a "blocked, needs attention"
signal that's indistinguishable from a block we could actually fix.
## Goal
Make the gate **classify** each blocker and behave accordingly:
| Class | Definition | Behaviour |
|-------|-----------|-----------|
| **actionable** | the compat matrix has a newer version of the addon whose `max_k8s >= target`, and the running version is older — upgrading it would clear the block | **alert** (`k8s_upgrade_blocked=1` → K8sUpgradeBlocked), with the specific "upgrade X → Y" remediation in the nightly report |
| **waiting-upstream** | **no** matrix version of the addon supports the target yet (kyverno/ESO for 1.36) | **quiet** (`k8s_upgrade_held=1`, no alert) — nightly report only |
| **pinned** | a supporting version exists but the addon carries `"pinned": true` in the matrix (gpu-operator) | **quiet** (held) |
Removed-API and containerd blocks are always **actionable**. **Held wins:** if
*any* blocker is waiting-or-pinned, the whole target is **HELD** (quiet) —
acting on the actionable blockers wouldn't unblock it yet. The nightly report
still lists everything so the full eventual scope is visible.
Also (scope decision: "tidy the block path"): deliberate gate decisions
(actionable-block **and** held) now make the preflight Job **Complete cleanly**
(exit 0) instead of Failing. Chain progression is gated on the verdict, not the
exit code. Real failures (unhealthy nodes, kubeadm errors, crashes) still exit
1 → `K8sUpgradeChainJobFailed`.
## Design
### `compat-gate.py`
- New exit codes: `0` safe · `2` actionable-block · `3` gate-error (fail-safe) · **`4` held**.
- Each stdout reason line is tagged `[ACTIONABLE]` / `[WAITING]` / `[PINNED]`.
- `check_addons`: when an addon blocks, decide its class:
- `pinned: true` in its matrix entry → `[PINNED]`.
- else a higher matrix version with `max_k8s >= target` exists → `[ACTIONABLE]` (`upgrade X to >= V`).
- else → `[WAITING]` (`no released X version supports k8s T yet`).
- unreadable image / below-matrix → `[ACTIONABLE]` (fail-safe — a human must look).
- `check_removed_apis`, `check_containerd`: tag `[ACTIONABLE]`.
- `exit_code(reasons)`: `0` if none; `4` if any `held_reason` (WAITING/PINNED); else `2`.
### `upgrade-step.sh`
- New global `HALT_CHAIN=0`; `spawn_next()` returns early (no next Job) when set.
- Replace `block()` with `record_blocked()` / `record_held()` — push the gauge,
set `HALT_CHAIN=1`, **do not exit**.
- `phase_preflight` gate handling routes on the gate's exit code:
- `0` → push `blocked=0`+`held=0`, proceed.
- `2`/`3``record_blocked`, `return 0` (Job Completes, K8sUpgradeBlocked fires).
- `4``record_held`, `return 0` (Job Completes, **no alert**).
- Push the gauge **definitively once** per run (remove the pre-reset `blocked=0`
at gate start) so a standing block doesn't flap 1→0→1 and re-notify.
- postflight also clears `held=0` alongside the existing gauge resets.
### detector (`main.tf`, the `k8s-version-check` CronJob)
- Consequence of the tidy change: refusals now **Complete** instead of Failing,
so the old "re-spawn only a *Failed* preflight" idempotency would skip a
refused-but-Complete preflight until its 7d TTL. Fix: re-spawn nightly when the
preflight is **Complete but no `k8s-upgrade-master-<target>` Job exists** (the
gate refused — chain never advanced) — **silently** (no Slack), so a standing
hold re-evaluates each night without noise.
- The per-night `slack "K8s upgrade available…"` becomes an `echo`; the spawn
Slack fires only for a genuinely new spawn or a Failed-respawn (`ANNOUNCE`
flag), not for silent re-evaluations — killing the last nightly-noise source.
### `addon-compat.json`
- Add `"pinned": true` + `"pin_reason"` to the gpu-operator entry (its
`26.3 → 1.36` row stays; `pinned` overrides classification to held). Document
the `pinned` flag in `_comment`. Unpinning later = delete two keys.
### `stacks/monitoring` alert rules (`prometheus_chart_values.tpl`)
- `K8sUpgradeBlocked` (`k8s_upgrade_blocked == 1`): unchanged trigger, now
actionable-only; reword annotation (reasons are in the nightly report, not a
per-run chain Slack).
- `K8sUpgradeChainJobFailed`: **drop** the `unless on() (k8s_upgrade_blocked == 1)`
clause — deliberate blocks no longer create Failed Jobs, so the alert again
means a genuine wedge.
- **No alert** for `k8s_upgrade_held` (intentional — nothing to action; the
nightly report surfaces it). Add a comment recording this.
### `nightly-report.py`
- Read `k8s_upgrade_held`. New `⏸️ HELD — <target> not yet upgradable` headline.
- Group reasons by tag: *Action needed* / *Waiting on upstream* / *Pinned (held by us)*
(fallback bullets for untagged lines, so older reason strings still render).
- Fetch reasons when avail AND (blocked OR held).
## Net effect on 1.36 today
**HELD, quiet** — waiting on kyverno + ESO (upstream) + gpu-operator (pinned);
Calico listed as the lone actionable piece. No nightly Failed Job, no alert —
just the nightly report's ⏸️ line. Flips to actionable (→ alert) only once
kyverno/ESO ship support **and** gpu-operator is unpinned.
## Tests (TDD)
- `compat-gate`: waiting / actionable / pinned-is-held / mixed-held-wins,
removed-API & containerd are actionable, exit_code mapping, + existing
patch/safe cases stay green.
- `nightly-report`: held headline + grouped reasons; existing tests stay green.
- `upgrade-step.sh`: shellcheck; manual review of the HALT_CHAIN + gauge flow
(bash, not unit-tested).
## Out of scope (separate follow-up)
Auto-refreshing the matrix when upstream ships 1.36 support (a periodic
addon-readiness probe). This change only *consumes* the matrix.

View file

@ -0,0 +1,128 @@
# Post-Mortem: MetalLB ServiceL2Status Stuck Immutable → PG LB VIP Flap → Woodpecker CI Tier 1 Applies Broken
| Field | Value |
|-------|-------|
| **Date** | 2026-05-16 (mitigated) / 2026-05-26 (closed) |
| **Duration** | ~5 days of degraded CI (2026-04-21 first observed → 2026-05-16 mitigated). Symptom-only; no human-visible service downtime. |
| **Severity** | SEV3 — Woodpecker CI default.yml apply step failed on Tier 1 (PG-backend) stacks. Drift-detection ran silently broken. Manual `scripts/tg apply` continued to work. No data loss, no app downtime. |
| **Affected Services** | Woodpecker CI pipelines applying any of the 28+ Tier 1 stacks (monitoring, crowdsec, authentik, headscale, etc.). PostgreSQL backend itself was healthy. |
| **Issue** | Beads `code-aoxk` (closed 2026-05-26). |
| **Status** | Closed |
## Summary
Woodpecker CI surfaced as `ERROR: Cannot read PG credentials from Vault. Run: vault login -method=oidc` from `scripts/tg` whenever a pipeline tried to apply a Tier 1 stack. The error was misleading on two counts:
1. **Vault was healthy.** A direct `vault read database/static-creds/pg-terraform-state` from inside a Woodpecker pipeline pod (using K8s SA JWT → `auth/kubernetes/login role=ci`) succeeded every time when run in isolation.
2. **The "Cannot read PG credentials" message in `scripts/tg` was a catch-all** that fired for *any* Terraform/Terragrunt failure during PG state-lock acquire-release, including TCP RSTs against the PG LoadBalancer VIP.
Actual root cause: the MetalLB `ServiceL2Status` CR for the `postgresql-lb` service (`dbaas` namespace, VIP `10.0.20.200`) had a stuck `status.node` field that the controller treated as immutable. The L2 speaker kept failing to update it with `Invalid value: "k8s-nodeX": Value is immutable`, so the leader-elected announcer flapped between k8s-node3 and k8s-node4 every few seconds. Each flap dropped open TCP connections (RST). Terraform's state-lock acquire → operation → release sequence straddled flaps and failed mid-operation. `scripts/tg` surfaced this as the misleading "Cannot read PG credentials" message.
Manual `scripts/tg apply` from the DevVM kept working because the developer's session happened to land on whichever node currently held the VIP and complete fast enough to not straddle a flap. CI pipelines, being slower (full stack walk), reliably straddled at least one flap.
## Impact
- **CI degradation**: Tier 1 stack changes pushed to master were NOT auto-applied. Required manual `scripts/tg apply` from DevVM after every push touching one of 28+ stacks.
- **Drift-detection broken**: The daily `drift-detection.yml` Woodpecker pipeline silently failed on every Tier 1 stack — meaning unannounced manual changes to those stacks could have persisted undetected for the duration.
- **No user-facing outage**: PG cluster itself, all apps that use PG, and all in-cluster traffic to `10.0.20.200` worked normally. Only the very specific `acquire-state-lock → run operation → release-state-lock` round-trip pattern from CI was unreliable.
## Timeline (UTC)
| Time | Event |
|------|-------|
| 2026-04-21 | First broken CI pipelines (#411, #412, #413). Drift-detection failures noticed. `code-aoxk` filed. Initial hypothesis: Vault auth/role mismatch. |
| 2026-04-22 — 2026-05-15 | Multiple investigation attempts. Verified Vault K8s `auth/kubernetes/role/ci` has correct policies (`terraform-state`, `ci`). Verified `database/static-creds/pg-terraform-state` exists, rotates on schedule, credentials valid. Could not reproduce the failure in isolated `vault read` from Woodpecker pods. |
| 2026-05-16 (~12:14 UTC) | `pg-cluster-3` came up (third CNPG replica); endpoint set churn likely triggered MetalLB L2 announcer to attempt to update the existing `ServiceL2Status` CR (was `l2-rgt9d`). Update was rejected as immutable. Speaker kept retrying. VIP flapped. |
| 2026-05-16 | RCA breakthrough: noticed `kubectl logs -n metallb-system -l component=speaker` was full of `Invalid value: "k8s-node…": Value is immutable` on the postgresql-lb ServiceL2Status. Correlated with `kubectl get servicel2status` returning multiple stale entries for the same service. |
| 2026-05-16 | **Mitigation**: `kubectl delete servicel2status.metallb.io l2-rgt9d -n metallb-system`. Speaker recreated the CR cleanly (became `l2-zj9ss`). Flap stopped. PG connections stable. Manual CI re-runs of `monitoring` stack apply succeeded immediately. |
| 2026-05-17 | Audit: acceptance criteria 1 + 2 met implicitly. #3 (post-mortem) remained pending. Beads task reverted from `in_progress``open`. |
| 2026-05-25 | Node2 SCSI LUN remap → encrypted PVC emergency_ro → containerd boltdb corruption outage. Unrelated, but pulled Woodpecker server off node2. Subsequent server pod restart on k8s-node4. |
| 2026-05-26 | Verification: from a live Woodpecker pipeline pod (`wp-01kshph6pa0w6ch0zf5x9bfqgr`), `vault write auth/kubernetes/login role=ci jwt=$(cat /var/run/secrets/kubernetes.io/serviceaccount/token)` succeeded. `vault read database/static-creds/pg-terraform-state` returned valid creds (`username=terraform_state`, last_vault_rotation 2026-05-21, TTL 58h). Live `default.yml` pipeline confirmed applying Tier 1 stacks: dbaas, authentik, crowdsec, monitoring, nvidia, cloudflared, kyverno, metallb — all `OK`. `postgresql-lb` ServiceL2Status currently single allocation (`l2-sv9vv` on k8s-node3, no flap). Beads task closed. |
## Root Cause
`metallb-speaker` reconciler in the deployed MetalLB version treats `ServiceL2Status.status.node` as immutable after first set. When the L2 announcer's leader-election picks a different node to announce a given VIP (which happens on speaker pod restart, node loss, endpoint set churn, or pod-anti-affinity reshuffles), the reconciler fails to patch the existing CR and gets stuck in a retry loop. Without manual deletion, the reconciler will not progress.
Why it manifested as Vault credential errors:
1. CI's `scripts/tg` pre-flight runs `vault read database/static-creds/pg-terraform-state` (line 83 in current code) to get PG credentials. That call succeeds.
2. CI then runs `terragrunt apply` against the Tier 1 stack. Terragrunt connects to `10.0.20.200:5432` for state-lock acquire (via `pg_advisory_lock`). The TCP connection lands on whichever node MetalLB last announced the VIP from.
3. Mid-operation, MetalLB tries to re-announce from a different node, sends gratuitous ARPs, and the upstream switch updates its MAC table. Open TCP sessions on the previous announcer's node are immediately RST.
4. Terragrunt's state-lock release (or any subsequent PG operation) fails with broken pipe / connection refused.
5. `scripts/tg` interpreted the wrapper-level failure as "PG creds bad" because that's the most common failure mode it handles. The actual error from terragrunt was buried in `2>/dev/null` suppression (since fixed — see Fix #1 below).
## Detection
We did not have any of:
- A direct alert for "MetalLB ServiceL2Status reconciler errors".
- An alert for "PG LB VIP node changed N times in M minutes".
- An end-to-end probe for the CI state-lock pattern (terragrunt against `10.0.20.200`).
Detection mechanism was a human reading `kubectl logs -n metallb-system` for unrelated reasons. Took 25 days from first observed symptom to RCA.
## Fixes & Mitigations
### 1. Surface real error from `scripts/tg` (DONE)
The original `scripts/tg` swallowed the real `vault read` / terragrunt error behind `2>/dev/null` and printed a static "Cannot read PG credentials from Vault" message. Fixed in the script:
```sh
# scripts/tg lines 79-89 (current)
if ! command -v vault >/dev/null 2>&1; then
echo "ERROR: vault CLI not found on PATH. Install it or use an image that includes it (ci/Dockerfile)." >&2
exit 1
fi
VAULT_OUT=$(vault read -format=json database/static-creds/pg-terraform-state 2>&1) || {
echo "ERROR: Cannot read PG credentials from Vault. Vault output follows:" >&2
echo "$VAULT_OUT" >&2
echo "" >&2
echo "Hint: humans run 'vault login -method=oidc'; CI auths via K8s SA (role=ci)." >&2
exit 1
}
```
Comment in the code explicitly references this incident.
### 2. Stuck-CR cleanup procedure (DOCUMENTED)
Reproduction check for future sessions (also in `code-aoxk` beads notes):
```sh
kubectl logs -n metallb-system -l component=speaker --tail=200 | grep -iE 'Invalid value.*immutable'
# If matches found → same root cause. Delete the stuck CR:
kubectl get servicel2status -n metallb-system
kubectl delete servicel2status.metallb.io <name> -n metallb-system
```
Speaker recreates the CR cleanly within seconds.
### 3. Long-term MetalLB controller fix (DEFERRED)
The underlying bug — speaker not recreating the CR when the immutable field needs to change — is upstream MetalLB behaviour. Two paths possible:
- **Upgrade MetalLB** to a version where this is fixed (needs research — check changelogs).
- **File upstream issue / patch** with reproducer.
Not done as part of this post-mortem; tracked separately. Risk acceptance: until then, the manual `delete servicel2status` workaround is the playbook, and is fast (<10s).
### 4. Alerting (DEFERRED)
Suggested but not implemented:
- Prometheus alert on `metallb_speaker_reconcile_errors_total{kind="ServiceL2Status"}` rate.
- Synthetic probe: a CronJob that does `pg_advisory_lock` + release against the PG VIP every 5min from CI namespace, alert if it ever fails.
Tracked as future hardening (no beads task yet — only worth filing if recurrence happens).
## Lessons
1. **`2>/dev/null` is a time-bomb.** It hid the real error for weeks. Fix #1 already lands the principle; audit other places in `scripts/` for the same anti-pattern next time we touch them.
2. **CRD `status.*` immutability is non-obvious failure mode.** When debugging weird LB / VIP / endpoint behaviour, always grep speaker logs for `immutable`, `cannot update`, and reconciler errors. Add to cluster-health checks.
3. **Misleading wrapper errors cost weeks.** `scripts/tg` claimed "Cannot read PG credentials" — that's what the operator believed. The actual `vault read` step worked. The real failure was three steps later in a completely different subsystem. When a wrapper script makes a definitive claim about which subsystem failed, distrust it; reproduce the subsystem in isolation before chasing the claim.
4. **CNPG primary changes / endpoint churn can trigger L2 announcer flap.** The trigger (within the timeline) was likely the `pg-cluster-3` pod coming up. Worth flagging for any future CNPG topology changes.
## References
- Beads: `code-aoxk` — closed 2026-05-26.
- `scripts/tg` lines 65-95 — current pre-flight with explicit error surfacing.
- `kubectl get servicel2status -A` — current state, single allocation per service.
- This file: `infra/docs/post-mortems/2026-05-16-metallb-l2-immutable-pg-vip-flap.md`.

View file

@ -0,0 +1,131 @@
# 2026-06-22 — devvm memory/IO overload: per-user containment + OOM backstop
## Impact
- devvm (VM 102, the shared multi-user Claude Code workstation) became
unresponsive under combined memory + IO pressure and had to be **hard-killed +
rebooted** by the admin on 2026-06-22 (morning). All ssh/tmux + t3 sessions for
wizard/emo/anca lost, in-flight agents killed.
- Signature on the last htop before the kill: **load avg ~60** on 32 vCPU, **RAM
22.5/23.5G**, **swap 13.9/14.0G (full)**, a wall of **D-state** (uninterruptible
IO-wait) processes, and a single `ugrep` in emo's tmux holding **~10G RES /
64% CPU**. Many `claude --effort max/xhigh` sessions + playwright-chrome MCP
instances across three users on top.
## This is the "crawl" class, not the QEMU-stall class
The 2026-06-11 post-mortem (`2026-06-11-devvm-qemu-io-stall.md`) fixed a
*different* failure mode — a QEMU-userspace block-path wedge on the legacy LSI
controller. That fix shipped (verified 2026-06-22: the guest now boots on
`virtio_scsi`, `scsihw: virtio-scsi-single + iothread`). Its post-mortem
explicitly deferred **this** class:
> The recurring *crawl* class (agent storms → swap-thrash; journald
> watchdog-killed 3× on 2026-06-10) is a separate failure mode — ssh/tmux
> sessions remain memory-uncontained by **explicit decision (swap-only,
> 2026-06-10)**.
That explicit decision is the root cause closed here.
## Root cause
Work on the devvm lives in **two independent cgroup-v2 trees per user**, and only
one was capped:
| Tree | cgroup | Cap before today |
|---|---|---|
| t3 web sessions | `system.slice/system-t3\x2dserve.slice/t3-serve@<user>` | `MemoryHigh=12G MemoryMax=16G MemorySwapMax=0 OOMPolicy=continue` ✓ |
| **ssh/tmux sessions** | `user.slice/user-<uid>.slice` | **`MemoryMax=infinity`, swap unlimited** ✗ |
The uncapped `user-<uid>.slice` was the hole. A runaway there (the 10G `ugrep`;
stacked max-effort agents) grew unbounded, spilled into the **14G disk swap**, and
swap-thrashed the **host-mbps-throttled (60/60 MB/s) virtual disk**. That is the
overload chain:
```
uncapped tmux growth → disk-swap thrash on a throttled spindle
→ IO storm (D-state pileup) → load ~60 → box unresponsive → hard kill
```
i.e. **memory pressure becomes the IO storm**. There was also **no global OOM
backstop** (no systemd-oomd / earlyoom) to shed the worst offender before the
kernel OOM or the thrash-wedge. And even the existing t3 caps don't sum safely
(3 users × 16G = 48G > 32G RAM) — nothing reasoned about the *whole box*.
## Fix (`setup-devvm.sh` §10, applied live 2026-06-22)
Design decisions (interviewed with the admin via `/grill-me`): **soft-generous
per-user caps + a hard ceiling + a kill-the-worst backstop**, maximising
single-user utilisation while making a box-wide wedge impossible. (The backstop
was first built on systemd-oomd, then switched to earlyoom mid-rollout when oomd
proved inert with `swap=0` — see Verification + Lessons.)
| Layer | What |
|---|---|
| **Per-user caps, BOTH trees** | `user-.slice.d` drop-in gives every `user-<uid>.slice` the same `MemoryHigh=12G / MemoryMax=16G / MemorySwapMax=0` the t3 tree already had. A user is now bounded in whichever surface they work in. |
| **No disk swap for work** | `MemorySwapMax=0` on every work cgroup → a spike OOMs **locally** at the ceiling instead of thrashing the throttled disk. Kills the IO-storm-via-swap mechanism at the source. The 14G swapfile stays for system cold pages only. |
| **earlyoom backstop (free-RAM threshold)** | New package — used **instead of systemd-oomd** (which is inert with `swap=0`; see Lessons). Watches `MemAvailable%` and SIGTERMs the biggest task at **5%**, SIGKILL at **3%**, swap ignored (`-s 100`). `--avoid` keeps sshd/systemd/dockerd/containerd/t3-dispatch/tmux off the victim list (**the admin's way in always survives**); `--prefer` targets the agent/browser hogs (python3/node/chrome/…). Swap-independent and reliable, where oomd's pressure-kill was not. |
| **Fair-share CPU/IO** | `CPUWeight`/`IOWeight` per slice (system.slice 200, users + docker 100 each). Work-conserving — a lone user still gets all 32 cores + the full IO budget when others idle; weights only bite under contention. No hard CPU/IO caps. |
| **Docker containment** | Containers previously landed in `system.slice` — uncapped. Now `cgroup-parent: docker.slice` in `daemon.json` routes every container into a capped (`MemoryMax=8G`, swap 0) slice, so a runaway container is cgroup-OOM'd locally instead of escaping into the uncapped `system.slice`. |
Durable in `setup-devvm.sh` (survives a VM rebuild); `earlyoom` added to
`packages.txt`. The numbers are tunable — `MemoryHigh=12G` will throttle a *lone*
heavy user between 1216G even with RAM free; bump to 16/20 if that bites.
## Verification (live, 2026-06-22)
- **Caps live on running cgroups**: all three `user-<uid>.slice` report
`memory.high=12G memory.max=16G memory.swap.max=0`; `docker.slice` `memory.max=8G`;
daemon.json kept buildkit/nvidia/insecure-registries; paperless-mcp recovered
under `docker.slice`.
- **Stress test A (hard cap)** — the PRIMARY guard: a 2G-capped, swap=0 balloon was
killed at exactly 2G by the cgroup-local OOM (`constraint=CONSTRAINT_MEMCG`) with
**swap flat at 0MB throughout** — no thrash. Same mechanism protects every user
slice (16G) and `docker.slice` (8G).
- **Soft cap observed**: a balloon pushed past `MemoryHigh` sat at ~220M / 99%
memory.pressure, throttled to a crawl, making no progress and harming nothing —
a runaway is throttled, not just killed.
- **systemd-oomd disproven, then dropped**: a self-policed balloon held
`memory.pressure full avg10 = 9699%` (≫ its 20% limit) for >70s but oomd never
killed it — `Pgscan: 0`. oomd's pressure-kill only acts on cgroups doing active
reclaim, which a `swap=0` anon workload never does. oomd was purged.
- **earlyoom backstop** — verified via `--dryrun`: at the threshold it logs
`low memory! … mem 90% swap 100%` (fires on RAM alone, swap ignored) and selects
`SIGTERM … "chrome"` (a `--prefer` hog), never an `--avoid`'d daemon. Live
earlyoom v1.7 confirms `SIGTERM mem<=5% / SIGKILL mem<=3%, swap<=100%`.
## Out of scope / follow-ups
- **Alerting** (tracked, fast-follow bead): `DevvmDown` (closes the 90-min
detection gap the 2026-06-11 PM flagged), sustained-memory-PSI/swap pressure
early-warning, and an "earlyoom-killed-something" alert (earlyoom logs each kill;
`-N /script` can push a metric). devvm node-exporter is already scraped
(`job=devvm`, `10.0.10.10:9100`), so only alert *rules* are new (a
monitoring-stack Terraform change).
- **zram cushion**: considered, deferred. Could let work cgroups absorb spikes in
compressed RAM instead of OOMing at the ceiling; not needed for the wedge fix.
- **Per-user docker isolation**: containers share one `docker.slice` budget, not
per-user. Fine for current usage (krr + short-lived tools).
- **Host-side IO**: the 60/60 mbps cap + the shared `sdc` HDD IO domain are
host-level (bead `code-oflt`); unchanged here.
## Lessons
- **"Swap as the safety valve" is an IO-storm amplifier on a throttled disk.**
Leaving ssh/tmux memory-uncontained (the 2026-06-10 decision) traded a clean
local OOM for a box-wide swap-thrash wedge. `MemorySwapMax=0` + a hard cap turns
the failure back into a contained, local kill.
- **Cap the box, not one surface.** t3 sessions were capped for months while the
same user's tmux was unbounded — and the caps that existed didn't sum to < RAM.
Containment has to reason about every tree and the aggregate.
- **A backstop must protect the operator's way in.** earlyoom `--avoid`s
sshd/systemd/dockerd/containerd/t3-dispatch/tmux, so the box always stays
reachable to recover; only the agent/browser hogs are eligible victims.
- **systemd-oomd is the wrong backstop for a no-swap box — verify, don't assume.**
oomd's memory-pressure killer only fires on cgroups doing active reclaim
(`pgscan` rising). With `MemorySwapMax=0` + anonymous memory there is nothing to
reclaim, so a cgroup sat at 99% `memory.pressure` indefinitely and oomd never
acted (proven with `oomctl` + a balloon). The very `swap=0` that kills the IO
storm also neuters oomd. earlyoom (free-RAM threshold, swap-independent) is the
correct pairing. A famous tool that "does OOM" still has to be proven to fire
under *your* configuration.

View file

@ -0,0 +1,97 @@
# Post-mortem: k8s 1.34→1.35 upgrade stalled — etcd IO starvation (2026-06-24)
> Filename kept for inbound links. The originally-suspected cause (kubeadm-config
> OIDC drift) turned out **not** to be the crash — see "Correction" below. The OIDC
> drift was a real *separate* latent bug fixed in the same change.
**Impact:** The autonomous k8s-version-upgrade chain (23:00 UTC nightly) reached
the master control-plane phase for the first time — preflight passed, etcd
snapshot taken, master cordoned + drained, etcd upgraded 3.6.5→3.6.6 — then the
kube-apiserver upgrade to v1.35.6 **crash-looped**. kubeadm waited its 5-minute
static-pod-hash window across all internal retries, then auto-rolled-back to
v1.34.9. The cluster stayed healthy on 1.34.9 (apiserver, all 7 nodes Ready), but
the run left **k8s-master cordoned** and the chain **wedged on `in_flight=1`**.
No data loss; no user-facing outage (the master carries control-plane taints, so
no workloads were displaced).
**Trigger:** the first *minor* upgrade the chain ever attempted (1.34→1.35) — the
first time kubeadm upgrades etcd (3.6.5→3.6.6) and regenerates the control-plane
static pods, i.e. the first time the upgrade pushes real write-IO at etcd.
## Root cause — etcd IO starvation on the shared HDD
The new kube-apiserver could not establish/keep a working connection to etcd
during the upgrade because **etcd was IO-starved**. etcd's surviving container log
from the crash window (`/var/log/pods/.../etcd/0.log`, 23:0423:20 UTC) shows:
- **1,180** `apply request took too long` warnings in 16 minutes;
- individual applies of **4.3s / 2.9s / 2.7s / 1.8s** (healthy is <100ms),
clustered at **23:18:51 UTC** — exactly when kubeadm's final attempt was trying
to bring the new apiserver up.
A reproduced 1.35.6 apiserver with no etcd dies with
`F instance.go:233 Error creating leases: error creating storage factory: context
deadline exceeded` — the same failure mode a multi-second etcd produces. etcd
lives on the contended `sdc` HDD (**beads code-oflt**: "etcd/critical VM disks on
shared sdc HDD — recurring IO-storm root cause"). The upgrade itself piled IO onto
that spindle:
1. etcd's own upgrade-restart + WAL/db re-read (it restarted ~23:04, re-elected);
2. kubeadm dumping a full **~400MB etcd DB backup** to
`/etc/kubernetes/tmp/kubeadm-backup-etcd-<ts>/` (on the same HDD) before the
etcd upgrade — and **145 of these had accumulated to 28GB** (kubeadm never
cleans them up), pushing master root fs to **73%**, above the 70% kubelet
image-GC threshold, so image GC churned during the drain too;
3. master-drain pod evictions.
### Correction — it was NOT the OIDC flag swap
`kubeadm upgrade diff v1.35.6` showed the regenerated manifest also swaps
`--authentication-config` (structured multi-issuer OIDC) back to legacy
single-issuer `--oidc-*` flags (kubeadm-config drift, see secondary finding). That
was the *first* hypothesis — but an isolated repro of the 1.35.6 apiserver with
those exact `--oidc-*` flags **and authentik reachable** initialised OIDC cleanly
(`oidc.go:313`, no error) and ran fine until it hit the (deliberately dead) test
etcd. So the auth swap does **not** crash the apiserver; it was a red herring for
the crash. Image pull (all v1.35.6 images pre-pulled), OOM (none), and disk-full
were also ruled out.
## Secondary finding (real, fixed separately) — kubeadm-config OIDC drift
apiserver auth is configured in three places that must agree:
(1) `/etc/kubernetes/pki/auth-config.yaml` (structured, two issuers: `kubernetes`
+ `k8s-dashboard`, added 2026-06-19); (2) the live static-pod manifest
(`--authentication-config`); (3) the kubeadm-config `ClusterConfiguration` CM —
which still carried the legacy `--oidc-*` extraArgs. `kubeadm upgrade` regenerates
the manifest from (3), so it would have reverted structured auth → **dashboard +
kubectl SSO break after a successful upgrade** (recoverable: the chain's
post-master `restore.sh` re-adds the flag). This is a real bug, just not the crash.
## Resolution
1. **Reclaimed the 28GB kubeadm scratch** on master (`/etc/kubernetes/tmp/kubeadm-backup-*`) — root fs 73% → 23%.
2. **Reconciled kubeadm-config live** (zero cluster impact — CM only read at upgrade time): dropped `--oidc-*`, added `--authentication-config` via `kubeadm init phase upload-config kubeadm`. `kubeadm upgrade diff` then shows only the control-plane image bumps.
3. **Recovered:** uncordoned k8s-master, cleared the stuck `in_flight` gauge + annotation, deleted last night's Complete/Failed `1-35-6` phase jobs (a Complete preflight would otherwise make the detector idempotent-skip the re-run).
## Prevention (landed in this change)
| Gap | Fix |
|-----|-----|
| kubeadm leaks ~400MB etcd-DB backups into `/etc/kubernetes/tmp` forever (→ disk fills, image-GC churn, write-IO on etcd's spindle) | **`upgrade-step.sh` preflight now prunes** `/etc/kubernetes/tmp/kubeadm-backup-*` + `kubeadm-upgraded-manifests*` older than 3 days on master, every run. Best-effort, never aborts. |
| kubeadm-config drift would silently break SSO after an upgrade | `apiserver-oidc.tf`'s remote script now **also reconciles kubeadm-config** (`kubeadm init phase upload-config`), delivered via the `apiserver-oidc-restore` ConfigMap the chain re-runs (CI needs no ssh) or a local `-replace` apply. Preflight **alerts** (not blocks — SSO drift is recoverable) if `kubeadm upgrade diff` would still drop `--authentication-config`. |
| etcd on the contended `sdc` HDD starves under upgrade IO | **Durable fix is beads code-oflt** (move etcd/critical VM disks off `sdc`). Not in this change. Mitigations above reduce the upgrade's own IO; reclaimed disk removes the image-GC variable. |
## Lessons
- **Capture the failing component's own logs before concluding.** The `kubeadm
upgrade diff` made the OIDC swap look like the cause; only etcd's log (multi-second
applies) + an isolated apiserver repro showed the truth (etcd IO). A clean diff is
"what config changes," not "why it crashed."
- **etcd on shared HDD is the cluster's recurring fragility** (immich IO storm
2026-05-25, this stall). Upgrades concentrate IO (etcd restart + kubeadm's 400MB
backup copy + drain) onto that spindle. code-oflt is the real fix.
- **Tools that leave per-operation scratch must be reaped.** kubeadm's
`/etc/kubernetes/tmp` etcd backups are throwaway (real backups → NFS) but never
GC'd; 28GB had silently accumulated.
- **Out-of-band control-plane edits must be written back to kubeadm-config** — else
`kubeadm upgrade` silently reverts them (here: SSO; could be admission/audit/API flags).

View file

@ -11,6 +11,11 @@ inference every six hours and backs up only the `claudeAiOauth` object to:
secret/workstation/claude-users/<os-user>
```
The backup **merges** into that path (`vault kv patch -method=rw`, falling back to
`kv put` only when the path does not exist yet), so keys that other tools
co-locate there — notably `homelab vault`'s `vaultwarden_*` credentials — survive.
A blind `kv put` here silently wiped them on every six-hourly run (fixed 2026-06-26).
The user's unrelated `mcpOAuth` credentials never leave their home directory.
Each renewal service has a distinct 32-day periodic Vault token, mode `0600`, at
`~/.config/claude-auth-sync/vault-token`. Its policy can access only that user's
@ -75,8 +80,64 @@ sudo --preserve-env=VAULT_ADDR,VAULT_TOKEN /usr/local/bin/t3-provision-users
```
Never copy another user's `.credentials.json` or scoped Vault token. Never restore
the old shared `CLAUDE_CODE_OAUTH_TOKEN`; environment credentials outrank per-user
login and would silently collapse all users onto one identity.
a **shared** `CLAUDE_CODE_OAUTH_TOKEN` across users; environment credentials
outrank per-user login and would silently collapse all users onto one identity.
(A **per-user**, non-rotating setup-token tied to the user's OWN Enterprise
identity is a different, sanctioned thing — see "Long-lived per-user token" below.)
## Long-lived per-user token (heavy concurrent-agent users)
The six-hourly renewal above assumes Claude owns refresh-token rotation in a
single `~/.claude/.credentials.json`. A user who runs **many concurrent Claude
sessions** (interactive tmux panes + their `t3-serve` instance + always-on
`start-claude.sh` agents) breaks that assumption: when the shared access token
expires, the processes refresh **simultaneously**, the OAuth server rotates the
refresh token, and the losing writer persists an **empty** refresh token —
logging the user out roughly every access-token lifetime (~8h). Re-issuing the
credential does not help; the race recurs.
The fix is a **per-user, long-lived setup-token** (`sk-ant-oat01-…`, ~1y,
**non-rotating**). With `CLAUDE_CODE_OAUTH_TOKEN` set, Claude uses it directly and
never touches `.credentials.json` — so there is nothing to race on. This is the
user's OWN Enterprise identity (scope `user:inference`; local MCP servers are
client-side and unaffected), stored only in their OWN Vault path — **NOT** the
forbidden shared token, and it never crosses OS users.
**Enable it (one-time, per user):**
1. The user mints their own token (interactive Enterprise SSO):
```bash
claude setup-token # opens an SSO URL; paste the code back -> prints sk-ant-oat01-…
```
2. An admin stores it in that user's Vault path (MERGE, never `kv put` — siblings
like `claude_ai_oauth_json` / `vaultwarden_*` must survive):
```bash
vault kv patch -method=rw secret/workstation/claude-users/<os-user> \
setup_token=sk-ant-oat01-…
```
3. Materialize + activate (or just wait ≤6h for the timer):
```bash
systemctl start claude-auth-sync@<os-user>.service
```
`claude-auth-sync` writes `~/.config/claude-auth-sync/claude-oauth.env`
(`CLAUDE_CODE_OAUTH_TOKEN=…`, mode 0600) and, while a token is present, **skips**
the rotating-credential validate/backup/restore (so no false
`WorkstationClaudeAuthInvalid`). `start-claude.sh` and `t3-serve@.service` load
that env file. **Sessions started before activation keep the old credential
until relaunched** — the user must restart their agents / `t3-serve` to cut over.
**Disable it:** clear the field (`vault kv patch -method=rw
secret/workstation/claude-users/<os-user> setup_token=""`) — the next sync removes
the env file and the user reverts to the per-user SSO credential flow.
**Rotate before expiry:** setup-tokens expire 1y after mint. Re-mint (step 1) and
re-store (step 2); the env file refreshes on the next sync.
## Verification

View file

@ -0,0 +1,346 @@
# Goldmane Flow Trail — east-west "who-talks-to-whom" observability
> As-built runbook for the Calico Goldmane + Whisker flow plane and the
> `goldmane-edge-aggregator` durable audit trail. Design + rationale:
> [ADR-0014](../adr/0014-service-identity-and-east-west-observability.md).
> Glossary: `CONTEXT.md`**Service identity**, **Goldmane / Whisker**.
> Implements infra issues #57 (Whisker ingress), #58 (aggregator), #61
> (monitoring), #62 (egress allowlist queries), #63 (these docs).
## What the trail is
Three layers turn raw east-west traffic into a queryable, durable record of
which Service talks to which. **Service identity = the workload's namespace**
(primary), refined by a `service-identity` label in the few multi-Service
namespaces (`monitoring`, `kube-system`, `dbaas`) — see ADR-0014.
| Layer | Component | Lifetime | Where it lives |
|---|---|---|---|
| **Live map** | Calico **Goldmane** + **Whisker** | ~60-min in-memory ring buffer (lost on Goldmane restart) | `calico-system`; Whisker UI at `whisker.viktorbarzin.me` |
| **Durable trail** | `goldmane-edge-aggregator` (`aggregate` mode) | persistent | CNPG Postgres DB `goldmane_edges`, table `edge` |
| **Notification** | `goldmane-edges-digest` CronJob (`digest` mode) | daily | Slack `#alerts` |
**Goldmane** aggregates identity-stamped flows (namespace / pod / workload /
labels + allow-deny + policy-trace) streamed from Felix (the existing
`calico-node` DaemonSet) over gRPC into a ~60-minute in-memory ring buffer —
**nothing is written to etcd or the K8s API** (the etcd-cost constraint that
drove the whole design). **Whisker** is its live web UI. Because the ring
buffer is *not* a trail (a Goldmane restart loses the window), the
`goldmane-edge-aggregator` consumes Goldmane's gRPC `Flows.Stream` API over
mTLS and upserts the unique **namespace-pair edge set** into Postgres; a daily
CronJob posts first-seen edges to Slack.
The edge set is deliberately **low-cardinality** — one row per
`(src_ns, dst_ns, action)`, *not* per-pod or per-port — so the table stays
small no matter how much traffic flows.
## Where the data lives
### Whisker UI — live, ~60 min
- `https://whisker.viktorbarzin.me` (Authentik-gated — Whisker ships no own
login; `auth = "required"`). Shows the live flow stream + a service graph for
roughly the last hour. Use it for "what is talking right now"; it is **not**
history.
- In-cluster: `Service goldmane:7443` (gRPC/mTLS), `Service whisker:8081`
(HTTP), both in `calico-system`.
- **DNS fix + self-heal:** whisker's egress to the kube-dns ClusterIP is allowed
by `whisker-allow-dns-clusterip` (`stacks/calico`) — without it the UI goes
empty after any gRPC-stream break (see Troubleshooting → "Whisker UI empty").
The `whisker-watchdog` CronJob (every 10 min) is a backstop that restarts
whisker if its backend ever wedges for another reason.
### CNPG `goldmane_edges` — durable
- Postgres DB `goldmane_edges` on the CNPG cluster
(`pg-cluster-rw.dbaas.svc.cluster.local:5432`). One table:
```
edge(src_ns text, dst_ns text, action text,
first_seen timestamptz, last_seen timestamptz, flow_count bigint,
PRIMARY KEY (src_ns, dst_ns, action))
```
- `action``allow` / `deny` / `pass` / `unspecified` (normalised Goldmane
action).
- **Self-edges (`src_ns == dst_ns`) and empty-namespace flows** (host-endpoint
/ public-internet) are **dropped** — the trail is about in-cluster service
relationships only. (Egress to the public internet is therefore NOT in this
table; it lives in the Wave-1 Calico flow-log path — see security.md.)
- A **"new edge"** = a row whose `first_seen` falls inside the digest window.
- Role `goldmane_edges` (Vault-rotated, 7-day) owns the DB. The `edge` table
is created idempotently by the aggregator at startup (canonical DDL also in
the repo at `migrations/0001_edge.sql`).
### Slack `#alerts` — daily digest
> **Channel note (2026-06-25):** posts to **`#alerts`**. The dedicated `#security` channel was abandoned — the shared `alertmanager_slack_api_url` incoming webhook's Slack app is not a member of it, so a channel override there returns HTTP `404 channel_not_found`. Everything now posts to `#alerts` (this digest plus alertmanager's `slack-security` receiver, which keeps its `[SECURITY]` styling so security-lane alerts still stand out there).
- CronJob `goldmane-edges-digest` (08:00 Europe/London) posts edges first seen
in the last 24h. Quiet when there are none. Reuses the existing alert-digest
Slack incoming webhook (Vault `secret/viktor``alertmanager_slack_api_url`)
— no new webhook was created.
## How to enable / disable
### Goldmane + Whisker (the flow plane)
Operator CRs in **`stacks/calico/main.tf`** — NOT the Helm `goldmane`/`whisker`
flags (those stay `false`; the operator's own `installation`/`apiServer` are
operator-managed via the `goldmanes`/`whiskers.operator.tigera.io` CRDs):
- `kubectl_manifest.goldmane` (kind `Goldmane`) — creating it makes the operator
re-render `calico-node` with the `FELIX_FLOWLOGSGOLDMANESERVER` env (the
operator auto-wires Felix — **do NOT patch FelixConfiguration**), triggering a
supervised `calico-node` DaemonSet roll. Yields `Deployment` + `Service
goldmane:7443`.
- `kubectl_manifest.whisker` (kind `Whisker`, `depends_on` goldmane;
`notifications = Disabled`). Yields `Deployment` + `Service whisker:8081`.
**To disable:** delete those two CRs and re-apply `stacks/calico`. Reversible
toggle (Goldmane is tech-preview in OSS Calico 3.30 — the main standing risk per
ADR-0014).
### Whisker public ingress (infra #57)
Also in `stacks/calico/main.tf`:
- `module "ingress_whisker"` (`ingress_factory`, `auth = "required"`,
`dns_type = "proxied"`) → `whisker.viktorbarzin.me`.
- `kubernetes_network_policy_v1.whisker_allow_traefik` — **required alongside the
ingress**: the operator's own `whisker` NetworkPolicy (owned by the Whisker CR)
is `policyTypes: [Ingress]` with no rules = default-deny ingress to the pod.
This additive NP ORs in an allow for `namespaceSelector
kubernetes.io/metadata.name=traefik` on TCP 8081. Without it Traefik 502s.
### The aggregator + digest (the durable trail) — `stacks/goldmane-edge-aggregator`
A Tier-1 stack (PG state) mirroring the claude-memory pattern. `scripts/tg
apply` from `stacks/goldmane-edge-aggregator/`. It provisions: the namespace,
the mTLS client material, the Postgres DB-init Job, the `DATABASE_URL`
ExternalSecret (Vault static role `pg-goldmane-edges`), the Slack ExternalSecret,
the `aggregate` Deployment, and the `digest` CronJob. **To disable the trail
without touching the flow plane:** scale `deployment/goldmane-edge-aggregator` to
0 (transient) or remove the stack (permanent) — Goldmane/Whisker keep running.
Image: `ghcr.io/viktorbarzin/goldmane-edge-aggregator` (PRIVATE) — the
`goldmane-edge-aggregator` namespace must be in the `ghcr-credentials` Kyverno
allowlist (`stacks/kyverno/modules/kyverno/ghcr-credentials.tf`,
`local.ghcr_private_namespaces`) or pulls 401. Code repo:
`~/code/goldmane-edge-aggregator` (see its `README.md` + `DEPLOY.md`).
## mTLS cert — the REUSE decision (cert-reuse gotcha)
The aggregator dials `goldmane:7443` over **mutual TLS**. Goldmane requires the
client cert to chain to the **Tigera CA**, but it does **NOT authorize by client
identity** — any Tigera-CA-signed cert is accepted.
Rather than copy the Tigera CA **private key** into Terraform state to mint our
own cert (a needless CA-key exposure; the `hashicorp/tls` provider also clashes
with this repo's global generate-providers/lockfile pattern), the stack
**REUSES the operator-minted, Tigera-CA-signed `whisker-backend-key-pair`
Secret** (`calico-system`), copying its `tls.crt`/`tls.key` into the
`goldmane-client-tls` Secret in the aggregator namespace. The CA *bundle* that
verifies Goldmane's serving cert (`tigera-ca-bundle` ConfigMap, key
`tigera-ca-bundle.crt`) is likewise copied verbatim (a ConfigMap can't be
cross-namespace-mounted).
> **GOTCHA — if the operator rotates `whisker-backend-key-pair`, re-apply
> `stacks/goldmane-edge-aggregator`** to re-sync the copied cert. Symptom of a
> stale copy: the `aggregate` pod logs TLS handshake / `Flows.Stream` failures
> and no `last_seen` updates land in the `edge` table. Hardening follow-up
> (noted in the stack): mint an own-identity cert in-namespace if Whisker is ever
> removed (which would delete the reused source Secret).
The Deployment leaves `GOLDMANE_HOST=goldmane.calico-system.svc.cluster.local:7443`
and the default cert/CA paths; the default ServerName (host sans port) is a SAN
on Goldmane's live serving cert, so no `GOLDMANE_SERVER_NAME` /
`GOLDMANE_TLS_INSECURE` override is needed.
## How to query who-talks-to-whom
**Quickest — the `homelab edges` CLI** (the investigation helper; read-only
SELECT against the DB via the dbaas primary pod, no creds/SQL to remember):
```
homelab edges --ns <ns> # edges touching <ns> (either direction)
homelab edges --peers-of <ns> # <ns>'s distinct peer namespaces
homelab edges --src <ns> # <ns>'s egress peers (--dst <ns> for ingress)
homelab edges --new-since 24h # edges first seen in the last day (or a date)
homelab edges --denied # blocked / lateral-movement attempts
homelab edges --json [...] # machine-readable, for agents/pipelines
homelab edges --help # full flag list
```
For ad-hoc SQL, `psql` into the DB (creds: Vault static role
`static-creds/pg-goldmane-edges`, or exec a CNPG pod). All queries are against
the single `edge` table.
```sql
-- Everything talking to a namespace (inbound), most-active first
SELECT src_ns, action, flow_count, first_seen, last_seen
FROM edge WHERE dst_ns = '<ns>' ORDER BY flow_count DESC;
-- Everything a namespace talks TO (outbound)
SELECT dst_ns, action, flow_count, first_seen, last_seen
FROM edge WHERE src_ns = '<ns>' ORDER BY last_seen DESC;
-- New edges in the last 24h (what the digest reports)
SELECT src_ns, dst_ns, action, flow_count, first_seen
FROM edge WHERE first_seen > now() - interval '24 hours'
ORDER BY first_seen DESC;
-- Any DENIED edges (policy is dropping this pair)
SELECT src_ns, dst_ns, flow_count, last_seen
FROM edge WHERE action = 'deny' ORDER BY last_seen DESC;
-- Full edge set as a graph adjacency list
SELECT src_ns, dst_ns, action, flow_count FROM edge ORDER BY src_ns, dst_ns;
```
For the **live** (sub-hour) view including pod/port detail, use the Whisker UI —
the `edge` table intentionally aggregates that away.
## Deriving the Wave-1 egress allowlist from the edge table (infra #62)
The durable edge set is a faster, identity-stamped data source for the existing
**observe-then-enforce** egress effort (beads `code-8ywc`; snapshot
`docs/architecture/wave1-egress-observation-2026-05-22.md`) than the original
iptables-`LOG` → journald → Loki path (ADR-0014 consequence: "Enforcement gains
a better data source"). It replaces the *internal* (namespace-to-namespace) leg
of the allowlist; **external/public-internet egress is NOT in this table** (empty
dst namespace, dropped) — for those destinations keep using the Calico flow-log
path described in security.md.
**Per-namespace internal egress allowlist** — the set of in-cluster namespaces a
given source is *observed* talking to with `action='allow'`:
```sql
-- Internal egress allowlist for one namespace (feeds its NetworkPolicy)
SELECT DISTINCT dst_ns
FROM edge
WHERE src_ns = '<ns>' AND action = 'allow'
ORDER BY dst_ns;
```
```sql
-- Full internal egress matrix for all namespaces at once
SELECT src_ns, array_agg(DISTINCT dst_ns ORDER BY dst_ns) AS allowed_dst_ns
FROM edge
WHERE action = 'allow'
GROUP BY src_ns
ORDER BY src_ns;
```
```sql
-- Sanity: namespaces with a DENY edge already (policy is biting; investigate
-- before tightening further)
SELECT DISTINCT src_ns, dst_ns FROM edge WHERE action = 'deny';
```
**How this feeds enforcement (scope):** the derived `dst_ns` set is the
*internal* half of a namespace's egress allowlist — it tells you which
in-cluster namespaces to permit before flipping that namespace to default-deny.
The universal baseline (kube-dns :53, often dbaas :3306/:5432, redis :6379) and
the external destinations still come from the Wave-1 observation snapshot.
**Enforce-flips remain OUT OF SCOPE** here — this is observe-and-derive only;
the phased per-namespace default-deny rollout (starting `recruiter-responder`)
is tracked under `code-8ywc`. Cross-links:
[security.md → NetworkPolicy Default-Deny Egress](../architecture/security.md#networkpolicy-default-deny-egress-wave-1--observe-then-enforce-tier-34),
[wave1-egress-observation-2026-05-22.md](../architecture/wave1-egress-observation-2026-05-22.md),
[ADR-0014](../adr/0014-service-identity-and-east-west-observability.md).
> **Caveat (same as the Wave-1 snapshot):** an edge only exists if it was
> *observed*. A weekly CronJob or a 7-day Vault rotation may not have fired yet —
> collect ≥7 days of edges before treating a namespace's `allow` set as
> complete. The `first_seen` column tells you how long an edge has been known;
> the digest surfaces brand-new ones daily.
## Monitoring & health (infra #61)
The aggregator pod has **no `/metrics` endpoint** — health is inferred from
kube-state-metrics. Three complementary signals (memory ids 6598, 6599;
see also [monitoring.md → Security Alerts](../architecture/monitoring.md#security-alerts-wave-1--planned-beads-code-8ywc)):
| Signal | What | Where |
|---|---|---|
| **`AggregatorDown`** | `kube_deployment_status_replicas_available{namespace="goldmane-edge-aggregator",deployment="goldmane-edge-aggregator"} < 1` for 15m → warning | Prometheus alert group `Network Observability (Goldmane)` in `stacks/monitoring/modules/monitoring/prometheus_chart_values.tpl`; routes `slack-warning``#alerts` |
| **`DigestFailing`** | `kube_job_status_failed{...job_name=~"goldmane-edges-digest.*"} > 0` within 24h, for 30m → warning | same alert group → `#alerts` |
| **cluster-health #48** | `check_goldmane_aggregator` reads the Deployment's `Available` condition (missing or not-Available → FAIL) | `scripts/cluster_healthcheck.sh` (human / `--quiet` / `--json` modes; emits `goldmane_aggregator`) |
The two alert layers are deliberately complementary: `AggregatorDown`
**no new edges land** in the DB; `DigestFailing` → **edges still land but nobody
is told**. A freshness probe (#61b) was intentionally skipped — `AggregatorDown`
is the agreed floor.
## Troubleshooting
**Whisker UI 502 / unreachable.** The additive
`kubernetes_network_policy_v1.whisker_allow_traefik` is missing or the
operator's default-deny `whisker` NP regenerated — re-apply `stacks/calico`. A
brand-new ingress host is also invisible to LAN split-horizon until the hourly
`technitium-ingress-dns-sync` runs (memory #5349); test meanwhile with
`curl -sSI --resolve whisker.viktorbarzin.me:443:10.0.20.203 https://whisker.viktorbarzin.me`
(expect a 302 to Authentik — the gate working).
**Whisker UI empty (but reachable — 302s to Authentik fine).** ROOT CAUSE (the
2026-06-28 incident): the operator's own `whisker` NetworkPolicy is
policyTypes:[Ingress,**Egress**], and its egress allows DNS only to the kube-dns
*pods* (podSelector `k8s-app=kube-dns`). But whisker-backend resolves
`goldmane.calico-system.svc` via the kube-dns **ClusterIP** (10.96.0.10), and
**Calico drops UDP DNS to a ClusterIP under a podSelector-only egress rule**.
Verified: from the whisker pod's netns, ClusterIP DNS = 100% timeout while direct
kube-dns *pod-IP* DNS = OK, and a pod with no egress policy resolves fine.
whisker-backend resolves goldmane ONCE in the brief startup window before the
policy programs, holds its long-lived gRPC stream, and only re-resolves when that
stream breaks (e.g. a node-reboot blip) — at which point the blocked ClusterIP
DNS wedges its Go resolver (`failed to stream flows` / `code = Unavailable: dns
... i/o timeout` forever) and the UI goes blank. The durable **aggregator is a
SEPARATE pod in its own (unrestricted) namespace** and is unaffected.
FIX (applied 2026-06-28): `kubernetes_network_policy_v1.whisker_allow_dns_clusterip`
(`stacks/calico`) — an additive egress NP allowing whisker → the kube-dns
ClusterIP (`10.96.0.10/32`) on 53/UDP+TCP; k8s egress policies are additive so
the operator NP is untouched. Backstop: the `whisker-watchdog` CronJob restarts
the pod if it ever wedges for another reason. Immediate manual heal:
`kubectl -n calico-system delete pod -l k8s-app=whisker`. Diagnose by comparing,
from the whisker pod's netns, `nslookup goldmane.calico-system.svc.cluster.local
10.96.0.10` (the ClusterIP — times out if the NP fix is missing) against the same
query aimed at a kube-dns *pod IP* (always works).
**No new `last_seen` updates / `AggregatorDown` firing.** Check the `aggregate`
pod logs (`kubectl logs -n goldmane-edge-aggregator deploy/goldmane-edge-aggregator`).
Common causes, in order:
1. **Stale mTLS cert** — the operator rotated `whisker-backend-key-pair`; re-apply
`stacks/goldmane-edge-aggregator` (see cert-reuse gotcha above). Symptom: TLS
handshake / `Flows.Stream` errors.
2. **Stale DB password** — the 7-day Vault rotation bounced the credential but
the pod kept the old one. The Deployment carries
`secret.reloader.stakater.com/reload: goldmane-edges-db-creds`; if it's not
restarting on rotation, verify the Reloader annotation and the ExternalSecret.
3. **Goldmane restarted** — the in-memory window was lost (expected); the stream
reconnects automatically and resumes upserting. No data loss in the DB
(only the sub-hour live window in Whisker is gone).
**Digest never posts / `DigestFailing` firing.** Inspect the most recent
`goldmane-edges-digest-*` Job (`kubectl get jobs -n goldmane-edge-aggregator`;
`kubectl logs job/<name>`). The CronJob's `ttl_seconds_after_finished=86400` GCs
pods after a day, so check soon after a failed run. With `SLACK_WEBHOOK_URL`
empty the binary forces a dry-run (no post) — verify the `goldmane-edges-slack`
ExternalSecret resolved. A dry run / smoke test: run the image with `args:
["digest"]` + `DRY_RUN=1` to print the message instead of POSTing.
> Resolved (2026-06-28): the digest posts cleanly to `#alerts`
> (`lastSuccessfulTime` current, `DigestFailing` clear; e.g. the 2026-06-28 08:00
> London run reported "8 new edges in last 24h"). The 2026-06-25 failures were
> the `#security` channel override returning HTTP 404 — the shared
> `alertmanager_slack_api_url` webhook's Slack app isn't a member of `#security`;
> consolidating all Slack output to `#alerts` fixed it.
**No edges at all in the table.** Confirm Goldmane is enabled
(`kubectl get goldmane,whisker -A`) and `calico-node` rolled with the
`FELIX_FLOWLOGSGOLDMANESERVER` env; confirm the `goldmane-edges-db-init` Job
completed; confirm the aggregator pod is `Running` and not `ImagePullBackOff`
(ghcr allowlist).
## Related
- [ADR-0014 — Service identity & east-west observability](../adr/0014-service-identity-and-east-west-observability.md)
- [security.md — NetworkPolicy Default-Deny Egress + east-west flow observability](../architecture/security.md)
- [monitoring.md — east-west flow observability + alerts](../architecture/monitoring.md)
- [wave1-egress-observation-2026-05-22.md](../architecture/wave1-egress-observation-2026-05-22.md)
- `CONTEXT.md` glossary — **Service identity**, **Goldmane / Whisker**
- Code: `~/code/goldmane-edge-aggregator` (`README.md`, `DEPLOY.md`); stacks
`stacks/goldmane-edge-aggregator`, `stacks/calico`

View file

@ -0,0 +1,164 @@
# `homelab vault` onboarding (Vaultwarden access + `vault kv` infra secrets)
## Scope
`homelab vault` fronts **two unrelated secret stores** — the name collides, so
the command keeps them clearly separated:
- **Vaultwarden** — your personal *password manager* (logins/passwords/TOTP).
The verbs below give each devvm roster user no-HITL access to **their own**
Vaultwarden vault (and any Organization Collection shared with their account).
It shells out to the official `bw` CLI; the user's Vaultwarden credentials live
only in their isolated Vault path `secret/workstation/claude-users/<os-user>`
and are decrypted as that OS user — the admin never sees them.
- **HashiCorp Vault / OpenBao** — the homelab *infra* secrets store (the
`secret/…` KV mount at `vault.viktorbarzin.me`), under `homelab vault kv`.
These use the caller's **own** Vault token (`vault login -method=oidc`
`~/.vault-token`), **not** the scoped Vaultwarden token (which only reads the
`claude-users/<user>` path); access is whatever your Vault policy grants.
```text
# Vaultwarden (password manager)
homelab vault setup one-time: store VW email + master password + API key
homelab vault status configured / unlocked / reachable (no secrets)
homelab vault list [--search Q] item names (no secrets)
homelab vault get <name> [--field password|username|uri|notes|totp] [--json]
homelab vault get <name> --all all fields (incl. custom) as JSON; pipe it (| jq)
homelab vault code <name> current TOTP code
homelab vault lock lock / log out the local bw session
# HashiCorp Vault / OpenBao (infra secrets; uses your own OIDC token)
homelab vault kv get <path> [--field K] read an infra KV secret
homelab vault kv list <path> list sub-paths
homelab vault kv put <path> <key> write one key (value via stdin; merges)
```
## How auth works (why a non-admin can use it)
`homelab vault` runs `vault` as the calling user. It resolves a Vault token in
this order (`ensureVaultToken`, `cli/cmd_vault.go`):
1. an explicit `$VAULT_TOKEN` (a deliberate override), then
2. the per-user **scoped token** that `claude-auth-sync` maintains at
`~/.config/claude-auth-sync/vault-token` (policy `workstation-claude-<user>`), then
3. a native `~/.vault-token` (admins who carry one; non-admins usually don't).
**The scoped token deliberately beats `~/.vault-token`.** This tool only touches
your own `secret/workstation/claude-users/<user>` path, and a power-user who ran
`vault login -method=oidc` carries a read-only `~/.vault-token` (capability
`deny` on that path); letting it win would shadow the scoped token and fail every
op with `403 permission denied` (this is exactly what bit emo, 2026-06-28). The
CLI also **self-defaults `VAULT_ADDR`** to `https://vault.viktorbarzin.me` when
unset, so it works from non-login shells (tmux panes, AFK agent subprocesses)
that never sourced `/etc/environment` — otherwise every `vault` child hits the
`127.0.0.1:8200` default and fails `connection refused` (exit 2).
That scoped policy grants exactly `create`/`read`/`update` on the user's own
`secret/workstation/claude-users/<user>` path — no `patch` capability — so the
tool writes with `vault kv patch -method=rw` (read-modify-write), falling back to
`kv put` only when the path does not exist yet. This preserves the
`claude_ai_oauth_json` key that [claude-auth-sync](claude-auth-renew-workstation.md)
co-locates there. (The admin-only bugs were fixed 2026-06-27; the
`VAULT_ADDR`/token-precedence bugs above were fixed 2026-06-28.)
## Prerequisites (per user)
- The user is in `scripts/workstation/roster.yaml` and the **vault** stack has
been applied → their `workstation-claude-<user>` policy exists.
- The user's workstation was provisioned (`setup-devvm.sh`) → their scoped Vault
token exists at `~/.config/claude-auth-sync/vault-token`.
- `bw` is installed **system-wide** at `/usr/bin/bw` (see below).
- The user has a Vaultwarden account at `https://vaultwarden.viktorbarzin.me`
(self-service signup is open; admin panel is disabled).
## One-time admin steps (devvm)
`bw` must be system-wide so every user resolves it (it is a Node script, and
`node` is already system-wide at `/usr/bin/node`). `setup-devvm.sh` installs it
to the npm `/usr` prefix; the guard checks the **system** path, not
`command -v bw` (an admin's own `~/.local/bin/bw` used to mask the system
install, leaving non-admins with no backend). To install on a running box:
```bash
sudo npm install -g --prefix /usr "@bitwarden/cli@^2024"
bw --version # confirm /usr/bin/bw resolves
```
After landing a `cli/` change, rebuild the binary so users pick it up:
```bash
# version is stamped from cli/VERSION, exactly as setup-devvm.sh does it
sudo bash -c 'cd /home/wizard/code/infra/cli && \
go build -ldflags "-X main.version=$(cat VERSION 2>/dev/null || echo dev)" \
-o /usr/local/bin/homelab .'
```
(or just re-run `scripts/workstation/setup-devvm.sh` as root, which rebuilds it.)
## User onboarding
The user runs these as themselves. The master password / API key are entered
interactively (never on the command line) and stored only in the user's Vault
path.
1. In the Vaultwarden web vault → **Settings → Security → Keys → View API key**,
copy the `client_id` (`user.xxxx`) and `client_secret`.
2. Configure:
```bash
homelab vault setup # prompts: VW email, API client_id/secret, master password
homelab vault status # → "vault: configured, unlocked, reachable ✓"
homelab vault list # item names (own vault + any shared Collections)
```
## Shared-Collection access (sharing passwords with a user)
`homelab vault` surfaces Organization Collection items automatically once the
user's Vaultwarden account is a confirmed member. These steps are done by the
vault owner in the **Vaultwarden web UI** (they need the owner's master
password — not an infra/Terraform operation):
1. Create or reuse an **Organization** and a **Collection** of shared logins.
2. **Invite** the user's Vaultwarden account to the Organization, granting
**"Can view"** on that Collection (least privilege).
3. The user accepts the email invite and confirms membership.
4. The user runs `homelab vault list` — the shared items now appear alongside
their own (a `homelab vault status` sync picks them up).
## Security model (the no-HITL trade)
Identity is the kernel UID. Anything running as the user can decrypt the user's
vault — this is the accepted trade for no-human-in-the-loop fetches. Secrets
never appear in `argv` (passed via env or stdin), core dumps are disabled, TOTP
fetches are logged to syslog/Loki, and on a TTY values go to the clipboard
(auto-clearing) rather than scrollback. The admin's Vault token is never used by
a non-admin: each user authenticates with their own scoped token.
## Verification
```bash
# the scoped token carries the right policy
VAULT_TOKEN="$(sudo cat /home/<user>/.config/claude-auth-sync/vault-token)" \
vault token lookup -format=json | jq '.data.display_name, .data.policies'
# → "token-devvm-claude-auth-<user>", [..., "workstation-claude-<user>"]
sudo -u <user> -i bw --version # /usr/bin/bw resolves for the user
sudo -u <user> -i homelab vault status
```
## Troubleshooting
**`homelab vault setup` (or any verb) fails with `exit status 2`** — older
binaries swallowed the underlying `vault` error; the message now includes it.
Two historical causes (both fixed in-CLI 2026-06-28, kept here for diagnosis):
- `... connection refused` to `127.0.0.1:8200``VAULT_ADDR` wasn't set in the
caller's shell. The CLI now self-defaults it, but if you see this on an old
binary: `export VAULT_ADDR=https://vault.viktorbarzin.me`.
- `403 permission denied` on `PUT .../secret/data/workstation/claude-users/<user>`
→ a stale read-only `~/.vault-token` (e.g. from `vault login -method=oidc`,
policy `default`, capability `deny` on that path) was shadowing the scoped
token. The CLI now prefers the scoped token; on an old binary, `rm
~/.vault-token` (or `unset VAULT_TOKEN`) and retry. Confirm with
`VAULT_TOKEN="$(sudo cat /home/<user>/.config/claude-auth-sync/vault-token)" vault token capabilities secret/data/workstation/claude-users/<user>`
→ must be `create, read, update`.

View file

@ -36,11 +36,13 @@ envsubst on /template/job-template.yaml | kubectl apply -f -
Job 0 — preflight (pinned: k8s-node1)
├── compat-gate: addon/API/containerd support for target (else BLOCK+alert)
├── compat-gate: addon/API/containerd support for target (else BLOCK-actionable+alert / HOLD-quiet)
├── All nodes Ready + no Mem/Disk pressure
├── halt-on-alert (kured-style ignore-list)
├── 24h-quiet baseline (no Ready transitions <24h ago)
├── kubeadm upgrade plan matches target (skipped when master already at target — partial-resume)
├── apiserver-OIDC drift check: kubeadm upgrade diff drops --authentication-config? → Slack WARN (recoverable; not a block)
├── reclaim kubeadm scratch: prune /etc/kubernetes/tmp/kubeadm-backup-* >3d on master (kubeadm leaks ~400MB etcd-db backups)
├── Push k8s_upgrade_in_flight=1, k8s_upgrade_started_timestamp=$(date +%s)
├── Trigger backup-etcd Job, wait, verify snapshot byte count
├── SSH master: containerd skew fix (if master < workers)
@ -112,18 +114,36 @@ inert for a patch (no API removal or containerd floor occurs inside a minor).
This is the **"auto-upgrade when we can, halt + alert when we can't"** contract.
**On a block**, the gate:
- pushes `k8s_upgrade_blocked=1` to Pushgateway (→ the `K8sUpgradeBlocked`
Prometheus alert),
- Slacks the **specific reasons** (which addon/API/node, current vs required), and
- **halts the chain** — it exits **non-fatal** (the upgrade simply isn't safe yet,
this is not a failure). Because the block happens **before any mutation, no
rollback is involved**; nothing was changed.
**The gate classifies each refusal** (2026-06-28) so it only cries wolf when
there's something to do — `compat-gate.py` exit code + a `[TAG]` on every reason:
**To clear a block**: upgrade the named addon (or migrate the API caller off the
deprecated group/version, or bump containerd on the named node) so the offending
condition no longer holds. The **next nightly run then proceeds automatically**
no manual chain restart needed.
- **`[ACTIONABLE]`** (exit 2) — a newer version of the lagging addon **exists in
the compat matrix** and upgrading it would clear the block (or an in-use
deprecated API must be migrated / a node's containerd bumped).
- **`[WAITING]`** (exit 4 = held) — **no released addon version supports the
target yet** (e.g. kyverno/ESO behind a brand-new k8s minor). Only an upstream
release can clear it.
- **`[PINNED]`** (exit 4 = held) — a supporting version exists but the addon is
**deliberately pinned** in the matrix (`"pinned": true`, e.g. gpu-operator,
whose bump is coupled to a newer NVIDIA driver image + Ubuntu/kernel).
- **Held wins on a mix**: if any blocker is waiting/pinned the whole target is
held — acting on the actionable ones wouldn't unblock it yet.
**On any refusal** the preflight pushes the verdict gauge (`k8s_upgrade_blocked=1`
for actionable, `k8s_upgrade_held=1` for held), sets `HALT_CHAIN` so the chain
doesn't advance, and **exits 0 — the Job Completes cleanly** (a refusal is a
decision, not a failure: no Failed Job, no `K8sUpgradeChainJobFailed`). It's
before any mutation, so no rollback. Reasons (grouped by class) appear in the
**morning nightly report**, not a per-run Slack.
- **Actionable**`K8sUpgradeBlocked` fires (once, via alert-on-change). Clear
it by doing the named upgrade/migration; the next nightly run proceeds.
- **Held****deliberately NO alert** — only the nightly report's `⏸️ HELD`
line, because it can't be actioned now (a nightly alert would cry wolf). It
clears itself once upstream ships support (refresh `addon-compat.json`) or the
pin is lifted (delete `pinned`+`pin_reason`). The detector re-evaluates every
night, silently re-spawning the refused-but-Complete preflight (so a cleared
block is picked up next run, not after the 7d Job TTL).
The **compat matrix** lives in
`stacks/k8s-version-upgrade/scripts/addon-compat.json` — a map of `addon → highest
@ -163,6 +183,8 @@ Pushed by upgrade-step.sh during phase execution; observed by the
| `k8s_upgrade_in_flight` (1/0) | preflight Job (set to 1) | postflight Job (set to 0) |
| `k8s_upgrade_started_timestamp` (epoch s) | preflight Job | postflight Job (set to 0) |
| `k8s_upgrade_snapshot_taken` (1/0) | preflight Job (set to 1 after Job=`pre-upgrade-etcd-*` completes with `Backup done:` log of ≥1 KiB) | postflight Job (0) |
| `k8s_upgrade_blocked` (1/0) | preflight Job — set 1 on an **actionable** compat refusal (→ `K8sUpgradeBlocked`) | preflight (definitive each run; 0 when safe) / postflight (0) |
| `k8s_upgrade_held` (1/0) | preflight Job — set 1 on a **held** (waiting-upstream/pinned) refusal; **no alert** | preflight (definitive each run; 0 when safe) / postflight (0) |
| `k8s_upgrade_available{kind,running,target}` | detection CronJob | next detection run (overwrite) |
| `k8s_version_check_last_run_timestamp` | detection CronJob | (cumulative) |
@ -171,10 +193,27 @@ Pushed by upgrade-step.sh during phase execution; observed by the
- **`K8sVersionSkew`** — distinct kubelet/apiserver `gitVersion` count > 1 for 30m. Catches a half-done rollout.
- **`EtcdPreUpgradeSnapshotMissing`** — `k8s_upgrade_in_flight==1 && k8s_upgrade_snapshot_taken==0` for 10m. Catches preflight Stage 2 failing silently.
- **`K8sUpgradeStalled`** — `k8s_upgrade_in_flight==1 && time()-k8s_upgrade_started_timestamp > 5400` for 5m. Catches a Job in the chain dying without spawning its successor.
- **`K8sUpgradeChainJobFailed`** — `kube_job_status_failed{namespace="k8s-upgrade",job_name=~"k8s-upgrade-.*",reason=~"BackoffLimitExceeded|DeadlineExceeded"} > 0` for 15m (warning). Catches a phase Job that **terminally failed before `k8s_upgrade_in_flight` was set** — the preflight gates exit pre-metric, so the two `in_flight`-based alerts above are blind to a failed preflight (this is what hid the 5-day 1.34.9 wedge on 2026-06-12). Reason-scoped to terminal job conditions so a retry-success doesn't false-positive (a bare failed-pod-count would otherwise also block kured for the Job's 7d TTL).
- **`K8sUpgradeBlocked`** — `k8s_upgrade_blocked == 1` (warning). A k8s **auto-upgrade was refused** by the compat gate because a critical addon, an in-use deprecated API, or a node's containerd is too old for the detected target. The **specific reasons are in Slack**; clear it by upgrading the named addon / migrating the API caller / bumping containerd, after which the next nightly run proceeds (see "Auto-upgrade compat gate"). No upgrade was attempted, so this is not a half-done-rollout alert.
- **`K8sUpgradeChainJobFailed`** — `kube_job_status_failed{namespace="k8s-upgrade",job_name=~"k8s-upgrade-(preflight|master|worker|postflight)-.*",reason=~"BackoffLimitExceeded|DeadlineExceeded"} > 0` for 15m (warning). Catches a phase Job that **terminally failed before `k8s_upgrade_in_flight` was set** — the preflight gates exit pre-metric, so the two `in_flight`-based alerts above are blind to a failed preflight (this is what hid the 5-day 1.34.9 wedge on 2026-06-12). Reason-scoped to terminal job conditions so a retry-success doesn't false-positive (a bare failed-pod-count would otherwise also block kured for the Job's 7d TTL). The old `unless on() (k8s_upgrade_blocked == 1)` clause was **dropped 2026-06-28**: compat-gate refusals now Complete cleanly (exit 0) instead of Failing, so a terminally-Failed chain Job again means a genuine wedge with nothing to exclude.
- **`K8sUpgradeBlocked`** — `k8s_upgrade_blocked == 1` (warning). An **ACTIONABLE** compat-gate refusal — a newer version of the lagging addon exists and upgrading it would clear the block (or an in-use deprecated API must be migrated / a node's containerd bumped). Reasons (grouped by class) are in the **morning nightly report**; clear it by doing the named upgrade/migration, after which the next nightly run proceeds (see "Auto-upgrade compat gate"). No upgrade was attempted, so this is not a half-done-rollout alert. **There is deliberately NO companion alert for the held verdict** (`k8s_upgrade_held=1` — waiting-on-upstream / pinned): nothing can be actioned now, so it is surfaced only by the nightly report's `⏸️ HELD` line.
- The first four alerts ALSO block kured (same `--prometheus-url` halt-on-alert mechanism) so the OS-reboot pipeline can't run on top of a half-done version upgrade.
### Nightly upgrade report (Slack)
CronJob `k8s-upgrade-nightly-report` (k8s-upgrade ns, `var.report_schedule`,
default `7 6 * * *` = 06:07 UTC — after the 23:00 chain, before the 08:00 London
alert-digest) posts ONE Slack summary each morning of the previous night's run:
running version, detector freshness, detected target + kind, the outcome
(⚪ no upgrade needed / 🔴 blocked-actionable + reasons / ⏸️ held = waiting-upstream/pinned /
🟢 upgraded / 🟡 in progress / ⚠️ detector stale), and recent chain jobs. Read-only — it reads
the Pushgateway gauges + live nodes/jobs and re-runs `compat-gate.py` for fresh
blocker reasons; reuses the chain's SA + `slack_webhook` + scripts ConfigMap.
Logic + unit tests: `scripts/nightly-report.py`, `scripts/test_nightly_report.py`.
This is the day-to-day visibility layer (it does NOT replace the alerts above —
those fire on problems; this reports the outcome every night). Manual run:
`kubectl -n k8s-upgrade create job --from=cronjob/k8s-upgrade-nightly-report nightly-report-test`
(name it WITHOUT a `k8s-upgrade-{phase}-` prefix so a failure can't trip
`K8sUpgradeChainJobFailed`).
### CoreDNS is NOT upgraded by kubeadm here
CoreDNS runs a **custom split-horizon Corefile** (owned by the technitium stack)
@ -205,22 +244,34 @@ Exposed in K8s via ExternalSecret `k8s-upgrade-creds` in the `k8s-upgrade` names
## Common Operations
### Post-upgrade: apiserver OIDC restore (AUTOMATED by the chain since 2026-06-19)
### apiserver OIDC + kubeadm upgrades (kubeadm-config reconciliation since 2026-06-24)
`kubeadm upgrade apply` **regenerates `/etc/kubernetes/manifests/kube-apiserver.yaml`
and drops the `--authentication-config` flag**, silently disabling apiserver
OIDC (kubectl/kubelogin CLI **and** the web dashboard SSO break — tokens get
401). This used to require a manual re-apply after **every** control-plane bump.
from kubeadm-config**. apiserver auth uses a structured multi-issuer
`--authentication-config` (kubectl + dashboard SSO), but kubeadm-config used to
still carry the legacy single-issuer `--oidc-*` extraArgs — so every upgrade
reverted the flag, **silently breaking SSO after the upgrade** (the apiserver does
NOT crash on this — verified by isolated repro; it's recoverable via the restore
script below). NB: the **1.34→1.35 stall on 2026-06-24 was a *separate* issue —
etcd IO starvation**, not this drift; post-mortem:
`docs/post-mortems/2026-06-24-kubeadm-oidc-drift-apiserver-upgrade-stall.md`.
**Now automated:** the `rbac` stack publishes its OIDC restore script to the
`kube-system/apiserver-oidc-restore` ConfigMap, and the version-upgrade chain's
`phase_master` re-runs it on master immediately after `kubeadm upgrade apply`
(while tigera-operator is still quiesced, so the flag-add apiserver restart can't
crashloop the operator). It's idempotent, health-gates `/livez` with
auto-rollback, and is **non-fatal** — a failure only lags SSO until the next rbac
apply (the version upgrade itself already succeeded). So a chain-driven
control-plane bump no longer breaks SSO. The master phase self-skips when master
is already at target, so this only runs when master was actually upgraded.
**Primary fix (2026-06-24):** `stacks/rbac/modules/rbac/apiserver-oidc.tf` now
**reconciles kubeadm-config** (`kubeadm init phase upload-config kubeadm`, rewriting
`apiServer.extraArgs`: drop `--oidc-*`, add `--authentication-config`) as part of
its remote script. So kubeadm regenerates a **correct** manifest and the apiserver
upgrades with a pure image bump — `kubeadm upgrade diff <target>` shows only the
image change. Zero live impact (the CM is read only during an upgrade).
**Backstops:**
- **Preflight check 4b** runs `kubeadm upgrade diff` and **alerts** (Slack WARN, does
NOT block — the drift only breaks SSO, which is recoverable) if
`--authentication-config` would still be dropped.
- The `rbac` stack still publishes its restore script to the
`kube-system/apiserver-oidc-restore` ConfigMap, and `phase_master` re-runs it on
master right after `kubeadm upgrade apply` (idempotent, `/livez`-gated with
auto-rollback, non-fatal) — now redundant belt-and-suspenders that *also*
re-reconciles kubeadm-config. Self-skips when master is already at target.
**Manual fallback** — only for an out-of-band/manual `kubeadm` upgrade, or if the
chain logged `WARN: --authentication-config absent after re-apply`:

View file

@ -0,0 +1,72 @@
# Runbook: pfSense WAN / egress outage
**Scope:** the cluster (and home) loses **internet egress** while pfSense is
otherwise alive — internal VLAN routing and DNS keep working. This is the
**2026-06-27 incident class**: pfSense (Proxmox **VMID 101**) stopped passing
IPv4 egress for ~20 min (00:02→00:23 UTC) while LAN/OPT1 routing + Unbound
stayed up; recovery required a manual reboot, and **nothing alerted** (no egress
probe existed; the cloudflared replica metric stayed green). The alerts +
probes below close that gap. Incident detail: memory ids #6715#6723.
pfSense is a **single point of failure** (no HA): it is the k8s default gateway
(`10.0.20.1`), Kea DHCP, Unbound DNS, NAT, and the WireGuard hub. WAN is
**static** `192.168.1.2/24`, upstream gateway `WANGW = 192.168.1.1` (the TP-Link
Archer AX6000). The sole IPv4 default gateway, no gateway-group/failover.
## Alerts (all in `stacks/monitoring/modules/monitoring/`)
| Alert | Signal | Means |
|-------|--------|-------|
| `WANGatewayUnreachable` (critical) | in-cluster ICMP to `192.168.1.1` fails >3m | pfSense's upstream gateway is unreachable from the cluster |
| `InternetEgressDown` (critical) | in-cluster ICMP to **both** `9.9.9.9` and `1.1.1.1` fails >2m | internet egress through pfSense NAT is black-holed |
| `ExternalDNSResolutionDown` (warning) | UDP/53 to both public resolvers fails >3m | egress or external-DNS path broken |
| `EgressOnlyDivergence` (critical) | t3-probe `cloudflare` leg down **while** `internal` leg up >3m | egress-specific failure, internal healthy (the exact 2026-06-27 signature) |
| `PfSenseVMDown` (critical) | `pve_up{id="qemu/101"}==0` while host up >2m | the pfSense VM stopped/crashed (host fine) |
| `CloudflaredTunnelConnLoss` (warning, Loki) | >20 cloudflared edge-conn failures/5m | tunnel/egress trouble (canary that fires first; replica metric is blind) |
Probes run **from inside the cluster** (blackbox-exporter, pod → node → pfSense
NAT), so they exercise the exact egress path that fails. `WANGatewayUnreachable`
/ `InternetEgressDown` **inhibit** the downstream egress symptoms so one root
alert pages, not a storm.
`PfSenseVMDown` **does not** catch a *guest-internal* reboot — `pve_up` tracks
the qemu process, which survives an in-guest reboot (this is why 2026-06-27 was
metric-invisible). `CloudflaredTunnelConnLoss` + the probe alerts cover that case.
## Diagnose (read-only first)
1. **Confirm scope** — is it egress-only or total?
- `kubectl -n monitoring` Grafana → `probe_success{job=~"wan-gateway-icmp|internet-egress-icmp"}` and `t3probe_connected` by `leg`.
- Internal still up? `pve_up{id="qemu/101"}` should be `1`; internal k8s DNS (`10.0.20.1`) still resolving = pfSense alive, egress-only.
2. **Capture pfSense on-box logs BEFORE rebooting** (they persist on disk — no RAM-disk — and are the only source that proves the mechanism; they are NOT shipped to Loki):
```
ssh -i ~/.ssh/id_ed25519 admin@10.0.20.1 # devvm wizard key (id #6784)
clog /var/log/gateways.log | grep -iE 'WANGW|down|up|delay|loss' # dpinger gateway alarms
clog /var/log/routing.log | grep -iE 'default|route' # default-route add/delete
clog /var/log/system.log | tail -200
netstat -rn | head # is the default route present?
ls -la /var/crash/ # panic/textdump?
```
(If SSH is rejected post-reboot, the reboot regenerated `authorized_keys` from
config.xml — re-add the key via console or WebGUI; see id #6718.)
3. **Upstream check** — is the TP-Link / ISP up? It held the same public IP with
clean DHCP renewals through the 2026-06-27 event, so a *sustained* upstream
fault is unlikely; a reboot fixing it points at **pfSense-side state**.
## Recover
- **Fast path (known fix):** reboot pfSense — re-adds the default route, re-arms
dpinger, flushes pf state. **Capture the logs above FIRST** (a reboot wipes
the volatile evidence needed to find the real mechanism).
- Targeted (if logs show a dpinger gateway-down): System → Routing → Gateways →
WANGW; check the monitor IP + dpinger state; re-enable the gateway / let it
re-eval. Confirm `netstat -rn` shows the default route restored.
## Prevent / harden (deferred, needs a live-pfSense change)
Not done in this monitoring change — tracked for a follow-up with hands-on
pfSense access: point dpinger's monitor at the local gateway (`192.168.1.1`)
instead of an external IP + widen thresholds; disable `gw_down_kill_states` for
the single WAN; add a failover gateway group; a 60s auto-recovery watchdog;
ship pfSense system/gateway/routing syslog to the cluster so these logs become
centrally queryable.

View file

@ -37,6 +37,19 @@ logs every outcome (`paired user=.. endpoint=.. fallback=..`, plus `mint/pairing
`T3AutoUpdateRolledBack` / `T3AutoUpdateRollbackFailed` / `T3AutoUpdateFrozen`
Alertmanager → Slack.
## Idle migrator — draining deferrals (`scripts/t3-migrate-idle.sh`)
Step 5 DEFERS any instance with an active agent, recording `/var/lib/t3-autoupdate/deferred/<user>` (= the target version). Without a drainer, a user busy at every 04:00 window never migrates and their client shows *"Client and server versions differ"* for days. `t3-migrate-idle.timer` (overnight, every 20 min 01:0005:40) drains those markers:
- Per marker: skip + clear if the unit is gone or was already restarted *after* the deferral; otherwise restart the still-stale `t3-serve@<u>` onto the current binary **only when that user is idle**`state.sqlite` shows zero `active_turn_id` (no in-flight turn) AND ≥ `T3_MIGRATE_QUIET_SECONDS` (default 900 = 15 min) since the last thread activity — then verify pairing and clear the marker. **Fail-closed:** any query/parse doubt → skip, retry next tick.
- It restarts via the SAME `safe_restart_unit` the daily canary uses (sourced `t3-safe-restart.sh`: backup → restart → verify → recover). The shared `/etc/t3-autoupdate.freeze` halts it too.
- **Force / preview:**
```bash
sudo systemctl start t3-migrate-idle.service # run a drain pass now (still idle-gated)
sudo env T3_DRY_RUN=1 /usr/local/bin/t3-migrate-idle # log decisions, act on nothing
```
- **Rare-tail failure:** if a deferred user's forward migration fails at idle restart (already gated against a copy of their real DB at install), `safe_restart_unit` restores their DB + freezes + alerts. The binary rollback is a no-op (the build was already accepted, so other users are unaffected), but that user's serve may crashloop on the restored DB until the freeze is cleared and the build investigated (manual rollback below).
## Operations
**Freeze / revert (stop tracking right now — the fast "make it stop"):**

View file

@ -107,10 +107,6 @@ variable "custom_content_security_policy" {
type = string
default = null
}
variable "exclude_crowdsec" {
type = bool
default = false
}
variable "full_host" {
type = string
default = null
@ -310,7 +306,6 @@ resource "kubernetes_ingress_v1" "proxied-ingress" {
"traefik-error-pages@kubernetescrd",
var.skip_default_rate_limit ? null : "traefik-rate-limit@kubernetescrd",
var.custom_content_security_policy == null ? "traefik-csp-headers@kubernetescrd" : null,
var.exclude_crowdsec ? null : "traefik-crowdsec@kubernetescrd",
local.effective_anti_ai ? "traefik-ai-bot-block@kubernetescrd" : null,
local.effective_anti_ai ? "traefik-anti-ai-headers@kubernetescrd" : null,
local.auth_middleware,

View file

@ -27,7 +27,7 @@ KUBECONFIG_PATH="${KUBECONFIG:-${HOME}/.kube/config}"
[[ -f "$KUBECONFIG_PATH" ]] || KUBECONFIG_PATH="$(pwd)/config"
KUBECTL=""
JSON_RESULTS=()
TOTAL_CHECKS=47
TOTAL_CHECKS=48
# Parallel execution settings. Each check function is self-contained — it
# only reads cluster state and mutates the in-memory counters / JSON_RESULTS
@ -3156,6 +3156,44 @@ PYEOF
esac
}
# --- 48. Goldmane edge-aggregator availability ---
#
# The goldmane-edge-aggregator Deployment (ADR-0014 / infra #58) streams Calico
# Goldmane flows into the goldmane_edges CNPG DB — the durable who-talks-to-whom
# trail. The pod has NO /metrics endpoint, so its liveness can't be scraped;
# this check reads the Deployment's Available condition directly so the trail
# silently dying surfaces in the health board (mirrors the AggregatorDown
# Prometheus alert). Missing Deployment / not-Available -> FAIL.
check_goldmane_aggregator() {
section 48 "Goldmane Edge-Aggregator"
local ns="goldmane-edge-aggregator" dep="goldmane-edge-aggregator"
local avail desired ready
# One get; absent Deployment is a hard fail (the trail isn't deployed).
if ! $KUBECTL get deploy "$dep" -n "$ns" >/dev/null 2>&1; then
[[ "$QUIET" == true ]] && section_always 48 "Goldmane Edge-Aggregator"
fail "Deployment $ns/$dep not found — who-talks-to-whom edge trail is not running"
json_add "goldmane_aggregator" "FAIL" "deployment missing"
return 0
fi
avail=$($KUBECTL get deploy "$dep" -n "$ns" \
-o jsonpath='{.status.conditions[?(@.type=="Available")].status}' 2>/dev/null)
ready=$($KUBECTL get deploy "$dep" -n "$ns" -o jsonpath='{.status.readyReplicas}' 2>/dev/null)
desired=$($KUBECTL get deploy "$dep" -n "$ns" -o jsonpath='{.spec.replicas}' 2>/dev/null)
ready=${ready:-0}
desired=${desired:-0}
if [[ "$avail" == "True" ]]; then
pass "Edge-aggregator Available ($ready/$desired ready)"
json_add "goldmane_aggregator" "PASS" "${ready}/${desired} ready"
else
[[ "$QUIET" == true ]] && section_always 48 "Goldmane Edge-Aggregator"
fail "Edge-aggregator NOT Available ($ready/$desired ready) — edge trail has stopped recording"
json_add "goldmane_aggregator" "FAIL" "${ready}/${desired} ready; Available=${avail:-unknown}"
fi
}
# --- Summary ---
print_summary() {
if [[ "$JSON" == true ]]; then
@ -3224,7 +3262,7 @@ main() {
check_monitoring_prom_am check_monitoring_vault check_monitoring_css
check_external_replicas check_external_divergence check_pve_thermals
check_pve_load check_external_traefik_5xx check_ha_status_dashboard
check_immich_search check_csi_ghost_drift
check_immich_search check_csi_ghost_drift check_goldmane_aggregator
)
# Auto-fix mutates cluster state inside individual checks — keep that

View file

@ -21,7 +21,7 @@
# - canary rollout: restart idle instances ONE AT A TIME, verifying pairing
# through the real dispatch after each, and roll back (binary + that user's DB)
# + self-freeze on the first failure — active-agent instances are deferred,
# never killed;
# never killed (deferred instances are recorded for t3-migrate-idle to drain);
# - rollback target is the recorded LAST-GOOD build, not "whatever was installed".
# Detection backstop (real-user pairing failure/fallback) lives in the dispatch
# logs + Loki alerts (T3PairingBroken / T3PairFallbackHigh / T3AutoUpdate*).
@ -29,24 +29,17 @@
# Full procedure + manual rollback: docs/runbooks/t3-version-bump.md.
set -uo pipefail
# ---- autoupdate-specific config (shared config + helpers come from the lib) -----
T3_TRACK="${T3_TRACK:-nightly}" # npm dist-tag to follow (nightly | latest)
T3_PIN="${T3_PIN:-}" # optional HARD pin to an exact version (disables tracking)
FREEZE_FILE="${T3_FREEZE_FILE:-/etc/t3-autoupdate.freeze}"
STATE_DIR="${T3_STATE_DIR:-/var/lib/t3-autoupdate}"
LAST_GOOD_FILE="$STATE_DIR/last-good"
BACKUP_DIR="${T3_BACKUP_DEST:-/var/backups/t3-state}"
SMOKE_PORT="${T3_SMOKE_PORT:-3799}"
DISPATCH="${T3_DISPATCH:-127.0.0.1:3780}"
USER_MAP="${T3_USER_MAP:-/etc/ttyd-user-map}"
DRY_RUN="${T3_DRY_RUN:-0}"
TMPROOT="${T3_TMPDIR:-/var/tmp}" # health-check scratch on DISK — /tmp is a 2G tmpfs and a populated state.sqlite (~hundreds of MB) overflows it
LOG() { logger -t t3-autoupdate "$*"; echo "t3-autoupdate: $*"; }
ver() { t3 --version 2>/dev/null | awk '{print $NF}' | sed 's/^v//'; }
# OS users owning a ~/.t3 (RHS of each non-comment "authentik=os_user" map line).
osusers() { awk -F= '!/^[[:space:]]*#/&&NF==2{gsub(/[[:space:]]/,"",$2);print $2}' "$USER_MAP" 2>/dev/null | sort -u; }
# authentik username for an OS user (reverse map; first match) — for dispatch verify.
ak_for() { awk -F= -v u="$1" '!/^[[:space:]]*#/&&NF==2{gsub(/[[:space:]]/,"",$1);gsub(/[[:space:]]/,"",$2);if($2==u){print $1;exit}}' "$USER_MAP" 2>/dev/null; }
LOG_TAG=t3-autoupdate
# shellcheck source=scripts/t3-safe-restart.sh
. "${T3_SAFE_RESTART_LIB:-/usr/local/lib/t3-safe-restart.sh}"
# is $1 a strictly-newer version than $2 (version-sort)?
newer() { [ "$1" != "$2" ] && [ "$(printf '%s\n%s\n' "$1" "$2" | sort -V | tail -1)" = "$1" ]; }
@ -86,27 +79,21 @@ LOG "candidate: $current -> $target (track=$T3_TRACK, last_good=$last_good, dry_
# ---- helpers: backup, health-check, rollback, restart-verify --------------------
# Online consistent per-user snapshot (run AS the owner so WAL stays owned; never
# stops the serve). Sets $ADMIN_SEED to wizard's backup for the migration health
# check. Mirrors t3-backup-state.sh.
# check. Mirrors t3-backup-state.sh. (backup_user lives in the shared lib.)
ADMIN_SEED=""
backup_all() {
local u src out dst ts; ts="$(date +%Y%m%d-%H%M%S)"
local u dst
for u in $(osusers); do
src="/home/$u/.t3/userdata/state.sqlite"; [ -f "$src" ] || continue
out="$BACKUP_DIR/$u"; dst="$out/state-prebump-$target-$ts.sqlite"
install -d -o "$u" -g "$u" -m700 "$out" 2>/dev/null || mkdir -p "$out"
if runuser -u "$u" -- timeout "${T3_BACKUP_TIMEOUT:-900}" sqlite3 "$src" "VACUUM INTO '$dst'" 2>/dev/null && [ -s "$dst" ]; then
if dst="$(backup_user "$u")"; then
LOG "pre-bump backup: $u -> $dst ($(stat -c%s "$dst" 2>/dev/null) bytes)"
[ "$u" = "wizard" ] && ADMIN_SEED="$dst"
else
LOG "WARN: pre-bump backup FAILED for $u ($src)"; rm -f "$dst"
LOG "WARN: pre-bump backup FAILED for $u (/home/$u/.t3/userdata/state.sqlite)"
fi
done
[ -n "$ADMIN_SEED" ] || ADMIN_SEED="$(ls -1t "$BACKUP_DIR"/*/"state-prebump-$target-"*.sqlite 2>/dev/null | head -1)"
}
# newest pre-bump backup taken THIS run for a user (for restore-on-rollback).
prebump_of() { ls -1t "$BACKUP_DIR/$1/state-prebump-$target-"*.sqlite 2>/dev/null | head -1; }
# health_check <t3bin> [seed_db]: start a throwaway serve (seeded with a copy of a
# real populated DB if given, so the forward migration runs on real data), then do
# the real mint -> credential-exchange -> t3_session pairing handshake with the
@ -143,27 +130,12 @@ health_check() {
rm -rf "$dir"; return 1
}
# roll the GLOBAL binary back to last-good. Pre-restart failures need only this
# (no real DB migrated yet); post-restart failures also restore the user's DB.
rollback_binary() {
LOG "rolling back binary $target -> $last_good"
if npm i -g "t3@$last_good" >/dev/null 2>&1; then LOG "rolled back to $last_good"; return 0; fi
LOG "ROLLBACK FAILED — could not reinstall t3@$last_good (t3 may be broken; manual fix per runbook)"; return 1
}
# is this t3-serve@<unit> running an active agent (claude/codex/opencode)? never restart those.
unit_busy() {
local unit="$1" pid; pid="$(systemctl show -p MainPID --value "$unit" 2>/dev/null)"
[ -n "$pid" ] && [ "$pid" != "0" ] && pgrep -aP "$pid" 2>/dev/null | grep -qiE 'claude|codex|opencode'
}
# verify a user's pairing through the REAL dispatch (mint -> exchange -> cookie).
verify_pairing() {
local u="$1" ak out; ak="$(ak_for "$u")"; [ -n "$ak" ] || { LOG "no authentik mapping for $u — skipping dispatch verify"; return 0; }
out="$(curl -s -i --max-time 10 -H "X-authentik-username: $ak" -H 'Sec-Fetch-Dest: document' "http://$DISPATCH/" 2>/dev/null)"
printf '%s' "$out" | grep -qi '^set-cookie:[[:space:]]*t3_session='
}
# ---- 3. DRY RUN: preview only (install candidate to temp prefix, gate it) -------
if [ "$DRY_RUN" = "1" ]; then
LOG "DRY_RUN: would back up [$(osusers | tr '\n' ' ')]; testing candidate $target in a temp prefix (no global change, no restarts)"
@ -196,31 +168,15 @@ restarted=0; deferred=0
for unit in $(systemctl list-units --type=service --state=running --no-legend 't3-serve@*' 2>/dev/null | awk '{print $1}'); do
u="$(printf '%s' "$unit" | sed -n 's/^t3-serve@\(.*\)\.service$/\1/p')"; [ -n "$u" ] || continue
if unit_busy "$unit"; then
LOG "deferring $unit (active agent) — migrates on its next idle restart"; deferred=$((deferred+1)); continue
LOG "deferring $unit (active agent) — migrates on its next idle restart"
mkdir -p "$DEFER_DIR" 2>/dev/null && printf '%s\n' "$target" >"$DEFER_DIR/$u" # record for t3-migrate-idle
deferred=$((deferred+1)); continue
fi
systemctl restart "$unit" || LOG "WARN: systemctl restart $unit returned non-zero"
ok=0
for _ in $(seq 1 15); do
if verify_pairing "$u"; then ok=1; break; fi
sleep 2
done
if [ "$ok" = "1" ]; then
LOG "restarted $unit -> $target (pairing verified via dispatch)"; restarted=$((restarted+1))
if safe_restart_unit "$unit" "$u"; then
restarted=$((restarted+1))
rm -f "$DEFER_DIR/$u" 2>/dev/null # now current — clear any stale marker
else
LOG "HEALTH-CHECK FAILED: $u pairing broken AFTER restart onto $target — rolling back + restoring its DB"
rollback_binary
bak="$(prebump_of "$u")"
if [ -n "$bak" ]; then
systemctl stop "$unit" 2>/dev/null
if install -o "$u" -g "$u" -m600 "$bak" "/home/$u/.t3/userdata/state.sqlite" 2>/dev/null; then
rm -f "/home/$u/.t3/userdata/state.sqlite-wal" "/home/$u/.t3/userdata/state.sqlite-shm"
LOG "restored $u state.sqlite from $bak"
fi
systemctl start "$unit" 2>/dev/null
fi
touch "$FREEZE_FILE" 2>/dev/null
LOG "FROZEN ($FREEZE_FILE) after canary $u failed on $target; last_good stays $last_good — investigate, then remove the freeze file to resume"
exit 1
exit 1 # frozen by safe_restart_unit — preserve today's behavior
fi
done

View file

@ -0,0 +1,8 @@
[Unit]
Description=t3 idle migrator — restart deferred t3-serve instances onto the current binary when idle
Documentation=https://forgejo.viktorbarzin.me/viktor/infra/src/branch/master/docs/plans/2026-06-21-t3-idle-migrate-design.md
After=network.target t3-dispatch.service
[Service]
Type=oneshot
ExecStart=/usr/local/bin/t3-migrate-idle

View file

@ -0,0 +1,86 @@
#!/usr/bin/env bash
# t3-migrate-idle.sh — drains t3-autoupdate's deferral markers (via the overnight
# t3-migrate-idle.timer). For each deferred t3-serve@<user>, if nothing is actively
# working in that instance (no in-flight turn + a quiet buffer), restart it onto the
# current binary using the shared safe_restart_unit, then clear the marker.
# Why this exists: t3-autoupdate defers a user with an active agent at its single
# daily window; a user busy every night never migrates and their client shows
# "Client and server versions differ". See docs/plans/2026-06-21-t3-idle-migrate-*.
set -uo pipefail
LOG_TAG=t3-migrate-idle
# shellcheck source=scripts/t3-safe-restart.sh
. "${T3_SAFE_RESTART_LIB:-/usr/local/lib/t3-safe-restart.sh}"
QUIET_SECONDS="${T3_MIGRATE_QUIET_SECONDS:-900}" # required idle before a restart (15 min)
DRY_RUN="${T3_DRY_RUN:-0}"
# pure logic: is it safe given <active_turns> and <idle_seconds>? fail closed.
gate_is_safe() {
local active="$1" idle="$2"
case "$active" in ''|*[!0-9]*) return 1;; esac # unparseable/empty active -> unsafe
[ "$active" -eq 0 ] || return 1 # a turn is running -> unsafe
[ -z "$idle" ] && return 0 # no threads at all -> safe
case "$idle" in ''|*[!0-9-]*) return 1;; esac # non-numeric -> unsafe
[ "$idle" -ge "$QUIET_SECONDS" ] # negative or < quiet -> unsafe
}
# query a state.sqlite (path or file: URI). Echoes "<active_turns>|<idle_seconds>".
# idle_seconds is empty when there are no rows. Normalizes ISO 'T'/'Z' for julianday.
gate_query() {
local db="$1"
sqlite3 -batch -noheader -separator '|' "$db" \
"SELECT
(SELECT count(*) FROM projection_thread_sessions WHERE active_turn_id IS NOT NULL),
CAST((julianday('now') - julianday(replace(replace(max(updated_at),'T',' '),'Z',''))) * 86400 AS INT)
FROM projection_thread_sessions;"
}
# safe_to_restart <user>: wire runuser + the user's DB into gate_query/gate_is_safe.
safe_to_restart() {
local u="$1" db row
db="/home/$u/.t3/userdata/state.sqlite"; [ -f "$db" ] || return 1
row="$(runuser -u "$u" -- sqlite3 -batch -noheader -separator '|' "file:$db?mode=ro" \
"SELECT
(SELECT count(*) FROM projection_thread_sessions WHERE active_turn_id IS NOT NULL),
CAST((julianday('now') - julianday(replace(replace(max(updated_at),'T',' '),'Z',''))) * 86400 AS INT)
FROM projection_thread_sessions;" 2>/dev/null)" || return 1
gate_is_safe "${row%%|*}" "${row##*|}"
}
main() {
# a frozen build must not be auto-migrated (shared switch with t3-autoupdate)
if [ -e "$FREEZE_FILE" ]; then LOG "FROZEN: $FREEZE_FILE present — not draining deferrals"; exit 0; fi
[ -d "$DEFER_DIR" ] || exit 0 # nothing deferred
last_good="$(tr -d '[:space:]' <"$LAST_GOOD_FILE" 2>/dev/null)" # rollback target for the helper
local marker u unit started mwritten migrated=0 skipped=0
for marker in "$DEFER_DIR"/*; do
[ -e "$marker" ] || continue # empty-dir glob
u="$(basename "$marker")"; unit="t3-serve@$u.service"
if ! systemctl is-active --quiet "$unit"; then
LOG "clearing marker for $u: $unit not active"; rm -f "$marker"; continue
fi
started="$(date -d "$(systemctl show -p ActiveEnterTimestamp --value "$unit" 2>/dev/null)" +%s 2>/dev/null || echo 0)"
mwritten="$(stat -c %Y "$marker" 2>/dev/null || echo 0)"
if [ "$started" -gt "$mwritten" ]; then
LOG "clearing marker for $u: $unit already restarted $((started-mwritten))s after the deferral"; rm -f "$marker"; continue
fi
if ! safe_to_restart "$u"; then skipped=$((skipped+1)); continue; fi
target="$(tr -d '[:space:]' <"$marker" 2>/dev/null)"; [ -n "$target" ] || target="$(ver)"
if [ "$DRY_RUN" = "1" ]; then LOG "DRY_RUN: would migrate $unit -> $target (idle gate satisfied)"; continue; fi
if ! backup_user "$u" >/dev/null; then
LOG "WARN: pre-restart backup failed for $u — skipping (fail closed)"; skipped=$((skipped+1)); continue
fi
if safe_restart_unit "$unit" "$u"; then
LOG "migrated $unit -> $target (idle restart)"; rm -f "$marker"; migrated=$((migrated+1))
else
LOG "migrate FAILED for $unit — recovery+freeze handled by safe_restart_unit; stopping drain"; exit 1
fi
done
LOG "idle-migrate pass complete (migrated=$migrated skipped=$skipped)"
}
# main-guard: run only when executed, not when sourced (tests source this file).
if [ "${BASH_SOURCE[0]}" = "${0}" ]; then main "$@"; fi

View file

@ -0,0 +1,10 @@
[Unit]
Description=Overnight drain of t3-autoupdate deferrals (idle-gated t3-serve migration)
[Timer]
OnCalendar=*-*-* 01..05:00/20
RandomizedDelaySec=120
Persistent=false
[Install]
WantedBy=timers.target

View file

@ -29,6 +29,9 @@ REPO_REMOTE_BASE="${REPO_REMOTE_BASE:-https://forgejo.viktorbarzin.me/viktor}"
# Per-user OIDC kubeconfig (kubelogin/PKCE; cluster server+CA copied from the admin kubeconfig).
OIDC_ISSUER="${OIDC_ISSUER:-https://authentik.viktorbarzin.me/application/o/kubernetes/}"
ADMIN_KUBECONFIG="${ADMIN_KUBECONFIG:-/home/wizard/.kube/config}"
# OS users (space-separated) that receive the vendored agent skills (scripts/workstation/claude-skills).
# Allowlist: install_skills no-ops for anyone not listed. Extend here to roll out to more users.
SKILL_USERS="${SKILL_USERS:-emo}"
log() { echo "[t3-provision] $*"; }
run() { if [[ "$DRY_RUN" == 1 ]]; then echo "[dry-run] $*"; else "$@"; fi; }
@ -237,6 +240,79 @@ EOF
log "wrote OIDC kubeconfig -> $user:~/.kube/config"
}
# Hands-off chrome-service browser credential. For a user who has a
# `<os_user>-browser` ServiceAccount in the chrome-service namespace (created in
# stacks/chrome-service/rbac.tf), install a DUAL-CONTEXT kubeconfig whose DEFAULT
# context authenticates with that SA's long-lived token — so `homelab browser`
# (which shells out to `kubectl port-forward -n chrome-service`) works
# non-interactively, even from a headless agent session (the user's interactive
# OIDC login can't authenticate a headless kubectl). The user's personal OIDC
# identity is retained as the `oidc@homelab` named context
# (`kubectl --context oidc@homelab`). TF (the SA's existence) is the source of
# truth for WHO gets this — there is no roster flag. Idempotent (cmp-guarded; SA
# tokens are stable) + best-effort (cluster/secret unreachable -> WARN, never aborts).
install_browser_kubeconfig() {
local user="$1" home kc sa secret token server ca tmp
home="$(getent passwd "$user" | cut -d: -f6)"
[[ -z "$home" ]] && return 0
sa="${user}-browser"
secret="${sa}-token"
[[ -r "$ADMIN_KUBECONFIG" ]] || return 0
# Gate: only users with a chrome-service browser SA (TF-driven). Best-effort read.
KUBECONFIG="$ADMIN_KUBECONFIG" kubectl --request-timeout=10s -n chrome-service get serviceaccount "$sa" >/dev/null 2>&1 || return 0
token="$(KUBECONFIG="$ADMIN_KUBECONFIG" kubectl --request-timeout=10s -n chrome-service get secret "$secret" -o jsonpath='{.data.token}' 2>/dev/null | base64 -d 2>/dev/null || true)"
[[ -n "$token" ]] || { log "WARN: browser SA token not ready for $user (secret chrome-service/$secret) — skipped"; return 0; }
server="$(KUBECONFIG="$ADMIN_KUBECONFIG" kubectl config view --raw --minify -o jsonpath='{.clusters[0].cluster.server}')"
ca="$(KUBECONFIG="$ADMIN_KUBECONFIG" kubectl config view --raw --minify -o jsonpath='{.clusters[0].cluster.certificate-authority-data}')"
[[ -n "$server" && -n "$ca" ]] || { log "WARN: could not read cluster server/CA -> skip browser kubeconfig for $user"; return 0; }
kc="$home/.kube/config"
tmp="$(mktemp)"
cat > "$tmp" <<EOF
apiVersion: v1
kind: Config
clusters:
- name: homelab
cluster:
server: $server
certificate-authority-data: $ca
contexts:
- name: ${sa}@homelab
context:
cluster: homelab
user: $sa
- name: oidc@homelab
context:
cluster: homelab
user: oidc
current-context: ${sa}@homelab
users:
- name: $sa
user:
token: $token
- name: oidc
user:
exec:
apiVersion: client.authentication.k8s.io/v1beta1
command: kubectl
args:
- oidc-login
- get-token
- --oidc-issuer-url=$OIDC_ISSUER
- --oidc-client-id=kubernetes
- --oidc-extra-scope=email
- --oidc-extra-scope=profile
- --oidc-extra-scope=groups
interactiveMode: IfAvailable
EOF
if cmp -s "$tmp" "$kc" 2>/dev/null; then rm -f "$tmp"; return 0; fi # already current -> no churn
if [[ "$DRY_RUN" == 1 ]]; then echo "[dry-run] dual-context (SA default + OIDC) browser kubeconfig -> $user:$kc"; rm -f "$tmp"; return 0; fi
install -d -o "$user" -g "$user" -m 0700 "$home/.kube"
install -o "$user" -g "$user" -m 0600 "$tmp" "$kc" || { log "WARN: failed to write browser kubeconfig for $user"; rm -f "$tmp"; return 0; }
rm -f "$tmp"
log "wrote dual-context browser kubeconfig (SA default + OIDC) -> $user:~/.kube/config"
return 0
}
# Idempotently set KEY=VALUE in a t3-serve env file, PRESERVING other lines — so writing
# T3_PORT never clobbers an injected CLAUDE_CODE_OAUTH_TOKEN, and vice-versa. Mode 0600.
env_set() {
@ -381,9 +457,133 @@ install_playwright() {
run systemctl enable --now "playwright-snapshot-refresh@$user.timer" >/dev/null 2>&1 || true
}
# Per-user homelab-memory setup — migrate off the claude-memory MCP/plugin to the
# homelab CLI hooks (auto-recall + auto-learn + compaction backup/recovery).
# Idempotent, if-absent, ADDITIVE: never clobbers `env` (the per-user
# MEMORY_API_KEY) or other MCP servers; removes ONLY the `claude_memory` MCP.
# Reuses the user's existing key — does NOT mint one (per-user isolation stays
# deferred, design 2026-06-08). The homelab CLI (/usr/local/bin/homelab) hits the
# same remote HTTP API the MCP used. Hook scripts: $WORKSTATION_DIR/claude-hooks.
install_memory() {
local user="$1" home
home="$(getent passwd "$user" | cut -d: -f6)"
[[ -n "$home" && -d "$home" ]] || return 0
local src="$WORKSTATION_DIR/claude-hooks" hooks_dst="$home/.claude/hooks" settings="$home/.claude/settings.json"
[[ -d "$src" ]] || { log "WARN: $src missing -> skip memory setup for $user"; return 0; }
if [[ "$DRY_RUN" == 1 ]]; then echo "[dry-run] memory: hooks + settings wire + claude_memory MCP removal -> $user"; return 0; fi
# (1) (re)install the 4 hook scripts, owned by the user (refreshed each reconcile so fixes land)
install -d -o "$user" -g "$user" -m 0755 "$hooks_dst"
local h
for h in homelab-memory-recall.py auto-learn.py pre-compact-backup.sh post-compact-recovery.sh; do
install -o "$user" -g "$user" -m 0755 "$src/$h" "$hooks_dst/$h"
done
# (2) wire the hooks in settings.json, if-absent + additive. Run the helper as ROOT:
# it must read $src under the admin's hardened home (mode 700), which a
# runuser-as-$user CANNOT traverse — so chown the result back to the user and
# enforce 0600 (it holds the per-user MEMORY_API_KEY).
if python3 "$src/wire-memory-hooks.py" "$home" >/dev/null 2>&1; then
[[ -f "$settings" ]] && chown "$user:$user" "$settings" 2>/dev/null || true
log "memory hooks wired -> $user"
else
log "WARN: memory hook wiring failed for $user (retries next reconcile)"
fi
[[ -f "$settings" ]] && chmod 600 "$settings" || true
# (2b) reuse the user's existing key; warn (do NOT mint — needs an admin vault write) if absent.
if [[ -f "$settings" ]] && ! grep -q 'MEMORY_API_KEY' "$settings"; then
log "WARN: $user has no MEMORY_API_KEY in settings.json — homelab memory no-ops until an admin mints one"
fi
# (3) remove the now-superseded claude_memory MCP (AS the user, if-present) + the plugin dir.
if runuser -u "$user" -- bash -lc 'command -v claude >/dev/null 2>&1 && claude mcp get claude_memory >/dev/null 2>&1'; then
runuser -u "$user" -- bash -lc 'claude mcp remove claude_memory >/dev/null 2>&1' && log "removed claude_memory MCP -> $user" || true
fi
if [[ -d "$home/.claude/plugins/claude-memory" ]]; then
rm -rf "$home/.claude/plugins/claude-memory" && log "removed claude-memory plugin dir -> $user"
fi
return 0 # best-effort tail must never return non-zero, else set -euo pipefail aborts the whole reconcile
}
# Per-user agent skills, vendored from the in-repo snapshot ($WORKSTATION_DIR/claude-skills) — the
# `npx skills` upstream drifted off this exact set, so we reproduce it offline + deterministically.
# if-absent + ADDITIVE: copies a skill dir into ~/.agents/skills/<name> (owned by the user) and
# symlinks ~/.claude/skills/<name> -> ../../.agents/skills/<name> (the layout `skills add -g`
# produces; Claude Code reads ~/.claude/skills/). Scoped to SKILL_USERS. if-absent keys on the
# user's OWN copy, so it heals a stale/cross-user ~/.claude/skills symlink but never clobbers a real
# skill dir. Best-effort tail: must return 0 or set -euo pipefail aborts the whole reconcile.
install_skills() {
local user="$1" home
home="$(getent passwd "$user" | cut -d: -f6)"
[[ -n "$home" && -d "$home" ]] || return 0
case " $SKILL_USERS " in *" $user "*) ;; *) return 0 ;; esac
local src_root="$WORKSTATION_DIR/claude-skills"
[[ -d "$src_root" ]] || { log "WARN: $src_root missing -> skip skills for $user"; return 0; }
if [[ "$DRY_RUN" == 1 ]]; then
local d names=""
for d in "$src_root"/*/; do [[ -d "$d" ]] && names+="$(basename "$d") "; done
echo "[dry-run] vendor skills if-absent -> $user: ${names}"
return 0
fi
local agents_dir="$home/.agents/skills" claude_dir="$home/.claude/skills"
# own the parent ~/.agents too (install -d leaves created intermediates root-owned)
install -d -o "$user" -g "$user" -m 0755 "$home/.agents" "$agents_dir" "$claude_dir"
chown "$user:$user" "$home/.agents" || true
local skill name dst link n=0
for skill in "$src_root"/*/; do
[[ -d "$skill" ]] || continue
name="$(basename "$skill")"
dst="$agents_dir/$name"
link="$claude_dir/$name"
# if-absent keys on the user's OWN copy (a real dir under ~/.agents/skills), NOT on any
# pre-existing ~/.claude/skills entry — so a stale or cross-user symlink gets healed.
if [[ ! -d "$dst" ]]; then
cp -a "$src_root/$name" "$dst" || { log "WARN: copy skill $name -> $user failed"; continue; }
chown -R "$user:$user" "$dst" || true
n=$((n+1))
fi
# point ~/.claude/skills/<name> at the user's own copy (replacing a stale/cross-user symlink);
# never clobber a real dir/file squatting that name.
if [[ -d "$link" && ! -L "$link" ]]; then
log "WARN: $claude_dir/$name is a real dir (left as-is) for $user"
elif [[ "$(readlink "$link" 2>/dev/null)" != "../../.agents/skills/$name" ]]; then
ln -sfn "../../.agents/skills/$name" "$link" && chown -h "$user:$user" "$link" || log "WARN: link skill $name -> $user failed"
fi
done
if [[ "$n" -gt 0 ]]; then log "vendored/healed $n skill(s) -> $user"; fi
return 0 # best-effort tail must never return non-zero, else set -euo pipefail aborts the reconcile
}
[[ $EUID -eq 0 ]] || { echo "t3-provision-users: must run as root" >&2; exit 1; }
for bin in python3 jq; do command -v "$bin" >/dev/null || { echo "missing $bin" >&2; exit 1; }; done
[[ -f "$ROSTER" && -f "$ENGINE" ]] || { echo "roster/engine not under $WORKSTATION_DIR" >&2; exit 1; }
# 0) self-deploy: the repo is the authoring surface (like sync_managed_config /
# deploy_user_launcher below). Nothing else redeploys /usr/local/bin (only the
# manual setup-devvm.sh did) — so a committed edit silently never reached the
# hourly run until now (the homelab-memory rollout sat undeployed for a day).
# If the repo copy differs, install it and re-exec the fresh binary. Guarded:
# re-exec flag (no loop), bash -n (never deploy a broken script), DRY_RUN (no
# mutation), cmp (no churn when unchanged).
SELF_SRC="$WORKSTATION_DIR/../t3-provision-users.sh"
SELF_DST=/usr/local/bin/t3-provision-users
if [[ -z "${T3_PROVISION_SELF_DEPLOYED:-}" && -r "$SELF_SRC" ]] && ! cmp -s "$SELF_SRC" "$SELF_DST"; then
if [[ "$DRY_RUN" == 1 ]]; then
echo "[dry-run] self-deploy $SELF_DST from repo (changed)"
elif bash -n "$SELF_SRC" 2>/dev/null; then
install -m 0755 "$SELF_SRC" "$SELF_DST"
log "self-deployed $SELF_DST from repo (changed) — re-exec"
exec env T3_PROVISION_SELF_DEPLOYED=1 "$SELF_DST" "$@"
else
log "WARN: repo t3-provision-users.sh fails 'bash -n' — keeping deployed copy"
fi
fi
install -d -m 0755 "$ENVDIR"
# 1) current sticky ports from existing .env files -> {os_user: port}
@ -467,6 +667,7 @@ while IFS=$'\t' read -r os_user tier shell groups_csv code_layout repos_csv; do
refresh_user_clone "$os_user" code
fi
install_user_kubeconfig "$os_user"
install_browser_kubeconfig "$os_user" # hands-off chrome-service CLI cred (no-op unless the user has a browser SA)
deploy_user_launcher "$os_user" # keep ~/start-claude.sh current (skel only seeds new accounts)
fi
refresh_codex_mirror "$os_user" # all tiers — mirror of the managed claudeMd
@ -494,6 +695,21 @@ while IFS=$'\t' read -r os_user pw_port; do
install_playwright "$os_user"
done < <(jq -r '.playwright_ports | to_entries[] | [.key, .value] | @tsv' "$desired_file")
# 5d) per-user homelab-memory (ALL users): replace the claude-memory MCP/plugin with the
# homelab CLI memory hooks. Idempotent + additive + if-absent; never touches the
# per-user MEMORY_API_KEY or other MCP servers (removes ONLY claude_memory).
while IFS=$'\t' read -r os_user; do
id "$os_user" >/dev/null 2>&1 || continue
install_memory "$os_user"
done < <(jq -r '.accounts[].os_user' "$desired_file")
# 5e) per-user agent skills (SKILL_USERS allowlist only): vendored snapshot -> ~/.agents/skills
# + ~/.claude/skills symlinks. if-absent + additive; best-effort (never aborts the reconcile).
while IFS=$'\t' read -r os_user; do
id "$os_user" >/dev/null 2>&1 || continue
install_skills "$os_user"
done < <(jq -r '.accounts[].os_user' "$desired_file")
# 5b) machine-wide (once, not per-user): keep the t3 gated nightly TRACKER timer enabled (it
# follows t3@nightly daily, gated; see t3-autoupdate.sh / docs/runbooks/t3-version-bump.md).
# NEVER --now: the tracker installs a NEW build + migrates DBs + restarts serves, so firing

View file

@ -0,0 +1,96 @@
#!/usr/bin/env bash
# t3-safe-restart.sh — SOURCED library (not executed). Shared by t3-autoupdate.sh
# (daily gated tracker) and t3-migrate-idle.sh (overnight deferral drainer).
#
# Holds the per-unit "dangerous" routine — backup -> restart -> verify pairing ->
# recover (restore DB + roll global binary back to last-good + freeze) — extracted
# verbatim from t3-autoupdate.sh step 6, plus the small helpers it depends on.
# The only change from the inline original: safe_restart_unit RETURNS non-zero on
# failure (after performing recovery+freeze) instead of `exit 1`, so the CALLER
# decides what to do (the daily job exits; the idle job stops draining).
#
# Callers must set, before calling safe_restart_unit: $target (version being moved
# TO, for log lines + the prebump filename) and $last_good (rollback target).
# Set $LOG_TAG before sourcing to tag syslog ("t3-autoupdate" / "t3-migrate-idle").
# ---- shared config defaults (honour the original T3_* override names) -----------
: "${LOG_TAG:=t3-safe-restart}"
: "${FREEZE_FILE:=${T3_FREEZE_FILE:-/etc/t3-autoupdate.freeze}}"
: "${STATE_DIR:=${T3_STATE_DIR:-/var/lib/t3-autoupdate}}"
: "${LAST_GOOD_FILE:=$STATE_DIR/last-good}"
: "${DEFER_DIR:=$STATE_DIR/deferred}"
: "${BACKUP_DIR:=${T3_BACKUP_DEST:-/var/backups/t3-state}}"
: "${DISPATCH:=${T3_DISPATCH:-127.0.0.1:3780}}"
: "${USER_MAP:=${T3_USER_MAP:-/etc/ttyd-user-map}}"
: "${T3_BACKUP_TIMEOUT:=900}"
LOG() { logger -t "$LOG_TAG" "$*"; echo "$LOG_TAG: $*"; }
ver() { t3 --version 2>/dev/null | awk '{print $NF}' | sed 's/^v//'; }
# OS users owning a ~/.t3 (RHS of each non-comment "authentik=os_user" map line).
osusers() { awk -F= '!/^[[:space:]]*#/&&NF==2{gsub(/[[:space:]]/,"",$2);print $2}' "$USER_MAP" 2>/dev/null | sort -u; }
# authentik username for an OS user (reverse map; first match) — for dispatch verify.
ak_for() { awk -F= -v u="$1" '!/^[[:space:]]*#/&&NF==2{gsub(/[[:space:]]/,"",$1);gsub(/[[:space:]]/,"",$2);if($2==u){print $1;exit}}' "$USER_MAP" 2>/dev/null; }
# Online consistent snapshot of ONE user's state.sqlite (run AS the owner so the
# WAL stays owned; never stops the serve). Uses global $target for the filename.
# Echoes the backup path on success; non-zero on failure.
backup_user() {
local u="$1" src out dst ts
src="/home/$u/.t3/userdata/state.sqlite"; [ -f "$src" ] || return 1
ts="$(date +%Y%m%d-%H%M%S)"
out="$BACKUP_DIR/$u"; dst="$out/state-prebump-$target-$ts.sqlite"
install -d -o "$u" -g "$u" -m700 "$out" 2>/dev/null || mkdir -p "$out"
if runuser -u "$u" -- timeout "$T3_BACKUP_TIMEOUT" sqlite3 "$src" "VACUUM INTO '$dst'" 2>/dev/null && [ -s "$dst" ]; then
printf '%s\n' "$dst"; return 0
fi
rm -f "$dst"; return 1
}
# newest pre-bump backup for a user taken for the current $target (restore source).
prebump_of() { ls -1t "$BACKUP_DIR/$1/state-prebump-$target-"*.sqlite 2>/dev/null | head -1; }
# roll the GLOBAL binary back to last-good. In the idle path last_good==installed,
# so this is a harmless no-op reinstall (does NOT downgrade other users).
rollback_binary() {
LOG "rolling back binary $target -> $last_good"
if npm i -g "t3@$last_good" >/dev/null 2>&1; then LOG "rolled back to $last_good"; return 0; fi
LOG "ROLLBACK FAILED — could not reinstall t3@$last_good (t3 may be broken; manual fix per runbook)"; return 1
}
# verify a user's pairing through the REAL dispatch (mint -> exchange -> cookie).
verify_pairing() {
local u="$1" ak out; ak="$(ak_for "$u")"; [ -n "$ak" ] || { LOG "no authentik mapping for $u — skipping dispatch verify"; return 0; }
out="$(curl -s -i --max-time 10 -H "X-authentik-username: $ak" -H 'Sec-Fetch-Dest: document' "http://$DISPATCH/" 2>/dev/null)"
printf '%s' "$out" | grep -qi '^set-cookie:[[:space:]]*t3_session='
}
# safe_restart_unit <unit> <user>: restart the unit, verify pairing; on failure
# restore the user's DB from its pre-restart backup, roll the binary back, freeze.
# Assumes a pre-restart backup already exists for <user> at the current $target
# (the daily job's backup_all, or the idle job's backup_user, takes it first).
# Returns 0 on verified success, non-zero after recovery+freeze on failure.
safe_restart_unit() {
local unit="$1" u="$2" ok=0 _ bak
systemctl restart "$unit" || LOG "WARN: systemctl restart $unit returned non-zero"
for _ in $(seq 1 15); do
if verify_pairing "$u"; then ok=1; break; fi
sleep 2
done
if [ "$ok" = "1" ]; then
LOG "restarted $unit -> $target (pairing verified via dispatch)"; return 0
fi
LOG "HEALTH-CHECK FAILED: $u pairing broken AFTER restart onto $target — rolling back + restoring its DB"
rollback_binary
bak="$(prebump_of "$u")"
if [ -n "$bak" ]; then
systemctl stop "$unit" 2>/dev/null
if install -o "$u" -g "$u" -m600 "$bak" "/home/$u/.t3/userdata/state.sqlite" 2>/dev/null; then
rm -f "/home/$u/.t3/userdata/state.sqlite-wal" "/home/$u/.t3/userdata/state.sqlite-shm"
LOG "restored $u state.sqlite from $bak"
fi
systemctl start "$unit" 2>/dev/null
fi
touch "$FREEZE_FILE" 2>/dev/null
LOG "FROZEN ($FREEZE_FILE) after $u failed on $target; last_good stays $last_good — investigate, then remove the freeze file to resume"
return 1
}

View file

@ -11,6 +11,12 @@ Environment=HOME=/home/%i
Environment=PATH=/usr/local/bin:/usr/bin:/bin:/home/%i/.local/bin
Environment=NODE_ENV=production
EnvironmentFile=/etc/t3-serve/%i.env
# Optional per-user long-lived CLAUDE_CODE_OAUTH_TOKEN, materialized by
# claude-auth-sync from the user's own Vault path. Non-rotating, so t3's
# concurrent agent sessions can't race on OAuth refresh-token rotation and wipe
# the shared ~/.claude/.credentials.json. Leading '-' = optional (absent for
# users on the normal per-user Enterprise-SSO credential flow).
EnvironmentFile=-/home/%i/.config/claude-auth-sync/claude-oauth.env
WorkingDirectory=/home/%i
ExecStart=/usr/bin/t3 serve --host 0.0.0.0 --port ${T3_PORT} --base-dir /home/%i/.t3
Restart=on-failure

View file

@ -28,5 +28,61 @@ ok "accept own scoped Vault token" cas_vault_identity_ok token-devvm-claude-auth
no "reject another user's token" cas_vault_identity_ok token-devvm-claude-auth-anca default,workstation-claude-anca
no "reject wrong policy" cas_vault_identity_ok token-devvm-claude-auth-emo default,workstation-claude-anca
# --- Regression: cas_backup must MERGE into the shared Vault path, preserving
# sibling keys that other tools co-locate there (e.g. `homelab vault`'s
# vaultwarden_* creds) — NOT overwrite the whole KV document. A blind `kv put`
# wiped them every 6h (claude-auth-sync clobber, 2026-06-26).
fakebin="$tmp/bin"; mkdir -p "$fakebin"
store="$tmp/vault-store.json"
cat > "$fakebin/vault" <<'FAKE'
#!/usr/bin/env bash
# Minimal KV-v2 fake backed by $VAULT_FAKE_STORE (a flat JSON object).
[[ "$1" == kv ]] || { echo '{}'; exit 0; } # token lookup etc. -> ignore
op="$2"; shift 2
store="$VAULT_FAKE_STORE"
case "$op" in
get)
for a in "$@"; do [[ "$a" == -field=* ]] && field="${a#-field=}"; done
if [[ "$*" == *-format=json* ]]; then
[[ -f "$store" ]] || { echo "No value found"; exit 2; }
jq -n --argjson d "$(cat "$store")" '{data:{data:$d}}'; exit 0
fi
[[ -f "$store" ]] || exit 2 # bare get == existence check
if [[ -n "${field:-}" ]]; then
v="$(jq -r --arg k "$field" '.[$k] // empty' "$store")"; [[ -n "$v" ]] || exit 1
printf '%s' "$v"; exit 0
fi
exit 0 ;;
put) echo '{}' > "$store" ;; # full replace
patch) [[ -f "$store" ]] || { echo "No value found"; exit 2; } ;; # merge (rw)
*) exit 1 ;;
esac
for a in "$@"; do
case "$a" in
-*|secret/*) continue ;; # flags + the path arg
*=*) k="${a%%=*}"; v="${a#*=}"
t="$(mktemp)"; jq --arg k "$k" --arg v "$v" '.[$k]=$v' "$store" > "$t" && mv "$t" "$store" ;;
esac
done
exit 0
FAKE
chmod +x "$fakebin/vault"
CAS_VAULT_PATH="secret/workstation/claude-users/test"
CAS_CREDENTIALS="$tmp/credentials.json"
CAS_STATE_DIR="$tmp/state"
_oldpath="$PATH"; PATH="$fakebin:$PATH"; export VAULT_FAKE_STORE="$store"
printf '{"vaultwarden_master_password":"keep-me"}\n' > "$store" # pretend `homelab vault setup` ran
ok "backup succeeds (existing doc)" cas_backup
eq "merge preserves sibling key" keep-me "$(jq -r '.vaultwarden_master_password' "$store")"
eq "merge writes claude oauth" access "$(jq -r '.claude_ai_oauth_json|fromjson|.accessToken' "$store")"
rm -f "$store" # fresh user: no doc yet
ok "backup succeeds (creates doc)" cas_backup
eq "create writes claude oauth" access "$(jq -r '.claude_ai_oauth_json|fromjson|.accessToken' "$store")"
PATH="$_oldpath"; unset VAULT_FAKE_STORE
printf '\n%d passed, %d failed\n' "$pass" "$fail"
(( fail == 0 ))

View file

@ -0,0 +1,102 @@
#!/usr/bin/env python3
"""Tests for scripts/tg lock-timeout injection.
scripts/tg wraps terragrunt. Tier-1 stacks rely on terraform's pg-backend
state lock; without -lock-timeout an apply fails instantly ("Error acquiring
the state lock") whenever anything else holds the lock — a Woodpecker-killed
run whose PG advisory lock has not been reaped yet, a concurrent local apply,
or the daily drift `plan`. This was the single largest cause of infra CI
failures. These tests pin that tg injects -lock-timeout for state-locking
verbs (and still preserves -auto-approve for non-interactive applies), so a
contended lock waits rather than fails.
Hermetic: a stub `terragrunt` on PATH records the args tg forwards; PG_CONN_STR
is pre-set so the Tier-1 Vault credential fetch is skipped (no network/Vault).
"""
import os
import shutil
import subprocess
from pathlib import Path
import pytest
SCRIPTS_DIR = Path(__file__).resolve().parent
TG = SCRIPTS_DIR / "tg"
AUTH_CHECK = SCRIPTS_DIR / "check-ingress-auth-comments.py"
def _run(tmp_path, *tg_args, env_extra=None):
"""Run a copy of scripts/tg in an isolated fake repo; return forwarded args."""
repo = tmp_path / "repo"
(repo / "scripts").mkdir(parents=True)
shutil.copy(TG, repo / "scripts" / "tg")
shutil.copy(AUTH_CHECK, repo / "scripts" / "check-ingress-auth-comments.py")
os.chmod(repo / "scripts" / "tg", 0o755)
os.chmod(repo / "scripts" / "check-ingress-auth-comments.py", 0o755)
# Fake Tier-1 stack ("faketest" is NOT in TIER0_STACKS), no ingress auth lines.
stack = repo / "stacks" / "faketest"
stack.mkdir(parents=True)
(stack / "terragrunt.hcl").write_text("# fake\n")
(stack / "main.tf").write_text("# no ingress_factory auth lines here\n")
# Stub terragrunt: append every forwarded arg (one per line) to a capture file.
bindir = tmp_path / "bin"
bindir.mkdir()
capture = tmp_path / "tg_args.txt"
stub = bindir / "terragrunt"
stub.write_text(
"#!/usr/bin/env bash\n"
f'for a in "$@"; do echo "$a" >> "{capture}"; done\n'
"exit 0\n"
)
os.chmod(stub, 0o755)
env = dict(os.environ)
env["PATH"] = f"{bindir}:{env['PATH']}"
env["PG_CONN_STR"] = "postgres://stub" # skip the Tier-1 Vault cred fetch
env["TF_PLUGIN_CACHE_DIR"] = str(tmp_path / "plugin-cache")
if env_extra:
env.update(env_extra)
proc = subprocess.run(
["bash", str(repo / "scripts" / "tg"), *tg_args],
cwd=str(stack),
env=env,
capture_output=True,
text=True,
)
assert proc.returncode == 0, f"tg exited {proc.returncode}\nSTDERR:\n{proc.stderr}\nSTDOUT:\n{proc.stdout}"
return capture.read_text().splitlines() if capture.exists() else []
def test_apply_non_interactive_has_lock_timeout_and_auto_approve(tmp_path):
args = _run(tmp_path, "apply", "--non-interactive")
assert "apply" in args
assert "-auto-approve" in args, "non-interactive apply must keep -auto-approve"
assert "-lock-timeout=5m" in args, "apply must wait for a contended state lock"
def test_plan_has_lock_timeout_but_not_auto_approve(tmp_path):
args = _run(tmp_path, "plan")
assert "plan" in args
assert "-lock-timeout=5m" in args
assert "-auto-approve" not in args, "plan must never get -auto-approve"
@pytest.mark.parametrize("verb", ["destroy", "refresh"])
def test_locking_verb_gets_lock_timeout(tmp_path, verb):
args = _run(tmp_path, verb)
assert "-lock-timeout=5m" in args, f"{verb} should carry -lock-timeout"
def test_non_locking_verb_has_no_lock_timeout(tmp_path):
# validate does not take a state lock — must not carry -lock-timeout.
args = _run(tmp_path, "validate")
assert "validate" in args
assert not any(a.startswith("-lock-timeout") for a in args)
def test_lock_timeout_is_env_overridable(tmp_path):
args = _run(tmp_path, "plan", env_extra={"TG_LOCK_TIMEOUT": "2m"})
assert "-lock-timeout=2m" in args

View file

@ -13,6 +13,15 @@ export TF_PLUGIN_CACHE_DIR="${TF_PLUGIN_CACHE_DIR:-$HOME/.terraform.d/plugin-cac
export TF_PLUGIN_CACHE_MAY_BREAK_DEPENDENCY_LOCK_FILE=1
mkdir -p "$TF_PLUGIN_CACHE_DIR"
# State-lock wait window. Tier-1 stacks lock their state via terraform's pg
# backend (pg_advisory_lock); with no timeout an apply fails instantly
# ("Error acquiring the state lock") the moment anything else holds the lock —
# a Woodpecker-killed run whose lock PG hasn't reaped yet, a concurrent local
# apply, or the daily drift `plan`. Waiting a few minutes absorbs all of those
# (the holder finishes, or PG reaps the dead backend). This was the #1 cause of
# infra CI failures. Override with TG_LOCK_TIMEOUT (e.g. 0 to fail fast).
LOCK_TIMEOUT="${TG_LOCK_TIMEOUT:-5m}"
# Determine stack name from cwd (relative to stacks/)
STACK_NAME=""
cwd="$(pwd)"
@ -134,29 +143,30 @@ if $is_mutating && [ -n "$STACK_NAME" ] && is_tier0 "$STACK_NAME"; then
fi
fi
# If running apply with --non-interactive, add -auto-approve for Terraform
# Build the terragrunt invocation:
# - add -auto-approve right after `apply` for --non-interactive runs (CI)
# - add -lock-timeout for state-locking verbs (plan/apply/destroy/refresh) so
# a contended state lock WAITS instead of failing instantly (see
# LOCK_TIMEOUT above). Non-locking verbs (init/validate/output/fmt) skip it.
args=("$@")
has_apply=false
has_non_interactive=false
for arg in "${args[@]}"; do
case "$arg" in
apply) has_apply=true ;;
--non-interactive) has_non_interactive=true ;;
esac
done
if $has_apply && $has_non_interactive; then
new_args=()
for arg in "${args[@]}"; do
new_args+=("$arg")
if [ "$arg" = "apply" ]; then
new_args+=("-auto-approve")
tg_args=()
for arg in "${args[@]}"; do
tg_args+=("$arg")
if [ "$arg" = "apply" ] && $has_non_interactive; then
tg_args+=("-auto-approve")
fi
done
terragrunt "${new_args[@]}"
else
terragrunt "$@"
done
if $is_tf_op; then
tg_args+=("-lock-timeout=$LOCK_TIMEOUT")
fi
terragrunt "${tg_args[@]}"
# After mutating operations: encrypt+commit (Tier 0) or no-op (Tier 1 — PG is authoritative)
if $is_mutating && [ -n "$STACK_NAME" ] && is_tier0 "$STACK_NAME"; then

View file

@ -13,6 +13,10 @@ CAS_VAULT_TOKEN_FILE="${CLAUDE_AUTH_VAULT_TOKEN_FILE:-$CAS_CONFIG_DIR/vault-toke
CAS_VAULT_PATH="${CLAUDE_AUTH_VAULT_PATH:-secret/workstation/claude-users/$CAS_USER}"
CAS_STATE_DIR="${CLAUDE_AUTH_STATE_DIR:-$CAS_HOME/.local/state/claude-auth-sync}"
CAS_LOG="$CAS_STATE_DIR/sync.log"
# Where a long-lived per-user setup-token is materialized as an env file
# (KEY=VALUE) for start-claude.sh + t3-serve@.service to load. Lives under the
# already-ReadWritePaths config dir so the sandboxed service may write it.
CAS_TOKEN_ENV_FILE="${CLAUDE_AUTH_TOKEN_ENV_FILE:-$CAS_CONFIG_DIR/claude-oauth.env}"
cas_log() {
mkdir -p "$CAS_STATE_DIR"
@ -82,7 +86,17 @@ cas_backup() {
return 1
}
expires="$(jq -r '.expiresAt' <<<"$oauth")"
vault kv put "$CAS_VAULT_PATH" \
# MERGE into the shared path so sibling keys other tools co-locate there
# (e.g. `homelab vault`'s vaultwarden_* creds) survive. `kv patch -method=rw`
# is read+update (needs no `patch` capability) but requires the secret to
# already exist, so create it with `kv put` on the very first backup only.
local -a write_cmd
if vault kv get "$CAS_VAULT_PATH" >/dev/null 2>&1; then
write_cmd=(vault kv patch -method=rw "$CAS_VAULT_PATH")
else
write_cmd=(vault kv put "$CAS_VAULT_PATH")
fi
"${write_cmd[@]}" \
claude_ai_oauth_json="$oauth" \
credential_expires_at_ms="$expires" \
backed_up_at="$(date -Is)" >/dev/null || {
@ -123,6 +137,41 @@ cas_restore() {
cas_log "RECOVERED restored Claude OAuth state from Vault"
}
# A user-scoped, long-lived setup-token (`sk-ant-oat01-…`, ~1y, NON-rotating) may
# be stored in this user's OWN Vault path (field `setup_token`). When present it
# is the authoritative credential: it bypasses the shared
# ~/.claude/.credentials.json OAuth refresh-token rotation entirely — the fix for
# users running many concurrent Claude sessions (interactive + t3-serve + always-on
# agents) that otherwise race on refresh and wipe each other's refresh token.
# We materialize it to a user-owned env file that start-claude.sh and
# t3-serve@.service load as CLAUDE_CODE_OAUTH_TOKEN. This is the user's OWN
# Enterprise identity, NOT the forbidden legacy SHARED token — it never crosses
# OS users. Returns 0 when a token is active, so the caller skips the
# rotating-credential validate/backup/restore (probing the now-vestigial
# credential would otherwise emit false WorkstationClaudeAuthInvalid alerts).
cas_sync_setup_token() {
local token desired tmp
token="$(vault kv get -field=setup_token "$CAS_VAULT_PATH" 2>/dev/null)" || token=""
if [[ "$token" != sk-ant-oat01-* ]]; then
if [[ -e "$CAS_TOKEN_ENV_FILE" ]]; then
rm -f "$CAS_TOKEN_ENV_FILE"
cas_log "removed stale CLAUDE_CODE_OAUTH_TOKEN env (no setup-token in Vault)"
fi
return 1
fi
desired="CLAUDE_CODE_OAUTH_TOKEN=$token"
if [[ -r "$CAS_TOKEN_ENV_FILE" && "$(<"$CAS_TOKEN_ENV_FILE")" == "$desired" ]]; then
cas_log "OK long-lived setup-token active (CLAUDE_CODE_OAUTH_TOKEN current); credential checks skipped"
return 0
fi
tmp="$(mktemp "${CAS_TOKEN_ENV_FILE}.XXXXXX")" || { cas_log "FAIL could not stage token env file"; return 1; }
printf '%s\n' "$desired" > "$tmp"
chmod 0600 "$tmp"
mv "$tmp" "$CAS_TOKEN_ENV_FILE"
cas_log "OK long-lived setup-token active; CLAUDE_CODE_OAUTH_TOKEN materialized; credential checks skipped"
return 0
}
cas_main() {
umask 077
for bin in jq vault claude timeout flock; do
@ -133,6 +182,11 @@ cas_main() {
flock -n 9 || { cas_log "SKIP another sync is already running"; return 0; }
cas_prepare_vault || return 1
# A long-lived per-user setup-token, if provisioned, is authoritative and
# non-rotating — materialize it and skip the rotating-credential dance.
if cas_sync_setup_token; then
return 0
fi
if cas_live_auth_ok; then
cas_backup
return

View file

@ -0,0 +1,184 @@
#!/usr/bin/env python3
"""
Stop hook (async): automatic learning extraction via haiku-as-judge.
After each Claude response, sends the user message + assistant response to
haiku to detect corrections, preferences, decisions, or facts worth storing.
If learning events are detected, stores them via the `homelab memory` CLI the
only sanctioned memory path on the devvm (no direct HTTP, no local SQLite).
Runs with async: true does NOT block the user.
"""
import io
import json
import logging
import os
import shutil
import subprocess
import sys
logger = logging.getLogger(__name__)
JUDGE_PROMPT = """You are a memory extraction judge. Analyze this exchange between a user and an AI assistant.
USER MESSAGE:
{user_message}
ASSISTANT RESPONSE:
{assistant_response}
Your job: determine if any of these learning events occurred:
1. USER CORRECTION user corrected the assistant's mistake or misunderstanding
2. PREFERENCE user stated a preference, habit, or "I like/prefer/want" statement
3. DECISION a decision was reached about how to do something
4. FACT user shared a durable fact about themselves, their team, tools, or environment
If ANY learning event occurred, return JSON:
{{"events": [{{"type": "correction|preference|decision|fact", "content": "concise fact to remember (one sentence)", "importance": 0.7, "expanded_keywords": "space-separated semantically related search terms for recall (minimum 5 words)", "supersedes": null}}]}}
If NO learning event occurred, return:
{{"events": []}}
Rules:
- Only extract DURABLE facts, not transient task details
- Corrections are highest value (0.8-0.9)
- Be conservative false negatives are better than false positives
- "expanded_keywords" should include synonyms, related concepts, and adjacent topics that would help find this memory later
- "supersedes" should be a search query to find the old outdated memory, or null
- Return ONLY valid JSON, no other text"""
def _store_via_homelab_cli(content, category, tags, importance, expanded_keywords):
"""Store one memory via the homelab CLI — the only sanctioned memory path on
the devvm (no direct HTTP, no local SQLite). The CLI defaults the API URL and
reads CLAUDE_MEMORY_API_KEY / MEMORY_API_KEY from the environment; if neither
is set (e.g. a user without a minted key) it no-ops silently."""
homelab = shutil.which("homelab") or "/usr/local/bin/homelab"
if not os.path.exists(homelab):
return
if not (os.environ.get("CLAUDE_MEMORY_API_KEY") or os.environ.get("MEMORY_API_KEY")):
return
cmd = [
homelab, "memory", "store", content,
"--category", category,
"--tags", tags,
"--importance", str(importance),
]
if expanded_keywords:
# CLI wants comma-separated keywords; the judge emits space-separated terms.
keywords = ",".join(expanded_keywords.replace(",", " ").split())
if keywords:
cmd += ["--keywords", keywords]
subprocess.run(cmd, capture_output=True, text=True, timeout=15, env=os.environ)
def main() -> None:
# Graceful exit if claude CLI is not available
if not shutil.which("claude"):
return
try:
hook_input = json.load(sys.stdin)
except (json.JSONDecodeError, EOFError):
return
if isinstance(hook_input, dict) and hook_input.get("stop_hook_active", False):
return
transcript_path = ""
if isinstance(hook_input, dict):
transcript_path = hook_input.get("transcript_path", "")
if not transcript_path or not os.path.exists(transcript_path):
return
user_message = ""
assistant_response = ""
try:
MAX_TAIL_BYTES = 50_000
with open(transcript_path, "rb") as f:
f.seek(0, io.SEEK_END)
size = f.tell()
f.seek(max(0, size - MAX_TAIL_BYTES))
tail = f.read().decode("utf-8", errors="replace")
lines = tail.split("\n")
for line in reversed(lines):
line = line.strip()
if not line:
continue
try:
entry = json.loads(line)
except json.JSONDecodeError:
continue
role = entry.get("role", "")
content = entry.get("content", "")
if isinstance(content, list):
content = " ".join(
b.get("text", "") for b in content
if isinstance(b, dict) and b.get("type") == "text"
)
content = str(content)[:2000]
if role == "assistant" and not assistant_response:
assistant_response = content
elif role == "user" and not user_message:
user_message = content
if user_message and assistant_response:
break
except Exception:
return
if not user_message or len(user_message.strip()) < 10:
return
prompt = JUDGE_PROMPT.format(
user_message=user_message,
assistant_response=assistant_response[:1000],
)
try:
result = subprocess.run(
["claude", "-p", prompt, "--model", "haiku"],
capture_output=True, text=True, timeout=30,
env={**os.environ, "CLAUDECODE": ""},
)
if result.returncode != 0:
return
response_text = result.stdout.strip()
if response_text.startswith("```"):
lines = response_text.split("\n")
lines = [l for l in lines if not l.strip().startswith("```")]
response_text = "\n".join(lines).strip()
judge_result = json.loads(response_text)
events = judge_result.get("events", [])
if not events:
return
except (subprocess.TimeoutExpired, json.JSONDecodeError, OSError):
return
category_map = {
"correction": "preferences",
"preference": "preferences",
"decision": "decisions",
"fact": "facts",
}
for event in events:
content = event.get("content", "")
if not content:
continue
event_type = event.get("type", "fact")
importance = max(0.0, min(1.0, float(event.get("importance", 0.7))))
category = category_map.get(event_type, "facts")
tags = f"auto-learned,{event_type}"
expanded_keywords = event.get("expanded_keywords", "")
try:
_store_via_homelab_cli(content, category, tags, importance, expanded_keywords)
except Exception:
pass # Never crash the async hook
if __name__ == "__main__":
main()

View file

@ -0,0 +1,76 @@
#!/usr/bin/env python3
"""UserPromptSubmit hook: inject relevant memories via `homelab memory recall`.
Replaces the claude-memory MCP recall path. Instead of instructing the model to
call the memory_recall MCP tool, this hook runs the homelab CLI (a direct client
to the same claude-memory HTTP API) and injects the ACTUAL results as context
so recall is automatic, needs no model tool-call, and works with the MCP
uninstalled. Best-effort: any failure exits 0 silently (recall just doesn't
happen that turn, exactly like the MCP being unavailable).
Wizard-only trial of the MCP deprecation (2026-06-20). Reversible: restore the
plugin command in ~/.claude/settings.json (backup: settings.json.bak-pre-homelab-memory).
"""
import json
import os
import shutil
import subprocess
import sys
def main() -> None:
try:
hook_input = json.load(sys.stdin)
except (json.JSONDecodeError, EOFError):
return
prompt = ""
if isinstance(hook_input, dict):
prompt = hook_input.get("prompt") or hook_input.get("user_prompt") or ""
if not prompt and isinstance(hook_input.get("content"), str):
prompt = hook_input["content"]
prompt = (prompt or "").strip()
# Same gates as the original recall hook: skip short prompts, code/JSON/XML blobs.
if len(prompt) < 10 or prompt[0] in "`{<":
return
homelab = shutil.which("homelab") or "/usr/local/bin/homelab"
if not os.path.exists(homelab):
return
if not (os.environ.get("CLAUDE_MEMORY_API_KEY") or os.environ.get("MEMORY_API_KEY")):
return
try:
res = subprocess.run(
[homelab, "memory", "recall", prompt, "--limit", "5"],
capture_output=True, text=True, errors="replace", timeout=4,
env=os.environ,
)
except Exception:
# Best-effort: ANY failure — timeout, OSError, or a UnicodeDecodeError on
# truncated multibyte (Cyrillic) output — must silently skip recall this
# turn, exactly like the MCP being unavailable. errors="replace" above
# also keeps a mid-rune-truncated payload from raising here at all. Never
# let this hook surface a "UserPromptSubmit hook error".
return
out = (res.stdout or "").strip()
if res.returncode != 0 or not out:
return
context = (
"Relevant stored memories (via `homelab memory recall`) — incorporate "
"naturally if useful; do NOT mention this lookup to the user:\n\n" + out
)
print(json.dumps({
"hookSpecificOutput": {
"hookEventName": "UserPromptSubmit",
"additionalContext": context,
}
}))
if __name__ == "__main__":
main()

View file

@ -0,0 +1,64 @@
#!/bin/bash
# UserPromptSubmit hook: Inject recovery context after compaction
# This hook runs on each user prompt, but only injects context once after compaction.
# Read hook input from stdin
INPUT=$(cat)
# Extract session ID
SESSION_ID=$(echo "$INPUT" | jq -r '.session_id // .sessionId // "unknown"')
# Define marker path
MEMORY_HOME="${MEMORY_HOME:-$HOME/.claude/claude-memory}"
MARKER_DIR="${MEMORY_HOME}/state/compaction-markers"
MARKER_FILE="${MARKER_DIR}/${SESSION_ID}.json"
# Fast path: no marker means no recent compaction, exit immediately
if [ ! -f "$MARKER_FILE" ]; then
exit 0
fi
# Read marker contents
MARKER=$(cat "$MARKER_FILE")
# Validate JSON before processing
if ! echo "$MARKER" | jq -e . >/dev/null 2>&1; then
rm -f "$MARKER_FILE"
exit 0
fi
# Extract data from marker
COMPACTED_AT=$(echo "$MARKER" | jq -r '.compactedAt // "unknown"')
PERSONALITY=$(echo "$MARKER" | jq -r '.personalityReminder // ""')
# Build remembered facts summary (limit to ~500 chars)
FACTS_SUMMARY=$(echo "$MARKER" | jq -r '
.rememberedFacts[:10] |
map("- [\(.category // "fact")] \(.content)") |
join("\n")
' 2>/dev/null || echo "")
# Build recovery context (kept under 1000 tokens)
RECOVERY_CONTEXT="[Claude Memory Recovery - Context compacted at ${COMPACTED_AT}]
${PERSONALITY}
Key memories from before compaction:
${FACTS_SUMMARY}
Use the memory_recall MCP tool if you need more context about past conversations."
# Output JSON with additional context for injection
cat << EOF
{
"hookSpecificOutput": {
"hookEventName": "UserPromptSubmit",
"additionalContext": $(echo "$RECOVERY_CONTEXT" | jq -Rs .)
}
}
EOF
# Delete marker file (one-time injection)
rm -f "$MARKER_FILE"
exit 0

View file

@ -0,0 +1,43 @@
#!/bin/bash
# PreCompact hook: Save key memories before compaction
set -e
INPUT=$(cat)
SESSION_ID=$(echo "$INPUT" | jq -r '.session_id // .sessionId // "unknown"')
MEMORY_HOME="${MEMORY_HOME:-$HOME/.claude/claude-memory}"
MARKER_DIR="${MEMORY_HOME}/state/compaction-markers"
MEMORY_DB="${MEMORY_HOME}/memory/memory.db"
MARKER_FILE="${MARKER_DIR}/${SESSION_ID}.json"
mkdir -p "$MARKER_DIR"
TIMESTAMP=$(date -u +"%Y-%m-%dT%H:%M:%SZ")
# Try API first, fall back to SQLite
REMEMBERED_FACTS="[]"
if [ -n "${MEMORY_API_KEY:-${CLAUDE_MEMORY_API_KEY:-}}" ]; then
API_KEY="${MEMORY_API_KEY:-${CLAUDE_MEMORY_API_KEY:-}}"
API_URL="${MEMORY_API_URL:-${CLAUDE_MEMORY_API_URL:-}}"
if [ -n "$API_URL" ]; then
REMEMBERED_FACTS=$(curl -sf -H "Authorization: Bearer ${API_KEY}" \
"${API_URL}/api/memories?limit=20" 2>/dev/null | \
jq '[.memories[] | {content, category, importance}]' 2>/dev/null || echo "[]")
fi
elif [ -f "$MEMORY_DB" ]; then
REMEMBERED_FACTS=$(sqlite3 -json "$MEMORY_DB" \
"SELECT content, category, importance FROM memories ORDER BY importance DESC, created_at DESC LIMIT 20" 2>/dev/null || echo "[]")
fi
if ! echo "$REMEMBERED_FACTS" | jq empty 2>/dev/null; then
REMEMBERED_FACTS="[]"
fi
jq -n \
--arg sid "$SESSION_ID" \
--arg ts "$TIMESTAMP" \
--argjson facts "$REMEMBERED_FACTS" \
'{sessionId: $sid, compactedAt: $ts, rememberedFacts: $facts}' \
> "$MARKER_FILE"
exit 0

View file

@ -0,0 +1,90 @@
#!/usr/bin/env python3
"""Wire the homelab-memory hooks into a user's ~/.claude/settings.json.
Part of the claude-memory MCP -> homelab CLI migration (all-users rollout).
Two passes, idempotent, never touching `env` (the per-user MEMORY_API_KEY) or any
other setting:
(0) PRUNE any hook command still pointing at the retired claude-memory plugin
(`plugins/claude-memory/hooks/`). install_memory() rm -rf's that dir, so
those entries are dangling and a missing UserPromptSubmit hook exits 2,
a BLOCKING error that erases the prompt and freezes the session (devvm emo
incident 2026-06-22). Must run BEFORE the additive pass: the plugin shares
basenames with the homelab hooks, so without pruning, the "already present"
check below matches the dead plugin path and skips the real install.
(1) ADD each homelab hook group when no existing command references its script.
Usage: wire-memory-hooks.py <home_dir>
Exit 0 on success (changed or already-present); 1 only on an unreadable settings file.
"""
import json
import os
import sys
home = sys.argv[1]
settings = os.path.join(home, ".claude", "settings.json")
hooks_dir = os.path.join(home, ".claude", "hooks")
# (event, script-basename used for the if-absent check, full command, extra fields)
WANT = [
("PreCompact", "pre-compact-backup.sh", f"{hooks_dir}/pre-compact-backup.sh", {"timeout": 30}),
("UserPromptSubmit", "post-compact-recovery.sh", f"{hooks_dir}/post-compact-recovery.sh", {"timeout": 10}),
("UserPromptSubmit", "homelab-memory-recall.py", f"python3 {hooks_dir}/homelab-memory-recall.py", {"timeout": 8}),
("Stop", "auto-learn.py", f"python3 {hooks_dir}/auto-learn.py", {"async": True}),
]
try:
if os.path.exists(settings) and os.path.getsize(settings) > 0:
with open(settings) as fh:
data = json.load(fh)
else:
data = {}
except (json.JSONDecodeError, OSError) as e:
print(f"ERROR: cannot read {settings}: {e}", file=sys.stderr)
sys.exit(1)
hooks = data.setdefault("hooks", {})
changed = False
# (0) Prune dead claude-memory plugin hooks (see module docstring). Must precede
# the additive pass so shared basenames don't mask a needed install.
DEAD_REF = "plugins/claude-memory/hooks/"
for event in list(hooks.keys()):
new_groups = []
removed_any = False
for g in (hooks.get(event) or []):
original = g.get("hooks") or []
kept = [h for h in original if DEAD_REF not in (h.get("command", "") or "")]
if len(kept) != len(original):
removed_any = True
if kept:
new_groups.append({**g, "hooks": kept})
if removed_any:
changed = True
if new_groups:
hooks[event] = new_groups
else:
del hooks[event]
# (1) Additively wire each homelab hook, if no command already references it.
for event, basename, command, extra in WANT:
groups = hooks.setdefault(event, [])
already = any(
basename in (h.get("command", "") or "")
for g in groups
for h in (g.get("hooks", []) or [])
)
if already:
continue
entry = {"type": "command", "command": command}
entry.update(extra)
groups.append({"hooks": [entry]})
changed = True
if changed:
tmp = settings + ".tmp"
with open(tmp, "w") as fh:
json.dump(data, fh, indent=2)
os.replace(tmp, settings)
print(f"wired memory hooks -> {settings}")
else:
print(f"memory hooks already present -> {settings} (no change)")

View file

@ -0,0 +1,47 @@
# claude-skills — vendored agent-skill snapshot
Point-in-time snapshot of the admin's (`wizard`) Claude Code agent skills, deployed
per-user by `install_skills()` in `../../t3-provision-users.sh` (scoped to the
`SKILL_USERS` allowlist). Each subdirectory is one skill (`SKILL.md` + any bundled
references). The provisioner copies a skill into `~/.agents/skills/<name>/` (owned by
the user) and symlinks `~/.claude/skills/<name> -> ../../.agents/skills/<name>` — the
layout the `skills` CLI's `-g` install produces; Claude Code reads `~/.claude/skills/`.
## Why vendored (not `npx skills add` at provision time)
Upstream drifted from this set: on `mattpocock/skills` master, `diagnose`
`diagnosing-bugs` and `write-a-skill``writing-great-skills` were renamed, and
`caveman` + `zoom-out` are no longer published — so `npx skills` cannot reproduce this
exact set. Vendoring is also offline/deterministic and keeps GitHub-clone +
unpinned-CLI dependencies out of the hourly **root** reconcile.
## Sources
- `mattpocock/skills` (https://github.com/mattpocock/skills) — all except `find-skills`
- `vercel-labs/skills` (https://github.com/vercel-labs/skills) — `find-skills`
- **homelab-local, emo-PERSONALIZED**`cluster-health` here is an
**emo-specific variant**, not a copy of the canonical skill. It started as a
copy of this repo's `.claude/skills/cluster-health/` but was rewritten on
2026-06-26 to focus on ha-sofia + emo's Sofia devices (emo is the only entry
in `SKILL_USERS`, a read-only power-user). The canonical admin skill
(`.claude/skills/cluster-health/`) is the full 47-check version and is left
untouched. **Do NOT `cp -a` the canonical copy over this one** — that would
clobber the personalization. Maintain the two independently.
## Refreshing
Re-snapshot the upstream skills from a current install and commit the diff:
```sh
cp -a ~/.agents/skills/. scripts/workstation/claude-skills/
```
`cluster-health` is hand-maintained (emo variant) — it is **not** covered by the
`cp -a` above and must **not** be overwritten from `.claude/skills/`. Edit it in
place here when emo's needs change, then refresh his live copy (the provisioner's
`install_skills()` is if-absent, so it won't update an existing `~/.agents/skills`
copy — `cp` the new `SKILL.md` to `/home/emo/.agents/skills/cluster-health/` and
`chown emo:emo`, or remove emo's copy and re-run the reconcile).
Snapshot taken 2026-06-23 (upstream); `cluster-health` vendored 2026-06-26,
personalized for emo 2026-06-26.

View file

@ -0,0 +1,49 @@
---
name: caveman
description: >
Ultra-compressed communication mode. Cuts token usage ~75% by dropping
filler, articles, and pleasantries while keeping full technical accuracy.
Use when user says "caveman mode", "talk like caveman", "use caveman",
"less tokens", "be brief", or invokes /caveman.
---
Respond terse like smart caveman. All technical substance stay. Only fluff die.
## Persistence
ACTIVE EVERY RESPONSE once triggered. No revert after many turns. No filler drift. Still active if unsure. Off only when user says "stop caveman" or "normal mode".
## Rules
Drop: articles (a/an/the), filler (just/really/basically/actually/simply), pleasantries (sure/certainly/of course/happy to), hedging. Fragments OK. Short synonyms (big not extensive, fix not "implement a solution for"). Abbreviate common terms (DB/auth/config/req/res/fn/impl). Strip conjunctions. Use arrows for causality (X -> Y). One word when one word enough.
Technical terms stay exact. Code blocks unchanged. Errors quoted exact.
Pattern: `[thing] [action] [reason]. [next step].`
Not: "Sure! I'd be happy to help you with that. The issue you're experiencing is likely caused by..."
Yes: "Bug in auth middleware. Token expiry check use `<` not `<=`. Fix:"
### Examples
**"Why React component re-render?"**
> Inline obj prop -> new ref -> re-render. `useMemo`.
**"Explain database connection pooling."**
> Pool = reuse DB conn. Skip handshake -> fast under load.
## Auto-Clarity Exception
Drop caveman temporarily for: security warnings, irreversible action confirmations, multi-step sequences where fragment order risks misread, user asks to clarify or repeats question. Resume caveman after clear part done.
Example -- destructive op:
> **Warning:** This will permanently delete all rows in the `users` table and cannot be undone.
>
> ```sql
> DROP TABLE users;
> ```
>
> Caveman resume. Verify backup exist first.

View file

@ -0,0 +1,146 @@
---
name: cluster-health
description: |
Personalized for emo. Check whether the homelab Kubernetes cluster is
affecting ha-sofia or the Sofia smart-home devices it runs (Tuya devices,
the MPPT ATS, lights, climate, security, irrigation). Use when:
(1) "is ha-sofia ok", "are my devices / the ATS / the lights down",
(2) "is the cluster affecting Sofia / my devices",
(3) "check the cluster", "cluster health", "is everything running",
(4) a device on the Барзини → Статус dashboard looks offline.
Runs the cluster-wide healthcheck read-only and triages it by what
ha-sofia actually depends on; the rest of the cluster is the admin's area.
author: Claude Code
version: 3.0.0-emo
date: 2026-06-26
---
# Cluster Health — personalized for emo (ha-sofia focus)
## What you actually care about
You care about **ha-sofia** and the **Sofia smart-home devices** it runs —
the Tuya devices, the **MPPT ATS**, and the lights / climate / security /
irrigation on your **Барзини → Статус** dashboard. The wider Kubernetes
cluster matters to you **only when it's breaking something ha-sofia or your
devices depend on.** Anything else is the admin's (wizard's) area — note it in
one line and move on; don't chase it.
You have **read-only** cluster access. You can SEE everything but change
nothing — so when something on your chain is broken, the job is to confirm it
and hand it off, not to repair it.
## How ha-sofia depends on the cluster
ha-sofia itself runs at the house (HAOS at https://ha-sofia.viktorbarzin.me) —
**not** in the cluster. The cluster reaches it through exactly two things:
1. **tuya-bridge** (namespace `tuya-bridge`) — the REST API ha-sofia calls for
every Tuya device **and the MPPT ATS**. If it's unhealthy, your Tuya devices
+ ATS stop responding. **This is the #1 thing to check.**
2. **The path that carries ha-sofia ⇄ tuya-bridge and keeps ha-sofia
reachable**: cloudflared (tunnel) → Traefik (LB) → the ingress + TLS cert
for `tuya-bridge.viktorbarzin.me` and `ha-sofia.viktorbarzin.me`, plus
Technitium DNS. If any of these break, ha-sofia can't reach tuya-bridge and
you can't reach ha-sofia remotely.
Everything else in the cluster is unrelated to you unless it's hosting one of
those pods.
## Step 1 — run the healthcheck (read-only, with your HA token)
Your account can't read Vault, so load your own ha-sofia token first (it was
minted for you and lives at `~/.config/cluster-health/haos_token`). Then run
the script from YOUR clone, read-only:
```bash
cd /home/emo/code
export HOME_ASSISTANT_SOFIA_TOKEN="$(cat ~/.config/cluster-health/haos_token)"
bash scripts/cluster_healthcheck.sh --no-fix --quiet
# machine-readable instead:
# bash scripts/cluster_healthcheck.sh --no-fix --quiet --json | tee /tmp/cluster-health.json
```
- **Never pass `--fix`** — it deletes pods (a write); you're read-only and it
will fail.
- Exit codes: `0` healthy, `1` warnings, `2` failures.
With the token exported, the **ha-sofia checks run for you**:
26 Entity Availability · 27 Integration Health · 28 Automation Status ·
29 System Resources · **45 Status Dashboard** — your Барзини → Статус view,
classifying every device tile as OK / ⚠️ / Offline across Сигурност, Мрежа &
IT, Енергия, Климат, Уреди, Мултимедия, Осветление, Поливна. Check 30 also
covers the **tuya** exporter.
## Step 2 — triage the output by relevance to YOU
Read the PASS/WARN/FAIL summary, then split the WARN/FAIL items in two:
- **On your chain → this is what matters.** Anything touching: `tuya-bridge`,
`cloudflared`, `traefik`, DNS (check 21), the TLS cert / ingress for your two
hosts (checks 12, 22, 31, 32), or a **node** hosting those pods — plus all the
**ha-sofia** checks (2629, 45) and the **tuya** exporter (30).
- **Not on your chain → one line, then drop it.** Summarise as "N unrelated
cluster issues (admin's area)" and don't investigate.
## Step 3 — read-only checks for your chain
All of these work with your read-only access:
```bash
# tuya-bridge — your devices + the ATS
kubectl get pods -n tuya-bridge
kubectl rollout status deploy/tuya-bridge -n tuya-bridge
kubectl logs -n tuya-bridge deploy/tuya-bridge --tail=50
# the reachability path ha-sofia uses
kubectl get pods -n cloudflared
kubectl get pods -n traefik
kubectl get ingress -A | grep -Ei 'tuya-bridge|ha-sofia'
# whole external path in one shot (DNS + tunnel + Traefik + cert):
curl -sI --max-time 10 https://tuya-bridge.viktorbarzin.me | head -1
# reachable -> HTTP/2 200 / 401 / 403 (any HTTP response = path is up)
# broken -> curl: timeout / could not resolve host
```
The fastest **device-level** signal is your own dashboard: open
**https://ha-sofia.viktorbarzin.me → Барзини → Статус**. If devices show
Offline / Разкачен / ⚠️ **but tuya-bridge is healthy**, the problem is at the
house (device power / Wi-Fi / the Sofia TP-Link network) — **not** the cluster.
## Step 4 — if something on your chain is broken
You can't fix the cluster (read-only), so **capture + hand off**:
```bash
kubectl describe pod -n tuya-bridge <pod>
kubectl logs -n tuya-bridge <pod> --previous --tail=200
```
Then file it for the admin with the **`/file-issue`** skill — e.g. *"ha-sofia
Tuya devices + ATS unresponsive; tuya-bridge pod CrashLooping"* with the output
above. cloudflared / Traefik / DNS outages are cluster-wide — the admin's
alerting is already firing, but file it so it's tracked from your side too.
## What will skip for you (expected — not failures)
A few checks need access your account doesn't have. They warn/skip — that's
normal, and **none of them are on your ha-sofia chain**:
- **Uptime Kuma (14)** — needs an admin password from Vault.
- **PVE host checks** — 36 (LVM snapshots), 43 (host thermals), 44 (host load),
and the Proxmox CSI ghost-disk check — all need root SSH to the Proxmox host.
- **`--fix`** — pod deletion (a write); not available to you.
(The ha-sofia checks are **not** in this list — your token makes them work.)
## Your ha-sofia token
- Stored at `~/.config/cluster-health/haos_token` (yours, mode 600).
- It's a **dedicated** long-lived token, named `emo-cluster-health` under
ha-sofia → your profile → **Long-Lived Access Tokens**. Revoking it there
affects only you.
- It currently carries admin-level HA scope (Home Assistant only lets a token
be minted for the account that created it, and it was minted via the admin
account). If it ever stops working, tell wizard and a fresh one can be minted.

View file

@ -0,0 +1,117 @@
---
name: diagnose
description: Disciplined diagnosis loop for hard bugs and performance regressions. Reproduce → minimise → hypothesise → instrument → fix → regression-test. Use when user says "diagnose this" / "debug this", reports a bug, says something is broken/throwing/failing, or describes a performance regression.
---
# Diagnose
A discipline for hard bugs. Skip phases only when explicitly justified.
When exploring the codebase, use the project's domain glossary to get a clear mental model of the relevant modules, and check ADRs in the area you're touching.
## Phase 1 — Build a feedback loop
**This is the skill.** Everything else is mechanical. If you have a fast, deterministic, agent-runnable pass/fail signal for the bug, you will find the cause — bisection, hypothesis-testing, and instrumentation all just consume that signal. If you don't have one, no amount of staring at code will save you.
Spend disproportionate effort here. **Be aggressive. Be creative. Refuse to give up.**
### Ways to construct one — try them in roughly this order
1. **Failing test** at whatever seam reaches the bug — unit, integration, e2e.
2. **Curl / HTTP script** against a running dev server.
3. **CLI invocation** with a fixture input, diffing stdout against a known-good snapshot.
4. **Headless browser script** (Playwright / Puppeteer) — drives the UI, asserts on DOM/console/network.
5. **Replay a captured trace.** Save a real network request / payload / event log to disk; replay it through the code path in isolation.
6. **Throwaway harness.** Spin up a minimal subset of the system (one service, mocked deps) that exercises the bug code path with a single function call.
7. **Property / fuzz loop.** If the bug is "sometimes wrong output", run 1000 random inputs and look for the failure mode.
8. **Bisection harness.** If the bug appeared between two known states (commit, dataset, version), automate "boot at state X, check, repeat" so you can `git bisect run` it.
9. **Differential loop.** Run the same input through old-version vs new-version (or two configs) and diff outputs.
10. **HITL bash script.** Last resort. If a human must click, drive _them_ with `scripts/hitl-loop.template.sh` so the loop is still structured. Captured output feeds back to you.
Build the right feedback loop, and the bug is 90% fixed.
### Iterate on the loop itself
Treat the loop as a product. Once you have _a_ loop, ask:
- Can I make it faster? (Cache setup, skip unrelated init, narrow the test scope.)
- Can I make the signal sharper? (Assert on the specific symptom, not "didn't crash".)
- Can I make it more deterministic? (Pin time, seed RNG, isolate filesystem, freeze network.)
A 30-second flaky loop is barely better than no loop. A 2-second deterministic loop is a debugging superpower.
### Non-deterministic bugs
The goal is not a clean repro but a **higher reproduction rate**. Loop the trigger 100×, parallelise, add stress, narrow timing windows, inject sleeps. A 50%-flake bug is debuggable; 1% is not — keep raising the rate until it's debuggable.
### When you genuinely cannot build a loop
Stop and say so explicitly. List what you tried. Ask the user for: (a) access to whatever environment reproduces it, (b) a captured artifact (HAR file, log dump, core dump, screen recording with timestamps), or (c) permission to add temporary production instrumentation. Do **not** proceed to hypothesise without a loop.
Do not proceed to Phase 2 until you have a loop you believe in.
## Phase 2 — Reproduce
Run the loop. Watch the bug appear.
Confirm:
- [ ] The loop produces the failure mode the **user** described — not a different failure that happens to be nearby. Wrong bug = wrong fix.
- [ ] The failure is reproducible across multiple runs (or, for non-deterministic bugs, reproducible at a high enough rate to debug against).
- [ ] You have captured the exact symptom (error message, wrong output, slow timing) so later phases can verify the fix actually addresses it.
Do not proceed until you reproduce the bug.
## Phase 3 — Hypothesise
Generate **35 ranked hypotheses** before testing any of them. Single-hypothesis generation anchors on the first plausible idea.
Each hypothesis must be **falsifiable**: state the prediction it makes.
> Format: "If <X> is the cause, then <changing Y> will make the bug disappear / <changing Z> will make it worse."
If you cannot state the prediction, the hypothesis is a vibe — discard or sharpen it.
**Show the ranked list to the user before testing.** They often have domain knowledge that re-ranks instantly ("we just deployed a change to #3"), or know hypotheses they've already ruled out. Cheap checkpoint, big time saver. Don't block on it — proceed with your ranking if the user is AFK.
## Phase 4 — Instrument
Each probe must map to a specific prediction from Phase 3. **Change one variable at a time.**
Tool preference:
1. **Debugger / REPL inspection** if the env supports it. One breakpoint beats ten logs.
2. **Targeted logs** at the boundaries that distinguish hypotheses.
3. Never "log everything and grep".
**Tag every debug log** with a unique prefix, e.g. `[DEBUG-a4f2]`. Cleanup at the end becomes a single grep. Untagged logs survive; tagged logs die.
**Perf branch.** For performance regressions, logs are usually wrong. Instead: establish a baseline measurement (timing harness, `performance.now()`, profiler, query plan), then bisect. Measure first, fix second.
## Phase 5 — Fix + regression test
Write the regression test **before the fix** — but only if there is a **correct seam** for it.
A correct seam is one where the test exercises the **real bug pattern** as it occurs at the call site. If the only available seam is too shallow (single-caller test when the bug needs multiple callers, unit test that can't replicate the chain that triggered the bug), a regression test there gives false confidence.
**If no correct seam exists, that itself is the finding.** Note it. The codebase architecture is preventing the bug from being locked down. Flag this for the next phase.
If a correct seam exists:
1. Turn the minimised repro into a failing test at that seam.
2. Watch it fail.
3. Apply the fix.
4. Watch it pass.
5. Re-run the Phase 1 feedback loop against the original (un-minimised) scenario.
## Phase 6 — Cleanup + post-mortem
Required before declaring done:
- [ ] Original repro no longer reproduces (re-run the Phase 1 loop)
- [ ] Regression test passes (or absence of seam is documented)
- [ ] All `[DEBUG-...]` instrumentation removed (`grep` the prefix)
- [ ] Throwaway prototypes deleted (or moved to a clearly-marked debug location)
- [ ] The hypothesis that turned out correct is stated in the commit / PR message — so the next debugger learns
**Then ask: what would have prevented this bug?** If the answer involves architectural change (no good test seam, tangled callers, hidden coupling) hand off to the `/improve-codebase-architecture` skill with the specifics. Make the recommendation **after** the fix is in, not before — you have more information now than when you started.

View file

@ -0,0 +1,41 @@
#!/usr/bin/env bash
# Human-in-the-loop reproduction loop.
# Copy this file, edit the steps below, and run it.
# The agent runs the script; the user follows prompts in their terminal.
#
# Usage:
# bash hitl-loop.template.sh
#
# Two helpers:
# step "<instruction>" → show instruction, wait for Enter
# capture VAR "<question>" → show question, read response into VAR
#
# At the end, captured values are printed as KEY=VALUE for the agent to parse.
set -euo pipefail
step() {
printf '\n>>> %s\n' "$1"
read -r -p " [Enter when done] " _
}
capture() {
local var="$1" question="$2" answer
printf '\n>>> %s\n' "$question"
read -r -p " > " answer
printf -v "$var" '%s' "$answer"
}
# --- edit below ---------------------------------------------------------
step "Open the app at http://localhost:3000 and sign in."
capture ERRORED "Click the 'Export' button. Did it throw an error? (y/n)"
capture ERROR_MSG "Paste the error message (or 'none'):"
# --- edit above ---------------------------------------------------------
printf '\n--- Captured ---\n'
printf 'ERRORED=%s\n' "$ERRORED"
printf 'ERROR_MSG=%s\n' "$ERROR_MSG"

View file

@ -0,0 +1,142 @@
---
name: find-skills
description: Helps users discover and install agent skills when they ask questions like "how do I do X", "find a skill for X", "is there a skill that can...", or express interest in extending capabilities. This skill should be used when the user is looking for functionality that might exist as an installable skill.
---
# Find Skills
This skill helps you discover and install skills from the open agent skills ecosystem.
## When to Use This Skill
Use this skill when the user:
- Asks "how do I do X" where X might be a common task with an existing skill
- Says "find a skill for X" or "is there a skill for X"
- Asks "can you do X" where X is a specialized capability
- Expresses interest in extending agent capabilities
- Wants to search for tools, templates, or workflows
- Mentions they wish they had help with a specific domain (design, testing, deployment, etc.)
## What is the Skills CLI?
The Skills CLI (`npx skills`) is the package manager for the open agent skills ecosystem. Skills are modular packages that extend agent capabilities with specialized knowledge, workflows, and tools.
**Key commands:**
- `npx skills find [query]` - Search for skills interactively or by keyword
- `npx skills add <package>` - Install a skill from GitHub or other sources
- `npx skills check` - Check for skill updates
- `npx skills update` - Update all installed skills
**Browse skills at:** https://skills.sh/
## How to Help Users Find Skills
### Step 1: Understand What They Need
When a user asks for help with something, identify:
1. The domain (e.g., React, testing, design, deployment)
2. The specific task (e.g., writing tests, creating animations, reviewing PRs)
3. Whether this is a common enough task that a skill likely exists
### Step 2: Check the Leaderboard First
Before running a CLI search, check the [skills.sh leaderboard](https://skills.sh/) to see if a well-known skill already exists for the domain. The leaderboard ranks skills by total installs, surfacing the most popular and battle-tested options.
For example, top skills for web development include:
- `vercel-labs/agent-skills` — React, Next.js, web design (100K+ installs each)
- `anthropics/skills` — Frontend design, document processing (100K+ installs)
### Step 3: Search for Skills
If the leaderboard doesn't cover the user's need, run the find command:
```bash
npx skills find [query]
```
For example:
- User asks "how do I make my React app faster?" → `npx skills find react performance`
- User asks "can you help me with PR reviews?" → `npx skills find pr review`
- User asks "I need to create a changelog" → `npx skills find changelog`
### Step 4: Verify Quality Before Recommending
**Do not recommend a skill based solely on search results.** Always verify:
1. **Install count** — Prefer skills with 1K+ installs. Be cautious with anything under 100.
2. **Source reputation** — Official sources (`vercel-labs`, `anthropics`, `microsoft`) are more trustworthy than unknown authors.
3. **GitHub stars** — Check the source repository. A skill from a repo with <100 stars should be treated with skepticism.
### Step 5: Present Options to the User
When you find relevant skills, present them to the user with:
1. The skill name and what it does
2. The install count and source
3. The install command they can run
4. A link to learn more at skills.sh
Example response:
```
I found a skill that might help! The "react-best-practices" skill provides
React and Next.js performance optimization guidelines from Vercel Engineering.
(185K installs)
To install it:
npx skills add vercel-labs/agent-skills@react-best-practices
Learn more: https://skills.sh/vercel-labs/agent-skills/react-best-practices
```
### Step 6: Offer to Install
If the user wants to proceed, you can install the skill for them:
```bash
npx skills add <owner/repo@skill> -g -y
```
The `-g` flag installs globally (user-level) and `-y` skips confirmation prompts.
## Common Skill Categories
When searching, consider these common categories:
| Category | Example Queries |
| --------------- | ---------------------------------------- |
| Web Development | react, nextjs, typescript, css, tailwind |
| Testing | testing, jest, playwright, e2e |
| DevOps | deploy, docker, kubernetes, ci-cd |
| Documentation | docs, readme, changelog, api-docs |
| Code Quality | review, lint, refactor, best-practices |
| Design | ui, ux, design-system, accessibility |
| Productivity | workflow, automation, git |
## Tips for Effective Searches
1. **Use specific keywords**: "react testing" is better than just "testing"
2. **Try alternative terms**: If "deploy" doesn't work, try "deployment" or "ci-cd"
3. **Check popular sources**: Many skills come from `vercel-labs/agent-skills` or `ComposioHQ/awesome-claude-skills`
## When No Skills Are Found
If no relevant skills exist:
1. Acknowledge that no existing skill was found
2. Offer to help with the task directly using your general capabilities
3. Suggest the user could create their own skill with `npx skills init`
Example:
```
I searched for skills related to "xyz" but didn't find any matches.
I can still help you with this task directly! Would you like me to proceed?
If this is something you do often, you could create your own skill:
npx skills init my-xyz-skill
```

View file

@ -0,0 +1,10 @@
---
name: grill-me
description: Interview the user relentlessly about a plan or design until reaching shared understanding, resolving each branch of the decision tree. Use when user wants to stress-test a plan, get grilled on their design, or mentions "grill me".
---
Interview me relentlessly about every aspect of this plan until we reach a shared understanding. Walk down each branch of the design tree, resolving dependencies between decisions one-by-one. For each question, provide your recommended answer.
Ask the questions one at a time.
If a question can be answered by exploring the codebase, explore the codebase instead.

View file

@ -0,0 +1,47 @@
# ADR Format
ADRs live in `docs/adr/` and use sequential numbering: `0001-slug.md`, `0002-slug.md`, etc.
Create the `docs/adr/` directory lazily — only when the first ADR is needed.
## Template
```md
# {Short title of the decision}
{1-3 sentences: what's the context, what did we decide, and why.}
```
That's it. An ADR can be a single paragraph. The value is in recording *that* a decision was made and *why* — not in filling out sections.
## Optional sections
Only include these when they add genuine value. Most ADRs won't need them.
- **Status** frontmatter (`proposed | accepted | deprecated | superseded by ADR-NNNN`) — useful when decisions are revisited
- **Considered Options** — only when the rejected alternatives are worth remembering
- **Consequences** — only when non-obvious downstream effects need to be called out
## Numbering
Scan `docs/adr/` for the highest existing number and increment by one.
## When to offer an ADR
All three of these must be true:
1. **Hard to reverse** — the cost of changing your mind later is meaningful
2. **Surprising without context** — a future reader will look at the code and wonder "why on earth did they do it this way?"
3. **The result of a real trade-off** — there were genuine alternatives and you picked one for specific reasons
If a decision is easy to reverse, skip it — you'll just reverse it. If it's not surprising, nobody will wonder why. If there was no real alternative, there's nothing to record beyond "we did the obvious thing."
### What qualifies
- **Architectural shape.** "We're using a monorepo." "The write model is event-sourced, the read model is projected into Postgres."
- **Integration patterns between contexts.** "Ordering and Billing communicate via domain events, not synchronous HTTP."
- **Technology choices that carry lock-in.** Database, message bus, auth provider, deployment target. Not every library — just the ones that would take a quarter to swap out.
- **Boundary and scope decisions.** "Customer data is owned by the Customer context; other contexts reference it by ID only." The explicit no-s are as valuable as the yes-s.
- **Deliberate deviations from the obvious path.** "We're using manual SQL instead of an ORM because X." Anything where a reasonable reader would assume the opposite. These stop the next engineer from "fixing" something that was deliberate.
- **Constraints not visible in the code.** "We can't use AWS because of compliance requirements." "Response times must be under 200ms because of the partner API contract."
- **Rejected alternatives when the rejection is non-obvious.** If you considered GraphQL and picked REST for subtle reasons, record it — otherwise someone will suggest GraphQL again in six months.

View file

@ -0,0 +1,60 @@
# CONTEXT.md Format
## Structure
```md
# {Context Name}
{One or two sentence description of what this context is and why it exists.}
## Language
**Order**:
{A one or two sentence description of the term}
_Avoid_: Purchase, transaction
**Invoice**:
A request for payment sent to a customer after delivery.
_Avoid_: Bill, payment request
**Customer**:
A person or organization that places orders.
_Avoid_: Client, buyer, account
```
## Rules
- **Be opinionated.** When multiple words exist for the same concept, pick the best one and list the others under `_Avoid_`.
- **Keep definitions tight.** One or two sentences max. Define what it IS, not what it does.
- **Only include terms specific to this project's context.** General programming concepts (timeouts, error types, utility patterns) don't belong even if the project uses them extensively. Before adding a term, ask: is this a concept unique to this context, or a general programming concept? Only the former belongs.
- **Group terms under subheadings** when natural clusters emerge. If all terms belong to a single cohesive area, a flat list is fine.
## Single vs multi-context repos
**Single context (most repos):** One `CONTEXT.md` at the repo root.
**Multiple contexts:** A `CONTEXT-MAP.md` at the repo root lists the contexts, where they live, and how they relate to each other:
```md
# Context Map
## Contexts
- [Ordering](./src/ordering/CONTEXT.md) — receives and tracks customer orders
- [Billing](./src/billing/CONTEXT.md) — generates invoices and processes payments
- [Fulfillment](./src/fulfillment/CONTEXT.md) — manages warehouse picking and shipping
## Relationships
- **Ordering → Fulfillment**: Ordering emits `OrderPlaced` events; Fulfillment consumes them to start picking
- **Fulfillment → Billing**: Fulfillment emits `ShipmentDispatched` events; Billing consumes them to generate invoices
- **Ordering ↔ Billing**: Shared types for `CustomerId` and `Money`
```
The skill infers which structure applies:
- If `CONTEXT-MAP.md` exists, read it to find contexts
- If only a root `CONTEXT.md` exists, single context
- If neither exists, create a root `CONTEXT.md` lazily when the first term is resolved
When multiple contexts exist, infer which one the current topic relates to. If unclear, ask.

View file

@ -0,0 +1,88 @@
---
name: grill-with-docs
description: Grilling session that challenges your plan against the existing domain model, sharpens terminology, and updates documentation (CONTEXT.md, ADRs) inline as decisions crystallise. Use when user wants to stress-test a plan against their project's language and documented decisions.
---
<what-to-do>
Interview me relentlessly about every aspect of this plan until we reach a shared understanding. Walk down each branch of the design tree, resolving dependencies between decisions one-by-one. For each question, provide your recommended answer.
Ask the questions one at a time, waiting for feedback on each question before continuing.
If a question can be answered by exploring the codebase, explore the codebase instead.
</what-to-do>
<supporting-info>
## Domain awareness
During codebase exploration, also look for existing documentation:
### File structure
Most repos have a single context:
```
/
├── CONTEXT.md
├── docs/
│ └── adr/
│ ├── 0001-event-sourced-orders.md
│ └── 0002-postgres-for-write-model.md
└── src/
```
If a `CONTEXT-MAP.md` exists at the root, the repo has multiple contexts. The map points to where each one lives:
```
/
├── CONTEXT-MAP.md
├── docs/
│ └── adr/ ← system-wide decisions
├── src/
│ ├── ordering/
│ │ ├── CONTEXT.md
│ │ └── docs/adr/ ← context-specific decisions
│ └── billing/
│ ├── CONTEXT.md
│ └── docs/adr/
```
Create files lazily — only when you have something to write. If no `CONTEXT.md` exists, create one when the first term is resolved. If no `docs/adr/` exists, create it when the first ADR is needed.
## During the session
### Challenge against the glossary
When the user uses a term that conflicts with the existing language in `CONTEXT.md`, call it out immediately. "Your glossary defines 'cancellation' as X, but you seem to mean Y — which is it?"
### Sharpen fuzzy language
When the user uses vague or overloaded terms, propose a precise canonical term. "You're saying 'account' — do you mean the Customer or the User? Those are different things."
### Discuss concrete scenarios
When domain relationships are being discussed, stress-test them with specific scenarios. Invent scenarios that probe edge cases and force the user to be precise about the boundaries between concepts.
### Cross-reference with code
When the user states how something works, check whether the code agrees. If you find a contradiction, surface it: "Your code cancels entire Orders, but you just said partial cancellation is possible — which is right?"
### Update CONTEXT.md inline
When a term is resolved, update `CONTEXT.md` right there. Don't batch these up — capture them as they happen. Use the format in [CONTEXT-FORMAT.md](./CONTEXT-FORMAT.md).
`CONTEXT.md` should be totally devoid of implementation details. Do not treat `CONTEXT.md` as a spec, a scratch pad, or a repository for implementation decisions. It is a glossary and nothing else.
### Offer ADRs sparingly
Only offer to create an ADR when all three are true:
1. **Hard to reverse** — the cost of changing your mind later is meaningful
2. **Surprising without context** — a future reader will wonder "why did they do it this way?"
3. **The result of a real trade-off** — there were genuine alternatives and you picked one for specific reasons
If any of the three is missing, skip the ADR. Use the format in [ADR-FORMAT.md](./ADR-FORMAT.md).
</supporting-info>

View file

@ -0,0 +1,13 @@
---
name: handoff
description: Compact the current conversation into a handoff document for another agent to pick up.
argument-hint: "What will the next session be used for?"
---
Write a handoff document summarising the current conversation so a fresh agent can continue the work. Save it to a path produced by `mktemp -t handoff-XXXXXX.md` (read the file before you write to it).
Suggest the skills to be used, if any, by the next session.
Do not duplicate content already captured in other artifacts (PRDs, plans, ADRs, issues, commits, diffs). Reference them by path or URL instead.
If the user passed arguments, treat them as a description of what the next session will focus on and tailor the doc accordingly.

View file

@ -0,0 +1,37 @@
# Deepening
How to deepen a cluster of shallow modules safely, given its dependencies. Assumes the vocabulary in [LANGUAGE.md](LANGUAGE.md) — **module**, **interface**, **seam**, **adapter**.
## Dependency categories
When assessing a candidate for deepening, classify its dependencies. The category determines how the deepened module is tested across its seam.
### 1. In-process
Pure computation, in-memory state, no I/O. Always deepenable — merge the modules and test through the new interface directly. No adapter needed.
### 2. Local-substitutable
Dependencies that have local test stand-ins (PGLite for Postgres, in-memory filesystem). Deepenable if the stand-in exists. The deepened module is tested with the stand-in running in the test suite. The seam is internal; no port at the module's external interface.
### 3. Remote but owned (Ports & Adapters)
Your own services across a network boundary (microservices, internal APIs). Define a **port** (interface) at the seam. The deep module owns the logic; the transport is injected as an **adapter**. Tests use an in-memory adapter. Production uses an HTTP/gRPC/queue adapter.
Recommendation shape: *"Define a port at the seam, implement an HTTP adapter for production and an in-memory adapter for testing, so the logic sits in one deep module even though it's deployed across a network."*
### 4. True external (Mock)
Third-party services (Stripe, Twilio, etc.) you don't control. The deepened module takes the external dependency as an injected port; tests provide a mock adapter.
## Seam discipline
- **One adapter means a hypothetical seam. Two adapters means a real one.** Don't introduce a port unless at least two adapters are justified (typically production + test). A single-adapter seam is just indirection.
- **Internal seams vs external seams.** A deep module can have internal seams (private to its implementation, used by its own tests) as well as the external seam at its interface. Don't expose internal seams through the interface just because tests use them.
## Testing strategy: replace, don't layer
- Old unit tests on shallow modules become waste once tests at the deepened module's interface exist — delete them.
- Write new tests at the deepened module's interface. The **interface is the test surface**.
- Tests assert on observable outcomes through the interface, not internal state.
- Tests should survive internal refactors — they describe behaviour, not implementation. If a test has to change when the implementation changes, it's testing past the interface.

View file

@ -0,0 +1,123 @@
# HTML Report Format
The architectural review is rendered as a single self-contained HTML file in the OS temp directory. Tailwind and Mermaid both come from CDNs. Mermaid handles graph-shaped diagrams reliably; hand-built divs and inline SVG handle the more editorial visuals (mass diagrams, cross-sections). Mix the two — don't lean on Mermaid for everything, it'll start to look generic.
## Scaffold
```html
<!doctype html>
<html lang="en">
<head>
<meta charset="utf-8" />
<title>Architecture review — {{repo name}}</title>
<script src="https://cdn.tailwindcss.com"></script>
<script type="module">
import mermaid from "https://cdn.jsdelivr.net/npm/mermaid@11/dist/mermaid.esm.min.mjs";
mermaid.initialize({ startOnLoad: true, theme: "neutral", securityLevel: "loose" });
</script>
<style>
/* small custom layer for things Tailwind doesn't cover cleanly:
dashed seam lines, hand-drawn-feeling arrow heads, etc. */
.seam { stroke-dasharray: 4 4; }
.leak { stroke: #dc2626; }
.deep { background: linear-gradient(135deg, #0f172a, #1e293b); }
</style>
</head>
<body class="bg-stone-50 text-slate-900 font-sans">
<main class="max-w-5xl mx-auto px-6 py-12 space-y-12">
<header>...</header>
<section id="candidates" class="space-y-10">...</section>
<section id="top-recommendation">...</section>
</main>
</body>
</html>
```
## Header
Repo name, date, and a compact legend: solid box = module, dashed line = seam, red arrow = leakage, thick dark box = deep module. No introduction paragraph — straight into the candidates.
## Candidate card
The diagrams carry the weight. Prose is sparse, plain, and uses the glossary terms ([LANGUAGE.md](LANGUAGE.md)) without ceremony.
Each candidate is one `<article>`:
- **Title** — short, names the deepening (e.g. "Collapse the Order intake pipeline").
- **Badge row** — recommendation strength (`Strong` = emerald, `Worth exploring` = amber, `Speculative` = slate), plus a tag for the dependency category (`in-process`, `local-substitutable`, `ports & adapters`, `mock`).
- **Files** — monospaced list, `font-mono text-sm`.
- **Before / After diagram** — the centrepiece. Two columns, side by side. See patterns below.
- **Problem** — one sentence. What hurts.
- **Solution** — one sentence. What changes.
- **Wins** — bullets, ≤6 words each. e.g. "Tests hit one interface", "Pricing logic stops leaking", "Delete 4 shallow wrappers".
- **ADR callout** (if applicable) — one line in an amber-tinted box.
No paragraphs of explanation. If the diagram needs a paragraph to be understood, redraw the diagram.
## Diagram patterns
Pick the pattern that fits the candidate. Mix them. Don't make every diagram look the same — variety is part of the point.
### Mermaid graph (the workhorse for dependencies / call flow)
Use a Mermaid `flowchart` or `graph` when the point is "X calls Y calls Z, and look at the mess." Wrap it in a Tailwind-styled card so it doesn't feel parachuted in. Style with classDef to colour leakage edges red and the deep module dark. Sequence diagrams work well for "before: 6 round-trips; after: 1."
```html
<div class="rounded-lg border border-slate-200 bg-white p-4">
<pre class="mermaid">
flowchart LR
A[OrderHandler] --> B[OrderValidator]
B --> C[OrderRepo]
C -.leak.-> D[PricingClient]
classDef leak stroke:#dc2626,stroke-width:2px;
class C,D leak
</pre>
</div>
```
### Hand-built boxes-and-arrows (when Mermaid's layout fights you)
Modules as `<div>`s with borders and labels. Arrows as inline SVG `<line>` or `<path>` elements positioned absolutely over a relative container. Reach for this when you want the "after" diagram to feel like one thick-bordered deep module with greyed-out internals — Mermaid won't render that with the right weight.
### Cross-section (good for layered shallowness)
Stack horizontal bands (`h-12 border-l-4`) to show layers a call passes through. Before: 6 thin layers each doing nothing. After: 1 thick band labelled with the consolidated responsibility.
### Mass diagram (good for "interface as wide as implementation")
Two rectangles per module — one for interface surface area, one for implementation. Before: interface rectangle is nearly as tall as the implementation rectangle (shallow). After: interface rectangle is short, implementation rectangle is tall (deep).
### Call-graph collapse
Before: a tree of function calls rendered as nested boxes. After: the same tree collapsed into one box, with the now-internal calls shown faded inside it.
## Style guidance
- Lean editorial, not corporate-dashboard. Generous whitespace. Serif optional for headings (`font-serif` works well with stone/slate).
- Colour sparingly: one accent (emerald or indigo) plus red for leakage and amber for warnings.
- Keep diagrams ~320px tall so before/after sits comfortably side by side without scrolling.
- Use `text-xs uppercase tracking-wider` for module labels inside diagrams — they should read as schematic, not as UI.
- The only scripts are the Tailwind CDN and the Mermaid ESM import. The report is otherwise static — no app code, no interactivity beyond Mermaid's own rendering.
## Top recommendation section
One larger card. Candidate name, one sentence on why, anchor link to its card. That's it.
## Tone
Plain English, concise — but the architectural nouns and verbs come straight from [LANGUAGE.md](LANGUAGE.md). Concision is not an excuse to drift.
**Use exactly:** module, interface, implementation, depth, deep, shallow, seam, adapter, leverage, locality.
**Never substitute:** component, service, unit (for module) · API, signature (for interface) · boundary (for seam) · layer, wrapper (for module, when you mean module).
**Phrasings that fit the style:**
- "Order intake module is shallow — interface nearly matches the implementation."
- "Pricing leaks across the seam."
- "Deepen: one interface, one place to test."
- "Two adapters justify the seam: HTTP in prod, in-memory in tests."
**Wins bullets** name the gain in glossary terms: *"locality: bugs concentrate in one module"*, *"leverage: one interface, N call sites"*, *"interface shrinks; implementation absorbs the wrappers"*. Don't write *"easier to maintain"* or *"cleaner code"* — those terms aren't in the glossary and don't earn their place.
No hedging, no throat-clearing, no "it's worth noting that…". If a sentence could be a bullet, make it a bullet. If a bullet could be cut, cut it. If a term isn't in [LANGUAGE.md](LANGUAGE.md), reach for one that is before inventing a new one.

View file

@ -0,0 +1,44 @@
# Interface Design
When the user wants to explore alternative interfaces for a chosen deepening candidate, use this parallel sub-agent pattern. Based on "Design It Twice" (Ousterhout) — your first idea is unlikely to be the best.
Uses the vocabulary in [LANGUAGE.md](LANGUAGE.md) — **module**, **interface**, **seam**, **adapter**, **leverage**.
## Process
### 1. Frame the problem space
Before spawning sub-agents, write a user-facing explanation of the problem space for the chosen candidate:
- The constraints any new interface would need to satisfy
- The dependencies it would rely on, and which category they fall into (see [DEEPENING.md](DEEPENING.md))
- A rough illustrative code sketch to ground the constraints — not a proposal, just a way to make the constraints concrete
Show this to the user, then immediately proceed to Step 2. The user reads and thinks while the sub-agents work in parallel.
### 2. Spawn sub-agents
Spawn 3+ sub-agents in parallel using the Agent tool. Each must produce a **radically different** interface for the deepened module.
Prompt each sub-agent with a separate technical brief (file paths, coupling details, dependency category from [DEEPENING.md](DEEPENING.md), what sits behind the seam). The brief is independent of the user-facing problem-space explanation in Step 1. Give each agent a different design constraint:
- Agent 1: "Minimize the interface — aim for 13 entry points max. Maximise leverage per entry point."
- Agent 2: "Maximise flexibility — support many use cases and extension."
- Agent 3: "Optimise for the most common caller — make the default case trivial."
- Agent 4 (if applicable): "Design around ports & adapters for cross-seam dependencies."
Include both [LANGUAGE.md](LANGUAGE.md) vocabulary and CONTEXT.md vocabulary in the brief so each sub-agent names things consistently with the architecture language and the project's domain language.
Each sub-agent outputs:
1. Interface (types, methods, params — plus invariants, ordering, error modes)
2. Usage example showing how callers use it
3. What the implementation hides behind the seam
4. Dependency strategy and adapters (see [DEEPENING.md](DEEPENING.md))
5. Trade-offs — where leverage is high, where it's thin
### 3. Present and compare
Present designs sequentially so the user can absorb each one, then compare them in prose. Contrast by **depth** (leverage at the interface), **locality** (where change concentrates), and **seam placement**.
After comparing, give your own recommendation: which design you think is strongest and why. If elements from different designs would combine well, propose a hybrid. Be opinionated — the user wants a strong read, not a menu.

View file

@ -0,0 +1,53 @@
# Language
Shared vocabulary for every suggestion this skill makes. Use these terms exactly — don't substitute "component," "service," "API," or "boundary." Consistent language is the whole point.
## Terms
**Module**
Anything with an interface and an implementation. Deliberately scale-agnostic — applies equally to a function, class, package, or tier-spanning slice.
_Avoid_: unit, component, service.
**Interface**
Everything a caller must know to use the module correctly. Includes the type signature, but also invariants, ordering constraints, error modes, required configuration, and performance characteristics.
_Avoid_: API, signature (too narrow — those refer only to the type-level surface).
**Implementation**
What's inside a module — its body of code. Distinct from **Adapter**: a thing can be a small adapter with a large implementation (a Postgres repo) or a large adapter with a small implementation (an in-memory fake). Reach for "adapter" when the seam is the topic; "implementation" otherwise.
**Depth**
Leverage at the interface — the amount of behaviour a caller (or test) can exercise per unit of interface they have to learn. A module is **deep** when a large amount of behaviour sits behind a small interface. A module is **shallow** when the interface is nearly as complex as the implementation.
**Seam** _(from Michael Feathers)_
A place where you can alter behaviour without editing in that place. The *location* at which a module's interface lives. Choosing where to put the seam is its own design decision, distinct from what goes behind it.
_Avoid_: boundary (overloaded with DDD's bounded context).
**Adapter**
A concrete thing that satisfies an interface at a seam. Describes *role* (what slot it fills), not substance (what's inside).
**Leverage**
What callers get from depth. More capability per unit of interface they have to learn. One implementation pays back across N call sites and M tests.
**Locality**
What maintainers get from depth. Change, bugs, knowledge, and verification concentrate at one place rather than spreading across callers. Fix once, fixed everywhere.
## Principles
- **Depth is a property of the interface, not the implementation.** A deep module can be internally composed of small, mockable, swappable parts — they just aren't part of the interface. A module can have **internal seams** (private to its implementation, used by its own tests) as well as the **external seam** at its interface.
- **The deletion test.** Imagine deleting the module. If complexity vanishes, the module wasn't hiding anything (it was a pass-through). If complexity reappears across N callers, the module was earning its keep.
- **The interface is the test surface.** Callers and tests cross the same seam. If you want to test *past* the interface, the module is probably the wrong shape.
- **One adapter means a hypothetical seam. Two adapters means a real one.** Don't introduce a seam unless something actually varies across it.
## Relationships
- A **Module** has exactly one **Interface** (the surface it presents to callers and tests).
- **Depth** is a property of a **Module**, measured against its **Interface**.
- A **Seam** is where a **Module**'s **Interface** lives.
- An **Adapter** sits at a **Seam** and satisfies the **Interface**.
- **Depth** produces **Leverage** for callers and **Locality** for maintainers.
## Rejected framings
- **Depth as ratio of implementation-lines to interface-lines** (Ousterhout): rewards padding the implementation. We use depth-as-leverage instead.
- **"Interface" as the TypeScript `interface` keyword or a class's public methods**: too narrow — interface here includes every fact a caller must know.
- **"Boundary"**: overloaded with DDD's bounded context. Say **seam** or **interface**.

View file

@ -0,0 +1,81 @@
---
name: improve-codebase-architecture
description: Find deepening opportunities in a codebase, informed by the domain language in CONTEXT.md and the decisions in docs/adr/. Use when the user wants to improve architecture, find refactoring opportunities, consolidate tightly-coupled modules, or make a codebase more testable and AI-navigable.
---
# Improve Codebase Architecture
Surface architectural friction and propose **deepening opportunities** — refactors that turn shallow modules into deep ones. The aim is testability and AI-navigability.
## Glossary
Use these terms exactly in every suggestion. Consistent language is the point — don't drift into "component," "service," "API," or "boundary." Full definitions in [LANGUAGE.md](LANGUAGE.md).
- **Module** — anything with an interface and an implementation (function, class, package, slice).
- **Interface** — everything a caller must know to use the module: types, invariants, error modes, ordering, config. Not just the type signature.
- **Implementation** — the code inside.
- **Depth** — leverage at the interface: a lot of behaviour behind a small interface. **Deep** = high leverage. **Shallow** = interface nearly as complex as the implementation.
- **Seam** — where an interface lives; a place behaviour can be altered without editing in place. (Use this, not "boundary.")
- **Adapter** — a concrete thing satisfying an interface at a seam.
- **Leverage** — what callers get from depth.
- **Locality** — what maintainers get from depth: change, bugs, knowledge concentrated in one place.
Key principles (see [LANGUAGE.md](LANGUAGE.md) for the full list):
- **Deletion test**: imagine deleting the module. If complexity vanishes, it was a pass-through. If complexity reappears across N callers, it was earning its keep.
- **The interface is the test surface.**
- **One adapter = hypothetical seam. Two adapters = real seam.**
This skill is _informed_ by the project's domain model. The domain language gives names to good seams; ADRs record decisions the skill should not re-litigate.
## Process
### 1. Explore
Read the project's domain glossary and any ADRs in the area you're touching first.
Then use the Agent tool with `subagent_type=Explore` to walk the codebase. Don't follow rigid heuristics — explore organically and note where you experience friction:
- Where does understanding one concept require bouncing between many small modules?
- Where are modules **shallow** — interface nearly as complex as the implementation?
- Where have pure functions been extracted just for testability, but the real bugs hide in how they're called (no **locality**)?
- Where do tightly-coupled modules leak across their seams?
- Which parts of the codebase are untested, or hard to test through their current interface?
Apply the **deletion test** to anything you suspect is shallow: would deleting it concentrate complexity, or just move it? A "yes, concentrates" is the signal you want.
### 2. Present candidates as an HTML report
Write a self-contained HTML file to the OS temp directory so nothing lands in the repo. Resolve the temp dir from `$TMPDIR`, falling back to `/tmp` (or `%TEMP%` on Windows), and write to `<tmpdir>/architecture-review-<timestamp>.html` so each run gets a fresh file. Open it for the user — `xdg-open <path>` on Linux, `open <path>` on macOS, `start <path>` on Windows — and tell them the absolute path.
The report uses **Tailwind via CDN** for layout and styling, and **Mermaid via CDN** for diagrams where a graph/flow/sequence reliably communicates the structure. Mix Mermaid with hand-crafted CSS/SVG visuals — use Mermaid when relationships are graph-shaped (call graphs, dependencies, sequences), and hand-built divs/SVG when you want something more editorial (mass diagrams, cross-sections, collapse animations). Each candidate gets a **before/after visualisation**. Be visual.
For each candidate, the same template as before, but rendered as a card:
- **Files** — which files/modules are involved
- **Problem** — why the current architecture is causing friction
- **Solution** — plain English description of what would change
- **Benefits** — explained in terms of locality and leverage, and how tests would improve
- **Before / After diagram** — side-by-side, custom-drawn, illustrating the shallowness and the deepening
- **Recommendation strength** — one of `Strong`, `Worth exploring`, `Speculative`, rendered as a badge
End the report with a **Top recommendation** section: which candidate you'd tackle first and why.
**Use CONTEXT.md vocabulary for the domain, and [LANGUAGE.md](LANGUAGE.md) vocabulary for the architecture.** If `CONTEXT.md` defines "Order," talk about "the Order intake module" — not "the FooBarHandler," and not "the Order service."
**ADR conflicts**: if a candidate contradicts an existing ADR, only surface it when the friction is real enough to warrant revisiting the ADR. Mark it clearly in the card (e.g. a warning callout: _"contradicts ADR-0007 — but worth reopening because…"_). Don't list every theoretical refactor an ADR forbids.
See [HTML-REPORT.md](HTML-REPORT.md) for the full HTML scaffold, diagram patterns, and styling guidance.
Do NOT propose interfaces yet. After the file is written, ask the user: "Which of these would you like to explore?"
### 3. Grilling loop
Once the user picks a candidate, drop into a grilling conversation. Walk the design tree with them — constraints, dependencies, the shape of the deepened module, what sits behind the seam, what tests survive.
Side effects happen inline as decisions crystallize:
- **Naming a deepened module after a concept not in `CONTEXT.md`?** Add the term to `CONTEXT.md` — same discipline as `/grill-with-docs` (see [CONTEXT-FORMAT.md](../grill-with-docs/CONTEXT-FORMAT.md)). Create the file lazily if it doesn't exist.
- **Sharpening a fuzzy term during the conversation?** Update `CONTEXT.md` right there.
- **User rejects the candidate with a load-bearing reason?** Offer an ADR, framed as: _"Want me to record this as an ADR so future architecture reviews don't re-suggest it?"_ Only offer when the reason would actually be needed by a future explorer to avoid re-suggesting the same thing — skip ephemeral reasons ("not worth it right now") and self-evident ones. See [ADR-FORMAT.md](../grill-with-docs/ADR-FORMAT.md).
- **Want to explore alternative interfaces for the deepened module?** See [INTERFACE-DESIGN.md](INTERFACE-DESIGN.md).

View file

@ -0,0 +1,79 @@
# Logic Prototype
A tiny interactive terminal app that lets the user drive a state model by hand. Use this when the question is about **business logic, state transitions, or data shape** — the kind of thing that looks reasonable on paper but only feels wrong once you push it through real cases.
## When this is the right shape
- "I'm not sure if this state machine handles the edge case where X then Y."
- "Does this data model actually let me represent the case where..."
- "I want to feel out what the API should look like before writing it."
- Anything where the user wants to **press buttons and watch state change**.
If the question is "what should this look like" — wrong branch. Use [UI.md](UI.md).
## Process
### 1. State the question
Before writing code, write down what state model and what question you're prototyping. One paragraph, in the prototype's README or a comment at the top of the file. A logic prototype that answers the wrong question is pure waste — make the question explicit so it can be checked later, whether the user is watching now or returning to it AFK.
### 2. Pick the language
Use whatever the host project uses. If the project has no obvious runtime (e.g. a docs repo), ask.
Match the project's existing conventions for tooling — don't add a new package manager or runtime just for the prototype.
### 3. Isolate the logic in a portable module
Put the actual logic — the bit that's answering the question — behind a small, pure interface that could be lifted out and dropped into the real codebase later. The TUI around it is throwaway; the logic module shouldn't be.
The right shape depends on the question:
- **A pure reducer**`(state, action) => state`. Good when actions are discrete events and state is a single value.
- **A state machine** — explicit states and transitions. Good when "which actions are even legal right now" is part of the question.
- **A small set of pure functions** over a plain data type. Good when there's no implicit current state — just transformations.
- **A class or module with a clear method surface** when the logic genuinely owns ongoing internal state.
Pick whichever shape best fits the question being asked, *not* whichever is easiest to wire to a TUI. Keep it pure: no I/O, no terminal code, no `console.log` for control flow. The TUI imports it and calls into it; nothing flows the other direction.
This is what makes the prototype useful past its own lifetime. When the question's been answered, the validated reducer / machine / function set can be lifted into the real module — the TUI shell gets deleted.
### 4. Build the smallest TUI that exposes the state
Build it as a **lightweight TUI** — on every tick, clear the screen (`console.clear()` / `print("\033[2J\033[H")` / equivalent) and re-render the whole frame. The user should always see one stable view, not an ever-growing scrollback.
Each frame has two parts, in this order:
1. **Current state**, pretty-printed and diff-friendly (one field per line, or formatted JSON). Use **bold** for field names or section headers and **dim** for less important context (timestamps, IDs, derived values). Native ANSI escape codes are fine — `\x1b[1m` bold, `\x1b[2m` dim, `\x1b[0m` reset. No need to pull in a styling library unless one is already in the project.
2. **Keyboard shortcuts**, listed at the bottom: `[a] add user [d] delete user [t] tick clock [q] quit`. Bold the key, dim the description, or vice-versa — whatever reads cleanly.
Behaviour:
1. **Initialise state** — a single in-memory object/struct. Render the first frame on start.
2. **Read one keystroke (or one line)** at a time, dispatch to a handler that mutates state.
3. **Re-render** the full frame after every action — don't append, replace.
4. **Loop until quit.**
The whole frame should fit on one screen.
### 5. Make it runnable in one command
Add a script to the project's existing task runner (`package.json` scripts, `Makefile`, `justfile`, `pyproject.toml`). The user should run `pnpm run <prototype-name>` or equivalent — never need to remember a path.
If the host project has no task runner, just put the command at the top of the prototype's README.
### 6. Hand it over
Give the user the run command. They'll drive it themselves; the interesting moments are when they say "wait, that shouldn't be possible" or "huh, I assumed X would be different" — those are the bugs in the _idea_, which is the whole point. If they want new actions added, add them. Prototypes evolve.
### 7. Capture the answer
When the prototype has done its job, the answer to the question is the only thing worth keeping. If the user is around, ask what it taught them. If not, leave a `NOTES.md` next to the prototype so the answer can be filled in (or filled in by you, if you've watched the session) before the prototype gets deleted.
## Anti-patterns
- **Don't add tests.** A prototype that needs tests is no longer a prototype.
- **Don't wire it to the real database.** Use an in-memory store unless the question is specifically about persistence.
- **Don't generalise.** No "what if we wanted to support X later." The prototype answers one question.
- **Don't blur the logic and the TUI together.** If the reducer / state machine references `console.log`, prompts, or terminal escape codes, it's no longer portable. Keep the TUI as a thin shell over a pure module.
- **Don't ship the TUI shell into production.** The shell is optimised for being driven by hand from a terminal. The logic module behind it is the bit worth keeping.

View file

@ -0,0 +1,30 @@
---
name: prototype
description: Build a throwaway prototype to flesh out a design before committing to it. Routes between two branches — a runnable terminal app for state/business-logic questions, or several radically different UI variations toggleable from one route. Use when the user wants to prototype, sanity-check a data model or state machine, mock up a UI, explore design options, or says "prototype this", "let me play with it", "try a few designs".
---
# Prototype
A prototype is **throwaway code that answers a question**. The question decides the shape.
## Pick a branch
Identify which question is being answered — from the user's prompt, the surrounding code, or by asking if the user is around:
- **"Does this logic / state model feel right?"** → [LOGIC.md](LOGIC.md). Build a tiny interactive terminal app that pushes the state machine through cases that are hard to reason about on paper.
- **"What should this look like?"** → [UI.md](UI.md). Generate several radically different UI variations on a single route, switchable via a URL search param and a floating bottom bar.
The two branches produce very different artifacts — getting this wrong wastes the whole prototype. If the question is genuinely ambiguous and the user isn't reachable, default to whichever branch better matches the surrounding code (a backend module → logic; a page or component → UI) and state the assumption at the top of the prototype.
## Rules that apply to both
1. **Throwaway from day one, and clearly marked as such.** Locate the prototype code close to where it will actually be used (next to the module or page it's prototyping for) so context is obvious — but name it so a casual reader can see it's a prototype, not production. For throwaway UI routes, obey whatever routing convention the project already uses; don't invent a new top-level structure.
2. **One command to run.** Whatever the project's existing task runner supports — `pnpm <name>`, `python <path>`, `bun <path>`, etc. The user must be able to start it without thinking.
3. **No persistence by default.** State lives in memory. Persistence is the thing the prototype is _checking_, not something it should depend on. If the question explicitly involves a database, hit a scratch DB or a local file with a clear "PROTOTYPE — wipe me" name.
4. **Skip the polish.** No tests, no error handling beyond what makes the prototype _runnable_, no abstractions. The point is to learn something fast and then delete it.
5. **Surface the state.** After every action (logic) or on every variant switch (UI), print or render the full relevant state so the user can see what changed.
6. **Delete or absorb when done.** When the prototype has answered its question, either delete it or fold the validated decision into the real code — don't leave it rotting in the repo.
## When done
The _answer_ is the only thing worth keeping from a prototype. Capture it somewhere durable (commit message, ADR, issue, or a `NOTES.md` next to the prototype) along with the question it was answering. If the user is around, that capture is a quick conversation; if not, leave the placeholder so they (or you, on the next pass) can fill in the verdict before deleting the prototype.

View file

@ -0,0 +1,112 @@
# UI Prototype
Generate **several radically different UI variations** on a single route, switchable from a floating bottom bar. The user flips between variants in the browser, picks one (or steals bits from each), then throws the rest away.
If the question is about logic/state rather than what something looks like — wrong branch. Use [LOGIC.md](LOGIC.md).
## When this is the right shape
- "What should this page look like?"
- "I want to see a few options for this dashboard before committing."
- "Try a different layout for the settings screen."
- Any time the user would otherwise spend a day picking between three vague mockups in their head.
## Two sub-shapes — strongly prefer sub-shape A
A UI prototype is much easier to judge when it's **butting up against the rest of the app** — real header, real sidebar, real data, real density. A throwaway route on its own is a vacuum: every variant looks fine in isolation. Default to sub-shape A whenever there's a plausible existing page to host the variants. Only reach for sub-shape B if the prototype genuinely has no nearby home.
### Sub-shape A — adjustment to an existing page (preferred)
The route already exists. Variants are rendered **on the same route**, gated by a `?variant=` URL search param. The existing data fetching, params, and auth all stay — only the rendering swaps. This is the default; pick it unless there's a specific reason not to.
If the prototype is for something that doesn't yet have a page but *would naturally live inside one* (a new section of the dashboard, a new card on the settings screen, a new step in an existing flow) — that's still sub-shape A. Mount the variants inside the host page.
### Sub-shape B — a new page (last resort)
Only use this when the thing being prototyped genuinely has no existing page to live inside — e.g. an entirely new top-level surface, or a flow that can't be embedded anywhere sensible.
Create a **throwaway route** following whatever routing convention the project already uses — don't invent a new top-level structure. Name it so it's obviously a prototype (e.g. include the word `prototype` in the path or filename). Same `?variant=` pattern.
Before committing to sub-shape B, sanity-check: is there really no existing page this could be embedded in? An empty route hides design problems that a populated one would expose.
In both sub-shapes the floating bottom bar is identical.
## Process
### 1. State the question and pick N
Default to **3 variants**. More than 5 stops being radically different and starts being noise — cap there.
Write down the plan in one line, in the prototype's location or a top-of-file comment:
> "Three variants of the settings page, switchable via `?variant=`, on the existing `/settings` route."
This works whether the user is here to push back or not.
### 2. Generate radically different variants
Draft each variant. Hold each one to:
- The page's purpose and the data it has access to.
- The project's component library / styling system (TailwindCSS, shadcn, MUI, plain CSS, whatever).
- A clear exported component name, e.g. `VariantA`, `VariantB`, `VariantC`.
Variants must be **structurally different** — different layout, different information hierarchy, different primary affordance, not just different colours. Three slightly-tweaked card grids isn't a UI prototype, it's wallpaper. If two drafts come out too similar, redo one with explicit "do not use a card grid" guidance.
### 3. Wire them together
Create a single switcher component on the route:
```tsx
// pseudo-code — adapt to the project's framework
const variant = searchParams.get('variant') ?? 'A';
return (
<>
{variant === 'A' && <VariantA {...data} />}
{variant === 'B' && <VariantB {...data} />}
{variant === 'C' && <VariantC {...data} />}
<PrototypeSwitcher variants={['A','B','C']} current={variant} />
</>
);
```
For sub-shape A (existing page): keep all the existing data fetching above the switcher; only the rendered subtree changes per variant.
For sub-shape B (new page): the throwaway route under `/prototype/<name>` mounts the same switcher.
### 4. Build the floating switcher
A small fixed-position bar at the bottom-centre of the screen with three pieces:
- **Left arrow** — cycles to the previous variant (wraps around).
- **Variant label** — shows the current variant key and, if the variant exports a name, that name too. e.g. `B — Sidebar layout`.
- **Right arrow** — cycles forward (wraps around).
Behaviour:
- Clicking an arrow updates the URL search param (use the framework's router — `router.replace` on Next, `navigate` on React Router, etc) so the variant is shareable and reload-stable.
- Keyboard: `←` and `→` arrow keys also cycle. Don't intercept arrow keys when an `<input>`, `<textarea>`, or `[contenteditable]` is focused.
- Visually distinct from the page (e.g. high-contrast pill, subtle shadow) so it's obviously not part of the design being evaluated.
- Hidden in production builds — gate on `process.env.NODE_ENV !== 'production'` or an equivalent check, so a stray prototype merge can't ship the bar to users.
Put the switcher in a single shared component so both sub-shapes can reuse it. Locate it wherever shared UI lives in the project.
### 5. Hand it over
Surface the URL (and the `?variant=` keys). The user will flip through whenever they get to it. The interesting feedback is usually **"I want the header from B with the sidebar from C"** — that's the actual design they want.
### 6. Capture the answer and clean up
Once a variant has won, write down which one and why (commit message, ADR, issue, or a `NOTES.md` next to the prototype if running AFK and the user hasn't responded yet). Then:
- **Sub-shape A** — delete the losing variants and the switcher; fold the winner into the existing page.
- **Sub-shape B** — promote the winning variant to a real route, delete the throwaway route and the switcher.
Don't leave variant components or the switcher lying around. They rot fast and confuse the next reader.
## Anti-patterns
- **Variants that differ only in colour or copy.** That's a tweak, not a prototype. Real variants disagree about structure.
- **Sharing too much code between variants.** A shared `<Header>` is fine; a shared `<Layout>` defeats the point. Each variant should be free to throw out the layout.
- **Wiring variants to real mutations.** Read-only prototypes are fine. If a variant needs to mutate, point it at a stub — the question is "what should this look like", not "does the backend work".
- **Promoting the prototype directly to production.** The variant code was written under prototype constraints (no tests, minimal error handling). Rewrite it properly when you fold it in.

View file

@ -0,0 +1,121 @@
---
name: setup-matt-pocock-skills
description: Sets up an `## Agent skills` block in AGENTS.md/CLAUDE.md and `docs/agents/` so the engineering skills know this repo's issue tracker (GitHub or local markdown), triage label vocabulary, and domain doc layout. Run before first use of `to-issues`, `to-prd`, `triage`, `diagnose`, `tdd`, `improve-codebase-architecture`, or `zoom-out` — or if those skills appear to be missing context about the issue tracker, triage labels, or domain docs.
disable-model-invocation: true
---
# Setup Matt Pocock's Skills
Scaffold the per-repo configuration that the engineering skills assume:
- **Issue tracker** — where issues live (GitHub by default; local markdown is also supported out of the box)
- **Triage labels** — the strings used for the five canonical triage roles
- **Domain docs** — where `CONTEXT.md` and ADRs live, and the consumer rules for reading them
This is a prompt-driven skill, not a deterministic script. Explore, present what you found, confirm with the user, then write.
## Process
### 1. Explore
Look at the current repo to understand its starting state. Read whatever exists; don't assume:
- `git remote -v` and `.git/config` — is this a GitHub repo? Which one?
- `AGENTS.md` and `CLAUDE.md` at the repo root — does either exist? Is there already an `## Agent skills` section in either?
- `CONTEXT.md` and `CONTEXT-MAP.md` at the repo root
- `docs/adr/` and any `src/*/docs/adr/` directories
- `docs/agents/` — does this skill's prior output already exist?
- `.scratch/` — sign that a local-markdown issue tracker convention is already in use
### 2. Present findings and ask
Summarise what's present and what's missing. Then walk the user through the three decisions **one at a time** — present a section, get the user's answer, then move to the next. Don't dump all three at once.
Assume the user does not know what these terms mean. Each section starts with a short explainer (what it is, why these skills need it, what changes if they pick differently). Then show the choices and the default.
**Section A — Issue tracker.**
> Explainer: The "issue tracker" is where issues live for this repo. Skills like `to-issues`, `triage`, `to-prd`, and `qa` read from and write to it — they need to know whether to call `gh issue create`, write a markdown file under `.scratch/`, or follow some other workflow you describe. Pick the place you actually track work for this repo.
Default posture: these skills were designed for GitHub. If a `git remote` points at GitHub, propose that. If a `git remote` points at GitLab (`gitlab.com` or a self-hosted host), propose GitLab. Otherwise (or if the user prefers), offer:
- **GitHub** — issues live in the repo's GitHub Issues (uses the `gh` CLI)
- **GitLab** — issues live in the repo's GitLab Issues (uses the [`glab`](https://gitlab.com/gitlab-org/cli) CLI)
- **Local markdown** — issues live as files under `.scratch/<feature>/` in this repo (good for solo projects or repos without a remote)
- **Other** (Jira, Linear, etc.) — ask the user to describe the workflow in one paragraph; the skill will record it as freeform prose
**Section B — Triage label vocabulary.**
> Explainer: When the `triage` skill processes an incoming issue, it moves it through a state machine — needs evaluation, waiting on reporter, ready for an AFK agent to pick up, ready for a human, or won't fix. To do that, it needs to apply labels (or the equivalent in your issue tracker) that match strings *you've actually configured*. If your repo already uses different label names (e.g. `bug:triage` instead of `needs-triage`), map them here so the skill applies the right ones instead of creating duplicates.
The five canonical roles:
- `needs-triage` — maintainer needs to evaluate
- `needs-info` — waiting on reporter
- `ready-for-agent` — fully specified, AFK-ready (an agent can pick it up with no human context)
- `ready-for-human` — needs human implementation
- `wontfix` — will not be actioned
Default: each role's string equals its name. Ask the user if they want to override any. If their issue tracker has no existing labels, the defaults are fine.
**Section C — Domain docs.**
> Explainer: Some skills (`improve-codebase-architecture`, `diagnose`, `tdd`) read a `CONTEXT.md` file to learn the project's domain language, and `docs/adr/` for past architectural decisions. They need to know whether the repo has one global context or multiple (e.g. a monorepo with separate frontend/backend contexts) so they look in the right place.
Confirm the layout:
- **Single-context** — one `CONTEXT.md` + `docs/adr/` at the repo root. Most repos are this.
- **Multi-context**`CONTEXT-MAP.md` at the root pointing to per-context `CONTEXT.md` files (typically a monorepo).
### 3. Confirm and edit
Show the user a draft of:
- The `## Agent skills` block to add to whichever of `CLAUDE.md` / `AGENTS.md` is being edited (see step 4 for selection rules)
- The contents of `docs/agents/issue-tracker.md`, `docs/agents/triage-labels.md`, `docs/agents/domain.md`
Let them edit before writing.
### 4. Write
**Pick the file to edit:**
- If `CLAUDE.md` exists, edit it.
- Else if `AGENTS.md` exists, edit it.
- If neither exists, ask the user which one to create — don't pick for them.
Never create `AGENTS.md` when `CLAUDE.md` already exists (or vice versa) — always edit the one that's already there.
If an `## Agent skills` block already exists in the chosen file, update its contents in-place rather than appending a duplicate. Don't overwrite user edits to the surrounding sections.
The block:
```markdown
## Agent skills
### Issue tracker
[one-line summary of where issues are tracked]. See `docs/agents/issue-tracker.md`.
### Triage labels
[one-line summary of the label vocabulary]. See `docs/agents/triage-labels.md`.
### Domain docs
[one-line summary of layout — "single-context" or "multi-context"]. See `docs/agents/domain.md`.
```
Then write the three docs files using the seed templates in this skill folder as a starting point:
- [issue-tracker-github.md](./issue-tracker-github.md) — GitHub issue tracker
- [issue-tracker-gitlab.md](./issue-tracker-gitlab.md) — GitLab issue tracker
- [issue-tracker-local.md](./issue-tracker-local.md) — local-markdown issue tracker
- [triage-labels.md](./triage-labels.md) — label mapping
- [domain.md](./domain.md) — domain doc consumer rules + layout
For "other" issue trackers, write `docs/agents/issue-tracker.md` from scratch using the user's description.
### 5. Done
Tell the user the setup is complete and which engineering skills will now read from these files. Mention they can edit `docs/agents/*.md` directly later — re-running this skill is only necessary if they want to switch issue trackers or restart from scratch.

View file

@ -0,0 +1,51 @@
# Domain Docs
How the engineering skills should consume this repo's domain documentation when exploring the codebase.
## Before exploring, read these
- **`CONTEXT.md`** at the repo root, or
- **`CONTEXT-MAP.md`** at the repo root if it exists — it points at one `CONTEXT.md` per context. Read each one relevant to the topic.
- **`docs/adr/`** — read ADRs that touch the area you're about to work in. In multi-context repos, also check `src/<context>/docs/adr/` for context-scoped decisions.
If any of these files don't exist, **proceed silently**. Don't flag their absence; don't suggest creating them upfront. The producer skill (`/grill-with-docs`) creates them lazily when terms or decisions actually get resolved.
## File structure
Single-context repo (most repos):
```
/
├── CONTEXT.md
├── docs/adr/
│ ├── 0001-event-sourced-orders.md
│ └── 0002-postgres-for-write-model.md
└── src/
```
Multi-context repo (presence of `CONTEXT-MAP.md` at the root):
```
/
├── CONTEXT-MAP.md
├── docs/adr/ ← system-wide decisions
└── src/
├── ordering/
│ ├── CONTEXT.md
│ └── docs/adr/ ← context-specific decisions
└── billing/
├── CONTEXT.md
└── docs/adr/
```
## Use the glossary's vocabulary
When your output names a domain concept (in an issue title, a refactor proposal, a hypothesis, a test name), use the term as defined in `CONTEXT.md`. Don't drift to synonyms the glossary explicitly avoids.
If the concept you need isn't in the glossary yet, that's a signal — either you're inventing language the project doesn't use (reconsider) or there's a real gap (note it for `/grill-with-docs`).
## Flag ADR conflicts
If your output contradicts an existing ADR, surface it explicitly rather than silently overriding:
> _Contradicts ADR-0007 (event-sourced orders) — but worth reopening because…_

View file

@ -0,0 +1,22 @@
# Issue tracker: GitHub
Issues and PRDs for this repo live as GitHub issues. Use the `gh` CLI for all operations.
## Conventions
- **Create an issue**: `gh issue create --title "..." --body "..."`. Use a heredoc for multi-line bodies.
- **Read an issue**: `gh issue view <number> --comments`, filtering comments by `jq` and also fetching labels.
- **List issues**: `gh issue list --state open --json number,title,body,labels,comments --jq '[.[] | {number, title, body, labels: [.labels[].name], comments: [.comments[].body]}]'` with appropriate `--label` and `--state` filters.
- **Comment on an issue**: `gh issue comment <number> --body "..."`
- **Apply / remove labels**: `gh issue edit <number> --add-label "..."` / `--remove-label "..."`
- **Close**: `gh issue close <number> --comment "..."`
Infer the repo from `git remote -v``gh` does this automatically when run inside a clone.
## When a skill says "publish to the issue tracker"
Create a GitHub issue.
## When a skill says "fetch the relevant ticket"
Run `gh issue view <number> --comments`.

View file

@ -0,0 +1,23 @@
# Issue tracker: GitLab
Issues and PRDs for this repo live as GitLab issues. Use the [`glab`](https://gitlab.com/gitlab-org/cli) CLI for all operations.
## Conventions
- **Create an issue**: `glab issue create --title "..." --description "..."`. Use a heredoc for multi-line descriptions. Pass `--description -` to open an editor.
- **Read an issue**: `glab issue view <number> --comments`. Use `-F json` for machine-readable output.
- **List issues**: `glab issue list -F json` with appropriate `--label` filters.
- **Comment on an issue**: `glab issue note <number> --message "..."`. GitLab calls comments "notes".
- **Apply / remove labels**: `glab issue update <number> --label "..."` / `--unlabel "..."`. Multiple labels can be comma-separated or by repeating the flag.
- **Close**: `glab issue close <number>`. `glab issue close` does not accept a closing comment, so post the explanation first with `glab issue note <number> --message "..."`, then close.
- **Merge requests**: GitLab calls PRs "merge requests". Use `glab mr create`, `glab mr view`, `glab mr note`, etc. — the same shape as `gh pr ...` with `mr` in place of `pr` and `note`/`--message` in place of `comment`/`--body`.
Infer the repo from `git remote -v``glab` does this automatically when run inside a clone.
## When a skill says "publish to the issue tracker"
Create a GitLab issue.
## When a skill says "fetch the relevant ticket"
Run `glab issue view <number> --comments`.

View file

@ -0,0 +1,19 @@
# Issue tracker: Local Markdown
Issues and PRDs for this repo live as markdown files in `.scratch/`.
## Conventions
- One feature per directory: `.scratch/<feature-slug>/`
- The PRD is `.scratch/<feature-slug>/PRD.md`
- Implementation issues are `.scratch/<feature-slug>/issues/<NN>-<slug>.md`, numbered from `01`
- Triage state is recorded as a `Status:` line near the top of each issue file (see `triage-labels.md` for the role strings)
- Comments and conversation history append to the bottom of the file under a `## Comments` heading
## When a skill says "publish to the issue tracker"
Create a new file under `.scratch/<feature-slug>/` (creating the directory if needed).
## When a skill says "fetch the relevant ticket"
Read the file at the referenced path. The user will normally pass the path or the issue number directly.

View file

@ -0,0 +1,15 @@
# Triage Labels
The skills speak in terms of five canonical triage roles. This file maps those roles to the actual label strings used in this repo's issue tracker.
| Label in mattpocock/skills | Label in our tracker | Meaning |
| -------------------------- | -------------------- | ---------------------------------------- |
| `needs-triage` | `needs-triage` | Maintainer needs to evaluate this issue |
| `needs-info` | `needs-info` | Waiting on reporter for more information |
| `ready-for-agent` | `ready-for-agent` | Fully specified, ready for an AFK agent |
| `ready-for-human` | `ready-for-human` | Requires human implementation |
| `wontfix` | `wontfix` | Will not be actioned |
When a skill mentions a role (e.g. "apply the AFK-ready triage label"), use the corresponding label string from this table.
Edit the right-hand column to match whatever vocabulary you actually use.

View file

@ -0,0 +1,109 @@
---
name: tdd
description: Test-driven development with red-green-refactor loop. Use when user wants to build features or fix bugs using TDD, mentions "red-green-refactor", wants integration tests, or asks for test-first development.
---
# Test-Driven Development
## Philosophy
**Core principle**: Tests should verify behavior through public interfaces, not implementation details. Code can change entirely; tests shouldn't.
**Good tests** are integration-style: they exercise real code paths through public APIs. They describe _what_ the system does, not _how_ it does it. A good test reads like a specification - "user can checkout with valid cart" tells you exactly what capability exists. These tests survive refactors because they don't care about internal structure.
**Bad tests** are coupled to implementation. They mock internal collaborators, test private methods, or verify through external means (like querying a database directly instead of using the interface). The warning sign: your test breaks when you refactor, but behavior hasn't changed. If you rename an internal function and tests fail, those tests were testing implementation, not behavior.
See [tests.md](tests.md) for examples and [mocking.md](mocking.md) for mocking guidelines.
## Anti-Pattern: Horizontal Slices
**DO NOT write all tests first, then all implementation.** This is "horizontal slicing" - treating RED as "write all tests" and GREEN as "write all code."
This produces **crap tests**:
- Tests written in bulk test _imagined_ behavior, not _actual_ behavior
- You end up testing the _shape_ of things (data structures, function signatures) rather than user-facing behavior
- Tests become insensitive to real changes - they pass when behavior breaks, fail when behavior is fine
- You outrun your headlights, committing to test structure before understanding the implementation
**Correct approach**: Vertical slices via tracer bullets. One test → one implementation → repeat. Each test responds to what you learned from the previous cycle. Because you just wrote the code, you know exactly what behavior matters and how to verify it.
```
WRONG (horizontal):
RED: test1, test2, test3, test4, test5
GREEN: impl1, impl2, impl3, impl4, impl5
RIGHT (vertical):
RED→GREEN: test1→impl1
RED→GREEN: test2→impl2
RED→GREEN: test3→impl3
...
```
## Workflow
### 1. Planning
When exploring the codebase, use the project's domain glossary so that test names and interface vocabulary match the project's language, and respect ADRs in the area you're touching.
Before writing any code:
- [ ] Confirm with user what interface changes are needed
- [ ] Confirm with user which behaviors to test (prioritize)
- [ ] Identify opportunities for [deep modules](deep-modules.md) (small interface, deep implementation)
- [ ] Design interfaces for [testability](interface-design.md)
- [ ] List the behaviors to test (not implementation steps)
- [ ] Get user approval on the plan
Ask: "What should the public interface look like? Which behaviors are most important to test?"
**You can't test everything.** Confirm with the user exactly which behaviors matter most. Focus testing effort on critical paths and complex logic, not every possible edge case.
### 2. Tracer Bullet
Write ONE test that confirms ONE thing about the system:
```
RED: Write test for first behavior → test fails
GREEN: Write minimal code to pass → test passes
```
This is your tracer bullet - proves the path works end-to-end.
### 3. Incremental Loop
For each remaining behavior:
```
RED: Write next test → fails
GREEN: Minimal code to pass → passes
```
Rules:
- One test at a time
- Only enough code to pass current test
- Don't anticipate future tests
- Keep tests focused on observable behavior
### 4. Refactor
After all tests pass, look for [refactor candidates](refactoring.md):
- [ ] Extract duplication
- [ ] Deepen modules (move complexity behind simple interfaces)
- [ ] Apply SOLID principles where natural
- [ ] Consider what new code reveals about existing code
- [ ] Run tests after each refactor step
**Never refactor while RED.** Get to GREEN first.
## Checklist Per Cycle
```
[ ] Test describes behavior, not implementation
[ ] Test uses public interface only
[ ] Test would survive internal refactor
[ ] Code is minimal for this test
[ ] No speculative features added
```

View file

@ -0,0 +1,33 @@
# Deep Modules
From "A Philosophy of Software Design":
**Deep module** = small interface + lots of implementation
```
┌─────────────────────┐
│ Small Interface │ ← Few methods, simple params
├─────────────────────┤
│ │
│ │
│ Deep Implementation│ ← Complex logic hidden
│ │
│ │
└─────────────────────┘
```
**Shallow module** = large interface + little implementation (avoid)
```
┌─────────────────────────────────┐
│ Large Interface │ ← Many methods, complex params
├─────────────────────────────────┤
│ Thin Implementation │ ← Just passes through
└─────────────────────────────────┘
```
When designing interfaces, ask:
- Can I reduce the number of methods?
- Can I simplify the parameters?
- Can I hide more complexity inside?

View file

@ -0,0 +1,31 @@
# Interface Design for Testability
Good interfaces make testing natural:
1. **Accept dependencies, don't create them**
```typescript
// Testable
function processOrder(order, paymentGateway) {}
// Hard to test
function processOrder(order) {
const gateway = new StripeGateway();
}
```
2. **Return results, don't produce side effects**
```typescript
// Testable
function calculateDiscount(cart): Discount {}
// Hard to test
function applyDiscount(cart): void {
cart.total -= discount;
}
```
3. **Small surface area**
- Fewer methods = fewer tests needed
- Fewer params = simpler test setup

Some files were not shown because too many files have changed in this diff Show more