Compare commits

...
Sign in to create a new pull request.

213 commits

Author SHA1 Message Date
Viktor Barzin
6c5288998f goldmane-trail: polish follow-ups #57/#59/#61/#62/#63 + digest→#alerts
All checks were successful
ci/woodpecker/push/default Pipeline was successful
Completes the Goldmane who-talks-to-whom trail (ADR-0014), implemented by a
subagent workflow (distinct stacks in parallel, docs last):

- #57 Whisker gated ingress: ingress_factory (whisker.viktorbarzin.me,
  auth=required, Authentik-gated) + a NetworkPolicy allowing traefik->whisker:8081
  (the operator's whisker NP default-denies ingress). calico stack.
- #61 pipeline health: AggregatorDown + DigestFailing Prometheus alerts
  (prometheus_chart_values.tpl) + cluster-health check #48.
- #59 service-identity labels on the multi-Service namespaces (monitoring's 5
  TF-managed deployments + dbaas), with the KYVERNO_LIFECYCLE_V1 marker so they
  update in-place.
- #62/#63 docs: docs/runbooks/goldmane-flow-trail.md (new), service-catalog,
  security.md + monitoring.md east-west sections, ADR-0014 as-built, CONTEXT.md.
  #62 = the SQL to derive the Wave-1 per-namespace egress allowlist from the
  edge table (feeds code-8ywc; enforce-flips out of scope).

Also fixes the digest's Slack target: #security override 404s channel_not_found
because the shared alertmanager_slack_api_url webhook's app isn't a member of
#security (this likely also breaks alertmanager's slack-security receiver — flagged
in the runbook). Routed to #alerts (the webhook's working channel) until the app
is invited; verified a real digest run posts cleanly (360 edges).

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-25 17:49:25 +00:00
Viktor Barzin
306cdd4cb3 state(dbaas): update encrypted state 2026-06-25 17:31:03 +00:00
Viktor Barzin
9c68d147e0 k8s-upgrade: reclaim+auto-prune kubeadm /etc/kubernetes/tmp leak; correct crash root cause to etcd IO (not OIDC)
Some checks failed
ci/woodpecker/push/postmortem-todos Pipeline was successful
ci/woodpecker/push/default Pipeline failed
Digging into "why did the apiserver crash" disproved the earlier OIDC
explanation. An isolated v1.35.6 apiserver repro with authentik reachable
initialises OIDC cleanly (oidc.go:313, no error) and runs fine — so the
--authentication-config -> --oidc-* revert is NOT what crashed it. etcd's
surviving crash-window log is the real cause: 1180 "apply request took too long"
warnings in 16 min, individual applies up to 4.3s (healthy <100ms) right as
kubeadm tried to bring up the new apiserver. That's etcd IO starvation on the
shared sdc HDD (beads code-oflt).

A big contributor + the reason master root fs sat at 73%: kubeadm dumps a full
~400MB etcd DB backup into /etc/kubernetes/tmp/kubeadm-backup-etcd-<ts>/ before
every etcd upgrade and never cleans it up — 145 dirs / 28GB had accumulated,
driving image-GC churn and extra write-IO onto etcd's spindle. Reclaimed live
(73% -> 23%) and added a preflight prune (>3 days) so it can't re-accumulate.

Also corrected the OIDC handling: the kubeadm-config drift is real but only
breaks dashboard/kubectl SSO AFTER a successful upgrade (recoverable via the
chain's restore.sh + the kubeadm-config reconciliation) — it does not crash the
apiserver. So the preflight check is now an ALERT, not a block (was added on the
wrong hypothesis). Post-mortem, runbook, and apiserver-oidc.tf header corrected.

Per Viktor: reclaim the disk and automate so the manual cleanup never recurs;
the durable IO fix remains code-oflt (etcd off the shared HDD).

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-25 15:23:15 +00:00
Viktor Barzin
60a1cb9a25 k8s-upgrade: reconcile kubeadm-config OIDC drift that crash-looped the v1.35 apiserver upgrade
All checks were successful
ci/woodpecker/push/postmortem-todos Pipeline was successful
ci/woodpecker/push/default Pipeline was successful
Last night's autonomous 1.34->1.35 run reached the master control-plane phase
for the first time (preflight passed, etcd snapshot taken, etcd upgraded), then
the kube-apiserver upgrade to v1.35.6 crash-looped and kubeadm auto-rolled-back
to 1.34.9. The cluster stayed healthy but the master was left cordoned and the
chain wedged on in_flight.

Root cause: kubeadm upgrade regenerates the apiserver static-pod manifest from
the kubeadm-config ConfigMap. apiserver auth was switched on 2026-06-19 to a
structured multi-issuer --authentication-config (kubectl + dashboard SSO), but
kubeadm-config still carried the legacy single-issuer --oidc-* extraArgs, so the
regenerated manifest reverted structured auth and the new apiserver crash-looped.
Proven via `kubeadm upgrade diff`. The existing post-upgrade OIDC restore step
never ran because the upgrade itself never succeeded.

Fix:
- rbac/apiserver-oidc.tf: the remote script now also reconciles kubeadm-config
  (kubeadm init phase upload-config: drop --oidc-*, add --authentication-config)
  so a future kubeadm upgrade regenerates a correct manifest. Delivered to the
  cluster via the apiserver-oidc-restore ConfigMap the chain re-runs (CI needs no
  ssh key); trigger deliberately not script-hashed since CI cannot ssh.
- k8s-version-upgrade/upgrade-step.sh: new preflight gate runs `kubeadm upgrade
  diff` and BLOCKS+alerts (never drains the master) if --authentication-config
  would still be dropped.
- Post-mortem + runbook updated.

The live kubeadm-config was reconciled directly on the master and verified
(`kubeadm upgrade diff` now shows only the control-plane image bump), so tonight's
run can complete the 1.34->1.35 upgrade.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-25 14:16:04 +00:00
Viktor Barzin
c6bba1da6e home-assistant skill: refresh ha-london map (HAOS 2026.5.2, Cowboy revived, Overview redesign)
All checks were successful
ci/woodpecker/push/default Pipeline was successful
Viktor asked to redesign the ha-london dashboards and fix the broken integrations (the Cowboy one). The skill's ha-london knowledge map had drifted badly from reality, so this brings it current: it claimed HA 2025.9.1 on a docker-run container (it's HAOS 2026.5.2, managed); listed the now-dead jdejaegh Cowboy integration with sensor.bike_* entities (revived via elsbrock/cowboy-ha v1.2.0 -> entities are sensor.classic_performance_*); and didn't flag that met/metoffice/roomba/hildebrandglow are user-disabled (not broken) or that Tapo P100 is failing. Also documents the redesigned Overview (Home+More sections, Mushroom+mini-graph-card), the dashboard/view/card glossary, the parked-bike 'unknown battery' gotcha, and that london is API-only from the Sofia devvm (no SSH).

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-24 22:03:15 +00:00
Viktor Barzin
b858561bd0 Merge remote-tracking branch 'origin/master'
Some checks failed
ci/woodpecker/push/default Pipeline failed
2026-06-24 20:59:39 +00:00
Viktor Barzin
a7704f46a6 deploy goldmane-edge-aggregator: durable who-talks-to-whom edge trail (#58, ADR-0014)
Infra side of ADR-0014: an mTLS gRPC consumer of Calico Goldmane's Flows API
that records the namespace-pair edge-set in CNPG and posts a daily new-edge
digest to #security. Adds the goldmane-edge-aggregator stack, the
pg-goldmane-edges Vault rotation role (Tier-0 vault state updated here), and the
namespace in the ghcr-credentials allowlist.

Cert: REUSES the operator-minted, Tigera-CA-signed whisker-backend client cert
(Goldmane verifies only the CA chain, not identity) instead of minting from the
Tigera CA private key. This avoids putting the CA key in TF state AND the
hashicorp/tls provider, which is incompatible with this repo's global
generate-providers/lockfile pattern (it broke every stack's lockfile).

Verified live: aggregator streaming flows, 174 edges in Postgres across 50x54
namespaces, db+slack ExternalSecrets synced, digest dry-run formats correctly,
private image pulls via the Kyverno-synced ghcr-credentials.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-24 20:59:39 +00:00
Viktor Barzin
aa510e3600 instagram-poster: force_conflicts on ESO manifests (fix apply)
All checks were successful
ci/woodpecker/push/default Pipeline was successful
The ESO v1 migration (2026-06-22) made the external-secrets controller own
.spec.refreshInterval via server-side apply, so terraform apply of the two
ExternalSecret manifests fails with a field-manager conflict (Woodpecker #348),
which blocked the replicas=0 scale-down from landing. Add force_conflicts=true
to both, matching the grafana/woodpecker/traefik fix applied to other stacks
the same day.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-24 20:49:53 +00:00
Viktor Barzin
53834deb24 instagram-poster: scale to 0 (unused, dead ExternalSecret)
Some checks failed
ci/woodpecker/push/default Pipeline failed
Viktor confirmed the Instagram Graph poster isn't used. Its ExternalSecret
has been dead on missing Vault keys (ig_graph_long_lived_token,
ig_business_account_id), so the deployment sat at 0/1 firing
DeploymentReplicasMismatch. Setting replicas=0 stops the alert and makes the
scale-down durable (a bare kubectl scale reverts on the next stack apply).
Re-set to 1 after minting a Meta long-lived token + populating the Vault keys.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-24 20:45:30 +00:00
Viktor Barzin
8dd9a3978d Merge remote-tracking branch 'forgejo/master' into wizard/homelab-vault
All checks were successful
ci/woodpecker/push/default Pipeline was successful
2026-06-24 12:25:52 +00:00
Viktor Barzin
65b2df1222 fix(monitoring): force_conflicts on grafana_db_creds ExternalSecret
The external-secrets controller owns .spec.refreshInterval via SSA, so a plain
terraform apply of the monitoring stack conflicts. Latent until 2026-06-24 (the
homelab-vault loki-rules change was the first monitoring apply in a while and
surfaced it). force_conflicts lets TF win — same pattern as woodpecker/traefik/
k8s-version-upgrade stacks.
2026-06-24 12:25:36 +00:00
Viktor Barzin
1d0388da12 Merge remote-tracking branch 'origin/master'
All checks were successful
ci/woodpecker/push/default Pipeline was successful
2026-06-24 12:22:58 +00:00
Viktor Barzin
92361f36db calico: enable Goldmane + Whisker (Calico 3.30 OSS flow observability)
Turns on Calico 3.30's native east-west flow observability so we can see which
Service talks to which (ADR-0014, issue #57). Enabled via the operator CRs
directly (kubectl_manifest Goldmane + Whisker, name=default) rather than the
Helm goldmane/whisker flags, because the goldmanes/whiskers CRDs already exist
and this sidesteps the helm-upgrade CR-before-CRD ordering issue. Whisker
notifications=Disabled so the UI doesn't call the external Tigera endpoint.

Applied supervised: creating the Goldmane CR re-rendered calico-node with the
FELIX_FLOWLOGSGOLDMANESERVER env (operator auto-wires Felix — no manual
FelixConfiguration); calico-node rolled cleanly 7/7, tigerastatus healthy,
goldmane is receiving flows from all nodes, Whisker UI serves.

Durable Loki persistence is NOT included here: the Goldmane emitter is Calico
Cloud/Enterprise-gated with no OSS knob to aim it at Loki (the CR can override
only name+resources, not env), so a durable trail needs a small custom gRPC
consumer of goldmane:7443 — tracked in issue #58.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-24 12:22:48 +00:00
Viktor Barzin
e711b2f971 feat(monitoring): homelab vault traceability alerts (TOTP-fetch + volume)
Some checks failed
ci/woodpecker/push/default Pipeline failed
Build infra CLI / build (push) Has been cancelled
Adds a Loki ruler group (lane=security -> #security) for the homelab vault
op-log: VaultwardenTOTPFetched (every 2nd-factor fetch is visible) and
VaultwardenFetchVolumeHigh (>100 fetches/10m backstop). The audit spine
(Vault audit device, reads of secret/data/workstation/claude-users/*) is
already captured. True CLI-bypass detection needs cross-stream correlation
(follow-up).
2026-06-24 10:31:32 +00:00
Viktor Barzin
64104e56e9 feat(devvm): install Bitwarden CLI for homelab vault 2026-06-24 10:29:57 +00:00
Viktor Barzin
15643d1f44 feat(cli): bare homelab vault help command 2026-06-24 10:29:32 +00:00
Viktor Barzin
772aed5370 fix(cli): vault security review fixes
C1 (critical): setup wrote the master password + API client_secret as
`vault kv patch key=value` argv, leaking them via /proc/<pid>/cmdline to
same-UID processes. Now written via stdin (key=- form); only email +
client_id (non-credentials) remain in argv.
I1: `get --json` refused on a TTY (was dumping the secret to scrollback).
M1: vaultLock now holds the per-user flock (it mutates bw state).
M4: bw login-detection parses status JSON instead of substring matching.
M5: clipboard path refuses when stderr is not a TTY (was silently failing).
M6: realRunner trims only trailing newline, preserving secret whitespace;
    secret prompts likewise.
Adds security-property tests: no secret in argv across the get flow,
clipboard decision matrix, --json TTY gate, bw status parsing.
2026-06-24 10:28:31 +00:00
Viktor Barzin
5a864cf19c feat(cli): homelab vault setup onboarding (one-time, self-service) 2026-06-24 10:21:57 +00:00
Viktor Barzin
e20033855d feat(cli): vault list/search/code/status/lock 2026-06-24 10:21:07 +00:00
Viktor Barzin
365340b37d feat(cli): homelab vault get with TTY-aware return 2026-06-24 10:20:05 +00:00
Viktor Barzin
2dd12fc6be feat(cli): vault session bootstrap with per-user flock + no-coredump 2026-06-24 10:18:36 +00:00
Viktor Barzin
5bae2a3907 feat(cli): privacy-aware vault op-log (process, never the secret) 2026-06-24 10:17:50 +00:00
Viktor Barzin
81122f8607 feat(cli): TTY-aware return + OSC52 clipboard with terminal gating 2026-06-24 10:17:13 +00:00
Viktor Barzin
06f4b87af1 feat(cli): vault bw engine env/arg builders + unlock 2026-06-24 10:16:19 +00:00
Viktor Barzin
cd44ca5921 feat(cli): vault creds loading from per-user Vault path 2026-06-24 10:15:32 +00:00
Viktor Barzin
6c53ee10b1 feat(cli): register homelab vault command group skeleton 2026-06-24 10:14:24 +00:00
Viktor Barzin
ae0d7984c4 docs: ADR-0014 + glossary — service identity (namespace+label) & Calico Goldmane observability
All checks were successful
ci/woodpecker/push/default Pipeline was successful
Records the design reached in a /grill-with-docs session: how to track which
Service talks to which as more Services are added, using k8s-native options.

Decision: service identity = the workload's namespace (primary) plus a
`service-identity` label only in the few multi-Service namespaces; east-west
observability = Calico 3.30 Goldmane/Whisker (already in our Calico v3.30.7,
currently disabled) emitting to Loki for a durable trail; enforcement reuses the
existing Wave 1 egress track. Dedicated per-Service ServiceAccounts deferred and
a service mesh / mTLS / SPIFFE rejected — the trust model needs attribution-grade
forensics on a trusted, etcd-constrained cluster, not cryptographic
non-repudiation. This is the service-mesh evaluation the 2026-04-20 infra audit
flagged as missing; rejected alternatives (Retina, Hubble, Kiali, a custom Alloy
enricher) are recorded with rationale.

Adds glossary terms (Service identity, Goldmane / Whisker) to CONTEXT.md.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-24 10:00:36 +00:00
Viktor Barzin
0293b5c634 android-emulator: fix idle-sleeper dying with SIGPIPE before it could sleep
All checks were successful
ci/woodpecker/push/default Pipeline was successful
Caught live-testing the previous commit: every sleeper run exited 141
(SIGPIPE) in ~1s with no output, never reaching the scale-down. Cause:
`set -o pipefail` + `dumpsys power | awk '...; exit'` — awk closes the pipe
after the first match while `kubectl exec` is still streaming dumpsys, so
the exec gets SIGPIPE, pipefail makes the pipeline 141, and set -e kills the
script before any echo. (My earlier dry-run missed it because it didn't run
under `set -euo pipefail`.)

Fix: drop pipefail; capture each exec to a var (`|| true`) then parse with
awk reading to END (no early `exit`), so nothing can SIGPIPE mid-stream and
a failed/booting exec falls through to the fail-safe "do not sleep" branch.
Also fetch the pod name via jsonpath instead of `-o name | head -1` (no pipe
to SIGPIPE, no `pod/` prefix to strip), and exec `adb` directly without the
`sh -c` wrapper.

Verified live: ran the corrected script as the gate ServiceAccount against
the stuck emulator (idle ~120h) — it logged "idle >= 6h ... scaling to zero"
and patched the deployment to replicas=0. The 6+ day pod is now asleep.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-24 08:57:36 +00:00
Viktor Barzin
839fdb33c2 android-emulator: sleep after 6h idle (activity-based), fix never-sleeping
All checks were successful
ci/woodpecker/push/default Pipeline was successful
The emulator was meant to scale to zero when idle but had been up 6+ days
straight despite ~5 days with no real use. Two bugs:

1. The idle check counted ESTABLISHED TCP connections to the adb/noVNC
   ports. A forgotten `adb connect` (no disconnect) holds that transport
   open forever, so every 15-min run saw "active" and reset the counter --
   it never reached the sleep branch. (Right now: 4 such stale transports
   from pods on k8s-node3/node4.)
2. Even when it did reach the sleep branch, `kubectl scale --replicas=0`
   failed Forbidden -- the gate ServiceAccount can patch `deployments` but
   not `deployments/scale`.

Switch the sleeper to measure actual use: time since last user activity
(taps/keys/app-launches, incl. noVNC clicks) from `dumpsys power` vs guest
uptime. No interaction for 6h -> sleep. This ignores idle/forgotten
connections entirely. Scale down with a direct replicas patch on the named
deployment (same path the wake gate scales up), so it needs only the
existing `deployments` patch grant -- no `deployments/scale`. Now stateless
(drops the idle-counter annotation; gate.py no longer sets it) and lighter
on etcd. Fail-safe: any read error (e.g. mid-boot) does not sleep.

Requested by Viktor: turn the dev-only emulator off when it hasn't been
used for 6h.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-24 08:49:23 +00:00
Viktor Barzin
566447a698 k8s-upgrade: preflight kubeadm-plan gate must pass explicit target (minor-upgrade fix)
All checks were successful
ci/woodpecker/push/default Pipeline was successful
Last night's 1.34.9->1.35.6 run passed the ESO/kyverno compat gate (the migration
worked!) but ABORTED at the kubeadm-plan-target gate: it ran `kubeadm upgrade plan`
with NO version, so master's old 1.34.9 kubeadm auto-proposed only the current
minor (Loki: "falling back to stable-1.34") and plan_target != 1.35.6 -> abort.
That gate worked for patch upgrades but never for minors. Fix: pass the explicit
`v$TARGET_VERSION` (verified on master: `kubeadm upgrade plan v1.35.6` emits
"kubeadm upgrade apply v1.35.6"). Works for patches too. Applied live to the
ConfigMap before tonight's run; deleted the failed preflight-1-35-6 job.

Also: ESO 2.x took SSA ownership of .spec.refreshInterval, so terraform's apply of
the k8s-upgrade-creds ExternalSecret hit a field-manager conflict. Added
field_manager.force_conflicts=true (benign — interval is semantically identical).
This pattern affects all 104 migrated ESs fleet-wide (follow-up).

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-24 06:06:14 +00:00
Viktor Barzin
98d2b89614 calico: bump tigera-operator mem limit 256Mi -> 512Mi (OOM crashloop fix)
All checks were successful
ci/woodpecker/push/default Pipeline was successful
The operator OOM-crashlooped on 2026-06-23: it idles at ~246Mi with a ~266Mi
startup spike (re-listing resources to build informer caches), both at/over the
256Mi limit, so the first time the pod restarted it could never finish startup
(exit 137 OOMKilled, leader-elect, OOM, repeat). A latent landmine — the limit
was always too tight; it only bit once the pod restarted. Data plane was never
affected (calico-node 7/7, tigerastatus green throughout). 512Mi gives headroom
(now ~246Mi steady, verified stable 0 restarts). NOT caused by the ESO migration
(which never touched calico); cluster churn was at most the trigger that exposed
the tight limit.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-23 12:46:28 +00:00
Viktor Barzin
68c240b8de Merge remote-tracking branch 'origin/master'
Some checks failed
ci/woodpecker/push/default Pipeline failed
2026-06-23 09:56:25 +00:00
Viktor Barzin
7d297dc6b1 eso: complete migration — chart 2.6.0, all CRs on v1, 1.35 gate cleared
Phase 3 of the ESO 0.12->2.6 migration (the last k8s-1.35 compat-gate blocker).
Climbed external-secrets 0.16.2 -> 0.17.0 -> ... -> 2.6.0 one minor at a time,
each hop applied + verified (ES sync held at 109 Ready every hop; atomic=true
rollback safety net). Crossed the 0.17 cutoff (v1beta1 serving removed) only
after Phase 2 put all 104 ExternalSecrets + 2 ClusterSecretStores on
external-secrets.io/v1. Result: compat-gate now returns "OK: cluster is safe to
upgrade to 1.35.6" (EXIT 0) — the autonomous version-check chain will take k8s
1.34 -> 1.35 on its next nightly run.

Also fixes the repo-wide stale-lock issue that broke CI pipeline 332: the
terragrunt-generated providers.tf declares gavinbunney/kubectl + telmate/proxmox,
but ~28-39 stacks' committed .terraform.lock.hcl predated that ("Inconsistent
dependency lock file: no version selected"). Reconciled via `tg init -upgrade`
and committed so `terragrunt apply`/CI work cleanly again.

Docs: .claude/CLAUDE.md ESO line corrected (104 ESs, v1, chart 2.6.0); plan doc
marked COMPLETE.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-23 09:55:51 +00:00
Viktor Barzin
ff4b01a674 state(external-secrets): update encrypted state 2026-06-23 09:53:36 +00:00
Viktor Barzin
e1a85dd727 state(external-secrets): update encrypted state 2026-06-23 09:52:30 +00:00
Viktor Barzin
af22416d6f state(external-secrets): update encrypted state 2026-06-23 09:51:21 +00:00
Viktor Barzin
c75982f408 state(external-secrets): update encrypted state 2026-06-23 09:50:11 +00:00
Viktor Barzin
0407e3c578 state(external-secrets): update encrypted state 2026-06-23 09:48:33 +00:00
Viktor Barzin
dab8f9446f state(external-secrets): update encrypted state 2026-06-23 09:47:24 +00:00
Viktor Barzin
e815bb0295 state(external-secrets): update encrypted state 2026-06-23 09:46:17 +00:00
Viktor Barzin
8412cd7d54 state(external-secrets): update encrypted state 2026-06-23 09:45:04 +00:00
Viktor Barzin
f2956e1e62 state(external-secrets): update encrypted state 2026-06-23 09:43:57 +00:00
Viktor Barzin
bf2f865eee state(external-secrets): update encrypted state 2026-06-23 09:42:52 +00:00
Viktor Barzin
6f3cfb18c7 state(external-secrets): update encrypted state 2026-06-23 09:41:46 +00:00
Viktor Barzin
6e8e066215 state(external-secrets): update encrypted state 2026-06-23 09:40:14 +00:00
Viktor Barzin
de1fb04d9f state(external-secrets): update encrypted state 2026-06-23 09:39:12 +00:00
Viktor Barzin
606cfdb544 state(external-secrets): update encrypted state 2026-06-23 09:38:12 +00:00
Viktor Barzin
72464e7880 state(external-secrets): update encrypted state 2026-06-23 09:37:11 +00:00
Viktor Barzin
e88ea50304 docs(multi-tenancy): document install_skills (vendored per-user agent skills)
All checks were successful
ci/woodpecker/push/default Pipeline was successful
Record the new reconcile step alongside install_memory/install_playwright:
vendored own-copies of the 16-skill set for the SKILL_USERS allowlist (emo),
why it's vendored not npx (upstream drift), and that if-absent keys on the
user's own copy so it heals a stale/cross-user ~/.claude/skills symlink
(emo's grill-me pointed into the admin's home).

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-23 09:30:27 +00:00
Viktor Barzin
1c8dc6bd6c t3-provision-users: install_skills heals stale symlinks + owns ~/.agents
All checks were successful
ci/woodpecker/push/default Pipeline was successful
Follow-up to the vendored-skills change, from verifying the emo rollout:

- The if-absent guard treated ANY pre-existing ~/.claude/skills/<name> entry
  as "installed", so a manual cross-user symlink emo already had (grill-me ->
  /home/wizard/.claude/skills/grill-me) was skipped — leaving the requested
  skill depending on the admin's home instead of emo's own copy. The guard now
  keys on the user's OWN copy (a real dir under ~/.agents/skills) and (re)points
  the ~/.claude/skills symlink at it, healing a stale/cross-user link while
  still never clobbering a real dir.
- install -d left the intermediate ~/.agents owned by root; now owned by the user.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-23 09:27:31 +00:00
Viktor Barzin
987fdd16db t3-provision-users: vendor agent skills + per-user install_skills (emo)
All checks were successful
ci/woodpecker/push/default Pipeline was successful
Make the admin's Claude Code agent skills available to the `emo` devvm user.
Viktor asked to install Matt Pocock's skills for emo, starting with grill-me
but covering the full set the admin already uses.

The `npx skills` upstream has drifted off that set (diagnose -> diagnosing-bugs
and write-a-skill -> writing-great-skills were renamed; caveman + zoom-out are
no longer published), so reproducing it via npx is impossible and would also
spray ~70 agent dirs into the user's home + add a GitHub-clone + unpinned-CLI
dependency to the hourly root reconcile. Instead vendor a point-in-time
snapshot of the 16 skills (scripts/workstation/claude-skills/) and copy them
per-user, mirroring install_memory: install_skills() copies each skill into
~/.agents/skills/<name> (owned by the user) and symlinks
~/.claude/skills/<name> -> ../../.agents/skills/<name>. if-absent, additive,
best-effort, scoped to the SKILL_USERS allowlist (emo).

find-skills is from vercel-labs/skills (not Matt Pocock) but included since it
is part of the admin's current set.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-23 09:23:37 +00:00
Viktor Barzin
59f2beda21 chrome-service: run real Google Chrome (H.264/AAC codecs) for the browser
All checks were successful
ci/woodpecker/push/default Pipeline was successful
Point the chrome-service container at the new chrome-service-browser image and
launch /opt/google/chrome/chrome instead of the bundled Chromium. Fixes
MEDIA_ERR_SRC_NOT_SUPPORTED on H.264/AAC video (Instagram Reels etc.) in the
noVNC view — bundled Chromium has those codecs compiled out; only real Chrome
carries them. connect_over_cdp callers (tripit fare scrape, homelab browser,
snapshot-harvester) attach over raw CDP (version-tolerant) — validated after
rollout. Image is built off-infra on GHA (prior commit) → public ghcr.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-22 21:15:36 +00:00
Viktor Barzin
df1ec1879d chrome-service: build a real-Chrome browser image (H.264/AAC codecs)
Some checks failed
ci/woodpecker/push/default Pipeline was successful
Build chrome-service-browser / build (push) Has been cancelled
Add an infra-owned image (Playwright base + google-chrome-stable) + its GHA
build workflow. The bundled Chromium ships proprietary codecs compiled out, so
H.264/AAC video (Instagram Reels, X, most .mp4) fails in the noVNC view with
MEDIA_ERR_SRC_NOT_SUPPORTED; only real Google Chrome carries those codecs
(libffmpeg swap + Chrome-for-Testing both ruled out). This commit only builds
the image (→ ghcr.io/viktorbarzin/chrome-service-browser); a follow-up flips
main.tf's launch to it once the image exists + is public.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-22 21:01:17 +00:00
Viktor Barzin
7061b1dfc6 state(external-secrets): update encrypted state 2026-06-22 20:55:27 +00:00
Viktor Barzin
e2f328ff4a state(external-secrets): update encrypted state 2026-06-22 20:45:24 +00:00
Viktor Barzin
a735be9ba4 state(external-secrets): update encrypted state 2026-06-22 20:45:08 +00:00
Viktor Barzin
c670cb7118 eso: Phase 2 — migrate all 104 ExternalSecrets + 2 ClusterSecretStores to v1
Some checks failed
ci/woodpecker/push/default Pipeline failed
The API rewrite half of the ESO 0.12->2.6 migration (last k8s-1.35 compat-gate
blocker). Done on chart 0.16.2, which serves BOTH external-secrets.io/v1beta1
and v1, so this is the safe window — MUST land before 0.17 removes v1beta1
(there is no conversion webhook). Pure apiVersion bump, schema is byte-identical:
106 occurrences (104 ExternalSecrets + 2 ClusterSecretStores vault-kv/vault-database)
across 73 .tf files, v1beta1 -> v1, no other field changes.

Validated live first on tandoor (single, non-coupled, synced ES): the
kubernetes_manifest apiVersion bump forces a REPLACE; the target Secret is
cascade-GC'd for ONE ~0.3s poll then ESO recreates it (identical value re-synced
from Vault, new UID) and the ES returns SecretSynced=True on v1. Running pods
keep their mounted copy through the sub-second blip. All 110 target Secrets were
snapshotted to /tmp first as a backstop.

CI applies the changed stacks serially (staged rollout); watching aggregate ES
sync back to 108 synced (2 pre-existing dead: instagram-poster, payslip-ingest).
Next: Phase 3 climb 0.16.2 -> 2.6.0.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-22 19:13:04 +00:00
Viktor Barzin
98cd535b97 authentik: lock chrome.viktorbarzin.me noVNC to Viktor only
All checks were successful
ci/woodpecker/push/default Pipeline was successful
The chrome-service noVNC exposes Viktor's live logged-in browser sessions
(Instagram etc. — he'll sign in there for homelab browser to reuse). It was
auth="required" = any authenticated user, and "Home Server Admins" includes emo
(emil.barzin@gmail.com), so the admin group is not a sufficient gate. Add a
host-specific case to the domain-wide forward-auth restriction allowing only
Viktor's accounts (vbarzin@gmail.com + akadmin break-glass); everyone else,
incl. emo, is denied at the noVNC. emo's AGENT already can't reach the browser
(read-only RBAC blocks port-forward); this closes the human noVNC path.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-22 18:09:27 +00:00
Viktor Barzin
a3cdc0d6d0 chrome-service: size headed Chrome window to fill Xvfb (noVNC cut-off)
All checks were successful
ci/woodpecker/push/default Pipeline was successful
The noVNC view showed the browser in the top-left with the rest of the
framebuffer black. Cause: Chrome launched with no --window-size, and there's no
window manager, so it opened at its profile-persisted (smaller) size inside the
1280x720 Xvfb. Add --window-size=1280,720 --window-position=0,0 so the window
fills the screen on every launch (fresh pods/profiles too). Live windows were
already resized via CDP as a stopgap.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-22 18:00:20 +00:00
Viktor Barzin
c7ead032ec chrome-service: fix noVNC stuck-"Connecting" (x11vnc fd-sweep under nofile=2^31)
Some checks failed
ci/woodpecker/push/default Pipeline was successful
Build chrome-service-novnc / build (push) Has been cancelled
The noVNC view hung on "Connecting" forever then timed out. Root cause: x11vnc
sweeps the entire fd table (fcntl per fd) on every client connection, and
containerd grants pods RLIMIT_NOFILE=2^31, so the RFB handshake never completes
(websockify accepts the WS and dials localhost:5900, but x11vnc never sends its
banner — verified: handshake timed out at 8s, x11vnc had burned 1h41m CPU
spinning). Same bug + fix the android-emulator stack already carries.

Cap nofile before x11vnc starts, in two places:
- files/novnc/entrypoint.sh: `ulimit -n 65536` (root fix, makes the image correct)
- main.tf novnc container: `command = ["bash","-c","ulimit -n 65536; exec /entrypoint.sh"]`
  so the cap applies deterministically on rollout even though the image is
  :latest/IfNotPresent (a rebuilt entrypoint isn't guaranteed to be re-pulled).

Also documents the gotcha + diagnosis in docs/architecture/chrome-service.md and
notes the black-when-idle behaviour + the autoconnect URL.

(A live x11vnc relaunch with the cap already unblocked the running pod; this
makes it survive restarts.)

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-22 17:34:03 +00:00
Viktor Barzin
20ca5ee624 tripit: REEL_PROVIDER=anonymous — actually fetch reels (was fake canned caption)
All checks were successful
ci/woodpecker/push/default Pipeline was successful
REEL_PROVIDER was unset, so the reel pipeline used FakeReelExtractor, which returns
a CANNED caption — every pasted (tripit #120) or forwarded reel produced a DUMMY
Saved Place instead of reading the real reel. Set REEL_PROVIDER=anonymous in app_env
(covers the web Deployment + the ingest CronJob) so AnonymousReelExtractor does the
real anonymous read. Verified live from the cluster: yt-dlp fetched a real IG /p/
caption (no IG_GRAPHQL_DOC_ID needed — the internal-API path is an optional
optimisation; yt-dlp fallback works). LLM extraction + Nominatim POI geocoding were
already real (prior commits); this was the last fake link in the chain.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-22 17:30:47 +00:00
Viktor Barzin
f46b69f372 tripit: enable real LLM + Nominatim on the web Deployment (in-app reel paste #120)
All checks were successful
ci/woodpecker/push/default Pipeline was successful
The web Deployment ran LLM_MODE=fake with no reel geocoder — only the ingest-plans
CronJob had real providers. The in-app reel-URL paste feature (tripit #120) runs
ingest_reel IN the web pod (BackgroundTask), so the Deployment now needs real
extraction: LLM_MODE=llamacpp (qwen3vl-8b; qwen3-8b segfaults on the current
llama-swap image) with the ADR-0033 claude-agent-service fallback, plus
REEL_GEOCODER_PROVIDER=nominatim for venue->city/country POI geocoding. Set in
app_env (feeds the Deployment; the CronJobs already had these via extra_env). Bonus:
this also un-fakes the in-app booking *share* import, which used the same fake LLM.
MAIL_INGEST_ENABLED stays false on the Deployment (only the CronJob polls mail).

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-22 16:50:04 +00:00
Viktor Barzin
59f2070e56 tripit: switch mail-ingest LLM_MODEL qwen3-8b -> qwen3vl-8b (qwen3-8b segfaults)
All checks were successful
ci/woodpecker/push/default Pipeline was successful
The qwen3-8b GGUF segfaults on load on the current llama-swap :cuda image
("common_init_from_params: failed to create context"; llama-swap returns 502),
which broke ALL tripit mail ingest text extraction — booking emails AND forwarded
reels (status=failed, "no place could be read"). The GGUF isn't corrupt (valid
header, full size, worked for weeks) — it's a llama.cpp/image regression. Rather
than pin the SHARED llama-swap image (cross-user blast radius), repoint the
ingest-plans CronJob at qwen3vl-8b, an already-provisioned 8B model that loads
fine and extracts flight numbers + places reliably. Restores the auto-path
(reels resolve via the Nominatim geocoder; bookings parse again). The broken
qwen3-8b GGUF is a separate, non-urgent llama-cpp cleanup.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-22 15:52:09 +00:00
Viktor Barzin
7dbbb74163 homelab v0.8.1: frame browser as escalation (default headless), match CLAUDE.md
Some checks failed
ci/woodpecker/push/default Pipeline was successful
Build infra CLI / build (push) Has been cancelled
Make `homelab browser --help` and chrome-service.md state the same tiered rule
now in ~/code/CLAUDE.md: default to the Playwright MCP/headless browser for all
routine automation; reach for `homelab browser` ONLY when headless is blocked
(loads-but-submit-fails / one request errors while siblings 200 / explicit bot
wall). Removes the "co-equal choice" framing so agents have one non-conflicting
instruction. Adds a test asserting the tiered wording so it can't regress.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-22 15:44:43 +00:00
Viktor Barzin
f96cde35bd tripit: enable Nominatim POI geocoding for reel→Wishlist ingest
All checks were successful
ci/woodpecker/push/default Pipeline was successful
Forwarded reels (tripit ADR-0031) geocode their venue to map a Saved Place to a
country + city, but the reel route was wired to the global geocoder, which here is
GEOCODER_PROVIDER=openmeteo (city-level, name-based). OpenMeteo returns nothing for
a venue query like "Time Out Market, Lisbon" so reels never resolved and no Saved
Place was created. The app fix (tripit 3c62d596) gave the reel route its own
geocoder behind REEL_GEOCODER_PROVIDER; set it to nominatim on the ingest-plans
CronJob (the only one running the reel route) so forwarded reels resolve to real
venue coords + city + country. Isolated from the global geocoder, which stays
openmeteo for weather/tours. Verified Nominatim resolves the venue from the cluster.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-22 14:59:37 +00:00
Viktor Barzin
a6b52a5839 homelab v0.8.0: browser verbs for headful anti-bot web automation
Some checks are pending
Build infra CLI / build (push) Waiting to run
ci/woodpecker/push/default Pipeline was successful
Add `homelab browser run|open` so agents can drive the cluster's headful
Chrome (chrome-service) over CDP from the devvm. The headless playwright/mcp
browser can load anti-bot sites and fill their forms, but the gated submit
silently fails — e.g. the Stirling Ackroyd Fixflo tenant portal returned
net::ERR_FILE_NOT_FOUND on its pre-submit check and hung, creating nothing.
Driving the real headful Chrome submits first try. That capability already
existed but was undiscoverable, so it cost ~40 min + redundant form re-runs to
find; now it is one command, versioned, test-covered, and `browser --help`
carries the when-to-use signature + an error-code cheat-sheet so the right tool
is reached at the right moment (the failure was judgment, not setup).

- port-forward svc/chrome-service:9222 (tunnels API-server->pod, so it bypasses
  the :9222 NetworkPolicy), assert non-headless via /json/version,
  connect_over_cdp, inject the same vendored stealth.js the in-cluster callers
  use; the port-forward is always torn down, on success and on error.
- node CDP client pinned to playwright-core@1.48.2 to match the v1.48.0-noble
  image (Chromium 130); self-provisioned lazily into ~/.cache/homelab, no
  per-user setup.
- default is a fresh incognito context (safe for the shared browser + concurrent
  callers); --shared-context reuses the warmed persistent profile.
- TDD: cmd_browser_test.go covers arg parsing, headless detection, the version
  pin, the help cheat-sheet, and a stealth.js drift guard. Verified end-to-end
  against bot.sannysoft.com (real Chrome UA, webdriver hidden, plugins/WebGL
  spoofed) and `browser open`.
- docs: README v0.8 section, ADR-0013, and a chrome-service.md "driving from
  outside the cluster" section.

Closes: code-nepg

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-22 12:22:22 +00:00
Viktor Barzin
de163aa6af workstation: switch devvm OOM backstop from systemd-oomd to earlyoom
All checks were successful
ci/woodpecker/push/postmortem-todos Pipeline was successful
ci/woodpecker/push/default Pipeline was successful
The systemd-oomd backstop added in the previous commit is INERT on this box.
oomd's memory-pressure kill only acts on cgroups doing active reclaim (pgscan
rising); with MemorySwapMax=0 + anonymous agent memory there is nothing to
reclaim, so pgscan stays 0 and oomd never fires. Proven live: a cgroup held at
96-99% memory.pressure for >70s with pgscan=0 was never killed (oomctl + balloon).
The very swap=0 that kills the IO storm also neuters oomd.

Replace it with earlyoom, which watches free RAM (MemAvailable%) and is
swap-independent: SIGTERM the biggest task at 5%, SIGKILL at 3%, swap ignored
(-s 100). It --avoids sshd/systemd/dockerd/containerd/t3-dispatch/tmux (the
admin's way in always survives) and --prefers the agent/browser hogs. Verified
via --dryrun: fires on the RAM threshold and selects a chrome process, not a
protected daemon.

The per-cgroup caps (MemoryHigh=12G/MemoryMax=16G/MemorySwapMax=0 per user,
docker.slice 8G) are unchanged and remain the PRIMARY guard — earlyoom is the
aggregate net for the rare all-users-maxed case. systemd-oomd purged; its config
+ ManagedOOM drop-ins removed. Post-mortem updated with the finding.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-22 10:39:16 +00:00
Viktor Barzin
3a59f4a8bf workstation: per-user memory caps + systemd-oomd backstop on devvm
All checks were successful
ci/woodpecker/push/postmortem-todos Pipeline was successful
ci/woodpecker/push/default Pipeline was successful
The shared devvm keeps overloading and had to be hard-killed again today
(2026-06-22): a runaway in one user's ssh/tmux session (a 10G ugrep, plus
stacked max-effort agents) grew unbounded, spilled into the disk swap, and
swap-thrashed the throttled virtual disk into an IO storm until the box wedged.

Root cause: ssh/tmux work runs under user-<uid>.slice, left memory-uncontained
by the explicit 2026-06-10 "swap-only" decision, while only the t3-serve tree
was capped. So one user could starve everyone.

This bounds every user on BOTH trees (MemoryHigh=12G, MemoryMax=16G,
MemorySwapMax=0 so work OOMs locally at its ceiling instead of thrashing swap),
adds a systemd-oomd PSI backstop that sheds the single worst work cgroup under
box-wide pressure while leaving system.slice (sshd/services/your way in)
protected, gives system.slice a fair-share CPU/IO priority edge, and routes
docker containers into a capped, oomd-policed docker.slice so they can't dodge
the caps or mis-target oomd. All durable in setup-devvm.sh so a VM rebuild
reproduces them; systemd-oomd added to packages.txt.

Applied live and verified: oomctl shows the backstop armed (not dry-run) on the
work slices with system.slice protected; a capped-balloon stress test OOM-killed
locally at the ceiling with swap flat (no thrash).

Post-mortem: docs/post-mortems/2026-06-22-devvm-mem-io-overload-containment.md

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-22 10:25:09 +00:00
Viktor Barzin
2169e0de5f workstation: harden memory hooks — prune dead plugin refs + homelab-CLI-only store
All checks were successful
ci/woodpecker/push/default Pipeline was successful
wire-memory-hooks.py now PRUNES any settings.json hook still pointing at the
retired claude-memory plugin (plugins/claude-memory/hooks/) before the additive
pass. install_memory() rm -rf's that dir, so those entries are dangling — and a
missing UserPromptSubmit hook exits 2, a BLOCKING error that erases the prompt
and froze emo's sessions (2026-06-22). The plugin shares basenames with the
homelab hooks, so the old additive-only logic saw the dead plugin path as
"already present" and skipped installing the real ~/.claude/hooks/ copy; pruning
first fixes that. Verified against emo's exact original config: yields the
correct 4-hook set, drops the dead PermissionRequest entry, idempotent on rerun.

auto-learn.py now stores via the `homelab memory` CLI only — dropped the direct
HTTP path and the local-SQLite fallback (memory is homelab-CLI-only per Viktor;
never local files). No-ops silently when no API key is in env (e.g. ancamilea).

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-22 09:24:42 +00:00
Viktor Barzin
aeed461591 Revert "feat(monitoring): Tempo + OTel Collector for tripit tracing, hardened (ADR-0032 Phase 2)"
All checks were successful
ci/woodpecker/push/default Pipeline was successful
This reverts commit 1595bddfc2.
2026-06-22 08:31:17 +00:00
Viktor Barzin
1595bddfc2 feat(monitoring): Tempo + OTel Collector for tripit tracing, hardened (ADR-0032 Phase 2)
Some checks failed
ci/woodpecker/push/default Pipeline failed
Re-land Phase 2 after the first attempt's two failure modes, both fixed:
- tempo.resources set under the correct single-binary chart key (was OOMKilled on
  the namespace LimitRange default when mis-placed at top level).
- atomic=true + cleanup_on_fail=true on BOTH helm releases — a failed install
  auto-rolls-back instead of leaving a stuck/orphaned release (memory #6479).

Tempo (single-binary, proxmox-lvm 20Gi, 30d) + OTel Collector (contrib; otlp ->
redaction -> batch -> tempo) + Tempo datasource + additive trace_id->Tempo
derivedField on Loki + tripit LOG_FORMAT=json/OTEL_EXPORTER_OTLP_ENDPOINT.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-22 08:17:59 +00:00
Viktor Barzin
a0897de7c3 workstation: document homelab-memory hooks + provisioner self-deploy [ci skip]
multi-tenancy.md never mentioned the homelab-memory hooks rollout and still
listed claude_memory credential injection as purely "future". Document what is
actually true now: install_memory provisions the recall/auto-learn/compaction
hooks per user, the provisioner binary self-deploys from the repo (step 0), the
set -e abort fix, and that the hooks no-op without a MEMORY_API_KEY in env (CLI
defaults the URL) — emo has a key, ancamilea is keyless until one is minted.
Also clarify setup-devvm.sh's binary install is now bootstrap-only (ongoing
edits self-deploy).

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-22 08:04:38 +00:00
Viktor Barzin
92f35550f2 workstation: self-deploy t3-provision-users from the repo each reconcile [ci skip]
Root cause of emo's lost memory: nothing redeployed /usr/local/bin/t3-provision-users
except the manual setup-devvm.sh, so the homelab-memory rollout (44562535/9aa2438e,
Jun 21) sat committed-but-undeployed for a day — the hourly reconcile kept running the
pre-memory binary and never wired the new memory hooks for emo/anca.

Close the gap the same way the script already treats managed-settings.json and
start-claude.sh (sync_managed_config / deploy_user_launcher): the repo is the
authoring surface. At the top of the run, if the repo copy differs from the deployed
binary, install it and re-exec the fresh one. Guards: a re-exec env flag (no loop),
bash -n (never deploy a broken script), DRY_RUN (no mutation), cmp (no churn when
unchanged). Verified across all four paths in isolation.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-22 08:02:31 +00:00
Viktor Barzin
0b11a28d66 workstation: stop install_memory aborting the reconcile under set -e
install_memory (added in 44562535) ended with `[[ -d <plugin-dir> ]] && rm && log`
and guarded a chmod with a bare `[[ -f settings ]] && chmod`. When the plugin dir
or settings file is absent — the normal case for users who never had the
claude-memory plugin — those return non-zero, and under `set -euo pipefail` the
function returns non-zero and kills the whole hourly reconcile after the FIRST
user, before the rest are processed.

It never fired before because the rollout was committed but the deployed
/usr/local/bin/t3-provision-users was never updated, so install_memory had never
run. On first real run it aborted right after ancamilea, so emo (and wizard)
never got their memory hooks wired — the reason emo's sessions lost memory. Wrap
the cleanup in an if-block, guard the chmod, and end the function with return 0.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-22 07:59:47 +00:00
Viktor Barzin
464e0bfb97 Revert "feat(monitoring): Tempo + OTel Collector for tripit tracing (ADR-0032 Phase 2)"
All checks were successful
ci/woodpecker/push/default Pipeline was successful
This reverts commit 7513468a2d.
2026-06-22 06:46:56 +00:00
Viktor Barzin
72dcb125d5 Revert "fix(monitoring): tempo OOMKilled — move resources under tempo.resources"
This reverts commit a02782d11f.
2026-06-22 06:46:56 +00:00
Viktor Barzin
a02782d11f fix(monitoring): tempo OOMKilled — move resources under tempo.resources
Some checks failed
ci/woodpecker/push/default Pipeline failed
Pipeline #315 failed: tempo-0 CrashLoopBackOff / OOMKilled (exit 137). The
single-binary grafana/tempo chart (v1.24.4) takes container resources at
tempo.resources, not a top-level resources: — so my block was ignored and the pod
fell to the namespace LimitRange default and OOMed. Set tempo.resources explicitly
(req 256Mi / limit 2Gi). tripit + existing monitoring were unaffected throughout.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-22 06:44:31 +00:00
Viktor Barzin
7513468a2d feat(monitoring): Tempo + OTel Collector for tripit tracing (ADR-0032 Phase 2)
Some checks failed
ci/woodpecker/push/default Pipeline failed
Stand up the cluster's first trace store + OTLP ingress so tripit's OpenTelemetry
spans (Phase 1, already live in prod) export and correlate with logs:
- Grafana Tempo (single-binary, filesystem on proxmox-lvm 20Gi, 30d)
- OTel Collector (contrib; otlp -> redaction deny-list backstop -> batch -> tempo)
- Grafana: a Tempo datasource + an ADDITIVE trace_id->Tempo derivedField on the
  Loki datasource (no uid change, so existing dashboards are unaffected)
- tripit deployment: LOG_FORMAT=json + OTEL_EXPORTER_OTLP_ENDPOINT -> the Collector

Additive (new helm releases; Loki/Prometheus/Grafana untouched). Offline
'terraform validate' clean; full plan+apply runs in CI (locked git-crypt blocks a
local plan as non-admin).

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-22 06:31:11 +00:00
Viktor Barzin
1a32c07ffe docs(eso): Phase 1 done (0.16.2) + confirmed Phase 2 GC findings
All checks were successful
ci/woodpecker/push/default Pipeline was successful
Execution log added to the ESO migration plan. Phase 1 complete: ESO at 0.16.2
(both v1beta1+v1 served). Phase 2 findings confirmed live: apiVersion bump forces
a kubernetes_manifest REPLACE, and ESO ESs use creationPolicy=Owner (target Secret
ownerRef → cascade-GC risk on the replace's delete). Phase 2 must snapshot Secrets
+ empirically validate GC-survival on the first live ES + per-stack two-phase
-target apply (fallback: state rm + import). Corrected the doc's k8s assumption
(cluster is on 1.34; whole climb stays on 1.34, no interleave).

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-21 20:44:50 +00:00
Viktor Barzin
ac27e41fde Merge remote-tracking branch 'origin/master'
All checks were successful
ci/woodpecker/push/default Pipeline was successful
2026-06-21 20:41:35 +00:00
Viktor Barzin
296deda3b4 eso: Phase 1 — climb chart 0.12.1 -> 0.16.2 (transition version) + atomic
First half of the ESO 0.12->2.6 migration (docs/plans/2026-06-21-eso-0.12-to-2.x-migration-design.md),
clearing the LAST k8s-1.35 compat-gate blocker. Stepped one minor at a time on
k8s 1.34 (no k8s interleave — cluster already on 1.34, ESO bands are conservative
tested ranges not hard limits): 0.12.1 -> 0.13.0 -> 0.14.4 -> 0.15.1 -> 0.16.2.
Each hop applied + verified: controller healthy, all 108 live ExternalSecrets
stayed SecretSynced (2 pre-existing dead — instagram-poster, payslip-ingest —
missing Vault data, untouched). Added atomic=true + timeout=600 (ESO had no
rollback safety net). 0.16.2 serves BOTH v1beta1 AND v1 (storedVersions now
["v1beta1","v1"]) — the safe window to rewrite all 104 CRs to v1 (Phase 2) before
0.17 removes v1beta1. State auto-committed per hop by scripts/tg (Tier-0 SOPS).

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-21 20:41:30 +00:00
Viktor Barzin
0cd59d2c55 state(external-secrets): update encrypted state 2026-06-21 20:41:10 +00:00
Viktor Barzin
b8612e788d state(external-secrets): update encrypted state 2026-06-21 20:39:45 +00:00
Viktor Barzin
877e5c73b2 state(external-secrets): update encrypted state 2026-06-21 20:38:34 +00:00
Viktor Barzin
de2250f667 immich-frame: set photo date format to dd/MM/yyyy
All checks were successful
ci/woodpecker/push/default Pipeline was successful
The photo date overlay was showing US-style MM/dd/yyyy — ImmichFrame's built-in default when PhotoDateFormat is unset. Viktor wants UK day/month/year ordering instead. Pin PhotoDateFormat to the date-fns pattern "dd/MM/yyyy" (uppercase MM = month; lowercase mm would render minutes). The config map carries reloader.stakater.com/match, so Reloader restarts the immich-frame pod automatically on apply.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-21 20:36:43 +00:00
Viktor Barzin
8e6eff03dd state(external-secrets): update encrypted state 2026-06-21 20:36:37 +00:00
Viktor Barzin
0bae025b9b wealth dashboard: spend-down figures in today's money (inflation-adjusted)
All checks were successful
ci/woodpecker/push/default Pipeline was successful
Viktor asked whether the spend-down numbers were inflation-adjusted —
they were not (all nominal). He chose to switch the card to today's
money, so every row now shows constant purchasing power for life.

Each row is a die-with-zero annuity at the REAL rate (1+g)/1.03−1
(3% inflation), spending a constant inflation-adjusted amount (the
actual pounds withdrawn rise with inflation) until net worth hits £0
at age 100:
  • No growth (0%)  → £12/day, £370/mo,   £4,446/yr   (negative real: loses to inflation)
  • Inflation (3%)  → £43/day, £1,315/mo, £15,776/yr  (0% real: holds value)
  • Market (7%)     → £130/day, £3,942/mo, £47,300/yr (~3.9% real)

Title now flags "(today's £)". Same panel/layout; only the SQL, title,
and tooltip changed.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-21 20:13:59 +00:00
Viktor Barzin
3fb6284e2b immich-frame: use 24-hour clock (ClockFormat HH:mm)
All checks were successful
ci/woodpecker/push/default Pipeline was successful
Viktor asked to switch the Immich photo-frame shown on the Portal
kitchen appliance to a 24-hour clock. immichFrame defaults ClockFormat
to 'hh:mm' (12-hour) and we never overrode it, so the frame was showing
12-hour time. Set ClockFormat: "HH:mm" (date-fns 24h token) in the
frame Settings.yml ConfigMap; Reloader restarts the pod on apply.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-21 20:10:51 +00:00
Viktor Barzin
e89de86af0 wealth dashboard: spend-down table → three growth scenarios
All checks were successful
ci/woodpecker/push/default Pipeline was successful
Viktor wanted the spend-down card to compare three portfolio-growth
scenarios rather than the previous floor-vs-4%-real pair.

The table now has three rows, each a die-with-zero annuity (drain net
worth to £0 by age 100) spending a constant number of ACTUAL (nominal)
pounds, differing only by the assumed nominal growth rate:
  • No growth (0%)      → £43/day,  £1,315/mo, £15,776/yr  (= NW ÷ years)
  • Inflation (3%)      → £106/day, £3,233/mo, £38,792/yr  (NEW)
  • Avg market (7%)     → £220/day, £6,703/mo, £80,435/yr

This keeps the £43 no-growth floor he anchored on. The old third row
was "4% real" (£133) expressed in today's money; it's replaced by the
7%-nominal market row (£220, actual pounds) so all three rows share one
basis (nominal pounds) and are directly comparable. 3%/7% are hardcoded
(one-line SQL edit). Table height 4→5 for the extra row; panels below
shifted down 1.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-21 20:06:29 +00:00
Viktor Barzin
85d42f2c13 wealth dashboard: merge spend-down tiles into one compact table
All checks were successful
ci/woodpecker/push/default Pipeline was successful
Viktor wanted the six separate spend-down stat tiles consolidated into a
single, more compact card with the figures laid out as rows.

Replaces stat panels 9220-9225 with one table panel (id 9220) in the
Overview row: 2 rows (Floor / 4% real) × 3 columns (per day / month /
year). Same underlying math and live values (£43/£1,315/£15,776 floor;
£133/£4,039/£48,463 at 4% real). w=9 instead of the full-width tile row,
so it takes ~a third of the width.

Note: this intentionally overrides the "table panels live at the bottom"
layout convention — Viktor chose to keep this headline KPI glanceable at
the top of the dashboard rather than scroll for it.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-21 19:55:57 +00:00
Viktor Barzin
63add2a126 feat(tripit): finalize ADR-0028 auth env — AUTH_MODE=normal, trips@ sender, trust XFF
All checks were successful
ci/woodpecker/push/default Pipeline was successful
Now that the native-auth rollout is complete: (1) AUTH_MODE hybrid->normal — the legacy Authentik OIDC-bearer + forward-auth arms were removed in #96, and 'hybrid' already resolved to 'normal' via backward-compat parsing; this makes it explicit and corrects the now-false comment. (2) SMTP_FROM plans@->trips@ — the dedicated native-auth sender; the trips@->spam@ send-as alias is live + verified (RCPT 250). (3) TRUST_FORWARDED_FOR=true — so #95's per-IP signup rate-limit keys on the real client behind Traefik, not the shared ingress pod IP. Env-only; the Deployment image is KEEL_IGNORE_IMAGE (lifecycle-ignored), so this does NOT touch the running image. Reloader restarts the pod to pick up the new env.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-21 19:50:20 +00:00
Viktor Barzin
166a2bcab4 wealth dashboard: add "spend-down to £0 at 100" stat tiles
All checks were successful
ci/woodpecker/push/default Pipeline was successful
Viktor wanted a glanceable number on the Wealth dashboard for how much
he can spend for the rest of his life — spending the whole net worth
down to zero by age 100.

Adds a third line of six stat tiles to the Overview section, two
equations × three cadences (per day / month / year):

  • FLOOR  — net worth ÷ time remaining to age 100. Treats the money as
    cash (no growth, no inflation): a conservative lower bound.
    ≈ £43/day, £1.3k/mo, £15.8k/yr.
  • 4% REAL — die-with-zero annuity: the constant, inflation-adjusted
    spend that drains the balance to £0 at 100 while it keeps earning
    4% real. PMT = NW·r/(1−(1+r)^−n). ≈ £133/day, £4.0k/mo, £48.5k/yr.

Horizon is today → his 100th birthday (DOB 1998-10-04 → 2098-10-04),
computed live so the figures tick as net worth and the horizon move.
Net worth reuses the existing latest-per-account dav_corrected math, so
the tiles always agree with the "Net worth (current)" stat (pension
included; target £0). The 4% real rate is hard-coded per his "keep it
simple, just a number" steer — a one-line SQL edit to change later.

Layout: tiles inserted at y=9; all sections below shifted down 4 rows.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-21 19:48:30 +00:00
c830f9f462 Merge pull request 'workstation: wire-memory-hooks as root (fix non-admin wiring)' (#14) from wizard/mem-fix into master
All checks were successful
ci/woodpecker/push/default Pipeline was successful
2026-06-21 17:45:39 +00:00
Viktor Barzin
9aa2438e75 workstation: run wire-memory-hooks as root, not runuser (fix non-admin wiring)
install_memory ran the JSON-merge helper via 'runuser -u $user', but the helper
lives under the admin's mode-700 home ($WORKSTATION_DIR) which non-admin users
can't traverse -> wiring silently failed for emo/anca (hooks copied but never
wired into settings.json). Run the helper as root (it reads both the repo helper
and the user's home) and chown the result back to the user. Verified by the live
all-users rollout: emo + anca now wired correctly.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-21 17:45:36 +00:00
f318773cb0 Merge pull request 'workstation: homelab-memory for all users (retire claude-memory MCP)' (#13) from wizard/memory-allusers into master
All checks were successful
ci/woodpecker/push/default Pipeline was successful
2026-06-21 17:42:51 +00:00
Viktor Barzin
44562535a2 workstation: provision homelab-memory hooks for all users (retire claude-memory MCP)
Roll the wizard MCP->homelab-CLI memory migration out to every devvm user. Adds
install_memory() to t3-provision-users.sh (mirrors install_playwright: per-user,
idempotent, if-absent, as-the-user): installs the 4 memory hook scripts into
~/.claude/hooks, wires them into settings.json additively (wire-memory-hooks.py
never touches env / the per-user MEMORY_API_KEY), and removes ONLY the
claude_memory MCP + plugin if present. Reuses each user's existing key (no
minting; per-user isolation stays deferred per the 2026-06-07 design). The
homelab CLI hits the same remote HTTP API the MCP used; recall runs via the
homelab-memory-recall.py UserPromptSubmit hook. Shared instructions (rules/skills
symlinked from base; root+infra CLAUDE.md) already cover all users.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-21 17:42:42 +00:00
Viktor Barzin
79749d7324 Merge remote-tracking branch 'origin/master'
All checks were successful
ci/woodpecker/push/default Pipeline was successful
2026-06-21 17:27:42 +00:00
Viktor Barzin
5e3fe2e8e2 docs(plans): ESO 0.12->2.6 (v1beta1->v1) migration design — the last k8s-1.35 blocker
Design doc for migrating External Secrets Operator off v0.12 (k8s <=1.31), now
the ONLY remaining compat-gate blocker for autonomous k8s 1.35 (kyverno cleared
to 1.18.1 today). Decisive findings: NO v1beta1->v1 conversion webhook, so all
104 ExternalSecrets (across 73 stacks) + 2 ClusterSecretStores must be rewritten
to external-secrets.io/v1 (byte-identical apiVersion bump) while on 0.16.2, BEFORE
crossing 0.17 (which removes v1beta1 — the point of no return). Step one minor at
a time (no skipping); chart==app version; downstream Secrets survive. 5-phase
ordered plan + per-phase rollback + the plan-time data.kubernetes_secret -target
gotcha (15 stacks) + Tier-0/SOPS handling. Plan only — nothing applied.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-21 17:27:37 +00:00
3f81b20fa6 Merge pull request 'docs: memory via homelab CLI (retire memory-tool/MCP refs)' (#12) from wizard/memory-cli-docs into master
All checks were successful
ci/woodpecker/push/default Pipeline was successful
2026-06-21 17:24:10 +00:00
Viktor Barzin
e2018f9b6c docs: memory via homelab CLI, not the retired memory-tool/MCP
The claude-memory MCP/plugin was uninstalled 2026-06-21 (recall now via the
homelab-memory-recall.py UserPromptSubmit hook; store/recall/update via the
`homelab memory` CLI, which hits the same remote HTTP API). Updates the
.claude/CLAUDE.md 'remember X' instruction off the obsolete local memory-tool
CLI + memory_search/memory_get onto the homelab CLI. Matches the root monorepo
CLAUDE.md + ~/.claude/rules/execution.md.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-21 17:24:00 +00:00
Viktor Barzin
51838a4ec7 kyverno: 3.6.1 -> 3.8.1 (app 1.16 -> 1.18.1) — clears the k8s-1.35 compat-gate block
All checks were successful
ci/woodpecker/push/default Pipeline was successful
kyverno v1.16 supports k8s <=1.34, so it was one of the two addons blocking the
autonomous 1.35 upgrade (compat gate, nightly). v1.18 supports 1.35.

Stepped one minor at a time per the kyverno upgrade guide (per-minor CRD notes):
3.6.1 (1.16) -> 3.7.2 (1.17.2) -> 3.8.1 (1.18.1), each hop applied + verified
supervised. atomic=true (auto-rollback on a failed rollout) + forceFailurePolicyIgnore
(admissions stay open mid-roll) kept it safe. Values schema confirmed compatible
across 3.6->3.8 (forceFailurePolicyIgnore still under features:).

Verified after each hop: all 17 ClusterPolicies stayed Ready, admission controller
2/2, no destroys/replaces in plan. Final 1.18.1: images v1.18.1, mutating webhook
live (server-side dry-run injects ndots:2 in a non-excluded ns). compat-gate vs
1.35.6 now lists ONLY external-secrets (kyverno cleared). ESO 0.12->2.x
(v1beta1->v1, 73 files) is the last remaining 1.35 blocker — to be planned.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-21 17:21:38 +00:00
Viktor Barzin
ead876ec65 k8s-upgrade: nightly Slack report monitor + scope chain-failed alert to phases
All checks were successful
ci/woodpecker/push/default Pipeline was successful
Adds a daily visibility layer so every night's autonomous-upgrade outcome is
reviewable at a glance during the upgrade-cleanup window (Viktor: "track every
night's upgrade for the next 7 days; clean up all bugs and blockers").

Last night (2026-06-20) confirmed BOTH prior fixes work in production: the
detector resolved target 1.35.6 (k8s_upgrade_available) and the compat gate
correctly REFUSED it (k8s_upgrade_blocked=1 -> K8sUpgradeBlocked) because ESO
v0.12 (<=1.31) and kyverno v1.16 (<=1.34) don't support 1.35.

What's here:
- CronJob k8s-upgrade-nightly-report (06:07 UTC) -> one Slack summary/morning:
  running version, detector freshness, detected target, outcome (no-op /
  blocked+live reasons / upgraded / in-progress / detector-stale), recent jobs.
  Read-only: reads Pushgateway gauges + live nodes/jobs, re-runs compat-gate.py
  for fresh blockers; reuses the chain SA + slack_webhook + scripts ConfigMap.
  Pure helpers unit-tested (test_nightly_report.py, 8 cases incl. a real
  v-prefix bug TDD caught). Verified end-to-end in-cluster (posted to Slack).
- K8sUpgradeChainJobFailed regex scoped from `k8s-upgrade-.*` to
  `k8s-upgrade-(preflight|master|worker|postflight)-.*` so the new report job
  (or any future helper) can't false-trip the chain-wedged alarm.

Manual state repair (no git artifact): imported the orphaned `alert-digest`
CronJob into the monitoring stack state
(`tg import module.monitoring.kubernetes_cron_job_v1.alert_digest monitoring/alert-digest`).
Root cause: when alert_digest was added (2026-06-12) the apply recorded its
ConfigMap + Secret but not the CronJob, so every full monitoring apply since has
failed with `cronjobs.batch "alert-digest" already exists` (Woodpecker pipeline
298 today) — surviving only via targeted prometheus applies. Now in state, so
monitoring CI applies cleanly again.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-21 16:57:44 +00:00
Viktor Barzin
7270e2be3b monitoring: K8sUpgradeChainJobFailed must not double-fire on a compat-gate block
Some checks failed
ci/woodpecker/push/default Pipeline failed
Last night (2026-06-20) the detector + compat-gate fixes worked: the chain
resolved target 1.35.6 and the gate correctly REFUSED it (ESO 0.12 + kyverno
1.16 don't support 1.35), pushing k8s_upgrade_blocked=1 -> K8sUpgradeBlocked
fired as designed. But the refusal also made the preflight Job exit 1
(block() exits 1 on purpose so the Failed Job re-spawns nightly), which tripped
K8sUpgradeChainJobFailed too — a duplicate, misleading "pipeline wedged" alarm
for what is the intended halt-and-alert outcome.

Fix: gate the alert with `unless on() k8s_upgrade_blocked == 1`. A deliberate
block sets that gauge (and it stays 1 until the next preflight resets it), so
the chain-job-failed alert is suppressed for the blocked period; a genuine
wedge / crash / halt-on-alert exits 1 WITHOUT setting it, so it still fires
(preserving the alert's original purpose — catching the pre-in_flight preflight
failure that hid the 5-day 1.34.9 wedge). Runbook + automated-upgrades docs
updated to match.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-21 16:35:35 +00:00
Viktor Barzin
b0ccaf1c65 state(vault): update encrypted state 2026-06-21 15:07:01 +00:00
Viktor Barzin
f84e6818b2 state(vault): update encrypted state 2026-06-21 15:07:01 +00:00
Viktor Barzin
cc4bb8ffe8 wealth dashboard: show price freshness for all 3 holdings, not just worst
Some checks failed
ci/woodpecker/push/default Pipeline failed
Viktor wanted the freshness tile to cover all three main holdings
(META, VUAG, VUSA), not only the single stalest one. Dropped LIMIT 1 so
the stat renders one value per held position (worst-first), switched the
tile to horizontal orientation so the three values sit side-by-side, and
updated the description. Each value is coloured by its own age threshold
(META red ~2mo, the Vanguard ETFs green ~2d). No threshold or datasource
change.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-21 14:49:33 +00:00
6c2c56ab3b Merge pull request 'docs: CrowdSec enforcement = firewall-bouncer + CF WAF (plugin removed)' (#11) from wizard/crowdsec-docs into master
All checks were successful
ci/woodpecker/push/default Pipeline was successful
2026-06-21 13:40:41 +00:00
Viktor Barzin
ceae4d5f06 docs: rewrite CrowdSec enforcement architecture (firewall-bouncer + CF WAF; Yaegi plugin removed)
The Traefik Yaegi CrowdSec bouncer plugin was dead on Traefik 3.7.5 (handler
never invoked) and has been removed. Document the replacement: in-kernel
nftables drop via cs-firewall-bouncer on direct hosts, and a Cloudflare IP-List
+ zone WAF block rule (fed by a LAPI->CF-list sync CronJob) on proxied hosts.
Both add zero per-request latency and fail open.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-21 13:39:26 +00:00
4df741f6de Merge pull request 'traefik/crowdsec: delete dead Yaegi plugin + middleware CRD + captcha (PR2/2)' (#10) from wizard/cs-deplugin-crd into master
All checks were successful
ci/woodpecker/push/default Pipeline was successful
2026-06-21 13:36:03 +00:00
Viktor Barzin
c23b03864e traefik/crowdsec: delete dead Yaegi plugin + middleware CRD + captcha (PR2/2)
Zero live ingresses reference traefik-crowdsec@kubernetescrd (PR1 + a
cluster-wide targeted ingress re-apply confirmed 0), so the crowdsec Middleware
CRD and the broken Yaegi bouncer plugin can be removed without orphaning any
router. Removes: the `crowdsec` Middleware, the crowdsec-bouncer plugin (static
config + initContainer download + state.json entry), the captcha template
ConfigMap + volume + captcha.html, the Turnstile widget + data.cloudflare_accounts,
and the 3 now-unused module vars. Also drops the `crowdsec` middleware from the
catch-all error-pages IngressRoute chain (the one remaining CRD-level reference,
which an Ingress-annotation grep does not surface) so that router is not orphaned
when the Middleware is deleted; it keeps rate-limit. Enforcement is fully handled
out-of-band now: cs-firewall-bouncer (in-kernel nftables, direct hosts) +
Cloudflare IP-List/WAF (proxied hosts). The api-token-middleware plugin is
deliberately preserved (still used by paperless-mcp).

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-21 13:35:13 +00:00
df86075c3d Merge pull request 'cleanup: fully remove orphaned council-complaints app' (#9) from wizard/council-cleanup into master
All checks were successful
ci/woodpecker/push/default Pipeline was successful
2026-06-21 13:33:23 +00:00
Viktor Barzin
68d9058f85 cleanup: fully remove orphaned council-complaints app
The council-complaints app (Islington civic-reporting pilot) has been
abandoned. It was already dead in the cluster (deployments scaled 0/0,
image only on the decommissioned registry.viktorbarzin.me which 404s),
and it was never in Terraform — only docs + a kyverno comment referenced
it. Its live cluster resources (namespace, both NFS-backed PVs, ingresses)
were torn down out-of-band via kubectl (nothing in TF to drift from); the
DB-dump PVC was backed up to NFS first.

This removes the remaining repo references to the live app:
- service-catalog.md: drop the council-complaints row
- ci-cd.md + .claude/CLAUDE.md: drop it from the GHA->ghcr app list
- kyverno require-trusted-registries: the registry.viktorbarzin.me/*
  allowlist comment claimed council-complaints as the last referencer;
  rewrite it (no live workload pulls from that registry now; only stale
  completed Job records still carry the ref). The allowlist line itself
  is kept (registry-scoped, not app-specific).

Historical point-in-time plan docs (docs/plans/2026-05-16-auto-upgrade-
apps-{design,plan}.md) still mention it inside a frozen "10 GHA-migrated
repos (memory id=388)" snapshot; left as-is so the dated record stays
accurate.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-21 13:32:10 +00:00
Viktor Barzin
6dc3ce139f wealth dashboard: expand all rows by default + inline the freshness stat
Some checks failed
ci/woodpecker/push/default Pipeline failed
Two follow-ups Viktor asked for on the Price freshness panel:

- Expand every section by default. Grafana's collapsed rows hide their
  child panels; just flipping collapsed=false leaves a non-canonical shape
  (confirmed via the Grafana API that it keeps the panels nested rather
  than hoisting them), so each row is now collapsed=false + panels=[] with
  its children hoisted to top-level -- the exact form Grafana writes when
  you expand-and-save. Row headers revert to their original y (the child
  y-coords were already expanded-layout coordinates).

- Stop the freshness stat from taking its own line. It's now the 6th tile
  in the existing returns row (1d/7d/30d/90d/12mo + freshness), all width 4
  at y=5; the collapsed-row y-shift from the previous commit is undone.

No query or threshold changes. The large diff is mechanical: 12 child
panels re-indent from nested to top-level.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-21 13:29:25 +00:00
Viktor Barzin
92ff0b92f1 Merge remote-tracking branch 'forgejo/master' into wizard/t3-idle-migrate
All checks were successful
ci/woodpecker/push/default Pipeline was successful
2026-06-21 12:41:33 +00:00
Viktor Barzin
5a136c7d53 docs: t3-migrate-idle runbook section + service-catalog + design status
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-21 12:40:46 +00:00
Viktor Barzin
334d8fee5d setup-devvm: install + enable t3-migrate-idle (lib, script, units, timer)
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-21 12:36:13 +00:00
Viktor Barzin
3cf09a0fe3 t3-migrate-idle: systemd oneshot + overnight timer (01:00-05:40, /20)
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-21 12:35:19 +00:00
Viktor Barzin
af9f7be297 t3-migrate-idle: drain deferral markers when safe
For each /var/lib/t3-autoupdate/deferred/<user> marker: skip+clear if the unit is
gone or was already restarted after the deferral; otherwise, when the idle gate is
satisfied, take a pre-restart backup and restart via the shared safe_restart_unit,
clearing the marker on verified success. DRY_RUN logs decisions without acting.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-21 12:34:44 +00:00
Viktor Barzin
06e400522f t3-migrate-idle: idle gate (no in-flight turn + quiet buffer), TDD
The gate reads t3's state.sqlite: safe to restart only when zero threads have an
active_turn_id AND the most-recent thread activity is older than the quiet buffer
(default 15m). Fail-closed on any parse/query error. Pure-bash unit tests cover
the boundaries against fixture DBs (no root/bats/Docker).

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-21 12:34:11 +00:00
Viktor Barzin
de97696ff0 t3-autoupdate: source the shared safe-restart lib + record deferrals
Behavior-preserving refactor: the per-unit restart/recover body and small helpers
now come from t3-safe-restart.sh (one audited copy). Additionally, when a unit is
deferred for an active agent, write a marker under /var/lib/t3-autoupdate/deferred/
so the new idle migrator can drain it later; clear the marker on a successful
restart. Install/health-gate/canary logic is unchanged.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-21 12:32:57 +00:00
Viktor Barzin
2ab5b94748 t3-safe-restart: extract shared safe-restart library from t3-autoupdate
Pull the per-unit backup->restart->verify->recover routine (and the small
helpers it needs) out of t3-autoupdate.sh into a sourced library, so a second
job (the upcoming idle migrator) can reuse the exact same audited recovery path
instead of forking safety-critical code. safe_restart_unit returns non-zero on
failure (after recovery+freeze) rather than exiting, so callers control flow.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-21 12:28:53 +00:00
Viktor Barzin
0cebeeb0ee t3-idle-migrate: implementation plan
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-21 12:26:05 +00:00
Viktor Barzin
ddbdbca7e9 wealth dashboard: add "Price freshness" stat for stalest held quote
Some checks failed
ci/woodpecker/push/default Pipeline failed
Viktor was worried about stale prices silently distorting net worth.
Confirmed it's real: META's quote has been frozen at 2026-04-17 (65 days
old) while the dashboard keeps valuing the ~55-share position at that
stale close; the Vanguard ETFs are current. Nothing flagged it.

Adds one compact stat to the Overview row showing the most out-of-date
HELD position's quote age (symbol + humanised age), colour-coded: green
<=4d (weekend/bank-holiday tolerant), amber 5-9d, red >=10d. Pure read of
the quote_latest mirror via the wealth-pg datasource, held positions
only, LEFT JOIN so a held symbol with no quote at all sorts as max-stale.
The six collapsed rows below shift down 4 grid units to make room; no
other panel touched.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-21 12:23:45 +00:00
Viktor Barzin
9503bed589 t3-idle-migrate: design for graceful overnight restart of deferred t3-serve instances
Viktor hit the t3 'Client and server versions differ' warning. Root cause: the daily gated autoupdate defers a user's t3-serve restart whenever that user has an active agent at the 04:00 window, so anyone busy every night (long-lived/AFK sessions) never migrates and the client/server version skew persists for days.

This design adds a small idle-gated overnight job that drains those deferrals -- restarting a deferred instance onto the current binary only when no turn is in flight (state.sqlite active_turn_id) and it's been quiet for a buffer, so the migration lands in a real quiet gap instead of killing in-flight agent turns. Reuses the autoupdate's proven backup->restart->verify->recover path via a shared helper (approach C from the brainstorm). Design doc only; no behavior change yet.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-21 12:04:22 +00:00
Viktor Barzin
b1bbe42821 homelab ha token: dedicated openclaw/ha-tokens secret + least-priv RBAC for emo
Some checks are pending
Build infra CLI / build (push) Waiting to run
ci/woodpecker/push/default Pipeline was successful
`ha token` originally read openclaw/openclaw-secrets -> skill_secrets, which only
cluster admins can read — so it hung/failed for the non-admin operator it was
built for (emo = emil.barzin@gmail.com, OIDC group "Home Server Admins", whose
identity is deliberately barred from secrets in the openclaw namespace).

Split the HA tokens into a dedicated secret openclaw/ha-tokens (keys sofia/london)
with a Role + RoleBinding granting `get` on JUST that secret to the Home Server
Admins group (k8s RBAC can't scope to a JSON sub-key, hence a separate object).
emo now resolves the HA token with their own identity, WITHOUT gaining the rest
of skill_secrets (slack_webhook, uptime_kuma_password). openclaw's own deployment
keeps reading openclaw-secrets — purely additive.

- stacks/openclaw/ha_tokens.tf: new secret + least-privilege Role/RoleBinding
- cli/cmd_ha.go: read openclaw/ha-tokens (raw base64 per-instance key); drop JSON parse
- README + ADR-0012 updated; VERSION -> v0.7.1

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-21 10:45:32 +00:00
a091689603 Merge pull request 'traefik/crowdsec: remove dead plugin middleware reference (PR1/2)' (#8) from wizard/cs-deplugin-refs into master
Some checks failed
ci/woodpecker/push/default Pipeline failed
2026-06-21 00:17:51 +00:00
Viktor Barzin
71d0af084e traefik/crowdsec: remove 6 hard-coded middleware refs the variable sweep missed (PR1/2)
The first PR1 commit only dropped the ingress_factory reference + the 8
exclude_crowdsec call sites. But the crowdsec middleware is ALSO hard-coded
(not via the variable) in 6 more ingresses that build their middleware chain by
hand: owntracks, the monitoring Helm values (grafana + prometheus +
alertmanager), and the reverse-proxy module + its own separate ingress factory.
Remove all 6 so that after the full-cluster apply NO live ingress references
traefik-crowdsec@kubernetescrd — the precondition for PR2 deleting the CRD.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-21 00:17:40 +00:00
Viktor Barzin
7bd4612edf ci: scripts/tg waits out a contended state lock (-lock-timeout)
All checks were successful
ci/woodpecker/push/default Pipeline was successful
The infra CI pipeline was failing often — ~38% of the last 50 runs didn't
succeed. The single biggest cause (8 of 19 non-successes) was Tier-1 stack
applies dying instantly with "Error acquiring the state lock".

Tier-0 stacks already degrade gracefully (Vault advisory lock → the pipeline
skips a locked stack). Tier-1 stacks have no such fallback: they rely on
terraform's pg-backend pg_advisory_lock, and scripts/tg ran terragrunt with
no -lock-timeout, so any concurrent lock holder was fatal — a Woodpecker-killed
run whose PG lock wasn't reaped yet (PL266 killed → PL267 failed the same
second), a human/agent applying locally, or the daily drift `plan`.

Fix: scripts/tg now passes -lock-timeout (default 5m, override TG_LOCK_TIMEOUT)
on every state-locking verb (plan/apply/destroy/refresh), so a contended lock
WAITS for the holder to finish instead of failing. -auto-approve behaviour for
non-interactive applies is unchanged. Central wrapper change → covers CI, plus
local human/agent applies; no CI image rebuild (tg is read from the repo).

Adds a hermetic pytest (stub terragrunt + preset PG_CONN_STR) pinning the
arg-injection. Docs updated in AGENTS.md + .claude/CLAUDE.md.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-21 00:15:39 +00:00
Viktor Barzin
84a18a5529 traefik/crowdsec: remove dead Yaegi-plugin middleware reference (PR1/2)
The Traefik CrowdSec (Yaegi) bouncer plugin enforces nothing on Traefik 3.7.5
(handler never invoked) and is fully superseded by the cs-firewall-bouncer
(in-kernel nftables drop on direct hosts) + the Cloudflare IP-List/WAF rule
(proxied hosts). Drop the `traefik-crowdsec@kubernetescrd` middleware from the
ingress_factory chain and the 8 explicit `exclude_crowdsec = true` call sites,
and delete the now-unused `exclude_crowdsec` variable.

This is PR1 of a 2-phase removal: the reference is removed FIRST (a shared-module
change → full-cluster apply re-renders every ingress without the middleware) so
that PR2 can delete the `crowdsec` Middleware CRD + the plugin itself WITHOUT
leaving any ingress pointing at a missing middleware (which would error those
routers). PR2 MUST NOT land until this has fully applied and zero live ingresses
reference traefik-crowdsec@kubernetescrd.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-21 00:15:12 +00:00
9774ae3d19 Merge pull request 'crowdsec: firewall-bouncer cluster-wide (remove node2 pin)' (#7) from wizard/cs-fw-allnodes into master
All checks were successful
ci/woodpecker/push/default Pipeline was successful
2026-06-21 00:08:15 +00:00
Viktor Barzin
c92590ae85 crowdsec: roll firewall-bouncer cluster-wide (remove node2 validation pin)
One-node validation on k8s-node2 passed: kernel nftables sets created in both
input and forward chains (policy accept), ~31k decisions loaded, a known banned
scanner confirmed in the drop set, pod stable 4h+ with no collateral. Remove the
nodeSelector so the DaemonSet runs on every node — direct-host enforcement now
survives a MetalLB VIP failover to any worker.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-21 00:07:45 +00:00
4f1c998468 Merge pull request 'rybbit sync: exclude CAPI + per_page=500 fix' (#6) from wizard/crowdsec-syncfix into master
All checks were successful
ci/woodpecker/push/default Pipeline was successful
2026-06-21 00:05:50 +00:00
Viktor Barzin
f55bb6c422 rybbit: sync excludes CAPI blocklist + fix CF items per_page (500)
The edge CF IP List can't hold the ~31k CAPI community blocklist (already
enforced in-kernel by the firewall-bouncer), so the sync now skips origin=CAPI
and carries only high-signal local/curated decisions (+ a 9000 safety cap).
Also fixes the list-items GET: per_page=1000 returned a misleading CF 400
'invalid or expired cursor' (10027); the endpoint max is 500. Verified live:
crowdsec_ban populates (4 IPs) and the sync exits 0.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-21 00:05:05 +00:00
Viktor Barzin
6d5d3726d6 Merge remote-tracking branch 'origin/master' into wizard/ha-cli-verbs
Some checks are pending
Build infra CLI / build (push) Waiting to run
ci/woodpecker/push/default Pipeline was successful
2026-06-20 23:46:29 +00:00
Viktor Barzin
48225f2dea homelab CLI v0.7: add ha token + ha ssh for Home Assistant
Mined another devvm user's Claude sessions for repeated, hand-rolled command
patterns worth absorbing into the shared CLI. The dominant signal was Home
Assistant "Sofia" work: a `kubectl | base64 | jq` token-extraction pipeline
re-derived ~420x, and a bespoke non-interactive `ssh -o …` invocation reinvented
~30x — every session. The existing `home-assistant-sofia.py` already covers the
API but goes unused from an arbitrary cwd (needs an env var set + a cwd-relative
path), so agents bypassed it and hand-rolled everything.

Add two verbs covering exactly the gaps the `ha` MCP can't (entity state/control
stays with the MCP):
- `ha token [--instance sofia|london]` (read): resolves the long-lived API token
  live from k8s secret openclaw/openclaw-secrets via the ambient kubeconfig — no
  pre-set env var. Composes as `curl -H "Authorization: Bearer $(homelab ha token)"`.
- `ha ssh [--instance sofia|london] -- <cmd>` (write): deterministic
  non-interactive ssh to the HA host using the invoking user's key.

Also fix the root cause: `home-assistant-sofia.py` now falls back to
`homelab ha token` when its env var is unset (works from any directory), and the
home-assistant skill points agents at these verbs + `homelab metrics query`
instead of hand-rolled curls. README + ADR-0012 + AGENTS.md updated per the
per-verb-group convention.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-20 23:46:09 +00:00
Viktor Barzin
46166c63b2 fix(authentik): long-lived social-login sessions + shield auth from CrowdSec lockout
All checks were successful
ci/woodpecker/push/default Pipeline was successful
Viktor's passkeys all vanished and he was suddenly being asked to log in
multiple times a day instead of ~monthly. Root cause: on 2026-06-18 an ad-hoc
tripit passkey E2E test (run from the devvm as akadmin via python-httpx) cleaned
up "the demo user's" passkeys with GET /core/users/?search={demo} then DELETE
each device of users[0] — but the fuzzy search returned the REAL account, so it
wiped all 6 real passkeys. Losing passkeys forced fallback to Google login, and
the social-login stage (default-source-authentication-login) had the provider
default session_duration=seconds=0, which falls back to UNAUTHENTICATED_AGE=2h —
hence the constant re-logins. (Password + passkey logins were already weeks=4.)

Changes:
- authentik: adopt default-source-authentication-login into Terraform (import)
  and pin session_duration=weeks=4, so Google/GitHub/Facebook logins last as long
  as password/passkey. Immediate relief without re-enrolling.
- authentik: document the provider-schema gotcha — authentik_stage_identification
  exposes no webauthn_stage / enable_remember_me attribute, so they must NOT be in
  ignore_changes (commit 4e882989 removed them for this reason; re-adding breaks
  every apply). The passkey break was purely the missing device records, not drift.
- edge (rybbit): shield auth so a CrowdSec hit can never wall a user out of login —
  carve authentik.viktorbarzin.me + public-auth out of the zone WAF block rule,
  make the LAPI->edge sync ban-only (stop downgrading captcha to a hard block),
  and set exclude_crowdsec on the Authentik UI ingress (auth keeps rate-limiting).
- docs: record the session-duration change, the edge enforcement + auth carve-out
  (previously undocumented), and the pre-existing broken crowdsec-cf-sync CronJob
  (CF cursor pagination 400 + ~31k IPs vs list capacity -> edge list inert).

Passkey re-enrollment is a manual user action (devices are gone from the DB);
nothing auto-re-deletes them.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-20 23:40:22 +00:00
Viktor Barzin
600f1f933c Create Claude auth state directories
All checks were successful
ci/woodpecker/push/default Pipeline was successful
The first live renewal run showed systemd could not create state beneath a read-only home sandbox. Provision each user's writable state directory before enabling the timer so automatic renewal can run.
2026-06-20 20:25:55 +00:00
Viktor Barzin
7f1788a106 Merge remote-tracking branch 'origin/master' into wizard/claude-auth-renew
All checks were successful
ci/woodpecker/push/default Pipeline was successful
2026-06-20 20:22:20 +00:00
Viktor Barzin
ff67e9d422 Fix workstation package manifest parsing
The approved Claude token renewal deployment could not run because setup-devvm passed inline package comments to apt as package names. Strip inline comments so the persisted all-user setup remains reproducible.
2026-06-20 20:22:05 +00:00
Viktor Barzin
524b874036 state(vault): update encrypted state
Some checks failed
ci/woodpecker/push/default Pipeline was canceled
2026-06-20 20:14:53 +00:00
Viktor Barzin
7050b0441e Merge remote-tracking branch 'origin/master' into wizard/claude-auth-renew
Some checks failed
ci/woodpecker/push/default Pipeline failed
2026-06-20 20:11:09 +00:00
Viktor Barzin
bc2fbc712c Merge remote-tracking branch 'origin/master' into wizard/claude-auth-renew 2026-06-20 20:10:48 +00:00
Viktor Barzin
02d14796cc feat(mailserver): add trips@ send-as alias for TripIt native auth email (ADR-0028)
Some checks failed
ci/woodpecker/push/default Pipeline was canceled
TripIt's native signup-verification + account-recovery mail (ADR-0028) sends From: trips@viktorbarzin.me while authenticating SMTP as spam@. With SPOOF_PROTECTION on, Postfix smtpd_sender_login_maps requires an EXPLICIT alias (the @domain catch-all doesn't satisfy it) — mirrors the existing plans@->spam@ grant. Must be applied + verified before TripIt flips SMTP_FROM to trips@, else every verification/recovery send is rejected 550.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-20 20:10:47 +00:00
Viktor Barzin
5549fc3672 Add per-user Claude auth renewal
Each workstation user needs a continuously valid Claude token under their own Enterprise identity. Store only that user's OAuth state in an isolated Vault path, renew and verify it automatically, recover from Vault when possible, and alert when interactive SSO is required.
2026-06-20 20:10:40 +00:00
Viktor Barzin
3278588325 chore(authentik): tear down obsolete tripit-enrollment (ADR-0020 superseded by ADR-0028)
All checks were successful
ci/woodpecker/push/default Pipeline was successful
TripIt external users are now LOCAL TripIt accounts (ADR-0028 native passkey + Authentik OIDC), so the Authentik-side self-enrollment machinery is dead. Removes the tripit-enrollment + tripit-recovery flows and all their stages/prompts/policies/bindings, the tripit-email-stages blueprint (+yaml), and the 'TripIt External' group; reverts the admin-services-restriction fence branch that contained those users (its sole member, the leftover tripit-demo@ test account, was deleted first, so the revert affects zero live principals). Real external collaborators (type=external) are untouched. tg plan: 0 add, 1 change (the policy expression), 20 destroy (all tripit_*). Closes tripit#97; moots the B2 per-app OIDC fences.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-20 20:04:24 +00:00
834c5e6a2a Merge pull request 'CrowdSec proxied: single CF list (block-only) + firewall-bouncer re-apply' (#5) from wizard/crowdsec-1list into master
Some checks failed
ci/woodpecker/push/default Pipeline failed
2026-06-20 19:31:01 +00:00
Viktor Barzin
7cf93a0587 crowdsec+rybbit: proxied edge to single CF list (block-only) + retrigger firewall-bouncer apply
CF account hard-limits to 1 Rules List, so proxied enforcement uses one crowdsec_ban
list + one WAF block rule; the sync writes both ban and captcha decisions into it
(captcha downgraded to block at the edge). Drops the second list + managed_challenge
rule. Trivial touch to firewall_bouncer.tf to make CI re-apply crowdsec and recreate
the DaemonSet (tar fix already in master; stale orphan was cleared).

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-20 19:29:43 +00:00
1406d8a391 Merge pull request 'Fix CF ruleset import id + depends_on' (#4) from wizard/crowdsec-fix2 into master
Some checks failed
ci/woodpecker/push/default Pipeline failed
2026-06-20 19:13:03 +00:00
Viktor Barzin
f2b089e267 rybbit: fix cloudflare_ruleset import id (zone/ 3-part form) + depends_on lists
v4.52.7 import id must be zone/<zone_id>/<ruleset_id>; add depends_on so the
crowdsec_ban/captcha lists exist before the WAF rules reference them.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-20 19:12:29 +00:00
58fc6d5061 Merge pull request 'Fix CrowdSec firewall-bouncer tar + CF WAF ruleset import' (#3) from wizard/crowdsec-fixes into master
Some checks failed
ci/woodpecker/push/default Pipeline failed
2026-06-20 19:06:15 +00:00
Viktor Barzin
a351a66843 crowdsec+rybbit: fix firewall-bouncer tar extraction (busybox) + import existing CF WAF ruleset
- initContainer used GNU tar --wildcards which fails on the busybox curl image (pod Init:Error); switch to extract-all + cp via shell glob.
- cloudflare_ruleset hit the per-zone singleton conflict; import the existing 'default' http_request_firewall_custom ruleset and manage all rules — CrowdSec ban/captcha first, the pre-existing disabled skip rule preserved verbatim.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-20 19:04:30 +00:00
70e8ce1021 Merge pull request 'CrowdSec real enforcement: edge WAF (proxied) + firewall-bouncer (direct)' (#2) from wizard/crowdsec-enforcement into master
Some checks failed
ci/woodpecker/push/default Pipeline failed
2026-06-20 09:42:41 +00:00
Viktor Barzin
ca8d617e72 rybbit: use 'Account Rule Lists' permission group for the CF sync token (v4)
tg plan verified the agent's guess 'Account Filter Lists Edit/Read' is not a key in the v4.52.7 permission-group map; the live CF API lists the correct account-scoped groups as 'Account Rule Lists Read'/'Write'.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-20 09:41:41 +00:00
Viktor Barzin
0c56290af0 chore(forgejo): re-trigger apply of git.timeout/gc.auto (changed-stack skip)
All checks were successful
ci/woodpecker/push/default Pipeline was successful
910d5892 landed the [git.timeout] + [git.config] env in master, but the CI apply
skipped stacks/forgejo (the changed-stack-diff race after a sync-merge), so the
Forgejo deployment never picked it up. A trivial comment touch to force a clean
apply of the stack so the durable push-mirror fix actually takes effect.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-20 09:19:53 +00:00
Viktor Barzin
cc4bfb593b rybbit: proxied CrowdSec enforcement via Cloudflare IP Lists + WAF rule
Replaces the Worker+KV approach (which only covered the ~27 routed hosts) with a
zone-wide mechanism that covers ALL proxied hosts: two CF account IP Lists
(crowdsec_ban, crowdsec_captcha) + one zone WAF custom rule that blocks
`(ip.src in $crowdsec_ban)` and managed-challenges `(ip.src in $crowdsec_captcha)`.
No per-request Worker, no cookie machinery — the rybbit Worker stays
analytics-only. lapi_kv_sync.py now full-reconciles the two lists from LAPI
(fail-safe: a LAPI blip skips the run and freezes the last-known-good block set;
serializes CF bulk ops since CF allows one pending op per account). A
least-privilege CF API token (Account Filter Lists Edit) is minted in TF.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-20 09:18:33 +00:00
Viktor Barzin
7e646e1c7c crowdsec: add cs-firewall-bouncer DaemonSet (direct-host nftables enforcement)
Drops banned source IPs in-kernel via nftables (hooks input+forward, so DNAT'd
LoadBalancer traffic is caught before reaching Traefik) for DIRECT hosts — the
direct-side replacement for the dead Traefik plugin, zero per-request hop.

No published image exists, so an initContainer fetches the pinned official
static binary (v0.0.34) onto a stock debian-slim base (nftables backend uses
netlink directly, no nft CLI needed). hostNetwork + NET_ADMIN/NET_RAW (not
privileged). Config (with api_key) in a Secret, Reloader-annotated. crowdsec ns
is already in the Kyverno wave-1 exclude list, so the privileged/hostNetwork pod
is admitted. Pinned to k8s-node2 (runs a Traefik pod) for one-node validation
before the nodeSelector is removed to roll cluster-wide. Fail-open by element
timeout if the bouncer stops.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-20 09:11:08 +00:00
Viktor Barzin
53117b193a portal-realtime: deploy the v2 full-duplex voice agent (Pipecat)
All checks were successful
ci/woodpecker/push/default Pipeline was successful
New stack for the realtime voice agent — v2 of the portal-assistant brain
path. One persistent WebSocket per conversation: continuous mic audio ->
Silero VAD turn-taking -> Whisper STT (portal-stt) -> streaming Claude brain
(claude-agent-service) -> edge-tts (portal-tts) -> audio out, with barge-in.
Reuses all three upstream cluster services; nothing new is spun up.

Public Cloudflare ingress (proxied, WebSocket) at portal-realtime.viktorbarzin.me
with the app's own DEVICE_TOKEN as the edge gate (auth="app" — Authentik would
break the native Portal client). No buffering middleware: it would break the
streaming WebSocket. Image ghcr.io/viktorbarzin/portal-assistant-realtime
(private ghcr, pulled with ghcr_pull_token). Sibling to the v1 portal-assistant
gateway, which stays live.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-20 08:23:17 +00:00
Viktor Barzin
44cac6f4e2 gitignore: ignore Python test artifacts (__pycache__, *.pyc, .pytest_cache)
All checks were successful
ci/woodpecker/push/default Pipeline was successful
Introduced the first pytest file in the tree
(stacks/k8s-version-upgrade/scripts/test_compat_gate.py); running it leaves an
untracked __pycache__/ dir. Ignore the standard Python build artifacts so test
runs don't show up as working-tree noise or get committed by accident.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-20 08:17:03 +00:00
Viktor Barzin
b58fe8cb1a docs(k8s-upgrade): record detector Packages-probe -L fix + compat-gate patch scope
All checks were successful
ci/woodpecker/push/default Pipeline was successful
Two corrections to the runbook matching today's code fixes:
- The next-minor *patch* probe (GET .../Packages) also needs `-L`; it lacked it
  until 2026-06-20 and silently no-op'd the 2026-06-19 nightly run. Both probes
  now follow the 302.
- The compat gate's addon check is scoped to minor jumps — patches within the
  running minor are never addon-blocked (target_minor <= running_minor returns
  early), so a conservative ceiling like ESO 0.12 -> 1.31 no longer false-blocks
  a 1.34.x patch.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-20 08:16:20 +00:00
Viktor Barzin
e5250f417e k8s-version-upgrade: compat gate must not false-block patch upgrades
All checks were successful
ci/woodpecker/push/default Pipeline was successful
The compat gate compared every addon's matrix ceiling against the target
k8s minor unconditionally. That is correct for a minor JUMP, but it also
blocked patch upgrades within the minor the cluster is ALREADY running:
ESO v0.12's matrix ceiling is 1.31, the cluster runs 1.34.9, so a target of
1.34.10 (a patch) was refused with "external-secrets supports k8s <= 1.31;
target 1.34 exceeds it" — even though the running cluster is itself proof ESO
0.12 works on 1.34. That silently defeats autonomous patching (it would have
bitten the moment a 1.34.10 was published).

Fix: a target at or below the running minor crosses into no new k8s minor, so
every installed addon is already empirically proven on it — check_addons now
returns no reasons when target_minor <= running_minor. Added running_minor()
(oldest kubelet across nodes, mirroring the detector; RUNNING_K8S env override
for tests) and pass it in. Minor jumps are unchanged: 1.34->1.35 still blocks
on ESO 0.12 + kyverno 1.16. removed-API + containerd checks are naturally
inert for patches (no API removal / containerd floor inside a minor) and keep
running as defence. Added test_compat_gate.py (8 cases) covering both paths.

Verified end-to-end against live Prometheus: target 1.34.10 -> EXIT 0 (safe),
target 1.35.6 -> EXIT 2 (blocked on ESO+kyverno).

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-20 08:14:50 +00:00
Viktor Barzin
38675b7922 crowdsec: register kvsync + firewall bouncer keys in LAPI
Seeds two new bouncers at LAPI startup (BOUNCER_KEY_kvsync, BOUNCER_KEY_firewall)
from Vault secret/platform, mirroring the existing BOUNCER_KEY_traefik wiring.
These are the two halves of the real enforcement that replaces the dead Yaegi
plugin: kvsync authenticates the LAPI->Cloudflare-KV sync (proxied edge Worker),
firewall authenticates the cs-firewall-bouncer DaemonSet (direct-host nftables).

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-20 08:12:38 +00:00
Viktor Barzin
a9384a4067 Merge remote-tracking branch 'origin/master'
Some checks failed
ci/woodpecker/push/default Pipeline failed
2026-06-20 08:09:16 +00:00
Viktor Barzin
44a98d408e k8s-version-upgrade: detector next-minor probe must follow 302 (curl -sfL)
The next-minor Packages query used `curl -sf` without -L. pkgs.k8s.io
302-redirects every request to a backing host, so without -L curl returned
an empty body, NEXT_MINOR_PATCH came back empty, and the detector fell
through to "No upgrade needed". That is exactly why last night's 23:00 chain
no-op'd instead of resolving the 1.35 next-minor target (1.35.6) and handing
it to the compat gate. `curl -sfL` follows the redirect and returns the
Packages file (verified: -sf -> empty, -sfL -> 1.35.6). Mirrors the same
-L fix already applied to the Release availability probe (-sILo) above.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-20 08:09:08 +00:00
Viktor Barzin
910d589205 fix(forgejo): raise git-op timeouts + lower gc.auto to stop push-mirror timeouts
Some checks failed
ci/woodpecker/push/default Pipeline was canceled
The tripit Forgejo->GitHub push-mirror silently stalled: `git cat-file
--batch-all-objects` over the NFS-backed repo exceeded the default git deadline
once ~4500 loose objects accumulated (gc.auto's 6700 threshold hadn't fired), so
pushes stopped reaching GitHub and prod deploys stalled. Raise [git.timeout]
(DEFAULT/MIRROR/GC) so a slow object enumeration can't abort the mirror, and set
[git.config] gc.auto=1000 so post-push autogc + the git_gc_repos cron keep repos
packed (the real fix). A one-off forced gc already unblocked tripit; this prevents
recurrence across all repos.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-20 08:08:50 +00:00
Viktor Barzin
45bed1c133 Merge remote-tracking branch 'origin/master'
All checks were successful
ci/woodpecker/push/default Pipeline was successful
2026-06-20 08:07:23 +00:00
Viktor Barzin
e1736d2e5c calico: hop 3.28.5->3.30.7 (operator v1.38.13) — restores a SUPPORTED Calico/k8s-1.34 pairing. Disabled new-in-3.30 Goldmane/Whisker (their CRs render before crds/ install on helm upgrade; we use Prometheus/Loki). calico-node 7/7 on quay/v3.30.7, tigerastatus green. Applied manually + verified overnight. 2026-06-20 08:07:08 +00:00
Viktor Barzin
4d9fdbc7f7 rybbit: add CrowdSec LAPI -> Cloudflare KV sync script (proxied edge control plane)
Pure-stdlib script (alert_digest pattern, runs on stock python:3.12-alpine) that
projects CrowdSec Ip-scope ban/captcha decisions into the Workers KV namespace
the edge Worker reads on each proxied request. Full-reconcile per run so an
un-ban clears from the edge within one interval; fail-safe (a LAPI read error
skips the run and leaves existing bans to expire by TTL = fail-open, never a
stale all-block). TF wiring (KV namespace + CronJob + key registration) follows.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-20 08:05:11 +00:00
Viktor Barzin
0ac176da01 crowdsec: whitelist internal/LAN/tailnet CIDRs at the decision layer
Preparing for real CrowdSec enforcement (edge Cloudflare Worker for proxied
hosts + cs-firewall-bouncer for direct hosts). Both enforce by dropping the
real source IP, so if an internal/RFC1918 address ever ended up in a ban
decision it could blackhole legitimate internal traffic. Whitelisting the
cluster/LAN/tailnet ranges (10/8, 172.16/12, 192.168/16, 100.64/10) at the
CrowdSec parser layer makes that structurally impossible — a trusted source
can never produce a decision in the first place. Public IP already whitelisted.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-20 08:03:46 +00:00
Viktor Barzin
3e3fdb34f0 homelab: v0.6.0 — usage telemetry (usage top), evidence-driven verb prioritization
Some checks are pending
Build infra CLI / build (push) Waiting to run
ci/woodpecker/push/default Pipeline was successful
Answers the question that drove the whole CLI — which verbs to add next — with
data instead of one maintainer's habits, and resolves the cross-user-usage ask
in-bounds (no reading anyone's home).

- emit on dispatch: every verb fire-and-forgets one Loki line {job,user,verb} +
  "exit=N ver=X". ONLY the verb path + exit code — never args, paths, flags, or
  secrets (the emit never sees arguments). Best-effort: 800ms timeout, errors
  swallowed, never affects the command; opt-out HOMELAB_TELEMETRY=0. Discovery
  verbs (manifest/version/help) and usage itself don't self-record.
- usage top [--since 30d] [--user U] [--json]: ranks verbs via
  sum by (verb)(count_over_time({job="homelab-usage"}[…])) against the shared
  Loki. Cross-user analytics WITHOUT touching ~/.claude — the privacy-preserving
  answer to "what does the team use".
- Loki sink (zero new infra, dogfoods v0.5 logs path); push verified HTTP 204 no
  auth. ADR docs/adr/0011.

Live-verified: ran 4 verbs, usage top ranked them correctly (metrics query=2).

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-19 22:29:01 +00:00
Viktor Barzin
666fefd22b calico: hop 3.26->3.28.5 (operator v1.34.13); calico-node 7/7 healthy, tigerastatus green, kube-controller-manager restarted (3.28 UID change). Applied manually + verified.
Some checks failed
ci/woodpecker/push/default Pipeline failed
2026-06-19 22:09:23 +00:00
Viktor Barzin
8ed5368be9 calico: bring tigera-operator under Terraform via Helm (adopt at 3.26.1)
Some checks failed
ci/woodpecker/push/default Pipeline failed
Base for the stepped 3.26->3.28->3.30->3.32 upgrade (k8s 1.36 prereq; 3.26 is
already unsupported on k8s 1.34). Manage ONLY the operator via the official
tigera-operator Helm chart (chart ver == Calico ver); installation.enabled=false
keeps the live Installation CR operator-managed so Helm never touches calico-node.
Adopted in place: existing operator Deployment/SA/ClusterRole/ClusterRoleBinding
pre-stamped with Helm ownership metadata (transient migration step), then the
release imported via a plan-verified create (1 to add, 0 destroy). Verified clean:
calico-node 7/7 unchanged, tigerastatus green. Closes the year-deferred adoption
(code-3ad) for the operator without adopting the Installation CR.
2026-06-19 21:50:34 +00:00
Viktor Barzin
dd029ca7fb traefik/crowdsec: switch bouncer to live mode (stream cache doesn't enforce under Yaegi)
All checks were successful
ci/woodpecker/push/default Pipeline was successful
After bumping to v1.6.0 (stream goroutine runs) and disabling redis (in-memory
cache), the plugin logs `handleStreamCache:updated` but still does NOT enforce:
a ban present in the LAPI stream AND pulled by the plugin still let the banned IP
through. Stream-mode decision matching is unreliable under Traefik's Yaegi
interpreter here. Switch crowdsecMode stream->live: the plugin queries LAPI
synchronously per request (result cached per-IP for defaultDecisionSeconds), which
enforces reliably and picks up new decisions immediately. LAPI is 3-replica +
in-cluster so per-request latency is small; fail-open preserved (updateMaxFailure=-1).

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-19 17:49:26 +00:00
Viktor Barzin
0cc48d83ac traefik/crowdsec: disable bouncer redis cache (broken under Yaegi → in-memory)
With the plugin on v1.6.0 the stream goroutine finally runs, and its slog output
revealed the real blocker: `handleStreamTicker ... isCrowdsecStreamHealthy:true
cache:unreachable`. The LAPI stream is healthy, but the plugin's redis client
cannot reach the cache under Traefik's Yaegi interpreter — even though
redis-master.redis.svc is reachable AND writable from the traefik namespace
(SET/GET verified via busybox; no NetworkPolicies; no auth). Same interpreter
-incompat class as the stream goroutine itself. With redisCacheUnreachableBlock
=false the bouncer then failed open and enforced nothing.

Disable the redis cache so the plugin uses its in-memory decision store (works
under Yaegi). Removes redisCacheHost/redisCacheUnreachableBlock. Trade-off:
captcha already-solved grace is per-pod across the 3 Traefik replicas (at worst
an occasional re-solve) — acceptable; bans/captcha decisions enforce correctly.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-19 17:49:26 +00:00
Viktor Barzin
531efb218d traefik: bump crowdsec-bouncer plugin v1.4.2 -> v1.6.0 (fix stream not pulling)
The crowdsec-bouncer Yaegi plugin pinned at v1.4.2 loads on Traefik 3.7.5 but
its decision-stream goroutine never runs — no Traefik pod ever calls the LAPI
stream (verified: no traefik-pod bouncer entry / no @pod-ip auto-registration),
and it logs nothing. All deps are healthy (LAPI 200 + full ban list reachable
from the traefik ns, key valid, redis PONG, config correct, no NetworkPolicies),
so CrowdSec enforced nothing despite the bouncer now being registered. This is
the Traefik-v3 / Yaegi plugin-incompat class that already killed rewrite-body
here. v1.4.2 predates Nov 2025; latest is v1.6.0.

Bump to v1.6.0 (initContainer download URL + state.json + experimental.plugins
version). Config-verified compatible: every key we use survives (crowdsecMode,
crowdsecLapiKey/Host, updateMaxFailure, redisCache*, clientTrustedIPs, all
captcha* incl. turnstile); v1.6.0 also moves logging to slog/trace for future
diagnosis. Pinned, not auto-updated (Keel can't manage a Yaegi plugin, and
plugin bumps must be tested against the running Traefik/Yaegi).

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-19 17:49:26 +00:00
78095aa273 docs(forgejo): runbook reflects Authentik disabled + zero-click GitHub
All checks were successful
ci/woodpecker/push/default Pipeline was successful
Authentik OAuth2 source is now disabled (login_source.is_active=0) and GitHub
auto-registration (zero-click sign-up) is on. Document why (global auto-reg +
Authentik's email-as-username 500; Forgejo/Authentik email mismatch blocks
account-linking) and how to re-enable Authentik later.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-19 17:37:46 +00:00
7d99203fc6 forgejo: re-enable ENABLE_AUTO_REGISTRATION for zero-click GitHub sign-up
All checks were successful
ci/woodpecker/push/default Pipeline was successful
Per Viktor: GitHub sign-up must work zero-click (account created on first login,
no form). This global [oauth2_client] setting enables it. It conflicts with
Authentik (preferred_username is an email → invalid Forgejo username → 500 on
auto-create), and Viktor's Forgejo email (me@viktorbarzin.me) doesn't match his
Authentik email (vbarzin@gmail.com) so account-linking can't bridge it — so the
Authentik OAuth2 source is DISABLED (login_source.is_active=0; DB-managed,
out-of-band) per his directive. Forgejo sign-in is now GitHub + native login.

Committed via API to land on origin without pushing a concurrent agent's unpushed
local commit.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-19 17:34:17 +00:00
ef530b7d38 forgejo: drop ENABLE_AUTO_REGISTRATION — it broke Authentik sign-in
All checks were successful
ci/woodpecker/push/default Pipeline was successful
ENABLE_AUTO_REGISTRATION is a global [oauth2_client] setting (all OAuth sources).
On Authentik sign-in, Forgejo auto-created an account and derived the username
from Authentik's preferred_username claim — which is the user's email
(vbarzin@gmail.com), invalid as a Forgejo username (no '@') → CreateUser failed
→ 500 on the OAuth callback. (GitHub's username claim is valid, so only Authentik
broke.) Reverting to the standard link/register flow fixes both; GitHub sign-up
still works via a one-step register form. Committed via API to touch only main.tf
(forgejo-only CI apply) so it doesn't collide with concurrent crowdsec work.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-19 17:24:29 +00:00
Viktor Barzin
a5bb4db9c5 crowdsec: register the Traefik bouncer with LAPI (fix fail-open)
All checks were successful
ci/woodpecker/push/default Pipeline was successful
The Traefik bouncer plugin's API key was never registered with LAPI — the
crowdsec stack reads many keys from Vault but not ingress_crowdsec_api_key, and
the chart registers no bouncer. So LAPI returned 403 to the plugin, which with
updateMaxFailure=-1 failed open and enforced NOTHING: no community-blocklist
bans, and the (now-Turnstile-wired) captcha never fired. cscli bouncers list was
empty; the registration was likely lost in the MySQL->PostgreSQL DB migration
with no IaC to recreate it.

Seed the bouncer at LAPI startup via BOUNCER_KEY_traefik, valued from the same
Vault key the middleware presents — so they match by construction, and the
bouncer re-registers automatically on every LAPI start (survives DB wipes).

- stacks/crowdsec/main.tf: read ingress_crowdsec_api_key, pass to module.
- module main.tf: new sensitive var + thread into the values templatefile.
- values.yaml: BOUNCER_KEY_traefik on lapi.env.
- docs/architecture/security.md: document registration + fail-open history and
  the proxied-app coverage caveat.

Activates enforcement (community blocklist bans + captcha) on non-proxied apps;
internal IPs stay bypassed (clientTrustedIPs), fail-open-on-LAPI-down preserved.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-19 17:08:28 +00:00
Viktor Barzin
56dadda453 traefik: pin helm chart to 40.2.0 (deployed version)
All checks were successful
ci/woodpecker/push/default Pipeline was successful
The traefik helm_release had no chart version pin, so a refreshed helm repo
index resolves `chart = "traefik"` to the latest (41.0.0), whose values schema
rejects this stack's `logs` block ("Additional property logs is not allowed") —
an unpinned apply attempts that upgrade and fails (atomic rollback). Pin to the
deployed 40.2.0 (release rev 57, since 2026-05-30) so applies are deterministic;
chart bumps must be deliberate with a values migration.

Follow-up to fd0c7493 (Turnstile captcha), which was applied with this pin
already in live TF state — this lands the pin in git to remove the drift.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-19 16:58:33 +00:00
Viktor Barzin
4a66377425 forgejo: add "Sign in with GitHub" (OAuth2 source + auto-registration)
All checks were successful
ci/woodpecker/push/default Pipeline was successful
Viktor wanted people to be able to sign up with GitHub, not just the
native form or Authentik SSO.

- Added a GitHub OAuth2 login source via `forgejo admin auth add-oauth
  --provider github` (name "github", matching the callback registered on
  the GitHub OAuth App). Like the existing Authentik source, it lives in
  Forgejo's DB rather than Terraform — there's no clean TF resource for
  login sources. Client id/secret mirrored to Vault secret/viktor
  (forgejo_github_oauth_client_id / _secret) for recovery.
- This commit's TF change: ENABLE_AUTO_REGISTRATION=true in
  [oauth2_client], so a first GitHub sign-in creates the account directly
  ("sign up with GitHub") instead of a link-to-existing detour. The
  GitHub identity is the trust gate for this path; Turnstile + email
  confirmation still gate the native form.

Verified: GitHub recognises the client id, Forgejo's /user/oauth2/github
redirects to GitHub's authorize URL with the correct client id +
callback, and the login page renders the button. Final browser
click-through is the user's to do.

Runbook updated: docs/runbooks/forgejo-open-signups.md (GitHub section +
secret-rotation + DB-loss recreate steps).

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-19 16:41:49 +00:00
Viktor Barzin
fd0c7493c3 traefik/crowdsec: serve Cloudflare Turnstile for captcha remediation
CrowdSec LAPI already issues `captcha`-type decisions for lower-severity abuse
(http-429-abuse, http-403-abuse, http-crawl-non_statics, http-sensitive-files),
but the Traefik bouncer plugin had no captcha provider configured — so those
decisions silently fell through to a 403 ban (traced in the plugin's bouncer.go
@ v1.4.2: captchaClient.Valid==false => handleBanServeHTTP). Flagged users had
no way to self-unblock, contradicting the profile's stated intent.

Wire Cloudflare Turnstile as the bouncer's captcha provider so a captcha
decision now renders a solvable challenge instead of a hard block:

- New cloudflare_turnstile_widget.crowdsec_captcha (managed mode), scoped to
  viktorbarzin.me so one widget covers every subdomain the bouncer fronts.
  Mirrors the existing Forgejo-signup Turnstile pattern; sitekey + secret are
  passed into the traefik module.
- middleware.tf: captchaProvider=turnstile + site/secret keys + grace 1800s +
  captchaHTMLFilePath=/captcha/captcha.html.
- Vendor the plugin's captcha.html and mount it into the Traefik container at
  /captcha via the chart `volumes` value — the pulled Yaegi plugin does not
  expose its bundled template to Traefik.
- docs/architecture/security.md: document the ban-vs-captcha remediation split.
- Remove the dead crowdsec-ingress-bouncer.yaml (unused nginx bouncer with
  placeholder reCAPTCHA keys; referenced by zero .tf).

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-19 16:38:38 +00:00
Viktor Barzin
963e4fcdde forgejo: open native self-signups, gated by Turnstile + email confirmation
All checks were successful
ci/woodpecker/push/default Pipeline was successful
Viktor wants Forgejo open for anyone to sign up, but without bot/spam
account floods. Flip the deployment from OAuth-only registration
(ALLOW_ONLY_EXTERNAL_REGISTRATION=true) to allowing native local
sign-up, and add two bot gates on the registration form:

  - Cloudflare Turnstile captcha (CAPTCHA_TYPE=cfturnstile). The widget
    is managed in Terraform (turnstile.tf) via the CF Global API key, so
    the sitekey/secret are IaC, not a dashboard artifact.
  - Mandatory email confirmation (REGISTER_EMAIL_CONFIRM=true). Wire the
    Forgejo mailer to the cluster mailserver as noreply@viktorbarzin.me
    (mail.viktorbarzin.me:587 STARTTLS), reusing the same Vault-sourced
    credential Authentik uses (email-secret.tf ESO -> secret/authentik
    smtp_password).

Existing Authentik OAuth2 login is unchanged (additive). Deployment env
appended (not inserted) so the diff stays purely additive; a reloader
annotation rolls the pod on secret rotation.

Verified live: signup page renders the Turnstile widget, mailer delivers
a test message end-to-end, Forgejo healthy, plan-to-zero after apply.

Runbook: docs/runbooks/forgejo-open-signups.md

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-19 16:05:07 +00:00
Viktor Barzin
21dbd79ae4 Merge remote-tracking branch 'origin/master' into wizard/homelab-obs
Some checks are pending
Build infra CLI / build (push) Waiting to run
ci/woodpecker/push/default Pipeline was successful
2026-06-19 11:27:44 +00:00
Viktor Barzin
e91e1612dd homelab: v0.5.0 — net/dns/metrics/logs probes (endpoint resolution)
The remaining verbs that pass the "saves reasoning, not just typing" test the
user posed mid-session: each encodes the non-obvious which-endpoint-reached-how
resolution otherwise re-derived every time. (Same test deprioritized node-ssh
and secret-get aliasing — thin wrappers over commands already known.)

- net check <host> [path]: two-legged reachability — external (public DNS→CF)
  vs internal (Traefik LB) — so you see WHERE a break is, not just that one path
  works. (live: surfaced the LB at 6ms vs CF 77ms.)
- dns lookup <name> [type]: Technitium (10.0.20.201) vs public (1.1.1.1) diff.
- metrics query "<promql>" / metrics alerts: Prometheus via the LB
  (prometheus-query.viktorbarzin.lan); alerts uses the synthetic ALERTS series
  since the query frontend has no /api/v1/alerts and Alertmanager has no ingress.
- logs query "<logql>" [--since 1h] [--limit N]: Loki range query via the LB.

All reach auth-free internal ingresses through the LB (Go form of
curl --resolve host:443:10.0.20.203) — no port-forward, no kubectl. In-cluster-
only endpoints (Alertmanager v2) deliberately out of scope. Verified live before
building; all five smoke-tested green.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-19 11:27:31 +00:00
Viktor Barzin
6cb823e431 k8s-version-upgrade: complete autonomy P0 — blocked alert + deeper postflight + runbook
Some checks failed
ci/woodpecker/push/default Pipeline was canceled
Builds on the compat gate (prev commit) to finish "auto-upgrade when safe, halt +
alert when not":
- monitoring: K8sUpgradeBlocked alert (k8s_upgrade_blocked==1, for 10m, warning)
  in the Upgrade Gates group — the clean "a k8s auto-upgrade was refused, see
  Slack for why" signal. (Until monitoring is applied, a block still surfaces via
  the already-live K8sUpgradeChainJobFailed.)
- upgrade-step.sh phase_postflight: deeper post-upgrade smoke tests —
  apiserver /readyz + /livez, in-cluster DNS (resolve kubernetes.default), and
  core kube-system pods (apiserver/controller-manager/scheduler/etcd/coredns)
  Running. Any failure halts + alerts (exit 1; no rollback — kubeadm can't
  downgrade). Catches a "pods look Running but cluster is broken" upgrade.
- runbook: documents the compat gate, the blocked alert, how to clear a block,
  matrix maintenance, and the detector minor-probe fix.

After deploy, the nightly chain detects 1.35 (minor detection now works) and
correctly BLOCKS on Calico 3.26 / ESO 0.12 / kyverno 1.16 (all behind), alerting
via K8sUpgradeBlocked — the autonomy working as designed until the catch-up
clears those addons.
2026-06-19 11:27:17 +00:00
Viktor Barzin
cecd9fe247 k8s-version-upgrade: compat gate — auto-upgrade when safe, halt + alert when not
Make k8s upgrades (patch AND minor) autonomous without being reckless: the chain
attempts every upgrade but refuses unless it can prove the target is safe. A
refusal is a BLOCK (not a crash) — it halts the chain and signals for attention.

- compat-gate.py: read-only preflight check. Blocks if (a) a critical addon's
  running version doesn't support the target k8s minor, (b) an in-use deprecated
  API (apiserver_requested_deprecated_apis) is removed at/before the target, or
  (c) a node's containerd is below the target's floor. Validated against the live
  cluster: correctly blocks 1.35/1.36 today on Calico 3.26 / ESO 0.12 / kyverno
  1.16 (all behind), which is exactly the auto-halt we want until they're bumped.
- addon-compat.json: curated addon -> max-supported-k8s matrix (Calico, ESO,
  kyverno, gpu-operator + containerd floor), sourced from each project's compat
  docs (2026-06-19). The keystone data the gate reads; keep current.
- upgrade-step.sh: phase_preflight runs the gate FIRST (before any mutation);
  block() pushes k8s_upgrade_blocked=1 + Slacks the reasons + halts.
- main.tf: detector minor-probe fix (curl -sILo so the 302 from pkgs.k8s.io
  resolves to 200 — minors were never being detected). Gated behind the compat
  gate above, so enabling minor detection can't roll an unsafe minor.

Not pushed yet: deploys with the K8sUpgradeBlocked alert + deeper postflight +
runbook (next commit) so the detector fix only goes live with the full net.
2026-06-19 11:23:30 +00:00
Viktor Barzin
9189560ac3 homelab: v0.4.0 — ci/deploy verbs (watch what you trigger)
Some checks are pending
Build infra CLI / build (push) Waiting to run
ci/woodpecker/push/default Pipeline was successful
Adds the verb-group that kills the single biggest reasoning sink in agent
sessions — watching a build/deploy to completion (proven the session that built
it: hours hand-rolling Woodpecker polling + DB-schema spelunking for one CI
incident).

- ci status/watch: Woodpecker REST API (version-stable, not its DB schema),
  reached via the internal Traefik LB (dial 10.0.20.203, SNI=ci.viktorbarzin.me
  so the cert verifies — the Go form of the house `curl --resolve` pattern),
  token from WOODPECKER_TOKEN/Vault, repo id resolved from the cwd remote, with
  retries that ride Woodpecker's intermittent empty responses. watch matches the
  HEAD/given commit (avoids the post-push race) and exits non-zero on failure.
- deploy wait: image-sha match THEN rollout status (rollout status alone returns
  success on the old ReplicaSet); kubectl-based.
- work land now auto-watches CI to green on the landed commit (--no-ci-watch to
  skip), closing the v0.1 gap.
- ci logs deferred to v0.4.1 (Woodpecker detail/log endpoints were the least
  reliable; status/watch use the working list endpoint).

Live-verified ci status/watch against the live API.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-19 10:59:14 +00:00
Viktor Barzin
787ce4edfa homelab: v0.3.1 — fix k8s db PG target (resolve CNPG primary pod, not the Service)
Some checks are pending
Build infra CLI / build (push) Waiting to run
ci/woodpecker/push/default Pipeline was successful
`k8s db <app>` (Postgres path) execed `pg-cluster-rw`, which is the CNPG
read-write SERVICE, not a pod — so kubectl exec failed with
`pods "pg-cluster-rw" not found`. The unit test only checked the plan; the verb
was never fired at live state (the gap flagged in v0.2), so it shipped broken.

Fix: the PG plan now carries a label selector (cnpg.io/instanceRole=primary)
instead of a pod name, and k8s db resolves the actual primary POD via
`kubectl get pod -l <selector>` before exec. MySQL path (real pod
mysql-standalone-0) unchanged. Live-verified both paths (psql + mysql).

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-19 09:09:34 +00:00
Viktor Barzin
90c944a265 woodpecker: disable partial clone (partial: false) — fix intermittent git exit-128
All checks were successful
ci/woodpecker/push/default Pipeline was successful
Infra pipelines were failing intermittently across all authors (e.g. #241-244,
#247) with the git clone step exiting 128:

  git fetch --depth=1 --filter=tree:0 ...   (partial/treeless clone)
  git reset --hard <sha>
  fatal: could not fetch <tree-sha> from promisor remote
  remote: 404 page not found

The plugin-git clone defaulted to a partial (treeless) clone. The initial ref
fetch carries credentials, but the lazy *promisor* object fetch triggered by
`git reset --hard` hits the PRIVATE Forgejo repo without creds -> 404 -> exit
128. Whether it fired was luck-of-the-draw, hence the ~50% intermittent failures
fleet-wide (not specific to any commit).

Fix: set `partial: false` on every clone block so all objects for the (still
shallow) commit are fetched upfront with creds — no fragile lazy promisor fetch.

Diagnosed against the woodpecker Postgres DB (steps/log_entries) since the
Woodpecker HTTP API was itself flapping. Earlier "permission for ViktorBarzin"
log lines were an unrelated cross-forge red herring.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-19 09:06:44 +00:00
Viktor Barzin
fd77c0dc4f monitoring: RpiSofiaUndervoltage alerts on new brown-out, not until reboot
Some checks failed
ci/woodpecker/push/default Pipeline failed
The rpi-sofia under-voltage alert keyed off the sticky firmware bit
(rpi_under_voltage_occurred == 1), which latches on the first brown-out and
stays 1 until the Pi reboots. With alert-on-change routing it re-paged on every
boot cycle and sat firing for ~211h of the last 14d — Viktor reported "getting a
few of these lately" — and it disagreed with the HA-sofia dashboard, which shows
the live state and reads OK once voltage recovers.

Can't just switch to the live bit: rpi_under_voltage_now never registered once in
14d (brown-outs are sub-second and fall between the 1-min textfile-collector
samples), so the sticky bit is the only reliable detector.

Fix: edge-trigger on a NEW latch via increase(rpi_under_voltage_occurred[1h]) > 0.
Fires once per brown-out and auto-resolves ~1h later (~2h active over the same
14d instead of ~211h); counter-reset handling makes a clean reboot a no-op. Both
real brown-out events in the window are still caught. Docs updated in the same
commit (monitoring.md).

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-19 08:45:39 +00:00
Viktor Barzin
fbf6f11038 feat(tripit): #96 cutover — /api self-authenticates (remove forward-auth, add strip-auth-headers)
All checks were successful
ci/woodpecker/push/default Pipeline was successful
ADR-0028 #96 (website half): /api drops Authentik forward-auth so the browser can carry a TripIt session cookie (the outpost 302'd cookie-only requests). The app self-authenticates (TripIt-session-first in get_current_user); no session -> 401 -> SPA landing. strip-auth-headers is REQUIRED now: with forward-auth gone, the hybrid forward-auth arm would otherwise trust a client-injected X-authentik-email — stripping inbound X-authentik-* closes that. /metrics split into its own still-gated ingress. Shell keeps Authentik bearers on tripit-api.* until #94; full AUTH_MODE collapse follows then. Verified live: no-session->401, valid TripIt cookie->200, injected header->401, Shell->200.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-19 08:27:39 +00:00
Viktor Barzin
8559c4574a fix(tripit): pin Authentik invalidation_flow literal (data source flakes null in CI under provider skew)
All checks were successful
ci/woodpecker/push/default Pipeline was successful
Pipeline 244 failed: data.authentik_flow.default_provider_invalidation resolved null in CI (goauthentik 2024.x provider vs 2026.2 server), silently blocking every tripit-stack apply incl. the ADR-0028 #90 signing-key + redirect-URI delivery. Pin the literal UUID (what the slug resolves to) — matches the data-source-skew workaround used for the Vault binding.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-19 08:10:25 +00:00
Viktor Barzin
e5bb16e02a feat(tripit): activate TripIt-native session auth — signing key + Authentik web redirect (ADR-0028 #90)
Some checks failed
ci/woodpecker/push/default Pipeline failed
Adds SESSION_SIGNING_KEY (Vault secret/tripit -> tripit-secrets ExternalSecret -> env_from) so TripIt's own session JWTs are signed with a real key (the app fails closed under the dev default until this lands), and adds the website OIDC redirect URI https://tripit.viktorbarzin.me/api/auth/callback/authentik to the public tripit-app provider so 'Log in with Authentik' works. Reuses the Shell's existing public OAuth2 app.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-19 06:06:43 +00:00
Viktor Barzin
077ac97df5 k8s-version-upgrade: auto-restore apiserver OIDC after control-plane bumps
Some checks failed
ci/woodpecker/push/default Pipeline failed
kubeadm upgrade apply regenerates the apiserver static-pod manifest and drops
the --authentication-config flag, silently breaking SSO (kubectl/kubelogin + the
k8s dashboard) until someone manually re-applied the rbac stack. That manual step
ran after every control-plane upgrade — the one thing keeping autonomous patch
upgrades from being truly hands-off (it bit us this cycle: an earlier master bump
left SSO broken until we noticed).

Automate it: the rbac stack now publishes its existing OIDC restore script (the
same one its null_resource runs) to a kube-system/apiserver-oidc-restore
ConfigMap, and the upgrade chain's phase_master re-runs it on master right after
the kubeadm upgrade — while tigera-operator is still quiesced so the flag-add
apiserver restart can't crashloop it. The script is idempotent and health-gates
/livez with auto-rollback; the step is non-fatal (a failure only lags SSO until
the next rbac apply, it won't abort the upgrade). phase_master already self-skips
when master is at target, so this only fires when master was actually upgraded.

The chain SA gets a name-scoped get on that one ConfigMap. Runbook updated: the
manual restore is now a documented fallback (command corrected — it needs
-replace, since the null_resource trigger hash never changes).

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-19 06:04:30 +00:00
Viktor Barzin
48b63ffa6f homelab: add memory verb-group (v0.3.0) — direct claude-memory HTTP client
Some checks failed
Build infra CLI / build (push) Waiting to run
ci/woodpecker/push/default Pipeline failed
Lets agents search/navigate memory via the CLI, as the first step toward
deprecating the memory MCP. claude-memory is a FastAPI service (the MCP is just
one frontend); homelab memory is a thin Bearer-auth HTTP client over the same
API, using the env the hooks already set (CLAUDE_MEMORY_API_URL/KEY). It works
even when the MCP frontend is down — the recurring disconnect that took the MCP
offline for this whole session.

Verbs: recall (server-side semantic search), list, categories, tags, stats,
secret (read); store, update, delete (write). Validated against the live API
including a store→recall→delete round-trip — full data-plane parity with the MCP.

The deprecation itself (rewiring the per-prompt auto-recall + auto-learn hooks to
the CLI, then uninstalling the MCP) is a deliberate follow-up, sequenced after
the CLI is proven in the hooks — see docs/adr/0008.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-19 05:56:25 +00:00
Viktor Barzin
3594485f77 homelab: v0.2.0 — docs + version for the k8s verb-group
Some checks are pending
Build infra CLI / build (push) Waiting to run
ci/woodpecker/push/default Pipeline was successful
Bump cli/VERSION to v0.2.0; document the k8s verbs (README table + resolver
note), add docs/adr/0007 (resolver, read/write split, config-mutation stays
raw, db dbaas pattern), and extend the AGENTS.md discovery pointer with the
Kubernetes surface.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-18 22:30:41 +00:00
Viktor Barzin
1f7438bb18 homelab: add k8s verb-group (v0.2) — the biggest remaining surface
Mining the post-v0.1 corpus showed kubectl is the dominant remaining domain by
far: 11,291 commands across 243 sessions (more than everything else combined).
This adds the full k8s verb-group built on an app→namespace→pod resolver (most
namespaces hold one app, so <app> defaults to the namespace and the target
defaults to deploy/<app>, letting kubectl resolve the pod; -n/--pod/-c/-l/--tty
override).

Read: status (pods + non-Normal events), get, logs, describe, debug (one-shot
triage), pf, rollout-status. Write/operational: db (the dbaas psql/mysql exec
pattern — PG via pg-cluster-rw -c postgres, MySQL via mysql-standalone-0 with the
env-password bash wrapper, never inline), exec, rm-pod (pods/jobs ONLY), restart.
Config-mutation verbs (apply/edit/patch/scale/create) are deliberately NOT
exposed — they stay raw per the Terraform-only policy.

Smoke-verified read verbs against the live cluster (get/logs/rollout-status);
write verbs are unit-tested (resolver, db-plan, shell-quoting) but not fired at
live state.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-18 22:29:51 +00:00
Viktor Barzin
66caa0bf7f homelab: v0.1 docs, distribution wiring, and version
Some checks are pending
Build infra CLI / build (push) Waiting to run
ci/woodpecker/push/default Pipeline was successful
Completes v0.1: documentation, build/install path, and version stamping.

- cli/VERSION (v0.1.0) stamped into the binary via ldflags.
- cli/README.md rewritten as the homelab overview (verbs + tiers, manifest,
  build, the preserved legacy webhook use-cases).
- docs/adr/0004-0006: why homelab exists (grown in place from infra/cli, not a
  separate repo), v0.1 scope + everything-allowed/tiers-recorded, and the
  work/tf behaviour (native worktree entry, verification-gated auto-land,
  presence-coupled apply).
- setup-devvm.sh builds cli/ -> /usr/local/bin/homelab each provisioning run
  (t3-dispatch pattern), so every devvm user gets the current binary.
- AGENTS.md: discovery pointer under Common Operations.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-18 19:25:51 +00:00
Viktor Barzin
087b415f73 homelab: add work verbs (start/land/clean) with a land verification gate
Completes the infra-loop verb surface. work start creates .worktrees/<topic>
on <user>/<topic> off <remote>/master (git-crypt-aware, ensures .worktrees is
ignored) and prints the path for native EnterWorktree entry. work land fetches,
merges master in, verifies, pushes HEAD:master with non-fast-forward retry, and
falls back to pushing the feature branch for a PR when the direct push is
rejected (branch protection). work clean removes the worktree + branch.

Safety: work land REFUSES to push when it cannot verify (no --verify-cmd and no
auto-detected suite) unless --no-verify is passed. This was added after an
accidental smoke-test invocation pushed unverified WIP to master (benign — the
infra CI applied 0 stacks since the diff was cli/-only — but the gate makes an
unverified land a deliberate choice, not the default).

Known v0.1 limitation: land does not yet block on CI to green; that arrives with
the ci/deploy watch verbs. It prints a reminder to follow the pipeline manually.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-18 19:24:08 +00:00
Viktor Barzin
36d562c15c homelab: add tf verbs + stack/git-crypt substrate
Some checks are pending
Build infra CLI / build (push) Waiting to run
ci/woodpecker/push/default Pipeline was successful
Adds the tf verb-group and the resolver substrate beneath it, continuing the
v0.1 infra-loop build.

- substrate: findInfraRoot (walk up to terragrunt.hcl + stacks/), stack→dir
  resolver, and repo/remote/git-crypt detection (preferRemote forgejo>origin,
  hasGitCryptAttr, gitCryptFlags) — the last is for `work` next.
- tf plan/validate/fmt/force-unlock/apply, resolving the stack from cwd and
  delegating to scripts/tg (which owns state decrypt/encrypt, the Vault lock,
  and the ingress auth-comment check) rather than calling terragrunt directly.
- tf apply is presence-coupled: claims stack:<name>, ALWAYS releases on exit
  (normal, error, or SIGINT/SIGTERM via sync.Once + signal handler) — fixing
  the documented ~200-claim leak — and prints an out-of-band reminder since CI
  applies canonically on push.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-18 19:16:33 +00:00
Viktor Barzin
ed6f22fd53 homelab: scaffold unified CLI (registry, manifest, claim/release) in infra/cli
Begin evolving the existing infra/cli into the agent-facing "homelab" CLI
decided in the design/grilling session: one composable, JSON-capable surface
for the operations agents run over and over (mined from 51k commands across
2,225 past sessions; the infra inner-loop is ~29% of them). v0.1 targets that
loop — work/tf/claim — and ships here, in place, in infra/cli.

This first slice:
- command registry + dispatcher (longest-prefix verb matching) and a
  `manifest`/`manifest --json` progressive-discovery entrypoint; every verb
  declares a read|write tier so write-gating can be added later (everything is
  allowed for now).
- claim/release verbs wrapping the existing presence script (not reimplemented),
  with label-taxonomy validation.
- main() front-dispatches the homelab verb surface but falls through to the
  legacy webhook -use-case path verbatim, so the in-cluster infra-cli image is
  unaffected.
- fix a pre-existing vet error (glog.Infof missing format directive) that
  blocked `go test`.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-18 19:12:57 +00:00
Viktor Barzin
70e217db24 k8s-version-upgrade: preflight skips kubeadm-plan gate when master already at target
All checks were successful
ci/woodpecker/push/default Pipeline was successful
The autonomous 1.34.9 version-upgrade chain has been failing its preflight every
night. A prior run left k8s-master + k8s-node1 on 1.34.9 while node2-6 stayed on
1.34.8, and preflight's gate-4 runs `kubeadm upgrade plan` on master. On an
already-at-target master, kubeadm prints no "kubeadm upgrade apply vX.Y.Z" line,
so the parsed target came back empty and the `!= requested` check aborted the
whole chain before any worker was touched. Deterministic — it self-cleaned and
re-failed identically each night, so it would have failed again tonight, leaving
node2-6 stuck on the old patch.

Fix: skip the kubeadm-plan-target gate when master is already on TARGET_VERSION
— the same at-target self-skip that phase_master and phase_worker already do.
The remaining workers are still validated by their own per-node phases, and the
detector already confirmed the target is installable via apt-cache. This lets
tonight's unattended chain resume and finish node2-6 -> 1.34.9.

Runbook updated: node count 5 -> 7, the gate skip note, and a Past Incidents
writeup (incl. the collateral apiserver OIDC wipe, restored via the rbac stack).

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-18 09:17:46 +00:00
Viktor Barzin
8787d361dc claude-memory: HA (replicas 2 + PDB) to stop recurring MCP disconnects
All checks were successful
ci/woodpecker/push/default Pipeline was successful
The claude-memory MCP backend ran as a single replica with no PDB, so every
voluntary disruption took it to zero for ~30-90s — which surfaced as the
memory MCP "keeps getting disconnected" problem. Disruption sources hitting
the lone pod: the descheduler (every-5-min CronJob, LowNodeUtilization —
caught evicting it live), Keel image bumps, Reloader restarts on the 7-day
DB-password rotation, node drains, and CI deploys.

The local stdio MCP subprocess itself was proven healthy (fast non-blocking
startup, stderr suppressed, graceful degradation), so the fault was purely
backend availability, not the MCP plumbing.

Fix: run 2 replicas (the backend is stateless FastAPI over shared CNPG
Postgres and already has hostname anti-affinity) + restore the PDB at
minAvailable=1 (safe now — the drain deadlock that justified removing it
only existed at 1 replica) + descheduler evict=false to stop the needless
5-min churn. All five disruption sources become zero-downtime rolling events.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-18 09:13:36 +00:00
Viktor Barzin
48b7be3b14 feat(tripit): live lodging-price scrape — LODGING_PROVIDER=playwright
All checks were successful
ci/woodpecker/push/default Pipeline was successful
Viktor asked to turn lodging prices on and stop using the fake provider.
Mirrors the existing FARE_PROVIDER wiring: point the Booking.com/Airbnb lodging
scraper at the shared chrome-service browser over CDP (the namespace is already
admitted through chrome-service's NetworkPolicy for the fare scrape). The lodging
code (ADR-0025, tripit #78) is live in tripit 03973b5, so the env lands after
that rollout.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-18 06:53:19 +00:00
Viktor Barzin
d709d338c6 service-catalog: add paperless-ai (RAG semantic search + auto-tagging)
All checks were successful
ci/woodpecker/push/default Pipeline was successful
Document the new paperless-ai service and the two non-obvious operational
facts: runtime config lives in the PVC .env (not TF env, which would shadow
it), and Qwen3 needs /no_think for parseable tagging output.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-18 06:44:00 +00:00
Viktor Barzin
4977153dfb paperless-ai: make the PVC .env the single source of config truth
All checks were successful
ci/woodpecker/push/default Pipeline was successful
Auto-tagging silently no-op'd: the container env vars set in the deployment
shadowed the app's own /app/data/.env, because paperless-ai's dotenv loader
does not override process.env. A stale PROCESS_PREDEFINED_DOCUMENTS=yes (with
no TAGS) made the scan select zero documents.

Strip the wizard-owned behavioural config (Paperless URL, AI provider, model,
scan interval, tagging flags) from the container env, keeping only
infrastructural env (PUID/PGID/port/RAG/HF cache) and the Vault-sourced
secret refs. The app's setup-written .env on the PVC is now authoritative,
so processing runs and tags all documents. Qwen3 thinking is disabled via
SYSTEM_PROMPT=/no_think in that .env to keep the model's JSON output parseable.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-18 06:41:29 +00:00
Viktor Barzin
aeee0d02e2 paperless-ai: deploy clusterzx/paperless-ai for semantic doc search + AI tagging
Some checks failed
ci/woodpecker/push/default Pipeline failed
Viktor wanted real semantic search over his ~300 Paperless documents and
preferred a ready-made solution over building one. paperless-ai provides
local-embedding RAG (ChromaDB + sentence-transformers, GPU-free) plus
LLM-driven auto-analysis/tagging.

Wiring:
- LLM (chat answers + tagging) -> in-cluster llama-swap qwen3-8b
  (OpenAI-compatible); embeddings + vector store are local on the PVC.
- Reads Paperless over the internal service via a dedicated `paperless-ai`
  superuser token (Vault secret/paperless-ai); app-admin creds also in Vault.
- Encrypted PVC for /app/data (SQLite + ChromaDB + model cache).
- Ingress paperless-ai.viktorbarzin.me behind Authentik (auth=required).
- Third-party image pinned (docker.io/clusterzx/paperless-ai:3.0.9), no Keel.

Runtime config persists to the PVC .env via the app's one-time setup; the
deployment env vars are pre-fill/documentation only.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-18 06:23:00 +00:00
Viktor Barzin
605cf99a1b portal-tts: docker.io/ prefix on edge-tts image (Kyverno trusted-registries)
All checks were successful
ci/woodpecker/push/default Pipeline was successful
The edge-tts apply was blocked by the require-trusted-registries Kyverno policy —
a bare `travisvn/openai-edge-tts` isn't in the allowlist. The policy blanket-
trusts `docker.io/*`, so prefixing the image with `docker.io/` passes admission
with no policy change. Verified live: bg synth round-trips through Whisper
verbatim and a full gateway /v1/talk bg turn returns a coherent spoken Bulgarian
reply ("Добър ден! Добре съм, благодаря!...").

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-17 21:24:34 +00:00
Viktor Barzin
ab55cb5dcd portal-stt: drop setup_tls_secret module (ClusterIP-only, no fullchain.pem)
All checks were successful
ci/woodpecker/push/default Pipeline was successful
The landed portal-stt source still declared the setup_tls_secret module +
tls_secret_name variable, which file()-reads secrets/fullchain.pem — a file this
stack does not ship. portal-stt is ClusterIP-only (no ingress; the Gateway is the
sole externally-exposed component, ADR-0001), so it needs no TLS secret. The live
deployment never had it (removed during the original apply); this aligns the
source with reality so CI applies cleanly. Fixes the pipeline-229 portal-stt
apply failure.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-17 20:29:31 +00:00
Viktor Barzin
e7b9a74756 portal-assistant: land voice stacks + switch TTS to edge-tts (intelligible Bulgarian)
Some checks failed
ci/woodpecker/push/default Pipeline failed
The portal-assistant voice-assistant stacks (portal-tts, portal-stt,
portal-assistant) were applied to the live cluster from feature branches but
never landed on master — the GitOps source of truth. This lands all three and,
in portal-tts, fixes Bulgarian speech.

Bulgarian was unintelligible: the local Piper voice (bg_BG-dimitar-medium via
espeak-ng) mangles Bulgarian consonants — a synth->Whisper round-trip turned
"Добър ден" into "Обърден", and a user heard pure gibberish. English was fine.

portal-tts now runs openai-edge-tts (Microsoft edge-tts neural voices) for BOTH
languages instead of Piper — ADR-0003 always named edge-tts as the online
Bulgarian-quality fallback. Validated before landing: edge bg round-trips
through Whisper verbatim ("Добър ден! Как сте днес? ..."). The gateway maps
detected language bg/en to the edge voice names via new TTS_VOICE_BG /
TTS_VOICE_EN env (bg-BG-KalinaNeural / en-US-AvaNeural). No GPU, no NFS model
store, no secrets — edge fetches voices from Microsoft per request (egress
verified). The assistant already needs the internet for the Claude brain, so an
online TTS adds no new failure mode.

The brain stays Sonnet with no extended thinking (already the default — a live
turn answers directly in ~3.4s), per the latency-over-smartness ask.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-17 20:25:29 +00:00
Viktor Barzin
677a181d49 reverse-proxy: dedicated rate limit for ha-london; bump ha-sofia (cold-client 429s)
All checks were successful
ci/woodpecker/push/default Pipeline was successful
New, empty-cache clients (the repurposed Meta Portal running the HA companion
app) cold-load the whole HA frontend at once - dozens of frontend_latest/*.js +
MDI icon chunks. ha-london had no per-service rate limit, so it fell back to the
global 10/s burst 50 and 429'd those chunks, leaving every dashboard blank
(Settings, which loads less, worked). Give ha-london its own 200/500 middleware
(skip_global_rate_limit, mirroring ha-sofia, with depends_on to avoid the
dangling-middleware 404 window) and bump ha-sofia 100/200 -> 200/500 so a cold
Portal load of Sofia doesn't hit the same wall.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-17 19:53:47 +00:00
Viktor Barzin
9565ff1ce5 state(infra): update encrypted state
All checks were successful
ci/woodpecker/push/default Pipeline was successful
2026-06-17 19:50:30 +00:00
Viktor Barzin
6518e54154 create-template-vm: add k8s-upgrade pipeline SSH key to node cloud-init
Some checks failed
ci/woodpecker/push/default Pipeline failed
New k8s nodes were only getting the personal `wizard` key in authorized_keys —
not the automated k8s-version-upgrade pipeline's key (Vault
secret/k8s-upgrade/ssh_key_pub). So a freshly provisioned node is invisible to
the upgrade chain (it SSHes in as `wizard` to drain+upgrade): node4/5/6 all hit
"Permission denied (publickey)" on 2026-06-17 and had to have the key pushed by
hand. Bake the public key into the cloud-init template so every new node gets it
on first boot.

(unattended-upgrades is already in this template — node4/node5 missed it only
because the LIVE PVE cloud-init snippet lagged this source: it deploys via a
Tier-0 `stacks/infra` apply that hadn't run since before their 2026-05-26
provision. Same lesson applies to THIS change — it reaches new nodes only after
`stacks/infra` is applied to refresh the snippet on the PVE host.)

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-17 18:59:59 +00:00
341 changed files with 28403 additions and 9760 deletions

View file

@ -24,8 +24,8 @@
Violations cause state drift, which causes future applies to break or silently revert changes. Violations cause state drift, which causes future applies to break or silently revert changes.
## Instructions ## Instructions
- **"remember X"**: Use `memory-tool store "content" --category facts --tags "tag1,tag2"` (via exec) for persistent cross-session memory. Also update this file + `AGENTS.md` (if shared knowledge), commit with `[ci skip]`. To recall: `memory-tool recall "query"`. To list: `memory-tool list`. To delete: `memory-tool delete <id>`. The native `memory_search` and `memory_get` tools are also available for searching indexed memory files. For **storing** new memories, always use the `memory-tool` CLI via exec. - **"remember X"**: store to the remote claude-memory store via the **`homelab memory` CLI**: `homelab memory store "content" --category facts --tags "tag1,tag2"` (also `recall "query"` / `update <id>` / `list` / `delete <id>`). For shared knowledge, also update the relevant CLAUDE.md / `AGENTS.md`. (Supersedes the old `memory-tool` CLI **and** the claude-memory MCP — both retired 2026-06-21; the homelab CLI hits the same remote HTTP API. Recall also runs automatically each turn via a UserPromptSubmit hook.)
- **Apply**: Authenticate via `vault login -method=oidc`, then use `scripts/tg` (preferred — handles state decrypt/encrypt) or `terragrunt` directly. `scripts/tg` adds `-auto-approve` for `--non-interactive` applies. - **Apply**: Authenticate via `vault login -method=oidc`, then use `scripts/tg` (preferred — handles state decrypt/encrypt) or `terragrunt` directly. `scripts/tg` adds `-auto-approve` for `--non-interactive` applies, and `-lock-timeout` (default `5m`, override via `TG_LOCK_TIMEOUT`) on every state-locking verb (`plan`/`apply`/`destroy`/`refresh`) so a contended state lock **waits** instead of failing instantly with `Error acquiring the state lock`.
- **New services need CI/CD** and **monitoring** (Prometheus/Uptime Kuma). CI = a GHA workflow on the repo's GitHub mirror (build + tests off-infra, ADR-0002); Woodpecker gets a deploy-only pipeline — never an in-cluster build. - **New services need CI/CD** and **monitoring** (Prometheus/Uptime Kuma). CI = a GHA workflow on the repo's GitHub mirror (build + tests off-infra, ADR-0002); Woodpecker gets a deploy-only pipeline — never an in-cluster build.
- **New service**: Use `setup-project` skill for full workflow - **New service**: Use `setup-project` skill for full workflow
- **Ingress**: `ingress_factory` module. **Auth** (`auth` string enum, default `"required"` — fail-closed). Pick by asking "what gates the app?": - **Ingress**: `ingress_factory` module. **Auth** (`auth` string enum, default `"required"` — fail-closed). Pick by asking "what gates the app?":
@ -47,7 +47,7 @@ Violations cause state drift, which causes future applies to break or silently r
## Terraform State — Two-Tier Backend ## Terraform State — Two-Tier Backend
- **Tier 0 (bootstrap)**: Local state, SOPS-encrypted in git. Stacks: `infra`, `platform`, `cnpg`, `vault`, `dbaas`, `external-secrets`. These must exist before PG is reachable. - **Tier 0 (bootstrap)**: Local state, SOPS-encrypted in git. Stacks: `infra`, `platform`, `cnpg`, `vault`, `dbaas`, `external-secrets`. These must exist before PG is reachable.
- **Tier 1 (everything else)**: PostgreSQL backend (`pg`) on CNPG cluster at `pg-cluster-rw.dbaas.svc.cluster.local:5432/terraform_state`. Native `pg_advisory_lock` for concurrent safety. Each stack gets its own PG schema. - **Tier 1 (everything else)**: PostgreSQL backend (`pg`) on CNPG cluster at `pg-cluster-rw.dbaas.svc.cluster.local:5432/terraform_state`. Native `pg_advisory_lock` for concurrent safety. Each stack gets its own PG schema. **Lock contention is non-fatal**: `scripts/tg` passes `-lock-timeout` (default `5m`) so a contended lock waits rather than hard-failing — this was the #1 cause of infra CI failures (a Woodpecker-killed run's unreaped PG lock, a concurrent local apply, or the daily drift `plan`; Tier-1 stacks have no Vault advisory-lock skip to fall back on, unlike Tier-0).
- **Auth**: `scripts/tg` auto-fetches PG credentials from Vault (`database/static-creds/pg-terraform-state`). Humans use `vault login -method=oidc`, agents use K8s auth (role: `terraform-state`, namespace: `claude-agent`). - **Auth**: `scripts/tg` auto-fetches PG credentials from Vault (`database/static-creds/pg-terraform-state`). Humans use `vault login -method=oidc`, agents use K8s auth (role: `terraform-state`, namespace: `claude-agent`).
- **Tier 0 workflow** (unchanged): `git pull``scripts/tg plan``scripts/tg apply``git push`. State sync via SOPS is transparent. - **Tier 0 workflow** (unchanged): `git pull``scripts/tg plan``scripts/tg apply``git push`. State sync via SOPS is transparent.
- **Tier 1 workflow**: `vault login -method=oidc``scripts/tg plan``scripts/tg apply`. No git commit needed — PG is authoritative. - **Tier 1 workflow**: `vault login -method=oidc``scripts/tg plan``scripts/tg apply`. No git commit needed — PG is authoritative.
@ -63,7 +63,7 @@ Violations cause state drift, which causes future applies to break or silently r
- **`secret/viktor`** — go-to path for ALL personal secrets (135 keys). Contains every API key, token, password, SSH key, and config from the old terraform.tfvars. Check here first: `vault kv get -field=KEY secret/viktor`. - **`secret/viktor`** — go-to path for ALL personal secrets (135 keys). Contains every API key, token, password, SSH key, and config from the old terraform.tfvars. Check here first: `vault kv get -field=KEY secret/viktor`.
- **Auth**: `vault login -method=oidc` (Authentik SSO) → `~/.vault-token` → read by Vault TF provider. - **Auth**: `vault login -method=oidc` (Authentik SSO) → `~/.vault-token` → read by Vault TF provider.
- **Vault stack self-reads**: `data "vault_kv_secret_v2" "vault"` reads its own OIDC creds from `secret/vault`. - **Vault stack self-reads**: `data "vault_kv_secret_v2" "vault"` reads its own OIDC creds from `secret/vault`.
- **ESO (External Secrets Operator)**: `stacks/external-secrets/`43 ExternalSecrets + 9 DB-creds ExternalSecrets. API version `v1beta1`. Two ClusterSecretStores: `vault-kv` and `vault-database`. - **ESO (External Secrets Operator)**: `stacks/external-secrets/`chart **2.6.0 / app v2.6.0** (migrated 0.12.1→2.6.0 on 2026-06-22, one minor at a time; helm_release has `atomic=true`). **~104 ExternalSecrets across 73 files**, all on **API version `v1`** (migrated v1beta1→v1 on 2026-06-22 — there is NO v1beta1→v1 conversion webhook, so all CRs were rewritten to v1 on chart 0.16.2 before 0.17 removed v1beta1; see `docs/plans/2026-06-21-eso-0.12-to-2.x-migration-design.md`). Two ClusterSecretStores: `vault-kv` and `vault-database`. (2 pre-existing dead ESs — instagram-poster, payslip-ingest — fail "cannot find secret data" on missing Vault keys, unrelated.)
- **Plan-time pattern**: Former plan-time stacks use `data "kubernetes_secret"` to read ESO-created K8s Secrets at plan time (no Vault dependency). First-apply gotcha: must `terragrunt apply -target=kubernetes_manifest.external_secret` first, then full apply. `count` on resources using secret values fails — remove conditional counts. - **Plan-time pattern**: Former plan-time stacks use `data "kubernetes_secret"` to read ESO-created K8s Secrets at plan time (no Vault dependency). First-apply gotcha: must `terragrunt apply -target=kubernetes_manifest.external_secret` first, then full apply. `count` on resources using secret values fails — remove conditional counts.
- **14 hybrid stacks** still keep `data "vault_kv_secret_v2"` for plan-time needs (job commands, Helm templatefile, module inputs). Platform has 48 plan-time refs — no migration possible without restructuring modules. - **14 hybrid stacks** still keep `data "vault_kv_secret_v2"` for plan-time needs (job commands, Helm templatefile, module inputs). Platform has 48 plan-time refs — no migration possible without restructuring modules.
- **Database rotation**: Vault DB engine rotates passwords every 7 days (604800s). MySQL: speedtest, wrongmove, codimd, nextcloud, shlink, grafana, phpipam. PostgreSQL: health, linkwarden, affine, woodpecker, claude_memory, crowdsec, technitium. Excluded: authentik (PgBouncer), root users. **Apps that read a rotated secret only at startup** (env var / initContainer, not a hot-reloaded mount) MUST carry a Reloader annotation (`secret.reloader.stakater.com/reload: <secret>`) or they keep the stale password and silently fail DB auth on each rotation until manually restarted — matrix's Synapse `inject-db-password` initContainer hit exactly this (found via Loki 2026-06-05, ~12.9k auth-fail lines/hr); matrix has since migrated to tuwunel (RocksDB, no Postgres) on 2026-06-08 and is no longer in the rotation list above. Technitium uses a password-sync CronJob (every 6h) to push rotated password to the Technitium app config via API, disable SQLite + MySQL logging, check PG plugin is loaded, configure PG query logging (90-day retention), and disable SQLite on secondary/tertiary instances. - **Database rotation**: Vault DB engine rotates passwords every 7 days (604800s). MySQL: speedtest, wrongmove, codimd, nextcloud, shlink, grafana, phpipam. PostgreSQL: health, linkwarden, affine, woodpecker, claude_memory, crowdsec, technitium. Excluded: authentik (PgBouncer), root users. **Apps that read a rotated secret only at startup** (env var / initContainer, not a hot-reloaded mount) MUST carry a Reloader annotation (`secret.reloader.stakater.com/reload: <secret>`) or they keep the stale password and silently fail DB auth on each rotation until manually restarted — matrix's Synapse `inject-db-password` initContainer hit exactly this (found via Loki 2026-06-05, ~12.9k auth-fail lines/hr); matrix has since migrated to tuwunel (RocksDB, no Postgres) on 2026-06-08 and is no longer in the rotation list above. Technitium uses a password-sync CronJob (every 6h) to push rotated password to the Technitium app config via API, disable SQLite + MySQL logging, check PG plugin is loaded, configure PG query logging (90-day retention), and disable SQLite on secondary/tertiary instances.
@ -130,7 +130,7 @@ ghcr, NOT DockerHub), kms-website, Freedify, instagram-poster, payslip-ingest,
broker-sync (image `wealthfolio-sync`), fire-planner, recruiter-responder, broker-sync (image `wealthfolio-sync`), fire-planner, recruiter-responder,
x402-gateway — plus tripit. Earlier public-repo apps already on GHA (Website, x402-gateway — plus tripit. Earlier public-repo apps already on GHA (Website,
apple-health-data, audiblez-web, plotting-book, insta2spotify, apple-health-data, audiblez-web, plotting-book, insta2spotify,
audiobook-search, council-complaints) now also land on ghcr. audiobook-search) now also land on ghcr.
- **PUBLIC ghcr packages:** beadboard, nextcloud-todos, claude-agent-service, - **PUBLIC ghcr packages:** beadboard, nextcloud-todos, claude-agent-service,
claude-memory-mcp, kms-website, freedify, tuya_bridge, x402-gateway, claude-memory-mcp, kms-website, freedify, tuya_bridge, x402-gateway,
chrome-service-novnc, android-emulator. chrome-service-novnc, android-emulator.
@ -202,7 +202,7 @@ the workflow's built-in `GITHUB_TOKEN` (`packages: write`).
- **Critical path services scaled to 3**: Traefik, Authentik, CrowdSec LAPI, PgBouncer, Cloudflared. - **Critical path services scaled to 3**: Traefik, Authentik, CrowdSec LAPI, PgBouncer, Cloudflared.
- **PDBs**: minAvailable=2 on Traefik and Authentik. - **PDBs**: minAvailable=2 on Traefik and Authentik.
- **Fallback proxies**: basicAuth when Authentik is down, fail-open when poison-fountain is down. - **Fallback proxies**: basicAuth when Authentik is down, fail-open when poison-fountain is down.
- **CrowdSec bouncer**: graceful degradation mode (fail-open on error). - **CrowdSec enforcement is out-of-band** (no Traefik plugin/middleware — the dead Yaegi `crowdsec-bouncer-traefik-plugin` was removed on Traefik 3.7.5): banned IPs are dropped **in-kernel via nftables** by the `cs-firewall-bouncer` DaemonSet on **direct** hosts (drops in BOTH the `input` and `forward` hooks — Traefik is ETP=Local so client traffic is DNAT'd to the pod via `forward`; pulls ALL decisions incl. the ~31k CAPI blocklist), and **blocked at the Cloudflare edge** for **proxied** hosts (one `crowdsec_ban` Rules List + a zone WAF block rule, fed by the `crowdsec-cf-sync` CronJob in `rybbit` ns every 2 min — excludes CAPI). Zero per-request latency; **fails open** (LAPI down → no new bans, existing drops persist, legit traffic never blocked). Whitelist covers RFC1918 + tailnet + internal CIDRs. Full as-built: `docs/architecture/security.md`.
- **Rate limiting**: Return 429 (not 503). Per-service tuning via dedicated middleware + `skip_default_rate_limit` (default 10/s burst 50): Immich 1000/20000, ActualBudget 50/300 (app boot = ~70 parallel revalidations). - **Rate limiting**: Return 429 (not 503). Per-service tuning via dedicated middleware + `skip_default_rate_limit` (default 10/s burst 50): Immich 1000/20000, ActualBudget 50/300 (app boot = ~70 parallel revalidations).
- **Retry middleware**: 2 attempts, 100ms — in default ingress chain. - **Retry middleware**: 2 attempts, 100ms — in default ingress chain.
- **Entrypoint transport timeouts** (`websecure` `respondingTimeouts`): `writeTimeout=0` (unlimited download duration), `readTimeout=3600s` (uploads ≤1h), `idleTimeout=600s`. These are **HARD total-duration caps**, not nginx-style per-read idle timeouts — a finite `writeTimeout` truncates *any* large download at that wall-clock mark (a prior `writeTimeout=60s` silently cut Immich videos at 60s). **Do NOT re-tighten `writeTimeout`**; keep `readTimeout` finite (slow-loris backstop) but ≥ longest expected upload. Full rationale: `docs/architecture/networking.md` → "Entrypoint Transport Timeouts". - **Entrypoint transport timeouts** (`websecure` `respondingTimeouts`): `writeTimeout=0` (unlimited download duration), `readTimeout=3600s` (uploads ≤1h), `idleTimeout=600s`. These are **HARD total-duration caps**, not nginx-style per-read idle timeouts — a finite `writeTimeout` truncates *any* large download at that wall-clock mark (a prior `writeTimeout=60s` silently cut Immich videos at 60s). **Do NOT re-tighten `writeTimeout`**; keep `readTimeout` finite (slow-loris backstop) but ≥ longest expected upload. Full rationale: `docs/architecture/networking.md` → "Entrypoint Transport Timeouts".
@ -216,7 +216,7 @@ the workflow's built-in `GITHUB_TOKEN` (`packages: write`).
|---------|--------------------------| |---------|--------------------------|
| Nextcloud | MaxRequestWorkers=150, needs 8Gi limit (Apache transient memory spikes, see commit eb94144), very generous startup probe | | Nextcloud | MaxRequestWorkers=150, needs 8Gi limit (Apache transient memory spikes, see commit eb94144), very generous startup probe |
| Immich | ML on SSD (CUDA), disable ModSecurity (breaks streaming), frequent upgrades. **`immich-machine-learning` MUST run with `MACHINE_LEARNING_MODEL_TTL > 0`** (set to `600` in `stacks/immich/main.tf`, env on the `immich-machine-learning` deployment). At `0`, no model ever unloads and onnxruntime's CUDA arena (OCR's dynamic input shapes inflate it to ~10 GB) is held forever on the **time-sliced T4 it shares with llama-swap/frigate/immich-server** — which has no VRAM isolation, so immich-ml starved llama-swap (qwen3-8b) and silently broke recruiter-responder triage for ~5 h on 2026-06-02 (post-mortem `docs/post-mortems/2026-06-02-immich-ml-ttl-gpu-oom-recruiter.md`). TTL>0 lets idle models (OCR, face — AND CLIP) free VRAM. The TTL is a single GLOBAL knob (no per-model pin), so CLIP would also unload after 600s idle; the `clip-keepalive` CronJob (`*/5 * * * *`, same stack) pings the CLIP textual encoder so smart-search stays warm without pinning the ad-hoc models. **Smart search has a SECOND warmth layer in Postgres** (don't conflate it with the ML model): the ~665MB vchord `clip_index` must stay resident in PG `shared_buffers`, else an ANN probe that lands on an evicted list pays a ~1.8s cold storage read vs ~4ms warm. The `postStart` hook prewarms it ONCE at pod start and `pg_prewarm.autoprewarm` only re-warms at *startup*, so the index decays out of cache over days under job buffer-pressure (observed ~33% resident after 9d uptime → slow context search, easily misattributed to the ML model). The `clip-index-prewarm` CronJob (`*/5`, same stack) re-runs `pg_prewarm('clip_index')` to pin it hot; `immich-search-probe` (`*/5`) measures live latency + residency → Pushgateway gauges (`immich_smart_search_db_seconds`, `immich_clip_index_cached_pct`) → alerts `ImmichSmartSearchSlow`/`ImmichClipIndexColdCache`/`ImmichSearchProbeStale` + cluster-health check #46 (`check_immich_search`). immich PG role is a superuser so the CronJobs can run `pg_prewarm`/`pg_buffercache`. **Video transcoding is GPU-accelerated**: `immich-server` is pinned to GPU node1 (nodeSelector `nvidia.com/gpu.present` + NoSchedule toleration + `gpu-workload` priority) with a time-sliced `nvidia.com/gpu=1` slice — the stock immich-server image's ffmpeg already ships h264/hevc_nvenc + NVDEC. Activated via `ffmpeg.accel=nvenc` + `accelDecode=true` in the **DB** system-config (`system_metadata` table, key `system-config`, JSONB — NOT Terraform; app config is DB-managed here like oauth/smtp). Direct DB edits need a pod **recreate** to reload (config is cached at boot; only API-driven changes broadcast a reload). **Streaming bitrate is capped** to keep 4K playback smooth on the contended HDD and over remote uplinks: `ffmpeg.maxBitrate=20000k` + `preset=medium` + `transcode=bitrate` (set 2026-06-01 — was uncapped `maxBitrate=0` + `ultrafast` + `targetResolution=original`, which produced 77264 Mbps 4K transcodes that stuttered for every client, local and remote, since even a single stream needs ~1013.5 MB/s off the shared `sdc` spindle). 4K resolution is preserved (`targetResolution=original`); originals are NEVER modified — only the `encoded-video/` streaming copy. To re-apply transcode settings to EXISTING videos (config changes only affect new/missing ones): delete the offenders' `asset_file` rows `WHERE type='encoded_video'` (derived/regenerable — never touches originals) then run videoConversion `force=false` (admin Jobs API → "Missing"); it regenerates them to the deterministic `<assetId>.mp4` path at concurrency 1 (gentle on sdc). See `docs/runbooks/immich-transcode-bitrate.md`. If Immich is ever reinstalled fresh (not restored), re-set these keys (accel, accelDecode, **maxBitrate=20000k, preset=medium, transcode=bitrate**). Thumbnails/previews live on SSD NFS (sdb) — do NOT move to block storage (HDD sdc = slower + the contended IO domain). **Background-job concurrency is capped to protect sdc** (DB-managed system-config, `system_metadata` key `system-config`, JSONB `job.*.concurrency`; re-set on fresh install): `thumbnailGeneration=2`, `metadataExtraction=2`, `library=2` — these jobs read ORIGINALS off the HDD library. Left uncapped (were 8/4/4) a library-wide job (e.g. Duplicate Detection on 2026-06-01) fans the ML/thumbnail backfill out into a read storm that saturates sdc and starves etcd → apiserver down. `sidecar`/`smartSearch`/`faceDetection` stay at Immich defaults (small `.xmp` / SSD previews). Apply via Job Settings UI or the `system-config` API; **direct DB edits need an `immich-server` pod recreate to reload** (config cached at boot). See `docs/post-mortems/2026-05-25-immich-anca-elements-io-storm.md`. | | Immich | ML on SSD (CUDA), disable ModSecurity (breaks streaming), frequent upgrades. **`immich-machine-learning` MUST run with `MACHINE_LEARNING_MODEL_TTL > 0`** (set to `600` in `stacks/immich/main.tf`, env on the `immich-machine-learning` deployment). At `0`, no model ever unloads and onnxruntime's CUDA arena (OCR's dynamic input shapes inflate it to ~10 GB) is held forever on the **time-sliced T4 it shares with llama-swap/frigate/immich-server** — which has no VRAM isolation, so immich-ml starved llama-swap (qwen3-8b) and silently broke recruiter-responder triage for ~5 h on 2026-06-02 (post-mortem `docs/post-mortems/2026-06-02-immich-ml-ttl-gpu-oom-recruiter.md`). TTL>0 lets idle models (OCR, face — AND CLIP) free VRAM. The TTL is a single GLOBAL knob (no per-model pin), so CLIP would also unload after 600s idle; the `clip-keepalive` CronJob (`*/5 * * * *`, same stack) pings the CLIP textual encoder so smart-search stays warm without pinning the ad-hoc models. **Smart search has a SECOND warmth layer in Postgres** (don't conflate it with the ML model): the ~665MB vchord `clip_index` must stay resident in PG `shared_buffers`, else an ANN probe that lands on an evicted list pays a ~1.8s cold storage read vs ~4ms warm. The `postStart` hook prewarms it ONCE at pod start and `pg_prewarm.autoprewarm` only re-warms at *startup*, so the index decays out of cache over days under job buffer-pressure (observed ~33% resident after 9d uptime → slow context search, easily misattributed to the ML model). The `clip-index-prewarm` CronJob (`*/5`, same stack) re-runs `pg_prewarm('clip_index')` to pin it hot; `immich-search-probe` (`*/5`) measures live latency + residency → Pushgateway gauges (`immich_smart_search_db_seconds`, `immich_clip_index_cached_pct`) → alerts `ImmichSmartSearchSlow`/`ImmichClipIndexColdCache`/`ImmichSearchProbeStale` + cluster-health check #46 (`check_immich_search`). immich PG role is a superuser so the CronJobs can run `pg_prewarm`/`pg_buffercache`. **Video transcoding is GPU-accelerated**: `immich-server` is pinned to GPU node1 (nodeSelector `nvidia.com/gpu.present` + NoSchedule toleration + `gpu-workload` priority) with a time-sliced `nvidia.com/gpu=1` slice — the stock immich-server image's ffmpeg already ships h264/hevc_nvenc + NVDEC. Activated via `ffmpeg.accel=nvenc` + `accelDecode=true` in the **DB** system-config (`system_metadata` table, key `system-config`, JSONB — NOT Terraform; app config is DB-managed here like oauth/smtp). Direct DB edits need a pod **recreate** to reload (config is cached at boot; only API-driven changes broadcast a reload). **Streaming bitrate is capped** to keep 4K playback smooth on the contended HDD and over remote uplinks: `ffmpeg.maxBitrate=20000k` + `preset=medium` + `transcode=bitrate` (set 2026-06-01 — was uncapped `maxBitrate=0` + `ultrafast` + `targetResolution=original`, which produced 77264 Mbps 4K transcodes that stuttered for every client, local and remote, since even a single stream needs ~1013.5 MB/s off the shared `sdc` spindle). 4K resolution is preserved (`targetResolution=original`); originals are NEVER modified — only the `encoded-video/` streaming copy. To re-apply transcode settings to EXISTING videos (config changes only affect new/missing ones): delete the offenders' `asset_file` rows `WHERE type='encoded_video'` (derived/regenerable — never touches originals) then run videoConversion `force=false` (admin Jobs API → "Missing"); it regenerates them to the deterministic `<assetId>.mp4` path at concurrency 1 (gentle on sdc). See `docs/runbooks/immich-transcode-bitrate.md`. If Immich is ever reinstalled fresh (not restored), re-set these keys (accel, accelDecode, **maxBitrate=20000k, preset=medium, transcode=bitrate**). Thumbnails/previews live on SSD NFS (sdb) — do NOT move to block storage (HDD sdc = slower + the contended IO domain). **Background-job concurrency is capped to protect sdc** (DB-managed system-config, `system_metadata` key `system-config`, JSONB `job.*.concurrency`; re-set on fresh install): `thumbnailGeneration=2`, `metadataExtraction=2`, `library=2` — these jobs read ORIGINALS off the HDD library. Left uncapped (were 8/4/4) a library-wide job (e.g. Duplicate Detection on 2026-06-01) fans the ML/thumbnail backfill out into a read storm that saturates sdc and starves etcd → apiserver down. `sidecar`/`smartSearch`/`faceDetection` stay at Immich defaults (small `.xmp` / SSD previews). Apply via Job Settings UI or the `system-config` API; **direct DB edits need an `immich-server` pod recreate to reload** (config cached at boot). See `docs/post-mortems/2026-05-25-immich-anca-elements-io-storm.md`. |
| CrowdSec | Pin version, disable Metabase when not needed (CPU hog), LAPI scaled to 3, **DB on PostgreSQL** (migrated from MySQL), flush config: max_items=10000/max_age=7d/agents_autodelete=30d, DECISION_DURATION=168h in blocklist CronJob | | CrowdSec | Pin version, disable Metabase when not needed (CPU hog), LAPI scaled to 3, **DB on PostgreSQL** (migrated from MySQL), flush config: max_items=10000/max_age=7d/agents_autodelete=30d, DECISION_DURATION=168h in blocklist CronJob. **Enforcement is out-of-band, NOT a Traefik plugin** (the Yaegi `crowdsec-bouncer-traefik-plugin` was dead on Traefik 3.7.5 and removed): `cs-firewall-bouncer` DaemonSet drops in-kernel via nftables on direct hosts (bouncer key `firewall`, v0.0.34 binary fetched at runtime, hostNetwork+NET_ADMIN, `stacks/crowdsec/modules/crowdsec/firewall_bouncer.tf`); `crowdsec-cf-sync` CronJob blocks at the CF edge for proxied hosts (bouncer key `kvsync`, `stacks/rybbit/crowdsec_edge.tf`). Both fail open. See `docs/architecture/security.md` |
| Frigate | GPU stall detection in liveness probe (inference speed check), high CPU | | Frigate | GPU stall detection in liveness probe (inference speed check), high CPU |
| Authentik | 3 server replicas + 2-replica embedded outpost (PG-backed sessions), PgBouncer in front of PostgreSQL, strip auth headers before forwarding. **`authentik.*` Helm values are INERT** (existingSecret skips chart env rendering) — tune via `server.env`/`worker.env` in `modules/authentik/values.yaml`. Single-screen login (password embedded in identification stage); all first-party OIDC apps use implicit consent (2026-06-10). `/static` ingress carve-out serves assets with immutable Cache-Control. | | Authentik | 3 server replicas + 2-replica embedded outpost (PG-backed sessions), PgBouncer in front of PostgreSQL, strip auth headers before forwarding. **`authentik.*` Helm values are INERT** (existingSecret skips chart env rendering) — tune via `server.env`/`worker.env` in `modules/authentik/values.yaml`. Single-screen login (password embedded in identification stage); all first-party OIDC apps use implicit consent (2026-06-10). `/static` ingress carve-out serves assets with immutable Cache-Control. |
| Kyverno | failurePolicy=Ignore to prevent blocking cluster, pin chart version | | Kyverno | failurePolicy=Ignore to prevent blocking cluster, pin chart version |
@ -243,7 +243,8 @@ Plan in `docs/architecture/security.md` + response playbook in `docs/runbooks/se
- **Source-IP allowlist (K2, K9, V7, S1)**: `10.0.20.0/22`, `192.168.1.0/24` (Proxmox + Sofia LAN), K8s pod CIDR, K8s service CIDR, Headscale tailnet. **Policy: no public-IP access** — Vault, kube-apiserver, PVE sshd must transit LAN or Headscale. **One documented exception (2026-06-11): break-glass SSH** — PVE sshd on a WAN-exposed `:52222`, key-only, dedicated break-glass key only (`Match LocalPort`), rate-limited + fail2ban; intentionally cluster-independent so it survives an outage. As-built `docs/runbooks/breakglass-ssh.md`. (Replaced the 2026-05-30 port-knock design — circular Vault dep caused a lockout.) - **Source-IP allowlist (K2, K9, V7, S1)**: `10.0.20.0/22`, `192.168.1.0/24` (Proxmox + Sofia LAN), K8s pod CIDR, K8s service CIDR, Headscale tailnet. **Policy: no public-IP access** — Vault, kube-apiserver, PVE sshd must transit LAN or Headscale. **One documented exception (2026-06-11): break-glass SSH** — PVE sshd on a WAN-exposed `:52222`, key-only, dedicated break-glass key only (`Match LocalPort`), rate-limited + fail2ban; intentionally cluster-independent so it survives an outage. As-built `docs/runbooks/breakglass-ssh.md`. (Replaced the 2026-05-30 port-knock design — circular Vault dep caused a lockout.)
- **Response model**: (I) Slack-only daily skim. All security alerts via Loki ruler → Alertmanager → `#security` Slack receiver. Single channel with severity labels inside (critical/warning/info). No paging. - **Response model**: (I) Slack-only daily skim. All security alerts via Loki ruler → Alertmanager → `#security` Slack receiver. Single channel with severity labels inside (critical/warning/info). No paging.
- **Kyverno policies (wave 1)**: `deny-privileged-containers`, `deny-host-namespaces`, `restrict-sys-admin`, `require-trusted-registries` flip Audit→Enforce with the 31-namespace exclude list (memory id=1970). `failurePolicy: Ignore` preserved. Cosign `verify-images` deferred. - **Kyverno policies (wave 1)**: `deny-privileged-containers`, `deny-host-namespaces`, `restrict-sys-admin`, `require-trusted-registries` flip Audit→Enforce with the 31-namespace exclude list (memory id=1970). `failurePolicy: Ignore` preserved. Cosign `verify-images` deferred.
- **NetworkPolicy default-deny egress (wave 1)**: observe-then-enforce (γ approach) — Calico flow logs cluster-wide + GlobalNetworkPolicy log-only on tier 3+4, build empirical allowlist after 1 week, phased per-namespace enforce starting `recruiter-responder`. Tier 0/1/2 deferred. - **NetworkPolicy default-deny egress (wave 1)**: observe-then-enforce (γ approach) — Calico flow logs cluster-wide + GlobalNetworkPolicy log-only on tier 3+4, build empirical allowlist after 1 week, phased per-namespace enforce starting `recruiter-responder`. Tier 0/1/2 deferred. **The internal (ns-to-ns) half of each allowlist now derives faster from the east-west flow trail** (below): `SELECT DISTINCT dst_ns FROM edge WHERE src_ns='<ns>' AND action='allow'`. External egress is NOT in that table (empty-ns flows dropped) — those still come from the Calico flow-log W1.6 snapshot. Enforce-flips remain out of scope of the trail (observe-and-derive only; beads `code-8ywc`).
- **East-west flow trail (who-talks-to-whom, ADR-0014)**: Calico **Goldmane** (`goldmane.calico-system:7443`, gRPC/mTLS, ~60-min in-memory ring buffer — no etcd writes) + **Whisker** live UI (`whisker.viktorbarzin.me`, Authentik-gated) → **`goldmane-edge-aggregator`** streams Goldmane's `Flows.Stream` over mTLS and upserts the namespace-pair **edge set** (`edge(src_ns,dst_ns,action,first_seen,last_seen,flow_count)`, self-edges + public-internet flows dropped) into **CNPG DB `goldmane_edges`** → daily **`goldmane-edges-digest`** CronJob posts first-seen edges to `#alerts` (the shared webhook's Slack app isn't in `#security` → 404 channel_not_found; flip `SLACK_CHANNEL` back once invited — see runbook). **CERT-REUSE GOTCHA**: the aggregator's mTLS client cert reuses the operator's Tigera-CA-signed `whisker-backend-key-pair` Secret (Goldmane verifies CA-chain only) — **re-apply `stacks/goldmane-edge-aggregator` if the operator rotates it** (symptom: no `last_seen` updates, `AggregatorDown`). Service identity = namespace, + `service-identity` label only in `monitoring`/`kube-system`/`dbaas`. Health: `AggregatorDown` + `DigestFailing` alerts + cluster-health #48. Runbook: `docs/runbooks/goldmane-flow-trail.md`. (Goldmane is OSS tech-preview — reversible operator-CR toggle in `stacks/calico/main.tf`.)
- **What's NOT in scope**: canary tokens (rejected — self-trigger risk with Viktor's normal `vault kv list secret/viktor` and `kubectl get secret -A` workflows), Falco/Tetragon (too noisy for Slack-only daily check), Cloudflare/GitHub audit polling (deferred to wave 2). - **What's NOT in scope**: canary tokens (rejected — self-trigger risk with Viktor's normal `vault kv list secret/viktor` and `kubectl get secret -A` workflows), Falco/Tetragon (too noisy for Slack-only daily check), Cloudflare/GitHub audit polling (deferred to wave 2).
## Storage & Backup Architecture ## Storage & Backup Architecture

View file

@ -7,6 +7,7 @@ Control and query Home Assistant entities on ha-sofia.viktorbarzin.me.
import argparse import argparse
import json import json
import os import os
import subprocess
import sys import sys
from urllib.parse import urljoin from urllib.parse import urljoin
@ -17,13 +18,29 @@ except ImportError:
print(" pip install requests") print(" pip install requests")
sys.exit(1) sys.exit(1)
# Configuration from environment variables (ha-sofia specific)
HA_URL = os.environ.get("HOME_ASSISTANT_SOFIA_URL", "").rstrip("/")
HA_TOKEN = os.environ.get("HOME_ASSISTANT_SOFIA_TOKEN")
if not HA_URL or not HA_TOKEN: def _token_from_homelab():
print("ERROR: HOME_ASSISTANT_SOFIA_URL and HOME_ASSISTANT_SOFIA_TOKEN environment variables must be set.") """Resolve the token via the homelab CLI when the env var isn't set, so the
print("These should be set when activating the Claude venv (~/.venvs/claude)") script works from any directory / unprovisioned session (see ADR-0012)."""
try:
out = subprocess.run(
["homelab", "ha", "token", "--instance", "sofia"],
capture_output=True, text=True, timeout=30)
if out.returncode == 0 and out.stdout.strip():
return out.stdout.strip()
except Exception:
pass
return None
# Configuration: prefer env vars (set by the Claude venv); otherwise fall back to
# defaults + the homelab CLI so the script is not cwd/env dependent (ADR-0012).
HA_URL = os.environ.get("HOME_ASSISTANT_SOFIA_URL", "").rstrip("/") or "https://ha-sofia.viktorbarzin.me"
HA_TOKEN = os.environ.get("HOME_ASSISTANT_SOFIA_TOKEN") or _token_from_homelab()
if not HA_TOKEN:
print("ERROR: no ha-sofia API token available.")
print("Set HOME_ASSISTANT_SOFIA_TOKEN, or ensure `homelab ha token` works (kubeconfig reachable).")
sys.exit(1) sys.exit(1)
HEADERS = { HEADERS = {

View file

@ -166,7 +166,8 @@ Pinned via Terraform in `stacks/authentik/`:
| Knob | Value | Surface | Effect | | Knob | Value | Surface | Effect |
|------|-------|---------|--------| |------|-------|---------|--------|
| `UserLoginStage.session_duration` on `default-authentication-login` | `weeks=4` | `authentik_stage_user_login.default_login` in `authentik_provider.tf` | Authenticated users stay logged in 4 weeks across browser restarts. No sliding refresh — resets on each login. | | `UserLoginStage.session_duration` on `default-authentication-login` | `weeks=4` | `authentik_stage_user_login.default_login` in `authentik_provider.tf` | Authenticated users stay logged in 4 weeks across browser restarts. No sliding refresh — resets on each login. Used by password login (`default-authentication-flow`) AND passkey login (`webauthn` flow — both terminate on this stage). |
| `UserLoginStage.session_duration` on `default-source-authentication-login` | `weeks=4` | `authentik_stage_user_login.default_source_login` in `authentik_provider.tf` (imported 2026-06-20, id `4c6977d2-…`) | **Social logins** (Google/GitHub/Facebook, via `default-source-authentication-flow`). Was the provider default `seconds=0`, which fell back to `UNAUTHENTICATED_AGE=hours=2` — so social logins expired every **2h** while password/passkey lasted 4 weeks. Pinned `weeks=4` on 2026-06-20 to make all login paths consistent. (Surfaced when the 2026-06-18 passkey wipe forced fallback to Google login → "re-login multiple times daily".) |
| `ProxyProvider.access_token_validity` on `Provider for Domain wide catch all` | `weeks=4` | `authentik_provider_proxy.catchall.access_token_validity` in `authentik_provider.tf` | Cookie `Max-Age` on `authentik_proxy_*` and `expires` on rows in `authentik_providers_proxy_proxysession`. Bumped 2026-05-10 from `hours=168`. **Bumping requires `kubectl rollout restart deploy/ak-outpost-authentik-embedded-outpost`** — the gorilla session store binds the value once at outpost startup; the 5-min provider refresh logs `"reusing existing session store"` and skips rebuild. | | `ProxyProvider.access_token_validity` on `Provider for Domain wide catch all` | `weeks=4` | `authentik_provider_proxy.catchall.access_token_validity` in `authentik_provider.tf` | Cookie `Max-Age` on `authentik_proxy_*` and `expires` on rows in `authentik_providers_proxy_proxysession`. Bumped 2026-05-10 from `hours=168`. **Bumping requires `kubectl rollout restart deploy/ak-outpost-authentik-embedded-outpost`** — the gorilla session store binds the value once at outpost startup; the 5-min provider refresh logs `"reusing existing session store"` and skips rebuild. |
| `AUTHENTIK_SESSIONS__UNAUTHENTICATED_AGE` (server + worker) | `hours=2` | `server.env` + `worker.env` in `modules/authentik/values.yaml` | Anonymous Django sessions (bots, healthcheckers, partial flows) are reaped within 2h instead of the 1d default. | | `AUTHENTIK_SESSIONS__UNAUTHENTICATED_AGE` (server + worker) | `hours=2` | `server.env` + `worker.env` in `modules/authentik/values.yaml` | Anonymous Django sessions (bots, healthcheckers, partial flows) are reaped within 2h instead of the 1d default. |
@ -177,6 +178,13 @@ Notes:
- The standalone embedded-outpost deployment needs `AUTHENTIK_POSTGRESQL__{HOST,PORT,USER,PASSWORD,NAME}` env vars to reach the dbaas cluster — codified via `kubernetes_json_patches.deployment` envFrom the shared `goauthentik` Secret. The `app.kubernetes.io/component=server` pod label is also injected via JSON patch (matches the `component:server` half of the Service selector that the controller adds for embedded outposts). - The standalone embedded-outpost deployment needs `AUTHENTIK_POSTGRESQL__{HOST,PORT,USER,PASSWORD,NAME}` env vars to reach the dbaas cluster — codified via `kubernetes_json_patches.deployment` envFrom the shared `goauthentik` Secret. The `app.kubernetes.io/component=server` pod label is also injected via JSON patch (matches the `component:server` half of the Service selector that the controller adds for embedded outposts).
- `ProxyProvider.remember_me_offset` stays UI-managed via `ignore_changes`. - `ProxyProvider.remember_me_offset` stays UI-managed via `ignore_changes`.
- The Authentik provider's resource schema does **not** expose the `Outpost.managed` field. We rely on TF's "write only fields it knows about" semantic: the server-set `goauthentik.io/outposts/embedded` value is preserved across applies because Terraform never writes `managed`. Don't change the resource provider schema expectations without verifying this assumption holds. - The Authentik provider's resource schema does **not** expose the `Outpost.managed` field. We rely on TF's "write only fields it knows about" semantic: the server-set `goauthentik.io/outposts/embedded` value is preserved across applies because Terraform never writes `managed`. Don't change the resource provider schema expectations without verifying this assumption holds.
## WebAuthn / Passkeys (2026-06-20)
- **Passkey devices live in the DB, NOT Terraform** (`WebAuthnDevice` model). They are user-owned; no TF resource or blueprint manages them. Re-enroll via the user settings UI (Authentik → Settings → MFA Devices → register a security key / passkey).
- **2026-06-18 wipe (root cause of the "WebAuthn broke" incident):** all 6 of Viktor's passkeys were deleted (`WebAuthnDevice.objects.count()` → 0) at 19:27 by an **ad-hoc tripit passkey E2E test** run from the devvm (`python-httpx/0.28.1`, as `akadmin`). The test cleanup did `GET /core/users/?search={demo}` (a **fuzzy** search) then `DELETE /api/v3/authenticators/admin/webauthn/{pk}/` for each device of `users[0]` — but `users[0]` resolved to the **real** account, not the intended demo user. **Lesson:** any future passkey-test cleanup MUST exact-match the demo user (`username == demo`), never `users[0]` of a fuzzy `?search=`. It was a one-off ad-hoc script (no committed/scheduled copy), so nothing auto-re-deletes — re-enrollment is safe.
- **Passkey login path itself is intact:** the identification stage's `passwordless_flow``webauthn` flow (UI-managed, in `ignore_changes`); the break was purely the missing device records.
- **Provider-schema gotcha:** the pinned authentik TF provider's `authentik_stage_identification` resource exposes **no** `webauthn_stage` or `enable_remember_me` attribute (they exist on the app *model*, not in the provider schema). Do NOT add them to `ignore_changes``tg plan` errors `Unsupported attribute`. They are purely UI/app-managed. (Commit `4e882989` removed them for exactly this reason; re-adding breaks every apply.)
- ALL tuned env vars are injected via `server.env` / `worker.env` (not the `authentik.*` values block) because we set `authentik.existingSecret.secretName: goauthentik`, which makes the chart skip rendering its own `AUTHENTIK_*` Secret. The `authentik.*` value block is therefore inert in this stack — anything new under `authentik.*` must use the `*.env` arrays instead. Live base values come from the orphaned, helm-keep-policy `goauthentik` Secret created by chart 2025.10.3 before `existingSecret` was introduced. **2026-06-10:** the previously-inert tuning (`AUTHENTIK_WEB__WORKERS=3`, `AUTHENTIK_WEB__THREADS=4`, `AUTHENTIK_CACHE__TIMEOUT_FLOWS=1800`, `AUTHENTIK_CACHE__TIMEOUT_POLICIES=900`, `AUTHENTIK_POSTGRESQL__CONN_MAX_AGE=60`, `AUTHENTIK_POSTGRESQL__CONN_HEALTH_CHECKS=true`, worker `AUTHENTIK_WORKER__THREADS=4`) was moved into the env arrays and is now actually live — before that, pods silently ran defaults (2 gunicorn workers, 300s caches, no persistent DB conns). - ALL tuned env vars are injected via `server.env` / `worker.env` (not the `authentik.*` values block) because we set `authentik.existingSecret.secretName: goauthentik`, which makes the chart skip rendering its own `AUTHENTIK_*` Secret. The `authentik.*` value block is therefore inert in this stack — anything new under `authentik.*` must use the `*.env` arrays instead. Live base values come from the orphaned, helm-keep-policy `goauthentik` Secret created by chart 2025.10.3 before `existingSecret` was introduced. **2026-06-10:** the previously-inert tuning (`AUTHENTIK_WEB__WORKERS=3`, `AUTHENTIK_WEB__THREADS=4`, `AUTHENTIK_CACHE__TIMEOUT_FLOWS=1800`, `AUTHENTIK_CACHE__TIMEOUT_POLICIES=900`, `AUTHENTIK_POSTGRESQL__CONN_MAX_AGE=60`, `AUTHENTIK_POSTGRESQL__CONN_HEALTH_CHECKS=true`, worker `AUTHENTIK_WORKER__THREADS=4`) was moved into the env arrays and is now actually live — before that, pods silently ran defaults (2 gunicorn workers, 300s caches, no persistent DB conns).
- **Outpost (2026-06-10):** `log_level=info` (was `trace` — per-request overhead on the forward-auth hot path) and `kubernetes_replicas=2` (was 1 — single-pod hot path; safe since proxy sessions live in Postgres). Both in `authentik_outpost.embedded` config. - **Outpost (2026-06-10):** `log_level=info` (was `trace` — per-request overhead on the forward-auth hot path) and `kubernetes_replicas=2` (was 1 — single-pod hot path; safe since proxy sessions live in Postgres). Both in `authentik_outpost.embedded` config.
- **Image tag is PINNED in values (`global.image.tag`), 2026-06-10:** Keel moves the authentik image between chart releases, while helm derives the tag from the chart appVersion — an unpinned helm apply silently DOWNGRADES live pods (caused the 2026-06-10 boot storm + shared-PG failover; see `docs/post-mortems/2026-06-10-authentik-downgrade-boot-storm.md`). Before touching this chart, check the live image tag and refresh the pin. - **Image tag is PINNED in values (`global.image.tag`), 2026-06-10:** Keel moves the authentik image between chart releases, while helm derives the tag from the chart appVersion — an unpinned helm apply silently DOWNGRADES live pods (caused the 2026-06-10 boot storm + shared-PG failover; see `docs/post-mortems/2026-06-10-authentik-downgrade-boot-storm.md`). Before touching this chart, check the live image tag and refresh the pin.

File diff suppressed because one or more lines are too long

View file

@ -11,8 +11,8 @@ description: |
There are TWO Home Assistant deployments: ha-london (default) and ha-sofia. There are TWO Home Assistant deployments: ha-london (default) and ha-sofia.
Always use Home Assistant for smart home control. Always use Home Assistant for smart home control.
author: Claude Code author: Claude Code
version: 2.0.0 version: 2.1.0
date: 2026-02-07 date: 2026-06-24
--- ---
# Home Assistant Control # Home Assistant Control
@ -44,6 +44,12 @@ There are **two** Home Assistant instances:
- Environment variables for each instance: - Environment variables for each instance:
- **ha-london**: `HOME_ASSISTANT_URL` and `HOME_ASSISTANT_TOKEN` - **ha-london**: `HOME_ASSISTANT_URL` and `HOME_ASSISTANT_TOKEN`
- **ha-sofia**: `HOME_ASSISTANT_SOFIA_URL` and `HOME_ASSISTANT_SOFIA_TOKEN` - **ha-sofia**: `HOME_ASSISTANT_SOFIA_URL` and `HOME_ASSISTANT_SOFIA_TOKEN`
- If those env vars aren't set (e.g. you're not in the infra repo / Claude venv), don't hand-roll a `kubectl | base64 | jq` token pipeline — use the global **`homelab` CLI** instead (on `$PATH` in any directory):
## homelab CLI (preferred — works from any directory)
- **Token**: `homelab ha token [--instance sofia|london]` resolves the long-lived API token live from the cluster. Use it directly in curl: `curl -H "Authorization: Bearer $(homelab ha token)" https://ha-sofia.viktorbarzin.me/api/states`. (The `home-assistant-sofia.py` script also auto-falls-back to this when its env var is unset.)
- **Host shell** (ha-sofia): `homelab ha ssh -- <cmd>` runs a command on the HA host with deterministic non-interactive ssh (no host-key prompt) — e.g. `homelab ha ssh -- "sudo docker ps"`, `homelab ha ssh -- "cat /config/configuration.yaml"`. Replaces bespoke `ssh -o StrictHostKeyChecking=no …` invocations.
- **Cluster metrics/logs** (not HA-specific): prefer `homelab metrics query "<promql>"` / `homelab logs query "<logql>"` over hand-rolled `curl …/api/v1/query`, and `homelab claim`/`release` over calling `scripts/presence` directly.
## API Control ## API Control
@ -389,14 +395,27 @@ Advanced SSH, File Editor, Studio Code Server, InfluxDB, Mosquitto, Node-RED, Fr
## ha-london Knowledge Map ## ha-london Knowledge Map
### Overview ### Overview
- **HA Version**: 2025.9.1 (Docker container on Raspberry Pi) - **HA Version**: 2026.5.2 on **Home Assistant OS** (HAOS — managed appliance, NOT a `docker run` container). Latest is 2026.6.4 (update available, deliberately not applied).
- **Location**: London, UK - **Location**: London, UK
- **Platform**: Raspberry Pi 4, HA OS (not Docker standalone) - **Platform**: Raspberry Pi 4, HA OS
- **SSH**: `ssh hassio@192.168.8.103` (requires `sudo` for file access) - **Access from the Sofia devvm**: london is **remote**`homelab ha ssh --instance london` generally WON'T connect (ADR-0012). Drive it via the API: `homelab ha token --instance london` + `https://ha-london.viktorbarzin.me/api/...`, and the WebSocket API `wss://ha-london.viktorbarzin.me/api/websocket` for dashboards / config-entries / HACS installs.
- **Config path**: `/config/` (requires `sudo` for file access) - **SSH (only from the London LAN)**: `ssh hassio@192.168.8.103` (requires `sudo` for file access)
- **Config path**: `/config/`
- **3 tracked people**: Viktor Barzin, Anca Milea, Gheorghe Milea - **3 tracked people**: Viktor Barzin, Anca Milea, Gheorghe Milea
- **Zone**: London (home) - **Zone**: London (home)
### Dashboards (redesigned 2026-06-24)
**Glossary** (HA terms — keep distinct):
- **Dashboard** = a sidebar entry (Overview, Air Quality, Map). Sidebar *order* is a per-USER frontend preference, not in any dashboard config.
- **View** = a tab inside a dashboard. View order is global (stored in the dashboard config).
- **Card** = a widget inside a view.
- **Overview** (`lovelace`, the default): responsive **sections** views, styled with Mushroom + mini-graph-card.
- **Home** tab: *Who's home* · *Comfort & Air* (CO₂/temp/humidity/PM2.5/VOC chips + CO₂ and temp/humidity trend graphs + link to Air Quality) · *Cowboy* (battery/range/last-ride) · *Energy* (5 Kasa plugs + power trend) · *Quick actions* (Netflix/Stremio/Night).
- **More** tab: *Network* (GL-MT6000 router) · *System* (HA version/update, last backup, RPi power) · *Phones*.
- **Air Quality** (`air-quality`): deep-dive (views: Home, Detailed). (`detialed``detailed` path typo fixed 2026-06-24.)
- Built via the WS `lovelace/config/save` API (london is remote — no SSH path).
### Key Systems ### Key Systems
#### 1. Smart Plugs (TP-Link Kasa) — Energy Monitoring #### 1. Smart Plugs (TP-Link Kasa) — Energy Monitoring
@ -418,10 +437,15 @@ Named plugs with power/energy tracking:
- PM1.0/2.5/4.0/10 particulate sensors - PM1.0/2.5/4.0/10 particulate sensors
- VOC, NOx, ammonia, CO, ethanol, hydrogen, methane, NO2 gas sensors - VOC, NOx, ammonia, CO, ethanol, hydrogen, methane, NO2 gas sensors
#### 3. Cowboy E-Bike #### 3. Cowboy E-Bike (`elsbrock/cowboy-ha`)
- `sensor.bike_state_of_charge`: Battery % Bike named **"Classic Performance"** → entities are `sensor.classic_performance_*` (26 total). The old `sensor.bike_*` names are GONE (they were the dead `jdejaegh` integration).
- `sensor.bike_total_distance`: Total km - `sensor.classic_performance_remaining_battery`: Battery % (was `sensor.bike_state_of_charge`)
- `sensor.bike_total_co2_saved`: CO2 saved (grams) - `sensor.classic_performance_remaining_range`: Range km
- `sensor.classic_performance_mileage`: Total km (was `sensor.bike_total_distance`)
- `sensor.classic_performance_saved_co2`: Lifetime CO2 saved (was `sensor.bike_total_co2_saved`)
- Plus `_distance_today`, `_last_trip_*`, `_battery_health`, `device_tracker.classic_performance`, etc.
- **GOTCHA**: live battery/range/mileage read `unknown` while the bike is parked/asleep — Cowboy only reports live SoC when awake (ridden/charging); trip-history + `distance_today` stay live regardless.
- Auth: account **email+password** (no AWS Cognito — that was the dead `jdejaegh`/`cowboybike` lineage). Setup via UI config flow / REST `config_entries/flow`. Creds in Vaultwarden item **"cowboy bike"** (`homelab vault get "cowboy bike"`).
#### 4. Uptime Monitoring (UptimeRobot) #### 4. Uptime Monitoring (UptimeRobot)
- `sensor.blog`: blog uptime - `sensor.blog`: blog uptime
@ -440,12 +464,17 @@ Named plugs with power/energy tracking:
- Scripts: `script.start_netflix`, `script.start_stremio` - Scripts: `script.start_netflix`, `script.start_stremio`
- Scene: `scene.night` (turns off Livia + Michelle plugs) - Scene: `scene.night` (turns off Livia + Michelle plugs)
### Custom Components ### Custom Components (HACS integrations)
- **cowboy**: Cowboy e-bike integration (HACS) - **cowboy** (`elsbrock/cowboy-ha` v1.2.0): Cowboy e-bike — revived 2026-06-24. The old `jdejaegh/home-assistant-cowboy` repo is **dead (404)**; don't chase it.
- **hildebrandglow_dcc**: UK smart meter DCC energy data (HACS) - **hildebrandglow_dcc**: UK smart meter DCC energy — **DISABLED by user** (config entry `disabled_by: user`), not broken.
### HACS frontend cards (plugins)
- **Mushroom** (`piitaya/lovelace-mushroom`), **mini-graph-card** (`kalkih/mini-graph-card`), **plotly-graph-card** (`dbuezas/lovelace-plotly-graph-card`) — used by the redesigned Overview. Install over WS `hacs/repository/download`; resources auto-register in storage mode.
### Integrations ### Integrations
ESPHome, TP-Link Kasa, Tapo, UptimeRobot, Cowboy, Hildebrand Glow DCC, Oral-B BLE, Ookla Speedtest, HACS, OpenRouter (multiple free LLMs), Piper (local TTS), Whisper (local STT), Android TV/ADB ESPHome, TP-Link Kasa, Tapo, UptimeRobot, **Cowboy** (elsbrock), Oral-B BLE, Ookla Speedtest (exposes only an `update` entity, no live speed sensors), HACS, OpenRouter (free LLMs), Piper (TTS), Whisper (STT), Android TV/ADB.
- **Disabled by user (NOT broken)**: `met` + `metoffice` (weather — so `weather.*` entities are ABSENT), `roomba` (Rumi vacuum), `hildebrandglow_dcc` (energy).
- **Failing**: `tplink` **Tapo P100** projector plug — `setup_retry`, 403 KLAP handshake from 192.168.8.108 (plug off / firmware). Left as-is.
### AI / Voice Assistants ### AI / Voice Assistants
- 5 free LLM conversation agents: Google Gemma 3 27B, Meta Llama 3.2 3B, Mistral Devstral 2, OpenAI GPT-OSS-20B, Z.AI GLM 4.5 Air - 5 free LLM conversation agents: Google Gemma 3 27B, Meta Llama 3.2 3B, Mistral Devstral 2, OpenAI GPT-OSS-20B, Z.AI GLM 4.5 Air
@ -460,15 +489,8 @@ ESPHome, TP-Link Kasa, Tapo, UptimeRobot, Cowboy, Hildebrand Glow DCC, Oral-B BL
- Anca arrival/departure notifications - Anca arrival/departure notifications
- Night scene: turns off Livia + Michelle - Night scene: turns off Livia + Michelle
### Docker Setup ### Platform (HAOS — ignore any legacy `docker run` snippet)
```bash ha-london runs **Home Assistant OS** (managed appliance), NOT a hand-run Docker container. There is no `docker run homeassistant/home-assistant` to manage. Install HACS components over the WebSocket API (`hacs/repository/download` with the repo's HACS id), then restart via `POST /api/services/homeassistant/restart` — a HAOS restart drops automations for ~12 min and resets `sensor.uptime` (use that as the "back up" marker).
docker run -d --name homeassistant --privileged \
-e TZ=Europe/London \
-v /home/pi/docker/homeAssistant:/config \
-v /run/dbus:/run/dbus:ro \
--network=host --restart=unless-stopped \
homeassistant/home-assistant:2025.9
```
### SSH Access ### SSH Access
```bash ```bash

View file

@ -0,0 +1,39 @@
name: Build chrome-service-browser
# ADR-0002: infra-owned image built off-infra on GHA → ghcr. Playwright base +
# real Google Chrome (proprietary H.264/AAC codecs) for the chrome-service
# browser container, so the noVNC view can play H.264 video (Reels). Rebuilds
# are rare → dispatch + path trigger. NOTE: after the first push, set the ghcr
# package `chrome-service-browser` to PUBLIC (same as chrome-service-novnc) so
# the pod pulls it without credentials.
on:
push:
branches: [master]
paths:
- 'stacks/chrome-service/files/chrome/**'
workflow_dispatch: {}
permissions:
contents: read
packages: write
jobs:
build:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
- uses: docker/setup-buildx-action@v3
- uses: docker/login-action@v3
with:
registry: ghcr.io
username: ${{ github.actor }}
password: ${{ secrets.GITHUB_TOKEN }}
- uses: docker/build-push-action@v6
with:
context: stacks/chrome-service/files/chrome
platforms: linux/amd64
provenance: false
push: true
tags: |
ghcr.io/viktorbarzin/chrome-service-browser:latest
ghcr.io/viktorbarzin/chrome-service-browser:${{ github.sha }}

6
.gitignore vendored
View file

@ -110,3 +110,9 @@ terraform.tfstate.backup
# Timestamped terraform state backups (terraform.tfstate.<ts>.backup) — plaintext Tier-0 # Timestamped terraform state backups (terraform.tfstate.<ts>.backup) — plaintext Tier-0
# secrets; created by terraform state ops. The patterns above miss the timestamped form. # secrets; created by terraform state ops. The patterns above miss the timestamped form.
terraform.tfstate.*.backup terraform.tfstate.*.backup
# Python test artifacts (pytest bytecode cache) — e.g. from
# stacks/k8s-version-upgrade/scripts/test_compat_gate.py
__pycache__/
*.pyc
.pytest_cache/

View file

@ -19,6 +19,7 @@ clone:
git: git:
image: woodpeckerci/plugin-git image: woodpeckerci/plugin-git
settings: settings:
partial: false
depth: 2 depth: 2
attempts: 5 attempts: 5
backoff: 10s backoff: 10s

View file

@ -9,6 +9,7 @@ clone:
git: git:
image: woodpeckerci/plugin-git image: woodpeckerci/plugin-git
settings: settings:
partial: false
depth: 1 depth: 1
attempts: 3 attempts: 3

View file

@ -5,6 +5,7 @@ clone:
git: git:
image: woodpeckerci/plugin-git image: woodpeckerci/plugin-git
settings: settings:
partial: false
depth: 2 depth: 2
steps: steps:

View file

@ -11,6 +11,7 @@ clone:
git: git:
image: woodpeckerci/plugin-git image: woodpeckerci/plugin-git
settings: settings:
partial: false
depth: 5 depth: 5
steps: steps:

View file

@ -5,6 +5,7 @@ clone:
git: git:
image: woodpeckerci/plugin-git image: woodpeckerci/plugin-git
settings: settings:
partial: false
attempts: 5 attempts: 5
backoff: 10s backoff: 10s

View file

@ -23,6 +23,7 @@ clone:
git: git:
image: woodpeckerci/plugin-git image: woodpeckerci/plugin-git
settings: settings:
partial: false
depth: 1 depth: 1
attempts: 3 attempts: 3

View file

@ -38,6 +38,7 @@ clone:
git: git:
image: woodpeckerci/plugin-git image: woodpeckerci/plugin-git
settings: settings:
partial: false
depth: 1 depth: 1
attempts: 3 attempts: 3

View file

@ -6,6 +6,7 @@ clone:
git: git:
image: woodpeckerci/plugin-git image: woodpeckerci/plugin-git
settings: settings:
partial: false
attempts: 5 attempts: 5
backoff: 10s backoff: 10s

View file

@ -9,7 +9,7 @@
- **Ask before `git push`** — always confirm with the user first - **Ask before `git push`** — always confirm with the user first
## Execution ## Execution
- **Apply a service**: `scripts/tg apply --non-interactive` (auto-decrypts SOPS secrets) - **Apply a service**: `scripts/tg apply --non-interactive` (auto-decrypts SOPS secrets; passes `-lock-timeout`, default `5m` / `TG_LOCK_TIMEOUT`, so a contended state lock waits instead of failing with `Error acquiring the state lock`)
- **Legacy apply**: `cd stacks/<service> && terragrunt apply --non-interactive` (uses terraform.tfvars) - **Legacy apply**: `cd stacks/<service> && terragrunt apply --non-interactive` (uses terraform.tfvars)
- **kubectl**: `kubectl --kubeconfig $(pwd)/config` - **kubectl**: `kubectl --kubeconfig $(pwd)/config`
- **Health check**: `bash scripts/cluster_healthcheck.sh --quiet` - **Health check**: `bash scripts/cluster_healthcheck.sh --quiet`
@ -289,6 +289,7 @@ curl -X POST -H "Authorization: token $TOK" -H 'Content-Type: application/json'
``` ```
## Common Operations ## Common Operations
- **`homelab` CLI** (`/usr/local/bin/homelab`, source `cli/`): unified infra-ops verbs — run `homelab manifest` to discover the surface (each verb tagged read/write). Infra loop: `homelab tf plan|fmt|apply <stack>` (wraps `scripts/tg`; `apply` auto-claims presence + releases on exit, warns out-of-band), `homelab claim|release <kind>:<name>`, `homelab work start|land|clean <topic>` (worktree lifecycle; `land` gates on verification, `--verify-cmd`/`--no-verify`). Kubernetes (v0.2): `homelab k8s status|get|logs|describe|debug|pf|rollout-status <app>` (read; `<app>` defaults to the namespace, target to `deploy/<app>`), `homelab k8s db <app> [--mysql] -- "<SQL>"`, `k8s exec`, `k8s restart`, `k8s rm-pod` (pods/jobs only) — config-mutation kubectl verbs are intentionally absent (Terraform-only). Memory (v0.3): `homelab memory recall "<context>"` (semantic search), `memory list|categories|tags|stats|secret`, `memory store|update|delete` — a direct HTTP client to claude-memory that works even when the memory MCP is down. CI/deploy (v0.4): `homelab ci status|watch [commit]` (Woodpecker, repo resolved from cwd), `homelab deploy wait <ns>/<deploy> [--sha]` (image-sha + rollout) — `work land` now auto-watches CI to green. Net/obs (v0.5): `homelab net check <host> [path]` (external-CF vs internal-LB reachability), `dns lookup <name>` (Technitium vs public diff), `metrics query "<promql>"` / `metrics alerts` (Prometheus via LB), `logs query "<logql>" [--since]` (Loki via LB) — endpoint resolution baked in, no port-forward. Usage telemetry (v0.6): every dispatched verb fire-and-forgets a Loki line (`{user,verb}` + exit only, NO args/secrets; opt-out `HOMELAB_TELEMETRY=0`); `homelab usage top [--since][--user]` ranks verb usage across all users — evidence for what to build next, queryable without reading anyone's home. Home Assistant (v0.7): `homelab ha token [--instance sofia|london]` (prints the long-lived API token, resolved live from k8s Secret `openclaw/openclaw-secrets` — use as `curl -H "Authorization: Bearer $(homelab ha token)"`), `homelab ha ssh [--instance sofia|london] -- <cmd>` (run a command on the HA host; deterministic non-interactive ssh, the invoking user's `~/.ssh/id_ed25519`, sofia=`vbarzin@192.168.1.8` default) — entity state/control stays with the `ha` MCP, these cover only what an API-only MCP can't (token + host shell). Full docs: `cli/README.md`.
- **Deploy new service**: Use `stacks/<existing-service>/` as template. Create stack, add DNS in tfvars, apply platform then service. - **Deploy new service**: Use `stacks/<existing-service>/` as template. Create stack, add DNS in tfvars, apply platform then service.
- **Fix crashed pods**: Run healthcheck first. Safe to delete evicted/failed pods and CrashLoopBackOff pods with >10 restarts. - **Fix crashed pods**: Run healthcheck first. Safe to delete evicted/failed pods and CrashLoopBackOff pods with >10 restarts.
- **OOMKilled**: Check `kubectl describe limitrange tier-defaults -n <ns>`. Increase `resources.limits.memory` in the stack's main.tf. - **OOMKilled**: Check `kubectl describe limitrange tier-defaults -n <ns>`. Increase `resources.limits.memory` in the stack's main.tf.

View file

@ -117,9 +117,17 @@ The bare-metal load-balancer that assigns external IPs to `type=LoadBalancer` Se
_Avoid_: calling `.200` "the cluster IP" or assuming all ingress shares one LB IP. _Avoid_: calling `.200` "the cluster IP" or assuming all ingress shares one LB IP.
**Calico**: **Calico**:
The cluster CNI and **NetworkPolicy** engine (also GlobalNetworkPolicy + flow logs). Egress lockdown follows an **observe-then-enforce** rollout — flow logs build an empirical allowlist, then default-deny egress is enforced per-namespace, tier by tier (wave 1 began at `recruiter-responder`; Tier 0/1/2 deferred). The cluster CNI and **NetworkPolicy** engine (also GlobalNetworkPolicy + flow logs; live flow observability via **Goldmane / Whisker**). Egress lockdown follows an **observe-then-enforce** rollout — flow logs build an empirical allowlist, then default-deny egress is enforced per-namespace, tier by tier (wave 1 began at `recruiter-responder`; Tier 0/1/2 deferred).
_Avoid_: "firewall" (it's pod-level policy, not a perimeter); conflating a Calico **NetworkPolicy** (enforced in the data path) with a **Kyverno policy** (enforced at admission) — different layers. _Avoid_: "firewall" (it's pod-level policy, not a perimeter); conflating a Calico **NetworkPolicy** (enforced in the data path) with a **Kyverno policy** (enforced at admission) — different layers.
**Service identity**:
How a **Service** is named in flow/audit data — its **namespace** is the primary identity (Goldmane stamps it natively, and "one Service ≈ one namespace" holds for ~87 namespaces), refined by an explicit identity label (e.g. `service-identity`) only in the handful of genuinely multi-Service namespaces (`monitoring`, `kube-system`, `dbaas`). Deliberately NOT a per-Service **ServiceAccount** (deferred — 56% of pods share `default`; revisit only if principal-based enforcement or mTLS is adopted) and NOT a SPIFFE/mesh identity (rejected — attribution-grade audit on a trusted single-tenant cluster doesn't justify a mesh).
_Avoid_: equating "service identity" with a workload's **ServiceAccount** (that's the deferred enforcement principal, not the attribution key) or with cryptographic/SPIFFE identity; "Service" here is the domain **Service**, not the K8s `Service` object.
**Goldmane / Whisker**:
Calico 3.30's OSS flow-observability pair — **Goldmane** aggregates identity-stamped flows (namespace/pod/workload/labels + allow-deny + policy trace) streamed from Felix over gRPC into an in-memory ~60-min ring buffer (no etcd/API writes); **Whisker** is its live web UI. The east-west "who-talks-to-whom" data plane, succeeding raw iptables-`LOG`→journald lines (which carry no identity). The in-memory buffer alone is not an audit trail — durable history is the **`goldmane-edge-aggregator`** (the implemented trail; ADR-0014 originally framed this as a Loki emitter), which streams Goldmane's gRPC `Flows.Stream` over mTLS and upserts the namespace-pair **edge set** into CNPG DB `goldmane_edges` + a daily `#security` digest. As-built: `docs/runbooks/goldmane-flow-trail.md`.
_Avoid_: assuming Goldmane persists (it's a ring buffer — lost on restart); expecting a ServiceAccount field in its schema (it carries labels, not SA); confusing it with Cilium **Hubble** (needs the Cilium datapath, unusable on Calico) or **Kiali** (needs an Istio mesh).
### Storage ### Storage
**proxmox-lvm-encrypted**: **proxmox-lvm-encrypted**:

View file

@ -1,2 +1,224 @@
# What is this? # homelab
This is a CLI to manipulate files in the terraform repo and commit and push them
`homelab` is the unified, agent-facing CLI for operating this homelab — one
composable, JSON-capable surface for the operations agents run over and over,
discovered progressively at runtime. It is grown **in place** from this
directory (the former `infra-cli`), and the legacy webhook use-cases still work
(see below).
It encodes *actions*, never *judgment*: methodology (debugging, TDD, review) and
third-party/owned MCP servers (e.g. phpIPAM) are deliberately out of scope.
## Usage
```
homelab <command> [args]
homelab manifest [--json] # list every verb + its read/write tier (discovery entrypoint)
homelab version
```
### v0.1 verbs — the infra inner-loop
| Command | Tier | What it does |
|---|---|---|
| `claim <kind>:<name> --purpose "…"` | write | claim a shared resource on the presence board (wraps `scripts/presence`) |
| `release <kind>:<name>` | write | release a presence claim |
| `tf plan <stack>` | read | `scripts/tg plan` for a stack (resolved from cwd) |
| `tf validate <stack>` | read | `scripts/tg validate` |
| `tf fmt <stack>` | read | `terraform fmt -recursive` on the stack |
| `tf force-unlock <stack> <lock-id>` | write | release a stuck state lock |
| `tf apply <stack>` | write | `scripts/tg apply` — auto-claims `stack:<name>`, always releases, warns it's out-of-band |
| `work start <topic>` | write | create `.worktrees/<topic>` on `<user>/<topic>` off `<remote>/master`; enter with native `EnterWorktree` |
| `work land [--verify-cmd "…"] [--no-verify]` | write | merge master in → verify → push `HEAD:master` (non-ff retry; PR fallback) |
| `work clean <topic>` | write | remove a task's worktree + branch (run from the main checkout) |
### v0.2 verbs — Kubernetes
Built on an **app→namespace→pod resolver**: `<app>` defaults to the namespace
(most namespaces hold one app); the target defaults to `deploy/<app>` and lets
kubectl resolve the pod. Override with `-n`/`--pod`/`-c`/`-l`/`--tty`. Uses the
ambient kubeconfig.
| Command | Tier | What it does |
|---|---|---|
| `k8s status [ns]` | read | pods (wide) + recent non-Normal events (`-A` if no ns) |
| `k8s get <ns> <resource> […]` | read | `kubectl -n <ns> get …` passthrough |
| `k8s logs <app>` | read | logs for `deploy/<app>` (`--tail` default 200; `-c`/`--previous`/`--since`/`-l`) |
| `k8s describe <app> [resource]` | read | describe the deployment (or an explicit resource) |
| `k8s debug <app>` | read | one-shot triage: pods + workloads + describe + recent logs + events |
| `k8s pf <app> <local:remote> [target]` | read | port-forward to `svc/<app>` (or an explicit target) |
| `k8s rollout-status <app>` | read | `rollout status deploy/<app>` |
| `k8s db <app> [--mysql] [--db N] -- "<SQL>"` | write | exec into the dbaas DB (PG `pg-cluster-rw`, or MySQL with env-password wrapper) |
| `k8s exec <app> [--tty] -- <cmd>` | write | exec in the app's pod |
| `k8s restart <app>` | write | `rollout restart deploy/<app>` then wait for status |
| `k8s rm-pod <name> -n <ns> [--job] [--force]` | write | delete a stuck **pod/job only** |
Config-mutation verbs (`apply`/`edit`/`patch`/`scale`/`create`) are intentionally
**not** exposed — they stay raw `kubectl`, per the Terraform-only policy.
`tf` resolves the stack dir by walking up from cwd to the infra root and
delegates to `scripts/tg` (which owns state decrypt/encrypt, the Vault lock, and
the ingress auth-comment check). git-crypt filter flags are auto-injected on git
operations in the encrypted infra repo.
**`work land` refuses to push when it cannot verify** (no `--verify-cmd` and no
auto-detected suite) unless you pass `--no-verify` — landing to master unverified
must be deliberate. After pushing it **watches CI to green** (`ci watch` on the
landed commit) and fails if the pipeline does; pass `--no-ci-watch` to skip.
Tiers are recorded per verb so a future PreToolUse classifier can auto-allow
reads / prompt writes; v0.1 allows everything and relies on existing gates
(permission mode, presence claims, plan approval).
### v0.3 verbs — memory
A thin HTTP client over the **claude-memory** service (the same backend the
memory MCP wraps), authed with `CLAUDE_MEMORY_API_KEY` against
`CLAUDE_MEMORY_API_URL` (the env the hooks already set; defaults to the
ingress). Because it hits the HTTP API directly, it **works even when the MCP
frontend is down**.
| Command | Tier | What it does |
|---|---|---|
| `memory recall "<context>" [--query --category --sort --limit]` | read | semantic search (server-side ranking) — the navigate workhorse |
| `memory list [--category --tag --limit]` | read | recent memories |
| `memory categories` / `memory tags` / `memory stats` | read | enumerate the store |
| `memory secret <id>` | read | reveal a sensitive memory's content |
| `memory store "<content>" [--category --tags --keywords --importance --sensitive]` | write | store a memory |
| `memory update <id> [--content --tags --importance]` | write | edit a memory |
| `memory delete <id>` | write | delete a memory |
All read/write paths are validated against the live API (incl. a
store→recall→delete round-trip). This gives full data-plane parity with the MCP;
the eventual deprecation (rewiring the per-prompt auto-recall + auto-learn hooks
to the CLI, then uninstalling the MCP) is a **separate, deliberate follow-up**
see `docs/adr/0008`.
### v0.4 verbs — ci / deploy
Watch what you trigger, without hand-rolling Woodpecker/kubectl polling. `ci`
talks to the Woodpecker API (token from `WOODPECKER_TOKEN` or Vault
`secret/ci/global`) via the internal Traefik LB, resolving the repo from the cwd
remote, with retries that ride Woodpecker's intermittent empty responses.
| Command | Tier | What it does |
|---|---|---|
| `ci status [commit]` | read | pipeline status for HEAD (or a commit) |
| `ci watch [commit]` | read | poll the pipeline to terminal; exit non-zero on failure |
| `deploy wait <ns>/<deploy> [--sha SHA]` | read | wait for the deployment image to match the sha, *then* rollout status (rollout status alone lies on the old ReplicaSet) |
`work land` now calls `ci watch` on the landed commit automatically (skip with
`--no-ci-watch`), closing the v0.1 "doesn't wait for CI" gap. `ci logs` (failing
step) is deferred to v0.4.1 — Woodpecker's per-pipeline detail/log endpoints were
the least reliable; `status`/`watch` use the list endpoint that works.
### v0.5 verbs — net / dns / metrics / logs
Reachability + observability probes. Their value is *endpoint resolution* — the
non-obvious "which host, public or LB, what auth, what URL shape" reasoning you'd
otherwise re-derive every time — not the HTTP call itself. All reach internal
ingresses through the Traefik LB (the Go form of `curl --resolve host:443:10.0.20.203`).
| Command | Tier | What it does |
|---|---|---|
| `net check <host> [path]` | read | probes the host two ways — external (public DNS → Cloudflare) vs internal (Traefik LB) — with status + latency, so you can tell *where* a break is (CF? app? the LB path?) |
| `dns lookup <name> [type]` | read | resolves via Technitium (`10.0.20.201`) and public (`1.1.1.1`), diffed — surfaces split-horizon vs propagation gaps |
| `metrics query "<promql>"` | read | Prometheus instant query (`prometheus-query.viktorbarzin.lan`); prints `value {labels}` or `--json` |
| `metrics alerts` | read | currently-firing alerts (via the synthetic `ALERTS` series — the query frontend has no `/api/v1/alerts`) |
| `logs query "<logql>" [--since 1h] [--limit N]` | read | Loki range query (`loki.viktorbarzin.lan`); prints log lines or `--json` |
Quote the PromQL/LogQL. These hit auth-free internal ingresses — no port-forward,
no kubectl. (In-cluster-only endpoints like Alertmanager stay out of scope; the
firing set is reachable via `ALERTS` instead.)
### v0.6 — usage telemetry (`usage top`)
Makes "which verbs are actually used, by everyone" a query instead of a guess —
so adding the *next* verb is evidence-driven, not shaped by one person's habits.
Every dispatched verb emits one fire-and-forget Loki line: `{job, user, verb}`
labels + `exit=N ver=X` — **only the verb path and exit code, never args, paths,
flags, or secrets.** It's best-effort (tight timeout, errors swallowed, never
affects the command) and opt-out via `HOMELAB_TELEMETRY=0`. Because the sink is
the shared Loki, aggregate usage is queryable **without reading anyone's home**
the privacy-preserving answer to "what does the team use."
| Command | Tier | What it does |
|---|---|---|
| `usage top [--since 30d] [--user U] [--json]` | read | rank verbs by invocation count across all users (or one), via `sum by (verb) (count_over_time({job="homelab-usage"}[…]))` |
### v0.7 verbs — Home Assistant
Cover exactly the two things the `ha` **MCP server can't**: resolving the
long-lived API token out of the cluster, and SSH to the HA host for host-level
work (config files, docker, add-ons). Entity state and control (`turn_on`,
`get_state`, services) stay with the MCP — *actions an MCP already encodes are
out of scope* (see top of this doc). The value here is the same as `net`/`dns`:
the non-obvious *which secret, which host, which key, which flags* you'd
otherwise re-derive every session — agents were hand-rolling a
`kubectl | base64 | jq` token pipeline and a bespoke `ssh -o …` invocation on
every run because the existing `home-assistant-sofia.py` needs an env var set
and a cwd-relative path, neither of which holds in an arbitrary session.
| Command | Tier | What it does |
|---|---|---|
| `ha token [--instance sofia\|london]` | read | print the long-lived HA API token, resolved live from the dedicated k8s Secret `openclaw/ha-tokens` (key per instance) via the ambient kubeconfig — no pre-set env var. Use as `curl -H "Authorization: Bearer $(homelab ha token)" …`. The secret is a least-privilege carve-out (`stacks/openclaw/ha_tokens.tf`): the `Home Server Admins` group can read *just* it, so non-admin operators get the HA token without the rest of `skill_secrets` (slack webhook, uptime-kuma password) |
| `ha ssh [--instance sofia\|london] [-i KEY] -- <cmd>` | write | run `<cmd>` on the HA host over ssh with deterministic non-interactive flags (explicit key = the invoking user's `~/.ssh/id_ed25519`, no user ssh-config, no known_hosts prompt). sofia (`vbarzin@192.168.1.8`) is reachable from the devvm LAN; london is documented but generally remote |
`--instance` defaults to **sofia** (the devvm shares the Sofia LAN). `ha token`
prints the bare token to stdout so it composes in `$(…)`; it's read-tier like
`memory secret`. `ha ssh` resolves the *invoking user's* key, so it's per-user,
not tied to whoever first wrote the workflow (the user's key must be enrolled on
the HA host).
### v0.8 verbs — browser (headful anti-bot automation)
Drive the cluster's **headful** Chrome (`chrome-service`, real Chrome under Xvfb)
from the devvm over CDP, for sites that detect and block headless automation. The
headless `@playwright/mcp` browser can *load* such a site and fill its forms, but
the gated action (submit/login) silently fails — the motivating case was the
Stirling Ackroyd Fixflo tenant portal, whose pre-submit check returned
`net::ERR_FILE_NOT_FOUND` and hung. This path connects via `connect_over_cdp`,
injects the same `stealth.js` the in-cluster callers use, and submits first try.
The command owns only the *mechanics* (port-forward, stealth, lifecycle); the
agent supplies the Playwright script — judgment stays out of the CLI.
| Command | Tier | What it does |
|---|---|---|
| `browser run <script.js> [--url U] [--shared-context] [--keep-open] [--port N] [--timeout S]` | write | port-forward `svc/chrome-service:9222`, assert it's a real (non-headless) Chrome via `/json/version`, `connect_over_cdp`, `addInitScript(stealth.js)`, then run the script with `page`/`context`/`browser`/`log` in scope (top-level await ok; return a value to print it). Always tears the forward down. |
| `browser open <url> [--shared-context] [--timeout S]` | write | open `<url>` headful and print title + visible text + a screenshot path — a quick check. |
| `browser --help` | read | when-to-use signature + the error-code cheat-sheet (`ERR_FILE_NOT_FOUND` = automation-layer intercept, not egress; `ERR_CONNECTION_REFUSED`/`_TIMED_OUT`/`_NAME_NOT_RESOLVED` = real egress; one endpoint 500 while siblings 200 = bot rejection). |
Default context is a **fresh incognito** one (closed on exit) — safe for the
shared browser and concurrent callers (e.g. tripit's fare scrape); `--shared-context`
reuses the warmed persistent profile when a pre-logged-in session is needed.
`port-forward` tunnels API-server→pod, so it bypasses the `:9222` NetworkPolicy
that gates in-cluster callers — no namespace label needed. The node CDP client is
pinned to **`playwright-core@1.48.2`** to match the chrome-service image minor
(Chromium 130; protocol changes between minors) and is installed once, lazily,
into `~/.cache/homelab/browser-client/` (no per-user setup). Because the client
runs on the devvm, `setInputFiles` streams local files to the remote browser over
CDP — no `chmod`/staging-dir workaround. See `docs/architecture/chrome-service.md`
and `docs/adr/0013`.
## Build / install
Built from source to `/usr/local/bin/homelab` during devvm provisioning
(`scripts/workstation/setup-devvm.sh`, the `t3-dispatch` pattern); version is
stamped from `cli/VERSION` via ldflags. Manual build:
```
cd cli && go build -ldflags "-X main.version=$(cat VERSION)" -o /usr/local/bin/homelab .
go test ./...
```
## Legacy webhook use-cases (preserved)
This binary is also the in-cluster `infra-cli` image. Invocations starting with
`-use-case=<vpn|setup-openwrt-dns|add-email-alias|...>` fall through to the
original flag-based path unchanged, so the webhook handler is unaffected.
## Design
See `infra/docs/adr/0004``0013` for the architecture decisions.

1
cli/VERSION Normal file
View file

@ -0,0 +1 @@
v0.8.1

388
cli/browser.go Normal file
View file

@ -0,0 +1,388 @@
package main
import (
_ "embed"
"encoding/json"
"fmt"
"io"
"net"
"net/http"
"os"
"os/exec"
"os/signal"
"path/filepath"
"strconv"
"strings"
"sync"
"syscall"
"time"
)
// playwrightVersion pins the node CDP client to the chrome-service image minor
// (mcr.microsoft.com/playwright:v1.48.0-noble → Chromium 130). connect_over_cdp
// speaks the browser's CDP, so the client minor must track the server minor;
// see docs/architecture/chrome-service.md "Image pin".
const playwrightVersion = "1.48.2"
// defaultBrowserTimeout is how long (seconds) to wait for the port-forwarded CDP
// endpoint to become ready before giving up.
const defaultBrowserTimeout = 60
const (
chromeServiceNamespace = "chrome-service"
chromeServiceName = "chrome-service"
chromeServiceCDPPort = 9222
)
// stealthJS is vendored verbatim from stacks/chrome-service/files/stealth.js (the
// source of truth the in-cluster callers use). TestStealthJSEmbeddedMatchesCanonical
// guards against drift.
//
//go:embed browser_stealth.js
var stealthJS string
// runnerJS is the node wrapper that connects to the port-forwarded CDP endpoint,
// installs the stealth init script, and runs the user's Playwright script.
//
//go:embed browser_runner.js
var runnerJS string
// browserOpts is the parsed form of `homelab browser run|open` arguments.
type browserOpts struct {
mode string // "run" | "open"
script string // path to the user Playwright script (run mode)
url string // initial URL (run: optional; open: required positional)
sharedCtx bool // use the warmed persistent profile instead of a fresh context
keepOpen bool // leave the created context/pages open on exit
port int // explicit local port for the forward (0 = auto)
timeout int // CDP readiness timeout, seconds
help bool
}
// parseBrowserArgs parses the args after `browser run` / `browser open`.
func parseBrowserArgs(mode string, args []string) (browserOpts, error) {
o := browserOpts{mode: mode, timeout: defaultBrowserTimeout}
var positionals []string
atoi := func(s, flag string) (int, error) {
n, err := strconv.Atoi(s)
if err != nil {
return 0, fmt.Errorf("%s expects an integer, got %q", flag, s)
}
return n, nil
}
for i := 0; i < len(args); i++ {
a := args[i]
switch {
case a == "-h" || a == "--help":
o.help = true
case a == "--shared-context":
o.sharedCtx = true
case a == "--keep-open":
o.keepOpen = true
case a == "--url":
if i+1 < len(args) {
o.url = args[i+1]
i++
}
case strings.HasPrefix(a, "--url="):
o.url = strings.TrimPrefix(a, "--url=")
case a == "--port":
if i+1 < len(args) {
n, err := atoi(args[i+1], "--port")
if err != nil {
return o, err
}
o.port = n
i++
}
case strings.HasPrefix(a, "--port="):
n, err := atoi(strings.TrimPrefix(a, "--port="), "--port")
if err != nil {
return o, err
}
o.port = n
case a == "--timeout":
if i+1 < len(args) {
n, err := atoi(args[i+1], "--timeout")
if err != nil {
return o, err
}
o.timeout = n
i++
}
case strings.HasPrefix(a, "--timeout="):
n, err := atoi(strings.TrimPrefix(a, "--timeout="), "--timeout")
if err != nil {
return o, err
}
o.timeout = n
case strings.HasPrefix(a, "-"):
return o, fmt.Errorf("unknown flag %q (try: homelab browser --help)", a)
default:
positionals = append(positionals, a)
}
}
if o.help {
return o, nil
}
switch mode {
case "run":
if len(positionals) == 0 {
return o, fmt.Errorf("usage: homelab browser run <script.js> [--url URL] [--shared-context] [--keep-open] [--port N] [--timeout S]")
}
o.script = positionals[0]
case "open":
if len(positionals) == 0 {
return o, fmt.Errorf("usage: homelab browser open <url> [--shared-context] [--timeout S]")
}
o.url = positionals[0]
}
return o, nil
}
// cdpHealthy parses a CDP /json/version body and reports whether the endpoint is
// a real (non-headless) Chrome — the entire reason chrome-service exists.
func cdpHealthy(jsonBody []byte) (browser string, healthy bool, err error) {
var v struct {
Browser string `json:"Browser"`
UserAgent string `json:"User-Agent"`
}
if e := json.Unmarshal(jsonBody, &v); e != nil {
return "", false, fmt.Errorf("parse /json/version: %w", e)
}
if v.Browser == "" {
return "", false, fmt.Errorf("/json/version had no Browser field")
}
healthy = strings.HasPrefix(v.Browser, "Chrome/") &&
!strings.Contains(v.Browser, "Headless") &&
!strings.Contains(v.UserAgent, "Headless")
return v.Browser, healthy, nil
}
// buildPortForwardArgs is the kubectl invocation that exposes chrome-service's
// CDP locally. port-forward tunnels API-server→pod, so it bypasses the :9222
// NetworkPolicy that gates in-cluster callers.
func buildPortForwardArgs(localPort int) []string {
return []string{"-n", chromeServiceNamespace, "port-forward",
"svc/" + chromeServiceName, fmt.Sprintf("%d:%d", localPort, chromeServiceCDPPort)}
}
// browserClientPackageJSON is the auto-managed manifest for the pinned node CDP
// client kept under the user cache dir.
func browserClientPackageJSON() string {
return fmt.Sprintf(`{
"name": "homelab-browser-client",
"private": true,
"description": "Pinned CDP client for 'homelab browser' — auto-managed, do not edit.",
"dependencies": {
"playwright-core": "%s"
}
}
`, playwrightVersion)
}
// freePort asks the kernel for an unused ephemeral TCP port.
func freePort() (int, error) {
l, err := net.Listen("tcp", "127.0.0.1:0")
if err != nil {
return 0, err
}
defer l.Close()
return l.Addr().(*net.TCPAddr).Port, nil
}
// browserClientDir is where the pinned node client + managed runner files live.
func browserClientDir() (string, error) {
cache, err := os.UserCacheDir()
if err != nil || cache == "" {
home, herr := os.UserHomeDir()
if herr != nil {
return "", fmt.Errorf("locate cache dir: %v / %v", err, herr)
}
cache = filepath.Join(home, ".cache")
}
return filepath.Join(cache, "homelab", "browser-client"), nil
}
// installedPlaywrightVersion reads the version of the playwright-core already
// installed in dir, or "" if absent/unreadable.
func installedPlaywrightVersion(dir string) string {
b, err := os.ReadFile(filepath.Join(dir, "node_modules", "playwright-core", "package.json"))
if err != nil {
return ""
}
var v struct {
Version string `json:"version"`
}
if json.Unmarshal(b, &v) != nil {
return ""
}
return v.Version
}
// ensureBrowserClient writes the managed runner/stealth/package files into dir
// and lazily installs the pinned playwright-core (only when missing/mismatched),
// so no per-user setup is needed and the client tracks the binary version.
func ensureBrowserClient(dir string) error {
if err := os.MkdirAll(dir, 0o755); err != nil {
return err
}
files := map[string]string{
"package.json": browserClientPackageJSON(),
"browser_runner.js": runnerJS,
"stealth.js": stealthJS,
}
for name, content := range files {
if err := os.WriteFile(filepath.Join(dir, name), []byte(content), 0o644); err != nil {
return err
}
}
if installedPlaywrightVersion(dir) == playwrightVersion {
return nil
}
fmt.Fprintf(os.Stderr, "homelab browser: installing pinned playwright-core@%s (one-time, ~a few seconds)…\n", playwrightVersion)
cmd := exec.Command("npm", "install", "--no-audit", "--no-fund", "--silent")
cmd.Dir = dir
cmd.Stdout = os.Stderr
cmd.Stderr = os.Stderr
if err := cmd.Run(); err != nil {
return fmt.Errorf("npm install playwright-core@%s in %s: %w (is node/npm installed?)", playwrightVersion, dir, err)
}
if got := installedPlaywrightVersion(dir); got != playwrightVersion {
return fmt.Errorf("playwright-core install mismatch in %s: want %s, got %q", dir, playwrightVersion, got)
}
return nil
}
// waitForCDP polls the local CDP endpoint until it answers as a healthy
// (non-headless) Chrome, or the timeout elapses.
func waitForCDP(cdpURL string, timeout time.Duration) (string, error) {
deadline := time.Now().Add(timeout)
client := &http.Client{Timeout: 3 * time.Second}
var lastErr error
for time.Now().Before(deadline) {
resp, err := client.Get(cdpURL + "/json/version")
if err != nil {
lastErr = err
time.Sleep(300 * time.Millisecond)
continue
}
body, _ := io.ReadAll(resp.Body)
resp.Body.Close()
browser, healthy, herr := cdpHealthy(body)
if herr != nil {
lastErr = herr
time.Sleep(300 * time.Millisecond)
continue
}
if !healthy {
return browser, fmt.Errorf("CDP reports %q — expected a non-headless Chrome (wrong target?)", browser)
}
return browser, nil
}
if lastErr == nil {
lastErr = fmt.Errorf("timed out after %s", timeout)
}
return "", lastErr
}
// runBrowser is the orchestration: pick a port, ensure the pinned client, start
// (and ALWAYS tear down) a CDP port-forward, wait for readiness, then run node.
func runBrowser(o browserOpts) error {
port := o.port
if port == 0 {
p, err := freePort()
if err != nil {
return fmt.Errorf("pick local port: %w", err)
}
port = p
}
dir, err := browserClientDir()
if err != nil {
return err
}
if err := ensureBrowserClient(dir); err != nil {
return err
}
// Start the forward in its own process group so the whole tree dies on cleanup.
pf := exec.Command("kubectl", buildPortForwardArgs(port)...)
pf.SysProcAttr = &syscall.SysProcAttr{Setpgid: true}
var pfLog strings.Builder
pf.Stdout = &pfLog
pf.Stderr = &pfLog
if err := pf.Start(); err != nil {
return fmt.Errorf("start kubectl port-forward (kubeconfig set?): %w", err)
}
var once sync.Once
teardown := func() {
once.Do(func() {
if pf.Process != nil {
_ = syscall.Kill(-pf.Process.Pid, syscall.SIGKILL)
}
_ = pf.Wait()
})
}
defer teardown()
// Tear down on Ctrl-C / SIGTERM too, then exit non-zero.
sigCh := make(chan os.Signal, 1)
signal.Notify(sigCh, os.Interrupt, syscall.SIGTERM)
defer signal.Stop(sigCh)
go func() {
if _, ok := <-sigCh; ok {
teardown()
os.Exit(130)
}
}()
cdpURL := fmt.Sprintf("http://127.0.0.1:%d", port)
browser, err := waitForCDP(cdpURL, time.Duration(o.timeout)*time.Second)
if err != nil {
return fmt.Errorf("chrome-service CDP not ready on %s: %w\n--- port-forward log ---\n%s", cdpURL, err, pfLog.String())
}
fmt.Fprintf(os.Stderr, "homelab browser: connected to %s via %s\n", browser, cdpURL)
return runBrowserNode(dir, cdpURL, o)
}
// runBrowserNode invokes the managed node runner with inputs passed via env.
func runBrowserNode(dir, cdpURL string, o browserOpts) error {
env := append(os.Environ(),
"HOMELAB_CDP_URL="+cdpURL,
"HOMELAB_BROWSER_MODE="+o.mode,
"HOMELAB_STEALTH_PATH="+filepath.Join(dir, "stealth.js"),
"NODE_PATH="+filepath.Join(dir, "node_modules"),
)
if o.url != "" {
env = append(env, "HOMELAB_BROWSER_URL="+o.url)
}
if o.script != "" {
abs, err := filepath.Abs(o.script)
if err != nil {
return err
}
if _, err := os.Stat(abs); err != nil {
return fmt.Errorf("script %s: %w", o.script, err)
}
env = append(env, "HOMELAB_BROWSER_SCRIPT="+abs)
}
if o.sharedCtx {
env = append(env, "HOMELAB_BROWSER_SHARED=1")
}
if o.keepOpen {
env = append(env, "HOMELAB_BROWSER_KEEP_OPEN=1")
}
if o.mode == "open" {
shot := filepath.Join(os.TempDir(), fmt.Sprintf("homelab-browser-%d.png", os.Getpid()))
env = append(env, "HOMELAB_BROWSER_SCREENSHOT="+shot)
}
cmd := exec.Command("node", filepath.Join(dir, "browser_runner.js"))
cmd.Env = env
cmd.Stdout = os.Stdout
cmd.Stderr = os.Stderr
cmd.Stdin = os.Stdin
return cmd.Run()
}

106
cli/browser_runner.js Normal file
View file

@ -0,0 +1,106 @@
// homelab browser — node CDP runner (auto-managed; regenerated each run from the
// homelab binary — DO NOT EDIT here). Connects to the port-forwarded
// chrome-service CDP endpoint, installs the stealth init script, then runs the
// user's Playwright script (run mode) or opens a URL (open mode). All inputs
// arrive via HOMELAB_* env vars set by the Go CLI.
'use strict';
const fs = require('fs');
const { chromium } = require('playwright-core');
async function main() {
const cdpURL = process.env.HOMELAB_CDP_URL;
if (!cdpURL) throw new Error('HOMELAB_CDP_URL not set');
const mode = process.env.HOMELAB_BROWSER_MODE || 'run';
const stealthPath = process.env.HOMELAB_STEALTH_PATH || '';
const initURL = process.env.HOMELAB_BROWSER_URL || '';
const scriptPath = process.env.HOMELAB_BROWSER_SCRIPT || '';
const shared = process.env.HOMELAB_BROWSER_SHARED === '1';
const keepOpen = process.env.HOMELAB_BROWSER_KEEP_OPEN === '1';
const screenshotPath = process.env.HOMELAB_BROWSER_SCREENSHOT || '';
const browser = await chromium.connectOverCDP(cdpURL);
// Fresh isolated context by default (safe for the shared browser + concurrent
// callers); --shared-context reuses the warmed persistent profile.
let context;
let createdContext = false;
if (shared) {
const existing = browser.contexts();
if (existing.length) {
context = existing[0];
} else {
context = await browser.newContext();
createdContext = true;
}
} else {
context = await browser.newContext();
createdContext = true;
}
if (stealthPath) {
const stealth = fs.readFileSync(stealthPath, 'utf8');
if (stealth.trim()) await context.addInitScript(stealth);
}
const page = await context.newPage();
const log = (...a) => console.error('[browser]', ...a);
let exitCode = 0;
try {
if (initURL) {
await page.goto(initURL, { waitUntil: 'domcontentloaded' });
}
if (mode === 'open') {
console.log('url: ' + page.url());
console.log('title: ' + (await page.title()));
const text = (await page.evaluate(() => (document.body ? document.body.innerText : ''))).trim();
console.log('--- visible text (truncated to 4000 chars) ---');
console.log(text.slice(0, 4000));
if (screenshotPath) {
await page.screenshot({ path: screenshotPath, fullPage: true });
console.log('screenshot: ' + screenshotPath);
}
} else {
if (!scriptPath) throw new Error('run mode requires HOMELAB_BROWSER_SCRIPT');
const src = fs.readFileSync(scriptPath, 'utf8');
// Run the user's source with page/context/browser/log in lexical scope.
// AsyncFunction body permits top-level await.
const AsyncFunction = Object.getPrototypeOf(async () => {}).constructor;
const fn = new AsyncFunction('page', 'context', 'browser', 'log', src);
const result = await fn(page, context, browser, log);
if (result !== undefined) {
let out;
try {
out = typeof result === 'string' ? result : JSON.stringify(result, null, 2);
} catch (_) {
out = String(result);
}
console.log(out);
}
}
} catch (e) {
console.error('homelab browser: script error:', e && e.stack ? e.stack : e);
exitCode = 1;
} finally {
if (!keepOpen) {
try {
// Close only what we created; never tear down the shared persistent context.
if (createdContext) {
await context.close();
} else {
await page.close();
}
} catch (_) { /* ignore */ }
}
// Disconnect from the CDP endpoint; this does NOT kill the remote browser.
try {
await browser.close();
} catch (_) { /* ignore */ }
}
process.exit(exitCode);
}
main().catch((e) => {
console.error('homelab browser: fatal:', e && e.stack ? e.stack : e);
process.exit(1);
});

54
cli/browser_stealth.js Normal file
View file

@ -0,0 +1,54 @@
// Minimal stealth init script for Playwright-driven Chromium.
// Vendored from puppeteer-extra-plugin-stealth/evasions/* (MIT) — covers:
// webdriver, chrome.runtime, navigator.plugins, navigator.languages,
// Permissions.query, WebGL getParameter (vendor + renderer spoof).
// Run via context.add_init_script() so it executes before any page script.
(() => {
// navigator.webdriver — most common detection, removed entirely.
Object.defineProperty(Navigator.prototype, 'webdriver', { get: () => undefined });
// window.chrome.runtime — many sites check that real Chrome exposes this.
if (!window.chrome) window.chrome = {};
window.chrome.runtime = window.chrome.runtime || {};
// navigator.plugins — headless reports zero; spoof a plausible PDF viewer.
Object.defineProperty(navigator, 'plugins', {
get: () => [{ name: 'Chrome PDF Plugin' }, { name: 'Chrome PDF Viewer' }, { name: 'Native Client' }],
});
// navigator.languages — headless returns empty array.
Object.defineProperty(navigator, 'languages', { get: () => ['en-US', 'en'] });
// Permissions.query — headless returns 'denied' for notifications instead of 'default'.
const origQuery = window.navigator.permissions && window.navigator.permissions.query;
if (origQuery) {
window.navigator.permissions.query = (parameters) =>
parameters && parameters.name === 'notifications'
? Promise.resolve({ state: Notification.permission })
: origQuery(parameters);
}
// WebGL getParameter — spoof vendor + renderer strings to a real GPU.
const spoofGl = (proto) => {
if (!proto) return;
const orig = proto.getParameter;
proto.getParameter = function (parameter) {
if (parameter === 37445) return 'Intel Inc.'; // UNMASKED_VENDOR_WEBGL
if (parameter === 37446) return 'Intel Iris OpenGL Engine'; // UNMASKED_RENDERER_WEBGL
return orig.apply(this, arguments);
};
};
spoofGl(window.WebGLRenderingContext && window.WebGLRenderingContext.prototype);
spoofGl(window.WebGL2RenderingContext && window.WebGL2RenderingContext.prototype);
// disable-devtool.js (theajack/disable-devtool) auto-inits via a script
// tag with `disable-devtool-auto`. Its Performance detector trips under
// Playwright (CDP adds console.log latency vs console.table) and the
// redirect URL is hard-coded — for hmembeds that's google.com.
// Hide the auto-init marker so the library's IIFE exits early.
const origQS = Document.prototype.querySelector;
Document.prototype.querySelector = function (sel) {
if (typeof sel === 'string' && sel.indexOf('disable-devtool-auto') !== -1) return null;
return origQS.apply(this, arguments);
};
})();

117
cli/cmd_browser.go Normal file
View file

@ -0,0 +1,117 @@
package main
import "fmt"
// browser verbs drive the cluster's HEADFUL Chrome (ns chrome-service) over CDP
// from outside the cluster, for sites that detect/block headless automation.
// The headless @playwright/mcp browser can load such sites but their gated
// actions (submit/login) silently fail; this path submits first try. Mechanics
// only — the agent supplies the Playwright script. See docs/adr/0013.
func browserCommands() []Command {
return []Command{
{Path: []string{"browser"}, Tier: TierRead,
Summary: "headful cluster-Chrome automation for anti-bot sites (run `browser --help`)", Run: browserTopHelp},
{Path: []string{"browser", "run"}, Tier: TierWrite,
Summary: "run a Playwright script against headful cluster Chrome: browser run <script.js> [--url U] [--shared-context]", Run: browserRun},
{Path: []string{"browser", "open"}, Tier: TierWrite,
Summary: "open a URL in headful cluster Chrome; print title + text + screenshot: browser open <url>", Run: browserOpen},
}
}
func browserTopHelp([]string) error {
fmt.Print(browserHelp())
return nil
}
func browserRun(args []string) error {
o, err := parseBrowserArgs("run", args)
if err != nil {
return err
}
if o.help {
fmt.Print(browserHelp())
return nil
}
return runBrowser(o)
}
func browserOpen(args []string) error {
o, err := parseBrowserArgs("open", args)
if err != nil {
return err
}
if o.help {
fmt.Print(browserHelp())
return nil
}
return runBrowser(o)
}
// browserHelp carries the discoverability payload: WHEN to reach for this, and
// the diagnostic cheat-sheet that lets the agent self-correct instead of
// retrying a deterministic form blind (the failure mode that motivated this).
func browserHelp() string {
return `homelab browser drive the cluster's HEADFUL Chrome (anti-bot) over CDP
The shared chrome-service (ns chrome-service) runs a REAL, headed Chrome under
Xvfb. This connects to it via a port-forward + Playwright connect_over_cdp,
injects the same stealth.js the in-cluster callers use, and runs your script.
USAGE
homelab browser run <script.js> [--url URL] [--shared-context] [--keep-open] [--port N] [--timeout S]
homelab browser open <url> [--shared-context] [--timeout S]
WHEN TO USE THIS escalation only; DEFAULT to the headless/MCP browser
Default to the Playwright MCP / headless browser for ALL routine browsing and
automation it's interactive (snapshot per step), fast to start, isolated.
Reach for THIS command ONLY when headless is demonstrably blocked: a site
LOADS fine but a gated action FAILS or HANGS a submit/login/checkout spins
forever, or ONE request errors while its siblings 200. That is the signature
of headless / anti-bot detection (navigator.webdriver, UA "HeadlessChrome",
disable-devtool traps). It presents as a real Chrome and usually succeeds
first try but it's the shared cluster browser (slower startup, one batch
run, no per-step feedback), so it's the escalation path, never the default.
ERROR-CODE CHEAT-SHEET (diagnose BEFORE retrying)
ERR_FILE_NOT_FOUND (-6) request intercepted/resolved locally by the
automation layer NOT a network/egress problem.
(This is what silently broke the headless submit.)
ERR_CONNECTION_REFUSED / real egress failure (DNS/route/firewall). These also
ERR_TIMED_OUT / break the initial page load if the page loaded,
ERR_NAME_NOT_RESOLVED egress is fine and the cause is elsewhere.
one endpoint 500s while server-side bot rejection of the automation, not
its siblings 200 your payload.
HABITS
- Inspect the network panel BEFORE retrying a deterministic form; a blind
retry just repeats the same silent failure.
- Don't park a half-filled multi-step form across a user pause the session
can expire; re-run the whole flow from this command in one shot.
- Uploads stream over CDP via setInputFiles from THIS host no chmod/staging
of $HOME needed; just point setInputFiles at a local path.
CONTEXT
Default: a FRESH incognito context, closed on exit safe for the shared
browser and concurrent callers (e.g. tripit). Your script does its own login.
--shared-context: reuse the warmed PERSISTENT profile (cookies from a manual
noVNC login at chrome.viktorbarzin.me) when you need a pre-logged-in session.
SCRIPT CONTRACT (run mode)
Your file's body runs with page, context, browser and log() already in scope
(top-level await allowed). Return a value to print it. Example flow.js:
await page.goto('https://portal.example.com/login');
await page.fill('#user', 'me'); await page.fill('#pass', process.env.PW);
await page.click('button[type=submit]');
await page.waitForURL('**/dashboard');
return 'logged in: ' + page.url();
Run it: homelab browser run flow.js
NOTES
- The Playwright client is pinned to playwright-core@` + playwrightVersion + ` to match the
chrome-service image (Chrome 130); installed once into ~/.cache/homelab/.
- The port-forward is always torn down, on success and on error.
`
}

172
cli/cmd_browser_test.go Normal file
View file

@ -0,0 +1,172 @@
package main
import (
"os"
"reflect"
"strings"
"testing"
)
func TestParseBrowserArgsRun(t *testing.T) {
got, err := parseBrowserArgs("run", []string{
"flow.js", "--url", "https://example.com", "--shared-context",
"--port", "19999", "--timeout", "45", "--keep-open",
})
if err != nil {
t.Fatalf("parseBrowserArgs run: unexpected err: %v", err)
}
want := browserOpts{
mode: "run", script: "flow.js", url: "https://example.com",
sharedCtx: true, keepOpen: true, port: 19999, timeout: 45,
}
if !reflect.DeepEqual(got, want) {
t.Fatalf("parseBrowserArgs run =\n %+v\nwant\n %+v", got, want)
}
}
func TestParseBrowserArgsRunDefaults(t *testing.T) {
got, err := parseBrowserArgs("run", []string{"flow.js"})
if err != nil {
t.Fatalf("unexpected err: %v", err)
}
if got.script != "flow.js" || got.sharedCtx || got.keepOpen || got.port != 0 {
t.Fatalf("defaults wrong: %+v", got)
}
if got.timeout != defaultBrowserTimeout {
t.Fatalf("timeout default = %d, want %d", got.timeout, defaultBrowserTimeout)
}
}
func TestParseBrowserArgsRunRequiresScript(t *testing.T) {
if _, err := parseBrowserArgs("run", []string{"--url", "https://x"}); err == nil {
t.Fatalf("run without a script path should error")
}
}
func TestParseBrowserArgsOpenRequiresURL(t *testing.T) {
got, err := parseBrowserArgs("open", []string{"https://example.com"})
if err != nil {
t.Fatalf("unexpected err: %v", err)
}
if got.url != "https://example.com" || got.mode != "open" {
t.Fatalf("open parse wrong: %+v", got)
}
if _, err := parseBrowserArgs("open", []string{}); err == nil {
t.Fatalf("open without a URL should error")
}
}
func TestParseBrowserArgsHelp(t *testing.T) {
for _, a := range [][]string{{"--help"}, {"-h"}, {"flow.js", "--help"}} {
got, err := parseBrowserArgs("run", a)
if err != nil {
t.Fatalf("help parse %v: %v", a, err)
}
if !got.help {
t.Fatalf("args %v should set help", a)
}
}
}
func TestParseBrowserArgsEqualsForm(t *testing.T) {
got, err := parseBrowserArgs("run", []string{"flow.js", "--url=https://x", "--port=8123", "--timeout=10"})
if err != nil {
t.Fatalf("unexpected err: %v", err)
}
if got.url != "https://x" || got.port != 8123 || got.timeout != 10 {
t.Fatalf("--flag=value form not parsed: %+v", got)
}
}
func TestCDPHealthy(t *testing.T) {
real := []byte(`{"Browser":"Chrome/130.0.6723.31","User-Agent":"Mozilla/5.0 (X11; Linux x86_64) Chrome/130.0.0.0 Safari/537.36","webSocketDebuggerUrl":"ws://127.0.0.1/devtools/browser/x"}`)
browser, ok, err := cdpHealthy(real)
if err != nil || !ok {
t.Fatalf("real Chrome should be healthy: ok=%v err=%v", ok, err)
}
if !strings.HasPrefix(browser, "Chrome/") {
t.Fatalf("browser = %q, want Chrome/ prefix", browser)
}
headless := []byte(`{"Browser":"HeadlessChrome/130.0.6723.31","User-Agent":"Mozilla/5.0 HeadlessChrome/130.0.0.0"}`)
if _, ok, _ := cdpHealthy(headless); ok {
t.Fatalf("HeadlessChrome must be reported unhealthy (the whole point of chrome-service)")
}
if _, _, err := cdpHealthy([]byte("not json")); err == nil {
t.Fatalf("malformed /json/version body should error")
}
}
func TestBuildPortForwardArgs(t *testing.T) {
got := buildPortForwardArgs(18080)
want := []string{"-n", "chrome-service", "port-forward", "svc/chrome-service", "18080:9222"}
if !reflect.DeepEqual(got, want) {
t.Fatalf("buildPortForwardArgs =\n %v\nwant\n %v", got, want)
}
}
func TestBrowserClientPackageJSONPinsVersion(t *testing.T) {
pj := browserClientPackageJSON()
if !strings.Contains(pj, `"playwright-core": "`+playwrightVersion+`"`) {
t.Fatalf("package.json must pin playwright-core to %s; got:\n%s", playwrightVersion, pj)
}
}
func TestPlaywrightVersionPinnedToServerMinor(t *testing.T) {
// chrome-service runs mcr.microsoft.com/playwright:v1.48.0-noble; the CDP
// client minor MUST match (protocol changes between minors).
if !strings.HasPrefix(playwrightVersion, "1.48.") {
t.Fatalf("playwrightVersion = %q, must be 1.48.x to match the chrome-service image", playwrightVersion)
}
}
func TestBrowserHelpHasDiagnosticCheatSheet(t *testing.T) {
h := browserHelp()
for _, want := range []string{
"homelab browser run",
"ERR_FILE_NOT_FOUND",
"ERR_CONNECTION_REFUSED",
"network panel",
"headless",
"--shared-context",
} {
if !strings.Contains(h, want) {
t.Errorf("browser --help is missing %q (the discoverability/self-correction payload)", want)
}
}
}
func TestBrowserHelpIsTiered(t *testing.T) {
// --help must frame this as the ESCALATION path (default to headless first),
// matching ~/code/CLAUDE.md and chrome-service.md — non-conflicting agent
// instructions. Guard against a regression to "co-equal choice" wording.
h := browserHelp()
for _, want := range []string{"Default to the", "escalation"} {
if !strings.Contains(h, want) {
t.Errorf("browser --help must carry the tiered/default-headless framing; missing %q", want)
}
}
}
func TestStealthJSEmbeddedMatchesCanonical(t *testing.T) {
// The embedded copy must never drift from the source of truth that the
// in-cluster callers use, else the CLI's stealth and the cluster's diverge.
canonical, err := os.ReadFile("../stacks/chrome-service/files/stealth.js")
if err != nil {
t.Fatalf("read canonical stealth.js: %v", err)
}
if stealthJS != string(canonical) {
t.Fatalf("cli/browser_stealth.js has drifted from stacks/chrome-service/files/stealth.js — re-copy it")
}
}
func TestFreePortReturnsUsablePort(t *testing.T) {
p, err := freePort()
if err != nil {
t.Fatalf("freePort: %v", err)
}
if p <= 1024 || p > 65535 {
t.Fatalf("freePort returned %d, want an ephemeral port", p)
}
}

99
cli/cmd_ci.go Normal file
View file

@ -0,0 +1,99 @@
package main
import (
"fmt"
"os"
"strings"
"time"
)
func ciCommands() []Command {
return []Command{
{Path: []string{"ci", "status"}, Tier: TierRead,
Summary: "pipeline status for HEAD/a commit: ci status [commit]", Run: ciStatus},
{Path: []string{"ci", "watch"}, Tier: TierRead,
Summary: "poll the pipeline for HEAD (or a commit) to terminal; non-zero on failure", Run: ciWatch},
}
}
func short(s string) string {
if len(s) > 8 {
return s[:8]
}
return s
}
func firstLine(s string) string { return strings.SplitN(s, "\n", 2)[0] }
// currentHEAD returns the full HEAD sha of the cwd repo (empty if not a repo).
func currentHEAD() string {
cwd, _ := os.Getwd()
root, err := gitRepoRoot(cwd)
if err != nil {
return ""
}
sha, _ := gitOutput(root, "rev-parse", "HEAD")
return sha
}
func ciStatus(args []string) error {
commit, _ := firstPositional(args)
c, err := newWPClient()
if err != nil {
return err
}
id, err := c.repoID()
if err != nil {
return err
}
p, err := c.findPipeline(id, commit)
if err != nil {
return err
}
fmt.Printf("#%d %s event=%s %s %s\n", p.Number, p.Status, p.Event, short(p.Commit), firstLine(p.Message))
return nil
}
func ciWatch(args []string) error {
commit, _ := firstPositional(args)
if commit == "" {
commit = currentHEAD()
}
if commit == "" {
return fmt.Errorf("no commit given and not in a git repo")
}
c, err := newWPClient()
if err != nil {
return err
}
id, err := c.repoID()
if err != nil {
return err
}
timeout := 20 * time.Minute
deadline := time.Now().Add(timeout)
last := ""
for time.Now().Before(deadline) {
p, err := c.findPipeline(id, commit)
if err != nil {
if last != "waiting" {
fmt.Fprintf(os.Stderr, "homelab: waiting for pipeline (%s)...\n", short(commit))
last = "waiting"
}
} else {
if p.Status != last {
fmt.Fprintf(os.Stderr, "homelab: #%d %s\n", p.Number, p.Status)
last = p.Status
}
if isTerminalStatus(p.Status) {
fmt.Printf("#%d %s %s\n", p.Number, p.Status, short(commit))
if isFailureStatus(p.Status) {
return fmt.Errorf("pipeline #%d %s (woodpecker repo, see UI/DB for the failing step)", p.Number, p.Status)
}
return nil
}
}
time.Sleep(15 * time.Second)
}
return fmt.Errorf("timed out after %s waiting for CI on %s", timeout, short(commit))
}

56
cli/cmd_claim.go Normal file
View file

@ -0,0 +1,56 @@
package main
import (
"fmt"
"strings"
)
func claimCommands() []Command {
return []Command{
{Path: []string{"claim"}, Tier: TierWrite,
Summary: "claim a shared infra resource on the presence board",
Run: runClaim},
{Path: []string{"release"}, Tier: TierWrite,
Summary: "release a presence claim",
Run: runRelease},
}
}
// runClaim parses `<kind>:<name> --purpose "..."` in either order (the presence
// script takes the label first, so we can't rely on Go's flag package which
// stops at the first positional).
func runClaim(args []string) error {
var label, purpose string
for i := 0; i < len(args); i++ {
a := args[i]
switch {
case a == "--purpose" || a == "-purpose":
if i+1 < len(args) {
purpose = args[i+1]
i++
}
case strings.HasPrefix(a, "--purpose="):
purpose = strings.TrimPrefix(a, "--purpose=")
case !strings.HasPrefix(a, "-") && label == "":
label = a
}
}
if label == "" {
return fmt.Errorf(`usage: homelab claim <kind>:<name> --purpose "what + why"`)
}
return presenceClaim(label, purpose)
}
func runRelease(args []string) error {
var label string
for _, a := range args {
if !strings.HasPrefix(a, "-") {
label = a
break
}
}
if label == "" {
return fmt.Errorf("usage: homelab release <kind>:<name>")
}
return presenceRelease(label)
}

51
cli/cmd_deploy.go Normal file
View file

@ -0,0 +1,51 @@
package main
import (
"fmt"
"os"
"strings"
"time"
)
func deployCommands() []Command {
return []Command{
{Path: []string{"deploy", "wait"}, Tier: TierRead,
Summary: "wait for <ns>/<deploy> to roll out the current (or --sha) image: deploy wait <ns>/<deploy> [--sha SHA]", Run: deployWait},
}
}
// deployWait closes the "did the NEW code land" gap: rollout status alone returns
// success on the OLD ReplicaSet, so we first wait for the deployment image to
// reference the expected sha, THEN block on rollout status.
func deployWait(args []string) error {
target, _ := firstPositional(args)
if target == "" || !strings.Contains(target, "/") {
return fmt.Errorf("usage: homelab deploy wait <ns>/<deploy> [--sha SHA] [--timeout 10m]")
}
parts := strings.SplitN(target, "/", 2)
ns, deploy := parts[0], parts[1]
sha := flagValue(args, "--sha")
if sha == "" {
sha = short(currentHEAD())
}
deadline := time.Now().Add(10 * time.Minute)
if sha != "" {
fmt.Fprintf(os.Stderr, "homelab: waiting for %s/%s image to match %s...\n", ns, deploy, sha)
matched := false
for time.Now().Before(deadline) {
img, _ := kubectlCapture(ns, "get", "deploy", deploy, "-o", "jsonpath={.spec.template.spec.containers[*].image}")
if strings.Contains(img, sha) {
matched = true
break
}
time.Sleep(10 * time.Second)
}
if !matched {
return fmt.Errorf("timed out: %s/%s image never matched %q", ns, deploy, sha)
}
}
fmt.Fprintf(os.Stderr, "homelab: rollout status %s/%s...\n", ns, deploy)
return kubectlStream(ns, "rollout", "status", "deploy/"+deploy, "--timeout=180s")
}

172
cli/cmd_ha.go Normal file
View file

@ -0,0 +1,172 @@
package main
import (
"encoding/base64"
"fmt"
"os"
"path/filepath"
"strings"
)
// Home Assistant verbs cover the two things the `ha` MCP server can't: resolving
// the long-lived API token out of the cluster, and SSH to the HA host for
// host-level work (config files, docker, add-ons). Entity state/control stays
// with the MCP — see docs/adr/0012.
//
// The token lives in the dedicated k8s Secret openclaw/ha-tokens (one key per
// instance), split out of openclaw-secrets so non-admin operators (emo / "Home
// Server Admins") can read JUST the HA token, not the full skill_secrets blob.
// `ha token` resolves it on demand via the ambient kubeconfig, so it never
// depends on a pre-set env var (the gap that made agents re-derive the
// kubectl|base64|jq pipeline every session).
type haInstance struct {
name string // sofia | london
sshUser string // SSH login on the HA host
sshHost string // host reachable from the devvm (Sofia LAN)
secretKey string // key inside the openclaw/ha-tokens Secret holding this token
}
const (
haDefaultInstance = "sofia"
haSecretNamespace = "openclaw"
haSecretName = "ha-tokens" // dedicated, least-privilege; see stacks/openclaw/ha_tokens.tf
)
// haInstances maps instance name → connection/secret facts. sofia is the default
// because the devvm is on the Sofia LAN; london is documented but its host
// (192.168.8.x) is only reachable remotely, so `ha ssh --instance london`
// generally won't connect from here (token resolution still works).
var haInstances = map[string]haInstance{
"sofia": {name: "sofia", sshUser: "vbarzin", sshHost: "192.168.1.8", secretKey: "sofia"},
"london": {name: "london", sshUser: "hassio", sshHost: "192.168.8.103", secretKey: "london"},
}
func haCommands() []Command {
return []Command{
{Path: []string{"ha", "token"}, Tier: TierRead,
Summary: "reveal the HA long-lived API token from the cluster: ha token [--instance sofia|london]", Run: haToken},
{Path: []string{"ha", "ssh"}, Tier: TierWrite,
Summary: "run a command on the HA host over ssh: ha ssh [--instance sofia|london] [-i KEY] -- <cmd>", Run: haSSH},
}
}
// resolveHAInstance looks up an instance by name; "" yields the default (sofia).
func resolveHAInstance(name string) (haInstance, error) {
if name == "" {
name = haDefaultInstance
}
inst, ok := haInstances[name]
if !ok {
return haInstance{}, fmt.Errorf("unknown HA instance %q (want sofia or london)", name)
}
return inst, nil
}
// decodeSecretValue base64-decodes a k8s Secret `.data.<key>` value as returned
// by kubectl jsonpath (trailing whitespace tolerated).
func decodeSecretValue(b64 string) (string, error) {
raw, err := base64.StdEncoding.DecodeString(strings.TrimSpace(b64))
if err != nil {
return "", fmt.Errorf("base64-decode secret value: %w", err)
}
return string(raw), nil
}
func haToken(args []string) error {
name, _ := firstPositional(args) // accept `ha token sofia` as well as `--instance sofia`
for i := 0; i < len(args); i++ {
if args[i] == "--instance" && i+1 < len(args) {
name = args[i+1]
} else if strings.HasPrefix(args[i], "--instance=") {
name = strings.TrimPrefix(args[i], "--instance=")
}
}
inst, err := resolveHAInstance(name)
if err != nil {
return err
}
b64, err := kubectlCapture(haSecretNamespace, "get", "secret", haSecretName,
"-o", "jsonpath={.data."+inst.secretKey+"}")
if err != nil {
return fmt.Errorf("read secret %s/%s (kubeconfig set? RBAC?): %w", haSecretNamespace, haSecretName, err)
}
if b64 == "" {
return fmt.Errorf("secret %s/%s has no %q key", haSecretNamespace, haSecretName, inst.secretKey)
}
tok, err := decodeSecretValue(b64)
if err != nil {
return err
}
fmt.Println(tok)
return nil
}
// defaultHAKeyPath is the invoking user's ed25519 key, so the verb is per-user
// rather than tied to whoever first wrote the workflow.
func defaultHAKeyPath() string {
if home, err := os.UserHomeDir(); err == nil && home != "" {
return filepath.Join(home, ".ssh", "id_ed25519")
}
return filepath.Join("~", ".ssh", "id_ed25519")
}
// parseHASSH reads `[--instance X] [-i|--key PATH] [-- ] <cmd...>`. Tokens after
// `--` are taken verbatim; bare tokens before it are also the remote command.
func parseHASSH(args []string) (inst haInstance, keyPath string, remote []string, err error) {
name := haDefaultInstance
keyPath = defaultHAKeyPath()
for i := 0; i < len(args); i++ {
a := args[i]
switch {
case a == "--":
remote = append(remote, args[i+1:]...)
i = len(args)
case a == "--instance":
if i+1 < len(args) {
name = args[i+1]
i++
}
case strings.HasPrefix(a, "--instance="):
name = strings.TrimPrefix(a, "--instance=")
case a == "--key" || a == "-i":
if i+1 < len(args) {
keyPath = args[i+1]
i++
}
case strings.HasPrefix(a, "--key="):
keyPath = strings.TrimPrefix(a, "--key=")
default:
remote = append(remote, a)
}
}
inst, err = resolveHAInstance(name)
return inst, keyPath, remote, err
}
// buildHASSHArgs assembles deterministic, non-interactive ssh args: an explicit
// key, no user ssh config, and no known_hosts prompt/record — so it runs
// unattended in an agent session without hanging on a host-key prompt.
func buildHASSHArgs(inst haInstance, keyPath string, remote []string) []string {
args := []string{
"-F", "/dev/null",
"-o", "IdentityFile=" + keyPath,
"-o", "StrictHostKeyChecking=no",
"-o", "UserKnownHostsFile=/dev/null",
"-o", "ConnectTimeout=10",
"-o", "BatchMode=yes",
inst.sshUser + "@" + inst.sshHost,
}
return append(args, remote...)
}
func haSSH(args []string) error {
inst, keyPath, remote, err := parseHASSH(args)
if err != nil {
return err
}
if len(remote) == 0 {
return fmt.Errorf(`usage: homelab ha ssh [--instance sofia|london] [-i KEY] -- <command>`)
}
return runStreaming("ssh", buildHASSHArgs(inst, keyPath, remote)...)
}

92
cli/cmd_ha_test.go Normal file
View file

@ -0,0 +1,92 @@
package main
import (
"encoding/base64"
"reflect"
"strings"
"testing"
)
func TestResolveHAInstance(t *testing.T) {
// empty defaults to sofia (the devvm sits on the Sofia LAN)
if got, err := resolveHAInstance(""); err != nil || got.name != "sofia" {
t.Fatalf(`resolveHAInstance("") = %+v, %v; want sofia`, got, err)
}
if got, err := resolveHAInstance("sofia"); err != nil || got.secretKey != "sofia" {
t.Fatalf("sofia secretKey = %q, %v", got.secretKey, err)
}
if got, err := resolveHAInstance("london"); err != nil || got.secretKey != "london" || got.sshUser != "hassio" {
t.Fatalf("london = %+v, %v", got, err)
}
if _, err := resolveHAInstance("paris"); err == nil {
t.Fatalf("resolveHAInstance(paris) should error on unknown instance")
}
}
func TestDecodeSecretValue(t *testing.T) {
// k8s stores Secret values base64-encoded; `kubectl -o jsonpath={.data.<k>}`
// returns that base64, which decodeSecretValue turns back into the raw token.
enc := base64.StdEncoding.EncodeToString([]byte("tok-sofia"))
if got, err := decodeSecretValue(enc); err != nil || got != "tok-sofia" {
t.Fatalf("decodeSecretValue = %q, %v; want tok-sofia", got, err)
}
// trailing whitespace/newline from jsonpath output must be tolerated
if got, err := decodeSecretValue(enc + "\n"); err != nil || got != "tok-sofia" {
t.Fatalf("decodeSecretValue (trailing ws) = %q, %v; want tok-sofia", got, err)
}
if _, err := decodeSecretValue("not-base64!!"); err == nil {
t.Fatalf("decodeSecretValue should error on undecodable base64")
}
}
func TestBuildHASSHArgs(t *testing.T) {
inst, _ := resolveHAInstance("sofia")
got := buildHASSHArgs(inst, "/home/u/.ssh/id_ed25519", []string{"cat", "/config/configuration.yaml"})
want := []string{
"-F", "/dev/null",
"-o", "IdentityFile=/home/u/.ssh/id_ed25519",
"-o", "StrictHostKeyChecking=no",
"-o", "UserKnownHostsFile=/dev/null",
"-o", "ConnectTimeout=10",
"-o", "BatchMode=yes",
"vbarzin@192.168.1.8",
"cat", "/config/configuration.yaml",
}
if !reflect.DeepEqual(got, want) {
t.Fatalf("buildHASSHArgs =\n %v\nwant\n %v", got, want)
}
}
func TestParseHASSH(t *testing.T) {
// instance flag + everything after `--` is the verbatim remote command
inst, key, remote, err := parseHASSH([]string{"--instance", "sofia", "--", "docker", "ps", "-a"})
if err != nil {
t.Fatalf("parseHASSH err: %v", err)
}
if inst.name != "sofia" {
t.Errorf("instance = %q, want sofia", inst.name)
}
if !strings.HasSuffix(key, "/.ssh/id_ed25519") {
t.Errorf("default key = %q, want it to end in /.ssh/id_ed25519", key)
}
if !reflect.DeepEqual(remote, []string{"docker", "ps", "-a"}) {
t.Errorf("remote = %v, want [docker ps -a]", remote)
}
// bare args (no `--`) are also taken as the remote command; -i overrides the key
_, key2, remote2, err := parseHASSH([]string{"-i", "/tmp/k", "uptime"})
if err != nil {
t.Fatalf("parseHASSH err: %v", err)
}
if key2 != "/tmp/k" {
t.Errorf("key = %q, want /tmp/k", key2)
}
if !reflect.DeepEqual(remote2, []string{"uptime"}) {
t.Errorf("remote = %v, want [uptime]", remote2)
}
// unknown instance surfaces as an error
if _, _, _, err := parseHASSH([]string{"--instance", "paris", "--", "ls"}); err == nil {
t.Errorf("parseHASSH should error on unknown instance")
}
}

288
cli/cmd_k8s.go Normal file
View file

@ -0,0 +1,288 @@
package main
import (
"fmt"
"os"
"strings"
)
func k8sCommands() []Command {
return []Command{
{Path: []string{"k8s", "status"}, Tier: TierRead,
Summary: "pods (wide) + recent non-Normal events for a namespace (or -A)", Run: k8sStatus},
{Path: []string{"k8s", "get"}, Tier: TierRead,
Summary: "kubectl get in a namespace: k8s get <ns> <resource> [args]", Run: k8sGet},
{Path: []string{"k8s", "logs"}, Tier: TierRead,
Summary: "logs for <app> (deploy/<app>; --tail/-c/--previous/--since/-l)", Run: k8sLogs},
{Path: []string{"k8s", "describe"}, Tier: TierRead,
Summary: "describe <app>'s deployment (or an explicit resource)", Run: k8sDescribe},
{Path: []string{"k8s", "debug"}, Tier: TierRead,
Summary: "one-shot triage for <app>: pods+deploy+describe+logs+events", Run: k8sDebug},
{Path: []string{"k8s", "pf"}, Tier: TierRead,
Summary: "port-forward: k8s pf <app> <local:remote> [svc/pod target]", Run: k8sPortForward},
{Path: []string{"k8s", "db"}, Tier: TierWrite,
Summary: `query a dbaas DB: k8s db <app> [--mysql] [--db N] -- "<SQL>"`, Run: k8sDB},
{Path: []string{"k8s", "exec"}, Tier: TierWrite,
Summary: "exec in <app>'s pod: k8s exec <app> [--tty] -- <cmd>", Run: k8sExec},
{Path: []string{"k8s", "rm-pod"}, Tier: TierWrite,
Summary: "delete a stuck pod/job ONLY: k8s rm-pod <name> -n <ns> [--job] [--force]", Run: k8sRmPod},
{Path: []string{"k8s", "rollout-status"}, Tier: TierRead,
Summary: "rollout status of deploy/<app>", Run: k8sRolloutStatus},
{Path: []string{"k8s", "restart"}, Tier: TierWrite,
Summary: "rollout restart deploy/<app> then wait for status", Run: k8sRestart},
{Path: []string{"k8s", "probe"}, Tier: TierRead,
Summary: "in-cluster reachability: ephemeral curl pod to <app>.<ns>.svc", Run: k8sProbe},
}
}
func k8sStatus(args []string) error {
t := parseK8sTarget(args)
ns := t.namespace() // "" when no app/ns given → cluster-wide
get := []string{"get", "pods", "-o", "wide"}
ev := []string{"get", "events", "--field-selector", "type!=Normal", "--sort-by=.lastTimestamp"}
if ns == "" {
get = append(get, "-A")
ev = append(ev, "-A")
}
if err := kubectlStream(ns, get...); err != nil {
return err
}
fmt.Fprintln(os.Stderr, "\n--- recent events (type!=Normal) ---")
_ = kubectlStream(ns, ev...) // best-effort
return nil
}
func k8sGet(args []string) error {
t := parseK8sTarget(args)
if t.app == "" || len(t.rest) == 0 {
return fmt.Errorf("usage: homelab k8s get <ns> <resource> [args]")
}
return kubectlStream(t.app, append([]string{"get"}, t.rest...)...)
}
func k8sLogs(args []string) error {
t := parseK8sTarget(args)
if t.app == "" {
return fmt.Errorf("usage: homelab k8s logs <app> [--tail N] [-c ctr] [--previous] [--since 1h] [-l sel]")
}
a := []string{"logs"}
if t.selector != "" {
a = append(a, "-l", t.selector)
} else {
a = append(a, t.objectRef())
}
if t.container != "" {
a = append(a, "-c", t.container)
}
if !containsPrefix(t.rest, "--tail") {
a = append(a, "--tail=200")
}
a = append(a, t.rest...)
return kubectlStream(t.namespace(), a...)
}
func k8sDescribe(args []string) error {
t := parseK8sTarget(args)
if t.app == "" {
return fmt.Errorf("usage: homelab k8s describe <app> [resource]")
}
if len(t.rest) > 0 {
return kubectlStream(t.namespace(), append([]string{"describe"}, t.rest...)...)
}
return kubectlStream(t.namespace(), "describe", t.objectRef())
}
func k8sDebug(args []string) error {
t := parseK8sTarget(args)
if t.app == "" {
return fmt.Errorf("usage: homelab k8s debug <app>")
}
ns := t.namespace()
sec := func(title string) { fmt.Fprintf(os.Stderr, "\n=== %s ===\n", title) }
sec("pods")
_ = kubectlStream(ns, "get", "pods", "-o", "wide")
sec("workloads")
_ = kubectlStream(ns, "get", "deploy,sts,ds", "-o", "wide")
sec("describe "+t.objectRef())
_ = kubectlStream(ns, "describe", t.objectRef())
sec("recent logs (--tail=50)")
_ = kubectlStream(ns, "logs", t.objectRef(), "--tail=50")
sec("events (type!=Normal)")
_ = kubectlStream(ns, "get", "events", "--field-selector", "type!=Normal", "--sort-by=.lastTimestamp")
return nil
}
func k8sPortForward(args []string) error {
t := parseK8sTarget(args)
if t.app == "" || len(t.rest) == 0 {
return fmt.Errorf("usage: homelab k8s pf <app> <local:remote> [svc/pod target]")
}
ports := t.rest[0]
target := "svc/" + t.app
if len(t.rest) > 1 {
target = t.rest[1]
}
return kubectlStream(t.namespace(), "port-forward", target, ports)
}
func k8sDB(args []string) error {
var app, dbName, sql string
mysql := false
for i := 0; i < len(args); i++ {
a := args[i]
if a == "--" {
sql = strings.Join(args[i+1:], " ")
break
}
switch {
case a == "--mysql":
mysql = true
case a == "--db":
if i+1 < len(args) {
dbName = args[i+1]
i++
}
case strings.HasPrefix(a, "--db="):
dbName = strings.TrimPrefix(a, "--db=")
case !strings.HasPrefix(a, "-") && app == "":
app = a
}
}
if app == "" {
return fmt.Errorf(`usage: homelab k8s db <app> [--mysql] [--db NAME] -- "<SQL>"`)
}
p := planDBExec(app, dbName, sql, mysql)
pod := p.pod
if pod == "" && p.selector != "" {
resolved, err := kubectlCapture(p.ns, "get", "pod", "-l", p.selector, "-o", "jsonpath={.items[0].metadata.name}")
if err != nil || resolved == "" {
return fmt.Errorf("could not resolve db pod in %s (selector %q): %v", p.ns, p.selector, err)
}
pod = resolved
}
exec := []string{"exec"}
if sql == "" {
exec = append(exec, "-it") // interactive client when no SQL given
}
exec = append(exec, pod)
if p.container != "" {
exec = append(exec, "-c", p.container)
}
exec = append(exec, "--")
exec = append(exec, p.argv...)
return kubectlStream(p.ns, exec...)
}
func k8sExec(args []string) error {
t := parseK8sTarget(args)
if t.app == "" {
return fmt.Errorf("usage: homelab k8s exec <app> [--pod p] [-c ctr] [--tty] -- <cmd>")
}
if len(t.rest) == 0 {
return fmt.Errorf("provide a command after --, e.g. homelab k8s exec %s -- env", t.app)
}
a := []string{"exec"}
if t.tty {
a = append(a, "-it")
}
a = append(a, t.objectRef())
if t.container != "" {
a = append(a, "-c", t.container)
}
a = append(a, "--")
a = append(a, t.rest...)
return kubectlStream(t.namespace(), a...)
}
func k8sRmPod(args []string) error {
var pod, ns, grace string
force, job := false, false
for i := 0; i < len(args); i++ {
a := args[i]
switch {
case a == "-n" || a == "--namespace":
if i+1 < len(args) {
ns = args[i+1]
i++
}
case a == "--force":
force = true
case a == "--job":
job = true
case a == "--grace":
if i+1 < len(args) {
grace = args[i+1]
i++
}
case !strings.HasPrefix(a, "-") && pod == "":
pod = a
}
}
if pod == "" || ns == "" {
return fmt.Errorf("usage: homelab k8s rm-pod <name> -n <ns> [--job] [--force] [--grace N] (pods/jobs only)")
}
kind := "pod"
if job {
kind = "job"
}
a := []string{"delete", kind, pod}
if grace != "" {
a = append(a, "--grace-period="+grace)
}
if force {
a = append(a, "--force")
}
return kubectlStream(ns, a...)
}
func k8sRolloutStatus(args []string) error {
t := parseK8sTarget(args)
if t.app == "" {
return fmt.Errorf("usage: homelab k8s rollout-status <app>")
}
return kubectlStream(t.namespace(), "rollout", "status", "deploy/"+t.app)
}
func k8sRestart(args []string) error {
t := parseK8sTarget(args)
if t.app == "" {
return fmt.Errorf("usage: homelab k8s restart <app>")
}
ns := t.namespace()
if err := kubectlStream(ns, "rollout", "restart", "deploy/"+t.app); err != nil {
return err
}
return kubectlStream(ns, "rollout", "status", "deploy/"+t.app)
}
func k8sProbe(args []string) error {
t := parseK8sTarget(args)
if t.app == "" {
return fmt.Errorf("usage: homelab k8s probe <app> [path] [--port N]")
}
ns := t.namespace()
url := "http://" + t.app + "." + ns + ".svc.cluster.local"
if port := flagValue(args, "--port"); port != "" {
url += ":" + port
}
if len(t.rest) > 0 {
p := t.rest[0]
if !strings.HasPrefix(p, "/") {
p = "/" + p
}
url += p
}
return kubectlStream(ns, "run", "homelab-probe", "--rm", "-i", "--restart=Never",
"--image=curlimages/curl:latest", "--",
"curl", "-sS", "--max-time", "10", "-w", "\n[%{http_code}] %{time_total}s\n", url)
}
// containsPrefix reports whether any arg starts with prefix.
func containsPrefix(args []string, prefix string) bool {
for _, a := range args {
if strings.HasPrefix(a, prefix) {
return true
}
}
return false
}

302
cli/cmd_memory.go Normal file
View file

@ -0,0 +1,302 @@
package main
import (
"encoding/json"
"fmt"
"net/url"
"strings"
)
func memoryCommands() []Command {
return []Command{
{Path: []string{"memory", "recall"}, Tier: TierRead,
Summary: `semantic search of memory: memory recall "<context>" [--query …] [--category] [--sort] [--limit]`, Run: memoryRecall},
{Path: []string{"memory", "list"}, Tier: TierRead,
Summary: "list recent memories [--category C] [--tag T] [--limit N]", Run: memoryList},
{Path: []string{"memory", "categories"}, Tier: TierRead,
Summary: "list memory categories", Run: memorySimpleGet("/api/categories")},
{Path: []string{"memory", "tags"}, Tier: TierRead,
Summary: "list memory tags", Run: memorySimpleGet("/api/tags")},
{Path: []string{"memory", "stats"}, Tier: TierRead,
Summary: "memory store stats", Run: memorySimpleGet("/api/stats")},
{Path: []string{"memory", "secret"}, Tier: TierRead,
Summary: "reveal a sensitive memory's content: memory secret <id>", Run: memorySecret},
{Path: []string{"memory", "store"}, Tier: TierWrite,
Summary: `store a memory: memory store "<content>" [--category --tags --keywords --importance --sensitive]`, Run: memoryStore},
{Path: []string{"memory", "update"}, Tier: TierWrite,
Summary: "update a memory: memory update <id> [--content --tags --importance --keywords]", Run: memoryUpdate},
{Path: []string{"memory", "delete"}, Tier: TierWrite,
Summary: "delete a memory: memory delete <id>", Run: memoryDelete},
}
}
// printMemories renders a {memories:[…]} response as compact lines, or raw JSON.
func printMemories(raw []byte, jsonOut bool) error {
if jsonOut {
fmt.Println(string(raw))
return nil
}
var r struct {
Memories []struct {
ID int `json:"id"`
Content string `json:"content"`
Category string `json:"category"`
Tags string `json:"tags"`
Importance float64 `json:"importance"`
} `json:"memories"`
}
if err := json.Unmarshal(raw, &r); err != nil {
fmt.Println(string(raw))
return nil
}
if len(r.Memories) == 0 {
fmt.Println("(no memories)")
return nil
}
for _, m := range r.Memories {
c := strings.ReplaceAll(m.Content, "\n", " ")
if len(c) > 240 {
c = c[:240] + "…"
}
fmt.Printf("#%d [%s] (%.2f) %s\n", m.ID, m.Category, m.Importance, c)
if m.Tags != "" {
fmt.Printf(" tags: %s\n", m.Tags)
}
}
return nil
}
func memoryRecall(args []string) error {
req := memRecallReq{}
jsonOut := false
var pos []string
for i := 0; i < len(args); i++ {
a := args[i]
switch {
case a == "--query":
if i+1 < len(args) {
req.ExpandedQuery = args[i+1]
i++
}
case a == "--category":
if i+1 < len(args) {
req.Category = args[i+1]
i++
}
case a == "--sort":
if i+1 < len(args) {
req.SortBy = args[i+1]
i++
}
case a == "--limit":
if i+1 < len(args) {
fmt.Sscanf(args[i+1], "%d", &req.Limit)
i++
}
case a == "--json":
jsonOut = true
case !strings.HasPrefix(a, "-"):
pos = append(pos, a)
}
}
req.Context = strings.Join(pos, " ")
if req.Context == "" {
return fmt.Errorf(`usage: homelab memory recall "<context>" [--query …] [--category C] [--sort importance|relevance|recency] [--limit N]`)
}
c, err := newMemoryClient()
if err != nil {
return err
}
raw, err := c.do("POST", "/api/memories/recall", req)
if err != nil {
return err
}
return printMemories(raw, jsonOut)
}
func memoryList(args []string) error {
q := url.Values{}
jsonOut := false
for i := 0; i < len(args); i++ {
a := args[i]
switch {
case a == "--category":
if i+1 < len(args) {
q.Set("category", args[i+1])
i++
}
case a == "--tag":
if i+1 < len(args) {
q.Set("tag", args[i+1])
i++
}
case a == "--limit":
if i+1 < len(args) {
q.Set("limit", args[i+1])
i++
}
case a == "--json":
jsonOut = true
}
}
c, err := newMemoryClient()
if err != nil {
return err
}
path := "/api/memories"
if len(q) > 0 {
path += "?" + q.Encode()
}
raw, err := c.do("GET", path, nil)
if err != nil {
return err
}
return printMemories(raw, jsonOut)
}
func memorySimpleGet(path string) func([]string) error {
return func(args []string) error {
c, err := newMemoryClient()
if err != nil {
return err
}
raw, err := c.do("GET", path, nil)
if err != nil {
return err
}
fmt.Println(string(raw))
return nil
}
}
func memorySecret(args []string) error {
id, _ := firstPositional(args)
if id == "" {
return fmt.Errorf("usage: homelab memory secret <id>")
}
c, err := newMemoryClient()
if err != nil {
return err
}
raw, err := c.do("POST", "/api/memories/"+id+"/secret", nil)
if err != nil {
return err
}
fmt.Println(string(raw))
return nil
}
func memoryStore(args []string) error {
req := memStoreReq{Category: "facts", Importance: 0.5}
var pos []string
for i := 0; i < len(args); i++ {
a := args[i]
switch {
case a == "--category":
if i+1 < len(args) {
req.Category = args[i+1]
i++
}
case a == "--tags":
if i+1 < len(args) {
req.Tags = args[i+1]
i++
}
case a == "--keywords":
if i+1 < len(args) {
req.ExpandedKeywords = args[i+1]
i++
}
case a == "--importance":
if i+1 < len(args) {
fmt.Sscanf(args[i+1], "%f", &req.Importance)
i++
}
case a == "--sensitive":
req.ForceSensitive = true
case !strings.HasPrefix(a, "-"):
pos = append(pos, a)
}
}
req.Content = strings.Join(pos, " ")
if req.Content == "" {
return fmt.Errorf(`usage: homelab memory store "<content>" [--category C] [--tags ...] [--keywords ...] [--importance 0.5] [--sensitive]`)
}
c, err := newMemoryClient()
if err != nil {
return err
}
raw, err := c.do("POST", "/api/memories", req)
if err != nil {
return err
}
fmt.Println(string(raw))
return nil
}
func memoryUpdate(args []string) error {
var id string
req := memUpdateReq{}
for i := 0; i < len(args); i++ {
a := args[i]
switch {
case a == "--content":
if i+1 < len(args) {
v := args[i+1]
req.Content = &v
i++
}
case a == "--tags":
if i+1 < len(args) {
v := args[i+1]
req.Tags = &v
i++
}
case a == "--keywords":
if i+1 < len(args) {
v := args[i+1]
req.ExpandedKeywords = &v
i++
}
case a == "--importance":
if i+1 < len(args) {
var f float64
fmt.Sscanf(args[i+1], "%f", &f)
req.Importance = &f
i++
}
case !strings.HasPrefix(a, "-") && id == "":
id = a
}
}
if id == "" {
return fmt.Errorf("usage: homelab memory update <id> [--content ...] [--tags ...] [--importance N] [--keywords ...]")
}
c, err := newMemoryClient()
if err != nil {
return err
}
raw, err := c.do("PUT", "/api/memories/"+id, req)
if err != nil {
return err
}
fmt.Println(string(raw))
return nil
}
func memoryDelete(args []string) error {
id, _ := firstPositional(args)
if id == "" {
return fmt.Errorf("usage: homelab memory delete <id>")
}
c, err := newMemoryClient()
if err != nil {
return err
}
raw, err := c.do("DELETE", "/api/memories/"+id, nil)
if err != nil {
return err
}
fmt.Println(string(raw))
return nil
}

83
cli/cmd_net.go Normal file
View file

@ -0,0 +1,83 @@
package main
import (
"fmt"
"strings"
"time"
)
func netCommands() []Command {
return []Command{
{Path: []string{"net", "check"}, Tier: TierRead,
Summary: "reachability of <host>[/path]: external (public DNS→CF) vs internal (Traefik LB)", Run: netCheck},
{Path: []string{"dns", "lookup"}, Tier: TierRead,
Summary: "resolve <name> via Technitium (10.0.20.201) and public (1.1.1.1), diffed", Run: dnsLookup},
}
}
func fmtProbe(code int, d time.Duration, err error) string {
if err != nil {
return "ERR " + err.Error()
}
return fmt.Sprintf("HTTP %d %dms", code, d.Milliseconds())
}
func netCheck(args []string) error {
host, rest := firstPositional(args)
if host == "" {
return fmt.Errorf("usage: homelab net check <host> [path]")
}
path := "/"
if len(rest) > 0 && !strings.HasPrefix(rest[0], "-") {
path = rest[0]
if !strings.HasPrefix(path, "/") {
path = "/" + path
}
}
u := "https://" + host + path
fmt.Printf("%s\n", u)
// external leg: resolve via public DNS, dial the public IP (tests the real CF path)
pubOut, _ := dig(hostOnly(host), "1.1.1.1", "")
if pubIP := firstLine(pubOut); pubIP != "" {
c, d, e := probeURL(clientDialingIP(pubIP, 10*time.Second), u)
fmt.Printf(" external (public %-15s) %s\n", pubIP, fmtProbe(c, d, e))
} else {
fmt.Println(" external (public) no public A record")
}
// internal leg: dial the Traefik LB directly
c, d, e := probeURL(clientDialingIP(internalLBIP, 10*time.Second), u)
fmt.Printf(" internal (LB %-15s) %s\n", internalLBIP, fmtProbe(c, d, e))
return nil
}
func dnsLookup(args []string) error {
name, rest := firstPositional(args)
if name == "" {
return fmt.Errorf("usage: homelab dns lookup <name> [A|AAAA|TXT|MX|PTR]")
}
rr := ""
if len(rest) > 0 {
rr = rest[0]
}
tech, _ := dig(name, "10.0.20.201", rr)
pub, _ := dig(name, "1.1.1.1", rr)
fmt.Printf("technitium (10.0.20.201): %s\n", oneLineList(tech))
fmt.Printf("public (1.1.1.1) : %s\n", oneLineList(pub))
if strings.TrimSpace(tech) != strings.TrimSpace(pub) {
fmt.Println("⚠ mismatch — split-horizon (expected for internal-only apps) or a propagation gap")
}
return nil
}
func hostOnly(h string) string { // strip any path accidentally included
return strings.SplitN(h, "/", 2)[0]
}
func oneLineList(s string) string {
s = strings.TrimSpace(s)
if s == "" {
return "(none)"
}
return strings.ReplaceAll(s, "\n", ", ")
}

197
cli/cmd_obs.go Normal file
View file

@ -0,0 +1,197 @@
package main
import (
"encoding/json"
"fmt"
"net/url"
"sort"
"strconv"
"strings"
"time"
)
const (
promHost = "prometheus-query.viktorbarzin.lan"
lokiHost = "loki.viktorbarzin.lan"
)
func obsCommands() []Command {
return []Command{
{Path: []string{"metrics", "query"}, Tier: TierRead,
Summary: `Prometheus instant query: metrics query "<promql>" [--json]`, Run: metricsQuery},
{Path: []string{"metrics", "alerts"}, Tier: TierRead,
Summary: "list currently firing Prometheus alerts", Run: metricsAlerts},
{Path: []string{"logs", "query"}, Tier: TierRead,
Summary: `Loki query (last --since, default 1h): logs query "<logql>" [--since 1h] [--limit N] [--json]`, Run: logsQuery},
}
}
// queryArg joins non-flag args into the query (PromQL/LogQL should normally be
// passed as a single quoted argument; this also tolerates unquoted multi-token).
func queryArg(args []string, valueFlags map[string]bool) string {
var parts []string
for i := 0; i < len(args); i++ {
a := args[i]
if valueFlags[a] {
i++
continue
}
if strings.HasPrefix(a, "-") {
continue
}
parts = append(parts, a)
}
return strings.Join(parts, " ")
}
func labelStr(m map[string]string) string {
name := m["__name__"]
var kv []string
for k, v := range m {
if k != "__name__" {
kv = append(kv, k+"="+v)
}
}
sort.Strings(kv)
return name + "{" + strings.Join(kv, ",") + "}"
}
func metricsQuery(args []string) error {
q := queryArg(args, nil)
if q == "" {
return fmt.Errorf(`usage: homelab metrics query "<promql>" [--json]`)
}
v := url.Values{}
v.Set("query", q)
body, err := lbGetBody(promHost, "/api/v1/query", v)
if err != nil {
return err
}
if containsArg(args, "--json") {
fmt.Println(string(body))
return nil
}
var r struct {
Data struct {
Result []struct {
Metric map[string]string `json:"metric"`
Value []interface{} `json:"value"`
} `json:"result"`
} `json:"data"`
}
if err := json.Unmarshal(body, &r); err != nil {
fmt.Println(string(body))
return nil
}
if len(r.Data.Result) == 0 {
fmt.Println("(no series)")
return nil
}
for _, s := range r.Data.Result {
val := ""
if len(s.Value) == 2 {
val = fmt.Sprint(s.Value[1])
}
fmt.Printf("%-14s %s\n", val, labelStr(s.Metric))
}
return nil
}
func metricsAlerts(args []string) error {
// prometheus-query is a query-only frontend (no /api/v1/alerts); the firing
// set is exposed as the synthetic ALERTS series, queryable the normal way.
v := url.Values{}
v.Set("query", `ALERTS{alertstate="firing"}`)
body, err := lbGetBody(promHost, "/api/v1/query", v)
if err != nil {
return err
}
if containsArg(args, "--json") {
fmt.Println(string(body))
return nil
}
var r struct {
Data struct {
Result []struct {
Metric map[string]string `json:"metric"`
} `json:"result"`
} `json:"data"`
}
if err := json.Unmarshal(body, &r); err != nil {
fmt.Println(string(body))
return nil
}
if len(r.Data.Result) == 0 {
fmt.Println("(no firing alerts)")
return nil
}
for _, a := range r.Data.Result {
m := a.Metric
scope := ""
for _, k := range []string{"namespace", "deployment", "instance", "job", "node"} {
if v := m[k]; v != "" {
scope = k + "=" + v
break
}
}
fmt.Printf("%-9s %-34s %s\n", m["severity"], m["alertname"], scope)
}
return nil
}
func logsQuery(args []string) error {
q := queryArg(args, map[string]bool{"--since": true, "--limit": true})
if q == "" {
return fmt.Errorf(`usage: homelab logs query "<logql>" [--since 1h] [--limit N] [--json]`)
}
since := flagValue(args, "--since")
if since == "" {
since = "1h"
}
dur, err := time.ParseDuration(since)
if err != nil {
return fmt.Errorf("bad --since %q: %w", since, err)
}
limit := flagValue(args, "--limit")
if limit == "" {
limit = "100"
}
end := time.Now()
v := url.Values{}
v.Set("query", q)
v.Set("limit", limit)
v.Set("start", strconv.FormatInt(end.Add(-dur).UnixNano(), 10))
v.Set("end", strconv.FormatInt(end.UnixNano(), 10))
body, err := lbGetBody(lokiHost, "/loki/api/v1/query_range", v)
if err != nil {
return err
}
if containsArg(args, "--json") {
fmt.Println(string(body))
return nil
}
var r struct {
Data struct {
Result []struct {
Values [][]string `json:"values"`
} `json:"result"`
} `json:"data"`
}
if err := json.Unmarshal(body, &r); err != nil {
fmt.Println(string(body))
return nil
}
n := 0
for _, s := range r.Data.Result {
for _, val := range s.Values {
if len(val) == 2 {
fmt.Println(val[1])
n++
}
}
}
if n == 0 {
fmt.Println("(no log lines)")
}
return nil
}

122
cli/cmd_tf.go Normal file
View file

@ -0,0 +1,122 @@
package main
import (
"fmt"
"os"
"os/signal"
"path/filepath"
"strings"
"sync"
"syscall"
)
func tfCommands() []Command {
return []Command{
{Path: []string{"tf", "plan"}, Tier: TierRead,
Summary: "terragrunt plan a stack (via scripts/tg)", Run: tfPassthrough("plan")},
{Path: []string{"tf", "validate"}, Tier: TierRead,
Summary: "terragrunt validate a stack", Run: tfPassthrough("validate")},
{Path: []string{"tf", "fmt"}, Tier: TierRead,
Summary: "terraform fmt a stack's files", Run: tfFmt},
{Path: []string{"tf", "force-unlock"}, Tier: TierWrite,
Summary: "release a stuck terraform state lock (needs <stack> <lock-id>)", Run: tfForceUnlock},
{Path: []string{"tf", "apply"}, Tier: TierWrite,
Summary: "terragrunt apply a stack — presence-coupled, out-of-band", Run: tfApply},
}
}
// firstPositional returns the first non-flag arg and the remaining args with it removed.
func firstPositional(args []string) (string, []string) {
for i, a := range args {
if !strings.HasPrefix(a, "-") {
rest := append(append([]string{}, args[:i]...), args[i+1:]...)
return a, rest
}
}
return "", args
}
// resolveTfStack finds the infra root (from cwd) and the stack directory named
// by the first positional arg, returning the remaining args.
func resolveTfStack(args []string) (infraRoot, stackName, stackDir string, rest []string, err error) {
stackName, rest = firstPositional(args)
if stackName == "" {
err = fmt.Errorf("missing <stack> argument")
return
}
cwd, e := os.Getwd()
if e != nil {
err = e
return
}
infraRoot, err = findInfraRoot(cwd)
if err != nil {
return
}
stackDir, err = resolveStack(infraRoot, stackName)
return
}
func tgPath(infraRoot string) string { return filepath.Join(infraRoot, "scripts", "tg") }
// tfPassthrough runs `scripts/tg <verb> [extra]` in the stack directory.
func tfPassthrough(verb string) func([]string) error {
return func(args []string) error {
infraRoot, _, stackDir, rest, err := resolveTfStack(args)
if err != nil {
return err
}
return runStreamingIn(stackDir, tgPath(infraRoot), append([]string{verb}, rest...)...)
}
}
func tfFmt(args []string) error {
_, _, stackDir, _, err := resolveTfStack(args)
if err != nil {
return err
}
return runStreamingIn(stackDir, "terraform", "fmt", "-recursive", ".")
}
func tfForceUnlock(args []string) error {
infraRoot, _, stackDir, rest, err := resolveTfStack(args)
if err != nil {
return err
}
if len(rest) < 1 {
return fmt.Errorf("usage: homelab tf force-unlock <stack> <lock-id>")
}
return runStreamingIn(stackDir, tgPath(infraRoot), "force-unlock", "-force", rest[0])
}
// tfApply applies a stack out-of-band: claim the stack on the presence board,
// ALWAYS release on exit (normal, error, or signal — fixing the claim leak),
// and warn that CI applies canonically on push.
func tfApply(args []string) error {
infraRoot, stackName, stackDir, _, err := resolveTfStack(args)
if err != nil {
return err
}
label := "stack:" + stackName
fmt.Fprintf(os.Stderr,
"homelab: out-of-band apply of %q — CI applies canonically on push to master.\n", stackName)
if err := presenceClaim(label, "homelab tf apply "+stackName); err != nil {
return fmt.Errorf("presence claim failed (run `vault login -method=oidc`?): %w", err)
}
// Release exactly once, whether we exit normally, on error, or on signal —
// sync.Once makes the defer and the signal goroutine safe to both call it.
var once sync.Once
release := func() { once.Do(func() { _ = presenceRelease(label) }) }
defer release()
sig := make(chan os.Signal, 1)
signal.Notify(sig, os.Interrupt, syscall.SIGTERM)
go func() {
<-sig
release()
os.Exit(130)
}()
return runStreamingIn(stackDir, tgPath(infraRoot), "apply", "--non-interactive")
}

27
cli/cmd_tf_test.go Normal file
View file

@ -0,0 +1,27 @@
package main
import (
"reflect"
"testing"
)
func TestFirstPositional(t *testing.T) {
cases := []struct {
args []string
wantName string
wantRest []string
}{
{[]string{"vault"}, "vault", []string{}},
{[]string{"--json", "vault"}, "vault", []string{"--json"}},
{[]string{"vault", "abc-123"}, "vault", []string{"abc-123"}},
{[]string{"--foo", "monitoring", "extra"}, "monitoring", []string{"--foo", "extra"}},
{[]string{"--only-flags"}, "", []string{"--only-flags"}},
}
for _, c := range cases {
gotName, gotRest := firstPositional(c.args)
if gotName != c.wantName || !reflect.DeepEqual(gotRest, c.wantRest) {
t.Errorf("firstPositional(%v) = (%q, %v), want (%q, %v)",
c.args, gotName, gotRest, c.wantName, c.wantRest)
}
}
}

77
cli/cmd_usage.go Normal file
View file

@ -0,0 +1,77 @@
package main
import (
"encoding/json"
"fmt"
"net/url"
"sort"
"strconv"
)
func usageCommands() []Command {
return []Command{
{Path: []string{"usage", "top"}, Tier: TierRead,
Summary: "rank homelab verb usage across users (from Loki): usage top [--since 30d] [--user U] [--json]", Run: usageTop},
}
}
// usageQuery builds the LogQL metric query that counts invocations per verb.
func usageQuery(since, user string) string {
sel := `job="` + usageJob + `"`
if user != "" {
sel += `, user="` + user + `"`
}
return fmt.Sprintf(`sum by (verb) (count_over_time({%s}[%s]))`, sel, since)
}
func usageTop(args []string) error {
since := flagValue(args, "--since")
if since == "" {
since = "30d"
}
v := url.Values{}
v.Set("query", usageQuery(since, flagValue(args, "--user")))
body, err := lbGetBody(lokiHost, "/loki/api/v1/query", v)
if err != nil {
return err
}
if containsArg(args, "--json") {
fmt.Println(string(body))
return nil
}
var r struct {
Data struct {
Result []struct {
Metric map[string]string `json:"metric"`
Value []interface{} `json:"value"`
} `json:"result"`
} `json:"data"`
}
if err := json.Unmarshal(body, &r); err != nil {
fmt.Println(string(body))
return nil
}
type row struct {
verb string
n int
}
var rows []row
for _, s := range r.Data.Result {
n := 0
if len(s.Value) == 2 {
if f, e := strconv.ParseFloat(fmt.Sprint(s.Value[1]), 64); e == nil {
n = int(f)
}
}
rows = append(rows, row{s.Metric["verb"], n})
}
if len(rows) == 0 {
fmt.Println("(no usage recorded yet)")
return nil
}
sort.Slice(rows, func(i, j int) bool { return rows[i].n > rows[j].n })
for _, r := range rows {
fmt.Printf("%6d %s\n", r.n, r.verb)
}
return nil
}

663
cli/cmd_vault.go Normal file
View file

@ -0,0 +1,663 @@
package main
import (
"bufio"
"encoding/base64"
"encoding/json"
"fmt"
"os"
"os/exec"
"strings"
"syscall"
)
// vault verbs give each unix user no-HITL access to THEIR OWN Vaultwarden vault.
// Identity is the kernel UID; per-user creds live in that user's isolated Vault
// path (secret/workstation/claude-users/<user>) read via their scoped token, and
// decryption is done by the official `bw` CLI. See
// docs/superpowers/specs/2026-06-24-homelab-vault-design.md.
func vaultCommands() []Command {
return []Command{
{Path: []string{"vault", "setup"}, Tier: TierWrite,
Summary: "one-time: store your Vaultwarden master password + API key in your Vault path", Run: vaultSetup},
{Path: []string{"vault", "status"}, Tier: TierRead,
Summary: "show whether your vault is configured/reachable (no secrets)", Run: vaultStatus},
{Path: []string{"vault", "list"}, Tier: TierRead,
Summary: "list your item names: vault list [--search Q]", Run: vaultList},
{Path: []string{"vault", "get"}, Tier: TierRead,
Summary: "fetch one item: vault get <name> [--field password|username|uri|notes|totp] [--json]", Run: vaultGet},
{Path: []string{"vault", "search"}, Tier: TierRead,
Summary: "search your item names: vault search <query>", Run: vaultSearch},
{Path: []string{"vault", "code"}, Tier: TierRead,
Summary: "current TOTP code for an item: vault code <name>", Run: vaultCode},
{Path: []string{"vault", "lock"}, Tier: TierWrite,
Summary: "lock/log out the local bw session", Run: vaultLock},
{Path: []string{"vault"}, Tier: TierRead,
Summary: "Vaultwarden access for your own vault (run `homelab vault` for help)",
Run: func([]string) error { fmt.Print(vaultHelp()); return nil }},
}
}
// vaultHelp is shown for bare `homelab vault`.
func vaultHelp() string {
return `homelab vault read YOUR OWN Vaultwarden logins (no-HITL after one-time setup)
homelab vault setup one-time: store your master password + API key in your Vault path
homelab vault status configured / unlocked / reachable (no secrets)
homelab vault list [--search Q] list your item names (no secrets)
homelab vault get <name> [--field password|username|uri|notes|totp] [--json]
TTY clipboard (auto-clears); piped stdout
homelab vault code <name> current TOTP code
homelab vault lock lock / log out the local bw session
Creds live only in your own Vault path; the admin never sees them. Identity is
your unix UID. Security model: docs/superpowers/specs/2026-06-24-homelab-vault-design.md
(note: anything running as your user can decrypt your vault the accepted no-HITL trade).
`
}
const vwUserPathPrefix = "secret/workstation/claude-users/"
// vwCreds is one user's Vaultwarden auth material, read from their Vault path.
type vwCreds struct {
Email string
MasterPassword string
ClientID string
ClientSecret string
}
// cmdRunner shells out to an external command with an explicit environment and
// returns trimmed stdout. Secrets are passed via envv, NEVER argv. Tests inject
// a fake; realRunner is the production implementation.
type cmdRunner func(name string, argv, envv []string) (string, error)
func realRunner(name string, argv, envv []string) (string, error) {
cmd := exec.Command(name, argv...)
if envv != nil {
cmd.Env = envv
}
out, err := cmd.Output()
// Trim only the trailing newline the tool appends — NOT all whitespace, so a
// fetched secret with significant leading/trailing spaces is preserved.
return strings.TrimRight(string(out), "\r\n"), err
}
// realRunnerStdin runs a command feeding `stdin` to it, for secret values that
// must NOT appear in argv (visible via ps / /proc/<pid>/cmdline to same-UID
// processes). Used by setup to write the master password / client_secret.
func realRunnerStdin(name string, argv, envv []string, stdin string) (string, error) {
cmd := exec.Command(name, argv...)
if envv != nil {
cmd.Env = envv
}
cmd.Stdin = strings.NewReader(stdin)
out, err := cmd.Output()
return strings.TrimRight(string(out), "\r\n"), err
}
func vwCredsPath(user string) string { return vwUserPathPrefix + user }
func bwAppDataDir(uid string) string { return "/run/user/" + uid + "/homelab-bw" }
// readVaultField returns one field from a KV-v2 path, "" if absent/error.
func readVaultField(run cmdRunner, field, path string) string {
out, err := run("vault", []string{"kv", "get", "-field=" + field, path}, nil)
if err != nil {
return ""
}
return out
}
// loadCreds reads the four vaultwarden_* keys from the user's isolated path.
// A missing master password means the user hasn't onboarded.
func loadCreds(run cmdRunner, user string) (vwCreds, error) {
p := vwCredsPath(user)
c := vwCreds{
Email: readVaultField(run, "vaultwarden_email", p),
MasterPassword: readVaultField(run, "vaultwarden_master_password", p),
ClientID: readVaultField(run, "vaultwarden_client_id", p),
ClientSecret: readVaultField(run, "vaultwarden_client_secret", p),
}
if c.MasterPassword == "" {
return vwCreds{}, fmt.Errorf("vault not configured for this user — run `homelab vault setup`")
}
return c, nil
}
// vaultCurrentUser/vaultCurrentUID are seams for tests (avoid conflict with repo.go's currentUser func).
var vaultCurrentUser = func() string { return os.Getenv("USER") }
var vaultCurrentUID = func() string { return fmt.Sprintf("%d", os.Getuid()) }
// bwBaseEnv is the minimal non-secret environment bw/node need. We deliberately
// do NOT inherit the full parent env (keeps stray secrets out of the child).
func bwBaseEnv(appdata string) []string {
path := os.Getenv("PATH")
if path == "" {
path = "/usr/local/bin:/usr/bin:/bin"
}
return []string{
"PATH=" + path,
"HOME=" + os.Getenv("HOME"),
"BITWARDENCLI_APPDATA_DIR=" + appdata,
"BW_NOINTERACTION=true",
}
}
// bwSecretEnv adds the secret-bearing vars. session may be "" (pre-unlock).
func bwSecretEnv(appdata string, c vwCreds, session string) []string {
env := bwBaseEnv(appdata)
env = append(env,
"BW_CLIENTID="+c.ClientID,
"BW_CLIENTSECRET="+c.ClientSecret,
"BW_PASSWORD="+c.MasterPassword,
)
if session != "" {
env = append(env, "BW_SESSION="+session)
}
return env
}
func bwLoginArgs() []string { return []string{"login", "--apikey"} }
func bwUnlockArgs() []string { return []string{"unlock", "--passwordenv", "BW_PASSWORD", "--raw"} }
func bwGetArgs(field, name string) []string { return []string{"get", field, name} }
func bwStatusArgs() []string { return []string{"status"} }
// bwNeedsLogin parses `bw status` JSON and reports whether a `bw login` is
// required. Unparseable/empty output → true (safer to attempt login).
func bwNeedsLogin(statusJSON string) bool {
var s struct {
Status string `json:"status"`
}
if err := json.Unmarshal([]byte(statusJSON), &s); err != nil {
return true
}
return s.Status == "unauthenticated" || s.Status == ""
}
func bwListArgs(search string) []string {
a := []string{"list", "items"}
if search != "" {
a = append(a, "--search", search)
}
return a
}
// bwUnlock runs `bw unlock` and returns the raw session key.
func bwUnlock(run cmdRunner, env []string) (string, error) {
out, err := run("bw", bwUnlockArgs(), env)
if err != nil {
return "", fmt.Errorf("bw unlock failed (wrong master password? run `homelab vault setup`): %w", err)
}
return out, nil
}
// bwGet fetches one field of one item; session must be present in env.
func bwGet(run cmdRunner, env []string, field, name string) (string, error) {
return run("bw", bwGetArgs(field, name), env)
}
func returnMode(isTTY bool) string {
if isTTY {
return "clipboard"
}
return "stdout"
}
// stdoutIsTTY reports whether stdout is a character device (a terminal).
func stdoutIsTTY() bool {
fi, err := os.Stdout.Stat()
if err != nil {
return false
}
return fi.Mode()&os.ModeCharDevice != 0
}
// stderrIsTTY reports whether stderr is a terminal (the OSC52 escape is written
// to stderr, so the clipboard path is only viable when stderr is a terminal).
func stderrIsTTY() bool {
fi, err := os.Stderr.Stat()
if err != nil {
return false
}
return fi.Mode()&os.ModeCharDevice != 0
}
// osc52 returns the OSC 52 escape that makes the local terminal copy payload to
// the system clipboard (works over SSH; no X11). osc52clear copies empty.
func osc52(payload string) string {
return "\x1b]52;c;" + base64.StdEncoding.EncodeToString([]byte(payload)) + "\a"
}
func osc52clear() string { return "\x1b]52;c;\a" }
// terminalAllowed gates OSC 52: only terminals known to honor clipboard writes,
// else we'd dump the secret's base64 into scrollback on unsupported terminals.
func terminalAllowed(term, termProgram string) bool {
t := strings.ToLower(term)
p := strings.ToLower(termProgram)
for _, ok := range []string{"kitty", "alacritty", "foot", "wezterm", "ghostty", "tmux", "screen"} {
if strings.Contains(t, ok) || strings.Contains(p, ok) {
return true
}
}
// xterm proper supports it only when the program is a known-good emulator.
return false
}
// opRecord is one CLI operation. ItemName is accepted for the caller's
// convenience but is INTENTIONALLY never rendered into the log line — auditing
// which of your own logins you opened is itself sensitive, and per-item reads
// are invisible server-side anyway (spec §9a).
type opRecord struct {
User string
Verb string
PID int
PPID int
ParentComm string
ItemName string // never logged
}
func opLogLine(r opRecord) string {
return fmt.Sprintf("user=%s verb=%s pid=%d ppid=%d parent=%s",
r.User, r.Verb, r.PID, r.PPID, r.ParentComm)
}
// parentComm reads /proc/<ppid>/comm (best-effort; "" on failure).
func parentComm(ppid int) string {
b, err := os.ReadFile(fmt.Sprintf("/proc/%d/comm", ppid))
if err != nil {
return ""
}
return strings.TrimSpace(string(b))
}
// writeOpLog appends one privacy-aware line to the user's op-log (best-effort;
// never blocks or fails the command). Goes to syslog so it ships to Loki.
func writeOpLog(r opRecord) {
exec.Command("logger", "-t", "homelab-vault", opLogLine(r)).Run() // best-effort
}
func vaultLockPath(uid string) string { return "/run/user/" + uid + "/homelab-vault.lock" }
// hardenProcess disables core dumps so a bw/homelab crash can't spill the master
// password to a core file. Best-effort.
func hardenProcess() {
_ = syscall.Setrlimit(syscall.RLIMIT_CORE, &syscall.Rlimit{Cur: 0, Max: 0})
}
// withUserLock serializes bw mutations for this user (concurrent Claude sessions
// as the same user otherwise race bw's appdata). Returns an unlock func.
func withUserLock(uid string) (func(), error) {
f, err := os.OpenFile(vaultLockPath(uid), os.O_CREATE|os.O_RDWR, 0600)
if err != nil {
return nil, err
}
if err := syscall.Flock(int(f.Fd()), syscall.LOCK_EX); err != nil {
f.Close()
return nil, err
}
return func() { syscall.Flock(int(f.Fd()), syscall.LOCK_UN); f.Close() }, nil
}
// session is one usable bw context: the env (with BW_SESSION) ready for `bw get`.
type session struct {
env []string
}
// openSession resolves creds, ensures login, unlocks, and returns a ready env.
// Caller must hold the user lock. appdata is created on tmpfs (0700).
func openSession(run cmdRunner, user, uid string) (session, error) {
creds, err := loadCreds(run, user)
if err != nil {
return session{}, err
}
appdata := bwAppDataDir(uid)
if err := os.MkdirAll(appdata, 0700); err != nil {
return session{}, fmt.Errorf("create bw appdata %s: %w", appdata, err)
}
loginEnv := bwSecretEnv(appdata, creds, "")
// Ensure server is set and we're logged in (idempotent; ignore "already").
_, _ = run("bw", []string{"config", "server", "https://vaultwarden.viktorbarzin.me"}, loginEnv)
st, _ := run("bw", bwStatusArgs(), loginEnv)
if bwNeedsLogin(st) {
if _, err := run("bw", bwLoginArgs(), loginEnv); err != nil {
return session{}, fmt.Errorf("bw login --apikey failed (API key valid? run `homelab vault setup`): %w", err)
}
}
sess, err := bwUnlock(run, loginEnv)
if err != nil {
return session{}, err
}
return session{env: bwSecretEnv(appdata, creds, sess)}, nil
}
type getOpts struct {
name string
field string
json bool
}
var validGetFields = map[string]bool{"password": true, "username": true, "uri": true, "notes": true, "totp": true}
func parseGetArgs(args []string) (getOpts, error) {
o := getOpts{field: "password"}
for i := 0; i < len(args); i++ {
a := args[i]
switch {
case a == "--json":
o.json = true
case a == "--field" && i+1 < len(args):
o.field = args[i+1]
i++
case strings.HasPrefix(a, "--field="):
o.field = strings.TrimPrefix(a, "--field=")
case !strings.HasPrefix(a, "-") && o.name == "":
o.name = a
}
}
if o.name == "" {
return o, fmt.Errorf("usage: homelab vault get <name> [--field password|username|uri|notes|totp] [--json]")
}
if !validGetFields[o.field] {
return o, fmt.Errorf("invalid --field %q (want password|username|uri|notes|totp)", o.field)
}
return o, nil
}
// getValue opens a session and fetches one field. Pure of I/O side effects
// besides the runner, so it is unit-tested with a fake runner.
func getValue(run cmdRunner, user, uid string, o getOpts) (string, error) {
s, err := openSession(run, user, uid)
if err != nil {
return "", err
}
return bwGet(run, s.env, o.field, o.name)
}
// clipboardDecision picks how to return a secret value. "stdout" prints it (a
// pipe/agent — the intended machine path); "clipboard" copies via OSC52;
// "refuse" emits nothing sensitive (would otherwise risk dumping the secret's
// base64 into scrollback, or silently fail because the OSC52 escape goes to a
// non-terminal stderr).
func clipboardDecision(stdoutTTY, stderrTTY bool, term, termProgram string) string {
if !stdoutTTY {
return "stdout"
}
if terminalAllowed(term, termProgram) && stderrTTY {
return "clipboard"
}
return "refuse"
}
// jsonToStdoutOK reports whether `--json` may print the secret to stdout — only
// when stdout is NOT a terminal (i.e. piped to a machine consumer).
func jsonToStdoutOK(stdoutTTY bool) bool { return !stdoutTTY }
// emitSecret returns a value TTY-aware (see clipboardDecision). Never prints the
// secret to a terminal's stdout/scrollback.
func emitSecret(value string) {
switch clipboardDecision(stdoutIsTTY(), stderrIsTTY(), os.Getenv("TERM"), os.Getenv("TERM_PROGRAM")) {
case "stdout":
fmt.Println(value)
case "clipboard":
fmt.Fprint(os.Stderr, osc52(value))
fmt.Fprintln(os.Stderr, "copied to clipboard; clearing in 30s")
clearClipboardAfter(30)
default: // refuse
fmt.Fprintln(os.Stderr, "refusing to print secret: this terminal can't do OSC52 clipboard safely; pipe the command (e.g. | cat) or use a supported terminal")
}
}
// clearClipboardAfter spawns a detached background clear so the secret doesn't
// linger in the clipboard. Best-effort.
func clearClipboardAfter(seconds int) {
exec.Command("sh", "-c", fmt.Sprintf("sleep %d; printf '%s'", seconds, osc52clear())).Start()
}
// listNames extracts "name (id)" from `bw list items` JSON; never values.
func listNames(jsonOut string) []string {
var items []struct {
ID string `json:"id"`
Name string `json:"name"`
}
if err := json.Unmarshal([]byte(jsonOut), &items); err != nil {
return nil
}
out := make([]string, 0, len(items))
for _, it := range items {
out = append(out, fmt.Sprintf("%s (%s)", it.Name, it.ID))
}
return out
}
func runList(run cmdRunner, user, uid, search string) ([]string, error) {
s, err := openSession(run, user, uid)
if err != nil {
return nil, err
}
out, err := run("bw", bwListArgs(search), s.env)
if err != nil {
return nil, err
}
return listNames(out), nil
}
func vaultList(args []string) error {
hardenProcess()
search := ""
for i := 0; i < len(args); i++ {
if args[i] == "--search" && i+1 < len(args) {
search = args[i+1]
i++
} else if strings.HasPrefix(args[i], "--search=") {
search = strings.TrimPrefix(args[i], "--search=")
}
}
uid := vaultCurrentUID()
unlock, err := withUserLock(uid)
if err != nil {
return err
}
defer unlock()
names, err := runList(realRunner, vaultCurrentUser(), uid, search)
if err != nil {
return err
}
for _, n := range names {
fmt.Println(n)
}
return nil
}
func vaultSearch(args []string) error {
if len(args) == 0 {
return fmt.Errorf("usage: homelab vault search <query>")
}
return vaultList([]string{"--search", strings.Join(args, " ")})
}
func vaultCode(args []string) error {
hardenProcess()
if len(args) == 0 {
return fmt.Errorf("usage: homelab vault code <name>")
}
name := args[0]
uid := vaultCurrentUID()
unlock, err := withUserLock(uid)
if err != nil {
return err
}
defer unlock()
user := vaultCurrentUser()
val, err := getValue(realRunner, user, uid, getOpts{name: name, field: "totp"})
if err != nil {
return err
}
// TOTP is the most sensitive op: log AND emit an ntfy-bound marker (spec §9a-d).
writeOpLog(opRecord{User: user, Verb: "code", PID: os.Getpid(), PPID: os.Getppid(), ParentComm: parentComm(os.Getppid()), ItemName: name})
exec.Command("logger", "-t", "homelab-vault-totp", "user="+user+" totp-fetch parent="+parentComm(os.Getppid())).Run()
emitSecret(val)
return nil
}
// statusSummary reports config/reachability without revealing secrets.
func statusSummary(run cmdRunner, user, uid string) string {
if _, err := loadCreds(run, user); err != nil {
return "vault: not configured — run `homelab vault setup`"
}
s, err := openSession(run, user, uid)
if err != nil {
return "vault: configured, but unlock/login FAILED (creds stale? run `homelab vault setup`): " + err.Error()
}
if _, err := run("bw", []string{"sync"}, s.env); err != nil {
return "vault: configured + unlocked, but sync/reachability failed: " + err.Error()
}
return "vault: configured, unlocked, reachable ✓"
}
func vaultStatus(args []string) error {
hardenProcess()
uid := vaultCurrentUID()
unlock, err := withUserLock(uid)
if err != nil {
return err
}
defer unlock()
fmt.Println(statusSummary(realRunner, vaultCurrentUser(), uid))
return nil
}
func vaultLock(args []string) error {
uid := vaultCurrentUID()
unlock, err := withUserLock(uid) // logout mutates bw state — serialize with get/list
if err != nil {
return err
}
defer unlock()
appdata := bwAppDataDir(uid)
_, _ = realRunner("bw", []string{"lock"}, bwBaseEnv(appdata))
_, logoutErr := realRunner("bw", []string{"logout"}, bwBaseEnv(appdata))
if logoutErr == nil {
fmt.Println("locked")
}
return nil // lock/logout best-effort; never error the caller
}
// vaultPatchPublicArgs writes the non-secret identifiers via argv. Neither the
// email nor the API client_id is a usable credential on its own.
func vaultPatchPublicArgs(user, email, clientID string) []string {
return []string{"kv", "patch", vwCredsPath(user),
"vaultwarden_email=" + email,
"vaultwarden_client_id=" + clientID,
}
}
// vaultPatchSecretArgs writes ONE secret value via the `key=-` stdin form, so
// the value never appears in argv (ps / /proc/<pid>/cmdline). The value is fed
// on stdin by realRunnerStdin.
func vaultPatchSecretArgs(user, key string) []string {
return []string{"kv", "patch", vwCredsPath(user), key + "=-"}
}
// writeCreds stores all four fields in the user's Vault path. The two real
// secrets (master password, API client_secret) go via stdin — never argv.
func writeCreds(user string, c vwCreds) error {
if _, err := realRunner("vault", vaultPatchPublicArgs(user, c.Email, c.ClientID), nil); err != nil {
return err
}
if _, err := realRunnerStdin("vault", vaultPatchSecretArgs(user, "vaultwarden_master_password"), nil, c.MasterPassword); err != nil {
return err
}
if _, err := realRunnerStdin("vault", vaultPatchSecretArgs(user, "vaultwarden_client_secret"), nil, c.ClientSecret); err != nil {
return err
}
return nil
}
// promptNoEcho reads one line without terminal echo (for the master password).
func promptNoEcho(prompt string) (string, error) {
fmt.Fprint(os.Stderr, prompt)
exec.Command("stty", "-echo").Run()
defer func() { exec.Command("stty", "echo").Run(); fmt.Fprintln(os.Stderr) }()
r := bufio.NewReader(os.Stdin)
line, err := r.ReadString('\n')
// Trim only the line terminator — a master password / API secret may
// legitimately contain leading/trailing spaces.
return strings.TrimRight(line, "\r\n"), err
}
func promptLine(prompt string) (string, error) {
fmt.Fprint(os.Stderr, prompt)
line, err := bufio.NewReader(os.Stdin).ReadString('\n')
return strings.TrimSpace(line), err
}
func vaultSetup(args []string) error {
hardenProcess()
fmt.Fprintln(os.Stderr, "One-time setup. Stored ONLY in your own Vault path; the admin never sees it.")
fmt.Fprintln(os.Stderr, "Get your API key at https://vaultwarden.viktorbarzin.me → Settings → Security → Keys → View API key.")
email, err := promptLine("Vaultwarden email: ")
if err != nil {
return err
}
clientID, err := promptLine("API key client_id (user.xxxx): ")
if err != nil {
return err
}
clientSecret, err := promptNoEcho("API key client_secret: ")
if err != nil {
return err
}
master, err := promptNoEcho("Master password: ")
if err != nil {
return err
}
if master == "" || clientID == "" || clientSecret == "" {
return fmt.Errorf("all fields are required")
}
c := vwCreds{Email: email, MasterPassword: master, ClientID: clientID, ClientSecret: clientSecret}
if err := writeCreds(vaultCurrentUser(), c); err != nil {
return fmt.Errorf("writing creds to your Vault path failed (scoped token present?): %w", err)
}
fmt.Fprintln(os.Stderr, "Stored. Verifying unlock…")
uid := vaultCurrentUID()
unlock, err := withUserLock(uid)
if err != nil {
return err
}
defer unlock()
if _, err := openSession(realRunner, vaultCurrentUser(), uid); err != nil {
return fmt.Errorf("stored, but verification failed — double-check master password / API key: %w", err)
}
fmt.Fprintln(os.Stderr, "✓ Verified. Fetches are now AFK.")
return nil
}
func vaultGet(args []string) error {
hardenProcess()
o, err := parseGetArgs(args)
if err != nil {
return err
}
uid := vaultCurrentUID()
unlock, err := withUserLock(uid)
if err != nil {
return err
}
defer unlock()
user := vaultCurrentUser()
val, err := getValue(realRunner, user, uid, o)
if err != nil {
return err
}
writeOpLog(opRecord{User: user, Verb: "get", PID: os.Getpid(), PPID: os.Getppid(), ParentComm: parentComm(os.Getppid()), ItemName: o.name})
if o.json {
if !jsonToStdoutOK(stdoutIsTTY()) {
return fmt.Errorf("refusing to print a secret as JSON to a terminal; pipe it (e.g. | cat) or drop --json")
}
fmt.Printf("{%q:%q}\n", o.field, val)
return nil
}
emitSecret(val)
return nil
}

368
cli/cmd_vault_test.go Normal file
View file

@ -0,0 +1,368 @@
package main
import (
"encoding/base64"
"fmt"
"os"
"reflect"
"strings"
"testing"
)
func TestVaultCommandsRegistered(t *testing.T) {
want := map[string]Tier{
"vault setup": TierWrite,
"vault status": TierRead,
"vault list": TierRead,
"vault get": TierRead,
"vault search": TierRead,
"vault code": TierRead,
"vault lock": TierWrite,
}
got := map[string]Tier{}
for _, c := range vaultCommands() {
got[c.name()] = c.Tier
}
for name, tier := range want {
if got[name] != tier {
t.Errorf("command %q: tier=%q, want %q (registered=%v)", name, got[name], tier, got[name] != "")
}
}
}
func TestVaultGroupInRegistry(t *testing.T) {
if !isCommandGroup(buildRegistry(), "vault") {
t.Fatal("`vault` group not wired into buildRegistry()")
}
}
func TestVaultCredsPath(t *testing.T) {
if got := vwCredsPath("emo"); got != "secret/workstation/claude-users/emo" {
t.Fatalf("vwCredsPath = %q", got)
}
}
func TestBwAppDataDir(t *testing.T) {
if got := bwAppDataDir("1001"); got != "/run/user/1001/homelab-bw" {
t.Fatalf("bwAppDataDir = %q", got)
}
}
// fakeRunner records calls and returns canned stdout/err keyed by argv[0]+first arg.
type fakeRunner struct {
calls [][]string
out map[string]string // key: name+" "+strings.Join(argv," ") prefix-matched
err map[string]error
lastEnv []string
}
func (f *fakeRunner) run(name string, argv, envv []string) (string, error) {
f.calls = append(f.calls, append([]string{name}, argv...))
f.lastEnv = envv
key := name + " " + strings.Join(argv, " ")
for k, v := range f.out {
if strings.HasPrefix(key, k) {
return v, f.err[k]
}
}
return "", f.err[key]
}
func TestLoadCredsReadsFourFields(t *testing.T) {
f := &fakeRunner{out: map[string]string{
"vault kv get -field=vaultwarden_email secret/workstation/claude-users/emo": "emo@x.me",
"vault kv get -field=vaultwarden_master_password secret/workstation/claude-users/emo": "hunter2",
"vault kv get -field=vaultwarden_client_id secret/workstation/claude-users/emo": "user.abc",
"vault kv get -field=vaultwarden_client_secret secret/workstation/claude-users/emo": "sek",
}}
c, err := loadCreds(f.run, "emo")
if err != nil {
t.Fatalf("loadCreds: %v", err)
}
want := vwCreds{Email: "emo@x.me", MasterPassword: "hunter2", ClientID: "user.abc", ClientSecret: "sek"}
if !reflect.DeepEqual(c, want) {
t.Fatalf("loadCreds = %+v want %+v", c, want)
}
}
func TestLoadCredsUnconfigured(t *testing.T) {
f := &fakeRunner{out: map[string]string{}} // every field empty
if _, err := loadCreds(f.run, "emo"); err == nil || !strings.Contains(err.Error(), "not configured") {
t.Fatalf("want 'not configured' error, got %v", err)
}
}
func TestBwEnvCarriesSecretsNotArgv(t *testing.T) {
c := vwCreds{ClientID: "user.abc", ClientSecret: "sek", MasterPassword: "hunter2"}
env := bwSecretEnv("/run/user/1001/homelab-bw", c, "SESSIONKEY")
joined := strings.Join(env, "\n")
for _, want := range []string{
"BW_CLIENTID=user.abc", "BW_CLIENTSECRET=sek", "BW_PASSWORD=hunter2",
"BW_SESSION=SESSIONKEY", "BITWARDENCLI_APPDATA_DIR=/run/user/1001/homelab-bw",
} {
if !strings.Contains(joined, want) {
t.Errorf("bwSecretEnv missing %q", want)
}
}
if strings.Contains(joined, "PATH=") == false {
t.Error("bwSecretEnv must keep a PATH so node/bw resolve")
}
}
func TestBwGetArgsHasNoSessionInArgv(t *testing.T) {
argv := bwGetArgs("password", "github")
for _, a := range argv {
if strings.Contains(a, "SESSION") || a == "--session" {
t.Fatalf("session must travel via env, not argv: %v", argv)
}
}
if !reflect.DeepEqual(argv, []string{"get", "password", "github"}) {
t.Fatalf("bwGetArgs = %v", argv)
}
}
func TestBwListArgs(t *testing.T) {
if got := bwListArgs(""); !reflect.DeepEqual(got, []string{"list", "items"}) {
t.Fatalf("bwListArgs('') = %v", got)
}
if got := bwListArgs("git"); !reflect.DeepEqual(got, []string{"list", "items", "--search", "git"}) {
t.Fatalf("bwListArgs('git') = %v", got)
}
}
func TestBwUnlockReturnsSession(t *testing.T) {
f := &fakeRunner{out: map[string]string{"bw unlock": "THE-SESSION-KEY"}}
env := bwSecretEnv("/run/user/1001/homelab-bw", vwCreds{MasterPassword: "pw"}, "")
sess, err := bwUnlock(f.run, env)
if err != nil || sess != "THE-SESSION-KEY" {
t.Fatalf("bwUnlock = %q, %v", sess, err)
}
// argv must use --passwordenv + --raw, never the password literal
last := f.calls[len(f.calls)-1]
if strings.Join(last, " ") != "bw unlock --passwordenv BW_PASSWORD --raw" {
t.Fatalf("unlock argv = %v", last)
}
}
func TestReturnMode(t *testing.T) {
if returnMode(true) != "clipboard" || returnMode(false) != "stdout" {
t.Fatal("returnMode wrong")
}
}
func TestOSC52Encode(t *testing.T) {
got := osc52("secret")
want := "\x1b]52;c;" + base64.StdEncoding.EncodeToString([]byte("secret")) + "\a"
if got != want {
t.Fatalf("osc52 = %q want %q", got, want)
}
if osc52clear() != "\x1b]52;c;\a" {
t.Fatalf("osc52clear wrong: %q", osc52clear())
}
}
func TestTerminalAllowed(t *testing.T) {
allow := []struct{ term, prog string }{
{"xterm-kitty", ""}, {"alacritty", ""}, {"foot", ""}, {"tmux-256color", ""},
{"screen-256color", ""}, {"xterm-256color", "WezTerm"}, {"xterm-256color", "ghostty"},
}
for _, c := range allow {
if !terminalAllowed(c.term, c.prog) {
t.Errorf("terminalAllowed(%q,%q) = false, want true", c.term, c.prog)
}
}
deny := []struct{ term, prog string }{{"dumb", ""}, {"", ""}, {"vt100", ""}}
for _, c := range deny {
if terminalAllowed(c.term, c.prog) {
t.Errorf("terminalAllowed(%q,%q) = true, want false", c.term, c.prog)
}
}
}
func TestOpLogLineHasNoSecretOrItem(t *testing.T) {
line := opLogLine(opRecord{User: "emo", Verb: "get", PID: 10, PPID: 9, ParentComm: "claude", ItemName: "Chase Bank"})
for _, must := range []string{"user=emo", "verb=get", "ppid=9", "parent=claude"} {
if !strings.Contains(line, must) {
t.Errorf("op-log missing %q: %s", must, line)
}
}
for _, mustNot := range []string{"Chase", "password", "secret"} {
if strings.Contains(line, mustNot) {
t.Fatalf("op-log LEAKS %q (privacy violation): %s", mustNot, line)
}
}
}
func TestLockPath(t *testing.T) {
if got := vaultLockPath("1001"); got != "/run/user/1001/homelab-vault.lock" {
t.Fatalf("vaultLockPath = %q", got)
}
}
func TestParseGetArgs(t *testing.T) {
o, err := parseGetArgs([]string{"github", "--field", "username", "--json"})
if err != nil || o.name != "github" || o.field != "username" || !o.json {
t.Fatalf("parseGetArgs = %+v err=%v", o, err)
}
d, _ := parseGetArgs([]string{"github"})
if d.field != "password" || d.json {
t.Fatalf("defaults wrong: %+v", d)
}
if _, err := parseGetArgs([]string{}); err == nil {
t.Fatal("get with no name must error")
}
if _, err := parseGetArgs([]string{"x", "--field", "evil"}); err == nil {
t.Fatal("invalid --field must error")
}
}
func TestListNamesParsing(t *testing.T) {
// bw list items returns JSON; listNames extracts name + id only.
js := `[{"id":"1","name":"GitHub","login":{"username":"u"}},{"id":"2","name":"AWS"}]`
names := listNames(js)
if len(names) != 2 || names[0] != "GitHub (1)" || names[1] != "AWS (2)" {
t.Fatalf("listNames = %v", names)
}
}
func TestStatusSummaryUnconfigured(t *testing.T) {
f := &fakeRunner{out: map[string]string{}} // no creds
s := statusSummary(f.run, "emo", "1001")
if !strings.Contains(s, "not configured") {
t.Fatalf("status = %q", s)
}
}
func TestVaultPatchPublicArgs(t *testing.T) {
got := vaultPatchPublicArgs("emo", "e@x.me", "user.ci")
want := []string{"kv", "patch", "secret/workstation/claude-users/emo",
"vaultwarden_email=e@x.me", "vaultwarden_client_id=user.ci"}
if !reflect.DeepEqual(got, want) {
t.Fatalf("vaultPatchPublicArgs = %v", got)
}
for _, a := range got {
if strings.Contains(a, "master_password") || strings.Contains(a, "client_secret") {
t.Fatalf("secret key leaked into public argv: %v", got)
}
}
}
func TestVaultPatchSecretArgsNoValueInArgv(t *testing.T) {
for _, key := range []string{"vaultwarden_master_password", "vaultwarden_client_secret"} {
got := vaultPatchSecretArgs("emo", key)
want := []string{"kv", "patch", "secret/workstation/claude-users/emo", key + "=-"}
if !reflect.DeepEqual(got, want) {
t.Fatalf("vaultPatchSecretArgs(%q) = %v", key, got)
}
if got[len(got)-1] != key+"=-" {
t.Fatalf("secret value must be read from stdin (`%s=-`), got %v", key, got)
}
}
}
// TestNoSecretInArgvAcrossFlow is the load-bearing security test: across the
// whole get flow (vault reads, bw config/status/login/unlock/get) NO secret
// value may appear in any command's argv — secrets travel via env/stdin only.
func TestNoSecretInArgvAcrossFlow(t *testing.T) {
uid := fmt.Sprintf("%d", os.Getuid())
f := &fakeRunner{out: map[string]string{
"vault kv get -field=vaultwarden_master_password secret/workstation/claude-users/emo": "SUPERSECRETPW",
"vault kv get -field=vaultwarden_client_id secret/workstation/claude-users/emo": "user.x",
"vault kv get -field=vaultwarden_client_secret secret/workstation/claude-users/emo": "CLIENTSEKRET",
"bw status": `{"status":"locked"}`,
"bw unlock": "SESSIONXYZ",
"bw get password github": "p@ss",
}}
if _, err := getValue(f.run, "emo", uid, getOpts{name: "github", field: "password"}); err != nil {
t.Fatalf("getValue: %v", err)
}
for _, call := range f.calls {
for _, arg := range call {
for _, s := range []string{"SUPERSECRETPW", "CLIENTSEKRET", "SESSIONXYZ"} {
if strings.Contains(arg, s) {
t.Errorf("secret %q leaked into argv: %v", s, call)
}
}
}
}
if !strings.Contains(strings.Join(f.lastEnv, "\n"), "BW_SESSION=SESSIONXYZ") {
t.Error("expected BW_SESSION in the bw get env (test would be vacuous otherwise)")
}
}
func TestClipboardDecision(t *testing.T) {
cases := []struct {
stdoutTTY, stderrTTY bool
term, prog, want string
}{
{false, true, "xterm-kitty", "", "stdout"},
{true, true, "xterm-kitty", "", "clipboard"},
{true, true, "dumb", "", "refuse"},
{true, false, "xterm-kitty", "", "refuse"},
}
for _, c := range cases {
if got := clipboardDecision(c.stdoutTTY, c.stderrTTY, c.term, c.prog); got != c.want {
t.Errorf("clipboardDecision(%v,%v,%q) = %q, want %q", c.stdoutTTY, c.stderrTTY, c.term, got, c.want)
}
}
}
func TestJSONToStdoutOK(t *testing.T) {
if jsonToStdoutOK(true) {
t.Error("must refuse JSON secret on a terminal")
}
if !jsonToStdoutOK(false) {
t.Error("must allow JSON when piped")
}
}
func TestBwNeedsLogin(t *testing.T) {
if !bwNeedsLogin(`{"status":"unauthenticated"}`) {
t.Error("unauthenticated → needs login")
}
if bwNeedsLogin(`{"status":"locked"}`) {
t.Error("locked → no login (just unlock)")
}
if bwNeedsLogin(`{"status":"unlocked"}`) {
t.Error("unlocked → no login")
}
if !bwNeedsLogin(`not json`) {
t.Error("unparseable → attempt login")
}
}
func TestVaultHelpMentionsSecurity(t *testing.T) {
h := vaultHelp()
for _, want := range []string{"homelab vault get", "no-HITL", "your own", "setup"} {
if !strings.Contains(h, want) {
t.Errorf("vault help missing %q", want)
}
}
}
func TestVaultBareGroupRegistered(t *testing.T) {
for _, c := range vaultCommands() {
if len(c.Path) == 1 && c.Path[0] == "vault" {
return
}
}
t.Fatal("bare `vault` help command not registered")
}
// getValue is the testable core: given a runner + opts, returns the secret value.
func TestGetValueFlow(t *testing.T) {
f := &fakeRunner{out: map[string]string{
"vault kv get -field=vaultwarden_master_password secret/workstation/claude-users/emo": "pw",
"vault kv get -field=vaultwarden_client_id secret/workstation/claude-users/emo": "user.x",
"vault kv get -field=vaultwarden_client_secret secret/workstation/claude-users/emo": "cs",
"bw status": `{"status":"locked"}`,
"bw unlock": "SESS",
"bw get password github": "p@ss",
}}
// Use real UID so os.MkdirAll(/run/user/<uid>/homelab-bw) succeeds.
uid := fmt.Sprintf("%d", os.Getuid())
val, err := getValue(f.run, "emo", uid, getOpts{name: "github", field: "password"})
if err != nil || val != "p@ss" {
t.Fatalf("getValue = %q, %v", val, err)
}
}

212
cli/cmd_work.go Normal file
View file

@ -0,0 +1,212 @@
package main
import (
"fmt"
"os"
"path/filepath"
"strings"
)
func workCommands() []Command {
return []Command{
{Path: []string{"work", "start"}, Tier: TierWrite,
Summary: "create a worktree + branch for a task (enter it with EnterWorktree)", Run: workStart},
{Path: []string{"work", "land"}, Tier: TierWrite,
Summary: "merge master in, verify, push HEAD:master (run from the worktree)", Run: workLand},
{Path: []string{"work", "clean"}, Tier: TierWrite,
Summary: "remove a task's worktree + branch (run from the main checkout)", Run: workClean},
}
}
// flagValue extracts `--name value` or `--name=value` from args.
func flagValue(args []string, name string) string {
for i, a := range args {
if a == name && i+1 < len(args) {
return args[i+1]
}
if strings.HasPrefix(a, name+"=") {
return strings.TrimPrefix(a, name+"=")
}
}
return ""
}
func remotesOrEmpty(repoRoot string) []string {
r, _ := gitRemotes(repoRoot)
return r
}
// workStart creates .worktrees/<topic> on branch <user>/<topic> off <remote>/master.
func workStart(args []string) error {
topic, _ := firstPositional(args)
if topic == "" {
return fmt.Errorf("usage: homelab work start <topic>")
}
cwd, _ := os.Getwd()
repoRoot, err := gitRepoRoot(cwd)
if err != nil {
return fmt.Errorf("not in a git repository: %w", err)
}
remote := preferRemote(remotesOrEmpty(repoRoot))
if remote == "" {
return fmt.Errorf("no git remote configured in %s", repoRoot)
}
flags := cryptFlagsFor(repoRoot)
branch := currentUser() + "/" + topic
wtRel := filepath.Join(".worktrees", topic)
ensureWorktreesIgnored(repoRoot)
if err := gitStream(repoRoot, flags, "fetch", remote); err != nil {
return fmt.Errorf("fetch %s failed: %w", remote, err)
}
if err := gitStream(repoRoot, flags, "worktree", "add", wtRel, "-b", branch, remote+"/master"); err != nil {
return fmt.Errorf("worktree add failed: %w", err)
}
wtPath := filepath.Join(repoRoot, wtRel)
fmt.Printf("homelab: created worktree %s (branch %s off %s/master)\n", wtPath, branch, remote)
fmt.Printf("homelab: enter it with the native tool: EnterWorktree(path=%q)\n", wtPath)
return nil
}
// workLand integrates the current branch into master: fetch, merge master in,
// verify, push HEAD:master (retrying on non-fast-forward), with a feature-branch
// fallback when the direct push is rejected (e.g. branch protection).
func workLand(args []string) error {
verifyCmd := flagValue(args, "--verify-cmd")
cwd, _ := os.Getwd()
repoRoot, err := gitRepoRoot(cwd)
if err != nil {
return fmt.Errorf("not in a git repository: %w", err)
}
branch, err := gitOutput(repoRoot, "rev-parse", "--abbrev-ref", "HEAD")
if err != nil {
return err
}
if branch == "master" || branch == "main" {
return fmt.Errorf("refusing to land: already on %s", branch)
}
remote := preferRemote(remotesOrEmpty(repoRoot))
if remote == "" {
return fmt.Errorf("no git remote configured in %s", repoRoot)
}
flags := cryptFlagsFor(repoRoot)
if err := gitStream(repoRoot, flags, "fetch", remote); err != nil {
return fmt.Errorf("fetch failed: %w", err)
}
if err := gitStream(repoRoot, flags, "merge", "--no-edit", remote+"/master"); err != nil {
return fmt.Errorf("merging %s/master failed — resolve conflicts then re-run `homelab work land`: %w", remote, err)
}
if err := runVerify(repoRoot, verifyCmd, containsArg(args, "--no-verify")); err != nil {
return fmt.Errorf("not landing: %w", err)
}
if err := pushWithRetry(repoRoot, flags, remote, 3); err != nil {
return landFallback(repoRoot, flags, remote, branch, err)
}
fmt.Printf("homelab: landed %s -> %s/master.\n", branch, remote)
if containsArg(args, "--no-ci-watch") {
fmt.Println("homelab: --no-ci-watch set; not waiting for CI.")
return nil
}
landed, _ := gitOutput(repoRoot, "rev-parse", "HEAD")
fmt.Fprintln(os.Stderr, "homelab: watching CI for the landed commit...")
if err := ciWatch([]string{landed}); err != nil {
return fmt.Errorf("landed, but CI did not go green: %w", err)
}
return nil
}
// runVerify runs the explicit --verify-cmd, else auto-detects (go test). If
// neither is available it REFUSES (returns an error) unless allowSkip is set —
// landing to master unverified must be a deliberate choice (--no-verify).
func runVerify(repoRoot, verifyCmd string, allowSkip bool) error {
if verifyCmd != "" {
fmt.Fprintf(os.Stderr, "homelab: verify: %s\n", verifyCmd)
return runStreamingIn(repoRoot, "sh", "-c", verifyCmd)
}
if isFile(filepath.Join(repoRoot, "go.mod")) {
fmt.Fprintln(os.Stderr, "homelab: verify: go test ./...")
return runStreamingIn(repoRoot, "go", "test", "./...")
}
if allowSkip {
fmt.Fprintln(os.Stderr, "homelab: WARNING: --no-verify set — landing without verification")
return nil
}
return fmt.Errorf("no verification configured for this repo — pass --verify-cmd \"...\" or --no-verify to land without verifying")
}
// pushWithRetry pushes HEAD:master, recovering from non-fast-forward rejections
// by fetching + merging master and retrying.
func pushWithRetry(repoRoot string, flags []string, remote string, attempts int) error {
var lastErr error
for i := 0; i < attempts; i++ {
if err := gitStream(repoRoot, flags, "push", remote, "HEAD:master"); err == nil {
return nil
} else {
lastErr = err
}
if i < attempts-1 {
fmt.Fprintln(os.Stderr, "homelab: push rejected — fetching + merging master, then retrying")
if err := gitStream(repoRoot, flags, "fetch", remote); err != nil {
return err
}
if err := gitStream(repoRoot, flags, "merge", "--no-edit", remote+"/master"); err != nil {
return err
}
}
}
return fmt.Errorf("push to %s/master failed after %d attempts: %w", remote, attempts, lastErr)
}
// landFallback pushes the feature branch when the direct master push is rejected
// (e.g. branch protection), so the work isn't lost and a PR can be opened.
func landFallback(repoRoot string, flags []string, remote, branch string, pushErr error) error {
fmt.Fprintf(os.Stderr, "homelab: direct push to master failed (%v)\n", pushErr)
fmt.Fprintf(os.Stderr, "homelab: falling back to pushing the feature branch %q for a PR\n", branch)
if err := gitStream(repoRoot, flags, "push", "-u", remote, branch); err != nil {
return fmt.Errorf("fallback branch push also failed: %w", err)
}
fmt.Printf("homelab: pushed %s to %s. Open a PR to land it (branch protection blocked the direct push).\n", branch, remote)
return nil
}
// workClean removes a task's worktree and branch. Run from the main checkout.
func workClean(args []string) error {
topic, _ := firstPositional(args)
if topic == "" {
return fmt.Errorf("usage: homelab work clean <topic> (run from the main checkout)")
}
cwd, _ := os.Getwd()
repoRoot, err := gitRepoRoot(cwd)
if err != nil {
return fmt.Errorf("not in a git repository: %w", err)
}
flags := cryptFlagsFor(repoRoot)
wtRel := filepath.Join(".worktrees", topic)
branch := currentUser() + "/" + topic
if err := gitStream(repoRoot, flags, "worktree", "remove", wtRel); err != nil {
return fmt.Errorf("worktree remove failed (uncommitted changes? run from the main checkout, not the worktree): %w", err)
}
if err := gitStream(repoRoot, flags, "branch", "-d", branch); err != nil {
fmt.Fprintf(os.Stderr, "homelab: note: could not delete branch %s (unmerged — use `git branch -D` if intended): %v\n", branch, err)
}
fmt.Printf("homelab: removed worktree %s and branch %s\n", wtRel, branch)
return nil
}
// ensureWorktreesIgnored appends .worktrees/ to .gitignore if not already ignored.
func ensureWorktreesIgnored(repoRoot string) {
if _, err := gitOutput(repoRoot, "check-ignore", ".worktrees"); err == nil {
return
}
gi := filepath.Join(repoRoot, ".gitignore")
f, err := os.OpenFile(gi, os.O_APPEND|os.O_CREATE|os.O_WRONLY, 0o644)
if err != nil {
return
}
defer f.Close()
if _, err := f.WriteString("\n.worktrees/\n"); err == nil {
fmt.Fprintln(os.Stderr, "homelab: added .worktrees/ to .gitignore")
}
}

32
cli/cmd_work_test.go Normal file
View file

@ -0,0 +1,32 @@
package main
import "testing"
func TestRunVerifyRefusesWhenNothingToVerify(t *testing.T) {
dir := t.TempDir() // no go.mod, no verify cmd
if err := runVerify(dir, "", false); err == nil {
t.Fatal("runVerify must refuse (error) when nothing to verify and --no-verify absent")
}
if err := runVerify(dir, "", true); err != nil {
t.Fatalf("runVerify must skip when --no-verify set, got: %v", err)
}
}
func TestFlagValue(t *testing.T) {
cases := []struct {
args []string
name string
want string
}{
{[]string{"--verify-cmd", "go test ./..."}, "--verify-cmd", "go test ./..."},
{[]string{"--verify-cmd=make test"}, "--verify-cmd", "make test"},
{[]string{"topic", "--verify-cmd", "x"}, "--verify-cmd", "x"},
{[]string{"topic"}, "--verify-cmd", ""},
{[]string{"--verify-cmd"}, "--verify-cmd", ""}, // no value
}
for _, c := range cases {
if got := flagValue(c.args, c.name); got != c.want {
t.Errorf("flagValue(%v, %q) = %q, want %q", c.args, c.name, got, c.want)
}
}
}

104
cli/command.go Normal file
View file

@ -0,0 +1,104 @@
package main
import (
"encoding/json"
"fmt"
"sort"
"strings"
)
// Tier classifies whether a command observes (read) or mutates (write) state.
// v0.1 allows everything; the tier is recorded so a classifier hook can gate
// writes later without restructuring (see docs/adr/0005).
type Tier string
const (
TierRead Tier = "read"
TierWrite Tier = "write"
)
// Command is one homelab verb. Path is the token sequence that selects it,
// e.g. ["claim"] or ["tf", "plan"]. Run receives the args after the path.
type Command struct {
Path []string
Tier Tier
Summary string
Run func(args []string) error
}
// dispatch routes args to the command whose Path is the longest matching prefix
// of args, passing the remaining args to its Run.
func dispatch(reg []Command, args []string) error {
best := -1
bestLen := 0
for i, c := range reg {
if len(c.Path) > len(args) {
continue
}
match := true
for j, p := range c.Path {
if args[j] != p {
match = false
break
}
}
if match && len(c.Path) >= bestLen {
best = i
bestLen = len(c.Path)
}
}
if best < 0 {
return fmt.Errorf("unknown command: %q", strings.Join(args, " "))
}
matched := reg[best]
runErr := matched.Run(args[bestLen:])
emitUsage(matched.name(), runErr) // best-effort usage telemetry; never affects the command
return runErr
}
// name is the space-joined verb path, e.g. "tf plan".
func (c Command) name() string { return strings.Join(c.Path, " ") }
// sortedByName returns a copy of reg ordered by verb path for stable output.
func sortedByName(reg []Command) []Command {
out := make([]Command, len(reg))
copy(out, reg)
sort.Slice(out, func(i, j int) bool { return out[i].name() < out[j].name() })
return out
}
// manifestText renders one aligned line per command: "<path> <tier> <summary>".
// This is the cheap progressive-discovery entrypoint (see docs/adr/0004).
func manifestText(reg []Command) string {
cmds := sortedByName(reg)
width := 0
for _, c := range cmds {
if n := len(c.name()); n > width {
width = n
}
}
var b strings.Builder
for _, c := range cmds {
fmt.Fprintf(&b, "%-*s %-5s %s\n", width, c.name(), c.Tier, c.Summary)
}
return b.String()
}
// manifestJSON renders the registry as a JSON array of {command, tier, summary}
// so agents can parse the full surface in one call.
func manifestJSON(reg []Command) (string, error) {
type entry struct {
Command string `json:"command"`
Tier string `json:"tier"`
Summary string `json:"summary"`
}
entries := make([]entry, 0, len(reg))
for _, c := range sortedByName(reg) {
entries = append(entries, entry{Command: c.name(), Tier: string(c.Tier), Summary: c.Summary})
}
b, err := json.MarshalIndent(entries, "", " ")
if err != nil {
return "", err
}
return string(b), nil
}

73
cli/command_test.go Normal file
View file

@ -0,0 +1,73 @@
package main
import (
"encoding/json"
"reflect"
"strings"
"testing"
)
// Tracer bullet: the dispatcher must route `homelab <path...> <args...>` to the
// command whose Path is the longest matching prefix of the input tokens, and
// hand the command the remaining args.
func TestDispatchRoutesToLongestPrefixMatch(t *testing.T) {
var gotArgs []string
ran := ""
reg := []Command{
{Path: []string{"claim"}, Tier: TierWrite, Summary: "claim a resource",
Run: func(a []string) error { ran = "claim"; gotArgs = a; return nil }},
{Path: []string{"tf", "plan"}, Tier: TierRead, Summary: "plan a stack",
Run: func(a []string) error { ran = "tf plan"; gotArgs = a; return nil }},
}
if err := dispatch(reg, []string{"tf", "plan", "vault", "--json"}); err != nil {
t.Fatalf("dispatch returned error: %v", err)
}
if ran != "tf plan" {
t.Fatalf("routed to %q, want %q", ran, "tf plan")
}
if want := []string{"vault", "--json"}; !reflect.DeepEqual(gotArgs, want) {
t.Fatalf("command got args %v, want %v", gotArgs, want)
}
}
func TestDispatchUnknownCommandErrors(t *testing.T) {
reg := []Command{{Path: []string{"claim"}, Run: func(a []string) error { return nil }}}
if err := dispatch(reg, []string{"bogus"}); err == nil {
t.Fatal("expected error for unknown command, got nil")
}
}
// The manifest is the progressive-discovery entrypoint: one line per command
// showing the full verb path, its tier, and summary, sorted for stable output.
func TestManifestTextListsEveryCommandWithTier(t *testing.T) {
reg := []Command{
{Path: []string{"tf", "plan"}, Tier: TierRead, Summary: "plan a stack"},
{Path: []string{"claim"}, Tier: TierWrite, Summary: "claim a resource"},
}
out := manifestText(reg)
for _, want := range []string{"claim", "tf plan", "read", "write", "plan a stack", "claim a resource"} {
if !strings.Contains(out, want) {
t.Errorf("manifest text missing %q\n---\n%s", want, out)
}
}
// sorted: claim (c) must appear before tf plan (t)
if strings.Index(out, "claim") > strings.Index(out, "tf plan") {
t.Errorf("manifest not sorted by path:\n%s", out)
}
}
func TestManifestJSONIsParsableAndTagged(t *testing.T) {
reg := []Command{{Path: []string{"tf", "apply"}, Tier: TierWrite, Summary: "apply a stack"}}
out, err := manifestJSON(reg)
if err != nil {
t.Fatalf("manifestJSON error: %v", err)
}
var got []map[string]string
if err := json.Unmarshal([]byte(out), &got); err != nil {
t.Fatalf("manifest JSON not parsable: %v\n%s", err, out)
}
if len(got) != 1 || got[0]["command"] != "tf apply" || got[0]["tier"] != "write" {
t.Fatalf("unexpected manifest JSON: %v", got)
}
}

98
cli/homelab.go Normal file
View file

@ -0,0 +1,98 @@
package main
import (
"fmt"
"strings"
)
// version is stamped at build time via -ldflags "-X main.version=vX.Y.Z".
var version = "dev"
// buildRegistry returns every homelab verb. New verb-groups append here.
func buildRegistry() []Command {
var reg []Command
reg = append(reg, claimCommands()...)
reg = append(reg, tfCommands()...)
reg = append(reg, workCommands()...)
reg = append(reg, k8sCommands()...)
reg = append(reg, memoryCommands()...)
reg = append(reg, ciCommands()...)
reg = append(reg, deployCommands()...)
reg = append(reg, netCommands()...)
reg = append(reg, obsCommands()...)
reg = append(reg, usageCommands()...)
reg = append(reg, haCommands()...)
reg = append(reg, browserCommands()...)
reg = append(reg, vaultCommands()...)
return reg
}
// dispatchTop handles the homelab verb surface. handled=false means the args are
// not a homelab verb, so main() falls back to the legacy -use-case path.
func dispatchTop(args []string) (handled bool, err error) {
if len(args) == 0 {
fmt.Print(usage())
return true, nil
}
switch args[0] {
case "help", "-h", "--help":
fmt.Print(usage())
return true, nil
case "version", "--version":
fmt.Println("homelab " + version)
return true, nil
case "manifest":
reg := buildRegistry()
if containsArg(args[1:], "--json") {
out, err := manifestJSON(reg)
if err != nil {
return true, err
}
fmt.Println(out)
return true, nil
}
fmt.Print(manifestText(reg))
return true, nil
}
if strings.HasPrefix(args[0], "-") {
return false, nil
}
reg := buildRegistry()
if !isCommandGroup(reg, args[0]) {
return false, nil
}
return true, dispatch(reg, args)
}
func isCommandGroup(reg []Command, group string) bool {
for _, c := range reg {
if len(c.Path) > 0 && c.Path[0] == group {
return true
}
}
return false
}
func containsArg(args []string, want string) bool {
for _, a := range args {
if a == want {
return true
}
}
return false
}
func usage() string {
var b strings.Builder
fmt.Fprintf(&b, "homelab %s — unified homelab operations CLI\n\n", version)
b.WriteString("Usage:\n homelab <command> [args]\n\nCommands:\n")
for _, line := range strings.Split(strings.TrimRight(manifestText(buildRegistry()), "\n"), "\n") {
if line != "" {
b.WriteString(" " + line + "\n")
}
}
b.WriteString("\n manifest [--json] list all commands (machine-readable with --json)\n")
b.WriteString(" version print version\n")
b.WriteString("\nLegacy webhook use-cases remain available via -use-case=<name>.\n")
return b.String()
}

138
cli/k8s.go Normal file
View file

@ -0,0 +1,138 @@
package main
import (
"fmt"
"os/exec"
"strings"
)
// kubectl helpers use the ambient kubeconfig (no per-call auth flags).
func kubectlBase(ns string, args ...string) []string {
var full []string
if ns != "" {
full = append(full, "-n", ns)
}
return append(full, args...)
}
func kubectlStream(ns string, args ...string) error {
return runStreamingIn("", "kubectl", kubectlBase(ns, args...)...)
}
// kubectlCapture runs kubectl and returns trimmed stdout (for resolving pods).
func kubectlCapture(ns string, args ...string) (string, error) {
out, err := exec.Command("kubectl", kubectlBase(ns, args...)...).Output()
return strings.TrimSpace(string(out)), err
}
// k8sTarget is the parsed `<app>` + selectors shared by the k8s verbs.
type k8sTarget struct {
app string
ns string
pod string
container string
selector string
tty bool
rest []string // passthrough flags and, after `--`, the exec command
}
// parseK8sTarget reads `<app> [-n ns] [--pod p] [-c ctr] [-l sel] [flags] [-- cmd]`.
// The first bare token is the app; unknown flags pass through in rest.
func parseK8sTarget(args []string) k8sTarget {
t := k8sTarget{}
i := 0
take := func() string {
if i+1 < len(args) {
i++
return args[i]
}
return ""
}
for i = 0; i < len(args); i++ {
a := args[i]
switch {
case a == "--":
t.rest = append(t.rest, args[i+1:]...)
return t
case a == "-n" || a == "--namespace":
t.ns = take()
case strings.HasPrefix(a, "--namespace="):
t.ns = strings.TrimPrefix(a, "--namespace=")
case a == "--pod":
t.pod = take()
case strings.HasPrefix(a, "--pod="):
t.pod = strings.TrimPrefix(a, "--pod=")
case a == "-c" || a == "--container":
t.container = take()
case strings.HasPrefix(a, "--container="):
t.container = strings.TrimPrefix(a, "--container=")
case a == "-l" || a == "--selector":
t.selector = take()
case strings.HasPrefix(a, "--selector="):
t.selector = strings.TrimPrefix(a, "--selector=")
case a == "--tty" || a == "-it" || a == "-ti":
t.tty = true
case !strings.HasPrefix(a, "-") && t.app == "":
t.app = a
default:
t.rest = append(t.rest, a)
}
}
return t
}
// namespace defaults to the app name (most namespaces hold exactly one app).
func (t k8sTarget) namespace() string {
if t.ns != "" {
return t.ns
}
return t.app
}
// objectRef is the kubectl object for logs/exec: an explicit pod, else
// deploy/<app> (kubectl resolves a pod from the Deployment).
func (t k8sTarget) objectRef() string {
if t.pod != "" {
return "pod/" + t.pod
}
return "deploy/" + t.app
}
// --- database access (the dbaas exec pattern) ---
type dbPlan struct {
ns string
pod string // explicit pod (e.g. mysql-standalone-0)
selector string // resolve the pod by this label when pod == "" (CNPG primary)
container string // "" = default container
argv []string // command + args to run inside the pod
}
// planDBExec builds the in-pod command to run sql against app's database.
// PG (default): CNPG primary POD (resolved by label — pg-cluster-rw is a
// Service, not an exec target), psql -U postgres -d <db>.
// MySQL: mysql-standalone-0, password from env (never on the command line).
// dbName defaults to app. sql empty => interactive client.
func planDBExec(app, dbName, sql string, mysql bool) dbPlan {
if dbName == "" {
dbName = app
}
if mysql {
inner := fmt.Sprintf(`mysql -u root -p"$MYSQL_ROOT_PASSWORD" %s`, shellQuote(dbName))
if sql != "" {
inner += " -e " + shellQuote(sql)
}
return dbPlan{ns: "dbaas", pod: "mysql-standalone-0", argv: []string{"bash", "-c", inner}}
}
argv := []string{"psql", "-U", "postgres", "-d", dbName}
if sql != "" {
argv = append(argv, "-tAc", sql)
}
return dbPlan{ns: "dbaas", selector: "cnpg.io/instanceRole=primary", container: "postgres", argv: argv}
}
// shellQuote single-quotes s for safe embedding in a bash -c string.
func shellQuote(s string) string {
return "'" + strings.ReplaceAll(s, "'", `'\''`) + "'"
}

65
cli/k8s_test.go Normal file
View file

@ -0,0 +1,65 @@
package main
import (
"reflect"
"strings"
"testing"
)
func TestParseK8sTarget(t *testing.T) {
got := parseK8sTarget([]string{"tripit", "-n", "prod", "--pod", "x-123", "-c", "app", "-l", "k=v", "--tail=50", "--", "ls", "-la"})
want := k8sTarget{app: "tripit", ns: "prod", pod: "x-123", container: "app", selector: "k=v", rest: []string{"--tail=50", "ls", "-la"}}
if !reflect.DeepEqual(got, want) {
t.Fatalf("parseK8sTarget =\n %+v\nwant\n %+v", got, want)
}
}
func TestK8sTargetNamespaceDefaultsToApp(t *testing.T) {
if ns := parseK8sTarget([]string{"immich"}).namespace(); ns != "immich" {
t.Errorf("namespace() = %q, want immich", ns)
}
if ns := parseK8sTarget([]string{"immich", "-n", "dbaas"}).namespace(); ns != "dbaas" {
t.Errorf("namespace() = %q, want dbaas", ns)
}
}
func TestK8sTargetObjectRef(t *testing.T) {
if r := parseK8sTarget([]string{"tripit"}).objectRef(); r != "deploy/tripit" {
t.Errorf("objectRef() = %q, want deploy/tripit", r)
}
if r := parseK8sTarget([]string{"tripit", "--pod", "tripit-abc"}).objectRef(); r != "pod/tripit-abc" {
t.Errorf("objectRef() = %q, want pod/tripit-abc", r)
}
}
func TestPlanDBExecPostgresDefault(t *testing.T) {
p := planDBExec("fire-planner", "", "SELECT 1", false)
// pg-cluster-rw is a Service, so the PG plan resolves the primary POD by
// label rather than naming an (un-exec-able) Service.
if p.ns != "dbaas" || p.pod != "" || p.selector != "cnpg.io/instanceRole=primary" || p.container != "postgres" {
t.Fatalf("unexpected pg target: %+v", p)
}
// db name defaults to the app; SQL passed via -tAc
joined := strings.Join(p.argv, " ")
if !strings.Contains(joined, "-d fire-planner") || !strings.Contains(joined, "-tAc") {
t.Fatalf("pg argv missing db/sql: %v", p.argv)
}
}
func TestPlanDBExecMysqlEnvPassword(t *testing.T) {
p := planDBExec("wrongmove", "wrongmove", "SHOW TABLES", true)
if p.pod != "mysql-standalone-0" {
t.Fatalf("unexpected mysql pod: %+v", p)
}
inner := strings.Join(p.argv, " ")
// password must come from the env var, never inline
if !strings.Contains(inner, `-p"$MYSQL_ROOT_PASSWORD"`) {
t.Fatalf("mysql must use env password wrapper: %v", p.argv)
}
}
func TestShellQuoteEscapes(t *testing.T) {
if got := shellQuote("a'b"); got != `'a'\''b'` {
t.Fatalf("shellQuote = %q", got)
}
}

View file

@ -26,8 +26,16 @@ var (
) )
func main() { func main() {
err := run() // homelab verb surface (work/tf/claim/...) is tried first; if the args are
if err != nil { // not a homelab verb, fall through to the legacy webhook -use-case path.
if handled, err := dispatchTop(os.Args[1:]); handled {
if err != nil {
fmt.Fprintln(os.Stderr, "homelab: "+err.Error())
os.Exit(1)
}
return
}
if err := run(); err != nil {
glog.Errorf("run failed: %s", err.Error()) glog.Errorf("run failed: %s", err.Error())
os.Exit(255) os.Exit(255)
} }

103
cli/memory.go Normal file
View file

@ -0,0 +1,103 @@
package main
import (
"bytes"
"encoding/json"
"fmt"
"io"
"net/http"
"os"
"strings"
"time"
)
// defaultMemoryURL is used when no env override is present (agents normally have
// CLAUDE_MEMORY_API_URL set by the memory hooks).
const defaultMemoryURL = "https://claude-memory.viktorbarzin.me"
type memoryClient struct {
base string
key string
http *http.Client
}
func firstEnv(keys ...string) string {
for _, k := range keys {
if v := os.Getenv(k); v != "" {
return v
}
}
return ""
}
func resolveMemoryBase() string {
if b := firstEnv("CLAUDE_MEMORY_API_URL", "MEMORY_API_URL"); b != "" {
return strings.TrimRight(b, "/")
}
return defaultMemoryURL
}
// newMemoryClient talks straight to the claude-memory HTTP API (the same backend
// the MCP wraps), so it works even when the MCP frontend is down.
func newMemoryClient() (*memoryClient, error) {
key := firstEnv("CLAUDE_MEMORY_API_KEY", "MEMORY_API_KEY")
if key == "" {
return nil, fmt.Errorf("no memory API key — set CLAUDE_MEMORY_API_KEY (or MEMORY_API_KEY)")
}
return &memoryClient{base: resolveMemoryBase(), key: key, http: &http.Client{Timeout: 30 * time.Second}}, nil
}
func (c *memoryClient) do(method, path string, body interface{}) ([]byte, error) {
var r io.Reader
if body != nil {
b, err := json.Marshal(body)
if err != nil {
return nil, err
}
r = bytes.NewReader(b)
}
req, err := http.NewRequest(method, c.base+path, r)
if err != nil {
return nil, err
}
req.Header.Set("Authorization", "Bearer "+c.key)
if body != nil {
req.Header.Set("Content-Type", "application/json")
}
resp, err := c.http.Do(req)
if err != nil {
return nil, err
}
defer resp.Body.Close()
out, _ := io.ReadAll(resp.Body)
if resp.StatusCode >= 300 {
return nil, fmt.Errorf("memory API %s %s -> %d: %s", method, path, resp.StatusCode, strings.TrimSpace(string(out)))
}
return out, nil
}
// Request bodies mirror src/claude_memory/api/models.py.
type memRecallReq struct {
Context string `json:"context"`
ExpandedQuery string `json:"expanded_query,omitempty"`
Category string `json:"category,omitempty"`
SortBy string `json:"sort_by,omitempty"`
Limit int `json:"limit,omitempty"`
}
type memStoreReq struct {
Content string `json:"content"`
Category string `json:"category,omitempty"`
Tags string `json:"tags,omitempty"`
ExpandedKeywords string `json:"expanded_keywords,omitempty"`
Importance float64 `json:"importance"`
ForceSensitive bool `json:"force_sensitive,omitempty"`
}
type memUpdateReq struct {
Content *string `json:"content,omitempty"`
Tags *string `json:"tags,omitempty"`
Importance *float64 `json:"importance,omitempty"`
ExpandedKeywords *string `json:"expanded_keywords,omitempty"`
}

51
cli/memory_test.go Normal file
View file

@ -0,0 +1,51 @@
package main
import (
"encoding/json"
"os"
"strings"
"testing"
)
func TestResolveMemoryBase(t *testing.T) {
old1, old2 := os.Getenv("CLAUDE_MEMORY_API_URL"), os.Getenv("MEMORY_API_URL")
defer func() { os.Setenv("CLAUDE_MEMORY_API_URL", old1); os.Setenv("MEMORY_API_URL", old2) }()
os.Unsetenv("CLAUDE_MEMORY_API_URL")
os.Unsetenv("MEMORY_API_URL")
if got := resolveMemoryBase(); got != defaultMemoryURL {
t.Errorf("resolveMemoryBase() = %q, want default %q", got, defaultMemoryURL)
}
os.Setenv("CLAUDE_MEMORY_API_URL", "https://m.example/") // trailing slash trimmed
if got := resolveMemoryBase(); got != "https://m.example" {
t.Errorf("resolveMemoryBase() = %q, want https://m.example", got)
}
}
func TestMemStoreReqAlwaysSendsImportance(t *testing.T) {
b, _ := json.Marshal(memStoreReq{Content: "x", Category: "facts", Importance: 0.5})
s := string(b)
if !strings.Contains(s, `"content":"x"`) || !strings.Contains(s, `"importance":0.5`) {
t.Fatalf("memStoreReq JSON missing fields: %s", s)
}
}
func TestMemUpdateReqOmitsUnsetFields(t *testing.T) {
tags := "a,b"
b, _ := json.Marshal(memUpdateReq{Tags: &tags})
s := string(b)
if strings.Contains(s, "content") || strings.Contains(s, "importance") {
t.Fatalf("unset update fields must be omitted: %s", s)
}
if !strings.Contains(s, `"tags":"a,b"`) {
t.Fatalf("set field missing: %s", s)
}
}
func TestMemRecallReqOmitsEmptyOptionals(t *testing.T) {
b, _ := json.Marshal(memRecallReq{Context: "hi"})
s := string(b)
if strings.Contains(s, "expanded_query") || strings.Contains(s, "category") || strings.Contains(s, "limit") {
t.Fatalf("empty optionals must be omitted: %s", s)
}
}

58
cli/presence.go Normal file
View file

@ -0,0 +1,58 @@
package main
import (
"fmt"
"os"
"path/filepath"
"strings"
)
// validPresenceKinds is the fixed label taxonomy accepted by the presence board.
var validPresenceKinds = []string{"node", "host", "stack", "service", "db", "pvc", "infra"}
// presenceScript locates the presence CLI — homelab WRAPS it, it does not
// reimplement it. Override with HOMELAB_PRESENCE; defaults to ~/code/scripts/presence.
func presenceScript() string {
if p := os.Getenv("HOMELAB_PRESENCE"); p != "" {
return p
}
home, err := os.UserHomeDir()
if err != nil {
return "presence"
}
return filepath.Join(home, "code", "scripts", "presence")
}
// validateLabel checks a presence label is <kind>:<name> with a known kind.
func validateLabel(label string) error {
parts := strings.SplitN(label, ":", 2)
if len(parts) != 2 || parts[0] == "" || parts[1] == "" {
return fmt.Errorf("label must be <kind>:<name> (e.g. stack:vault), got %q", label)
}
for _, k := range validPresenceKinds {
if parts[0] == k {
return nil
}
}
return fmt.Errorf("invalid label kind %q; valid kinds: %s", parts[0], strings.Join(validPresenceKinds, ", "))
}
// presenceClaim claims label on the board with a purpose note.
func presenceClaim(label, purpose string) error {
if err := validateLabel(label); err != nil {
return err
}
args := []string{"claim", label}
if purpose != "" {
args = append(args, "--purpose", purpose)
}
return runStreaming(presenceScript(), args...)
}
// presenceRelease releases a prior claim on label.
func presenceRelease(label string) error {
if err := validateLabel(label); err != nil {
return err
}
return runStreaming(presenceScript(), "release", label)
}

24
cli/presence_test.go Normal file
View file

@ -0,0 +1,24 @@
package main
import "testing"
func TestValidateLabelAcceptsTaxonomy(t *testing.T) {
good := []string{
"stack:vault", "service:health", "node:k8s-node1", "db:pg-cluster",
"infra:gpu-operator", "host:proxmox-1", "pvc:dbaas/data",
}
for _, l := range good {
if err := validateLabel(l); err != nil {
t.Errorf("validateLabel(%q) = %v, want nil", l, err)
}
}
}
func TestValidateLabelRejectsBadLabels(t *testing.T) {
bad := []string{"vault", "stack:", "bogus:x", ":x", "stack", ""}
for _, l := range bad {
if err := validateLabel(l); err == nil {
t.Errorf("validateLabel(%q) = nil, want error", l)
}
}
}

76
cli/probe.go Normal file
View file

@ -0,0 +1,76 @@
package main
import (
"context"
"crypto/tls"
"fmt"
"io"
"net"
"net/http"
"net/url"
"os/exec"
"strings"
"time"
)
// internalLBIP is the dedicated Traefik LB; every internal ingress routes through it.
const internalLBIP = "10.0.20.203"
// clientDialingIP returns an http.Client that dials ip for ANY host while keeping
// the URL host as SNI (so the cert matches) — the Go form of `curl --resolve
// host:443:ip`. TLS verification is skipped (these are reachability/observability
// probes, not security checks; internal .lan vhosts may serve a non-matching cert).
func clientDialingIP(ip string, timeout time.Duration) *http.Client {
d := &net.Dialer{Timeout: 8 * time.Second}
tr := &http.Transport{
DialContext: func(ctx context.Context, network, addr string) (net.Conn, error) {
if i := strings.LastIndex(addr, ":"); i >= 0 {
addr = ip + addr[i:]
}
return d.DialContext(ctx, network, addr)
},
TLSClientConfig: &tls.Config{InsecureSkipVerify: true},
}
return &http.Client{Timeout: timeout, Transport: tr}
}
// probeURL issues a GET and returns status code + elapsed time.
func probeURL(c *http.Client, rawurl string) (int, time.Duration, error) {
start := time.Now()
resp, err := c.Get(rawurl)
dur := time.Since(start)
if err != nil {
return 0, dur, err
}
resp.Body.Close()
return resp.StatusCode, dur, nil
}
// lbGetBody GETs https://<host><path>?<q> through the internal LB and returns the body.
func lbGetBody(host, path string, q url.Values) ([]byte, error) {
u := "https://" + host + path
if len(q) > 0 {
u += "?" + q.Encode()
}
resp, err := clientDialingIP(internalLBIP, 20*time.Second).Get(u)
if err != nil {
return nil, err
}
defer resp.Body.Close()
body, _ := io.ReadAll(resp.Body)
if resp.StatusCode >= 300 {
return nil, fmt.Errorf("%s -> %d: %s", path, resp.StatusCode, strings.TrimSpace(string(body)))
}
return body, nil
}
// dig runs `dig +short` against a resolver, optionally for a record type.
func dig(name, server, rrtype string) (string, error) {
args := []string{"+short", "+time=3", "+tries=1"}
if rrtype != "" {
args = append(args, rrtype)
}
args = append(args, name, "@"+server)
out, err := exec.Command("dig", args...).Output()
return strings.TrimSpace(string(out)), err
}

49
cli/probe_test.go Normal file
View file

@ -0,0 +1,49 @@
package main
import "testing"
func TestQueryArg(t *testing.T) {
if got := queryArg([]string{"up"}, nil); got != "up" {
t.Errorf(`queryArg(["up"]) = %q, want "up"`, got)
}
if got := queryArg([]string{"up", "--json"}, nil); got != "up" {
t.Errorf(`--json should be dropped, got %q`, got)
}
// single quoted PromQL arrives as one token
if got := queryArg([]string{"count by (node) (up)", "--json"}, nil); got != "count by (node) (up)" {
t.Errorf(`quoted query mangled: %q`, got)
}
// value-flags and their values are skipped, query survives
vf := map[string]bool{"--since": true, "--limit": true}
if got := queryArg([]string{`{app="x"}`, "--since", "1h", "--limit", "50"}, vf); got != `{app="x"}` {
t.Errorf(`value-flag skipping failed: %q`, got)
}
}
func TestLabelStr(t *testing.T) {
got := labelStr(map[string]string{"__name__": "up", "job": "x", "instance": "y"})
if got != "up{instance=y,job=x}" { // __name__ extracted, rest sorted
t.Errorf("labelStr = %q", got)
}
if got := labelStr(map[string]string{"alertname": "Foo"}); got != "{alertname=Foo}" {
t.Errorf("labelStr (no __name__) = %q", got)
}
}
func TestOneLineList(t *testing.T) {
if got := oneLineList(" "); got != "(none)" {
t.Errorf("empty = %q, want (none)", got)
}
if got := oneLineList("a\nb"); got != "a, b" {
t.Errorf("multi = %q, want 'a, b'", got)
}
}
func TestHostOnly(t *testing.T) {
if got := hostOnly("foo.me/path"); got != "foo.me" {
t.Errorf("hostOnly = %q", got)
}
if got := hostOnly("foo.me"); got != "foo.me" {
t.Errorf("hostOnly = %q", got)
}
}

101
cli/repo.go Normal file
View file

@ -0,0 +1,101 @@
package main
import (
"os"
"os/exec"
"os/user"
"path/filepath"
"strings"
)
// preferRemote picks the canonical remote: forgejo if present, else origin,
// else the first listed. (For infra, origin and forgejo both point at Forgejo.)
func preferRemote(remotes []string) string {
has := map[string]bool{}
for _, r := range remotes {
has[r] = true
}
switch {
case has["forgejo"]:
return "forgejo"
case has["origin"]:
return "origin"
case len(remotes) > 0:
return remotes[0]
default:
return ""
}
}
// hasGitCryptAttr reports whether .gitattributes content enables git-crypt.
func hasGitCryptAttr(gitattributes string) bool {
return strings.Contains(gitattributes, "filter=git-crypt")
}
// gitCryptFlags are the per-command flags that disable smudge/clean so git
// operations in a git-crypt repo don't try to decrypt (NEVER persisted to config).
func gitCryptFlags() []string {
return []string{
"-c", "filter.git-crypt.smudge=cat",
"-c", "filter.git-crypt.clean=cat",
"-c", "filter.git-crypt.required=false",
}
}
// gitOutput runs `git -C dir <args>` and returns trimmed stdout.
func gitOutput(dir string, args ...string) (string, error) {
cmd := exec.Command("git", append([]string{"-C", dir}, args...)...)
out, err := cmd.Output()
return strings.TrimSpace(string(out)), err
}
func gitRepoRoot(dir string) (string, error) {
return gitOutput(dir, "rev-parse", "--show-toplevel")
}
// gitRemotes lists configured remote names for the repo at dir.
func gitRemotes(dir string) ([]string, error) {
out, err := gitOutput(dir, "remote")
if err != nil {
return nil, err
}
if out == "" {
return nil, nil
}
return strings.Split(out, "\n"), nil
}
// isGitCryptRepo reports whether the repo at repoRoot uses git-crypt.
func isGitCryptRepo(repoRoot string) bool {
b, err := os.ReadFile(filepath.Join(repoRoot, ".gitattributes"))
if err != nil {
return false
}
return hasGitCryptAttr(string(b))
}
// cryptFlagsFor returns the git-crypt filter flags when repoRoot is encrypted,
// else nil. These are injected per-command and never persisted.
func cryptFlagsFor(repoRoot string) []string {
if isGitCryptRepo(repoRoot) {
return gitCryptFlags()
}
return nil
}
// gitStream runs `git [cryptFlags] -C repoRoot <args>` with live output.
func gitStream(repoRoot string, cryptFlags []string, args ...string) error {
full := append(append([]string{}, cryptFlags...), append([]string{"-C", repoRoot}, args...)...)
return runStreamingIn("", "git", full...)
}
// currentUser returns the OS username for branch naming (<user>/<topic>).
func currentUser() string {
if u := os.Getenv("USER"); u != "" {
return u
}
if u, err := user.Current(); err == nil && u.Username != "" {
return u.Username
}
return "user"
}

37
cli/repo_test.go Normal file
View file

@ -0,0 +1,37 @@
package main
import "testing"
func TestPreferRemote(t *testing.T) {
cases := []struct {
in []string
want string
}{
{[]string{"origin", "forgejo"}, "forgejo"},
{[]string{"forgejo"}, "forgejo"},
{[]string{"origin"}, "origin"},
{[]string{"upstream"}, "upstream"},
{nil, ""},
}
for _, c := range cases {
if got := preferRemote(c.in); got != c.want {
t.Errorf("preferRemote(%v) = %q, want %q", c.in, got, c.want)
}
}
}
func TestHasGitCryptAttr(t *testing.T) {
if !hasGitCryptAttr("*.tfvars filter=git-crypt diff=git-crypt") {
t.Error("expected git-crypt detected")
}
if hasGitCryptAttr("*.md text\n*.png binary") {
t.Error("expected no git-crypt")
}
}
func TestGitCryptFlagsShape(t *testing.T) {
f := gitCryptFlags()
if len(f) != 6 || f[0] != "-c" || f[1] != "filter.git-crypt.smudge=cat" {
t.Fatalf("unexpected git-crypt flags: %v", f)
}
}

23
cli/run.go Normal file
View file

@ -0,0 +1,23 @@
package main
import (
"os"
"os/exec"
)
// runStreaming executes name with args, wiring std streams to this process so
// the caller sees live output, and returns the command's error (non-nil on
// non-zero exit — preserved so homelab's own exit code reflects the child's).
func runStreaming(name string, args ...string) error {
return runStreamingIn("", name, args...)
}
// runStreamingIn is runStreaming with a working directory (empty = inherit).
func runStreamingIn(dir, name string, args ...string) error {
cmd := exec.Command(name, args...)
cmd.Dir = dir
cmd.Stdout = os.Stdout
cmd.Stderr = os.Stderr
cmd.Stdin = os.Stdin
return cmd.Run()
}

54
cli/stack.go Normal file
View file

@ -0,0 +1,54 @@
package main
import (
"fmt"
"os"
"path/filepath"
"sort"
"strings"
)
// findInfraRoot walks up from start to the infra repo root — the directory
// holding both terragrunt.hcl and a stacks/ directory.
func findInfraRoot(start string) (string, error) {
dir := start
for {
if isFile(filepath.Join(dir, "terragrunt.hcl")) && isDir(filepath.Join(dir, "stacks")) {
return dir, nil
}
parent := filepath.Dir(dir)
if parent == dir {
return "", fmt.Errorf("not inside an infra checkout (no terragrunt.hcl + stacks/ found above %s)", start)
}
dir = parent
}
}
// resolveStack maps a bare stack name to its directory under <infraRoot>/stacks.
func resolveStack(infraRoot, name string) (string, error) {
dir := filepath.Join(infraRoot, "stacks", name)
if isDir(dir) {
return dir, nil
}
avail := listStacks(infraRoot)
return "", fmt.Errorf("stack %q not found under stacks/; available: %s", name, strings.Join(avail, ", "))
}
// listStacks returns the sorted names of every directory under <infraRoot>/stacks.
func listStacks(infraRoot string) []string {
entries, err := os.ReadDir(filepath.Join(infraRoot, "stacks"))
if err != nil {
return nil
}
var out []string
for _, e := range entries {
if e.IsDir() {
out = append(out, e.Name())
}
}
sort.Strings(out)
return out
}
func isFile(p string) bool { fi, err := os.Stat(p); return err == nil && !fi.IsDir() }
func isDir(p string) bool { fi, err := os.Stat(p); return err == nil && fi.IsDir() }

52
cli/stack_test.go Normal file
View file

@ -0,0 +1,52 @@
package main
import (
"os"
"path/filepath"
"testing"
)
func newInfraTree(t *testing.T, stacks ...string) string {
t.Helper()
root := t.TempDir()
if err := os.WriteFile(filepath.Join(root, "terragrunt.hcl"), []byte("# root"), 0o644); err != nil {
t.Fatal(err)
}
for _, s := range stacks {
if err := os.MkdirAll(filepath.Join(root, "stacks", s), 0o755); err != nil {
t.Fatal(err)
}
}
return root
}
func TestFindInfraRootWalksUp(t *testing.T) {
root := newInfraTree(t, "vault")
got, err := findInfraRoot(filepath.Join(root, "stacks", "vault"))
if err != nil {
t.Fatalf("findInfraRoot error: %v", err)
}
if got != root {
t.Fatalf("findInfraRoot = %q, want %q", got, root)
}
}
func TestFindInfraRootErrorsOutsideInfra(t *testing.T) {
if _, err := findInfraRoot(t.TempDir()); err == nil {
t.Fatal("expected error outside an infra checkout")
}
}
func TestResolveStack(t *testing.T) {
root := newInfraTree(t, "vault", "monitoring")
dir, err := resolveStack(root, "vault")
if err != nil {
t.Fatalf("resolveStack error: %v", err)
}
if want := filepath.Join(root, "stacks", "vault"); dir != want {
t.Fatalf("resolveStack = %q, want %q", dir, want)
}
if _, err := resolveStack(root, "nonesuch"); err == nil {
t.Fatal("expected error for unknown stack")
}
}

62
cli/telemetry.go Normal file
View file

@ -0,0 +1,62 @@
package main
import (
"bytes"
"encoding/json"
"net/http"
"os"
"strconv"
"strings"
"time"
)
// usageJob is the Loki stream job label for homelab usage telemetry.
const usageJob = "homelab-usage"
// emitUsage best-effort records one verb invocation to Loki for cross-user
// usage analytics. Labels are low-cardinality (job/user/verb); the line carries
// only exit code + CLI version. NEVER args, paths, flags, or secrets. It must
// never affect the command: all errors are swallowed and a tight timeout bounds
// the cost. Opt out with HOMELAB_TELEMETRY=0.
func emitUsage(verb string, runErr error) {
switch os.Getenv("HOMELAB_TELEMETRY") {
case "0", "off", "false", "no":
return
}
if verb == "" || strings.HasPrefix(verb, "usage") {
return // don't self-record the analytics reader
}
exit := 0
if runErr != nil {
exit = 1
}
body, err := json.Marshal(lokiPush{Streams: []lokiStream{{
Stream: map[string]string{"job": usageJob, "user": currentUser(), "verb": verb},
Values: [][2]string{{
strconv.FormatInt(time.Now().UnixNano(), 10),
"exit=" + strconv.Itoa(exit) + " ver=" + version,
}},
}}})
if err != nil {
return
}
req, err := http.NewRequest("POST", "https://"+lokiHost+"/loki/api/v1/push", bytes.NewReader(body))
if err != nil {
return
}
req.Header.Set("Content-Type", "application/json")
resp, err := clientDialingIP(internalLBIP, 800*time.Millisecond).Do(req)
if err != nil {
return
}
resp.Body.Close()
}
type lokiPush struct {
Streams []lokiStream `json:"streams"`
}
type lokiStream struct {
Stream map[string]string `json:"stream"`
Values [][2]string `json:"values"`
}

View file

@ -103,6 +103,6 @@ func notifyForIPChange(oldIP, newIP net.IP) error {
if err != nil { if err != nil {
return errors.Wrapf(err, "Error reading response") return errors.Wrapf(err, "Error reading response")
} }
glog.Infof("Response:", string(responseBody)) glog.Infof("Response: %s", string(responseBody))
return nil return nil
} }

18
cli/usage_test.go Normal file
View file

@ -0,0 +1,18 @@
package main
import (
"strings"
"testing"
)
func TestUsageQuery(t *testing.T) {
got := usageQuery("30d", "")
want := `sum by (verb) (count_over_time({job="homelab-usage"}[30d]))`
if got != want {
t.Errorf("usageQuery(30d,\"\") = %q, want %q", got, want)
}
withUser := usageQuery("7d", "emo")
if !strings.Contains(withUser, `user="emo"`) || !strings.Contains(withUser, "[7d]") {
t.Errorf("usageQuery with user missing filter/range: %q", withUser)
}
}

191
cli/woodpecker.go Normal file
View file

@ -0,0 +1,191 @@
package main
import (
"context"
"encoding/json"
"fmt"
"io"
"net"
"net/http"
"os"
"os/exec"
"strings"
"time"
)
// Woodpecker is reached at ci.viktorbarzin.me but routed via the internal Traefik
// LB (mirrors the proven `curl --resolve ci.viktorbarzin.me:443:10.0.20.203`):
// we dial the LB IP while keeping SNI/Host = the hostname so the cert verifies.
const (
wpHost = "ci.viktorbarzin.me"
wpLBIP = "10.0.20.203"
)
type wpClient struct {
base string
token string
http *http.Client
}
// wpToken reads WOODPECKER_TOKEN, else the canonical Vault path.
func wpToken() string {
if t := firstEnv("WOODPECKER_TOKEN", "WP_TOKEN"); t != "" {
return t
}
out, err := exec.Command("vault", "kv", "get", "-field=woodpecker_api_token", "secret/ci/global").Output()
if err != nil {
return ""
}
return strings.TrimSpace(string(out))
}
func newWPClient() (*wpClient, error) {
tok := wpToken()
if tok == "" {
return nil, fmt.Errorf("no woodpecker token — set WOODPECKER_TOKEN or `vault login` (reads secret/ci/global)")
}
ip := firstEnv("HOMELAB_WP_IP")
if ip == "" {
ip = wpLBIP
}
dialer := &net.Dialer{Timeout: 8 * time.Second}
tr := &http.Transport{
DialContext: func(ctx context.Context, network, addr string) (net.Conn, error) {
if strings.HasPrefix(addr, wpHost+":") {
addr = ip + addr[strings.LastIndex(addr, ":"):]
}
return dialer.DialContext(ctx, network, addr)
},
}
return &wpClient{base: "https://" + wpHost, token: tok, http: &http.Client{Timeout: 20 * time.Second, Transport: tr}}, nil
}
// getJSON GETs path into v, retrying the transient empty/5xx responses the
// Woodpecker API intermittently returns under load.
func (c *wpClient) getJSON(path string, v interface{}) error {
var lastErr error
for attempt := 0; attempt < 5; attempt++ {
if attempt > 0 {
time.Sleep(2 * time.Second)
}
req, _ := http.NewRequest("GET", c.base+path, nil)
req.Header.Set("Authorization", "Bearer "+c.token)
resp, err := c.http.Do(req)
if err != nil {
lastErr = err
continue
}
body, _ := io.ReadAll(resp.Body)
resp.Body.Close()
if resp.StatusCode >= 500 || len(strings.TrimSpace(string(body))) == 0 {
lastErr = fmt.Errorf("woodpecker GET %s -> %d (empty/5xx, retrying)", path, resp.StatusCode)
continue
}
if resp.StatusCode >= 300 {
return fmt.Errorf("woodpecker GET %s -> %d: %s", path, resp.StatusCode, strings.TrimSpace(string(body)))
}
return json.Unmarshal(body, v)
}
return lastErr
}
type wpPipeline struct {
Number int `json:"number"`
Status string `json:"status"`
Event string `json:"event"`
Commit string `json:"commit"`
Message string `json:"message"`
}
func (c *wpClient) recentPipelines(repoID, n int) ([]wpPipeline, error) {
var ps []wpPipeline
err := c.getJSON(fmt.Sprintf("/api/repos/%d/pipelines?per_page=%d", repoID, n), &ps)
return ps, err
}
// findPipeline returns the pipeline for commit (prefix match), or the latest when
// commit is empty.
func (c *wpClient) findPipeline(repoID int, commit string) (wpPipeline, error) {
ps, err := c.recentPipelines(repoID, 25)
if err != nil {
return wpPipeline{}, err
}
if len(ps) == 0 {
return wpPipeline{}, fmt.Errorf("no pipelines for repo %d", repoID)
}
if commit == "" {
return ps[0], nil
}
for _, p := range ps {
if strings.HasPrefix(p.Commit, commit) {
return p, nil
}
}
return wpPipeline{}, fmt.Errorf("no pipeline for commit %s in the last %d", commit[:min(8, len(commit))], len(ps))
}
func (c *wpClient) repoID() (int, error) {
owner, repo, err := repoOwnerName()
if err != nil {
return 0, err
}
var r struct {
ID int `json:"id"`
}
if err := c.getJSON("/api/repos/lookup/"+owner+"/"+repo, &r); err != nil {
return 0, err
}
if r.ID == 0 {
return 0, fmt.Errorf("repo %s/%s not registered in woodpecker", owner, repo)
}
return r.ID, nil
}
// repoOwnerName derives <owner>/<repo> from the cwd git remote.
func repoOwnerName() (string, string, error) {
cwd, _ := os.Getwd()
root, err := gitRepoRoot(cwd)
if err != nil {
return "", "", fmt.Errorf("not in a git repository: %w", err)
}
remote := preferRemote(remotesOrEmpty(root))
url, err := gitOutput(root, "remote", "get-url", remote)
if err != nil {
return "", "", err
}
return parseOwnerRepo(url)
}
// parseOwnerRepo extracts owner/repo from an https or ssh git remote URL.
func parseOwnerRepo(url string) (string, string, error) {
u := strings.TrimSuffix(strings.TrimSpace(url), ".git")
u = strings.TrimSuffix(u, "/")
if i := strings.Index(u, "://"); i >= 0 {
u = u[i+3:]
}
u = strings.ReplaceAll(u, ":", "/") // git@host:owner/repo -> git@host/owner/repo
parts := strings.Split(u, "/")
if len(parts) < 2 || parts[len(parts)-1] == "" || parts[len(parts)-2] == "" {
return "", "", fmt.Errorf("cannot parse owner/repo from remote %q", url)
}
return parts[len(parts)-2], parts[len(parts)-1], nil
}
func isTerminalStatus(s string) bool {
switch s {
case "success", "failure", "error", "killed", "declined", "blocked":
return true
}
return false
}
func isFailureStatus(s string) bool {
return s == "failure" || s == "error" || s == "killed" || s == "declined"
}
func min(a, b int) int {
if a < b {
return a
}
return b
}

40
cli/woodpecker_test.go Normal file
View file

@ -0,0 +1,40 @@
package main
import "testing"
func TestParseOwnerRepo(t *testing.T) {
cases := []struct{ in, owner, repo string }{
{"https://forgejo.viktorbarzin.me/viktor/infra.git", "viktor", "infra"},
{"https://forgejo.viktorbarzin.me/viktor/infra", "viktor", "infra"},
{"git@github.com:ViktorBarzin/infra.git", "ViktorBarzin", "infra"},
{"https://github.com/ViktorBarzin/tripit/", "ViktorBarzin", "tripit"},
}
for _, c := range cases {
o, r, err := parseOwnerRepo(c.in)
if err != nil || o != c.owner || r != c.repo {
t.Errorf("parseOwnerRepo(%q) = (%q, %q, %v), want (%q, %q)", c.in, o, r, err, c.owner, c.repo)
}
}
if _, _, err := parseOwnerRepo("nonsense"); err == nil {
t.Error("expected error for unparseable remote")
}
}
func TestStatusClassification(t *testing.T) {
for _, s := range []string{"success", "failure", "error", "killed"} {
if !isTerminalStatus(s) {
t.Errorf("%q should be terminal", s)
}
}
for _, s := range []string{"running", "pending"} {
if isTerminalStatus(s) {
t.Errorf("%q should not be terminal", s)
}
}
if !isFailureStatus("failure") || !isFailureStatus("error") {
t.Error("failure/error should classify as failure")
}
if isFailureStatus("success") {
t.Error("success must not classify as failure")
}
}

View file

@ -0,0 +1,30 @@
# homelab: a unified infra-ops CLI grown in place from infra/cli
Agents re-derive the same operational command boilerplate every session — mining
51,116 bash commands across 2,225 past sessions showed dense, repeated patterns
(the infra inner-loop alone is ~29%). We are building `homelab`, one CLI encoding
the deterministic, repeated **actions** (not judgment) agents run — composable in
bash, JSON-capable, and discovered progressively via `homelab manifest`. It is
grown **in place** in `cli/` (the existing `infra-cli`), absorbing new verb-groups
alongside the preserved legacy webhook use-cases. Versioned with a `cli/VERSION`
file (the infra repo deploys continuously and does not cut semver tags).
## Considered options
- **Its own top-level repo** (the original plan) — rejected in favour of keeping
it where the Terraform/Terragrunt and `scripts/tg` it drives already live; the
Go source isn't git-crypt-encrypted and a provision-time build is unaffected by
GitOps continuous-deploy.
- **A fresh CLI ignoring infra-cli** — rejected: strands the VPN/DNS/email
webhook use-cases.
- **Raw kubectl/tg/ssh + skills + MCP only** — kept for everything outside the
recurring action surface (methodology skills; third-party/owned MCP such as
phpIPAM, which homelab does NOT duplicate).
## Consequences
- The binary is dual-purpose: the agent-facing `homelab` verb surface AND the
in-cluster `infra-cli` webhook image. `main()` front-dispatches homelab verbs
and falls through to the legacy `-use-case` path verbatim.
- Distribution: built from source to `/usr/local/bin/homelab` during devvm
provisioning (`t3-dispatch` precedent), refreshed by `t3-autoupdate`.

View file

@ -0,0 +1,23 @@
# homelab v0.1 scope: the infra inner-loop; everything allowed, tiers recorded
v0.1 ships only the highest-volume surface — the infra inner-loop: `work`
(worktree lifecycle), `tf` (terragrunt via `scripts/tg` + fmt/validate/
force-unlock), and `claim`/`release` (presence) — because it is ~29% of all mined
commands and where agents lose the most time and leak the most presence claims.
v0.1 enforces **no** homelab-level permission gating: everything is allowed,
relying on existing gates (harness permission mode, presence claims, plan
approval). But every verb records a `read|write` tier (visible in `manifest`), so
a PreToolUse classifier hook (auto-allow reads / prompt writes) can be added
later with zero restructuring.
## Considered options
- **Reads-first vertical slice** (top read verb per domain) — lower risk, broad
value, but defers the toil that motivated the project.
- **One domain deep (k8s)** — cleanest template, narrow day-one value.
We chose the highest-volume-but-write-heavy infra loop deliberately, accepting
the extra complexity (worktree lifecycle, git-crypt flag injection, presence
coupling, branch-protection PR fallback) for the biggest immediate toil
reduction. k8s/node/secret/net/ci verb-groups are deferred to later versions.

View file

@ -0,0 +1,29 @@
# homelab work/tf behaviour: native worktree entry, gated auto-land, presence-coupled apply
Four behaviours of the infra-loop verbs are surprising enough to record:
1. **`work` owns worktree create/land/clean, but session *entry* delegates to the
native harness worktree tool.** A CLI is a child process and cannot change the
agent's working directory; `EnterWorktree` can. So `homelab work start <topic>`
creates the worktree + branch off `<remote>/master` (git-crypt-aware) and
prints the path — the agent enters it with native `EnterWorktree({path})`.
2. **`work land` is auto-land, but gated on verification.** It merges master in →
runs verification → pushes `HEAD:master` (fetch+merge+retry on
non-fast-forward) → falls back to pushing the feature branch for a PR when the
direct push is rejected (branch protection). It **refuses to push when it
cannot verify** (no `--verify-cmd` and no auto-detected suite) unless
`--no-verify` is passed — added after an accidental smoke-test land pushed
unverified WIP to master (benign: the infra CI applied 0 stacks because the
diff was `cli/`-only, but an unverified land must be deliberate, not default).
3. **`tf apply` is first-class despite GitOps, and mandatorily presence-coupled.**
Local applies are out-of-band (CI applies canonically on push) but happen
constantly (~763× in the corpus). `tf apply <stack>` auto-claims `stack:<name>`,
delegates to `scripts/tg apply --non-interactive`, and **always releases on
exit** (normal, error, or signal via `sync.Once` + handler) — fixing the
documented ~200-claim leak — and prints an out-of-band reminder.
4. **Known v0.1 limitation:** `work land` does not yet block on CI to green; that
arrives with the ci/deploy watch verb-group. It prints a reminder to follow
the pipeline manually.

View file

@ -0,0 +1,30 @@
# homelab k8s verb-group: app→pod resolver, read/write split, config-mutation stays raw
v0.2 adds the Kubernetes verb-group — the biggest remaining surface by far
(mining the post-v0.1 corpus: 11,291 `kubectl` commands across 243 sessions, more
than every other domain combined).
It is built on an **app→namespace→pod resolver**: most namespaces hold exactly
one app, so `<app>` defaults to the namespace, and the target defaults to
`deploy/<app>` (kubectl resolves a pod from the Deployment). `-n`/`--pod`/`-c`/
`-l`/`--tty` override; multi-pod namespaces (`dbaas`, `monitoring`) need
specificity. The CLI uses the ambient kubeconfig — no per-call auth flags.
Verbs: read — `status`, `get`, `logs`, `describe`, `debug` (one-shot triage),
`pf`, `rollout-status`; write/operational — `db`, `exec`, `restart`, `rm-pod`.
## Decisions worth recording
- **Config-mutation verbs are deliberately NOT exposed** (`apply`/`edit`/`patch`/
`scale`/`create`). They stay raw `kubectl`, by design, per the repo's
Terraform-only policy — the corpus confirms they're low-frequency, and a
friendly verb would normalise a policy violation.
- **`rm-pod` is restricted to pods/jobs only** — deleting Deployments/STS/PVCs is
config mutation and forbidden; the verb cannot target them.
- **`db` encodes the dbaas exec pattern** (the single highest-value k8s
sub-pattern, ~886 dbaas ops): PG via `pg-cluster-rw -c postgres`,
`psql -U postgres -d <app>`; MySQL via `mysql-standalone-0` with a
`bash -c 'mysql -p"$MYSQL_ROOT_PASSWORD" …'` wrapper so the password comes from
the pod env and never appears on the command line.
- Read verbs were smoke-tested against the live cluster; write verbs are
unit-tested (resolver, db-plan, shell-quoting) but not fired at live state.

View file

@ -0,0 +1,30 @@
# homelab memory verb-group: direct HTTP client to claude-memory; MCP deprecation path
v0.3 adds the memory verb-group so agents can search and navigate memory from the
CLI. `claude-memory` is a FastAPI service (Postgres-backed, `Bearer`-auth,
ingress `auth = "none"` so programmatic clients work) — the **MCP is just one
frontend over it**. `homelab memory` is a thin HTTP client over the same API,
using the env the hooks already set (`CLAUDE_MEMORY_API_URL` +
`CLAUDE_MEMORY_API_KEY`; defaults to the ingress). Because it talks to the HTTP
API directly, it **works even when the MCP frontend is down** — the recurring
MCP-disconnect problem that motivated claude-memory HA (and that took the MCP
offline for the entire session this was built in).
Verbs: `recall` (server-side semantic ranking), `list`, `categories`, `tags`,
`stats`, `secret` (read); `store`, `update`, `delete` (write). Validated against
the live API including a store→recall→delete round-trip — full data-plane parity
with the MCP.
## Deprecation path (deliberate follow-up — NOT done in v0.3)
The MCP is more than tools: the **per-prompt auto-recall hook** and the
**auto-learn hook** run on every prompt for every agent. Deprecating it safely is
a separate, sequenced change:
1. Rewire the auto-recall hook to `homelab memory recall` and the auto-learn hook
to `homelab memory store`.
2. Update the CLAUDE.md memory policy to point at the CLI.
3. Uninstall the MCP.
Done CLI-first (verbs proven before touching the every-prompt path) so a
regression can't silently break auto-recall/auto-learn fleet-wide.

View file

@ -0,0 +1,29 @@
# homelab ci/deploy verbs: API-based watch, internal-LB dialer, work-land integration
v0.4 adds `ci`/`deploy` — the biggest *reasoning* sink in agent sessions (watching
a build/deploy to completion), proven during the session that built it (hours
spent hand-rolling Woodpecker API polling, DB-schema reverse-engineering, and
retrigger logic for a single CI incident).
## Decisions
- **API, not DB.** The verbs query the Woodpecker REST API (version-stable),
not its Postgres schema (which drifts across upgrades — column renames bit us
mid-incident). Reached via the internal Traefik LB by dialing `10.0.20.203`
while keeping SNI/Host = `ci.viktorbarzin.me` so the cert verifies (the Go
equivalent of the house `curl --resolve` pattern). Token from
`WOODPECKER_TOKEN` or Vault `secret/ci/global`; repo id resolved from the cwd
git remote via `/api/repos/lookup/<owner>/<repo>`.
- **Retries are mandatory.** The Woodpecker API intermittently returns empty/5xx
under load (it flapped through the whole build session); `getJSON` retries
empties with backoff so `ci watch` is reliable exactly when it's needed.
- **`work land` now waits for CI.** After pushing, `work land` calls `ci watch`
on the landed commit and fails if the pipeline does — closing the gap ADR-0005
deferred. `--no-ci-watch` opts out.
- **`deploy wait` encodes the "rollout status lies" rule:** it first waits for
the deployment image to reference the expected sha, *then* blocks on rollout
status (kubectl-based; reuses the k8s helpers).
- **`ci logs` deferred to v0.4.1.** Woodpecker's per-pipeline detail/log
endpoints were the least reliable this session (often empty); `status`/`watch`
rely on the list endpoint that works. A DB-backed `ci logs` is a possible
follow-up if the API path stays flaky.

View file

@ -0,0 +1,37 @@
# homelab net/dns/metrics/logs verbs: endpoint resolution as the unit of value
v0.5 adds `net`/`dns`/`metrics`/`logs`. These were chosen against an explicit
test the user posed mid-build: *does the verb save reasoning, or only typing?* A
wrapper over a command already known fluently (plain `ssh`, `vault kv get`) saves
keystrokes but not thought. These four save thought — the reasoning they encode
is **which endpoint, reached how, with what auth/URL shape** — re-derived every
time otherwise. (That same test deprioritized `node ssh` aliasing and `secret
get`, which are thin wrappers; see the session discussion.)
## Decisions
- **Internal ingresses, reached via the LB.** Everything routes through the
Traefik LB by dialing `10.0.20.203` with the URL host preserved as SNI — the
Go form of the house `curl --resolve host:443:10.0.20.203` pattern
(`probe.go: clientDialingIP`). Verified live before building: Prometheus
(`prometheus-query.viktorbarzin.lan`) and Loki (`loki.viktorbarzin.lan`) both
answer JSON over the LB with **no auth gate and no port-forward** — so these
stay clean HTTP clients, not kubectl wrappers.
- **`net check` is two-legged on purpose.** It resolves the host via public DNS
(→ Cloudflare) AND dials the internal LB, reporting both — because the useful
question is *where* a break is (CF edge vs the app vs the LB path), which a
single curl can't answer. The external leg forces public resolution (the devvm
resolver is split-horizon and would otherwise hit the LB for both).
- **`metrics alerts` uses the `ALERTS` series, not `/api/v1/alerts`.**
`prometheus-query.*` is a query-only frontend (404 on `/api/v1/alerts`), and
Alertmanager has no LB ingress (the alert-digest reads it in-cluster). Firing
alerts are exposed as the synthetic `ALERTS{alertstate="firing"}` time series,
queryable through the working endpoint — so no new dependency.
- **Deliberately NOT built:** in-cluster-only endpoints (Alertmanager v2,
raw `*.svc` services) that would force port-forward/`kubectl run`. The
reasoning-savings there don't beat the added moving parts; kept out of scope.
- **No `node`/`secret` group.** Same test: their high-volume parts are
command-wrappers (low savings); only compound node ops (serial console, VM
wait, fan-out) would qualify, and those are lower-frequency. Left unbuilt
unless a concrete pain surfaces — the high-value deterministic surface
(tf/work/ci/k8s/memory + these probes) is now covered.

View file

@ -0,0 +1,34 @@
# homelab usage telemetry: evidence-driven verb prioritization, privacy by construction
v0.6 adds `usage top` plus a fire-and-forget emit on every dispatched verb. It
exists to answer the question that drove the whole CLI — *which verbs are worth
adding next* — with data instead of one maintainer's habits (the earlier mining
covered a single user's ~51k commands, so the surface is shaped to that user).
## Decisions
- **Emit on dispatch, in `dispatch()`.** The longest-prefix match already knows
the verb path; after `Run` returns we emit `{verb, exit}`. Discovery verbs
don't go through `dispatch()` (`manifest`/`version`/`help` are handled in
`dispatchTop`), so they don't self-record; `usage *` is skipped explicitly so
the analytics reader doesn't pollute its own data.
- **Payload is deliberately minimal: verb path + exit code only.** Labels
`{job=homelab-usage, user, verb}` (all low-cardinality) + line `exit=N ver=X`.
**No args, paths, flags, hostnames, or secrets** ever leave the process — the
emit sees only the matched verb name, not the arguments. This is what makes
cross-user aggregation safe.
- **Shared Loki sink → cross-user analytics WITHOUT reading homes.** Each user's
CLI writes its own invocations (attributed to its OS user) to the shared Loki
push API via the Traefik LB (verified: HTTP 204, no auth). `usage top` reads
back with a LogQL metric query. This is the privacy-preserving resolution to
"what does everyone (e.g. another user) use" — it never touches anyone's
`~/.claude`, which the org per-user policy bars (see the per-user red-line in
managed-settings; reading another user's home is off-limits even for an owner
in-session — a fresh session under changed MDM policy is the only legitimate
path, and even then this telemetry is the better answer).
- **Best-effort, never affects the command.** All errors swallowed; an 800ms
client timeout bounds the cost; opt-out via `HOMELAB_TELEMETRY=0`. Telemetry
must never slow or break the tool it measures.
- **Loki, not a new datastore.** Zero new infra, and it dogfoods the v0.5 `logs`
path (same host, same LB dial). Presence MySQL was the alternative (queryable
SQL) but would add a write dependency and creds; Loki needs neither.

View file

@ -0,0 +1,54 @@
# homelab Home Assistant verbs: token resolution + host SSH, not entity control
v0.7 adds `ha token` and `ha ssh`. They were chosen by mining a heavy HA
operator's sessions: across ~1,900 shell commands the single most-repeated line
(420×) was a hand-rolled `kubectl … | base64 -d | python -c '…token'` pipeline,
and a bespoke `ssh -o StrictHostKeyChecking=no -o …` invocation was redefined as
a shell function ~30× — both re-derived from scratch every session. The existing
`home-assistant-sofia.py` already covers the *API*, but it goes unused from an
arbitrary cwd (it needs `HOME_ASSISTANT_SOFIA_TOKEN` set and is referenced by a
cwd-relative path), so agents bypassed it. A global verb on `$PATH` closes that
gap for every user in every directory.
## Decisions
- **Only the two gaps the `ha` MCP can't fill.** The `ha` MCP server already
does entity state and control (`get_state`, `call_service`, history, logs).
Per the CLI's founding rule — *MCP-encoded actions are out of scope* (ADR-0004)
— we do **not** reimplement `on`/`off`/`list`/`state`. We add only token
*resolution* and host *SSH*, neither of which an API-only MCP can provide. The
value is endpoint/secret/host resolution, exactly like `net`/`dns` (ADR-0010).
- **`ha token` resolves live from the cluster, not from an env var.** It reads
the dedicated k8s Secret `openclaw/ha-tokens` (one key per instance: `sofia` /
`london`) via the ambient kubeconfig. This is robust to env drift — the precise
failure that made agents re-derive the pipeline. Read-tier, prints the bare
token to stdout so it composes in `$(…)`, mirroring `memory secret`.
- **The token is split into its own least-privilege secret** (`stacks/openclaw/ha_tokens.tf`).
It was originally read from `openclaw-secrets``skill_secrets` (a JSON blob
also holding `slack_webhook` + `uptime_kuma_password`), which only cluster
admins can read — so the verb hung/failed for the non-admin operator it was
built for (emo = `emil.barzin@gmail.com`, group `Home Server Admins`, whose
OIDC identity is barred from secrets in `openclaw`). `ha-tokens` carries only
the HA tokens, with a Role+RoleBinding granting `get` on *just that secret* to
the `Home Server Admins` group (k8s RBAC can't scope to a JSON sub-key, hence
the separate object). openclaw's own deployment keeps reading `openclaw-secrets`
— this is purely additive.
- **`ha ssh` is deterministic and per-user.** Flags are fixed for unattended
use: `-F /dev/null` (ignore user ssh-config), `StrictHostKeyChecking=no` +
`UserKnownHostsFile=/dev/null` (no host-key prompt/record — agents have no
TTY), `BatchMode=yes` + `ConnectTimeout=10` (fail fast, never hang). The key
is the **invoking user's** `~/.ssh/id_ed25519`, so the verb isn't tied to
whoever first wrote the workflow; that user's key must be enrolled on the HA
host. Write-tier (runs an arbitrary remote command).
- **sofia is the default; london is structural.** The devvm sits on the Sofia
LAN, so `vbarzin@192.168.1.8` is reachable and is the default instance. london
(`hassio@192.168.8.103`) is in the instance map so `ha token --instance london`
works (a pure secret read), but `ha ssh --instance london` generally won't
connect from here — london is remote. We model it correctly rather than
pretend it's reachable.
- **Scope held at two verbs.** `ha api` (an authenticated curl passthrough for
the endpoints the MCP/script don't cover — `/api/template`, `/reload`,
`check_config`, `/error_log`) was deferred: once `ha token` exists, raw curl is
already unblocked, and a generic passthrough overlaps the MCP. Re-measure via
`usage top` (ADR-0011); add targeted sugar verbs only if those endpoints are
still hand-rolled often.

View file

@ -0,0 +1,75 @@
# homelab browser verbs: headful (anti-bot) web automation via cluster Chrome
v0.8 adds `browser run`, `browser open`, and `browser --help`. They package a
capability that already existed but was undiscoverable: driving the cluster's
**headful** Chrome (`chrome-service` — real Chrome under Xvfb, CDP on
`svc/chrome-service:9222`) from the devvm, for sites that detect and block
headless automation.
## Motivating incident (2026-06-22)
Logging a washing-machine repair on the Stirling Ackroyd **Fixflo** tenant
portal: the headless `@playwright/mcp` browser loaded the site and filled the
entire multi-step form, but the **final submit silently failed** — Fixflo's
pre-submit `POST /IssuePreCreationCheck` returned `net::ERR_FILE_NOT_FOUND`, the
spinner hung, no issue was created. Root cause = headless-Chrome detection. The
fix was to drive the headful `chrome-service` over `connect_over_cdp` — it
submitted first try (Fixflo ref IS22657587). That capability was documented
(`docs/architecture/chrome-service.md`) but **not packaged or discoverable**, so
it took ~40 min, three redundant full form re-runs, and a user hint. The agent
also misread `ERR_FILE_NOT_FOUND` as "network egress" and retried blind instead
of inspecting the network panel.
## Decisions
- **Mechanics in `homelab`, not a `~/.claude` skill.** A standalone skill was
rejected: the CLI is run every session (so the verb is *discoverable*), is
versioned, multi-user, and test-covered. A private, untested skill is none of
those. The command owns only the deterministic *mechanics* (port-forward,
stealth injection, lifecycle) — the agent supplies the Playwright script, so
*judgment* stays out of the CLI (the founding rule, ADR-0004/0005).
- **The failure was judgment, not setup friction**, so the CLI is paired with a
one-line pointer in always-in-context `~/code/CLAUDE.md` and a diagnostic
payload in `browser --help`: the *when-to-use* signature (a site loads but a
gated action fails/hangs, or one request 500s/aborts while siblings 200 →
suspect headless detection) and an error-code cheat-sheet (`ERR_FILE_NOT_FOUND`
= request resolved/intercepted by the automation layer, **not** egress;
egress failures are `ERR_CONNECTION_REFUSED`/`_TIMED_OUT`/`_NAME_NOT_RESOLVED`
and would break the page load too). A command the agent doesn't think to run is
useless; the cheat-sheet is the actual fix for the misdiagnosis.
- **Reach the pod via `kubectl port-forward`, then `connect_over_cdp` to
localhost.** port-forward tunnels API-server→pod, so it **bypasses the `:9222`
NetworkPolicy** that gates in-cluster callers — the devvm needs no namespace
label. Readiness is asserted against `/json/version`: the endpoint must report
a real `Chrome/…`, never `HeadlessChrome` (the whole point). The forward is
**always** torn down (process-group kill + signal handler), on success and on
error — an acceptance requirement.
- **Default to a fresh incognito context; `--shared-context` opts into the warmed
profile.** chrome-service is a single shared browser with a persistent profile.
A fresh, always-closed context is safe for concurrent callers (tripit's fare
scrape connects per-quote) and is what production already does. The warmed
persistent profile (cookies from a manual noVNC login) is opt-in for flows that
need a pre-logged-in session.
- **Pin the node CDP client to `playwright-core@1.48.2`** to match the
chrome-service image minor (`mcr.microsoft.com/playwright:v1.48.0-noble`,
Chromium 130). `connect_over_cdp` speaks the browser's CDP, and protocol
changes between Playwright minors — the devvm's ambient Python Playwright was
1.58, a 10-minor skew. The pin makes behaviour deterministic across the fleet
regardless of local drift. `playwright-core` (not `playwright`) because no
browser binary is needed — we connect to the remote one.
- **Self-provision the client lazily, no per-user setup.** The pinned client is
installed once into `~/.cache/homelab/browser-client/` (idempotent, version-
guarded) on first use, alongside the embedded runner + stealth files. node is
already fleet-wide; this avoids coupling the feature to a provisioner change
and keeps it self-contained and self-healing. The client runs on the devvm, so
`setInputFiles` streams local files to the remote browser over CDP — no
`chmod`/staging-dir workaround on the CDP path.
- **Vendor `stealth.js`, guard against drift.** The CLI embeds a byte-for-byte
copy of `stacks/chrome-service/files/stealth.js` (the source of truth the
in-cluster callers use) via `go:embed`; a unit test fails if the copy drifts.
`go:embed` can't reach outside the package dir, hence the vendored copy rather
than a path reference.
- **Scope held at two action verbs + help.** `run` (arbitrary script — the
workhorse) and `open` (navigate + title/text/screenshot — a quick check) cover
the surface. Both are write-tier; the bare `browser`/`--help` is read. Re-measure
via `usage top` (ADR-0011) before adding more.

View file

@ -0,0 +1,35 @@
---
status: accepted
date: 2026-06-24
---
# Service identity is namespace + label; east-west observability via Calico Goldmane; no service mesh
As the Service count grows we want an audit-grade record of which Service talks to which — the "service mesh evaluation" `docs/plans/2026-04-20-infra-audit-design.md` flagged as never done ("worth a design doc even if the answer is no, too much complexity for the gain"). We evaluated the full design space against two constraints: the trust model is a single-tenant cluster needing **attribution-grade** forensics (reconstruct events in a cluster we trust), not cryptographic non-repudiation against a hostile pod; and we are acutely **etcd-constrained** (we removed VPA/Goldilocks for exactly this, and carry open beads `code-oflt`/`code-at4f` on etcd starvation). Decision: **service identity = the workload's namespace** (primary; Goldmane stamps it natively and "one Service ≈ one namespace" holds for ~87 of our namespaces), refined by an explicit `service-identity` label only in the few genuinely multi-Service namespaces (`monitoring`, `kube-system`, `dbaas`). **East-west observability = Calico 3.30 Goldmane + Whisker** (already in our Calico v3.30.7, currently `enabled = false` in `stacks/calico/main.tf`), with Goldmane's emitter shipping flows to **Loki** for a durable trail. **Enforcement reuses the existing Wave 1 observe-then-enforce egress track**, now selecting on namespace/label and fed by Goldmane's allow/deny + policy-trace flows. We explicitly **reject** a service mesh, mTLS, SPIFFE/SPIRE, and dedicated per-Service ServiceAccounts for now.
## Considered options
- **Dedicated per-Service ServiceAccount as the identity primitive** — initially chosen, then reversed. 56% of pods (257/458) run as `default`, so it is a ~116-stack rollout; and Goldmane (the chosen flow source) carries pod/namespace/workload **labels but no ServiceAccount field**, so SAs would not even reach the audit trail without a custom pod→SA mapping. The cheaper, etcd-inert path (namespace is free; a handful of static labels) delivers the same attribution. Deferred until identity-aware NetworkPolicy needs a principal finer than namespace/label, or mTLS is adopted.
- **Service mesh (Istio / Linkerd / Cilium-mesh) + mTLS + SPIFFE/SPIRE** — the only thing that makes the trail cryptographically non-repudiable against a hostile pod. The trust model does not justify it, east-west stays single-tenant plaintext, and it is precisely the "too much complexity for the gain" the audit doc predicted. Rejected.
- **Microsoft Retina (CNI-agnostic eBPF)** — more capable (DNS, drops, Hubble UI) and GA, runs on Calico without a CNI change. But identity-rich mode writes **one `RetinaEndpoint` CRD per pod to etcd** (continuous, pod-proportional churn — the exact axis we guard), and it is metrics-first, not log-first (no per-flow Loki records without custom glue). Rejected for this use case; noted as the fallback if DNS/drop-level detail is ever needed.
- **Cilium Hubble** — reads Cilium's eBPF datapath maps; unusable on Calico without migrating the CNI. A CNI migration is not justified. Rejected.
- **Kiali** — builds its graph entirely from an Istio mesh's Prometheus telemetry; no mesh, no graph. Rejected.
- **Custom Grafana Alloy enrichment exporter over raw iptables-`LOG` flow lines** — Alloy has no IP→identity dictionary-lookup primitive (`loki.process` lacks a lookup stage; `k8sattributes` can't do per-line/dual-IP association), so this is a multi-day custom build that also has to beat pod-IP churn. Goldmane delivers identity-stamped flows natively and obviates it. Rejected.
- **Kyverno generate+mutate to provision/assign identity** — rejected on etcd grounds: background scans + PolicyReports + UpdateRequests are continuous writes, the VPA-class cost we shed. Identity stays static.
## Consequences
- **No etcd cost from the flow plane.** Goldmane streams flows from Felix (the existing `calico-node` DaemonSet) over gRPC into a ~60-minute in-memory ring buffer — nothing written to etcd or the K8s API. Steady-state cost is two Deployments (`goldmane`, `whisker`) + RAM/CPU on the goldmane pod.
- **The ring buffer is not a trail.** Durable, queryable history depends on the emitter→Loki path (reuse the 90-day security-stream retention); on a Goldmane restart the in-memory window is lost.
- **Goldmane is tech-preview** in OSS Calico 3.30 — the main risk. Enabling it is a reversible toggle in `stacks/calico/main.tf`, but the toggle interacts with the operator-managed Installation CR (only namespaces are TF-adopted today; verify how Goldmane/Whisker are enabled before applying).
- **Attribution is namespace-grained for free** across ~87 single-Service namespaces. Multi-Service namespaces (`monitoring`, `kube-system`, `dbaas`) need a `service-identity` label to disambiguate; most are platform/infra and already on the Wave 1 enforcement exclude list.
- **The trail is attribution-grade, not cryptographic.** It reliably reconstructs events in a trusted cluster but cannot prove identity against a pod that spoofs its source — an accepted limit of the trust model. This ADR does not change the east-west encryption posture (still plaintext, no mTLS).
- **Enforcement gains a better data source.** Goldmane's allow/deny + policy-trace flows build the Wave 1 empirical egress allowlist faster than the current iptables-`LOG`→journald→Loki path, and policies select on namespace/label with no SA dependency.
- **New ubiquitous language** recorded in `CONTEXT.md`: **Service identity** and **Goldmane / Whisker**.
- **Revisit triggers:** adopt dedicated per-Service SAs if identity-aware NetworkPolicy needs a principal finer than namespace/label, or if mTLS is ever required; reconsider Retina if DNS/drop-level flow detail becomes necessary.
## As-built (2026-06-25)
Implemented across infra issues #57#63. **One material deviation from the decision above:** the durable trail is NOT a Goldmane→Loki emitter (no such emitter exists in OSS Calico 3.30) — it is the **`goldmane-edge-aggregator`** service, which streams Goldmane's gRPC `Flows.Stream` API over mTLS and upserts the unique namespace-pair **edge set** (`edge(src_ns,dst_ns,action,first_seen,last_seen,flow_count)`, self-edges + empty-namespace flows dropped) into **CNPG DB `goldmane_edges`**, plus a daily `goldmane-edges-digest` CronJob → `#alerts` (the shared webhook can't reach `#security` — see runbook). The mTLS client cert **reuses the operator's Tigera-CA-signed `whisker-backend-key-pair`** rather than copying the CA private key into TF state (Goldmane verifies CA-chain only, not identity) — re-apply `stacks/goldmane-edge-aggregator` if the operator rotates it. `service-identity` labels are live on the multi-Service namespaces (`monitoring`, `dbaas`). Whisker UI is Authentik-gated at `whisker.viktorbarzin.me`. Health: Prometheus alerts `AggregatorDown` + `DigestFailing` and cluster-health check #48.
Full as-built, query recipes (incl. the Wave-1 egress-allowlist derivation), and troubleshooting: [`docs/runbooks/goldmane-flow-trail.md`](../runbooks/goldmane-flow-trail.md). Stacks: `stacks/calico` (Goldmane/Whisker + Whisker ingress), `stacks/goldmane-edge-aggregator` (the trail). Code: `~/code/goldmane-edge-aggregator`.

View file

@ -108,31 +108,6 @@ All new users must use an invitation link to register. The invitation-enrollment
Group membership is auto-assigned from the invitation's `fixed_data` field. This prevents open registration while maintaining SSO convenience. Group membership is auto-assigned from the invitation's `fixed_data` field. This prevents open registration while maintaining SSO convenience.
### TripIt External self-signup (open enrollment, fenced)
Unlike every other app, **TripIt allows open public self-signup** for people
outside the homelab (ADR-0020 in the tripit repo; runbook
`docs/runbooks/tripit-external-signup.md`). A dedicated public `tripit-enrollment`
flow (email + passkey, no password) creates the account and stamps it into the
parentless **`TripIt External`** group. Containment is two-layered:
- **Forward-auth apps**: a branch prepended to the `admin-services-restriction`
catch-all policy admits `TripIt External` to `tripit.viktorbarzin.me` only and
denies every other `auth="required"` host.
- **OIDC apps**: that branch does NOT cover OIDC (OIDC bypasses forward-auth).
External users are contained because every sensitive OIDC app already requires a
trusted group they do not hold — audited 2026-06-15:
Immich/Grafana/Linkwarden/Cloudflare Access → `Home Server Admins`, Forgejo →
`Task Submitters`/`Forgejo Users`, Headscale → `Headscale Users`, wrongmove →
`Wrongmove Users`. **Vault** was OPEN (any OIDC identity got a powerless
`default`-policy token) and is bound to **`Allow Login Users`** as part of this
change. The Kubernetes OIDC clients are OPEN but idle (apiserver rejects OIDC).
**Invariants**: keep `TripIt External` parentless (never under `Allow Login
Users`); keep the catch-all branch first; never co-assign `TripIt External` to a
trusted/internal user; the `tripit-enrollment` user_write "Create users group"
setting is the keystone that tags every signup.
### OIDC Applications ### OIDC Applications
Authentik provides OIDC for 10 applications: Authentik provides OIDC for 10 applications:

View file

@ -319,7 +319,7 @@ each Job's pod and its drain target are always different nodes.
- `K8sVersionSkew` — kubelet/apiserver `gitVersion` count >1 for 30m. Catches a half-done rollout. - `K8sVersionSkew` — kubelet/apiserver `gitVersion` count >1 for 30m. Catches a half-done rollout.
- `EtcdPreUpgradeSnapshotMissing``k8s_upgrade_in_flight==1 && k8s_upgrade_snapshot_taken==0` for 10m. Catches preflight failing silently. - `EtcdPreUpgradeSnapshotMissing``k8s_upgrade_in_flight==1 && k8s_upgrade_snapshot_taken==0` for 10m. Catches preflight failing silently.
- `K8sUpgradeStalled``k8s_upgrade_in_flight==1 && time()-k8s_upgrade_started_timestamp > 5400` for 5m. Catches a chain Job dying without spawning its successor. - `K8sUpgradeStalled``k8s_upgrade_in_flight==1 && time()-k8s_upgrade_started_timestamp > 5400` for 5m. Catches a chain Job dying without spawning its successor.
- `K8sUpgradeChainJobFailed``kube_job_status_failed{namespace="k8s-upgrade",job_name=~"k8s-upgrade-.*",reason=~"BackoffLimitExceeded|DeadlineExceeded"} > 0` for 15m (warning). Catches a phase Job that terminally failed **before `in_flight` was set** (the preflight gates exit pre-metric) — invisible to the two `in_flight`-based alerts above; this was the blind spot behind the 5-day 1.34.9 preflight wedge. Reason-scoped so a retry-success doesn't false-positive (and so it doesn't needlessly block kured). - `K8sUpgradeChainJobFailed``(kube_job_status_failed{namespace="k8s-upgrade",job_name=~"k8s-upgrade-(preflight|master|worker|postflight)-.*",reason=~"BackoffLimitExceeded|DeadlineExceeded"} > 0) unless on() (k8s_upgrade_blocked == 1)` for 15m (warning). Catches a phase Job that terminally failed **before `in_flight` was set** (the preflight gates exit pre-metric) — invisible to the two `in_flight`-based alerts above; this was the blind spot behind the 5-day 1.34.9 preflight wedge. Reason-scoped so a retry-success doesn't false-positive (and so it doesn't needlessly block kured). The `unless k8s_upgrade_blocked == 1` clause (2026-06-21) excludes a deliberate compat-gate refusal (owned by `K8sUpgradeBlocked`) so a block doesn't double-fire as a wedge.
- **Pushgateway metrics**: - **Pushgateway metrics**:
- `k8s_upgrade_in_flight` (set in preflight, cleared in postflight) - `k8s_upgrade_in_flight` (set in preflight, cleared in postflight)
- `k8s_upgrade_snapshot_taken` (set after etcd snapshot Job completes with ≥1 KiB) - `k8s_upgrade_snapshot_taken` (set after etcd snapshot Job completes with ≥1 KiB)

View file

@ -112,17 +112,32 @@ External caller (dev box):
@playwright/mcp --isolated --storage-state ~/.cache/...storage-state.json @playwright/mcp --isolated --storage-state ~/.cache/...storage-state.json
``` ```
## Browser binary — real Google Chrome (for proprietary codecs)
The chrome-service container runs **real Google Chrome**, not the bundled
Chromium, via the infra-owned image `ghcr.io/viktorbarzin/chrome-service-browser`
(`files/chrome/Dockerfile` = `mcr.microsoft.com/playwright:v1.48.0-noble` +
`google-chrome-stable`, built by `.github/workflows/build-chrome-service-browser.yml`).
The launch resolves `CHROMIUM=/opt/google/chrome/chrome`.
**Why:** the Playwright-bundled Chromium has proprietary codecs **compiled out**,
so H.264/AAC video (Instagram Reels, X, most `.mp4`) fails in the noVNC view with
`MEDIA_ERR_SRC_NOT_SUPPORTED` (the bytes download `200 video/mp4` but there's no
decoder — NOT a GPU issue). Royalty-free codecs (VP9/VP8/AV1 → YouTube) always
worked. Swapping `libffmpeg.so` does NOT help (codecs are compiled out, not just
the lib stripped) and Chrome-for-Testing is also codec-less — only
`google-chrome-stable` carries them.
## Image pin ## Image pin
Both the server image (`mcr.microsoft.com/playwright:v1.48.0-noble` in The Playwright base + the Python client (`playwright==1.48.0` in callers'
`stacks/chrome-service/main.tf`) and the Python client `requirements.txt`) and the snapshot sidecars
(`playwright==1.48.0` in callers' `requirements.txt`) **must match (`mcr.microsoft.com/playwright/python:v1.48.0-noble`) historically had to match
minor-versions**. Bump in lockstep — Playwright protocol changes between minor-versions. The chrome-service browser is now real Google Chrome (a newer
minors and the client cannot connect to a mismatched server. milestone than the 1.48 Chromium), but the `connect_over_cdp` callers (tripit
fare scrape, `homelab browser`, snapshot-harvester) attach over raw CDP, which is
The harvester + snapshot-server sidecar use version-tolerant — verified working against this Chrome. If a future Chrome
`mcr.microsoft.com/playwright/python:v1.48.0-noble` — same playwright milestone breaks a caller, pin Chrome in the Dockerfile or bump the clients.
minor, with Python-side bindings pre-installed.
## Storage ## Storage
@ -167,7 +182,29 @@ minor, with Python-side bindings pre-installed.
`x11vnc` (connected to Xvfb on `localhost:6099`) bridged to `x11vnc` (connected to Xvfb on `localhost:6099`) bridged to
`websockify` on port 6080. Service `chrome` maps :80 → :6080 and is `websockify` on port 6080. Service `chrome` maps :80 → :6080 and is
exposed via `ingress_factory` at `chrome.viktorbarzin.me`, exposed via `ingress_factory` at `chrome.viktorbarzin.me`,
Authentik-gated. Authentik-gated. The bare host serves `vnc.html` (image symlinks
`index.html → vnc.html`); add `?autoconnect=true&resize=scale&path=websockify`
to skip the Connect button. The view is **black when no browser window is
open** (idle) — that is normal, not a failed connection. Chrome is launched
with `--window-size=1280,720 --window-position=0,0` to fill the Xvfb screen
(no window manager runs, so without it Chrome opens at its profile-persisted
size and the rest of the framebuffer shows as a black cut-off).
### noVNC fd-sweep gotcha (stuck "Connecting")
If the noVNC client hangs on **"Connecting" forever then times out**, the cause
is almost always x11vnc's fd-table sweep: containerd grants pods
`RLIMIT_NOFILE = 2^31`, and x11vnc `fcntl`-sweeps the **entire** fd table on
every client connection, so the RFB handshake never completes (websockify
accepts the WS and logs `connecting to: localhost:5900`, but x11vnc never sends
the `RFB 003.008` banner). Diagnose: `grep "open files" /proc/$(pgrep -n
x11vnc)/limits` (huge = bad) and time the handshake from a sibling container
(`python3 -c "import socket;s=socket.socket();s.connect(('127.0.0.1',5900));print(s.recv(12))"`
healthy <0.3s, broken hangs). **Fix: cap `ulimit -n 65536` before x11vnc starts**
— done both in `files/novnc/entrypoint.sh` (root) and via the container `command`
wrapper in `main.tf` (so it applies deterministically even though the image is
`:latest`/`IfNotPresent` and won't re-pull a rebuilt entrypoint). Same bug + fix
as the android-emulator stack.
- **snapshot-server sidecar** (`mcr.microsoft.com/playwright/python:v1.48.0-noble`) - **snapshot-server sidecar** (`mcr.microsoft.com/playwright/python:v1.48.0-noble`)
serves `GET /api/snapshot` from `/profile/snapshots/storage-state.json`, serves `GET /api/snapshot` from `/profile/snapshots/storage-state.json`,
bearer-gated by `PW_TOKEN`. Service `chrome-snapshot` maps :8088 → :8088 bearer-gated by `PW_TOKEN`. Service `chrome-snapshot` maps :8088 → :8088
@ -180,6 +217,45 @@ minor, with Python-side bindings pre-installed.
See `stacks/chrome-service/README.md` for the recipe (label namespace, See `stacks/chrome-service/README.md` for the recipe (label namespace,
inject `CHROME_CDP_URL`, vendor `stealth.js`). inject `CHROME_CDP_URL`, vendor `stealth.js`).
## Driving from OUTSIDE the cluster (`homelab browser`)
Agents on the devvm reach this browser through the **`homelab browser`** CLI
(`cli/`, ADR-0013) — the packaged, discoverable form of the ad-hoc
`connect_over_cdp` recipe. It is the **escalation path, not the default**:
agents default to the Playwright MCP / headless browser for all routine
automation, and reach for `homelab browser` ONLY when headless is blocked — a
site loads but a gated action (submit/login) silently fails or hangs, the
signature of headless / anti-bot detection. (Same tiered rule lives in
`~/code/CLAUDE.md` and `homelab browser --help`.)
```text
devvm: homelab browser run flow.js
│ kubectl port-forward svc/chrome-service :9222 (random local port)
http://127.0.0.1:<port> ──► chrome-service pod :9222 (CDP)
│ assert /json/version Browser is "Chrome/…", not "HeadlessChrome"
│ node + playwright-core@1.48.2 → connectOverCDP
│ context.addInitScript(stealth.js) ← same vendored file as in-cluster
│ run the user's Playwright script with page/context/browser in scope
└─ port-forward always torn down (success or error)
```
Key facts:
- **port-forward bypasses the `:9222` NetworkPolicy.** It tunnels
API-server→pod, so the devvm needs no `chrome-service.viktorbarzin.me/client`
label — unlike in-cluster callers.
- **Client pinned to the image minor.** The node client is
`playwright-core@1.48.2` (matches `v1.48.0-noble` / Chromium 130), installed
lazily into `~/.cache/homelab/browser-client/`. Bump it in lockstep when the
server image bumps (same rule as the in-cluster Python clients — see "Image
pin" above).
- **Default context is a fresh incognito one** (closed on exit), safe for the
shared browser; `--shared-context` reuses the warmed persistent profile.
- **`stealth.js` is vendored** into the CLI (`cli/browser_stealth.js`) as a
byte-identical copy of `files/stealth.js`, guarded by a drift test — so the
CLI's stealth never diverges from the in-cluster callers'.
## Limits + risks ## Limits + risks
- **Anti-bot vs stealth arms race** — when an upstream beats us (DRM - **Anti-bot vs stealth arms race** — when an upstream beats us (DRM

View file

@ -116,7 +116,7 @@ instagram-poster, payslip-ingest, broker-sync (image name `wealthfolio-sync`),
fire-planner, recruiter-responder, x402-gateway — plus **tripit** (the original fire-planner, recruiter-responder, x402-gateway — plus **tripit** (the original
pilot, 2026-06-09). Earlier public-repo apps already on GHA (Website, pilot, 2026-06-09). Earlier public-repo apps already on GHA (Website,
k8s-portal, apple-health-data, audiblez-web, plotting-book, insta2spotify, k8s-portal, apple-health-data, audiblez-web, plotting-book, insta2spotify,
audiobook-search, council-complaints) now also land on ghcr. audiobook-search) now also land on ghcr.
### Infra-owned images (issues #29 / #30) ### Infra-owned images (issues #29 / #30)

View file

@ -146,7 +146,7 @@ Query examples (Grafana → Loki): `{job="rpi-sofia-journal"}`, `{job="rpi-sofia
**Dashboard** — `dashboards/rpi-sofia.json` ("RPi Sofia", Hardware folder): status, undervoltage/throttle, SoC temp, load, memory, root-fs free + read-only, network. **Dashboard** — `dashboards/rpi-sofia.json` ("RPi Sofia", Hardware folder): status, undervoltage/throttle, SoC temp, load, memory, root-fs free + read-only, network.
**Alerts** (group `RPi Sofia` in `prometheus_chart_values.tpl`): `RpiSofiaDown` (`up==0`), `RpiSofiaFilesystemReadonly` (`node_filesystem_readonly{mountpoint="/"}==1` — the SD-failure signature), `RpiSofiaUndervoltage` (`rpi_under_voltage_occurred==1`), `RpiSofiaHighTemp`. **Alerts** (group `RPi Sofia` in `prometheus_chart_values.tpl`): `RpiSofiaDown` (`up==0`), `RpiSofiaFilesystemReadonly` (`node_filesystem_readonly{mountpoint="/"}==1` — the SD-failure signature), `RpiSofiaUndervoltage` (`increase(rpi_under_voltage_occurred[1h])>0` — edge-triggered on the sticky bit; the live `rpi_under_voltage_now` bit is too transient to catch at 1-min sampling, so it fires on a *new* brown-out and auto-resolves ~1h later instead of latching until reboot), `RpiSofiaHighTemp`.
**Recovery** — a systemd hardware watchdog (`RuntimeWatchdogSec=14s`, bcm2835 max ~15s) auto-reboots the Pi on a hard hang instead of leaving it dead for hours. **Recovery** — a systemd hardware watchdog (`RuntimeWatchdogSec=14s`, bcm2835 max ~15s) auto-reboots the Pi on a hard hang instead of leaving it dead for hours.
@ -321,6 +321,17 @@ Detects the inverse of the K-series alerts: a service that **must work WITHOUT A
- **Alert**: `probe_failed_due_to_regex{job="blackbox-authentik-walloff"} == 1` for 10m → `severity=warning`, `lane=security`**`#security` Slack** (Slack-only, no paging). `probe_failed_due_to_regex` (not bare `probe_success==0`) is the signal: it isolates the Authentik-redirect from unrelated 5xx/DNS/TLS failures already covered by reachability alerts. Inhibited by `TraefikDown` and `AuthentikDown` (symptom, not regression, during those outages). - **Alert**: `probe_failed_due_to_regex{job="blackbox-authentik-walloff"} == 1` for 10m → `severity=warning`, `lane=security`**`#security` Slack** (Slack-only, no paging). `probe_failed_due_to_regex` (not bare `probe_success==0`) is the signal: it isolates the Authentik-redirect from unrelated 5xx/DNS/TLS failures already covered by reachability alerts. Inhibited by `TraefikDown` and `AuthentikDown` (symptom, not regression, during those outages).
- **Target list + how to add one**: `local.authentik_walloff_targets` in `stacks/monitoring/modules/monitoring/authentik_walloff_probe.tf` — a map of `service → URL`. To guard a NEW carve-out, add ONE line. Verify it does NOT already 302 to Authentik first: `curl -s -o /dev/null -w '%{http_code} %{redirect_url}\n' '<url>'`. The map key becomes the `service` label on the metric + alert. (Note: openclaw `task-webhook` is intentionally NOT probed — no public DNS record.) - **Target list + how to add one**: `local.authentik_walloff_targets` in `stacks/monitoring/modules/monitoring/authentik_walloff_probe.tf` — a map of `service → URL`. To guard a NEW carve-out, add ONE line. Verify it does NOT already 302 to Authentik first: `curl -s -o /dev/null -w '%{http_code} %{redirect_url}\n' '<url>'`. The map key becomes the `service` label on the metric + alert. (Note: openclaw `task-webhook` is intentionally NOT probed — no public DNS record.)
#### East-west flow observability (Goldmane edge-aggregator) — `AggregatorDown` / `DigestFailing` (ADR-0014)
Health for the durable "who-talks-to-whom" trail (Calico Goldmane → `goldmane-edge-aggregator` → CNPG `goldmane_edges` → daily `#alerts` digest; full trail in security.md + [runbooks/goldmane-flow-trail.md](../runbooks/goldmane-flow-trail.md)). The aggregator pod exposes **no `/metrics`**, so health is inferred from kube-state-metrics. Alert group `Network Observability (Goldmane)` in `prometheus_chart_values.tpl`; both route the default `slack-warning` receiver → **`#alerts`**.
| Alert | Expr (abridged) | For | Severity |
|---|---|---|---|
| `AggregatorDown` | `kube_deployment_status_replicas_available{namespace="goldmane-edge-aggregator",deployment="goldmane-edge-aggregator"} < 1` (+ Prometheus-restart guard) | 15m | warning |
| `DigestFailing` | `kube_job_status_failed{namespace="goldmane-edge-aggregator",job_name=~"goldmane-edges-digest.*"} > 0` within 24h | 30m | warning |
The two layers are **complementary**: `AggregatorDown` ⇒ no new edges land in the DB; `DigestFailing` ⇒ edges still land but nobody is told. (`< 1` requires the metric series to exist — a fully-deleted Deployment is instead caught by cluster-health check #48 below as "deployment missing".) A freshness probe (#61b) was deliberately skipped — `AggregatorDown` is the agreed floor. **Cluster-health check #48** (`check_goldmane_aggregator` in `scripts/cluster_healthcheck.sh`) reads the Deployment's `Available` condition independently (human / `--quiet` / `--json`; JSON key `goldmane_aggregator`).
#### Backup Alerts #### Backup Alerts
- **PostgreSQLBackupStale**: >36h since last backup - **PostgreSQLBackupStale**: >36h since last backup
- **MySQLBackupStale**: >36h since last backup - **MySQLBackupStale**: >36h since last backup

View file

@ -543,10 +543,16 @@ Separate from the in-cluster namespace-owner model above, the **devvm** (`10.0.1
**Config inheritance (live):** wizard authors the base (his chezmoi-versioned `~/.claude`). Two native layers carry it to every user — the enforced org `claudeMd` in `/etc/claude-code/managed-settings.json` (top precedence, all sessions) and per-user `~/.claude/{skills,rules,…}` **symlinks** to the base (seeded via `/etc/skel`; edits propagate live). Secrets stay per-user at mode 600, never symlinked. **The managed config self-deploys from the repo** (2026-06-10): the hourly reconcile's `sync_managed_config` installs `scripts/workstation/managed-settings.json` to `/etc/claude-code/` whenever the repo copy changes — so editing the claudeMd = edit + commit, no manual install — and `refresh_codex_mirror` regenerates each user's `~/.codex/AGENTS.md` (a static mirror of the claudeMd; only files carrying the mirror header are touched, user-customized ones are left alone). Repo-level guidance (`.claude/CLAUDE.md`, `AGENTS.md`, `CONTEXT.md` in the infra repo) reaches non-admins through their auto-freshened clones — commit + push and every user has it within the hour. **Config inheritance (live):** wizard authors the base (his chezmoi-versioned `~/.claude`). Two native layers carry it to every user — the enforced org `claudeMd` in `/etc/claude-code/managed-settings.json` (top precedence, all sessions) and per-user `~/.claude/{skills,rules,…}` **symlinks** to the base (seeded via `/etc/skel`; edits propagate live). Secrets stay per-user at mode 600, never symlinked. **The managed config self-deploys from the repo** (2026-06-10): the hourly reconcile's `sync_managed_config` installs `scripts/workstation/managed-settings.json` to `/etc/claude-code/` whenever the repo copy changes — so editing the claudeMd = edit + commit, no manual install — and `refresh_codex_mirror` regenerates each user's `~/.codex/AGENTS.md` (a static mirror of the claudeMd; only files carrying the mirror header are touched, user-customized ones are left alone). Repo-level guidance (`.claude/CLAUDE.md`, `AGENTS.md`, `CONTEXT.md` in the infra repo) reaches non-admins through their auto-freshened clones — commit + push and every user has it within the hour.
**Memory — homelab CLI hooks (rolled out 2026-06-21, deploy-fixed 2026-06-22):** the per-user `claude_memory` MCP was retired for the **homelab-memory hooks** — the reconcile's `install_memory` (re)installs four scripts into `~/.claude/hooks/` each run (`homelab-memory-recall.py` UserPromptSubmit recall, `auto-learn.py` Stop-hook extraction, `pre-compact-backup.sh`/`post-compact-recovery.sh`), wires them into `settings.json` if-absent + additive, and removes the old `claude_memory` MCP. **The provisioner binary itself now self-deploys from the repo** (step 0: `bash -n`-gated `install` + re-exec when `scripts/t3-provision-users.sh` differs from `/usr/local/bin/t3-provision-users`, guarded against re-exec loops / DRY_RUN mutation) — added after this very rollout sat committed-but-undeployed for a day (only the manual `setup-devvm.sh` had ever deployed the binary), so the hourly reconcile kept running the pre-memory version and emo/anca silently lost memory (recall + auto-learn never wired). A latent `set -e` abort in `install_memory` (a bare `[[ -d plugin-dir ]] && …` returning non-zero) was also fixed; it had killed the reconcile after the first user the first time it actually ran. The hooks need a `MEMORY_API_KEY` (or `CLAUDE_MEMORY_API_KEY`) in the user's `settings.json` env — the `homelab` CLI defaults the API URL, so **the key is the only hard requirement**; `install_memory` reuses an existing key and only WARNs if absent (it does NOT mint one — that's an admin Vault step, see Remaining). wizard + emo carry a key from their original MCP setup; **ancamilea is keyless → her memory no-ops until a key is minted.** (`auto-learn.py`'s passive store calls the API directly, so it additionally needs `*_API_URL` in env to avoid its local-SQLite fallback; recall + manual `homelab memory store` go through the URL-defaulting CLI and need only the key.)
**Agent skills — vendored own-copies for an allowlist (2026-06-23):** beyond the config-inheritance base (above, which symlinks the admin's `~/.claude/skills` into every user), the reconcile's `install_skills` gives users on the `SKILL_USERS` allowlist (currently `emo`) their OWN copies of a curated skill set vendored in-repo at `scripts/workstation/claude-skills/` (16: the admin's 15 `mattpocock/skills` + `find-skills` from `vercel-labs/skills`). It copies each into `~/.agents/skills/<name>` (owned by the user, parent `~/.agents` chowned too — `install -d` leaves intermediates root-owned) and points `~/.claude/skills/<name>` at it with a **relative** symlink (`../../.agents/skills/<name>` — the layout `skills add -g` produces; Claude Code reads `~/.claude/skills/`). **Vendored, NOT `npx skills add`:** upstream drifted off this exact set (`diagnose``diagnosing-bugs`, `write-a-skill``writing-great-skills` renamed; `caveman` + `zoom-out` unpublished), so npx can't reproduce it — and a per-reconcile GitHub clone + unpinned-CLI dependency has no place in the hourly root job; refresh by re-snapshotting (`claude-skills/README.md`). **if-absent keys on the user's OWN copy** (a real dir under `~/.agents/skills`), so a steady-state reconcile is a no-op AND a stale or cross-user `~/.claude/skills` symlink is healed to the own copy — emo had `grill-me`/`file-issue` symlinked into the admin's home; `grill-me` is now emo's own (`file-issue` is outside the set, left as-is). A real dir squatting a name is never clobbered. Best-effort tail (`return 0`, like `install_memory`). Extend coverage = edit `SKILL_USERS`.
**Onboarding state self-heals (2026-06-15):** `~/.claude.json` is a single file that ALL of a user's concurrent `claude` processes (the ttyd terminal + their `t3-serve` instance + agent/SDK sessions) read-modify-write, so a stale writer periodically drops top-level keys — including `hasCompletedOnboarding` — which bounces the next *interactive* session back to the first-run "Choose the text style" wizard even though the user is fully logged in (credentials live in the SEPARATE `~/.claude/.credentials.json`, untouched by the race; first observed for emo 2026-06-15). The launcher (`skel/start-claude.sh`) now idempotently re-asserts `hasCompletedOnboarding` (+ `lastOnboardingVersion`) in `~/.claude.json` right before it runs `claude` — merge-only, never clobbers other keys, no-op if jq is missing or the file is empty/corrupt. And since the launcher is a per-user copy that `/etc/skel` only seeds at account creation, the reconcile's new `deploy_user_launcher` step re-copies `skel/start-claude.sh` into every non-admin home (copy-if-changed) so launcher edits now reach EXISTING users within the hour — `.tmux.conf` is deliberately NOT re-copied (terminal-lobby appends its own managed section to it). **Onboarding state self-heals (2026-06-15):** `~/.claude.json` is a single file that ALL of a user's concurrent `claude` processes (the ttyd terminal + their `t3-serve` instance + agent/SDK sessions) read-modify-write, so a stale writer periodically drops top-level keys — including `hasCompletedOnboarding` — which bounces the next *interactive* session back to the first-run "Choose the text style" wizard even though the user is fully logged in (credentials live in the SEPARATE `~/.claude/.credentials.json`, untouched by the race; first observed for emo 2026-06-15). The launcher (`skel/start-claude.sh`) now idempotently re-asserts `hasCompletedOnboarding` (+ `lastOnboardingVersion`) in `~/.claude.json` right before it runs `claude` — merge-only, never clobbers other keys, no-op if jq is missing or the file is empty/corrupt. And since the launcher is a per-user copy that `/etc/skel` only seeds at account creation, the reconcile's new `deploy_user_launcher` step re-copies `skel/start-claude.sh` into every non-admin home (copy-if-changed) so launcher edits now reach EXISTING users within the hour — `.tmux.conf` is deliberately NOT re-copied (terminal-lobby appends its own managed section to it).
**Claude Code runtime — native, per-user (2026-06-15):** `claude` is the **native** install (`~/.local/bin/claude``~/.local/share/claude/versions/<v>`, self-updating; `installMethod: native`) — NOT npm-global or npx. It is the runtime for both the ttyd launcher and each `t3-serve` instance. `setup-devvm.sh` installs node ONLY for the `t3` CLI (not claude); per-user native claude is provisioned by the reconcile's `install_user_claude_native` (covers terminal + t3, idempotent, skip-if-present) and self-bootstrapped by `start-claude.sh` on first launch — both via the official `https://claude.ai/install.sh`. The legacy machine-wide `npm install -g @anthropic-ai/claude-code` bootstrap and the launcher's `npx` fallback were removed; existing users had already auto-migrated to native, and the npm-global dir was empty. **PATH (`~/.local/bin`, where the native binary lives):** ensured three ways — `/etc/profile.d/10-local-bin.sh` for login shells (machine-wide, fresh-user-safe), `start-claude.sh` itself (the launcher runs in tmux's non-login env that skips the user's shell rc), and `t3-serve@.service` (`Environment=PATH=…:/home/%i/.local/bin`). **Claude Code runtime — native, per-user (2026-06-15):** `claude` is the **native** install (`~/.local/bin/claude``~/.local/share/claude/versions/<v>`, self-updating; `installMethod: native`) — NOT npm-global or npx. It is the runtime for both the ttyd launcher and each `t3-serve` instance. `setup-devvm.sh` installs node ONLY for the `t3` CLI (not claude); per-user native claude is provisioned by the reconcile's `install_user_claude_native` (covers terminal + t3, idempotent, skip-if-present) and self-bootstrapped by `start-claude.sh` on first launch — both via the official `https://claude.ai/install.sh`. The legacy machine-wide `npm install -g @anthropic-ai/claude-code` bootstrap and the launcher's `npx` fallback were removed; existing users had already auto-migrated to native, and the npm-global dir was empty. **PATH (`~/.local/bin`, where the native binary lives):** ensured three ways — `/etc/profile.d/10-local-bin.sh` for login shells (machine-wide, fresh-user-safe), `start-claude.sh` itself (the launcher runs in tmux's non-login env that skips the user's shell rc), and `t3-serve@.service` (`Environment=PATH=…:/home/%i/.local/bin`).
**Claude authentication — per-user, self-renewing, Vault-recoverable (2026-06-20):** every roster user logs in with their OWN Enterprise identity; shared `CLAUDE_CODE_OAUTH_TOKEN` injection was removed because environment auth outranks local login and collapses identity/audit/quota. Claude owns access-token refresh in `~/.claude/.credentials.json`. A system template timer (`claude-auth-sync@<user>.timer`, every 6h) renews a dedicated 32-day periodic Vault token, validates Claude with real non-persistent Haiku inference (`auth status` can lie during a 401), backs up only `claudeAiOauth` to `secret/workstation/claude-users/<os_user>`, and performs one atomic Vault restore/retry on failure while preserving `mcpOAuth`. Vault policy `workstation-claude-<os_user>` isolates every path; the roster generates policies for present and future users. A hard refresh-token revocation still requires the affected person to complete SSO—there is no supported noninteractive bypass. Loki alert `WorkstationClaudeAuthInvalid` surfaces exhausted recovery. Runbook: `../runbooks/claude-auth-renew-workstation.md`.
**Per-user browser MCP — playwright, reproducible from git (2026-06-16):** every user (incl. the admin) gets their OWN isolated `@playwright/mcp` server so their concurrent Claude sessions don't fight over tabs (`--isolated` → a fresh browser context per MCP connection), wired into Claude in **every directory** via a user-scope `~/.claude.json` entry (`playwright → http://localhost:<PLAYWRIGHT_PORT>/mcp`). Mechanism: **system-level template units** `playwright-mcp@<user>.service` + `playwright-snapshot-refresh@<user>.{service,timer}` (`User=%i`, sourced from `scripts/workstation/playwright/`, installed by `setup-devvm.sh` §9e — system manager, so NO systemd --user / linger). `roster_engine.py` allocates a sticky per-user `PLAYWRIGHT_PORT` (`PLAYWRIGHT_BASE_PORT=8931`); the reconcile's `install_playwright()` writes it, seeds the chrome-service snapshot token if-absent (staged from Vault `secret/chrome-service` to `/etc/t3-serve/chrome-service-token` by `setup-devvm.sh` §8c, since the hourly root reconcile has no Vault token), wires `~/.claude.json` by running `claude mcp add --scope user` AS the user (clobber-proof + if-absent, so it fixes existing/new/admin without rewriting a populated config), and `enable --now`s the instances (idempotent — never restarts a running server). The `@playwright/mcp` version is **pinned** in the unit (the `@latest`-silently-rolls-the-fleet footgun — see `T3_PIN`). Replaced the earlier hand-made `~/.config/systemd/user/playwright-*` units (one-time idle-gated migration; pre-migration emo/anca had servers running but never wired into their `.claude.json`). Cookie-warming pipeline + ops: `../runbooks/chrome-service-snapshot.md`. **Per-user browser MCP — playwright, reproducible from git (2026-06-16):** every user (incl. the admin) gets their OWN isolated `@playwright/mcp` server so their concurrent Claude sessions don't fight over tabs (`--isolated` → a fresh browser context per MCP connection), wired into Claude in **every directory** via a user-scope `~/.claude.json` entry (`playwright → http://localhost:<PLAYWRIGHT_PORT>/mcp`). Mechanism: **system-level template units** `playwright-mcp@<user>.service` + `playwright-snapshot-refresh@<user>.{service,timer}` (`User=%i`, sourced from `scripts/workstation/playwright/`, installed by `setup-devvm.sh` §9e — system manager, so NO systemd --user / linger). `roster_engine.py` allocates a sticky per-user `PLAYWRIGHT_PORT` (`PLAYWRIGHT_BASE_PORT=8931`); the reconcile's `install_playwright()` writes it, seeds the chrome-service snapshot token if-absent (staged from Vault `secret/chrome-service` to `/etc/t3-serve/chrome-service-token` by `setup-devvm.sh` §8c, since the hourly root reconcile has no Vault token), wires `~/.claude.json` by running `claude mcp add --scope user` AS the user (clobber-proof + if-absent, so it fixes existing/new/admin without rewriting a populated config), and `enable --now`s the instances (idempotent — never restarts a running server). The `@playwright/mcp` version is **pinned** in the unit (the `@latest`-silently-rolls-the-fleet footgun — see `T3_PIN`). Replaced the earlier hand-made `~/.config/systemd/user/playwright-*` units (one-time idle-gated migration; pre-migration emo/anca had servers running but never wired into their `.claude.json`). Cookie-warming pipeline + ops: `../runbooks/chrome-service-snapshot.md`.
**Infra access:** non-admins get their own **writable, git-crypt-LOCKED** clone of the (public) infra repo — code/docs plaintext, secret files (`*.tfvars`, `secrets/**`) stay ciphertext. Its location depends on the per-user `code_layout` in `roster.yaml`: `single` (default) puts the clone AT `~/code`; `workspace` makes `~/code` a plain directory of per-project clones — the infra clone at `~/code/infra` plus each roster `repos` entry cloned from Forgejo `viktor/<name>` **as the user** (their PAT authenticates, so private repos work; clone failures WARN and retry next hour). Flipping a user to `workspace` auto-migrates their existing `~/code` clone to `~/code/infra` (local branches/dirty state survive; running processes follow the moved inode). ancamilea = workspace + `tripit` since 2026-06-10. The provisioner clones infra anonymously from the public GitHub mirror; **contribute access is wired per-user on top** (see below). The apply boundary still holds (`scripts/tg apply` needs an admin Vault token + cluster RBAC), but **pushing `master` is NOT inert** — the Forgejo→Woodpecker webhook fires `.woodpecker/default.yml` (`event: push, branch: master`, `require_approval: forks` only), which terragrunt-applies changed stacks. `master` is **branch-protected on Forgejo** (force-push disabled for everyone — history is append-only; push + merge whitelists = `viktor` + explicitly granted users, deploy keys allowed). **Allow-then-audit (Viktor, 2026-06-10):** `ebarzin` (emo) is on the whitelist and pushes straight to `master` — no PR gate. The tracking burden moves to: (a) **commit messages that record what + why** (the agent instructions in AGENTS.md and the managed claudeMd require the body to paraphrase the user's request), (b) the **`notify-nonadmin-push` Slack audit step** in `.woodpecker/default.yml` — every master push by a non-admin author is posted to Slack (admin pushes are not), and (c) non-admins **never use `[ci skip]`** so every change fires the pipeline (and thus the audit feed). Users NOT on the whitelist fall back to `<user>/<topic>` branches + PRs. **Clones stay fresh automatically** (2026-06-10): the hourly `t3-provision-users` reconcile runs `refresh_user_clone` over every managed clone — the infra clone and any workspace repos (fetch all remotes + fast-forward `master`, ONLY when on master with a clean tree and an upstream — dirty trees and local commits are left alone with a WARN) — and also `wire_forgejo_remote`, which idempotently adds the documented `forgejo` remote + `forgejo/master` upstream to infra clones that predate that contract. `start-claude.sh` does the same freshen at session launch (10s fetch cap per repo so an offline remote never stalls the session; workspace layouts freshen each repo under `~/code`). **Infra access:** non-admins get their own **writable, git-crypt-LOCKED** clone of the (public) infra repo — code/docs plaintext, secret files (`*.tfvars`, `secrets/**`) stay ciphertext. Its location depends on the per-user `code_layout` in `roster.yaml`: `single` (default) puts the clone AT `~/code`; `workspace` makes `~/code` a plain directory of per-project clones — the infra clone at `~/code/infra` plus each roster `repos` entry cloned from Forgejo `viktor/<name>` **as the user** (their PAT authenticates, so private repos work; clone failures WARN and retry next hour). Flipping a user to `workspace` auto-migrates their existing `~/code` clone to `~/code/infra` (local branches/dirty state survive; running processes follow the moved inode). ancamilea = workspace + `tripit` since 2026-06-10. The provisioner clones infra anonymously from the public GitHub mirror; **contribute access is wired per-user on top** (see below). The apply boundary still holds (`scripts/tg apply` needs an admin Vault token + cluster RBAC), but **pushing `master` is NOT inert** — the Forgejo→Woodpecker webhook fires `.woodpecker/default.yml` (`event: push, branch: master`, `require_approval: forks` only), which terragrunt-applies changed stacks. `master` is **branch-protected on Forgejo** (force-push disabled for everyone — history is append-only; push + merge whitelists = `viktor` + explicitly granted users, deploy keys allowed). **Allow-then-audit (Viktor, 2026-06-10):** `ebarzin` (emo) is on the whitelist and pushes straight to `master` — no PR gate. The tracking burden moves to: (a) **commit messages that record what + why** (the agent instructions in AGENTS.md and the managed claudeMd require the body to paraphrase the user's request), (b) the **`notify-nonadmin-push` Slack audit step** in `.woodpecker/default.yml` — every master push by a non-admin author is posted to Slack (admin pushes are not), and (c) non-admins **never use `[ci skip]`** so every change fires the pipeline (and thus the audit feed). Users NOT on the whitelist fall back to `<user>/<topic>` branches + PRs. **Clones stay fresh automatically** (2026-06-10): the hourly `t3-provision-users` reconcile runs `refresh_user_clone` over every managed clone — the infra clone and any workspace repos (fetch all remotes + fast-forward `master`, ONLY when on master with a clean tree and an upstream — dirty trees and local commits are left alone with a WARN) — and also `wire_forgejo_remote`, which idempotently adds the documented `forgejo` remote + `forgejo/master` upstream to infra clones that predate that contract. `start-claude.sh` does the same freshen at session launch (10s fetch cap per repo so an offline remote never stalls the session; workspace layouts freshen each repo under `~/code`).
@ -561,7 +567,7 @@ Separate from the in-cluster namespace-owner model above, the **devvm** (`10.0.1
**Web-terminal session persistence (2026-06-10):** the tmux-based web terminal's named sessions (each running one Claude conversation) survive devvm reboots — `tmux-persist-save.timer` (5-min) snapshots every terminal user's sessions (name, cwd, conversation uuid from argv or the cwd-slug transcript dir) to `/var/lib/tmux-persist/<user>.tsv`, and `tmux-persist-restore.service` recreates missing sessions at boot with `claude --resume <uuid>` (per-session idempotent; also handles partial loss). The web terminal also exposes an **on-demand "Restore sessions" button** (terminal-lobby: `tmux-api` `POST /restore` → the validated root `tmux-restore-user` wrapper → `tmux-persist restore <user>`, a single-user mode of the same script): the boot-only restore service never fires when an **OOM kills a user's tmux server *without* a reboot** (the common case under multi-user memory pressure), so the button covers that gap. This is a **tmux/terminal-surface** feature, deliberately outside the t3 namespace: the t3 chat surface persists its own threads (`~/.t3` state, plus the daily `t3-backup-state` dump), and Claude conversations themselves were always durable (`~/.claude/projects/`) — what this adds is the volatile tmux wiring. **Web-terminal session persistence (2026-06-10):** the tmux-based web terminal's named sessions (each running one Claude conversation) survive devvm reboots — `tmux-persist-save.timer` (5-min) snapshots every terminal user's sessions (name, cwd, conversation uuid from argv or the cwd-slug transcript dir) to `/var/lib/tmux-persist/<user>.tsv`, and `tmux-persist-restore.service` recreates missing sessions at boot with `claude --resume <uuid>` (per-session idempotent; also handles partial loss). The web terminal also exposes an **on-demand "Restore sessions" button** (terminal-lobby: `tmux-api` `POST /restore` → the validated root `tmux-restore-user` wrapper → `tmux-persist restore <user>`, a single-user mode of the same script): the boot-only restore service never fires when an **OOM kills a user's tmux server *without* a reboot** (the common case under multi-user memory pressure), so the button covers that gap. This is a **tmux/terminal-surface** feature, deliberately outside the t3 namespace: the t3 chat surface persists its own threads (`~/.t3` state, plus the daily `t3-backup-state` dump), and Claude conversations themselves were always durable (`~/.claude/projects/`) — what this adds is the volatile tmux wiring.
**Status (2026-06-10):** built + verified on the live host — capacity (8 GiB swap), config inheritance, roster-driven provisioner, per-user locked clone, per-user OIDC kubeconfig + the `oidc-power-user-readonly` ClusterRole + emo's `k8s_users` entry (applied + impersonation-verified), the Authentik `T3 Users` edge gate, **the emo Phase-5 cutover (own clone + launcher repoint + `code-shared` removal, completed 2026-06-10) and emo's contribute access (`ebarzin` write collaborator + PAT + protected `master`)**, and **per-user `code_layout` with the ancamilea workspace cutover (infra → `~/code/infra`, `tripit` alongside, 2026-06-10)**. Per the live `/etc/skel` design, non-admin `~/.claude/{rules,skills}` symlinks into the admin base are **kept** (they ARE the shared-base delivery mechanism — the plan's step to remove them is obsolete). **Remaining (held / future):** the offboarding apply-side (Phase 7), the rest of per-user MCP/auth injection (`ha` + `claude_memory` + `.credentials.json` + beads Dolt cred — **per-user playwright browser MCP done 2026-06-16**, see above), and roster-reconciled `T3 Users` membership. See `../runbooks/offboard-user.md` for deprovisioning. **Status (2026-06-20):** built + verified on the live host — capacity (8 GiB swap), config inheritance, roster-driven provisioner, per-user locked clone, per-user OIDC kubeconfig + the `oidc-power-user-readonly` ClusterRole + emo's `k8s_users` entry (applied + impersonation-verified), the Authentik `T3 Users` edge gate, **the emo Phase-5 cutover (own clone + launcher repoint + `code-shared` removal, completed 2026-06-10) and emo's contribute access (`ebarzin` write collaborator + PAT + protected `master`)**, **per-user `code_layout` with the ancamilea workspace cutover**, per-user playwright browser MCP, and per-user Claude OAuth renewal/Vault recovery. Per the live `/etc/skel` design, non-admin `~/.claude/{rules,skills}` symlinks into the admin base are **kept**. **Remaining (held / future):** the offboarding apply-side (Phase 7), per-user `ha`/`claude_memory`/beads credential injection, and roster-reconciled `T3 Users` membership. See `../runbooks/offboard-user.md` for deprovisioning.
## Related ## Related

View file

@ -4,7 +4,7 @@ Last updated: 2026-04-19 (WS E — Kea DHCP pushes dual DNS per subnet; Kea DDNS
## Overview ## Overview
The homelab network is built on a dual-VLAN architecture with pfSense providing gateway services, Technitium for internal DNS, and Cloudflare for external DNS. Traefik serves as the Kubernetes ingress controller with a comprehensive middleware chain including CrowdSec bot protection, Authentik forward-auth, and rate limiting. All HTTP traffic flows through Cloudflared tunnels, avoiding the need for port forwarding or exposing public IPs. The homelab network is built on a dual-VLAN architecture with pfSense providing gateway services, Technitium for internal DNS, and Cloudflare for external DNS. Traefik serves as the Kubernetes ingress controller with a middleware chain of anti-AI bot-blocking, Authentik forward-auth, rate limiting, and retry. CrowdSec IP-reputation enforcement is **out-of-band** (not a Traefik hop): banned IPs are dropped in-kernel via nftables on direct hosts and blocked at the Cloudflare edge on proxied hosts (see `docs/architecture/security.md`). All HTTP traffic flows through Cloudflared tunnels, avoiding the need for port forwarding or exposing public IPs.
## Architecture Diagram ## Architecture Diagram
@ -16,12 +16,14 @@ graph TB
Traefik[Traefik Ingress<br/>3 replicas + PDB] Traefik[Traefik Ingress<br/>3 replicas + PDB]
subgraph "Middleware Chain" subgraph "Middleware Chain"
CS[CrowdSec Bouncer<br/>fail-open] AntiAI[Anti-AI bot-block<br/>fail-open]
Auth[Authentik Forward-Auth<br/>3 replicas + PDB] Auth[Authentik Forward-Auth<br/>3 replicas + PDB]
RL[Rate Limiter<br/>429 response] RL[Rate Limiter<br/>429 response]
Retry[Retry<br/>2 attempts, 100ms] Retry[Retry<br/>2 attempts, 100ms]
end end
CSdrop[CrowdSec drop<br/>nftables / CF edge<br/>out-of-band, pre-Traefik]
subgraph "Proxmox Host (eno1)" subgraph "Proxmox Host (eno1)"
vmbr0[vmbr0 Bridge<br/>192.168.1.127/24] vmbr0[vmbr0 Bridge<br/>192.168.1.127/24]
vmbr1[vmbr1 Internal<br/>VLAN-aware] vmbr1[vmbr1 Internal<br/>VLAN-aware]
@ -53,8 +55,9 @@ graph TB
Internet -->|DNS query| CF Internet -->|DNS query| CF
CF -->|CNAME to tunnel| CFD CF -->|CNAME to tunnel| CFD
CFD --> Traefik CFD --> Traefik
Traefik --> CS CSdrop -.->|banned IPs dropped before Traefik| Traefik
CS --> Auth Traefik --> AntiAI
AntiAI --> Auth
Auth --> RL Auth --> RL
RL --> Retry RL --> Retry
Retry --> Service Retry --> Service
@ -82,7 +85,7 @@ graph TB
| Cloudflare DNS | SaaS | External | ~50 public domains under viktorbarzin.me | | Cloudflare DNS | SaaS | External | ~50 public domains under viktorbarzin.me |
| Cloudflared | Container | K8s (3 replicas) | Tunnel ingress, replaces port forwarding | | Cloudflared | Container | K8s (3 replicas) | Tunnel ingress, replaces port forwarding |
| Traefik | Helm chart | K8s (3 replicas + PDB) | Ingress controller, HTTP/3 enabled | | Traefik | Helm chart | K8s (3 replicas + PDB) | Ingress controller, HTTP/3 enabled |
| CrowdSec | Helm chart | K8s (LAPI: 3 replicas) | Bot protection, fail-open bouncer | | CrowdSec | Helm chart | K8s (LAPI: 3 replicas) | IP reputation. Out-of-band enforcement: `cs-firewall-bouncer` DaemonSet (in-kernel nftables drop, direct hosts) + Cloudflare edge WAF rule (proxied hosts). Fail-open |
| Authentik | Helm chart | K8s (3 replicas + PDB) | SSO, forward-auth middleware | | Authentik | Helm chart | K8s (3 replicas + PDB) | SSO, forward-auth middleware |
| MetalLB | v0.15.3 Helm chart | K8s | LoadBalancer IPs (10.0.20.200-10.0.20.220), all services on 10.0.20.200 | | MetalLB | v0.15.3 Helm chart | K8s | LoadBalancer IPs (10.0.20.200-10.0.20.220), all services on 10.0.20.200 |
| Registry Cache | Container | 10.0.20.10 | Pull-through for docker.io:5000, ghcr.io:5010 | | Registry Cache | Container | 10.0.20.10 | Pull-through for docker.io:5000, ghcr.io:5010 |
@ -208,24 +211,31 @@ VMs tag traffic on vmbr1 to isolate workloads. pfSense bridges VLAN 20 to the up
### Ingress Flow ### Ingress Flow
CrowdSec is **not** a step in this chain — banned IPs are dropped before the
request ever reaches Traefik (Cloudflare edge WAF rule on proxied hosts; host
nftables on direct hosts). The flow below is for a request that survives that
out-of-band gate.
```mermaid ```mermaid
sequenceDiagram sequenceDiagram
participant Client participant Client
participant Cloudflare participant CFedge as Cloudflare (edge WAF: crowdsec_ban block)
participant Cloudflared participant Cloudflared
participant Traefik participant Traefik
participant CrowdSec participant AntiAI
participant Authentik participant Authentik
participant RateLimit participant RateLimit
participant Retry participant Retry
participant Service participant Service
participant Pod participant Pod
Client->>Cloudflare: HTTPS request to blog.viktorbarzin.me Client->>CFedge: HTTPS request to blog.viktorbarzin.me
Cloudflare->>Cloudflared: Forward via tunnel (QUIC) Note over CFedge: banned IP → blocked here (proxied hosts)
CFedge->>Cloudflared: Forward via tunnel (QUIC)
Cloudflared->>Traefik: HTTP to LoadBalancer IP Cloudflared->>Traefik: HTTP to LoadBalancer IP
Traefik->>CrowdSec: Apply bouncer middleware Note over Traefik: on direct hosts, banned IPs already dropped in-kernel (nftables forward hook)
CrowdSec->>Authentik: If allowed, check auth (protected=true) Traefik->>AntiAI: anti-AI bot-block (fail-open)
AntiAI->>Authentik: If allowed, check auth (protected=true)
Authentik->>RateLimit: If authenticated, check rate limit Authentik->>RateLimit: If authenticated, check rate limit
RateLimit->>Retry: If within limit, continue RateLimit->>Retry: If within limit, continue
Retry->>Service: Forward to Service Retry->>Service: Forward to Service
@ -234,24 +244,27 @@ sequenceDiagram
Service-->>Retry: Response Service-->>Retry: Response
Retry-->>RateLimit: Response Retry-->>RateLimit: Response
RateLimit-->>Authentik: Response (strip auth headers) RateLimit-->>Authentik: Response (strip auth headers)
Authentik-->>CrowdSec: Response Authentik-->>AntiAI: Response
CrowdSec-->>Traefik: Response AntiAI-->>Traefik: Response
Traefik-->>Cloudflared: Response Traefik-->>Cloudflared: Response
Cloudflared-->>Cloudflare: Response via tunnel Cloudflared-->>CFedge: Response via tunnel
Cloudflare-->>Client: HTTPS response CFedge-->>Client: HTTPS response
``` ```
### Middleware Chain ### Middleware Chain
Every ingress created by the `ingress_factory` module follows this chain: CrowdSec IP-reputation enforcement is **not** in this chain — it is out-of-band
(host nftables on direct hosts; the Cloudflare edge WAF `crowdsec_ban` rule on
proxied hosts), so banned IPs never reach the chain and there is no per-request
CrowdSec hop. Every ingress created by the `ingress_factory` module follows this
Traefik chain:
1. **CrowdSec Bouncer**: Checks IP against threat database. **Fail-open** mode — if LAPI is unreachable, traffic passes through to prevent outages. 1. **Anti-AI bot-block** (`ai-bot-block` ForwardAuth, on by default via `ingress_factory`): blocks/tarpits known AI crawlers. **Fail-open** (currently a no-op `return 200` — poison-fountain scaled to 0; see `docs/architecture/security.md`).
2. **Authentik Forward-Auth** (if `protected = true`): SSO authentication via OIDC. Non-authenticated users are redirected to login. Auth headers are stripped before forwarding to backend. 2. **Authentik Forward-Auth** (if `protected = true`): SSO authentication via OIDC. Non-authenticated users are redirected to login. Auth headers are stripped before forwarding to backend.
3. **Rate Limiting**: Per-IP throttling. Returns **429 Too Many Requests** (not 503) when limit exceeded. Default is `rate-limit` (average 10 req/s, burst 50). Services whose clients legitimately burst harder get a dedicated middleware via `skip_default_rate_limit = true` + `extra_middlewares`: Immich (`immich-rate-limit`, 1000/20000, photo uploads) and ActualBudget (`actualbudget-rate-limit`, 50/300 — the Actual web app boots with ~70 parallel asset/migration revalidations; the default burst 429'd the tail and stalled every page load). 3. **Rate Limiting**: Per-IP throttling. Returns **429 Too Many Requests** (not 503) when limit exceeded. Default is `rate-limit` (average 10 req/s, burst 50). Services whose clients legitimately burst harder get a dedicated middleware via `skip_default_rate_limit = true` + `extra_middlewares`: Immich (`immich-rate-limit`, 1000/20000, photo uploads) and ActualBudget (`actualbudget-rate-limit`, 50/300 — the Actual web app boots with ~70 parallel asset/migration revalidations; the default burst 429'd the tail and stalled every page load).
4. **Retry**: 2 attempts with 100ms delay on transient failures (5xx errors, connection errors). 4. **Retry**: 2 attempts with 100ms delay on transient failures (5xx errors, connection errors).
Additional middleware: Additional middleware:
- **Anti-AI**: On by default via `ingress_factory`. Blocks common AI crawler user-agents.
- **HTTP/3 (QUIC)**: Enabled globally on Traefik. - **HTTP/3 (QUIC)**: Enabled globally on Traefik.
### Entrypoint Transport Timeouts ### Entrypoint Transport Timeouts
@ -348,7 +361,7 @@ Containerd on all K8s nodes uses `hosts.toml` to redirect pulls to the local cac
| pfSense | `stacks/pfsense/` | VM + cloud-init config | | pfSense | `stacks/pfsense/` | VM + cloud-init config |
| Technitium | `stacks/technitium/` | Deployment, Service, PVC | | Technitium | `stacks/technitium/` | Deployment, Service, PVC |
| Traefik | `stacks/platform/` (sub-module) | Helm release, IngressRoute CRDs | | Traefik | `stacks/platform/` (sub-module) | Helm release, IngressRoute CRDs |
| CrowdSec | `stacks/platform/` (sub-module) | Helm release, LAPI + bouncer | | CrowdSec | `stacks/crowdsec/` (+ edge in `stacks/rybbit/`) | Helm release, LAPI + agent; `cs-firewall-bouncer` DaemonSet (nftables, direct hosts) + Cloudflare edge sync (proxied hosts) |
| Authentik | `stacks/authentik/` | Helm release, ingress, OIDC configs | | Authentik | `stacks/authentik/` | Helm release, ingress, OIDC configs |
| MetalLB | `stacks/platform/` (sub-module) | Helm release, IPAddressPool | | MetalLB | `stacks/platform/` (sub-module) | Helm release, IPAddressPool |
| Cloudflared | `stacks/cloudflared/` | Deployment (3 replicas), tunnel config; runs `--no-autoupdate` (in-place self-updates exited the pods and severed all tunnel WebSockets, 2026-06-09/10) | | Cloudflared | `stacks/cloudflared/` | Deployment (3 replicas), tunnel config; runs `--no-autoupdate` (in-place self-updates exited the pods and severed all tunnel WebSockets, 2026-06-09/10) |
@ -436,13 +449,30 @@ Containerd on all K8s nodes uses `hosts.toml` to redirect pulls to the local cac
**Decision**: Technitium handles internal `.lan` domains with near-zero latency. Cloudflare handles public domains with global DNS. K8s nodes use Technitium as primary, which forwards non-.lan queries to Cloudflare. **Decision**: Technitium handles internal `.lan` domains with near-zero latency. Cloudflare handles public domains with global DNS. K8s nodes use Technitium as primary, which forwards non-.lan queries to Cloudflare.
### Why Fail-Open on CrowdSec Bouncer? ### Why CrowdSec Enforcement Is Out-of-Band (and Fails Open)
**Alternatives considered**: CrowdSec used to enforce inline as a Traefik middleware (the
1. **Fail-closed**: Maximum security, but LAPI downtime blocks all traffic. `crowdsec-bouncer-traefik-plugin`). On Traefik 3.7.5 the Yaegi plugin handler was
2. **Redundant LAPI**: Already scaled to 3 replicas, but resource pressure can still cause outages. never invoked, so it enforced nothing; the plugin was removed and enforcement
moved off the request path entirely (full history in
`docs/architecture/security.md`). It now runs on two surfaces:
**Decision**: Availability > strict bot blocking. CrowdSec LAPI is scaled to 3 replicas for resilience, but during cluster-wide resource exhaustion (e.g., memory pressure), bouncer falls back to allowing traffic. This prevents a complete service outage due to a security add-on. - **Direct hosts**`cs-firewall-bouncer` DaemonSet drops banned IPs in the host
nftables, in **both the `input` and `forward` hooks**. The `forward` hook is
the load-bearing one: with Traefik on a dedicated LB IP at
`externalTrafficPolicy=Local`, client packets are DNAT'd to the Traefik **pod**
and transit the node's `forward` chain (not `input`) — which is exactly why the
ingress must preserve the **real client IP** end-to-end (ETP=Local + PROXY-v2
for IPv6; see the Traefik LB IP and IPv6 ingress notes above). Without the real
client IP the firewall-bouncer (and the CF edge rule) would have nothing to
match on.
- **Proxied hosts** → a Cloudflare edge WAF rule (`ip.src in $crowdsec_ban`) fed
by the `crowdsec-cf-sync` CronJob.
Both **fail open**: if LAPI is unreachable, the firewall-bouncer simply stops
receiving new decisions (existing drops persist) and the CF sync skips a run —
neither ever blocks legitimate traffic. Availability > strict bot blocking, and
out-of-band enforcement adds **zero per-request latency** (no Traefik hop).
### Why HTTP/3 (QUIC)? ### Why HTTP/3 (QUIC)?
@ -473,9 +503,10 @@ Containerd on all K8s nodes uses `hosts.toml` to redirect pulls to the local cac
**Symptoms**: All ingress routes return 503, Traefik dashboard shows no backends available. **Symptoms**: All ingress routes return 503, Traefik dashboard shows no backends available.
**Diagnosis**: Middleware chain is blocking traffic. Check: **Diagnosis**: Middleware chain is blocking traffic. (CrowdSec is **not** in the
1. Authentik status: `kubectl get pod -n authentik` chain — a CrowdSec/LAPI outage cannot cause 503s; it only stops new bans.) Check:
2. CrowdSec LAPI status: `kubectl get pod -n crowdsec` 1. Authentik status: `kubectl get pod -n authentik` (ForwardAuth fails closed if the auth server is unreachable)
2. `bot-block-proxy` status: `kubectl get pod -n traefik -l app=bot-block-proxy` (anti-AI ForwardAuth target — also fails closed if down)
3. Traefik logs: `kubectl logs -n kube-system deploy/traefik` 3. Traefik logs: `kubectl logs -n kube-system deploy/traefik`
**Fix**: If Authentik is down and ingress uses forward-auth, pods won't pass health checks. Scale Authentik to 3 replicas or temporarily disable forward-auth middleware. **Fix**: If Authentik is down and ingress uses forward-auth, pods won't pass health checks. Scale Authentik to 3 replicas or temporarily disable forward-auth middleware.

View file

@ -2,40 +2,50 @@
## Overview ## Overview
The homelab implements defense-in-depth security at the application layer (L7) using CrowdSec for threat intelligence and IP reputation, Kyverno for policy enforcement and resource governance, and a 3-layer anti-AI scraping defense (reduced from 5 in April 2026 after removing the rewrite-body plugin). All security components operate in graceful degradation mode (fail-open) to prevent cascading failures. Security policies are deployed in audit mode first, then selectively enforced after validation. The homelab implements defense-in-depth security using CrowdSec for threat intelligence and IP reputation, Kyverno for policy enforcement and resource governance, and a 3-layer anti-AI scraping defense (reduced from 5 in April 2026 after removing the rewrite-body plugin). CrowdSec enforcement is **out-of-band** (not a per-request Traefik hop — see the CrowdSec section): banned IPs are dropped in-kernel via nftables on direct hosts, and blocked at the Cloudflare edge on proxied hosts, so enforcement adds **zero per-request latency**. All security components fail open (a CrowdSec outage stops new bans but never blocks legitimate traffic). Security policies are deployed in audit mode first, then selectively enforced after validation.
## Architecture Diagram ## Architecture Diagram
CrowdSec enforcement is out-of-band (NOT an inline Traefik middleware hop). The
Traefik request chain is anti-AI → Authentik ForwardAuth → rate-limit → retry;
CrowdSec drops banned IPs *before* (direct hosts) or *off* (proxied hosts) that
chain entirely.
```mermaid ```mermaid
graph LR graph TB
Internet[Internet] Internet[Internet]
CF[Cloudflare WAF]
subgraph "Proxied hosts (orange-cloud)"
CFedge[Cloudflare edge<br/>WAF rule: ip.src in $crowdsec_ban → block]
end
subgraph "Direct hosts (grey-cloud / internal)"
NFT[Host nftables<br/>table crowdsec/crowdsec6<br/>drop in input + forward]
end
Tunnel[Cloudflared Tunnel] Tunnel[Cloudflared Tunnel]
CrowdSec[CrowdSec Bouncer<br/>Traefik Plugin] Traefik[Traefik<br/>anti-AI → Authentik → rate-limit → retry]
AntiAI[Anti-AI Check<br/>poison-fountain]
ForwardAuth[Authentik ForwardAuth]
RateLimit[Rate Limit Middleware]
Retry[Retry Middleware<br/>2 attempts, 100ms]
Backend[Backend Service] Backend[Backend Service]
LAPI[CrowdSec LAPI<br/>3 replicas] LAPI[CrowdSec LAPI<br/>3 replicas]
Agent[CrowdSec Agent] Agent[CrowdSec Agent<br/>parses Traefik logs]
FWB[cs-firewall-bouncer<br/>DaemonSet, every node]
CFsync[crowdsec-cf-sync<br/>CronJob, every 2 min]
Internet -->|1| CF Internet -->|proxied| CFedge
CF -->|2| Tunnel Internet -->|direct| NFT
Tunnel -->|3| CrowdSec CFedge -->|allowed| Tunnel
CrowdSec -.->|Query| LAPI Tunnel --> Traefik
Agent -.->|Report| LAPI NFT -->|allowed| Traefik
CrowdSec -->|4. Pass/Block| AntiAI Traefik --> Backend
AntiAI -->|5. Human/Bot| ForwardAuth
ForwardAuth -->|6. Authenticated| RateLimit
RateLimit -->|7. Under Limit| Retry
Retry -->|8. Success/Retry| Backend
style CrowdSec fill:#f9f,stroke:#333 Agent -.->|report| LAPI
style AntiAI fill:#ff9,stroke:#333 LAPI -.->|all decisions incl. CAPI| FWB
style ForwardAuth fill:#9f9,stroke:#333 FWB -.->|program drop rules| NFT
style RateLimit fill:#99f,stroke:#333 LAPI -.->|ban/captcha decisions, CAPI excluded| CFsync
CFsync -.->|push IP list| CFedge
style CFedge fill:#f9f,stroke:#333
style NFT fill:#f9f,stroke:#333
``` ```
## Components ## Components
@ -44,7 +54,8 @@ graph LR
|-----------|---------|----------|---------| |-----------|---------|----------|---------|
| CrowdSec LAPI | Pinned | `stacks/crowdsec/` | Local API, threat intelligence aggregation (3 replicas) | | CrowdSec LAPI | Pinned | `stacks/crowdsec/` | Local API, threat intelligence aggregation (3 replicas) |
| CrowdSec Agent | Pinned | `stacks/crowdsec/` | Log parser, scenario detection | | CrowdSec Agent | Pinned | `stacks/crowdsec/` | Log parser, scenario detection |
| CrowdSec Traefik Bouncer | Plugin | Traefik config | Plugin-based IP reputation check | | cs-firewall-bouncer | v0.0.34 | `stacks/crowdsec/modules/crowdsec/firewall_bouncer.tf` | In-kernel nftables drop on every node (DIRECT hosts). Bouncer key `firewall` |
| crowdsec-cf-sync | — | `stacks/rybbit/crowdsec_edge.tf` | LAPI→Cloudflare-IP-List sync CronJob (PROXIED hosts). Bouncer key `kvsync` |
| Kyverno | Pinned chart | `stacks/kyverno/` | Policy engine for K8s admission control | | Kyverno | Pinned chart | `stacks/kyverno/` | Policy engine for K8s admission control |
| poison-fountain | Latest | `stacks/poison-fountain/` | Anti-AI bot detection and tarpit service | | poison-fountain | Latest | `stacks/poison-fountain/` | Anti-AI bot detection and tarpit service |
| cert-manager/certbot | - | `stacks/cert-manager/` | TLS certificate management | | cert-manager/certbot | - | `stacks/cert-manager/` | TLS certificate management |
@ -54,11 +65,15 @@ graph LR
### Request Security Layers ### Request Security Layers
Every incoming request passes through 6 security layers: CrowdSec IP-reputation enforcement happens **before** a request reaches the
Traefik chain (banned IPs are dropped in-kernel on direct hosts, or blocked at
the Cloudflare edge on proxied hosts — see CrowdSec Threat Intelligence below).
A request that survives that out-of-band gate then passes through the Traefik
middleware chain:
1. **Cloudflare WAF** - DDoS protection, bot detection, firewall rules (external) 1. **Cloudflare WAF / edge** - DDoS protection, bot detection, firewall rules incl. the CrowdSec `crowdsec_ban` block rule (proxied hosts only)
2. **Cloudflared Tunnel** - Zero Trust tunnel, hides origin IP 2. **Cloudflared Tunnel** - Zero Trust tunnel, hides origin IP (proxied hosts)
3. **CrowdSec Bouncer** - IP reputation check against LAPI (fail-open on error) 3. **CrowdSec out-of-band drop** - nftables on direct hosts; *not* a Traefik hop (zero per-request latency)
4. **Anti-AI Scraping** - 3-layer bot defense (optional per service, updated 2026-04-17) 4. **Anti-AI Scraping** - 3-layer bot defense (optional per service, updated 2026-04-17)
5. **Authentik ForwardAuth** - Authentication check (if `protected = true`) 5. **Authentik ForwardAuth** - Authentication check (if `protected = true`)
6. **Rate Limiting** - Per-source IP rate limits (returns 429 on breach) 6. **Rate Limiting** - Per-source IP rate limits (returns 429 on breach)
@ -80,11 +95,71 @@ CrowdSec operates in a hub-and-agent model:
- Reports malicious IPs to LAPI - Reports malicious IPs to LAPI
- Shares threat intel with CrowdSec community (anonymized) - Shares threat intel with CrowdSec community (anonymized)
**Traefik Bouncer Plugin**: Enforcement is split across **two out-of-band surfaces**, neither of which adds
- Integrated as Traefik middleware any per-request latency. (See "Why the Traefik bouncer plugin was removed" below
- Queries LAPI for IP reputation on each request for the supersession history — there is no longer an inline Traefik bouncer.)
- **Fail-open mode**: If LAPI unreachable, allows traffic (graceful degradation)
- Blocks IPs on ban list, allows others **Surface 1 — DIRECT (non-Cloudflare-proxied) hosts → in-kernel nftables drop**
(`cs-firewall-bouncer` DaemonSet, `stacks/crowdsec/modules/crowdsec/firewall_bouncer.tf`):
- Runs on **every node** (no nodeSelector). Programs the HOST nftables — `table ip
crowdsec` / `table ip6 crowdsec6` — with drop rules in **both the `input` AND
the `forward` hooks**. The `forward` hook is required because Traefik is a
LoadBalancer with `externalTrafficPolicy=Local`: client traffic is DNAT'd to the
Traefik **pod** and transits the node's `forward` hook (not `input`) with the
real client IP preserved. Chains use `policy accept` (only set members drop —
it can never blackhole normal traffic).
- Pulls **all** decisions from LAPI, **including the CAPI community blocklist
(~31k IPs)**. Packets from banned IPs are dropped **in-kernel before reaching
Traefik** → zero per-request hops, no Traefik involvement at all.
- **Packaging**: cs-firewall-bouncer publishes no container image, so the
**v0.0.34** static binary is fetched at runtime by an initContainer onto a
`debian:bookworm-slim` runtime container. Needs `hostNetwork` +
`NET_ADMIN`/`NET_RAW` to talk netlink directly. Registered bouncer key:
**`firewall`**.
- **Fail-open**: if LAPI is unreachable it just stops receiving new decisions
(existing drop rules persist); it never blocks legitimate traffic.
**Surface 2 — PROXIED (Cloudflare orange-cloud) hosts → Cloudflare edge block**
(`stacks/rybbit/crowdsec_edge.tf` + `lapi_kv_sync.py`):
- Proxied hosts terminate at the Cloudflare edge, so a host-level nftables drop
would never see them. Enforcement is instead a single Cloudflare Rules List
**`crowdsec_ban`** + a zone-scoped WAF custom rule `(ip.src in $crowdsec_ban)`
**block** action, which covers every proxied host in the zone.
- Fed by the **`crowdsec-cf-sync` CronJob** (namespace `rybbit`, every 2 min,
pure-stdlib Python in a ConfigMap). It pulls local **ban/captcha ip-scoped**
decisions and pushes them into the CF list, but **EXCLUDES the ~31k CAPI
community blocklist** — that set is far too large for a CF Rules List (the CF
account hard-limits to **one** list), and CAPI is already covered in-kernel on
direct hosts and by Cloudflare's own managed protections on proxied hosts.
Registered bouncer key: **`kvsync`**.
- **Block-only**: the single-list limit precludes a separate
captcha/managed-challenge list, so both ban and captcha decisions are enforced
as a plain block at the edge.
- **Auth carve-out:** the WAF rule excludes `authentik.viktorbarzin.me` +
`public-auth.viktorbarzin.me` (`… and not (http.host in {…})`). A CrowdSec hit
must never wall a user out of the login / WebAuthn flow they authenticate
through; auth keeps `traefik-rate-limit` for brute-force protection.
**Whitelist** (`stacks/crowdsec/whitelist.yaml`): a CrowdSec whitelist covers
RFC1918 + the tailnet + internal CIDRs (plus one specific external IP), so
internal users are never enforced. Internal access uses split-horizon DNS
straight to Traefik, and direct internal clients are RFC1918 — both whitelisted.
#### Why the Traefik bouncer plugin was removed
Enforcement used to run as an inline Traefik middleware — the
`crowdsec-bouncer-traefik-plugin` (Yaegi/Lua), which queried LAPI on every
request and could serve a Cloudflare Turnstile captcha for soft remediations.
On **Traefik 3.7.5 the Yaegi handler was never invoked**, so the bouncer was
registered but enforced **nothing** despite appearing healthy. Rather than chase
the Yaegi runtime, the whole plugin path was **removed** (2026-06): the plugin
static config + initContainer download, the `crowdsec` Middleware CRD, the
`captcha.html` template + its ConfigMap and volume mount, and the Cloudflare
Turnstile widget (`cloudflare_turnstile_widget.crowdsec_captcha`). It was
replaced by the two out-of-band surfaces above, which add zero per-request
latency and fail open. (The earlier `crowdsec-cf-sync` cursor-pagination /
IP-List-capacity issues are also moot now that CAPI is excluded from the edge
list and dropped in-kernel instead.)
**Metabase** (disabled by default): **Metabase** (disabled by default):
- Dashboard for CrowdSec analytics - Dashboard for CrowdSec analytics
@ -289,6 +364,67 @@ Beads: `code-8ywc` W1.6 + W1.7. **Status: planned.**
- Rare-event misses: a Sunday-only CronJob's egress won't appear in 7 days of flow logs. Mitigation: extend observation to 2 weeks for namespaces with weekly CronJobs. - Rare-event misses: a Sunday-only CronJob's egress won't appear in 7 days of flow logs. Mitigation: extend observation to 2 weeks for namespaces with weekly CronJobs.
- Mass-rollout cascade: the 26h March 2026 outage (memory id=390) was a mass-change cascade. Mitigation: phased per-namespace with health-check pauses, similar to the 2026-05-17 Keel phased rollout (memory id=1972). - Mass-rollout cascade: the 26h March 2026 outage (memory id=390) was a mass-change cascade. Mitigation: phased per-namespace with health-check pauses, similar to the 2026-05-17 Keel phased rollout (memory id=1972).
#### Deriving the per-namespace egress allowlist from the edge trail (Wave 1 W1.7)
The durable **east-west flow trail** (below) is now the preferred data source for
the *internal* (namespace-to-namespace) half of each Wave-1 egress allowlist —
faster and identity-stamped vs the original iptables-`LOG`→journald→Loki path
(ADR-0014: "Enforcement gains a better data source"). The unique observed
namespace pairs live in CNPG DB `goldmane_edges`, table `edge`. To derive the
namespaces a source is observed talking to (the `allow` set that seeds its
NetworkPolicy):
```sql
SELECT DISTINCT dst_ns FROM edge WHERE src_ns='<ns>' AND action='allow' ORDER BY dst_ns;
```
The full SQL recipe (whole-cluster matrix, deny sanity-checks, the ≥7-day
observation caveat) is in
[runbooks/goldmane-flow-trail.md → Deriving the Wave-1 egress allowlist](../runbooks/goldmane-flow-trail.md#deriving-the-wave-1-egress-allowlist-from-the-edge-table-infra-62).
**External / public-internet egress is NOT in this table** (empty-namespace flows
are dropped) — for those destinations keep using the Calico flow-log observation
(the W1.6 snapshot, `wave1-egress-observation-2026-05-22.md`). This feeds the
existing observe-then-enforce effort (beads `code-8ywc`); **enforce-flips remain
out of scope** of the trail — it is observe-and-derive only.
### East-west flow observability (Goldmane / Whisker + edge trail) (ADR-0014)
The "who-talks-to-whom" data plane that succeeds raw iptables-`LOG` lines (which
carried no identity). **Service identity = the workload's namespace** (primary),
refined by a `service-identity` label in the few multi-Service namespaces
(`monitoring`, `kube-system`, `dbaas`). End-to-end trail, three layers:
1. **Calico Goldmane + Whisker** (`calico-system`) — Goldmane aggregates
identity-stamped flows (ns/pod/workload/labels + allow-deny + policy-trace)
streamed from Felix over gRPC into a **~60-min in-memory ring buffer** (no
etcd/API writes — the etcd-cost constraint that drove the design). **Whisker**
is its live web UI at `whisker.viktorbarzin.me` (Authentik-gated,
`auth = "required"` — Whisker has no own login; an additive NetworkPolicy ORs
Traefik past the operator's default-deny `whisker` NP). The ring buffer is
**not** a trail (lost on Goldmane restart). Enabled via operator CRs in
`stacks/calico/main.tf`; reversible toggle (Goldmane is OSS tech-preview).
2. **`goldmane-edge-aggregator`** (`stacks/goldmane-edge-aggregator`) — streams
Goldmane's gRPC `Flows.Stream` over **mTLS** and upserts the low-cardinality
namespace-pair edge set (`edge(src_ns,dst_ns,action,first_seen,last_seen,
flow_count)`) into CNPG DB `goldmane_edges`. Self-edges and empty-namespace
(public-internet) flows are dropped — in-cluster relationships only. The mTLS
client cert **reuses the operator's Tigera-CA-signed `whisker-backend-key-pair`**
(Goldmane verifies CA-chain only, not identity) rather than copying the CA
private key into TF state — **re-apply the stack if the operator rotates that
Secret**.
3. **`goldmane-edges-digest`** CronJob — posts first-seen edges daily to
**`#alerts`** (reuses the alert-digest webhook; a `#security` override 404s —
that webhook's Slack app isn't a member of `#security`; see runbook).
The trail is **attribution-grade, not cryptographic** (reconstructs events in a
trusted cluster; cannot prove identity against a spoofing pod — accepted trust-model
limit; east-west stays plaintext, no mTLS between app pods). Health is covered by
the **`AggregatorDown`** + **`DigestFailing`** alerts and cluster-health check #48
(see monitoring.md). Full as-built, query recipes, and troubleshooting:
[runbooks/goldmane-flow-trail.md](../runbooks/goldmane-flow-trail.md). Decision:
[ADR-0014](../adr/0014-service-identity-and-east-west-observability.md); glossary
`CONTEXT.md`**Service identity**, **Goldmane / Whisker**.
### TLS & HTTP/3 ### TLS & HTTP/3
**Traefik** handles TLS termination: **Traefik** handles TLS termination:
@ -330,10 +466,12 @@ Beads: `code-8ywc` W1.6 + W1.7. **Status: planned.**
| Path | Purpose | | Path | Purpose |
|------|---------| |------|---------|
| `stacks/crowdsec/` | CrowdSec LAPI, agent, bouncer config | | `stacks/crowdsec/` | CrowdSec LAPI, agent config + `whitelist.yaml` |
| `stacks/crowdsec/modules/crowdsec/firewall_bouncer.tf` | cs-firewall-bouncer DaemonSet (in-kernel nftables drop, direct hosts) |
| `stacks/rybbit/crowdsec_edge.tf` + `lapi_kv_sync.py` | Cloudflare IP-List + WAF block rule + LAPI→CF sync CronJob (proxied hosts) |
| `stacks/kyverno/` | Kyverno deployment + policies | | `stacks/kyverno/` | Kyverno deployment + policies |
| `stacks/poison-fountain/` | Anti-AI service + CronJob | | `stacks/poison-fountain/` | Anti-AI service + CronJob |
| `stacks/platform/modules/traefik/middleware.tf` | Security middleware definitions | | `stacks/traefik/modules/traefik/middleware.tf` | Security middleware definitions (no longer includes a CrowdSec bouncer) |
| `stacks/platform/modules/ingress_factory/` | Per-service security toggles | | `stacks/platform/modules/ingress_factory/` | Per-service security toggles |
### Vault Paths ### Vault Paths
@ -443,7 +581,11 @@ spec:
**Fix**: **Fix**:
1. Check LAPI decisions: `kubectl exec -it crowdsec-lapi-0 -- cscli decisions list` 1. Check LAPI decisions: `kubectl exec -it crowdsec-lapi-0 -- cscli decisions list`
2. Remove ban: `kubectl exec -it crowdsec-lapi-0 -- cscli decisions delete --ip <IP>` 2. Remove ban: `kubectl exec -it crowdsec-lapi-0 -- cscli decisions delete --ip <IP>`
3. Whitelist if needed: Add to `stacks/crowdsec/whitelist.yaml` — the in-kernel drop clears as soon as `cs-firewall-bouncer` reconciles (direct
hosts); for proxied hosts the `crowdsec-cf-sync` CronJob removes it from the
`crowdsec_ban` CF list within ~2 min.
3. Whitelist if needed: Add to `stacks/crowdsec/whitelist.yaml` (RFC1918 + tailnet
+ internal CIDRs are already whitelisted, so internal clients are never banned).
### Kyverno Policy Blocking Deployment ### Kyverno Policy Blocking Deployment

View file

@ -110,7 +110,7 @@ The Config base / machine-wide managed layer is **secret-free**. Everything carr
| Auth / token | Lives in (per-user, `0600`) | New-user provisioning (from Vault) | | Auth / token | Lives in (per-user, `0600`) | New-user provisioning (from Vault) |
|---|---|---| |---|---|---|
| **Claude OAuth** | `~/.claude/.credentials.json` (or `CLAUDE_CODE_OAUTH_TOKEN`) | the shared Enterprise token (earlier decision) **or** own interactive login; emo keeps his own | | **Claude OAuth** | `~/.claude/.credentials.json` + isolated Vault backup | own Enterprise SSO login; Claude refreshes locally and `claude-auth-sync@<user>.timer` validates/backs up/recovers `claudeAiOauth` at `secret/workstation/claude-users/<os_user>`; shared token injection is forbidden |
| **`claude_memory` MCP** | `~/.claude.json` mcpServers + `MEMORY_API_KEY` in `settings.json` env | **DEFERRED — not a risk now (Viktor, 2026-06-08).** Per-user memory isolation needs a service-side `_key_to_user` map edit + redeploy (claude-memory-mcp, GHA repo 78), not just a Vault write — NOT built now. For now a new user gets a simple key or omits memory; revisit if isolation becomes a concern. | | **`claude_memory` MCP** | `~/.claude.json` mcpServers + `MEMORY_API_KEY` in `settings.json` env | **DEFERRED — not a risk now (Viktor, 2026-06-08).** Per-user memory isolation needs a service-side `_key_to_user` map edit + redeploy (claude-memory-mcp, GHA repo 78), not just a Vault write — NOT built now. For now a new user gets a simple key or omits memory; revisit if isolation becomes a concern. |
| **`ha` MCP** (token-in-URL) | `~/.claude.json` | shared `ha_sofia_mcp_url` from Vault `secret/openclaw` (one HA instance; shared secret, per-user file) — only if HA-eligible | | **`ha` MCP** (token-in-URL) | `~/.claude.json` | shared `ha_sofia_mcp_url` from Vault `secret/openclaw` (one HA instance; shared secret, per-user file) — only if HA-eligible |
| **`playwright` MCP** | per-user systemd unit (own port) + localhost entry | existing per-user playwright pattern (id=4015); non-secret | | **`playwright` MCP** | per-user systemd unit (own port) + localhost entry | existing per-user playwright pattern (id=4015); non-secret |

View file

@ -0,0 +1,243 @@
# External Secrets Operator: 0.12.1 → 2.6.0 Migration (v1beta1 → v1) — Design Doc
> **Status:****COMPLETE (2026-06-22).** ESO at chart/app **2.6.0**; all 104 ExternalSecrets + 2 ClusterSecretStores on `external-secrets.io/v1`; 109 ESs SecretSynced (2 pre-existing dead); compat-gate now returns `OK: cluster is safe to upgrade to 1.35.6` (EXIT 0) — the last k8s-1.35 blocker is cleared. Executed Phase 1 (climb to 0.16.2) → Phase 2 (v1 rewrite, validated GC-survival on tandoor) → Phase 3 (climb 0.16.2→2.6.0 across the 0.17 cutoff, ES sync held at 109 every hop). Side-finding fixed: repo-wide stale `.terraform.lock.hcl` files (missing gavinbunney/kubectl + telmate/proxmox from the generated providers.tf) had broken `terragrunt apply` for ~28 stacks (this is what failed CI pipeline 332) — reconciled via `init -upgrade` + committed.
> **Scope:** Upgrade the ESO Helm chart `0.12.1` (app `v0.12.1`) to `2.6.0` (app `v2.6.0`) and migrate every `external-secrets.io/v1beta1` custom resource to `external-secrets.io/v1`.
> **Owner:** Viktor Barzin. **Author:** Claude (research + design only — no changes applied).
>
> **EXECUTION CORRECTION + STATUS (2026-06-21 — "let's do the ESO migration"):** The cluster is already on **k8s 1.34.9** (all 7 nodes), NOT ≤1.31 as §4.3 assumed. ESO 0.12 runs fine on 1.34 (the support-matrix bands are conservative *tested* ranges, not hard limits). **The entire ESO climb 0.12→2.6 therefore happens on k8s 1.34 — there is NO k8s interleave; IGNORE the "advance k8s to 1.32/1.33" steps in §4.3 / Phase 1 / Phase 3.** Only AFTER ESO reaches 2.x does the nightly version-check chain take k8s 1.34→1.35 (gate clears). Exact hop sequence (latest patch per minor): **0.13.0 → 0.14.4 → 0.15.1 → 0.16.2** [rewrite all 104 CRs to `v1` here] → **0.17.0 → 0.18.2 → 0.19.2 → 0.20.4 → 1.0.0 → 1.1.1 → 1.2.1 → 1.3.2 → 2.0.1 → 2.1.0 → 2.2.0 → 2.3.0 → 2.4.1 → 2.5.0 → 2.6.0**. Pre-flight done: CRD `storedVersions` are `["v1beta1"]` only (no v1alpha1 patch needed).
>
> **EXECUTION LOG:**
> - **✅ Phase 1 DONE (2026-06-21):** ESO climbed 0.12.1 → 0.13.0 → 0.14.4 → 0.15.1 → **0.16.2**, one hop at a time, each applied + verified (controller healthy; 108 live ExternalSecrets stayed SecretSynced; 2 pre-existing dead — `instagram-poster/instagram-poster-secrets` False since 2026-05-10, `payslip-ingest/payslip-ingest-secrets` False since 2026-04-25, both missing Vault data, untouched). Added `atomic=true` + `timeout=600` to the helm_release. At 0.16.2 **both `v1beta1` and `v1` are served** (110 each) and `storedVersions = ["v1beta1","v1"]`. Committed (`eso: Phase 1 …`); state auto-committed per hop by `scripts/tg`.
> - **⏳ Phase 2 PENDING — findings confirmed (decisive for execution):** (a) bumping a `kubernetes_manifest` ExternalSecret's apiVersion v1beta1→v1 **forces a REPLACE** (verified live on instagram-poster: `-/+ must be replaced`), NOT in-place. (b) Our ExternalSecrets use **`creationPolicy=Owner`** (default; confirmed on nextcloud) → target Secrets carry an ownerReference, so the replace's delete step can **cascade-GC the Secret** before ESO recreates it. → **Phase 2 must be done carefully, NOT a blind bulk apply:** (1) snapshot ALL target Secrets first (backstop); (2) **empirically validate on the FIRST live stack** — migrate one ES while watching its target Secret; ESO re-syncs the identical spec fast and should re-adopt before GC, but confirm before proceeding; (3) then the per-stack two-phase `-target`-then-full apply (the 15 plan-time-coupled stacks need `-target` first). If validation shows GC wins, pivot to `state rm` + `import {}` (adopts the already-v1-served object with zero delete → zero GC). Repo is clean at v1beta1 (the lone test edit was reverted, never applied).
> - **Phase 3 PENDING:** hops 0.17.0 → 0.18.2 → 0.19.2 → 0.20.4 → 1.0.0 → 1.1.1 → 1.2.1 → 1.3.2 → 2.0.1 → 2.1.0 → 2.2.0 → 2.3.0 → 2.4.1 → 2.5.0 → 2.6.0 (all on k8s 1.34, CRs already v1). Crossing **0.17 is the point of no return**.
---
## 1. Goal & why
ESO is the **last remaining compatibility gate blocking the autonomous k8s 1.35 upgrade** (Kyverno was cleared to 1.18.1 earlier today). The installed ESO `0.12.x` supports only Kubernetes **1.19 → 1.31** ([support matrix](https://external-secrets.io/latest/introduction/stability-support/)); the k8s-version-check chain will refuse to advance the cluster past 1.31 while ESO sits at 0.12. The `2.x` series supports **k8s 1.341.35**, which clears the gate.
The hard part is not the chart bump itself — it is that **ESO removed the `external-secrets.io/v1beta1` API**, and every one of our ExternalSecret / ClusterSecretStore resources is currently declared `v1beta1`. If we upgrade past the removal version without first rewriting the manifests to `v1`, ESO stops reconciling and synced Secrets go stale (apps keep their last-good Secret, but rotations and new secrets break).
**Downtime tolerance:** brief, recoverable downtime of the ESO *controller* is acceptable. What must NOT happen is loss/corruption of the downstream Kubernetes `Secret` objects that apps mount (DB creds, API keys). Those must survive continuously.
---
## 2. Current state
### 2.1 Versions
| Component | Current | Target |
|---|---|---|
| Helm chart `external-secrets` | **0.12.1** | **2.6.0** |
| App / controller image | **v0.12.1** | **v2.6.0** |
| API version of all CRs | **`external-secrets.io/v1beta1`** | **`external-secrets.io/v1`** |
| Repo: `https://charts.external-secrets.io` | (unchanged) | (unchanged) |
ESO stack: `stacks/external-secrets/main.tf`. `helm_release.external_secrets` pins `version = "0.12.1"`, namespace `external-secrets` (separate `kubernetes_namespace` resource, not `create_namespace`), and the **only** chart value set is `installCRDs = true` (via `yamlencode({ installCRDs = true })`). No webhook/replica/resource overrides.
### 2.2 Inventory (live, from `stacks/`)
| Kind | Count | apiVersion | Where |
|---|---|---|---|
| **ExternalSecret** (`kubernetes_manifest`) | **104** | all `v1beta1` (0 mismatches) | 73 `.tf` files |
| **ClusterSecretStore** (definitions) | **2** | both `v1beta1` | `stacks/external-secrets/main.tf` |
| SecretStore | 0 | — | — |
| PushSecret | 0 | — | — |
| ClusterExternalSecret | 0 | — | — |
- **Only ONE apiVersion string exists in the whole tree:** `external-secrets.io/v1beta1` (106 occurrences = 104 ExternalSecret + 2 ClusterSecretStore). Zero `v1`, zero `v1alpha1`. → a clean single-target rewrite.
- **`secretStoreRef` split:** 78 ExternalSecrets → `vault-kv`, 26 → `vault-database` (78 + 26 = 104). The `kind = "ClusterSecretStore"` string also appears inside every `secretStoreRef`, so a naive `grep 'kind = "ClusterSecretStore"'` returns 106 — only **2** are real store definitions.
- **22 files carry >1 ExternalSecret** (max: `stacks/fire-planner/main.tf` = 5; then wealthfolio / real-estate-crawler / phpipam / payslip-ingest / n8n / job-hunter / ebooks = 3 each; 13 files = 2). The 104-vs-73 gap is these multi-secret files.
- **Nested-module ExternalSecrets** (easy to miss when scripting the bump): `stacks/instagram-poster/modules/instagram-poster/main.tf`, `stacks/postiz/modules/postiz/main.tf`, `stacks/technitium/modules/technitium/main.tf`, `stacks/mailserver/modules/mailserver/main.tf`, `stacks/monitoring/modules/monitoring/grafana.tf`, `stacks/proxmox-csi/modules/proxmox-csi/main.tf`.
- **Docs are STALE:** `.claude/CLAUDE.md` says "43 ExternalSecrets + 9 DB-creds". Live count is **104 ExternalSecrets / 73 files / 26 db-refs**. Fix in the migration PR.
### 2.3 The two ClusterSecretStores (`stacks/external-secrets/main.tf`)
Both `kubernetes_manifest`, both `external-secrets.io/v1beta1`, both `depends_on = [helm_release.external_secrets]`:
- **`vault-kv`** → Vault KV **v2** at `path = "secret"`, server `http://vault-active.vault.svc.cluster.local:8200`, auth `kubernetes` mount `kubernetes`, role `eso`, SA `external-secrets/external-secrets`.
- **`vault-database`** → identical except `path = "database"`, **`version = "v1"`** (Vault DB engine, KV-v1-style).
ESO's Vault auth role `eso` (`stacks/vault/main.tf:486-511`): policy `eso-reader` (`secret/data/*` read+list, deny `secret/data/vault`, `database/static-creds/*` read), `token_ttl = token_period = 864000` (10d, periodic/auto-renew).
### 2.4 Tier-0 / state
ESO is **Tier-0 (bootstrap)** (`.claude/CLAUDE.md` "Terraform State — Two-Tier Backend"; root `terragrunt.hcl` `tier0_stacks = ["infra","platform","cnpg","vault","dbaas","external-secrets"]`). Tier-0 ⇒ **local SOPS-encrypted state in git** (`state/stacks/external-secrets/terraform.tfstate`), NOT the PG backend. Workflow: `git pull``scripts/tg plan``scripts/tg apply``git push`; SOPS decrypt via Vault Transit (primary) → age fallback. **Tier-0 must apply before PG is reachable**, so the ESO upgrade cannot depend on PG.
### 2.5 Provider versions (`stacks/external-secrets/providers.tf`)
- `required_providers` declares **only** `vault = hashicorp/vault, ~> 4.0`.
- `provider "kubernetes"` and `provider "helm"` are declared **without version constraints** (resolve from root / `.terraform.lock.hcl`). The `helm` block already uses the **v3-style nested `kubernetes = {…}` argument** (not the legacy `kubernetes {}` block) ⇒ helm provider is **v3.x or v4.x** in the lockfile. **No `kubectl` provider** in this stack. No `required_version` pinned here.
- ⚠️ **Verify the resolved helm provider version** in `.terraform.lock.hcl` before starting — the prompt referenced `~> 4.0` for helm; the *stack* only pins that for `vault`. Either way the v3-syntax helm block + an SDK-v3 provider is compatible with the chart (see §4.5).
### 2.6 Plan-time coupling (the cross-cutting risk)
**15 stacks read ESO-created Secrets at plan time** via `data "kubernetes_secret"` (avoids a Vault dependency at plan): `actualbudget, affine, changedetection, coturn, ebooks, fire-planner, freedify, freshrss, grampsweb, k8s-dashboard (dashboard_injector.tf), navidrome, owntracks, real-estate-crawler, servarr, technitium (modules/technitium)`.
The documented **first-apply gotcha** (`.claude/CLAUDE.md`, `docs/architecture/secrets.md:360`, `stacks/fire-planner/main.tf:574`): the Secret must exist before the `data "kubernetes_secret"` plans, so on first creation you must `terragrunt apply -target=kubernetes_manifest.<external_secret>` first, then full apply. **Why this matters for the migration:** the `kubernetes_manifest` provider treats `apiVersion` as part of resource identity, so bumping `v1beta1``v1` **forces a replace** of all 104 ExternalSecrets. During replace there is a window where the new CR (and thus the synced Secret) may not yet be materialized when the same stack's `data "kubernetes_secret"` plans → the two-phase `-target` apply is needed **fleet-wide for the v1 rewrite step, not just fire-planner.**
### 2.7 Vault DB rotation (rotation interplay)
`stacks/vault/main.tf`: **25 `vault_database_secret_backend_static_role`, every one `rotation_period = 604800` (7 days)** (8 MySQL + 17 PostgreSQL static roles). ESO syncs these via `vault-database``remoteRef.key = "static-creds/<role>"`. Apps reading a rotated secret only at startup carry a Stakater Reloader annotation. **Implication:** any ESO controller downtime longer than the gap to the next rotation could leave a Secret stale across a rotation; keep controller downtime short and re-sync promptly.
### 2.8 git-crypt landmine (adjacent, not in ESO stack)
`.claude/CLAUDE.md:146` + `docs/architecture/ci-cd.md:108` + `stacks/kyverno/modules/kyverno/tls-secret-sync.tf`: on a **git-crypt-locked clone**, `kubernetes_secret.tls_secret` reads `secrets/fullchain.pem`/`privkey.pem` via `file()` which returns **ciphertext**, corrupting the wildcard TLS secret Kyverno clones cluster-wide. **The ESO stack itself has NO `file()` reads of git-crypt secrets** — so this landmine does not bite the ESO upgrade directly. It is listed here only as a guardrail: do not piggyback unrelated kyverno applies during this work, and run all applies from an **unlocked** checkout.
---
## 3. Target
- Helm chart **`external-secrets` 2.6.0** (app **v2.6.0**), repo `https://charts.external-secrets.io`.
- All ExternalSecret + ClusterSecretStore CRs on **`external-secrets.io/v1`**.
- Cluster ESO compatible with **k8s 1.341.35** ⇒ unblocks the autonomous 1.35 upgrade.
---
## 4. Key findings (the decisive facts)
> Sourced from ESO official docs + GitHub release notes; verbatim quotes below.
### 4.1 Chart version == app version (premise check)
The chart version and app version are released **in lockstep and are the same number**. `Chart.yaml`: `version: 0.12.1 / appVersion: v0.12.1`; `version: 2.6.0 / appVersion: v2.6.0`. The app series ran `…0.20.4 → 1.0.0 → … → 2.0.0 → … → 2.6.0`. **Crucially, the `v1.0.0` and `v2.0.0` APP releases are NOT the `external-secrets.io/v1` API**`v1.0.0` is just "continuation after 0.20.4" (release diff `v0.20.4...v1.0.0`, no API change), and `v2.0.0`'s only breaking change is removing the unmaintained **Alibaba + Device42** providers (we use neither — only Vault). The API migration happened back at **0.16/0.17**. Source: [v1.0.0 notes](https://github.com/external-secrets/external-secrets/releases/tag/v1.0.0) · [v2.0.0 notes](https://github.com/external-secrets/external-secrets/releases/tag/v2.0.0).
### 4.2 Version path: **NO skipping minors — step one minor at a time**
Official policy, verbatim ([stability-support](https://external-secrets.io/latest/introduction/stability-support/)):
> "**Upgrade version by version** — We strongly recommend upgrading one minor version at a time (e.g., 0.18.x → 0.19.x → 0.20.x) rather than skipping versions."
Maintainer (issue [#4785](https://github.com/external-secrets/external-secrets/issues/4785), @gusfcarvalho): *"We are pre release… Every minor bump should be treated as a major bump until we go 1.0."***You CANNOT helm-upgrade 0.12.1 → 2.6.0 directly.** You must step each minor: `0.12 → 0.13 → 0.14 → 0.15 → 0.16 → 0.17 → 0.18 → 0.19 → 0.20 → 1.x → 2.x`.
### 4.3 k8s ↔ ESO must advance roughly in lockstep
Each ESO release targets a **narrow** k8s band ([support matrix](https://external-secrets.io/latest/introduction/stability-support/)):
| ESO | k8s band |
|---|---|
| 0.12.x | 1.19 → 1.31 |
| 0.16.x | 1.32 |
| 0.17.x | 1.33 |
| 2.0 2.5 | 1.34 1.35 |
| 2.6 (latest) | (matrix row not yet appended; 2.x band is consistently 1.341.35 — see Open Questions) |
**This is the single most important sequencing constraint.** ESO doesn't "support only ≤ its max k8s" in a wide range — older ESO may not run cleanly on a *much newer* k8s either. The bands imply the ESO upgrade and the k8s upgrade need to be **interleaved**, not "finish ESO, then bump k8s in one jump." Practical reading: the cluster is currently on k8s ≤1.31 (ESO 0.12 blocks past it). The 0.16/0.17 steps want k8s 1.32/1.33; the 2.x steps want 1.34/1.35. So this is a **coordinated ESO+k8s climb**, e.g. ESO→0.16 alongside k8s→1.32, ESO→0.17 alongside k8s→1.33, then ESO→2.x alongside k8s→1.34→1.35. (The k8s climb is itself sequential via the version-check chain; this doc focuses on the ESO half but flags the coupling — see Open Questions for who drives the interleave.)
### 4.4 API migration: **must rewrite manifests to `v1` FIRST — there is NO v1beta1→v1 conversion webhook**
- **`external-secrets.io/v1` promoted to STORAGE version: v0.16.0.** v0.16.0 release notes "BREAKING CHANGES": *"Promotion of ExternalSecret/v1 and SecretStore/v1 and their cluster counterparts"* and *"Removal of Conversion Webhooks and …/v1alpha1…"*. From 0.16, **etcd stores `v1`**. Source: [v0.16.0 notes](https://github.com/external-secrets/external-secrets/releases/tag/v0.16.0).
- **`external-secrets.io/v1beta1` STOPS BEING SERVED (hard cutoff): v0.17.0.** Verbatim ([v0.17.0 notes](https://github.com/external-secrets/external-secrets/releases/tag/v0.17.0)):
> "v0.17.0 Stops serving `v1beta1` apis. You need to update your manifests from `v1beta1` to `v1` prior to updating from `v0.16` to `v0.17`. The only change needed is upgrading your manifests to `v1` (i.e. removing the `beta1` from `v1beta1`). … Be sure to do that to all your manifests prior to bumping to `v0.17.0`! `v0.16.2` already supports `v1` so this process should be smooth."
- **No v1beta1→v1 conversion webhook.** The only conversion webhook that ever existed was v1alpha1→v1beta1, **removed in 0.16**. Maintainer (issue [#5478](https://github.com/external-secrets/external-secrets/issues/5478), @gusfcarvalho): the post-0.16 "drift" is simply that etcd now stores v1 — *"This isn't really a conversion issue."* ⇒ **old v1beta1 manifests do NOT keep working past 0.17 via any auto-conversion.**
- **Verdict: MUST-REWRITE-FIRST.** Rewrite all CRs to `v1` while on **0.16.x** (which serves *both* v1beta1 and v1), then upgrade to 0.17. Real-world confirmation (issue [#4785](https://github.com/external-secrets/external-secrets/issues/4785), @Dutchy-): *"I was able to change v1beta1 to v1 on 0.16 without issues. After that I was able to upgrade to 0.17."*
- There is a deprecated escape hatch in chart 2.6.0 — `unsafeServeV1Beta1: true` re-enables v1beta1 serving for stragglers — but its own values comment says *"This flag will be removed on 2026.05.01"* (i.e. **already past**, do not rely on it).
- **Schema change is a PURE apiVersion string bump — ZERO field changes.** CRD `openAPIV3Schema` diff (v0.16.2 bundle, which serves both): ExternalSecret / SecretStore / ClusterSecretStore / ClusterExternalSecret have **byte-identical** spec field sets between v1beta1 and v1 (`{data, dataFrom, refreshInterval, refreshPolicy, secretStoreRef, target}` for ExternalSecret). Maintainer (issue #4785, @Skarlso): *"Just change your manifests to be v1 and upgrade… We don't have anything fancy that you need to do."* PushSecret only ever had `v1alpha1` (no v1beta1) — **unaffected** (we have 0 anyway).
### 4.5 Helm chart values + CRD handling (0.12 → 2.6)
- **No top-level values removed or renamed.** `values.yaml` diff 0.12.1↔2.6.0 is **additive only** (new keys: `enableHTTP2, extraInitContainers, genericTargets, grafanaDashboard, hostAliases, hostUsers, leaderElectionID, livenessProbe, openshiftFinalizers, processClusterGenerator, processClusterPushSecret, processSecretStore, readinessProbe, strategy, systemAuthDelegator, vault`). Our single value `installCRDs = true` survives.
- **`installCRDs` still works** in 2.6.0 (defaults `true`, "install and upgrade CRDs through helm chart"). CRDs are **templated into the single `external-secrets` chart** and **upgraded by `helm upgrade`** automatically — there is **no separate CRDs subchart**, and no manual `kubectl apply` of CRDs is required by default. (Out-of-band bundle, if ever needed, lives at `deploy/crds/bundle.yaml` per release tag.) The only CRD-value change: `crds.conversion.enabled` defaults `true` in 0.12.1 (for the old v1alpha1 webhook) → `false` in 2.6.0 ("we stopped supporting v1alpha1"). We don't set it, so the new default is fine.
- **CRD storedVersions bookkeeping (the one real pre-flight check):** v0.16.0 notes warn to ensure no CRD still lists `v1alpha1` in `.status.storedVersions` before/at 0.16, with a `kubectl patch` to set it to `["v1","v1beta1"]` if needed. This is CRD metadata hygiene, NOT secret deletion.
- **Helm provider:** `Chart.yaml apiVersion: v2` (Helm 3 chart) in both 0.12.1 and 2.6.0; **no minimum Helm version declared** (only `kubeVersion: ">= 1.19.0-0"`). The Terraform helm provider on Helm SDK v3 (v3.x/v4.x) is compatible. **The 2.x chart does NOT require a newer helm provider than 0.12 did** — the v3-style helm block in `providers.tf` already satisfies it. (Still: pin/verify the resolved version in the lockfile; see Open Questions.)
### 4.6 Data migration: **downstream Secrets survive**
The synced Kubernetes `Secret` objects are **not deleted or force-resynced** by these upgrades. The change is an apiVersion bump on the *custom resources*, whose `spec` is schema-identical, so the controller keeps reconciling the same target Secrets. A controller restart triggers a normal **reconcile (re-assert, not delete)**. Caveat: no release note says verbatim "synced Secrets are preserved"; the conclusion is from (a) schema identity, (b) maintainers calling it "100% compatible" (issue #5478), (c) absence of any "secrets recreated/deleted" note. **Standard caution: snapshot/back up all ESO-created Secrets before the 0.16→0.17 step** (see §8 verification). Unrelated watch-item: v0.14.0 flagged a stateful-**generators** change — we use no generators, so N/A.
---
## 5. Migration strategy (ordered, do-this-then-that)
> **Pre-reqs every step:** run from an **unlocked** infra checkout (git-crypt unlocked); `vault login -method=oidc`; ESO is **Tier-0** so use `scripts/tg plan` / `scripts/tg apply` against `stacks/external-secrets` and **`git push`** after each apply (SOPS state). Claim presence before each apply: `~/code/scripts/presence claim stack:external-secrets --purpose "ESO 0.12→2.x migration step N"`. Wait for the controller `Deployment` to roll out healthy before the next hop.
### Phase 0 — Pre-flight (no changes)
1. Confirm cluster k8s version and the version-check chain's current target; **coordinate with the k8s climb** (see §4.3 / Open Questions). Decide who drives the interleave.
2. `kubectl get crd | grep external-secrets.io` and for each: `kubectl get crd <name> -o jsonpath='{.status.storedVersions}'` — confirm none still list `v1alpha1`. If any do, plan the `kubectl patch …/status storedVersions=["v1beta1"]` per the v0.16.0 note (do this *before* reaching 0.16).
3. **Snapshot all ESO-managed Secrets** (rollback safety net):
`kubectl get externalsecrets -A` (record the 104) and `for ns/secret in <targets>: kubectl get secret -n <ns> <name> -o yaml > backup/<ns>-<name>.yaml`. Keep outside git-crypt or encrypt.
4. Inspect `.terraform.lock.hcl` in `stacks/external-secrets` — record resolved `helm` + `kubernetes` provider versions. If helm provider < what 2.6.0 needs (it doesn't appear to need anything beyond SDK v3), bump the constraint as its own committed change first.
5. Read `docs/architecture/secrets.md` + the fire-planner first-apply comment to re-confirm the `-target` pattern for the v1 rewrite step.
### Phase 1 — Climb to 0.16.x (chart bump only, NO manifest change yet)
ESO `0.16.x` is the **transition version** that serves *both* v1beta1 and v1. Climb to it one minor at a time, leaving all CRs as `v1beta1`:
6. For `v` in `0.13.0, 0.14.0, 0.15.x, 0.16.2` (use latest patch of each minor): set `helm_release.external_secrets.version = "<v>"`, `scripts/tg plan` (expect: chart upgrade + CRD upgrade in place; **no `kubernetes_manifest` replacements** — apiVersion unchanged), `scripts/tg apply`, `git push`, wait for rollout, verify `kubectl get externalsecrets -A` all `SecretSynced=True`.
- **Interleave k8s as required:** before/at 0.16 the cluster should be on **k8s 1.32** (0.16 band). Advance k8s via the normal version-check chain to 1.32 around this point.
- Watch the **0.14.0** notes (generators) — N/A for us, but eyeball the plan diff anyway.
7. **Land on 0.16.2 and STOP.** Verify both APIs are served: `kubectl get externalsecrets.v1.external-secrets.io -A` and `kubectl get externalsecrets.v1beta1.external-secrets.io -A` both work.
### Phase 2 — Rewrite all 104 CRs + 2 stores to `v1` (while on 0.16.2)
This is the MUST-DO-FIRST API migration, done in the safe window where both versions are served.
8. **Mechanical rewrite** across `stacks/`: replace the apiVersion string `external-secrets.io/v1beta1``external-secrets.io/v1` in every ExternalSecret and ClusterSecretStore `kubernetes_manifest` (104 + 2 = 106 occurrences across 73 files, **including the 6 nested-module files** in §2.2). **No other field changes** (schema identical). Do this in a worktree, committed file-by-file.
- Leave `secretStoreRef.kind = "ClusterSecretStore"` (that's a kind reference, not an apiVersion — unaffected).
9. **Two-phase apply because `kubernetes_manifest` replace + plan-time `data "kubernetes_secret"`:**
a. **Stores first:** `scripts/tg apply -target='kubernetes_manifest.css_vault_kv' -target='kubernetes_manifest.css_vault_db'` in `stacks/external-secrets` (they get replaced to v1; ESO still serves v1beta1 too, so in-flight ExternalSecrets keep syncing). `git push`.
b. **ExternalSecrets, per stack:** for each of the 73 stacks, `scripts/tg apply -target=kubernetes_manifest.<external_secret_name>` FIRST (materializes the replaced v1 CR + its Secret), THEN a full `scripts/tg apply` for that stack (lets the 15 plan-time `data "kubernetes_secret"` reads resolve against the now-existing Secret). The **15 plan-time-coupled stacks** (§2.6) absolutely need the `-target` first; the rest are lower-risk but follow the same pattern for safety. `git push` per stack (Tier-1 stacks use PG state; ESO stack is Tier-0).
- Because the spec is identical, the *replace* re-creates an identical CR; ESO reconciles and re-asserts the same target Secret (no value change) → apps keep their Secret throughout.
10. **Verify the rewrite fully landed:** `grep -rc 'external-secrets.io/v1beta1' stacks/` returns **0**; `kubectl get externalsecrets -A -o jsonpath used to confirm all served as v1`; all `SecretSynced=True`; spot-check a rotated DB cred (e.g. `nextcloud-db-creds`) still valid.
### Phase 3 — Cross the 0.17 cutoff, then climb to 2.6.0
Only after Phase 2 is 100% applied (zero v1beta1 in repo AND in etcd):
11. Bump chart `0.16.2 → 0.17.x`. `scripts/tg plan` (expect chart/CRD upgrade; **no manifest replacements** — already v1), apply, push, rollout, verify all synced. **k8s should be 1.33** (0.17 band) around here.
12. Continue one minor at a time: `0.18.x → 0.19.x → 0.20.x → 1.0.0 → 1.x (latest) → 2.0.0 → … → 2.6.0`. At each: bump `version`, plan, apply, push, rollout, verify synced. **k8s reaches 1.34 then 1.35** across the 2.x steps.
- **At 2.0.0:** confirm the plan shows nothing odd from the Alibaba/Device42 provider removal (we use only Vault — should be a no-op).
13. **Land on 2.6.0.** Verify: controller image `v2.6.0`, all 104 ExternalSecrets `SecretSynced=True`, both ClusterSecretStores `Valid=True`.
### Phase 4 — Close the gate + docs
14. Advance k8s to **1.35** via the version-check chain if not already; confirm the **compat-gate now lists ESO as compatible** and 1.35 is unblocked.
15. Update `.claude/CLAUDE.md` Secrets Management section: correct counts (**104 ExternalSecrets / 73 files / 26 db-refs**), apiVersion now `v1`. Update `docs/architecture/secrets.md`. Commit as part of the work (audit trail).
---
## 6. Risks & mitigations
| Risk | Likelihood | Mitigation |
|---|---|---|
| **Secret-sync outage → app DB/API auth failures** during controller restarts or the replace window | Med | Spec is identical so re-sync re-asserts the same value; keep each controller restart short; do Phase-2 replaces **per stack** (small blast radius); the 15 plan-time stacks use `-target` first so the Secret exists before dependents plan. Pre-step Secret snapshot (Phase 0.3) for instant restore. |
| **Crossing 0.17 with any CR still v1beta1** → ESO stops reconciling those, secrets go stale | High if rushed | Phase 2 gate: `grep -rc v1beta1 stacks/` **must be 0** AND `kubectl get …v1beta1…` returns nothing live before Phase 3. Do not skip 0.16. |
| **CRD removal/replace by helm dropping data** | Low | Chart manages CRDs in-place via `installCRDs=true` (upgrade, not delete-recreate); CRs are the data and they're untouched by a CRD *upgrade*. Snapshot anyway. Never `helm uninstall` (that can GC CRDs). |
| **No conversion webhook safety net** (must-rewrite-first) | Certain (by design) | Whole strategy is built on rewriting at 0.16. The deprecated `unsafeServeV1Beta1` is already past its 2026-05-01 removal — do NOT rely on it. |
| **`kubernetes_manifest` forces replace on apiVersion bump** → transient gap + plan-time read failures | High | Two-phase `-target` apply fleet-wide (Phase 2.9); identical spec ⇒ replacement CR is equivalent. |
| **Vault 7-day DB rotation lands mid-migration** → a Secret stale across rotation if controller down | Med | Keep controller downtime < rotation gap; re-sync immediately after each hop; Reloader annotations already re-roll pods on Secret change; if a rotation is imminent, sequence the affected db stacks last and verify those creds explicitly. |
| **git-crypt tls-secret-sync landmine** | Low (not in ESO stack) | ESO stack has no `file()` git-crypt reads; run from an **unlocked** checkout; do **not** piggyback kyverno applies during this work. |
| **helm/k8s provider in lockfile too old for 2.x chart** | Low | Phase 0.4 verify; bump constraint as a separate committed change if needed (chart needs only Helm SDK v3, already satisfied). |
| **k8s/ESO band mismatch** (e.g. ESO 0.12 on k8s 1.33) | Med | Interleave the climbs per §4.3; don't jump k8s far ahead of ESO or vice-versa. |
| **Many small applies = long, error-prone session** | Med | Script the per-stack `-target`-then-full loop; checkpoint with `kubectl get externalsecrets -A` after each; the rewrite itself is a single `sed`-class change so low semantic risk. |
---
## 7. Rollback plan (per hop)
- **During Phase 1 (chart climb, still v1beta1):** revert `version` to the previous minor in `stacks/external-secrets/main.tf`, `scripts/tg apply`, `git push`. Helm rolls the controller back; CRs unchanged. Clean.
- **During Phase 2 (v1 rewrite, on 0.16.2):** 0.16.2 serves both APIs, so you can `git revert` the apiVersion-bump commits and re-apply — the CRs flip back to v1beta1 cleanly (both served). Secrets unaffected (identical spec). This is the **last point of easy rollback**.
- **After Phase 3 (≥0.17, v1beta1 no longer served):** **rollback is HARD** — once etcd stores v1-only and the controller is ≥0.17, downgrading cannot re-serve v1beta1 and v1 objects can't be auto-converted back ([general guidance + maintainer position](https://github.com/external-secrets/external-secrets/issues/5478)). Treat **crossing 0.17 as the point of no return.** If you must recover: re-install 0.16.2 (serves both), restore CRs from the Phase-0 manifest snapshot, and restore Secrets from the Secret snapshot. This is a disaster-recovery path, not a routine rollback — hence the Phase-2 gate must be airtight.
- **Always available:** the Phase-0.3 Secret backups let you `kubectl apply` the last-good Secret to keep an app authenticating while you fix ESO.
---
## 8. Verification
**Per hop:**
- `kubectl -n external-secrets get deploy,po` healthy; controller image tag == target.
- `kubectl get externalsecrets -A` → all 104 `STATUS=SecretSynced` / `READY=True`.
- `kubectl get clustersecretstores``vault-kv` + `vault-database` `Valid=True`.
**After Phase 2 (v1 rewrite):**
- `grep -rc 'external-secrets.io/v1beta1' stacks/`**0**.
- `kubectl get externalsecrets.v1beta1.external-secrets.io -A` → still served on 0.16 (sanity), but `kubectl get externalsecrets.v1.external-secrets.io -A` is the real check.
- Spot-check a rotated DB cred end-to-end: e.g. `nextcloud-db-creds` value matches `vault read database/static-creds/mysql-nextcloud` and the app authenticates.
**Final (2.6.0):**
- Controller image `v2.6.0`; all ExternalSecrets synced; both stores valid.
- Diff a sample of the 104 target Secrets against the Phase-0 backups → values unchanged (continuity proof).
- App health: spot-check 34 high-value consumers (nextcloud, immich, grafana, a `vault-database` consumer) — pods running, no auth errors in logs.
- **Compat-gate:** run the upgrade-state / k8s-version-check audit — ESO no longer flagged as a 1.35 blocker; k8s 1.35 upgrade proceeds.
---
## 9. Open questions
1. **k8s/ESO interleave ownership.** §4.3 shows narrow per-version k8s bands (0.16→1.32, 0.17→1.33, 2.x→1.34-1.35). The cluster is currently ≤1.31. **Who drives the interleave** — does this migration also advance k8s step-by-step, or does the autonomous version-check chain advance k8s and we time ESO hops to it? Need the exact current k8s version and the chain's behavior when ESO is the only gate. (Decisive for sequencing Phases 1/3.)
2. **2.6.0 ↔ k8s 1.35 explicit support.** The support matrix table currently ends at **2.5** (k8s 1.34-1.35). 2.6.0 exists on GitHub but the matrix row isn't appended yet; the whole 2.x band is consistently 1.34-1.35, so 2.6 on 1.35 is a *strong inference* not a quoted row. Confirm via `Chart.yaml` `kubeVersion` of 2.6.0 or a 2.6 release note before relying on it. ([matrix](https://external-secrets.io/latest/introduction/stability-support/))
3. **Resolved helm provider version.** The stack only pins `vault ~> 4.0`; helm/k8s are unpinned (lockfile-resolved). Confirm the lockfile version and whether to pin it explicitly as part of this work. (Chart needs only Helm SDK v3 — likely a no-op, but verify.)
4. **Intermediate-minor patch selection.** Use latest patch of each minor (0.13.x, 0.14.x, 0.15.x). Confirm 0.16.**2** specifically (the note says 0.16.2 already supports v1) vs a later 0.16 patch.
5. **Per-stack apply automation.** 73 stacks × (target + full) apply is large. Acceptable to script a loop, or prefer manual per-stack with checkpoints? Some stacks have other in-flight drift that a full apply would also push — needs a clean-plan check per stack first.
6. **Stateful generators / advanced features.** Confirmed we use none (0 SecretStore/PushSecret/ClusterExternalSecret/generators), so the v0.14 generator and v2.0 provider-removal breaking changes are N/A — but re-confirm no generator usage crept in before Phase 3.
---
## 10. Sources (decisive facts)
- Skip-version policy + k8s support matrix: <https://external-secrets.io/latest/introduction/stability-support/>
- `v1` promoted to storage version (0.16.0): <https://github.com/external-secrets/external-secrets/releases/tag/v0.16.0>
- `v1beta1` removed / "rewrite manifests to v1 first" (0.17.0): <https://github.com/external-secrets/external-secrets/releases/tag/v0.17.0>
- No conversion webhook / "not a conversion issue" (#5478): <https://github.com/external-secrets/external-secrets/issues/5478>
- v1beta1↔v1 schema identical / "nothing fancy" (#4785): <https://github.com/external-secrets/external-secrets/issues/4785>
- App v1.0.0 ≠ API v1: <https://github.com/external-secrets/external-secrets/releases/tag/v1.0.0>
- v2.0.0 only removes Alibaba/Device42: <https://github.com/external-secrets/external-secrets/releases/tag/v2.0.0>
- Chart 2.6.0 on ArtifactHub: <https://artifacthub.io/packages/helm/external-secrets-operator/external-secrets>

View file

@ -0,0 +1,140 @@
# t3 idle-migrate — graceful overnight restart of deferred t3-serve instances — design
- **Date:** 2026-06-21
- **Status:** implemented 2026-06-21 (branch `wizard/t3-idle-migrate`; deployed + timer enabled on devvm, first overnight drain pending)
- **Owner:** Viktor (wizard)
- **Builds on:** the gated nightly tracker `t3-autoupdate` (re-enabled 2026-06-16, `scripts/t3-autoupdate.{sh,service,timer}`; design history in `docs/runbooks/t3-version-bump.md` + post-mortem `2026-06-09-t3-nightly-autoupdate-auth-outage.md`) and the per-user `t3-serve@<user>` systemd instances (`scripts/t3-serve@.service`).
## Goal
When `t3-autoupdate` **defers** a user's `t3-serve` restart because that user has an active agent at the daily 04:0005:00 window, the user's running server keeps executing its start-time t3 version indefinitely — their client (which tracks the freshly-installed global binary) then shows *"Client and server versions differ."* For a user who is busy at every daily window (wizard: long-lived/AFK sessions overnight), the deferral never resolves and the skew persists for days.
Add a **small, idle-gated overnight job that drains those deferrals**: restart a deferred `t3-serve@<user>` onto the current binary **only when nothing is actively working** in that instance, so the migration happens during a genuine quiet gap rather than killing in-flight agent turns.
## Background — why the skew persists (root cause, verified 2026-06-21)
- All `t3-serve@<user>` instances share ONE global `/usr/bin/t3` (→ `/usr/lib/node_modules/t3`). `t3-autoupdate` installs a new nightly to that single binary, health-gates it against a **copy** of wizard's populated `state.sqlite`, then **canary-restarts idle instances one at a time**, verifying pairing after each (`scripts/t3-autoupdate.sh` step 6).
- Its idle check is coarse — `unit_busy()`:
```sh
pid=$(systemctl show -p MainPID --value "$unit")
pgrep -aP "$pid" | grep -qiE 'claude|codex|opencode'
```
i.e. "does the server have any `claude`/`codex`/`opencode` **child**?" But `t3 serve` keeps one such child alive per **open** session, even one idle awaiting input. Live snapshot 2026-06-21: wizard had **5 `running` provider sessions** (= 5 `claude` children) but only **3 mid-turn**, plus **89 `ready` (open-idle)** threads. So `unit_busy` is true whenever any tab is open → wizard is deferred at every window.
- The job runs **once daily** (`OnCalendar=*-*-* 04:00:00`, `RandomizedDelaySec=1h`, `Persistent` deliberately omitted) and **only acts on a version bump** (exits early if `installed == target`). So once the binary is already current, nothing re-triggers a restart of a still-stale running server until the *next* new nightly — and only if the user happens to be idle then.
- Confirmed in the logs: `t3-autoupdate: deferring t3-serve@wizard.service (active agent) — migrates on its next idle restart` on **both** Jun 20 and Jun 21 windows; wizard's server has been up since Jun 20 06:17 on `…20260620.605` while the binary + client are on `…20260621.613`.
## Decisions (from brainstorm 2026-06-21)
1. **"Safe to restart" = no turn in flight AND a quiet buffer.** Not "zero open sessions" (that would essentially never fire for wizard). Open-but-idle tabs are acceptable to drop — t3 persists thread history in `state.sqlite` and the client reconnects/resumes (the daily job already restarts idle instances routinely; restart→resume is the exercised path). To verify during implementation: the user-facing resume after a server restart.
2. **Cadence: overnight window only.** Frequent checks within a fixed overnight window; never disconnects tabs during the working day. Migrates within ~1 night of a build landing.
3. **Scope: all `t3-serve@<user>`, self-limiting.** The job restarts only an instance that actually *owes* a migration (a deferral marker exists). Users already migrated at the daily window have no marker → no-op. No hardcoded per-user logic.
4. **Approach C: extract a shared safe-restart helper, reuse from both jobs.** One audited copy of the dangerous backup→restart→verify→recover logic; the new job adds only *scheduling + gating*.
## Constraints (load-bearing)
1. **The binary is global; migrations are forward-only and per-user-DB.** You cannot keep one user on the old version while others run the new one. A real-user forward-migration failure therefore means the build is unsafe for a real user → the only consistent recovery is the daily job's existing one (restore that user's DB + roll the **global** binary back to last-good + freeze + alert). This is a rare tail (the build was already migration-gated against a copy of wizard's real DB at install time), but the idle path must not invent a weaker recovery.
2. **Per-user secret boundary.** A user's `~/.t3/userdata/state.sqlite` is mode 600 and may not be read as another user. The job runs as root (system service) but reads each user's DB **as that user** via `runuser -u <user> -- sqlite3 …` (the pattern `backup_all` already uses), read-only (`mode=ro`) so it never locks the live WAL.
3. **Fail closed.** Any uncertainty about whether an instance is safe to restart (DB locked/busy/unreadable, query error, unparseable timestamp) → treat as *not safe*, skip this tick, retry in 20 min. Never restart on doubt.
4. **Do not change the daily job's gated-install behavior.** The step-6 extraction must be behavior-preserving; health-gate, canary, downgrade-guard, freeze, and rollback stay exactly as today.
5. **Infra-as-code via the devvm installer.** Sources live in `scripts/`; deployment is `scripts/workstation/setup-devvm.sh` (the devvm is hand-managed VM 102 — no Terraform apply). Shared-devvm deploy takes a presence claim.
## Design
### Components
Four new files in `scripts/` + a one-line addition to the existing job:
1. **`scripts/t3-safe-restart.sh`** — shared library, sourced (not executed). Holds the per-unit "dangerous" routine extracted from `t3-autoupdate.sh` step 6 as `safe_restart_unit <unit> <target>`:
pre-restart `VACUUM INTO` backup (as the owner) → `systemctl restart` → poll `verify_pairing` (15×2s ≈ 30s) → on failure: restore that user's DB from the pre-restart backup, `rollback_binary` to last-good, `touch $FREEZE_FILE`, log+alert. The shared helpers it needs (`LOG`, `ver`, `osusers`, `ak_for`, `verify_pairing`, `prebump_of`, `rollback_binary`, `DISPATCH`/`BACKUP_DIR`/… config) move into the lib too. Installed to `/usr/local/lib/t3-safe-restart.sh`.
**Contract:** returns `0` on verified success, **non-zero** after performing recovery+freeze on failure. This is the one non-verbatim change to step-6 logic — today it `exit 1`s inline; the extracted function `return`s instead so the *caller* decides (the daily job `exit 1`s on non-zero exactly as today; the idle job `break`s). Behavior is otherwise identical.
2. **`scripts/t3-migrate-idle.sh`** — the new job (scheduling + gating only). Installed to `/usr/local/bin/t3-migrate-idle`. Sources the lib; per tick, drains the deferral directory (control flow below).
3. **`scripts/t3-migrate-idle.service`** — `Type=oneshot`, `ExecStart=/usr/local/bin/t3-migrate-idle`. (No `EnvironmentFile` needed; env-overridable knobs have defaults.)
4. **`scripts/t3-migrate-idle.timer`** — overnight window, frequent checks:
```ini
[Timer]
OnCalendar=*-*-* 01..05:00/20 # fires 01:00,01:20,…,05:40; none at/after 06:00. System TZ (UTC) — tune the window.
Persistent=false # never replay a missed migrate-restart at an unpredictable time
RandomizedDelaySec=120
```
5. **One-line edit to `t3-autoupdate.sh`** — in the existing defer branch, *also record* the deferral:
```sh
LOG "deferring $unit (active agent) — migrates on its next idle restart"
mkdir -p "$DEFER_DIR" 2>/dev/null; printf '%s\n' "$target" > "$DEFER_DIR/$u" # NEW
deferred=$((deferred+1)); continue
```
where `DEFER_DIR=/var/lib/t3-autoupdate/deferred`. This is the *only* behavioral change to the scarred script beyond the verbatim step-6 extraction.
### Why a deferral marker (not version-introspection)
The marker makes "which instances owe a restart" **exact** and decouples it from the binary-is-current problem — the daily job already *knows* it deferred wizard, so it records that fact. The idle job drains the directory; the version string in the marker is informational (a restart always picks up whatever binary is current). The marker is removed only after the restart's pairing is verified.
### Control flow of `t3-migrate-idle` (per tick)
```
for marker in $DEFER_DIR/*: # nothing deferred → no-op
user = basename(marker); unit = t3-serve@<user>.service
[ unit is an active running service ] or { rm marker; continue } # gone
if unit ActiveEnterTimestamp > mtime(marker): rm marker; continue # already restarted (manual/other) → just clear
if not safe_to_restart(user): continue # mid-turn or not quiet → try next tick
target = contents(marker)
if safe_restart_unit(unit, target): rm marker # success: verified on new binary
else: # helper already restored DB + rolled back binary + froze + alerted
break # frozen: stop draining; a human investigates
```
### `safe_to_restart(user)` — the gate
Single read-only query, run as the user:
```sh
runuser -u "$user" -- sqlite3 "file:/home/$user/.t3/userdata/state.sqlite?mode=ro" "
SELECT
(SELECT count(*) FROM projection_thread_sessions WHERE active_turn_id IS NOT NULL),
CAST((julianday('now')
- julianday(replace(replace(max(updated_at),'T',' '),'Z',''))) * 86400 AS INT)
FROM projection_thread_sessions;"
```
- Column 1 = **active turns**; must be `0`. (`active_turn_id` is set exactly while a turn runs — verified 2026-06-21.)
- Column 2 = **idle seconds** = now most-recent thread activity. Must be `≥ QUIET_SECONDS` (default **900** = 15 min, env-overridable). `updated_at` is ISO-8601 `…Z`; `datetime('now')`/`julianday('now')` are UTC, so normalizing `T`/`Z` away before `julianday()` keeps the arithmetic correct without depending on a newer SQLite's `Z` parsing.
- **NULL idle** (no threads at all) ⇒ safe. **Any error / non-numeric / nonzero exit** ⇒ not safe (constraint 3).
### Failure recovery
Delegated entirely to `safe_restart_unit` (the extracted, already-proven path): restore the user's DB from the pre-restart backup, roll the global binary back to last-good, `touch /etc/t3-autoupdate.freeze`, log+alert. The idle job then stops draining (the freeze halts both jobs until a human clears it) — see constraint 1 for why per-user divergence isn't an option.
### Observability
- Structured `logger -t t3-migrate-idle` lines; extend the existing `T3AutoUpdate*` Loki ruler/alerts to also match this tag. Success → one line: `migrated t3-serve@wizard → <target> (idle restart; idle 47m)`. Failure → reuses the daily job's freeze+alert.
- **Recommended (optional):** a Pushgateway gauge for **deferral-marker age** + an alert if a marker survives **> 3 days** — passive visibility into "busy every night for 3 days," *not* the auto-escalation/daytime-widening that was explicitly de-scoped.
### Delivery
- Wire into `scripts/workstation/setup-devvm.sh` alongside the existing units:
- `install -m 0644 "$SCRIPTS/t3-safe-restart.sh" /usr/local/lib/t3-safe-restart.sh`
- `install -m 0755 "$SCRIPTS/t3-migrate-idle.sh" /usr/local/bin/t3-migrate-idle`
- add `t3-migrate-idle.service t3-migrate-idle.timer` to the unit-install loop (→ `/etc/systemd/system/`)
- add `t3-migrate-idle.timer` to the `systemctl enable --now` list
- `homelab claim host:devvm --purpose "deploy t3-migrate-idle units"` before the install + enable on the shared devvm.
- No Terraform (hand-managed VM 102).
## Testing
- **TDD on the gating core (`bats`)** against fixture `state.sqlite` files: active turn → unsafe; idle-but-recent (< QUIET) unsafe; idle + quiet safe; empty DB safe; locked/garbage DB / sqlite error unsafe (fail-closed); marker drain: unit started after marker clear+skip, before eligible.
- **`T3_DRY_RUN=1`** mode logs `would migrate <unit> → <target>` without acting. Roll out in dry-run first; confirm it flags wizard's server at a real overnight idle moment; then enable live.
- **Step-6 extraction is behavior-preserving** — validate the daily job's decisions are unchanged via a dry-run diff before/after the refactor.
## Out of scope (YAGNI)
- Daytime restarts / "around the clock" cadence (de-scoped: overnight only).
- Auto-escalation that widens to a daytime attempt after N stale nights (de-scoped; the optional marker-age alert covers visibility).
- Per-user opt-out file (not needed — the job is self-limiting via markers).
- Any change to how `t3-autoupdate` *installs/gates* a build.
## Open questions
None outstanding from the brainstorm. Two items to **verify during implementation** (not blockers): (a) user-facing session resume after a `t3-serve` restart; (b) the devvm's `sqlite3` parses the normalized timestamp as expected (the `replace()` normalization is the safeguard).

View file

@ -0,0 +1,729 @@
# t3 idle-migrate Implementation Plan
> **For agentic workers:** REQUIRED SUB-SKILL: Use superpowers:subagent-driven-development (recommended) or superpowers:executing-plans to implement this plan task-by-task. Steps use checkbox (`- [ ]`) syntax for tracking.
**Goal:** Add a small idle-gated overnight job that restarts a `t3-serve@<user>` deferred by the daily autoupdate, so a chronically-busy user's server migrates onto the current t3 binary during a real quiet gap instead of staying version-skewed for days.
**Architecture:** Extract the daily job's per-unit "dangerous" restart routine (backup→restart→verify→recover) into a sourced shared library `t3-safe-restart.sh`; the daily `t3-autoupdate` and a new `t3-migrate-idle` job both call it. The daily job records each deferral as a marker file; the new job drains markers overnight, restarting only when `state.sqlite` shows no in-flight turn and a quiet buffer has elapsed. Self-limiting (only acts on a recorded deferral), fail-closed.
**Tech Stack:** bash, systemd timers, sqlite3 (reading t3's `state.sqlite`), the existing `t3-autoupdate` machinery. Deployed via `scripts/workstation/setup-devvm.sh` on the hand-managed devvm (no Terraform).
**Design:** `docs/plans/2026-06-21-t3-idle-migrate-design.md`.
---
## File structure
- **Create `scripts/t3-safe-restart.sh`** — sourced library: shared config defaults, `LOG`/`ver`/`osusers`/`ak_for`/`verify_pairing`/`backup_user`/`prebump_of`/`rollback_binary`, and `safe_restart_unit`. One responsibility: the audited per-unit safe restart + its recovery.
- **Modify `scripts/t3-autoupdate.sh`** — source the lib; replace the inline helpers + step-6 body with calls into it; write/clear the deferral marker. Behavior unchanged.
- **Create `scripts/t3-migrate-idle.sh`** — the new job: the idle gate (`gate_query`/`gate_is_safe`/`safe_to_restart`) + the marker-drain loop. Main logic behind a `main`-guard so it's source-safe for tests.
- **Create `scripts/t3-migrate-idle.service`** + **`scripts/t3-migrate-idle.timer`** — oneshot + overnight timer.
- **Create `tests/t3-migrate-idle-gate.test.sh`** — pure-bash TDD for the gate predicates against fixture SQLite DBs (no root, no bats).
- **Modify `scripts/workstation/setup-devvm.sh`** — install + enable the new files.
- **Modify `docs/runbooks/t3-version-bump.md`** + **`.claude/reference/service-catalog.md`** — document the new job.
**Recovery semantics note (load-bearing):** `safe_restart_unit` is reused verbatim. In the *daily* path a canary failure happens when `last_good < target`, so its `rollback_binary` genuinely reverts the global binary (correct — a bad build is bad for everyone). In the *idle* path `last_good == installed == target` (the build was already accepted), so `rollback_binary` is a **harmless no-op reinstall** — recovery reduces to "restore the failing user's DB + freeze + alert" and does NOT downgrade other users. Known rare-tail limitation: if that user's forward migration genuinely fails at idle time (already gated against a copy of their real DB at install), their server may crashloop on the restored DB until a human acts on the freeze+alert. Documented, not hidden.
---
## Task 1: Shared library `t3-safe-restart.sh`
**Files:**
- Create: `scripts/t3-safe-restart.sh`
- [ ] **Step 1: Create the library**
```bash
#!/usr/bin/env bash
# t3-safe-restart.sh — SOURCED library (not executed). Shared by t3-autoupdate.sh
# (daily gated tracker) and t3-migrate-idle.sh (overnight deferral drainer).
#
# Holds the per-unit "dangerous" routine — backup -> restart -> verify pairing ->
# recover (restore DB + roll global binary back to last-good + freeze) — extracted
# verbatim from t3-autoupdate.sh step 6, plus the small helpers it depends on.
# The only change from the inline original: safe_restart_unit RETURNS non-zero on
# failure (after performing recovery+freeze) instead of `exit 1`, so the CALLER
# decides what to do (the daily job exits; the idle job stops draining).
#
# Callers must set, before calling safe_restart_unit: $target (version being moved
# TO, for log lines + the prebump filename) and $last_good (rollback target).
# Set $LOG_TAG before sourcing to tag syslog ("t3-autoupdate" / "t3-migrate-idle").
# ---- shared config defaults (override via env before sourcing) ------------------
: "${LOG_TAG:=t3-safe-restart}"
: "${FREEZE_FILE:=/etc/t3-autoupdate.freeze}"
: "${STATE_DIR:=/var/lib/t3-autoupdate}"
: "${LAST_GOOD_FILE:=$STATE_DIR/last-good}"
: "${DEFER_DIR:=$STATE_DIR/deferred}"
: "${BACKUP_DIR:=/var/backups/t3-state}"
: "${DISPATCH:=127.0.0.1:3780}"
: "${USER_MAP:=/etc/ttyd-user-map}"
: "${T3_BACKUP_TIMEOUT:=900}"
LOG() { logger -t "$LOG_TAG" "$*"; echo "$LOG_TAG: $*"; }
ver() { t3 --version 2>/dev/null | awk '{print $NF}' | sed 's/^v//'; }
# OS users owning a ~/.t3 (RHS of each non-comment "authentik=os_user" map line).
osusers() { awk -F= '!/^[[:space:]]*#/&&NF==2{gsub(/[[:space:]]/,"",$2);print $2}' "$USER_MAP" 2>/dev/null | sort -u; }
# authentik username for an OS user (reverse map; first match) — for dispatch verify.
ak_for() { awk -F= -v u="$1" '!/^[[:space:]]*#/&&NF==2{gsub(/[[:space:]]/,"",$1);gsub(/[[:space:]]/,"",$2);if($2==u){print $1;exit}}' "$USER_MAP" 2>/dev/null; }
# Online consistent snapshot of ONE user's state.sqlite (run AS the owner so the
# WAL stays owned; never stops the serve). Uses global $target for the filename.
# Echoes the backup path on success; non-zero on failure.
backup_user() {
local u="$1" src out dst ts
src="/home/$u/.t3/userdata/state.sqlite"; [ -f "$src" ] || return 1
ts="$(date +%Y%m%d-%H%M%S)"
out="$BACKUP_DIR/$u"; dst="$out/state-prebump-$target-$ts.sqlite"
install -d -o "$u" -g "$u" -m700 "$out" 2>/dev/null || mkdir -p "$out"
if runuser -u "$u" -- timeout "$T3_BACKUP_TIMEOUT" sqlite3 "$src" "VACUUM INTO '$dst'" 2>/dev/null && [ -s "$dst" ]; then
printf '%s\n' "$dst"; return 0
fi
rm -f "$dst"; return 1
}
# newest pre-bump backup for a user taken for the current $target (restore source).
prebump_of() { ls -1t "$BACKUP_DIR/$1/state-prebump-$target-"*.sqlite 2>/dev/null | head -1; }
# roll the GLOBAL binary back to last-good. In the idle path last_good==installed,
# so this is a harmless no-op reinstall (does NOT downgrade other users).
rollback_binary() {
LOG "rolling back binary $target -> $last_good"
if npm i -g "t3@$last_good" >/dev/null 2>&1; then LOG "rolled back to $last_good"; return 0; fi
LOG "ROLLBACK FAILED — could not reinstall t3@$last_good (t3 may be broken; manual fix per runbook)"; return 1
}
# verify a user's pairing through the REAL dispatch (mint -> exchange -> cookie).
verify_pairing() {
local u="$1" ak out; ak="$(ak_for "$u")"; [ -n "$ak" ] || { LOG "no authentik mapping for $u — skipping dispatch verify"; return 0; }
out="$(curl -s -i --max-time 10 -H "X-authentik-username: $ak" -H 'Sec-Fetch-Dest: document' "http://$DISPATCH/" 2>/dev/null)"
printf '%s' "$out" | grep -qi '^set-cookie:[[:space:]]*t3_session='
}
# safe_restart_unit <unit> <user>: restart the unit, verify pairing; on failure
# restore the user's DB from its pre-restart backup, roll the binary back, freeze.
# Assumes a pre-restart backup already exists for <user> at the current $target
# (the daily job's backup_all, or the idle job's backup_user, takes it first).
# Returns 0 on verified success, non-zero after recovery+freeze on failure.
safe_restart_unit() {
local unit="$1" u="$2" ok=0 _ bak
systemctl restart "$unit" || LOG "WARN: systemctl restart $unit returned non-zero"
for _ in $(seq 1 15); do
if verify_pairing "$u"; then ok=1; break; fi
sleep 2
done
if [ "$ok" = "1" ]; then
LOG "restarted $unit -> $target (pairing verified via dispatch)"; return 0
fi
LOG "HEALTH-CHECK FAILED: $u pairing broken AFTER restart onto $target — rolling back + restoring its DB"
rollback_binary
bak="$(prebump_of "$u")"
if [ -n "$bak" ]; then
systemctl stop "$unit" 2>/dev/null
if install -o "$u" -g "$u" -m600 "$bak" "/home/$u/.t3/userdata/state.sqlite" 2>/dev/null; then
rm -f "/home/$u/.t3/userdata/state.sqlite-wal" "/home/$u/.t3/userdata/state.sqlite-shm"
LOG "restored $u state.sqlite from $bak"
fi
systemctl start "$unit" 2>/dev/null
fi
touch "$FREEZE_FILE" 2>/dev/null
LOG "FROZEN ($FREEZE_FILE) after $u failed on $target; last_good stays $last_good — investigate, then remove the freeze file to resume"
return 1
}
```
- [ ] **Step 2: Syntax + lint check**
Run: `bash -n scripts/t3-safe-restart.sh && (command -v shellcheck >/dev/null && shellcheck -x scripts/t3-safe-restart.sh || echo "shellcheck absent — skipped")`
Expected: no syntax errors. (shellcheck may warn on the intentional global `$target`/`$last_good` references — acceptable; they are documented caller-set globals.)
- [ ] **Step 3: Source-and-define smoke test**
Run:
```bash
bash -c 'LOG_TAG=test; . scripts/t3-safe-restart.sh; for f in LOG ver osusers ak_for backup_user prebump_of rollback_binary verify_pairing safe_restart_unit; do declare -F "$f" >/dev/null || { echo "MISSING $f"; exit 1; }; done; echo "all functions defined"'
```
Expected: `all functions defined` (sourcing has no side effects — no exit, no output beyond the echo).
- [ ] **Step 4: Commit**
```bash
GC=(-c filter.git-crypt.smudge=cat -c filter.git-crypt.clean=cat -c filter.git-crypt.required=false)
git "${GC[@]}" add scripts/t3-safe-restart.sh
git "${GC[@]}" commit -m "t3-safe-restart: extract shared safe-restart library from t3-autoupdate
Pull the per-unit backup->restart->verify->recover routine (and the small
helpers it needs) out of t3-autoupdate.sh into a sourced library, so a second
job (the upcoming idle migrator) can reuse the exact same audited recovery path
instead of forking safety-critical code. safe_restart_unit returns non-zero on
failure (after recovery+freeze) rather than exiting, so callers control flow.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>"
```
---
## Task 2: Refactor `t3-autoupdate.sh` to use the library + record deferrals
**Files:**
- Modify: `scripts/t3-autoupdate.sh` (config block 3242, helpers 44165, step 6 loop 194225)
- [ ] **Step 1: Source the library; drop the now-shared helpers**
Replace lines 3252 (the `T3_*` config block through the `newer()` helper) with — keep the autoupdate-only config, source the lib for the shared bits:
```bash
# ---- autoupdate-specific config (shared config + helpers come from the lib) -----
T3_TRACK="${T3_TRACK:-nightly}" # npm dist-tag to follow (nightly | latest)
T3_PIN="${T3_PIN:-}" # optional HARD pin to an exact version (disables tracking)
SMOKE_PORT="${T3_SMOKE_PORT:-3799}"
DRY_RUN="${T3_DRY_RUN:-0}"
TMPROOT="${T3_TMPDIR:-/var/tmp}" # health-check scratch on DISK — /tmp is a 2G tmpfs and a populated state.sqlite (~hundreds of MB) overflows it
LOG_TAG=t3-autoupdate
# shellcheck source=scripts/t3-safe-restart.sh
. "${T3_SAFE_RESTART_LIB:-/usr/local/lib/t3-safe-restart.sh}"
# is $1 a strictly-newer version than $2 (version-sort)?
newer() { [ "$1" != "$2" ] && [ "$(printf '%s\n%s\n' "$1" "$2" | sort -V | tail -1)" = "$1" ]; }
mkdir -p "$STATE_DIR" 2>/dev/null || true
```
(The lib now provides `FREEZE_FILE`, `STATE_DIR`, `LAST_GOOD_FILE`, `DEFER_DIR`, `BACKUP_DIR`, `DISPATCH`, `USER_MAP`, `LOG`, `ver`, `osusers`, `ak_for`, `verify_pairing`, `prebump_of`, `rollback_binary`, `backup_user`, `safe_restart_unit`.)
- [ ] **Step 2: Simplify `backup_all` to call the shared `backup_user`**
Replace the `backup_all()` definition (lines 90105) with:
```bash
ADMIN_SEED=""
backup_all() {
local u dst
for u in $(osusers); do
if dst="$(backup_user "$u")"; then
LOG "pre-bump backup: $u -> $dst ($(stat -c%s "$dst" 2>/dev/null) bytes)"
[ "$u" = "wizard" ] && ADMIN_SEED="$dst"
else
LOG "WARN: pre-bump backup FAILED for $u (/home/$u/.t3/userdata/state.sqlite)"
fi
done
[ -n "$ADMIN_SEED" ] || ADMIN_SEED="$(ls -1t "$BACKUP_DIR"/*/"state-prebump-$target-"*.sqlite 2>/dev/null | head -1)"
}
```
Delete the now-duplicated standalone `prebump_of`, `rollback_binary`, and `verify_pairing` definitions (lines 107108, 146152, 160165) — they come from the lib. Keep `health_check` and `unit_busy` (autoupdate-only).
- [ ] **Step 3: Use `safe_restart_unit` + write/clear the deferral marker in step 6**
Replace the step-6 loop body (lines 196225) with:
```bash
for unit in $(systemctl list-units --type=service --state=running --no-legend 't3-serve@*' 2>/dev/null | awk '{print $1}'); do
u="$(printf '%s' "$unit" | sed -n 's/^t3-serve@\(.*\)\.service$/\1/p')"; [ -n "$u" ] || continue
if unit_busy "$unit"; then
LOG "deferring $unit (active agent) — migrates on its next idle restart"
mkdir -p "$DEFER_DIR" 2>/dev/null && printf '%s\n' "$target" >"$DEFER_DIR/$u" # record for t3-migrate-idle
deferred=$((deferred+1)); continue
fi
if safe_restart_unit "$unit" "$u"; then
restarted=$((restarted+1))
rm -f "$DEFER_DIR/$u" 2>/dev/null # now current — clear any stale marker
else
exit 1 # frozen by safe_restart_unit — preserve today's behavior
fi
done
```
- [ ] **Step 4: Syntax check + behavior-preserving dry-run diff**
Run:
```bash
bash -n scripts/t3-autoupdate.sh
# Confirm the only remaining defer/restart decisions are unchanged vs HEAD~1 logic:
git -c filter.git-crypt.smudge=cat -c filter.git-crypt.clean=cat -c filter.git-crypt.required=false diff HEAD scripts/t3-autoupdate.sh | grep -E '^\+|^-' | grep -vE 'safe_restart_unit|backup_user|DEFER_DIR|source|\. "|LOG_TAG|^\+\+\+|^---' | head -40
```
Expected: no syntax errors; the diff shows only the extraction (calls replacing inline bodies) + the two marker lines — no change to install/health-gate/canary decision logic.
- [ ] **Step 5: Commit**
```bash
GC=(-c filter.git-crypt.smudge=cat -c filter.git-crypt.clean=cat -c filter.git-crypt.required=false)
git "${GC[@]}" add scripts/t3-autoupdate.sh
git "${GC[@]}" commit -m "t3-autoupdate: source the shared safe-restart lib + record deferrals
Behavior-preserving refactor: the per-unit restart/recover body and small helpers
now come from t3-safe-restart.sh (one audited copy). Additionally, when a unit is
deferred for an active agent, write a marker under /var/lib/t3-autoupdate/deferred/
so the new idle migrator can drain it later; clear the marker on a successful
restart. Install/health-gate/canary logic is unchanged.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>"
```
---
## Task 3: The idle gate (TDD) — `gate_query` + `gate_is_safe`
**Files:**
- Create: `tests/t3-migrate-idle-gate.test.sh`
- Create (incremental): `scripts/t3-migrate-idle.sh` (gate functions only this task)
- [ ] **Step 1: Write the failing test**
Create `tests/t3-migrate-idle-gate.test.sh`:
```bash
#!/usr/bin/env bash
# Pure-bash unit tests for the t3-migrate-idle gate. No root, no bats, no Docker.
# Sources t3-migrate-idle.sh (main-guarded) with the lib path pointed at the worktree.
set -uo pipefail
HERE="$(cd "$(dirname "$0")/.." && pwd)" # repo root (tests/ is one level down)
export T3_SAFE_RESTART_LIB="$HERE/scripts/t3-safe-restart.sh"
# shellcheck source=/dev/null
. "$HERE/scripts/t3-migrate-idle.sh" # defines functions; main-guard prevents the drain from running
pass=0; fail=0
ok() { if "$@"; then pass=$((pass+1)); else fail=$((fail+1)); echo "FAIL: $*"; fi; }
notok(){ if "$@"; then fail=$((fail+1)); echo "FAIL (expected non-zero): $*"; else pass=$((pass+1)); fi; }
# --- gate_is_safe <active> <idle_seconds> with QUIET_SECONDS=900 ---
QUIET_SECONDS=900
ok gate_is_safe 0 1000 # idle, quiet long enough -> safe
notok gate_is_safe 1 1000 # a turn in flight -> unsafe
notok gate_is_safe 0 100 # idle but not quiet enough -> unsafe
ok gate_is_safe 0 "" # no threads at all (NULL idle) -> safe
notok gate_is_safe x 1000 # unparseable active -> unsafe
notok gate_is_safe 0 -30 # negative idle (clock skew) -> unsafe
# --- gate_query <db> against fixture SQLite DBs ---
TMP="$(mktemp -d)"; trap 'rm -rf "$TMP"' EXIT
mkfix() { # mkfix <file> ; reads rows "active_turn_id|updated_at" on stdin
local f="$1"; sqlite3 "$f" "CREATE TABLE projection_thread_sessions(active_turn_id TEXT, updated_at TEXT NOT NULL);"
while IFS='|' read -r a u; do sqlite3 "$f" "INSERT INTO projection_thread_sessions VALUES ($([ "$a" = NULL ] && echo NULL || echo "'$a'"), '$u');"; done
}
NOW="$(date -u +%Y-%m-%dT%H:%M:%S.000Z)"
OLD="$(date -u -d '1 hour ago' +%Y-%m-%dT%H:%M:%S.000Z)"
# active turn present -> "1|<small idle>"
printf '%s\n' "abc|$NOW" "NULL|$OLD" | mkfix "$TMP/active.db"
res="$(gate_query "$TMP/active.db")"; ok test "${res%%|*}" = "1"
# all idle, last activity 1h ago -> "0|>=3500"
printf '%s\n' "NULL|$OLD" "NULL|$OLD" | mkfix "$TMP/idle.db"
res="$(gate_query "$TMP/idle.db")"; ok test "${res%%|*}" = "0"; ok test "${res##*|}" -ge 3500
# empty table -> "0|" (NULL idle)
sqlite3 "$TMP/empty.db" "CREATE TABLE projection_thread_sessions(active_turn_id TEXT, updated_at TEXT NOT NULL);"
res="$(gate_query "$TMP/empty.db")"; ok test "${res%%|*}" = "0"
echo "PASS=$pass FAIL=$fail"; [ "$fail" -eq 0 ]
```
- [ ] **Step 2: Run it to verify it fails**
Run: `bash tests/t3-migrate-idle-gate.test.sh`
Expected: FAIL — `scripts/t3-migrate-idle.sh` does not exist yet (source error).
- [ ] **Step 3: Create `scripts/t3-migrate-idle.sh` with the gate functions + main-guard skeleton**
```bash
#!/usr/bin/env bash
# t3-migrate-idle.sh — drains t3-autoupdate's deferral markers (via the overnight
# t3-migrate-idle.timer). For each deferred t3-serve@<user>, if nothing is actively
# working in that instance (no in-flight turn + a quiet buffer), restart it onto the
# current binary using the shared safe_restart_unit, then clear the marker.
# Why this exists: t3-autoupdate defers a user with an active agent at its single
# daily window; a user busy every night never migrates and their client shows
# "Client and server versions differ". See docs/plans/2026-06-21-t3-idle-migrate-*.
set -uo pipefail
LOG_TAG=t3-migrate-idle
# shellcheck source=scripts/t3-safe-restart.sh
. "${T3_SAFE_RESTART_LIB:-/usr/local/lib/t3-safe-restart.sh}"
QUIET_SECONDS="${T3_MIGRATE_QUIET_SECONDS:-900}" # required idle before a restart (15 min)
DRY_RUN="${T3_DRY_RUN:-0}"
# pure logic: is it safe given <active_turns> and <idle_seconds>? fail closed.
gate_is_safe() {
local active="$1" idle="$2"
case "$active" in ''|*[!0-9]*) return 1;; esac # unparseable/empty active -> unsafe
[ "$active" -eq 0 ] || return 1 # a turn is running -> unsafe
[ -z "$idle" ] && return 0 # no threads at all -> safe
case "$idle" in ''|*[!0-9-]*) return 1;; esac # non-numeric -> unsafe
[ "$idle" -ge "$QUIET_SECONDS" ] # negative or < quiet -> unsafe
}
# query a state.sqlite (path or file: URI). Echoes "<active_turns>|<idle_seconds>".
# idle_seconds is empty when there are no rows. Normalizes ISO 'T'/'Z' for julianday.
gate_query() {
local db="$1"
sqlite3 -batch -noheader -separator '|' "$db" \
"SELECT
(SELECT count(*) FROM projection_thread_sessions WHERE active_turn_id IS NOT NULL),
CAST((julianday('now') - julianday(replace(replace(max(updated_at),'T',' '),'Z',''))) * 86400 AS INT)
FROM projection_thread_sessions;"
}
# safe_to_restart <user>: wire runuser + the user's DB into gate_query/gate_is_safe.
safe_to_restart() {
local u="$1" db row
db="/home/$u/.t3/userdata/state.sqlite"; [ -f "$db" ] || return 1
row="$(runuser -u "$u" -- sqlite3 -batch -noheader -separator '|' "file:$db?mode=ro" \
"SELECT
(SELECT count(*) FROM projection_thread_sessions WHERE active_turn_id IS NOT NULL),
CAST((julianday('now') - julianday(replace(replace(max(updated_at),'T',' '),'Z',''))) * 86400 AS INT)
FROM projection_thread_sessions;" 2>/dev/null)" || return 1
gate_is_safe "${row%%|*}" "${row##*|}"
}
main() {
: # drain loop added in Task 4
}
# main-guard: run only when executed, not when sourced (tests source this file).
if [ "${BASH_SOURCE[0]}" = "${0}" ]; then main "$@"; fi
```
- [ ] **Step 4: Run the test to verify it passes**
Run: `bash tests/t3-migrate-idle-gate.test.sh`
Expected: `PASS=10 FAIL=0` (exit 0).
- [ ] **Step 5: Commit**
```bash
GC=(-c filter.git-crypt.smudge=cat -c filter.git-crypt.clean=cat -c filter.git-crypt.required=false)
git "${GC[@]}" add scripts/t3-migrate-idle.sh tests/t3-migrate-idle-gate.test.sh
git "${GC[@]}" commit -m "t3-migrate-idle: idle gate (no in-flight turn + quiet buffer), TDD
The gate reads t3's state.sqlite: safe to restart only when zero threads have an
active_turn_id AND the most-recent thread activity is older than the quiet buffer
(default 15m). Fail-closed on any parse/query error. Pure-bash unit tests cover
the boundaries against fixture DBs (no root/bats/Docker).
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>"
```
---
## Task 4: The marker-drain loop in `t3-migrate-idle.sh`
**Files:**
- Modify: `scripts/t3-migrate-idle.sh` (replace the `main()` skeleton)
- [ ] **Step 1: Implement `main()` (the drain loop)**
Replace the `main() { : ; }` skeleton with:
```bash
main() {
# a frozen build must not be auto-migrated (shared switch with t3-autoupdate)
if [ -e "$FREEZE_FILE" ]; then LOG "FROZEN: $FREEZE_FILE present — not draining deferrals"; exit 0; fi
[ -d "$DEFER_DIR" ] || exit 0 # nothing deferred
last_good="$(tr -d '[:space:]' <"$LAST_GOOD_FILE" 2>/dev/null)" # rollback target for the helper
local marker u unit started mwritten migrated=0 skipped=0
for marker in "$DEFER_DIR"/*; do
[ -e "$marker" ] || continue # empty-dir glob
u="$(basename "$marker")"; unit="t3-serve@$u.service"
if ! systemctl is-active --quiet "$unit"; then
LOG "clearing marker for $u: $unit not active"; rm -f "$marker"; continue
fi
started="$(date -d "$(systemctl show -p ActiveEnterTimestamp --value "$unit" 2>/dev/null)" +%s 2>/dev/null || echo 0)"
mwritten="$(stat -c %Y "$marker" 2>/dev/null || echo 0)"
if [ "$started" -gt "$mwritten" ]; then
LOG "clearing marker for $u: $unit already restarted $((started-mwritten))s after the deferral"; rm -f "$marker"; continue
fi
if ! safe_to_restart "$u"; then skipped=$((skipped+1)); continue; fi
target="$(tr -d '[:space:]' <"$marker" 2>/dev/null)"; [ -n "$target" ] || target="$(ver)"
if [ "$DRY_RUN" = "1" ]; then LOG "DRY_RUN: would migrate $unit -> $target (idle gate satisfied)"; continue; fi
if ! backup_user "$u" >/dev/null; then
LOG "WARN: pre-restart backup failed for $u — skipping (fail closed)"; skipped=$((skipped+1)); continue
fi
if safe_restart_unit "$unit" "$u"; then
LOG "migrated $unit -> $target (idle restart)"; rm -f "$marker"; migrated=$((migrated+1))
else
LOG "migrate FAILED for $unit — recovery+freeze handled by safe_restart_unit; stopping drain"; exit 1
fi
done
LOG "idle-migrate pass complete (migrated=$migrated skipped=$skipped)"
}
```
- [ ] **Step 2: Re-run the gate tests (regression — main-guard still source-safe)**
Run: `bash tests/t3-migrate-idle-gate.test.sh`
Expected: `PASS=10 FAIL=0` (sourcing still defines functions without running the loop).
- [ ] **Step 3: Syntax + lint**
Run: `bash -n scripts/t3-migrate-idle.sh && (command -v shellcheck >/dev/null && shellcheck -x scripts/t3-migrate-idle.sh || echo "shellcheck absent — skipped")`
Expected: no syntax errors.
- [ ] **Step 4: Commit**
```bash
GC=(-c filter.git-crypt.smudge=cat -c filter.git-crypt.clean=cat -c filter.git-crypt.required=false)
git "${GC[@]}" add scripts/t3-migrate-idle.sh
git "${GC[@]}" commit -m "t3-migrate-idle: drain deferral markers when safe
For each /var/lib/t3-autoupdate/deferred/<user> marker: skip+clear if the unit is
gone or was already restarted after the deferral; otherwise, when the idle gate is
satisfied, take a pre-restart backup and restart via the shared safe_restart_unit,
clearing the marker on verified success. DRY_RUN logs decisions without acting.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>"
```
---
## Task 5: systemd units
**Files:**
- Create: `scripts/t3-migrate-idle.service`, `scripts/t3-migrate-idle.timer`
- [ ] **Step 1: Create the service unit**
`scripts/t3-migrate-idle.service`:
```ini
[Unit]
Description=t3 idle migrator — restart deferred t3-serve instances onto the current binary when idle
Documentation=https://forgejo.viktorbarzin.me/viktor/infra/src/branch/master/docs/plans/2026-06-21-t3-idle-migrate-design.md
After=network.target t3-dispatch.service
[Service]
Type=oneshot
ExecStart=/usr/local/bin/t3-migrate-idle
```
- [ ] **Step 2: Create the timer unit**
`scripts/t3-migrate-idle.timer`:
```ini
[Unit]
Description=Overnight drain of t3-autoupdate deferrals (idle-gated t3-serve migration)
[Timer]
OnCalendar=*-*-* 01..05:00/20
RandomizedDelaySec=120
Persistent=false
[Install]
WantedBy=timers.target
```
- [ ] **Step 3: Validate unit syntax**
Run: `systemd-analyze verify scripts/t3-migrate-idle.service scripts/t3-migrate-idle.timer 2>&1 | grep -v 'Unknown\|Cannot find' || echo "units parse OK"`
Expected: no fatal parse errors (warnings about the `[Install]` of a non-installed unit / missing exec on a non-deployed path are acceptable in the worktree).
- [ ] **Step 4: Confirm the OnCalendar expands to the intended overnight slots**
Run: `systemd-analyze calendar '*-*-* 01..05:00/20' --iterations=5`
Expected: next elapses at 01:00/01:20/01:40/02:00/… (every 20 min, hours 0105).
- [ ] **Step 5: Commit**
```bash
GC=(-c filter.git-crypt.smudge=cat -c filter.git-crypt.clean=cat -c filter.git-crypt.required=false)
git "${GC[@]}" add scripts/t3-migrate-idle.service scripts/t3-migrate-idle.timer
git "${GC[@]}" commit -m "t3-migrate-idle: systemd oneshot + overnight timer (01:00-05:40, /20)
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>"
```
---
## Task 6: Wire into `setup-devvm.sh`
**Files:**
- Modify: `scripts/workstation/setup-devvm.sh` (9a install ~line 164; 9d unit loop ~line 200; enable ~line 218)
- [ ] **Step 1: Install the lib + the new script (section 9a)**
After the `install -m 0755 "$SCRIPTS/t3-autoupdate.sh" /usr/local/bin/t3-autoupdate` line, add:
```bash
install -m 0644 "$SCRIPTS/t3-safe-restart.sh" /usr/local/lib/t3-safe-restart.sh
install -m 0755 "$SCRIPTS/t3-migrate-idle.sh" /usr/local/bin/t3-migrate-idle
```
- [ ] **Step 2: Install the unit files (section 9d loop)**
Add to the `for u in …` unit list (after the `t3-autoupdate.service t3-autoupdate.timer \` line):
```bash
t3-migrate-idle.service t3-migrate-idle.timer \
```
- [ ] **Step 3: Enable the timer (section 9 enable line)**
Append `t3-migrate-idle.timer` to the `systemctl enable --now` list:
```bash
systemctl enable --now t3-dispatch.service \
t3-autoupdate.timer t3-backup-state.timer t3-provision-users.timer t3-migrate-idle.timer >/dev/null 2>&1 || \
log "WARN: some units failed to enable (check: systemctl status t3-dispatch t3-*.timer)"
```
- [ ] **Step 4: Syntax check**
Run: `bash -n scripts/workstation/setup-devvm.sh`
Expected: no syntax errors.
- [ ] **Step 5: Commit**
```bash
GC=(-c filter.git-crypt.smudge=cat -c filter.git-crypt.clean=cat -c filter.git-crypt.required=false)
git "${GC[@]}" add scripts/workstation/setup-devvm.sh
git "${GC[@]}" commit -m "setup-devvm: install + enable t3-migrate-idle (lib, script, units, timer)
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>"
```
---
## Task 7: Deploy to the devvm + validate (dry-run first)
**Files:** none (operational). Presence-claimed, shared-host mutation.
- [ ] **Step 1: Claim the host**
Run: `homelab claim host:devvm --purpose "deploy t3-migrate-idle units (idle-gated t3-serve migration)"`
Expected: claim acquired (if already held by another session, defer per CLAUDE.md).
- [ ] **Step 2: Install the artifacts (mirror setup-devvm.sh 9a/9d)**
Run:
```bash
W=/home/wizard/code/infra/.worktrees/t3-idle-migrate/scripts
sudo install -m 0644 "$W/t3-safe-restart.sh" /usr/local/lib/t3-safe-restart.sh
sudo install -m 0755 "$W/t3-migrate-idle.sh" /usr/local/bin/t3-migrate-idle
sudo install -m 0644 "$W/t3-migrate-idle.service" /etc/systemd/system/t3-migrate-idle.service
sudo install -m 0644 "$W/t3-migrate-idle.timer" /etc/systemd/system/t3-migrate-idle.timer
sudo systemctl daemon-reload
```
Expected: no errors.
- [ ] **Step 2b: Re-point the live daily job at the installed lib (it now sources it)**
The deployed `/usr/local/bin/t3-autoupdate` is the OLD inline version until setup-devvm re-runs; install the refactored one so both jobs share the lib:
```bash
sudo install -m 0755 "$W/t3-autoupdate.sh" /usr/local/bin/t3-autoupdate
sudo /usr/local/bin/t3-autoupdate # safe: same-version run exits at "already on nightly; nothing to do"
```
Expected: log line `already on <track>=<ver>; nothing to do` (proves the refactored daily job sources the lib and runs clean).
- [ ] **Step 3: DRY-RUN the idle migrator against live state**
Run: `sudo T3_DRY_RUN=1 /usr/local/bin/t3-migrate-idle; echo "exit=$?"`
Expected: with wizard currently busy (mid-turn during the day), a `skipped` count — `idle-migrate pass complete (migrated=0 skipped=N)` — and NO restart. (If wizard happens to be idle+quiet, it logs `DRY_RUN: would migrate t3-serve@wizard …` and still does not act.)
- [ ] **Step 4: Seed a deferral marker for the current skew + dry-run again**
The live daily job already deferred wizard but the marker mechanism is new, so create it once to represent the existing `.605→.613` debt:
```bash
sudo install -d -m755 /var/lib/t3-autoupdate/deferred
printf '%s\n' "$(t3 --version | awk '{print $NF}' | sed 's/^v//')" | sudo tee /var/lib/t3-autoupdate/deferred/wizard >/dev/null
sudo T3_DRY_RUN=1 /usr/local/bin/t3-migrate-idle; echo "exit=$?"
```
Expected: the pass now considers `wizard` — either `DRY_RUN: would migrate t3-serve@wizard.service -> …613` (if idle) or counted in `skipped` (if mid-turn). Confirms marker drain + gate wiring end-to-end without acting.
- [ ] **Step 5: Enable the timer (live)**
Run: `sudo systemctl enable --now t3-migrate-idle.timer && systemctl list-timers t3-migrate-idle.timer --no-pager`
Expected: timer active, next elapse in the 01:0005:40 window.
- [ ] **Step 6: Release the claim**
Run: `homelab release host:devvm`
> **First live migration** happens overnight at the first idle+quiet tick. Verify next session: `journalctl -u t3-migrate-idle.service --since yesterday | grep -E 'migrated|skipped|DRY|FROZEN'` and `t3 --version` vs the running server's version. (The user-facing resume-after-restart is observed here — design open-question (a).)
---
## Task 8: Docs
**Files:**
- Modify: `docs/runbooks/t3-version-bump.md` (add an idle-migrate section)
- Modify: `.claude/reference/service-catalog.md` (add the unit)
- Modify: `docs/plans/2026-06-21-t3-idle-migrate-design.md` (Status → implemented)
- [ ] **Step 1: Runbook** — add a section after the autoupdate description:
```markdown
## Idle migrator (`t3-migrate-idle.timer`)
`t3-autoupdate` defers a user's `t3-serve` restart when they have an active agent
at the daily window, recording `/var/lib/t3-autoupdate/deferred/<user>`.
`t3-migrate-idle` (overnight, every 20 min 01:0005:40) drains those markers:
it restarts a deferred instance onto the current binary only when that user's
`state.sqlite` shows no in-flight turn (`active_turn_id`) and ≥15 min quiet, via
the shared `safe_restart_unit` (same backup→verify→recover as the daily canary).
- **Force a migration now:** `sudo systemctl start t3-migrate-idle.service` (still idle-gated).
- **Preview without acting:** `sudo T3_DRY_RUN=1 /usr/local/bin/t3-migrate-idle`.
- **Stop it:** the shared `/etc/t3-autoupdate.freeze` halts both jobs.
- **Rare-tail failure:** a forward-migration failure at idle restart restores the
user's DB + freezes + alerts (the binary rollback is a no-op since the build was
already accepted); the user's server may crashloop on the restored DB until the
freeze is cleared. Investigate per the rollback section above.
```
- [ ] **Step 2: service-catalog** — add a row/line for `t3-migrate-idle.timer` (overnight idle-gated t3-serve migration; sources `t3-safe-restart.sh`).
- [ ] **Step 3: design doc status** — change the header `Status:` to `implemented 2026-06-21 (commits on wizard/t3-idle-migrate)`.
- [ ] **Step 4: Commit**
```bash
GC=(-c filter.git-crypt.smudge=cat -c filter.git-crypt.clean=cat -c filter.git-crypt.required=false)
git "${GC[@]}" add docs/runbooks/t3-version-bump.md .claude/reference/service-catalog.md docs/plans/2026-06-21-t3-idle-migrate-design.md
git "${GC[@]}" commit -m "docs: t3-migrate-idle runbook + service-catalog + design status
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>"
```
---
## Task 9: Land
- [ ] **Step 1: Merge latest master into the branch**
Run:
```bash
GC=(-c filter.git-crypt.smudge=cat -c filter.git-crypt.clean=cat -c filter.git-crypt.required=false)
git "${GC[@]}" fetch forgejo
git "${GC[@]}" merge --no-edit forgejo/master
```
Expected: clean merge (no conflicts; the files are new or autoupdate-only). Resolve if any.
- [ ] **Step 2: Re-run the gate tests post-merge**
Run: `bash tests/t3-migrate-idle-gate.test.sh`
Expected: `PASS=10 FAIL=0`.
- [ ] **Step 3: Push to master**
Run: `git -c filter.git-crypt.smudge=cat -c filter.git-crypt.clean=cat -c filter.git-crypt.required=false push forgejo HEAD:master`
Expected: accepted. Non-fast-forward → fetch/merge/retry.
- [ ] **Step 4: Watch CI to completion**
Run: `homelab ci watch`
Expected: green (infra apply pipeline — this change is scripts/docs only, no Terraform, so apply is a no-op for it).
- [ ] **Step 5: Clean up the worktree**
Run (from the main checkout):
```bash
git -C /home/wizard/code/infra worktree remove .worktrees/t3-idle-migrate
git -C /home/wizard/code/infra branch -d wizard/t3-idle-migrate
```
---
## Self-review
- **Spec coverage:** marker mechanism (T2,T4) · shared safe-restart lib / approach C (T1) · idle gate active_turn_id+quiet (T3) · overnight timer (T5) · all-users self-limiting via markers (T4 loop) · failure recovery reuse (T1, note) · observability logs (LOG_TAG throughout) · delivery via setup-devvm (T6) · presence-claimed deploy (T7) · TDD on the gate (T3) · dry-run rollout (T7) · docs (T8). Optional Pushgateway marker-age gauge from the design is **intentionally deferred** (logged here as a follow-up, not built — keeps scope to the shipping mechanism).
- **Placeholders:** none — every file has complete content; every command has expected output.
- **Type/name consistency:** `safe_restart_unit`, `backup_user`, `prebump_of`, `gate_query`, `gate_is_safe`, `safe_to_restart`, `DEFER_DIR`, `QUIET_SECONDS`, `T3_SAFE_RESTART_LIB`, `LOG_TAG` used identically across tasks. `target`/`last_good` are documented caller-set globals consumed by lib functions.

View file

@ -0,0 +1,131 @@
# 2026-06-22 — devvm memory/IO overload: per-user containment + OOM backstop
## Impact
- devvm (VM 102, the shared multi-user Claude Code workstation) became
unresponsive under combined memory + IO pressure and had to be **hard-killed +
rebooted** by the admin on 2026-06-22 (morning). All ssh/tmux + t3 sessions for
wizard/emo/anca lost, in-flight agents killed.
- Signature on the last htop before the kill: **load avg ~60** on 32 vCPU, **RAM
22.5/23.5G**, **swap 13.9/14.0G (full)**, a wall of **D-state** (uninterruptible
IO-wait) processes, and a single `ugrep` in emo's tmux holding **~10G RES /
64% CPU**. Many `claude --effort max/xhigh` sessions + playwright-chrome MCP
instances across three users on top.
## This is the "crawl" class, not the QEMU-stall class
The 2026-06-11 post-mortem (`2026-06-11-devvm-qemu-io-stall.md`) fixed a
*different* failure mode — a QEMU-userspace block-path wedge on the legacy LSI
controller. That fix shipped (verified 2026-06-22: the guest now boots on
`virtio_scsi`, `scsihw: virtio-scsi-single + iothread`). Its post-mortem
explicitly deferred **this** class:
> The recurring *crawl* class (agent storms → swap-thrash; journald
> watchdog-killed 3× on 2026-06-10) is a separate failure mode — ssh/tmux
> sessions remain memory-uncontained by **explicit decision (swap-only,
> 2026-06-10)**.
That explicit decision is the root cause closed here.
## Root cause
Work on the devvm lives in **two independent cgroup-v2 trees per user**, and only
one was capped:
| Tree | cgroup | Cap before today |
|---|---|---|
| t3 web sessions | `system.slice/system-t3\x2dserve.slice/t3-serve@<user>` | `MemoryHigh=12G MemoryMax=16G MemorySwapMax=0 OOMPolicy=continue` ✓ |
| **ssh/tmux sessions** | `user.slice/user-<uid>.slice` | **`MemoryMax=infinity`, swap unlimited** ✗ |
The uncapped `user-<uid>.slice` was the hole. A runaway there (the 10G `ugrep`;
stacked max-effort agents) grew unbounded, spilled into the **14G disk swap**, and
swap-thrashed the **host-mbps-throttled (60/60 MB/s) virtual disk**. That is the
overload chain:
```
uncapped tmux growth → disk-swap thrash on a throttled spindle
→ IO storm (D-state pileup) → load ~60 → box unresponsive → hard kill
```
i.e. **memory pressure becomes the IO storm**. There was also **no global OOM
backstop** (no systemd-oomd / earlyoom) to shed the worst offender before the
kernel OOM or the thrash-wedge. And even the existing t3 caps don't sum safely
(3 users × 16G = 48G > 32G RAM) — nothing reasoned about the *whole box*.
## Fix (`setup-devvm.sh` §10, applied live 2026-06-22)
Design decisions (interviewed with the admin via `/grill-me`): **soft-generous
per-user caps + a hard ceiling + a kill-the-worst backstop**, maximising
single-user utilisation while making a box-wide wedge impossible. (The backstop
was first built on systemd-oomd, then switched to earlyoom mid-rollout when oomd
proved inert with `swap=0` — see Verification + Lessons.)
| Layer | What |
|---|---|
| **Per-user caps, BOTH trees** | `user-.slice.d` drop-in gives every `user-<uid>.slice` the same `MemoryHigh=12G / MemoryMax=16G / MemorySwapMax=0` the t3 tree already had. A user is now bounded in whichever surface they work in. |
| **No disk swap for work** | `MemorySwapMax=0` on every work cgroup → a spike OOMs **locally** at the ceiling instead of thrashing the throttled disk. Kills the IO-storm-via-swap mechanism at the source. The 14G swapfile stays for system cold pages only. |
| **earlyoom backstop (free-RAM threshold)** | New package — used **instead of systemd-oomd** (which is inert with `swap=0`; see Lessons). Watches `MemAvailable%` and SIGTERMs the biggest task at **5%**, SIGKILL at **3%**, swap ignored (`-s 100`). `--avoid` keeps sshd/systemd/dockerd/containerd/t3-dispatch/tmux off the victim list (**the admin's way in always survives**); `--prefer` targets the agent/browser hogs (python3/node/chrome/…). Swap-independent and reliable, where oomd's pressure-kill was not. |
| **Fair-share CPU/IO** | `CPUWeight`/`IOWeight` per slice (system.slice 200, users + docker 100 each). Work-conserving — a lone user still gets all 32 cores + the full IO budget when others idle; weights only bite under contention. No hard CPU/IO caps. |
| **Docker containment** | Containers previously landed in `system.slice` — uncapped. Now `cgroup-parent: docker.slice` in `daemon.json` routes every container into a capped (`MemoryMax=8G`, swap 0) slice, so a runaway container is cgroup-OOM'd locally instead of escaping into the uncapped `system.slice`. |
Durable in `setup-devvm.sh` (survives a VM rebuild); `earlyoom` added to
`packages.txt`. The numbers are tunable — `MemoryHigh=12G` will throttle a *lone*
heavy user between 1216G even with RAM free; bump to 16/20 if that bites.
## Verification (live, 2026-06-22)
- **Caps live on running cgroups**: all three `user-<uid>.slice` report
`memory.high=12G memory.max=16G memory.swap.max=0`; `docker.slice` `memory.max=8G`;
daemon.json kept buildkit/nvidia/insecure-registries; paperless-mcp recovered
under `docker.slice`.
- **Stress test A (hard cap)** — the PRIMARY guard: a 2G-capped, swap=0 balloon was
killed at exactly 2G by the cgroup-local OOM (`constraint=CONSTRAINT_MEMCG`) with
**swap flat at 0MB throughout** — no thrash. Same mechanism protects every user
slice (16G) and `docker.slice` (8G).
- **Soft cap observed**: a balloon pushed past `MemoryHigh` sat at ~220M / 99%
memory.pressure, throttled to a crawl, making no progress and harming nothing —
a runaway is throttled, not just killed.
- **systemd-oomd disproven, then dropped**: a self-policed balloon held
`memory.pressure full avg10 = 9699%` (≫ its 20% limit) for >70s but oomd never
killed it — `Pgscan: 0`. oomd's pressure-kill only acts on cgroups doing active
reclaim, which a `swap=0` anon workload never does. oomd was purged.
- **earlyoom backstop** — verified via `--dryrun`: at the threshold it logs
`low memory! … mem 90% swap 100%` (fires on RAM alone, swap ignored) and selects
`SIGTERM … "chrome"` (a `--prefer` hog), never an `--avoid`'d daemon. Live
earlyoom v1.7 confirms `SIGTERM mem<=5% / SIGKILL mem<=3%, swap<=100%`.
## Out of scope / follow-ups
- **Alerting** (tracked, fast-follow bead): `DevvmDown` (closes the 90-min
detection gap the 2026-06-11 PM flagged), sustained-memory-PSI/swap pressure
early-warning, and an "earlyoom-killed-something" alert (earlyoom logs each kill;
`-N /script` can push a metric). devvm node-exporter is already scraped
(`job=devvm`, `10.0.10.10:9100`), so only alert *rules* are new (a
monitoring-stack Terraform change).
- **zram cushion**: considered, deferred. Could let work cgroups absorb spikes in
compressed RAM instead of OOMing at the ceiling; not needed for the wedge fix.
- **Per-user docker isolation**: containers share one `docker.slice` budget, not
per-user. Fine for current usage (krr + short-lived tools).
- **Host-side IO**: the 60/60 mbps cap + the shared `sdc` HDD IO domain are
host-level (bead `code-oflt`); unchanged here.
## Lessons
- **"Swap as the safety valve" is an IO-storm amplifier on a throttled disk.**
Leaving ssh/tmux memory-uncontained (the 2026-06-10 decision) traded a clean
local OOM for a box-wide swap-thrash wedge. `MemorySwapMax=0` + a hard cap turns
the failure back into a contained, local kill.
- **Cap the box, not one surface.** t3 sessions were capped for months while the
same user's tmux was unbounded — and the caps that existed didn't sum to < RAM.
Containment has to reason about every tree and the aggregate.
- **A backstop must protect the operator's way in.** earlyoom `--avoid`s
sshd/systemd/dockerd/containerd/t3-dispatch/tmux, so the box always stays
reachable to recover; only the agent/browser hogs are eligible victims.
- **systemd-oomd is the wrong backstop for a no-swap box — verify, don't assume.**
oomd's memory-pressure killer only fires on cgroups doing active reclaim
(`pgscan` rising). With `MemorySwapMax=0` + anonymous memory there is nothing to
reclaim, so a cgroup sat at 99% `memory.pressure` indefinitely and oomd never
acted (proven with `oomctl` + a balloon). The very `swap=0` that kills the IO
storm also neuters oomd. earlyoom (free-RAM threshold, swap-independent) is the
correct pairing. A famous tool that "does OOM" still has to be proven to fire
under *your* configuration.

View file

@ -0,0 +1,97 @@
# Post-mortem: k8s 1.34→1.35 upgrade stalled — etcd IO starvation (2026-06-24)
> Filename kept for inbound links. The originally-suspected cause (kubeadm-config
> OIDC drift) turned out **not** to be the crash — see "Correction" below. The OIDC
> drift was a real *separate* latent bug fixed in the same change.
**Impact:** The autonomous k8s-version-upgrade chain (23:00 UTC nightly) reached
the master control-plane phase for the first time — preflight passed, etcd
snapshot taken, master cordoned + drained, etcd upgraded 3.6.5→3.6.6 — then the
kube-apiserver upgrade to v1.35.6 **crash-looped**. kubeadm waited its 5-minute
static-pod-hash window across all internal retries, then auto-rolled-back to
v1.34.9. The cluster stayed healthy on 1.34.9 (apiserver, all 7 nodes Ready), but
the run left **k8s-master cordoned** and the chain **wedged on `in_flight=1`**.
No data loss; no user-facing outage (the master carries control-plane taints, so
no workloads were displaced).
**Trigger:** the first *minor* upgrade the chain ever attempted (1.34→1.35) — the
first time kubeadm upgrades etcd (3.6.5→3.6.6) and regenerates the control-plane
static pods, i.e. the first time the upgrade pushes real write-IO at etcd.
## Root cause — etcd IO starvation on the shared HDD
The new kube-apiserver could not establish/keep a working connection to etcd
during the upgrade because **etcd was IO-starved**. etcd's surviving container log
from the crash window (`/var/log/pods/.../etcd/0.log`, 23:0423:20 UTC) shows:
- **1,180** `apply request took too long` warnings in 16 minutes;
- individual applies of **4.3s / 2.9s / 2.7s / 1.8s** (healthy is <100ms),
clustered at **23:18:51 UTC** — exactly when kubeadm's final attempt was trying
to bring the new apiserver up.
A reproduced 1.35.6 apiserver with no etcd dies with
`F instance.go:233 Error creating leases: error creating storage factory: context
deadline exceeded` — the same failure mode a multi-second etcd produces. etcd
lives on the contended `sdc` HDD (**beads code-oflt**: "etcd/critical VM disks on
shared sdc HDD — recurring IO-storm root cause"). The upgrade itself piled IO onto
that spindle:
1. etcd's own upgrade-restart + WAL/db re-read (it restarted ~23:04, re-elected);
2. kubeadm dumping a full **~400MB etcd DB backup** to
`/etc/kubernetes/tmp/kubeadm-backup-etcd-<ts>/` (on the same HDD) before the
etcd upgrade — and **145 of these had accumulated to 28GB** (kubeadm never
cleans them up), pushing master root fs to **73%**, above the 70% kubelet
image-GC threshold, so image GC churned during the drain too;
3. master-drain pod evictions.
### Correction — it was NOT the OIDC flag swap
`kubeadm upgrade diff v1.35.6` showed the regenerated manifest also swaps
`--authentication-config` (structured multi-issuer OIDC) back to legacy
single-issuer `--oidc-*` flags (kubeadm-config drift, see secondary finding). That
was the *first* hypothesis — but an isolated repro of the 1.35.6 apiserver with
those exact `--oidc-*` flags **and authentik reachable** initialised OIDC cleanly
(`oidc.go:313`, no error) and ran fine until it hit the (deliberately dead) test
etcd. So the auth swap does **not** crash the apiserver; it was a red herring for
the crash. Image pull (all v1.35.6 images pre-pulled), OOM (none), and disk-full
were also ruled out.
## Secondary finding (real, fixed separately) — kubeadm-config OIDC drift
apiserver auth is configured in three places that must agree:
(1) `/etc/kubernetes/pki/auth-config.yaml` (structured, two issuers: `kubernetes`
+ `k8s-dashboard`, added 2026-06-19); (2) the live static-pod manifest
(`--authentication-config`); (3) the kubeadm-config `ClusterConfiguration` CM —
which still carried the legacy `--oidc-*` extraArgs. `kubeadm upgrade` regenerates
the manifest from (3), so it would have reverted structured auth → **dashboard +
kubectl SSO break after a successful upgrade** (recoverable: the chain's
post-master `restore.sh` re-adds the flag). This is a real bug, just not the crash.
## Resolution
1. **Reclaimed the 28GB kubeadm scratch** on master (`/etc/kubernetes/tmp/kubeadm-backup-*`) — root fs 73% → 23%.
2. **Reconciled kubeadm-config live** (zero cluster impact — CM only read at upgrade time): dropped `--oidc-*`, added `--authentication-config` via `kubeadm init phase upload-config kubeadm`. `kubeadm upgrade diff` then shows only the control-plane image bumps.
3. **Recovered:** uncordoned k8s-master, cleared the stuck `in_flight` gauge + annotation, deleted last night's Complete/Failed `1-35-6` phase jobs (a Complete preflight would otherwise make the detector idempotent-skip the re-run).
## Prevention (landed in this change)
| Gap | Fix |
|-----|-----|
| kubeadm leaks ~400MB etcd-DB backups into `/etc/kubernetes/tmp` forever (→ disk fills, image-GC churn, write-IO on etcd's spindle) | **`upgrade-step.sh` preflight now prunes** `/etc/kubernetes/tmp/kubeadm-backup-*` + `kubeadm-upgraded-manifests*` older than 3 days on master, every run. Best-effort, never aborts. |
| kubeadm-config drift would silently break SSO after an upgrade | `apiserver-oidc.tf`'s remote script now **also reconciles kubeadm-config** (`kubeadm init phase upload-config`), delivered via the `apiserver-oidc-restore` ConfigMap the chain re-runs (CI needs no ssh) or a local `-replace` apply. Preflight **alerts** (not blocks — SSO drift is recoverable) if `kubeadm upgrade diff` would still drop `--authentication-config`. |
| etcd on the contended `sdc` HDD starves under upgrade IO | **Durable fix is beads code-oflt** (move etcd/critical VM disks off `sdc`). Not in this change. Mitigations above reduce the upgrade's own IO; reclaimed disk removes the image-GC variable. |
## Lessons
- **Capture the failing component's own logs before concluding.** The `kubeadm
upgrade diff` made the OIDC swap look like the cause; only etcd's log (multi-second
applies) + an isolated apiserver repro showed the truth (etcd IO). A clean diff is
"what config changes," not "why it crashed."
- **etcd on shared HDD is the cluster's recurring fragility** (immich IO storm
2026-05-25, this stall). Upgrades concentrate IO (etcd restart + kubeadm's 400MB
backup copy + drain) onto that spindle. code-oflt is the real fix.
- **Tools that leave per-operation scratch must be reaped.** kubeadm's
`/etc/kubernetes/tmp` etcd backups are throwaway (real backups → NFS) but never
GC'd; 28GB had silently accumulated.
- **Out-of-band control-plane edits must be written back to kubeadm-config** — else
`kubeadm upgrade` silently reverts them (here: SSO; could be admission/audit/API flags).

View file

@ -0,0 +1,95 @@
# Workstation Claude authentication renewal
## Scope
Every roster user authenticates Claude Code with their own Enterprise identity.
Credentials are never shared between OS users. Claude refreshes its normal OAuth
access token; `claude-auth-sync@<user>.timer` verifies that refresh using real
inference every six hours and backs up only the `claudeAiOauth` object to:
```text
secret/workstation/claude-users/<os-user>
```
The user's unrelated `mcpOAuth` credentials never leave their home directory.
Each renewal service has a distinct 32-day periodic Vault token, mode `0600`, at
`~/.config/claude-auth-sync/vault-token`. Its policy can access only that user's
path. The service renews the Vault token on every run.
## Normal lifecycle
1. Add the user to `scripts/workstation/roster.yaml` and apply the Vault stack.
2. Run `scripts/workstation/setup-devvm.sh` as root with the admin Vault token.
Its foreground provisioner mints the isolated periodic token and enables the
user's timer. Routine hourly provisioning never needs an admin token.
3. The user completes one initial Enterprise login:
```bash
claude auth login --claudeai --sso --email <enterprise-email>
```
4. Start the first sync immediately instead of waiting for the timer:
```bash
systemctl start claude-auth-sync@<os-user>.service
systemctl status claude-auth-sync@<os-user>.service
```
Success writes no secrets to the journal. The user's private log records `OK` in
`~/.local/state/claude-auth-sync/sync.log`; journald receives the same status with
`identifier=claude-auth-sync` for Loki alerting.
## Automatic recovery
`claude auth status` is not a sufficient health check: it can report logged in
while inference returns HTTP 401. The service therefore runs a minimal Haiku
inference with no session persistence. On failure it:
1. reads the user's latest OAuth object from Vault;
2. atomically merges it into `.credentials.json`, preserving MCP OAuth state;
3. retries inference once;
4. stores the newly refreshed OAuth object back in Vault on success.
Vault KV version history remains available for audit, but the service deliberately
does not cycle through old refresh tokens: providers commonly invalidate rotated
refresh tokens, so replaying old versions can make recovery less deterministic.
## Recovery requiring a person
If both local state and the latest Vault copy fail, the refresh token was revoked,
invalidated, or the Enterprise session requires reauthorization. Run the login as
the affected OS user, then rerun the service:
```bash
claude auth login --claudeai --sso --email <enterprise-email>
systemctl start claude-auth-sync@$(id -un).service
```
If the scoped Vault token expired or drift protection rejected it, rerun the root
provisioner with an admin Vault token after confirming the matching policy exists:
```bash
export VAULT_ADDR=https://vault.viktorbarzin.me
export VAULT_TOKEN="$(cat /home/wizard/.vault-token)"
sudo --preserve-env=VAULT_ADDR,VAULT_TOKEN /usr/local/bin/t3-provision-users
```
Never copy another user's `.credentials.json` or scoped Vault token. Never restore
the old shared `CLAUDE_CODE_OAUTH_TOKEN`; environment credentials outrank per-user
login and would silently collapse all users onto one identity.
## Verification
```bash
systemctl list-timers 'claude-auth-sync@*'
systemctl status claude-auth-sync@<os-user>.service
journalctl -t claude-auth-sync --since today
```
Inspect Vault metadata, not secret values:
```bash
vault kv metadata get secret/workstation/claude-users/<os-user>
```
Alert `WorkstationClaudeAuthInvalid` fires when any renewal agent logs `FAIL`.

View file

@ -0,0 +1,168 @@
# Runbook: Forgejo open self-service signups
Last updated: 2026-06-19
`forgejo.viktorbarzin.me` allows **open native self-registration** (anyone can
create a local Forgejo account from the web form), gated against bots by two
layers:
1. **Cloudflare Turnstile** captcha on the registration form.
2. **Mandatory email confirmation** — a new account stays inactive until the
user clicks an activation link emailed to the address they registered with.
Two external login sources also work alongside local accounts: the pre-existing
**Sign in with GitHub** OAuth2 login (the **Authentik OAuth2 source is now DISABLED** — see the GitHub section below) (see the GitHub
section below). Opening local signups was additive — it did not touch SSO.
Most of this is Terraform-managed in `stacks/forgejo/`. The one exception is the
OAuth2 login *sources* (Authentik, GitHub), which live in Forgejo's own DB and
are added via `forgejo admin auth` — there is no clean Terraform resource for
them (their secrets are mirrored to Vault for recovery).
## What is configured (and where)
All on the `kubernetes_deployment.forgejo` container env in
`stacks/forgejo/main.tf` (Forgejo reads `app.ini` keys from `FORGEJO__<section>__<KEY>`
env vars):
| Setting | Value | Effect |
|---|---|---|
| `service.DISABLE_REGISTRATION` | `false` | Registration is enabled |
| `service.ALLOW_ONLY_EXTERNAL_REGISTRATION` | `false` | Native local sign-up allowed (was `true` = OAuth-only) |
| `service.ENABLE_CAPTCHA` | `true` | Captcha required on the signup form |
| `service.CAPTCHA_TYPE` | `cfturnstile` | Cloudflare Turnstile |
| `service.CF_TURNSTILE_SITEKEY` | widget id | Public; rendered in the page |
| `service.CF_TURNSTILE_SECRET` | from `forgejo-turnstile` Secret | Server-side verification |
| `service.REGISTER_EMAIL_CONFIRM` | `true` | Account inactive until email is confirmed |
| `mailer.*` | see below | Sends the activation email |
| `oauth2_client.ENABLE_AUTO_REGISTRATION` | `true` | First GitHub (OAuth2) sign-in auto-creates the account |
Captcha guards **registration only**`REQUIRE_CAPTCHA_FOR_LOGIN` is left at the
default `false`, so existing users are not captcha'd on every login.
## Cloudflare Turnstile widget — `turnstile.tf`
- The widget is a Terraform resource: `cloudflare_turnstile_widget.forgejo_signup`
(mode `managed`, domain `forgejo.viktorbarzin.me`), created with the CF Global
API Key already wired in `cloudflare_provider.tf`. The account id is resolved
via `data.cloudflare_accounts`.
- `.id` is the **public sitekey** (passed as a plain env value). `.secret` is the
**secret key**, stored in the `forgejo-turnstile` K8s Secret and injected via
`secret_key_ref`. The secret also lives in TF state (Tier-1 PG, encrypted at
rest) — same trust level as the CF API key already in state.
- Forgejo is **non-proxied** (direct A record to Traefik), but Turnstile is a
client-side JS widget served from `challenges.cloudflare.com`, so proxy status
is irrelevant — the widget works regardless.
**Rotate the widget secret** (e.g. if it leaks):
```
cd stacks/forgejo && vault login -method=oidc
../../scripts/tg apply --non-interactive -replace=cloudflare_turnstile_widget.forgejo_signup
```
This mints a new sitekey+secret, updates the `forgejo-turnstile` Secret, and (via
the Reloader annotation) rolls the Forgejo pod. Verify the new sitekey appears in
the `/user/sign_up` HTML afterwards.
## Mailer — `email-secret.tf` + `[mailer]` env
- Forgejo sends as **`noreply@viktorbarzin.me`** via **`mail.viktorbarzin.me:587`**
with `PROTOCOL=smtp+starttls`. This reuses the same mailserver SASL account
Authentik uses (`stacks/authentik/email-secret.tf`) — one credential, one
rotation point.
- **The host MUST be `mail.viktorbarzin.me`, not `mailserver.mailserver.svc`**:
the mailserver serves the `*.viktorbarzin.me` wildcard cert, which does not
cover the `.svc` DNS name, so STARTTLS cert verification would fail.
`mail.viktorbarzin.me` resolves in-cluster (→ `10.0.20.1`) and matches the cert.
- The password is synced from Vault `secret/authentik``smtp_password` by the
`forgejo-email` ExternalSecret (ESO `ClusterSecretStore vault-kv`) into the
`forgejo-email` K8s Secret (key `PASSWD`), referenced by `FORGEJO__mailer__PASSWD`.
- The deployment carries `reloader.stakater.com/auto: "true"`, so a rotation of
either secret rolls the pod automatically.
## GitHub sign-in (OAuth2 source)
People can **sign up / sign in with GitHub** — the active Forgejo OAuth2 source. GitHub sign-up is **zero-click** (auto-registration creates the account on first login).
> **Authentik is DISABLED on purpose** (2026-06-19). `ENABLE_AUTO_REGISTRATION` is GLOBAL across OAuth sources, and Authentik's `preferred_username` claim is the user's **email** — invalid as a Forgejo username, which 500'd auto-create. Viktor's Forgejo email (`me@viktorbarzin.me`) does not match his Authentik email (`vbarzin@gmail.com`), so account-linking can't bridge it. Per his directive GitHub was prioritised; the Authentik source was deactivated via `UPDATE login_source SET is_active=0 WHERE name='Authentik'` in the forgejo MySQL DB. **Re-enable** with `is_active=1` after fixing Authentik's username claim.
- **Source** (Forgejo DB, *not* Terraform — added via CLI, same as Authentik):
```
forgejo admin auth add-oauth --name github --provider github --key <client-id> --secret <client-secret>
```
The source **name must stay `github`** — it is part of the callback URL
(`/user/oauth2/github/callback`) registered on the GitHub side, so renaming it
breaks the callback. `forgejo admin auth list` shows it (ID 2).
- **GitHub OAuth App**: a classic OAuth App under the ViktorBarzin GitHub account
(Settings → Developer settings → OAuth Apps). Homepage
`https://forgejo.viktorbarzin.me`, callback
`https://forgejo.viktorbarzin.me/user/oauth2/github/callback`. GitHub has **no
API to create OAuth Apps** — creating it is a browser-only step.
- **Credentials**: Vault `secret/viktor``forgejo_github_oauth_client_id` /
`forgejo_github_oauth_client_secret` (kept for recovery; the live values are in
Forgejo's DB).
- **Auto-registration**: `FORGEJO__oauth2_client__ENABLE_AUTO_REGISTRATION=true`
(`main.tf`) makes a first GitHub login create the account directly. The GitHub
identity is the trust gate for this path — the Turnstile captcha + email
confirmation only apply to the **native** signup form, not OAuth.
**Rotate the GitHub client secret** — generate a new one in the GitHub OAuth App, then:
```
vault kv patch secret/viktor forgejo_github_oauth_client_secret=<new>
POD=$(kubectl -n forgejo get pod -l app=forgejo -o jsonpath='{.items[0].metadata.name}')
kubectl -n forgejo exec "$POD" -- su-exec git forgejo admin auth update-oauth --id 2 --secret <new>
```
(Source id from `forgejo admin auth list`.)
**Recreate after a Forgejo DB loss**: the source is not in Terraform, so after a
from-scratch restore, re-run the `add-oauth` command above with the Vault creds.
## Re-closing / tightening signups
Edit `stacks/forgejo/main.tf` and `scripts/tg apply` (or commit + push — CI
applies):
- **OAuth-only again** (revert this change): set
`FORGEJO__service__ALLOW_ONLY_EXTERNAL_REGISTRATION` back to `"true"`.
- **No new accounts at all** (admins create them): set
`FORGEJO__service__DISABLE_REGISTRATION` to `"true"`.
- **Require admin approval per signup** (strongest, instead of email confirm):
set `REGISTER_MANUAL_CONFIRM=true` **and** `REGISTER_EMAIL_CONFIRM=false`
(Forgejo makes the two mutually exclusive). New accounts then queue under Site
Administration → Identity & Access → Accounts until an admin activates them.
## Handling spam / abuse accounts
A signup that clears Turnstile + email confirmation is still a real, low-privilege
Forgejo user. To deal with abuse:
- **Ban/delete** via Site Administration → Identity & Access → Accounts, or
`forgejo admin user delete --username <name>` inside the pod
(`kubectl -n forgejo exec deploy/forgejo -- forgejo admin user ...`).
- New users get Forgejo defaults (they can create repos/orgs). If abuse warrants,
tighten with `[service].DEFAULT_ALLOW_CREATE_ORGANIZATION=false` and/or
`[repository].MAX_CREATION_LIMIT` (add as env vars; out of scope for the initial
open-signups change).
## Operational notes
- The Forgejo deployment is **single-replica with `Recreate` strategy**, so any
config apply briefly restarts the pod (git remote + OCI registry unavailable for
a few seconds). Expected, not an incident.
- The signup page is **not** behind Cloudflare's bot-fight (Forgejo is
non-proxied) — Turnstile + email confirmation are the bot gate. CrowdSec +
Traefik rate limiting still front the host.
## Verify it's working
```
POD=$(kubectl -n forgejo get pod -l app=forgejo -o jsonpath='{.items[0].metadata.name}')
# Env present:
kubectl -n forgejo exec "$POD" -- env | grep -E 'ALLOW_ONLY_EXTERNAL|ENABLE_CAPTCHA|CAPTCHA_TYPE|CF_TURNSTILE_SITEKEY|REGISTER_EMAIL_CONFIRM|mailer__ENABLED'
# Turnstile widget rendered on the form:
kubectl -n forgejo exec "$POD" -- wget -qO- http://localhost:3000/user/sign_up | grep -oE 'cf-turnstile|data-sitekey="[^"]*"'
# Secrets healthy:
kubectl -n forgejo get externalsecret forgejo-email
kubectl -n forgejo get secret forgejo-email forgejo-turnstile
```
A full real-world check is to register a throwaway account and confirm the
activation email arrives. The mailer transport (server/port/cert/cred) is shared
with Authentik, which is already in production for external user enrollment.

View file

@ -0,0 +1,301 @@
# Goldmane Flow Trail — east-west "who-talks-to-whom" observability
> As-built runbook for the Calico Goldmane + Whisker flow plane and the
> `goldmane-edge-aggregator` durable audit trail. Design + rationale:
> [ADR-0014](../adr/0014-service-identity-and-east-west-observability.md).
> Glossary: `CONTEXT.md`**Service identity**, **Goldmane / Whisker**.
> Implements infra issues #57 (Whisker ingress), #58 (aggregator), #61
> (monitoring), #62 (egress allowlist queries), #63 (these docs).
## What the trail is
Three layers turn raw east-west traffic into a queryable, durable record of
which Service talks to which. **Service identity = the workload's namespace**
(primary), refined by a `service-identity` label in the few multi-Service
namespaces (`monitoring`, `kube-system`, `dbaas`) — see ADR-0014.
| Layer | Component | Lifetime | Where it lives |
|---|---|---|---|
| **Live map** | Calico **Goldmane** + **Whisker** | ~60-min in-memory ring buffer (lost on Goldmane restart) | `calico-system`; Whisker UI at `whisker.viktorbarzin.me` |
| **Durable trail** | `goldmane-edge-aggregator` (`aggregate` mode) | persistent | CNPG Postgres DB `goldmane_edges`, table `edge` |
| **Notification** | `goldmane-edges-digest` CronJob (`digest` mode) | daily | Slack `#alerts` |
**Goldmane** aggregates identity-stamped flows (namespace / pod / workload /
labels + allow-deny + policy-trace) streamed from Felix (the existing
`calico-node` DaemonSet) over gRPC into a ~60-minute in-memory ring buffer —
**nothing is written to etcd or the K8s API** (the etcd-cost constraint that
drove the whole design). **Whisker** is its live web UI. Because the ring
buffer is *not* a trail (a Goldmane restart loses the window), the
`goldmane-edge-aggregator` consumes Goldmane's gRPC `Flows.Stream` API over
mTLS and upserts the unique **namespace-pair edge set** into Postgres; a daily
CronJob posts first-seen edges to Slack.
The edge set is deliberately **low-cardinality** — one row per
`(src_ns, dst_ns, action)`, *not* per-pod or per-port — so the table stays
small no matter how much traffic flows.
## Where the data lives
### Whisker UI — live, ~60 min
- `https://whisker.viktorbarzin.me` (Authentik-gated — Whisker ships no own
login; `auth = "required"`). Shows the live flow stream + a service graph for
roughly the last hour. Use it for "what is talking right now"; it is **not**
history.
- In-cluster: `Service goldmane:7443` (gRPC/mTLS), `Service whisker:8081`
(HTTP), both in `calico-system`.
### CNPG `goldmane_edges` — durable
- Postgres DB `goldmane_edges` on the CNPG cluster
(`pg-cluster-rw.dbaas.svc.cluster.local:5432`). One table:
```
edge(src_ns text, dst_ns text, action text,
first_seen timestamptz, last_seen timestamptz, flow_count bigint,
PRIMARY KEY (src_ns, dst_ns, action))
```
- `action``allow` / `deny` / `pass` / `unspecified` (normalised Goldmane
action).
- **Self-edges (`src_ns == dst_ns`) and empty-namespace flows** (host-endpoint
/ public-internet) are **dropped** — the trail is about in-cluster service
relationships only. (Egress to the public internet is therefore NOT in this
table; it lives in the Wave-1 Calico flow-log path — see security.md.)
- A **"new edge"** = a row whose `first_seen` falls inside the digest window.
- Role `goldmane_edges` (Vault-rotated, 7-day) owns the DB. The `edge` table
is created idempotently by the aggregator at startup (canonical DDL also in
the repo at `migrations/0001_edge.sql`).
### Slack `#alerts` — daily digest
> **Channel note (2026-06-25):** posts to **`#alerts`**, not `#security`. The shared `alertmanager_slack_api_url` incoming webhook's Slack app is not a member of `#security`, so a channel override there returns HTTP `404 channel_not_found` (this almost certainly also breaks alertmanager's `slack-security` receiver — verify separately). To route the digest (and security alerts) to `#security`: invite that webhook's Slack app to `#security`, then set `SLACK_CHANNEL=#security` in `stacks/goldmane-edge-aggregator` and re-apply.
- CronJob `goldmane-edges-digest` (08:00 Europe/London) posts edges first seen
in the last 24h. Quiet when there are none. Reuses the existing alert-digest
Slack incoming webhook (Vault `secret/viktor``alertmanager_slack_api_url`)
— no new webhook was created.
## How to enable / disable
### Goldmane + Whisker (the flow plane)
Operator CRs in **`stacks/calico/main.tf`** — NOT the Helm `goldmane`/`whisker`
flags (those stay `false`; the operator's own `installation`/`apiServer` are
operator-managed via the `goldmanes`/`whiskers.operator.tigera.io` CRDs):
- `kubectl_manifest.goldmane` (kind `Goldmane`) — creating it makes the operator
re-render `calico-node` with the `FELIX_FLOWLOGSGOLDMANESERVER` env (the
operator auto-wires Felix — **do NOT patch FelixConfiguration**), triggering a
supervised `calico-node` DaemonSet roll. Yields `Deployment` + `Service
goldmane:7443`.
- `kubectl_manifest.whisker` (kind `Whisker`, `depends_on` goldmane;
`notifications = Disabled`). Yields `Deployment` + `Service whisker:8081`.
**To disable:** delete those two CRs and re-apply `stacks/calico`. Reversible
toggle (Goldmane is tech-preview in OSS Calico 3.30 — the main standing risk per
ADR-0014).
### Whisker public ingress (infra #57)
Also in `stacks/calico/main.tf`:
- `module "ingress_whisker"` (`ingress_factory`, `auth = "required"`,
`dns_type = "proxied"`) → `whisker.viktorbarzin.me`.
- `kubernetes_network_policy_v1.whisker_allow_traefik` — **required alongside the
ingress**: the operator's own `whisker` NetworkPolicy (owned by the Whisker CR)
is `policyTypes: [Ingress]` with no rules = default-deny ingress to the pod.
This additive NP ORs in an allow for `namespaceSelector
kubernetes.io/metadata.name=traefik` on TCP 8081. Without it Traefik 502s.
### The aggregator + digest (the durable trail) — `stacks/goldmane-edge-aggregator`
A Tier-1 stack (PG state) mirroring the claude-memory pattern. `scripts/tg
apply` from `stacks/goldmane-edge-aggregator/`. It provisions: the namespace,
the mTLS client material, the Postgres DB-init Job, the `DATABASE_URL`
ExternalSecret (Vault static role `pg-goldmane-edges`), the Slack ExternalSecret,
the `aggregate` Deployment, and the `digest` CronJob. **To disable the trail
without touching the flow plane:** scale `deployment/goldmane-edge-aggregator` to
0 (transient) or remove the stack (permanent) — Goldmane/Whisker keep running.
Image: `ghcr.io/viktorbarzin/goldmane-edge-aggregator` (PRIVATE) — the
`goldmane-edge-aggregator` namespace must be in the `ghcr-credentials` Kyverno
allowlist (`stacks/kyverno/modules/kyverno/ghcr-credentials.tf`,
`local.ghcr_private_namespaces`) or pulls 401. Code repo:
`~/code/goldmane-edge-aggregator` (see its `README.md` + `DEPLOY.md`).
## mTLS cert — the REUSE decision (cert-reuse gotcha)
The aggregator dials `goldmane:7443` over **mutual TLS**. Goldmane requires the
client cert to chain to the **Tigera CA**, but it does **NOT authorize by client
identity** — any Tigera-CA-signed cert is accepted.
Rather than copy the Tigera CA **private key** into Terraform state to mint our
own cert (a needless CA-key exposure; the `hashicorp/tls` provider also clashes
with this repo's global generate-providers/lockfile pattern), the stack
**REUSES the operator-minted, Tigera-CA-signed `whisker-backend-key-pair`
Secret** (`calico-system`), copying its `tls.crt`/`tls.key` into the
`goldmane-client-tls` Secret in the aggregator namespace. The CA *bundle* that
verifies Goldmane's serving cert (`tigera-ca-bundle` ConfigMap, key
`tigera-ca-bundle.crt`) is likewise copied verbatim (a ConfigMap can't be
cross-namespace-mounted).
> **GOTCHA — if the operator rotates `whisker-backend-key-pair`, re-apply
> `stacks/goldmane-edge-aggregator`** to re-sync the copied cert. Symptom of a
> stale copy: the `aggregate` pod logs TLS handshake / `Flows.Stream` failures
> and no `last_seen` updates land in the `edge` table. Hardening follow-up
> (noted in the stack): mint an own-identity cert in-namespace if Whisker is ever
> removed (which would delete the reused source Secret).
The Deployment leaves `GOLDMANE_HOST=goldmane.calico-system.svc.cluster.local:7443`
and the default cert/CA paths; the default ServerName (host sans port) is a SAN
on Goldmane's live serving cert, so no `GOLDMANE_SERVER_NAME` /
`GOLDMANE_TLS_INSECURE` override is needed.
## How to query who-talks-to-whom
`psql` into the DB (creds: Vault static role `static-creds/pg-goldmane-edges`, or
exec a CNPG pod). All queries are against the single `edge` table.
```sql
-- Everything talking to a namespace (inbound), most-active first
SELECT src_ns, action, flow_count, first_seen, last_seen
FROM edge WHERE dst_ns = '<ns>' ORDER BY flow_count DESC;
-- Everything a namespace talks TO (outbound)
SELECT dst_ns, action, flow_count, first_seen, last_seen
FROM edge WHERE src_ns = '<ns>' ORDER BY last_seen DESC;
-- New edges in the last 24h (what the digest reports)
SELECT src_ns, dst_ns, action, flow_count, first_seen
FROM edge WHERE first_seen > now() - interval '24 hours'
ORDER BY first_seen DESC;
-- Any DENIED edges (policy is dropping this pair)
SELECT src_ns, dst_ns, flow_count, last_seen
FROM edge WHERE action = 'deny' ORDER BY last_seen DESC;
-- Full edge set as a graph adjacency list
SELECT src_ns, dst_ns, action, flow_count FROM edge ORDER BY src_ns, dst_ns;
```
For the **live** (sub-hour) view including pod/port detail, use the Whisker UI —
the `edge` table intentionally aggregates that away.
## Deriving the Wave-1 egress allowlist from the edge table (infra #62)
The durable edge set is a faster, identity-stamped data source for the existing
**observe-then-enforce** egress effort (beads `code-8ywc`; snapshot
`docs/architecture/wave1-egress-observation-2026-05-22.md`) than the original
iptables-`LOG` → journald → Loki path (ADR-0014 consequence: "Enforcement gains
a better data source"). It replaces the *internal* (namespace-to-namespace) leg
of the allowlist; **external/public-internet egress is NOT in this table** (empty
dst namespace, dropped) — for those destinations keep using the Calico flow-log
path described in security.md.
**Per-namespace internal egress allowlist** — the set of in-cluster namespaces a
given source is *observed* talking to with `action='allow'`:
```sql
-- Internal egress allowlist for one namespace (feeds its NetworkPolicy)
SELECT DISTINCT dst_ns
FROM edge
WHERE src_ns = '<ns>' AND action = 'allow'
ORDER BY dst_ns;
```
```sql
-- Full internal egress matrix for all namespaces at once
SELECT src_ns, array_agg(DISTINCT dst_ns ORDER BY dst_ns) AS allowed_dst_ns
FROM edge
WHERE action = 'allow'
GROUP BY src_ns
ORDER BY src_ns;
```
```sql
-- Sanity: namespaces with a DENY edge already (policy is biting; investigate
-- before tightening further)
SELECT DISTINCT src_ns, dst_ns FROM edge WHERE action = 'deny';
```
**How this feeds enforcement (scope):** the derived `dst_ns` set is the
*internal* half of a namespace's egress allowlist — it tells you which
in-cluster namespaces to permit before flipping that namespace to default-deny.
The universal baseline (kube-dns :53, often dbaas :3306/:5432, redis :6379) and
the external destinations still come from the Wave-1 observation snapshot.
**Enforce-flips remain OUT OF SCOPE** here — this is observe-and-derive only;
the phased per-namespace default-deny rollout (starting `recruiter-responder`)
is tracked under `code-8ywc`. Cross-links:
[security.md → NetworkPolicy Default-Deny Egress](../architecture/security.md#networkpolicy-default-deny-egress-wave-1--observe-then-enforce-tier-34),
[wave1-egress-observation-2026-05-22.md](../architecture/wave1-egress-observation-2026-05-22.md),
[ADR-0014](../adr/0014-service-identity-and-east-west-observability.md).
> **Caveat (same as the Wave-1 snapshot):** an edge only exists if it was
> *observed*. A weekly CronJob or a 7-day Vault rotation may not have fired yet —
> collect ≥7 days of edges before treating a namespace's `allow` set as
> complete. The `first_seen` column tells you how long an edge has been known;
> the digest surfaces brand-new ones daily.
## Monitoring & health (infra #61)
The aggregator pod has **no `/metrics` endpoint** — health is inferred from
kube-state-metrics. Three complementary signals (memory ids 6598, 6599;
see also [monitoring.md → Security Alerts](../architecture/monitoring.md#security-alerts-wave-1--planned-beads-code-8ywc)):
| Signal | What | Where |
|---|---|---|
| **`AggregatorDown`** | `kube_deployment_status_replicas_available{namespace="goldmane-edge-aggregator",deployment="goldmane-edge-aggregator"} < 1` for 15m → warning | Prometheus alert group `Network Observability (Goldmane)` in `stacks/monitoring/modules/monitoring/prometheus_chart_values.tpl`; routes `slack-warning``#alerts` |
| **`DigestFailing`** | `kube_job_status_failed{...job_name=~"goldmane-edges-digest.*"} > 0` within 24h, for 30m → warning | same alert group → `#alerts` |
| **cluster-health #48** | `check_goldmane_aggregator` reads the Deployment's `Available` condition (missing or not-Available → FAIL) | `scripts/cluster_healthcheck.sh` (human / `--quiet` / `--json` modes; emits `goldmane_aggregator`) |
The two alert layers are deliberately complementary: `AggregatorDown`
**no new edges land** in the DB; `DigestFailing` → **edges still land but nobody
is told**. A freshness probe (#61b) was intentionally skipped — `AggregatorDown`
is the agreed floor.
## Troubleshooting
**Whisker UI 502 / unreachable.** The additive
`kubernetes_network_policy_v1.whisker_allow_traefik` is missing or the
operator's default-deny `whisker` NP regenerated — re-apply `stacks/calico`. A
brand-new ingress host is also invisible to LAN split-horizon until the hourly
`technitium-ingress-dns-sync` runs (memory #5349); test meanwhile with
`curl -sSI --resolve whisker.viktorbarzin.me:443:10.0.20.203 https://whisker.viktorbarzin.me`
(expect a 302 to Authentik — the gate working).
**No new `last_seen` updates / `AggregatorDown` firing.** Check the `aggregate`
pod logs (`kubectl logs -n goldmane-edge-aggregator deploy/goldmane-edge-aggregator`).
Common causes, in order:
1. **Stale mTLS cert** — the operator rotated `whisker-backend-key-pair`; re-apply
`stacks/goldmane-edge-aggregator` (see cert-reuse gotcha above). Symptom: TLS
handshake / `Flows.Stream` errors.
2. **Stale DB password** — the 7-day Vault rotation bounced the credential but
the pod kept the old one. The Deployment carries
`secret.reloader.stakater.com/reload: goldmane-edges-db-creds`; if it's not
restarting on rotation, verify the Reloader annotation and the ExternalSecret.
3. **Goldmane restarted** — the in-memory window was lost (expected); the stream
reconnects automatically and resumes upserting. No data loss in the DB
(only the sub-hour live window in Whisker is gone).
**Digest never posts / `DigestFailing` firing.** Inspect the most recent
`goldmane-edges-digest-*` Job (`kubectl get jobs -n goldmane-edge-aggregator`;
`kubectl logs job/<name>`). The CronJob's `ttl_seconds_after_finished=86400` GCs
pods after a day, so check soon after a failed run. With `SLACK_WEBHOOK_URL`
empty the binary forces a dry-run (no post) — verify the `goldmane-edges-slack`
ExternalSecret resolved. A dry run / smoke test: run the image with `args:
["digest"]` + `DRY_RUN=1` to print the message instead of POSTing.
> Known state (2026-06-25): the digest CronJob's first Job **failed** and it has
> never successfully posted (`lastSuccessfulTime` empty) — the digest leg is the
> live gap; `DigestFailing` is catching it. Edges still land in the DB via the
> `aggregate` Deployment; only the `#security` notification is affected.
> Investigation/fix belongs to the aggregator slice (#58/#60), not monitoring.
**No edges at all in the table.** Confirm Goldmane is enabled
(`kubectl get goldmane,whisker -A`) and `calico-node` rolled with the
`FELIX_FLOWLOGSGOLDMANESERVER` env; confirm the `goldmane-edges-db-init` Job
completed; confirm the aggregator pod is `Running` and not `ImagePullBackOff`
(ghcr allowlist).
## Related
- [ADR-0014 — Service identity & east-west observability](../adr/0014-service-identity-and-east-west-observability.md)
- [security.md — NetworkPolicy Default-Deny Egress + east-west flow observability](../architecture/security.md)
- [monitoring.md — east-west flow observability + alerts](../architecture/monitoring.md)
- [wave1-egress-observation-2026-05-22.md](../architecture/wave1-egress-observation-2026-05-22.md)
- `CONTEXT.md` glossary — **Service identity**, **Goldmane / Whisker**
- Code: `~/code/goldmane-edge-aggregator` (`README.md`, `DEPLOY.md`); stacks
`stacks/goldmane-edge-aggregator`, `stacks/calico`

View file

@ -2,9 +2,9 @@
## Overview ## Overview
Kubernetes component versions (`kubeadm`/`kubelet`/`kubectl`) on the 5 K8s Kubernetes component versions (`kubeadm`/`kubelet`/`kubectl`) on the 7 K8s
VMs are upgraded automatically by a weekly detection CronJob that seeds a nodes (k8s-master + k8s-node1..6) are upgraded automatically by a nightly
chain of small phase Jobs. Each Job is **pinned to a node that is NOT its detection CronJob that seeds a chain of small phase Jobs. Each Job is **pinned to a node that is NOT its
drain target** — so no pod in the chain can preempt itself. drain target** — so no pod in the chain can preempt itself.
The chain (23:00 UTC nightly): The chain (23:00 UTC nightly):
@ -36,14 +36,17 @@ envsubst on /template/job-template.yaml | kubectl apply -f -
Job 0 — preflight (pinned: k8s-node1) Job 0 — preflight (pinned: k8s-node1)
├── compat-gate: addon/API/containerd support for target (else BLOCK+alert)
├── All nodes Ready + no Mem/Disk pressure ├── All nodes Ready + no Mem/Disk pressure
├── halt-on-alert (kured-style ignore-list) ├── halt-on-alert (kured-style ignore-list)
├── 24h-quiet baseline (no Ready transitions <24h ago) ├── 24h-quiet baseline (no Ready transitions <24h ago)
├── kubeadm upgrade plan matches target ├── kubeadm upgrade plan matches target (skipped when master already at target — partial-resume)
├── apiserver-OIDC drift check: kubeadm upgrade diff drops --authentication-config? → Slack WARN (recoverable; not a block)
├── reclaim kubeadm scratch: prune /etc/kubernetes/tmp/kubeadm-backup-* >3d on master (kubeadm leaks ~400MB etcd-db backups)
├── Push k8s_upgrade_in_flight=1, k8s_upgrade_started_timestamp=$(date +%s) ├── Push k8s_upgrade_in_flight=1, k8s_upgrade_started_timestamp=$(date +%s)
├── Trigger backup-etcd Job, wait, verify snapshot byte count ├── Trigger backup-etcd Job, wait, verify snapshot byte count
├── SSH master: containerd skew fix (if master < workers) ├── SSH master: containerd skew fix (if master < workers)
├── SSH all 5 nodes: apt repo URL rewrite (only kind=minor) ├── SSH all 7 nodes: apt repo URL rewrite (only kind=minor)
└── spawn_next → k8s-upgrade-master-<target_version> └── spawn_next → k8s-upgrade-master-<target_version>
@ -87,6 +90,59 @@ Job 6 — postflight (no pinning)
**adding a node needs no change** — the chain upgrades every worker still **adding a node needs no change** — the chain upgrades every worker still
off-target, then runs postflight. SSH uses node InternalIPs (no DNS needed). off-target, then runs postflight. SSH uses node InternalIPs (no DNS needed).
### Auto-upgrade compat gate
The chain now attempts **patch AND minor** upgrades autonomously — but before any
mutation, `phase_preflight` runs `compat-gate.py` **FIRST** and **REFUSES (blocks)
the upgrade** if any of these hold for the detected target:
- a **critical addon's running version doesn't support the target k8s minor**
(running version > the addon's highest-supported minor in the compat matrix),
- an **in-use deprecated API is removed at/before the target** — measured live
from `apiserver_requested_deprecated_apis` (something is still calling a
group/version that the target k8s drops), or
- a **node's containerd is below the target's floor** (the minimum containerd the
target k8s requires).
The addon check is **scoped to minor jumps**: a target **at or below the running
k8s minor** (a patch) crosses into no new minor, so the running cluster is itself
proof the installed addons work there — `compat-gate.py` skips the addon ceilings
when `target_minor <= running_minor`. (Without this a conservative ceiling such as
ESO 0.12 → 1.31 would false-block a 1.34.x **patch** on a cluster already running
1.34 — fixed 2026-06-20.) The deprecated-API and containerd checks are naturally
inert for a patch (no API removal or containerd floor occurs inside a minor).
This is the **"auto-upgrade when we can, halt + alert when we can't"** contract.
**On a block**, the gate:
- pushes `k8s_upgrade_blocked=1` to Pushgateway (→ the `K8sUpgradeBlocked`
Prometheus alert),
- Slacks the **specific reasons** (which addon/API/node, current vs required), and
- **halts the chain** — it exits **non-fatal** (the upgrade simply isn't safe yet,
this is not a failure). Because the block happens **before any mutation, no
rollback is involved**; nothing was changed.
**To clear a block**: upgrade the named addon (or migrate the API caller off the
deprecated group/version, or bump containerd on the named node) so the offending
condition no longer holds. The **next nightly run then proceeds automatically**
no manual chain restart needed.
The **compat matrix** lives in
`stacks/k8s-version-upgrade/scripts/addon-compat.json` — a map of `addon → highest
supported k8s minor`, populated from each addon's own compatibility docs. **Keep
it current**; the gate reads it on every run. Gate logic:
`stacks/k8s-version-upgrade/scripts/compat-gate.py`.
> **Both** detector probes against `pkgs.k8s.io` follow the 302 redirect via `-L`:
> the next-minor *availability* probe (`HEAD .../v<NEXT_MINOR>/deb/Release`) **and**
> the next-minor *patch* probe (`GET .../v<NEXT_MINOR>/deb/Packages`, which resolves
> the exact `X.Y.Z`). The Packages probe lacked `-L` until 2026-06-20 — `pkgs.k8s.io`
> 302-redirects every request, so without it curl returned an empty body,
> `NEXT_MINOR_PATCH` came back empty, and the detector silently fell through to
> "No upgrade needed". That is why the **2026-06-19 nightly run no-op'd** instead of
> resolving the 1.35 target. With both probes on `-L`, **minor versions are detected**
> and gated behind the compat check above before the chain acts on them.
## Components ## Components
### Shared resources (one-time, Terraform-managed) ### Shared resources (one-time, Terraform-managed)
@ -117,8 +173,26 @@ Pushed by upgrade-step.sh during phase execution; observed by the
- **`K8sVersionSkew`** — distinct kubelet/apiserver `gitVersion` count > 1 for 30m. Catches a half-done rollout. - **`K8sVersionSkew`** — distinct kubelet/apiserver `gitVersion` count > 1 for 30m. Catches a half-done rollout.
- **`EtcdPreUpgradeSnapshotMissing`** — `k8s_upgrade_in_flight==1 && k8s_upgrade_snapshot_taken==0` for 10m. Catches preflight Stage 2 failing silently. - **`EtcdPreUpgradeSnapshotMissing`** — `k8s_upgrade_in_flight==1 && k8s_upgrade_snapshot_taken==0` for 10m. Catches preflight Stage 2 failing silently.
- **`K8sUpgradeStalled`** — `k8s_upgrade_in_flight==1 && time()-k8s_upgrade_started_timestamp > 5400` for 5m. Catches a Job in the chain dying without spawning its successor. - **`K8sUpgradeStalled`** — `k8s_upgrade_in_flight==1 && time()-k8s_upgrade_started_timestamp > 5400` for 5m. Catches a Job in the chain dying without spawning its successor.
- **`K8sUpgradeChainJobFailed`** — `kube_job_status_failed{namespace="k8s-upgrade",job_name=~"k8s-upgrade-.*",reason=~"BackoffLimitExceeded|DeadlineExceeded"} > 0` for 15m (warning). Catches a phase Job that **terminally failed before `k8s_upgrade_in_flight` was set** — the preflight gates exit pre-metric, so the two `in_flight`-based alerts above are blind to a failed preflight (this is what hid the 5-day 1.34.9 wedge on 2026-06-12). Reason-scoped to terminal job conditions so a retry-success doesn't false-positive (a bare failed-pod-count would otherwise also block kured for the Job's 7d TTL). - **`K8sUpgradeChainJobFailed`** — `(kube_job_status_failed{namespace="k8s-upgrade",job_name=~"k8s-upgrade-(preflight|master|worker|postflight)-.*",reason=~"BackoffLimitExceeded|DeadlineExceeded"} > 0) unless on() (k8s_upgrade_blocked == 1)` for 15m (warning). Catches a phase Job that **terminally failed before `k8s_upgrade_in_flight` was set** — the preflight gates exit pre-metric, so the two `in_flight`-based alerts above are blind to a failed preflight (this is what hid the 5-day 1.34.9 wedge on 2026-06-12). Reason-scoped to terminal job conditions so a retry-success doesn't false-positive (a bare failed-pod-count would otherwise also block kured for the Job's 7d TTL). The `unless k8s_upgrade_blocked == 1` clause (added 2026-06-21) excludes a preflight that failed because the **compat gate deliberately refused** the target — that's owned by `K8sUpgradeBlocked` and was double-firing here; a genuine wedge exits without setting the blocked gauge, so it still fires.
- All four alerts ALSO block kured (same `--prometheus-url` halt-on-alert mechanism) so the OS-reboot pipeline can't run on top of a half-done version upgrade. - **`K8sUpgradeBlocked`** — `k8s_upgrade_blocked == 1` (warning). A k8s **auto-upgrade was refused** by the compat gate because a critical addon, an in-use deprecated API, or a node's containerd is too old for the detected target. The **specific reasons are in Slack**; clear it by upgrading the named addon / migrating the API caller / bumping containerd, after which the next nightly run proceeds (see "Auto-upgrade compat gate"). No upgrade was attempted, so this is not a half-done-rollout alert.
- The first four alerts ALSO block kured (same `--prometheus-url` halt-on-alert mechanism) so the OS-reboot pipeline can't run on top of a half-done version upgrade.
### Nightly upgrade report (Slack)
CronJob `k8s-upgrade-nightly-report` (k8s-upgrade ns, `var.report_schedule`,
default `7 6 * * *` = 06:07 UTC — after the 23:00 chain, before the 08:00 London
alert-digest) posts ONE Slack summary each morning of the previous night's run:
running version, detector freshness, detected target + kind, the outcome
(⚪ no upgrade needed / 🔴 blocked + live blocker reasons / 🟢 upgraded /
🟡 in progress / ⚠️ detector stale), and recent chain jobs. Read-only — it reads
the Pushgateway gauges + live nodes/jobs and re-runs `compat-gate.py` for fresh
blocker reasons; reuses the chain's SA + `slack_webhook` + scripts ConfigMap.
Logic + unit tests: `scripts/nightly-report.py`, `scripts/test_nightly_report.py`.
This is the day-to-day visibility layer (it does NOT replace the alerts above —
those fire on problems; this reports the outcome every night). Manual run:
`kubectl -n k8s-upgrade create job --from=cronjob/k8s-upgrade-nightly-report nightly-report-test`
(name it WITHOUT a `k8s-upgrade-{phase}-` prefix so a failure can't trip
`K8sUpgradeChainJobFailed`).
### CoreDNS is NOT upgraded by kubeadm here ### CoreDNS is NOT upgraded by kubeadm here
@ -150,27 +224,54 @@ Exposed in K8s via ExternalSecret `k8s-upgrade-creds` in the `k8s-upgrade` names
## Common Operations ## Common Operations
### Post-upgrade: restore apiserver OIDC (REQUIRED after any control-plane bump) ### apiserver OIDC + kubeadm upgrades (kubeadm-config reconciliation since 2026-06-24)
`kubeadm upgrade apply` **regenerates `/etc/kubernetes/manifests/kube-apiserver.yaml` `kubeadm upgrade apply` **regenerates `/etc/kubernetes/manifests/kube-apiserver.yaml`
and drops the `--authentication-config` flag**, silently disabling apiserver from kubeadm-config**. apiserver auth uses a structured multi-issuer
OIDC (kubectl/kubelogin CLI **and** the web dashboard SSO break — tokens get `--authentication-config` (kubectl + dashboard SSO), but kubeadm-config used to
401). This is not auto-detected (the `rbac` stack's `null_resource` trigger is a still carry the legacy single-issuer `--oidc-*` extraArgs — so every upgrade
content hash that doesn't change). After any control-plane upgrade, re-apply: reverted the flag, **silently breaking SSO after the upgrade** (the apiserver does
NOT crash on this — verified by isolated repro; it's recoverable via the restore
script below). NB: the **1.34→1.35 stall on 2026-06-24 was a *separate* issue —
etcd IO starvation**, not this drift; post-mortem:
`docs/post-mortems/2026-06-24-kubeadm-oidc-drift-apiserver-upgrade-stall.md`.
**Primary fix (2026-06-24):** `stacks/rbac/modules/rbac/apiserver-oidc.tf` now
**reconciles kubeadm-config** (`kubeadm init phase upload-config kubeadm`, rewriting
`apiServer.extraArgs`: drop `--oidc-*`, add `--authentication-config`) as part of
its remote script. So kubeadm regenerates a **correct** manifest and the apiserver
upgrades with a pure image bump — `kubeadm upgrade diff <target>` shows only the
image change. Zero live impact (the CM is read only during an upgrade).
**Backstops:**
- **Preflight check 4b** runs `kubeadm upgrade diff` and **alerts** (Slack WARN, does
NOT block — the drift only breaks SSO, which is recoverable) if
`--authentication-config` would still be dropped.
- The `rbac` stack still publishes its restore script to the
`kube-system/apiserver-oidc-restore` ConfigMap, and `phase_master` re-runs it on
master right after `kubeadm upgrade apply` (idempotent, `/livez`-gated with
auto-rollback, non-fatal) — now redundant belt-and-suspenders that *also*
re-reconciles kubeadm-config. Self-skips when master is already at target.
**Manual fallback** — only for an out-of-band/manual `kubeadm` upgrade, or if the
chain logged `WARN: --authentication-config absent after re-apply`:
```bash ```bash
cd stacks/rbac cd stacks/rbac
TF_VAR_ssh_private_key="$(cat ~/.ssh/id_ed25519)" \ TF_VAR_ssh_private_key="$(cat ~/.ssh/id_ed25519)" \
VAULT_ADDR=https://vault.viktorbarzin.me ../../scripts/tg apply \ VAULT_ADDR=https://vault.viktorbarzin.me ../../scripts/tg apply \
--non-interactive -target=module.rbac.null_resource.apiserver_oidc_config --non-interactive -target=module.rbac.null_resource.apiserver_oidc_config \
-replace=module.rbac.null_resource.apiserver_oidc_config
``` ```
(`ssh_private_key` must be a key authorized for `wizard@<master>`; it is not yet (`-replace` is **required** — the `null_resource` trigger is a content hash that
wired from Vault.) The provisioner re-writes `/etc/kubernetes/pki/auth-config.yaml` doesn't change, so a plain `-target` apply is a no-op. `ssh_private_key` must be a
(both `kubernetes` + `k8s-dashboard` issuers), re-adds the flag, and key authorized for `wizard@<master>`.) The provisioner re-writes
health-gates `/livez` with auto-rollback. Verify: `curl -sk `/etc/kubernetes/pki/auth-config.yaml` (both `kubernetes` + `k8s-dashboard`
https://localhost:6443/livez` on the master = `ok`, and the apiserver manifest issuers), re-adds the flag, and health-gates `/livez` with auto-rollback. Verify:
contains `--authentication-config`. See `docs/plans/2026-06-04-k8s-dashboard-sso-design.md`. `curl -sk https://localhost:6443/livez` on the master = `ok`, and the apiserver
manifest contains `--authentication-config`. See
`docs/plans/2026-06-04-k8s-dashboard-sso-design.md`.
### Verify the pipeline is healthy ### Verify the pipeline is healthy
```bash ```bash
@ -356,6 +457,13 @@ kill %1
## Past Incidents ## Past Incidents
### 2026-06-18 — Preflight gate-4 wedged a partial (master-ahead) chain
- A prior 1.34.9 run upgraded k8s-master + k8s-node1, then stopped; node2-6 stayed on 1.34.8.
- Every nightly preflight then aborted at the **kubeadm-plan-target gate**: `kubeadm upgrade plan` runs on k8s-master, already on 1.34.9, so it emitted no `kubeadm upgrade apply vX.Y.Z` line → empty `plan_target``'' != '1.34.9'``exit 1`. Deterministic, not transient (gates 1-3 all green; no critical alert was firing). The failed preflight self-cleaned each night (2026-06-17 retry-on-failure) but re-failed identically.
- The two `in_flight`-based alerts stayed blind (preflight aborts pre-metric); `K8sUpgradeChainJobFailed` (warning) surfaced it.
- **Collateral**: the earlier master bump had also dropped apiserver `--authentication-config` (SSO broke); restored separately via the `rbac` stack's `apiserver_oidc_config`.
- **Mitigation**: `phase_preflight` now **skips the kubeadm-plan-target gate when k8s-master is already on TARGET_VERSION** (mirrors the at-target self-skip already in `phase_master`/`phase_worker`). Remaining workers are validated by their own phases; the detector's apt-cache probe already confirmed the target is installable.
### 2026-05-11 — Self-preemption (agent → Job-chain rewrite) ### 2026-05-11 — Self-preemption (agent → Job-chain rewrite)
- The v1 agent ran inside the `claude-agent-service` Deployment (replicas=1, no nodeSelector) and was scheduled to k8s-node4. - The v1 agent ran inside the `claude-agent-service` Deployment (replicas=1, no nodeSelector) and was scheduled to k8s-node4.
- During Stage 6 (first worker drain) the agent ran `kubectl drain k8s-node4` — evicting itself. - During Stage 6 (first worker drain) the agent ran `kubectl drain k8s-node4` — evicting itself.
@ -369,6 +477,8 @@ kill %1
|------|-------| |------|-------|
| Stack (CronJob + ConfigMaps + SA/RBAC + ExternalSecret) | `infra/stacks/k8s-version-upgrade/main.tf` | | Stack (CronJob + ConfigMaps + SA/RBAC + ExternalSecret) | `infra/stacks/k8s-version-upgrade/main.tf` |
| Universal phase body | `infra/stacks/k8s-version-upgrade/scripts/upgrade-step.sh` | | Universal phase body | `infra/stacks/k8s-version-upgrade/scripts/upgrade-step.sh` |
| Compat gate (addon/API/containerd block logic) | `infra/stacks/k8s-version-upgrade/scripts/compat-gate.py` |
| Compat matrix (addon → highest supported k8s minor) | `infra/stacks/k8s-version-upgrade/scripts/addon-compat.json` |
| Job template | `infra/stacks/k8s-version-upgrade/job-template.yaml` | | Job template | `infra/stacks/k8s-version-upgrade/job-template.yaml` |
| Per-node upgrade script | `infra/scripts/update_k8s.sh` | | Per-node upgrade script | `infra/scripts/update_k8s.sh` |
| Upgrade Gates alerts | `infra/stacks/monitoring/modules/monitoring/prometheus_chart_values.tpl` (group "Upgrade Gates") | | Upgrade Gates alerts | `infra/stacks/monitoring/modules/monitoring/prometheus_chart_values.tpl` (group "Upgrade Gates") |

View file

@ -37,6 +37,19 @@ logs every outcome (`paired user=.. endpoint=.. fallback=..`, plus `mint/pairing
`T3AutoUpdateRolledBack` / `T3AutoUpdateRollbackFailed` / `T3AutoUpdateFrozen` `T3AutoUpdateRolledBack` / `T3AutoUpdateRollbackFailed` / `T3AutoUpdateFrozen`
Alertmanager → Slack. Alertmanager → Slack.
## Idle migrator — draining deferrals (`scripts/t3-migrate-idle.sh`)
Step 5 DEFERS any instance with an active agent, recording `/var/lib/t3-autoupdate/deferred/<user>` (= the target version). Without a drainer, a user busy at every 04:00 window never migrates and their client shows *"Client and server versions differ"* for days. `t3-migrate-idle.timer` (overnight, every 20 min 01:0005:40) drains those markers:
- Per marker: skip + clear if the unit is gone or was already restarted *after* the deferral; otherwise restart the still-stale `t3-serve@<u>` onto the current binary **only when that user is idle**`state.sqlite` shows zero `active_turn_id` (no in-flight turn) AND ≥ `T3_MIGRATE_QUIET_SECONDS` (default 900 = 15 min) since the last thread activity — then verify pairing and clear the marker. **Fail-closed:** any query/parse doubt → skip, retry next tick.
- It restarts via the SAME `safe_restart_unit` the daily canary uses (sourced `t3-safe-restart.sh`: backup → restart → verify → recover). The shared `/etc/t3-autoupdate.freeze` halts it too.
- **Force / preview:**
```bash
sudo systemctl start t3-migrate-idle.service # run a drain pass now (still idle-gated)
sudo env T3_DRY_RUN=1 /usr/local/bin/t3-migrate-idle # log decisions, act on nothing
```
- **Rare-tail failure:** if a deferred user's forward migration fails at idle restart (already gated against a copy of their real DB at install), `safe_restart_unit` restores their DB + freezes + alerts. The binary rollback is a no-op (the build was already accepted, so other users are unaffected), but that user's serve may crashloop on the restored DB until the freeze is cleared and the build investigated (manual rollback below).
## Operations ## Operations
**Freeze / revert (stop tracking right now — the fast "make it stop"):** **Freeze / revert (stop tracking right now — the fast "make it stop"):**

View file

@ -1,226 +0,0 @@
# Runbook — TripIt external user self-signup (email + passkey)
Implements ADR-0020 (tripit repo): people outside the homelab self-register to
TripIt with **email + a passkey** (no password), are auto-tagged into the
**`TripIt External`** Authentik group, and are fenced to `tripit.viktorbarzin.me`
only. Audience: people Viktor knows; open public registration.
> **Safety model.** Containment is two-layered. (1) **Forward-auth apps** — the
> branch in `stacks/authentik/admin-services-restriction.tf` admits `TripIt
> External` to `tripit.viktorbarzin.me` and denies every other `auth="required"`
> host. (2) **OIDC apps** — the branch does NOT cover OIDC (it bypasses
> forward-auth); External users are contained because every sensitive OIDC app
> already requires a trusted group they do not hold (audit below). The no-lockout
> guarantee is that the group is created **empty**, so the new branch matches
> zero existing users on day one.
## OIDC app authorization audit (2026-06-15, read-only)
A parentless `TripIt External` user holds NONE of these groups, so:
| OIDC app | Requires | External user |
|---|---|---|
| Immich, Grafana, Linkwarden, Cloudflare Access | `Home Server Admins` | DENIED ✓ |
| Forgejo | `Task Submitters` / `Forgejo Users` | DENIED ✓ |
| Headscale | `Headscale Users` | DENIED ✓ |
| wrongmove | `Wrongmove Users` | DENIED ✓ |
| **Vault** | **was OPEN** → bound to `Allow Login Users` in Step 3 | DENIED after Step 3 |
| Kubernetes, Kubernetes Dashboard | OPEN | harmless — apiserver rejects OIDC tokens (idle) |
| TripIt App, Public | OPEN | by design (TripIt's own provider / guest) |
Vault's JWT `default` role grants only Vault's built-in `default` policy (token
self-management, cubbyhole — **no** secret access), so the pre-fix exposure was a
near-powerless token; Step 3 closes it anyway.
---
## Pre-flight gates (STOP if any fails)
1. **`TripIt External` is net-new / empty** (no-lockout precondition):
```
kubectl -n authentik exec -i deploy/goauthentik-server -- ak shell <<'PY'
from authentik.core.models import Group
g = Group.objects.filter(name="TripIt External").first()
print("exists:", bool(g), "members:", g.users.count() if g else 0)
PY
```
Expect `exists: False`. If it exists with members → STOP.
2. **Authentik image pin matches live (B5)** — the policy edit auto-applies the
whole `authentik` stack; a stale pin re-triggers the 2026-06-10 downgrade
boot-storm:
```
kubectl -n authentik get deploy -o custom-columns=N:.metadata.name,IMG:.spec.template.spec.containers[0].image
```
Every `goauthentik`/`ak-outpost` image tag MUST equal
`stacks/authentik/modules/authentik/values.yaml` `global.image.tag`
(currently `2026.2.4`). If they differ → refresh the pin first.
---
## Step 1 — Terraform (group + fence branch)
Already written on this branch:
- `stacks/authentik/tripit-external.tf` — the empty, parentless group.
- `stacks/authentik/admin-services-restriction.tf` — the prepended fence branch.
**Local plan gate (B4 — CI auto-applies on push with `-auto-approve`, so there is
NO human plan review in the apply path; do it here):**
```
vault login -method=oidc
cd stacks/authentik && ../../scripts/tg plan
```
Confirm the plan is **exactly**:
- `+ authentik_group.tripit_external` (create)
- `~ authentik_policy_expression.admin_services_restriction` (update in place — the
`expression` body gains ONLY the new branch; every other line byte-identical)
- **`Plan: 1 to add, 1 to change, 0 to destroy.`**
ABORT if the plan shows any destroy/replace, any `authentik_provider_*` /
`authentik_outpost` / `authentik_flow*` / `helm_release`, or any other expression
change.
**Apply** (presence-claim courtesy, then push = apply; land human-watched, B5):
```
~/code/scripts/presence claim stack:authentik --purpose "ADR-0020 TripIt External group + fence branch"
# push the branch to master (this triggers CI tg apply on the authentik stack)
```
Watch: GHA → Woodpecker `default.yml` apply → outpost stays healthy
(`kubectl -n authentik get endpoints ak-outpost-authentik-embedded-outpost` = 2
IPs; an anonymous request to any `auth=required` host still 302s to Authentik).
The branch is inert (empty group) so no access changes yet.
---
## Step 2 — Authentik SMTP (B1, BLOCKER before any flow)
Email verification is the **entire identity boundary** (TripIt trusts the
Authentik email verbatim). Authentik currently has the **default/unconfigured**
transport (`email.host = localhost`), so verification/recovery mail cannot send.
Add to **both** `server.env` and `worker.env` in
`stacks/authentik/modules/authentik/values.yaml` (wire the password from a secret;
the cluster mailserver is what TripIt already relays through —
`mailserver.mailserver.svc`):
```yaml
- { name: AUTHENTIK_EMAIL__HOST, value: "mailserver.mailserver.svc" }
- { name: AUTHENTIK_EMAIL__PORT, value: "587" }
- { name: AUTHENTIK_EMAIL__USE_TLS, value: "true" }
- { name: AUTHENTIK_EMAIL__FROM, value: "noreply@viktorbarzin.me" }
- { name: AUTHENTIK_EMAIL__USERNAME, value: "<relay user>" } # confirm relay creds
- { name: AUTHENTIK_EMAIL__PASSWORD, valueFrom: { secretKeyRef: { name: <secret>, key: <key> } } }
```
**Gate:** after apply, Authentik UI → System → Settings (or an Email stage) →
**Send test email**; it must arrive. Then prove enrollment cannot complete for an
address you do NOT control.
---
## Step 3 — Bind Vault → `Allow Login Users` (close the one open OIDC gap)
Authentik UI → Applications → **Vault** → bind an authorization policy requiring
group **`Allow Login Users`** (the base group every real homelab user inherits;
parentless `TripIt External` is excluded). This changes nothing for existing
users and denies External users at the Vault consent step.
Verify: an External test account (Step 6) cannot complete Vault OIDC login.
---
## Step 4 — Build the flows (Authentik UI; UI-managed per ADR split)
All three flows: designation as noted, no password stage.
**Flow `tripit-enrollment`** (Enrollment):
| Order | Stage | Key settings |
|---|---|---|
| 5 | Captcha | reCAPTCHA **v2 checkbox** keys (v3/invisible fail — see `crowdsec-recaptcha-key-type`) |
| 10 | Identification | email only; **no** `password_stage`; `sources` optional |
| 20 | Email (verification) | activate, blocking — **before** user_write |
| 30 | WebAuthn authenticator setup | `user_verification = required`, `resident_key = required` |
| 40 | User Write | **`create_users_group` = `TripIt External`** (the keystone tag); `user_type = external` |
| 50 | User Login | session as default (`weeks=4`) |
**Flow `tripit-login`** (Authentication, passwordless):
Identification (sets `enrollment_flow`/`recovery_flow`) → Authenticator
Validation (`device_classes = [webauthn]`, `user_verification = required`) → User
Login. Prefer routing a passkey-less email to recovery over minting a credential.
**Flow `tripit-recovery`** (Recovery):
Identification (`pretend_user_exists = on`) → Email (recovery link) → WebAuthn
authenticator setup → User Login. Notify the account on recovery + new-passkey.
> Do **NOT** bind the `brute-force-protection` ReputationPolicy to these flows —
> it denies anonymous users (2026-04-06 regression). The Captcha is the bot gate.
---
## Step 5 — Surface "Sign up"
Recommended: a **TripIt-scoped** signup link / share-invite rather than a global
login-screen button (narrower bot surface). Enrollment URL:
`https://authentik.viktorbarzin.me/if/flow/tripit-enrollment/`.
---
## Step 6 — Verification (before/after — "all access keeps working")
Hosts for the matrix (must be real `auth="required"` default-allow hosts, NOT
`auth="app"` apps like immich/nextcloud which bypass the catch-all):
`tripit`, `family`, `hackmd`, `health` (default-allow) + `terminal` (admin-only).
**Before** (capture per user, no redirect-follow; 200=ALLOW, 302→authentik/403=DENY):
```
COOKIE='authentik_session=<paste for this user>'; for H in tripit family hackmd health terminal; do
printf '%-10s %s\n' "$H" "$(curl -s -o /dev/null -w '%{http_code}' --max-redirs 0 -H "Cookie: $COOKIE" https://$H.viktorbarzin.me/)"; done
```
Representative non-admin: `kadir.tugan@gmail.com` (Wrongmove-only) → tripit/family/hackmd/health ALLOW, terminal DENY. Admin `vbarzin@gmail.com` → all ALLOW.
**After Step 1 apply — regression:** re-run identically; both users' results MUST
be unchanged (diff empty).
**After flows — external smoke test (the security proof):** enrol a throwaway
account via the enrollment URL (email verify + passkey). Confirm it is tagged
`TripIt External`, then with its cookie:
```
for H in tripit family hackmd health terminal frigate; do printf '%-10s %s\n' "$H" \
"$(curl -s -o /dev/null -w '%{http_code}' --max-redirs 0 -H "Cookie: authentik_session=<external>" https://$H.viktorbarzin.me/)"; done
```
Expect **tripit=200, every other host DENY** (family/hackmd/health were ALLOW for
kadir — the contrast is the fence proof). Then:
- **OIDC containment:** with the external account, attempt OIDC login to Vault,
Immich, Forgejo, Grafana → each must be DENIED at the app's own login.
- **Auto-provision:** the TripIt `users` row exists (CNPG primary in ns `dbaas`:
`select id,email from tripit.users where email='<throwaway>'`).
- **Walling-off guard** `AuthentikWallingOffPublicPath` stays green.
**Any 200 on a non-tripit host, or any OIDC app admitting the external account →
ROLLBACK.**
---
## Step 7 — Standing regression probe (recommended)
Add a permanent `TripIt External` identity to the `blackbox-exporter` guard
(`stacks/monitoring/.../authentik_walloff_probe.tf` pattern): assert 200 on
`tripit.viktorbarzin.me` AND DENY on `family.viktorbarzin.me`. This converts the
"branch stays first" and "user_write keeps the keystone tag" invariants into
automated `#security` alerts.
---
## Rollback
Revert the `admin-services-restriction.tf` expression (delete the branch) and push
(= apply); removing a prepended `if g: return …` is behaviour-preserving on
non-members, restoring prior authz. Disable/delete the throwaway external account
(with the branch gone, a tagged account falls into default-allow). The empty group
may stay (harmless). Plan-gate the revert too.
## Operational invariants
- `TripIt External` stays **parentless** (never under `Allow Login Users`).
- The fence branch stays **first** in `admin-services-restriction`.
- **Never** co-assign `TripIt External` to a trusted/internal user.
- The `tripit-enrollment` user_write **`create_users_group`** setting is the
keystone — re-verify after any flow edit (clearing it makes UNtagged accounts
that fall into default-allow).
- Authentik SMTP is a live dependency of enrollment + recovery.

View file

@ -8,6 +8,13 @@ users:
sudo: ALL=(ALL) NOPASSWD:ALL sudo: ALL=(ALL) NOPASSWD:ALL
ssh_authorized_keys: ssh_authorized_keys:
- ${authorized_ssh_key} - ${authorized_ssh_key}
# k8s-upgrade pipeline key (matches Vault secret/k8s-upgrade/ssh_key_pub).
# The automated k8s-version-upgrade chain SSHes in as `wizard` to drain +
# upgrade each node; WITHOUT this a freshly-provisioned node is invisible
# to the upgrade pipeline (node4/5/6 hit exactly this — Permission denied —
# 2026-06-17). Hardcoded: it's a public key and the keypair is stable; if
# it's ever rotated, update this line and Vault together.
- ssh-ed25519 AAAAC3NzaC1lZDI1NTE5AAAAIElH9x76UNA8UNxrxTjREYz4hz1fbCdRwAXbOkJ5FnSM k8s-upgrade-pipeline
passwd: ${passwd} passwd: ${passwd}
lock_passwd: false # enable passwd login lock_passwd: false # enable passwd login
shell: /bin/bash shell: /bin/bash

View file

@ -107,10 +107,6 @@ variable "custom_content_security_policy" {
type = string type = string
default = null default = null
} }
variable "exclude_crowdsec" {
type = bool
default = false
}
variable "full_host" { variable "full_host" {
type = string type = string
default = null default = null
@ -310,7 +306,6 @@ resource "kubernetes_ingress_v1" "proxied-ingress" {
"traefik-error-pages@kubernetescrd", "traefik-error-pages@kubernetescrd",
var.skip_default_rate_limit ? null : "traefik-rate-limit@kubernetescrd", var.skip_default_rate_limit ? null : "traefik-rate-limit@kubernetescrd",
var.custom_content_security_policy == null ? "traefik-csp-headers@kubernetescrd" : null, var.custom_content_security_policy == null ? "traefik-csp-headers@kubernetescrd" : null,
var.exclude_crowdsec ? null : "traefik-crowdsec@kubernetescrd",
local.effective_anti_ai ? "traefik-ai-bot-block@kubernetescrd" : null, local.effective_anti_ai ? "traefik-ai-bot-block@kubernetescrd" : null,
local.effective_anti_ai ? "traefik-anti-ai-headers@kubernetescrd" : null, local.effective_anti_ai ? "traefik-anti-ai-headers@kubernetescrd" : null,
local.auth_middleware, local.auth_middleware,

View file

@ -0,0 +1,20 @@
[Unit]
Description=Validate and back up Claude OAuth credentials for %i
Documentation=https://github.com/ViktorBarzin/infra/blob/master/docs/runbooks/claude-auth-renew-workstation.md
Wants=network-online.target
After=network-online.target
[Service]
Type=oneshot
User=%i
Group=%i
Environment=HOME=/home/%i
Environment=PATH=/usr/local/bin:/usr/bin:/bin:/home/%i/.local/bin
ExecStart=/usr/local/bin/claude-auth-sync
# Credential and Vault access are required; keep the remaining host surface narrow.
NoNewPrivileges=true
PrivateTmp=true
ProtectSystem=strict
ProtectHome=read-only
ReadWritePaths=-/home/%i/.claude -/home/%i/.claude.json -/home/%i/.config/claude-auth-sync -/home/%i/.local/state/claude-auth-sync

View file

@ -0,0 +1,12 @@
[Unit]
Description=Keep Claude OAuth credentials valid and recoverable for %i
[Timer]
OnBootSec=10m
OnUnitActiveSec=6h
Persistent=true
RandomizedDelaySec=10m
Unit=claude-auth-sync@%i.service
[Install]
WantedBy=timers.target

View file

@ -27,7 +27,7 @@ KUBECONFIG_PATH="${KUBECONFIG:-${HOME}/.kube/config}"
[[ -f "$KUBECONFIG_PATH" ]] || KUBECONFIG_PATH="$(pwd)/config" [[ -f "$KUBECONFIG_PATH" ]] || KUBECONFIG_PATH="$(pwd)/config"
KUBECTL="" KUBECTL=""
JSON_RESULTS=() JSON_RESULTS=()
TOTAL_CHECKS=47 TOTAL_CHECKS=48
# Parallel execution settings. Each check function is self-contained — it # Parallel execution settings. Each check function is self-contained — it
# only reads cluster state and mutates the in-memory counters / JSON_RESULTS # only reads cluster state and mutates the in-memory counters / JSON_RESULTS
@ -3156,6 +3156,44 @@ PYEOF
esac esac
} }
# --- 48. Goldmane edge-aggregator availability ---
#
# The goldmane-edge-aggregator Deployment (ADR-0014 / infra #58) streams Calico
# Goldmane flows into the goldmane_edges CNPG DB — the durable who-talks-to-whom
# trail. The pod has NO /metrics endpoint, so its liveness can't be scraped;
# this check reads the Deployment's Available condition directly so the trail
# silently dying surfaces in the health board (mirrors the AggregatorDown
# Prometheus alert). Missing Deployment / not-Available -> FAIL.
check_goldmane_aggregator() {
section 48 "Goldmane Edge-Aggregator"
local ns="goldmane-edge-aggregator" dep="goldmane-edge-aggregator"
local avail desired ready
# One get; absent Deployment is a hard fail (the trail isn't deployed).
if ! $KUBECTL get deploy "$dep" -n "$ns" >/dev/null 2>&1; then
[[ "$QUIET" == true ]] && section_always 48 "Goldmane Edge-Aggregator"
fail "Deployment $ns/$dep not found — who-talks-to-whom edge trail is not running"
json_add "goldmane_aggregator" "FAIL" "deployment missing"
return 0
fi
avail=$($KUBECTL get deploy "$dep" -n "$ns" \
-o jsonpath='{.status.conditions[?(@.type=="Available")].status}' 2>/dev/null)
ready=$($KUBECTL get deploy "$dep" -n "$ns" -o jsonpath='{.status.readyReplicas}' 2>/dev/null)
desired=$($KUBECTL get deploy "$dep" -n "$ns" -o jsonpath='{.spec.replicas}' 2>/dev/null)
ready=${ready:-0}
desired=${desired:-0}
if [[ "$avail" == "True" ]]; then
pass "Edge-aggregator Available ($ready/$desired ready)"
json_add "goldmane_aggregator" "PASS" "${ready}/${desired} ready"
else
[[ "$QUIET" == true ]] && section_always 48 "Goldmane Edge-Aggregator"
fail "Edge-aggregator NOT Available ($ready/$desired ready) — edge trail has stopped recording"
json_add "goldmane_aggregator" "FAIL" "${ready}/${desired} ready; Available=${avail:-unknown}"
fi
}
# --- Summary --- # --- Summary ---
print_summary() { print_summary() {
if [[ "$JSON" == true ]]; then if [[ "$JSON" == true ]]; then
@ -3224,7 +3262,7 @@ main() {
check_monitoring_prom_am check_monitoring_vault check_monitoring_css check_monitoring_prom_am check_monitoring_vault check_monitoring_css
check_external_replicas check_external_divergence check_pve_thermals check_external_replicas check_external_divergence check_pve_thermals
check_pve_load check_external_traefik_5xx check_ha_status_dashboard check_pve_load check_external_traefik_5xx check_ha_status_dashboard
check_immich_search check_csi_ghost_drift check_immich_search check_csi_ghost_drift check_goldmane_aggregator
) )
# Auto-fix mutates cluster state inside individual checks — keep that # Auto-fix mutates cluster state inside individual checks — keep that

View file

@ -21,7 +21,7 @@
# - canary rollout: restart idle instances ONE AT A TIME, verifying pairing # - canary rollout: restart idle instances ONE AT A TIME, verifying pairing
# through the real dispatch after each, and roll back (binary + that user's DB) # through the real dispatch after each, and roll back (binary + that user's DB)
# + self-freeze on the first failure — active-agent instances are deferred, # + self-freeze on the first failure — active-agent instances are deferred,
# never killed; # never killed (deferred instances are recorded for t3-migrate-idle to drain);
# - rollback target is the recorded LAST-GOOD build, not "whatever was installed". # - rollback target is the recorded LAST-GOOD build, not "whatever was installed".
# Detection backstop (real-user pairing failure/fallback) lives in the dispatch # Detection backstop (real-user pairing failure/fallback) lives in the dispatch
# logs + Loki alerts (T3PairingBroken / T3PairFallbackHigh / T3AutoUpdate*). # logs + Loki alerts (T3PairingBroken / T3PairFallbackHigh / T3AutoUpdate*).
@ -29,24 +29,17 @@
# Full procedure + manual rollback: docs/runbooks/t3-version-bump.md. # Full procedure + manual rollback: docs/runbooks/t3-version-bump.md.
set -uo pipefail set -uo pipefail
# ---- autoupdate-specific config (shared config + helpers come from the lib) -----
T3_TRACK="${T3_TRACK:-nightly}" # npm dist-tag to follow (nightly | latest) T3_TRACK="${T3_TRACK:-nightly}" # npm dist-tag to follow (nightly | latest)
T3_PIN="${T3_PIN:-}" # optional HARD pin to an exact version (disables tracking) T3_PIN="${T3_PIN:-}" # optional HARD pin to an exact version (disables tracking)
FREEZE_FILE="${T3_FREEZE_FILE:-/etc/t3-autoupdate.freeze}"
STATE_DIR="${T3_STATE_DIR:-/var/lib/t3-autoupdate}"
LAST_GOOD_FILE="$STATE_DIR/last-good"
BACKUP_DIR="${T3_BACKUP_DEST:-/var/backups/t3-state}"
SMOKE_PORT="${T3_SMOKE_PORT:-3799}" SMOKE_PORT="${T3_SMOKE_PORT:-3799}"
DISPATCH="${T3_DISPATCH:-127.0.0.1:3780}"
USER_MAP="${T3_USER_MAP:-/etc/ttyd-user-map}"
DRY_RUN="${T3_DRY_RUN:-0}" DRY_RUN="${T3_DRY_RUN:-0}"
TMPROOT="${T3_TMPDIR:-/var/tmp}" # health-check scratch on DISK — /tmp is a 2G tmpfs and a populated state.sqlite (~hundreds of MB) overflows it TMPROOT="${T3_TMPDIR:-/var/tmp}" # health-check scratch on DISK — /tmp is a 2G tmpfs and a populated state.sqlite (~hundreds of MB) overflows it
LOG() { logger -t t3-autoupdate "$*"; echo "t3-autoupdate: $*"; } LOG_TAG=t3-autoupdate
ver() { t3 --version 2>/dev/null | awk '{print $NF}' | sed 's/^v//'; } # shellcheck source=scripts/t3-safe-restart.sh
# OS users owning a ~/.t3 (RHS of each non-comment "authentik=os_user" map line). . "${T3_SAFE_RESTART_LIB:-/usr/local/lib/t3-safe-restart.sh}"
osusers() { awk -F= '!/^[[:space:]]*#/&&NF==2{gsub(/[[:space:]]/,"",$2);print $2}' "$USER_MAP" 2>/dev/null | sort -u; }
# authentik username for an OS user (reverse map; first match) — for dispatch verify.
ak_for() { awk -F= -v u="$1" '!/^[[:space:]]*#/&&NF==2{gsub(/[[:space:]]/,"",$1);gsub(/[[:space:]]/,"",$2);if($2==u){print $1;exit}}' "$USER_MAP" 2>/dev/null; }
# is $1 a strictly-newer version than $2 (version-sort)? # is $1 a strictly-newer version than $2 (version-sort)?
newer() { [ "$1" != "$2" ] && [ "$(printf '%s\n%s\n' "$1" "$2" | sort -V | tail -1)" = "$1" ]; } newer() { [ "$1" != "$2" ] && [ "$(printf '%s\n%s\n' "$1" "$2" | sort -V | tail -1)" = "$1" ]; }
@ -86,27 +79,21 @@ LOG "candidate: $current -> $target (track=$T3_TRACK, last_good=$last_good, dry_
# ---- helpers: backup, health-check, rollback, restart-verify -------------------- # ---- helpers: backup, health-check, rollback, restart-verify --------------------
# Online consistent per-user snapshot (run AS the owner so WAL stays owned; never # Online consistent per-user snapshot (run AS the owner so WAL stays owned; never
# stops the serve). Sets $ADMIN_SEED to wizard's backup for the migration health # stops the serve). Sets $ADMIN_SEED to wizard's backup for the migration health
# check. Mirrors t3-backup-state.sh. # check. Mirrors t3-backup-state.sh. (backup_user lives in the shared lib.)
ADMIN_SEED="" ADMIN_SEED=""
backup_all() { backup_all() {
local u src out dst ts; ts="$(date +%Y%m%d-%H%M%S)" local u dst
for u in $(osusers); do for u in $(osusers); do
src="/home/$u/.t3/userdata/state.sqlite"; [ -f "$src" ] || continue if dst="$(backup_user "$u")"; then
out="$BACKUP_DIR/$u"; dst="$out/state-prebump-$target-$ts.sqlite"
install -d -o "$u" -g "$u" -m700 "$out" 2>/dev/null || mkdir -p "$out"
if runuser -u "$u" -- timeout "${T3_BACKUP_TIMEOUT:-900}" sqlite3 "$src" "VACUUM INTO '$dst'" 2>/dev/null && [ -s "$dst" ]; then
LOG "pre-bump backup: $u -> $dst ($(stat -c%s "$dst" 2>/dev/null) bytes)" LOG "pre-bump backup: $u -> $dst ($(stat -c%s "$dst" 2>/dev/null) bytes)"
[ "$u" = "wizard" ] && ADMIN_SEED="$dst" [ "$u" = "wizard" ] && ADMIN_SEED="$dst"
else else
LOG "WARN: pre-bump backup FAILED for $u ($src)"; rm -f "$dst" LOG "WARN: pre-bump backup FAILED for $u (/home/$u/.t3/userdata/state.sqlite)"
fi fi
done done
[ -n "$ADMIN_SEED" ] || ADMIN_SEED="$(ls -1t "$BACKUP_DIR"/*/"state-prebump-$target-"*.sqlite 2>/dev/null | head -1)" [ -n "$ADMIN_SEED" ] || ADMIN_SEED="$(ls -1t "$BACKUP_DIR"/*/"state-prebump-$target-"*.sqlite 2>/dev/null | head -1)"
} }
# newest pre-bump backup taken THIS run for a user (for restore-on-rollback).
prebump_of() { ls -1t "$BACKUP_DIR/$1/state-prebump-$target-"*.sqlite 2>/dev/null | head -1; }
# health_check <t3bin> [seed_db]: start a throwaway serve (seeded with a copy of a # health_check <t3bin> [seed_db]: start a throwaway serve (seeded with a copy of a
# real populated DB if given, so the forward migration runs on real data), then do # real populated DB if given, so the forward migration runs on real data), then do
# the real mint -> credential-exchange -> t3_session pairing handshake with the # the real mint -> credential-exchange -> t3_session pairing handshake with the
@ -143,27 +130,12 @@ health_check() {
rm -rf "$dir"; return 1 rm -rf "$dir"; return 1
} }
# roll the GLOBAL binary back to last-good. Pre-restart failures need only this
# (no real DB migrated yet); post-restart failures also restore the user's DB.
rollback_binary() {
LOG "rolling back binary $target -> $last_good"
if npm i -g "t3@$last_good" >/dev/null 2>&1; then LOG "rolled back to $last_good"; return 0; fi
LOG "ROLLBACK FAILED — could not reinstall t3@$last_good (t3 may be broken; manual fix per runbook)"; return 1
}
# is this t3-serve@<unit> running an active agent (claude/codex/opencode)? never restart those. # is this t3-serve@<unit> running an active agent (claude/codex/opencode)? never restart those.
unit_busy() { unit_busy() {
local unit="$1" pid; pid="$(systemctl show -p MainPID --value "$unit" 2>/dev/null)" local unit="$1" pid; pid="$(systemctl show -p MainPID --value "$unit" 2>/dev/null)"
[ -n "$pid" ] && [ "$pid" != "0" ] && pgrep -aP "$pid" 2>/dev/null | grep -qiE 'claude|codex|opencode' [ -n "$pid" ] && [ "$pid" != "0" ] && pgrep -aP "$pid" 2>/dev/null | grep -qiE 'claude|codex|opencode'
} }
# verify a user's pairing through the REAL dispatch (mint -> exchange -> cookie).
verify_pairing() {
local u="$1" ak out; ak="$(ak_for "$u")"; [ -n "$ak" ] || { LOG "no authentik mapping for $u — skipping dispatch verify"; return 0; }
out="$(curl -s -i --max-time 10 -H "X-authentik-username: $ak" -H 'Sec-Fetch-Dest: document' "http://$DISPATCH/" 2>/dev/null)"
printf '%s' "$out" | grep -qi '^set-cookie:[[:space:]]*t3_session='
}
# ---- 3. DRY RUN: preview only (install candidate to temp prefix, gate it) ------- # ---- 3. DRY RUN: preview only (install candidate to temp prefix, gate it) -------
if [ "$DRY_RUN" = "1" ]; then if [ "$DRY_RUN" = "1" ]; then
LOG "DRY_RUN: would back up [$(osusers | tr '\n' ' ')]; testing candidate $target in a temp prefix (no global change, no restarts)" LOG "DRY_RUN: would back up [$(osusers | tr '\n' ' ')]; testing candidate $target in a temp prefix (no global change, no restarts)"
@ -196,31 +168,15 @@ restarted=0; deferred=0
for unit in $(systemctl list-units --type=service --state=running --no-legend 't3-serve@*' 2>/dev/null | awk '{print $1}'); do for unit in $(systemctl list-units --type=service --state=running --no-legend 't3-serve@*' 2>/dev/null | awk '{print $1}'); do
u="$(printf '%s' "$unit" | sed -n 's/^t3-serve@\(.*\)\.service$/\1/p')"; [ -n "$u" ] || continue u="$(printf '%s' "$unit" | sed -n 's/^t3-serve@\(.*\)\.service$/\1/p')"; [ -n "$u" ] || continue
if unit_busy "$unit"; then if unit_busy "$unit"; then
LOG "deferring $unit (active agent) — migrates on its next idle restart"; deferred=$((deferred+1)); continue LOG "deferring $unit (active agent) — migrates on its next idle restart"
mkdir -p "$DEFER_DIR" 2>/dev/null && printf '%s\n' "$target" >"$DEFER_DIR/$u" # record for t3-migrate-idle
deferred=$((deferred+1)); continue
fi fi
systemctl restart "$unit" || LOG "WARN: systemctl restart $unit returned non-zero" if safe_restart_unit "$unit" "$u"; then
ok=0 restarted=$((restarted+1))
for _ in $(seq 1 15); do rm -f "$DEFER_DIR/$u" 2>/dev/null # now current — clear any stale marker
if verify_pairing "$u"; then ok=1; break; fi
sleep 2
done
if [ "$ok" = "1" ]; then
LOG "restarted $unit -> $target (pairing verified via dispatch)"; restarted=$((restarted+1))
else else
LOG "HEALTH-CHECK FAILED: $u pairing broken AFTER restart onto $target — rolling back + restoring its DB" exit 1 # frozen by safe_restart_unit — preserve today's behavior
rollback_binary
bak="$(prebump_of "$u")"
if [ -n "$bak" ]; then
systemctl stop "$unit" 2>/dev/null
if install -o "$u" -g "$u" -m600 "$bak" "/home/$u/.t3/userdata/state.sqlite" 2>/dev/null; then
rm -f "/home/$u/.t3/userdata/state.sqlite-wal" "/home/$u/.t3/userdata/state.sqlite-shm"
LOG "restored $u state.sqlite from $bak"
fi
systemctl start "$unit" 2>/dev/null
fi
touch "$FREEZE_FILE" 2>/dev/null
LOG "FROZEN ($FREEZE_FILE) after canary $u failed on $target; last_good stays $last_good — investigate, then remove the freeze file to resume"
exit 1
fi fi
done done

View file

@ -0,0 +1,8 @@
[Unit]
Description=t3 idle migrator — restart deferred t3-serve instances onto the current binary when idle
Documentation=https://forgejo.viktorbarzin.me/viktor/infra/src/branch/master/docs/plans/2026-06-21-t3-idle-migrate-design.md
After=network.target t3-dispatch.service
[Service]
Type=oneshot
ExecStart=/usr/local/bin/t3-migrate-idle

Some files were not shown because too many files have changed in this diff Show more